So first off, what is a language model?
Language model is a probabilistic model that assigns a probability to a sequence of tokens (words or sometimes characters). So if (w_1, ..., w_n) is a sentence with n words,
P(W_(n+1) | w_1, ..., w_n) = P(w_n|w_1, ..., w_n-1) * P(w_n-1| w_1, ... w_(n-2)) * P(w_2|w_1) * P(w_1)
Don’t be intimidated by the formula, okay? That's a great advice given to me by another data scientist. This complicated formula means that calculating a probability is conditional. Why? Well, think about it:
Are you one of those scientist that care about applicability of your work? Let us review some use-cases of language models:
*Generate sentences: we could use the highest next probable word or token to happen after a sequence of words to generate new texts
* Scoring in Translation: If you have a model that translates a sentence, usually the model gets several results based on same input. Translated sentence is the one with highest probability among those in which scores are generated by the language model
* Speech recognition: When you match a wave to a sentence or a word, you also wanna pickup the wave that corresponds to the sentence with highest probability, right?
We are going to focus in this activity, in something close to the first use-case. We would like to create a language model, that can predict the next word, meaning that it will generate new text based on text we feed it. So one valid question: most Machine learning models, we usually use features like categorical or numerical values. How can we feed them text? Very good question. In the field of Natural Language Processing or NLP (language models are a category of NLP models), we transform text into features so that they are digestible for those algorithms. Yes, we need a pipeline.
- Tokenize: We need to divide sentences into tokens, or mostly words and characters. So
becomes amir, likes, data, science.
- Special character for end and beginning of sentences: simple! Just add a new token that shows start and end of sentence like:
<science> . This helps cases where we wanna find the most probable word that comes first in the sentence.
- Word frequencies (dictionary): We need to count frequency of the words and put a threshold that words that have occurred less than that threshold will be removed. Reason for this is that the words that happen rarely are not gonna be useful for our predictions in most cases.
- Index to word: Now since the word frequencies is a list, we can go back to our sentences and replace them with the index of the word in our word frequency list. <Begin> is gonna be index 0,
is gonna be 1 and so on. This turns the sentence to something like: [0, 1, 138, 200, 100] or something. See? we are slowly transforming texts into numbers. What happens when we hit a word that we removed while setting that threshold? We replace it with a token called .
- Create hot vectors of indices: Numbers like 36 need a matrix representation in neural networks as inputs or in our case sequences. for this, we use hot vectors. What is a hot vector, you might ask. It’s simply a binary vector with the length of our word frequency list (our dictionary). So if
has index 1, its hot vector will be (0, 1, 0, 0, 0, 0, 0, …, 0).
Neural Language Model using RNN
One of the main features of the type of neural networks to use is that it should be able to work with sequences. In classical neural network models, we assume inputs are independent. One type of those neural network models are (linear) RNNs (Recurrent Neural Networks) and their general architecture looks like below:
They are the type of models that have been a bit more mysterious and harder to work with, one reason might be the fact that they are more complicated in their architecture. Their compact version looks like the left side of the photo above. X is the input, W_x is the weight (or parameter) associated with the input. S is the hidden layer, however S in itself has different values based on each t (t can be timestamp or in our case, element t of the sequence). So we feed this neural network (x_1, …, x_n) like (amir, likes, data, science) and Y will be the probability value of that sequence according to our model. Notice that what w_rec is doing is transforming weights that we trained according to parameters in s_0 to the next hidden layer s_1 and then onwards. That’s the whole sophisticated part of RNN that allows us to work with sequences.
This was how it works but how to formulate them? What is it they are trying to optimize?
The hidden layers follow a tan activation:
at each stage (hidden in picture), there is an actual output that is used by running softmax function on the relevant state:
The resulting y_t uses part of each y_t depending on the function used.
Initialization of the parameters
We can’t initialize parameters with 0 in RNNs unlike traditional neural networks. In order to initialize w_x and w_rec, we use [-1/sqrt(number_of_hidden_layer), 1/sqrt(number_of_hidden_layer)] and for x_t, [-1/sqrt(number_of_frequent_words), 1/sqrt(number_of_frequent_words)].
If you wanna see how YOU can code an RNN, read this tutorial: http://peterroelants.github.io/posts/rnn...on_part01/
If you don’t wanna re-invent the wheel, use Tensorflow: https://www.tensorflow.org/versions/r0.1...index.html
RNNs are limited due to problems remembering a very long history, this issue is solved with LSTM. For a good understanding, read this: http://colah.github.io/posts/2015-08-Und...ing-LSTMs/