2.2 Word Representation and Language Model
2.2.3 Language Model
A Language model is an umbrella term that describes a variety of mathematical mod- els for formulating, analysing and creating the text with natural language. Essentially language model is used to judge whether the input text is reasonable and comprehen- sible. It plays a vital role in tasks of information retrieval, machine translation, speech recognition, etc.
Statistical Language Model
Assume there is a sequenceScontainingmwords[w1,w2,· · · ,wm], statistical language model estimates the probability distributionP(S = w1:m)over S of [w1,w2,· · · ,wm]to represent the likelihood thatSis a sentence, calculated as:
P(S =w1:m) =P(w1)P(w2|w1)P(w2|w1,w2) · · ·P(wi|w1,w2,· · · ,wi−1)· · ·P(wm|w1,w2,· · · ,wm−1) =P(w1) m
∏
i=2 P(wi|w1:(i−1)) (2.1)However, if too many words accrue in the text, it is hardly to estimateP(wi|w1,w2,· · ·wi−1)
because of sparsity. Alternatively, we can simplify it with an n-gram model that only con- siders n previous words in the front ofwi, known as the n-step transition probability of a Markov chain:
Then how can we know the value of P(wi|w1,w2· · ·wi−1)? Normally, we assume that
words in the text is subject to a polynomial distribution θ and use the maximum likeli-
hood estimation and Lagrange multiplier to estimate the value ofθ.
Whenn = 1, it is called the unigram model andP(S = w1:m|θ) = ∏im=1P(wi)which
indicates that words are independent from each other without any semantic and ordering information. It is called the bigram model and trigram model when n = 2 and n = 3. Under the n-gram model, the conditional probability normally computed with the frequency counting:
P(wi|w1+i−n,w2+i−n· · ·wi−1) =
count(w1+i−n,w2+i−n· · · ,wi−1,wi)
count(w1+i−n,w2+i−n· · · ,wi−1)
(2.3)
wherecount(w1+i−n,w2+i−n· · ·wi−1)is the times that words sequence[w1+i−n,w2+i−n,· · · ,wi−1]appears in the text. Intuitively, a largern will keep more ordering and semantic
information of the text, however, it also leads to a more severe sparsity issue with E- q.2.3. Therefore, the trigram model and other methods for smoothing purpose are usu- ally adopted in real application. For example, to attenuate the noise caused by those co- occurrences that happen rarely or never, the Golbal Vector Model, which is based on the “words - words co-occurrence matrix” where the time of wordswi andwj co-occurring in a corpus is xij, is added a weight function f(xij)into the cost function with proper- ties: 1)f(0) =0; 2)f(x)is non-increasing and relatively small toxijso that both rare and frequent co-occurrences will not be overweighted [132].
Neural Network Language Model
Neural network language model stared from the following idea:
• The problem of estimatingP(wi|w1:(i−1))within a corpus containing a vocabularyV
of length|V|can be seem as a problem of multi-class classification and the number of class ism. It can be formulated with follow:
Pk∈|V|
label(wi) =k|hi = label(w1:(i−1))
= fk(w1:(i−1),α) (2.4)
where, label(wi)is the predicted class label of wordwi; hi = label(w1:(i−1))is the
2.2 Word Representation and Language Model 25 timate how probable the wi is thek-th word in the vocabulary with the constrain ∑V
k=1 fk(w1:(i−1),α) =1, whereαis a parameter.
The process of optimising a classifier can be coped with many methods of machine learning, among which, the neural network model draws much attention. Xu et al. first introduced the neural network model to the bigram language model [?]. Bengio et al. for- mally proposed a neural network language model based on n-gram model [13,14] be- coming the representative work in this research direction. Typically, when we know the firstiwords [w1,w−2,· · · ,wi−1], we estimate the i-th word wi with a neural network language model consisting of the following three layers:
• Input layermaps[w1,w−2,· · · ,wi−1]as word embeddings[e1,e−2,· · · ,ei−1].
In this layer, a wordwi is transformed to a dense real-valued vectorewi byewi =
Mvi, where vi is the one-hot representation of a wordwi and M ∈ Rt×|V| is a t- dimensional word embedding matrix. Each column of M corresponds to a real- valued vector of a word, we denote it asewi. This procedure embeds a word into
the continuous semantic space and reduces the dimension of it from|V|tot.
• Hidden layeremploys different types of neural networks, such as the Feed-forward Neural Network (FNN), the Recurrent Neural Network (RNN) and many other manifolds to compute a representation of linear distributional features of previous information ht. FNN requires a fixed size for input vector; therefore, word em- beddings from the input layer will be formed as a longxvector in successive and
ht = tanh(b1+WX), whereW is the input-to-hidden weights matrix andb1is an-
other output biases.
• Output layer uses a classifier,yt = so f tmax(Oht+b2), maps the values of ht to a vector yt ∈ R|V| that represents a probability distribution in which the j-th ele- ments is the posterior probability that the t-th word is the j-th word in V, where
O∈ R|v|×t0is the hidden-to-output weights matrix andb
2is another output biases.
Ocan also be seen as another word embedding matrix, where each row is a new word embedding.
Neural network language model has been a research hotspot in recent years. Many ef- forts has been taken with different concerns, for example how to train word embeddings [38,198], how to improve the basic n-gram model [111] and introducing deep learning
method to neural network language model [85,117,159].
The performance of language models depends on the input text and the training mod- el and it will affect the performances of further tasks and applications. On the contrary, methods designed for other tasks are tightly associated with the input text representation. Next, we focus on the primary techniques of topic detection and document clustering and classification tasks, starting with their common foundation, the topic model.