2.3 Distributional Semantics
3.1.3 Corpus-based measures
Corpus-based measures rely on co-occurrence information, which assumes that related words will appear together in a text. Semantic relatedness of two given text fragments (word, phrase, sentence etc.) can be obtained by calculating the similarity between their high dimensional vectors in a distributed semantic space. Distributional representation is based on the underlying idea proposed by Firth [47] that “the semantic meaning of a word can (at least to a certain extent) be inferred from its usage in context”, i.e. its distribution in text. This semantic representation is built through a statistical analysis over the large contextual information in which a word occurs. Distributional Semantic Models (DSM) compute the relatedness scores by using distributional representation. DSMs are based on the distributional hypothesis introduced by Harris [59], i.e. words that occur in the same contexts tend to have similar meaning.
3.1.3.1 Latent Semantic Analysis
The most common DSM model is Latent Semantic Analysis (LSA) ([85], [49]). LSA constructs a semantic space of a large text corpus by first casting it onto a rectangular matrix of words by documents, where each cell contains the as- sociation strength of a word in a given document and each row represents a unique word. This matrix is decomposed by using a well known algebraic ma- trix factorization method called Singular Value Decomposition (SVD) in which the k largest singular values are retained and the remainder are set to 0. LSA relies on word distribution represented by a word-document matrix and calcu- lates similarity between words and documents by taking the cosine of the two corresponding vectors in this k-dimensional space.
Let M = Ti∗ Dj represent the rectangular word-document matrix, where Ti and Dj refer to ith word and jth document respectively. This matrix M can be
decomposed into a product of three matrices U Σ and V, where U and V are orthogonal matrices and Σ represents the singular value matrix. This matrix Σ turns the high dimensions space into the k dimensional space by retaining the k largest singular values. This can be represented by the following equation:
M = U ΣVT (3.7)
Here, U and V contain the Eigen vectors of M MT and MTM respectively. There- fore the relatedness between two terms can be represented as
rel(Ti, Tr) = UiΣk∗ UrΣk (3.8) Similarly, the relatedness between two documents as
rel(Di, Dr) = ΣkVi∗ ΣkVr (3.9)
3.1.3.2 Explicit Semantic Analysis
Gabrilovich and Markovitch [52] introduced Explicit Semantic Analysis (ESA) which attempts to represent the semantics of a given word in a high dimensional distributional semantic space similar to LSA. LSA performs dimensionality reduc- tion to obtain the latent concepts. On the contrary, ESA directly uses supervised topics such as Wikipedia concepts that are built manually, and considers that every concept represents a unique topic. ESA creates a high dimensional vector to represent the semantics of a word, where every dimension reflects a unique Wikipedia concept/article. This high dimensional vector is created by taking the TF-IDF weight of a given word in the corresponding Wikipedia article. The se- mantic relatedness between two words is expressed by a cosine score between the corresponding vectors. ESA represents composite semantics by creating a high dimensional vector of a document, which is the vector addition of the vec- tors of each word appearing in the given document.
Figure 3.1 illustrates the process of building an ESA vector and calculating the relatedness scores. ESA requires to preprocess the data to build the inverted
3.1 Text similarity and relatedness 37
Fig. 3.1 Explicit Semantic Analysis
index of every word appearing in a corpus. For instance, the figure shows that it creates an inverted index over Wikipedia articles. Every entry in the index rep- resents a word and its DSM vector, where Wij is the tf-idf weight of wordi with Wikipedia article content that has U RIj. The length of each vector is N, as there are N articles in Wikipedia. Thus, ESA generates a very big and sparse vector, in comparison to LSA. With the built inverted index, ESA retrieves the vectors for all the words and by adding them the semantic interpreter generates the vec- tor for the given text documents. Finally, it computes a cosine score between the obtained vectors. There are several other corpus based measures, which use the probability distribution of the term over a large corpus. Thomas Hoffman proposed Probabilistic Latent Semantic Analysis (PLSA) [64], which extends the classical concept of LSA with a strong statistical foundation by calculating the probability distribution. The accept model is used to maximize the distribution. PLSA does not include the prior distributions of the topics. By using Dirichlet prior distribution, Latent Dirichlet Allocation [23] showed significant improve- ment over PLSA. On the same basis, there are more machine leaning based mod- els [100, 109, 123] that perform topic modeling and represent the semantics of
a word by using dense vectors over these hidden topics.
3.1.3.3 Word embeddings
Learning semantic representations of words using neural network architecture has recently received very high popularity due to its ability to learn high qual- ity semantics in a dense low dimensional space. The most popular and easy to use method is called Word2Vec [100] that learns the representations by us- ing language model over a very large corpora. Mikolov et al. [100] proposed the Skip-gram model to learn the representations, which is trained to an objec- tive of predicting the nearby words. Let wi is a given word at ith position, the Skip-gram model would predict the adjacent words wi−2, wi−1, wi+1 and wi+2. Although, Word2Vec achieved high accuracy in different NLP tasks, it relies only on the local context around the words. Therefore, the learned representations are very sophisticated to the local context and do not take benefit from other words which are a bit far in a bigger context window. Pennington et al. [109] proposed GloVe that considers a global context similar to other matrix factoriza- tion based methods. However, GloVe also added a context representation with the word vector by giving a higher preference to the local context. Levy et al. [88] showed that GloVe can be seen as a matrix factorization method similar to LSA. However, considering extra context representations in GloVe makes it to perform better than LSA and Word2Vec in word similarity task.
Since, word embeddings are learned by giving preferences to local context, they tend to capture the words with high substitutability. Therefore, these methods improve the accuracy in finding similar words but they may fail to capture the related words, which do not appear in the same or similar context and generally appear far from each other in a bigger context window. In chapter 6, we ana- lyze the performance of Word2Vec and GloVe in calculating word similarity and relatedness.