Doc2Vec - Textual Context Matching Based on Semantic Document Embeddings

7.2 Textual Context Matching Based on Semantic Document Embeddings

7.2.1 Doc2Vec

Doc2Vec is a modification of Word2Vec presented by Le and Mikolov [Le14]. It learns fixed-length embeddings from variable-length pieces of texts like documents. Throughout this chapter, we use the terms documents and paragraphs interchangeably. However, Doc2Vec addresses some of the key weaknesses of bag-of-word models by incorporating more semantics and considering the word order within a small context. As an example for the semantic embedding, the Doc2Vec model embeds the word ‘powerful’ closer to ‘strong’ than to ‘Paris’, which is not the case in bag-of-word models.

The architecture is either based on the distributed memory model (PV-DM), which is similar to the CBOW model of Word2Vec, or on the distributed bag-of-words model (PV-DBOW), which is similar to the skip-gram model. In the following, we describe both

approaches in more detail.

Distributed Memory Model

The distributed memory model (PV-DM) is inspired by the continuous bag-of-words model in Word2Vec, which can be summarized as predicting a word given its context. While the word vectors being initialized randomly, they are adapted accordingly as a result of the prediction task during the training process. A very similar idea is used in the PV-DM model for Doc2Vec. In addition to word vectors, document or paragraph vectors contribute to the prediction of the next word given different contexts sampled from the respective paragraph. Figure 7.1shows an example in the PV-DM model where a set of words and the respective paragraph id is used in the prediction task.

Classifier

Average/Concatenate

Paragraph Matrix

table

D W W W W W

Paragraph ID the dog sat on _the

Figure 7.1: The distributed memory model is similar to the CBOW model of Word2Vec. An additional paragraph token is added to the context words and the concatenation or average of the paragraph vector with a context of multiple (five) words is used to predict the sixth word

(inspired by [Le14]).

As depicted, in Doc2Vec, we make use of a document matrix𝐷(in addition to the word matrix 𝑊), which represents the paragraphs’ weights (vectors). The document vector and

word vectors are averaged or concatenated to predict the next word in a context. The paragraph token can be thought of an additional word (vectors of documents and words are of equal size). It acts like a memory what is missing from the current context - or the topic of the paragraph. For this reason, this model is called the Distributed Memory Model of Doc2Vec [Le14]. The context size must be set a-priori and the context words are sampled from a sliding window over the given paragraph. The underlying paragraph vector is basically shared across all words within the paragraph but not across paragraphs. Further, the word matrix is shared across all available paragraphs in the corpus, e.g., the vector of a specific word is the same across all paragraphs.

A significant advantage of PV-DM is that it addresses the key weakness of bag-of- word models. While the word ordering in bag-of-word models is utterly ignored, PV-DM considers the word ordering in a small context, which is comparable to n-gram models with relatively large n. The authors claim that PV-DM might be better than bag of n-gram models since these models would create a very high-dimensional representation that tends to generalize poorly [Le14].

Distributed Bag-of-Words Model

While the distributed memory model concatenates/averages the word vectors and the paragraph vector to predict a context word, the distributed bag-of-words model works vice versa. It ignores the context words in the input and tries to predict words randomly sampled from the paragraph in the output. More technically, in each iteration the algorithm samples a text window, then samples a word within this text window and creates a classification task given the paragraph id (vector). We provide an example of the PV-DBOW model in Figure7.2. Classifier Paragraph Matrix sat dog the on the D Paragraph ID

Figure 7.2: In the distributed bag-of-words model of Doc2Vec, the paragraph vector is trained

to predict the words in a small window (inspired by [Le14]).

In comparison to PV-DM, this model is conceptually simpler and requires less memory during computation. Overall, we only need to store the softmax weights and the paragraph weights. In the previous model, we store the softmax weights, the word weights and the paragraph weights, respectively. We emphasize, that some frameworks that contain a PV-DBOW Doc2Vec implementation also offer to train word vectors to improve the paragraph vectors during the training process (e.g., Gensim [Řeh10]). However, this is out-of-scope in this work and we refer to the respective literature. The PV-DBOW model can be compared to the skip-gram model used in Word2Vec [Mik13a].

In summary, Doc2Vec provides different (significant) advantages over other baseline approaches:

• The word order is considered to preserve information in the paragraphs (PV-DM). • Paragraph vectors inherit the semantics of words.

• Vectors are trained from unlabeled data, which becomes useful for tasks lacking enough labeled data.

Various authors were concerned about the optimal parameter tuning since the results of the original authors were hard to reproduce. For instance, Dai et al. [Dai15] provided a thorough comparison of Doc2Vec to other document modeling algorithms such as LDA on Wikipedia and arXiv. They also evaluated the accuracy of the method as they varied the dimensionality of the learned representations. Further, Lau and Baldwin [Lau16] provided a rigorous evaluation of Doc2Vec over theForum Question DuplicationandSemantic Textual Similarity task. In this work, the authors provided a significant number of parameter suggestions and evaluations for Wikipedia and other data sets. We used both papers to adapt our Doc2Vec parameter settings for our experiments.

Similar to Word2vec, we used the VSM toolkit Gensim1 _[_Řeh10_{] to train our Doc2Vec} model. The training time on the Wikipedia corpus took ≈2 days on our server with 20 cores and 25 GB RAM with 5 iterations overall.

In document Robust Entity Linking in Heterogeneous Domains (Page 133-135)