Vector space representation of documents - Challenges and Contributions

1.2 Challenges and Contributions

2.1.2 Vector space representation of documents

The vector space model is a representation of term occurrences in documents that is pri- marily of historic significance, but provides a basis and intuition for many of the related works that we discuss in the following. We only give a brief overview, and refer to text- books for further details (for example, see[125]).

From bags of words to vectors

To represent the text that is contained in a collection of documents in numeric form, the vector space model discards the word order and treats documents as sets of words (typically, multisets are considered, which are also called bags, hence the name). Each element in this set then corresponds to one component of a vector whose dimension |W | equals the size of the vocabularyW in the collection. In order to obtain a representation of the content of documents, each documentd is treated as a |W |-dimensional vector vd. Entries

ofvd equal zero if the corresponding word (or term) is not contained in the document, and

are non-zero if the word is contained. In the most simple form, one could consider a binary vector whose entries equal 1 if and only if the corresponding word is contained in the document, but more sophisticated approaches are viable, as we discuss in the following. Intuitively, the vector space model thus represents each document as an index vector that encodes for each term whether it occurs in the document. A document collection then be- comes a collection of vectors that allow the computation of the similarities of documents or the importance of their contents.

Term weighting

Index vectors with binary entries as described above have the downside of discarding the frequency information of terms, which can be useful in estimating the relevance of a term for a given document. Thus, instead of using binary entries, one could also use integer

2 Background and Related Work

values that count the frequency of each word of a document. This value is typically referred to as the term frequency tf(t,d) and denotes the frequency of term t in document d. This, however, is prone to creating an imbalance in the vectors, since some words are sim- ply more frequent than others, while not necessarily being more important. For example, consider common stop words such as articles, prepositions, or conjunctions, which add little content to a document but occur frequently. Thus, it makes sense to normalize the term frequency by an overall frequency of the word across all documents in the collection, which is denoted by df(t ). Since we want to encode greater word importances as larger values, it makes sense to consider the inverse of the document frequency idf(t ) for normalization. A commonly used version uses a logarithmic scaling and defines the inverse document frequency as

idf(t ) := log |W |

df(t ). (2.1)

The normalization of the term frequency with this inverse document frequency is then referred to as the tf-idf score[173],and defined as

tf-idf(t,d) := tf (t,d) log |W |

df(t ). (2.2) The resulting score can be used as component values of the document vectors. In contrast to simple frequency vectors, these normalized vectors now encode a sort of relevance information, since the component values are highest for terms that occur frequently in a small number of documents, and lower for terms that are either infrequent in the given document or frequent across many documents. This weighted vector representation then enables a scoring and comparison of document vectors.

Vector similarity

Based on the representation of documents as vectors, we now have a direct means of computing the similarity of documents as a similarity of vectors. Since each dimension of the vector space corresponds to one specific term and the component values of each document vector denote the weight of this term for the document, an intuitive approach is to use the angle between two vectors to derive a similarity. The more similar the content of the documents is, the smaller the angle, and vice versa. This is captured by the cosine similarity, which is given for two document vectorsvd1 andvd2as

cos(vd1,vd2) =

vd1 ·vd2

kvd1k kvd2k

2.1 Natural Language Processing and Document Models

where · denotes the inner product of the two vectors, and k k is the Euclidean norm. While we introduced it for the similarity of document vectors here, the cosine similarity is also frequently used to compare vector embeddings as discussed in the following, or to measure the similarity of graph neighbourhoods[184].

The retrieval of information from document collections is then enabled by modelling queries as small documents in the same vector space as the documents in the collection. By comparing the vector that represents the query terms to the document vectors, the documents can be ranked according to the computed similarity score, and presented to the user as a ranked list of documents by descending relevance to the query. In Chapter3, we use a similar intuition for graphs instead of vectors, by representing the joint content of the document collection as a graph, in which we then search the local neighbourhood of nodes that correspond to query entities and terms.

In document Implicit Entity Networks: A Versatile Document Model (Page 31-33)