Document Signatures
3.2 Term-document matrix
3.2.1 Bag-of-words model
The term-document matrix is a method for describing a bag-of-words document collection as a matrix in the vector space model of document representation [Salton et al., 1975]. In a bag-of-words representation, documents are viewed as being comprised purely of the terms they contain, without consideration for the order those words appear in, their proximity to other words etc. In a term-document matrix, the rows represent documents, the columns represent terms and the cells represent the presence of a term within that cell.
By way of example, consider a simple collection of 3 documents.
• Document A contains the text "The apple is red".
• Document B contains the text "The sky is blue".
• Document C contains the text "The flag is red and blue".
When these documents are processed by the indexing engine, all extra information that is
the is and apple sky flag red blue
Document A 1 1 0 1 0 0 1 0
Document B 1 1 0 0 1 0 0 1
Document C 1 1 1 0 0 1 1 1
Table 3.1: An example term-document matrix of a simple collection
not needed for a bag-of-words representation is essentially stripped out. Capitalisation and punctuation are both ignored, as are any formatting tags or other data not considered part of the document. The indexer typically performs other steps as well, including stopping and stemming, but those steps are omitted from this example as they do not contribute to the explanation.
The terms these three documents share are: "the", "is", "and", "apple", "sky",
"flag", "red", "blue". These terms will become the columns of the term-document matrix this collection becomes. It is worth noting that any term that appears in the collection will require its own column, irrespective of how few documents that term appears in.
The three documents become rows of the same matrix and a given cell will have a value based on the incidence of the term its column represents in the document its row represents.
The resultant term-document matrix is depicted in Table 3.1.
3.2.2 Query matrix arithmetic
Search queries are provided in the same bag-of-words representation as the documents. In a pure bag-of-words representation, phrase queries1 are not automatically available. They can be performed by searching for documents that contain all the terms and later removing documents from the result list that do not contain the phrase; however, typically n-gram representation2 or similar approaches are necessary to enable phrase searching.
In this particular example each term appears only a maximum of 1 time per document and the associated term-document cell is assigned 1 if the term appears in that document and 0 if it does not. These cells can also be weighted differently, using factors such as the number of
1Queries that require the result documents to match segments of text comprised of multiple terms. These queries are typically specified by quoting the text in question.
2In n-gram representation, unique groups of n adjacent terms are stored. Bag-of-words representation could be considered a case of n-gram representation where n = 1.
the 0
Table 3.2: An example query vector, represented as a column matrix
Figure 3.1: Example of performing a query by multiplying the collection’s term-document matrix by the query’s column matrix, returning a second column matrix containing the scores of the three documents for that query
times the term appeared in the document and the relative importance of the term to exercise finer control over the results.
With the document-term matrix prepared, queries are a matter of matrix multiplication and are constructed in the same way as the rows of the document-term matrix. For instance, the query "red and blue" would be represented as the column matrix depicted by Ta-ble 3.2. The query is represented as a column matrix as the multiplicand must have the same number of rows as the multiplier (the term-document matrix) has columns.
Once the query matrix has been constructed, a query is performed by multiplying the term-document matrix by it. The resulting matrix contains the document scores of each document with respect to the query. (Figure 3.1)
The more terms from the query that appear in the document, the higher that document’s
score will be. These scores can then be sorted and provided to the user.
3.2.3 Scalability
This model is a fundamental underpinning of many information retrieval approaches, but has certain issues preventing it from being used as-is in any practical sense.
The collection matrix is very large, with columns equal to the collection’s vocabulary and rows equal to the number of documents in said collection. It is easy to see that even a modest-sized collection like the Wikipedia corpus (§ B.2) will require a matrix that is millions of columns by millions of rows in size.
Such a large matrix would require sparse storage techniques to make up for the inef-ficiencies associated with reserving columns for all possible terms, even though very few of them actually appear in each document. The matrix multiplication operation would also be similarly expensive to perform on such large matrices, potentially requiring billions of multiplications and additions.
Using the Wikipedia XML corpus as an example illustrates some of the computational problems associated with this approach. This is a collection with 2 666 192 documents and a vocabulary size of 2 132 352 [De Vries and Geva, 2012]. As a result the term-document matrix for this collection would be 2666192 × 2132352 and the search query would be a 2132352 × 1 column matrix. Multiplying these two matrices would result in an output matrix of 2666192 × 1, each cell of which would require 2 132 352 multiplications and additions, for a total of 5 685 255 578 880, or ∼ 5.7 trillion multiplications and additions per query. Performing such a computation is not out of the reach of modern hardware, especially as matrix multiplication is highly parallelisable; however, it is still overly com-putationally expensive for a single query on a document collection that is not particularly large. A sparse representation of the same matrix, while able to take advantage of the fact that only 695 864 108 of the 4 546 925 051 904 cells are filled, would nonetheless still be very expensive to multiply.