components as for traditional data (Kimball, 1996). In particular, all the standard elements of the ETL (extraction-transformation-load) process must be present (Kimball & Caserta, 2002): First, it is necessary to find sources of documents, and to establish the contents and capabilities of those sources; next, one must specify procedures for retrieving the documents into the warehouse. This includes querying the sources according to their capabilities (i.e., do they accept SQL?), determining when the sources offer new documents (not already retrieved) and physically moving the documents into the warehouse. Transfor- mations may also be needed, as in the case of regular data: Some formats may have to be processed, some documents may need cleaning. For instance, e-mail messages may have to be stripped of headers attached to them by different programs; documents in PDF or Word may be stripped from their formatting, leaving only the text. In some cases, documents may be stored both as they originally are and in some standardized, cleaned way. Then, it is necessary to determine how documents are to be stored. Since documents may vary substan- tially in size (from a small Web page to a large manual) this is a complex decision. Simplifying considerably, two typical options are to store the document outside the database software that supports the warehouse, leaving pointers to the location inside the database, or to store the document inside the database. We explore this second option only, as it seems to be the most accepted nowadays. The manner in which the documents are stored depends on the options offered by the database system being used; we describe a typical scenario for a relational database. For each document, there are two parts that we need to store: the document itself, and its metadata. The document itself is usually stored as an attribute in a table, as most RDBMS now support storage of large binary objects and large character binary objects or CLOBs (Melton & Simon, 2002). Documents are very rich in metadata, and posses several types: traditional metadata (like information about sources, size, etc.), content metadata (like title, subject, description and format) plus other metadata proper of documents (for
instance, information about versions, author, etc.) (Sullivan, 2001). While there is no absolute agreement on what constitutes adequate content metadata, standards like the Dublin Core are usually considered the reference (Dublin, 2004). A data warehouse would ideally capture all these different types of metadata. Note that a data warehouse usually has a metadata repository, which contains information about the data in the warehouse (Kimball, 1996; Kimball & Caserta, 2004). While some document metadata may go into that repository, other should go into tables. Thus, depending on the amount of metadata present and the degree of normalization desired, the typical option is to create one or more tables that include as attributes the document itself and all other metadata. When several tables are needed, it is usually possible to create a small star schema (Kimball, 1998), with the fact table being the one holding the documents, and different types of metadata stored in dimension tables: One dimension may correspond to content metadata, another one to storage metadata yet another one to author information (Sullivan, 2001).
However, what really characterizes a document is its content, which is in some natural language and therefore cannot be easily analyzed with traditional database tools. In order to deal effectively with document content, Information Retrieval tools are used. Information retrieval (IR) treats a document as an (unordered) bag of words and does not carry out any syntactic or semantic analysis. Each document in a collection is scanned, tokenized (i.e., divided into discrete units or tokens), stopwords (words that appear in each document and carry no meaning, like “the” or “and”’) are dropped and the rest are normalized (stemmed). This involves getting rid of word inflections (prefixes, suffixes) to link several words to a common root (like “running” and “run”); it usually involves quite a few manipulations, like normalizing verbal forms. Thus, each document ends up as a list of “content” words. Each word is then substituted by a number that represents the importance of the word in the document (usually simply a count of how many times the word appears) and in the collection (usually, the inverse of the fraction of documents where the word appears). As a result, the document is seen as a vector <t1,...,tn> where each ti represents the weight given to the ith term in the document.
Technically, let D be a collection of documents, and let T be the collection of all terms in D. The term occurrence frequency (tf) is the number of times a term appears in a document. For term t in T, document d in D, we denote the occurrence frequency of t in d as tf(t,d). The intuition is that if t occurs frequently, then it is likely to be very significant to d. The inverse document frequency of t (in symbols, idf(t)) is computed as the fraction of documents where t appears out of all the documents. The idf tries to account for the fact that terms occurring in many documents are not good discriminators. In fact, some experiments have shown that words that occur too frequently or too infrequently are not good discriminators (the latter because they have lower probability of being used). A
tf-idf weight is a weight assigned to a term t in a document d, obtained by combining tf(t,d) and idf(t). Simple multiplication may be used, but most of the time a normalization factor is also introduced in the calculation. Basically, this makes different weights comparable by neutralizing document length (without the normalization, longer documents — which tend to have higher tfs — would dominate the collection).
A query in natural language can be represented as a vector, just like a document. That is, non-content words are thrown out, stemming is used, possibly words are grouped into phrases and weights assigned to each term by the system. When the query consists simply of a list of keywords (the usual case in IR), transforming it into a vector is trivial: We simply set to 1 the terms that appear in the query, and all others to 0. Answering a query, in this context, means finding document vectors which are the closest to the query vector, where “close” is defined by some distance measure in the vector space. A possible distance measure is the dot product of two vectors. If the vectors are binary (a 0 or 1 reflecting presence/ absence of a word in a document is used instead of a tf-idf weight), this value gives the number of terms in common. However, this measure favors longer documents which will be likely to have more terms, i.e., to be represented by less sparse vectors. To remedy this, another measure called the cosine function is used. Intuitively, this measure gives the cosine of the angle between the two vectors in the vector space; in this representation, two vectors with no common terms should be orthogonal to each other. While it has been found that the cosine function favors short documents over long ones, it’s still the most widely used measure (it is difficult to determine how to deal best with document length: There is a difference between a document being longer because it treats more topics, and being longer because it repeats the same message more times, i.e., redundancy).
Extensions to this basic framework abound. For instance, some systems accept phrases. Technically, phrases are n-grams (that is, n terms adjacent in the text) that together have a meaning different from the separate terms (for example, “operating system”), and hence are terms on their right. However, to discover phrases may be complicated, so not all systems support this option. Other systems use approximate string matching in order to deal with typos, misspell- ings, etc. which may be quite frequent on informal messages (like e-mails) due to the lack of editorial control.
One significant advantage of using a measure function as a similarity measure is that the documents retrieved can be ranked or ordered according to this measure. When the number of documents is large (which is usual), the user can focus on only the top k results (or the system may retrieve only the top k results). This is one of the most outstanding differences between IR and database systems, since in database systems the answer to a query is a set (unordered
collection) of tuples or other elements (even in object-oriented databases, where ordered lists can be returned as answers, the ordering has nothing to do with a notion of ranking; all elements of the answer are equally important and fully justified to be in the answer (Catell et al., 2000)).
To support the efficient implementation of the vector-space concept, most systems rely on the idea of an inverted index (Belew, 2000). Given a collection of documents, each document is analyzed as explained earlier (tokenized, and perhaps other additional steps). Then, an index is built by creating a sorted list of terms; for each term, a list of document identifiers (the documents where the term appears) is kept. This is the simplest form of inverted index. One can also keep, for each term and document, the number of occurrences. Or, even more, one can keep the position(s) where the term appears (this makes it possible to support proximity queries, where the user asks that some term appears near others). These indices are very useful as they support both Boolean searches and vector-space searches. However, such an index may grow very large, and it may also be hard to maintain if the collection of documents is not static (in any sense: if the collection grows and shrinks and/or if the documents already in the collection can be edited). Note that in typical IR applications (libraries) the collection is static, but in modern applications (i.e., the Web) this is not the case at all. Therefore, a lot of research has gone into compression and maintenance of the index (Belew, 2000; Chakrabarti, 2003).
To evaluate the performance of an IR system, two main metrics are used: recall and precision. Recall is the ratio of relevant documents that are retrieved to the total number of relevant documents (in other words, the fraction of relevant documents actually retrieved). Precision is the ratio of relevant documents that are retrieved to the total number of retrieved documents (that is, the fraction of documents retrieved that is actually relevant). It is common that when recall increases, precision decreases. That is, when we retrieve more documents, we increase the probability of retrieving more relevant documents, but we also tend to retrieve more irrelevant documents. By convention, if no documents are returned precision is considered 1 but recall is considered 0. As the answer set grows bigger, recall will increase, but precision is likely to decrease.
Note that all these measures depend on knowing the set of all the documents that are truly relevant to a query. In practice, this is very problematic, as it would involve defining relevance and examining the whole collection. Relevance seems to be a subjective concept (what is relevant to one user may not be to another), and thus quite hard to formalize. Therefore, these measures are idealizations. In principle, determining all documents relevant to a query, independent of any particular user, could be achieved with a corpus carefully annotated by a set of diverse experts. However, the effort required to fully annotate large collections, and to ensure that annotations form a consensus, is so large that it is very
infrequent to have such collections. Only the TREC conferences and a few research centers have managed to create collections of this nature (TREC, 2003).