The IR-based approach has severe limitations, since all information available is what terms appear in what document. Given that terms are (at best) second- order indicators of the content of a document, analysis of document content is very difficult in this framework. Also, the vector-space approach presents some issues of its own. We overview briefly some of the difficulties inherent to the IR approach.
One of the problems with the vector-space approach outlined before is that the number of dimensions in the space equals the number of terms in all documents. However, any given document is unlikely to contain all terms, or even a large part of them. As a result, each vector representing a document is usually sparse, with many zeroes for all the terms not appearing in a given document. Also, when we make each term a dimension, we are implicitly assuming that terms are independent of each other. However, this is rarely the case: Some terms may be highly correlated to others, and this is masked by giving each of them independent dimensions (for instance, synonyms are clearly terms that should be given a unique, shared dimension). Because of this, there are several techniques that attempt dimensionality reduction, that is, to reduce the number of dimensions used to represent documents by switching from terms to something else. One of the most well-known techniques for reducing dimensions is called latent semantic indexing (LSI) and is based on a linear algebra technique, singular value decomposition (SVD) (Berry & Browne, 1999; Belew, 2000). Without entering into technical details, SVD looks at the collection as a term-document matrix—where the dimensions are the number of documents and the number of terms, and each cell contains the number of appearances of a given term on a given document. As explained above, this matrix is bound to be sparse; SVD can be seen as a compression technique. Given n total terms in a corpus, SVD will
create a matrix with k < n entries, where each document is represented by a vector in k-space, instead of n-space. Each of the k factors is uncorrelated to any other. But the factors do not correspond to the original terms exactly. Rather, they capture hidden correlations among terms and documents. Terms that were not close in the original n-space may be close now in the k-space because they co-occur frequently in documents deemed similar. Consider the terms “car,” “driver,” “automobile” and “elephant” (this example is from Berry and Browne, 1999). If “car” and “automobile” co-occur with many of the same words, they will be mapped very close (or to the same dimension) in the k-space. The word “driver,” which has some relation, will be mapped to a somewhat close dimension. The word “elephant,” which is unrelated, will be mapped to a different dimension. Thus, even though something is lost (because only k < n dimensions are retained), hopefully what is lost is noise, since we picked the most important ``patterns’’.
Ideally, the set of terms for a corpus should be exhaustive (have terms for all the topics any user may want to search for) and specific (have terms that help identify the relevant documents for any topic). However, both measures are in tension; there exists a trade-off between them. In effect, if an index is exhaustive, recall is high but specificity may suffer (if we associate many keywords with a document, we increase its chances of being retrieved, but precision may go down); if the index is not very exhaustive we can achieve high precision, but recall may not be very good. Note that this analysis concerns the corpus (i.e., the documents); if we look at the user’s queries, the set of terms used there may be different. Thus, there may be a mismatch between terms used in queries and terms in the corpus; this is called the vocabulary mismatch (Belew, 2000).
Any approach based on the appearance of particular words has to deal with issues like synonyms, homonyms, ambiguity, and lack of context. One route that has been tried to attack the problem of relations among words (homonyms, synonyms and similar relations) is a knowledge-based approach, the use of a thesaurus. A thesaurus is a list of words together with a set of relationships among those words. There are some other relationships that all thesauri have: a “broader, more general term” or superclass relationships, and its inverse, the “narrower term” or “more concrete,” or subclass relationship. Besides this, a “synonym” relationships may connect synonyms, and an “antonym” relationship may connect antonyms. Other semantic relationships may also exist, like “related term.” Some thesauri will, at higher levels in the hierarchy of terms, have a “‘theme” or “topic” relationships that relates abstract words (like “war”) with words that are connected to some aspect of the topic that the abstract words represents (like “weapons,” “strategy” or “history”). Note that this connection is very informal and may link words that are only somewhat related; some words are related to other words only in a particular context.
Thesauri can help deal with the issue of word relationships; but they present significant problems of their own. The most important are:
•
Thesauri incorporate a classification/ontology in their relationships. There is usually more than one point of view of all but the simplest domains; different users may have different views and a given thesaurus represents but one. When a domain is standardized, this issue does not come up, but many domains are not standardized. Since most thesauri are fixed in the classification (i.e., terms cannot be changed from category to category) there is no way to overcome this problem.•
Thesauri are difficult to build and maintain. There are two main methods in building thesauri: manual (which is labor-intensive, slow and must deal with subjectivity issues) and automatic. Automatically-constructed thesauri are usually built using some machine-learning mechanism. Statistical methods are based on co-occurrence; but this approach has limitations, as it is not clear what the semantic relationship among co-occurring words is. Heuris- tics are usually added to such methods, but state-of-the-art procedures still require extensive supervision (Chakrabarti, 2003).•
Thesauri are limited; they cannot capture all semantic relationships for a natural language. Even on those relationships they capture, the situation may be more complex than the thesaurus indicates: Some terms, A and B, may be related only on a certain context. Most thesauri do not capture context. There are going to be failures, then, when using thesauri. How- ever, there is no agreed-upon mechanism to deal with such failures.From our point of view, the most important limitation of IR systems is their lack of semantic understanding, which makes it impossible to analyze the meaning of the information in the documents. This in turn harms their integration with the information in the database. As an example, assume a stockbroker database on company acquisitions. There is a table BIDS(bidder, target, date, price, charac- ter), where bidder is the company bidding to acquire another company, target is the company that is to be acquired, date and price are as expected and character is one of “hostile,” “friendly” and “neutral.” There is also a collection of documents, extracted from news feeds, that we use to complement the database table contents. Assume we want to query our database for all information known about a certain acquisition. We can query the information in the tables with SQL, and we can do a search with an IR system for the words “bid” and synonyms or related words (like “takeover”). However, what is returned by both searches is very different, not only in structure (rows of structured information in one case, documents in the other) but in character: The information in the database is
guaranteed to be relevant to the SQL query, while the information in the document may or may not be relevant. The documents may not be about acquisitions at all, even if they contain some relevant words (note that an IR search in this same document would contain the right words, but this is not a document about takeovers). We need to extract relevant information only from the document, and if possible, we need to structure it. Another way to see this problem is to assume that our goal is to build a method that would update our database table automatically from the information in the documents (since, once in the table, all information is available to users through SQL). For this, we need more than the presence of certain words. One article may contain the following sentence: “Oracle has announced a bid to acquire PeopleSoft.” For an IR system, the sentence above is indistinguishable from the following: “PeopleSoft has announced a bid to acquire Oracle” (systems that hold word offsets (positions) would be able to tell the difference, but such information is not currently used in IR systems to make these types of distinctions, only to implement the predicate NEAR). However, both sentences express clearly different content, and they should generate different tuples. Also, if we want to find out the character of a bid, it is not enough to look for certain words: “Oracle has announced a bid to acquire PeopleSoft. It is a friendly bid.” and “Oracle has announced a bid to acquire PeopleSoft. It is not a friendly bid.” both contain the word “friendly” but are opposite in meaning (further, note that “not” would be probably considered a stopword and ignored in most IR systems, rendering both sentences basically equivalent). Of course, the system could select documents where the right words appear, and present them to a human who could easily make such judgments. However, when the number of documents is large, this is clearly not a good solution. Also, note that some documents may be quite large, and all the relevant information may be contained in a few sentences. With the architecture sketched above, only single documents can be returned as answer to a query. To combat that, some systems offer partial solutions, like highlighting key terms in the documents (which facilitates searching for the relevant paragraphs) or summarizing it. However, these solutions are not standardized and are highly system-dependent. In fact, summarization is an active area of research, so standards are unlikely to appear any time soon.