The typical search scenario supported by today’s text retrieval systems allows a user to enter a query and receive a ranked list of documents. The IR system attempts to understand what the user’s information need is based on the query and likewise attempts to provide documents that will best satisfy the user’s information need.
In this dissertation, our work is motivated by the many user tasks that require finding more than one relevant document. Examples include:
• Literature searches: For example, when a scholar needs to fully review the literature in a research area.
• Legal discovery: Lawyers need to uncover all past relevant cases in order to make strong arguments.
Figure 1.1. The traditional scenario for interacting with an information retrieval system to find relevant documents via relevance feedback.
• Medical conditions / treatments: When someone becomes ill, the doctor, pa-tient, and family often need to find as much information as possible regarding the illness.
Within this broad problem of finding multiple relevant documents, we focus on one specific problem: once a user has found a relevant document, how should the user proceed to find other relevant documents? The user could find this document via a typical search or the user could already know of the document. The classic solution to this problem is what is known as relevance feedback.
Relevance feedback has the user provide feedback on the results to the IR system.
The IR system can collect feedback in many ways, but a typical approach is multiple-item relevance feedback whereby the user judges the top 5 or 10 documents as relevant or non-relevant and submits these judgments to the IR system. The IR system uses the judgments to craft a new query and returns a new set of results. The aim is for this feedback loop to continue until the user’s information need is satisfied. Figure 1.1 shows this process. Interaction techniques like relevance feedback focus on helping the user after the initial query rather than on improving the initial retrieval.
While relevance feedback is known to be a powerful technique for improving re-trieval quality (Ruthven and Lalmas (2003) provide an extensive review of over 30 years of relevance feedback research), it has seen little adoption by search systems. A
Figure 1.2. The Excite search system (circa 1996) provided a find-similar link next to each search result labeled “More Like This: Click here for a list of documents like this one.”
feedback-like technique that has seen adoption is an interaction mechanism we term find-similar.
Find-similar allows a user to request a list of documents similar to a given docu-ment. As a user interface feature, find-similar is typically instantiated as a button or link next to each result in the list of search results. For example, the Excite search engine (circa 1996) labeled their find-similar link “More Like This: Click here for a list of documents like this one” as shown in Figure 1.2. As such, find-similar provides a way for users to navigate from one document to another and supports the search techniques commonly employed by users (Bates, 1989).
While not all people have experience using find-similar, significant evidence exists that many users utilize this interaction mechanism. Spink et al. (2000, 2001) analyzed samples of Excite’s query logs and reported that between 5 and 9.7 percent of the queries came from the use of the “more like this” find-similar feature. Lin et al.
(2007) have reported that for the U.S. National Library of Medicine’s search engine, PubMed, 18.5% of non-trivial search sessions involve clicks on articles suggested by PubMed’s find-similar, which PubMed refers to as related articles.
Figure 1.3. This figure shows an example of find-similar use. The user starts by entering a query and getting a ranked list of documents (1). The user examines the documents and finds the first one to be non-relevant. The second document is relevant and the user decides to apply find-similar to that document (2). The system produces a ranked list of documents that are similar to the requested document.
The user continues (3) until reaching a point where the list of documents has few relevant documents and the user clicks the “back button” in the interface to go back to the previous ranked list (4). The user continues searching (5) via find-similar until finished.
This dissertation focuses on find-similar’s use as a search tool. To use find-similar as a search tool, a user will apply find-similar to a relevant document to find more relevant documents, and so forth. A user can either start with an initial query and apply find-similar to individual results, or a user can start with a known relevant document found via other means and apply find-similar to that document. Figure 1.3 shows an example of find-similar use starting from an initial query.
Find-similar can provide other forms of similarity to the user besides content similarity. For example, Figure 1.4 shows a page from CiteSeer (Bollacker et al., 1998), which is a research paper repository and search system. Many of the links provided by CiteSeer can be considered find-similar links. Some of the links are to documents with similar content while others are to documents with similar citations.
Figure 1.4. The CiteSeer research paper search system provides many types of similarity links besides content similarity, e.g. papers that cite this paper and co-citation similarity. This web page is also an example of using find-similar to find other relevant documents when a document, rather than a query is the starting point.
A similar search system, Google Scholar,2 provides links to documents written by the author of a paper, which allows the user to navigate along another dimension of similarity.
Using find-similar to navigate via similarity has significant potential to improve retrieval quality. Figure 1.5 shows an example of the power of navigating via simi-larity for the TREC ad-hoc query number 334, “export controls cryptography.” This example uses query likelihood to perform the initial retrieval and regular document-to-document similarity (details given in Chapter 3). For this query, the initial retrieval
2http://scholar.google.com/
pulls up three relevant documents into the top of the results at ranks 1, 2, and 4. In total, there are nine known relevant documents for query 334. The initial query finds the other relevant documents at ranks 79, 2465, 13564, 23831, 37874, and the last document is not retrieved at all since it doesn’t contain any of the three query terms.
In the figure, this last document is given a rank of 528,155, which is the number of documents in the collection. Without find-similar or some other mechanism, the documents at ranks 2465 and greater are effectively “out of reach” of the user. If the user uses find-similar to request similar documents for the relevant document at rank 4, the user will then find relevant documents at ranks 3, 6, 8, and 11 that were all at ranks greater than 1000 for the initial retrieval. Find-similar makes these documents much easier to reach. For example, document R5 now goes from a distance of 2465 to a distance of 7 (4 + 3). Using find-similar, the hardest to reach relevant document is 46 documents away from the initial query and the remaining relevant documents are found within 15 documents. This is a dramatic improvement from only finding 3 relevant documents in the initial retrieval.
In a broad sense, find-similar aims to add links to documents such that the time for a user to get from relevant document to relevant document is minimized. We next describe the work in this dissertation that addresses how to measure and improve the performance of find-similar.