• No results found

While find-similar has existed for many years and many search systems have pro-vided find-similar functionality to users, little research existed regarding find-similar before the work presented in this dissertation. We add to this existing research and further fill the gap between practice and research knowledge of find-similar.

The work in this dissertation is best viewed as falling into three categories: mea-surement, performance improvement, and applicability to different domains.

Figure 1.5. This figure shows the shortest paths via find-similar from the initial query to the nine relevant documents for TREC query 334, “export controls cryptog-raphy.” Non-relevant documents are represented by an “N” or omitted to save space.

The relevant document “R3” at rank 4 in the initial query results makes the majority of relevant documents much easier to reach.

We look at two ways to measure find-similar. The first, as presented in Chap-ter 3, performs a simulation of user behavior given an inChap-terface that incorporates find-similar. Here we measure find-similar’s potential to improve retrieval quality as compared to a state-of-the-art retrieval system as well as compared to relevance feedback. In Chapter 4, we use this simulation methodology to investigate the ef-fect of different initial conditions on find-similar’s performance. In Chapter 5, we present a method that does away with the user interface and focuses its measurement on the document network formed by find-similar’s document-to-document similarity measure.

To improve find-similar, we can provide user interface support for find-similar as well as change the document-to-document similarity measure. In Chapter 3, we investigate the need for an interface to help the user avoid the reexamination of documents while using find-similar. In Chapters 3, 4, 5, and 6, we look at various types of document-to-document similarity. We look at content similarity both with

and without query-biasing in Chapters 3 and 5. As part of Chapter 4, we compare our language modeling to-document similarity to PubMed’s actual document-to-document similarity. In Chapter 6 we treat the World Wide Web’s hyperlinks as a form of document-to-document similarity and measure the Web’s navigability both with and without additional content similarity links.

To understand if find-similar is applicable across different domains, we investi-gate its performance with newswire and government documents in Chapters 3 and 5, abstracts of biomedical texts in Chapter 4, and with Web pages in Chapter 6.

We review related work in Chapter 2 and conclude the dissertation in Chapter 7.

1.3 Contributions

In this dissertation, we make the following contributions:

1. We show that find-similar has the potential to produce a 23% improvement over a non-interactive state-of-the-art baseline as measured by mean average precision. This performance matches relevance feedback. (Chapter 3)

2. By simulating simple and plausible user browsing patterns, we show that find-similar’s performance is significantly affected by the browsing pattern. In par-ticular, if carelessly used when results are already good and not in need of much improvement, find-similar can degrade these results. (Chapter 3)

3. We find that find-similar benefits from user interface support to avoid the re-examination of documents. Without support to avoid the rere-examination of documents, find-similar only benefits the poorest performing topics. (Chap-ter 3)

4. Poor initial retrievals can come from complex information needs, the retrieval method, or novice users. As part of a case study of PubMed, we show how

find-similar compensates for poor initial retrievals. This work also broadens the applicability of find-similar beyond the newswire and government documents of Chapter 3 to biomedical abstracts. (Chapter 4)

5. We find that poorer retrieval systems are helped more by find-similar while the more difficult topics are not helped as much as the easier topics. (Chapter 4) 6. We find that find-similar’s performance can be improved by using a

query-biased, document-to-document content similarity rather than a similarity mea-sure that simply uses the document as a query. (Chapters 3 and 5)

7. We create a novel and well defined method to evaluate the ability of document-to-document similarity measures to cluster relevant documents. We show that both local and global measures of clustering are needed. (Chapter 5)

8. We show that the query-biased similarity that performed better under simula-tion also clusters relevant documents better than a regular similarity that treats a document as a query. The query-biased similarity produces a relative gain in the global measure of clustering by 45% while also producing a relative gain of 38% in a local measure of clustering (precision at rank 5). (Chapter 5)

9. We show that to a limited extent, the cluster hypothesis is true on the web when the document-to-document similarity measure is the distance to navigate from one document to another using hyperlinks. We found that the automatic addition of content similarity hyperlinks to web pages can significantly increase the number of relevant documents reachable from a given relevant document.

We quantify this increase in navigability using the method of Chapter 5 and show that find-similar produced an absolute gain in global navigability of 13.8%

while at the same time increasing the local navigability of the web. (Chapter 6)

CHAPTER 2