• No results found

Related Network Analysis Work

CHAPTER 2 RELATED WORK

2.7 Related Network Analysis Work

In Chapter 5, our focus shifts to measuring the document networks formed by dif-ferent document-to-document similarity measures. Newman (2003) provides a review of the large body of research on networks and various graph theoretic measures of

network properties. Costa et al. (2007) provides a survey of many of these network measures.

Chapter 5 discusses the well-known cluster hypothesis, which essentially says that relevant documents will tend to be more similar to each other than to non-relevant documents. Given a document-to-document similarity measure, we can view the similarity relationships between documents in a graph theoretic sense. Documents are nodes and links are placed between nodes based on the given similarity measure.

IR similarity measures tend to produce a numeric measure of similarity and some measures are asymmetric. Given this nature of IR similarity measures, an obvious document network for IR would be a directed network with weights for each link between nodes.

Given the hypothesis’ name, the cluster hypothesis, the clustering coefficient net-work measures appear to be obvious choices for measuring the cluster hypothesis.

These measures exist to characterize the extent to which the nodes in a network are clustered. Many of these measures are designed only for undirected, unweighted networks. Cluster measures typically average over all nodes in the network a measure-ment that captures the extent to which a node’s neighbors are themselves neighbors.

In the world of undirected, unweighted networks, one can see that these cluster mea-sures may say a network is perfectly clustered even if some nodes cannot be reached from other nodes. In other words, a network can be clustered in groups of fully con-nected subgraphs and be perfectly clustered. Even when these measures are extended to weighted networks (but not to directed networks) (Barth´elemya et al., 2005; Kalna and Higham, 2006), these measures still aim to capture the same characteristic of networks, which is different from the notion of clustering in the cluster hypothesis.

The cluster hypothesis says that for various subsets of the graph, where each subset is a set of relevant documents, that the nodes in these subsets are closer to each other than to the other nodes in the network. As we will explain in Chapter 5, we need

a measure that captures a sense of the distance between these relevant documents as well as a measure that captures the local or neighborhood quality of a relevant document.

In a similar fashion to our analysis in Chapter 5, Lawrie (2003) performed a shortest-paths and reachability analysis of hierarchical summaries of search results to compare the quality of various summarization methods.

Since users of find-similar navigate a document network formed by find-similar’s document-to-document similarity measure, this dissertation would seem to hold many connections to work that has looked at the navigation of social networks. Perhaps the most famous of the social network navigation works is that of Travers and Milgram (1969) that measured the number of people needed to reach a known target person via personal “first name” relationships only. An important difference between this work and ours is that for the IR tasks we concern ourselves with, our users do not know a priori their destination. Travers and Milgram (1969) gave participants a description of the target that included “his name, address, occupation and place of employment,” . . . “his college and year of graduation, his military service dates, and his wife’s maiden name and hometown.” In network terms, participants had to forward the research packet (message) to a neighboring node where the network was people and the links represented knowing someone on a first-name basis. Participants picked a neighboring node that they thought would most likely get the message to the target. When we know the destination in IR, we call this known item search, which we do not examine in this dissertation. We and our retrieval system users do not know what the set of relevant documents is for a given information need until the documents have been found as part of search. To design our IR systems, we collect known sets of relevant documents and measure the quality of IR system using these test sets, but in operation these sets are unknown.

Another way in which social network navigation research differs from our work is that we are interested in constructing inherently navigable networks rather than in only studying navigation on an existing network. We want to create similarity measures and interfaces that combined make the cluster hypothesis true. If a user has found a relevant document, we want it to be trivial for the user to find the other relevant documents. We do not create algorithms to guide the user’s navigation of a network as has been done for other networks (S¸im¸sek and Jensen, 2005). If we knew how best to guide the user through the network, we would utilize this information to improve the initial ranking of documents for the user. For our work in Chapter 3, we assume that the state-of-the-art baseline retrieval system has already exploited all reasonable information in its ranking of documents.

CHAPTER 3

USING SIMULATION TO EVALUATE THE POTENTIAL