7.2 Access Counting for Difficulty Prediction
7.2.1 Determining Query Difficulty Using Access-Order
An access-ordering is based on the frequency of document occurrence in the search results. As we have seen (Figure 3.1 page 58), there is a non-uniform likelihood of access for the documents in a collection over a set of user queries. Using access counts, for each document we can establish a probability that indicates the likelihood of seeing that document in the top ranked results of a random future query. Given such a set of document probabilities, an absolute ordering of documents from most to least likely to be retrieved, indicates a default ranking for the documents in the collection. Taking this reasoning one step further, the ordering offers an approximation of the expected ranking of documents for any query independent of the query terms. That is, if a document appears frequently in the result sets, then it has a greater chance of being ranked highly.
Document ranking functions order documents with respect to the user query. As seen in Section 2.6 (page 27), for each document a score is obtained that is typically a com- bination of three primary components; the length of the document, the frequency of the query term in the document, and the uniqueness of the query term in the collection. Of these, the uniqueness of the query term in the collection, establishes the impact that the term will have relative to other terms in the query, while the frequency of the term in doc- ument, and document length, establish the impact the term will have on the score of that document.
Figure 7.2: Proposed query difficulty prediction approach. Query 1 produces a result set in which the documents are ranked in an order that does not differ greatly from the collection- wide access-order. Such a query is expected to be difficult to resolve. Conversely, query 2, produces a result set for which the ranking of the documents varies significantly from that of the access-order. Such a query has high discriminatory power and is predicted to be simple to resolve.
We speculate that, for a difficult query, the documents in the results set will be ranked in much the same order as those in the access-ordering. That is, the query terms will not have a strong enough impact on the document scores to significantly alter the ordering of documents from the access-order. Conversely, a query that produces a result set that significantly differs in order to the absolute access-ordering is considered to have high discriminatory power, and therefore is considered a simple query to resolve. Figure 7.2 illustrates the principle. In the figure, query 1 produces a result set for which the documents appear in a ranked order that closely resembles the collection-wide access-order. As such, it is assumed that such a query has a low influence on the overall ranking of documents, and therefore the query is predicted “difficult” to resolve. Query 2 produces a result set in which the ranking of the documents differs significantly from that of the collection-wide access-order. Such a query has a strong influence on the document ranking, and is therefore predicted “simple” to resolve.
Our proposed difficulty predictor is based on measuring the difference in the ordering of documents, between the ranked result set of a query, and the collection access-order. One technique to measure such distinction is Kendall’s τ rank correlation co-efficient. Kendall’s τ measures the agreement between two rankings by comparing the number of pairwise element changes required to convert one set of ranked data to the other [Sheskin, 1997]. The τ value is reported between −1 and +1, where +1 is perfect agreement between the two rankings, −1 is perfect disagreement between the rankings, and 0 where the rankings are independent. Using Kendall’s τ , we can measure the correlation between the collection access-order and the ranked results of each query. In doing so, we can evaluate the potential of access counts as a predictor of query difficulty. Should our hypothesis hold, low accuracy queries will be those with τ correlations approaching +1, while higher accuracy queries have τ values closer to −1.
Unlike other difficulty predictors, our proposed approach is novel for several reasons. While the idea of a default ranking of documents in the collection is not new [Page et al., 1999; Kleinberg, 1999], to our knowledge it has not previously been considered in query difficulty prediction. Although we propose utilising this technique based on the ordering obtained by access counts, it could potentially be applied to other forms of document priors. Further, other difficulty prediction approaches have sought external information to determine query difficulty. Specifically, Swen et al. [2004] have considered sources such as Wordnet to determine the ambiguity of a query term. This work presents an alternative means by which to measure the discriminatory power of an entire query with respect to the search engine.
-0.4 -0.2 -0.0 0.2 0.4 Correlation 0.0 0.2 0.4 0.6 0.8 1.0 Average Precision
Figure 7.3: Comparison of the access-order based predictor to the average precision of the TREC-9 topics on the WT10g collection. Each point represents an individual topic. The x-axis shows the Kendall’s τ correlation of the top 1,000 results returned for a topic, with the collection access-order. The y-axis shows the average precision of the topic.