User Feedback - Adaptive Crawl Ordering Schemes

2.7 Adaptive Crawl Ordering Schemes

2.7.2 User Feedback

Several studies have also considered user feedback as part of an adaptive crawling algorithm. Bullot et al. [2003], for example, investigate user feedback for crawling. Their work compares various crawl ordering schemes so that popular documents are kept up-to-date. Their approach maintains a queue, with URLs at the front of the queue visited next. The score of any URL that is visited is reset to 0, new documents are inserted into the queue with a high score, and older URLs are given higher scores.

They monitor statistics regarding category, frequency of click, position of a URL, and

next penalisation to determine which URL to visit next.

• Category

The category of each clicked URL is noted and other URLs belonging to the same category can also be visited. In this way URLs that are not clicked, but on the same topic, also have their score affected.

• Frequency of Click

Frequency of click keeps track of the frequency at which URLs are visited by users. URLs that are visited more frequently by users are rated higher than those visited less frequently.

Stale Resource Not Returned Resource Returned URL Clicked Resource Irrelevant URL Not Clicked Resource Relevant Resource Fresh

Figure 2.3: The embarrassment metric decision tree.

• Position of URL

Position of URL indicates the location of a URL on the results pages. Since URLs on the results page are sorted by importance, URLs at the top of the first page are more important than any other location on any other pages.

• Next Penalisation

Next penalisation reduces the score of all URLs on pages where users have clicked the next button, since users have most probably found the results either irrelevant or repeated. If a user clicks on some URL before clicking next, URLs on the page are penalised less. If users click on page five instead of next, all URLs on pages one through four are penalised.

Bullot et al. [2003] also discuss the effect of updating score information at different in- tervals. They propose updating the score information whenever a URL is clicked (real time update), a periodic update at a set interval (scheduled time update) and a hybrid update model (scheduled limited update) that updates whenever a limited number of URL clicks are made. The work is only a proposal and not investigated either experimentally or theoreti-

cally. Furthermore, while they do propose schemes that alter the crawl order with regard to various user feedback, they nonetheless assume that changes follow a Poisson process.

In another study examining user feedback, Schaale et al. [2003] introduce a new method that analyses user queries to determine which documents should be crawled. Their scheme ranks domains based on their relevance to user queries, then combines this with the existing ordering method to produce a new crawl order. The work is untested.

Wolf et al. [2002] introduce the concept of search engine “embarrassment”, which measures the likelihood that a user clicks on a search result only to find that it is irrelevant to their query, identifying how embarrassed the search engine would be by the failure of a user’s search result. As highlighted in the decision tree shown in Figure 2.3, the embarrassment metric breaks the evaluation into steps. It determines whether a resource is fresh or stale, and in the latter case whether the stale resource is returned in response to a user query. If returned, the scheme determines whether the resource is clicked by the user. Finally it determines whether the clicked stale document is relevant to the query.

They model the probability of a user clicking on a result, based on its rank and page position. They also consider combinations of web update models. Their results show that their scheme for determining the optimal number of crawls for each document outperforms a proportional and uniform scheme. However, their work does not consider cases where the crawler fails to retrieve highly relevant, new documents during the crawl.

Finally, in an empirical study, Pandey and Olston [2005] examine how to schedule the recrawling of documents to improve user experience. They formulate a user-centric search repository quality metric that measures the impact of crawling strategies on users. We investigated a similar concept [Ali and Williams, 2003], which we describe in Chapter 4. They compare three schemes that analyse past change to predict future change and measure their resource usage:

• Staleness-Based Refreshing

The staleness-based approach attempts to reduce the number of stale documents in the search repository [Cho and Garc´ıa-Molina, 2000a]. They use this in conjunction with shingling [Broder et al., 1997] and the transportation algorithm [Wolf et al., 2002] for scheduling.

• Embarrassment-Based Refreshing

The embarrassment-based approach minimises the level of “embarrassment” caused to the search engine [Wolf et al., 2002]. Embarrassment increases when a user clicks on a search results and finds that it is irrelevant. Again, this is used in conjunction with shingling [Broder et al., 1997]. The results are simulated using a Poisson update distribution. Click frequencies are simulated using a Zipf-like function. In the original work by Wolf et al., the likelihood of relevance is simulated by selecting a uniform random number between 0 and 1. Instead, Pandey and Olston assume that a document that undergoes change becomes irrelevant to an average query if the fraction of shingles that change exceeds a given threshold.

• User-Centric Refreshing

They determine user impact by ranking with tf.idf and inlink count.

To determine user interest, Pandey and Olston use the AltaVista query set. They allow each scheme to crawl all documents on the initial crawl, then on subsequent crawls only a certain number of documents were allowed to be refreshed. They compared the volume of resources required by each scheme to achieve the same level of collection quality. Interest- ingly, their results show that a user-centric approach requires substantially fewer resources to achieve the same level of repository quality. This is particularly due to their use of query relevance, which avoids crawling documents that change in “uninteresting” ways.

It is important to note that in their work, Pandey and Olston use a collection of 48 weeks as opposed to our collection of 8 weeks. This allows them to use stochastic approaches such as a Poisson process to model change. Furthermore, their evaluation metric is closely tied into their schemes, and hence optimised for their schemes. Their implementations of various schemes make many assumptions in order to test them. For instance, the decision whether a document is relevant, the use of Zipf based click frequencies, and the simulation of results using a Poisson update distribution. Unlike our work, they do not incorporate the addition of new pages in any way. Finally, as noted earlier, this work was published two years after our work [Ali and Williams, 2003].

In document Effective web crawlers (Page 80-84)