2.7 Adaptive Crawl Ordering Schemes
2.7.1 Adaptive Crawling
Adaptive crawling schemes attempt to detect changes as part of the crawling scheme and alter the crawl based on this information. One study into adaptive crawling describes the design and implementation of an incremental crawler for the WebFountain project [Edwards et al., 2001].
The model, based on linear programming, considers crawl strategies for improving docu- ment coherence and maintaining collection freshness. Unlike much other work — but similar to our work described in Section 4.3 — their design does not make assumptions about the statistical nature of changes made to web documents. Instead it adapts to actual change rates detected as part of the crawling process.
The design implements 256 different change frequency buckets for partitioning docu- ments by change frequency, grouping documents with similar change frequencies together. Documents that change very rapidly are handled separately, on the assumption that these documents are typically media sites.
They find that different objectives produce different optimal solutions, but find a solution that considers several criteria to produce a good overall result.
• When the objective is to reduce the number of obsolete documents that are present in the final time period, the optimum solution is to crawl documents only during the final period, ignoring them during other time periods.
• With an objective to reduce the number of obsolete documents in each time period, the optimum solution is to crawl fast-changing documents in many time periods, while ignoring the 40% of documents with the lowest change rates.
• If the objective is to minimise the number of obsolete documents in the final time period, but to still try to consider the number of obsolete documents in each time period, the optimum solution is to crawl all documents just once during a crawl cycle but spread them across the entire cycle.
In this work, the model considers a document obsolete if it is no longer a match of the ver- sion on the Web, as determined by shingling [Broder et al., 1997], described in Section 2.5.1. While their results are promising, they are simulated, and they note that it is not possible to run the crawler model for a longer period and obtain a useful mathematical solution. Therefore, their scheme must be periodically reset and start again from the beginning, some- thing they argue would be necessarily anyway in practice in order to update parameters and re-optimise the crawler.
In another major study, Cho and Ntoulas [2002] examine how to effectively use sampling to detect document change. They measure the collection freshness while comparing the following three document crawling methods.
• Round Robin
Round robin downloads a different subset of the collection during each crawl cycle, guar- anteeing that all documents are retrieved once each. This method was used by the early version of the Google crawler [Brin and Page, 1998] and the Mercator crawler [Heydon and Najork, 1999].
• Change Frequency Based
The change frequency method uses past change history to determine how frequently documents are changing, and hence how frequently to revisit them. This method is investigated further in other work [Coffman et al., 1998; Cho and Garc´ıa-Molina, 2000a].
• Sampling-Based
The sampling-based method retrieves a small sample from each site to determine what percentage has changed, then allocates the remaining resources accordingly.
In the study they conduct crawls over a collection in cycles, downloading a limited number of documents per cycle. Furthermore, they define several evaluation metrics, though they use only the ChangeRatio measure, which we discussed in Section 2.5.
Cho and Ntoulas also define two different sampling policies, proportional and greedy. Once the initial sample is made, the proportional scheme allocates the remaining resources in proportion to the number of changed resources at each site. In contrast, the greedy scheme allocates all the remaining resources to the site that had the most changed resources, then if any resources are left they are allocated to the next most dynamic site and so on. While the greedy policy is expected to produce a better ChangeRatio than the proportional scheme when the estimation is correct, it is also expected to have a larger variation in performance, producing a worse result when the estimation is incorrect.
Cho and Ntoulas also examine the optimal sample size for producing the highest Change-
Ratio. They also present an adaptive sampling process that detects the ChangeRatio of each
site during the sampling process and examines the confidence intervals. Once the confidence interval is large enough to select a site for crawling, the scheme switches to the greedy policy. They note that both the greedy and adaptive schemes may never download some documents, and they examine this further in their experimental results.
Cho and Ntoulas examine “subset” sampling when there is only a small volume of re- sources available. In this case, sites are grouped into subsets and then, in each download cycle, only documents from one subset are sampled and downloaded. Different subsets are visited in a round robin fashion over multiple download cycles. They discuss subset size but do not provide an optimal subset size.
Cho and Ntoulas test these schemes against a web collection of 353,000 documents from 252 web sites, retrieved on a monthly basis for six months. They repeated the data for experiments requiring a longer change history. They also test some of the schemes against synthetic data following a normal change distribution.
Their results show that:
• The greedy and adaptive schemes perform extremely well, with an average ChangeRatio around 75%, compared to around 40% for the round robin, proportional, and frequency- based approaches.
• When the volume of resources allocated to sampling is high, the performance of all schemes is the same since all documents are downloaded. When sampling resources are low, the performance of the greedy policy degrades.
• Subset sampling improves performance when there are limited resources.
• With longer change history data the frequency policy gradually improves in perfor- mance, requiring about 100 download cycles to match the performance of the greedy policy.
To evaluate the greedy policies ability to revisit all documents, they introduce a measure of “fairness”, which they define as:
fairness = # of changed and visited documents up to the ith cycle
total # of changed documents up to the ith cycle (2.26)
Importantly, this measure does not consider multiple missed changes as a negative, as long as a changed document has been visited at least once. After five download cycles, the
greedy policy visits about 80% of the changed resources at least once. Cho and Ntoulas
examine the number of times changed resources are revisited in each change group, and show that the number of visits is proportional to the number of changes. When compared with synthetic data there is a close performance between the schemes with marginal differences, though with the greedy and adaptive scheme still perform best.
A different study into the sampling method [Cho and Ntoulas, 2002], by Ghodsi et al. [2005], introduces a hybrid method for updating collections. Their work combines the change
frequency method [Cho, 2001] with their own version of the sampling method. They use the
sampling method for the initial crawls until a large history is available, then switching to the change frequency method. They also modify the sampling method and instead have an iterative sampling method that downloads a sample size during iterations.
Initially the sample size is equal for each site, however this adapts during iterations, allocating more resources to sites that contain a larger number of changed documents.
Their data set consists of 100,000 documents (1000 web documents from 100 web sites) crawled once a fortnight for eight weeks, a total of four times. They then repeat the four cycles 125 times for a total of 500 cycles, measuring the efficiency of the different schemes against the collection. They measure efficiency as the number of retrieved documents that changed against the total number of retrieved documents.
Their results show that the standard adaptive sampling had an efficiency of 75%, com- pared to their improved sampling scheme at 81% efficiency, and their hybrid scheme, which had an efficiency of 87%.