Data Warehousing - Adaptive Crawl Ordering Schemes

2.7 Adaptive Crawl Ordering Schemes

2.7.3 Data Warehousing

The issue of adaptively maintaining data consistency has been approached from a data warehousing perspective.

One study describes a scheme to maintain weak consistency of stock prices in a data warehouse by estimating a good Time-To-Live (TTL) value based on different measures [Srini- vasan et al., 1998]. A TTL value indicates how long a data item is expected to match the source item and is typically used by proxy cache systems to predict the change frequency of cached items. In this work, Srinivasan et al. suggest several approaches for determining a “good” TTL.

• A Static TTL Value

This scheme is easy to implement, however selecting a low TTL may result in an approach that contacts the web server too often, leading to wasted resources. While selecting a high TTL would save on network resources, this could lead to the data warehouse storing stale information.

• A Semi-Static TTL Value

This scheme begins with a large TTL. It then reduces the TTL each time the source is found to be changing more frequently than it is being retrieved. The TTL can only be reduced, and so, the TTL will always be the worst case, even if the worst case occurs infrequently, leading to excessive polling.

• A Dynamic TTL Value Based on the Most Recent Changes

This scheme assumes a low TTL initially leading to frequent polling. If the source changes frequently, the scheme polls frequently (low TTL). If the source changes infrequently, the scheme polls infrequently (high TTL). More recent changes are given greater emphasis, in this case utilising a history of two TTL values.

• A Dynamic TTL Value with Static Bounds

To prevent the TTL value from becoming too high in cases where the source does not change for a long time, in this scheme a static bound is used to limit the maximum and minimum TTL value. This static value however, may not be representative of all sources.

• An Adaptive TTL Value

This scheme uses a dynamic TTL with a dynamic upper bound that is based on the most rapid change observed so far.

Srinivasan et al. judge performance by how well a scheme minimises the metrics: • Number of Pollings

Measures the number of times the source is polled and indicates the network overhead of a scheme.

• Violation Probability

Indicates the probability that a user’s temporal consistency requirement is violated. This metric measures the duration that the difference between the local price and the source price exceeds the user constraint.

V P rob = 1 T n X i=1 ti ! (2.27)

where t1, t2, . . . , tn denotes the durations during which the difference between the local

price and source price exceeded the user constraint, and T is the total time that the data was presented to the user.

Using these metrics, they find that the adaptive TTL approach achieves the best results, producing the lowest probability of violation. The schemes are evaluated against real stock prices for several major IT companies over three-hour intervals. While these schemes appear to work well for rapidly changing stock prices, which have uniform format, it is unclear how well they would adapt to changing web documents.

In another study into data warehousing, Sundaresan et al. [2003] examine the problem of revisiting documents that have changed to keep a local data current. They examine the problem from the perspective of a distributed shared memory domain where a similar problem, memory coherency, exists. Being pull-based, in turn means that the data warehouse must poll the data sources (web documents) for changes.

Sundaresan et al. measure the average freshness of the warehouse collection and examine the ability of the data warehouse to effectively determine change frequency with varying levels of information availability.

• At level 1, only the updated data is available.

• At level 2, a timestamp — equivalent to the Last-Modified HTTP header — is available so that the age of the data can be determined.

• At level 3, a history of timestamps is available, making it possible to model the update rate based on a window of previous updates. The longer the history the less susceptible the model is to noise, however it will adapt to changes in update frequency much more slowly.

Sundaresan et al. find that using the last update (level 2) to estimate the next change produces the lowest freshness, but also has the lowest number of polls. Maintaining a list of all updates (level 3) produces significantly better freshness, but more than doubles the number of polls when compared to maintaining only the last update (level 2). Doubling the poll rate of the last update (level 2) scheme produces slightly better results than maintaining all updates (level 3) but requires nearly double the amount of polls required by the later scheme.

While these schemes work reasonably well, they rely on the accuracy of the timestamp information, something that cannot be done in a Web environment. Furthermore, they find that a simple push model, where each source informs the local warehouse, produced the best results, however this is obviously not feasible for search engines as outlined in Section 2.3.1. They also consider different scheduling orders, such as First In First Out (FIFO), Least

Recently Requested (LRR), and Most Frequently Changed (MFC). The LRR approach favours

a model that does not over poll, and so, is more efficient, while the MFC scheme attempts to keep the most views up-to-date.

Sundaresan et al. find that, of all the schemes they compare, the MFC scheduling order produces the best overall freshness. The results are based on a Gaussian distribution of update intervals, and real trace statistics. While these results are interesting, they are not directly related to web document change, furthermore, change is implemented as a simple binary measure, and so, the implications of the result for real web documents is unclear.

In this section we have discussed various adaptive crawl ordering schemes that have been discussed in the literature. In the next section we discuss some of the past work into modelling the Web.

In document Effective web crawlers (Page 84-87)