Web Crawling and Web Change - Web Change and Modelling

2.8 Web Change and Modelling

2.8.3 Web Crawling and Web Change

The change rate of documents on the Web has been considered as part of a web crawling strategy to improve collection freshness. We first focus on several studies that have studied the link between web change and top level domains.

Top Level domains

Top level domains are the last part of an internet domain name, specifically, the letters after the final dot of a domain name. Many studies have highlighted the link between generic top level domain names, such as the commercial “.com” domain, and change frequency.

Cho and Garc´ıa-Molina [2000c] for example, monitor the daily evolution of 720,000 popu- lar web documents over a four month period to determine how the Web evolves. They study how often web documents change, their lifespan, how long it takes for half the documents to change and model the change mathematically to determine how it affects crawling strategies. The results of their experiments, presented in Table 2.2, illustrate the volatility of the “.com” domain, and the relative stability of the “.edu” and “.gov” domains

They also find that 50% of all documents change or are replaced after approximately fifty days. The “.com” domain requires only eleven days for 50% of documents to change, while the “.gov” domain requires almost four months to have the same number of documents change. From this study, they conclude that web document change can be modelled according to a Poisson process and they use this in their crawling techniques.

Their work discusses web document change and how it affects various crawling strategies such as incremental and batch crawlers. Their definition of incremental crawling differs with respect to most of the literature. In their work they define an incremental crawler as one that continues crawling the Web once the crawlers resources are full and replaces less important documents in their collection with the newer documents. In contrast, they define batch crawlers as those that must periodically recrawl the Web and replace the entire collection with freshly crawled documents.

Their work does not consider in detail the problem of determining web resource change from document content. Rather, their crawler uses a simple checksum to determine whether documents have changed. It then examines the history of the changes to determine how many times the crawler detected that the document changed [Cho and Garc´ıa-Molina, 2000b; 2003a] and uses this in its scheduling strategy. Their results are then used to determine the features of an effective crawling technique. Specifically, they indicate that the average freshness of the collection is the same for both batch and incremental (steady) crawlers, if all documents are revisited every month by both schemes.

If both schemes retrieve the same number of documents over the same period of time, the batch crawler will have a higher peak speed. The batch crawler is not continuously crawling as the steady crawler does, and so, must collect documents at a higher speed during the times it operates. The paper also shows that updating the collection as the crawler is retrieving documents improves freshness but reduces availability. The paper describes an incremental crawler where document importance is measured using PageRank. Document importance is used to determine which documents to remove from the collection to make room for any new documents that are crawled. Occasionally the crawler decides to revisit documents to refresh the collection. In this case the crawler uses a checksum to determine whether documents have changed. It then examines the history of the changes, to determine how many times the crawler detected that the document changed [Cho and Garc´ıa-Molina, 2003a].

The paper does not consider how significant the changes are, simply whether they have changed or not. They state that importance could be used to determine if a document needs to be kept up-to-date. It is unclear whether the crawler they describe has been implemented. Furthermore, they do not consider the creation of new documents.

Douglis et al. [1997] analyse the following factors that affect how frequently resources change on the Web:

• Size of the resource. • Resource type. • Frequency of access. • Resource location. • Top-level domain

The work focuses on how these factors affect proxy caching. The research analyses requests and resources taken from two traces: one taken from a gateway containing the content of requests and response messages, and another taken from an Internet proxy.

Douglis et al. determine the fraction of requests that access resources that have changed, how old resources are when they are accessed, how modification times and access rates interact, how much duplication there is in the Web, and whether changes can be detected and exploited in HTML resources.

They find that, over a period of two weeks, many resources were modified to some extent, some were never modified, and a significant number were modified at least once between each trace. They also find that, although rate of change depends on many factors, content type was a significant factor along with top level domains, while the size of the resource had little correlation with rate of change. Their results are especially relevant to our work. In Chapter 4 we show that past change size is a relatively good predictor of change; however, we do not consider the size of the resource, but the change in size of the resource.

Fetterly et al. [2003b] also examine “top level domains”, by monitoring 150,836,209 HTML documents on a weekly basis for eleven weeks to determine how frequently they change. They find that the amount of change varies dramatically for different top level domains, with documents in the “.com” domain changing more frequently than documents in the “.gov” and “.edu” domains. However, there is a much weaker relationship between top level domain and the amount of change. They also find that larger documents changed more frequently and significantly than smaller documents. Typically documents change in either

markup or in trivial ways. The relationship between the amount of change, the frequency of change, and document size is more significant for commercial domains than for educational and government domains. Finally they find that past change is a good predictor of future change and that the quality of documents is also important, considering the amount of spam documents they encounter, and the number of documents that change to a small degree in meaningful content.

Next we present studies that have examined web change without focusing on top level domains.

Other Studies

Wills and Mikhailov [1999a] discuss the accuracy of headers and their implications for caching. They use an MD5 hash to determine the accuracy of Last-Modified headers in test sets (home- pages taken from 100hot.com) over a two-week period.

As discussed in Section 2.1.1, they find that, in more than 9% of cases, resources had not changed, despite a change in the last-modified header. In approximately 0.3% of cases the resource had changed, despite no change in the last-modified header. They also show that in 14%–18% of cases no Last-Modified header was available.

They also performed a case study on several high-profile home-pages to determine short and long-term trends in changes, finding that many of the changes were predictable. That is, changes to the same few lines in the HTML code — particularly banner advertisements. This research heavily motivates our work in Chapter 4, that changes need to measured and analysed to determine their importance, particularly banner advertisements, and that the Last-Modifiedheader is not a reliable indicator of change.

Brewington and Cybenko [2000a] present an analysis of web document changes and show that they do not follow a pure Poisson process as stated by Cho et al. Nevertheless, they assume that web document changes follow a Poisson process for their calculations.

In separate work, Brewington and Cybenko [2000b] discuss how fast the Web evolves based on daily observations of 100,000 documents over a period of seven months, and estimate the rate at which web search engines need to revisit the documents to keep them up-to-date. They model changes in web documents and use this to determine the frequency at which crawlers must re-index them to achieve particular levels of index consistency.

The age of a document is determined by calculating the difference between a document’s Last-Modified header and the time it was retrieved. Their system monitors specific URLs for changes. In addition, the system also checks the top query responses whenever users present a query to the search engine, however a particular user’s query will not be run more often than once every three days, unless, the same query is posted by another user.

Through their work, Brewington and Cybenko [2000b] find:

• A web crawler must download at least 45 million documents a day (a re-indexing period of 18 days) to maintain a 95% probability that a document taken at random is no more than one week old (assuming a Web of 800 million documents).

• A crawler must re-index at least 94 million documents a day (a re-indexing period of 8.5 days) to maintain a 95% probability that a document taken at random is no more than a day old.

• The Last-Modified header is available for around 65% of observations • Most web documents are modified during the span of US working hours • 4 kB–5 kB files with less than 2–3 images were modified most frequently.

• 4% of documents that were observed six times or more changed on every observation. • 70% of these had no timestamp.

• 56% of documents observed six times or more had no change.

• 20% of the documents were younger than eleven days, with a median of roughly one hundred days.

• The older half of documents had a very long tail.

This research is closely related to ours, in that they too are examining the problem of determining frequency of document change. However, they are more concerned with modelling the change as opposed to determining the effect of the changes on query results. They do not consider the type, significance, and importance of the changes to documents in a search context. Their method of polling does not guarantee that all changes that occur

during any particular day are captured; however, they have conducted their experiments on a fairly large data set.

Whilst many studies have investigated web change, there are several common traits. Most use simple change metrics, despite studies showing that many modifications are trivial changes to content and markup, or predictable changes to banner advertising. Many assume change follows a Poisson process, despite observations to the contrary. Page content and purpose has a significant affect on change frequency, with commercial pages typically changing more frequently. Finally, many studies rely on past change statistics, the Last-Modified header, or some combination of the two, and do not consider the creation of new pages.

While these studies have examined web change and the implications for web crawling, other studies have examined the relationship between crawl ordering and its impact on the “quality” of the collection.

In document Effective web crawlers (Page 91-96)