• No results found

Document Deletion, Creation, and Update

In document Effective web crawlers (Page 88-91)

2.8 Web Change and Modelling

2.8.2 Document Deletion, Creation, and Update

Several studies have examined the rate at which documents are created, updated, and deleted on the Web.

Pitkow and Pirolli use a survival analysis to examine the lifespan of documents on the Web. They consider the survival rate of documents that are mainly requested internally by the author, externally requested by others, and mutually requested by both the author and the external community. They find that documents that are mainly requested by external users have the highest survival rate, followed by documents that are mutually requested, and that internally requested documents have the lowest survival rate.

Lawrence and Giles [2000] showed that even the most powerful crawlers can take weeks or months to discover that a document has been created, deleted, or updated. This figure has substantially improved over the years.

Notess [2003] examined search results for six different queries and noted that most search engines had some results that were indexed in the past few days. Notess also shows that the bulk of the index was about one month old, and that some documents had not been re-indexed for much longer periods.

Lewandowski [2004] study date restricted queries to determine how effective search en- gines are at determining the correct date of a document. Using fifty random queries from the Fireball [2008] search engine, Lewandowski finds that the major search engines, Google [2008], Yahoo! [2008], and Teoma [2004], all have difficulty in this regard.

In another study, Lewandowski et al. [2006] compare Google [2008], Yahoo! [2008], and MSN [2008] finding that Google has the best overall freshness, with most documents being updated on a daily basis, however, only MSN maintains all documents with a freshness of less than twenty days. Recent studies of a large domain show that deep Web coverage has also improved [McCown et al., 2006].

Lim et al. [2001] crawl five popular sites to a maximum depth of five levels, twice a day, for one month and examine the extent to which documents change, and how “clustered” the changes are. They measure the degree of change by using word edit distance, and determine how clustered the changes are by dividing documents into blocks and determining which blocks contain changes. They devise two methods of dividing documents into blocks.

The first method divides the document into predefined blocks of 32 words, while the second method divides documents by paragraph “<p>” tags.

Their results show that 90% of documents have a change of less than 20%. Furthermore, most documents have changes that affect less than half the blocks when blocks are defined as 32-word groups. With paragraph tags are used to define blocks, they note unsurprisingly, that many documents have changes that affect less than half the blocks.

They conclude that, since changes are typically small and clustered, crawlers should use an incremental update approach to improve efficiency. That is, instead of discarding the collection and rebuilding the entire collection from scratch, only small subsets of the Web need to be recrawled to update the collection.

Ntoulas et al. [2004a] examine the evolution of web documents, monitoring 150 popular web sites for a year to determine the changes in both content and structure. Specifically their work examines changes that affect search engines, such as changes to link structure, the rate at which new documents and new content are created, the rate of change in content as measured by tf.idf, as well as the number of new words introduced.

Their study shows that new documents are created at an approximate rate of 8% per week, while 80% of documents are no longer available after one year. New documents tend to “borrow” content from existing documents. They estimate that 38% of content in new documents is “borrowed”. They also show that the link structure of the Web is much more dynamic than the Web itself, with 25% of all links are new links being created each week and about 80% of links being replaced after a year.

Ntoulas et al. also show that documents tend to either remain unchained or go through small changes. There is less than 5% difference in 70% of changed documents after one week. About 50% of documents that are available after one year have no changes. Changes tend to be localised, with changes in restricted portions: weather, counter, reports, advertisements, and last update snippets.

Significantly, the results of Ntoulas et al. show that frequency of change is not a good predictor of degree of change, with no correlation between how frequently a document is changing, and how much the document changes. This effectively means that existing schemes that concentrate on frequency of change [Coffman et al., 1998; Cho and Garc´ıa-Molina, 2000c] would not effectively maximise the degree of changes detected. Their results also show that

past degree of change is a good predictor of future degree of change [Fetterly et al., 2003b]. That is, a document that changes by 10% in one week is likely to change by the same amount the next week.

Baeza-Yates and Castillo [2001]; Baeza-Yates et al. [2004] show that, unsurprisingly, new documents have poor PageRank, supporting our observation in Chapter 5 that finding doc- uments that are both new and popular, is difficult. In our work in Chapter 5, we determine the popularity of a document by the frequency that it is returned in response to user queries, and we define a document as new if it has not been previous retrieved by the crawler.

A study by Koehler [2002] examines the stability, availability, and change rate of a set of documents over a four-year period. The study does not consider the creation of new documents over the time period, and so, represents the dynamics of an aging collection. In the work, a document is considered comatose if it cannot be successfully downloaded during six weekly requests. Koehler selected 361 URLs at random and retrieved them on a weekly basis during the period December 1996 and February 2001. The study examines the change in size of the documents, the number of new links, the changed items linked from the document, as well as the document purpose. Koehler uses two different definitions of document purpose, navigational and content. A navigational document is one whose main purpose is to direct users to the information the web site was designed to provide, contained within content documents [McDonnell et al., 2000].

The results show that document longevity is closely linked to domain type and document purpose, with commercial navigational documents having a better survival rate than content documents, while the opposite is true for educational documents. Koehler find that in general, navigation documents are more likely to survive than content documents. Furthermore, aging documents had a half-life of about two years, while the frequency and type of changes tend to become more stable over time. Aging documents were also less likely to be removed from the Web. Koehler argues that this may be due to the author becoming either satisfied with or disinterested in the document.

While these studies have examined web dynamics from a general perspective, others have also concentrated on web dynamics with respect to crawling.

Top-Level Domain

All Com Edu Gov Netorg

Change < 1 day 23% 41% 2% 1% 11%

No change > 4 months 29% 14% 51% 54% 34%

Persistent > 1 month 72% 64% 84% 87% 71%

Persistent > 4 months 37% 26% 50% 56% 40%

Table 2.2: Change and persistence statistics of 720,000 popular pages. These values highlight relative volatility of the “.com” domain, and the relative stability of the “.edu” and “.gov” domains

In document Effective web crawlers (Page 88-91)