2.5 Crawler Evaluation Metrics
2.5.1 Measuring Document Change
The metrics in the previous section that evaluate crawler effectiveness require a definition of change. In this section we present various methods for detecting and measuring change and briefly compare them.
Simple Change Detection
A simple method of detecting byte-wise change between two documents is to compare a hash value, such as that computed by an MD5 hash [Rivest, 1992]. If two documents have the same hash value it is extremely likely that they are identical. Hashing has been used in many web change, collection freshness, and mirroring studies [Wills and Mikhailov, 1999a; Bharat and Broder, 1999; Mogul, 1999; Wills and Mikhailov, 1999b; Cho and Garc´ıa-Molina, 2000c; Brewington and Cybenko, 2000b;a; Cho and Garc´ıa-Molina, 2003a]. While hashing is fast and simple to compute, it is not a useful definition of change from a crawling perspective since it does not quantify the differences between two versions of a document, instead producing a binary indication of change regardless of the size or type of that change.
Detecting Meaningful Change
The size and type of change in a document may be important when assessing crawler effec- tiveness. Some changes are unlikely to alter the index term entries for that document, and hence not affect user query results. For example, changes in formatting such as the addition of a “<hr>” HTML tag to a document will not affect results to queries for most search mech- anisms. Furthermore, the alteration of an advertisement within a document is also unlikely to affect search results. Many studies of change on the Web have shown that changes are
(A) the quick brown fox jumped over the lazy dog (B) the lazy dog jumped over the quick brown fox.
the quick brown quick brown fox brown fox jumped fox jumped over
jumped over the over the lazy
the lazy dog lazy dog jumped dog jumped over over the quick
Figure 2.2: Measuring the difference between sentence (A) and (B) using a three word shingling metric. The shingles in bold are common to both sentence (A) and (B).
localised, due to rotated banner advertisement, changing date fields, and so on [Wills and Mikhailov, 1999a; Ntoulas et al., 2004a]. It would be a waste of crawler resources to recrawl these documents because doing so is unlikely to result in important changes to the index.
A first criterion for a meaningful measure of change, therefore, is that it should only take into account HTML content that is indexed by the search engine.
Shingling
The shingling scheme [Broder et al., 1997] measures the resemblance of two documents by removing all HTML from the documents, converting their text to lowercase and then mea- suring the number of unique sequences of n words, known as n-grams or n-shingles, that the two documents have in common, as shown in Equation 2.24. The scheme can also be used to measure the containment of one document within another. For example, Equation 2.25 computes the percentage of document (A) that originates from document (B).
Resemblance = Number of n-shingles (A) and (B) have in common
Total number of unique n-shingles in (A) or (B) (2.24)
Containment = Number of n-shingles (A) and (B) have in common
Number of unique n-shingles in (A) (2.25)
The example in Figure 2.2 uses three word shingles and produces ten unique sequences, of which four are common to both sentences, producing a similarity of 4/10 = 40%.
Since shingling requires a large amount of processing and storage, sketches are used, which store a specific number of shingles selected from the document. The selection criterion
may be positional, frequency-based, or structural and has been surveyed by Hoad and Zobel [2003]. These are converted into distinct numbers that are then sorted. “Supershingles” can be produced by creating shingles of the sorted sketches.
Other Measures of Change
Various measures of change have been used as part of large scale web document monitoring and reporting tools. These tools allow users to specify web documents that they want mon- itored, and the type of changes that they want reported. The WebCQ system [Liu et al., 2000], in particular, allows the following types of change to be detected:
• Content update: Any update on a document
• Content insertion: Increase in size (above a threshold) • Content deletion: Reduction in size (above a threshold) • Link change: New or removed HTML links
• Image change: New or removed images
• Word change: New words added or existing words removed • Phrase update: Detect changes to specific phrases
• Phrase deletion: Detect removal of specific phrases • Table change: Detect changes to specific tables • List change: Detect changes to specific lists
• Arbitrary text change: Identify any changes in specified fragments • Keyword: Detect addition or removal of selected keywords
We have evaluated several of these schemes in our work on change metrics in Section 4.3, and investigated these in a web crawling and search context.
ChangeDetector, another tool for monitoring web sites for change, uses document clas- sification to filter changes by topic and uses entity-based change detection to filter changes by semantic concepts, such as names and dates [Boyapati et al., 2002]. One such example is the presence of a date near a location, which is indicative of a press release.
When detecting change, the ChangeDetector scheme first looks for byte-wise differences. If a change is found, the scheme then processes the document with an XML-aware difference algorithm followed by entity extraction. The difference algorithm breaks the document into nodes based on the W3C Document Object Model [H´egaret et al., 2005]. The text that falls below a node is normalised by ignoring differences in whitespacing and structure. If a hash of the text detects differences, the algorithm attempts to align the text with insertions and deletions. Any “uninteresting” changes, such as spelling corrections, are ignored. The remaining changes are filtered through entity extractors to detect different types of semantic content. The distance between entities is recorded into a database allowing users to submit queries.
Flesca and Masciari [2003] present a change metric that has been implemented in CMW, a change detection system that can create web update triggers that allow users to identify changes they want monitored and the actions to perform when they occur. Their approach does not examine the exact sequence of changes that are required to produce the modified document, but instead the amount of change that has been made. They represent documents as trees and compare document subtrees to determine the amount of difference between them. Work by Chawathe and Garc´ıa-Molina [1997] examined change detection in structured text. Their scheme MH-DIFF, detects change in the hierarchy of a tree using the insert, delete, update, copy, move, and glue operations to classify changes.
• Insert adds a new node to the hierarchy. • Delete removes a node from the hierarchy. • Update alters the label of a node.
• Copy creates a copy of a subtree in another position in the hierarchy.
• Move removes a section of a subtree and places it into another position in the hierarchy. • Glue replaces a section of the subtree with a copy of another subtree in the hierarchy. While there has been a significant amount of research dealing with measuring change, particularly in web documents, much work dealing with modelling the Web has dealt with simple change metrics.