Mirroring - Effective web crawlers

The vast disparity in available bandwidth across the Web, and the multiple web users frequently requiring the same data at the same time, leads to an excessive strain on web servers. To alleviate this problem and improve load sharing, web administrators can use mirroring techniques. Mirroring replicates groups of documents or entire web sites at multiple URLs. While mirroring is beneficial to users, it can be a problem for web crawlers. If crawlers retrieve multiple copies of the same data, this is not beneficial to search engine users, since repeated search results do not provide new information. Furthermore, recrawling the same data at a different location wastes crawler bandwidth that could otherwise be used to improve search results. Crawlers therefore need to be able to detect mirrors and avoid crawling them unnecessarily. Propagation delay can further complicate detection of mirrors, since mirrors may not be identical.

A study by Fetterly et al. [2003a] examines the number and distribution of “near- duplicate” documents on the Web. They measure similarity using five word shingles [Broder et al., 1997] and define “near-duplicate” documents as documents that share two “supersh- ingles”. They collected data on a weekly basis over a duration of eleven weeks, consisting of approximately 150 million web documents. Their results show that 29.2% of these documents are very similar, while 22.2% are virtually identical. Many of these near duplicates remain so over time, with implications for mirroring.

In other work, Cho et al. [2000] identify techniques for detecting mirroring and examine 25 million documents for mirroring. They use a content similarity measure that determines tex- tual overlap [Shivakumar and Garc´ıa-Molina, 1995], and consider both document similarity and link similarity to identify mirrors.

In a large study, Bharat and Broder [1999] examine 179 million URLs, and 238,000 hosts to determine how much mirroring there was on the Web. They define two hosts as mirrors if a large percentage of URL paths are valid on both domains, and the common paths contain documents with similar content. Specifically, they examine structural and content similarity. Structural similarity is defined by the relative paths on a host. When two different hosts have the same set of paths, they are classified as structurally identical. On the other hand, if two documents are byte-wise identical they are classified as content identical. If documents have undergone changes at the byte level, such as the addition of white-space or HTML reformatting without changes to content, they are no longer content identical. Instead, two documents are considered content equivalent if they are identical after they are normalised for such changes. If documents undergo changes due to banner advertising or other dynamic content, but remain highly similar at a syntactic level, they are considered highly similar. Similarity at a syntactic level is measured using shingling [Broder et al., 1997]. Finally, two documents are considered related if they change substantially at the syntactic level but are semantically similar. Using these measures of similarity they define six levels of mirroring:

• Level 1: Structural and Content Identity.

– Every document on Host A is replicated with byte-wise identical content on Host B and vice versa.

• Level 2: Structural identity. Content equivalence.

– Every document on Host A is replicated with equivalent content on Host B and vice versa.

• Level 3: Structural identity. Content similarity.

– Every document on Host A is replicated with highly similar content on Host B and vice versa.

• Level 4: Partial structural match. Content similarity.

– Some document on Host A is replicated with highly similar content on Host B and vice versa.

• Level 5: Structural identity. Related content.

– Every document on Host A is replicated with related but not syntactically similar content on Host B and vice versa.

• Mismatch: None of the above.

They also investigate whether the IP address of hosts are related. They compare paths by tokenising them by directory and creating word bi-grams on consecutive directories. They convert characters to lowercase, treat non-alphabetical characters as word breaks, eliminate stop words, and ignore features that occur only once in a host. Finally, they compare document content with the use of shingling [Broder et al., 1997]. In their experiments they find that 10% of the hosts in the collection are mirrored.

In Appendix A, we describe the design and development of our Lara crawler, however we avoid issues related to mirroring by manually avoiding them. If we were to conduct a general crawl of the Web, we would need to implement some, or all or the techniques we have described.

While we have discussed maintaining freshness and consistency of collections retrieved by web crawlers, a similar problem is encountered with proxy caching.

2.14 Proxy Caching

Proxy caches operate in multi-user environments and attempt to reduce web traffic by storing local copies of web resources that are frequently requested by local users. The are two main issues regarding proxy caching:

• Cache replacement policies

There are potentially unbounded numbers of resources on the Web, due to documents that are dynamically generated [Baeza-Yates and Castillo, 2004]. For instance, online calendars, diaries, and sites that calculate the value of π, can dynamically generate an

unbounded number of documents. This coupled with limited local storage capacity, dictate that proxies need to decide on which resource to replace in the local cache once it is full. This issue is generally not directly relevant to crawling.

• Cache consistency

Proxy caches store local copies of web resources, and so, as with crawling, need to deal with issues regarding synchronising the local copy of a resource with the source. Chankhunthod et al. [1996] discuss a proxy cache system that has different levels of hierarchical caching with sibling and parent cache to provide greater flexibility and scalability. Belloum and Hertzberger [2002] examine the impact of dynamic documents on a cache replacement policy. They show that the performance can be improved dramatically by pre- fetching cached documents when they become stale according to their TTL value, before they are retrieved by users. In this sense it operates much like a crawler for a search engine, since updates are not in response to user requests.

Cao and Liu [1998] analyse three different approaches to maintaining strong cache consistency in a proxy cache system. They compare an adaptive TTL, a poll every time, and an invalidation approach. The adaptive TTL approach that they use is the same as that proposed by Cate [1992]. The poll every time approach sends an if-modified-since request whenever a user requests a resource contained in the cache. The if-modified-since request informs the web server to return the requested resource only if it has been modified since the date that the client last retrieved the resource, which is provided by the client as part of the request. Finally, the invalidation approach relies on the server to notify clients when files are modified.

Cao and Liu show that an invalidation approach generates a similar volume of network traffic and server workload as an adaptive TTL approach, yet maintains consistency more effectively. In addition, the invalidation approach maintains consistency as effectively as the poll every time approach, but produces a much smaller volume of network traffic and server workload. The invalidation approach, however, is not currently supported by the HTTP protocol, and may be difficult to implement on the Web, making it unsuitable for web crawling.

An alternate approach towards improving collection consistency has been to improve crawler and web server cooperation, which we discuss next.

In document Effective web crawlers (Page 103-107)