Topic-driven or focused crawling attempts to gather only those documents that are relevant
to a given topic Chakrabarti et al. [1999]. It orders the crawl frontier based on the relevance that URL anchor text has on a topic. Topic-driven crawling is based on the observation that documents sharing a link are more likely to be topically related [Davison, 2000].
Topic-driven crawling is particularly relevant to our work in Chapter 6, which uses a similar technique to topic-driven crawling. But, while topic-driven crawling measures the similarity between anchor text and a topic to order a crawl by relevance, we use anchor term statistics to measure the likelihood that anchor text point to a changed or new document.
One of the earliest examples of focused crawling was by Chakrabarti et al. [1999] and used example documents to search for other related documents.
Rennie and McCallum [1999] use reinforcement learning to improve the likelihood of crawling pages of a particular type or topic, producing a three-fold efficiency improvement over a breadth-first crawl.
Diligenti et al. [2000] applied context graphs to avoid pursuing short term gains in rele- vance, at the expense of following less relevant links that eventually lead to groups of highly relevant documents. They use search engines to train the crawler so that link distances between relevant documents can be determined for different topics.
McCallum et al. [2000] examine the use of machine learning to help in the construction of internet portals that are concentrated on a particular topic, through the use of topic-driven crawling.
Chau and Chen [2003] compare two traditional crawlers using breadth-first search, and PageRank, to a crawler using neural network techniques, to order a topic-driven crawl of a domain. In the case of the breadth-first search there is no way to ensure that documents were relevant to a topic. The PageRank scheme maintains two priority queues, one for URLs with
relevant anchor text and a second for all other URLs. The neural network spider maintains a graph of nodes that represent documents, and the links between the nodes representing anchor links. Each link is weighted according to the relevance of anchor terms to the topic, and links with a weight above a threshold are traversed, retrieving the documents they link to. The scheme then updates the graph, adding new nodes and links. The process stops once a predetermined number of documents are retrieved, or the average weight of nodes is lower than the allowable error. They develop a lexicon of both relevant, and irrelevant terms, which they use to compare anchor terms against in the case of the content-based schemes.
Chau and Chen also selected five high-quality hub documents as seed URLs. Relevant documents were determined by the percentage of relevant terms they contained.
Chau and Chen find that the neural network crawler performs best, with the breadth- first crawler performing marginally worse. Interestingly, they find that the PageRank crawler performed significantly worse than all other schemes, despite requiring the most computation time. The exceptional performance of the Breadth-First scheme was directly related to the choice of the high-quality seed URLs. PageRank performed exceptionally poorly since it placed great emphasis on documents with a large number of links that were typically irrelevant. Their work is of importance to us since it supports our results in Chapter 5 with regard to freshness.
Menczer et al. [2001] discuss methods of evaluating adaptive topic-driven web crawlers. They present three methods for evaluation: using classifiers trained on a sample set to assess newly crawled documents, ranking of crawled documents via an independent retrieval system and examining the order in which they were crawled, and finally, using cosine similarity to assess the mean similarity between each crawled document and the focused crawl topic.
Srinivasan et al. [2005]; Menczer et al. [2004] present an evaluation framework for topic- driven web crawlers. The system allows all logic related to the crawling algorithm to be encapsulated in a single module that is connected to the evaluation system via a standard interface. The system then keeps track of the resources used by the crawler, and the doc- uments that are retrieved. Topics and relevant documents are retrieved from the DMOZ [2008]. The system measures the recall and precision of the crawlers in regard to a specific set of relevant target documents. The similarities of documents that are not in the target set are also evaluated to determine their degree of relevance with regard to the target set.
Chakrabarti et al. [2002] examine the information that is available about a document in the documents that link to it, and how this information can be used to accelerate topic- driven crawling. They implement a system that consists of two separate classifiers. The first of these is an apprentice that assigns priorities to unvisited URLs using features derived from the W3C Document Object Model [H´egaret et al., 2005]. The second classifier is a
trainer that generates training instances for the apprentice. In effect, the trainer is a user
specification of the desired content, while the apprentice learns how to find documents that match the desired content. Chakrabarti et al. show that their approach reduces the fraction of false positives by 30%–90%.
Next we discuss “Hidden Web” crawling, another specific type of crawler that are specif- ically created to retrieve documents that are only accessible via a search interface.