Web Crawling - Website boundary detection via machine learning

Web crawling is the process of automatically traversing (crawling, spidering) the hy- perlink structure of the web whilst downloading web pages, which can then be used with respect to various applications. Web crawling provides a method of collecting data from the web. There are three main categories of tasks that a web crawler might be used for [172, 128]: (i) broad, (ii) focused and (iii) continuous. A broad crawl aims to cover a wide area of the web. A focused crawl aims to cover a smaller deeper collection of pages, which has a distinct relation to a specific task. A continuous crawl is defined as covering a portion of the web at repeated time intervals. The approach used with respect to the work described in this thesis uses focused web crawling, as a specific

target collection of pages is required for the task of WBD. 2.5.1 Basic web crawler

A basic web crawling process starts at some set of seed web pages. The links from within the pages are extracted, and used to fetch further pages. This process is repeated until some end criterion is reached.

Figure 2.4: The operation of a basic web crawler (based on [128]).

A basic web crawler is illustrated in Figure 2.4. The web crawler maintains a set of links (URLs) in the “frontier”. These URLs are selected in some order according to which page should be visited next. Initially the frontier is populated with the set of given seed URLs. The main loop repeats the process of: (1) getting a URL, (2) fetching the resource that resides at the URL and (3) parsing the content of the resource for any further links to traverse, which are then added the the frontier. This process is terminated at some given criteria, for example, according to some time constraint or the capacity of the repository, or simply because there are no more URLs in the frontier. Types of web crawlerA web crawler is basically a graph search (traversal) algorithm, and so graph traversal techniques can be applied to crawl the web. The are various

types of web crawler that can be implemented using the basic process shown in Figure 2.4 There are two main categories of web crawler:

(1) Universal Crawlers are general web crawlers that aim to cover as much of the web, in the most efficient and fastest possible way. An example in which a universal crawler is used is to build a search engine index, which aims to be as comprehensive and relevant as possible. These types of crawlers are used for broad and continuous crawling of the web.

(2) Preferential or Focused Crawlers are web crawlers that select the URL from the frontier in some biased way. The biasing can be done using many different methods, and for many different goals. An example is to calculate a probability value/score for each URL in the frontier based on a notion of similarity. A (cosine) similarity can be assigned to pages residing at a URL with respect to a given topic of interest, using keyword relevancy for example [128, 147]. A probability can then be calculated based on URLs which are more likely to contain content on the desired topic. The URLs can then be preferentially selected from the frontier (and subsequently crawled) based on this probability. This process is known as focused web crawling.

Web Crawling Issues There are many practical issues related to crawling the web, some of these issues are presented in the list below:

• Coverage - visit as many pages (of importance) as possible in a timely and efficient manner.

• Importance - pages that are deemed important for a goal need to be identified to be crawled.

• Freshness - keeping the local index of web pages as relevant as possible.

• Fetching - HTTP requests, time outs, download sizes and so on.

• Parsing - from simple extraction of URLs to Document Object Modelling (DOM) of web pages.

• Stop word removal and Stemming - methods to apply to the content of a web page.

• Link extraction and Canonicalisation - reduce links to some consistent format.

• Spider traps - dynamically generated content can cause an infinitely traversable sets of links.

• Page repository - storing pages efficiently and in a format which allows for future processing if need be.

• Concurrency - network bottle necks , CPU and disk operations can each hinder a web crawlers performance.

2.5.2 Random Walk

As stated above, a web crawler is simply an implementation of an abstract graph search (or traversal) method but applied to the WWW. In this section a random walk method of web graph traversal is explained. Given an arbitrary graph G, the sequence of vertices visited by starting at a vertex and then repeatedly moving from one vertex to the next by selecting a neighbour of the current vertex at random is termed a random walkon the graphG(see for instance [133]). Random walks arise in many settings (e.g. the shuffling of a deck of cards, or the motion of a dust particle in the air) and there is a vast literature related to them (the interested reader is referred to the classical [81], or the very recent [24] and references there in). In particular random walks can be used [25] as a means for exploring a graph.

It is well-known [25] that any random walk on a finite connected graph eventually visits all vertices in it. Random walks, as defined above, are examples of so called Markov stochastic processes [81]. The evolution of a process of this type is fully deter- mined by its current state. Assuming a graph is completely known to the walk, every time a vertex is visited there is an equal chance of moving to any of the connected links. Random walks have been used to traverse the web on a large scale [99]. A random walk on the web graph is not as simple to perform as it may seem. In a true random walk pages would be selected uniformly at random from the graph, which does not happen when using the web. A random walk on the web graph suffers from: (1) starting state bias, and (2) can only randomly move to pages that are known so far. These aspects relate to the fact that until the whole web graph is known, including all the in and out hyperlinks of each page, a true random walk cannot be performed.

In document Website boundary detection via machine learning (Page 45-48)