and preprocessing - Construction de corpus généraux et spécialisés à partir du Web

“Probe the universe in a myriad points."

— Henry D. Thoreau, The Journal of Henry David Thoreau, Boston: Houghton Mifflin Co., 1906, p. 457.

3.1 Introduction

3.1.1 Structure of the Web and definition of web crawling

The Web The Web is ubiquitous in today’s information society. As the fundamental pro-cesses governing its constitution and its functioning prove to be robust and scalable enough, it becomes more and more a “world wide web". These principles are known and well-studied, they are not a matter of chance. However, how people use it, what they do with websites in terms of interaction, creation, or habits bears a huge potential in terms of scientific study. This is particularly the case for social sciences, as a part of what Hendler, Shadbolt, Hall, Berners-Lee, and Weitzner (2008) call Web science. The exact structure of the WWW cannot be exactly determined, thereby leaving room for studies in computer and information science.

Since a full-fledged web science study falls beyond the scope of this work, I will focus in the following on technical aspects which make up the very essence of a web crawl and I will describe the challenges to be faced when text is to be gathered on the Web.

First of all, concerning the basic mechanisms, a crawl grounds on the notion of URIs (Uni-form Resource Identifiers) and their subclass URLs (Uni(Uni-form Resource Locators). They are what Berners-Lee et al. (2006) call “basic ingredients" of the Web, making the concept of a web crawl even possible:

“The Web is a space in which resources are identified by Uniform Resource Iden-tifiers (URIs). There are protocols to support interaction between agents, and for-mats to represent the information resources. These are the basic ingredients of the Web. On their design depends the utility and efficiency of Web interaction, and that design depends in turn on a number of principles, some of which were part of the original conception, while others had to be learned from experience."

(Berners-Lee et al., 2006, p. 8)

The robustness of the URL scheme fosters to a great extent the robustness of the Web itself. As in the expression “surfing the Web", how the Web is perceived is determined by the experience of following links that actually lead from website to website, being able to record them and make the “surf" reproducible because technically secure.

From a more technical point of view, using notions familiar to computer science, the image of a graph with nodes and edges is commonly used to describe the Web:

“One way to understand the Web, familiar to many in CS [Computer Science], is as a graph whose nodes are Web pages (defined as static HTML documents) and whose edges are the hypertext links among these nodes. This was named the ‘Web graph’, which also included the first related analysis. The in-degree of the Web graph was shown [...] to follow a power-law distribution." (Hendler et al., 2008, p. 64)

Zipf’s law in linguistics is an example of power law, a functional relationship between two quantities where one quantity varies as a power of another. It states that the frequency of any word is inversely proportional to its rank in the frequency table, is considered to be a power law, where frequency of an item or event is inversely proportional to its frequency rank.

The web is thought to be a scale-free network, as opposed to a random network (Schäfer &

Bildhauer, 2013, p. 8) because of link distribution. In other words, the global shape of the web graph is probably not random, there are pages who benefit a lot more from interlinking than others, and many cases where the links only go in one direction.

“Each node has an in-degree (the number of nodes linking to it) and an out-degree (number of nodes linked to by it). It is usually reported that the in-degrees in the web graph are distributed according to a power law." (Biemann et al., 2013, p. 23)

By way of consequence, one may think of the web graph as a polynuclear structure where the nuclei are quite dense and well-interlinked, with a vast, scattered periphery and probably not too many intermediate pages somewhere in-between. This structure has a tremendous impact on certain crawling strategies described below.

The problem is that there are probably different linguistic realities behind link distribution phenomena. While these notions of web science may seem abstract, the centrality and weight of a website could be compared to the difference between the language variant of the public speaker of an organization, and the variants spoken by various members.

Possible ways to analyze these phenomena and to cope with them are described in the experiments below (cf chapter 4).

Web crawler The basic definition of a web crawler given by Olston and Najork (2010) gets to the point:

“A web crawler (also known as a robot or a spider) is a system for the bulk down-loading of web pages." (Olston & Najork, 2010, p. 176)

As such, crawling is no more than a massive download of web pages. However, since the Web is large and diverse, building web crawlers has evolved into a subtle combination of skills.

Web crawling The starting ground that makes web crawling possible in the first place is the connectedness of the web and the existing standards concerning the presence of links in the form of Uniform Resource Locators (URLs) (Schäfer & Bildhauer, 2013, p. 8).

“The raison d’être for web crawlers lies in the fact that the web is not a centrally managed repository of information, but rather consists of hundreds of millions of independent web content providers, each one providing their own services, and many competing with one another." (Olston & Najork, 2010, p. 176)

Probably because the web has been and continues to be shaped by computer science paradigms related to concepts for theoretical reasons and linked to efficiency for practical reasons, a crawl (web crawling operation) is most commonly seen as the traversal of a web graph.

“Crawling (sometimes also called spidering, with occasional minimal semantic dif-ferences) is the recursive process of discovering and downloading Web pages by following links extracted (or harvested) from pages already known." (Schäfer &

Bildhauer, 2013, p. 15)

“Web crawling is the process of fetching web documents by recursively following hyperlinks. Web documents link to unique addresses (URLs) of other documents, thus forming a directed and cyclic graph with the documents as nodes and the links as edges." (Biemann et al., 2013, p. 23)

One may risk an analogy with space exploration by saying that in both cases the size, com-position, and shape of the universe cannot be determined with absolute certainty. It involves a conceptual effort, a theoretical framework, as well as huge datasets of measurements in order to enable scientists to get an indirect idea of the characteristics of their universe.

Let us say for now that there are web pages which will most probably be found with any kind of crawling strategy whatsoever as well as nearly regardless of the length of the crawl, while others may be as “interesting" as the first ones but still won’t be found even by extensive strategies.

3.1.2 “Offline" web corpora: overview of constraints and processing

In document Construction de corpus généraux et spécialisés à partir du Web (Page 125-128)