Introduction: qualification steps and challenges

Web corpus sources, qualification, and exploitation

4.1 Introduction: qualification steps and challenges

4.1.1 Hypotheses

4.1.1.1 From pre-qualified URLs to web corpora

Many problems of web corpus construction have been solved neither conceptually nor techni-cally, for example the issue of web genres as well as the question whether balance is relevant to web corpora. Tanguy (2013) claims that an interesting research program would be to work towards automatic characterization processes that do not classify web pages but rather yield useful information for further linguistic exploitation. As such, the resulting data could then be used by more advanced tools.¹

In fact, it could be useful to pre-qualify web documents, i.e. work on lists of URLs then used as sources for a crawl, in order to spot “licit" contexts (Tanguy, 2013) for use by corpus linguists, and thus make web corpus building easier. The whole process could be divided into the following main steps:

1. Fit to the peculiarities of “web texts", most notably by finding appropriate text descrip-tors.

In fact, the heterogeneity of the raw material makes effortless data mining impossible.

It is necessary to perform some data wrangling and to find salient surface cues (such as formal and sentence-based descriptors) in order to manipulate this material:

“While in some domains, simple ‘data mining’ is conceivable, in the case of text corpora (and, possibly, for other ’heterogeneous’ databases), prior re-description is a necessity." (Wallis & Nelson, 2001, p. 313)

2. Prequalification step:

• Filter URLs and web documents using the resulting relevant characteristics that enable the expression of heuristics and statistical processes.

• Annotate the resources so that they can fit various users’ needs.

3. Qualification step: using metadata added on purpose, qualify downloaded web docu-ments, i.e. determine if they seem suitable for inclusion in a corpus.

The operation of qualification may seem similar to a linguistic characterization of the texts.

However, due to the extreme diversity of the documents taken into consideration (see definition and examples), the qualification brings statistical significance as well as features related to the operationalization into focus. Robustness for example is paramount. As Bronckart, Bain, Schneuwly, Davaud, and Pasquier (1985) explain, these kinds of results do not yield conclusions at a linguistic level per se (or indirectly at best).²

1“Une des pistes les plus intéressantes concernerait à mon avis la mise en place de procédures automatiques de caractérisation à la volée, ne visant pas à la catégorisation en genres, mais permettant par contre de donner des informations utiles sur une page Web pour une exploitation linguistique (par exemple l’identification de contextes considérés comme licites). De telles procédures couvriraient des besoins en googleologie, et seraient insérables dans des approches plus lourdement outillées." (Tanguy, 2013, p. 29)

2“On ne peut généralement pas inférer de la significativité statistique d’une différence sa pertinence sur le plan linguistique ou plus généralement communicatif." (Bronckart et al., 1985, p. 72)

Nonetheless, this identification should be performed without losing sight of a typological perspective. As Loiseau (2008) explains, automatic classification is not a goal by itself, it is paramount to take a stand on textual typology in order to come to a proper description.³ 4.1.1.2 Usefulness of the results of readability studies for filtering and qualification

tasks

I formulate the following hypotheses with respect to the notions introduced in chapter 2:

• It is possible to develop a filter based on URLs, HTML characteristics and text-based statistics in order to discriminate between incoherent web-specific document types and linguistically relevant ones with a good precision.

• Research on text readability has yielded a series of metrics and analytical tools which can be generalized for text analysis purposes. These results can be used to qualify texts gathered on the web which are annotated on this basis. In fact, indicators can be aggre-gated to build multi-dimensional criteria that enable a proper classification of the texts and/or the construction of subcorpora.

• The precision as well as the recall of the prequalification and qualification can be eval-uated using specially designed samples (respectively URL and text samples) as well as with a series of heuristics applied to the whole corpus, such as, for instance, existing anti-spam tools, n-gram dispersion or language model perplexities.

4.1.1.3 Insights on collected corpora

Concerning the content exploitation as well as exploration, the following hypotheses can be formulated:

• It is possible to develop semi-automatic procedures to help with quality assessment.

• Corpora gathered on the Web can be compared to existing reference corpora as well as to one another, for quality assessment as well as for typological purposes.

• It is possible to give access to corpus texts via a visualization interface.

4.1.2 Concrete issues

4.1.2.1 Qualification of URLs (prequalification)

Problems to address The methodology used so far to gather URLs relies heavily on search engine APIs. However, many APIs are no longer freely available. Moreover, the question whether this is the best way to collect a high number of URLs reflecting language use on the web remains open, as the so-called BootCaT method is prone to several serious biases (see above). Link gathering on other sources like social networks is a way to solve this URL shortage problem and to complement the biases of search engine algorithms and optimization by adding user-based information to the crawl seeds.

3“Les nouveaux moyens de description de la textualité doivent donc sans doute être articulés à un programme de typologie textuelle, au-delà des perspectives de classification automatique." (Loiseau, 2008)

If one considers all potential URLs in a breadth-first search manner, a large part is re-dundant, misleading, or simply does not lead to a kind of content that one would consider integrating in any kind of text corpus.

Therefore, the main goal seems to be a proper calibration of the filter, which should only remove URLs that are obviously not reliable when it comes to creating a text corpus, leaving the rest of the work to a content filter.

In terms of precision and recall, the latter is preferable, as it is more important to keep as many interesting documents as possible, because the ones that lack relevance can always be filtered out later.

Method The prequalification of URLs has recently become a research topic by itself, all the more since big data became a field of interest. Due to the quantity of available web pages and the costs of processing large amounts of data, it has become an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to. Several heuristics such as trigram-based methods have proven to be efficient as a first pass, be it topic (Baykan, Henzinger, Marian, & Weber, 2009) and genre guessing (Abramson & Aha, 2012) or language identification (Baykan et al., 2008).

URL classification has also been used to find parallel texts for example, leaving a lot of questions unanswered as to the text quality of the “dirt cheap" corpus gathered this way (Smith et al., 2013), showing that it is not a trivial task as it impacts all downstream applications.

The work mentioned above paves the way for a first-pass filter enabling selection of pos-sible candidates for a web corpus before actually downloading anything. Spam and advertise-ment are a major issue, but also simple URLs that lead to image or video files or web pages that do not mainly consist of text, for example photoblogs.

4.1.2.2 Qualification of web texts

Problems to address There are obviously content and text types on the Internet that do not belong to a linguistically relevant text corpus (see examples below).

It is not so easy to filter them out because the inter-annotator agreements are remarkably low (see p. 189). It seems that the extension of the notion of web corpus varies greatly according to the possible end users.

Due to the wide variety of web texts it is necessary to find robust definitions and to build or use robust tools in order to enable classification and quality assessment in a large range of different languages, text genres, and web page types.

Exploitation and visualization Special interfaces are needed in order to provide easier ac-cess to the actual content of web corpora, both restricted and general-purpose ones. In fact, due to the size of the corpora, it is not usually possible to read or even skim through a significant part of their content. The main access available even on bare, unannotated texts is to examine random samples of the corpus.

Quality assessment and general corpus analysis could be easier with other ways to look at corpora, for instance from the particular angle of a precise tool, or using a visualization which maps either a particular characteristic as it is present or absent throughout the corpus, or a general summary of corpus content.

In document Construction de corpus généraux et spécialisés à partir du Web (Page 164-167)