Impact on research results

3.1.2 “Offline" web corpora: overview of constraints and processing chain

3.3 An underestimated step: Preprocessing

3.3.3 Impact on research results

3.3.3.1 Quantitative impact

A measure of the impact As an example of the tremendous impact of preprocessing, the breakdown performed on the DECOW2012 corpus in several stages (Schäfer & Bildhauer, 2013) shows how much web text is discarded in a conservative approach. In fact, due to the neces-sity of delivering a high-quality corpus to theoretical linguists, as much as 94% of the web documents which have been downloaded and stored are eliminated during the procedure.

Algorithm removes No. of documents Percentage

Very short pages 93,604,922 71.67%

Non-text documents 16,882,377 12.93%

Perfect duplicates 3,179,884 2.43%

Near-duplicates 9,175,335 7.03%

Total 122,842,518 94.06%

Table 3.1: Amount of documents lost in the cleanup steps of DECOW2012, a 9 billion token corpus of German from a crawl of the ✳❞❡ domain, according to Schäfer & Bildhauer (2013, p. 19)

Table 3.1 shows that a lot of computational effort is lost during preprocessing, which means that the infrastructure costs are really higher than they could be.

Raising awareness about consequences on the content Despite the cleaning steps, web corpora still aggregate a certain amount of undesirable material. In the eyes of traditional corpus linguists, this may even be their worst downside.

“[Web corpora] still contain significant amounts of noisy data, such as spam Web pages, redundant auto-generated content from content management systems, mis-spellings, etc. Compared to users of traditional corpora, users of Web corpora must therefore be more aware of the steps which were taken in the construction of the corpus, such that they are aware of potential distortions of their results." (Schäfer

& Bildhauer, 2013, p. 6)

Variations in content quality can have consequences even for the most enthusiastic user base. An advantage which computational linguists see in web data resides in the mere quantity of available occurrences, in the sense that more unclean data seems to be better than less clean data, which could per se enable statistical methods such as language models to obtain better results.

However, this maxim should not be adopted systematically, and a number of statistical measurements should be taken with caution and not compared with more traditional corpora without taking potential differences into account:

“As an example, [...] the word type count for Web corpora is usually much too high to be plausible due to a large amount of noisy material, such that naively drawn statistical conclusions might be invalid." (Schäfer & Bildhauer, 2013, p. 6)

The more complicated the processes, the less directly interpretable the results The issue of data “clean up" is mentioned by S. Abney (1996) when talking about statistical inquiries:

“There is always the danger that the simple principles we arrive at are artifacts of our data selection and data adjustment" (S. Abney, 1996, p. 11)

Even from a statistical point of view, access to the corpus is far from being immediate, because of the blend of research goals and processing issues described above. This blend is particularly difficult to see through regarding web corpora; a small shift in research practices can hinder a proper interpretation and assessment of results, even if the raw text base is the same:

“The value of statistical measurements strongly depends on their reproducibility and comparability. Even small changes in used definitions or working steps can lead to uncomparable and unappraisable results. This especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and lan-guage independent pre-processing is indispensable for lanlan-guage comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality." (Eckart et al., 2012, p. 2321)

3.3.3.2 Evaluation of boilerplate removal

Usage: corpus construction or readability enhancement for mobile devices (Kohlschütter et al., 2010). Even on mobile devices it is important to preserve the layout as it helps the reader to process the text with an acceptable cognitive cost. Therefore, if a corpus is used for under-standing/simulating how humans understand texts, these marks have to be kept.

The CLEANEVAL competition (Baroni, Chantree, Kilgarriff, & Sharoff, 2008) has been an attempt to evaluate several boilerplate removal tools, with major shortcomings. Most impor-tantly, the evaluation metrics favored destructive annotation, since potential loss of original markup was not well evaluated. However, HTML tags convey information helping the reader to get information, so that they ought to be taken into consideration for most applications (Lejeune, 2013).

The main issue resides in the distinction between informative and less (or non-)informative content. In the case of press articles for instance, information is not only conveyed by bare words, since the ability to extract metadata such as title and date is at least as important. Other end-users may also consider paragraph boundaries as highly relevant information.

Most of the time, tools are evaluated on English text, which raises an issue concerning other frequent languages on the Internet such as Russian or Chinese, as well as concerning multilingual corpora.

In this context, the metrics which are chosen also leave matter for discussion. Metrics on word-level are not appropriate for languages like Chinese, where a character n-gram evaluation would be better (Lejeune, 2013).

“From the point of view of our target user, boilerplate identification is critical, since too much boilerplate will invalidate statistics collected from the corpus and impair attempts to analyze the text by looking at KWiC concordances. Boilerplate stripping is a challenging task, since, unlike HTML and javascript, boilerplate is

natural language text and it is not cued by special mark-up." (Baroni et al., 2009, p. 215)

3.3.3.3 Qualitative impact

Beyond mere data cleaning, a corpus generally aims at being both an authentic and repre-sentative sample of language, as proponents of corpus linguistics such as Firth, Halliday or Sinclair “share the belief that each single act of communication shows the language system in operation" (Tognini-Bonelli, 2001).

This issue is all the more important in text linguistics, a case for which the web texts should not be truncated, because variations in the context can go as far as to invalidate interpretation:

“One fairly obvious feature of a text is that it is not the same all the way through.

In barest outline it has a beginning, middle and end, but it is likely to have a much more elaborate structure than that, and each aspect of its internal structure leads to different phraseology, different vocabulary and different structures." (Sinclair, 2008, p. 25)

Then, how should one proceed with web corpora? In fact, cleaning drastically impacts the final collection, so that authenticity can be questioned, which undermines the corpus reasoning process. A proper evaluation process of boilerplate removal, as shown below, and text quality, as shown in the next section, can be crucial.

Analyzing a corpus’ quality should take into account the potential corpus users, since there are different understandings of corpus quality, corresponding to diverging requirements among disciplines.

“There are diverse notions of ‘corpus quality’ (and, consequently, ‘noise’), which depend on the intended use of the corpus. In empirically oriented theoretical lin-guistics, carefully selected sampling procedures and non-destructive cleaning is important, while for many tasks in computational linguistics and language tech-nology, aggressive cleaning is fundamental to achieve good results." (Biemann et al., 2013, p. 23)

In order to exemplify corpus quality concerns, two examples are given below, with a lin-guist as potential corpus user in mind.

3.3.3.4 Practical examples from the point of view of a linguist

In the general-purpose approach of corpora from the web, insufficient corpus size is not a problem anymore, the corollary being that text quality becomes one.

“Crawled raw data for web corpus construction contains a lot of documents which are technically in the target language, but which fail as a text. Documents just containing tag clouds, lists of names or products, etc., need to be removed or at least marked as suspicious. Defining the criteria by which the decision to remove a document is made, however, is quite difficult. For instance, many documents contain a mix of good and bad segments and thus represent borderline cases. The decision to systematically remove documents is thus a design decision with major

consequences for the composition of the corpus and with potential negative side effects on the distribution of linguistic features." (Schäfer et al., 2013, p. 7)

Text selection is a consequence of extreme text and markup variety and also of potential preprocessing pitfalls. Thus, it integrates into a global processing toolchain which goes from the raw HTML document to the annotated, accessible corpus document. It is one of the steps which are performed at some point in each and every web corpus project, but whose impact is often underrated by end users and sometimes corpus designers themselves.

Example 1: tags clouds, lists, and/or search engine optimization techniques The follow-ing paragraphs are taken from test data¹⁹obtained after preprocessing.

❆ ❤✐❣❤ q✉❛❧✐t②✱ ✐♥❞❡♣❡♥❞❡♥t✱ ❝♦♥tr❛❝t ❝❛t❡r❡r ❢♦r ❜✉s✐♥❡ss✱ ✐♥❞✉str②✱ ❢♦♦❞

♣r♦❞✉❝t✐♦♥ ❛♥❞ ❞✐str✐❜✉t✐♦♥ s✐t❡s✳ ❲❡ ❞❡❧✐✈❡r ❛ ❤❛ss❧❡✲❢r❡❡ s❡r✈✐❝❡✱ ✇✐t❤

str✐❝t ❝♦♥tr♦❧s ❡♥s✉r✐♥❣ ❛❧❧ ❧❡❣❛❧ r❡q✉✐r❡♠❡♥ts ❛r❡ ❡①❝❡❡❞❡❞✳ ❈♦♥♥❡❝t

❈❛t❡r✐♥❣

❢♦♦❞ s❡r✈✐❝❡✱ ❝♦♥tr❛❝t ❝❛t❡r✐♥❣✱ ❝❛♥t❡❡♥✱ s❝❤♦♦❧ ❝❛t❡r✐♥❣✱ ✐♥❞❡♣❡♥❞❡♥t

❝❛t❡r❡rs✱ ✐♥❞✉str✐❛❧ ❝❛t❡r✐♥❣✱ ✉♥✐✈❡rs✐t② ❝❛t❡r✐♥❣✱ ❝❛t❡r✐♥❣ ❝♦♥tr❛❝t♦rs✱

st❛✛ ❝❛♥t❡❡♥✱ st❛✛ r❡st❛✉r❛♥t✱ ❝♦❧❧❡❣❡ ❝❛t❡r✐♥❣✱ ❝♦♥♥❡❝t ❝❛t❡r✐♥❣✱

❝♦♥♥❡❝t ❝❛t❡r✐♥❣ ❧t❞✱ ❢♦♦❞ ❛♥❞ s❡r✈✐❝❡✱ ♦♥ s✐t❡ ❝❛t❡r✐♥❣✱ ♣r♦❢❡ss✐♦♥❛❧

❝❛t❡r✐♥❣ ♠❛♥❛❣❡♠❡♥t✱ s❝❤♦♦❧ ❞✐♥♥❡rs✱ s❝❤♦♦❧s ❝❛t❡r✐♥❣✱ s✐t❡ ❝❛t❡r✐♥❣✱

✉♥✐✈❡rs✐t② ❢♦♦❞✱ ✇♦r❦s ❝❛♥t❡❡♥

Example 2: classified ads The following paragraphs are also taken from test data²⁰obtained after preprocessing. The text lacks any coherence, because the car ads it is taken from were only partially extracted from a website whose layout is disorienting to say the least.

r❡♥❛✉❧t ❡s♣❛❝❡ ✷✳✵ t ♣r✐✈✐❧❡❣❡ ✺❞r ✷✵✵✹ ♣r✐✈✐❧❡❞❣❡ ✺ ❞♦♦r ❡st❛t❡✱ ❣r❡②✱

♣❡tr♦❧✱ ♠❛♥✉❛❧✱ r❡❛r ✇✐♣❡r✱ ✐♠♠♦❜✐❧✐s❡r✱ s♦❧✐❞ ♣❛✐♥t✱ ♣❛ss❡♥❣❡r ❛✐r❜❛❣✱

tr✐♣ ❝♦♠♣✉t❡r✱ ✶ ♣r❡✈✐♦✉s ♦✇♥❡r✭s✮✱ ✷✳✳✳

s❝r♦❧❧ ♦✈❡r t❤❡ t❤✉♠❜♥❛✐❧s t♦ ❡♥❧❛r❣❡ ♠♦❞❡❧ ②❡❛r✿ ✷✵✵✾ ♠✐❧❡❛❣❡✿ ✸✶✱✺✵✵

♠✐❧❡s tr❛♥s♠✐ss✐♦♥✿ ♠❛♥✉❛❧ ❡♥❣✐♥❡ s✐③❡ ✭✐♥ ❝❝♠✮✿ ✶✱✾✾✺ ♣♦✇❡r✿ ✶✺✵ ❜❤♣ ❢✉❡❧✿

❞✐❡s❡❧ ✐♥t❡r❡st❡❞❄ ❝❛❧❧ ✉s ✵✶✷✵✾ ✽✷✶✶✸✸ ♦r ✵✶✷✵✾ ✽✷✶✶✸✸ ♦r ❡♠❛✐❧ ✉s ❝❛❧❧

♦r ✈✐s✐t ❛ ❢r✐❡♥❞❧② ♠❡♠❜❡r ♦❢ ♦✉r s❛❧❡s t❡❛♠ ✇❤♦ ✇✐❧❧ ❜❡ ♠♦r❡ t❤❛♥ ❤❛♣♣② t♦ ❤❡❧♣✳ ❞❛❧❡s ♠♦t♦r ❣r♦✉♣ ✇❤❡❛❧ r♦s❡ s❝♦rr✐❡r r❡❞r✉t❤ ❝♦r♥✇❛❧❧ tr✶✻ ✺❜①

✜♥❞ ♦✉t ✇❤❡r❡ ✇❡ ❛r❡

The classification of the examples above is not as clear as it seems, as discussions regarding these examples showed me (more details in the following section).

19Extracted from test data analyzed in (Schäfer et al., 2013)

❤tt♣✿✴✴✇✇✇✳❝②❧❡①✲r❡✈✐❡✇✳❝♦✳✉❦✴t❛❣s✴❤②❣✐❡♥❡✱❤❡❛❧t❤✴❄s♦rt❂r❛t✐♥❣❴❞❡s❝

20Extracted from test data analyzed in (Schäfer et al., 2013)

❤tt♣✿✴✴✇✇✇✳❝❛r♦❝❡❛♥✳❝♦✳✉❦✴❢♦r✲s❛❧❡✲❘❡♥❛✉❧t✰✷✵✲✻✳❤t♠❧

In document Construction de corpus généraux et spécialisés à partir du Web (Page 154-158)

3.1.2 “Offline&#34; web corpora: overview of constraints and processing chain

3.3 An underestimated step: Preprocessing

3.3.3 Impact on research results

3.1.2 “Offline" web corpora: overview of constraints and processing chain