From content oversight to suitable processing: Known problems

1.4 Web native corpora – Continuities and changes

1.4.2 From content oversight to suitable processing: Known problems

1.4.2.1 Challenging the text criterion

The very characteristics of web documents may be a problem, for instance concerning efficient tools for automatic removal of web-specific formatting (Hundt et al., 2007, p.4), as well as the difficulty of defining the notion of text and what (Renouf, 2007) calls the “handling of web pages with their hotchpotch of more and less text-like texts".

Compared to well-known and in a way “controlled" corpora, the “heterogeneity and ar-bitrariness of the text" (Renouf, 2007) contained in web corpora may be questioned, as it is opposed to the ideal notion of text described by Atkins et al. (1992) (see p. 11).

First, not all web texts are of discursive nature.

Second, a considerable amount is much shorter than books or newspaper articles.

Third, web texts are not necessarily integral, due to the very nature of web pages (and the interlinking and content injection sometimes called the Web 2.0), or due to the fragmentary nature of popular text genres such as comments, microtexts, or follow-ups. As a consequence, web texts are often not “conscious products of a unified authorial effort", among other reasons because the possible authors are not aware that they are cooperating, and/or because there are no common writing guidelines, for instance on the homepages of blog communities, social networks, or online versions of newspapers, which aggregate content from different sources.

Last, despite the existence of well-documented guidelines and the enforcement of common editing rules, the texts on Wikipedia – one of the most frequently viewed websites in the world and home to a considerable amount of usable text – cannot be considered to be “stylistically homogeneous".

Another typical phenomenon for the web as it has evolved is that text often comes second to images, audio files, videos or multimedia content, both in terms of web page design and actual time spent on audio/video platforms or social networks, which are among the most popular and among the largest existing websites.

1.4.2.2 To what extent do new text types and genres lead to a lack of metadata?

Description of the problem Among the loosely-defined group of internet users there are new or so far unobserved linguistic populations. There are also spontaneous and less sponta-neous uses, different intentions, a vast array of possible audiences and communicational goals which in sum account for corpora of a different nature than the canonical text corpora. From microtext by the elderly to sponsored fashion blogs, including machine-translated and auto-matically generated content, this diversity questions the good practices from pre-web times.

Biber and Kurjian (2007) describes the problems in terms of language registers:

“With most standard corpora, register categories are readily identifiable and can therefore be used in linguistic studies. However, research based on the web lacks this essential background information. [...] The fundamental problem is that we have no reliable methods for identifying the kinds of texts included in a general web search. In fact, there is an even more basic underlying problem: we do not at present know what range of registers exists on the web." (Biber & Kurjian, 2007, p. 111–112)

In other terms, it is what Bergh and Zanchetta (2008) called the “heterogeneous and some-what intractable character of the Web" (p. 310). Consequently, there are texts on the Web for which a successful classification is yet to be achieved. This is neither a trivial task nor a secondary one, since there are corpus linguists who believe that “text category is the most important organizing principle of most modern corpora"(O’Keeffe & McCarthy, 2010, p. 241).

Renouf (2007) also claims that lack of metadata makes an exhaustive study impossible or at least undermines it. Potential register- or variety-based studies, which require a precise idea of production conditions and text genre, are a good example.

Corresponding to the potential lack of information concerning the metadata of the texts is a lack of information regarding the content, which has to be recorded and evaluated a posteriori.

“Automated methods of corpus construction allow for limited control over the contents that end up in the final corpus. The actual corpus composition needs therefore to be investigated through post hoc evaluation methods." (Baroni et al., 2009, p. 217)

Thus, the problem is twofold, on the one hand it is a meta-information and categorization issue (Bergh & Zanchetta, 2008, p. 325), and on the other hand the actual contents of a web corpus can only be listed with certainty once the corpus is complete.

The issue of text typology General text typology is summarized above p. 13. Among the criteria mentioned by Atkins et al. (1992, p. 17), the following four are particularly problematic.

First of all, the makeup of a text is supposed to be “self-evident" (“a single text by one author is single"). But authorship of unknown web texts is much more difficult to determine.

Second, the factuality criterion, which leaves “many problem areas" in traditional corpora (although literary texts in reference corpora for instance are not a problem), causes even more problems in web corpora, where the credibility of web pages and the value of statements in texts cannot be precisely assessed. Third, the setting of a text, i.e. “in what social context does the text belong?", may not be determined in an unequivocal manner due to evolving contexts and possibly unknown categories. The function of texts was “not easy" to assess to begin with according to Lemnitzer and Zinsmeister (2010) and it has become even harder, for instance because of new text types such as microtexts.

Last but not least, two criteria impact all previous ones, that being the text and language status. The first is supposed to take the values “original", “reprint", “updated", “revised", etc., which is not only difficult to retrace on the internet but also different in nature, for instance because of retweets, reblogs, or reposts. This category actually leads to an ubiquitous issue, the need for near-duplicates removal. Last, the language status (“source" or “translation") leads to complex text classification issues such as machine translation detection, and the identification of age and first language of text producers.

The notion of genre is everything but unanimously defined, all the more since new or unknown genres are bound to emerge over again in the context of web texts. The notions of authorship, mode, audience, aim and domain (Sharoff, 2004, 2006) are attempts to address this issue, as well as those of audience, authorship and artifact (Warschauer & Grimes, 2007).

The difficulty to define new genres stable in time is precisely the starting point of a current research project involving Douglas Biber and Mark Davies (A linguistic taxonomy of English web registers, 2012-2015).

Specific issues are also to be found, for instance in the case of computer-mediated commu-nication (CMC), with a growing range of genres as well as a rapidly evolving media universe.

Internet-based communication possibly has to be tackled with ad-hoc tools and corpora, so that an ongoing project to build a reference corpus for CMC in German (Beißwenger, Ermakova, Geyken, Lemnitzer, & Storrer, 2013) involves a series of decisions and cannot be established from ready-made procedures.

Text categories as extrinsic and intrinsic composition criteria, a possible caveat of “tradi-tional" corpora?

“The design of most corpora is based on such external criteria, that is, using sit-uational determinants rather than linguistic characteristics, as the parameters of composition." (Hunston, 2008, p. 156)

Additionally, several typologies can be articulated on text level (Habert et al., 1997), such as genres and registers on the one hand, and intuitive categories used by speakers that may evolve as well as invariant “situational parameters" on the other.

“The text categories sampled in the Brown corpus have often been referred to as

‘text types’ or ‘genres’. In the narrower, text linguistic sense, the use of this termi-nology is hardly justified. The categories are only a fairly rough-and-ready classi-fication of texts. Research by Biber (1998) has shown, for instance, that sometimes more variation within traditional text categories (such as ‘newspapers’) exists than between different text categories." (Hundt, 2008, p. 171)

Press text is seen as close to the norm. It is supposed to replicate developments in society with relatively close coverage, allowing a user to look for new words or new expressions as well as to witness the disappearance of others.

However, as Hundt (2008) put it, “the question is whether one year’s worth of The Guardian or The Times can be considered a single-register corpus or not" (p. 179). In fact, apart from the particular style of a newspaper, texts published online are increasingly a blend of locally produced texts and material from other sources.

Thus, the lack of control and/or precision regarding metadata is not necessarily typical for web corpora. The latter only make an already existing issue more salient.

“It is worth pointing out that the lack of metadata is not unique to web cor-pora. Consider, for instance, corpora containing mostly newspaper articles (like the German DeReKo), where authorship cannot always be attributed to specific individuals."(Biemann et al., 2013, p. 47)

The lack of metadata is also a possible issue in the case of the BNC (see p. 27).

According to Kilgarriff and Grefenstette (2003), text typology is a case where there is still much to define, so that the Web as a potential resource even “forces the issue"⁴⁹. Thus, it could be considered a telltale sign.

All in all, corpus categories are wished for, but so far they have essentially been a finite, well-known series of different types. Texts on the internet reflect a manifold reality which is

49“’Text type’ is an area in which our understanding is, as yet, very limited. Although further work is required irrespective of the Web, the use of the Web forces the issue." (Kilgarriff & Grefenstette, 2003, p. 343)

difficult to grasp. Some see it as a downside of web corpora that one has to cope with less meta-information and with more a posteriori evaluation of the content. However, one could also say that these two characteristics exemplify tendencies that were already present in traditional corpora, as well as issues that were not properly settled, including the very operative definition of genres or registers.

1.4.2.3 Representativeness of web texts

Obviously, texts taken from the Web do not constitute a balanced corpus in a traditional sense, mostly because if nothing is done in order to establish such a balance, then textual material is not controlled, limited or balanced in any way (Bergh & Zanchetta, 2008, p. 325). The issue of representativeness as it is understood in corpus linguistics is discussed below. For a discussion of web representativeness, see p. 126.

Representativeness and typology The issue of representativeness follows directly from the potential lack of meta-information described above. Be it on a corpus design level or on a statistical level, it is impossible to know to what extent the gathered texts are representative of the whole Web, first because not much is known about the texts, and secondly because the composition of the Web is completely unknown, and can evolve very quickly in the course of time or according to a change of parameters, such as language or text type.

"It is rather complicated to do [...] stratified sampling with web data, because (i) the relative sizes of the strata in the population are not known, and (ii) it would be required to start with a crawled data set from which the corpus strata are sampled, as web documents are not archived and pre-classified like many traditional sources of text. Web documents have to be discovered through the crawling process, and cannot be taken from the shelves." (Biemann et al., 2013, p. 24)

Precisely because a general approach to web corpora involves a constant discovery of web pages and web documents, the result cannot be known in advance and thus cannot be balanced.

Nonetheless, it does not mean that the discussion about representativeness and balance does not apply to web corpora. It may indicate, however, that this discussion will not yield satisfying results.

Redefining the issue The representativeness issue may be typical of the corpus linguistics community, but it does not seem to be very intelligible, particularly to web scientists:

“There are arguments about how representative the Web is as a corpus, but the notion of what a corpus should represent – should it include speech, writing, back-ground language such as mumbling or talking in one’s sleep, or errors for example?

– is hard to pin down with any precision." (Berners-Lee et al., 2006, p. 50)

It also seems to show how corpora from the web exemplify existing issues and debates in the field of linguistics, which may have been ill-defined and as such need to be discussed:

“The Web is not representative of anything else. But neither are other corpora, in any well-understood sense. Picking away at the question merely exposes how

primitive our understanding of the topic is and leads inexorably to larger and altogether more interesting questions about the nature of language, and how it might be modeled. (Kilgarriff & Grefenstette, 2003, p. 343)

One may also say that it is cumbersome to pay too much attention to a notion that is not precise or adapted enough in a web context. Regarding this, the idea that corpus size may solve conceptual problems can also be discovered by linguistics, in this case by lexicographers:

“In a billion-word corpus, the occasional oddball text will not compromise the overall picture, so we now simply aim to ensure that the major text-types are all well represented in our corpus. The arguments about ‘representativeness’, in other words, have lost some of their force in the brave new world of mega-corpora."

(Rundell, 2008, p. 26)

A potential way to cope with representativeness would then be to aim for global compo-sition requirements loose enough not to get in the way of corpus building. A “weak" under-standing of representativeness seems indeed to pave the way to compromise.

Possible solutions On the corpus linguistics front, Leech (2006) developed a reception-based estimation of representativeness. However, other researchers such as Atkins et al. (1992) are in favor of a balanced ratio of production and reception:

“The corpus builder has to remain aware of the reception and production aspects, and though texts which have a wide reception are by definition easier to come by, if the corpus is to be a true reflection of native speaker usage, then every effort must be made to include as much production material as possible." (Atkins et al., 1992, p. 7)

Web corpora make such a balance possible, precisely because or in spite of web page in-terlinking biases as well as website audience statistics. However, the tradition is not to worry about sampling, as long as there is a certain degree of variation. (Schäfer & Bildhauer, 2013, p. 31)

“[Results suggest that] Web corpora built by a single researcher literally in min-utes are, in terms of variety of genres, topics and lexicon represented, closer to traditional ‘balanced’ corpora such as the BNC than to mono-source corpora, such as newswire-based corpora." (Baroni & Ueyama, 2006, p. 32)

Maybe because of biases in the way texts are collected, and prominent pages being favored in the process (for a comparison of sources see chapter 4), the results can be considered to be acceptable, especially with respect to traditional reference corpora.

1.4.2.4 Suitable toolchains and processing speed: practical issues

A user-friendliness problem First of all, one may say that corpora aiming for web scale have a problem with “user-friendliness (as the Web was not originally designed for linguistic research)" (Bergh & Zanchetta, 2008, p. 325). It affects both the corpus builders and the users.

In fact, the web as a corpus framework was fashionable in 2004 with the launch of the WAC-workshop. Then a few major contributors left, and it began to get more and more complicated to gather corpora as the web kept expanding and diversifying, e.g. with the Web 2.0 and social media.

Moreover, the weaknesses of generic formats also account for some difficulties. Text en-coding schemes, such as XML TEI, are not easy to apply to web texts, since they are primarily conceived for printed texts. Standards in general are not easy to adapt to the new reality, for instance dating systems. There are several ways to date a text, on the one hand on a text-to-text basis, with the time of last modification of the file on the web server, the redaction time as advertised in the content or in the metadata, or the creation time of the page, and on the other hand on a corpus basis, with for instance the time of retrieval by the corpus builders, or the first release of the corpus.

Similarly to the standards issue, scalable retrieval and querying infrastructure may impede adoption of web resources by linguists used to well-documented corpora and tools as well as to stabilized architectures. Most corpus construction projects are works in progress, so that software and content updates can be frequent, and results as well as display are usually constantly improved, which can affect the global user experience.

On the other side, there are also early adopters of search engines such as Google who are used to the apparent simplicity of the interface and who may be confused by the profusion of details of query syntax, subcorpora, or simply information overflow. This may explain why there are linguists who still try their luck on search engines directly (see the remarks on Googleology p. 43).

Too high a cost? Processing speed should not be as much of a problem as it was in the 2000s for web corpora or even in the 1980s for digital corpora. However, in a context of expand-ing web size and decreasexpand-ing expenditures on public research, the situation is not favorable to researchers.

Tanguy (2013) states that it might be too costly for a research institution to address the web as a whole⁵⁰, because the material costs for running a crawler and extracting text are much too high for academic budgets.

In recent articles on web corpus construction, no one claims to indeed truly harvest data on a web scale, in the sense that research teams compete with commercial search engines in terms of computational power or pages seen and actively maintained. The adjective “web-scale" typically refers to large corpora, meaning that they could not be from other sources than the web, but it does not mean that the corpora are truly on the scale of the web, or even one or two orders of magnitude smaller.

However, the claim by Tanguy (2013) should be kept in perspective. With the profusion of open-source software most tools are already available and need not be specially crafted for a particular case. Thus, it is not necessary to invest much time and energy in document retrieval, but “merely" computing power, which becomes cheaper over the course of time.

That said, corpus processing is tricky, as shown in chapter 3. All in all, I prefer to see Tanguy’s claim as a call for light-weight, more efficient approaches, for example by restricting

50La création d’un moteur de recherche, ou plutôt d’un crawler capable de parcourir le Web pour en extraire le contenu textuel est un travail de très longue haleine, et le coût matériel de son fonctionnement est colossal, bien hors de portée des budgets académiques."(Tanguy, 2013, p. 14)

In document Construction de corpus généraux et spécialisés à partir du Web (Page 72-80)