Changes in corpus design and construction

1.5 Intermediate conclusions

1.5.2 Changes in corpus design and construction

Web corpora are an unprecedented opportunity for observing linguistic phenomena rarely seen in written corpora and to apprehend new text types which open new issues regarding linguistic standards and norms, contact and diversity. Web texts may belong to new genres, resulting from practices developed at the age of internet-based communication. Statistically speaking they may also give a broader access to text production by a large number of existing socio- and ethnolinguistic groups, including lesser-known ones.

A matter of perspective: reopening the case of word order in German The word-order example, first mentioned in the introduction on p. 6, makes it clear that according to the perspective given by a corpus, divergent conclusions on language can be drawn. For instance, grammarians working on reference corpora including mostly “traditional" written text genre such as newspaper articles and novels, may rule out the rare cases of verb-second subordinates they encounter. Then they may conclude that the subordinate clause with the verb at the end is the standard structure in German. This is even more true if the sentences are hand-picked, because theoretical linguists may then have a tendency to choose “favorable" sentences with respect to the theory they are trying to prove.

In contrast, web corpora containing more casual genres or even speech transcriptions such as subtitles or spontaneous blog comments may lead to other conclusions. First, despite the meaning of “representative" in traditional corpus building, they may present a more representative image of how language is currently spoken by a broader spectrum of speakers. In that case, the relative abundance of verb-second subordinates implies that they are bound to be detected. Second, the use of more quantitative methods may also lead to the conclusion that in spite of numerous subordinate clauses following the “classical" model, verbs in second position

are clearly present in a majority of cases, because there are more principal clauses than subordinate clauses and because the verbs do not always come at the end in the latter case. That is why German is generally considered to be a V2 or a flexible language in language typology. For instance, it is classified by the World Atlas of Language Structures (Dryer & Haspelmath, 2013) as having “no dominant order".

One may argue that the sheer number of occurrences of a given structure does not necessarily illustrate its importance. That is why theories in linguistic research are often a matter of perspective. Since corpora play a major role in empirical linguistics, their origin and their composition ought to be better known.

Summary: characteristics of “pre-web" corpus construction

• Typology fixed most of the time before construction: find texts for all classes and balance the whole

• Normalization: sometimes before, sometimes after. Considered to be important. • Complete texts and/or extracts and/or derivates (Ngrams, word frequencies) • Persistent vs. temporary corpora

• Metadata: rich, sometimes manually edited or poor/not verified

• Classification according to text production parameters or according to tools output (ma- chine learning, statistical criteria: clustering)

• Subcorpora: used for balancing or register comparison

Because they are often difficult to access and difficult to process, non-standard variants are usually not part of reference corpora.

Post-web changes The prototypical “web as corpus" construction method is an extreme case where the corpus design steps mentioned by Atkins et al. (1992) (see p. 11) are reduced to the last two: data capture followed by corpus processing.

The shifts listed below are changes as compared to “pre-web" corpora:

• Be it for precise target and typology for specialized corpora and focused crawling, or concerning exploratory corpus construction for general-purpose web corpora, it is necessary to consider texts, text types, and text genres beyond the previous extension of these notions and beyond known categories.

• The normalization of the documents is much more delicate, it can be done a posteriori as a particular step or be left out, as it may be considered as error-prone.

• Usually, web corpus construction deals with the retrieval of complete texts, but due to the nature of HTML documents and the necessary post-processing (see the following chap- ter), they may be constructed of extracts that are artifacts of the text type and processing tools, and thus differ from the traditional sense of “extract".

• Consequently, there are usually no subcorpora as such and general-purpose corpora are taken as a whole, while specialized corpora are divided into categories as long as avail- able metadata allow for such an operation. However, web corpora also lead to an abun- dant production of derivates: n-grams, word frequencies, language models, training sets for diverse tools.

By contrast, the perspective stays the same concerning the following topics:

• There are persistent and temporary corpora. A given corpus may be extended regularly, and thus correspond to the notion of monitor corpus.

• The metadata may be rich for some specialized corpora, but poor otherwise.

The following facts about digital corpora are not necessarily true anymore regarding corpora from the Web:

• “All corpora are a compromise between what is desirable, that is, what the corpus de- signer has planned, and what is possible." (Hunston, 2008, p. 156)

In fact, general-purpose web corpora are rather a compromise on the technical side. Since there is no general web cartography and often no design plan, it is not possible to assess the recall of a web corpus with respect to a population of web documents.

• “Any corpus, unless it is unusually specific in content, may be perceived as a collection of sub-corpora, each one of which is relatively homogeneous." (Hunston, 2008, p. 154) The homogeneity of these subparts is not guaranteed anymore, nor are corpora really perceived as such a collection. When this is the case, the characteristics shared by web texts are of a different nature than the criteria used to build subcorpora, and most of the time such common characteristics must be inferred from the data and cannot be determined a priori.

• The most important practical constraints are software limitations, copyright and ethical issues, and text availability. (Hunston, 2008, p. 157)

Software limitations can now be considered secondary, as well as text availability, since it is the very condition of text inclusion. Ethical issues and copyright are not really a constraint, but rather a factor.

In document Construction de corpus généraux et spécialisés à partir du Web (Page 82-84)