3.1.2 “Offline" web corpora: overview of constraints and processing chain
3.2 Data collection: how is web document retrieval done and which challenges arise?
3.2.5 Finding a vaguely defined needle in a haystack: targeting small, partly unknown fractions of the webpartly unknown fractions of the web
3.2.5.1 Less-resourced languages
Main challenges Concerning less-resourced languages, many methodological issues remain leading to different notions of web corpora and different expectations towards the experimental reality they offer.
A major issue is precisely the lack of interest and project financing when dealing with cer-tain low-resource languages, which makes it necessary to use light-weight approaches where costs are lowered as much as possible:
“The lack of funding opportunities or commercial interest in this work has led to an approach based on certain principles that offer maximal ‘bang for the buck’: mono-lingual and parallel corpora harvested by web-crawling, language-independent tools when possible, an open-source development model that leverages volunteer labor by language enthusiasts, and unsupervised machine learning algorithms."
(Scannell, 2007, p. 5)
Another issue resides in the potential sources, which have to be found and properly evalu-ated:
“The first major challenge facing any corpus builder is the identification of suitable sources of corpus data. Design criteria for large corpora are of little use if no repositories of electronic text can be found with which to economically construct a corpus." (Baker et al., 2004, p. 510f)
No consensus in research practice There is no consensus to be found among the existing techniques. URL classification problems for instance make a proper language identification of the content necessary: especially for lesser-known languages, it is not so easy to find working patterns like those used by Baykan et al. (2008), who try to classify web pages as to their main language only by examining the web pages’ URLs.
If it is not possible to determine the nature of the content without seeing it, the way web documents are classified on the crawling side is not clear either.
Baker et al. (2004) state that it was faster for them not to use any automatic crawling and to turn to hand-picked content:
“We found it was faster for a human to visit the site, sort the text from the adverts, identify the useful material and save it." (Baker et al., 2004, p. 511)
However, the technical state of the art is more developed now than it was in 2004. In the previous years, at least two major projects relied on crawling techniques only and not hand-picked content anymore. The Crúbadán9 project was originally devoted to the study of Celtic languages, but was later adapted to target several hundreds of minority languages. The project researchers chose to focus on one language at a time because crawling the whole web was considered a waste of time and resources:
“The crawler focuses on one language at a time. A reasonable alternative would have been to crawl the web very broadly and categorize each downloaded doc-ument using the language recognizer, but this is clearly inefficient if one cares primarily about finding texts in languages that do not have a large presence on the web." (Scannell, 2007, p. 10)
9Literally “crawler" in Irish, but with the additional (appropriate in this context) connotation of unwanted or clumsy "pawing", from the root crúb (“paw").
❤tt♣✿✴✴❜♦r❡❧✳s❧✉✳❡❞✉✴❝r✉❜❛❞❛♥✴
On the other hand, the Leipzig Corpora Collection (Goldhahn, Eckart, & Quasthoff, 2012), which started as the FindLinks project (Heyer & Quasthoff, 2004), is an example of a global approach, but little is known about the crawling methods used, other than their being breadth-first, starting from a directory of news websites as source for many less-spoken languages, which seems to be successful.
Then again, Scannell (2007) states that crawling without expert knowledge is “doomed to failure", which shows that there is no consensus on this point either.
Problems and benefits of document nature The frequency of mixed-language documents is unevenly distributed on the web. As a matter of fact, they have been shown to be more of a problem concerning minority languages, making it even harder to gather web texts in these cases:
“The majority of web pages that contain text in a minority language also contain text in other languages." (King & Abney, 2013, p. 1110)
As for the reasons for using several languages in a single web document, code-switching alone does not seem to be a convincing approach. meaning that the study of mixed-language documents for linguistics purposes is probably irrelevant:
“Though code-switching has been well-studied linguistically, it is only one possible reason to explain why a document contains multiple languages" (King & Abney, 2013, p. 1111)
Thus, it can be said that the texts are different in nature, which may have benefits in some cases, for instance concerning text quality and more precisely machine-generated text:
“One benefit of working with under-resourced languages is that they are only rarely the target of ‘WAC spam’ – documents not written by humans who speak the target language but instead generated automatically by a computer one way or another." (Scannell, 2007, p. 13)
Conclusion: more downsides than advantages To conclude, one may say that the study of less-resourced languages on the Web bears overall more downsides than advantages. Indeed, easiness factors, such as the lack of spam and the smallish community of users, are outweighed by factors of complexity, such as the difficulty of the crawling process and the extrinsic and intrinsic variation of documents, and last but not least the difficulty to fund research projects on this topic.
3.2.5.2 Computer-mediated communication and afferent specific text genres
So far, there are few projects dealing with computer-mediated communication (CMC), mostly due to the recent expansion of social networks and the fast pace at which those resources are changing. Most scientific studies are focused on information extraction, such as sentiment extraction, e.g. whether tweets on a particular topic are rather positive or negative, or compar-isons based on geographic information, e.g. by using the metadata of short messages.
In the case of German, the DeRiK project features ongoing work with the purpose of build-ing a reference corpus dedicated to CMC (Beißwenger et al., 2013), which could be used in
order to study German and its varieties as they are spoken on the Web. More specifically, this kind of corpus can be used to find relevant examples for lexicography and dictionary building projects, and/or to test linguistic annotation chains for robustness.
The problems to be solved in order to be able to build reliable CMC corpora are closely related to the ones encountered when dealing with general web corpora as described above.
Specific issues are threefold. First, what is relevant content and where is it to be found? Second, how can information extraction issues be tackled? Last, is it possible to get a reasonable image of the result in terms of text quality and diversity?
3.2.5.3 Summary of issues to address
Targeting small, partly unknown fractions of the web enables researchers to target different speaker communities and various text types. It also involves dealing with potentially extreme content variability, and consequently requires adaptation of the crawling strategies.
Because the objects of study are still new, such as in the case of social networks, or difficult to fund, as in the case of less-resourced languages, there is no consensus in research prac-tice. There is also a real potential in terms of discoveries, provided the researchers manage to maximize the efficiency of the crawls and to provide the most “bang for the buck" (Scannell, 2007).
Rich metadata considered to be a desideratum for web corpora by Tanguy (2013), are par-ticularly relevant in that case, be it for language documentation purposes or to do justice to the variability of the texts and microtexts which have been gathered.
The issue of less-resourced languages is addressed in section 4.2.1.2, while microtexts, social networks, and computer-mediated communication are studied in section 4.2.1.3 concerning a cross-language comparison.