3.1.2 “Offline" web corpora: overview of constraints and processing chain
3.4 Conclusive remarks and open questions
3.4.1 Remarks on current research practice
The apparent simplicity of data collection Behind the apparent simplicity of data collection on the Web, there are many mechanisms and biases which can influence the course of a crawl, since a crawl generally means the exploration and retrieval of a tiny sample of the WWW.
“In a sense, data collection is the simplest step in Web corpus construction [...]
However, the pages which are crawled are a sample from a population (the Web documents), and the sampling procedure pre-determines to a large extent the na-ture of the final corpus." (Schäfer & Bildhauer, 2013, p. 35)
Crawler traps and other deception mechanisms targeted at machines are other potential downsides, which make the way a crawler learns and finds its way through the Web really differs from a general surfer’s experience.
Moreover, the issue of corpus representativeness, which for part of the corpus tradition is a desideratum, changes dramatically in its nature with respect to web representativeness.
The latter is even harder to define and to analyze than the former, it is a challenge that still remains to be addressed, with projects currently being funded on web corpus sampling and classification.21
Big data opportunism In a way, the enthusiasm about huge datasets and the belief that probabilistic approaches can mitigate potential design flaws and inconstant text quality can become an obstacle to corpus approaches on the linguistic side.
There seems to be a “quick and dirty" work tradition concerning web corpora, coming espe-cially from language modeling and machine translation specialists. N-gram-based approaches are no panacea, since a proper cleaning and segmentation remains crucial.
This trend is focused on application and benchmarks such as the translation evaluation frameworks. The ability to gather as much parallel text as possible to run through existing toolchains and finally score through evaluation frameworks supersedes a proper quality eval-uation phase.
There is a widely-shared belief among computational linguists working on web corpora that researchers could take advantage of existing linguistic processing and annotation infras-tructure.
Web data and web corpus scientists are needed Due to commercial interests, the exact process of web crawling is not well-documented, there is no precise manual one could take inspiration from, and before 2009-2010, there did not seem to be any synthetic overview of what web crawling is and how it is done. In this context, notions of Web science can be very helpful in order to understand what happens during a crawl, so that the final result is not left to chance.
All in all, there is a real need for skilled data scientists, combining the skills of computer scientists and librarians (The Royal Society Science Policy Center, 2012, p. 64).
21❤tt♣✿✴✴❣❡♣r✐s✳❞❢❣✳❞❡✴❣❡♣r✐s✴♣r♦❥❡❦t✴✷✻✶✾✵✷✽✷✶
“Given the breadth of the Web and its inherently multi-user (social) nature, its sci-ence is necessarily interdisciplinary, involving at least mathematics, CS [computer science], artificial intelligence, sociology, psychology, biology, and economics."
(Hendler et al., 2008, p. 64)
Web for corpus and Web 2.0: a post-BootCaT world? Web corpora are prone to diverse biases and based on a constantly changing resource. These changes are combined with an evolving web document structure and a slow but irresistible shift from “web as corpus" to
“web for corpus", due to the increasing number of web pages and the necessity to extract what is after all a tiny subset of what the web as we know it is supposed to be.
All these changes are part of what I call the post-BootCaT world in web corpus construction (Barbaresi, 2013a).22
There are not only fast URL changes and ubiquitous redirections. Content injected from multiple sources is a growing issue for recent general-purpose web corpora which can be linked to the Web 2.0 paradigm (see p. 116), because it may be that the probability of running into lower text quality, for instance because of machine-generated content or mixed-language doc-uments, as described above, increases with the number of different sources. Additionally, ex-ternal sources changing at a fast pace make a filtering or blacklisting of domain names more difficult to implement.
As the complexity of a document with respect to its source(s) rises, so does the difficulty to establish a quotable source for linguists depending on a more traditional reviewing process of linguistic proof and who need reliable, clearly established sources. It also makes the decision to exclude potentially noisy sources more delicate.
3.4.2 Existing problems and solutions
Recent advances Recent advances in general-purpose corpora, for instance in the work of Suchomel and Pomikálek (2012) or Schäfer and Bildhauer (2012), include resource-efficient processing tools, steps towards encoding of available metadata, and overall cleaner corpora through better selection and verification procedures. The crawling infrastructure, corpus pro-cessing tools, and corpus search engines make it possible to create and master web corpora on a scale of 10 billion tokens or more, for languages with a large, worldwide speaker community such as French or Spanish, but also for other cases such as Swedish or Czech (Jakubíček et al., 2013).
On the side of specialized corpora, work has been done to help with the normalization and annotation of text types such as Internet-based communication (Beißwenger et al., 2013), for which there are neither annotating schemes nor processing practices. Still, targeting small com-munities or particular text types, extracting the texts, and processing the documents remains a challenge.
Existing problems Search engines have not been taken as a source simply because they were convenient. They actually yield good results in terms of linguistic quality. The main advantage was to outsource operations such as web crawling and website quality filtering, which are
22Note that the proponents of the BootCaT method seem to acknowledge this evolution, see for example Marco Baroni’s talk at the BootCaTters of the world unite (BOTWU) workshop (2013): “My love affair with the Web...
and why it’s over!"
considered to be too costly or too complicated to deal with while the main purpose is actually to build a corpus.
Nonetheless, the quality of the links may not live up to expectations. First, purely URL-based approaches favor speed, sacrificing precision. Language identification tasks are a good example of this phenomenon (Baykan et al., 2008). Second, machine-translated content is a major issue, as is text quality in general, especially when it comes to web texts (Arase & Zhou, 2013). Third, mixed-language documents slow down text gathering processes (King & Abney, 2013). Fourth, link diversity is also a problem, which in my opinion has not gotten the attention it deserves. Last, the resource is constantly moving. Regular exploration and re-analysis could be the way to go to ensure the durability of the resource.
The inefficiency of crawling The crawling process in itself cannot be completely inefficient:
since even content prediction using URLs cannot be expected to be accurate, the links have to be visited in order to retrieve the content of a page, if only to discover that there is no or little text content. Once the web documents are stored, up to 94% of the web documents which have been downloaded and stored are eliminated during preprocessing (Schäfer & Bildhauer, 2013).
Thus, crawling and preprocessing are resource-intensive, so that the fact that much com-putation time could be saved is highly relevant, in particular when processing power is short.
Thus, the question is whether crawling (in)efficiency is unalterable or if it can be improved.
As it is not possible to start a web crawl from scratch, the question concerns both the sources and the course of the crawl, and can be roughly formulated as such: where may one find web pages which are bound to be interesting for corpus linguists and which in turn contain many links to other interesting web pages?
Preprocessing I used the word preprocessing, and not post-processing as sometimes found in the literature, in order to highlight the fact that it is this operation that makes corpora exploitable and available in a linguistic sense. It is more than just a cleaning operation, as it involves selecting the texts and balancing a corpus between opportunistic inclusion and strict selectivity, thus affecting its general profile and quality.
One may say that preprocessing suffers from a lack of attention since there is no external evaluation procedure as to how useful given web texts are for a corpus, and the last evaluation campaign regarding boilerplate removal dates back from 2008 and leaves much to be desired.
3.4.3 (Open) questions
Answerable questions The following questions are answered at least partially in the follow-ing:
• How can the linguistic relevance of web data be assessed?
Under what conditions is a text worthy to become part of a corpus?
• Is it possible for public research infrastructures to gather large quantities of text from the Web
In fact, Tanguy (2013) states that publicly funded research centers cannot compete with commercial search engines, mostly because of infrastructure costs. The answer to that problem may be to look for more efficient ways to build web corpora. In the following, solutions are introduced (see chapter 4).
– Are there ways to cope for the search engine APIs which are being closed when looking for URL crawling seeds?
– What are possible ways towards more efficient crawls?
– How can web corpus construction and its issues be brought closer to linguists and not left to “crawl engineers"?
• Operationalize document classification Towards the reproducibility of decisions
Open questions The following questions arise from the state of the art presented in the general introduction as well as in this part. They are known to be of interest, but fall beyond the scope of this work.
• How many text genres can be identified on the web?
What are the best machine learning techniques to deal with these genres?
• What is the best way to find promising and evenly distributed URL crawling seeds?
What is the best crawling strategy?
• What is the most adequate solution to the debate about corpus balance?
• What is the best boilerplate removal method? How can it be evaluated on a wide range of criteria and texts?