4. Data and methods
4.2 Corpora
The aim of this section is to provide information regarding the corpora used in the subsequent studies. Information will be provided regarding several key aspects, namely the constitution of the corpus (i.e. which types of texts do they involve), its tagging, and other various technical details. The first sub-section introduces the British National Corpus, that is prominently featured in chapter 5. The second sub-section introduces the Corpus of Historical American English, which was used for the study presented in chapter 6. Links to the official website of these corpora are provided under the “corpora” sub-section of the references. 4.2.1 The British National Corpus (BNC)
The British National Corpus is a 100 million words corpus that was compiled from 1991 to 1994. It mostly contains data from 1975 onwards, although there were several texts from the 1960s. It was designed as a representative corpus of British English, which means that it includes a broad variety of texts. The main distinction is between its written and spoken components, which amount to 90% and 10% of the corpus, respectively. The written component mostly comprises books and periodicals. They consist of imaginative texts (e.g. novels), but also informative texts on various topics (e.g. social sciences, arts, natural sciences, politics). They key point is that many different domains are represented. The spoken component is mostly sub-divided into two main categories, the “demographically sampled” part (4 million words), and the “context-governed” part (6 million words). The former consists of randomly selected British English speakers in the United Kingdom who recorded their everyday conversations and the latter consists of more formal speeches. Examples of these speeches involve lectures, business meetings, political speeches, sermons, parliamentary
proceedings, legal proceedings, sports commentaries. Again, a broad range of contexts is covered.
The original BNC was tagged using an automatic tagger called CLAWS4 (Garside 1987) and an additional tagging tool called Template tagger (Fligelstone et al. 1997). The “BNC Basic Tagset” contains 57 tags, and 4 additional punctuation tags. Note that the tagset itself is called “C5”, which stands for CLAWS5. For example, the tag <NN1> marks singular nouns, the tag <NN2> marks plural nouns, and the tag <NP0> marks proper nouns. Automatic taggers make mistakes, which is why it is necessary to estimate the error rate based on a manually-analysed sample. In the case of the BNC, the tagger has a precision of 96.25%, which means that it only attributes a wrong tag 3.75% of the time. In addition, the tagger may use ambiguity tags, which further increases the accuracy of the corpus. Indeed, when the tagger is unsure which tag to attribute, it can offer two tags, such as <VVG-AJ0> which denotes ambiguity between a gerund or an adjective.
Another important aspect of the BNC is that different versions have been released. The subsequent studies use the first version that was released in 1994. A second version called “BNC World” was released in 2001 which was fairly similar and mostly improved upon the tagging of the corpus, using an enhanced version of CLAWS4. The BNC reference guide (Burnard 2007: section 6.1) notes however that “in most respects the word class information provided by the corpus now is identical to that provided with the first release of the BNC in 1994.” Using the first or second version therefore does not differ greatly. Furthermore, a third version was released in 2007, which is known as the “BNC XML edition”. XML is a formatting language that is particularly used with corpora. This third edition mostly differs from the previous ones in terms of formatting, especially when it comes to the tagging of multiword elements (e.g. as well as), but not so much in terms of content. In addition, the BNC-BYU interface developed by Davies (2004) constitutes yet another version of the BNC, because the corpus was entirely re-tagged using the CLAWS7 tagger. This tagger is the same one used by the Corpus of Historical American English, which is presented in the next section.
4.2.2 The Corpus of Historical American English (COHA)
The Corpus of Historical American English (COHA) is a 400 million words diachronic corpus of American English compiled by Davies (2012), between 2008 and 2010. It contains data that ranges from 1810 to 2009. In a similar fashion to the BNC, the COHA is a general corpus and includes a large number of genres. However, these mostly consist of written
genres. The COHA has four main genres, namely fiction, magazines, newspapers, and non- fiction. Fiction is the most represented genre as it accounts for approximately 50% of the dataset, while magazines are second and represent around 25% of the dataset. Non-fiction comes third with 15%, and newspapers is the least-represented genre with around 10%, which is also partly due to the fact that newspapers only appear from the 1860s onwards. It is therefore debatable whether the COHA is a balanced corpus, especially in the early years. This is why the data presented in chapter 6 will generally discard these early years for the interpretation of the results.
Another important aspect of the corpus that makes the early years more difficult to analyse is that the amount of data per decade is not the same from one decade to another. In particular, the early decades are much smaller than the rest, as can be observed in Table 17. Figure 3 provides a clearer visualization of the situation and shows that the first two decades have much less data than the others. Significant increases can be observed between 1860 and 1880, but also between 1910 and 1920. These aspects are fairly relevant in sections 6.3 and 6.4, since the frequency of several items is studied. An easy fix to the varying sizes of the chunks is to calculate the frequency “per million words” for a given element.
Decade Number of words Decade Number of words
1810s 1'181'022 1910s 22'655'252 1820s 6'927'005 1920s 25'632'411 1830s 13'773'987 1930s 24'413'247 1840s 16'046'854 1940s 24'144'478 1850s 16'493'826 1950s 24'398'180 1860s 17'125'102 1960s 23'927'982 1870s 18'610'160 1970s 23'769'305 1880s 20'872'855 1980s 25'178'952 1890s 21'183'383 1990s 27'877'340 1900s 22'541'232 2000s 29'479'451
Table 17. Number of words in each decade of the COHA.
For example, if a word occurs 5’000 times in the 1810s, one needs to divide this number by 1.181, as there are 1.181 million words in that chunk. This results in a frequency of 4’234 per million words. If this word occurred 5’000 times in the 1820s, then its frequency per million words would be much lower, namely 722. Despite this adjustment, smaller collections of texts are less representative overall as they involve a smaller variety of texts, which can skew the frequency of a given word. This was briefly touched upon earlier by mentioning that
the first decades do not include newspapers for example. This is why expressing the words as frequency per million words is only a partial solution to the problem.
Figure 3. Number of words in each decade of the COHA.
The tagging of the corpus is done using CLAWS7. It is relatively similar to CLAWS4 and CLAWS5, which were used with the original BNC and introduced in the previous section. The main difference are the codes used for parts of speech. For example, the codes AJ0 (unmarked adjective), AJC (comparative), and AJS (superlative) were used for adjectives in the CLAWS5 tag set, whereas the CLAWS7 codes for these adjectives are JJ, JJR, and JJT respectively.
While the texts involved in the BNC can be accessed in full, the COHA includes copyrighted texts, which therefore means that they cannot be legally distributed in full. To circumvent this problem, the texts were transformed by deleting ten words every 200 words. This means that it is much more difficult to read those texts in full, but that at the same time most of the data is unaffected (95% of the data is still there). Deleted words are replaced by the symbol @ in the corpus. Also, given that these deletions occur regularly and indiscriminately in the text, it affects all words equally. These deletions are therefore unproblematic for the broad quantitative analyses presented in chapter 6.
Note that the COHA comes in different formats when one wants to access its full content. The study conducted in chapter 6 uses the “wlp” format of the corpus, which stands for word, lemma, part of speech. In this format, each word of the corpus is presented on one line that consists of three elements separated by tabulations. The first element is the word
0 5 10 15 20 25 30 35 N u m b e r o f wo rd s (m ill io n s)
itself as it appears in the text, the second element is its corresponding lemma, and the third is its CLAWS7 part of speech. Also, a sample of the COHA that uses the same format is available freely and consists of 3.6 million words, which roughly corresponds to one percent of the whole corpus. This sample was used to conduct some of the case studies presented in section 6.2, because its smaller size makes processing easier.
Now that the main tools have been introduced, a concrete application regarding grammaticalization is presented in the next chapter. This chapter discusses a synchronic approach to measuring grammaticalization. A binary logistic regression model, as presented in section 4.1.2, is used to compute a grammaticalization score on the basis of the parameters introduced in chapter 3. This score ranges from zero (highly lexical) to one (highly grammatical). The model relies on data taken from the British National Corpus, which was introduced in section 4.2.1 above.