Corpora used in this study - Corpus and methodology

Corpus and methodology

2. Corpora used in this study

This study uses three of the subcorpora of the Coruña Corpus: the Corpus of English Texts on Astronomy (CETA), the Corpus of English Philosophy Texts (CEPhiT) and the Corpus of English Life Sciences Texts (CELiST). These three corpora contain 122 samples of texts, and add up to 1,215,003 words. The texts which have been used are listed in Appendix 1 below. In what follows, the corpus used in the study is described in detail, organising the information according to the parameters used during the process of compilation (as explained in Section 1.2.2 above), which are also the parameters of the analysis of data. This analysis will review the distribution of words and samples according to each of them, starting with that of discipline, and continuing with the time of publication, the genre of the text, the geographical origin, and the sex of the author.

2.1. Disciplines

Samples from three subcorpora, dealing with astronomy, philosophy and life sciences, have been selected. The use of these three disciplines intends to give a representative overview of the uses of scientific register as a whole, since they are sufficiently different in nature as to represent different styles and approaches, as explained in Chapter 1. As can be seen in Figure 4.3 below, CETA, the Astronomy subcorpus, presents 42 samples and 409,909 words. CEPhiT, the Philosophy subcorpus, presents 40 samples and 401,129 words; whilst CELiST, dealing with Life Sciences, contains 403,965 words and a further 40 texts.

As explained in Section 1 above, the differences between the disciplines are due to the fact that the samples contain approximately 10,000 words, thus allowing for slight differences when adding up the numbers of each discipline, a variation which will also be present in all the different parameters presented below.

125

Figure 4.3: Words per discipline

The subcorpus on astronomy, CETA, presents 42 texts instead of 40 as four texts (two each in the 1770s and 1880s) have been included in toto (see Section 1.1.2 above) despite containing fewer than 10,000 words each, summing these approximately 10,000 words when considered in pairs, instead.

This difference in the number of samples will also be present throughout the analysis of the different parameters, and, as will be the case with the difference regarding the length of the samples.

2.2. Period of study

The three subcorpora are used in this study in their entirety, thus analysing the full period between 1700 and 1900. As shown in Figure 4.4 below, samples from both the eighteenth and nineteenth centuries add up to a similar figure of words: there are 61 samples in each one, with 608,644 words between 1700 and 1800 and 606,359 words in the period between 1800 and 1900.

Figure 4.4: Distribution of words per century

409909; 34%

401129; 33%

403965; 33%

Astronomy Philosophy Life Sciences

608644; 50%

606359; 50%

XVIII XIX

126

Analysing the distribution of the data over the period, each decade⁹¹ features six samples, but for the 1770s and 1880s, which show seven works each. As can be seen in Figure 4.5 below, samples from each decade add up to approximately 60,000 words, with a variation between the 58,830 words of the decade whose samples feature the fewest words (1840s) and the 64,086 words of the one with the highest amount of words (1850s).

Figure 4.5: Distribution of words per decade (N.B: y axis starts at 58,000 words)

2.3. Genre of the samples

The samples selected provide a representative view of the different genres used in English scientific writing during the period under study. As can be seen below in Figure 4.6, most of the samples (61 out of 122, adding up to 610,183 words) are treatises, in line with the common uses of the period. After treatises, the most frequently used genres were textbooks (20 samples, 206,277 words), reflecting the drive for popularisation of scientific knowledge at the period; essays (14 samples, 142,554 words), and lectures (12 samples, 120,538 words).

There are seven samples of articles, the first examples of the genre which would later, as explained in Chapter 1, dominate scientific writing, but which only add up to 53,861 words, thus showing their shorter length. There are also some examples of genres which are nowadays not normally used in scientific writing, such as the five examples of letters (51,555 words), a genre then very much in use as a way of communicating scientific knowledge; and the two dialogues (19,991 words), characteristic of the earlier scholastic paradigm but still in use (though receding) in scientific writing at the start of

91 Texts are selected at a rate of two samples per decade and subcorpus, irrespective of the actual year of writing.

Consequently, the distribution of texts inside each decade is not regular and it is by considering the diachronic distribution in terms of decades and not years that the best comparative view is offered.

60621

the period. Finally, there is also a dictionary, which appears under the label “others⁹²” and which contains 10,044 words.

Figure 4.6: Distribution of samples and words per genre.

2.4. Geographical distribution

The samples included in the study were not only written by English authors, but also by authors coming from Scotland, Wales, Ireland, and North America, thus representing the entirety of native English-speaking areas at the period.

The results presented below in Figure 4.7 show that the majority of texts (56 texts, 556,885 words) were written by English authors, particularly during the eighteenth century. There is also a sizeable number of samples (28, 276,331 words) written by Scottish authors, whilst Irish (10 samples, 101,723 words) and North American authors (16 samples, 158,170 words) appear less frequently.

It is also noticeable that there is an important number of authors (12 samples, adding up to 121,894 words) which are classified under the label “others”. This label includes two different types of authors:

some of them are authors about whose upbringing the compilers of the corpus have not found definite information, thus denying the possibility of their classification. Another sizeable group is formed by authors who have been educated in several countries during their lifetimes. These authors might have thus been influenced by more than one diatopic variety and might have used a mixed variety themselves.

92 This label includes more than dictionaries, as other subcorpora, not selected in this dissertation, include other genres, all of them sharing the characteristic of their being less frequently used.

610183; 50%

206277; 17%

142554; 12%

120538; 10%

53861; 4%

51555; 4% 19991; 2% 10044; 1%

Treatise Textbook Essay Lecture Article Letter Dialogue Others

128

Figure 4.7: Distribution of word per provenance of the author.

The decision of gathering these two types together may be put into question, but the second group is too heterogeneous (several combinations of countries at different periods resulting in particular and different idiolectal varieties) to consider it as a definite group, and the first one is not a group as such, but rather an assortment of authors about whose upbringing there is no information⁹³.

2.5 Sex of the authors

The great majority of the samples were written by men (110 out of 122), and just twelve were written by women. As can be seen below in Figure 4.8, this means that only 10% of the words (123,978) in the corpus were of female authorship, whilst 90% (1,091,025 words) were written by men. As explained in Section 1 above, this figures can be considered to be in keeping with the reality of the time.

Figure 4.8: Distribution of words per sex of the author

93 Although not being able to obtain information about an author is interesting per se, as it might be considered a significant evidence of their social status, the fact that an author’s provenance is unknown is no ground to constitute a group, as each of these authors whose provenance is not known would probably have a different background and the result would be another hodgepodge.

556885; 46%

276331; 23%

101723; 8%

158170; 13%

121894; 10%

England Scotland Ireland NA Others

1091025; 90%

123978; 10%

Male Female

129

In document On conditionality: a corpus-based study of conditional structures in late modern english scientific texts (Page 153-158)