Choice of reference corpus: the BNC Sampler written component

Chapter 3: Methodology

3.6. Keyword analyses

3.6.1. Choice of reference corpus: the BNC Sampler written component

corpus on the content of a keyword list generated by conducting a keyword analysis of a research corpus consisting of the text of Shakespeare’s Romeo and Juliet with two different reference corpora, all of Shakespeare’s plays and the British National Corpus respectively. The former throws up proper noun names of specific characters within Romeo and Juliet as well as words reflecting central themes such as love, death and poison. The latter also throws up words reflecting specific

characters and central themes and a set of words not present in the former list which reflect the specific nature of Shakespearean language such as thou, thy, O

and hath. Whilst underlining the important impact of reference corpus choice in this way, they argue this demonstration at the same time shows that ‘while the choice of reference corpus is important, above a certain size, the procedure throws up a robust core of KWs whichever the reference corpus used’ (ibid., p. 64). In a later article, Scott argues again for the robustness of the keywords procedure maintaining that keywords ‘identified even by an obviously absurd [reference corpus] can be plausible indicators of aboutness’ (2010, p. 51). However, he also points out that, in terms of content of reference corpus, genre and whether texts are spoken or written have a significant impact (ibid.).

As discussed above in section 2.3.5. of the Literature Review, in the context of usefulness of keywords to discourse analysis, Baker (2004) highlights the danger of overemphasis of difference if two research corpora are compared with each other in generating keyword lists. Taylor (2013) also takes up the issue of too much focus on difference in corpus work, arguing for the need for greater focus on similarity for two reasons; firstly, she argues that ‘by focusing on difference, we effectively create a ‘blind spot’; this means that, rather than aiming for a 360-degree perspective on our data, we are actually starting out with the goal of achieving only 180-degree visualisation’ (p. 83); secondly, she argues that ‘by setting out to look at difference, the analyst is likely to find and report on difference’ which creates ‘a significant threat to the balance of analysis’ (ibid.). She cites Baker’s idea of the ‘bottom drawer syndrome’ in which researchers who find similarity tend to file rather than publish such findings leading to a picture in the published research comparing a particular set of discourses or language types of a greater degree of difference than actually exists (Baker, 2010, p. 83, cited in Taylor, 2013, p. 83).

Taking into account Baker’s and Taylor’s arguments, I have aimed in my project for a research design which is able to account for both similarity and difference. For this reason, rather than compare research corpora against each other, which would risk exaggerating disciplinary and/or institutional differences at expense of possible

similarities, I have chosen a larger, more general corpus, the BNC Written Sampler (discussed below), as reference corpus in keyword analyses of all four sub-corpora. Using the keyword lists generated in this way, I have identified both similarities and differences in terms of keywords present across all four corpora. Taking into account also the impact of whether texts in a reference corpus are spoken or written (Scott and Tribble, 2006), I have opted for a reference corpus made up solely of written texts. This is because I am not so much concerned with highlighting features which single out the texts in my research corpora as written as opposed to spoken texts, but rather, with highlighting features by which those texts can be compared or contrasted as written texts belonging to specific disciplinary discourse communities. For this reason also, the BNC Written Sampler is a useful reference corpus for my research purposes.

The BNC was chosen because, in comparison to the ‘specialised’ nature of the four sub-corpora under examination, which only include written texts in the essay genre from the academic domain, it is a ‘general’ corpus the purpose of which is the study of modern British English as a whole. The BNC is ‘a well-known general corpus’ (McEnery et al., p. 59) consisting of 4,124 texts contributing to a total of 100,106,008 words of modern British English, 90% of which is written, ‘samples from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, as well as school and university essays’ (ibid.), and 10% of which is spoken, 863 transcripts of informal conversation including a balance demographically in terms of social class, region and age, and a range of different contexts from ‘formal business and government meetings to radio shows and phone ins’ (pp. 59-60).

The BNC Sampler, created at Lancaster University, is a 2 million word sub-corpus of the BNC created in order to manually check and correct word class tagging and also to create a corpus in which the balance between the written and spoken elements were evenly balanced with an approximate 50%-50% division (UCREL, Lancaster University, 1998). It contains ‘a wide and balanced sampling of texts from the BNC,

so as to maintain the general text types and the proportions of general text types (apart from the unequal written/spoken division) of the BNC as a whole’ (ibid.). The written portion of the BNC Sampler can be drawn on as a reference corpus in

Wmatrix, the programme I used to create my keyword lists (discussed below). The written portion of the BNC Sampler’s use as a reference corpus in comparison to the History and Politics/IR corpora from both institutions will show words that are unusually frequent in the written texts in these corpora in comparison to their frequency in ‘general’ usage in written texts beyond the academic domain. Differences between the keyword lists generated for the four corpora may indicate disciplinary and/or institutional variation.

There are arguments against the BNC’s use as a reference corpus in circumstances where the time period in which it was developed, compiled from the 1980s to 1993, could significantly skew the nature of the keyword lists generated; changes, for example in society, politics or technology, since this time have been likely to impact language usage in terms of the salience of particular themes and consequently the frequency of particular content lexis or proper names in spoken and written texts (e.g. Johnson and Ensslin, 2006). However, although, the BNC is arguably somewhat ‘dated’ at this point, this is likely to have very little or no significant impact on the occurrence of closed-class grammatical words, which are to be the focus of my study (discussed below), in keyword lists generated, and therefore the BNC’s use can be justified within the context of my project.

3.6.2. Keyness analysis procedure and identification of items for analysis

In document A corpus driven investigation into the semantic patterning of grammatical keywords in undergraduate History and PIR (Politics & International Relations) essays (Page 101-104)