Corpus tools - Data and method - A corpus-based discourse analysis of representations of people

3. Data and method

3.3. Corpus tools

To carry out the analysis I use a combination of corpus software. Different software make different tools available, which may differ in their quality and functionality. For this reason, it is not unusual for a corpus linguist to employ multiple tools for a single study. For instance, a study carried out by Baker et al. (2013) into the representation of Muslims in the British press used a combination of Sketch Engine and Wordsmith, as

James Balfour - May 2020 115

did Taylor’s (2014) study looking at the representation of migrants in the British and Italian press. The software used most frequently in this thesis (Chapters 4 and 7) is Sketch Engine, an online suite of corpus tools developed by Kilgarifff, Rychly and colleagues (e.g. Kilgariff et al., 2014). Sketch Engine was chosen over other software primarily for practical reasons. I have a Macbook computer and other corpus software that are able to process millions of words of text e.g. Wordsmith (see below) are only compatible with a Windows operating system. Second, Sketch Engine, unlike other software provides access to its unique ‘word sketch’ tool. This is used in Chapter 4 to examine the most frequent words that explicitly refer to schizophrenia. Once a corpus is uploaded to Sketch Engine it is tagged for parts-of-speech by way of the TreeTagger tool (Schmid, 1994). The word sketch tool, described by its creators as offering ‘a feast of information on the word’ (Kilgariff et al., 2014), then calculates collocates of words belonging to a specified lexeme and groups them into ‘frames’ based on their

grammatical relationship with the node word. For instance, the word sketch may group a lexeme’s collocates into verbs that predicate the word, collocates that are modifiers etc. In doing so, a word sketch offers a comprehensive semantic profile of a word, giving a good idea of its general usage in a corpus. To calculate collocates, the window span was set at +/-5, a span which has been used in similar CADS studies. As Baker et al. (2013:36) write, ‘the default span at five words either side of the search word […]

seems to offer a good balance between identifying words that actually do have a relationship with each other (longer spans can throw up unrelated cases) and giving enough words to analyse shorter (shorter spans result in fewer collocates). The logDice statistic, the default score in Sketch Engine, was chosen to determine collocation

116 James Balfour - May 2020

strength.¹⁹ While a logDice score can range between 1 and 14, where 14 denotes the highest collocation strength (i.e. the two words always occur together in the corpus), it is unlikely to exceed 10 (Rychly, 2008). Sketch Engine’s collocates tool is also used in Chapter 7 to examine ways in which the press re-contextualise violence committed by people with schizophrenia using words relating to moral responsibility.

One of the shortcomings of Sketch Engine is that, from concordance lines, it only allows the analyst to view a narrow strip of co-text and, if the analyst wished to view the article in its entirety, would need to refer back to the original file. This is time consuming when examining the usage of multiple collocates and, thus, the concordance tool available via Wordsmith 5.0 (Scott, 2008) was typically used to carry out

concordance analyses instead. Unlike Sketch Engine, Wordsmith’s concordance tool allows the analyst to view the entire text from the concordance line almost instantly.

This is helpful in cases where a word’s usage cannot be deduced from the narrow strip of context offered by Sketch Engine, or where the analyst wishes to examine something else in the text, for instance the article’s metadata.

Another tool used via Wordsmith was the keywords tool, which is used to examine distinctive lexis used in the tabloids and broadsheets in Chapter 5. In this chapter, I examine distinctive lexis used in the tabloid and broadsheet subcorpora in stories in which schizophrenia and people with schizophrenia are mentioned. Keywords are words

19 Sketch Engine describes logDice as “a statistic measure for identifying collocations. It expresses the typicality of the co-occurrence of the node and the collocate. It is only based on the frequency of the node and the collocate and the frequency of the whole collocation. logDice is not affected by the size of the corpus and, therefore, can be used to compare the scores between different corpora. logDice is the preferred option when working with large corpora.” https://www.sketchengine.eu/my_keywords/logdice/

James Balfour - May 2020 117

which are statistically significantly more frequent in one corpus relative to another. This is carried out by comparing two word lists (tokens listed in order of their frequency), derived for two corpora. Keywords are often used to identify salient topics in

contrasting corpora. As Kilgarriff (1997:233) claims, ‘any difference in the linguistic character of two corpora will leave its trace in differences between their word frequency lists.’ As it is not helpful to compare the raw frequencies of words in two corpora of different sizes, I used the log-likelihood significance metric to determine whether the difference in frequency of a word in the two word lists was statistically significant. Log-likelihood is one of two metrics (the other being the chi-square metric) available to use via Wordsmith 5.0. The log-likelihood metric tests the difference in frequency of a word against the null hypothesis, which stipulates that difference between two frequencies is due to random variation in the dataset. The p-value was set lower than is customary in the social sciences at 0.000001, in order to ensure that I examined the most statistically significant lexis. It is also customary for corpus analysts conducting keyword analyses to set cut-offs in order to limit the number of keywords to the extent that each can be analysed in detail. As with many previous CADS studies, I chose to limit the number of keywords for each tabloid/broadsheet subcorpus to the top 100 in descending order of their keyness score (i.e. their statistical significance). This also ensured that keywords obtained were distributed equally among the two subcorpora. The frequency threshold was set at three, which is the default settings in Wordsmith 5.0. An advantage of setting the frequency threshold very low is that the tool is able to capture cases where a feature may be very infrequent in one corpus and very high in another, with the frequency difference being significant (Gabrielatos, 2018:239).

118 James Balfour - May 2020

Each research question and the methods used to answer each one are listed in Table 3.6 below.

Table 3.6 Parts of the methodological framework and how they correspond to each of my research questions

Chapter RQ Method

Chapter 4 What do lexicogrammatical patterns around words referring to people with schizophrenia say about the way such people are typically represented in the British press? 5.0 is then used to examine patterns in usage.

Chapter 5 What distinctive words are used by the tabloids and broadsheets when reporting on stories that mention schizophrenia and people with schizophrenia? Do the ways such words are used in context shed light on differences in how people with

schizophrenia are represented in the tabloids and broadsheets?

I use the keywords tool via Wordsmith 5.0 to calculate

Chapter 6 How could one use corpus techniques to examine ways in which the press represent schizophrenic people as moral agents of violent crime?

Chapter 7 How do the British press use language to re-contextualise violence committed by

I use Sketch Engine’s word list tool to identify the ten

James Balfour - May 2020 119

people with schizophrenia? How is the press’ re-contextualisation of these crimes likely to shape a reader’s blame

judgement?

The theoretical-methodological approach used in this thesis is loosely based on Fairclough’s three tier approach (e.g. 1989, 1995). Fairclough breaks down critical discourse analysis into three stages, linguistic description, interpretation and

explanation. These roughly correspond to each of his concentric rectangles in his model representing the relationship between text, interaction and social context (see Section 2.1.1.1). The first phrase of Fairclough’s model is description, where the linguist describes language patterns in the text. In the case of the corpus linguist, who is probably dealing with a corpus size hundreds of times larger than the traditional CDA practitioner, this refers to identifying and describing patterns in the data. This contrasts with the second phase, interpretation, where the analyst examines how the ‘member’s

In document A corpus-based discourse analysis of representations of people with schizophrenia in the British press between 2000 and 2015 (Page 114-119)