Addressing textual similarities using keyness software tools

S3 WordList - [lllwords.lst wordlist (F

2.6 From keyness to locking: investigating similarities between corpora In this section I begin by explaining why looking at the differences between texts,

2.6.2 Addressing textual similarities using keyness software tools

Baker (2011) uses several frequency-based statistical methods to determine the words which remain consistently important over time in four diachronic corpora o f 20th century British English, from which he puts forward the concept o f "lockwords".

18 TESAS = TExt Source Alignment System.

These are high-frequency words which occur statistically with the most similar frequency when the corpora are compared. Baker explains that a lockword is:

a word which may change in its meaning or context o f usage when we compare a set of diachronic corpora together, yet appears to be relatively static in terms of frequency. (2011:66)

and that lockwords provide a statistical contrast to keywords in corpus linguistics:

These words were so consistent in their frequencies that they appeared to be the opposite o f Scott's (2000) concept of keywords [...] a new term, lockword, was thus invented to describe them. The term lock was chosen because it is related to key {key is the highest collocate o f lock in the British National Corpus (using log

likelihood), and furthermore, lock is a good description o f these words: they appear to be "locked" in place. (2011:73)

Baker (2011) successfully uses this method to identify words which are locked over the variable o f time (for example money). The concept of statistical locking does not have to be limited to a diachronic perspective, however. It can equally be applied to synchronic corpora (Baker, personal communication, 25.10.11), through which it shows what the language in the texts have most strongly in common, on an empirical basis. In my study, therefore, it enables me to find out:

• how the language used by Shakespeare in constructing character dialogue is similar to that of his contemporaries; and

• how EModE plays are characterised by shared preferences for some language styles among playwrights

based on data from about 1,600,000 words of dramatic dialogue in 79 plays by 24 different playwrights (taking the two corpora overall; see further chapter 4).

As stated in 1.1, my study also extends Baker's (2011) lockwords method, in three ways:

(i) through its application to synchronic rather than diachronic corpora, as indicated above;

(ii) by testing it with historical texts; and

(iii) in using a different process which orients the statistical computations o f the keyness tools in WordSmith and Wmatrix to similarity rather than difference, with relative simplicity and speed, as I explain below.

Baker (2011) extracts lists of the most frequent words in each o f his four corpora using WordSmith, and then uses the statistical analysis software SPSS to compare them and identify those with the most similar frequencies (using standard deviation and co

efficient o f variance19). I have only two corpora, however, so I am able to use the keyness tools in WordSmith and Wmatrix to identify words and other linguistic items which occur statistically with the most similar frequency, based on log-likelihood statistical tests (Baker, Hardie and Wilson, personal communication, 25-27.10.11) ( WordSmith and Wmatrix are currently limited to comparisons of two corpora only).

Setting the p value to 1.0 (in WordSmith) or the log-likelihood value to 0 (in Wmatrix, which does not provide p values), causes the keyness tools to identify words which are the least key, i.e. those for which the software finds the least information indicating there is a difference in frequency between the two corpora. Log-likelihood values at or near 0 occur either when frequencies in both corpora are low, or when frequencies are relatively similar (Rayson, personal communication, 19.12.11). Excluding low- frequency items restricts the results to the latter kind, which constitute the locked results (minimum frequency settings are discussed in 3.4.2).

Using specialised corpus linguistic keyness tools in WordSmith and Wmatrix to identify lockwords offers several advantages over Baker's (2011) method:

• speed and ease;

19 Standard deviation "measures the spread o f data from the mean frequency o f a word" and coefficient o f variance (standard deviation divided by mean average, then multiplied by 100) controls for frequency (Baker 2011:72).

• access to contextual data through concordances (facilitating investigations o f how words with the most similar frequencies are used in each corpus);

• computations which take into account the relative sizes o f the corpora (Baker, personal communication, 25.10.11)20; and

• a statistical testing method which has been more rigorously tested in corpus linguistics (e.g. in Rayson et al. 2004b); log-likelihood values can also be looked up in tables of statistics, unlike co-efficient o f variance values (Rayson, personal communication, 19.12.11).

Baker's (2011:66) initial definition of a lockword, given above, seems to allow for diverse functions in words which are of similarly high frequency, statistically, in corpora. However, later in his study (2011:83) he mentions the distinction o f "a true lockword": one which not only has a similarly high frequency in the corpora being examined, but which is also used in the same way(s). This raises the important

question o f whether or not all words with similarly high frequency which are matched orthographically by the computer software can be considered "locked", or whether there should be an additional linguistic criterion based on similarity o f function.

A parallel question can of course be applied to keywords and other key results:

does everything on the list of output generated by the software really count as a keyword, or only those items which are useful to the researcher? Based on Scott's (2010:46) comments regarding keywords and keyness, noted in 2.5.3 above, it would be difficult to get agreement on what should count as a "true" lockword, since

researchers would undoubtedly make different judgments about similarity o f function and/or other qualifying features. It seems better to set the criteria for locking and keyness according to statistical parameters only, then to assess all the results in terms

20 A lthough my corpora are o f very similar size, they are not identical (see 4.4).

o f prototypicality and usefulness for the researcher's purposes. Key or locked results which are problematic can be diagnosed, and if necessary disqualified from further analysis, with the reasons for this made clear in relation to the aims o f the study. This is a practical approach with a view to application, since my ultimate goal in obtaining locked and key results is to produce empirical data that will most usefully launch detailed investigations o f potential style features in my corpora.

Having explained the principles of locking, and argued that it can usefully be applied in my study, I end this chapter with some further discussion o f the kinds of results I anticipated from my corpus data, and my approach to interpreting them. This is based mainly on theory and background from keyness studies, but the principles extend equally to results generated with the locking method.

2.7 Issues surrounding the interpretation of corpus results in stylistic analysis

In document A Corpus Stylistic Investigation of the Language Style of Shakespeare's Plays in the Context of Other Contemporaneous Plays. (Page 63-67)