Node words and collocates - Computational identification

4.4 M icro analysis

4.4.1 Computational identification

4.4.1.3 Node words and collocates

Looking at every keyword is beyond the limits of this research, and additionally, keyword analysis can result in diminishing returns - with some keywords functioning in similar ways to others. Therefore, to identify those keywords which will then be subjected to a detailed collocational analysis, one needs to decide upon what Kennedy (1998: 251) has referred to as the ‘target term, node word or search item ’. I shall use ‘node w ord’ as a technical term for those keywords that are elected to be investigated in terms o f their collocates in each text. The question now is: what are the criteria o f deciding upon node words?

Prior to setting any criteria for selecting the node words in tire research data, it should be noted that among the keywords only lexical (not grammatical) words will be considered:

more often than not the keyness o f lexical words is due to their being inherently relational, because they cannot be established without referring to another text or set o f data; and this

51 A ccording to Scott (2004), the p value is ‘that used in standard chi-square and other statistical tests. This value ranges from 0 to 1. A value o f 0.1 suggests a 1% danger o f being w rong in claim ing a relationship, .05 would give a 5% danger o f error. In the social sciences a 5% risk is usually considered acceptable. In case o f key word analyses, where the notion o f risk is less important than that o f selectivity, you m ay often w ish to set a comparatively low p value threshold such as 0.000001 (one in 1 m illion) (IE -6 in scien tific notation) so as to obtain fewer key w ords’.

-128-relational aspect is crucial in terms o f identifying (opposing) discourses. Indeed, this can be taken as a guideline for any criteria governing the selection o f (lexical) keywords. In the present study, there are three criteria for selecting the node words.

The first criterion is quantitative in nature. Using W ordSmith5, the elect node word should be indentified to collocate with other words , i.e. collocates, in terms o f certain 52

collocation statistics, with the default settings: notably the span o f ±5 (that is, five words on either side o f the node word). It should be noted here that this span has not been arbitrarily set; rather, it was specified as a result o f prior qualitative investigations o f the concordances of all the designated collocations in this research data, where the range ±5 was identified to be the topm ost span. Words which occurred at a span o f 6 or 7 words away from the node, were normally too far away to suggest that the two words had any meaningful relationship to each other. Here, the reference made to ‘meaningful relationship’ is rather technical as it denotes words that, are m eaningfully related to each other in text, either on the syntagmatic plane or the paradigmatic one; so, these words must be meaningfully related in their co-textual environments.

Evert (2009: 1237) argues that the ‘association m easures’ of collocation fall into two major groups: ‘effect-size measures (MI, Dice, odds-ratio) and significance measures (z-score, t-score, simple-11, chi-squared, log-likelihood)’.53 Interestingly, in order to identify ‘strongly associated word pairs’, Evert (2005: 2 If) applied the significance measure o f log-likelihood to a case o f the English verb + noun (direct object) co-occurrences in the British National Corpus (BNC). M any different phenomena were found: ‘fixed idiomatic expressions (take

52 The collocatin g w ords should be view ed as being a textual concept, i.e. as significant to the text at stake, in that they sem antically com e together with shared discourse prosody in relation to the node word assigned with those collocatin g words.

5j A ccording to Evert (2009), effect-size measures aim to ‘quantify how strongly the words in a pair are attracted to each other, i.e. they measure statistical association betw een the cross-classifyin g factors in the contingency tab le’ (p. 1234); and statistical significance measures are ‘based on the sam e types o f hypothesis tests as ... chi-squared tests ... and likelihood-ratio tests’ (p. 1235).

place and give rise (to)), support verb constructions and other lexically determined combinations (make sense, p la y (a) role, solve (a) problem [...]), stereotypes and formulaic expressions ([...] wait (a) m inute)'. Further, free and compositional com binations, which reflect ‘facts of life, typical behaviour’, were also found, viz. '(ask (the) Secretary ( o f State) and write (a) letter)'. Actually, Evert (2005: 137) argues that the log-likelihood association measure offers ‘an excellent approximation o f the p-values of Fisher’s test and has convenient mathematical and numerical properties’. However, as he continues to argue, ‘the statistical soundness o f log-likelihood does not always translate into better perform ance’; and, as such,

‘[a] conclusive answer can therefore only come from a comparative empirical evaluation o f association measures, which plugs different measures into the intended application’ (Evert 2005:137)

In the present study, in respect o f significance measures, W ordSmith5 offered results of collocates that are equally significant in terms the t score, z score and log-likelhood. Thus, it would be rather redundant to incorporate all three results as evidence for collocability. Only one o f these results will therefore be chosen as a significance measure o f the collocating items, with a view to giving us ‘confidence in claims about the data, so that we m ay claim statistical significance for our results’ (Oakes 1998: 9). Indeed, as I shall shortly argue below, the MI and t scores can be suitable association measures o f relevant ‘aspects o f collocativity’

in the present study.

Collocational strength can be measured by the MI score. An ‘MI score o f 3 or higher’

is proposed to be ‘taken as evidence that two items are collocates’ (Hunston 2002: 71).

Interestingly, the MI score can be said to best suit the present research purpose as it focuses on the ‘more idiosyncratic collocates of a node’; and this indicates that ‘the items that have MI values are idiosyncratic instances peculiar to [one] corpus’ (Clear 1993: 281). That is, as McEnery and Wilson (2001: 86) argue, if the collocating items are to have ‘high positive

-130-mutual inform ation scores’, then they are ‘more likely to constitute characteristic collocations’

than others ‘with much lower mutual information scores’. Thus, the MI score asks the question ‘how strongly are the words attracted to each other?’ (Evert 2009: 1228). Indeed, following the tradition o f Church, Hanks and Moon (1994), I shall intersect the two measures (MI and t scores) and looking at pairs that have important scores in both measures. This may be explained on the grounds that: 1) ‘the t test measures the confidence with which we can claim that there is some association’ (Church and Hanks 1990, cited in McEnery et al. 2006:

57); 2) ‘/-scores tend to show high-frequency [collocating] pairs’ (M cEnery et al.: ibid.).

Thus, the t test asks the question ‘how much evidence is there for a positive association between the words, no matter how small effect size is?’ (Evert 2009: ibid.). Note that ‘[a] t score o f 2 or higher is normally considered to be statistically significant’ (M cEnery et al.

2006: 56). Nevertheless, ‘[f]rom a theoretical perspective’, Evert (2005: 82) argues, the /-test

‘is not applicable to cooccurrence frequency data’. ‘It may thus be more appropriate’, he (ibid.: 83) continues to argue, ‘to interpret t-score as a heuristic variant o f z-score that avoids the characteristic overestimation bias of the latter’ (i.e. rather than strictly as a significance test).

The second criterion for selecting node words, qualitative in nature, is based on the researcher’s intuition - which is constituted on the basis o f looking at concordances prior to actual analysis - about thematic relevance, where the elect node words constitute a semantic configuration o f one theme in each text. For instance, in Chapter 6, a set of node words (alongside their potential collocates) has served the themes o f ‘Wahhabi Islam ’ and ‘Saudi Wahhabism’ as being collocationally realized in each text, with different representations.

Also, in Chapter 7 where gender representations across the two texts matter, only gender- specific node words (and their potential collocates) have been considered. The third criterion is linguistically motivated: node words in this study should mainly share a semantic or

grammatical connection between the two texts, such as the node words WAHHABI (used by Schwartz) and WAHHAB’S (used by DeLong-Bas) that have been analysed in Chapter 6.

Sometimes they may even be cross-textually identical both in form and meaning, such as the node words JIHAD (Chapter 5) and SAUDI (Chapter 6).

Before com ing to the second micro procedural stage o f describing collocations, I would like to touch upon the general corpus o f American English used in the present study, the Corpus o f Contemporary American English (COCA).

In document Ideological Collocation in Meta-Wahhabi Discourse Post-911 : A Symbiosis of Critical Discourse Analysis and Corpus Linguistics. (Page 141-145)