Directional measures of collocation - High frequency collocations and second language learning

All of the measures of collocation discussed so far are non-directional, in the sense that it makes no difference which part of the word pair is taken as node and which as collocate. However, this may be misleading. As Stubbs points out, though the pair kith and kin have the same score on all of the measures regardless of which word is taken as the node, the relationship between the two words is clearly not symmetrical: kith predicts kin with around 100% certainty, whereas kin can be found in other contexts (1995, p. 35). The non-directionality of these measures may be particularly

problematic for our task of predicting the psychological correlates of frequency data, since it seems highly likely that any associative links running from kith to kin will be stronger that those running in the opposite direction. It would therefore be useful to have a statistic which reflects this.

A simple way of achieving a directional score would be to calculate the conditional probability of one word, given another. This could be done by simply dividing the frequency of the word pair by the frequency of the node. Since the conditional

probabilities are usually likely to be rather small, this figure can be multiplied by 100 for ease of reading:

P (w2|w1) = 100 x 1 2 1 w w w

Thus, to return to our earlier example, the conditional probability of the collocate tea, given the node strong, is:

100 x 768 , 15 28 = 0.178

while the conditional probability of the collocate strong, given the node tea, is:

100 x 030 , 8 28 = 0.349

indicating that this collocation is rather more important for tea than it is for strong. This approach has not been widely used in corpus linguistics, though Handl (2008) has recently suggested a similar method, and psychologists have speculated that the formula described here may be related to word association norms (Anderson, 1990, p. 64).

Variables

Using any of the above methods will involve the analyst in two important decisions which we have not yet been addressed: how close together two words need to be to count as ‘co-occurring’ (the question of ‘span’); and whether we should pool the counts for each inflectional/derivational form of a word - so that, for example, argue

strongly, argued strongly and strong argument would count as three occurrences of a

single collocation - or whether separate counts should be made for each form (the question of ‘lemmatisation’).

With regard to span, Jones and Sinclair report that the vast majority of a word’s collocational influence is found within a span of four words to its left and right (1974, pp. 21-22). Though much longer-distance dependencies have been claimed to exist (Clear, 1993, p. 276), this ‘+/- 4 word’ guideline has been widely accepted (Hoey, 2005, pp. 4-5). A less satisfactorily resolved issue related to span selection is that of whether association measures should be adjusted to take account of the span used. We have seen that standard association measures are based on comparing the number of times we would expect to find two words together if they were selected at random with the number of times we actually find them together. Clearly, however, the number of times we would expect to find two words directly adjacent to each other is rather lower than the number of times we would expect to find those words

somewhere within a span of +/- 4 words of each other. Specifically, if the probability that word2 is the word directly after word1 is given by the formula:

P(word1) x P(word2)

then the probability that word2 is one of the eight words falling within a +/- 4 word span of word1 is:

8 x P(word1) x P(word2)

To maintain the original logic of the association measures therefore, we would need to make this adjustment when calculating the ‘expected frequency’ part of the equations. While some publicly-available software for calculating association measures allows this adjustment to be made (e.g. T/Z and Mutual Information Calculator (Klarskov Mortensen, 2003)), others (e.g. WordSmith Tools (Scott, 1996)) do not make any adjustment for span. It could be argued that the latter choice violates the logic of the original formulas, leading to artificially-inflated scores when wider spans are used and to non-comparability between studies using different spans. On the other hand, the author of WordSmith Tools argues against including any adjustment for span on the grounds that word pairs which frequently co-occur directly next to each other (e.g.

rely-on) should not, for that reason alone, be considered stronger than pairs which

frequently appear at a certain distance from each other (e.g. kith-kin). “[I]f one CASTS ASPERSIONS on something”, he asks, “is that more linked that when ASPERSIONS got CAST on it?” (Scott, personal communication). The issue of adjusting association measures for span remains, then, a moot one. Since much of the corpus-based work in this thesis depends on WordSmith Tools, I will follow Scott in not making any such adjustment.

On the question of lemmatisation, Halliday (1966, p. 151) has argued that collocation should be seen as existing between ‘words’ at a rather high level of abstraction. On this view, strong, strongly, strength and strengthened, for example, should all be regarded as “the same item”; and a strong argument, he argued strongly, the strength

of his argument and his argument was strengthened are all “instances of the same

syntagmatic relation”. Halliday’s argument is that restating the syntagmatic

relationship for each form of the words involved would add complexity without a gain in descriptive power because, as far as the collocational pattern is concerned,

differences between word forms are irrelevant. Since Halliday published these

remarks, however, the assumption that differences between word forms are irrelevant to collocation has been widely questioned. Amongst other, Sinclair (1991, p. 8), Clear (1993, p. 277), Stubbs (1996, p. 38), and Hoey (2005, p. 5) have all argued that lemmatisation may disguise differences in the collocational preferences of different forms of a word. Clear, for example, notes that collocations such as vested interest,

crying shame, and bodes ill are all restricted to particular inflected forms, a point that

would be lost in a lemma-based analysis. Moreover, Clear points out, lack of

lemmatisation rarely if ever disguises a collocation, since “one of the inflected forms will appear as a significant collocate, and the potential for the other forms in the paradigm to collocate will be apparent to the human analyst” (1993, p. 277). In the studies that follow, no lemmatisation is used in tallying collocations unless

specifically noted.

In document High frequency collocations and second language learning (Page 94-97)