The most extensive body of research on the processing of formulaic language by non- impaired speakers has focused on the comprehension of idioms . Early work in this area claimed that, since the meanings of idioms cannot be derived from their
component parts, they must be stored in the mental lexicon as individual items, akin to ‘big words’. Whereas the comprehension of literal strings of language is taken to involve decoding the component words and combining their meanings according to the general rules of the language, idioms are, researchers claimed, simply looked up as extended lexical items. Support for this picture came from the finding that idiomatic phrases (e.g. break the ice) are recognised as meaningful strings more rapidly than literal controls (e.g. break the cup) (Swinney & Cutler, 1979), which was taken to suggest the involvement of a rapid, holistic recognition mechanism. Models differed as to how and when literal and figurative readings became active. According to one
(Bobrow & Bell, 1973), according to others they operate either simultaneously
(Swinney & Cutler, 1979), or one after the other, with a literal reading invoked only if the figurative fails (Gibbs, 1980).
Later research has largely rejected the idea that idioms are processed in an entirely holistic, word-like, manner (e.g., Cacciari & Tabossi, 1988; Gibbs, Nayak, & Cutting, 1989; Titone & Connine, 1999). Perhaps most influentially, Gibbs and Nayak (1989) note that approaches which see idioms as separate lexical entries fail to account for the productivity of certain idioms. That is, they cannot explain why some idioms can be syntactically altered while retaining their figurative meanings (e.g. ‘John laid down the law’ can become ‘The law was laid down by John’) but others cannot (‘John kicked the bucket’ cannot become ‘The bucket was kicked by John’ without losing its idiomatic meaning). To account for such cases, Gibbs and Nayak propose the idiom
decomposition hypothesis, according to which idioms may be classified as either
decomposable or non-decomposable. If an idiom is decomposable, its constituent parts can be associated with the components of its literal referent. Thus, when pop the
question is glossed as propose marriage, the noun question clearly refers to the
proposal and the verb pop to the act of making it. In the same way, the law of lay
down the law refers to rules, lay down to the act of invoking them. Non-decomposable
idioms, on the other hand, do not display such correspondences - no part of kick the
bucket can easily be analyzed as referring to any part of dying, nor can chew the fat be
broken down into components corresponding to parts of leisurely conversation. The idiom decomposition hypothesis claims that because the individual parts of
decomposable idioms have recognisable meanings, those idioms will be syntactically productive - when the question is popped, the transformed components maintain their individual figurative meanings – while non-decomposable idioms cannot be altered without losing their figurative sense.
This division of idioms into decomposable and non-decomposable has been found to correspond to differences in processing. Examining reading times for the two types of phrase through a self-paced reading task, Gibbs, Nayak and Cutting (1989) found that decomposable idioms are read significantly faster than both similar literal phrases (e.g.
pop the question was read faster than ask the question) and non-decomposable idioms.
than similar literal phrases (kick the bucket took longer to read than fill the bucket). This lead the authors to suggest a model on which readers always try to analyse phrases into component parts (an analysis which may, but need not, involve the activation of literal meanings). Decomposable idioms are still hypothesised to have directly stipulated meanings, but because their parts contribute systematically to these meanings, component analysis aids their recognition. Non-decomposable idioms, on the other hand, can only be processed holistically, the process of analysis doing nothing to aid their retrieval. It is this which makes their reading more difficult. While the reading of decomposable idioms is similar to that of literal strings, idioms are read faster because they are more familiar to readers.
Peterson et al (2001) have further refined this picture of holistic vs. componential comprehension by separating the action of semantic from syntactic processing. Using a naming task, they found that, regardless of whether the preceding context biases a literal (e.g., the soccer player slipped when he tried to kick the) or a figurative (e.g.,
the man was very old and feeble and it was believed that he would soon kick the)
reading process, syntactically-congruent completions (e.g. town) were read faster than syntactically-incongruent completions (e.g. grow). This ‘syntactic priming’ effect held true regardless of the degree of decomposability of a phrase and suggests, they argue, that syntactic processing continues throughout the reading of an idiom. In a parallel study designed to detect ‘conceptual priming’, the same researchers found that literal phrases primed semantically congruent completions (e.g. after reading the stem the
soccer player slipped when he tried to kick the, the concrete (and so kickable) noun shelf was named faster than the abstract (and so not kickable) truth), while matched
figurative phrases (e.g. the man was very old and feeble and it was believed that he
would soon kick the) did not. Peterson et al conclude that, though syntactic processing
continues after a phrase has been recognised as idiomatic, processing of the literal semantics is halted.
The processing of corpus-derived formulas
Though the processing of idiomatic phrases has attracted the majority of research attention, it should be clear from Chapter 2 that such phrases make up only a small percentage of formulaic language. Moreover, it seems possible that the semantic
from other types of sequence. Indeed, the apparent influence of semantic type (i.e. decomposable vs. non-decomposable) on processing appears to indicate that the most relevant factor for these items is not their frequency but their meaning. It is only in recent years that the sorts of high frequency but semantically regular formulas identified through corpus analysis which are our primary concern have started to receive serious research attention.
An early study in this area is that of McKoon and Ratcliff (1992, described in more detail in Section 4.5), who looked for evidence of priming between high frequency collocates. Though they found a small priming effect, the authors acknowledged possible problems with their source items, both because of the potential unreliability of the small corpus used and because effects of frequency were not distinguished from those of psychological association (see Section 4.4), and so decline to draw strong conclusions. They do tentatively suggest, however, that co-occurrence statistics may have some applicability as predictors of priming.
Schmitt et al (2004) used a dictation task to determine whether recurrent word clusters identified by corpus analysis are stored in the mind as holistic formulas. The clusters used were sequences of between two and six words, some taken from published listings and others found by corpus analysis to be recurrent contexts of certain key words. A set of 25 clusters were selected which varied in length, frequency, transparency of meaning and according to how intuitively ‘holistic’ they appeared. These clusters were embedded in a story, which was recorded and played to native and non-native speakers of English in 20-25 word segments. After hearing each burst, native speakers were asked to perform a simple mental arithmetic task and then to repeat what they had heard. Non-natives were only asked to repeat the segment. The thinking behind this task was that participants’ working memories would be
overloaded and they would need to reproduce the segment using their own linguistic resources. Word-for-word reproduction of the target clusters, it was hypothesised, would indicate that the clusters were likely to be holistically stored. For native speakers, a great deal of variation was found between different clusters, with some (e.g. go away, I don’t know what to do) being faithfully reproduced by most
participants, and others (e.g. in the same way as, aim of this study) usually being either avoided or reproduced only partially. Non-natives, unsurprisingly, performed rather
less well overall than natives, and again there was much variation between phrases. For both groups, accuracy of reproduction was not found to correlate with either the frequency or the length of clusters; the researchers suggest that semantically more transparent items (e.g. go away) may have been better recalled, while sentence stems (e.g. in the same way as) were less well remembered, but strong correlations were not found. They conclude that frequency data from corpora do not appear to be
particularly strong predictors of holistic storage. However, the somewhat eclectic mix of clusters used in the study, and the use of a methodology (the dictation task) whose ability to tap holistic processing had not been independently validated, renders this conclusion rather speculative.
Schmitt and Underwood (2004), also failed to find evidence for the holistic storage of corpus-derived clusters. As in Schmitt et al (2004), some clusters were taken from published listings and others were identified by corpus analysis as recurrent contexts of certain key words. The final listing of 21 items included lexical phrases, transparent metaphors, saying/proverbs, and idioms. Sequences were between four and eight words long and were deemed to be relatively predictable from their initial words; the authors also report that items which appeared with low frequency in the British National Corpus or CANCODE (a corpus of spoken English) were excluded, though they do not specify what level of frequency was required for inclusion. The phrases were embedded in short contexts, which were presented to participants word-by-word on a computer screen, with participants pressing a button to bring up each new word. The thinking behind this method was that the time taken for participants to press the button would indicate how long it had taken them to recognize and process the word. If formulaic sequences are stored holistically, the authors reasoned, recognition times for the latter parts of these phrases should be hastened once the phrase has been recognised. However, neither native nor non-native participants demonstrated any advantage in reading the final words of formulaic sequences over the same words in non-formulaic contexts. As with the previous study, the mix of items used, and the failure to provide specific frequency data, makes these results rather difficult to
interpret. Moreover, as the authors themselves suggest, the word-by-word presentation paradigm may have disrupted normal holistic processing strategies.
Other recent studies do appear to provide support for a link between the frequency of clusters in corpora and holistic storage. Using the same materials as Schmitt and Underwood (2004), Underwood et al (2004) used an eye-tracking paradigm to study native and non-native speakers’ reading of formulas embedded in short contexts. They found that both natives and non-natives fixated on target words less often when they appeared as the final words of a formulaic sequence than when they appeared in other contexts. They also found that natives (but not non-natives) fixated on targets for shorter durations when they were found within formulas. Underwood et al conclude that these results are consistent with the idea that formulas are stored holistically by native speakers and suggest that the somewhat ambiguous results seen for non-natives (fewer, but not shorter) fixations, may indicate only partial knowledge of the
formulas). The emergence of reliable effects here, using the same materials as those in Schmitt and Underwood (2004), suggests that the failure to find an advantage for formulaic sequences in that study may have been due to the acknowledged methodological problems.
Jiang and Nekrasova (2007) also claim to find evidence for holistic storage of formulaic sequences in both native and non-native speakers. They found
grammaticality judgements for 26 formulaic sequences taken from previous corpus- based studies (again, no actual frequency data are provided) to be both faster and more accurate than judgements for matched control strings (in which formulas were
changed by one word to create a more novel string). This suggests, they conclude, that the formulaic items are recognised holistically, obviating the need the full syntactic analysis which must presumably take place for the novel strings.
In a series of studies, Tremblay et al (in preparation) used self-paced reading and memory tasks to determine whether high frequency lexical bundles are holistically stored in the mind. The lexical bundles were either four-word strings found in the spoken part of the British National Corpus (BNC) with a mean frequency of at least 10 occurrences per million words, or five-word strings appearing in the same corpus at least five times per million words. Control phrases were created by substituting one word in each string with a replacement which was individually more frequent and (on average) shorter than the original and such that the new phrase was less frequent in the BNC than the original. In a word-by-word self-paced reading task, the replaced word
in the original bundle was found to be read significantly faster than its replacement in the control phrase; in a segment-by-segment self-paced reading task, entire lexical bundles were read significantly faster than control phrases, and in a sentence-by- sentence task, sentences containing the original bundles were read significantly faster than those containing their replacements.
Working on the idea that holistically-stored bundles should take up less space in working memory than novel strings of the same length, which need to be represented word-by-word, Tremblay et al also test the memory load of lexical bundles and their controls. They presented subjects with the lexical bundles or control phrases described above, along with a string of individual words and then asked them to recall the phrase and the words. When the input was presented visually, they found significantly better recall for both lexical bundles and their following words than for control phrases and their following words, suggesting that lexical bundles may indeed place less strain on working memory. When input was presented in the auditory modality, better recall was again found for the lexical bundles than for the control phrases, but not for their following words. The authors speculate that this may have been because natural intonation features of the lexical bundles had been deliberately stripped out of the recordings by using synthesised speech.