Chapter 8 ‘Come to think of it’: a consideration of the issues
8.5 The nature of formulaic sequences
One of the main arguments for supposing that formulaic sequences would make an ideal marker of authorship was based on the fact that they are ubiquitous. This claim should therefore be assessed. In Chapter 7, some comparison was made between the findings from the present research and other
-177-
research, and the point was made that the count of formulaic sequences in the author corpus was lower than other researchers had found in their data. This is unsurprising given that the approaches to identification differed, along with definitions of what was counted. Therefore, drawing comparisons with incompatible research is less insightful. Instead, three claims based on the present research can be made:
1) Authors used an average of seven formulaic clusters in each of five texts. The average sub- corpus length is 3,256 words and the average formulaic cluster consists of 3.1 words equating to 6.6 formulaically-used words per 1,000 overall words;
2) Formulaic sequences using the core word way occur 103 times in the total corpus of 65,113 words. The average length of a way-phrase is 3.2 words, so way-phrases occur 1.58 times per 1,000 words with 5.06 formulaically-used words per 1,000 overall words; and
3) Items Identified using the formulaic sequences reference list had an average length of 2.7 words. A total of 604 formulaic sequences were identified, equating to approximately 25.05 words per 1,000.
Viewed in this light, it cannot be claimed that these particular formulaic sequences are ubiquitous in short written narratives. It can barely be argued that these formulaic sequences are even frequent given that the most prominent measure calculates that a word which is part of a formulaic sequence occurs roughly 25 times per 1,000 words. The reason for these low frequencies has been acknowledged throughout this research—an automated approach was always expected to yield less data than the more intuitive approaches used by other researchers and described in Chapter 3. It also should be considered that identifying formulaic sequences which can be classed as formulaic clusters and identifying formulaic sequences through the use of just one core word are two very limited and very narrow approaches. These frequency scores are therefore understandably low. The frequency score for the formulaic sequences reference list is perhaps more surprising since by virtue of being a very large, inclusive list of formulaic sequences, more occurrences should have been identified. As previously stated, the automated nature of the approach bears the brunt of the blame. However, another contributory factor is likely to be that found by Moon (1998a)—quite simply, some formulaic sequences, particularly idioms and fixed expressions (e.g. kick the bucket), just do not occur as frequently in the English language as intuition might suggest. Basing a reference list on such items will therefore have limitations. Nonetheless, the fact that this approach in particular was successful at establishing variation between authors and attributing Questioned Documents, the fact that some authors have a higher count of formulaic sequences than others adds further support to the rationale behind this research; that is, if intuition suggests such formulaic sequences to be
-178-
common in English, the fact that they are not for some authors, and the fact that they are higher than average for other authors indicates that this is a useful marker of authorship of which authors are unaware.
It is interesting that no author showed a preference for any one formulaic sequence, again with the exception of Rose and in a way. None of the three approaches revealed that an author has a favourite phrase which they use consistently, whether distinctive or not. The conclusion to reach from this is that some phrases are distinctive (e.g. the Unabomber’s use of cool-headed logicians and you can’t eat your cake and have it too) but rare, whilst others are consistently used but not distinctive (e.g. Rose’s use of in a way). This is an important point, since the three methods outlined in this research cannot identify distinctive and rare formulaic sequences and qualitative analysis would be necessary to identify such admittedly important phrases. The approaches outlined in this research cannot, and should not, replace the traditional qualitative approach of close-reading a text to identify distinctiveness. It is also possible that focussing on specific patterns of formulaic sequences is not the only way to approach the data, and drawing on other lexical richness measures such as authorial pace, “the frequency with which he [the author] generates new words and allows them to enter his manuscript” (Baker, 1988: 36), may be developed. This is particularly appealing since authorial pace has been argued to be characteristic of authorial style regardless of text length or genre (Baker, 1988). Where authorial pace is expressed as Pace = 1/Type token ratio, a new measure which calculates the rate at which new sequences of words enter text may offer additional avenues to explore, particularly since formulaic sequences have been argued to constitute one lexical choice in this research.
The fact that no other author appeared to have a set of preferred formulaic sequences is in keeping with Wray (2002) who argues that:
[E]ach person, in each unique situation, will apply slightly different selection criteria to a slightly different set of options, from those available to anyone else. Certainly there will be very many similarities between individuals, insofar as they share, within a given environment or speech community, an inventory of idiomatic forms and certain interactional expectations of, and towards, each other. But, just as it will be possible, through such similarities in formulaic speech patterns, to spot people who come from the same place, are the same age or share the same interests or beliefs, so it will rarely be possible fully to predict which formulaic sequences a given speaker will select, since the balance of priorities is constantly shifting, and with it, the relative usefulness of the stored sequences (Wray, 2002: 101—2). Wray’s assertion that predictions about which formulaic sequences an author will select are not possible accounts for why the authors in this research did not appear to prefer certain formulaic sequences, since, if the priorities on the language user are constantly shifting, then texts composed
-179-
on different days at different times will inevitably not be fully comparable. This adds further support to the notion that the overall count of formulaic sequences may be more indicative of authorship than any single formulaic sequence—since the actual forms of formulaic sequences that authors use may vary, the degree of overall reliance on formulaic sequences may not.
But are the word sequences identified in this research really formulaic sequences?
In Chapter 3, Hoey’s (2005) theory of lexical priming was introduced. Hoey proposed that words are primed for other words, and so too are word sequences (‘nesting’), and crucially that primings can vary for individuals, depending on their experiences of the contexts and cotexts in which those words are used. Wray (2008), talking of her needs-only analysis model (also described in Chapter 3), highlights the difference between lexical priming and her model. Both offer a psychological explanation for how words co-occur to reduce processing effort, making the models “highly compatible” (Wray, 2008: 67). However, the crucial difference lies in the fact that for Hoey single words are primed first with word sequences being later primed whereas Wray places the emphasis on whichever lexical unit “constitutes the largest form-meaning mapping so far found adequate to handle the effective manipulation of input and output” (2008: 67). This is in many cases the word, but can also be larger or smaller components—morpheme equivalent units (as described in Chapter 3)—which provide “a layer of wrapping that protects the components from analysis under normal circumstances” (p. 67). Therefore, according to Wray’s model, words that occur in a morpheme equivalent unit can be considered to be collocation associations if they occur adjacently in text, but they “are not really ‘associates’ in his [Hoey’s] sense at all. Rather, they are sub-parts of a single large unit, much as ‘im-’ often occurs adjacent to ‘possible’” (p. 67).
What this means for the present research is that there are two potentially valid theories for why some authors use particular sequences of words, whereas others use different combinations. The present research focuses only on written output and so it is not possible to argue support for either theory. It might therefore be observed that Alan-3 used the word sequence besides the point which occurred only once in the entire author corpus and there were no occurrences of beside the point, yet in the BNC, beside the point occurs 74 times with besides the point occurring only once. This is therefore a rare word sequence which appears to be distinctive for Alan. What cannot be claimed is whether this is holistically processed as a single lexical item, for which Alan has no need to process the constituent words (as in Wray’s needs-only analysis model), or whether it was a low-level writing mistake, or whether Alan has a specific nested priming which was appropriate in this particular context at that particular moment of text creation, just as Hoey (2005) found around the
-180-
world to be five times more frequent in his corpus of newspaper articles than round the world (discussed in Section 6.3, p. 141). On this point, Hoey (2005) claims:
[I]t could be that for such co-occurrences one speaker has round the world (and not around the world) primed while a second speaker is primed to use around the world (and not round the world) (p. 74).
In short, it has only been possible to speculate on what should be formulaic, or what appears to be formulaic, either for an individual author or for the group. It has been helpful to characterise the items identified as formulaic since there is a rationale for why those sequences occur and because they are a low-level feature. But what evidence exists that they actually are formulaic? For each of the methods, the case has been made that the items are formulaic (albeit with a small measure of non-formulaic material such as the single words way and ways for example). However, in strict evidential terms, there is no way of knowing whether these items actually are formulaic for these authors. Such theories are based on proposals for how the mental lexical may operate, and not on how the mental lexicon actually operates for each of the 20 authors under investigation in this study. In reality, this means that whilst evidence can be claimed of formulaic sequences being processed differently to novel language (Conklin & Schmitt, 2008; Erman, 2007; Erman & Warren, 2000; Underwood, Schmitt, & Galpin, 2004), this evidence focuses on groups of people, rather than on individuals. This is clearly at odds with the current research which rests on the notion of individual uses of formulaic sequences. Therefore, even if some find the label ‘formulaic’ in this context to be problematic, the value of the results still stand—authors do use different counts of items included in the reference list and they do use different forms of clusters, both of which have been shown to statistically vary between authors and in some cases enable the successful attribution of a Questioned Document. In short, the label may be wrong, but this marker of authorship still shows promise.