Chapter 3 ‘Breaking new ground’: formulaic language as a marker of authorship
3.4 Why might formulaic sequences be a reliable marker of authorship?
3.4.1 Theoretical basis for formulaic sequences as a marker of authorship
Beginning firstly with the psycholinguistic theory of formulaic sequences, sequences of words are stored in the lexicon as single items (Bannard & Lieven, 2009; Ellis, 1996; Erman, 2007; Erman & Warren, 2000; Hoey, 2005; Pawley & Syder, 1983; Sinclair, 1991; Wray, 2000, 2002, 2008). This gives the speaker a processing advantage by reducing the cognitive burden of producing entirely novel language. If Wray (2002) is correct in her assertion that we only analyse those things which need to
-57-
be analysed (needs-only analysis, cf. Section 3.2.1, p. 51), then as language users, we will not necessarily focus on the internal constituents of these formulaic word sequences. Since sequences of words are stored in this pre-packaged holistic form then their occurrence in language may not be noticed by authors. Therefore authors will likely produce sequences of words without necessarily thinking about each individual word and it naturally follows that if authors are unaware that they are using particular sequences of words it will be much harder for them to disguise their style. This point is made by Lancashire (1998):
Word, phrase, and collocation frequencies … can be signatures of authorship because of the way the writer’s brain stores and creates speech. Even the author cannot imitate these features, simply because they are normally beyond recognition, unless the author has the same tools and expertise as stylometrists undertaking attribution research. Reliable markers arise from the unique, hidden clusters within the author’s long-term associative memory. (p. 299)
Additionally, there is growing support for the argument that even when there is opportunity for variability within a formulaic sequence, speakers holistically store a particular variant which works best for their needs. For example, Erman (2007) investigated the size of linguistic units in the mental lexicon, using pause distribution and pause duration as indicators. She hypothesized that pauses should be rare between prefabricated structures (“prefabs”, as defined in Section 3.1, p. 44) since they are stored and therefore produced holistically. Her data provided support for this hypothesis. Pausing occurred only rarely between component parts of prefabricated structures (p. 47). Erman also looked at prefabricated structures which allowed for some lexical variability (e.g. that’s the big question in X, where X can be filled with any discipline such as linguistics, history, science etc.). She found no evidence that speakers paused more or for longer at the point where one of the variable slots needed to be filled. This, she reasoned, suggests that there was no increase in cognitive effort required at the point where a single lexical item needed to be selected. Unfortunately, Erman only provides this one example and it may well be argued that if a linguist is talking about linguistics, then filling this particular slot with the word linguistics probably would not require extra cognitive effort, since the speaker will have been primed by the context. However, other variable slots are evident in the author corpus data, and it is presumed that these are equally as appropriate as examples: X might say where X can be filled with some or you; and It sounds ADV harsh where any adverb can be inserted e.g. very, really, quite, incredibly. Reflecting on her finding, Erman suggests that
speakers may well make preferred choices, and the prefab may therefore be fixed and stored as a unit in the individual user’s lexicon. In other words, speakers make preferred choices also where the system allows sometimes considerable variation, which suggests that more combinations of words are presumably fixed in the individual speaker’s mental lexicon than will be indicated in dictionaries and corpora (p. 46).
-58-
Whilst Erman (2007) was concerned with formulaic sequences that contained slots for variability, Peters (1983) highlighted that some sequences of words may be stored holistically for an individual, as opposed to holistically for a particular speech community, which she called “idiosyncratic formulas”. The effect is that such instances would be formulaic for an individual, even though they would not necessarily be recognisable as formulaic by the hearer:
Thus, if I find an especially felicitous way of expressing an idea, I may store up that turn of phrase so that the next time I need it it will come forth as a prefabricated chunk, even though to my hearer it may not be distinguishable from newly generated speech. (p. 3) An example of this is Tony Blair’s use of “entirely accepted” amongst other collocations as described in Section 2.1.4 which based on the evidence is likely to be formulaic for him but less so for others. It is also less likely that he would be so aware of his apparently idiolectal use of such expressions. Likewise, the Unabomber’s use of “cool-headed logicians” may be considered to be an idiosyncratic formula.
Turning next to the sociolinguistic aspects of formulaic language, according to Wray (2002, 2008) formulaic language is a linguistic solution to a non-linguistic problem and that problem is how we get our needs met: “Formulaic choices will be made on the basis of this single agenda, by means of the drive to manipulate others’ actions, knowledge, or emotions to one’s own advantage” (Wray, 2008: 69). According to Wray, we store holistically those sequences of words for which we have a need. Therefore,
[w]hat ends up in the lexicon is a direct reflection of the way the language is operating for the individual in his or her speech community or communities. The nature of the lexicon is determined not by structural principles which decide whether an item is simple enough to be stored, but by the individual’s priorities in handling real language input (Wray, 2002: 267—8). As such, there is potential for us all to have different inventories of formulaic sequences resulting from, amongst others, differing needs and differing social and linguistic backgrounds.
Therefore, providing that there is an appropriate way to identify it (cf. Section 3.5), formulaic sequences should reliably mark out an individual author. In the following section, the limited research literature into formulaic sequences as a marker of authorship will be assessed to see whether this prediction is correct.