Studies of advanced non-native language Formulaic language in advanced non-native speech

A number of studies have used ‘learner corpus’ methodologies to study the occurrence of formulaic language in advanced non-native speech. An early example is DeCock et al (1998), who examined ‘recurrent word combinations’ in a 60,000 word corpus of informal English language interviews with advanced (L1 French) learners. These were compared with the recurrent combinations found in an 80,000 word corpus of similar interviews with native speakers of English. The researchers automatically extracted from the corpora all continuous two-, three-, four-, and five-word combinations occurring with frequencies of greater than nine, four, three and two respectively. They found as much use of such sequences in the non-native as in the native corpus. Indeed, non-natives were found to use significantly more of the longest (i.e. four- and five- word) combinations that their native counterparts. They also found that non-natives tended to repeat combinations more often, though the differences here were not large: the average log type-token ratio across different lengths of combination being 72.5 for natives and 70.6 for non-natives. DeCock et al also found that native and non-native combinations differed somewhat in character. The learners, they noted, made more use of hesitation phenomena, such as repetitions (the the; I I) and filled pauses (and

er), while frequent native items such as you know, sort of, I mean were less prominent

in non-native speech. Looking specifically at a defined sub-type of recurrent combinations – those used to mark vagueness – they found significant underuse by non-natives, and also misuse, in that combinations were used in different syntactic and pragmatic contexts than those seen in the native corpus.

Oppenheim (2000) studied the use of ‘recurrent sequences’ (as identified by human raters) in a set of three-minute speeches given from notes by six advanced non-native

subjects made extensive use of recurrent sequences, with a mean of 66.4% of words taking part in such sequences. Interestingly, it seems that these learners used recurrent sequences quite consciously. All reported that they made an effort to learn phrases and collocations and that, when preparing their speeches, “pieces of phrases, entire phrases, or strings of phrases came to mind, as opposed to individual words or pieces of

words” (2000, p. 235). Oppenheim found only limited variation between individuals in their use of such sequences, and a high level of homogeneity within individual speakers’ language across two topics and two deliveries of each speech. She also found that different speakers tended to rely on different types of recurrent sequence, and that this reflected to a certain extent conscious strategies. Thus, speakers who reported that they concentrated on organisation in their speeches were found to use a high proportion of organisational sequences, while those who placed more focus on a fluent, nativelike delivery, made greater use of immediate repetition, a device that appears to aid fluency. Like DeCock et al, Oppenheim also found that sequences were often not nativelike. Indeed, she notes that sequences were “almost exclusively… idiosyncratic” (2000, p. 235).

Foster (2001) looked at formula use in a 20,000 word corpus of spoken data produced by 32 native and 32 non-native speakers of English completing three different

interactive tasks (a personal information exchange, a narrative, and a discussion), either with or without time allowed for planning. She asked a panel of seven native speakers to identify language which appeared to be produced as a ‘chunk’, rather than word by word, or which was part of a sentence ‘stem’ which had required

morphological adjustment or lexical addition. Language identified by five or more informants was taken to be ‘lexicalised’. She found that a higher proportion of native than of non-native speech was lexicalised. Interestingly, when native speakers were given time to plan for the tasks, their reliance on lexicalised language decreased (from 32.29% to 25.08%). Non-natives did not show a similar tendency (16.87% in

unplanned and 17.23% in planned conditions). This suggests that natives, but not non- natives, employed a higher level of lexicalised language to help cope with the

pressures of coming up with both content and language simultaneously. Foster notes that much of this language consisted of time-filling phrases such as I don’t know, I

Foster also found that the diversity of native speaker phrases increased in the planned condition. Whereas in the unplanned condition, 32.4% of their lexicalised language was made of phrases repeated seven or more times each, in the planned condition only 20.8% were such highly-repeated phrases. At the other end of the spectrum, the percentage of lexicalised language comprising phrases used only once increased from 31.9% in the unplanned to 55.6% in the planned conditions. Non-natives showed much less diversity overall in their phraseology, and contrary to the native speaker pattern their usage become less diverse in the planned condition (unplanned: 42.5% repeated at least seven times, 24.9% used once; planned: 55.3% repeated at least seven times, 16% used once).

Foster also looked at the accuracy, complexity, and fluency (measured in number/length of pauses) of speech across conditions. For both natives and non- natives, complexity and fluency increased in the planned conditions, and for non- natives accuracy also increased (this was not measured for natives). She concludes that, under the less pressurised conditions of the planned task, native speakers used a “more fluent, open-choice, rule-based style of language” (2001, p. 89) than in the unplanned condition, which elicited a greater reliance on lexicalised language, less complexity and less fluency. For non-natives, on the other hand, the difference between the two conditions rested solely in their producing more complex, accurate and fluent language, with no corresponding change in reliance on lexicalised language. This, Foster suggests, indicates that they relied in both conditions on a rule-based approach to language, requiring either pausing or planning time to execute fully and accurately.

Adolphs and Durow (2004) studied the use of three-word sequences produced in informal English language interviews by two L1 Mandarin students enrolled on masters degrees at a UK university. Two interviews were analysed for each student: one recorded when they were attending a pre-sessional English language course, and one recorded seven months later, when they were some way into their course of study. The two students were selected for analysis from a larger group on the grounds that they represented extremes of ‘social integration’ into the local community – one (‘Beth’) having joined social groups and made many native-speaking friends, while

from 3,046 to 7,162 words (excluding the interviewer’s turns). In the first part of their analysis, Adolphs and Durow identified the ten most frequent three-word sequences in each interview. They found that, for each speaker, the total percentage of production made up of the ten most frequent sequences rose slightly (from 2.38 to 3.53% for Ann and from 1.34% to 1.48% for Beth) from the first to the second interview. In contrast, the total contribution to production of all recurrent three-word sequences produced more than twice decreased slightly (from 20.98% to 18.93% for Beth and from 12.66% to 9.55% for Ann). Repetition of sequences therefore decreased somewhat overall, but there was a simultaneous increase in the degree to which production relied on a small group of favoured items. Perhaps the most interesting finding from this analysis lies in the nature of the top-ten sequences themselves. For both speakers (but especially for Beth), sequences in the first interview consisted largely of hesitation markers (just I er; I I I; yeah just er), while in the second interview these were almost entirely supplanted by more meaningful items (a lot of; it’s very nice; I got some). This would appear to indicate the learners’ moving away from the ‘idiosyncratic’ sequences identified by DeCock et al and Oppenheim and towards more nativelike usage.

Similar conclusions can be drawn from the second part of Adolphs and Durow’s analysis. Here, they looked specifically at how closely the phraseology of lexical items matched native speaker usage. Identifying the 15 most frequent lexical words in each interview, they searched CANCODE, a five million word corpus of native speech, for three-word contexts in which these words were repeatedly used (i.e. more than once per million words). They then determined what percentage of the learners’ use of these words matched these sequences. While the numbers involved were too small for valid inferential analyses, the authors found a substantial overall increase in the percentage of Beth’s usage which matched frequent native phraseology (from 42.28% to 59.13%) but a small drop for Ann (from 55.72% to 52.99%). However, looking only at the words which appeared in both first and second interviews, both learners showed convergence with native phraseology for the majority of items. Adolphs and Durow conclude that, while Beth’s phraseology seems to have improved overall (partly as a result, the implication appears to be, of her higher level of social integration, and consequent exposure to native speech), Ann improved her usage of just those lexical items which she used most frequently.

Formulaic language in advanced non-native writing

An early study of formulas in advanced non-native writing is that of Yorio (1989), who found “extensive use of conventionalized language” (which he appears to identify intuitively) in his analysis of the writing of 25 ESL students who had been resident in the Unites States for 5-7 years. Yorio notes, however, that the learners had “no formal control” over the formulas they used. He found errors of grammar (*take

advantages of; *are to blamed for) and of lexical choice (*made a great job; *on the meantime), mixed idioms (*give up their freedom of mobility, meaning ‘give up their

freedom of movement’), phrases used with the wrong meaning (in this way, meaning ‘for this reason’; in addition to, meaning ‘in order to’) and what he terms “attempted idioms”’ (at the end of the road, meaning ‘ultimately’; they feel suspended upon their

heads, the Damocles’ sword). Yorio also looked at the use of phrasal and

prepositional verbs. Comparing the usage in his learner corpus with that in a similar corpus produced by 15 native writers, he found that natives and non-natives used these forms to a similar extent (they constituted 19.5% of conjugated verbs for native speakers, 14% for non-natives). However, natives used a far higher proportion of ‘idiomatic’ two word verbs (e.g. bring up) than non-natives. Idiomatic forms comprised 36% of all two word verbs for native speakers, but only 6.5% for non- natives. Moreover, non-natives showed a tendency to use two word verbs incorrectly, getting them right in only 59% of cases.

Yorio also compared writing produced under matched conditions by immigrant students (L1 Spanish) resident in the US for five to six years and by English majors (also L1 Spanish) at a university in Argentina who had never been part of an English- speaking community. Yorio found that the latter group produced more grammatically accurate language than the former and made more use of idioms. He also felt that their writing was “more authentic” than that of the immigrant group. Yorio speculates that the authentic nature of their language is the product of a greater use of frequent collocations. While this finding is suggestive, it suffers for not being quantified. In support of his judgement that the Argentinian group’s writing was more authentic, Yorio writes that “[t]his impression of greater idiomaticity was apparent to me and to other colleagues whose native speaker impressions I sought”(Yorio, 1989, p. 65). His assertion that they made greater use of collocations is similarly subjective: “After

that appeared more native-like contained many more ‘English phrases’” (Yorio, 1989, p. 66)

A more rigorous approach is taken by Granger (1998), who compared the use of two “productive speech formulas” in a 250,000 word corpus of essays by advanced (L1 French) learners of English with a similar corpus of native speaker writing. The constructions examined were the passive structure:

it + modal + passive verb (of saying/thinking) + that-clause

(e.g. it is said that; it can be claimed that)

and the active structure:

I or we/one/you (generalized pronoun) + (modal) + active verb (of

saying/thinking) + that-clause (e.g. I claim that; we can say that)

She found that, while the passive structure was used with approximately equal frequency by native and non-native writers, non-natives massively overused the active structure compared to native speaker norms. In particular, certain instantiations of this form were used far more frequently by non-natives: in 20,000 words, non- natives produced the construction with say 75 times, compared with four uses by native speakers, and they produced the sequence with think 72 times, compared with three by native speakers (1998, p. 155). This lack of diversity in non-native

phraseology tallies with the findings of DeCock et al and Foster, reported above. Granger suggests that learners’ limited expressive repertoires may lead them to ‘cling on’ to certain fixed phrases – often L1 cognates - with which they feel confident; using them as (in Dechert’s words) “islands of reliability” (Granger, 1998, p. 156). She also notes in this context a possible transfer effect from the first language – the overused expressions tended to be those with direct L1 translation equivalents.

Granger also studied two-word collocations in the same corpora. Looking at the use of intensifying adverbs ending in –ly combined with adjectives (e.g. perfectly natural; closely linked), she found that ‘maximizers’ (e.g. absolutely; entirely; totally) were

used with roughly the same frequency (in terms of both types and tokens) by natives and non-natives, but that native writers used far more (types and tokens of) ‘boosters’ (e.g. deeply; strongly, highly). As with the active and passive frames described above, learners tended to overuse a few favourite intensifiers. Granger again notes an

apparent influence of the L1 here. Completely and totally, which have very high frequency direct translation equivalents in French (complètement and totalement) were significantly overused by these learners, while highly, whose literal French equivalent (hautement) is infrequent and reserved for formal language, was significantly

underused. Moreover, the non-natives tended to adopt what Granger calls

‘stereotyped’ maximizer + adjective combinations (i.e. formulaic items like acutely

aware, keenly felt, painfully clear) only when they had a direct translation equivalent

or were “lexically congruent”. That is to say, restricted collocations were adopted only when similar to L1 phrases.

Lorenz (1999) has also investigated the use of intensifier-adjective collocations in advanced English learners’ writing. He compared four corpora of “expository- argumentative” texts: 155,000 words produced by L1 German 16-18 year olds in the

Bundeswettbewerb Fremdsprachen, the German nation-wide foreign language

competition; 145,000 words produced in writing classes by university students of English; 126,000 words of general-topic argumentative essays produced by 15-18 year old British students; and 92,000 words of argumentative essays produced by British undergraduates. Lorenz’s study provides further evidence for the ‘islands of

reliability’ hypothesis, finding that the non-natives both overuse “a limited number of high frequency stock items” and that their overall repertoire of collocations (as

measured by a ‘type-token ratio’) is much lower than that of natives (Lorenz, 1999, pp. 168-170).

Lorenz attempts to quantify the ‘idiomaticity’ of collocation use in terms of the mutual information scores of intensifier-adjective combinations. He finds that the average mutual information score of the 920 combinations in his combined non-native corpora (MI = 7.41) is about 20% lower than that of the 626 combinations in his native corpora (MI = 9.22). Collocations which score highly on mutual information tend, we have seen, to be infrequent, but strongly-associated pairs. On Lorenz’s

instead “show a preference for attestedly viable, recurrent combinations” (Lorenz, 1999, p. 181). On these grounds, Lorenz makes the bold claim that mutual information “is no more and no less than a statistical representation of a stylistic quality as elusive as ‘idiomaticity’” (Lorenz, 1999, p. 184).

Hyland (2008) compares 4-word clusters (defined as chunks appearing at least 20 times per million words, and in at least 10% of texts) found in a 730,000 word corpus of research articles in electrical engineering, business studies, applied linguistics and microbiology with those in a 1.9 million word corpus of PhD dissertations and a 825,000 word corpus of MA theses in the same disciplines written by university students in Hong Kong. He finds the two corpora of student writing to contain both a greater concentration and a wider variety of clusters than was found in the research article corpus. Clusters constituted 5.1% of the MA corpus, 3.8% of the PhD corpus and 3.1% of the research article corpus. The MA corpus included a total of 149 different clusters, the PhD corpus 95, and the research article corpus 71. The student genres, Hyland observes, appear to be “more phrasal” than published writing, suggesting a “considerably higher reliance on prefabricated patterns among the less experience writers” (2008, p. 50). Hyland also finds differences between corpora in the actual structures and in their typical structures and functions. He warns, however, that these results need not indicate any “deficiencies” in the student writing. The differences in number and type of cluster between corpora could, he suggests, reflect the differing goals and audiences of the three text types.

A number of researchers have looked specifically at the use of restricted collocations as they are defined on the semantic-syntactic criteria of ‘Russian school’

phraseologists (see Section 2.2). Howarth (1998) claims to find evidence of the underuse of verb + noun collocations and idioms in non-native English academic writing. He defines collocations as combinations in which there is some restriction on the substitutability of elements, and idioms as combinations with entirely figurative meanings. In two corpora of native writing (a 58,000 word compilation of 29 social science texts and a 180,000 word collection comprising “papers on law, chapters from a books on language studies, and a complete book on social policy” (Howarth, 1998, p. 165)), the percentage of verb-noun combinations which were restricted collocations or idioms was 31% and 40% respectively. The figure for a non-native corpus (25,000

words produced by students on a masters course in English Language Teaching), however, was only 25%. These figures lead Howarth to conclude that “native speakers employ about 50% more restricted collocations and idioms (of a particular structural pattern) than learners do, on average” (1998, p. 177). It is worth noting, however, that much of the difference depends on native writers’ greater use of idioms, rather than collocations. The differences between collocation use in the non-native corpus and the first of the native corpus (24% vs. 28%) is actually smaller than that between the two native-speaker corpora (28% vs. 35%).

Kaszubski (2000) looks at intermediate and advanced English learners’ use of six high frequency verbs (be, do, have, make, take, give) in free combinations, restricted collocations, and ‘frozen uses’. She compares argumentative essays from a range of corpora produced by intermediate Polish and Spanish learners, advanced Belgian- French and Polish learners, native college students, and native professional writers. She reports that variation between the behaviour of different verbs makes it difficult to make absolute claims about the degree to which writers use restricted collocation in general, but that there appear to be three broad groups, comprising 1) intermediate learners, 2) advanced learners and native college students, and 3) native professional writers, with usage of free combinations decreasing from 1 to 3. The trend is far from emphatic though, and is even less so when considering the proportion (rather than number) of combinations which are free or restricted. One pattern which does emerge quite strongly is that of learners’ overuse of a few favoured collocations, generally either high frequency register-neutral items or items similar to L1 phrases.

The most comprehensive analyses of phraseologically-defined collocations in learner writing to date is that of Nesselhauf (2005). Like Howarth, Nesselhauf looks

specifically at verb + noun combinations, defining collocations as combinations in which there is some arbitrary restriction on what nouns can appear with (a given sense of) the verb. She analyses some 2,000 collocations taken from a 150,000 word corpus

In document High frequency collocations and second language learning (Page 151-161)