Token frequency - Grammaticalization parameters quantified

3. Grammaticalization parameters quantified

3.1 Token frequency

Token frequency is the number of occurrences of a word in a corpus. For example, the word give occurs 43’488 times in the British National Corpus. Naturally, give also has other forms such as gives, gave, given and giving. When measuring token frequency, a choice has to be made whether the frequencies of all these forms are aggregated or whether they are kept separate. Aggregating different forms of the same word is a process that is known as lemmatization and the element under which they are grouped is called a lemma. When give is lemmatized so that all of its word forms are counted together, its token frequency goes up to 123’617.8

Lemmatization was not used in the subsequent studies, since there is a clear need to keep most entries separate. For example, go, went and going fulfil quite different grammatical functions. These differences could not be studied on the basis of lemmatized data. Furthermore, different word forms of the same lemma can show different degrees of grammaticalization. This is related to the principle of divergence (Hopper 1991) discussed in section 2.2.2, where a grammaticalizing form can have a lexical counterpart still in use. Similarly, while a verb form might grammaticalize, other verb forms of the same lemma might not participate in this process. To illustrate, going mostly occurs as an auxiliary verb, while go and went tend to be main verbs. It might be argued that there are cases where lemmatization might have been required, as for example a and an indeed have the same grammatical function and are just two allomorphs constrained by phonological criteria. However, these cases are not so common and choosing which items should be lemmatized would involve much manual processing and highly subjective decisions which might be problematic, in particular when it comes to the reproducibility of the study.

In addition to lemmatization, another important practical aspect when measuring token frequency is how the corpora involved are parsed. It is common to use regular expressions to find words in a corpus, which can involve several small decisions such as which characters can constitute words (e.g. numbers, accented letters such as à or é). Corpora also tend to involve spelling mistakes and encoding errors, such as missing blank spaces between words or punctuation. Decisions can be made to try to minimize these by adding extra steps to automatically avert some of these shortcomings. These aspects can therefore change the results when measuring the token frequency of given items. This is why the regular expressions involved in chapter 5 are explicitly listed in section 5.3 and those in chapter 6 in

section 6.2. Different preprocessing decisions also explain why different interfaces and different researchers might not obtain exactly the same numbers when computing token frequency values, despite using the same corpora.

The relevance of frequency in grammaticalization can be approached from a naive angle, by making the observation that function words are generally more frequent than content words in most languages. This can easily be observed by looking at the most frequent words in any corpus, all of which happen to be grammatical elements. The five most frequent words in the British National Corpus are the, of, and, to and a. More generally, the majority of the top 100 most frequent words in this corpus are grammatical elements.

While token frequency is known to play a role in grammaticalization, there is some divergence when assessing the extent of this role. For example, Hopper and Traugott (2003: 126-127) acknowledge that frequency “has long been recognized informally as a concomitant of grammaticalization”, which is a prudent assertion. On the other hand, Narrog and Heine (2011: 2-3) note that “in some of the definitions provided, frequency is portrayed as one of the driving forces, or the driving force of grammaticalization”. This is especially the case with Bybee (2011: 77), who considers increases in frequency to even explain the unidirectionality of grammaticalization (section 2.4): “as long as frequency is on the rise, changes will move in a consistent direction.” This is because in Bybee’s view, “when a grammaticalizing construction ceases to rise in frequency, various things happen, but none of them is the precise reverse of the process” (Bybee 2011: 77). This shows that there are views of grammaticalization in which frequency occupies a core position. This is further illustrated in Diessel and Hilpert (2016).

An example of an effect of high token frequency is phonetic reduction, which is further discussed in the following section. Bell et al. (2009) have highlighted that there is a strong correlation between frequency of use and phonetic reduction, where frequent elements are more likely to be reduced in speech production. However, they report differences when it comes to lexical and grammatical items. Lexical items seem to display a straightforward correlation, where more frequent items are shorter, whereas grammatical items do not display such correlation when controlling for frequency and predictability. Grammatical items tend to be reduced when they are highly predictable from the linguistic context. For instance, of often appears after kind or sort in the kind of and sort of constructions (i.e. there is strong mutual information between these elements), which are subject to phonological reduction (kinda/sorta). Lexical and grammatical elements might therefore be processed differently in

speech production, and frequency may play a different role in both cases as pointed out by Bell et al. (2009).

Measures based on token frequencies have also been used to show the emergence of syntactic patterns, such as the grammaticalization of verb-object word order in English (Fries 1940) or of epistemic parentheticals (Thompson and Mulac 1991). Furthermore, token frequency has also been used to investigate the percentages of different uses of elements undergoing grammaticalization. For instance, Barth-Weingarten and Couper-Kuhlen (2002) used the percentages of different uses of though to illustrate that it is developing a new discourse marking function in Present-Day English.

It should be noted that in the same way that the Lehmann parameters (section 2.2.1) are not relevant in all cases of grammaticalization, something similar can be said about token frequency. There are examples of low frequency grammaticalization (e.g. Hoffmann 2004, Brems 2007). For example, there are complex prepositions that are relatively infrequent in English such as by dint of. Hoffmann (2004) argues that such low frequency complex prepositions can still involve high degrees of grammaticalization. There are cases where low frequency complex prepositions can grammaticalize by analogy to similar prepositions that are more frequent.9 Therefore, since by dint of has a similar structure and meaning as by means of, it may display similar grammaticalization features despite a much lower frequency. Furthermore, token frequency values have their limitations. For instance, certain concepts are rarer and so is the need to express them. If such a concept can only be expressed by one specific construction, then this construction is relatively prominent within this specific context, regardless of the overall frequency of the concept. This relates to Lehmann’s parameter of paradigmatic variability (section 2.2.1), where highly grammaticalized elements involve less freedom to choose alternatives. Hoffmann (2004: 191-193) mentions the preposition in front of which became a prominent way to express a specific spatial relationship in Present-Day English, in contrast to the preposition before which can also be used to express a similar locative meaning.

For example, sentences such as he is sitting before the fire and he is sitting in front of the fire are both possible in Present-Day English and have similar meanings. However, from a diachronic perspective, locative before was the only of the two that was used to express this concept in the 17th Century. The preposition in front of developed later on, and has a frequency close to locative before in the British National Corpus, given that around 6’000

occurrences of in front of versus 7’000 of locative before are reported in Hoffmann (2004: 192). While the preposition in front of does not have a larger frequency than locative before, it is currently one of the main ways to express this specific concept of spatial relation. In short, whether a construction is a prominent way to express a specific concept (i.e. in comparison to similar realizations) is also a relevant factor that goes beyond mere token frequency.

This is why chapter 6 also includes some low frequency elements. The existence of such elements is not problematic because it just means that higher token frequency is not an obligatory aspect of grammaticalization. However, it remains a central one, especially since increases in token frequency can also lead to phonological reduction, which is discussed in the next section.

In document Measurements of grammaticalization:: developing a quantitative index for the study of grammatical change (Page 62-65)