3. Grammaticalization parameters quantified
3.5 Dispersion
Dispersion refers to the way an element is distributed within a text. It can have an even distribution and occur regularly throughout the text, or it can have an uneven distribution and occur in specific parts of the text only. There are different ways to measure how a word is distributed in a text. The main approach used in this dissertation is called deviation of proportions (Gries 2008: 415-419). The general idea is that the actual distribution of the word in a text is compared to its theoretical uniform distribution in that text. This is achieved by splitting a corpus into a certain number of chunks. The complete algorithm to compute the deviation of proportions of a given item is as follows:
1. Divide a corpus into n chunks (n is chosen by the researcher). Determine which percentage of the corpus each chunk represents. These percentages are the expected percentages for a given item in a uniform distribution.
2. Find the actual frequencies of the desired item in each chunk and express them as percentages. These are the actual percentages of a given item.
3. Calculate the absolute difference between the actual percentages in each chunk and the expected percentage in a uniform distribution. Add all these differences up. 4. Divide this result by two.
To give a concrete example, let us imagine a corpus of 10’000 words. In this corpus, the word cat occurs 240 times. One can decide to divide the corpus into ten 1’000 words chunks. Thus, each chunk represents 10% of the corpus. It is then necessary to check how the word cat is distributed within these ten chunks. Table 4 shows how many instances of cat occur in each chunk. These are the actual instances and they should be expressed as percentages.
To compare the actual distribution with the theoretical uniform distribution, a subtraction must be made for each chunk. Therefore, the difference between the actual distribution and the uniform distribution in chunk seven is 19.167% (29.167 – 10). Only the absolute values matter (i.e. the sign of the result of the subtraction is ignored). These differences are computed in Table 5. Once these differences are obtained, the percentages need to be added up, which shows how much the actual distribution differs from the theoretical uniform distribution. In the present case, the total amount in Table 5 is 99.168%. This percentage must then be divided by two, to get the final value of 49.584%. Normally,
this is expressed using the decimal notation 0.495 (i.e. the percentage notation divided by 100).
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Chunk 7 Chunk 8 Chunk 9 Chunk 10
0 2 0 65 55 23 70 25 0 0
0% 0.833% 0% 27.083% 22.917% 9.583% 29.167% 10.417% 0% 0%
Table 4. Distribution of the word cat in an imaginary corpus. Each chunk contains 1’000 words. The theoretical uniform distribution is 10% per chunk.
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Chunk 7 Chunk 8 Chunk 9 Chunk 10
24 22 24 41 31 1 46 1 24 24
10% 9.167% 10% 17.083% 12.917% 0.417% 19.167% 0.417% 10% 10%
Table 5. Absolute difference between the theoretical uniform distribution and the actual distribution of the word cat.
Deviations of proportions range between 0 and 1. Indeed, if in the previous cat example the word occurred exactly 24 times per chunk (10%), then it means that the total of the absolute difference would be 0, which divided by two is still 0. To give an example of another extreme difference, if all 240 instances of cat occurred in the tenth chunk, then this would result in a 10% difference in the nine first chunks, and a 90% difference in the last chunk. This adds up to a total of 180%, which yields a deviation of proportions of 0.9. Note that in this specific example, the maximum value is 0.9, but it is easy to imagine that if there were more chunks, for example 100 chunks, then each chunk would represent one percent. As a result, if all occurrences of cat were in the last chunk in that example, then there would be 99 chunks where the difference is one percent and one where the difference is 99, which would result in a deviation of proportions of 0.99. This shows that 1 is only a theoretical maximum and that this maximum can vary depending on the number of chunks. This has actually been addressed by Lijffijt and Gries (2012). The authors propose a way to normalize the deviation of proportions (DP) measure in the following way:
To go back to the previous example where the maximal deviation of proportions was 0.9, using 10 chunks, one gets 0.9 divided by 0.9, which amounts to 1, the actual maximal deviation of proportions. This is however a minor concern that is mostly important from a
theoretical perspective. In the subsequent studies, what matters is to compare different values of deviation of proportions that are all computed on the same corpora, using the same chunks. The normalization process was therefore not necessary, as all values under comparison have the same potential maximum.
In chapters 5 and 6, chunks have the size of 1’000 words (0.001% of a one hundred million word corpus), which provides a rather detailed chunking. Furthermore, as most corpora are divided into smaller files, one has to decide whether chunking of the corpus takes this into account or not. In the present analysis, the chunking did not take the file structure of the corpus into account. For example, if a chunk consists of 500 items at the end of a file, then the first 500 items of the next file are going to be included into this chunk. This seems to be the best option, since deciding to end chunks with the end of files could theoretically result in several very small chunks that could be as small as a single word, which is not optimal.
To give an actual example taken from the British National Corpus written data discussed in section 5.4, the articles the and a have the most even distributions with deviation of proportions of 0.117 and 0.125 respectively. On the other hand, the maximal deviation of proportions value (0.998) is obtained by low frequency elements such as the pronoun ya, which is rarer in the written portion of the corpus and which can be expected to occur specifically in more informal contexts.
Note that deviation of proportions is one way of measuring dispersion and that other researchers might disagree that a uniform distribution such as the one presented above applies to natural languages. Indeed, it has been suggested that many words tend to appear in bursts in certain contexts (Altmann et al. 2009) and that all words are bursty to some extent (Pierrehumbert 2012). An example would be the pronoun she, which can be expected to appear in a very even fashion, but will in fact mostly appear in parts of a text where there is a female referent. Therefore, the claim is that no actual word has a perfect even distribution as shown above and that alternative methods where bursts are also taken into account should be used. This issue illustrates that dispersion can be measured in other ways than the one adopted in the subsequent studies.
Dispersion relates to grammaticalization because it can easily be observed that grammatical elements are more evenly distributed than lexical elements (Hilpert and Correia Saavedra 2017a). From an intuitive perspective, it seems plausible to assume that grammaticalized words are generally more evenly dispersed, since they are not related to a specific topic or domain and are thus more likely to occur in all parts of a text.
This has also been indirectly investigated by Pierrehumbert (2012), who studied the burstiness of words based on their semantic class. Four semantic classes were distinguished, ranging from less to more abstract: entities (e.g. Bible, Africa, Darwin), predicates/relations (e.g. blue, die, in, religion), modifiers/operators (e.g. believe, everyone, forty), and higher level operators (e.g. hence, let, supposedly, the) (Pierrehumbert 2012: 5). In terms of grammaticalization, these four classes would rank in the same way (although classes two and three might be shifted around), from less to more grammaticalized. Pierrehumbert’s main finding was that the more abstract forms tend to be less bursty. Therefore, it seems plausible that grammaticalized forms would also be generally less bursty, and therefore more evenly dispersed.
It therefore seems worthwhile to investigate whether dispersion can be used to measure the degrees of grammaticalization of a given item. As shown above, the even distribution of grammatical elements might be due to semantic criteria as well. Furthermore, as pointed out in Hilpert and Correia Saavedra (2017a), where dispersion was modelled using polynomial regression, frequency outscales semantic generality as a predictor in such a model. The even distribution of grammatical elements might therefore also be a frequency effect. Under this assumption, grammatical elements are evenly distributed because they are generally highly frequent and not because of their grammatical function resulting from the process of grammaticalization. This shows that the subsequent chapters need to carefully consider the relationships between all the parameters under discussion. These relationships are discussed from a synchronic perspective in chapter 5, while they are discussed from a diachronic perspective in chapter 6.