2. Corpus Linguistics
2.6 Frequency Profiling
2.6.1 Word frequency profiling
We began our examination of word frequency profiles in section 1.3 with a basic description of what they contain, and by mentioning their widespread use in corpus linguistic and classroom studies. For foreign or second language teaching, information about the frequencies of words is important for vocabulary grading and selection. Frequency studies also have applications to language teaching in such areas as syllabus design, materials writing, grading and language testing. For a recent view of the start of the art, Schmitt and McCarthy (1997) collected together many of these areas related to vocabulary. Historically, education was the driving force for frequency lists: see Thorndike (1921), (1932), Thorndike and Lorge (1944), Lorge (1949). Fries and Traver (1950) carried out an extensive survey of the English word lists available up to that time, discussed their various educational applications and compared seven of the major lists. In those early days, the source texts for the frequency lists were the ones used in the education of American children. Later counts included magazines and general reading material. A more modern and systematic project to obtain frequency counts from children’s reading materials resulted in the
American Heritage Word Frequency Book (Carroll et al, 1971). An improved kind of count (taking account of meaning but with a smaller wordlist) led to the publication of the General Service List of English Words by West (1953). Below the word level, Ljung (1974) published a frequency list of morphemes based on 8,000 of the most frequent words in the Thorndike-Lorge lists.
Other frequency lists have been compiled for particular varieties of English. For example, James et al. (1994) is a frequency book of the vocabulary of computer science; Dahl (1979) is a frequency book for the English of psychiatric interviews. The latter is one of the few existing frequency lists for spoken English, amongst others are an early list based on a limited corpus of 135,000 words (Jones and Sinclair, 1974), and that based on the spoken part of the BNC (Leech, Rayson and Wilson, 2001). The Michigan team are beginning work on a word frequency list for American academic spoken English, based on the MICASE corpus. If we consider
languages other than English, Juilland has produced a series of frequency dictionaries for Spanish, Rumanian and French (Juilland et al 1964, 1965 and 1970).
A third area of application for frequency-based word lists is that of natural language processing. NLP computer systems that process language need to know the probability of a word occurring in a text. This can be applied in, for example, machine translation or speech recognition software, where it is important to determine the most likely word to occur from a set of possible words. Finally, we can identify a fourth application for these lists, that of psychological research, where the frequency of vocabulary is valuable evidence in understanding the human processing of language.
Despite their usefulness as a starting point, there are problems with word frequency lists. The simple lists count inflectional variants of the same headword separately, so we may find the verb forms kicked and kicks high in our word count but the base form
kick would be lower down the list. In order to study the usage of the lemma KICK as a whole we need to reduce all variants to the base headword18 and count them together. This has to be done both for the verb lemma (kick, kicked, kicking, kicks) and for the corresponding noun lemma (kick, kicks). Leech, Rayson, and Wilson (2001) have produced lemmatised word frequency lists to overcome this problem. Simple word frequency lists often do not show frequencies for multi-word units (MWU). This usually relies on some automatic analysis to identify grammatical MWUs (e.g. the conjunction so that, the preposition in spite of, and at least as an adverb), or semantic MWUs (e.g. kick the bucket). Simple word lists also do not distinguish different words spelt the same (homographs), although this problem can be partly avoided if the lists are produced from a POS tagged corpus so that, for example, score as a noun is counted separately from score as a verb. Further ambiguities remain such as the noun spring, which can refer to a metal coil, a water source, or a season. This would need a fully automated word-sense analysis of the text, and such techniques are not mature enough to be used in large-scale projects as yet. Further practical problems of writing software that produces word frequency lists will be discussed in section 4.2.
18 The headword is sometimes called the lexeme, and Sinclair (1999) and others call it the definiendum
We will use real examples to illustrate the problems with using word frequency lists in section 4.4.
Even in a large comprehensively sampled corpus such as the BNC, the word frequency counts themselves can be misleading. This is not because we may have miscounted the words, but because of how the frequencies relate to use in the English language as a whole. If a word has a high frequency count, we may reasonably infer, due to the nature of the BNC, that the word has a similarly high currency of usage in the language. However, it is possible that the word has a high frequency not because it is widely used in the language as a whole but because it has high frequency in a much smaller number of texts, or parts of texts, within the corpus. To reveal such cases, we can calculate range or dispersion statistics. These show how widely spread the use of a word is: whether it is frequent because it occurs in a lot of text samples in the corpus or whether it is frequent because of a very high usage in only a subset of domains or genres. Frequent words with high dispersion values may be considered to have high currency in the language as a whole; high frequencies associated with low dispersion values should, in contrast, be treated with caution. In statistics, we use mean and standard deviation as summary measures. In corpus linguistics, these are analogous to frequency and dispersion. According to Fries and Traver (1950: 21), Thorndike was the first to introduce range values into frequency lists. Lyne (1985) surveys dispersion statistics in more detail and we will describe in section 4.2 how Matrix calculates the range and Juilland’s dispersion statistics. An alternative approach to quoting separate dispersion and frequency statistics is to combine them into one value called adjusted frequency (or sometimes coefficient of usage). This is the method used by Francis and Kučera (1982: 464). They quote the dispersion measures by Juilland and Rosengren and describe how they can be combined with actual frequencies in order to place ‘lemmas’ in the Brown corpus in order ranked by their adjusted frequencies. The American Heritage Word Frequency Book (Carroll et al, 1971: xl) used a measure of relative entropy from information theory as a dispersion statistic, but similarly calculated an adjusted frequency measure from the dispersion.
Zipf (1935, 1949) established a logarithmic connection between the rank frequency of a word and the number of words at that rank. He also proposed a ‘principle of least effort’ for human language use. Among other things, this means that the words that
people use most often will also prove to be the shortest and simplest. In a frequency list, we can see this principle at work by looking at the lengths of words in terms of how many (spoken) syllables they contain. The BNC frequencies follow the pattern predicted by Zipf’s principle (Leech, Rayson and Wilson, 2001: 121).
A lot of progress has been made since Zipf’s early studies on word frequency distributions. Baayen (1993) compares three models (the lognormal law, the generalised inverse Gauss-Poisson law, and the extended generalised Zipf’s law) with regard to estimating the theoretical vocabulary size. Baayen (1993: 361) writes that “the main challenge for future research in this area is to construct linguistically less naïve models that do not build on the unrealistic assumption that in language words appear at random”. The three models presented are all large number of rare event (LNRE) models. Even large corpora with tens of millions of words are located in the LNRE zone (Baayen, 2001: 51). Following up his own challenge, Baayen (2001: 161) adjusts the LNRE models to take into account non-randomness in language.
Words are not selected at random in language. This has implications for carrying out statistical procedures on word frequencies, as we shall see in section 2.7. Choosing one word (or POS) constrains the choice of the following word (or POS), so that, for example, having chosen a determiner (e.g. the) the choices for what can grammatically follow are immediately limited (e.g. an adjective, adverb or noun). This constraint is what statistical part-of-speech taggers such as CLAWS rely on to assist prediction of the correct word-class tag (see section 3.2.1). Other factors influence word selection, such as author preference (related to language proficiency), collocations, topic, and text type. Church and Gale (1995) refer to the bunchiness or
burstiness of words and show, as an example, the occurrences of the “very contagious” word “Kennedy” in the Brown corpus (because he was the president of the United States when the Brown corpus was compiled in 1961).