• No results found

Training data: the CMU pronouncing dictionary

In document Follow this and additional works at: (Page 151-154)

4. THE LATIN STRESS RULE: A CASE OF UNDERMATCHING

4.5 Modeling English stress with MaxEnt

4.5.1 Training data: the CMU pronouncing dictionary

The CMU pronouncing dictionary (Weide, 1994) has been used throughout this dissertation to find the lexical frequencies of the trends discussed. In this chapter, it will be used as the training data for fitting a MaxEnt model of the English stress system. The CMU pronouncing dictionary is a dictionary of American English lexical items, and contains about 134,000 entries, phonetically transcribed, and with each vowel annotated for primary, secondary, or no stress. The average adult vocabulary size is much smaller than 134,000 - about 9,000 to 17,000 words (Zechmeister et al., 1995). This means the pronouncing dictionary contains a great many entries that are low frequency enough as not to be present in most adult native speaker vocabularies.

The CMU pronouncing dictionary also contains forms with inflectional morphology (both ‘banana’ and ‘bananas’ for example). English inflectional morphology does not affect a word’s stress, so including these entries inflates the counts of particular word shapes (especially words with final clusters). In order to avoid hyper-low-frequency entries and entries with inflectional morphology, a ‘cleaned-up’ version of CMU was used, namely the input corpus for Hayes’s phonotactic probability calculator (Hayes, 2012). This input file contains 18,034 entries, all of which are frequent enough to be in English CELEX (Baayen et al., 1993). This dataset also avoids entries with in-flectional morphology, and certain transcription ‘errors’ are corrected, such as having multiple primary stresses on a single word.

A series of scripts was then used to annotate this lexicon further. Each word was first syllabified and annotated for its syllable structure (e.g. CVC, CVCC). The

maximal onset principle was used, so that clusters which are legal onsets of English were assumed to be onsets in every case2. The CMU transcription system does distinguish between syllabic [ô

"] and [ô] as a coda, but it does not represent syllabic l’s and nasals, instead transcribing them as @ followed by a coda l or nasal. In order to prevent inflated counts of syllables with codas in stressless position, all schwa-l or schwa-nasal sequences in stressless position were assumed to be syllabic sonorants instead of syllables with codas, and were re-transcribed as such. Diphthongs [>eI, >aI, oU, >> aU, >OI] were counted as long, while the non-diphthongs [A, a, æ, E, I, i, U, u, ô

Additionally, each entry was cross-referenced with English CELEX for part of speech information, and with SUBTLEX (Brysbaert and New, 2009) for frequency information. Spelling was used as a proxy for annotating derivational morphology through the process described in 4.5.1.1. This calculated information about each lexical item was then used to further annotate for a variety of factors, such as a word’s main stress, weight of the penultimate syllable, weight of the final syllable, final vowel, contents of the word-final coda, etc. Only words with at least two syllables were included in the calculations. The total number of words included in each search of the lexicon was 11,765.

4.5.1.1 Morphology Matters

English stress assignment is conditioned by morphology in two ways. First, ’neu-tral’ prefixes and suffixes can be added to a word without affecting its stress pattern, e.g pr´ımitive ∼ pr´ımitiveness; c´alibrate ∼ rec´alibrate. Second, many affixes enforce a specific main stress pattern. English has suffixes which enforce final stress, penul-timate stress, antepenulpenul-timate stress, and even pre-antepenulpenul-timate stress (Tescner and Whitley, 2004). Also, as discussed by Chomsky and Halle (1968); Burzio (1994);

2Thanks are due to Robert Staubs for generously lending his syllabification algorithm

Table 4.5. Examples of stress-shifting affixes. See the appendix for a full list of affixes, which list was compiled based on the list in pronouncing English Chapter 2

Stress shifted to:

ultima ex´amine ex`amin´ee

penultimate ab´olish `abol´ıtion antepenultimate s´olid sol´ıdif`y pre-antepenultimate ant´ıque ´antiqu`ary no shift comp´anion comp´anionable

Pater (2000); Collie (2008) words derived via a stress-shifting affix can preserve the main stress of their base as a secondary stress, resulting in a stress pattern which is atypical for monomorphemic words.

In the corpus searches presented here, spelling was used as a proxy to detect derivational morphology. For example, words ending in ’tion’ were considered to end in the ’-tion’ affix. The list of suffixes and prefixes in Pronouncing English, chapter 2, was taken to be exhaustive and words in the corpus with any of these strings at the appropriate edge of the word were marked as morphologically complex.

Some affix strings were excluded because more simple words fit them than complex words. Examples are ‘ab-’, ‘ad-’, ‘re-’, ‘-y’ and ‘-o’. Ultimately, this method of marking words was variably successful depending on the length of the word, and it was relatively conservative, tending to err in the direction of marking simplex words as morphologically complex rather than the reverse.

The success of this morphological discrimination was assessed by randomly sam-pling 100 words from each category (morphologically simple, morphologically com-plex) for lengths of 2 syllables, 3 syllables, and 4 syllables. A native English speaker (the author) then checked these randomly sampled words (600 total) and noted the number of incorrect categorizations in each sample.

Because 2-syllable words of English are morphologically complex relatively rarely, a high percentage of words which end in strings that normally constitute an affix are false alarms (e.g. ‘vary’ ends in ‘-ary’). On the other hand, the majority of longer

Table 4.6. Number of incorrect morphology categorizations in each random sample of 100 words

Categorized as:

Simple Complex F1 score

2 syllables 8 72 0.43

3 syllables 12 13 0.87

4 syllables 26 2 0.84

Table 4.7. Rate of obedience in the lexicon of stress-shift demands Claimed stress shift: % obedience no. suffixes

ultima 71% 13

penultimate 97% 17

antepenultimate 77% 29

pre-antepenultimate 67% 1

words of English are morphologically complex, so strings which are not usually a separate morpheme in shorter words often are in longer words (For example, a final

‘-y’).

The morphological marking was also used to check the ‘accuracy’ of stress-shifting affixes. Words containing affixes marked as stress-shifting in (Tescner and Whitley, 2004) did have the prescribed stress pattern the majority of the time. In the following table, all affixes of each class are grouped together.

In document Follow this and additional works at: (Page 151-154)