Gradient exceptionality in Maximum Entropy Grammar with lexically specific constraints

(1)

Gradient exceptionality in Maximum Entropy

Grammar with lexically specific constraints

Claire Moore-Cantwell Joe Pater

University of Massachusetts Amherst

Exceptionality Workshop OCP, Barcelona Jan. 27, 2015

Claire Moore-Cantwell, Joe Pater UMass Amherst OCP Barcelona Gradient exceptionality in Maximum Entropy Grammar with lexically specific constraints 1 / 30

(2)

The problem

The most general form of the problem: the inadequacy of a two-way distinction between rule-governed vs. exceptional (see e.g. Hayes 2008: ch. 9 on gradient productivity).

A specific example: degrees of regularity / exceptionality in stress placement

Peperkamp et al. (2010) provide psycholinguistic evidence for the following continuum of regularity, which matches relative degrees of regularity in the lexicons:

(3)

The problem

The most general form of the problem: the inadequacy of a two-way distinction between rule-governed vs. exceptional (see e.g. Hayes 2008: ch. 9 on gradient productivity). A specific example: degrees of regularity / exceptionality in stress placement

French, Finnish, Hungarian > Polish > Spanish

(4)

The problem

The most general form of the problem: the inadequacy of a two-way distinction between rule-governed vs. exceptional (see e.g. Hayes 2008: ch. 9 on gradient productivity). A specific example: degrees of regularity / exceptionality in stress placement

(5)

The problem

Fidelholtz (1979: 58):

It appears to be a problem for linguistic theory that there is nothing in the formal description of Polish stress which would indicate that Polish is a ‘penultimate-stress’ language, as compared with the similar rules in English, which is essentially a free-stress language.

When the penultimate syllable is light, stress falls on the penultimate syllable of some English and Polish words (e.g. English ban´ana), and on the antepenult on others (e.g. C´anada).

In English, both patterns are well-attested (Pater 1994), and each word’s pronunciation is stable.

In Polish, there are very few antepenultimately stressed words, and there is frequent regularization to penultimate stress.

(6)

The problem

(7)

The problem

(8)

The problem

(9)

The problem

Peperkamp et al.’s psycholinguistic evidence uses a memory task Sample item: númi númi num´ı num´ı númi OK (ISI 80 msec.)

Correct response: 1 1 2 2 1

Also tested using a segmental contrast: m´uku vs. m´unu

(10)

The problem

Peperkamp et al.’s psycholinguistic evidence uses a memory task Sample item: númi númi num´ı num´ı númi OK (ISI 80 msec.) Correct response: 1 1 2 2 1

(11)

The problem

Peperkamp et al.’s psycholinguistic evidence uses a memory task Sample item: númi númi num´ı num´ı númi OK (ISI 80 msec.) Correct response: 1 1 2 2 1

Also tested using a segmental contrast: m´uku vs. m´unu

(12)

(13)

The problem

The formal descriptions of both languages require lexical marking at least one of the patterns.

In English it’s hard to say which one - Pater 1994. Rates of attestation of the two patterns and productivity is affected by the quality of the final vowel - Moore-Cantwell 2014

The number of exceptions, few or many, does not affect the grammatical status of a a pattern at all in a standard generative grammar

This is true of the SPE formalism of Fidelholtz’s time, of metrical rule approaches, and of OT accounts with lexically specific constraints (posited independently by Kraska-Szlenk 1995 diss. for Polish and Pater 1995 ROA ms. [2000 Phonology] for English)

(14)

The problem

(15)

The problem

(16)

The problem

There have been a number of proposals addressing the general problem of gradient productivity in the recent generative literature – ours builds on those using:

Stochastic grammars: Zuraw (2000) using Boersma’s

Stochastic OT, Ernestus and Baayen (2003) using Stochastic OT and MaxEnt, Hayes and Wilson (2008) and Hayes, Zuraw,

Sipt´ar and Londe (2009) using MaxEnt

(17)

The problem

There have been a number of proposals addressing the general problem of gradient productivity in the recent generative literature – ours builds on those using:

Stochastic grammars: Zuraw (2000) using Boersma’s

Stochastic OT, Ernestus and Baayen (2003) using Stochastic OT and MaxEnt, Hayes and Wilson (2008) and Hayes, Zuraw,

Sipt´ar and Londe (2009) using MaxEnt

Indexed constraints: Pater (2005) and Becker (2009)

(18)

The problem

Challenges:

Stochastic grammar approaches sometimes conflate

within-word variation and generalizations across the lexicon (a challenge first noted in Zuraw 2000)

Indexed constraint approaches require a special calculation to get an influence of degree of regularity

We aim to avoid these problems by synthesizing the two approaches

(19)

The problem

Challenges:

(20)

The problem

Challenges:

(21)

Our proposal illustrated with stress

We will show that when lexically specific constraints are

incorporated into a MaxEnt framework (Goldwater and Johnson 2003), the outcome of learning does yield a grammatical difference between a language like Polish and one like English

H = weighted sum of violations (HG Harmony) p is proportional to exp(H)

(22)

Our proposal illustrated with stress

We will show that when lexically specific constraints are

incorporated into a MaxEnt framework (Goldwater and Johnson 2003), the outcome of learning does yield a grammatical difference between a language like Polish and one like English

H = weighted sum of violations (HG Harmony) p is proportional to exp(H)

(23)

Our proposal illustrated with stress

We assume that the grammar is composed of general constraints, and lexically specific versions of (some of) them, indexed to particular items

Given basic assumptions about how MaxEnt grammars are learned, weights of the general constraints and of the lexically specific ones will vary as a function of the lexical probability of a pattern

When a pattern is relatively general across the lexicon, the general constraint will wind up with a relatively high weigh

When there are relatively few exceptions, the lexically specific constraints encoding them need to have relatively high weight for the words to be pronounced faithfully

When a pattern is relatively rare, the general constraint will wind up with a relatively low weight

(24)

Our proposal illustrated with stress

(25)

Our proposal illustrated with stress

(26)

Our proposal illustrated with stress

(27)

Our proposal illustrated with stress

(28)

Our proposal illustrated with stress

An illustration: a set of simple stress ‘languages’ with varying degrees of regularity

Stress can fall in one of two positions, and there is a general constraint preferring each (e.g. Align-L and Align-R)

There are 100 words; languages have stress in one position in 100 - 50 words (fully regular to fully lexical), making 51 languages

(29)

Our proposal illustrated with stress

100 lexically specific versions of Align-L, and 100 of Align-R

(30)

Our proposal illustrated with stress

(31)

Our proposal illustrated with stress

100 lexically specific versions of Align-L, and 100 of Align-R

(32)

Our proposal illustrated with stress

Constraint weights began at zero, and were updated over 1000 epochs of Gradient Descent, with a learning rate of 0.1

Gradient Descent is similar to the Stochastic OT / HG Gradual Learning Algorithm except that it is run in batch -each epoch is one presentation of all the data (see Pater and Staubs 2013: NECPhon)

Used for convenience – closely approximates results of regular GLA

(33)

Our proposal illustrated with stress

(34)

Our proposal illustrated with stress

(35)

Our proposal illustrated with stress

Given lexical irregularity, how do we find out what the grammar does?

One increasingly popular answer: Wug test / nonce word judgment.

A wug test in our model means that the lexically specific constraints have no influence

(36)

Our proposal illustrated with stress

(37)

Our proposal illustrated with stress

(38)

(39)

Comparison with related approaches

MaxEnt without lexically specific constraints

Hayes and Wilson (2008): only applicable to phonotactics

Ernestus and Baayen (2003) and Hayes, Zuraw, Sipt´ar and

Londe (2009): grammars generate the same degree of variation for all words of a given morpho-phonological shape This works for modeling the nonce word data, but the grammar needs something else to generate the correct pronunciation of individual words

The learned grammars in our simulation always gave near 1 probability correct to individual words

(40)

Comparison with related approaches

(41)

Comparison with related approaches

Londe (2009): grammars generate the same degree of variation for all words of a given morpho-phonological shape

This works for modeling the nonce word data, but the grammar needs something else to generate the correct pronunciation of individual words

(42)

Comparison with related approaches

(43)

Comparison with related approaches

(44)

Comparison with related approaches

Lexically specific constraints without MaxEnt

Pater (2005): nonce probabilities generated by considering all possible indexations

Becker (2009): nonce probabilities derived from the number of words attached to each indexed constraint

(45)

Comparison with related approaches

(46)

Comparison with related approaches

(47)

Comparison with related approaches

Compared with prior work, this proposal is underdeveloped in terms of being applied to real cases of lexical irregularity and accompanied judgments

As a step forward, and as an illustration of how this model can be applied to alternations, we next turn to modeling of Ernestus and Baayen’s (2003) Dutch data

(48)

Simulating Dutch voicing alternations

The pattern (Ernestus and Baayen, 2003):

Word-internal obstruents contrast in voicing

[vErVEid@n] ’widen-inf’

[vErVEit@n] ’reproach-inf’

Word-final devoicing

[vErVEit] ’widen’

(49)

Simulating Dutch voicing alternations

The pattern (Ernestus and Baayen, 2003): Word-internal obstruents contrast in voicing

[vErVEit] ’reproach’

(50)

Simulating Dutch voicing alternations

The pattern (Ernestus and Baayen, 2003): Word-internal obstruents contrast in voicing

(51)

Simulating Dutch voicing alternations

Given a neutralized novel form, Dutch speakers can ’guess’ a voicing for the word-internal obstruent based on lexical trends

% voiced in lexicon % voiced in production

p/b 9% 4%

t/d 25% 9%

s/z 33% 23%

f/v 70% 49%

x/G 97% 80%

(52)

Simulating Dutch voicing alternations

p/b 9% 4%

t/d 25% 9%

s/z 33% 23%

f/v 70% 49%

(53)

Simulating Dutch voicing alternations

p/b 9% 4%

t/d 25% 9%

s/z 33% 23%

f/v 70% 49%

x/G 97% 80%

(54)

Simulating Dutch voicing alternations

We used six general constraints (a subset of those used by Ernestus and Baayen pg.20):

*VTV Intervocalic obstruents should be voiced

*VDV Intervocalic obstruents should be voiceless

*P[+voice] Labial stops should be voiceless

*T[+voice] Coronal stops should be voiceless

*S[+voice] Sibilants should be voiceless

*F[-voice] Labial fricatives should be voiced

*X[-voice] Velar fricatives should be voiced

(55)

Simulating Dutch voicing alternations

We used six general constraints (a subset of those used by Ernestus and Baayen pg.20):

*VTV Intervocalic obstruents should be voiced

*VDV Intervocalic obstruents should be voiceless

*P[+voice] Labial stops should be voiceless

*T[+voice] Coronal stops should be voiceless

*S[+voice] Sibilants should be voiceless

*F[-voice] Labial fricatives should be voiced

*X[-voice] Velar fricatives should be voiced

n.b. these constraints are shorthand

(56)

Simulating Dutch voicing alternations

Each lexical item gets a specific version of one of the most general constraints:

/vErVEid/ *VTV-widen ‘widen’ has intervocalic voicing

(57)

Simulating Dutch voicing alternations

Data used:

Trend % Voicing Total forms

p/b voiceless 9% 230

t/d voiceless 25% 719

s/z voiceless 33% 451

f/v voiced 70% 166

x/G voiced 97% 131

(58)

Simulating Dutch voicing alternations

Simulation details:

Gradient Descent, with 10,000 learning epochs (one epoch: one update over the entire input set) Learning rate: 0.01

Regularization: constraint weights based on a Gaussian prior

(59)

Simulating Dutch voicing alternations

Results summary:

Real words are correctly voiced/voiceless % voiced on novel words:

lexicon production (EB) Simulated

p/b 9% 4% 1%

t/d 25% 9% 9%

s/z 33% 23% 18%

f/v 70% 49% 84%

x/G 97% 80% 99%

(60)

Simulating Dutch voicing alternations

Results summary:

Real words are correctly voiced/voiceless % voiced on novel words:

lexicon production (EB) Simulated

p/b 9% 4% 1%

t/d 25% 9% 9%

s/z 33% 23% 18%

f/v 70% 49% 84%

(61)

Simulating Dutch voicing alternations

Weights for forms with x/G: 97% voiced (131 total)

0 2000 4000 6000 8000 10000 -5 0 5 Epochs Weight *VTV *X[-voice] *VTVi *VDVj (exceptions)

- High weight on *X[-voice] - Exceptions: high weights

⇒ conflict with *X[-voice] - Trend-followers: low weights

⇒ concord with *X[-voice]

(62)

Simulating Dutch voicing alternations

- High weight on *X[-voice]

- Exceptions: high weights ⇒ conflict with *X[-voice] - Trend-followers: low weights

(63)

Simulating Dutch voicing alternations

(64)

Simulating Dutch voicing alternations

⇒ conflict with *X[-voice]

- Trend-followers: low weights ⇒ concord with *X[-voice]

(65)

Simulating Dutch voicing alternations

(66)

Simulating Dutch voicing alternations

(67)

Simulating Dutch voicing alternations

Probabilities for forms with x/G: 97% voiced (131 total)

0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Epochs P(vo ice d) EB wug-test (80%) Lexicon (97%) voiced in lexicon voiceless in lexicon wug

- Novel words: voiced

- Trend-followers: correctly voiced - Exceptions: errors towards voicing - Analogous to Polish stress: exceptions exist, but are prone to regularization

(68)

Simulating Dutch voicing alternations

(69)

Simulating Dutch voicing alternations

- Trend-followers: correctly voiced

- Exceptions: errors towards voicing - Analogous to Polish stress: exceptions exist, but are prone to regularization

(70)

Simulating Dutch voicing alternations

- Trend-followers: correctly voiced - Exceptions: errors towards voicing

- Analogous to Polish stress: exceptions exist, but are prone to regularization

(71)

Simulating Dutch voicing alternations

(72)

Simulating Dutch voicing alternations

Weights for forms with s/z: 33% voiced (451 total)

0 2000 4000 6000 8000 10000 -5 0 5 Epochs Weight *VTV *S[+voice] *VDVi *VTVj (exceptions)

- Low-ish weight on *S[+voice] - Exceptions: high weights - Trend-followers: lower weights

(73)

Simulating Dutch voicing alternations

- Low-ish weight on *S[+voice]

- Exceptions: high weights - Trend-followers: lower weights

⇒ but higher than *S[+voice]

(74)

Simulating Dutch voicing alternations

- Low-ish weight on *S[+voice] - Exceptions: high weights

- Trend-followers: lower weights ⇒ but higher than *S[+voice]

(75)

Simulating Dutch voicing alternations

⇒ but higher than *S[+voice]

(76)

Simulating Dutch voicing alternations

(77)

Simulating Dutch voicing alternations

Probabilities for forms with s/z: 33% voiced (451 total)

- Novel words: 20% voiced - Existing words: correct

- Analogous to English stress: each word’s pronounciation is stable

(78)

Simulating Dutch voicing alternations

- Novel words: 20% voiced

- Existing words: correct

(79)

Simulating Dutch voicing alternations

(80)

Simulating Dutch voicing alternations

(81)

Summary

Using lexically specific constraints together with more general constraints like *P[+voice], we can model:

1 Probabilistic lexical trends which are reflected in speakers’

treatment of novel words

2 Categorical behavior of extant words in a language’s lexicon

3 Difference between generalizations with few vs. many

exceptions

(82)

Summary

(83)

Summary

exceptions

(84)

Summary

(85)

Thank you!

Special thanks are due to Robert Staubs, Sharon Peperkamp, and Emmanuel Dupoux for useful discussion and comments, and to Robert for generously lending his gradient descent script. This material is based upon work supported by the National Science Foundation under Grant No. 1451512 to the first author, Grant BCS-424077 to the University of Massachusetts Amherst, and by the city of Paris through a Research in Paris fellowship to the first author.