CS388: Natural Language Processing Lecture 4: Sequence Models

(1)

CS388: Natural Language Processing Lecture 4: Sequence Models

Eunsol Choi

Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi

(2)

LogisHcs: Time to think about ﬁnal project!

2

‣ There will be extra instructor oﬃce hour next week (sign up will be on piazza)

‣ 10 minute slots, sign up for one oﬃce hour slot per team

‣ Next lecture I’ll cover topic that could help you brainstorm about your ﬁnal project

‣ Project Proposal Due October 5th

‣ HW2 (implemenHng CRF) due September 30th — get started early!

(3)

CRFs Outline

‣ Model: P (y|x) = 1 Z

Yn i=2

exp( _t(y_{i 1}, y_i))

Yn i=1

exp( _e(y_i, i, x))

‣ Inference: argmax P(y|x) from Viterbi P (y|x) / exp w^>

" _n X

i=2

f_t(y_{i 1}, y_i) +

Xn

i=1

f_e(y_i, i, x)

#

(4)

CRF Learning

4

gold features — expected features under model

@

@w L(y

^⇤

, x) =

X

n i=1

f

_e

(y

_i^⇤

, i, x)

X

n i=1

X

s

P (y

_i

= s |x)f

^e

(s, i, x)

@

@w L(y^⇤, x) =

Xn i=2

f_t(y_{i 1}^⇤ , y_i^⇤) +

Xn i=1

f_e(y_i^⇤, i, x)

E^y

" _n X

i=2

f_t(y_{i 1}, y_i) +

Xn

i=1

f_e(y_i, i, x)

#

E^y

" _n X

i=1

f_e(y_i, i, x)

#

= X

y2Y

P (y|x)

" _n X

i=1

f_e(y_i, i, x)

#

=

Xn i=1

X

y2Y

P (y|x)f^e(y_i, i, x)

is annotated sequence for input sequence of length

y* x n

(5)

Forward Backward: RunHme

5

‣ Linear in sentence length

‣ Polynomial in the number of possible tags |K|

‣ Total RunHme:

‣ How does this compare to Viterbi?

t(s_t) = X

s_t+1

t+1(s_t+1) exp( _e(s_t+1, t + 1, x))

<latexit sha1_base64="o+YDBL44WMQt+0KfoSuWzIAvSK8=">AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w==</latexit>

exp( _t(s_t, s_t+1))

<latexit sha1_base64="MMxuphUYCcsoCkHdOiUL0s5PNcs=">AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=</latexit>

↵_t(s_t) = X

s_t ₁

↵_{t 1}(s_{t 1}) exp( _e(s_t, t, x))

<latexit sha1_base64="MsrARqnMVgv4lQtn8MktFrMi2DM=">AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc=</latexit>

exp( _t(s_{t 1}, s_t))

<latexit sha1_base64="+aaYS+pf0KkwlV5IpyMknO6c7lg=">AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H</latexit>

(6)

POS Tagging Performances

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

Slide credit: Dan Klein

‣ MaxEnt : 93.7% accuracy / 82.6% on unks

‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on unks

‣ Perceptron: 97.1% accuracy

‣ CRF: 97.3% accuracy

(7)

Errors

oﬃcial knowledge made up the story recently sold shares JJ/NN NN VBD RP/IN DT NN RB VBD/VBN NNS

Slide credit: Dan Klein / Toutanova + Manning (2000)

(NN NN: tax cut, art gallery, …)

(8)

Remaining Errors

‣ Underspeciﬁed / unclear, gold standard inconsistent / wrong: 58%

‣ Lexicon gap (word not seen with that tag in training) 4.5%

‣ Unknown word: 4.5%

‣ Could get right: 16%

‣ Diﬃcult linguisHcs: 20%

They set up absurd situa/ons, detached from reality VBD / VBP? (past or present?)

a $ 10 million fourth-quarter charge against discon/nued opera/ons adjecHve or verbal parHciple? JJ / VBN?

Manning 2011 “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some LinguisHcs?”

(9)

Remaining QuesHons

9

‣ How to handle noisy text?

‣ How to keep high accuracy when we go to new domains?

‣ How can we combine unlabeled data eﬃciently to labeled data?

‣ Can we incorporate domain speciﬁc lexicon? (List of protein names?)

(10)

Pseudocode for Training CRF

for each epoch

for each example

extract features on each emission and transiHon (look up in cache)

compute marginal probabiliHes with forward-backward compute potenHals phi based on features + weights

accumulate gradient over all emissions and transiHons

(11)

ImplementaHon Tips for CRFs

‣ Caching is your friend! Cache feature vectors especially

‣ Try to reduce redundant computaHon, e.g. if you compute both the gradient and the objecHve value, don’t rerun the dynamic program

‣ If things are too slow, run a proﬁler and see where Hme is being spent.

Forward-backward should take most of the Hme

‣ Exploit sparsity in feature vectors where possible, especially in feature vectors and gradients

‣ Do all dynamic program computaHon in log space to avoid underﬂow

(12)

Debugging Tips for CRFs

‣ Hard to know whether inference, learning, or the model is broken!

‣ Compute the objecHve — is opHmizaHon working?

‣ Learning: is the objecHve going down? Try to ﬁt 1 example / 10 examples. Are you applying the gradient correctly?

‣ Inference: check gradient computaHon (most likely place for bugs)

‣ Is the same for all i?

‣ Do probabiliHes normalize correctly + look “reasonable”? (Nearly uniform when untrained, then slowly converging to the right thing)

‣ If objecHve is going down but model performance is bad:

‣ Inference: check performance if you decode the training set

X

s

forward_i(s)backward_i(s)

(13)

Can we learn the latent states without supervised training dataset?

13

(14)

Sequence Labeling: ParHally Observed

14 Unsupervised Learning

‣ We have a set X and Y, and a joint distribuHon

p(x, y|θ)

‣ So far, we had fully observable data

(x

_i

, y

_i

)

pairs,

L(θ) = ∑

i

log P(x

_i

, y

_i

|θ)

‣ When we have parHally observable data, x only, then,

L(θ) = ∑

i

log P(x_i |θ)

= ∑i log ∑

y∈𝒴

P(x_i, y|θ)

(15)

ExpectaHon MaximizaHon

15

‣ The EM (ExpectaHon MaximizaHon) algorithm is a method for ﬁnding θ_MLE = arg max

θ L(θ) = arg max

θ ∑

i log ∑

y∈𝒴

P(x_i, y|θ)

‣ When we have parHally observable data, x only, then,

L(θ) = ∑

i log ∑

y∈𝒴

P(x_i, y|θ)

‣ We will look into EM in HMM!

(16)

EM for HMM

16

‣ Maximum Likelihood Parameters (supervised):

‣ For unsupervised learning, replace actual count with expected counts

‣ Expected emission counts:

‣ Reminds you of something? _@w^@ ^L(y^⇤^{, x) =}

Xn i=1

f_e(y_i^⇤, i, x)

Xn i=1

X

s

P (y_i = s|x)f^e(s, i, x)

(17)

That’s what we computed for forward-backward!

17 slide credit: Dan Klein

P (y

₃

= 2 |x) =

sum of all paths through state 2 at time 3 sum of all paths

=

‣ Easiest and most ﬂexible to do one pass to compute and one to

compute

(18)

Forward Backward for HMM

18

‣ Forward:

‣ For i = 1 … n

‣ Two passes: one forward, one back

‣ Backward:

‣ For i = n-1 ... 0

(19)

Other Marginal Inference

19

‣ Can we compute this?

‣ Relation with forward quantity?

(20)

EM IntuiHon

20

‣ What we want is…

‣ We can compute:

‣ If we have….

‣ Then we can compute expected transiHon counts:

‣ Above marginals can be computed from followings:

(21)

ExpectaHon MaximizaHon

21

‣ IniHalize transiHon and emission parameters:

‣ random, uniform, or more informed iniHalizaHon

‣ Iterate unHl convergence

‣ M-step: compuHng new transiHon and emission parameters

‣ Convergence? Yes. Global OpHmum? No.

‣ E-step: compuHng expected counts

(22)

Equivalent to the procedure given in the textbook (J&M Appendix A) –

slightly different notations

(23)

How is this even possible?

23

‣ I water the garden everyday

‣ Saw a weird bug in that garden …

‣ While I was thinking of an equation …

Noun

S: (n) garden (a plot of ground where plants are cultivated)

S: (n) garden (the flowers or vegetables or fruits or herbs that are cultivated in a garden) S: (n) garden (a yard or lawn adjoining a house)

Verb

S: (v) garden (work in the garden) "My hobby is gardening"

Adjective

S: (adj) garden (the usual or familiar type) "it is a common or garden sparrow"

We can start with a tag dicHonary (a list of potenHal POSs per word)

(24)

Does EM learn good POS tagger?

24

HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed

“Why doesn’t EM find good HMM POS-taggers”, Johnson, EMNLP 2007

(25)

P (y

_i

= s |x) = forward

_i

(s)backward

_i

(s) P

s⁰

forward

_i

(s

⁰

)backward

_i

(s

⁰

)

‣ Posterior is derived from the parameters and the data (condiHoned on x!)

HMM

CRF

Model parameter (usually mulHnomial distribuHon)

Inferred quanHty from forward-backward

Undeﬁned (model is by

deﬁniHon condiHoned on x)

P (x_i|yⁱ), P (y_i|y^{i 1})

P (y

_i

|x), P (y

^{i 1}

, y

_i

|x)

HMM vs. CRF

(26)

POS Results

26

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on dunks

‣ EM for HMM: 74.7%

(27)

Quiz: p(S1) vs. p(S2)

27

‣ S1 = Colorless green ideas sleep furiously.

‣ S2 = Furiously sleep ideas green colorless

‣ “It is fair to assume that neither sentence (S1) nor (S2) had ever occurred in an English discourse.

Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally "remote" from English” (Chomsky 1957)

‣ How would p(S1) and p(S2) compare based on (smoothed) bigram language models?

‣ How would p(S1) and p(S2) compare based on marginal probability based on POS- tagging HMMs?

‣ i.e., marginalized over all possible sequences of POS tags

(28)

‣ Sequence Modeling Problems in NLP

‣ GeneraHve Model: Hidden Markov Models (HMM)

‣ DiscriminaHve Model:

Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields

‣ Unsupervised Learning: ExpectaHon MaximizaHon

Summary: Sequence Models

(29)

29

‣ The higher accuracy of discriminaHve models comes at the price of much slower training

‣ GeneraHve vs. DiscriminaHve

‣ Structured or not

‣ Independent predicHons are eﬀecHve, but global structure helps

‣ But need to be careful about computaHonal overhead

‣ Model expressivity

‣ Expressive feature set can be beAer

‣ But more expensive, overﬁ}ng