CS388: Natural Language Processing Lecture 4: Sequence Models
Eunsol Choi
Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi
LogisHcs: Time to think about final project!
2
‣ There will be extra instructor office hour next week (sign up will be on piazza)
‣ 10 minute slots, sign up for one office hour slot per team
‣ Next lecture I’ll cover topic that could help you brainstorm about your final project
‣ Project Proposal Due October 5th
‣ HW2 (implemenHng CRF) due September 30th — get started early!
CRFs Outline
‣ Model: P (y|x) = 1 Z
Yn i=2
exp( t(yi 1, yi))
Yn i=1
exp( e(yi, i, x))
‣ Inference: argmax P(y|x) from Viterbi P (y|x) / exp w>
" n X
i=2
ft(yi 1, yi) +
Xn
i=1
fe(yi, i, x)
#
CRF Learning
4
gold features — expected features under model
@
@w L(y
⇤, x) =
X
n i=1f
e(y
i⇤, i, x)
X
n i=1X
s
P (y
i= s |x)f
e(s, i, x)
@
@w L(y⇤, x) =
Xn i=2
ft(yi 1⇤ , yi⇤) +
Xn i=1
fe(yi⇤, i, x)
Ey
" n X
i=2
ft(yi 1, yi) +
Xn
i=1
fe(yi, i, x)
#
Ey
" n X
i=1
fe(yi, i, x)
#
= X
y2Y
P (y|x)
" n X
i=1
fe(yi, i, x)
#
=
Xn i=1
X
y2Y
P (y|x)fe(yi, i, x)
is annotated sequence for input sequence of length
y* x n
Forward Backward: RunHme
5
‣ Linear in sentence length
‣ Polynomial in the number of possible tags |K|
‣ Total RunHme:
‣ How does this compare to Viterbi?
t(st) = X
st+1
t+1(st+1) exp( e(st+1, t + 1, x))
<latexit sha1_base64="o+YDBL44WMQt+0KfoSuWzIAvSK8=">AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w==</latexit>
exp( t(st, st+1))
<latexit sha1_base64="MMxuphUYCcsoCkHdOiUL0s5PNcs=">AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=</latexit>
↵t(st) = X
st 1
↵t 1(st 1) exp( e(st, t, x))
<latexit sha1_base64="MsrARqnMVgv4lQtn8MktFrMi2DM=">AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc=</latexit>
exp( t(st 1, st))
<latexit sha1_base64="+aaYS+pf0KkwlV5IpyMknO6c7lg=">AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H</latexit>
POS Tagging Performances
‣ Baseline: assign each word its most frequent tag: ~90% accuracy
‣ Trigram HMM: ~95% accuracy / 55% on unknown words
‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks
Slide credit: Dan Klein
‣ MaxEnt : 93.7% accuracy / 82.6% on unks
‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on unks
‣ Perceptron: 97.1% accuracy
‣ CRF: 97.3% accuracy
Errors
official knowledge made up the story recently sold shares JJ/NN NN VBD RP/IN DT NN RB VBD/VBN NNS
Slide credit: Dan Klein / Toutanova + Manning (2000)
(NN NN: tax cut, art gallery, …)
Remaining Errors
‣ Underspecified / unclear, gold standard inconsistent / wrong: 58%
‣ Lexicon gap (word not seen with that tag in training) 4.5%
‣ Unknown word: 4.5%
‣ Could get right: 16%
‣ Difficult linguisHcs: 20%
They set up absurd situa/ons, detached from reality VBD / VBP? (past or present?)
a $ 10 million fourth-quarter charge against discon/nued opera/ons adjecHve or verbal parHciple? JJ / VBN?
Manning 2011 “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some LinguisHcs?”
Remaining QuesHons
9
‣ How to handle noisy text?
‣ How to keep high accuracy when we go to new domains?
‣ How can we combine unlabeled data efficiently to labeled data?
‣ Can we incorporate domain specific lexicon? (List of protein names?)
Pseudocode for Training CRF
for each epoch
for each example
extract features on each emission and transiHon (look up in cache)
compute marginal probabiliHes with forward-backward compute potenHals phi based on features + weights
accumulate gradient over all emissions and transiHons
ImplementaHon Tips for CRFs
‣ Caching is your friend! Cache feature vectors especially
‣ Try to reduce redundant computaHon, e.g. if you compute both the gradient and the objecHve value, don’t rerun the dynamic program
‣ If things are too slow, run a profiler and see where Hme is being spent.
Forward-backward should take most of the Hme
‣ Exploit sparsity in feature vectors where possible, especially in feature vectors and gradients
‣ Do all dynamic program computaHon in log space to avoid underflow
Debugging Tips for CRFs
‣ Hard to know whether inference, learning, or the model is broken!
‣ Compute the objecHve — is opHmizaHon working?
‣ Learning: is the objecHve going down? Try to fit 1 example / 10 examples. Are you applying the gradient correctly?
‣ Inference: check gradient computaHon (most likely place for bugs)
‣ Is the same for all i?
‣ Do probabiliHes normalize correctly + look “reasonable”? (Nearly uniform when untrained, then slowly converging to the right thing)
‣ If objecHve is going down but model performance is bad:
‣ Inference: check performance if you decode the training set
X
s
forwardi(s)backwardi(s)
Can we learn the latent states without supervised training dataset?
13
Sequence Labeling: ParHally Observed
14 Unsupervised Learning
‣ We have a set X and Y, and a joint distribuHon
p(x, y|θ)
‣ So far, we had fully observable data
(x
i, y
i)
pairs,L(θ) = ∑
i
log P(x
i, y
i|θ)
‣ When we have parHally observable data, x only, then,
L(θ) = ∑
i
log P(xi |θ)
= ∑i log ∑
y∈𝒴
P(xi, y|θ)
ExpectaHon MaximizaHon
15
‣ The EM (ExpectaHon MaximizaHon) algorithm is a method for finding θMLE = arg max
θ L(θ) = arg max
θ ∑
i log ∑
y∈𝒴
P(xi, y|θ)
‣ When we have parHally observable data, x only, then,
L(θ) = ∑
i log ∑
y∈𝒴
P(xi, y|θ)
‣ We will look into EM in HMM!
EM for HMM
16
‣ Maximum Likelihood Parameters (supervised):
‣ For unsupervised learning, replace actual count with expected counts
‣ Expected emission counts:
‣ Reminds you of something? @w@ L(y⇤, x) =
Xn i=1
fe(yi⇤, i, x)
Xn i=1
X
s
P (yi = s|x)fe(s, i, x)
That’s what we computed for forward-backward!
17 slide credit: Dan Klein
P (y
3= 2 |x) =
sum of all paths through state 2 at time 3 sum of all paths
=
‣ Easiest and most flexible to do one pass to compute and one to
compute
Forward Backward for HMM
18
‣ Forward:
‣ For i = 1 … n
‣ Two passes: one forward, one back
‣ Backward:
‣ For i = n-1 ... 0
Other Marginal Inference
19
‣ Can we compute this?
‣ Relation with forward quantity?
EM IntuiHon
20
‣ What we want is…
‣ We can compute:
‣ If we have….
‣ Then we can compute expected transiHon counts:
‣ Above marginals can be computed from followings:
ExpectaHon MaximizaHon
21
‣ IniHalize transiHon and emission parameters:
‣ random, uniform, or more informed iniHalizaHon
‣ Iterate unHl convergence
‣ M-step: compuHng new transiHon and emission parameters
‣ Convergence? Yes. Global OpHmum? No.
‣ E-step: compuHng expected counts
Equivalent to the procedure given in the textbook (J&M Appendix A) –
slightly different notations
How is this even possible?
23
‣ I water the garden everyday
‣ Saw a weird bug in that garden …
‣ While I was thinking of an equation …
Noun
S: (n) garden (a plot of ground where plants are cultivated)
S: (n) garden (the flowers or vegetables or fruits or herbs that are cultivated in a garden) S: (n) garden (a yard or lawn adjoining a house)
Verb
S: (v) garden (work in the garden) "My hobby is gardening"
Adjective
S: (adj) garden (the usual or familiar type) "it is a common or garden sparrow"
We can start with a tag dicHonary (a list of potenHal POSs per word)
Does EM learn good POS tagger?
24
HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed
“Why doesn’t EM find good HMM POS-taggers”, Johnson, EMNLP 2007
P (y
i= s |x) = forward
i(s)backward
i(s) P
s0
forward
i(s
0)backward
i(s
0)
‣ Posterior is derived from the parameters and the data (condiHoned on x!)
HMM
CRF
Model parameter (usually mulHnomial distribuHon)
Inferred quanHty from forward-backward
Inferred quanHty from forward-backward
Undefined (model is by
definiHon condiHoned on x)
P (xi|yi), P (yi|yi 1)
P (y
i|x), P (y
i 1, y
i|x)
HMM vs. CRF
POS Results
26
‣ Baseline: assign each word its most frequent tag: ~90% accuracy
‣ Trigram HMM: ~95% accuracy / 55% on unknown words
‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks
‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on dunks
‣ EM for HMM: 74.7%
Quiz: p(S1) vs. p(S2)
27
‣ S1 = Colorless green ideas sleep furiously.
‣ S2 = Furiously sleep ideas green colorless
‣ “It is fair to assume that neither sentence (S1) nor (S2) had ever occurred in an English discourse.
Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally "remote" from English” (Chomsky 1957)
‣ How would p(S1) and p(S2) compare based on (smoothed) bigram language models?
‣ How would p(S1) and p(S2) compare based on marginal probability based on POS- tagging HMMs?
‣ i.e., marginalized over all possible sequences of POS tags
‣ Sequence Modeling Problems in NLP
‣ GeneraHve Model: Hidden Markov Models (HMM)
‣ DiscriminaHve Model:
Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields
‣ Unsupervised Learning: ExpectaHon MaximizaHon
Summary: Sequence Models
29
‣ The higher accuracy of discriminaHve models comes at the price of much slower training
‣ GeneraHve vs. DiscriminaHve
‣ Structured or not
‣ Independent predicHons are effecHve, but global structure helps
‣ But need to be careful about computaHonal overhead
‣ Model expressivity
‣ Expressive feature set can be beAer
‣ But more expensive, overfi}ng