• No results found

CS388: Natural Language Processing Lecture 4: Sequence Models

N/A
N/A
Protected

Academic year: 2021

Share "CS388: Natural Language Processing Lecture 4: Sequence Models"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

CS388: Natural Language Processing Lecture 4: Sequence Models

Eunsol Choi

Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi

(2)

LogisHcs: Time to think about final project!

2

‣ There will be extra instructor office hour next week (sign up will be on piazza)

10 minute slots, sign up for one office hour slot per team

‣ Next lecture I’ll cover topic that could help you brainstorm about your final project

‣ Project Proposal Due October 5th

‣ HW2 (implemenHng CRF) due September 30th — get started early!

(3)

CRFs Outline

‣ Model: P (y|x) = 1 Z

Yn i=2

exp( t(yi 1, yi))

Yn i=1

exp( e(yi, i, x))

Inference: argmax P(y|x) from Viterbi P (y|x) / exp w>

" n X

i=2

ft(yi 1, yi) +

Xn

i=1

fe(yi, i, x)

#

(4)

CRF Learning

4

gold features — expected features under model

@

@w L(y

, x) =

X

n i=1

f

e

(y

i

, i, x)

X

n i=1

X

s

P (y

i

= s |x)f

e

(s, i, x)

@

@w L(y, x) =

Xn i=2

ft(yi 1 , yi) +

Xn i=1

fe(yi, i, x)

Ey

" n X

i=2

ft(yi 1, yi) +

Xn

i=1

fe(yi, i, x)

#

Ey

" n X

i=1

fe(yi, i, x)

#

= X

y2Y

P (y|x)

" n X

i=1

fe(yi, i, x)

#

=

Xn i=1

X

y2Y

P (y|x)fe(yi, i, x)

is annotated sequence for input sequence of length

y* x n

(5)

Forward Backward: RunHme

5

‣ Linear in sentence length

‣ Polynomial in the number of possible tags |K|

‣ Total RunHme:

‣ How does this compare to Viterbi?

t(st) = X

st+1

t+1(st+1) exp( e(st+1, t + 1, x))

<latexit sha1_base64="o+YDBL44WMQt+0KfoSuWzIAvSK8=">AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w==</latexit>

exp( t(st, st+1))

<latexit sha1_base64="MMxuphUYCcsoCkHdOiUL0s5PNcs=">AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=</latexit>

t(st) = X

st 1

t 1(st 1) exp( e(st, t, x))

<latexit sha1_base64="MsrARqnMVgv4lQtn8MktFrMi2DM=">AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc=</latexit>

exp( t(st 1, st))

<latexit sha1_base64="+aaYS+pf0KkwlV5IpyMknO6c7lg=">AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H</latexit>

(6)

POS Tagging Performances

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

Slide credit: Dan Klein

‣ MaxEnt : 93.7% accuracy / 82.6% on unks

‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on unks

‣ Perceptron: 97.1% accuracy

‣ CRF: 97.3% accuracy

(7)

Errors

official knowledge made up the story recently sold shares JJ/NN NN VBD RP/IN DT NN RB VBD/VBN NNS

Slide credit: Dan Klein / Toutanova + Manning (2000)

(NN NN: tax cut, art gallery, …)

(8)

Remaining Errors

Underspecified / unclear, gold standard inconsistent / wrong: 58%

‣ Lexicon gap (word not seen with that tag in training) 4.5%

‣ Unknown word: 4.5%

‣ Could get right: 16%

‣ Difficult linguisHcs: 20%

They set up absurd situa/ons, detached from reality VBD / VBP? (past or present?)

a $ 10 million fourth-quarter charge against discon/nued opera/ons adjecHve or verbal parHciple? JJ / VBN?

Manning 2011 “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some LinguisHcs?”

(9)

Remaining QuesHons

9

‣ How to handle noisy text?

‣ How to keep high accuracy when we go to new domains?

‣ How can we combine unlabeled data efficiently to labeled data?

‣ Can we incorporate domain specific lexicon? (List of protein names?)

(10)

Pseudocode for Training CRF

for each epoch

for each example

extract features on each emission and transiHon (look up in cache)

compute marginal probabiliHes with forward-backward compute potenHals phi based on features + weights

accumulate gradient over all emissions and transiHons

(11)

ImplementaHon Tips for CRFs

‣ Caching is your friend! Cache feature vectors especially

‣ Try to reduce redundant computaHon, e.g. if you compute both the gradient and the objecHve value, don’t rerun the dynamic program

‣ If things are too slow, run a profiler and see where Hme is being spent.

Forward-backward should take most of the Hme

‣ Exploit sparsity in feature vectors where possible, especially in feature vectors and gradients

‣ Do all dynamic program computaHon in log space to avoid underflow

(12)

Debugging Tips for CRFs

‣ Hard to know whether inference, learning, or the model is broken!

‣ Compute the objecHve — is opHmizaHon working?

Learning: is the objecHve going down? Try to fit 1 example / 10 examples. Are you applying the gradient correctly?

Inference: check gradient computaHon (most likely place for bugs)

Is the same for all i?

‣ Do probabiliHes normalize correctly + look “reasonable”? (Nearly uniform when untrained, then slowly converging to the right thing)

‣ If objecHve is going down but model performance is bad:

Inference: check performance if you decode the training set

X

s

forwardi(s)backwardi(s)

(13)

Can we learn the latent states without supervised training dataset?

13

(14)

Sequence Labeling: ParHally Observed

14 Unsupervised Learning

‣ We have a set X and Y, and a joint distribuHon

p(x, y|θ)

‣ So far, we had fully observable data

(x

i

, y

i

)

pairs,

L(θ) = ∑

i

log P(x

i

, y

i

|θ)

‣ When we have parHally observable data, x only, then,

L(θ) = ∑

i

log P(xi |θ)

= ∑i log ∑

y∈𝒴

P(xi, y|θ)

(15)

ExpectaHon MaximizaHon

15

‣ The EM (ExpectaHon MaximizaHon) algorithm is a method for finding θMLE = arg max

θ L(θ) = arg max

θ

i log ∑

y∈𝒴

P(xi, y|θ)

‣ When we have parHally observable data, x only, then,

L(θ) = ∑

i log ∑

y∈𝒴

P(xi, y|θ)

‣ We will look into EM in HMM!

(16)

EM for HMM

16

‣ Maximum Likelihood Parameters (supervised):

‣ For unsupervised learning, replace actual count with expected counts

‣ Expected emission counts:

‣ Reminds you of something? @w@ L(y, x) =

Xn i=1

fe(yi, i, x)

Xn i=1

X

s

P (yi = s|x)fe(s, i, x)

(17)

That’s what we computed for forward-backward!

17 slide credit: Dan Klein

P (y

3

= 2 |x) =

sum of all paths through state 2 at time 3 sum of all paths

=

‣ Easiest and most flexible to do one pass to compute and one to

compute

(18)

Forward Backward for HMM

18

Forward:

For i = 1 … n

‣ Two passes: one forward, one back

Backward:

For i = n-1 ... 0

(19)

Other Marginal Inference

19

‣ Can we compute this?

‣ Relation with forward quantity?

(20)

EM IntuiHon

20

‣ What we want is…

‣ We can compute:

‣ If we have….

‣ Then we can compute expected transiHon counts:

‣ Above marginals can be computed from followings:

(21)

ExpectaHon MaximizaHon

21

‣ IniHalize transiHon and emission parameters:

‣ random, uniform, or more informed iniHalizaHon

Iterate unHl convergence

‣ M-step: compuHng new transiHon and emission parameters

‣ Convergence? Yes. Global OpHmum? No.

‣ E-step: compuHng expected counts

(22)

Equivalent to the procedure given in the textbook (J&M Appendix A) –

slightly different notations

(23)

How is this even possible?

23

‣ I water the garden everyday

‣ Saw a weird bug in that garden …

‣ While I was thinking of an equation …

Noun

S: (n) garden (a plot of ground where plants are cultivated)

S: (n) garden (the flowers or vegetables or fruits or herbs that are cultivated in a garden) S: (n) garden (a yard or lawn adjoining a house)

Verb

S: (v) garden (work in the garden) "My hobby is gardening"

Adjective

S: (adj) garden (the usual or familiar type) "it is a common or garden sparrow"

We can start with a tag dicHonary (a list of potenHal POSs per word)

(24)

Does EM learn good POS tagger?

24

HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed

“Why doesn’t EM find good HMM POS-taggers”, Johnson, EMNLP 2007

(25)

P (y

i

= s |x) = forward

i

(s)backward

i

(s) P

s0

forward

i

(s

0

)backward

i

(s

0

)

Posterior is derived from the parameters and the data (condiHoned on x!)

HMM

CRF

Model parameter (usually mulHnomial distribuHon)

Inferred quanHty from forward-backward

Inferred quanHty from forward-backward

Undefined (model is by

definiHon condiHoned on x)

P (xi|yi), P (yi|yi 1)

P (y

i

|x), P (y

i 1

, y

i

|x)

HMM vs. CRF

(26)

POS Results

26

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

‣ MEMM [Ratnaparkhi 1996]: 96.8% accuracy / 86.9% on dunks

‣ EM for HMM: 74.7%

(27)

Quiz: p(S1) vs. p(S2)

27

S1 = Colorless green ideas sleep furiously.

S2 = Furiously sleep ideas green colorless

“It is fair to assume that neither sentence (S1) nor (S2) had ever occurred in an English discourse.

Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally "remote" from English” (Chomsky 1957)

How would p(S1) and p(S2) compare based on (smoothed) bigram language models?

How would p(S1) and p(S2) compare based on marginal probability based on POS- tagging HMMs?

i.e., marginalized over all possible sequences of POS tags

(28)

‣ Sequence Modeling Problems in NLP

‣ GeneraHve Model: Hidden Markov Models (HMM)

‣ DiscriminaHve Model:

Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields

‣ Unsupervised Learning: ExpectaHon MaximizaHon

Summary: Sequence Models

(29)

29

‣ The higher accuracy of discriminaHve models comes at the price of much slower training

‣ GeneraHve vs. DiscriminaHve

‣ Structured or not

‣ Independent predicHons are effecHve, but global structure helps

‣ But need to be careful about computaHonal overhead

‣ Model expressivity

‣ Expressive feature set can be beAer

‣ But more expensive, overfi}ng

Summary: Sequence Models

References

Related documents

Western Power is running this registration of interest to identify third parties who may provide a distributed response of a complete ‘50 Mega Watts (MW)’ solution or a part

He recently completed an Honors Degree at Auckland University focusing on jazz performance and is now a successful session musician and has been teaching itinerant guitar lesson

Tako su narodi skrenuli s puta prihvaćanja odgovornosti za plodo­ ve svega onoga što bi sami posijali, te su time ostali zakinuti i oroblje- ni za moć da se sami pozabave

All 3 vendors’ pricing models are similar ● Cloud compute charges based on. ○ number of virtual CPUs

Keywords: United States, Election, Presidential, America, Donald Trump, Hillary Clinton... TABLE

taken for it to reach 40 is 2 days. Therefore 2 days is the half life. After another 2 days 

The main issues in this study that different from the previous researches is to test reclaimed water samples in different percentage of blending with potable water 20%,50% and

This study demonstrated that the adapted mindfulness intervention is feasible in terms of recruitment, retention, attrition and acceptability, for people with mild to