Learning To Deal With Little Or No Annotated Data

(1)

Learning To Deal With Little Or

No Annotated Data

Daniel Marcu

Information Sciences Institute and Department of Computer Science University of Southern California

4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292

[email protected]

(2)

Overview

• “There is no better data than more data.”

• Annotating data is more cost effective than writing rules

manually.

– Still, annotating data is expensive

• How can we annotate as little data as possible?

– Active Learning – Bootstrapping – Co-training

• Unsupervised Learning.

– Pattern Discovery

– Hidden Variables (the EM algorithm)

(3)

Choosing between confusables

[Banko and Brill, ACL-2001]

(4)

Base Noun Phrase Chunking

[Ngai and Yarowsky, ACL-2000]

• Asked human judges to write rules that can be

used to identify base noun phrases and

automatically integrated those rules into a

rule-based chunker.

• Asked human judges to annotate base noun

phrases in naturally occurring text and trained

a ML-based system to recognize these

phrases.

• Compared the performance of the two rule-

and ML-based systems.

(5)

(6)

(7)

How can we do well while

annotating less data?

• Active learning

– Active learning with one classifier

– Active learning with a committee of classifiers

• Bootstrapping

– Bootstrapping with one classifier

– Bootstrapping with a committee of classifiers

(8)

Active learning with one classifier

Input:

small annotated corpus +

large un-annotated corpus.

3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Elicit human judgments for examples on which

classifier had the lowest confidence.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.

(9)

Active learning with multiple

classifiers

Input:

large un-annotated corpus.

3. Train multiple classifiers on annotated data.

4. Apply classifiers on unlabeled examples.

5. Elicit human judgments for examples on which

classifiers agree the least.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.

(10)

Active learning helps

(11)

Active learning helps

(12)

Active learning worked in all

cases that I know of.

(13)

Bootstrapping with one classifier

Input:

3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Add to the training corpus the examples that

are labeled with high confidence.

6. Retrain classifier (and test on held-out

data).

(14)

Bootstrapping with multiple

classifiers

Input:

3. Train classifiers on annotated data.

4. Apply classifiers on unlabeled examples.

5. Add to the training corpus the examples that

are given the same label by all (most of) the

classifiers.

6. Retrain classifiers (and test on held-out

data).

(15)

Bootstrapping example

[Yarowsky – ACL-95]

• Extract from a corpus all instances of a

polysemous word (7538 instances of

plant

).

Sense Training Examples

? company said the plant is still operating ? Although thousands of plant and animal species ? zonal distribution of plant life

? to strain microscopic plant life from the ? Nissan car and truck plant in Japan

? discovered at a St. Louis plant manufacturing ? automated manufacturing plant in Fremont

(16)

Start with a simple classifier and

create a seed corpus

Sense Training Examples

A A A

zonal distribution of plant life

to strain microscopic plant life from the

…

? ? ? ?

Nissan car and truck plant in Japan

company said the plant is still operating Although thousands of plant and animal species …

B B B

discovered at a St. Louis plant manufacturing automated manufacturing plant in Fremont

…

• Start with a simple classifier: plant  A; manufacturing  B

 82 examples of living plants (1%)

 106 examples of manufacturing plants (1%)

(17)

(18)

1. Train supervised classifier on seed

corpus

Collocation Sense

plant life

A

manufacturing plant

B

life (within 2-10 words)

A

manufacturing (within 2-10 words)

B

animal (within 2-10 words)

A

equipment (within 2-10 words)

B

(19)

(20)

Rest of the algorithm

1. Optionally use one-sense-per-discourse filter and

augment labeled data

2. Repeat steps 1, 2, 3 iteratively.

Evaluation:

•

Baseline: 63.9%

•

Supervised: 96.1%

•

Bootstrapping: 96.5%

(21)

Bootstrapping does not work in all cases

(

than

vs.

then

[Banko and Brill, ACL-2001]

)

Test accuracy

% Total training data

106 with labeled seed corpus 0.9624 0.1

Seed + 5 x 106 _unsupervised _0.9588 _0.6

Seed + 107 _unsupervised _0.9620 _1.2

Seed + 108 _unsupervised _0.9715 _12.2

Seed + 5 x 108 _unsupervised _0.9588 _61.1

(22)

Co-training

[Blum and Mitchell, COLT-1998]

Professor John Smith

I teach computer courses and advise students including

Mary Kae

Bill Blue

I work on the following projects:

- machine learning for web classification - active learning for NLP

- software engineering My advisor

(23)

Co-training

• Input: L – set of labeled training examples

U – set of unlabeled examples

• Loop:

– Learn hyperlink-based classifier H from L. – Learn full-text classifier F from L.

– Allow H to label p positive and n negative examples from U (same distribution as in L).

– Allow F to label p positive and n negative examples from U. – Add these self-labeled examples to L.

• Why does this work?

– Examples that are easy to label by classifier X may be hard cases for the classifier Y. Classifier Y may learn something new from the

(24)

Example: Error rates for a web

classifier

Page-based classifier Hyperlink-based classifier Combined classifier (equal votes) Supervised training 12.9 12.4 11.1 Co-training 6.2 11.6 5.0

Problem: classify web pages as academic course (yes or no).

(25)

Co-training does not work in all cases

[Pierce and

Cardie, EMNLP-2001]

(26)

Unsupervised learning

• Pattern discovery.

– Language modeling (text as

sequence

of words)

– Unsupervised induction of syntactic structure.

– Unsupervised induction of POS taggers and base

noun identifiers for non-English.

(27)

Language modeling

• as soon as _______

• I would like ______

• P(w1, w2, w3, …, wn)

• Useful in

– Speech recognition

– Machine translation

– Summarization/Generation

(28)

N-grams models

• P(w

₁

, w

₂

, …, w

_n

) = P(w

₁

)



P(w

₂

| w

₁

)



P(w

₃

| w

₁

, w

₂

)



P(w

₄

| w

₁

, w

₂

, w

₃

)



…



P(w

_n

| w

₁

, w

₂

, …, w

_n-1

)



P(w

₁

)



P(w

₂

| w

₁

)



P(w

₃

| w

₁

, w

₂

)



P(w

₄

| w

₂

, w

₃

)



…



P(w

_n

| w

_n-2

, w

_n-1

)

• Estimation: P(c | a, b) = count(a, b, c)/count(a, b)

• Smoothing when count(a,b) or count(a,b,c) are 0.

• Still the most popular language model:

never underestimate

the power of n-grams.

(29)

ACL-Unsupervised induction of

syntactic structures

[van Zaanen –

ICML-2000]

• [Harris, 1951]

Methods in structural linguistics.

University of Chicago Press:

“Two constituents of the same type can be replaced”.

•

IDEA:

– Find in a corpus parts of sentences that can be replaced and

assume that these parts are syntactic constituents.

• Example:

– Show me (flights from Atlanta to Boston.)

– Show me (the rates for flight 1943.)

– (Book Delta 128) from Dallas to Boston.

– (Give me all flights) from Dallas to Boston.

(30)

Algorithm

1. Find overlapping segments in all sentence

pairs (string edit distance).

•

Dissimilar parts are considered possible

constituents and are assigned unique types (labels:

X1, X2…).

2. When multiple overlaps occur use various

selection criteria

•

First learned constituent is good.

•

Constituent that occurs most often is good.

(31)

Evaluation

• ATIS corpus: 716 sentences with 11,777 constituents.

• Examples:

– Corpus: What is (the name of (the airport in Boston)NP )NP

– Learned: What is the (name of the (airport in Boston)_C )_C

– Corpus: Explain classes ((QW)NP and (QX)NP and (Y)NP)NP

– Learned: Explain classes QW and (QX and (Y)C )C

• Non-crossing bracket precision: 86.47

• Non-crossing bracket recall: 86.78

• Lots of room for improvement:

(32)

Induction of POS taggers and base noun identifiers for

non-English languages

[Yarowsky and Ngai –

NAACL-2001]

• For many languages, no NLP analyzers exist.

• Bottleneck: lack of labeled data.

•

IDEA:

use parallel corpora and existing

statistical machine translation

software/techniques to automatically label

non-English texts.

(33)

Projecting POS tag and base

noun-phrase structure across

languages

(34)

Difficulties

• Statistical MT alignment programs yield

relatively low accuracy word alignments.

• Very often translations are not literal.

• Mismatch between the annotation needs of

two languages (gender in French and

(35)

POS tagger induction

• Run GIZA (

www.isi.edu/natural-language/projects/rewrite

) on parallel

corpus of 2M words.

• Run POS tagger on English text.

• Automatically induce tags for the French.

• Train probabilistic noisy-channel tagger on automatically

induced French tags.

– Downweight or exclude from the training data the segments that are likely to be aligned poorly.

– Train lexical priors P(t | w) and tag sequence models P(t2 | t1) using aggressive generalization techniques (most words have one possible core tag).

• Test performance on held-out data and out-of-domain manually

annotated data (100k: U. Montreal)

(36)

Evaluation

• E-F Aligned French

– Direct transfer: 76%

– Standard noisy-channel: 86%

– Noise-robust noisy-channel: 96%

– Upperbound (trained on heldout goldstandard): 97%

• Out-of-domain data:

– Standard noisy-channel: 82%

– Noise-robust noisy-channel: 94%

(37)

NP bracketer induction

• Tag and bracket English text [Brill,CL-1999;

Ramshaw and Marcus, VLC-1999]

• Induce maximal brackets in French/Chinese.

• Train transformation-based learning (TBL)

bracketer on French/Chinese data.

• Test performance on small corpus of held out

sentences (no French or Chinese NP bracketer

exists).

(38)

Evaluation on 50 French

sentences

– Direct, F-measure: Exact --- 45% Acceptable --- 59%

– TBL, F-measure: Exact --- 81% Acceptable --- 91%

(39)

Hidden variables: the EM

algorithm

[Knight, AI Magazine, 1997]

1a. ok-voon ororok sprok . 1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok . 5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

(40)

EM (Estimation Maximization)

[Dempster, Laird, Rubin; JRST, 1977]

• EM is good for solving “chicken and egg” problems.

– Translation:

• If we knew the word level alignments in a corpus, we would know to estimate t(f | e).

• If we knew t(f | e), we would be able to find the word-level alignments in a corpus.

•

Problem to solve:

find the word-level alignments and

the translation probabilities given this corpus:

1e: b c

1f: x y

P(a,f | e) =



_j=1,m

t(f

_j

| e

_aj

)

---2e: b 2f: y

(41)

The EM algorithm

[Knight, 1999 SMT Tutorial Book]

Step 1: Set parameters uniformly. t(x | b ) = ½

t(y | b) = ½ t(x | c) = ½ t(y | c) = ½

Step 2: Compute P(a, f | e) for all alignments. b c b | | P(a,f | e) = ½ * ½ = ¼ | P(a, f | e) = 1/2 x y y b c P(a, f | e) = ½ * ½ = 1/4 x y

(42)

The EM algorithm

Step 3: Normalize P(a, f | e) values to yield P(a | e, f) b c b | | P(a| e, f) = ¼ / ( ¼ + ¼) = ½ | P(a, f | e) = ½ / ½ = 1 x y y b c P(a, f | e) = ¼ / ( ¼ + ¼) = ½ x y

Step 4: Collect fractional counts tc(x | b ) = ½

tc(y | b) = ½ + 1 = 3/2 tc(x | c) = ½

(43)

The EM algorithm

Step 5: Normalize fractional counts to get revised parameter values. t(x | b) = ½ / ( 3/2 + ½) = ¼

t(y | b) = 3/2 / (3/2 + ½) = ¾ t(x | c) = ½ / 1 = ½

t(y | c) = ½ / 1 = ½

Repeat step 2: Compute P(a, f | e) for all alignments. b c b | | P(a,f | e) = ¼ * ½ = 1/8 | P(a, f | e) = 3/4 x y y b c P(a, f | e) = 3/4 * ½ = 3/8 x y

(44)

The EM algorithm

Repeat step 3: Normalize P(a, f | e) values to yield P(a | e, f) b c b | | P(a| e, f) = 1/8 / (1/8 + 3/8) = ¼ | P(a, f | e) = 1 x y y b c P(a, f | e) = ¾ x y

Repeat step 4: Collect fractional counts tc(x | b ) = ¼

tc(y | b) = ¾ + 1 = 7/4 tc(x | c) = ¾

(45)

The EM algorithm

Step 5: Normalize fractional counts to get revised parameter values. t(x | b) = 1/8

t(y | b) = 7/8 t(x | c) = 3/4 t(y | c) = 1/4

Repeat steps 2-5 many times: t(x | b) = 0.0001

t(y | b) = 0.9999 t(x | c) = 0.9999 t(y | c) = 0.0001

(46)

EM allows one to make MLE under adverse

circumstances

[Pedersen, EMNLP-2001 EM Panel]

• MLE (Maximum Likelihood Estimates)

– Parameters describe the characteristics of a

population. Their values are estimated from samples

collected from that population.

– A MLE is a parameter estimate that is most consistent

with the sampled data. It maximizes the likelihood of

the data P(X | ).

Θ

(47)

Trivial example: coin tossing

• 10 trials: h, t, t, t, h, t, t, h, t, t

• One parameter:



= P(h)

• The MLE



is 3/10.

• Explanation:

– Given 10 tosses, how likely it is to get 3 heads?

L(



) = C

₁₀3



3

(1–



)

7

(48)

EM: a more complex example

• Most often, for multinomial distributions it is not

possible to find the MLE using closed form formulas.

•

E-step:

Find the expected values of the complete data,

given the incomplete data and the current parameter

estimates (steps 2 and 3)

•

M-step:

Compute MLE as usual (steps 4 and 5).

1e: b c Parameters:  = {t(x | b), t(x | c), t(y | b), t(y | c)} 1f: x y L(X |  ) = P(e | f) = _a P(a, f | e)

---2e: b

2f: y