Learning To Deal With Little Or
No Annotated Data
Daniel Marcu
Information Sciences Institute and Department of Computer Science University of Southern California
4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292
Overview
• “There is no better data than more data.”
• Annotating data is more cost effective than writing rules
manually.
– Still, annotating data is expensive
• How can we annotate as little data as possible?
– Active Learning – Bootstrapping – Co-training
• Unsupervised Learning.
– Pattern Discovery
– Hidden Variables (the EM algorithm)
Choosing between confusables
[Banko and Brill, ACL-2001]
Base Noun Phrase Chunking
[Ngai and Yarowsky, ACL-2000]
• Asked human judges to write rules that can be
used to identify base noun phrases and
automatically integrated those rules into a
rule-based chunker.
• Asked human judges to annotate base noun
phrases in naturally occurring text and trained
a ML-based system to recognize these
phrases.
• Compared the performance of the two rule-
and ML-based systems.
How can we do well while
annotating less data?
• Active learning
– Active learning with one classifier
– Active learning with a committee of classifiers
• Bootstrapping
– Bootstrapping with one classifier
– Bootstrapping with a committee of classifiers
Active learning with one classifier
Input:
small annotated corpus +
large un-annotated corpus.
3. Train classifier on annotated data.
4. Apply classifier on unlabeled examples.
5. Elicit human judgments for examples on which
classifier had the lowest confidence.
6. Add new labeled data to the annotated corpus.
7. Retrain classifier and test on held-out data.
Active learning with multiple
classifiers
Input:
small annotated corpus +
large un-annotated corpus.
3. Train multiple classifiers on annotated data.
4. Apply classifiers on unlabeled examples.
5. Elicit human judgments for examples on which
classifiers agree the least.
6. Add new labeled data to the annotated corpus.
7. Retrain classifier and test on held-out data.
Active learning helps
Active learning helps
Active learning worked in all
cases that I know of.
Bootstrapping with one classifier
Input:
small annotated corpus +
large un-annotated corpus.
3. Train classifier on annotated data.
4. Apply classifier on unlabeled examples.
5. Add to the training corpus the examples that
are labeled with high confidence.
6. Retrain classifier (and test on held-out
data).
Bootstrapping with multiple
classifiers
Input:
small annotated corpus +
large un-annotated corpus.
3. Train classifiers on annotated data.
4. Apply classifiers on unlabeled examples.
5. Add to the training corpus the examples that
are given the same label by all (most of) the
classifiers.
6. Retrain classifiers (and test on held-out
data).
Bootstrapping example
[Yarowsky – ACL-95]
• Extract from a corpus all instances of a
polysemous word (7538 instances of
plant
).
Sense Training Examples
? company said the plant is still operating ? Although thousands of plant and animal species ? zonal distribution of plant life
? to strain microscopic plant life from the ? Nissan car and truck plant in Japan
? discovered at a St. Louis plant manufacturing ? automated manufacturing plant in Fremont
Start with a simple classifier and
create a seed corpus
Sense Training Examples
A A A
zonal distribution of plant life
to strain microscopic plant life from the
…
? ? ? ?
Nissan car and truck plant in Japan
company said the plant is still operating Although thousands of plant and animal species …
B B B
discovered at a St. Louis plant manufacturing automated manufacturing plant in Fremont
…
• Start with a simple classifier: plant A; manufacturing B
82 examples of living plants (1%)
106 examples of manufacturing plants (1%)
1. Train supervised classifier on seed
corpus
Collocation Sense
plant life
A
manufacturing plant
B
life (within 2-10 words)
A
manufacturing (within 2-10 words)
B
animal (within 2-10 words)
A
equipment (within 2-10 words)
B
Rest of the algorithm
1. Optionally use one-sense-per-discourse filter and
augment labeled data
2. Repeat steps 1, 2, 3 iteratively.
Evaluation:
•
Baseline: 63.9%
•
Supervised: 96.1%
•
Bootstrapping: 96.5%
Bootstrapping does not work in all cases
(
than
vs.
then
[Banko and Brill, ACL-2001]
)
Test accuracy
% Total training data
106 with labeled seed corpus 0.9624 0.1
Seed + 5 x 106 unsupervised 0.9588 0.6
Seed + 107 unsupervised 0.9620 1.2
Seed + 108 unsupervised 0.9715 12.2
Seed + 5 x 108 unsupervised 0.9588 61.1
Co-training
[Blum and Mitchell, COLT-1998]
Professor John Smith
I teach computer courses and advise students including
Mary Kae
Bill Blue
I work on the following projects:
- machine learning for web classification - active learning for NLP
- software engineering My advisor
Co-training
• Input: L – set of labeled training examples
U – set of unlabeled examples
• Loop:
– Learn hyperlink-based classifier H from L. – Learn full-text classifier F from L.
– Allow H to label p positive and n negative examples from U (same distribution as in L).
– Allow F to label p positive and n negative examples from U. – Add these self-labeled examples to L.
• Why does this work?
– Examples that are easy to label by classifier X may be hard cases for the classifier Y. Classifier Y may learn something new from the
Example: Error rates for a web
classifier
Page-based classifier Hyperlink-based classifier Combined classifier (equal votes) Supervised training 12.9 12.4 11.1 Co-training 6.2 11.6 5.0Problem: classify web pages as academic course (yes or no).
Co-training does not work in all cases
[Pierce and
Cardie, EMNLP-2001]
Unsupervised learning
• Pattern discovery.
– Language modeling (text as
sequence
of words)
– Unsupervised induction of syntactic structure.
– Unsupervised induction of POS taggers and base
noun identifiers for non-English.
Language modeling
• as soon as _______
• I would like ______
• P(w1, w2, w3, …, wn)
• Useful in
– Speech recognition
– Machine translation
– Summarization/Generation
N-grams models
• P(w
1, w
2, …, w
n) = P(w
1)
P(w
2| w
1)
P(w
3| w
1, w
2)
P(w
4| w
1, w
2, w
3)
…
P(w
n| w
1, w
2, …, w
n-1)
P(w
1)
P(w
2| w
1)
P(w
3| w
1, w
2)
P(w
4| w
2, w
3)
…
P(w
n| w
n-2, w
n-1)
• Estimation: P(c | a, b) = count(a, b, c)/count(a, b)
• Smoothing when count(a,b) or count(a,b,c) are 0.
• Still the most popular language model:
never underestimate
the power of n-grams.
ACL-Unsupervised induction of
syntactic structures
[van Zaanen –
ICML-2000]
• [Harris, 1951]
Methods in structural linguistics.
University of Chicago Press:
“Two constituents of the same type can be replaced”.
•
IDEA:
– Find in a corpus parts of sentences that can be replaced and
assume that these parts are syntactic constituents.
• Example:
– Show me (flights from Atlanta to Boston.)
– Show me (the rates for flight 1943.)
– (Book Delta 128) from Dallas to Boston.
– (Give me all flights) from Dallas to Boston.
Algorithm
1. Find overlapping segments in all sentence
pairs (string edit distance).
•
Dissimilar parts are considered possible
constituents and are assigned unique types (labels:
X1, X2…).
2. When multiple overlaps occur use various
selection criteria
•
First learned constituent is good.
•
Constituent that occurs most often is good.
Evaluation
• ATIS corpus: 716 sentences with 11,777 constituents.
• Examples:
– Corpus: What is (the name of (the airport in Boston)NP )NP
– Learned: What is the (name of the (airport in Boston)C )C
– Corpus: Explain classes ((QW)NP and (QX)NP and (Y)NP)NP
– Learned: Explain classes QW and (QX and (Y)C )C
• Non-crossing bracket precision: 86.47
• Non-crossing bracket recall: 86.78
• Lots of room for improvement:
Induction of POS taggers and base noun identifiers for
non-English languages
[Yarowsky and Ngai –
NAACL-2001]
• For many languages, no NLP analyzers exist.
• Bottleneck: lack of labeled data.
•
IDEA:
use parallel corpora and existing
statistical machine translation
software/techniques to automatically label
non-English texts.
Projecting POS tag and base
noun-phrase structure across
languages
Difficulties
• Statistical MT alignment programs yield
relatively low accuracy word alignments.
• Very often translations are not literal.
• Mismatch between the annotation needs of
two languages (gender in French and
POS tagger induction
• Run GIZA (
www.isi.edu/natural-language/projects/rewrite) on parallel
corpus of 2M words.
• Run POS tagger on English text.
• Automatically induce tags for the French.
• Train probabilistic noisy-channel tagger on automatically
induced French tags.
– Downweight or exclude from the training data the segments that are likely to be aligned poorly.
– Train lexical priors P(t | w) and tag sequence models P(t2 | t1) using aggressive generalization techniques (most words have one possible core tag).
• Test performance on held-out data and out-of-domain manually
annotated data (100k: U. Montreal)
Evaluation
• E-F Aligned French
– Direct transfer: 76%
– Standard noisy-channel: 86%
– Noise-robust noisy-channel: 96%
– Upperbound (trained on heldout goldstandard): 97%
• Out-of-domain data:
– Standard noisy-channel: 82%
– Noise-robust noisy-channel: 94%
NP bracketer induction
• Tag and bracket English text [Brill,CL-1999;
Ramshaw and Marcus, VLC-1999]
• Induce maximal brackets in French/Chinese.
• Train transformation-based learning (TBL)
bracketer on French/Chinese data.
• Test performance on small corpus of held out
sentences (no French or Chinese NP bracketer
exists).
Evaluation on 50 French
sentences
– Direct, F-measure: Exact --- 45% Acceptable --- 59%
– TBL, F-measure: Exact --- 81% Acceptable --- 91%
Hidden variables: the EM
algorithm
[Knight, AI Magazine, 1997]
1a. ok-voon ororok sprok . 1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok . 5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
EM (Estimation Maximization)
[Dempster, Laird, Rubin; JRST, 1977]
• EM is good for solving “chicken and egg” problems.
– Translation:
• If we knew the word level alignments in a corpus, we would know to estimate t(f | e).
• If we knew t(f | e), we would be able to find the word-level alignments in a corpus.
•
Problem to solve:
find the word-level alignments and
the translation probabilities given this corpus:
1e: b c
1f: x y
P(a,f | e) =
j=1,mt(f
j| e
aj)
---2e: b 2f: y
The EM algorithm
[Knight, 1999 SMT Tutorial Book]
Step 1: Set parameters uniformly. t(x | b ) = ½
t(y | b) = ½ t(x | c) = ½ t(y | c) = ½
Step 2: Compute P(a, f | e) for all alignments. b c b | | P(a,f | e) = ½ * ½ = ¼ | P(a, f | e) = 1/2 x y y b c P(a, f | e) = ½ * ½ = 1/4 x y
The EM algorithm
Step 3: Normalize P(a, f | e) values to yield P(a | e, f) b c b | | P(a| e, f) = ¼ / ( ¼ + ¼) = ½ | P(a, f | e) = ½ / ½ = 1 x y y b c P(a, f | e) = ¼ / ( ¼ + ¼) = ½ x y
Step 4: Collect fractional counts tc(x | b ) = ½
tc(y | b) = ½ + 1 = 3/2 tc(x | c) = ½
The EM algorithm
Step 5: Normalize fractional counts to get revised parameter values. t(x | b) = ½ / ( 3/2 + ½) = ¼
t(y | b) = 3/2 / (3/2 + ½) = ¾ t(x | c) = ½ / 1 = ½
t(y | c) = ½ / 1 = ½
Repeat step 2: Compute P(a, f | e) for all alignments. b c b | | P(a,f | e) = ¼ * ½ = 1/8 | P(a, f | e) = 3/4 x y y b c P(a, f | e) = 3/4 * ½ = 3/8 x y
The EM algorithm
Repeat step 3: Normalize P(a, f | e) values to yield P(a | e, f) b c b | | P(a| e, f) = 1/8 / (1/8 + 3/8) = ¼ | P(a, f | e) = 1 x y y b c P(a, f | e) = ¾ x y
Repeat step 4: Collect fractional counts tc(x | b ) = ¼
tc(y | b) = ¾ + 1 = 7/4 tc(x | c) = ¾
The EM algorithm
Step 5: Normalize fractional counts to get revised parameter values. t(x | b) = 1/8
t(y | b) = 7/8 t(x | c) = 3/4 t(y | c) = 1/4
Repeat steps 2-5 many times: t(x | b) = 0.0001
t(y | b) = 0.9999 t(x | c) = 0.9999 t(y | c) = 0.0001
EM allows one to make MLE under adverse
circumstances
[Pedersen, EMNLP-2001 EM Panel]
• MLE (Maximum Likelihood Estimates)
– Parameters describe the characteristics of a
population. Their values are estimated from samples
collected from that population.
– A MLE is a parameter estimate that is most consistent
with the sampled data. It maximizes the likelihood of
the data P(X | ).
Θ
Trivial example: coin tossing
• 10 trials: h, t, t, t, h, t, t, h, t, t
• One parameter:
= P(h)
• The MLE
is 3/10.
• Explanation:
– Given 10 tosses, how likely it is to get 3 heads?
L(
) = C
103
3(1–
)
7EM: a more complex example
• Most often, for multinomial distributions it is not
possible to find the MLE using closed form formulas.
•
E-step:
Find the expected values of the complete data,
given the incomplete data and the current parameter
estimates (steps 2 and 3)
•
M-step:
Compute MLE as usual (steps 4 and 5).
1e: b c Parameters: = {t(x | b), t(x | c), t(y | b), t(y | c)} 1f: x y L(X | ) = P(e | f) = a P(a, f | e)
---2e: b
2f: y