Learning from Labeled and Unlabeled Data
Tom Mitchell
Statistical Approaches to Learning and Discovery, 10-702 and 15-802
March 31, 2003
When can Unlabeled Data help supervised learning?
Consider setting:
• Set X of instances drawn from unknown P(X)
• f: X ! Y target function (or, P(Y|X))
• Set H of possible hypotheses for f
Given:
• iid labeled examples
• iid unlabeld examples Determine:
When can Unlabeled Data help supervised learning?
Important question! In many cases, unlabeled data is plentiful, labeled data expensive
• Medical outcomes (x=<patient,treatment>, y=outcome)
• Text classification (x=document, y=relevance)
• User modeling (x=user actions, y=user intent)
• …
Four Ways to Use Unlabeled Data for Supervised Learning
1. Use to reweight labeled examples
2. Use to help EM learn class-specific generative models
3. If problem has redundantly sufficient features, use CoTraining
4. Use to detect/preempt overfitting
1. Use U to reweight labeled examples
2. Use U with EM and Assumed Generative Model
Y
X
1X
2X
3X
41 0 1 0
?
1 0 0 0
?
0 1 1 0 0
0 0 1 0 0
1 1 0 0 1
X4 X3 X2 X1 Y
Learn P(Y|X)
From [Nigam et al., 2000]
E Step:
M Step: w
tis t-th word in vocabulary
Elaboration 1: Downweight the influence of unlabeled examples by factor λ
New M step:
Chosen by cross validation
Experimental Evaluation
• Newsgroup postings
– 20 newsgroups, 1000/group
• Web page classification
– student, faculty, course, project – 4199 web pages
• Reuters newswire articles
– 12,902 articles – 90 topics categories
20 Newsgroups
20 Newsgroups
Using one labeled example per class
• Can’t really get something for nothing…
• But unlabeled data useful to degree that assumed form for P(X,Y) is correct
• E.g., in text classification, useful despite obvious error in assumed form of P(X,Y)
2. Use U with EM and Assumed Generative Model
3. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining
) ( ) ( ) ( ) ( , :
2 2 1 1 2
1
2 1
x f x g x g x g g and
on distributi unknown
from drawn
x where
X X X where
Y X f learn
=
=
∀
∃
×
=
→
Redundantly Sufficient Features
Professor Faloutsos my advisor
CoTraining Algorithm #1
[Blum&Mitchell, 1998]
Given: labeled data L, unlabeled data U Loop:
Train g1 (hyperlink classifier) using L Train g2 (page classifier) using L
Allow g1 to label p positive, n negative examps from U Allow g2 to label p positive, n negative examps from U Add these self-labeled examples to L
CoTraining: Experimental Results
• begin with 12 labeled web pages (academic course)
• provide 1,000 additional unlabeled web pages
• average error: learning from labeled data 11.1%;
• average error: cotraining 5.0%
Typical run:
Co-Training Rote Learner
My advisor
+
-
-
pages hyperlinks
Co-Training Rote Learner
My advisor
+
-
-
pages hyperlinks
-
-
- - + +
- -
- + + +
+
- - +
+
-
Expected Rote CoTraining error given m examples
[ ]
j j mj
g x P g
x P error
E = ∑ ( ∈ )( 1 − ( ∈ ))
Where g is the jth connected component of graph
j) ( ) ( ) ( ) ( , :
:
2 2 1 1 2
1
2 1
x f x g x g x g g and
on distributi unknown
from drawn
x where
X X X where
Y X f learn
setting CoTraining
=
=
∀
∃
×
=
→
How many unlabeled examples suffice?
Want to assure that connected components in the underlying distribution, G
D, are connected components in the observed sample, G
SG
DG
SO(log(N)/ α ) examples assure that with high probability, G
Shas same connected components as G
D[Karger, 94]
, α
CoTraining Setting
) ( ) ( ) ( ) ( , :
2 2 1 1 2
1
2 1
x f x g x g x g g and
on distributi unknown
from drawn
x where
X X X where
Y X f learn
=
=
∀
∃
×
=
→
• If
– x1, x2 conditionally independent given y – f is PAC learnable from noisy labeled data
• Then
– f is PAC learnable from weak initial classifier plus unlabeled data
PAC Generalization Bounds on CoTraining
[Dasgupta et al., NIPS 2001]
• Idea: Want classifiers that produce a maximally consistent labeling of the data
• If learning is an optimization problem, what function should we optimize?
What if CoTraining Assumption Not Perfectly Satisfied?
-
+ +
+
What Objective Function?
2 2 2 1 1 ,
2 2 2 1 1 ,
2 2 2 ,
2 1 1
4 3
2 ) ˆ ( ) ˆ (
|
|
|
| 1
|
| 4 1
)) ˆ ( ) ˆ ( ( 3
)) ˆ ( ( 2
)) ( ˆ ( 1
4 3 2 1
+
− +
=
−
=
−
=
−
=
+ + +
=
∑
∑
∑
∑
∑
∪
∈
>∈
<
∈
>∈
<
>∈
<
U L x L
y x U x
L y x
L y x
x g x g U y L
E L
x g x g E
x g y E
x g y E
E c E c E E E
Error on labeled examples
Disagreement over unlabeled
Misfit to estimated class priors
What Function Approximators?
• Same fn form as Naïve Bayes, Max Entropy
• Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4
• No word independence assumption, use both labeled and unlabeled data
+
∑=
jwj xj
e x
g
,11 ) 1
ˆ
1(
∑= +
jwj xj
e x
g
,21 ) 1 ˆ
2(
Classifying Jobs for FlipDog
X1: job title
X2: job description
Gradient CoTraining
Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammer
Final Accuracy Labeled data alone: 86%
CoTraining: 96%
Gradient CoTraining
Classifying Upper Case sequences as Person Names
25 labeled 5000 unlabeled
2300 labeled 5000 unlabeled Using
labeled data only
Cotraining
Cotraining without fitting class
.27
.13 .24
.11
*.15
**
Error Rates
CoTraining Summary
• Unlabeled data improves supervised learning when example features are redundantly sufficient
– Family of algorithms that train multiple classifiers
• Theoretical results
– Expected error for rote learning – If X1,X2 conditionally indep given Y
• PAC learnable from weak initial classifier plus unlabeled data
• error bounds in terms of disagreement between g1(x1) and g2(x2)
• Many real-world problems of this type
– Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99]
– Web page classification [Blum, Mitchell 98]
– Word sense disambiguation [Yarowsky 95]
– Speech recognition [de Sa, Ballard 98]
4. Use U to Detect/Preempt Overfitting
• Definition of distance metric
– Non-negative d(f,g)¸0;
– symmetric d(f,g)=d(g,f);
– triangle inequality d(f,g) · d(f,h)+d(h,g)
• Classification with zero-one loss:
• Regression with squared loss:
Experimental Evaluation of TRI
[Schuurmans & Southey, MLJ 2002]
• Use it to select degree of polynomial for regression
• Compare to alternatives such as cross validation,
structural risk minimization, …
Generated y values contain zero mean Gaussian noise Y=f(x)+ ε
Cross validation (Ten-fold) Structural risk minimization
Approximation ratio:
true error of selected hypothesis true error of best hypothesis considered
Results using 200 unlabeled, t labeled
Worst performance in top .50 of trials