Learning from Labeled and
Unlabeled Data: Semi-supervised
Learning and Ranking
Dengyong Zhou
[email protected]
Learning from Examples
•
Input space X , and output space
Y = {1, −1}.
•
Training set
S = {z
1
= (x
1
, y
1
), . . . , z
l
= (x
l
, y
l
)} in
Z = X × Y drawn i.i.d. from some unknown
distribution.
Transductive Setting
•
Input space X = {x
1
, . . . , x
n
}, and output
space Y = {1, −1}.
•
Training set
S = {z
1
= (x
1
, y
1
), . . . , z
l
= (x
l
, y
l
)}.
Intuition about classification: Manifold
•
Local consistency. Nearby points are likely
to have the same label.
•
Global consistency. Points on the same
structure (typically referred to as a cluster or
manifold) are likely to have the same label.
A Toy Dataset (Two Moons)
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 11.5
(a) Toy Data (Two Moons)
unlabeled point labeled point −1 labeled point +1 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5
(b) SVM (RBF Kernel)
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(c) k−NN
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(d) Ideal Classification
Algorithm
1. Form the affinity matrix W defined by W
ij
=
exp(−kx
i
− x
j
k
2
/2σ
2
)
if i 6= j and W
ii
= 0.
2. Construct the matrix S = D
−1/2
W D
−1/2
in which D is
a diagonal matrix with its (i, i)-element equal to the
sum of the i-th row of W.
3. Iterate f(t + 1) = αSf(t) + (1 − α)y until convergence,
where α is a parameter in (0, 1).
Convergence
Theorem. The sequence {f(t)} converges to
f
∗
= β(I − αS)
−1
y,
where β = 1 − α.
Proof. Suppose F (0) = Y. By the iteration equation, we
have
f (t) = (αS)
t−1
Y + (1 − α)
t−1
X
i=0
(αS)
i
Y.
(1)
Since 0 < α < 1 and the eigenvalues of S in [−1, 1],
lim
t→∞
(αS)
t−1
= 0,
and lim
t→∞
t−1
X
i=0
(αS)
i
= (I − αS)
−1
.
(2)
Regularization Framework
Cost function
Q(f) =
1
2
n
X
i,j=1
W
ij
1
√
D
ii
f
i
−
1
pD
jj
f
j
2
+ µ
n
X
i=1
f
i
− y
i
2
•
Smoothness term. Measure the changes between
nearby points.
•
Fitting term. Measure the changes from the initial
Regularization Framework
Theorem. f
∗
= arg min
f ∈F
Q(f).
Proof. Differentiating Q(f) with respect to f, we have
∂Q
∂f
f =f
∗
= f
∗
− Sf
∗
+ µ(f
∗
− y) = 0,
(1)
which can be transformed into
f
∗
−
1
1 + µ
Sf
∗
−
µ
1 + µ
y = 0.
(2)
Let α = 1/(1 + µ) and β = µ/(1 + µ). Then
Two Variants
•
Substitute P = D
−1
W
for S in the iteration
equation. Then f
∗
= (I − αP )
−1
y.
•
Replace S with P
T
,
the transpose of P. Then
f
∗
= (I − αP
T
)
−1
y,
which is equivalent to
Toy Problem
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(a) t = 10
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(b) t = 50
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(c) t = 100
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5(d) t = 400
Handwritten Digit Recognition (USPS)
10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 # labeled points test error k−NN (k = 1) SVM (RBF kernel) consistency variant (1) variant (2)Dimension: 16x16. Size: 9298. (α = 0.95)
Handwritten Digit Recognition (USPS)
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 test error consistency variant (1) variant (2)Text Classification (20-newsgroups)
10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8# labeled points
test error
k−NN (k = 1)
SVM (RBF kernel)
consistency
variant (1)
variant (2)
Dimension: 8014. Size: 3970. (α = 0.95)
Text Classification (20-newsgroups)
0.25 0.3 0.35 0.4 0.45 0.5 test error consistency variant (1) variant (2)Spectral Graph Theory
Normalized graph Laplacian ∆ = D
−1/2
(D − W )D
−1/2
.
Linear operator on the space of functions defined on the
Graph.
Theorem. P
i,j
W
ij
1
√
D
ii
f
i
−
1
√
D
jj
f
j
2
= hf, ∆fi.
Discrete analogy of Laplace-Beltrami operator on
Riemannian Manifold which satisfies
Z
M
k∇fk
2
=
Z
M
∆(f )f.
Discrete Laplace equation ∆f = y.
Reversible Markov Chains
Lazy random walk defined by the transition probability
matrix P
∗
= (1 − α)I + αD
−1
W, α ∈ (0, 1).
Hitting time H
ij
= E{ number of steps required for a
random walk to reach a position x
j
with an initial position
x
i
}.
Commute time C
ij
= H
ij
+ H
ji
.
Theorem. Let L = (D − αW )
−1
.
Then
Ranking Problem
Problem setting. Given a set of point
X = {x
1
, ..., x
q
, x
q+1
, ..., x
n
} ⊂ R
m
,
the first q
points are the queries. The task is to rank the
remaining points according to their relevances to
the queries.
Examples. Image, document, movie, book,
Intuition of Ranking: Manifold
(a) Two moons ranking problem
query