Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking

(1)

Learning from Labeled and

Unlabeled Data: Semi-supervised

Learning and Ranking

Dengyong Zhou

[email protected]

(2)

Learning from Examples

• Input space X , and output space

Y = {1, −1}.

• Training set

S = {z

1 = (x

1 , y

1 ), . . . , z

_l

= (x

_l

, y

_l

)} in

Z = X × Y drawn i.i.d. from some unknown

distribution.

(3)

Transductive Setting

• Input space X = {x

₁

, . . . , x

_n

}, and output

space Y = {1, −1}.

• Training set

S = {z

1 = (x

1 , y

1 ), . . . , z

_l

= (x

_l

, y

_l

)}.

(4)

Intuition about classification: Manifold

• Local consistency. Nearby points are likely

to have the same label.

• Global consistency. Points on the same

structure (typically referred to as a cluster or

manifold) are likely to have the same label.

(5)

A Toy Dataset (Two Moons)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1

1.5

(a) Toy Data (Two Moons)

unlabeled point labeled point −1 labeled point +1 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(b) SVM (RBF Kernel)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(c) k−NN

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(d) Ideal Classification

(6)

Algorithm

1. Form the affinity matrix W defined by W

ij

=

exp(−kx

i

− x

j

k

2 /2σ

2 )

if i 6= j and W

ii

= 0.

2. Construct the matrix S = D

−1/2

_{W D}

−1/2

in which D is

a diagonal matrix with its (i, i)-element equal to the

sum of the i-th row of W.

3. Iterate f(t + 1) = αSf(t) + (1 − α)y until convergence,

where α is a parameter in (0, 1).

(7)

Convergence

Theorem. The sequence {f(t)} converges to

f

∗

_{= β(I − αS)}

−1

y,

where β = 1 − α.

Proof. Suppose F (0) = Y. By the iteration equation, we

have

f (t) = (αS)

t−1

_{Y + (1 − α)}

t−1

X

i=0

(αS)

i

Y. (1)

Since 0 < α < 1 and the eigenvalues of S in [−1, 1],

lim

t→∞

(αS)

t−1

_{= 0,}

_{and lim}

t→∞

t−1

X

i=0

(αS)

i

_{= (I − αS)}

−1

.

(2)

(8)

Regularization Framework

Cost function

Q(f) =

1 ₂

n

X

i,j=1

W

ij

1 √

D

_ii

f

i

−

1 pD

jj

f

j

2 + µ

n

X

i=1

f

i

− y

i

2

• Smoothness term. Measure the changes between

nearby points.

• Fitting term. Measure the changes from the initial

(9)

Regularization Framework

Theorem. f

∗

_{= arg min}

f ∈F

Q(f).

Proof. Differentiating Q(f) with respect to f, we have

∂Q

∂f

f =f

∗

= f

∗

_{− Sf}

∗

+ µ(f

∗

_{− y) = 0,}

(1)

which can be transformed into

f

∗

₋

1 1 + µ

Sf

∗

₋

µ

1 + µ

y = 0.

(2)

Let α = 1/(1 + µ) and β = µ/(1 + µ). Then

(10)

Two Variants

• Substitute P = D

−1

W

for S in the iteration

equation. Then f

∗

_{= (I − αP )}

−1

_y.

• Replace S with P

T

,

the transpose of P. Then

f

∗

_{= (I − αP}

T

)

−1

y,

which is equivalent to

(11)

Toy Problem

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(a) t = 10

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(b) t = 50

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(c) t = 100

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(d) t = 400

(12)

(13)

Handwritten Digit Recognition (USPS)

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 # labeled points test error k−NN (k = 1) SVM (RBF kernel) consistency variant (1) variant (2)

Dimension: 16x16. Size: 9298. (α = 0.95)

(14)

Handwritten Digit Recognition (USPS)

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 test error consistency variant (1) variant (2)

(15)

Text Classification (20-newsgroups)

10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

# labeled points

test error

k−NN (k = 1)

SVM (RBF kernel)

consistency

variant (1)

variant (2)

Dimension: 8014. Size: 3970. (α = 0.95)

(16)

Text Classification (20-newsgroups)

0.25 0.3 0.35 0.4 0.45 0.5 test error consistency variant (1) variant (2)

(17)

Spectral Graph Theory

Normalized graph Laplacian ∆ = D

−1/2

_{(D − W )D}

−1/2

_.

Linear operator on the space of functions defined on the

Graph.

Theorem. P

_i,j

W

ij

1 √

D

ii

f

i

−

1 √

D

jj

f

j

2 = hf, ∆fi.

Discrete analogy of Laplace-Beltrami operator on

Riemannian Manifold which satisfies

Z

M

k∇fk

2 =

Z

M

∆(f )f.

Discrete Laplace equation ∆f = y.

(18)

Reversible Markov Chains

Lazy random walk defined by the transition probability

matrix P

∗

_{= (1 − α)I + αD}

−1

_{W, α ∈ (0, 1).}

Hitting time H

ij

= E{ number of steps required for a

random walk to reach a position x

j

with an initial position

x

_i

_}.

Commute time C

ij

= H

ij

+ H

ji

.

Theorem. Let L = (D − αW )

−1

_.

Then

(19)

Ranking Problem

Problem setting. Given a set of point

X = {x

1 , ..., x

_q

, x

_q+1

, ..., x

_n

} ⊂ R

m

,

the first q

points are the queries. The task is to rank the

remaining points according to their relevances to

the queries.

Examples. Image, document, movie, book,

(20)

Intuition of Ranking: Manifold

(a) Two moons ranking problem

query

(b) Ideal ranking

• The relevant degrees of points in the upper moon to

(21)

Toy Ranking

(a) Connected graph

• Simply ranking the data according to the shortest

paths on the graph does not work well.

• Robust solution is to assemble all paths between two

points: f

∗

₌

P

(22)

(23)

Connection to Google

Theorem. For the task of ranking data represented

by a connected and undirected graph without queries,

(24)

Personalized Google: a variant

The ranking scores given by PageRank:

π(t + 1) = αP

T

π(t).

(4)

Add a query term on the right-hand side for the

query-based ranking,

π(t + 1) = αP

T

_{π(t) + (1 − α)y.}

(5)

(25)

Image Ranking

The top-left digit in each panel is the query. The left panel

shows the top 99 by our method; and the right panel shows

the top 99 by the Euclidean distance.

(26)

Document Ranking

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) autos inner product manifold ranking 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) motorcycles inner product manifold ranking 0.7 0.8 0.9 (c) baseball 0.7 0.8 0.9 (d) hockey