• No results found

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking

N/A
N/A
Protected

Academic year: 2021

Share "Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning from Labeled and

Unlabeled Data: Semi-supervised

Learning and Ranking

Dengyong Zhou

[email protected]

(2)

Learning from Examples

Input space X , and output space

Y = {1, −1}.

Training set

S = {z

1

= (x

1

, y

1

), . . . , z

l

= (x

l

, y

l

)} in

Z = X × Y drawn i.i.d. from some unknown

distribution.

(3)

Transductive Setting

Input space X = {x

1

, . . . , x

n

}, and output

space Y = {1, −1}.

Training set

S = {z

1

= (x

1

, y

1

), . . . , z

l

= (x

l

, y

l

)}.

(4)

Intuition about classification: Manifold

Local consistency. Nearby points are likely

to have the same label.

Global consistency. Points on the same

structure (typically referred to as a cluster or

manifold) are likely to have the same label.

(5)

A Toy Dataset (Two Moons)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1

1.5

(a) Toy Data (Two Moons)

unlabeled point labeled point −1 labeled point +1 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(b) SVM (RBF Kernel)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(c) k−NN

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(d) Ideal Classification

(6)

Algorithm

1. Form the affinity matrix W defined by W

ij

=

exp(−kx

i

− x

j

k

2

/2σ

2

)

if i 6= j and W

ii

= 0.

2. Construct the matrix S = D

−1/2

W D

−1/2

in which D is

a diagonal matrix with its (i, i)-element equal to the

sum of the i-th row of W.

3. Iterate f(t + 1) = αSf(t) + (1 − α)y until convergence,

where α is a parameter in (0, 1).

(7)

Convergence

Theorem. The sequence {f(t)} converges to

f

= β(I − αS)

−1

y,

where β = 1 − α.

Proof. Suppose F (0) = Y. By the iteration equation, we

have

f (t) = (αS)

t−1

Y + (1 − α)

t−1

X

i=0

(αS)

i

Y.

(1)

Since 0 < α < 1 and the eigenvalues of S in [−1, 1],

lim

t→∞

(αS)

t−1

= 0,

and lim

t→∞

t−1

X

i=0

(αS)

i

= (I − αS)

−1

.

(2)

(8)

Regularization Framework

Cost function

Q(f) =

1

2



n

X

i,j=1

W

ij



1

D

ii

f

i

1

pD

jj

f

j



2

+ µ

n

X

i=1

f

i

− y

i



2



Smoothness term. Measure the changes between

nearby points.

Fitting term. Measure the changes from the initial

(9)

Regularization Framework

Theorem. f

= arg min

f ∈F

Q(f).

Proof. Differentiating Q(f) with respect to f, we have

∂Q

∂f

f =f

= f

− Sf

+ µ(f

− y) = 0,

(1)

which can be transformed into

f

1

1 + µ

Sf

µ

1 + µ

y = 0.

(2)

Let α = 1/(1 + µ) and β = µ/(1 + µ). Then

(10)

Two Variants

Substitute P = D

−1

W

for S in the iteration

equation. Then f

= (I − αP )

−1

y.

Replace S with P

T

,

the transpose of P. Then

f

= (I − αP

T

)

−1

y,

which is equivalent to

(11)

Toy Problem

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(a) t = 10

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(b) t = 50

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(c) t = 100

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5

(d) t = 400

(12)
(13)

Handwritten Digit Recognition (USPS)

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 # labeled points test error k−NN (k = 1) SVM (RBF kernel) consistency variant (1) variant (2)

Dimension: 16x16. Size: 9298. (α = 0.95)

(14)

Handwritten Digit Recognition (USPS)

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 test error consistency variant (1) variant (2)

(15)

Text Classification (20-newsgroups)

10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

# labeled points

test error

k−NN (k = 1)

SVM (RBF kernel)

consistency

variant (1)

variant (2)

Dimension: 8014. Size: 3970. (α = 0.95)

(16)

Text Classification (20-newsgroups)

0.25 0.3 0.35 0.4 0.45 0.5 test error consistency variant (1) variant (2)

(17)

Spectral Graph Theory

Normalized graph Laplacian ∆ = D

−1/2

(D − W )D

−1/2

.

Linear operator on the space of functions defined on the

Graph.

Theorem. P

i,j

W

ij



1

D

ii

f

i

1

D

jj

f

j



2

= hf, ∆fi.

Discrete analogy of Laplace-Beltrami operator on

Riemannian Manifold which satisfies

Z

M

k∇fk

2

=

Z

M

∆(f )f.

Discrete Laplace equation ∆f = y.

(18)

Reversible Markov Chains

Lazy random walk defined by the transition probability

matrix P

= (1 − α)I + αD

−1

W, α ∈ (0, 1).

Hitting time H

ij

= E{ number of steps required for a

random walk to reach a position x

j

with an initial position

x

i

}.

Commute time C

ij

= H

ij

+ H

ji

.

Theorem. Let L = (D − αW )

−1

.

Then

(19)

Ranking Problem

Problem setting. Given a set of point

X = {x

1

, ..., x

q

, x

q+1

, ..., x

n

} ⊂ R

m

,

the first q

points are the queries. The task is to rank the

remaining points according to their relevances to

the queries.

Examples. Image, document, movie, book,

(20)

Intuition of Ranking: Manifold

(a) Two moons ranking problem

query

(b) Ideal ranking

The relevant degrees of points in the upper moon to

(21)

Toy Ranking

(a) Connected graph

Simply ranking the data according to the shortest

paths on the graph does not work well.

Robust solution is to assemble all paths between two

points: f

=

P

(22)
(23)

Connection to Google

Theorem. For the task of ranking data represented

by a connected and undirected graph without queries,

(24)

Personalized Google: a variant

The ranking scores given by PageRank:

π(t + 1) = αP

T

π(t).

(4)

Add a query term on the right-hand side for the

query-based ranking,

π(t + 1) = αP

T

π(t) + (1 − α)y.

(5)

(25)

Image Ranking

The top-left digit in each panel is the query. The left panel

shows the top 99 by our method; and the right panel shows

the top 99 by the Euclidean distance.

(26)

Document Ranking

0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) autos inner product manifold ranking 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) motorcycles inner product manifold ranking 0.7 0.8 0.9 (c) baseball 0.7 0.8 0.9 (d) hockey

(27)

Related Work

Graph/disffusion/cluster kernel (Kondor et al

2002; Chapelle et al. 2002; Smola et al.

2003).

Spectral clustering (Shi et al. 1997; Ng et al.

2001).

Manifold learning (nonlinear data

reduction)(Tenenbaum et al. 2000; Roweis et

al. 2000)

(28)

Related Work

Random walks (Szummer et al. 2001).

Graph min-cuts (Blum et al. 2001)

Learning on manifolds (Belkin et al. 2001).

(29)

Conclusion

Proposed a general semi-supervised

learning algorithm.

Proposed a general example-based ranking

(30)

Next Work

Model selection.

Active learning.

Generalization theory of learning from

labeled and unlabeled data.

(31)

References

1. Zhou, D., Bousquet, O., Lal, T.N., Weston, J. and

Schölkopf, B.: Learning with Local and Global

Consistency. NIPS, 2003.

2. Zhou, D., Weston, J., Gretton, A., Bousquet, O. and

Schölkopf, B.: Ranking on Data Manifolds. NIPS,

2003.

3. Weston, J., C. Leslie, D. Zhou, A. Elisseeff and W. S.

Noble: Semi-Supervised Protein Classification

References

Related documents

The left panel of Figure 7 shows the log-likelihood value of the whole dataset (training + test sets) divided by the number of observations for the inductive and transductive

Here we highlight the major contributions of this paper: (1) a novel regularization frame- work for distance metric learning and a new semi-supervised metric learning technique,

Although Deep Learning has successfully been applied to many fields, it relies on large amounts of data. In this work we focus on two different research lines within the context

The upper left panel shows the wavelet approximation and the upper right panel the weighted smoothing spline.. Second and third row show the corresponding first and

Wilcoxon Signed-Rank test results for N trials of the semi-supervised method with near than average selection on the KDD data set using 10, 17, 25 and 34 labeled examples with

The panel on the left side shows the measurement of the stage-in (download) time in the different scenarios, while the panel on the right side shows the WallClock time per

15 shows (left panel) an experimental evidence of organized mesoscale water patterns, made up by roughly aligned “dots”. 15 right panel) a simulation of 99 phase points with

4.37 The left panel shows the original two-dimensional data space of two easily distinguishable classes and the right panel shows the reduced the data space transformed using