• No results found

Part II: Ranking Algorithms

N/A
N/A
Protected

Academic year: 2021

Share "Part II: Ranking Algorithms"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

Supervised Learning of Non-binary Problems

Part II: Ranking Algorithms

Yoram Singer, HUJI

[email protected]

Based on joint work with: Koby Crammer (HUJI),

Rob Schapire (AT&T Labs), William Cohen (Consultant), Amit Singhal (Google),

Raj Iyer (Living Wisdom School), Yoav Freund (Banter)

NATO ASI on Learning Theory and Practice July 8-19 2002 - K.U. Leuven Belgium

(2)
(3)

Example - Meta-Searching

AltaVista Infoseek

?

Lycos

list list list

query

(4)

Example - Information Retrieval

• Text Retrieval:

(5)

Example - Information Retrieval

• Text Retrieval:

show me articles on Dolby C

• Query by Image:

(6)

Example - Information Retrieval

• Text Retrieval:

show me articles on Dolby C

• Query by Image:

show me pictures of Sunsets

• Audio Browsing:

(7)

Example - Information Retrieval

• Text Retrieval:

show me articles on Dolby C

• Query by Image:

show me pictures of Sunsets

• Audio Browsing:

find me radio programs on Who killed Paul M.

• Topic filtering:

return a ranked list of relevant topics for a given article E.g. Reuters labels each article with one (or more) topic

(8)

Elements of Ranking Problems

(9)

Elements of Ranking Problems

• Instances: Things to be ranked (e.g movies)

• Ranking features:

? Each feature is a partial ranking of the instances

? “Primitives” that will be combined into final ranking

(10)

Elements of Ranking Problems

• Instances: Things to be ranked (e.g movies)

• Ranking features:

? Each feature is a partial ranking of the instances

? “Primitives” that will be combined into final ranking

? E.g. ratings of previous viewers

• Feedback function:

? Encodes feedback on target (desired) ranking

(11)

Elements of Ranking Problems

• Instances: Things to be ranked (e.g movies)

• Ranking features:

? Each feature is a partial ranking of the instances

? “Primitives” that will be combined into final ranking

? E.g. ratings of previous viewers

• Feedback function:

? Encodes feedback on target (desired) ranking

? E.g. new viewer’s ratings

• Goal: produce a ranking of all instances, including those not

observed in the training. The Ranking should agree, as much as possible, with the feedback function.

(12)

Elements of Ranking Problems - Text Retrieval

• Instances: Documents

• Ranking features:

Each word ranks documents based on the number of its occurrences

• Feedback function:

Relevant documents ranked above irrelevant

• Goal:

Order the documents such that the relevant documents are as close as possible to the top of the ordered list

(13)

Formal Description of Ranking Problems

Domain or instance space: X

• Ranking features: f1, . . . , fn

? fi : X → S where S = {1, . . . , k} S

{φ}

? fi : X → < where < = < S

{φ}

? fi(x) = φ means that x is unranked by fi

? If fi(xj) > fi(xk) then xj is preferred over xk

• Feedback function:

? Φ : X × X → {−1, 0, +1} or Φ : X × X → <

? Φ(x, x) = 0

(14)

Formal Description of Ranking Problems (cont.)

• In many applications ∃ Ψ : X → S (or Ψ : X → <)

Φ(x1, x2) =   

 

+1 Ψ(x1) > Ψ(x2)

−1 Ψ(x1) < Ψ(x2) 0 otherwise

• Goal:

Find a function H : X → S or H : X → < which agrees

as much as possible with the feedback on a set of examples

T = {x1, . . . , xm}

(15)

Formal Description of Ranking Problems (cont.)

• In many applications ∃ Ψ : X → S (or Ψ : X → <)

Φ(x1, x2) =   

 

+1 Ψ(x1) > Ψ(x2)

−1 Ψ(x1) < Ψ(x2) 0 otherwise

• Goal:

Find a function H : X → S or H : X → < which agrees

as much as possible with the feedback on a set of examples

T = {x1, . . . , xm}

(16)

Ranking Loss

• If Φ : X × X → S then rlossT(H) is

| {(xi , xj) ∈ T × T | H(xi) ≤ H(xj) and Φ(xi, xj) > 0} |

• If Φ : X × X → < then let D(xi, xj) = c · max{0, Φ(xi, xj)}

where c is a normalization constant

rlossD(H) = X

(xi,xj)∈T×T

D(xi, xj)[[H(xi) ≤ H(xj)]]

• If Ψ : X → S and H : X → S where S = {1, . . . , k}

rlossT(H) = X

x∈T

(17)

Ranking Loss: illustration for

Φ

:

X × X → <

a b c d 2 4 3 −1 a b c d 0.2 0.4 0.3 0.1

H(a)=9 H(b)=8 H(d)=7 H(c)=1 a b c d 0.2 0.4 0.3 0.1 rloss=0.1+0.3=0.4

(18)

Ranking Loss: illustration for

Ψ

:

X → {

1

, . . . ,

6

}

2

1

3

4

5

6

2

1

3

4

5

6

loss=3

(19)

Learning Task

• Training data – examples (e.g. ratings of movies by users):

? Set of instances T = {x1, . . . , xm}

? Feedback (typically partial) on T via Φ or Ψ

• Find a sub-set of good ranking features

• Assign each feature an importance weight

(20)

Learning Task

• Training data – examples (e.g. ratings of movies by users):

? Set of instances T = {x1, . . . , xm}

? Feedback (typically partial) on T via Φ or Ψ

• Find a sub-set of good ranking features

• Assign each feature an importance weight

• Combine the set of features and weights into an ordering H

(21)

Combining Ranking Features

Graph Representation

f1(a)=2 f1(b)=1

f1(c)=0 f1(d)=φ

a

b

c

d

a

b

c

d

f2(a)=0 f2(b)=2

f2(c)=1 f2(d)=2

a

b

c

d

1/4

3/4

3/4 3/4 1 1/4 1/4 3/4 3/4

(22)

Combining Ranking Features

(23)

Combining Ranking Features

• An old problem with many faces and different views

• The problem of devising a “good” total ordering from a set of

(24)

Combining Ranking Features

• An old problem with many faces and different views

• The problem of devising a “good” total ordering from a set of

partial orderings or preferences is often intractable

• In our setting the problem is intractable (NPC) if |S| ≥ 3 and

(25)

Combining Ranking Features

• An old problem with many faces and different views

• The problem of devising a “good” total ordering from a set of

partial orderings or preferences is often intractable

• In our setting the problem is intractable (NPC) if |S| ≥ 3 and

we count weighted or unweighted disagreements

• The intractability stems from conflicting and circular

preferences over pairs of instances

(26)

Combining Ranking Features

• An old problem with many faces and different views

• The problem of devising a “good” total ordering from a set of

partial orderings or preferences is often intractable

• In our setting the problem is intractable (NPC) if |S| ≥ 3 and

we count weighted or unweighted disagreements

• The intractability stems from conflicting and circular

preferences over pairs of instances

• Combining ranking features is simple if |S| = 2 (no φ)

• In assessing preferences people are indeed inconsistent

(27)

Algorithmic Approaches

• Partial feedback over pairs with general ranking features:

? Derive simple (binary) features (ignore numerical rating)

? Devise a generalization of boosting to find a good subset

of features and their weights

⇒ RankBoost

• Feedback is given via Ψ in an online fashion:

? Perform rank-prediction by projecting and binning

? Devise efficient online algorithm with mistake-bound analysis

(28)

Brief Review of Boosting

• assume access to weak-learner (WL)

• maintain distribution D over examples {(xi, yi)}mi=1

where yi ∈ {−1, +1}

• for any distribution over examples WL returns h : X → <

• error of h is better than random | P

i yih(xi)| >

1 Poly(m)

• bound classification error (sign(h(x)) 6= y) by exp (−yh(x))

• modify distribution according to performance

Dt+1(i) ← 1

(29)

From Classification to Ranking

• Example (x, y)

⇒ Pair (x0, x1) such that Φ(x0, x1) < 0

• Classification error sign(f(x)) 6= y

⇒ Ranking error f(x0) > f(x1)

• 0-1 bound: exp (−yf(x))

(30)

RankBoost - Skeleton

• Maintain a distribution over pairs Dt(x0, x1)

• Given a set of features: fi : X → S

• Derive simpler feature: ht : X → {0, 1}

• Find importance weights: αt w.r.t Dt

• Combine features:

H(x) = X

t

αt ht(x)

(31)

Receive: initial distribution D1 over X × X (obtained from Φ) For t = 1, . . . , T:

Call weak learner with Dt ⇒ weak ranker ht : X → {0, 1}

Let Sb = {(x0, x1)|h(x0) − h(x1) = b}

Calculate Wb = X

(x0,x1)∈Sb

Dt(x0, x1)

Set αt = 12 log(W/W+)

Update Dt+1(x0, x1) = (1/Zt) Dt(x0, x1) exp (αt(ht(x0) − ht(x1)))

(32)

Deriving Simplified Features

h(x) =   

 

1 if fi(x) > θ

0 if fi(x) ≤ θ qdef if fi(x) = φ

• Ignores the numerical

ratings (uses only relative ranking information)

• Search for good fi, qdef, θ

can be done efficiently for

any feedback a b c d f(a)=7 f(b)=5 f(c)=3 f(b)=1 a b c d a b c d

=

e f(e)=φ e e

(33)

Deriving Simplified Features - Details

• Score of a feature r = Px

(34)

Deriving Simplified Features - Details

• Score of a feature r = Px

0,x1 D(x0, x1) (h(x1) − h(x0))

(35)

Deriving Simplified Features - Details

• Score of a feature r = Px

0,x1 D(x0, x1) (h(x1) − h(x0))

• Define π(x) = Px0(D(x0, x) − D(x, x0))

r = X

x0,x1

D(x0, x1) (h(x1) − h(x0))

= X

x0,x1

D(x0, x1)h(x1) − X x0,x1

D(x0, x1)h(x0)

= X

x

h(x) X

x0

D(x0, x) − X x

h(x) X

x0

D(x, x0)

= X

x

h(x) X

x0

(D(x0, x) − D(x, x0)) = X

x

(36)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:fi(x)>θ

h(x) π(x) + X

x:fi(x)≤θ

h(x) π(x) + X

x:fi(x)=φ

h(x) π(x)

= X

x:fi(x)>θ

π(x) + qdef X

x:fi(x)=φ

(37)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:fi(x)>θ

h(x) π(x) + X

x:fi(x)≤θ

h(x) π(x) + X

x:fi(x)=φ

h(x) π(x)

= X

x:fi(x)>θ

π(x) + qdef X

x:fi(x)=φ

π(x).

(38)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:fi(x)>θ

h(x) π(x) + X

x:fi(x)≤θ

h(x) π(x) + X

x:fi(x)=φ

h(x) π(x)

= X

x:fi(x)>θ

π(x) + qdef X

x:fi(x)=φ

π(x).

• Note that Px π(x) = Px,x0(D(x0, x) − D(x, x0)) = 0

• Further simplify

r = X

x:fi(x)>θ

π(x) − qdef X

x:fi(x)6=φ

(39)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:fi(x)>θ

h(x) π(x) + X

x:fi(x)≤θ

h(x) π(x) + X

x:fi(x)=φ

h(x) π(x)

= X

x:fi(x)>θ

π(x) + qdef X

x:fi(x)=φ

π(x).

• Note that Px π(x) = Px,x0(D(x0, x) − D(x, x0)) = 0

• Further simplify

r = X

x:fi(x)>θ

π(x) − qdef X

x:fi(x)6=φ

π(x)

(40)

Bounds on Empirical Error and Generalization

• (Empirical) bound on number of mis-ordered pairs

rlossD1(H) ≤ T

Y

t=1

(41)

Bounds on Empirical Error and Generalization

• (Empirical) bound on number of mis-ordered pairs

rlossD1(H) ≤ T

Y

t=1

Zt

• Bound on generalization error using VC analysis is tricky since

pairs are not independent

• Generalization error bound using two button model

|rlossD1(H) − rlossD(H)| ≤ O pd0 log(m/d0)/m

(42)

RankBoost - Meta-search Experiments

Top Top Top Top Top Top Avg

ML Domain 1 2 5 10 20 30 Rank

RankBoost 102 144 173 184 194 202 4.38

Best (Top 1) 117 137 154 167 177 181 6.80

Best (Top 10) 112 147 172 179 185 187 5.33

Best (Top 30) 95 129 159 178 187 191 5.68

University Domain

RankBoost 95 141 197 215 247 263 7.74

(43)

Online Framework for Ranking

• Algorithm works in rounds

(44)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

(45)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)

(46)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)

? Predicts a rank ˆyt ∈ {1, . . . , k}

(47)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)

? Predicts a rank ˆyt ∈ {1, . . . , k}

? Receives the correct rank yt ∈ {1, . . . , k}

(48)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)

? Predicts a rank ˆyt ∈ {1, . . . , k}

? Receives the correct rank yt ∈ {1, . . . , k}

? Computes loss |yt − yˆt|

(49)

Online Framework for Ranking

• Algorithm works in rounds

• On round t the ranking algorithm:

? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)

? Predicts a rank ˆyt ∈ {1, . . . , k}

? Receives the correct rank yt ∈ {1, . . . , k}

? Computes loss |yt − yˆt|

? Updates the rank-prediction rule

• Goal – minimize the cumulative loss: X

t

(50)

Ranking via Projections

W

(51)

Ranking via Projections

(52)

Ranking via Projections

1 2 3 4 b1 b2 b3

(53)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

Rank Levels

(54)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

W X

Direction: w Thresholds: b1, b2, . . . , bk1

(55)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

Correct Rank

Direction: w Thresholds: b1, b2, . . . , bk1

Rank a new instance: w · x

(56)

PRank - Update

b

2 b3

b

3

b

2

W

b b

1 4

1 2 3 4 5

E = { , }

Direction: w Thresholds: b1, b2, . . . , bk1

(57)

PRank - Update

b

2 b3

W

b b

1 4

1 2 3 4 5

(58)

PRank - Summary of Update

W

x

x

Move Thresholds ∀r ∈ E : br ← br − 1

(59)

PRank - Summary of Update

• Compute predicted rank yˆ = 1 + |{r : w · x < br}

• Get the correct rank y

• If y 6= yˆ do:

? Compute Error Set E = {r : y ≤ r < yˆ}

? For all r ∈ E do:

w ← w + |E| x br ← br − 1

(60)

Consistency

b4

W

b1 b2 b3 b4

(61)

Consistency

b4

W

b1 b2 b3 b4

• Can the above happen?

No ⇒ The order of the thresholds is preserved throughout the

(62)

Ranking Margin

W

b1 b2 b4

b − W X W X − b2

b3

3

3

• The margin of an example

Margin(x, y) = min{min

(63)

Ranking Margin

W

b1 b2 b4

b − W X W X − b2

b3

3

3

• The margin of an example

Margin(x, y) = min{min

r≥y {w · x − br} , minr<y {br − w · x}}

• The margin of a dataset

Margin

{xt, yt}Tt=1

(64)

Mistake Bound Theorem

• Assuming:

? Input sequence (x1, y1), . . . , (xT, yT)

? The norm of instances is bounded ||xt|| ≤ R

? The sequence is correctly rank by w? and b?1, . . . , b?k1

||w?||2 + (b?1)2 + . . . + (b?k1)2 = 1

(65)

Mistake Bound Theorem

• Assuming:

? Input sequence (x1, y1), . . . , (xT, yT)

? The norm of instances is bounded ||xt|| ≤ R

? The sequence is correctly rank by w? and b?1, . . . , b?k1

||w?||2 + (b?1)2 + . . . + (b?k1)2 = 1

? The margin achieved by w?, b?1, . . . , b?k1 is γ

• Then:

T

X

t=1

|yt − yˆt| ≤ (k − 1)R

2 + 1

(66)

Proof Sketch for MB Theorem

• Instantaneous Pranker

vt = (wt, bt1, . . . , btk1) ⇒ vt+1 = (wt+1, b1t+1, . . . , btk+11)

• Instantaneous error-set Et of size nt = |Et|

• Bound

? from below the increase in vt · v? ? from above the increase in vt · vt

• Assume ranking mistakes on all rounds

(67)

Proof Sketch for MB Theorem (cont.)

v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et

b∗r

= v∗ · vt + X

r∈Et

w∗ · vt − b∗r

= v∗ · vt + nt γ

v? · vT+1 ≥ γ

T

X

t=1 nt

(68)

Proof Sketch for MB Theorem (cont.)

v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et

b∗r

= v∗ · vt + X

r∈Et

w∗ · vt − b∗r

= v∗ · vt + nt γ

v? · vT+1 ≥ γ

T

X

t=1 nt

kvT+1k2 kv?k

| {z }

=1 2

≥ vT+1 · v?2 ⇒ kvT+1k ≥ γ2

T

X

t=1 nt

(69)

Proof Sketch for MB Theorem (cont.)

kvt+1k2 ≤ kvtk2 + 2 X

r∈Et

(wt · xt − btr)

| {z }

≤0

+(nt)2kxtk2 + nt

≤ kvtk2 + (nt)2R2 + nt

kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T

X

t=1 nt

(70)

Proof Sketch for MB Theorem (cont.)

kvt+1k2 ≤ kvtk2 + 2 X

r∈Et

(wt · xt − btr)

| {z }

≤0

+(nt)2kxtk2 + nt

≤ kvtk2 + (nt)2R2 + nt

kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T X t=1 nt γ2 T X t=1 nt !2

≤ kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T

X

t=1 nt

(71)

Proof Sketch for MB Theorem - Finishing up

γ2 T X t=1 nt !2

≤ kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T X t=1 nt X t

nt ≤ R

2 P

t(n

t)2

/[Pt nt] + 1

γ2

Use nt ≤ k − 1 and nt = |yt − yˆt|

T

X

t=1

|yˆt − yt| =

T

X

t=1

nt ≤ (k − 1)R

2 + 1

γ2 ≤ (k − 1)

(R2 + 1)

(72)

Experiments - EachMovie Database

• Database of 1648 movies

• 74,424 registered viewers

• Viewers rated subsets of movies

• Experimented with two subsets:

? 7451 viewers who rated > 100 movies

(73)

Results for

100

or more movies

0 10 20 30 40 50 60 70 80 90 100 1

1.2 1.4 1.6 1.8 2 2.2 2.4

Round

Rank Loss

PRank WH MC−Perceptron

(74)

Results for

200

or more movies

0 20 40 60 80 100 120 140 160 180 200 1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Round

Rank Loss

PRank WH MC−Perceptron

(75)

Summary

(76)

Summary

• Formal framework for learning of ranking problems

(77)

Summary

• Formal framework for learning of ranking problems

• Analysis of objective functions for ranking

(78)

Summary

• Formal framework for learning of ranking problems

• Analysis of objective functions for ranking

• Two algorithmic paradigms for learning good rankers

(79)

Summary

• Formal framework for learning of ranking problems

• Analysis of objective functions for ranking

• Two algorithmic paradigms for learning good rankers

• Empirical loss analysis and generalization bounds for RankBoost

(80)

Summary

• Formal framework for learning of ranking problems

• Analysis of objective functions for ranking

• Two algorithmic paradigms for learning good rankers

• Empirical loss analysis and generalization bounds for RankBoost

• Mistake bound and consistency for PRank

• Research directions:

? Batch versions of PRank and Kernels for Ranking

(81)

Summary

• Formal framework for learning of ranking problems

• Analysis of objective functions for ranking

• Two algorithmic paradigms for learning good rankers

• Empirical loss analysis and generalization bounds for RankBoost

• Mistake bound and consistency for PRank

• Research directions:

? Batch versions of PRank and Kernels for Ranking

? More applications (topic ranking, sound localization)

(82)

Based on

Y. Freund, R. Iyer, R. E. Schapire, Y. Singer. An efficient boosting

algorithm for combining preferences. Machine Learning: Proc. of the

Fifteenth Intl. Conf., 1998.

W. W. Cohen, R. E. Schapire, Y. Singer. Learning to order things.

Journal of Artificial Intelligence Research, 10:243–270, 1999.

R. Iyer, D. Lewis, R. E. Schapire, Y. Singer, A. Singhal. Boosting for Document Routing. CIKM, 2000.

K. Crammer and Y. Singer. PRanking with Ranking. Advances in

Neural Information Processing Systems 15, 2001.

K. Crammer and Y. Singer. A New Family of Online Algorithms for Category Ranking. SIGIR, 2002.

References

Related documents

Brydges Ford Sales Ltd... Hubbell and

Furthermore, when there is also human capital accumulation, Jones, Manuelli, and Rossi ( 1997 ) have shown that both labor and capital income tax should be zero in the long run

Karamakar and Kushwaha also constructed a bench top rotary viscometer to measure the shear strength of soils at various speeds as needed to calculate the viscosity of clay loam soil

In summary, we can say that the leadership education has characteristics that are fairly typical of the modern school leader education: stronger national management through

Quantitative trait loci (QTLs) detected with composite interval mapping analysis for traits related with maize endosperm texture modifi cation (texture, opacity, vitreousness,

A fire alarm shall not be given by an equipment defect external to the control units unless the condition exactly reproduces the effect of the operation of a detector or manual

[r]

• Tactic 1.3b Create a Boot Camp Task Force by Fall 2020 to evaluate current offerings and to identify and develop future boot camps based on student needs and opportunities, such