Part II: Ranking Algorithms

(1)

Supervised Learning of Non-binary Problems

Part II: Ranking Algorithms

Yoram Singer, HUJI

[email protected]

Based on joint work with: Koby Crammer (HUJI),

Rob Schapire (AT&T Labs), William Cohen (Consultant), Amit Singhal (Google),

Raj Iyer (Living Wisdom School), Yoav Freund (Banter)

NATO ASI on Learning Theory and Practice July 8-19 2002 - K.U. Leuven Belgium

(2)

(3)

Example - Meta-Searching

AltaVista Infoseek

?

Lycos

list list list

query

(4)

Example - Information Retrieval

• Text Retrieval:

(5)

Example - Information Retrieval

• Text Retrieval:

show me articles on Dolby C

• Query by Image:

(6)

Example - Information Retrieval

• Text Retrieval:

• Query by Image:

show me pictures of Sunsets

• Audio Browsing:

(7)

Example - Information Retrieval

• Text Retrieval:

• Query by Image:

show me pictures of Sunsets

• Audio Browsing:

find me radio programs on Who killed Paul M.

• Topic filtering:

return a ranked list of relevant topics for a given article E.g. Reuters labels each article with one (or more) topic

(8)

Elements of Ranking Problems

(9)

Elements of Ranking Problems

• Instances: Things to be ranked (e.g movies)

• Ranking features:

? Each feature is a partial ranking of the instances

? “Primitives” that will be combined into final ranking

(10)

Elements of Ranking Problems

? E.g. ratings of previous viewers

• Feedback function:

? Encodes feedback on target (desired) ranking

(11)

Elements of Ranking Problems

? E.g. ratings of previous viewers

? Encodes feedback on target (desired) ranking

? E.g. new viewer’s ratings

• Goal: produce a ranking of all instances, including those not

observed in the training. The Ranking should agree, as much as possible, with the feedback function.

(12)

Elements of Ranking Problems - Text Retrieval

• Instances: Documents

Each word ranks documents based on the number of its occurrences

Relevant documents ranked above irrelevant

• Goal:

Order the documents such that the relevant documents are as close as possible to the top of the ordered list

(13)

Formal Description of Ranking Problems

• Domain or instance space: X

• Ranking features: f₁, . . . , f_n

? f_i : X → S where S = {1, . . . , k} S

{φ}

? f_i : X → < where < = < S

{φ}

? f_i(x) = φ means that x is unranked by f_i

? If f_i(x_j) > f_i(x_k) then x_j is preferred over x_k

? Φ : X × X → {−1, 0, +1} or Φ : X × X → <

? Φ(x, x) = 0

(14)

Formal Description of Ranking Problems (cont.)

• In many applications ∃ Ψ : X → S (or Ψ : X → <)

Φ(x₁, x₂) =   

 

+1 Ψ(x₁) > Ψ(x₂)

−1 Ψ(x₁) < Ψ(x₂) 0 otherwise

• Goal:

Find a function H : X → S or H : X → < which agrees

as much as possible with the feedback on a set of examples

T = {x₁, . . . , x_m}

(15)

Formal Description of Ranking Problems (cont.)

• In many applications ∃ Ψ : X → S (or Ψ : X → <)

Φ(x₁, x₂) =   

 

+1 Ψ(x₁) > Ψ(x₂)

−1 Ψ(x₁) < Ψ(x₂) 0 otherwise

• Goal:

Find a function H : X → S or H : X → < which agrees

as much as possible with the feedback on a set of examples

T = {x₁, . . . , x_m}

(16)

Ranking Loss

• If Φ : X × X → S then rloss_T(H) is

| {(x_i , x_j) ∈ T × T | H(x_i) ≤ H(x_j) and Φ(x_i, x_j) > 0} |

• If Φ : X × X → < then let D(x_i, x_j) = c · max{0, Φ(x_i, x_j)}

where c is a normalization constant

rloss_D(H) = X

(x_i,x_j)∈T×T

D(x_i, x_j)[[H(x_i) ≤ H(x_j)]]

• If Ψ : X → S and H : X → S where S = {1, . . . , k}

rloss_T(H) = X

x∈T

(17)

Ranking Loss: illustration for

Φ

:

X × X → <

a b c d 2 4 3 −1 a b c d 0.2 0.4 0.3 0.1

H(a)=9 H(b)=8 H(d)=7 H(c)=1 a b c d 0.2 0.4 0.3 0.1 rloss=0.1+0.3=0.4

(18)

Ranking Loss: illustration for

Ψ

:

X → {

1 , . . . ,

6 }

2

1

3

4

5

6

2

1

3

4

5

6

loss=3

(19)

Learning Task

• Training data – examples (e.g. ratings of movies by users):

? Set of instances T = {x₁, . . . , x_m}

? Feedback (typically partial) on T via Φ or Ψ

• Find a sub-set of good ranking features

• Assign each feature an importance weight

(20)

Learning Task

• Training data – examples (e.g. ratings of movies by users):

? Set of instances T = {x₁, . . . , x_m}

? Feedback (typically partial) on T via Φ or Ψ

• Find a sub-set of good ranking features

• Assign each feature an importance weight

• Combine the set of features and weights into an ordering H

(21)

Combining Ranking Features

Graph Representation

f1(a)=2 f1(b)=1

f1(c)=0 f1(d)=φ

a

b

c

d

a

b

c

d

f2(a)=0 f2(b)=2

f2(c)=1 f2(d)=2

a

b

c

d

1/4

3/4

3/4 3/4 1 1/4 1/4 3/4 3/4

(22)

Combining Ranking Features

(23)

Combining Ranking Features

• An old problem with many faces and different views

• The problem of devising a “good” total ordering from a set of

(24)

Combining Ranking Features

partial orderings or preferences is often intractable

• In our setting the problem is intractable (NPC) if |S| ≥ 3 and

(25)

Combining Ranking Features

we count weighted or unweighted disagreements

• The intractability stems from conflicting and circular

preferences over pairs of instances

(26)

Combining Ranking Features

we count weighted or unweighted disagreements

• The intractability stems from conflicting and circular

preferences over pairs of instances

• Combining ranking features is simple if |S| = 2 (no φ)

• In assessing preferences people are indeed inconsistent

(27)

Algorithmic Approaches

• Partial feedback over pairs with general ranking features:

? Derive simple (binary) features (ignore numerical rating)

? Devise a generalization of boosting to find a good subset

of features and their weights

⇒ RankBoost

• Feedback is given via Ψ in an online fashion:

? Perform rank-prediction by projecting and binning

? Devise efficient online algorithm with mistake-bound analysis

(28)

Brief Review of Boosting

• assume access to weak-learner (WL)

• maintain distribution D over examples {(x_i, y_i)}m_i₌₁

where y_i ∈ {−1, +1}

• for any distribution over examples WL returns h : X → <

• error of h is better than random | P

i yih(xi)| >

1 Poly(m)

• bound classification error (sign(h(x)) 6= y) by exp (−yh(x))

• modify distribution according to performance

D_t₊₁(i) ← 1

(29)

From Classification to Ranking

• Example (x, y)

⇒ Pair (x₀, x₁) such that Φ(x₀, x₁) < 0

• Classification error sign(f(x)) 6= y

⇒ Ranking error f(x₀) > f(x₁)

• 0-1 bound: exp (−yf(x))

(30)

RankBoost - Skeleton

• Maintain a distribution over pairs D_t(x₀, x₁)

• Given a set of features: f_i : X → S

• Derive simpler feature: h_t : X → {0, 1}

• Find importance weights: α_t w.r.t D_t

• Combine features:

H(x) = X

t

α_t h_t(x)

(31)

Receive: initial distribution D₁ over X × X (obtained from Φ) For t = 1, . . . , T:

Call weak learner with D_t ⇒ weak ranker h_t : X → {0, 1}

Let S_b = {(x₀, x₁)|h(x₀) − h(x₁) = b}

Calculate W_b = X

(x₀,x₁)∈S_b

D_t(x₀, x₁)

Set α_t = 1₂ log(W₋/W₊)

Update D_t₊₁(x₀, x₁) = (1/Z_t) D_t(x₀, x₁) exp (α_t(h_t(x₀) − h_t(x₁)))

(32)

Deriving Simplified Features

h(x) =   

 

1 if f_i(x) > θ

0 if f_i(x) ≤ θ q_def if f_i(x) = φ

• Ignores the numerical

ratings (uses only relative ranking information)

• Search for good f_i, q_def, θ

can be done efficiently for

any feedback a b c d f(a)=7 f(b)=5 f(c)=3 f(b)=1 a b c d a b c d

=

e f(e)=φ e e

(33)

Deriving Simplified Features - Details

• Score of a feature r = P_x

(34)

Deriving Simplified Features - Details

0,x1 D(x0, x1) (h(x1) − h(x0))

(35)

Deriving Simplified Features - Details

0,x1 D(x0, x1) (h(x1) − h(x0))

• Define π(x) = P_x0(D(x0, x) − D(x, x0))

r = X

x₀,x₁

D(x₀, x₁) (h(x₁) − h(x₀))

= X

x₀,x₁

D(x₀, x₁)h(x₁) − X x₀,x₁

D(x₀, x₁)h(x₀)

= X

x

h(x) X

x0

D(x0, x) − X x

h(x) X

x0

D(x, x0)

= X

x

h(x) X

x0

(D(x0, x) − D(x, x0)) = X

x

(36)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:f_i(x)>θ

h(x) π(x) + X

x:f_i(x)≤θ

h(x) π(x) + X

x:f_i(x)=φ

h(x) π(x)

= X

x:f_i(x)>θ

π(x) + q_def X

x:f_i(x)=φ

(37)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:f_i(x)>θ

h(x) π(x) + X

x:f_i(x)≤θ

h(x) π(x) + X

x:f_i(x)=φ

h(x) π(x)

= X

x:f_i(x)>θ

π(x) + q_def X

x:f_i(x)=φ

π(x).

(38)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:f_i(x)>θ

h(x) π(x) + X

x:f_i(x)≤θ

h(x) π(x) + X

x:f_i(x)=φ

h(x) π(x)

= X

x:f_i(x)>θ

π(x) + q_def X

x:f_i(x)=φ

π(x).

• Note that P_x π(x) = P_x_,_x0(D(x0, x) − D(x, x0)) = 0

• Further simplify

r = X

x:f_i(x)>θ

π(x) − q_def X

x:f_i(x)6=φ

(39)

Deriving Simplified Features - Finishing Up

• Rewrite score

r = X

x:f_i(x)>θ

h(x) π(x) + X

x:f_i(x)≤θ

h(x) π(x) + X

x:f_i(x)=φ

h(x) π(x)

= X

x:f_i(x)>θ

π(x) + q_def X

x:f_i(x)=φ

π(x).

• Note that P_x π(x) = P_x_,_x0(D(x0, x) − D(x, x0)) = 0

• Further simplify

r = X

x:f_i(x)>θ

π(x) − q_def X

x:f_i(x)6=φ

π(x)

(40)

Bounds on Empirical Error and Generalization

• (Empirical) bound on number of mis-ordered pairs

rloss_D₁(H) ≤ T

Y

t=1

(41)

Bounds on Empirical Error and Generalization

• (Empirical) bound on number of mis-ordered pairs

rloss_D₁(H) ≤ T

Y

t=1

Z_t

• Bound on generalization error using VC analysis is tricky since

pairs are not independent

• Generalization error bound using two button model

|rloss_D₁(H) − rloss_D(H)| ≤ O pd0 log(m/d0)/m

(42)

RankBoost - Meta-search Experiments

Top Top Top Top Top Top Avg

ML Domain 1 2 5 10 20 30 Rank

RankBoost 102 144 173 184 194 202 4.38

Best (Top 1) 117 137 154 167 177 181 6.80

Best (Top 10) 112 147 172 179 185 187 5.33

Best (Top 30) 95 129 159 178 187 191 5.68

University Domain

RankBoost 95 141 197 215 247 263 7.74

(43)

Online Framework for Ranking

• Algorithm works in rounds

(44)

Online Framework for Ranking

• On round t the ranking algorithm:

(45)

Online Framework for Ranking

? Gets an input instance z_t = (f₁(x_t), . . . , f_n(x_t)) (f_i(x_t) ∈ <)

(46)

Online Framework for Ranking

? Predicts a rank ˆy_t ∈ {1, . . . , k}

(47)

Online Framework for Ranking

? Receives the correct rank y_t ∈ {1, . . . , k}

(48)

Online Framework for Ranking

? Computes loss |y_t − yˆ_t|

(49)

Online Framework for Ranking

? Computes loss |y_t − yˆ_t|

? Updates the rank-prediction rule

• Goal – minimize the cumulative loss: X

t

(50)

Ranking via Projections

W

(51)

Ranking via Projections

(52)

Ranking via Projections

1 2 3 4 b₁ b₂ b₃

(53)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

Rank Levels

(54)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

W X

Direction: w Thresholds: b₁, b₂, . . . , b_k₋₁

(55)

PRank - Update

W

b b b b

1 2 3 4

1 2 3 4 5

Correct Rank

Rank a new instance: w · x

(56)

PRank - Update

b

2 b3

b

3

b

2

W

b b

1 4

1 2 3 4 5

E = { , }

(57)

PRank - Update

b

2 b3

W

b b

1 4

1 2 3 4 5

(58)

PRank - Summary of Update

W

x

Move Thresholds ∀r ∈ E : b_r ← b_r − 1

(59)

PRank - Summary of Update

• Compute predicted rank yˆ = 1 + |{r : w · x < b_r}

• Get the correct rank y

• If y 6= yˆ do:

? Compute Error Set E = {r : y ≤ r < yˆ}

? For all r ∈ E do:

w ← w + |E| x b_r ← b_r − 1

(60)

Consistency

b₄

W

b₁ b₂ b₃ b₄

(61)

Consistency

b₄

W

b₁ b₂ b₃ b₄

• Can the above happen?

No ⇒ The order of the thresholds is preserved throughout the

(62)

Ranking Margin

W

b₁ b₂ b₄

b − W X W X − b₂

b₃

3

• The margin of an example

Margin(x, y) = min{min

(63)

Ranking Margin

W

b₁ b₂ b₄

b − W X W X − b₂

b₃

3

• The margin of an example

Margin(x, y) = min{min

r≥y {w · x − br} , minr<y {br − w · x}}

• The margin of a dataset

Margin

{x_t, y_t}T_t₌₁

(64)

Mistake Bound Theorem

• Assuming:

? Input sequence (x₁, y₁), . . . , (x_T, y_T)

? The norm of instances is bounded ||x_t|| ≤ R

? The sequence is correctly rank by w? and b?₁, . . . , b?_k₋₁

||w?||2 + (b?₁)2 + . . . + (b?_k₋₁)2 = 1

(65)

Mistake Bound Theorem

• Assuming:

? Input sequence (x₁, y₁), . . . , (x_T, y_T)

? The norm of instances is bounded ||x_t|| ≤ R

? The sequence is correctly rank by w? and b?₁, . . . , b?_k₋₁

||w?||2 + (b?₁)2 + . . . + (b?_k₋₁)2 = 1

? The margin achieved by w?, b?₁, . . . , b?_k₋₁ is γ

• Then:

T

X

t=1

|y_t − yˆ_t| ≤ (k − 1)R

2 ₊ ₁

(66)

Proof Sketch for MB Theorem

• Instantaneous Pranker

vt = (wt, bt₁, . . . , bt_k₋₁) ⇒ vt+1 = (wt+1, b₁t+1, . . . , bt_k+₋1₁)

• Instantaneous error-set Et of size nt = |Et|

• Bound

? from below the increase in vt · v? ? from above the increase in vt · vt

• Assume ranking mistakes on all rounds

(67)

Proof Sketch for MB Theorem (cont.)

v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et

b∗_r

= v∗ · vt + X

r∈Et

w∗ · vt − b∗_r

= v∗ · vt + nt γ

v? · vT+1 ≥ γ

T

X

t=1 nt

(68)

Proof Sketch for MB Theorem (cont.)

v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et

b∗_r

= v∗ · vt + X

r∈Et

w∗ · vt − b∗_r

= v∗ · vt + nt γ

v? · vT+1 ≥ γ

T

X

t=1 nt

kvT+1k2 kv?k

| {z }

=1 2

≥ vT+1 · v?2 ⇒ kvT+1k ≥ γ2

T

X

t=1 nt

(69)

Proof Sketch for MB Theorem (cont.)

kvt+1k2 ≤ kvtk2 + 2 X

r∈Et

(wt · x_t − bt_r)

| {z }

≤0

+(nt)2kx_tk2 + nt

≤ kvtk2 + (nt)2R2 + nt

kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T

X

t=1 nt

(70)

Proof Sketch for MB Theorem (cont.)

kvt+1k2 ≤ kvtk2 + 2 X

r∈Et

(wt · x_t − bt_r)

| {z }

≤0

+(nt)2kx_tk2 + nt

≤ kvtk2 + (nt)2R2 + nt

kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T X t=1 nt γ2 T X t=1 nt !2

≤ kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T

X

t=1 nt

(71)

Proof Sketch for MB Theorem - Finishing up

γ2 T X t=1 nt !2

≤ kvT+1k2 ≤ R2

T

X

t=1

(nt)2 +

T X t=1 nt X t

nt ≤ R

2 P

t(n

t₎2

/[P_t nt] + 1

γ2

Use nt ≤ k − 1 and nt = |y_t − yˆ_t|

T

X

t=1

|yˆ_t − y_t| =

T

X

t=1

nt ≤ (k − 1)R

2 _{+ 1}

γ2 ≤ (k − 1)

(R2 + 1)

(72)

Experiments - EachMovie Database

• Database of 1648 movies

• 74,424 registered viewers

• Viewers rated subsets of movies

• Experimented with two subsets:

? 7451 viewers who rated > 100 movies

(73)

Results for

100 or more movies

0 10 20 30 40 50 60 70 80 90 100 1

1.2 1.4 1.6 1.8 2 2.2 2.4

Round

Rank Loss

PRank WH MC−Perceptron

(74)

Results for

200 or more movies

0 20 40 60 80 100 120 140 160 180 200 1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Round

Rank Loss

PRank WH MC−Perceptron

(75)

Summary

(76)

Summary

• Formal framework for learning of ranking problems

(77)

Summary

• Analysis of objective functions for ranking

(78)

Summary

• Two algorithmic paradigms for learning good rankers

(79)

Summary

• Empirical loss analysis and generalization bounds for RankBoost

(80)

Summary

• Mistake bound and consistency for PRank

• Research directions:

? Batch versions of PRank and Kernels for Ranking

(81)

Summary

• Mistake bound and consistency for PRank

• Research directions:

? Batch versions of PRank and Kernels for Ranking

? More applications (topic ranking, sound localization)

(82)

Based on

Y. Freund, R. Iyer, R. E. Schapire, Y. Singer. An efficient boosting

algorithm for combining preferences. Machine Learning: Proc. of the

Fifteenth Intl. Conf., 1998.

W. W. Cohen, R. E. Schapire, Y. Singer. Learning to order things.

Journal of Artificial Intelligence Research, 10:243–270, 1999.

R. Iyer, D. Lewis, R. E. Schapire, Y. Singer, A. Singhal. Boosting for Document Routing. CIKM, 2000.

K. Crammer and Y. Singer. PRanking with Ranking. Advances in

Neural Information Processing Systems 15, 2001.

K. Crammer and Y. Singer. A New Family of Online Algorithms for Category Ranking. SIGIR, 2002.