Supervised Learning of Non-binary Problems
Part II: Ranking Algorithms
Yoram Singer, HUJI
Based on joint work with: Koby Crammer (HUJI),
Rob Schapire (AT&T Labs), William Cohen (Consultant), Amit Singhal (Google),
Raj Iyer (Living Wisdom School), Yoav Freund (Banter)
NATO ASI on Learning Theory and Practice July 8-19 2002 - K.U. Leuven Belgium
Example - Meta-Searching
AltaVista Infoseek
?
Lycoslist list list
query
Example - Information Retrieval
• Text Retrieval:
Example - Information Retrieval
• Text Retrieval:
show me articles on Dolby C
• Query by Image:
Example - Information Retrieval
• Text Retrieval:
show me articles on Dolby C
• Query by Image:
show me pictures of Sunsets
• Audio Browsing:
Example - Information Retrieval
• Text Retrieval:
show me articles on Dolby C
• Query by Image:
show me pictures of Sunsets
• Audio Browsing:
find me radio programs on Who killed Paul M.
• Topic filtering:
return a ranked list of relevant topics for a given article E.g. Reuters labels each article with one (or more) topic
Elements of Ranking Problems
Elements of Ranking Problems
• Instances: Things to be ranked (e.g movies)
• Ranking features:
? Each feature is a partial ranking of the instances
? “Primitives” that will be combined into final ranking
Elements of Ranking Problems
• Instances: Things to be ranked (e.g movies)
• Ranking features:
? Each feature is a partial ranking of the instances
? “Primitives” that will be combined into final ranking
? E.g. ratings of previous viewers
• Feedback function:
? Encodes feedback on target (desired) ranking
Elements of Ranking Problems
• Instances: Things to be ranked (e.g movies)
• Ranking features:
? Each feature is a partial ranking of the instances
? “Primitives” that will be combined into final ranking
? E.g. ratings of previous viewers
• Feedback function:
? Encodes feedback on target (desired) ranking
? E.g. new viewer’s ratings
• Goal: produce a ranking of all instances, including those not
observed in the training. The Ranking should agree, as much as possible, with the feedback function.
Elements of Ranking Problems - Text Retrieval
• Instances: Documents
• Ranking features:
Each word ranks documents based on the number of its occurrences
• Feedback function:
Relevant documents ranked above irrelevant
• Goal:
Order the documents such that the relevant documents are as close as possible to the top of the ordered list
Formal Description of Ranking Problems
• Domain or instance space: X
• Ranking features: f1, . . . , fn
? fi : X → S where S = {1, . . . , k} S
{φ}
? fi : X → < where < = < S
{φ}
? fi(x) = φ means that x is unranked by fi
? If fi(xj) > fi(xk) then xj is preferred over xk
• Feedback function:
? Φ : X × X → {−1, 0, +1} or Φ : X × X → <
? Φ(x, x) = 0
Formal Description of Ranking Problems (cont.)
• In many applications ∃ Ψ : X → S (or Ψ : X → <)
Φ(x1, x2) =
+1 Ψ(x1) > Ψ(x2)
−1 Ψ(x1) < Ψ(x2) 0 otherwise
• Goal:
Find a function H : X → S or H : X → < which agrees
as much as possible with the feedback on a set of examples
T = {x1, . . . , xm}
Formal Description of Ranking Problems (cont.)
• In many applications ∃ Ψ : X → S (or Ψ : X → <)
Φ(x1, x2) =
+1 Ψ(x1) > Ψ(x2)
−1 Ψ(x1) < Ψ(x2) 0 otherwise
• Goal:
Find a function H : X → S or H : X → < which agrees
as much as possible with the feedback on a set of examples
T = {x1, . . . , xm}
Ranking Loss
• If Φ : X × X → S then rlossT(H) is| {(xi , xj) ∈ T × T | H(xi) ≤ H(xj) and Φ(xi, xj) > 0} |
• If Φ : X × X → < then let D(xi, xj) = c · max{0, Φ(xi, xj)}
where c is a normalization constant
rlossD(H) = X
(xi,xj)∈T×T
D(xi, xj)[[H(xi) ≤ H(xj)]]
• If Ψ : X → S and H : X → S where S = {1, . . . , k}
rlossT(H) = X
x∈T
Ranking Loss: illustration for
Φ
:
X × X → <
a b c d 2 4 3 −1 a b c d 0.2 0.4 0.3 0.1H(a)=9 H(b)=8 H(d)=7 H(c)=1 a b c d 0.2 0.4 0.3 0.1 rloss=0.1+0.3=0.4
Ranking Loss: illustration for
Ψ
:
X → {
1
, . . . ,
6
}
2
1
3
4
5
6
2
1
3
4
5
6
loss=3
Learning Task
• Training data – examples (e.g. ratings of movies by users):
? Set of instances T = {x1, . . . , xm}
? Feedback (typically partial) on T via Φ or Ψ
• Find a sub-set of good ranking features
• Assign each feature an importance weight
Learning Task
• Training data – examples (e.g. ratings of movies by users):
? Set of instances T = {x1, . . . , xm}
? Feedback (typically partial) on T via Φ or Ψ
• Find a sub-set of good ranking features
• Assign each feature an importance weight
• Combine the set of features and weights into an ordering H
Combining Ranking Features
Graph Representation
f1(a)=2 f1(b)=1
f1(c)=0 f1(d)=φ
a
b
c
d
a
b
c
d
f2(a)=0 f2(b)=2
f2(c)=1 f2(d)=2
a
b
c
d
1/4
3/4
3/4 3/4 1 1/4 1/4 3/4 3/4Combining Ranking Features
Combining Ranking Features
• An old problem with many faces and different views
• The problem of devising a “good” total ordering from a set of
Combining Ranking Features
• An old problem with many faces and different views
• The problem of devising a “good” total ordering from a set of
partial orderings or preferences is often intractable
• In our setting the problem is intractable (NPC) if |S| ≥ 3 and
Combining Ranking Features
• An old problem with many faces and different views
• The problem of devising a “good” total ordering from a set of
partial orderings or preferences is often intractable
• In our setting the problem is intractable (NPC) if |S| ≥ 3 and
we count weighted or unweighted disagreements
• The intractability stems from conflicting and circular
preferences over pairs of instances
Combining Ranking Features
• An old problem with many faces and different views
• The problem of devising a “good” total ordering from a set of
partial orderings or preferences is often intractable
• In our setting the problem is intractable (NPC) if |S| ≥ 3 and
we count weighted or unweighted disagreements
• The intractability stems from conflicting and circular
preferences over pairs of instances
• Combining ranking features is simple if |S| = 2 (no φ)
• In assessing preferences people are indeed inconsistent
Algorithmic Approaches
• Partial feedback over pairs with general ranking features:
? Derive simple (binary) features (ignore numerical rating)
? Devise a generalization of boosting to find a good subset
of features and their weights
⇒ RankBoost
• Feedback is given via Ψ in an online fashion:
? Perform rank-prediction by projecting and binning
? Devise efficient online algorithm with mistake-bound analysis
Brief Review of Boosting
• assume access to weak-learner (WL)
• maintain distribution D over examples {(xi, yi)}mi=1
where yi ∈ {−1, +1}
• for any distribution over examples WL returns h : X → <
• error of h is better than random | P
i yih(xi)| >
1 Poly(m)
• bound classification error (sign(h(x)) 6= y) by exp (−yh(x))
• modify distribution according to performance
Dt+1(i) ← 1
From Classification to Ranking
• Example (x, y)
⇒ Pair (x0, x1) such that Φ(x0, x1) < 0
• Classification error sign(f(x)) 6= y
⇒ Ranking error f(x0) > f(x1)
• 0-1 bound: exp (−yf(x))
RankBoost - Skeleton
• Maintain a distribution over pairs Dt(x0, x1)
• Given a set of features: fi : X → S
• Derive simpler feature: ht : X → {0, 1}
• Find importance weights: αt w.r.t Dt
• Combine features:
H(x) = X
t
αt ht(x)
Receive: initial distribution D1 over X × X (obtained from Φ) For t = 1, . . . , T:
Call weak learner with Dt ⇒ weak ranker ht : X → {0, 1}
Let Sb = {(x0, x1)|h(x0) − h(x1) = b}
Calculate Wb = X
(x0,x1)∈Sb
Dt(x0, x1)
Set αt = 12 log(W−/W+)
Update Dt+1(x0, x1) = (1/Zt) Dt(x0, x1) exp (αt(ht(x0) − ht(x1)))
Deriving Simplified Features
h(x) =
1 if fi(x) > θ
0 if fi(x) ≤ θ qdef if fi(x) = φ
• Ignores the numerical
ratings (uses only relative ranking information)
• Search for good fi, qdef, θ
can be done efficiently for
any feedback a b c d f(a)=7 f(b)=5 f(c)=3 f(b)=1 a b c d a b c d
=
e f(e)=φ e eDeriving Simplified Features - Details
• Score of a feature r = Px
Deriving Simplified Features - Details
• Score of a feature r = Px
0,x1 D(x0, x1) (h(x1) − h(x0))
Deriving Simplified Features - Details
• Score of a feature r = Px
0,x1 D(x0, x1) (h(x1) − h(x0))
• Define π(x) = Px0(D(x0, x) − D(x, x0))
r = X
x0,x1
D(x0, x1) (h(x1) − h(x0))
= X
x0,x1
D(x0, x1)h(x1) − X x0,x1
D(x0, x1)h(x0)
= X
x
h(x) X
x0
D(x0, x) − X x
h(x) X
x0
D(x, x0)
= X
x
h(x) X
x0
(D(x0, x) − D(x, x0)) = X
x
Deriving Simplified Features - Finishing Up
• Rewrite score
r = X
x:fi(x)>θ
h(x) π(x) + X
x:fi(x)≤θ
h(x) π(x) + X
x:fi(x)=φ
h(x) π(x)
= X
x:fi(x)>θ
π(x) + qdef X
x:fi(x)=φ
Deriving Simplified Features - Finishing Up
• Rewrite score
r = X
x:fi(x)>θ
h(x) π(x) + X
x:fi(x)≤θ
h(x) π(x) + X
x:fi(x)=φ
h(x) π(x)
= X
x:fi(x)>θ
π(x) + qdef X
x:fi(x)=φ
π(x).
Deriving Simplified Features - Finishing Up
• Rewrite score
r = X
x:fi(x)>θ
h(x) π(x) + X
x:fi(x)≤θ
h(x) π(x) + X
x:fi(x)=φ
h(x) π(x)
= X
x:fi(x)>θ
π(x) + qdef X
x:fi(x)=φ
π(x).
• Note that Px π(x) = Px,x0(D(x0, x) − D(x, x0)) = 0
• Further simplify
r = X
x:fi(x)>θ
π(x) − qdef X
x:fi(x)6=φ
Deriving Simplified Features - Finishing Up
• Rewrite score
r = X
x:fi(x)>θ
h(x) π(x) + X
x:fi(x)≤θ
h(x) π(x) + X
x:fi(x)=φ
h(x) π(x)
= X
x:fi(x)>θ
π(x) + qdef X
x:fi(x)=φ
π(x).
• Note that Px π(x) = Px,x0(D(x0, x) − D(x, x0)) = 0
• Further simplify
r = X
x:fi(x)>θ
π(x) − qdef X
x:fi(x)6=φ
π(x)
Bounds on Empirical Error and Generalization
• (Empirical) bound on number of mis-ordered pairs
rlossD1(H) ≤ T
Y
t=1
Bounds on Empirical Error and Generalization
• (Empirical) bound on number of mis-ordered pairs
rlossD1(H) ≤ T
Y
t=1
Zt
• Bound on generalization error using VC analysis is tricky since
pairs are not independent
• Generalization error bound using two button model
|rlossD1(H) − rlossD(H)| ≤ O pd0 log(m/d0)/m
RankBoost - Meta-search Experiments
Top Top Top Top Top Top Avg
ML Domain 1 2 5 10 20 30 Rank
RankBoost 102 144 173 184 194 202 4.38
Best (Top 1) 117 137 154 167 177 181 6.80
Best (Top 10) 112 147 172 179 185 187 5.33
Best (Top 30) 95 129 159 178 187 191 5.68
University Domain
RankBoost 95 141 197 215 247 263 7.74
Online Framework for Ranking
• Algorithm works in rounds
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)
? Predicts a rank ˆyt ∈ {1, . . . , k}
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)
? Predicts a rank ˆyt ∈ {1, . . . , k}
? Receives the correct rank yt ∈ {1, . . . , k}
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)
? Predicts a rank ˆyt ∈ {1, . . . , k}
? Receives the correct rank yt ∈ {1, . . . , k}
? Computes loss |yt − yˆt|
Online Framework for Ranking
• Algorithm works in rounds
• On round t the ranking algorithm:
? Gets an input instance zt = (f1(xt), . . . , fn(xt)) (fi(xt) ∈ <)
? Predicts a rank ˆyt ∈ {1, . . . , k}
? Receives the correct rank yt ∈ {1, . . . , k}
? Computes loss |yt − yˆt|
? Updates the rank-prediction rule
• Goal – minimize the cumulative loss: X
t
Ranking via Projections
WRanking via Projections
Ranking via Projections
1 2 3 4 b1 b2 b3
PRank - Update
W
b b b b
1 2 3 4
1 2 3 4 5
Rank Levels
PRank - Update
W
b b b b
1 2 3 4
1 2 3 4 5
W X
Direction: w Thresholds: b1, b2, . . . , bk−1
PRank - Update
W
b b b b
1 2 3 4
1 2 3 4 5
Correct Rank
Direction: w Thresholds: b1, b2, . . . , bk−1
Rank a new instance: w · x
PRank - Update
b
2 b3
b
3
b
2
W
b b
1 4
1 2 3 4 5
E = { , }
Direction: w Thresholds: b1, b2, . . . , bk−1
PRank - Update
b
2 b3
W
b b
1 4
1 2 3 4 5
PRank - Summary of Update
W
x
x
Move Thresholds ∀r ∈ E : br ← br − 1
PRank - Summary of Update
• Compute predicted rank yˆ = 1 + |{r : w · x < br}
• Get the correct rank y
• If y 6= yˆ do:
? Compute Error Set E = {r : y ≤ r < yˆ}
? For all r ∈ E do:
w ← w + |E| x br ← br − 1
Consistency
b4
W
b1 b2 b3 b4
Consistency
b4
W
b1 b2 b3 b4
• Can the above happen?
No ⇒ The order of the thresholds is preserved throughout the
Ranking Margin
W
b1 b2 b4
b − W X W X − b2
b3
3
3
• The margin of an example
Margin(x, y) = min{min
Ranking Margin
W
b1 b2 b4
b − W X W X − b2
b3
3
3
• The margin of an example
Margin(x, y) = min{min
r≥y {w · x − br} , minr<y {br − w · x}}
• The margin of a dataset
Margin
{xt, yt}Tt=1
Mistake Bound Theorem
• Assuming:
? Input sequence (x1, y1), . . . , (xT, yT)
? The norm of instances is bounded ||xt|| ≤ R
? The sequence is correctly rank by w? and b?1, . . . , b?k−1
||w?||2 + (b?1)2 + . . . + (b?k−1)2 = 1
Mistake Bound Theorem
• Assuming:
? Input sequence (x1, y1), . . . , (xT, yT)
? The norm of instances is bounded ||xt|| ≤ R
? The sequence is correctly rank by w? and b?1, . . . , b?k−1
||w?||2 + (b?1)2 + . . . + (b?k−1)2 = 1
? The margin achieved by w?, b?1, . . . , b?k−1 is γ
• Then:
T
X
t=1
|yt − yˆt| ≤ (k − 1)R
2 + 1
Proof Sketch for MB Theorem
• Instantaneous Pranker
vt = (wt, bt1, . . . , btk−1) ⇒ vt+1 = (wt+1, b1t+1, . . . , btk+−11)
• Instantaneous error-set Et of size nt = |Et|
• Bound
? from below the increase in vt · v? ? from above the increase in vt · vt
• Assume ranking mistakes on all rounds
Proof Sketch for MB Theorem (cont.)
v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et
b∗r
= v∗ · vt + X
r∈Et
w∗ · vt − b∗r
= v∗ · vt + nt γ
v? · vT+1 ≥ γ
T
X
t=1 nt
Proof Sketch for MB Theorem (cont.)
v∗ · vt+1 = v∗ · vt + |Et| w? · vt − X r∈Et
b∗r
= v∗ · vt + X
r∈Et
w∗ · vt − b∗r
= v∗ · vt + nt γ
v? · vT+1 ≥ γ
T
X
t=1 nt
kvT+1k2 kv?k
| {z }
=1 2
≥ vT+1 · v?2 ⇒ kvT+1k ≥ γ2
T
X
t=1 nt
Proof Sketch for MB Theorem (cont.)
kvt+1k2 ≤ kvtk2 + 2 X
r∈Et
(wt · xt − btr)
| {z }
≤0
+(nt)2kxtk2 + nt
≤ kvtk2 + (nt)2R2 + nt
kvT+1k2 ≤ R2
T
X
t=1
(nt)2 +
T
X
t=1 nt
Proof Sketch for MB Theorem (cont.)
kvt+1k2 ≤ kvtk2 + 2 X
r∈Et
(wt · xt − btr)
| {z }
≤0
+(nt)2kxtk2 + nt
≤ kvtk2 + (nt)2R2 + nt
kvT+1k2 ≤ R2
T
X
t=1
(nt)2 +
T X t=1 nt γ2 T X t=1 nt !2
≤ kvT+1k2 ≤ R2
T
X
t=1
(nt)2 +
T
X
t=1 nt
Proof Sketch for MB Theorem - Finishing up
γ2 T X t=1 nt !2≤ kvT+1k2 ≤ R2
T
X
t=1
(nt)2 +
T X t=1 nt X t
nt ≤ R
2 P
t(n
t)2
/[Pt nt] + 1
γ2
Use nt ≤ k − 1 and nt = |yt − yˆt|
T
X
t=1
|yˆt − yt| =
T
X
t=1
nt ≤ (k − 1)R
2 + 1
γ2 ≤ (k − 1)
(R2 + 1)
Experiments - EachMovie Database
• Database of 1648 movies
• 74,424 registered viewers
• Viewers rated subsets of movies
• Experimented with two subsets:
? 7451 viewers who rated > 100 movies
Results for
100
or more movies
0 10 20 30 40 50 60 70 80 90 100 1
1.2 1.4 1.6 1.8 2 2.2 2.4
Round
Rank Loss
PRank WH MC−Perceptron
Results for
200
or more movies
0 20 40 60 80 100 120 140 160 180 200 1
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Round
Rank Loss
PRank WH MC−Perceptron
Summary
Summary
• Formal framework for learning of ranking problems
Summary
• Formal framework for learning of ranking problems
• Analysis of objective functions for ranking
Summary
• Formal framework for learning of ranking problems
• Analysis of objective functions for ranking
• Two algorithmic paradigms for learning good rankers
Summary
• Formal framework for learning of ranking problems
• Analysis of objective functions for ranking
• Two algorithmic paradigms for learning good rankers
• Empirical loss analysis and generalization bounds for RankBoost
Summary
• Formal framework for learning of ranking problems
• Analysis of objective functions for ranking
• Two algorithmic paradigms for learning good rankers
• Empirical loss analysis and generalization bounds for RankBoost
• Mistake bound and consistency for PRank
• Research directions:
? Batch versions of PRank and Kernels for Ranking
Summary
• Formal framework for learning of ranking problems
• Analysis of objective functions for ranking
• Two algorithmic paradigms for learning good rankers
• Empirical loss analysis and generalization bounds for RankBoost
• Mistake bound and consistency for PRank
• Research directions:
? Batch versions of PRank and Kernels for Ranking
? More applications (topic ranking, sound localization)
Based on
Y. Freund, R. Iyer, R. E. Schapire, Y. Singer. An efficient boosting
algorithm for combining preferences. Machine Learning: Proc. of the
Fifteenth Intl. Conf., 1998.
W. W. Cohen, R. E. Schapire, Y. Singer. Learning to order things.
Journal of Artificial Intelligence Research, 10:243–270, 1999.
R. Iyer, D. Lewis, R. E. Schapire, Y. Singer, A. Singhal. Boosting for Document Routing. CIKM, 2000.
K. Crammer and Y. Singer. PRanking with Ranking. Advances in
Neural Information Processing Systems 15, 2001.
K. Crammer and Y. Singer. A New Family of Online Algorithms for Category Ranking. SIGIR, 2002.