Sham M. Kakade
Shai Shalev-Shwartz Ambuj Tewari
Efficient Bandit Algorithms for Online Multiclass Prediction
ICML, July 2008
Motivation
Online web advertisement systems User submits a query
System (the learner) places an ad User either “clicks” or ignores
Goal: Maximize number of “clicks”
Modeling ?
Not the common online learning setting --
If user ignores, we donʼt get the “correct” ad Not the common multi-armed bandit --
We are also provided with a query
Outline
Online Bandit Multi-class Categorization Background: The Multi-class Perceptron The Banditron
Analysis
Experiments
The Separable Case
Extensions and Open Problems
Online Bandit Multiclass Categorization
For t = 1, 2, . . . , T Receive x ∈ R
dPredict ˆy
t∈ {1, . . . , k}
Pay 1[y
t"= ˆy
t]
y
tis not revealed
(query) (ad)
(click feedback)
Linear Hypotheses
A hypothesis is a mapping h : R
d→ {1, . . . , k}
Linear hypothesis: Exists k × d matrix W s.t.
h(x) = argmax
r
(W x)
rThe Multiclass Perceptron
U
t=
0 . . . 0
...
0 . . . 0
. . . xt . . .
0 . . . 0
...
0 . . . 0
. . . −xt . . .
0 . . . 0
...
0 . . . 0
row ˆyt row yt
For t = 1, 2, . . . , T Receive x
t∈ R
dPredict ˆy
t= argmax
r
(W
tx
t)
rReceive y
tUpdate: W
t+1= W
t+ U
twhere
Perceptron in the Bandit Setting
row ˆyt row yt
Ut =
0 . . . 0
...
0 . . . 0
. . . xt . . .
0 . . . 0
...
0 . . . 0
. . . −xt . . .
0 . . . 0
...
0 . . . 0
Problem: We’re blind to value of y
tSolution: Randomization can help !
Exploration
Explore: instead of predicting ˆy
tguess some ˜y
tSuppose we get the feedback ’correct’, i.e. ˜y
t= y
tThen, we know that
ˆy
t!= y
ty
t= ˜y
tSo, we can update W using the matrix U
tExploration vs. Exploitation
But, if our current model is correct, i.e. ˆy
t= y
tAnd, we guess some other ˜y
tThen, we both suffer loss and do not know how to update W
In this case, it’s better to Exploit the quality of current model
We control the exploration-exploitation tradeoff
using randomization
For t = 1, 2, . . . , T Receive x
t∈ R
dSet ˆy
t= argmax
r
(W
tx
t)
rDefine: P (r) = (1 − γ)1[r = ˆy
t] +
γkRandomly sample ˜y
taccording to P
Predict ˜y
tand receive feedback 1[˜y
t= y
t] Update: W
t+1= W
t+ ˜ U
tThe Banditron
For t = 1, 2, . . . , T Receive x
t∈ R
dSet ˆy
t= argmax
r
(W
tx
t)
rDefine: P (r) = (1 − γ)1[r = ˆy
t] +
γkRandomly sample ˜y
taccording to P
Predict ˜y
tand receive feedback 1[˜y
t= y
t] Update: W
t+1= W
t+ ˜ U
tThe Banditron
0 . . . 0
...
0 . . . 0
. . . 1[yP (yt=˜yt]
t) xt . . .
0 . . . 0
...
0 . . . 0
. . . −xt . . .
0 . . . 0
...
0 . . . 0
row ˆyt row yt
The Banditron Expected Update
E[ ˜Ut] = !
r
P (r)
0 . . . 0
0 . . .... 0 . . . 1[yP (yt=r]
t) xt . . .
0 . . . 0
0 . . .... 0 . . . −xt . . .
0 . . . 0
0 . . .... 0
= Ut
Analysis: The Hinge-Loss
5 !
≥ 1
(W xt)yt (W xt)r
≥ 1[y
t"= ˆy
t]
!
t(W ) = max
r!=yt
1 − (W x
t)
yt+ (W x
t)
rAnalysis: The Hinge-Loss
5 !
≥ 1
(W xt)yt (W xt)r
≥ 1[y
t"= ˆy
t]
The Separable Case:
!
t(W ) = max
r!=yt
1 − (W x
t)
yt+ (W x
t)
rMistake Bounds
M ≤ L + D + √
L D
E[M] ≤ L + γ T + 3 max !
k D
γ , "
D γ T #
+ $
k D L γ
Perceptron:
Banditron:
Symbol Meaning
M # mistakes
L competitor loss !t !t(W !) D competitor margin !W !!2F
k # classes T # rounds
γ Exploration-Exploitation parameter
Mistake Bounds (cont.)
Symbol Meaning
L competitor loss !t !t(W !) D competitor margin !W !!2F
k # classes T # rounds
Perceptron Banditron No noise:
L = 0 D √
k D T Low noise:
L = O(√
k D T )
√k D T √
k D T Noisy: L + T 1/2 L + T 2/3
Experiments
Reuters RCV1
~700k documents
Bag-of-words (d ~ 350k)
4 labels {CCAT, ECAT, GCAT, MCAT}
Synthetic separable data set
9 classes, d=400, million instances
A simple simulation of generating text documents Synthetic non-separable data set
separable + 5% label noise
Experimental Results -- Reuters
102 103 104 105 106
10−2 10−1
100 γ = 0.050
number of examples
error rate
Perceptron Banditron
Similar Slope
Experimental Results -- Separable Data
102 103 104 105 106
10−5 10−4 10−3 10−2 10−1
100 γ = 0.014
number of examples
error rate
Perceptron Banditron
Slope -1
Slope -0.55
Experimental Results -- 5% label noise
102 103 104 105 106
10−0.9 10−0.7 10−0.5 10−0.3
γ = 0.006
number of examples
error rate
Perceptron Banditron
13%
10%
Exploration-Exploitation Parameter
100 −3 10−2 10−1 100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
γ
error rate
Perceptron Banditron
100 −3 10−2 10−1 100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
γ
error rate
Perceptron Banditron
Reuters
5% label noise
The Separable Case
Halving
Discretized hypothesis space Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Bandit setting
Mistake Bound O(k2 d log(D d)) Using JL lemma we can also ob- tain O(k2 D log(T +kδ ) log(D))
The Separable Case
Halving
Discretized hypothesis space Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Bandit setting
Mistake Bound O(k2 d log(D d)) Using JL lemma we can also ob- tain O(k2 D log(T +kδ ) log(D))
The Separable Case
Halving
Discretized hypothesis space Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Bandit setting
Mistake Bound O(k2 d log(D d)) Using JL lemma we can also ob- tain O(k2 D log(T +kδ ) log(D))