ICML, July 2008

(1)

Sham M. Kakade

Shai Shalev-Shwartz Ambuj Tewari

Efficient Bandit Algorithms for Online Multiclass Prediction

ICML, July 2008

(2)

Motivation

Online web advertisement systems User submits a query

System (the learner) places an ad User either “clicks” or ignores

Goal: Maximize number of “clicks”

Modeling ?

Not the common online learning setting --

If user ignores, we donʼt get the “correct” ad Not the common multi-armed bandit --

We are also provided with a query

(3)

Outline

Online Bandit Multi-class Categorization Background: The Multi-class Perceptron The Banditron

Analysis

Experiments

The Separable Case

Extensions and Open Problems

(4)

Online Bandit Multiclass Categorization

For t = 1, 2, . . . , T Receive x ∈ R

^d

Predict ˆy

_t

∈ {1, . . . , k}

Pay 1[y

_t

"= ˆy

^t

]

y

_t

is not revealed

(query) (ad)

(click feedback)

(5)

Linear Hypotheses

A hypothesis is a mapping h : R

^d

→ {1, . . . , k}

Linear hypothesis: Exists k × d matrix W s.t.

h(x) = argmax

r

(W x)

_r

(6)

The Multiclass Perceptron

U

^t

=



 

 

0 . . . 0

...

0 . . . 0

. . . x_t . . .

0 . . . 0

...

0 . . . 0

. . . −x^t . . .

0 . . . 0

...

0 . . . 0



 

 

row ˆy_t row y_t

For t = 1, 2, . . . , T Receive x

_t

∈ R

^d

Predict ˆy

_t

= argmax

r

(W

^t

x

_t

)

_r

Receive y

_t

Update: W

^t+1

= W

^t

+ U

^t

where

(7)

Perceptron in the Bandit Setting

U^t =







0 . . . 0

...

0 . . . 0

. . . xt . . .

0 . . . 0

...

0 . . . 0

. . . −x^t . . .

0 . . . 0

...

0 . . . 0







Problem: We’re blind to value of y

_t

Solution: Randomization can help !

(8)

Exploration

Explore: instead of predicting ˆy

_t

guess some ˜y

_t

Suppose we get the feedback ’correct’, i.e. ˜y

_t

= y

_t

Then, we know that

ˆy

_t

!= y

^t

y

_t

= ˜y

_t

So, we can update W using the matrix U

^t

(9)

Exploration vs. Exploitation

But, if our current model is correct, i.e. ˆy

_t

= y

_t

And, we guess some other ˜y

_t

Then, we both suffer loss and do not know how to update W

In this case, it’s better to Exploit the quality of current model

We control the exploration-exploitation tradeoff

using randomization

(10)

For t = 1, 2, . . . , T Receive x

_t

∈ R

^d

Set ˆy

_t

= argmax

r

(W

^t

x

_t

)

_r

Define: P (r) = (1 − γ)1[r = ˆy

^t

] +

^γ_k

Randomly sample ˜y

_t

according to P

Predict ˜y

_t

and receive feedback 1[˜y

_t

= y

_t

] Update: W

^t+1

= W

^t

+ ˜ U

^t

The Banditron

(11)

For t = 1, 2, . . . , T Receive x

_t

∈ R

^d

Set ˆy

_t

= argmax

r

(W

^t

x

_t

)

_r

Define: P (r) = (1 − γ)1[r = ˆy

^t

] +

^γ_k

Randomly sample ˜y

_t

according to P

Predict ˜y

_t

and receive feedback 1[˜y

_t

= y

_t

] Update: W

^t+1

= W

^t

+ ˜ U

^t

The Banditron

_





0 . . . 0

...

0 . . . 0

. . . ^1[y_{P (y}^t^=˜^y^t^]

t) x_t . . .

0 . . . 0

...

0 . . . 0

. . . −x^t . . .

0 . . . 0

...

0 . . . 0







(12)

The Banditron Expected Update

E[ ˜U^t] = !

r

P (r)







0 . . . 0

0 . . .... 0 . . . ^1[y_{P (y}^t^=r]

t) x_t . . .

0 . . . 0

0 . . .... 0 . . . −x^t . . .

0 . . . 0

0 . . .... 0







= U^t

(13)

Analysis: The Hinge-Loss

5 ^!

≥ 1

(W x_t)_y_t (W x_t)_r

≥ 1[y

^t

"= ˆy

^t

]

!

_t

(W ) = max

r!=y^t

1 − (W x

^t

)

_y_t

+ (W x

_t

)

_r

(14)

Analysis: The Hinge-Loss

5 ^!

≥ 1

(W x_t)_y_t (W x_t)_r

≥ 1[y

^t

"= ˆy

^t

]

The Separable Case:

!

_t

(W ) = max

r!=y^t

1 − (W x

^t

)

_y_t

+ (W x

_t

)

_r

(15)

Mistake Bounds

M ≤ L + D + √

L D

E[M] ≤ L + γ T + 3 max !

k D

γ , "

D γ T #

+ $

k D L γ

Perceptron:

Banditron:

Symbol Meaning

M # mistakes

L competitor loss !_t !_t(W ^!) D competitor margin !W ^!!²_F

k # classes T # rounds

γ Exploration-Exploitation parameter

(16)

Mistake Bounds (cont.)

Symbol Meaning

L competitor loss !_t !_t(W ^!) D competitor margin !W ^!!²_F

k # classes T # rounds

Perceptron Banditron No noise:

L = 0 D √

k D T Low noise:

L = O(√

k D T )

√k D T √

k D T Noisy: L + T ^1/2 L + T ^2/3

(17)

Experiments

Reuters RCV1

~700k documents

Bag-of-words (d ~ 350k)

4 labels {CCAT, ECAT, GCAT, MCAT}

Synthetic separable data set

9 classes, d=400, million instances

A simple simulation of generating text documents Synthetic non-separable data set

separable + 5% label noise

(18)

Experimental Results -- Reuters

10² 10³ 10⁴ 10⁵ 10⁶

10⁻² 10⁻¹

10⁰ γ = 0.050

number of examples

error rate

Perceptron Banditron

Similar Slope

(19)

Experimental Results -- Separable Data

10² 10³ 10⁴ 10⁵ 10⁶

10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹

10⁰ γ = 0.014

number of examples

error rate

Slope -1

Slope -0.55

(20)

Experimental Results -- 5% label noise

10² 10³ 10⁴ 10⁵ 10⁶

10^−0.9 10^−0.7 10^−0.5 10^−0.3

γ = 0.006

number of examples

error rate

13%

10%

(21)

Exploration-Exploitation Parameter

100 ⁻³ 10⁻² 10⁻¹ 10⁰

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

γ

error rate

100 ⁻³ 10⁻² 10⁻¹ 10⁰

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

γ

error rate

Reuters

5% label noise

(22)

The Separable Case

Halving

Discretized hypothesis space Predict by majority vote

Remove ’wrong’ hypotheses

Note: can be applied in Bandit setting

Mistake Bound O(k² d log(D d)) Using JL lemma we can also ob- tain O(k² D log(^{T +k}_δ ) log(D))