Trading regret rate for computational efficiency in online learning with limited feedback
Shai Shalev-Shwartz
TTI-C Hebrew University
On-line Learning with Limited Feedback Workshop, 2009
June 2009
Online Learning with restricted computational power
Main Question
Given a runtime constraint τ , horizon T , reference class H:
What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?
Regret
τ
Online Learning with restricted computational power
Main Question
Given a runtime constraint τ , horizon T , reference class H:
What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?
Regret
τ
Online Learning with restricted computational power
Main Question
Given a runtime constraint τ , horizon T , reference class H:
What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?
Regret
τ
Non-stochastic Multi-armed bandit with side information
The prediction problem Arms: A = {1, . . . , k}
For t = 1, 2, . . . , T
Learner receives side information x t ∈ X
Environment chooses cost vector c t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A
Learner pay cost c t (a t )
Goal
Low regret w.r.t. a reference hypothesis class H:
Regret def =
T
X
t=1
c t (a t ) − min
h∈H T
X
t=1
c t (h(x t ))
Non-stochastic Multi-armed bandit with side information
The prediction problem Arms: A = {1, . . . , k}
For t = 1, 2, . . . , T
Learner receives side information x t ∈ X
Environment chooses cost vector c t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A
Learner pay cost c t (a t )
Goal
Low regret w.r.t. a reference hypothesis class H:
Regret def =
T
X c t (a t ) − min
T
X c t (h(x t ))
Outline
H is the class of linear hypotheses
Multiclass, separable
Regret
τ
Banditron Halving
2 arms, non-separable
Regret
τ
IDPK
EXP4
First example: Bandit Multi-class Categorization
For t = 1, 2, . . . , T
Learner receives side information x t ∈ X
Environment chooses ’correct’ arm y t ∈ A (unknown to learner) Learner chooses action a t ∈ A
Learner pay cost c t (a t ) = 1[a t 6= y t ]
Linear Hypotheses
H = {x 7→ argmax
r
(W x) r : W ∈ R k,d , kW k F ≤ 1}
R d
Large margin assumption
Assumption: Data is separable with margin µ:
∀t, ∀r 6= y t , (W x t ) y
t− (W x t ) r ≥ µ
R d
First approach – Halving
Halving for Bandit Multiclass categorization Initialize: V 1 = H
For t = 1, 2, . . . Receive x t
For all r ∈ [k] let V t (r) = {h ∈ V t : h(x t ) = r}
Predict ˆ y t ∈ arg max r |V t (r)|
If 1[ˆ y t 6= y t ] set V t+1 = V t \ V t (ˆ y t )
Analysis:
Whenever we err |V t+1 | ≤ 1 − 1 k |V t | ≤ exp(−1/k) |V t |
Therefore: M ≤ k log(|H|)
First approach – Halving
Halving for Bandit Multiclass categorization Initialize: V 1 = H
For t = 1, 2, . . . Receive x t
For all r ∈ [k] let V t (r) = {h ∈ V t : h(x t ) = r}
Predict ˆ y t ∈ arg max r |V t (r)|
If 1[ˆ y t 6= y t ] set V t+1 = V t \ V t (ˆ y t ) Analysis:
Whenever we err |V t+1 | ≤ 1 − 1 k |V t | ≤ exp(−1/k) |V t |
Therefore: M ≤ k log(|H|)
Using Halving
Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1
2) Step 2: Discretize H to (1/µ) d
0hypotheses
Apply Halving on the resulting finite set of hypotheses
Analysis:
Mistake bound is ˜ O
k µ
2But runtime grows like (1/µ) 1/µ
2Using Halving
Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1
2) Step 2: Discretize H to (1/µ) d
0hypotheses
Apply Halving on the resulting finite set of hypotheses Analysis:
Mistake bound is ˜ O
k µ
2But runtime grows like (1/µ) 1/µ
2How can we improve runtime?
Halving is not efficient because it does not utilize the structure of H In the full information case: Halving can be made efficient because each version space V t can be made convex !
The Perceptron is a related approach which utilizes convexity and works in the full information case
Next approach: Lets try to rely on the Perceptron
The Mutliclass Perceptron
For t = 1, 2, . . . , T Receive x t ∈ R d
Predict ˆ y t = arg max r (W t x t ) r
Receive y t
If ˆ y t 6= y t update: W t+1 = W t + U t
U t =
0 . . . 0
.. .
0 . . . 0
. . . x t . . .
0 . . . 0
.. .
0 . . . 0
. . . −x t . . .
0 . . . 0
.. .
0 . . . 0
Row y t
Row ˆ y t
Problem: In the bandit case, we’re blind to value of y t
The Mutliclass Perceptron
For t = 1, 2, . . . , T Receive x t ∈ R d
Predict ˆ y t = arg max r (W t x t ) r
Receive y t
If ˆ y t 6= y t update: W t+1 = W t + U t U t =
0 . . . 0
.. .
0 . . . 0
. . . x t . . .
0 . . . 0
.. .
0 . . . 0
. . . −x t . . .
0 . . . 0
.. .
0 . . . 0
Row y t
Row ˆ y t
Problem: In the bandit case, we’re blind to value of y t
The Banditron (Kakade, S, Tewari 08)
Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t
Suppose we get the feedback ’correct’, i.e. ˜ y t = y t
Then, we have full information for Perceptron’s update:
(x t , ˆ y t , ˜ y t = y t )
Exploration-Exploitation Tradeoff:
When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this
When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right
answer in our hands but didn’t exploit it
The Banditron (Kakade, S, Tewari 08)
Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t
Suppose we get the feedback ’correct’, i.e. ˜ y t = y t
Then, we have full information for Perceptron’s update:
(x t , ˆ y t , ˜ y t = y t )
Exploration-Exploitation Tradeoff:
When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this
When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right
answer in our hands but didn’t exploit it
The Banditron (Kakade, S, Tewari 08)
For t = 1, 2, . . . , T Receive x t ∈ R d
Set ˆ y t = arg max r (W t x t ) r
Define: P (r) = (1 − γ)1[r = ˆ y t ] + γ k Randomly sample ˜ y t according to P Predict ˜ y t
Receive feedback 1[˜ y t = y t ] Update: W t+1 = W t + ˜ U t
where U ˜ t =
0 . . . 0
.. .
0 . . . 0
. . . 1[y P (˜
t=˜ y y
t]
t
) x t . . .
0 . . . 0
.. .
0 . . . 0
. . . −x t . . .
0 . . . 0
.. .
0 . . . 0
Row ˜ y t
Row ˆ y t
The Banditron (Kakade, S, Tewari 08)
For t = 1, 2, . . . , T Receive x t ∈ R d
Set ˆ y t = arg max r (W t x t ) r
Define: P (r) = (1 − γ)1[r = ˆ y t ] + γ k Randomly sample ˜ y t according to P Predict ˜ y t
Receive feedback 1[˜ y t = y t ] Update: W t+1 = W t + ˜ U t where
U ˜ t =
0 . . . 0
.. .
0 . . . 0
. . . 1[y P (˜
t=˜ y y
t]
t
) x t . . .
0 . . . 0
.. .
0 . . . 0
. . . −x t . . .
0 . . . 0
.. .
0 . . . 0
Row ˜ y t
Row ˆ y t
The Banditron (Kakade, S, Tewari 08)
Theorem
Banditron’s regret is O(pk T /µ 2 ) Banditron’s runtime is O(k/µ 2 )
The crux of difference between Halving and Banditron:
Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H
Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm
We managed to obtain ’full-information examples’ by using exploration
The price of exploration is a higher regret
The Banditron (Kakade, S, Tewari 08)
Theorem
Banditron’s regret is O(pk T /µ 2 ) Banditron’s runtime is O(k/µ 2 )
The crux of difference between Halving and Banditron:
Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H
Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm
We managed to obtain ’full-information examples’ by using exploration
The price of exploration is a higher regret
Intermediate Summary – Trading Regret for Efficiency
Algorithm Regret runtime
Halving µ k
21 µ
1/µ
2Banditron
√ k T µ
k µ
2Regret
τ
Banditron
Halving
Second example: general costs, non-separable, 2 arms
Action set A = {0, 1}
For t = 1, 2, . . .
Learner receives x t
Environment chooses cost vector c t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p t ∈ [0, 1]
Learner chooses action a t ∈ A according to Pr[a t = 1] = ˜ p t
Learner pays cost c t (a t )
Remark: Can be extended to k arms using e.g. the offset tree
(Beygelzimer and Langford ’09)
Second example: general costs, non-separable, 2 arms
Action set A = {0, 1}
For t = 1, 2, . . .
Learner receives x t
Environment chooses cost vector c t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p t ∈ [0, 1]
Learner chooses action a t ∈ A according to Pr[a t = 1] = ˜ p t
Learner pays cost c t (a t )
Remark: Can be extended to k arms using e.g. the offset tree
(Beygelzimer and Langford ’09)
Hypothesis class and regret goal
H = {x 7→ φ(hw, xi) : kwk 2 ≤ 1}, φ(z) = 1+exp(−z/µ) 1
-1 1
1
Goal: bounded regret
T
X
t=1
E a∼ ˜ p
t[c t (a)] − min
h∈H T
X
t=1
E a∼h(x
t) [c t (a)]
Challenging: even with full information no known efficient algorithms
Hypothesis class and regret goal
H = {x 7→ φ(hw, xi) : kwk 2 ≤ 1}, φ(z) = 1+exp(−z/µ) 1
-1 1
1
Goal: bounded regret
T
X
t=1
E a∼ ˜ p
t[c t (a)] − min
h∈H T
X
t=1
E a∼h(x
t) [c t (a)]
First approach – EXP4
Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1
2) Step 2: Discretize H to (1/µ) d
0hypotheses
Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)
Analysis:
Regret bound is ˜ O q
k T µ
2Runtime grows like (1/µ) 1/µ
2First approach – EXP4
Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1
2) Step 2: Discretize H to (1/µ) d
0hypotheses
Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)
Analysis:
Regret bound is ˜ O q
k T µ
2Runtime grows like (1/µ) 1/µ
2Second Approach – IDPK
1
Reduction to weighted binary classification
Similar to Bianca Zadrozny 2003, Alina Beygelzimer and John Langford 2009
2
Learning fuzzy halfspaces using the
Infinite-Dimensional-Polynomial-Kernel (S, Shamir, Sridharan 2009)
Reduction to weighted binary classification
The expected cost of a strategy p is:
E a∼p [c(a)] = p c(1) + (1 − p) c(0)
Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:
` t (p) = ν t |p − y t | = ( c
t
(1)
˜
p
t|p − 0| if a t = 1
c
t(0)
1− ˜ p
t|p − 1| if a t = 0 Note that ` t only depends on available information and that:
E a
t∼˜ p
t[` t (p)] = E a∼p [c t (a)]
The above almost works – we should slightly change the probabilities so that ν t will not explode.
Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t
Reduction to weighted binary classification
The expected cost of a strategy p is:
E a∼p [c(a)] = p c(1) + (1 − p) c(0)
Limited feedback: We don’t observe c t but only c t (a t )
Define the following loss function:
` t (p) = ν t |p − y t | = ( c
t
(1)
˜
p
t|p − 0| if a t = 1
c
t(0)
1− ˜ p
t|p − 1| if a t = 0 Note that ` t only depends on available information and that:
E a
t∼˜ p
t[` t (p)] = E a∼p [c t (a)]
The above almost works – we should slightly change the probabilities so that ν t will not explode.
Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t
Reduction to weighted binary classification
The expected cost of a strategy p is:
E a∼p [c(a)] = p c(1) + (1 − p) c(0)
Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:
` t (p) = ν t |p − y t | = ( c
t
(1)
˜
p
t|p − 0| if a t = 1
c
t(0)
1− ˜ p
t|p − 1| if a t = 0
Note that ` t only depends on available information and that: E a
t∼˜ p
t[` t (p)] = E a∼p [c t (a)]
The above almost works – we should slightly change the probabilities so that ν t will not explode.
Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t
Reduction to weighted binary classification
The expected cost of a strategy p is:
E a∼p [c(a)] = p c(1) + (1 − p) c(0)
Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:
` t (p) = ν t |p − y t | = ( c
t
(1)
˜
p
t|p − 0| if a t = 1
c
t(0)
1− ˜ p
t|p − 1| if a t = 0 Note that ` t only depends on available information and that:
E a
t∼˜ p
t[` t (p)] = E a∼p [c t (a)]
The above almost works – we should slightly change the probabilities so that ν t will not explode.
Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t
Reduction to weighted binary classification
The expected cost of a strategy p is:
E a∼p [c(a)] = p c(1) + (1 − p) c(0)
Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:
` t (p) = ν t |p − y t | = ( c
t