Trading regret rate for computational efficiency in online learning with limited feedback

(1)

Trading regret rate for computational efficiency in online learning with limited feedback

Shai Shalev-Shwartz

TTI-C Hebrew University

On-line Learning with Limited Feedback Workshop, 2009

June 2009

(2)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(3)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(4)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(5)

Non-stochastic Multi-armed bandit with side information

The prediction problem Arms: A = {1, . . . , k}

For t = 1, 2, . . . , T

Learner receives side information x _t ∈ X

Environment chooses cost vector c _t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A

Learner pay cost c t (a t )

Goal

Low regret w.r.t. a reference hypothesis class H:

Regret ^def =

T

X

t=1

c t (a t ) − min

h∈H T

X

t=1

c t (h(x t ))

(6)

Non-stochastic Multi-armed bandit with side information

The prediction problem Arms: A = {1, . . . , k}

For t = 1, 2, . . . , T

Learner receives side information x _t ∈ X

Environment chooses cost vector c _t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A

Learner pay cost c t (a t )

Goal

Low regret w.r.t. a reference hypothesis class H:

Regret ^def =

T

X c t (a t ) − min

T

X c t (h(x t ))

(7)

Outline

H is the class of linear hypotheses

Multiclass, separable

Regret

τ

Banditron Halving

2 arms, non-separable

Regret

τ

IDPK

EXP4

(8)

First example: Bandit Multi-class Categorization

For t = 1, 2, . . . , T

Learner receives side information x _t ∈ X

Environment chooses ’correct’ arm y _t ∈ A (unknown to learner) Learner chooses action a _t ∈ A

Learner pay cost c t (a t ) = 1[a t 6= y _t ]

(9)

Linear Hypotheses

H = {x 7→ argmax

r

(W x) _r : W ∈ R ^k,d , kW k _F ≤ 1}

R ^d

(10)

Large margin assumption

Assumption: Data is separable with margin µ:

∀t, ∀r 6= y _t , (W x t ) y

t

− (W x _t ) r ≥ µ

R ^d

(11)

First approach – Halving

Halving for Bandit Multiclass categorization Initialize: V ₁ = H

For t = 1, 2, . . . Receive x _t

For all r ∈ [k] let V _t (r) = {h ∈ V _t : h(x _t ) = r}

Predict ˆ y _t ∈ arg max _r |V _t (r)|

If 1[ˆ y t 6= y _t ] set V t+1 = V t \ V _t (ˆ y t )

Analysis:

Whenever we err |V _t+1 | ≤ 1 − ¹ _k |V _t | ≤ exp(−1/k) |V _t |

Therefore: M ≤ k log(|H|)

(12)

First approach – Halving

Halving for Bandit Multiclass categorization Initialize: V ₁ = H

For t = 1, 2, . . . Receive x _t

For all r ∈ [k] let V _t (r) = {h ∈ V _t : h(x _t ) = r}

Predict ˆ y _t ∈ arg max _r |V _t (r)|

If 1[ˆ y t 6= y _t ] set V t+1 = V t \ V _t (ˆ y t ) Analysis:

Whenever we err |V _t+1 | ≤ 1 − ¹ _k |V _t | ≤ exp(−1/k) |V _t |

Therefore: M ≤ k log(|H|)

(13)

Using Halving

Step 1: Dimensionality reduction to d ⁰ = ˜ O( _µ ¹

2

) Step 2: Discretize H to (1/µ) ^d

⁰

hypotheses

Apply Halving on the resulting finite set of hypotheses

Analysis:

Mistake bound is ˜ O

k µ

²

But runtime grows like (1/µ) ^1/µ

²

(14)

Using Halving

Step 1: Dimensionality reduction to d ⁰ = ˜ O( _µ ¹

2

) Step 2: Discretize H to (1/µ) ^d

⁰

hypotheses

Apply Halving on the resulting finite set of hypotheses Analysis:

Mistake bound is ˜ O

k µ

²

But runtime grows like (1/µ) ^1/µ

²

(15)

How can we improve runtime?

Halving is not efficient because it does not utilize the structure of H In the full information case: Halving can be made efficient because each version space V _t can be made convex !

The Perceptron is a related approach which utilizes convexity and works in the full information case

Next approach: Lets try to rely on the Perceptron

(16)

The Mutliclass Perceptron

For t = 1, 2, . . . , T Receive x t ∈ R ^d

Predict ˆ y t = arg max r (W ^t x t ) r

Receive y _t

If ˆ y _t 6= y _t update: W ^t+1 = W ^t + U ^t

U ^t =







0 . . . 0

.. .

0 . . . 0

. . . x _t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0







Row y t

Row ˆ y t

Problem: In the bandit case, we’re blind to value of y _t

(17)

The Mutliclass Perceptron

For t = 1, 2, . . . , T Receive x t ∈ R ^d

Predict ˆ y t = arg max r (W ^t x t ) r

Receive y _t

If ˆ y _t 6= y _t update: W ^t+1 = W ^t + U ^t U ^t =







0 . . . 0

.. .

0 . . . 0

. . . x _t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0







Row y t

Row ˆ y t

Problem: In the bandit case, we’re blind to value of y _t

(18)

The Banditron (Kakade, S, Tewari 08)

Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t

Suppose we get the feedback ’correct’, i.e. ˜ y t = y t

Then, we have full information for Perceptron’s update:

(x t , ˆ y t , ˜ y t = y t )

Exploration-Exploitation Tradeoff:

When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this

When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right

answer in our hands but didn’t exploit it

(19)

The Banditron (Kakade, S, Tewari 08)

Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t

Suppose we get the feedback ’correct’, i.e. ˜ y t = y t

Then, we have full information for Perceptron’s update:

(x t , ˆ y t , ˜ y t = y t )

Exploration-Exploitation Tradeoff:

When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this

When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right

answer in our hands but didn’t exploit it

(20)

The Banditron (Kakade, S, Tewari 08)

For t = 1, 2, . . . , T Receive x t ∈ R ^d

Set ˆ y _t = arg max _r (W ^t x _t ) _r

Define: P (r) = (1 − γ)1[r = ˆ y _t ] + ^γ _k Randomly sample ˜ y t according to P Predict ˜ y t

Receive feedback 1[˜ y _t = y _t ] Update: W ^t+1 = W ^t + ˜ U ^t

where U ˜ ^t =







0 . . . 0

.. .

0 . . . 0

. . . ^1[y _{P (˜}

^t

^=˜ _y ^y

^t

^]

t

) x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0





 Row ˜ y _t

Row ˆ y _t

(21)

The Banditron (Kakade, S, Tewari 08)

For t = 1, 2, . . . , T Receive x t ∈ R ^d

Set ˆ y _t = arg max _r (W ^t x _t ) _r

Define: P (r) = (1 − γ)1[r = ˆ y _t ] + ^γ _k Randomly sample ˜ y t according to P Predict ˜ y t

Receive feedback 1[˜ y _t = y _t ] Update: W ^t+1 = W ^t + ˜ U ^t where

U ˜ ^t =







0 . . . 0

.. .

0 . . . 0

. . . ^1[y _{P (˜}

^t

^=˜ _y ^y

^t

^]

t

) x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0





 Row ˜ y _t

Row ˆ y _t

(22)

The Banditron (Kakade, S, Tewari 08)

Theorem

Banditron’s regret is O(pk T /µ ² ) Banditron’s runtime is O(k/µ ² )

The crux of difference between Halving and Banditron:

Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H

Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm

We managed to obtain ’full-information examples’ by using exploration

The price of exploration is a higher regret

(23)

The Banditron (Kakade, S, Tewari 08)

Theorem

Banditron’s regret is O(pk T /µ ² ) Banditron’s runtime is O(k/µ ² )

The crux of difference between Halving and Banditron:

Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H

Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm

We managed to obtain ’full-information examples’ by using exploration

The price of exploration is a higher regret

(24)

Intermediate Summary – Trading Regret for Efficiency

Algorithm Regret runtime

Halving _µ ^k

2

1 µ

1/µ

²

Banditron

√ k T µ

k µ

²

Regret

τ

Banditron

Halving

(25)

Second example: general costs, non-separable, 2 arms

Action set A = {0, 1}

For t = 1, 2, . . .

Learner receives x _t

Environment chooses cost vector c _t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p _t ∈ [0, 1]

Learner chooses action a t ∈ A according to Pr[a _t = 1] = ˜ p t

Learner pays cost c t (a t )

Remark: Can be extended to k arms using e.g. the offset tree

(Beygelzimer and Langford ’09)

(26)

Second example: general costs, non-separable, 2 arms

Action set A = {0, 1}

For t = 1, 2, . . .

Learner receives x _t

Environment chooses cost vector c _t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p _t ∈ [0, 1]

Learner chooses action a t ∈ A according to Pr[a _t = 1] = ˜ p t

Learner pays cost c t (a t )

Remark: Can be extended to k arms using e.g. the offset tree

(Beygelzimer and Langford ’09)

(27)

Hypothesis class and regret goal

H = {x 7→ φ(hw, xi) : kwk ₂ ≤ 1}, φ(z) = 1+exp(−z/µ) ¹

-1 1

1 Goal: bounded regret

T

X

t=1

E a∼ ˜ p

t

[c t (a)] − min

h∈H T

X

t=1

E a∼h(x

t

) [c t (a)]

Challenging: even with full information no known efficient algorithms

(28)

Hypothesis class and regret goal

H = {x 7→ φ(hw, xi) : kwk ₂ ≤ 1}, φ(z) = 1+exp(−z/µ) ¹

-1 1

1 Goal: bounded regret

T

X

t=1

E a∼ ˜ p

t

[c t (a)] − min

h∈H T

X

t=1

E a∼h(x

t

) [c t (a)]

(29)

First approach – EXP4

Step 1: Dimensionality reduction to d ⁰ = ˜ O( _µ ¹

2

) Step 2: Discretize H to (1/µ) ^d

⁰

hypotheses

Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)

Analysis:

Regret bound is ˜ O q

k T µ

²

Runtime grows like (1/µ) ^1/µ

²

(30)

First approach – EXP4

Step 1: Dimensionality reduction to d ⁰ = ˜ O( _µ ¹

2

) Step 2: Discretize H to (1/µ) ^d

⁰

hypotheses

Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)

Analysis:

Regret bound is ˜ O q

k T µ

²

Runtime grows like (1/µ) ^1/µ

²

(31)

Second Approach – IDPK

1

Reduction to weighted binary classification

Similar to Bianca Zadrozny 2003, Alina Beygelzimer and John Langford 2009

2

Learning fuzzy halfspaces using the

Infinite-Dimensional-Polynomial-Kernel (S, Shamir, Sridharan 2009)

(32)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y _t | = ( _c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` _t ⇒ regret bound w.r.t. c _t

(33)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t )

Define the following loss function:

` t (p) = ν t |p − y _t | = ( _c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` _t ⇒ regret bound w.r.t. c _t

(34)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y _t | = ( _c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0

Note that ` t only depends on available information and that: E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` _t ⇒ regret bound w.r.t. c _t

(35)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y _t | = ( _c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` _t ⇒ regret bound w.r.t. c _t

(36)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y _t | = ( _c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities

so that ν t will not explode.

(37)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(38)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(39)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(40)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h _w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H ⁰ = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R ^N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) _(i

₁

_,...,i

_j

₎ = 2 ^j/2 x i

1

· · · x _i

_j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h ⁰ ∈ H ⁰ s.t. for all x, h(x) ≈ h ⁰ (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(41)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h _w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H ⁰ = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R ^N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) _(i

₁

_,...,i

_j

₎ = 2 ^j/2 x i

1

· · · x _i

_j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h ⁰ ∈ H ⁰ s.t. for all x, h(x) ≈ h ⁰ (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(42)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h _w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H ⁰ = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R ^N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) _(i

₁

_,...,i

_j

₎ = 2 ^j/2 x i

1

· · · x _i

_j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h ⁰ ∈ H ⁰ s.t. for all x, h(x) ≈ h ⁰ (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(43)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z ^j

Therefore:

φ(hw, xi) ≈

∞

X

j=0

β j (hw, xi) ^j

=

∞

X

j=0

X

k

1

,...,k

j

2 ^−j/2 β j 2 ^j/2 w k

1

· · · w _k

_j

x k

1

· · · · x _k

_j

= hv _w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z ₁

x=−1

f (x)g(x)

√ 1 − x ² dx

(44)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z ^j Therefore:

φ(hw, xi) ≈

∞

X

j=0

β j (hw, xi) ^j

=

∞

X

j=0

X

k

1

,...,k

j

2 ^−j/2 β j 2 ^j/2 w k

1

· · · w _k

_j

x k

1

· · · · x _k

_j

= hv _w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z ₁

x=−1

f (x)g(x)

√ 1 − x ² dx

(45)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z ^j Therefore:

φ(hw, xi) ≈

∞

X

j=0

β j (hw, xi) ^j

=

∞

X

j=0

X

k

1

,...,k

j

2 ^−j/2 β j 2 ^j/2 w k

1

· · · w _k

_j

x k

1

· · · · x _k

_j

= hv _w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z 1

x=−1

f (x)g(x)

√ 1 − x ² dx

(46)

Infinite-Dimensional-Polynomial-Kernel

Although the dimension is infinite, can be solved using the kernel trick The corresponding kernel (a.k.a. Vovk’s infinite polynomial) :

hψ(x), ψ(x ⁰ )i = K(x, x ⁰ ) = 1 1 − ¹ ₂ hx, x ⁰ i

Algorithm boils down to online regression with the above kernel Convex! Can be solved e.g. using Zinkevich’s OCP

Trading regret rate for computational efficiency in online learning with limited feedback