• No results found

Trading regret rate for computational efficiency in online learning with limited feedback

N/A
N/A
Protected

Academic year: 2021

Share "Trading regret rate for computational efficiency in online learning with limited feedback"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Trading regret rate for computational efficiency in online learning with limited feedback

Shai Shalev-Shwartz

TTI-C Hebrew University

On-line Learning with Limited Feedback Workshop, 2009

June 2009

(2)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(3)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(4)

Online Learning with restricted computational power

Main Question

Given a runtime constraint τ , horizon T , reference class H:

What is the achievable regret of an algorithm whose (amortized) runtime is O(τ ) ?

Regret

τ

(5)

Non-stochastic Multi-armed bandit with side information

The prediction problem Arms: A = {1, . . . , k}

For t = 1, 2, . . . , T

Learner receives side information x t ∈ X

Environment chooses cost vector c t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A

Learner pay cost c t (a t )

Goal

Low regret w.r.t. a reference hypothesis class H:

Regret def =

T

X

t=1

c t (a t ) − min

h∈H T

X

t=1

c t (h(x t ))

(6)

Non-stochastic Multi-armed bandit with side information

The prediction problem Arms: A = {1, . . . , k}

For t = 1, 2, . . . , T

Learner receives side information x t ∈ X

Environment chooses cost vector c t : A → [0, 1] (unknown to learner) Learner chooses action a t ∈ A

Learner pay cost c t (a t )

Goal

Low regret w.r.t. a reference hypothesis class H:

Regret def =

T

X c t (a t ) − min

T

X c t (h(x t ))

(7)

Outline

H is the class of linear hypotheses

Multiclass, separable

Regret

τ

Banditron Halving

2 arms, non-separable

Regret

τ

IDPK

EXP4

(8)

First example: Bandit Multi-class Categorization

For t = 1, 2, . . . , T

Learner receives side information x t ∈ X

Environment chooses ’correct’ arm y t ∈ A (unknown to learner) Learner chooses action a t ∈ A

Learner pay cost c t (a t ) = 1[a t 6= y t ]

(9)

Linear Hypotheses

H = {x 7→ argmax

r

(W x) r : W ∈ R k,d , kW k F ≤ 1}

R d

(10)

Large margin assumption

Assumption: Data is separable with margin µ:

∀t, ∀r 6= y t , (W x t ) y

t

− (W x t ) r ≥ µ

R d

(11)

First approach – Halving

Halving for Bandit Multiclass categorization Initialize: V 1 = H

For t = 1, 2, . . . Receive x t

For all r ∈ [k] let V t (r) = {h ∈ V t : h(x t ) = r}

Predict ˆ y t ∈ arg max r |V t (r)|

If 1[ˆ y t 6= y t ] set V t+1 = V t \ V t (ˆ y t )

Analysis:

Whenever we err |V t+1 | ≤ 1 − 1 k  |V t | ≤ exp(−1/k) |V t |

Therefore: M ≤ k log(|H|)

(12)

First approach – Halving

Halving for Bandit Multiclass categorization Initialize: V 1 = H

For t = 1, 2, . . . Receive x t

For all r ∈ [k] let V t (r) = {h ∈ V t : h(x t ) = r}

Predict ˆ y t ∈ arg max r |V t (r)|

If 1[ˆ y t 6= y t ] set V t+1 = V t \ V t (ˆ y t ) Analysis:

Whenever we err |V t+1 | ≤ 1 − 1 k  |V t | ≤ exp(−1/k) |V t |

Therefore: M ≤ k log(|H|)

(13)

Using Halving

Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1

2

) Step 2: Discretize H to (1/µ) d

0

hypotheses

Apply Halving on the resulting finite set of hypotheses

Analysis:

Mistake bound is ˜ O

 k µ

2



But runtime grows like (1/µ) 1/µ

2

(14)

Using Halving

Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1

2

) Step 2: Discretize H to (1/µ) d

0

hypotheses

Apply Halving on the resulting finite set of hypotheses Analysis:

Mistake bound is ˜ O

 k µ

2



But runtime grows like (1/µ) 1/µ

2

(15)

How can we improve runtime?

Halving is not efficient because it does not utilize the structure of H In the full information case: Halving can be made efficient because each version space V t can be made convex !

The Perceptron is a related approach which utilizes convexity and works in the full information case

Next approach: Lets try to rely on the Perceptron

(16)

The Mutliclass Perceptron

For t = 1, 2, . . . , T Receive x t ∈ R d

Predict ˆ y t = arg max r (W t x t ) r

Receive y t

If ˆ y t 6= y t update: W t+1 = W t + U t

U t =

0 . . . 0

.. .

0 . . . 0

. . . x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0

Row y t

Row ˆ y t

Problem: In the bandit case, we’re blind to value of y t

(17)

The Mutliclass Perceptron

For t = 1, 2, . . . , T Receive x t ∈ R d

Predict ˆ y t = arg max r (W t x t ) r

Receive y t

If ˆ y t 6= y t update: W t+1 = W t + U t U t =

0 . . . 0

.. .

0 . . . 0

. . . x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0

Row y t

Row ˆ y t

Problem: In the bandit case, we’re blind to value of y t

(18)

The Banditron (Kakade, S, Tewari 08)

Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t

Suppose we get the feedback ’correct’, i.e. ˜ y t = y t

Then, we have full information for Perceptron’s update:

(x t , ˆ y t , ˜ y t = y t )

Exploration-Exploitation Tradeoff:

When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this

When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right

answer in our hands but didn’t exploit it

(19)

The Banditron (Kakade, S, Tewari 08)

Explore: From time to time, instead of predicting ˆ y t guess some ˜ y t

Suppose we get the feedback ’correct’, i.e. ˜ y t = y t

Then, we have full information for Perceptron’s update:

(x t , ˆ y t , ˜ y t = y t )

Exploration-Exploitation Tradeoff:

When exploring we may have ˜ y t = y t 6= ˆ y t and can learn from this

When exploring we may have ˜ y t 6= y t = ˆ y t and then we had the right

answer in our hands but didn’t exploit it

(20)

The Banditron (Kakade, S, Tewari 08)

For t = 1, 2, . . . , T Receive x t ∈ R d

Set ˆ y t = arg max r (W t x t ) r

Define: P (r) = (1 − γ)1[r = ˆ y t ] + γ k Randomly sample ˜ y t according to P Predict ˜ y t

Receive feedback 1[˜ y t = y t ] Update: W t+1 = W t + ˜ U t

where U ˜ t =

0 . . . 0

.. .

0 . . . 0

. . . 1[y P (˜

t

y y

t

]

t

) x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0

 Row ˜ y t

Row ˆ y t

(21)

The Banditron (Kakade, S, Tewari 08)

For t = 1, 2, . . . , T Receive x t ∈ R d

Set ˆ y t = arg max r (W t x t ) r

Define: P (r) = (1 − γ)1[r = ˆ y t ] + γ k Randomly sample ˜ y t according to P Predict ˜ y t

Receive feedback 1[˜ y t = y t ] Update: W t+1 = W t + ˜ U t where

U ˜ t =

0 . . . 0

.. .

0 . . . 0

. . . 1[y P (˜

t

y y

t

]

t

) x t . . .

0 . . . 0

.. .

0 . . . 0

. . . −x t . . .

0 . . . 0

.. .

0 . . . 0

 Row ˜ y t

Row ˆ y t

(22)

The Banditron (Kakade, S, Tewari 08)

Theorem

Banditron’s regret is O(pk T /µ 2 ) Banditron’s runtime is O(k/µ 2 )

The crux of difference between Halving and Banditron:

Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H

Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm

We managed to obtain ’full-information examples’ by using exploration

The price of exploration is a higher regret

(23)

The Banditron (Kakade, S, Tewari 08)

Theorem

Banditron’s regret is O(pk T /µ 2 ) Banditron’s runtime is O(k/µ 2 )

The crux of difference between Halving and Banditron:

Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H

Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm

We managed to obtain ’full-information examples’ by using exploration

The price of exploration is a higher regret

(24)

Intermediate Summary – Trading Regret for Efficiency

Algorithm Regret runtime

Halving µ k

2

 1 µ

 1/µ

2

Banditron

√ k T µ

k µ

2

Regret

τ

Banditron

Halving

(25)

Second example: general costs, non-separable, 2 arms

Action set A = {0, 1}

For t = 1, 2, . . .

Learner receives x t

Environment chooses cost vector c t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p t ∈ [0, 1]

Learner chooses action a t ∈ A according to Pr[a t = 1] = ˜ p t

Learner pays cost c t (a t )

Remark: Can be extended to k arms using e.g. the offset tree

(Beygelzimer and Langford ’09)

(26)

Second example: general costs, non-separable, 2 arms

Action set A = {0, 1}

For t = 1, 2, . . .

Learner receives x t

Environment chooses cost vector c t : A → [0, 1] (unknown to Learner) Learner chooses ˜ p t ∈ [0, 1]

Learner chooses action a t ∈ A according to Pr[a t = 1] = ˜ p t

Learner pays cost c t (a t )

Remark: Can be extended to k arms using e.g. the offset tree

(Beygelzimer and Langford ’09)

(27)

Hypothesis class and regret goal

H = {x 7→ φ(hw, xi) : kwk 2 ≤ 1}, φ(z) = 1+exp(−z/µ) 1

-1 1

1

Goal: bounded regret

T

X

t=1

E a∼ ˜ p

t

[c t (a)] − min

h∈H T

X

t=1

E a∼h(x

t

) [c t (a)]

Challenging: even with full information no known efficient algorithms

(28)

Hypothesis class and regret goal

H = {x 7→ φ(hw, xi) : kwk 2 ≤ 1}, φ(z) = 1+exp(−z/µ) 1

-1 1

1

Goal: bounded regret

T

X

t=1

E a∼ ˜ p

t

[c t (a)] − min

h∈H T

X

t=1

E a∼h(x

t

) [c t (a)]

(29)

First approach – EXP4

Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1

2

) Step 2: Discretize H to (1/µ) d

0

hypotheses

Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)

Analysis:

Regret bound is ˜ O q

k T µ

2



Runtime grows like (1/µ) 1/µ

2

(30)

First approach – EXP4

Step 1: Dimensionality reduction to d 0 = ˜ O( µ 1

2

) Step 2: Discretize H to (1/µ) d

0

hypotheses

Apply EXP4 on the resulting finite set of hypotheses (Auer, Cesa-Bianchi, Freund, Schapire 2002)

Analysis:

Regret bound is ˜ O q

k T µ

2



Runtime grows like (1/µ) 1/µ

2

(31)

Second Approach – IDPK

1

Reduction to weighted binary classification

Similar to Bianca Zadrozny 2003, Alina Beygelzimer and John Langford 2009

2

Learning fuzzy halfspaces using the

Infinite-Dimensional-Polynomial-Kernel (S, Shamir, Sridharan 2009)

(32)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y t | = ( c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t

(33)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t )

Define the following loss function:

` t (p) = ν t |p − y t | = ( c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t

(34)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y t | = ( c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0

Note that ` t only depends on available information and that: E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t

(35)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y t | = ( c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities so that ν t will not explode.

Bottom line: regret bound w.r.t. ` t ⇒ regret bound w.r.t. c t

(36)

Reduction to weighted binary classification

The expected cost of a strategy p is:

E a∼p [c(a)] = p c(1) + (1 − p) c(0)

Limited feedback: We don’t observe c t but only c t (a t ) Define the following loss function:

` t (p) = ν t |p − y t | = ( c

t

(1)

˜

p

t

|p − 0| if a t = 1

c

t

(0)

1− ˜ p

t

|p − 1| if a t = 0 Note that ` t only depends on available information and that:

E a

t

∼˜ p

t

[` t (p)] = E a∼p [c t (a)]

The above almost works – we should slightly change the probabilities

so that ν t will not explode.

(37)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(38)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(39)

Step 2 – Learning fuzzy halfspaces with IDPK

Goal: regret bound w.r.t. class H = {x 7→ φ(hw, xi)}

Working with expected 0 − 1 loss: |φ(hw, xi) − y|

Problem: The above is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the loss becomes convex

x 7→ φ(hw, xi)

x 7→ hv, ψ(x )i

(40)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H 0 = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) (i

1

,...,i

j

) = 2 j/2 x i

1

· · · x i

j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h 0 ∈ H 0 s.t. for all x, h(x) ≈ h 0 (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(41)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H 0 = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) (i

1

,...,i

j

) = 2 j/2 x i

1

· · · x i

j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h 0 ∈ H 0 s.t. for all x, h(x) ≈ h 0 (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(42)

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {h w (x) = φ(hw, xi) : kwk ≤ 1}

New class: H 0 = {h v (x) = hv, ψ(x)i : kvk ≤ B} where ψ : X → R N s.t. ∀j, ∀(i 1 , . . . , i j ), ψ(x) (i

1

,...,i

j

) = 2 j/2 x i

1

· · · x i

j

Lemma (S, Shamir, Sridharan 2009)

If B = exp( ˜ O(1/µ)) then for all h ∈ H exists h 0 ∈ H 0 s.t. for all x, h(x) ≈ h 0 (x).

Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds ?

(e.g. Kalai, Klivans, Mansour, Servedio 2005)

(43)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z j

Therefore:

φ(hw, xi) ≈

X

j=0

β j (hw, xi) j

=

X

j=0

X

k

1

,...,k

j

2 −j/2 β j 2 j/2 w k

1

· · · w k

j

x k

1

· · · · x k

j

= hv w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z 1

x=−1

f (x)g(x)

√ 1 − x 2 dx

(44)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z j Therefore:

φ(hw, xi) ≈

X

j=0

β j (hw, xi) j

=

X

j=0

X

k

1

,...,k

j

2 −j/2 β j 2 j/2 w k

1

· · · w k

j

x k

1

· · · · x k

j

= hv w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z 1

x=−1

f (x)g(x)

√ 1 − x 2 dx

(45)

Proof idea

Polynomial approximation: φ(z) ≈ P ∞ j=0 β j z j Therefore:

φ(hw, xi) ≈

X

j=0

β j (hw, xi) j

=

X

j=0

X

k

1

,...,k

j

2 −j/2 β j 2 j/2 w k

1

· · · w k

j

x k

1

· · · · x k

j

= hv w , ψ(x)i

To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product:

hf, gi = Z 1

x=−1

f (x)g(x)

√ 1 − x 2 dx

(46)

Infinite-Dimensional-Polynomial-Kernel

Although the dimension is infinite, can be solved using the kernel trick The corresponding kernel (a.k.a. Vovk’s infinite polynomial) :

hψ(x), ψ(x 0 )i = K(x, x 0 ) = 1 1 − 1 2 hx, x 0 i

Algorithm boils down to online regression with the above kernel Convex! Can be solved e.g. using Zinkevich’s OCP

Regret bound is B √

T

(47)

Trading Regret for Efficiency

Algorithm Regret runtime

EXP4 pT /µ 2 exp( ˜ O(1/µ 2 ))

> <

IDPK T 3/4 exp( ˜ O(1/µ)) T

(48)

Summary

Trading regret rate for efficiency:

Multiclass, separable

Regret

τ

Banditron Halving

2 arms, non-separable

Regret

τ

IDPK EXP4

Open questions:

More points on the curve (new algorithms)

Lower bounds ???

References

Related documents

We generalize these (NASA specific) high level requirements into the following application independent requirements: support for adhoc ontology-based annotation of

Speaking a Java idiom, methods are synchronized, that is each method of the same object is executed in mutual exclusion, and method invocations are asynchronous, that is the

overspeed Verify switch Engine Monitoring and protection Maintenance clear switch Engine Monitoring and protection Maintenance overdue lamp Engine Monitoring and protection

This paper focuses on financial integration among stock exchange markets in four new EU member states (the Czech Republic, Hungary, Poland and Slovakia) in com- parison

Thank you for taking the time to complete this survey, which forms the basis for my MPH thesis at West Virginia University School of Public Health. Your answers will help

were evaluated for radiation-induced adverse side effects), 95 Norwegian women with no known history of cancer and 95 American breast cancer patients treated with radiation (44 of

The others (e.g. Playing Videos, adding the shutdown button) are not crucial to the camera project but can be done if you’re also interested in exploring these capabilities.