• No results found

compsci 514: algorithms for data science Andrew McGregor Lecture 23

N/A
N/A
Protected

Academic year: 2021

Share "compsci 514: algorithms for data science Andrew McGregor Lecture 23"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

compsci

514:

algorithms for data science

Andrew McGregor

Lecture 23

(2)

summary

Last Class:

• Analysis of gradient descent for optimizing convex functions. • Introduction to convex sets and projection functions.

• (The same) analysis of projected gradient descent for optimizing under convex functions under (convex) constraints.

This Class:

• Online learning, regret, and online gradient descent. • Application to stochastic gradient descent.

(3)

constrained convex optimization

Often want to performconvex optimization with convex constraints. ~

θ∗= arg min

~ θ∈S

f (~θ), where S is aconvex set.

Definition – Convex Set: A set S ⊆ Rd is convex if and only if,

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

For any convex set let PS(·) denote the projection function onto S:

PS(~y ) = arg min ~ θ∈S

(4)

constrained convex optimization

Often want to performconvex optimization with convex constraints. ~

θ∗= arg min

~ θ∈S

f (~θ), where S is aconvex set.

Definition – Convex Set: A set S ⊆ Rd is convex if and only if,

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

For any convex set let PS(·) denote the projection function onto S:

PS(~y ) = arg min ~ θ∈S

(5)

constrained convex optimization

Often want to performconvex optimization with convex constraints. ~

θ∗= arg min

~ θ∈S

f (~θ), where S is aconvex set.

Definition – Convex Set: A set S ⊆ Rd is convex if and only if,

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

For any convex set let PS(·) denote the projection function onto S:

PS(~y ) = arg min ~ θ∈S

(6)

projected gradient descent

Projected Gradient Descent

• Choose some initialization ~θ

1

and set η =

GR√t

.

• For i = 1, . . . , t − 1

• ~θi +1(out)= ~θi− η · ~∇f (~θi)

• ~θ

i +1= PS(~θ(out)i +1 ).

(7)

convex projections

Analysis of projected gradient descent is almost identifcal to gradient descent analysis!

Theorem – Projection to a convex set: For any convex set

S ⊆ Rd, ~

y ∈ Rd, and ~θ ∈ S,

(8)

convex projections

Analysis of projected gradient descent is almost identifcal to gradient descent analysis!

Theorem – Projection to a convex set: For any convex set

S ⊆ Rd, ~

y ∈ Rd, and ~θ ∈ S,

(9)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) + 

Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(10)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) + 

Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(11)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) + 

Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(12)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) + 

Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(13)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) + 

Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(14)

online gradient descent

In reality many learning problems are online.

• Websites optimize ads or recommendations to show users, given continuous feedback from these users.

• Spam filters are incrementally updated and adapt as they see more examples of spam over time.

• Face recognition systems, other classification systems, learn from mistakes over time.

Want to minimize some global loss L(~θ, X) =Pn

i =1`(~θ, ~xi), when data

points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to

streaming algorithms)

Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.

(15)

online gradient descent

In reality many learning problems are online.

• Websites optimize ads or recommendations to show users, given continuous feedback from these users.

• Spam filters are incrementally updated and adapt as they see more examples of spam over time.

• Face recognition systems, other classification systems, learn from mistakes over time.

Want to minimize some global loss L(~θ, X) =Pn

i =1`(~θ, ~xi), when data

points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to

streaming algorithms)

Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.

(16)

online gradient descent

In reality many learning problems are online.

• Websites optimize ads or recommendations to show users, given continuous feedback from these users.

• Spam filters are incrementally updated and adapt as they see more examples of spam over time.

• Face recognition systems, other classification systems, learn from mistakes over time.

Want to minimize some global loss L(~θ, X) =Pn

i =1`(~θ, ~xi), when data

points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to

streaming algorithms)

Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.

(17)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

.

• Then are told f

i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(18)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

.

• Then are told f

i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(19)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

.

• Then are told f

i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(20)

online optimization example

UI design via online optimization.

• Parameter vector ~θ

(i )

: some encoding of the layout at step i .

• Functions f

1

, . . . , f

t

: f

i

(~

θ

(i )

) = 1 if user does not click ‘add to

cart’ and f

i

(~

θ

(i )

) = 0 if they do click.

• Want to maximize number of purchases, i.e., minimize

P

t

(21)

online optimization example

Home pricing tools.

• Parameter vector ~θ

(i )

: coefficients of linear model at step i .

• Functions f

1

, . . . , f

t

: f

i

(~

θ

(i )

) = (h~

x

i

, ~

θ

(i )

i − price

i

)

2

revealed when

home

i

is listed or sold.

• Want to minimize total squared error P

t

i =1

f

i

(~

θ

(i )

) (same as

(22)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) +  =

t

X

i =1

f

i

(~

θ

off

) + 

 is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight.  can be negative!

(23)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) +  =

t

X

i =1

f

i

(~

θ

off

) + 

 is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight.  can be negative!

(24)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) +  =

t

X

i =1

f

i

(~

θ

off

) + 

 is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight.  can be negative!

(25)

intuition check

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000| in

an alternating pattern?

How small can the regret  be?

P

t

i =1

f

i

(~

θ

(i )

) ≤

P

ti =1

f

i

(~

θ

off

) + .

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000|

in

no particular pattern

? How can any online learning algorithm hope

(26)

intuition check

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000| in

an alternating pattern?

How small can the regret  be?

P

t

i =1

f

i

(~

θ

(i )

) ≤

P

ti =1

f

i

(~

θ

off

) + .

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000|

in

no particular pattern

? How can any online learning algorithm hope

(27)

online gradient descent

Assume that:

• f

1

, . . . , f

t

are all convex.

• Each f

i

is G -Lipschitz (i.e., k ~

∇f

i

(~

θ)k

2

≤ G for all ~

θ.)

• k~θ

(1)

− ~

θ

off

k

2

≤ R where θ

(1)

is the first vector chosen.

Online Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Play ~θ(i ) and incur cost f i(~θ(i )).

• ~θ(i +1)= ~θ(i )− η · ~∇f i(~θ(i ))

(28)

online gradient descent

Assume that:

• f

1

, . . . , f

t

are all convex.

• Each f

i

is G -Lipschitz (i.e., k ~

∇f

i

(~

θ)k

2

≤ G for all ~

θ.)

• k~θ

(1)

− ~

θ

off

k

2

≤ R where θ

(1)

is the first vector chosen.

Online Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Play ~θ(i ) and incur cost f i(~θ(i )).

(29)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(30)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞.

No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(31)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(32)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )−θoffk2

2−kθ(i +1)−θoffk22

2η +

ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(33)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )−θoffk2

2−kθ(i +1)−θoffk22

2η +

ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(34)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Step 1: For all i , fi(θ(i )) − fi(θoff) ≤

kθ(i )−θoffk2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )− θoffk2 2− kθ (i +1)− θoffk2 2 2η + t · ηG2 2 .

(35)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff, using step size η = R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Step 1: For all i , fi(θ(i )) − fi(θoff) ≤

kθ(i )−θoffk2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )− θoffk2 2− kθ (i +1)− θoffk2 2 2η + t · ηG2 2 .

(36)

stochastic gradient descent

Stochastic gradient descent is an efficient

offline optimization

method

, seeking ˆ

θ with

f (ˆ

θ) ≤ min

~ θ

f (~

θ) +  = f (~

θ

) + .

• The most popular optimization method in modern machine

learning.

(37)

stochastic gradient descent

Stochastic gradient descent is an efficient

offline optimization

method

, seeking ˆ

θ with

f (ˆ

θ) ≤ min

~ θ

f (~

θ) +  = f (~

θ

) + .

• The most popular optimization method in modern machine

learning.

(38)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

j

is

Gn

-Lipschitz (i.e., k ~

∇f

j

(~

θ)k

2

Gn

for all ~

θ.)

• What does this imply about how Lipschitz f is?

• Initialize with θ

(1)

satisfying k~

θ

(1)

− ~

θ

k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))

• Return

θ =

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(39)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

j

is

Gn

-Lipschitz (i.e., k ~

∇f

j

(~

θ)k

2

Gn

for all ~

θ.)

• What does this imply about how Lipschitz f is?

• Initialize with θ

(1)

satisfying k~

θ

(1)

− ~

θ

k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))

• Return

θ =

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(40)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

j

is

Gn

-Lipschitz (i.e., k ~

∇f

j

(~

θ)k

2

Gn

for all ~

θ.)

• What does this imply about how Lipschitz f is?

• Initialize with θ

(1)

satisfying k~

θ

(1)

− ~

θ

k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η ·∇fj~ i(~θ(i ))

• Return

θ =

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(41)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

j

is

Gn

-Lipschitz (i.e., k ~

∇f

j

(~

θ)k

2

Gn

for all ~

θ.)

• What does this imply about how Lipschitz f is?

• Initialize with θ

(1)

satisfying k~

θ

(1)

− ~

θ

k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))

• Return

θ =

ˆ

1 t

Pt

i =1

~

θ

(i )

.

(42)

stochastic gradient descent

~

θ

(i +1)

= ~

θ

(i )

− η · ~

∇f

ji

(~

θ

(i )

) vs. ~

θ

(i +1)

= ~

θ

(i )

− η · ~

∇f (~

θ

(i )

)

Note that: E[~

∇f

ji

(~

θ

(i )

)] =

1n

∇f (~

~

θ

(i )

).

Analysis extends to

any

algorithm that takes the gradient step

in

expectation

(minibatch SGD, randomly quantized, measurement

(43)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.

(44)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.

(45)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.

(46)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.

(47)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.

(48)

sgd vs. gd

Stochastic gradient descent generally makes more iterations than

gradient descent.

Each iteration is much cheaper (by a factor of n).

~

n

X

j =1

f

j

(~

θ) vs. ~

∇f

j

(~

θ)

(49)

sgd vs. gd

When f (~θ) =Pn

j =1fj(~θ) and k ~∇fj(~θ)k2≤Gn:

Theorem – SGD: Aftert ≥ R2G22 iterations outputs ˆθ satisfying:

E[f ( ˆθ)] ≤ f (θ∗) + .

When k ~∇f (~θ)k2≤ ¯G :

Theorem – GD: Aftert ≥ R2G2¯2 iterations outputs ˆθ satisfying:

(50)

summary

• Introduced the online optimization problem and the notion of

regret to measure error in this setting.

• Introduced online gradient descent, which can solve online

convex optimization with average regret approaching 0.

• Introduced stochastic gradient descent, an offline optimization

method that can be analyzed as a special case of online gradient

descent.

References

Related documents

Telecommunications Opportunities Program, a Commerce Department initiative that provided grants for “for deploying broadband infrastructure in unserved and underserved areas in

In data analysis and machine learning, data points with many attributes are often stored, processed, and interpreted as high dimensional vectors , with real valued

[As described in the previous lecture.] – Stochastic gradient descent: faster than batch on large, redundant data sets.. [Whereas batch gradient descent walks downhill on one

The ideal of self-r eliance cer tainly does not exclude this other ideal of shar ing, of the g iv e-and-tak e, whether spontaneous or institutionalized, that

In the following, we describe a number of key features of our system. 1) Transparent virtual cloud computing environment: PaaS user needs to focus on the core functions

In this section, we illustrate how our methods can be used to quantify recovery thresholds for online SGD in inference tasks arising from various broad parametric families

We could also choose step to do the best we can along direction of negative gradient, called exact line search:.. t

On the contrary, the first-order algorithm GD, which it- eratively updates the solution using gradients, is efficient, but only guaranteed to converge to a first-order stationary