compsci 514: algorithms for data science Andrew McGregor Lecture 23

(1)

compsci

514:

algorithms for data science

Andrew McGregor

Lecture 23

(2)

summary

Last Class:

• Analysis of gradient descent for optimizing convex functions. • Introduction to convex sets and projection functions.

• (The same) analysis of projected gradient descent for optimizing under convex functions under (convex) constraints.

This Class:

• Online learning, regret, and online gradient descent. • Application to stochastic gradient descent.

(3)

constrained convex optimization

Often want to performconvex optimization with convex constraints. ~

θ∗= arg min

~ θ∈S

f (~θ), where S is aconvex set.

Definition – Convex Set: A set S ⊆ Rd is convex if and only if,

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

For any convex set let PS(·) denote the projection function onto S:

PS(~y ) = arg min ~ θ∈S

(4)

constrained convex optimization

θ∗= arg min

~ θ∈S

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

(5)

constrained convex optimization

θ∗= arg min

~ θ∈S

for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:

(1 − λ)~θ1+ λ · ~θ2∈ S

(6)

projected gradient descent

Projected Gradient Descent

• Choose some initialization ~θ

1

and set η =

_GR√_t

.

• For i = 1, . . . , t − 1

• ~θ_{i +1}(out)= ~θi− η · ~∇f (~θi)

• ~_θ

i +1= PS(~θ(out)i +1 ).

(7)

convex projections

Analysis of projected gradient descent is almost identifcal to gradient descent analysis!

Theorem – Projection to a convex set: For any convex set

S ⊆ Rd_{, ~}

y ∈ Rd_{, and ~}_{θ ∈ S,}

(8)

convex projections

Analysis of projected gradient descent is almost identifcal to gradient descent analysis!

Theorem – Projection to a convex set: For any convex set

S ⊆ Rd_{, ~}

y ∈ Rd_{, and ~}_{θ ∈ S,}

(9)

projected gradient descent analysis

Theorem – Projected GD: For convex G -Lipschitz function f ,

and convex set S, Projected GD run with t ≥ R2G22 iterations,

η = _GR√

t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),

outputs ˆθ satisfying:

f (ˆθ) ≤ f (~θ∗) +

Recall: ~θ(out)_{i +1} = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).

Step 1: For all i , f (~θi) − f (~θ∗) ≤

k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .

Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1_tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.

(10)

projected gradient descent analysis

η = _GR√

f (ˆθ) ≤ f (~θ∗) +

(11)

projected gradient descent analysis

η = _GR√

f (ˆθ) ≤ f (~θ∗) +

(12)

projected gradient descent analysis

η = _GR√

f (ˆθ) ≤ f (~θ∗) +

(13)

projected gradient descent analysis

η = _GR√

f (ˆθ) ≤ f (~θ∗) +

(14)

online gradient descent

In reality many learning problems are online.

• Websites optimize ads or recommendations to show users, given continuous feedback from these users.

• Spam filters are incrementally updated and adapt as they see more examples of spam over time.

• Face recognition systems, other classification systems, learn from mistakes over time.

Want to minimize some global loss L(~θ, X) =Pn

i =1`(~θ, ~xi), when data

points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to

streaming algorithms)

Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.

(15)

online gradient descent

(16)

online gradient descent

(17)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

_.

• Then are told f

_i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(18)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

_.

• Then are told f

_i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(19)

online optimization formal setup

Online Optimization: In place of a single function f , we see a

different objective function at each step:

f

1

, f

2

, . . . , f

t

: R

d

→ R

• At each step, first pick (play) a parameter vector ~θ

(i )

_.

• Then are told f

_i

and incur cost f

i

(~

θ

(i )

).

• Goal: Minimize total cost P

t

i =1

f

i

(~

θ

(i )

).

Our analysis will make no assumptions on how f

1

, . . . , f

t

are related

(20)

online optimization example

UI design via online optimization.

• Parameter vector ~θ

(i )

_{: some encoding of the layout at step i .}

• Functions f

₁

, . . . , f

t

: f

i

(~

θ

(i )

) = 1 if user does not click ‘add to

cart’ and f

i

(~

θ

(i )

) = 0 if they do click.

• Want to maximize number of purchases, i.e., minimize

P

t

(21)

online optimization example

Home pricing tools.

• Parameter vector ~θ

(i )

_{: coefficients of linear model at step i .}

• Functions f

1

, . . . , f

t

: f

i

(~

θ

(i )

) = (h~

x

i

, ~

θ

(i )

i − price

i

)

2

revealed when

home

i

is listed or sold.

• Want to minimize total squared error P

t

i =1

f

i

(~

θ

(i )

) (same as

(22)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) + =

t

X

i =1

f

i

(~

θ

off

) +

is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight. can be negative!

(23)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) + =

t

X

i =1

f

i

(~

θ

off

) +

is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight. can be negative!

(24)

regret

In normal optimization, we seek ˆ

θ satisfying:

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + .

In online optimization we will ask for the same.

t

X

i =1

f

i

(~

θ

(i )

) ≤ min

~ θ t

X

i =1

f

i

(~

θ) + =

t

X

i =1

f

i

(~

θ

off

) +

is called the

regret

and /t is the average regret.

• This error metric is a bit

unusual

: Comparing online solution to

best fixed solution in hindsight. can be negative!

(25)

intuition check

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000| in

an alternating pattern?

How small can the regret be?

P

t

i =1

f

i

(~

θ

(i )

) ≤

P

ti =1

f

i

(~

θ

off

) + .

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000|

in

no particular pattern

? How can any online learning algorithm hope

(26)

intuition check

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000| in

an alternating pattern?

How small can the regret be?

P

t

i =1

f

i

(~

θ

(i )

) ≤

P

ti =1

f

i

(~

θ

off

) + .

What if for i = 1, . . . , t, f

i

(θ) = |θ − 1000| or f

i

(θ) = |θ + 1000|

in

no particular pattern

? How can any online learning algorithm hope

(27)

online gradient descent

Assume that:

• f

₁

, . . . , f

t

are all convex.

• Each f

i

is G -Lipschitz (i.e., k ~

∇f

i

(~

θ)k

2

≤ G for all ~

θ.)

• k~θ

(1)

_{− ~}

_θ

off

_k

2

≤ R where θ

(1)

is the first vector chosen.

Online Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Play ~θ(i ) _{and incur cost f} i(~θ(i )).

• ~_θ(i +1)_{= ~}_θ(i )_{− η · ~}_∇f i(~θ(i ))

(28)

online gradient descent

Assume that:

• f

₁

, . . . , f

t

are all convex.

• Each f

i

is G -Lipschitz (i.e., k ~

∇f

i

(~

θ)k

2

≤ G for all ~

θ.)

• k~θ

(1)

_{− ~}

_θ

off

_k

2

≤ R where θ

(1)

is the first vector chosen.

Online Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Play ~θ(i ) _{and incur cost f} i(~θ(i )).

(29)

online gradient descent analysis

Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)

within radius R of θoff_{, using step size η =} R

G√t, has regret bounded

by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t

Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!

Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤

kθ(i )_−θoff_k2 2−kθ (i +1)_−θoff_k2 2 2η + ηG2 2

Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )_{− θ}off_k2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .

(30)

online gradient descent analysis

Upper bound onaverage regret goes to 0 and t → ∞.

No assumptions on f1, . . . , ft!

(31)

online gradient descent analysis

(32)

online gradient descent analysis

kθ(i )_−θoff_k2

2−kθ(i +1)−θoffk22

2η +

ηG2 2

(33)

online gradient descent analysis

kθ(i )_−θoff_k2

2−kθ(i +1)−θoffk22

2η +

ηG2 2

(34)

online gradient descent analysis

Step 1: For all i , fi(θ(i )) − fi(θoff) ≤

kθ(i )_−θoff_k2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )_{− θ}off_k2 2− kθ (i +1)_{− θ}off_k2 2 2η + t · ηG2 2 .

(35)

online gradient descent analysis

Step 1: For all i , fi(θ(i )) − fi(θoff) ≤

kθ(i )_−θoff_k2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )_{− θ}off_k2 2− kθ (i +1)_{− θ}off_k2 2 2η + t · ηG2 2 .

(36)

stochastic gradient descent

Stochastic gradient descent is an efficient

offline optimization

method

, seeking ˆ

θ with

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + = f (~

θ

∗

) + .

• The most popular optimization method in modern machine

learning.

(37)

stochastic gradient descent

Stochastic gradient descent is an efficient

offline optimization

method

, seeking ˆ

θ with

f (ˆ

θ) ≤ min

~ θ

f (~

θ) + = f (~

θ

∗

) + .

• The most popular optimization method in modern machine

learning.

(38)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

_j

is

G_n

-Lipschitz (i.e., k ~

∇f

_j

(~

θ)k

2

≤

G_n

for all ~

θ.)

• What does this imply about how Lipschitz f is?

• Initialize with θ

(1)

_{satisfying k~}

_θ

(1)

_{− ~}

_θ

∗

_k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i )₎

• Return

_{θ =}

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(39)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

_j

is

G_n

-Lipschitz (i.e., k ~

∇f

_j

(~

θ)k

2

≤

G_n

for all ~

θ.)

• Initialize with θ

(1)

_{satisfying k~}

_θ

(1)

_{− ~}

_θ

∗

_k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Return

_{θ =}

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(40)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

_j

is

G_n

-Lipschitz (i.e., k ~

∇f

_j

(~

θ)k

2

≤

G_n

for all ~

θ.)

• Initialize with θ

(1)

_{satisfying k~}

_θ

(1)

_{− ~}

_θ

∗

_k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η ·∇fj~ _i(~θ(i ))

• Return

_{θ =}

ˆ

1 t

Pt

i =1

θ

~

(i )

.

(41)

stochastic gradient descent

Assume that:

• f is convex and decomposable as f (~θ) = P

n

j =1

f

j

(~

θ).

• E.g., L(~θ, X) = Pn

j =1`(~θ, ~xj).

• Each f

_j

is

G_n

-Lipschitz (i.e., k ~

∇f

_j

(~

θ)k

2

≤

G_n

for all ~

θ.)

• Initialize with θ

(1)

_{satisfying k~}

_θ

(1)

_{− ~}

_θ

∗

_k

2

≤ R.

Stochastic Gradient Descent

• Pick some initial ~θ

(1)

_.

• Set step size η =

R G√t

.

• For i = 1, . . . , t

• Return

_{θ =}

ˆ

1 t

Pt

i =1

~

θ

(i )

.

(42)

stochastic gradient descent

~

θ

(i +1)

= ~

θ

(i )

− η · ~

∇f

_j_i

(~

θ

(i )

) vs. ~

θ

(i +1)

= ~

θ

(i )

− η · ~

∇f (~

θ

(i )

)

Note that: E[~

∇f

_j_i

(~

θ

(i )

)] =

1_n

∇f (~

~

θ

(i )

).

Analysis extends to

any

algorithm that takes the gradient step

in

expectation

(minibatch SGD, randomly quantized, measurement

(43)

stochastic gradient descent analysis

Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =

R

G√t, and starting point within radius R

of θ∗, outputs ˆθ satisfying: _{E[f ( ˆ}θ)] ≤ f (θ∗) + .

Step 1: f (ˆθ) − f (θ∗) ≤1_tPt i =1[f (θ (i )_{) − f (θ}∗_)] Step 2: E[f (ˆθ) − f (θ∗_{)] ≤} n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n_t · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ n_t · R ·G n · √ t | {z } OGD bound = RG√ t.

(44)

stochastic gradient descent analysis

R

(45)

stochastic gradient descent analysis

R

(46)

stochastic gradient descent analysis

R

Step 1: f (ˆθ) − f (θ∗) ≤1_tPt i =1[f (θ (i )_{) − f (θ}∗_)] Step 2: E[f (ˆθ) − f (θ∗_{)] ≤} n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ n_t · R ·G n · √ t | {z } OGD bound = RG√ t.

(47)

stochastic gradient descent analysis

R

Step 1: f (ˆθ) − f (θ∗) ≤1_tPt i =1[f (θ (i )_{) − f (θ}∗_)] Step 2: E[f (ˆθ) − f (θ∗_{)] ≤} n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ n_t · R ·G n · √ t | {z } OGD bound = RG√ t.