compsci
514:
algorithms for data science
Andrew McGregor
Lecture 23
summary
Last Class:
• Analysis of gradient descent for optimizing convex functions. • Introduction to convex sets and projection functions.
• (The same) analysis of projected gradient descent for optimizing under convex functions under (convex) constraints.
This Class:
• Online learning, regret, and online gradient descent. • Application to stochastic gradient descent.
constrained convex optimization
Often want to performconvex optimization with convex constraints. ~
θ∗= arg min
~ θ∈S
f (~θ), where S is aconvex set.
Definition – Convex Set: A set S ⊆ Rd is convex if and only if,
for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:
(1 − λ)~θ1+ λ · ~θ2∈ S
For any convex set let PS(·) denote the projection function onto S:
PS(~y ) = arg min ~ θ∈S
constrained convex optimization
Often want to performconvex optimization with convex constraints. ~
θ∗= arg min
~ θ∈S
f (~θ), where S is aconvex set.
Definition – Convex Set: A set S ⊆ Rd is convex if and only if,
for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:
(1 − λ)~θ1+ λ · ~θ2∈ S
For any convex set let PS(·) denote the projection function onto S:
PS(~y ) = arg min ~ θ∈S
constrained convex optimization
Often want to performconvex optimization with convex constraints. ~
θ∗= arg min
~ θ∈S
f (~θ), where S is aconvex set.
Definition – Convex Set: A set S ⊆ Rd is convex if and only if,
for any ~θ1, ~θ2∈ S and λ ∈ [0, 1]:
(1 − λ)~θ1+ λ · ~θ2∈ S
For any convex set let PS(·) denote the projection function onto S:
PS(~y ) = arg min ~ θ∈S
projected gradient descent
Projected Gradient Descent
• Choose some initialization ~θ
1and set η =
GR√t.
• For i = 1, . . . , t − 1
• ~θi +1(out)= ~θi− η · ~∇f (~θi)• ~θ
i +1= PS(~θ(out)i +1 ).
convex projections
Analysis of projected gradient descent is almost identifcal to gradient descent analysis!
Theorem – Projection to a convex set: For any convex set
S ⊆ Rd, ~
y ∈ Rd, and ~θ ∈ S,
convex projections
Analysis of projected gradient descent is almost identifcal to gradient descent analysis!
Theorem – Projection to a convex set: For any convex set
S ⊆ Rd, ~
y ∈ Rd, and ~θ ∈ S,
projected gradient descent analysis
Theorem – Projected GD: For convex G -Lipschitz function f ,
and convex set S, Projected GD run with t ≥ R2G22 iterations,
η = GR√
t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),
outputs ˆθ satisfying:
f (ˆθ) ≤ f (~θ∗) +
Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).
Step 1: For all i , f (~θi) − f (~θ∗) ≤
k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .
Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.
projected gradient descent analysis
Theorem – Projected GD: For convex G -Lipschitz function f ,
and convex set S, Projected GD run with t ≥ R2G22 iterations,
η = GR√
t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),
outputs ˆθ satisfying:
f (ˆθ) ≤ f (~θ∗) +
Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).
Step 1: For all i , f (~θi) − f (~θ∗) ≤
k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .
Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.
projected gradient descent analysis
Theorem – Projected GD: For convex G -Lipschitz function f ,
and convex set S, Projected GD run with t ≥ R2G22 iterations,
η = GR√
t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),
outputs ˆθ satisfying:
f (ˆθ) ≤ f (~θ∗) +
Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).
Step 1: For all i , f (~θi) − f (~θ∗) ≤
k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .
Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.
projected gradient descent analysis
Theorem – Projected GD: For convex G -Lipschitz function f ,
and convex set S, Projected GD run with t ≥ R2G22 iterations,
η = GR√
t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),
outputs ˆθ satisfying:
f (ˆθ) ≤ f (~θ∗) +
Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).
Step 1: For all i , f (~θi) − f (~θ∗) ≤
k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .
Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.
projected gradient descent analysis
Theorem – Projected GD: For convex G -Lipschitz function f ,
and convex set S, Projected GD run with t ≥ R2G22 iterations,
η = GR√
t, and starting point within radius R of ~θ∗= min~θ∈Sf (~θ),
outputs ˆθ satisfying:
f (ˆθ) ≤ f (~θ∗) +
Recall: ~θ(out)i +1 = ~θi− η · ~∇f (~θi) and ~θi +1= PS(~θ(out)i +1 ).
Step 1: For all i , f (~θi) − f (~θ∗) ≤
k~θi−θ∗k2 2−k~θ (out) i +1 −~θ∗k 2 2 2η + ηG2 2 .
Step 1.a: For all i , f (~θi) − f (~θ∗) ≤ k~θi−~θ∗k 2 2−k~θi +1−~θ∗k 2 2 2η + ηG2 2 . Step 2: 1tPt i =1f (~θi) − f (~θ∗) ≤ R 2 2η·t + ηG2 2 =⇒ Theorem.
online gradient descent
In reality many learning problems are online.
• Websites optimize ads or recommendations to show users, given continuous feedback from these users.
• Spam filters are incrementally updated and adapt as they see more examples of spam over time.
• Face recognition systems, other classification systems, learn from mistakes over time.
Want to minimize some global loss L(~θ, X) =Pn
i =1`(~θ, ~xi), when data
points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to
streaming algorithms)
Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.
online gradient descent
In reality many learning problems are online.
• Websites optimize ads or recommendations to show users, given continuous feedback from these users.
• Spam filters are incrementally updated and adapt as they see more examples of spam over time.
• Face recognition systems, other classification systems, learn from mistakes over time.
Want to minimize some global loss L(~θ, X) =Pn
i =1`(~θ, ~xi), when data
points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to
streaming algorithms)
Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.
online gradient descent
In reality many learning problems are online.
• Websites optimize ads or recommendations to show users, given continuous feedback from these users.
• Spam filters are incrementally updated and adapt as they see more examples of spam over time.
• Face recognition systems, other classification systems, learn from mistakes over time.
Want to minimize some global loss L(~θ, X) =Pn
i =1`(~θ, ~xi), when data
points are presented in an online fashion ~x1, ~x2, . . . , ~xn(similar to
streaming algorithms)
Stochastic gradient descent is a special case: when data points are considered arandom orderfor computational reasons.
online optimization formal setup
Online Optimization: In place of a single function f , we see a
different objective function at each step:
f
1, f
2, . . . , f
t: R
d→ R
• At each step, first pick (play) a parameter vector ~θ
(i ).
• Then are told f
iand incur cost f
i(~
θ
(i )).
• Goal: Minimize total cost P
ti =1
f
i(~
θ
(i )).
Our analysis will make no assumptions on how f
1, . . . , f
tare related
online optimization formal setup
Online Optimization: In place of a single function f , we see a
different objective function at each step:
f
1, f
2, . . . , f
t: R
d→ R
• At each step, first pick (play) a parameter vector ~θ
(i ).
• Then are told f
iand incur cost f
i(~
θ
(i )).
• Goal: Minimize total cost P
ti =1
f
i(~
θ
(i )).
Our analysis will make no assumptions on how f
1, . . . , f
tare related
online optimization formal setup
Online Optimization: In place of a single function f , we see a
different objective function at each step:
f
1, f
2, . . . , f
t: R
d→ R
• At each step, first pick (play) a parameter vector ~θ
(i ).
• Then are told f
iand incur cost f
i(~
θ
(i )).
• Goal: Minimize total cost P
ti =1
f
i(~
θ
(i )).
Our analysis will make no assumptions on how f
1, . . . , f
tare related
online optimization example
UI design via online optimization.
• Parameter vector ~θ
(i ): some encoding of the layout at step i .
• Functions f
1, . . . , f
t: f
i(~
θ
(i )) = 1 if user does not click ‘add to
cart’ and f
i(~
θ
(i )) = 0 if they do click.
• Want to maximize number of purchases, i.e., minimize
P
tonline optimization example
Home pricing tools.
• Parameter vector ~θ
(i ): coefficients of linear model at step i .
• Functions f
1, . . . , f
t: f
i(~
θ
(i )) = (h~
x
i, ~
θ
(i )i − price
i)
2revealed when
home
iis listed or sold.
• Want to minimize total squared error P
ti =1
f
i(~
θ
(i )) (same as
regret
In normal optimization, we seek ˆ
θ satisfying:
f (ˆ
θ) ≤ min
~ θ
f (~
θ) + .
In online optimization we will ask for the same.
t
X
i =1f
i(~
θ
(i )) ≤ min
~ θ tX
i =1f
i(~
θ) + =
tX
i =1f
i(~
θ
off) +
is called the
regret
and /t is the average regret.
• This error metric is a bit
unusual
: Comparing online solution to
best fixed solution in hindsight. can be negative!
regret
In normal optimization, we seek ˆ
θ satisfying:
f (ˆ
θ) ≤ min
~ θ
f (~
θ) + .
In online optimization we will ask for the same.
t
X
i =1f
i(~
θ
(i )) ≤ min
~ θ tX
i =1f
i(~
θ) + =
tX
i =1f
i(~
θ
off) +
is called the
regret
and /t is the average regret.
• This error metric is a bit
unusual
: Comparing online solution to
best fixed solution in hindsight. can be negative!
regret
In normal optimization, we seek ˆ
θ satisfying:
f (ˆ
θ) ≤ min
~ θ
f (~
θ) + .
In online optimization we will ask for the same.
t
X
i =1f
i(~
θ
(i )) ≤ min
~ θ tX
i =1f
i(~
θ) + =
tX
i =1f
i(~
θ
off) +
is called the
regret
and /t is the average regret.
• This error metric is a bit
unusual
: Comparing online solution to
best fixed solution in hindsight. can be negative!
intuition check
What if for i = 1, . . . , t, f
i(θ) = |θ − 1000| or f
i(θ) = |θ + 1000| in
an alternating pattern?
How small can the regret be?
P
ti =1
f
i(~
θ
(i )) ≤
P
ti =1f
i(~
θ
off) + .
What if for i = 1, . . . , t, f
i(θ) = |θ − 1000| or f
i(θ) = |θ + 1000|
in
no particular pattern
? How can any online learning algorithm hope
intuition check
What if for i = 1, . . . , t, f
i(θ) = |θ − 1000| or f
i(θ) = |θ + 1000| in
an alternating pattern?
How small can the regret be?
P
ti =1
f
i(~
θ
(i )) ≤
P
ti =1f
i(~
θ
off) + .
What if for i = 1, . . . , t, f
i(θ) = |θ − 1000| or f
i(θ) = |θ + 1000|
in
no particular pattern
? How can any online learning algorithm hope
online gradient descent
Assume that:
• f
1, . . . , f
tare all convex.
• Each f
iis G -Lipschitz (i.e., k ~
∇f
i(~
θ)k
2≤ G for all ~
θ.)
• k~θ
(1)− ~
θ
offk
2
≤ R where θ
(1)is the first vector chosen.
Online Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Play ~θ(i ) and incur cost f i(~θ(i )).
• ~θ(i +1)= ~θ(i )− η · ~∇f i(~θ(i ))
online gradient descent
Assume that:
• f
1, . . . , f
tare all convex.
• Each f
iis G -Lipschitz (i.e., k ~
∇f
i(~
θ)k
2≤ G for all ~
θ.)
• k~θ
(1)− ~
θ
offk
2
≤ R where θ
(1)is the first vector chosen.
Online Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Play ~θ(i ) and incur cost f i(~θ(i )).
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!
Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤
kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2
Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Upper bound onaverage regret goes to 0 and t → ∞.
No assumptions on f1, . . . , ft!
Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤
kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2
Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!
Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤
kθ(i )−θoffk2 2−kθ (i +1)−θoffk2 2 2η + ηG2 2
Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!
Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤
kθ(i )−θoffk2
2−kθ(i +1)−θoffk22
2η +
ηG2 2
Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Upper bound onaverage regret goes to 0 and t → ∞. No assumptions on f1, . . . , ft!
Step 1.1: For all i , ∇fi(θ(i ))T(θ(i )− θoff) ≤
kθ(i )−θoffk2
2−kθ(i +1)−θoffk22
2η +
ηG2 2
Convexity =⇒ Step 1: For all i , fi(θ(i )) − fi(θoff) ≤ kθ(i )− θoffk2 2− kθ(i +1)− θoffk22 2η + ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Step 1: For all i , fi(θ(i )) − fi(θoff) ≤
kθ(i )−θoffk2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )− θoffk2 2− kθ (i +1)− θoffk2 2 2η + t · ηG2 2 .
online gradient descent analysis
Theorem – OGD on Convex Lipschitz Functions: For con-vex G -Lipschitz f1, . . . , ft, OGD initialized with starting point θ(1)
within radius R of θoff, using step size η = R
G√t, has regret bounded
by: " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ RG√t
Step 1: For all i , fi(θ(i )) − fi(θoff) ≤
kθ(i )−θoffk2 2−kθ(i +1)−θoffk22 2η + ηG2 2 =⇒ " t X i =1 fi(θ(i )) − t X i =1 fi(θoff) # ≤ t X i =1 kθ(i )− θoffk2 2− kθ (i +1)− θoffk2 2 2η + t · ηG2 2 .
stochastic gradient descent
Stochastic gradient descent is an efficient
offline optimization
method
, seeking ˆ
θ with
f (ˆ
θ) ≤ min
~ θ
f (~
θ) + = f (~
θ
∗) + .
• The most popular optimization method in modern machine
learning.
stochastic gradient descent
Stochastic gradient descent is an efficient
offline optimization
method
, seeking ˆ
θ with
f (ˆ
θ) ≤ min
~ θ
f (~
θ) + = f (~
θ
∗) + .
• The most popular optimization method in modern machine
learning.
stochastic gradient descent
Assume that:
• f is convex and decomposable as f (~θ) = P
nj =1
f
j(~
θ).
• E.g., L(~θ, X) = Pn
j =1`(~θ, ~xj).
• Each f
jis
Gn-Lipschitz (i.e., k ~
∇f
j(~
θ)k
2≤
Gnfor all ~
θ.)
• What does this imply about how Lipschitz f is?
• Initialize with θ
(1)satisfying k~
θ
(1)− ~
θ
∗k
2≤ R.
Stochastic Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))• Return
θ =
ˆ
1 tPt
i =1θ
~
(i ).
stochastic gradient descent
Assume that:
• f is convex and decomposable as f (~θ) = P
nj =1
f
j(~
θ).
• E.g., L(~θ, X) = Pn
j =1`(~θ, ~xj).
• Each f
jis
Gn-Lipschitz (i.e., k ~
∇f
j(~
θ)k
2≤
Gnfor all ~
θ.)
• What does this imply about how Lipschitz f is?
• Initialize with θ
(1)satisfying k~
θ
(1)− ~
θ
∗k
2≤ R.
Stochastic Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))• Return
θ =
ˆ
1 tPt
i =1θ
~
(i ).
stochastic gradient descent
Assume that:
• f is convex and decomposable as f (~θ) = P
nj =1
f
j(~
θ).
• E.g., L(~θ, X) = Pn
j =1`(~θ, ~xj).
• Each f
jis
Gn-Lipschitz (i.e., k ~
∇f
j(~
θ)k
2≤
Gnfor all ~
θ.)
• What does this imply about how Lipschitz f is?
• Initialize with θ
(1)satisfying k~
θ
(1)− ~
θ
∗k
2≤ R.
Stochastic Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η ·∇fj~ i(~θ(i ))• Return
θ =
ˆ
1 tPt
i =1θ
~
(i ).
stochastic gradient descent
Assume that:
• f is convex and decomposable as f (~θ) = P
nj =1
f
j(~
θ).
• E.g., L(~θ, X) = Pn
j =1`(~θ, ~xj).
• Each f
jis
Gn-Lipschitz (i.e., k ~
∇f
j(~
θ)k
2≤
Gnfor all ~
θ.)
• What does this imply about how Lipschitz f is?
• Initialize with θ
(1)satisfying k~
θ
(1)− ~
θ
∗k
2≤ R.
Stochastic Gradient Descent
• Pick some initial ~θ
(1).
• Set step size η =
R G√t.
• For i = 1, . . . , t
• Pick random ji ∈ 1, . . . , n. • ~θ(i +1)= ~θ(i )− η · ~∇fji(~θ (i ))• Return
θ =
ˆ
1 tPt
i =1~
θ
(i ).
stochastic gradient descent
~
θ
(i +1)= ~
θ
(i )− η · ~
∇f
ji(~
θ
(i )) vs. ~
θ
(i +1)= ~
θ
(i )− η · ~
∇f (~
θ
(i ))
Note that: E[~
∇f
ji(~
θ
(i ))] =
1n∇f (~
~
θ
(i )).
Analysis extends to
any
algorithm that takes the gradient step
in
expectation
(minibatch SGD, randomly quantized, measurement
stochastic gradient descent analysis
Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =
R
G√t, and starting point within radius R
of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .
Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ∗)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.
stochastic gradient descent analysis
Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =
R
G√t, and starting point within radius R
of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .
Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ∗)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.
stochastic gradient descent analysis
Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =
R
G√t, and starting point within radius R
of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .
Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ∗)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ nt · EPt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.
stochastic gradient descent analysis
Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =
R
G√t, and starting point within radius R
of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .
Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ∗)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.
stochastic gradient descent analysis
Theorem – SGD on Convex Lipschitz Functions: SGD run with t ≥ R2G22 iterations, η =
R
G√t, and starting point within radius R
of θ∗, outputs ˆθ satisfying: E[f ( ˆθ)] ≤ f (θ∗) + .
Step 1: f (ˆθ) − f (θ∗) ≤1tPt i =1[f (θ (i )) − f (θ∗)] Step 2: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θ∗)] . Step 3: E[f (ˆθ) − f (θ∗)] ≤ n t · E Pt i =1[fji(θ(i )) − fji(θoff)] . Step 4: E[f (ˆθ) − f (θ∗)] ≤ nt · R ·G n · √ t | {z } OGD bound = RG√ t.
sgd vs. gd
Stochastic gradient descent generally makes more iterations than
gradient descent.
Each iteration is much cheaper (by a factor of n).
~
∇
nX
j =1f
j(~
θ) vs. ~
∇f
j(~
θ)
sgd vs. gd
When f (~θ) =Pn
j =1fj(~θ) and k ~∇fj(~θ)k2≤Gn:
Theorem – SGD: Aftert ≥ R2G22 iterations outputs ˆθ satisfying:
E[f ( ˆθ)] ≤ f (θ∗) + .
When k ~∇f (~θ)k2≤ ¯G :
Theorem – GD: Aftert ≥ R2G2¯2 iterations outputs ˆθ satisfying: