Large-scale machine learning and convex optimization

(1)

Large-scale machine learning and convex optimization

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

Eurandom - March 2014

Slides available at www.di.ens.fr/~fbach/gradsto_eurandom.pdf

(2)

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n, large k – d : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, text processing – Ideal running-time complexity: O(dn + kn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(3)

Object recognition

(4)

Learning for bioinformatics - Proteins

• Crucial components of cell life

• Predicting multiple functions and interactions

• Massive data: up to 1 millions for humans!

• Complex data

– Amino-acid sequence – Link with DNA

– Tri-dimensional molecule

(5)

Search engines - advertising

(6)

Advertising - recommendation

(7)

Context

Machine learning for “big data”

• Examples: computer vision, bioinformatics, text processing

• Ideal running-time complexity: O(dn + kn) – Going back to simple methods

(8)

Context

Machine learning for “big data”

• Examples: computer vision, bioinformatics, text processing

• Ideal running-time complexity: O(dn + kn)

• Going back to simple methods

(9)

Outline

1. Large-scale machine learning and optimization

• Traditional statistical analysis

• Classical methods for convex optimization 2. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

• Strongly convex vs. non-strongly convex

3. Smooth stochastic approximation algorithms

• Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes

5. Finite data sets

(10)

Supervised machine learning

• Data: n observations (xⁱ, y_i) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ^⊤Φ(x) of features Φ(x) ∈ R^d

• (regularized) empirical risk minimization: find ˆθ solution of min

θ∈R^d

1 n

n

X

i=1

ℓ y_i, θ^⊤Φ(x_i)

+ µΩ(θ)

convex data fitting term + regularizer

(11)

Usual losses

• Regression: y ∈ R, prediction ˆy = θ^⊤Φ(x) – quadratic loss ¹₂(y − ˆy)² = ¹₂(y − θ^⊤Φ(x))²

(12)

Usual losses

• Regression: y ∈ R, prediction ˆy = θ^⊤Φ(x) – quadratic loss ¹₂(y − ˆy)² = ¹₂(y − θ^⊤Φ(x))²

• Classification : y ∈ {−1, 1}, prediction ˆy = sign(θ^⊤Φ(x)) – loss of the form ℓ(y θ^⊤Φ(x))

– “True” 0-1 loss: ℓ(y θ^⊤Φ(x)) = 1_{y θ}_⊤_Φ(x)<0 – Usual convex losses:

−30 −2 −1 0 1 2 3 4

1 2 3 4 5

0−1 hinge square logistic

(13)

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: kθk²2 = Pd

j=1 |θ^j|² – Numerically well-behaved

– Representer theorem and kernel methods : θ = Pn

i=1 α_iΦ(x_i)

– See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and Cristianini (2004)

• Sparsity-inducing norms

– Main example: ℓ₁-norm kθk¹ = Pd

j=1 |θ^j|

– Perform model selection as well as regularization – Non-smooth optimization and structured sparsity

– See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

(14)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

(15)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

• Empirical risk: ˆf (θ) = _n¹ Pn

i=1 ℓ(y_i, θ^⊤Φ(x_i)) training cost

• Expected risk: f(θ) = E^(x,y)ℓ(y, θ^⊤Φ(x)) testing cost

• Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ – May be tackled simultaneously

(16)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

(17)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

such that Ω(θ) 6 D

convex data fitting term + constraint

(18)

Analysis of empirical risk minimization

• Approximation and estimation errors: C = {θ ∈ R^d, Ω(θ) 6 D}

f (ˆθ) − min

θ∈R^d f (θ) =

f (ˆθ) − min

θ∈C f (θ)

+

minθ∈C f (θ) − min

θ∈R^d f (θ)

– NB: may replace min

θ∈R^d f (θ) by best (non-linear) predictions 1. Uniform deviation bounds, with ˆθ ∈ arg min

θ∈C

f (θ)ˆ

f (ˆθ) − min

θ∈C f (θ) 6 2 sup

θ∈C | ˆf (θ) − f(θ)|

– Typically slow rate O 1

√n

2. More refined concentration results with faster rates

(19)

Slow rate for supervised learning

• Assumptions (f is the expected risk, ˆf the empirical risk) – Ω(θ) = kθk² (Euclidean norm)

– “Linear” predictors: θ(x) = θ^⊤Φ(x), with kΦ(x)k² 6 R a.s.

– G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = {kθk² 6 D}

– No assumptions regarding convexity

(20)

Slow rate for supervised learning

• Assumptions (f is the expected risk, ˆf the empirical risk) – Ω(θ) = kθk² (Euclidean norm)

– No assumptions regarding convexity

• With probability greater than 1 − δ sup

θ∈C | ˆf (θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Expectated estimation error: E sup

θ∈C | ˆf (θ) − f(θ)| 6 4GRD

√n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

(21)

Fast rate for supervised learning

• Assumptions (f is the expected risk, ˆf the empirical risk) – Same as before (bounded features, Lipschitz loss)

– Regularized risks: f^µ(θ) = f (θ)+^µ₂kθk²2 and ˆf^µ(θ) = ˆf (θ)+^µ₂kθk²2

– Convexity

• For any a > 0, with probability greater than 1 − δ, for all θ ∈ R^d, f^µ(θ)−min

η∈R^d f^µ(η) 6 (1+a)( ˆf^µ(θ)−min

η∈R^d

fˆ^µ(η))+8(1 + ¹_a)G²R²(32 + log ¹_δ) µn

• Results from Sridharan, Srebro, and Shalev-Shwartz (2008)

– see also Boucheron and Massart (2011) and references therein

• Strongly convex functions ⇒ fast rate

– Warning: µ should decrease with n to reduce approximation error

(22)

Complexity results in convex optimization

• Assumption: f convex on R^d

• Classical generic algorithms – (sub)gradient method/descent – Accelerated gradient descent – Newton method

• Key additional properties of f

– Lipschitz continuity, smoothness or strong convexity

• Key insight from Bottou and Bousquet (2008)

– In machine learning, no need to optimize below estimation error

• Key reference: Nesterov (2004)

(23)

Lipschitz continuity

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ R^d, kθk² 6 D ⇒ kf^′(θ)k² 6 B

• Machine learning – with f (θ) = _n¹ Pn

i=1 ℓ(y_i, θ^⊤Φ(x_i))

– G-Lipschitz loss and R-bounded data: B = GR

(24)

Smoothness and strong convexity

• A function f : R^d → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

∀θ¹, θ₂ ∈ R^d, kf^′(θ₁) − f^′(θ₂)k² 6 Lkθ¹ − θ²k²

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) 4 L · Id

smooth non−smooth

(25)

Smoothness and strong convexity

• A function f : R^d → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) 4 L · Id

i=1 ℓ(y_i, θ^⊤Φ(x_i)) – Hessian ≈ covariance matrix _n¹ Pn

i=1 Φ(x_i)Φ(x_i)^⊤ – ℓ-smooth loss and R-bounded data: L = ℓR²

(26)

Smoothness and strong convexity

• A function f : R^d → R is µ-strongly convex if and only if

∀θ¹, θ₂ ∈ R^d, f (θ₁) > f (θ₂) + f^′(θ₂)^⊤(θ₁ − θ²) + ^µ₂kθ¹ − θ²k²2

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) < µ · Id

convex strongly

convex

(27)

Smoothness and strong convexity

i=1 Φ(x_i)Φ(x_i)^⊤

– Data with invertible covariance matrix (low correlation/dimension)

(28)

Smoothness and strong convexity

i=1 Φ(x_i)Φ(x_i)^⊤

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by ^µ₂kθk²

– creates additional bias unless µ is small

(29)

Summary of smoothness/convexity assumptions

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ R^d, kθk² 6 D ⇒ kf^′(θ)k² 6 B

• Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f^′:

• Strong convexity of f: The function f is strongly convex with respect to the norm k · k, with convexity constant µ > 0:

(30)

Subgradient method/descent

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk² 6 D}

• Algorithm: θ^t = Π_D

θ_t−1 − 2D B√

tf^′(θ_t−1)

– Π_D : orthogonal projection onto {kθk² 6 D}

• Bound:

f 1 t

Xt−1 k=0

θ_k

− f(θ∗) 6 2DB

√t

• Three-line proof

• Best possible convergence rate after O(d) iterations

(31)

Subgradient method/descent - proof - I

• Iteration: θ_t = Π_D(θ_t−1 − γ^tf^′(θ_t−1)) with γ_t = ^2D

B√ t

• Assumption: kf^′(θ)k² 6 B and kθk² 6 D

kθ^t − θ∗k²2 6 kθt−1 − θ∗ − γ^tf^′(θ_t−1)k²2 by contractivity of projections

6 kθt−1 − θ∗k²2 + B²γ_t² − 2γ^t(θ_t−1 − θ∗)^⊤f^′(θ_t−1) because kf^′(θ_t−1)k² 6 B 6 kθt−1 − θ∗k²2 + B²γ_t² − 2γ^tf(θ_t−1) − f(θ∗) (property of subgradients)

• ^{leading to}

f (θ_t−1) − f(θ∗) 6 B²γ_t

2 + 1

2γ_tkθt−1 − θ∗k²2 − kθ^t − θ∗k²2

(32)

Subgradient method/descent - proof - II

• Starting from f (θ_t−1) − f(θ∗) 6 B²γ_t

2 + 1

2γ_tkθt−1 − θ∗k²2 − kθ^t − θ∗k²2

t

X

u=1

f(θ_u−1) − f(θ∗) 6

t

X

u=1

B²γ_u 2 +

t

X

u=1

1

2γ_ukθu−1 − θ∗k²2 − kθ^u − θ∗k²2

=

t

X

u=1

B²γ_u

2 +

t−1

X

u=1

kθ^u − θ∗k²2

1

2γ_u+1 − 1

2γ_u + kθ₀ − θ∗k²2

2γ₁ − kθ^t − θ∗k²2

2γ_t

6

t

X

u=1

B²γ_u 2 +

Xt−1 u=1

4D² 1

2γ_u+1 − 1

2γ_u + 4D² 2γ₁

=

t

X

u=1

B²γ_u

2 + 4D²

2γ_t 6 2DB√

t with γ_t = 2D B√ t

• Using convexity: f 1 t

Xt−1 k=0

θ_k

− f(θ∗) 6 2DB

√t

(33)

Subgradient descent for machine learning

• Assumptions (f is the expected risk, ˆf the empirical risk)

– ˆf (θ) = _n¹ Pn

i=1 ℓ(y_i, Φ(x_i)^⊤θ)

• Statistics: with probability greater than 1 − δ sup

θ∈C | ˆf (θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Optimization: after t iterations of subgradient method f (ˆˆ θ) − min

η∈C

f (η) 6ˆ GRD

√t

• t = n iterations, with total running-time complexity of O(n²d)

(34)

Subgradient descent - strong convexity

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk² 6 D}

– f µ-strongly convex

• Algorithm: θ^t = Π_D

θ_t−1 − 2

µ(t + 1)f^′(θ_t−1)

• Bound:

f

2

t(t + 1)

t

X

k=1

kθ_k−1

− f(θ_∗) 6 2B² µ(t + 1)

• Best possible convergence rate after O(d) iterations

(35)

Subgradient method - strong convexity - proof - I

• Iteration: θ_t = Π_D(θ_t−1 − γ^tf^′(θ_t−1)) with γ_t = _µ(t+1)²

• Assumption: kf^′(θ)k² 6 B and kθk² 6 D and µ-strong convexity of f kθ^t − θ∗k²2 6 kθt−1 − θ∗ − γ^tf^′(θ_t−1)k²2 by contractivity of projections

6 kθt−1 − θ∗k²2 + B²γ_t² − 2γ^t(θ_t−1 − θ∗)^⊤f^′(θ_t−1) because kf^′(θ_t−1)k² 6 B 6 kθt−1 − θ∗k²2 + B²γ_t² − 2γ^tf(θ_t−1) − f(θ∗) + µ

2kθt−1 − θ∗k²2

(property of subgradients and strong convexity)

• ^{leading to}

f (θ_t−1) − f(θ∗) 6 B²γ_t

2 + 1 2

1

γ_t − µkθt−1 − θ∗k²2 − 1

2γ_tkθ^t − θ∗k²2

6 B²

µ(t + 1) + µ 2

t − 1

2 kθt−1 − θ∗k²2 − µ(t + 1)

4 kθ^t − θ∗k²2

(36)

Subgradient method - strong convexity - proof - II

• ^{From f (θ}^t−1^{) − f(θ}^∗^{) 6} _{µ(t + 1)}^B² ⁺ ^µ₂^t ^{− 1}₂ ^kθ^t−1 ^{− θ}^∗^k²² ⁻ ^{µ(t + 1)}₄ ^kθ^t ^{− θ}^∗^k²²

t

X

u=1

uf(θ_u−1) − f(θ∗) 6

u

X

t=1

B²u

µ(u + 1) + 1 4

t

X

u=1

u(u − 1)kθu−1 − θ∗k²2 − u(u + 1)kθ^u − θ∗k²2

6 B²t

µ + 1

40 − t(t + 1)kθ^t − θ∗k²2 6 B²t µ

• Using convexity: f

2

t(t + 1)

t

X

u=1

uθ_u−1

− f(θ_∗) 6 2B² t + 1

(37)

(smooth) gradient descent

• Assumptions

– f convex with L-Lipschitz-continuous gradient – Minimum attained at θ_∗

• Algorithm:

θ_t = θ_t−1 − 1

Lf^′(θ_t−1)

• Bound:

f (θ_t) − f(θ∗) 6 2Lkθ⁰ − θ∗k² t + 4

• Not best possible convergence rate after O(d) iterations

(38)

(smooth) gradient descent - strong convexity

• Assumptions

– f convex with L-Lipschitz-continuous gradient – f µ-strongly convex

• Algorithm:

θ_t = θ_t−1 − 1

Lf^′(θ_t−1)

• Bound:

f (θ_t) − f(θ∗) 6 (1 − µ/L)^tf(θ₀) − f(θ∗)

• Adaptivity of gradient descent to problem difficulty

• Line search

(39)

Accelerated gradient methods (Nesterov, 1983)

• Assumptions

– f convex with L-Lipschitz-cont. gradient , min. attained at θ_∗

• Algorithm:

θ_t = η_t−1 − 1

Lf^′(η_t−1) η_t = θ_t + t − 1

t + 2(θ_t − θt−1)

• Bound:

f (θ_t) − f(θ∗) 6 2Lkθ⁰ − θ∗k² (t + 1)²

• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

• Not improvable

• Extension to strongly convex functions

(40)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d f (θ_t) + (θ − θ^t)^⊤∇f(θ^t)+L

2kθ − θ^tk²2

– θ_t+1 = θ_t − _L¹∇f(θ^t)

(41)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d

f (θ_t) + (θ − θ^t)^⊤∇f(θ^t)+L

2kθ − θ^tk²2

– θ_t+1 = θ_t − _L¹∇f(θ^t)

• Problems of the form: min

θ∈R^d f (θ) + µΩ(θ) – θ_t+1 = arg min

θ∈R^d f (θ_t) + (θ − θ^t)^⊤∇f(θ^t)+µΩ(θ)+L

2kθ − θ^tk²2

– Ω(θ) = kθk¹ ⇒ Thresholded gradient descent

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

(42)

Summary: minimizing convex functions

• Assumption: f convex

• Gradient descent: θ^t = θ_t−1 − γ^t f^′(θ_t−1) – O(1/√

t) convergence rate for non-smooth convex functions – O(1/t) convergence rate for smooth convex functions

– O(e^−ρt) convergence rate for strongly smooth convex functions

• Newton method: θ^t = θ_t−1 − f^′′(θ_t−1)⁻¹f^′(θ_t−1) – O e^−ρ2^t convergence rate

(43)

Summary: minimizing convex functions

• Assumption: f convex

• Gradient descent: θ^t = θ_t−1 − γ^t f^′(θ_t−1) – O(1/√

t) convergence rate for non-smooth convex functions – O(1/t) convergence rate for smooth convex functions

– O(e^−ρt) convergence rate for strongly smooth convex functions

• Newton method: θ^t = θ_t−1 − f^′′(θ_t−1)⁻¹f^′(θ_t−1) – O e^−ρ2^t convergence rate

• Key insights from Bottou and Bousquet (2008)

1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages

⇒ Stochastic approximation

(44)

Outline

5. Finite data sets

(45)

Stochastic approximation

• Goal: Minimizing a function f defined on R^d

– given only unbiased estimates f_n^′ (θ_n) of its gradients f^′(θ_n) at certain points θ_n ∈ R^d

• Stochastic approximation

– (much) broader applicability beyond convex optimization

θ_n = θ_n−1 − γⁿh_n(θ_n−1) with Eh_n(θ_n−1)|θn−1 = h(θ_n−1) – Beyond convex problems, i.i.d assumption, finite dimension, etc.

– Typically asymptotic results

– See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)

(46)

Stochastic approximation

• Goal: Minimizing a function f defined on R^d

– given only unbiased estimates f_n^′ (θ_n) of its gradients f^′(θ_n) at certain points θ_n ∈ R^d

• Machine learning - statistics

– loss for a single pair of observations: f_n(θ) = ℓ(y_n, θ^⊤Φ(x_n)) – f (θ) = Efn(θ) = E ℓ(yn, θ^⊤Φ(x_n)) = generalization error

– Expected gradient: f^′(θ) = Ef_n^′ (θ) = E ℓ^′(y_n, θ^⊤Φ(x_n)) Φ(x_n) – Non-asymptotic results

• Number of iterations = number of observations

(47)

Relationship to online learning

– Minimize f (θ) = Ezℓ(θ, z) = generalization error of θ – Using the gradients of single i.i.d. observations

(48)

Relationship to online learning

• Batch learning

– Finite set of observations: z₁, . . . , z_n – Empirical risk: ˆf (θ) = _n¹ P_n

k=1 ℓ(θ, z_i)

– Estimator ˆθ = Minimizer of ˆf (θ) over a certain class Θ – Generalization bound using uniform concentration results

(49)

Relationship to online learning

• Batch learning

– Finite set of observations: z₁, . . . , z_n – Empirical risk: ˆf (θ) = _n¹ Pn

k=1 ℓ(θ, z_i)

– Estimator ˆθ = Minimizer of ˆf (θ) over a certain class Θ – Generalization bound using uniform concentration results

• Online learning

– Update ˆθ_n after each new (potentially adversarial) observation z_n – Cumulative loss: _n¹ P_n

k=1 ℓ(ˆθ_k−1, z_k)

– Online to batch through averaging (Cesa-Bianchi et al., 2004)

(50)

Convex stochastic approximation

• Key properties of f and/or fⁿ

– Smoothness: f B-Lipschitz continuous, f^′ L-Lipschitz continuous – Strong convexity: f µ-strongly convex

(51)

Convex stochastic approximation

• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ_n = θ_n−1 − γⁿf_n^′ (θ_n−1)

– Polyak-Ruppert averaging: ¯θ_n = _n¹ P_n−1

k=0 θ_k

– Which learning rate sequence γ_n? Classical setting: γ_n = Cn^−α

(52)

Convex stochastic approximation

• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ_n = θ_n−1 − γⁿf_n^′ (θ_n−1)

k=0 θ_k

– Which learning rate sequence γ_n? Classical setting: γ_n = Cn^−α

• Desirable practical behavior

– Applicable (at least) to classical supervised learning problems – Robustness to (potentially unknown) constants (L,B,µ)

– Adaptivity to difficulty of the problem (e.g., strong convexity)

(53)

Stochastic subgradient descent/method

• Assumptions

– f_n convex and B-Lipschitz-continuous on {kθk² 6 D}

– (f_n) i.i.d. functions such that Efn = f – θ_∗ global optimum of f on {kθk² 6 D}

• Algorithm: θⁿ = Π_D

θ_n−1 − 2D B√

nf_n^′ (θ_n−1)

• Bound:

Ef 1 n

n−1X

k=0

θ_k

− f(θ∗) 6 2DB

√n

• “Same” three-line proof as in the deterministic case

• Minimax convergence rate

• Running-time complexity: O(dn) after n iterations

(54)

Stochastic subgradient method - proof - I

• Iteration: θ_n = Π_D(θ_n−1 − γⁿf_n^′(θ_n−1)) with γ_n = _B^2D^√_n

• ^Fⁿ : information up to time n

• ^kfn^′ (θ)k² 6 B and kθk² 6 D, unbiased gradients/functions E(fn|Fn−1) = f kθⁿ − θ∗k²2 6 kθn−1 − θ∗ − γⁿf^′(θ_n−1)k²2 by contractivity of projections

6 kθn−1 − θ∗k²2 + B²γ_n² − 2γⁿ(θ_n−1 − θ∗)^⊤f_n^′ (θ_n−1) because kfn^′ (θ_n−1)k² 6 B

Ekθⁿ − θ∗k²2|Fn−1 6 kθn−1 − θ∗k²2 + B²γ_n² − 2γⁿ(θ_n−1 − θ∗)^⊤f^′(θ_n−1)

6 kθn−1 − θ∗k²2 + B²γ_n² − 2γⁿf(θ_n−1) − f(θ∗) (subgradient property) Ekθⁿ − θ∗k²2 6 Ekθn−1 − θ∗k²2 + B²γ_n² − 2γⁿEf(θ_n−1) − f(θ∗)

• leading to Ef(θ_n−1) − f(θ∗) 6 B²γ_n

2 + 1

2γ_nEkθn−1 − θ∗k²2 − Ekθⁿ − θ∗k²2

(55)

Stochastic subgradient method - proof - II

• Starting from Ef(θ_n−1) − f(θ∗) 6 B²γ_n

2 + 1

2γ_nEkθn−1 − θ∗k²2 − Ekθⁿ − θ∗k²2

n

X

u=1

Ef(θ_u−1) − f(θ∗) 6

n

X

u=1

B²γ_u

2 +

n

X

u=1

1

2γ_uEkθu−1 − θ∗k²2 − Ekθ^u − θ∗k²2

6

n

X

u=1

B²γ_u

2 + 4D²

2γ_n 6 2DB

√n with γ_n = 2D B√

n

• Using convexity: Ef 1 n

n−1X

k=0

θ_k

− f(θ∗) 6 2DB

√n

(56)

Stochastic subgradient descent - strong convexity - I

• Assumptions

– f_n convex and B-Lipschitz-continuous – (f_n) i.i.d. functions such that Efn = f – f µ-strongly convex on {kθk² 6 D}

– θ_∗ global optimum of f over {kθk² 6 D}

• Algorithm: θⁿ = Π_D

θ_n−1 − 2

µ(n + 1)f_n^′ (θ_n−1)

• Bound:

Ef

2

n(n + 1)

n

X

k=1

kθ_k−1

− f(θ∗) 6 2B² µ(n + 1)

• “Same” three-line proof than in the deterministic case

(57)

Stochastic subgradient descent - strong convexity - II

• Assumptions

– f_n convex and B-Lipschitz-continuous – (f_n) i.i.d. functions such that Efn = f – θ_∗ global optimum of g = f + ^µ₂k · k²2

– No compactness assumption - no projections

• Algorithm:

θ_n = θ_n−1− 2

µ(n + 1)g_n^′ (θ_n−1) = θ_n−1− 2

µ(n + 1)f_n^′ (θ_n−1)+µθ_n−1

• Bound: Eg

2

n(n + 1)

n

X

k=1

kθ_k−1

− g(θ∗) 6 2B² µ(n + 1)

(58)

Outline

5. Finite data sets

(59)

Convex stochastic approximation Existing work

• Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)⁻¹)

Attained by averaged stochastic gradient descent with γ_n ∝ (µn)⁻¹ – Non-strongly convex: O(n^−1/2)

Attained by averaged stochastic gradient descent with γ_n ∝ n^−1/2 – Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

(60)

Convex stochastic approximation Existing work

Attained by averaged stochastic gradient descent with γ_n ∝ n^−1/2

• Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al.

(2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al.

(2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

(61)

Convex stochastic approximation Existing work

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γ_n = Cn^−α with α ∈ (1/2, 1) lead to O(n⁻¹) for smooth strongly convex problems

A single algorithm with global adaptive convergence rate for smooth problems?

(62)

Convex stochastic approximation Existing work

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γ_n = Cn^−α with α ∈ (1/2, 1) lead to O(n⁻¹) for smooth strongly convex problems

• Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn, 1/√

n})

(63)

Smoothness/convexity assumptions

• Iteration: θⁿ = θ_n−1 − γⁿf_n^′ (θ_n−1)

k=0 θ_k

• Smoothness of f_n: For each n > 1, the function f_n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f_n^′ :

– Smooth loss and bounded data

• Strong convexity of f: The function f is strongly convex with respect to the norm k · k, with convexity constant µ > 0:

– Invertible population covariance matrix – or regularization by ^µ₂kθk²

(64)

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γⁿ = Cn^−α

• Strongly convex smooth objective functions

– Old: O(n⁻¹) rate achieved without averaging for α = 1 – New: O(n⁻¹) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants – Forgetting of initial conditions

– Robustness to the choice of C

(65)

Summary of new results (Bach and Moulines, 2011)

– Non-asymptotic analysis with explicit constants – Forgetting of initial conditions

– Robustness to the choice of C

• Convergence rates for Ekθⁿ − θ^∗k² and Ek¯θ_n − θ^∗k² – no averaging: Oσ²γ_n

µ

+ O(e^−µnγⁿ)kθ⁰ − θ^∗k² – averaging: tr H(θ^∗)⁻¹

n + µ⁻¹O(n^−2α+n^−2+α) + Okθ⁰ − θ^∗k² µ²n²

(66)

Robustness to wrong constants for γ

_n

= Cn

^−α

• f(θ) = ¹₂|θ|² with i.i.d. Gaussian noise (d = 1)

• Left: α = 1/2

• Right: α = 1

0 2 4

−5 0 5

log(n) log[f(θ n)−f∗ ]

α = 1/2

sgd − C=1/5 ave − C=1/5 sgd − C=1 ave − C=1 sgd − C=5 ave − C=5

0 2 4

−5 0 5

log(n) log[f(θ n)−f∗ ]

α = 1

sgd − C=1/5 ave − C=1/5 sgd − C=1 ave − C=1 sgd − C=5 ave − C=5

• See also http://leon.bottou.org/projects/sgd

(67)

Summary of new results (Bach and Moulines, 2011)

– Non-asymptotic analysis with explicit constants

(68)

Summary of new results (Bach and Moulines, 2011)

– Non-asymptotic analysis with explicit constants

• Non-strongly convex smooth objective functions

– Old: O(n^−1/2) rate achieved with averaging for α = 1/2

– New: O(max{n^1/2−3α/2, n^−α/2, n^α−1}) rate achieved without averaging for α ∈ [1/3, 1]

• Take-home message

– Use α = 1/2 with averaging to be adaptive to strong convexity