• No results found

Machine learning and optimization for massive data

N/A
N/A
Protected

Academic year: 2022

Share "Machine learning and optimization for massive data"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

Machine learning and optimization for massive data

Francis Bach

INRIA - Ecole Normale Sup´ erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint work with Eric Moulines - IHES, May 2015

(2)

“Big data” revolution?

A new scientific context

• Technical progress

– Computation, storage and sensor costs

(3)

Moore’s law - Processors

(4)

Moore’s law - Storage

(5)

DNA sequencing

(6)

“Big data” revolution?

A new scientific context

• Data everywhere: size does not (always) matter

• Science and industry

• Size and variety

• Learning from examples

– n observations in dimension p

(7)

Search engines - Advertising

(8)

Marketing - Personalized recommendation

(9)

Visual object recognition

(10)

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

(11)

Astronomy

Square Kilometer Array (2024): 10

9

Go par jour

(12)

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n – p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising – Ideal running-time complexity: O(pn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(13)

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n – p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn) – Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(14)

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n – p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(15)

Scaling to large problems with convex optimization

“Retour aux sources”

• 1950’s: computers not powerful enough

IBM “1620”, 1959

CPU frequency: 50 KHz Price > 100 000 dollars

• 2010’s: Data too massive

• One pass through the data (Robbins et Monro, 1956)

− Algorithm: θn = θn−1 − γn(yn, hθn−1, Φ(xn)i)Φ(xn)

(16)

Scaling to large problems with convex optimization

“Retour aux sources”

• 1950’s: computers not powerful enough

IBM “1620”, 1959

CPU frequency: 50 KHz Price > 100 000 dollars

• 2010’s: Data too massive

• One pass through the data (Robbins et Monro, 1951) – Algorithm: θn = θn−1 − γn(yn, hθn−1, Φ(xn)i)Φ(xn)

(17)

Outline

• Introduction: stochastic approximation algorithms – Supervised machine learning and convex optimization – Stochastic gradient and averaging

– Strongly convex vs. non-strongly convex – O(1/(nµ)) vs. O(1/√

n) for all non-smooth functions

• Least-squares regressions – Constant step-sizes

– O(1/n) convergence rate for all quadratic functions

• Smooth losses

– Online Newton steps

– O(1/n) convergence rate for all convex functions

(18)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function hθ, Φ(x)i of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find ˆθ solution of

θ∈Rminp

1 n

n

X

i=1

ℓ yi, hθ, Φ(xi)i

+ µΩ(θ) convex data fitting term + regularizer

(19)

Usual losses

• Regression: y ∈ R, prediction ˆy = hθ, Φ(x)i – quadratic loss 12(y − ˆy)2 = 12(y − hθ, Φ(x)i)2

(20)

Usual losses

• Regression: y ∈ R, prediction ˆy = hθ, Φ(x)i – quadratic loss 12(y − ˆy)2 = 12(y − hθ, Φ(x)i)2

• Classification : y ∈ {−1, 1}, prediction ˆy = sign(hθ, Φ(x)i) – loss of the form ℓ(y hθ, Φ(x)i)

– “True” 0-1 loss: ℓ(y hθ, Φ(x)i) = 1y hθ,Φ(x)i<0

– Usual convex losses:

−30 −2 −1 0 1 2 3 4

1 2 3 4 5

0−1 hinge square logistic

(21)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function hθ, Φ(x)i of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find ˆθ solution of

θ∈Rminp

1 n

n

X

i=1

ℓ yi, hθ, Φ(xi)i

+ µΩ(θ) convex data fitting term + regularizer

(22)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function hθ, Φ(x)i of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find ˆθ solution of

θ∈Rminp

1 n

n

X

i=1

ℓ yi, hθ, Φ(xi)i

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: ˆf (θ) = n1 Pn

i=1 ℓ(yi, hθ, Φ(xi)i) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, hθ, Φ(x)i) testing cost

• Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ – May be tackled simultaneously

(23)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function hθ, Φ(x)i of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find ˆθ solution of

θ∈Rminp

1 n

n

X

i=1

ℓ yi, hθ, Φ(xi)i

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: ˆf (θ) = n1 Pn

i=1 ℓ(yi, hθ, Φ(xi)i) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, hθ, Φ(x)i) testing cost

• Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ – May be tackled simultaneously

(24)

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ Rp, eigenvaluesg′′(θ) 6 L

smooth non−smooth

(25)

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ Rp, eigenvaluesg′′(θ) 6 L

• Machine learning – with g(θ) = n1 Pn

i=1 ℓ(yi, hθ, Φ(xi)i) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi) ⊗ Φ(xi) – Bounded data

(26)

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if and only if

∀θ ∈ Rp, eigenvaluesg′′(θ) > µ

convex strongly

convex

(27)

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if and only if

∀θ ∈ Rp, eigenvaluesg′′(θ) > µ

(large µ) (small µ)

(28)

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if and only if

∀θ ∈ Rp, eigenvaluesg′′(θ) > µ

• Machine learning – with g(θ) = n1 Pn

i=1 ℓ(yi, hθ, Φ(xi)i) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi) ⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

(29)

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if and only if

∀θ ∈ Rp, eigenvaluesg′′(θ) > µ

• Machine learning – with g(θ) = n1 Pn

i=1 ℓ(yi, hθ, Φ(xi)i) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi) ⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by µ2kθk2

– creates additional bias unless µ is small

(30)

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt gt−1)

– O(1/t) convergence rate for convex functions

– O(e−ρt) convergence rate for strongly convex functions

(31)

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt gt−1)

– O(1/t) convergence rate for convex functions

– O(e−ρt) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′t−1)−1gt−1) – O e−ρ2t convergence rate

(32)

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt gt−1)

– O(1/t) convergence rate for convex functions

– O(e−ρt) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′t−1)−1gt−1) – O e−ρ2t convergence rate

• Key insights from Bottou and Bousquet (2008)

1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages

⇒ Stochastic approximation

(33)

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates fnn) of its gradients fn) at certain points θn ∈ Rp

(34)

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates fnn) of its gradients fn) at certain points θn ∈ Rp

• Machine learning - statistics

– f (θ) = Efn(θ) = E ℓ(yn, hθ, Φ(xn)i) = generalization error

– Loss for a single pair of observations: fn(θ) = ℓ(yn, hθ, Φ(xn)i) – Expected gradient:

f(θ) = Efn (θ) = E ℓ(yn, hθ, Φ(xn)i) Φ(xn)

• Beyond convex optimization: see, e.g., Benveniste et al. (2012)

(35)

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θn = θn−1 − γn fnn−1)

– Polyak-Ruppert averaging: ¯θn = n+11 Pn

k=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

(36)

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θn = θn−1 − γn fnn−1)

– Polyak-Ruppert averaging: ¯θn = n+11 Pn

k=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

• Running-time = O(np)

– Single pass through the data – One line of code among many

(37)

Convex stochastic approximation Existing work

• Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1 – Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

(38)

Convex stochastic approximation Existing work

• Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1 – Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for smooth strongly convex problems

A single algorithm with global adaptive convergence rate for smooth problems?

(39)

Convex stochastic approximation Existing work

• Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1 – Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for smooth strongly convex problems

• A single algorithm for smooth problems with convergence rate O(1/n) in all situations?

(40)

Least-mean-square algorithm

• Least-squares: f(θ) = 12E(yn − hΦ(xn), θi)2 with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes

– with strong convexity assumption EΦ(xn) ⊗ Φ(xn) = H < µ · Id

(41)

Least-mean-square algorithm

• Least-squares: f(θ) = 12E(yn − hΦ(xn), θi)2 with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes

– with strong convexity assumption EΦ(xn) ⊗ Φ(xn) = H < µ · Id

• New analysis for averaging and constant step-size γ = 1/(4R2) – Assume kΦ(xn)k 6 R and |yn − hΦ(xn), θi| 6 σ almost surely – No assumption regarding lowest eigenvalues of H

– Main result: Ef(¯θn) − f(θ) 6 4σ2p

n + 4R20 − θk2 n

• Matches statistical lower bound (Tsybakov, 2003)

– Non-asymptotic robust version of Gy¨orfi and Walk (1996)

(42)

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) = 12 yn − hΦ(xn), θi2

θn = θn−1 − γ hΦ(xn), θn−1i − ynΦ(xn)

• The sequence (θn)n is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation ¯θγ def= R θπγ(dθ) - For least-squares, ¯θγ = θ

¯ θγ

θ0

θn

(43)

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) = 12 yn − hΦ(xn), θi2

θn = θn−1 − γ hΦ(xn), θn−1i − ynΦ(xn)

• The sequence (θn)n is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation ¯θγ def= R θπγ(dθ)

• For least-squares, ¯θγ = θ

θ

θ0

θn

(44)

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) = 12 yn − hΦ(xn), θi2

θn = θn−1 − γ hΦ(xn), θn−1i − ynΦ(xn)

• The sequence (θn)n is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation ¯θγ def= R θπγ(dθ)

• For least-squares, ¯θγ = θ

θ

θ0

θn

¯ θn

(45)

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) = 12 yn − hΦ(xn), θi2

θn = θn−1 − γ hΦ(xn), θn−1i − ynΦ(xn)

• The sequence (θn)n is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation ¯θγ def= R θπγ(dθ)

• For least-squares, ¯θγ = θ

– θn does not converge to θ but oscillates around it – oscillations of order √γ

• Ergodic theorem:

– Averaged iterates converge to ¯θγ = θ at rate O(1/n)

(46)

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic square

1/2R2 1/8R2 1/32R2 1/2R2n1/2

(47)

Simulations - benchmarks

• alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5 0 0.5 1

log10(n) log 10[f(θ)−f(θ *)]

alpha square C=1 test

1/R2 1/R2n1/2 SAG

0 2 4 6

−2

−1.5

−1

−0.5 0 0.5 1

log10(n)

alpha square C=opt test

C/R2 C/R2n1/2 SAG

0 2 4

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n) log 10[f(θ)−f(θ *)]

news square C=1 test

1/R2 1/R2n1/2 SAG

0 2 4

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n)

news square C=opt test

C/R2 C/R2n1/2 SAG

(48)

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γfnn−1) also defines a Markov chain – Stationary distribution πγ such that R f(θ)πγ(dθ) = 0

– When f is not linear, f(R θπγ(dθ)) 6= R f(θ)πγ(dθ) = 0

(49)

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γfnn−1) also defines a Markov chain – Stationary distribution πγ such that R f(θ)πγ(dθ) = 0

– When f is not linear, f(R θπγ(dθ)) 6= R f(θ)πγ(dθ) = 0

• θn oscillates around the wrong value ¯θγ 6= θ

¯ θγ

θ0

θn

¯ θn

θ

(50)

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γfnn−1) also defines a Markov chain – Stationary distribution πγ such that R f(θ)πγ(dθ) = 0

– When f is not linear, f(R θπγ(dθ)) 6= R f(θ)πγ(dθ) = 0

• θn oscillates around the wrong value ¯θγ 6= θ – moreover, kθ − θnk = Op(√γ)

• Ergodic theorem

– averaged iterates converge to ¯θγ 6= θ at rate O(1/n) – moreover, kθ − ¯θγk = O(γ) (Bach, 2013)

(51)

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic logistic − 1

1/2R2 1/8R2 1/32R2 1/2R2n1/2

(52)

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2) for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1) for all convex quadratic functions

3. Newton’s method squares the error at each iteration for smooth functions

4. A single step of Newton’s method is equivalent to minimizing the quadratic Taylor expansion

– Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1) – Complexity: O(p) per iteration

(53)

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2) for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1) for all convex quadratic functions ⇒ O(n−1)

3. Newton’s method squares the error at each iteration for smooth functions ⇒ O((n−1/2)2)

4. A single step of Newton’s method is equivalent to minimizing the quadratic Taylor expansion

• Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1) – Complexity: O(p) per iteration

(54)

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ) def= Eℓ(yn, hθ, Φ(xn)i)

at ˜θ is equivalent to minimizing the quadratic approximation

g(θ) = f (˜θ) + hf(˜θ), θ − ˜θi + 12hθ − ˜θ, f′′(˜θ)(θ − ˜θ)i

= f (˜θ) + hEfn (˜θ), θ − ˜θi + 12hθ − ˜θ,Efn′′(˜θ)(θ − ˜θ)i

= Eh

f (˜θ) + hfn (˜θ), θ − ˜θi + 12hθ − ˜θ, fn′′(˜θ)(θ − ˜θ)ii

(55)

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ) def= Eℓ(yn, hθ, Φ(xn)i)

at ˜θ is equivalent to minimizing the quadratic approximation

g(θ) = f (˜θ) + hf(˜θ), θ − ˜θi + 12hθ − ˜θ, f′′(˜θ)(θ − ˜θ)i

= f (˜θ) + hEfn (˜θ), θ − ˜θi + 12hθ − ˜θ,Efn′′(˜θ)(θ − ˜θ)i

= Eh

f (˜θ) + hfn (˜θ), θ − ˜θi + 12hθ − ˜θ, fn′′(˜θ)(θ − ˜θ)ii

• Complexity of least-mean-square recursion for g is O(p) θn = θn−1 − γfn (˜θ) + fn′′(˜θ)(θn−1 − ˜θ)

– fn′′(˜θ) = ℓ′′(yn, h˜θ, Φ(xn)i)Φ(xn) ⊗ Φ(xn) has rank one

– New online Newton step without computing/inverting Hessians

(56)

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain ˜θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) – Provable convergence rate of O(p/n) for logistic regression

– Additional assumptions but no strong convexity

(57)

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain ˜θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) – Provable convergence rate of O(p/n) for logistic regression

– Additional assumptions but no strong convexity

• Update at each iteration using the current averaged iterate – Recursion: θn = θn−1 − γfn (¯θn−1) + fn′′(¯θn−1)(θn−1 − ¯θn−1) – No provable convergence rate (yet) but best practical behavior – Note (dis)similarity with regular SGD: θn = θn−1 − γfnn−1)

(58)

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic logistic − 1

1/2R2 1/8R2 1/32R2 1/2R2n1/2

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic logistic − 2

every 2p every iter.

2−step

2−step−dbl.

(59)

Simulations - benchmarks

• alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000)

0 2 4 6

−2.5

−2

−1.5

−1

−0.5 0 0.5

log10(n) log 10[f(θ)−f(θ *)]

alpha logistic C=1 test

1/R2 1/R2n1/2 SAG Adagrad Newton

0 2 4 6

−2.5

−2

−1.5

−1

−0.5 0 0.5

log10(n)

alpha logistic C=opt test

C/R2 C/R2n1/2 SAG Adagrad Newton

0 2 4

−1

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n) log 10[f(θ)−f(θ *)]

news logistic C=1 test

1/R2 1/R2n1/2 SAG Adagrad Newton

0 2 4

−1

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n)

news logistic C=opt test

C/R2 C/R2n1/2 SAG Adagrad Newton

(60)

Conclusions

• Constant-step-size averaged stochastic gradient descent – Reaches convergence rate O(1/n) in all regimes

– Improves on the O(1/√

n) lower-bound of non-smooth problems – Efficient online Newton step for non-quadratic problems

– Robustness to step-size selection

• Extensions and future work – Going beyond a single pass – Pre-conditioning

– Proximal extensions fo non-differentiable terms

– kernels and non-parametric estimation (Dieuleveut and Bach, 2014) – line-search

– parallelization

– Non-convex problems

(61)

References

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235–3249, 2012.

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Technical Report 00804431, HAL, 2013.

Albert Benveniste, Michel M´etivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, 2012.

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical report, ArXiv, 2014.

L. Gy¨orfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61, 1996.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Wiley West Sussex, 1995.

A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley

& Sons, 1983.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

(62)

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951. ISSN 0003-4851.

D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988.

A. B. Tsybakov. Optimal rates of aggregation. 2003.

A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.

References

Related documents

Another reason why Savills initially set up this virtual office in the area is because of the consistent similarities in the profiles and requirements of buyers in Richmond, Barnes,

Inspired by recent regret bounds for online convex opti- mization, we study stochastic convex optimiza- tion, and uncover a surprisingly different situation in the more general

Support Vector Machines: Training with Stochastic Gradient Descent1. Machine Learning

affirmed their self-worth, (4) informal mentoring being perceived as more beneficial due to the long lasting relationship that follows, (5) male mentors being as effective as female

The cornerstones and contexts of the co-active coaching model establish the foundation on which the co-active leader stands in relationship to the individual or team..

• Introduced online gradient descent, which can solve online convex optimization with average regret approaching 0. • Introduced stochastic gradient descent, an

Evolution to NextGen Weather Systems All-weather Departures Reduce Weather Impact All-weather Approaches Efficient, Flexible Routing Today's Weather Systems ITWS ITWS WARP

The following are either trademarks or registered trademarks of MicroStrategy Incorporated in the United States and certain other countries: MicroStrategy, MicroStrategy