Chapter 5_handout.pdf

(1)

CHAPTER 5 S

TOCHASTIC

G

RADIENT

F

ORM OF

S

TOCHASTIC

A

PROXIMATION

• Organization of chapter in ISSO

–Stochastic gradient •Core algorithm •Basic principles •Nonlinear regression •Connections to LMS –Neural network training

–Discrete event dynamic systems –Image processing

• Note: Some material in these slides as relates to online stochastic gradient descent goes slightly beyond

coverage in Chapter 5 of ISSO

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Stochastic Gradient Formulation

• For differentiable L(), recall familiar set of p equations and

punknowns for use in finding a minimum :

• Above is special case of root-finding problem

• Suppose cannot observe L() and g() except in presence of noise

– Adaptive control (target tracking)

– Simulation-based optimization

– Machine learning (ML)

– Etc.

• Seek unbiased measurementof L/for optimization



 



( ) L

g  

(2)

5-3

Stochastic Gradient Formulation (Cont’d)

• Suppose L() = E[Q(,V)]

– Vrepresents all random effects

– Q(,V) represents “observed” cost (noisy measurement of

L())

• Seek a representation where Q

/

is an unbiased measurement ofL/

– Not true when distribution function for Vdepends on

• Above implies that desired representationis

not

where p_V() is density function for V





( , ) ( , ) ( ) ,

[ ]

E Q V Q   p_V  d



_

( , ) ( , ) ( | ) ,

[ ]

E Q V Q   p_V   d

5-4

Stochastic Gradient Measurement

and Algorithm

• When density p_V() is independent of ,

is unbiased measurement of L/(Y is “stochastic gradient”) – Requires derivative–integral interchange in L/=E[Q(,V)]/

= E[Q(,V)/] to be valid (Theorem 5.1 in ISSO)

• Can use root-finding (Robbins-Monro) SA algorithm with stochastic gradient above to attempt to find :

• Unbiased measurement satisfies key convergence conditions of SA (Section 4.3 in ISSO)

• Note: Popular stochastic gradient algorithm above is invalid

if p_V() depends on ; see Exercise 5.2 or p. 415 of ISSOfor alternative form when p_V() depends on 

 

  

 ( , )

( ) Q V

Y

  

ˆ ₁ ˆ

₍

ˆ

₎

(3)

5-5

Stochastic Gradient and LMS Connections

• Recall basic linear model from Chapter 3:

• Consider standard MSE loss: – Implies

• Recall basic LMS algorithm from Chapter 3

• Hence LMS is direct application of stochastic gradient SA

• Proposition 5.1 in ISSOshows how SA convergence theory applies to LMS

– Implies convergence of LMS to 

• See Appendix in slides for Chap. 4 for discussion of iterate averaging (Sect. 4.5.3) in context of stochastic gradient for linear models (including LMS as special case)

   

  

  _{}



 ₁  ₁( ₁ ₁)

k

T

k k k k k k k

k

Y Q

ˆ ˆ _a _h _h ˆ _z

 T

k k k

z h v

 

 ₁  2

2

( )

[

(

_k T_k

)

]

L E z h

11₂(  T)2

k k k

Q z h

(4)

5-7

Stochastic Gradient Descent in

Machine Learning

• Stochastic gradient descent (SGD) is “go-to” method in vast number of ML applications, including deep learning in neural networks (NNs)

• SGD in basic form is application of standard SG methods

• Basic SGD is relatively robust method that is easy and “works” (but not likely the fastest method in given problem) • Many extensions possible: Momentum, iterate averaging

(Sect. 4.5), mini-batches, Adagrad, Adam, RMSProp, natural gradient descent, and second-order (Newton-type) methods

– Above may show improved performance in applications, but tend to need more tuning and be less robust than basic SGD • Need to distinguish between “true” stochastic optimization

problem with MSE metric and common ML framework of minimizing empirical risk function (ERF)….

5-8

MSE Relative to Empirical Risk Function

• Idealizedloss function (MSE)

where second equality is true for {x_k, z_k} being i.i.d.

input-output data pairs

• Minimizing MSE not feasible in practice because cannot compute expected value

• Feasiblebatch criterion, i.e., empirical risk function :

• For large n, minimum of ERF close to = argmin_L() • Batch training algorithms use deterministic optimization to

minimize ERF; in contrast, “standard” SGD applies to each summand one-at-a-time to minimize ERF











   





_  _ _  _

  2  2

1

1 2 1

( ) ( , ) ( , )

2

n

k k k k

k

L E z h E z h

n x x









  2

1

ERF ( , )

2

n

k k

k

z h

(5)

5-9

Online Loss Function and Training

• Assume {x_k, z_k} are input-output data pairs (typically

assumed i.i.d); data may be collected in batch or real-time • Consider data modeled as z_k = h(,x_k) + noise, where h(,x_k)

represents model with unknown parameters  – h(,x_k) may represent NN with connection weights 

• Typical Q_krepresents mean-squared difference between actual outcome and prediction:Q_k(,V_k)=

½

(z_k₊₁h(,x_k₊₁))2 • By processing data one at a time, then have “on-line”

training, which is SGD

– Note that random vector ½

(

z_k₊₁h(,x_k₊₁)

)

2/represents

“noisy” value of instantaneous true gradient L_k/

• Contrast is batch training where all data are processed at each iteration via sumof squared errors using deterministic steepest descent or other nonlinear programming method

• SGD = online trainingalgorithm in ML parlance

Stochastic Gradient for Online Training

• Online training uses instantaneous (stochastic) gradient; i.e., gradient corresponding to each measurement

• Recall then

• Above requires derivativeintegral interchange (valid for all popular NN architectures)

• From right-most expression above, unbiasedstochastic gradient input for use in SA algorithm is:





_

_

   _  _

 _ _ _  _

   

 

 

2

1 2

( , ) ( , )

( ) E zk h xk E h( , _k) z_k h xk

g x





 

 _  _

 ₁  2

2

( ) _k ( , _k) ;

L E z h x





1     __

( , )

( ) ( , ) k

k k k

h

h z x

(6)

5-11

Some Comments on Stochastic Gradient

Form of SA for Online Training

• White (1989) appears to be first to recognize connection of online training to SA

• For NNs, note use of backpropagation to get h

/

for each x_k as shown in stochastic gradient above

– Each iteration of SA requires one backpropagation

calculation (vs. nbackpropagations in each batch iteration) • The term “online” frequently misnomer as training often

done offline

– “Online” is used to refer to process of training with

instantaneous stochastic gradient even when done offline • Gain sequence a_k0 required for formal convergence

– But constant gain a_k= afrequently used in practice – With a_k= a, have “small” asymptotic error in estimate

5-12

Some Comments on Stochastic Gradient

Form of SA for Online Training (cont’d)

• Formal SA theory assumes one pass through data (one “epoch” in NN literature)

– Practical implementations of online training often assume multiple passes through fixed set of data

– Such multiple epochs emphasized in Wilson and Martinez (2003)

• Initial value for SA process in mth epoch is final value

in (m1)st epoch

• Formal SA convergence theory requires that sample size

n 

– Need constant infusion of new data

(7)

5-13

Some Comments on Stochastic Gradient

Form of SA for Online Training (cont’d)

• Another popular modification of basic SGD is

mini-batchesof several data points per iteration: Compromise

between batch gradient and SGD of one training sample per iteration

• Often lead to net computational improvement through vectorization and efficient matrix operations

• Introduces additional tuning parameter: what is good batch size?

Representation of Batch Training

for Use in Comparing with Online

• Why use online SGD instead of batch?

– Wilson and Martinez (2003) address this issue

• Aim to represent batch algorithm in form for direct comparison to online

• Recall Y_k associated with (k+1)st data pair: x_k₊₁_, z_k₊₁

• Let denote estimate in iteration (epoch) k for batch training with constant gain (step size):

• Batch algorithm can be written as type of recursive algorithm to expose difference with online (next slide)

 

   



_ _

  ₁ 1 

1 ₀

ˆ ˆ n

₍

ˆ

₎

k k a n _j Yj k

ˆ

(8)

5-15

Side-by-Side Representations of

Batch and Online Training

• Let denote estimate in iteration k within epoch m; let denote initial condition for in epoch m

• Batch and online training may be viewed in following nested loop:

For m= 1 to number of epochs For k = 0 to n1

endk loop end mloop

• Note the equivalence of above to basic algorithm for batch processing: for batch is same as on previous slide

_ˆ(m)

k

  

( ) ( ) ( )

1

ˆ m ˆ m

₍

ˆ m

_{), online}

k k

k k a Y k

      

( ) ( ) ( )

1 0

ˆ m ˆ m

₍

ˆ m

_{), batch with}

k k k

k k a Y a a n k

  

( ) ( 1)

0 1

ˆ m ˆ m n

_ˆ(m)

n ˆm

5-16

Is Online Training Better than Batch Training?

• Wilson and Martinez (2003) present semi-formal arguments for superiority of online training

– Consistent with much numerical evidence from others • Essential idea

– In batch training, algorithm “stuck” with older estimate of for all summands in loss function at each epoch (i.e., for each data pair x_k, z_k)

– only updated at each epoch

– In practice, effective gain a_k(after adjusting for division by n

in batch average) must be lower in batch update (vs. online), causing slower batch convergence

• In contrast, online training updates gradient estimate for change in each data pair x_k, z_k

– Allows for stochastic gradient that is unbiased around current value of (vs. around earlier value of )

• Authors make statement about reducing batch gain by factor of root-n to get same stability as online (p. 1437)

(9)

5-17

Neural Networks

• NNs are general function approximators

• Actual output z_krepresented by a NN according to standard model z_k = h(,x_k) + v_k

– h(,x_k) represents NN output for input x_k and weight values 

– v_k represents noise

• Diagram of simple feedforward NN on next slide

• Most popular training method is SGD with backpropagation (mean-squared-type loss function)

• Backpropagation computes h

/

 in SGD recursion:





 



 



 







 





 



 

  _  _



 

1

ˆ

1 1

ˆ

( , )

ˆ ˆ

ˆ _{( ,} ₎

k

k k

k k k

k k k k

Q a

h a h z

V

x

(10)

5-19

Discrete-Event Dynamic Systems

• Many applications of stochastic gradient methods in simulation-based optimization

• Discrete-event dynamic systems frequently modeled by simulation

– Trajectories of process are piecewise constant • Derivative–integral interchange critical

– Interchange not valid in many realistic systems – Interchange condition checked on case-by-case basis • Overall approach requires knowledge of inner workings of

simulation

– Needed to obtain Q(,V)/

– Chapters 14 and 15 of ISSOhave extensive discussion of simulation-based optimization

5-20

Image Restoration

• Aim is to recover true image subject to having recorded image corrupted by noise

• Common to construct least-squares type problem

where Hs represents a convolution of the measurement process (H) and the true pixel-by-pixel image (s)

• Can be solved by either batch linear regression methods or the LMS/RLS methods

• Nonlinear measurements need full power of stochastic gradient method

– Measurements modeled as Z= F(s,x,V)



 2

min s

(11)

References

White, H. (1989), “Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Neural Networks,” J. Amer. Stat. Assoc., vol. 84, pp. 10031013.

Wilson, D. R. and Martinez, T. R. (2003), “The General

Inefficiency of Batch Training for Gradient Descent Learning,”

Neural Networks, vol. 16, pp. 14291451.