• No results found

Chapter 5_handout.pdf

N/A
N/A
Protected

Academic year: 2020

Share "Chapter 5_handout.pdf"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

CHAPTER 5

S

TOCHASTIC

G

RADIENT

F

ORM OF

S

TOCHASTIC

A

PROXIMATION

• Organization of chapter in ISSO

–Stochastic gradient •Core algorithm •Basic principles •Nonlinear regression •Connections to LMS –Neural network training

–Discrete event dynamic systems –Image processing

• Note: Some material in these slides as relates to online stochastic gradient descent goes slightly beyond

coverage in Chapter 5 of ISSO

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Stochastic Gradient Formulation

• For differentiable L(), recall familiar set of p equations and

punknowns for use in finding a minimum :

• Above is special case of root-finding problem

• Suppose cannot observe L() and g() except in presence of noise

– Adaptive control (target tracking)

– Simulation-based optimization

– Machine learning (ML)

– Etc.

• Seek unbiased measurementof L/for optimization

 

( ) L

g  

(2)

5-3

Stochastic Gradient Formulation (Cont’d)

• Suppose L() = E[Q(,V)]

Vrepresents all random effects

Q(,V) represents “observed” cost (noisy measurement of

L())

• Seek a representation where Q

/

is an unbiased measurement ofL/

– Not true when distribution function for Vdepends on

• Above implies that desired representationis

not

where pV() is density function for V

( , ) ( , ) ( ) ,

[ ]

E QV Q   pV  d

( , ) ( , ) ( | ) ,

[ ]

E QV Q   pV   d

5-4

Stochastic Gradient Measurement

and Algorithm

• When density pV() is independent of ,

is unbiased measurement of L/(Y is “stochastic gradient”) – Requires derivative–integral interchange in L/=E[Q(,V)]/

= E[Q(,V)/] to be valid (Theorem 5.1 in ISSO)

• Can use root-finding (Robbins-Monro) SA algorithm with stochastic gradient above to attempt to find :

• Unbiased measurement satisfies key convergence conditions of SA (Section 4.3 in ISSO)

Note: Popular stochastic gradient algorithm above is invalid

if pV() depends on ; see Exercise 5.2 or p. 415 of ISSOfor alternative form when pV() depends on 

 

  

 ( , )

( ) Q V

Y

  

ˆ 1 ˆ

(

ˆ

)

(3)

5-5

Stochastic Gradient and LMS Connections

• Recall basic linear model from Chapter 3:

• Consider standard MSE loss: – Implies

• Recall basic LMS algorithm from Chapter 3

• Hence LMS is direct application of stochastic gradient SA

• Proposition 5.1 in ISSOshows how SA convergence theory applies to LMS

– Implies convergence of LMS to 

• See Appendix in slides for Chap. 4 for discussion of iterate averaging (Sect. 4.5.3) in context of stochastic gradient for linear models (including LMS as special case)

   

  

  

11( 11)

k

T

k k k k k k k

k

Y Q

ˆ ˆ a h h ˆ z

T

k k k

z h v

 

1  2

2

( )

[

(

k Tk

)

]

L E z h

112(  T)2

k k k

Q z h

(4)

5-7

Stochastic Gradient Descent in

Machine Learning

• Stochastic gradient descent (SGD) is “go-to” method in vast number of ML applications, including deep learning in neural networks (NNs)

• SGD in basic form is application of standard SG methods

• Basic SGD is relatively robust method that is easy and “works” (but not likely the fastest method in given problem) • Many extensions possible: Momentum, iterate averaging

(Sect. 4.5), mini-batches, Adagrad, Adam, RMSProp, natural gradient descent, and second-order (Newton-type) methods

– Above may show improved performance in applications, but tend to need more tuning and be less robust than basic SGD • Need to distinguish between “true” stochastic optimization

problem with MSE metric and common ML framework of minimizing empirical risk function (ERF)….

5-8

MSE Relative to Empirical Risk Function

Idealizedloss function (MSE)

where second equality is true for {xk, zk} being i.i.d.

input-output data pairs

• Minimizing MSE not feasible in practice because cannot compute expected value

Feasiblebatch criterion, i.e., empirical risk function :

• For large n, minimum of ERF close to = argminL() • Batch training algorithms use deterministic optimization to

minimize ERF; in contrast, “standard” SGD applies to each summand one-at-a-time to minimize ERF

   

  2  2

1

1 2 1

( ) ( , ) ( , )

2

n

k k k k

k

L E z h E z h

n x x

  2

1

1

ERF ( , )

2

n

k k

k

z h

(5)

5-9

Online Loss Function and Training

• Assume {xk, zk} are input-output data pairs (typically

assumed i.i.d); data may be collected in batch or real-time • Consider data modeled as zk = h(,xk) + noise, where h(,xk)

represents model with unknown parameters  – h(,xk) may represent NN with connection weights 

• Typical Qkrepresents mean-squared difference between actual outcome and prediction:Qk(,Vk)=

½

(zk+1h(,xk+1))2 • By processing data one at a time, then have “on-line”

training, which is SGD

– Note that random vector ½

(

zk+1h(,xk+1)

)

2/represents

“noisy” value of instantaneous true gradient Lk/

• Contrast is batch training where all data are processed at each iteration via sumof squared errors using deterministic steepest descent or other nonlinear programming method

SGD = online trainingalgorithm in ML parlance

Stochastic Gradient for Online Training

• Online training uses instantaneous (stochastic) gradient; i.e., gradient corresponding to each measurement

• Recall then

• Above requires derivativeintegral interchange (valid for all popular NN architectures)

• From right-most expression above, unbiasedstochastic gradient input for use in SA algorithm is:

  

   

 

 

 

 

2

1 2

( , ) ( , )

( ) E zk h xk E h( , k) zk h xk

g x

 

1  2

2

( ) k ( , k) ;

L E z h x

1    

( , )

( ) ( , ) k

k k k

h

h z x

(6)

5-11

Some Comments on Stochastic Gradient

Form of SA for Online Training

• White (1989) appears to be first to recognize connection of online training to SA

• For NNs, note use of backpropagation to get h

/

for each xk as shown in stochastic gradient above

– Each iteration of SA requires one backpropagation

calculation (vs. nbackpropagations in each batch iteration) • The term “online” frequently misnomer as training often

done offline

– “Online” is used to refer to process of training with

instantaneous stochastic gradient even when done offline • Gain sequence ak0 required for formal convergence

– But constant gain ak= afrequently used in practice – With ak= a, have “small” asymptotic error in estimate

5-12

Some Comments on Stochastic Gradient

Form of SA for Online Training (cont’d)

• Formal SA theory assumes one pass through data (one “epoch” in NN literature)

– Practical implementations of online training often assume multiple passes through fixed set of data

– Such multiple epochs emphasized in Wilson and Martinez (2003)

• Initial value for SA process in mth epoch is final value

in (m1)st epoch

• Formal SA convergence theory requires that sample size

n 

– Need constant infusion of new data

(7)

5-13

Some Comments on Stochastic Gradient

Form of SA for Online Training (cont’d)

• Another popular modification of basic SGD is

mini-batchesof several data points per iteration: Compromise

between batch gradient and SGD of one training sample per iteration

• Often lead to net computational improvement through vectorization and efficient matrix operations

• Introduces additional tuning parameter: what is good batch size?

Representation of Batch Training

for Use in Comparing with Online

• Why use online SGD instead of batch?

– Wilson and Martinez (2003) address this issue

• Aim to represent batch algorithm in form for direct comparison to online

• Recall Yk associated with (k+1)st data pair: xk+1, zk+1

• Let denote estimate in iteration (epoch) k for batch training with constant gain (step size):

• Batch algorithm can be written as type of recursive algorithm to expose difference with online (next slide)

 

   

  1 1 

1 0

ˆ ˆ n

(

ˆ

)

k k a n j Yj k

ˆ

(8)

5-15

Side-by-Side Representations of

Batch and Online Training

• Let denote estimate in iteration k within epoch m; let denote initial condition for in epoch m

• Batch and online training may be viewed in following nested loop:

For m= 1 to number of epochs For k = 0 to n1

endk loop end mloop

• Note the equivalence of above to basic algorithm for batch processing: for batch is same as on previous slide

ˆ(m)

k

  

( ) ( ) ( )

1

ˆ m ˆ m

(

ˆ m

), online

k k

k k a Y k

      

( ) ( ) ( )

1 0

ˆ m ˆ m

(

ˆ m

), batch with

k k k

k k a Y a a n k

  

( ) ( 1)

0 1

ˆ m ˆ m n

ˆ(m)

n ˆm

5-16

Is Online Training Better than Batch Training?

• Wilson and Martinez (2003) present semi-formal arguments for superiority of online training

– Consistent with much numerical evidence from others • Essential idea

– In batch training, algorithm “stuck” with older estimate of for all summands in loss function at each epoch (i.e., for each data pair xk, zk)

– only updated at each epoch

– In practice, effective gain ak(after adjusting for division by n

in batch average) must be lower in batch update (vs. online), causing slower batch convergence

• In contrast, online training updates gradient estimate for change in each data pair xk, zk

– Allows for stochastic gradient that is unbiased around current value of (vs. around earlier value of )

• Authors make statement about reducing batch gain by factor of root-n to get same stability as online (p. 1437)

(9)

5-17

Neural Networks

• NNs are general function approximators

Actual output zkrepresented by a NN according to standard model zk = h(,xk) + vk

h(,xk) represents NN output for input xk and weight values 

vk represents noise

• Diagram of simple feedforward NN on next slide

• Most popular training method is SGD with backpropagation (mean-squared-type loss function)

• Backpropagation computes h

/

 in SGD recursion:

 

 

 

 

 

 

 

 

 

1

ˆ

1 1

ˆ

( , )

ˆ ˆ

ˆ ( , )

k

k

k k

k k k

k k k k

Q a

h a h z

V

x

(10)

5-19

Discrete-Event Dynamic Systems

• Many applications of stochastic gradient methods in simulation-based optimization

• Discrete-event dynamic systems frequently modeled by simulation

– Trajectories of process are piecewise constant • Derivative–integral interchange critical

– Interchange not valid in many realistic systems – Interchange condition checked on case-by-case basis • Overall approach requires knowledge of inner workings of

simulation

– Needed to obtain Q(,V)/

– Chapters 14 and 15 of ISSOhave extensive discussion of simulation-based optimization

5-20

Image Restoration

• Aim is to recover true image subject to having recorded image corrupted by noise

• Common to construct least-squares type problem

where Hs represents a convolution of the measurement process (H) and the true pixel-by-pixel image (s)

• Can be solved by either batch linear regression methods or the LMS/RLS methods

• Nonlinear measurements need full power of stochastic gradient method

– Measurements modeled as Z= F(s,x,V)

 2

min s

(11)

References

White, H. (1989), “Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Neural Networks,” J. Amer. Stat. Assoc., vol. 84, pp. 10031013.

Wilson, D. R. and Martinez, T. R. (2003), “The General

Inefficiency of Batch Training for Gradient Descent Learning,”

Neural Networks, vol. 16, pp. 14291451.

References

Related documents

I find no evidence of any effects of statewide student achievement data on education policy preferences such as increasing overall spending levels, increasing teacher

In this study, orientation and location finder services for indoor navigation will be done by using accelerometer, compass and camera that have been already included in the phones

[r]

We present the first shared-memory parallel data structure for union-find (equivalently, IGC) that is both provably work-efficient (i.e. performs no more work than the best

Real-time Enterprises New Taylor Invest in R&D Enterpreneurship Adaptive Manufacturing Adaptive Automation Modular Products Configurable Systems Adaptive Factories

Chapter 4 presents a solution to surgical case assignment problem (SCAP), it introduces a stochastic model for the operating block planning and scheduling; a model that incorporates

To verify the adequacy of our model we fitted it to Nord Pool market daily average system prices from the period January 1, 1997 – January 15, 2000, see the right panel of Fig.. Next,

The environmentally lagging country tends to impose a higher rate of pollution reduction per unit of the emis- sion and reduce more pollution emissions, although it may generate a