CHAPTER 5
S
TOCHASTIC
G
RADIENT
F
ORM OF
S
TOCHASTIC
A
PROXIMATION
• Organization of chapter in ISSO
–Stochastic gradient •Core algorithm •Basic principles •Nonlinear regression •Connections to LMS –Neural network training
–Discrete event dynamic systems –Image processing
• Note: Some material in these slides as relates to online stochastic gradient descent goes slightly beyond
coverage in Chapter 5 of ISSO
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall
Stochastic Gradient Formulation
• For differentiable L(), recall familiar set of p equations and
punknowns for use in finding a minimum :
• Above is special case of root-finding problem
• Suppose cannot observe L() and g() except in presence of noise
– Adaptive control (target tracking)
– Simulation-based optimization
– Machine learning (ML)
– Etc.
• Seek unbiased measurementof L/for optimization
( ) L
g
5-3
Stochastic Gradient Formulation (Cont’d)
• Suppose L() = E[Q(,V)]
– Vrepresents all random effects
– Q(,V) represents “observed” cost (noisy measurement of
L())
• Seek a representation where Q
/
is an unbiased measurement ofL/– Not true when distribution function for Vdepends on
• Above implies that desired representationis
not
where pV() is density function for V
( , ) ( , ) ( ) ,
[ ]
E Q V Q pV d
( , ) ( , ) ( | ) ,
[ ]
E Q V Q pV d
5-4
Stochastic Gradient Measurement
and Algorithm
• When density pV() is independent of ,
is unbiased measurement of L/(Y is “stochastic gradient”) – Requires derivative–integral interchange in L/=E[Q(,V)]/
= E[Q(,V)/] to be valid (Theorem 5.1 in ISSO)
• Can use root-finding (Robbins-Monro) SA algorithm with stochastic gradient above to attempt to find :
• Unbiased measurement satisfies key convergence conditions of SA (Section 4.3 in ISSO)
• Note: Popular stochastic gradient algorithm above is invalid
if pV() depends on ; see Exercise 5.2 or p. 415 of ISSOfor alternative form when pV() depends on
( , )
( ) Q V
Y
ˆ 1 ˆ
(
ˆ)
5-5
Stochastic Gradient and LMS Connections
• Recall basic linear model from Chapter 3:
• Consider standard MSE loss: – Implies
• Recall basic LMS algorithm from Chapter 3
• Hence LMS is direct application of stochastic gradient SA
• Proposition 5.1 in ISSOshows how SA convergence theory applies to LMS
– Implies convergence of LMS to
• See Appendix in slides for Chap. 4 for discussion of iterate averaging (Sect. 4.5.3) in context of stochastic gradient for linear models (including LMS as special case)
1 1( 1 1)
k
T
k k k k k k k
k
Y Q
ˆ ˆ a h h ˆ z
T
k k k
z h v
1 2
2
( )
[
(
k Tk)
]
L E z h
112( T)2
k k k
Q z h
5-7
Stochastic Gradient Descent in
Machine Learning
• Stochastic gradient descent (SGD) is “go-to” method in vast number of ML applications, including deep learning in neural networks (NNs)
• SGD in basic form is application of standard SG methods
• Basic SGD is relatively robust method that is easy and “works” (but not likely the fastest method in given problem) • Many extensions possible: Momentum, iterate averaging
(Sect. 4.5), mini-batches, Adagrad, Adam, RMSProp, natural gradient descent, and second-order (Newton-type) methods
– Above may show improved performance in applications, but tend to need more tuning and be less robust than basic SGD • Need to distinguish between “true” stochastic optimization
problem with MSE metric and common ML framework of minimizing empirical risk function (ERF)….
5-8
MSE Relative to Empirical Risk Function
• Idealizedloss function (MSE)
where second equality is true for {xk, zk} being i.i.d.
input-output data pairs
• Minimizing MSE not feasible in practice because cannot compute expected value
• Feasiblebatch criterion, i.e., empirical risk function :
• For large n, minimum of ERF close to = argminL() • Batch training algorithms use deterministic optimization to
minimize ERF; in contrast, “standard” SGD applies to each summand one-at-a-time to minimize ERF
2 2
1
1 2 1
( ) ( , ) ( , )
2
n
k k k k
k
L E z h E z h
n x x
21
1
ERF ( , )
2
n
k k
k
z h
5-9
Online Loss Function and Training
• Assume {xk, zk} are input-output data pairs (typically
assumed i.i.d); data may be collected in batch or real-time • Consider data modeled as zk = h(,xk) + noise, where h(,xk)
represents model with unknown parameters – h(,xk) may represent NN with connection weights
• Typical Qkrepresents mean-squared difference between actual outcome and prediction:Qk(,Vk)=
½
(zk+1h(,xk+1))2 • By processing data one at a time, then have “on-line”training, which is SGD
– Note that random vector ½
(
zk+1h(,xk+1))
2/represents“noisy” value of instantaneous true gradient Lk/
• Contrast is batch training where all data are processed at each iteration via sumof squared errors using deterministic steepest descent or other nonlinear programming method
• SGD = online trainingalgorithm in ML parlance
Stochastic Gradient for Online Training
• Online training uses instantaneous (stochastic) gradient; i.e., gradient corresponding to each measurement
• Recall then
• Above requires derivativeintegral interchange (valid for all popular NN architectures)
• From right-most expression above, unbiasedstochastic gradient input for use in SA algorithm is:
2
1 2
( , ) ( , )
( ) E zk h xk E h( , k) zk h xk
g x
1 2
2
( ) k ( , k) ;
L E z h x
1
( , )
( ) ( , ) k
k k k
h
h z x
5-11
Some Comments on Stochastic Gradient
Form of SA for Online Training
• White (1989) appears to be first to recognize connection of online training to SA
• For NNs, note use of backpropagation to get h
/
for each xk as shown in stochastic gradient above– Each iteration of SA requires one backpropagation
calculation (vs. nbackpropagations in each batch iteration) • The term “online” frequently misnomer as training often
done offline
– “Online” is used to refer to process of training with
instantaneous stochastic gradient even when done offline • Gain sequence ak0 required for formal convergence
– But constant gain ak= afrequently used in practice – With ak= a, have “small” asymptotic error in estimate
5-12
Some Comments on Stochastic Gradient
Form of SA for Online Training (cont’d)
• Formal SA theory assumes one pass through data (one “epoch” in NN literature)
– Practical implementations of online training often assume multiple passes through fixed set of data
– Such multiple epochs emphasized in Wilson and Martinez (2003)
• Initial value for SA process in mth epoch is final value
in (m1)st epoch
• Formal SA convergence theory requires that sample size
n
– Need constant infusion of new data
5-13
Some Comments on Stochastic Gradient
Form of SA for Online Training (cont’d)
• Another popular modification of basic SGD is
mini-batchesof several data points per iteration: Compromise
between batch gradient and SGD of one training sample per iteration
• Often lead to net computational improvement through vectorization and efficient matrix operations
• Introduces additional tuning parameter: what is good batch size?
Representation of Batch Training
for Use in Comparing with Online
• Why use online SGD instead of batch?
– Wilson and Martinez (2003) address this issue
• Aim to represent batch algorithm in form for direct comparison to online
• Recall Yk associated with (k+1)st data pair: xk+1, zk+1
• Let denote estimate in iteration (epoch) k for batch training with constant gain (step size):
• Batch algorithm can be written as type of recursive algorithm to expose difference with online (next slide)
1 1
1 0
ˆ ˆ n
(
ˆ)
k k a n j Yj k
ˆ
5-15
Side-by-Side Representations of
Batch and Online Training
• Let denote estimate in iteration k within epoch m; let denote initial condition for in epoch m
• Batch and online training may be viewed in following nested loop:
For m= 1 to number of epochs For k = 0 to n1
endk loop end mloop
• Note the equivalence of above to basic algorithm for batch processing: for batch is same as on previous slide
ˆ(m)
k
( ) ( ) ( )
1
ˆ m ˆ m
(
ˆ m), online
k k
k k a Y k
( ) ( ) ( )
1 0
ˆ m ˆ m
(
ˆ m), batch with
k k k
k k a Y a a n k
( ) ( 1)
0 1
ˆ m ˆ m n
ˆ(m)
n ˆm
5-16
Is Online Training Better than Batch Training?
• Wilson and Martinez (2003) present semi-formal arguments for superiority of online training
– Consistent with much numerical evidence from others • Essential idea
– In batch training, algorithm “stuck” with older estimate of for all summands in loss function at each epoch (i.e., for each data pair xk, zk)
– only updated at each epoch
– In practice, effective gain ak(after adjusting for division by n
in batch average) must be lower in batch update (vs. online), causing slower batch convergence
• In contrast, online training updates gradient estimate for change in each data pair xk, zk
– Allows for stochastic gradient that is unbiased around current value of (vs. around earlier value of )
• Authors make statement about reducing batch gain by factor of root-n to get same stability as online (p. 1437)
5-17
Neural Networks
• NNs are general function approximators
• Actual output zkrepresented by a NN according to standard model zk = h(,xk) + vk
– h(,xk) represents NN output for input xk and weight values
– vk represents noise
• Diagram of simple feedforward NN on next slide
• Most popular training method is SGD with backpropagation (mean-squared-type loss function)
• Backpropagation computes h
/
in SGD recursion:
1
ˆ
1 1
ˆ
( , )
ˆ ˆ
ˆ ( , )
k
k
k k
k k k
k k k k
Q a
h a h z
V
x
5-19
Discrete-Event Dynamic Systems
• Many applications of stochastic gradient methods in simulation-based optimization
• Discrete-event dynamic systems frequently modeled by simulation
– Trajectories of process are piecewise constant • Derivative–integral interchange critical
– Interchange not valid in many realistic systems – Interchange condition checked on case-by-case basis • Overall approach requires knowledge of inner workings of
simulation
– Needed to obtain Q(,V)/
– Chapters 14 and 15 of ISSOhave extensive discussion of simulation-based optimization
5-20
Image Restoration
• Aim is to recover true image subject to having recorded image corrupted by noise
• Common to construct least-squares type problem
where Hs represents a convolution of the measurement process (H) and the true pixel-by-pixel image (s)
• Can be solved by either batch linear regression methods or the LMS/RLS methods
• Nonlinear measurements need full power of stochastic gradient method
– Measurements modeled as Z= F(s,x,V)
2
min s
References
White, H. (1989), “Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Neural Networks,” J. Amer. Stat. Assoc., vol. 84, pp. 10031013.
Wilson, D. R. and Martinez, T. R. (2003), “The General
Inefficiency of Batch Training for Gradient Descent Learning,”
Neural Networks, vol. 16, pp. 14291451.