Efficient Implementation of Second-Order
Stochastic Approximation
March 2020
James C. Spall
The Johns Hopkins University Applied Physics Laboratory and
Department of Applied Mathematics and Statistics
Overview
• Present key extensions to basic simultaneous perturbation stochastic approximation (SPSA) algorithm for stochastic optimization problems
• While SPSA in basic form is in same general family of “first-order” SA methods
• Aim to capture “second-order” improvement
• Analogous to first-order steepest descent algorithm vs. second-order Newton-Raphson algorithm in classical (deterministic) optimization
Content*
•
Part 1:
Brief Review of SPSA
•
Part 2:
Core Adaptive (“Second-Order”) SPSA
Method
•
Part 3:
Enhanced Adaptive SPSA Method: Optimal
Weighting and Feedback
•
Part 4:
Matrix Factorizations for Efficient
Implementation
•
Part 5:
SPSA Method Using Diagonalized Hessian
Estimates
• Appendices
3
Spall, J. C. (2000), “Adaptive Stochastic Approximation by the
Simultaneous Perturbation Method,” IEEE Transactions on Automatic Control, vol. 45, pp. 1839−1853. http://dx.doi.org/10.1109/
TAC.2000.880982
Spall, J. C. (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous
Perturbation Algorithm,”IEEE Transactions on Automatic Control, vol.
54(6), pp. 1216–1229. http://dx.doi.org/10.1109/TAC.2009.2019793
Sun, S. and Spall, J. C. (2019), “SPSA Method Using
Diagonalized Hessian Estimate,” Proc. of the IEEE Conf. on Decision and Control, Nice, France, 11–13 Dec. 2019, pp. 4922–
4927. https://doi.org/10.1109/CDC40024.2019.9029707
Zhu, J., Wang, L., and Spall, J. C. (2019), “Efficient Implementation of Second-Order Stochastic Approximation Algorithms in
High-Dimensional Problems,” IEEE Trans. on Neural Networks and Learning Systems, in press. http:// dx.doi.org/ 10.1109/ TNNLS.2019.2935455
Key References
Part 1:
Brief
Review of SPSA
5
Background for Stochastic Approximation
• Interested in finding = such that g() = 0 where is p
-dimensional vector of “adjustables”
• Common special case of minimization: find root = to
where L() is scalar-valued loss function
• Assume only (possibly noisy) measurements of g() and/or L() available; e.g., y() = L() + ()
– Noisy measurements arise in areas such as Monte Carlo simulation, real-time control/estimation, machine learning, system identification, etc.
L( )
( )
g
7
Brief Diversion (3 slides): Basic SPSA
Algorithm (NIPS 2012 tutorial; Spall, 2003)
• Let () denote SP estimate of g() at kth iteration
• Let denote estimate for at kth iteration • SPSA algorithm has form
where {ak} is nonnegative gain sequence
• Generic iterative form above is standard in SA; stochastic analogue to steepest descent
• Under conditions, in some stochastic sense as k
k 1 k a gk k k
ˆ ˆ ˆ ( )ˆ
k ˆ k ˆ g k ˆ 8
Computation of
(
•
)
(Heart of SPSA)
• Let be vector of pindependent random variables at kthiteration
• typically generated by Monte Carlo • Let {ck} be sequence of positive scalars
• For iteration k k+1, take measurements at design
levels:
where are measurement noise terms • Common special case is when
(e.g., system identification with perfect measurements of the likelihood function)
k
ˆ
g
k
T k k1, k2, ,... kp k
k k k k k k k
k k k k k k k
y c L c
y c L c
( ) ( ) ˆ ˆ ( ) ( ) ˆ ˆ ( ) ( ) k ck k ˆ k ( )
9
Computation of
(
•
)
(cont’d)
• The standard SP form for (•):• Note that (•) only requires twomeasurements of L(•) independentofp
• Above SP form contrasts with standard finite-difference approximations taking 2p(or p+1) measurements
• Intuitive reason why (•) is appropriate is that formalized in Spall (1992)
k ˆ g
k k k k k k
k k
k k
k k k k k k
k kp
y + c y c
c g
y + c y c
c
1
ˆ ˆ
( ) ( )
2 ˆ
ˆ ( )
ˆ ˆ
( ) ( )
2
k ˆ g
k k k k
E[gˆ ( )ˆ ˆ ] g( );ˆ k
ˆ g
ˆ
kg
Part 2: Core Adaptive
11
Why Seek Second-Order Method?
• Standard SA algorithms (Robbins-Monro, etc.) show “1st-order” behavior
– Sharp initial decline (in optimization case) – Slow convergence in final phase
– Sensitivity to units/scaling for components of (“non-transform invariant”)
• Long-standing interest in capturing “2nd-order” (Newton-like) effects in SA setting to cope with above shortcomings • Two general methods:
─ Adaptive SA algorithms (Hessian/Jacobian estimation)
─ Iterate averaging
• Traditional shortcomings in both of above in implementation difficulty and practical efficiency
‒ E.g., Difficult to get Hessian estimate
12
“Standard” Adaptive SPSA Algorithm
(Spall, 2000, IEEE Trans. Auto. Contr.)
• Let Gk() denote direct (noisy) measurement of g() or
simultaneous perturbation estimate of g() at kth iteration
• Let denote estimate for at kth iteration
• “Standard” adaptive SPSA algorithm has parallel recursions form
where {ak} is nonnegative gain sequence and is efficient
“per-iteration” Jacobian estimate
• Key is “per-iteration” estimate in (**); formula for next slide • Note 1: Convergence of can be shown (Spall, 2000)
• Note 2:Efficient rank 1 (or 2) update (via matrix inversion
lemma [ShermanMorrisonWoodbury formula]) can be used in (*); see Appendix 1
k ˆ
k k
a
k k k k k k k1 1
ˆ
ˆ
H G
( )
ˆ ,
H
f H
(
)
(*)
0
1
1
ˆ (**)
1
1
k k k
k
k
k
H
H
H
k
ˆ H
k k
ˆ andH
k
13
Per-Iteration Jacobian (Hessian) Estimate
where Gk = G(+ckk)–G(–ckk) and G() is direct noisy
measurement of g() or the simultaneous perturbation estimate of g() built from noisy L() measurements
and
k [k1,k2,…,kp]Tis mean-zero random vector such that
the {kj} satisfy standard SPSA conditions (mean zero,
symmetrically distributed random variables with finite inverse moments)
1 1 1 1 1 1
1 2 1 2
ˆ
1
,
, ,
,
, ,
2 2
2
T
k k
k k k kp k k kp
k k
c
c
G
G
H
Cost of Implementation
• For any p,the cost per iteration of adaptive method is
Four noisy measurements of L or
Three noisy measurements of g
• Above costs compare very favorably with previous methods:
O(p2) loss measurements per iteration in finite-difference
setting (e.g., Fabian, 1971)
O(p) gmeasurements per iteration in root-finding setting
15
Efficiency Analysis
• Use asymptotic normality of to compare asymptotic RMS errors (as in basic SPSA) against best possibleasymptotic RMS of SPSA and root-finding, say
• Noisy values of L (2SPSA): ak=1/k and ck= c
/
k1/6 (k 1)• Noisy values of g (root-finding or 2SG): ak= 1/k and any
valid ck
• Interpretation: Adaptive method with measurements of Ldoes
almost as well as unobtainable best SPSA; RMS error differs by < factor of 2. With measurements of g, does as well as analytically optimal root-finding method (rarely available).
and SPSA root
RMS RMS
RMS for adaptive 2 0
SPSA
c RMS
RMS for adaptive = 1
root
RMS
k
ˆ
Part 3: Enhanced Adaptive SPSA
Method: Optimal Weighting and
Feedback
17
Enhanced Adaptive SPSA Algorithm
(Spall, 2009, IEEE Trans. Auto. Contr.)• Standard adaptive algorithm can be improved:
─ Introduce feedback term that helps remove error in per-iteration Hestimates
─ Optimal weighting to account for noisy measurements of loss function or of g()
• Enhanced Adaptive SPSA algorithm has same general
parallel recursions form as (*) and (**) (slide 9):
• Items highlighted in redrepresent new expressions here: is feedback term
wk is optimal weighting
k k
a
k k k k1 1
ˆ
ˆ
H G
( )
ˆ ,
k
k
(1
w
)
k 1w
kˆ
kˆ
kH
H
H
k
ˆ
Enhanced Algorithm (cont’d)
• Recursion for unchanged from basic adaptive method • Critical aspect of recursion for H is the averaging of
per-iteration Jacobian estimates
─ Each per-iteration Hestimate formed by simultaneous perturbation of Gk() values
─ Per-iteration estimate formed from only two Gk() values (vs. 2pvalues in former adaptive methods)
• Two changes to recursion for H (feedback and weighting) ─ Feedback term uses current (cumulative) estimate of Has
true H: subtracts out error in that depends on true H
─ Weighting accounts for increasing effective noise contribution across iterations
ˆ
k
H
ˆ
19
Feedback Contribution to
Enhanced Algorithm
• Per-iteration Jacobian estimate can be decomposed into four parts:
where k= is error term and ck is positive
scalar such thatck 0 as k (ck is the “difference
interval” for the per-iteration estimates)
• Feedback term is best estimate of error term (using current estimate of H) at each iteration:
• Overall (cumulative) estimate of H (i.e., , k = 0,1,..., n) is
based on weighted sums of
2
ˆ
( )
ˆ noise( )
k k
k O
ckH H
( )ˆ
k k
H
1
ˆ
k k Hk
ˆ ˆ
k k
H
n
H
20
Optimal Weighting
• Recall:
• The noiseterm is of stochastic order in the case of noisy loss measurements (optimization) and order in the case of noisy gmeasurements (root-finding)
• Above implies that noise variance grows with k
• Optimal weighting is such that noise growth is “damped down” as kgets large
• Can solve for optimal weights in terms of ckusing
Lagrange multipliers; see Appendix 2 at end of slides below
O
2
ˆ
( )
ˆ( )
k k k noise ck
H H
2
(
ck)
O
1
(
c
k)
21
Comments on Theory
• Former theory on convergence of and asymptotic efficiency continues to apply
• New theory required for convergence of due to modified form
• Under various “standard” smoothness and bounded moments conditions, can show
• For each entry in have
• In special case of noise-free measurements, can show that
rate of convergence of to H() is almost
O
(
e
ck)
, c> 0– Feedback is critical to fast convergence
– Applies in case of direct loss or direct gmeasurements
ˆ
k
k
H
( ) a.s.
k
H H
k
H ,
k
H
variance for enhanced adaptive SPSA
1 variance for standard adaptive SPSA
Comments on Numerical Issues in
Practical Implementation
• In practical problems, basic update
may need modest modifications:
A. For small k, must ensure is replaced by related
guaranteed positive-definite matrix B. Stable and efficient computation of
• Standard method to solve A is to use matrix square root (e.g., Cholesky), as in
• Standard method to solve B is to use Gaussian elimination • Both standard solutions above require O(p3) calculations; not
k 1 k
a
k k1 k kˆ
ˆ
H G
( )
ˆ ,
T
k k k k p k
1/ 2, 0
( )
H H H I
k (want k k )
H H H
k
H
k1 k
23
Numerical Study
• Consider minimization problem based on 4th-order polynomial loss function; p= dim() = 10
• Compare standard adaptive and enhanced adaptive SPSA in terms of terminal loss function values and H estimates
• Used stochastic gradient setting (direct noisy
measurements of gradient g) (2SG) and setting with only
loss values (2SPSAsee next slide)
• Considered 50 independent runs for each experiment; normalized loss is from 0 (best) to 1(initial condition)
Number of iterations
Normalized loss from standard 2SG
Normalized loss from enhanced 2SG
P-values for
comparing loss functions
2000 0.019 0.012 0.0061
10,000 0.015 0.0034 0.00049
24
Relative Convergence Rates for Hessian
Estimate in 2SPSA in Noise-Free Problem
k
H
k true H
Part 4: Matrix Factorizations for
Efficient Implementation
25
• Adaptive (second-order) SPSA highly efficient relative to use of input information. Only:
‒ 4 noisy loss measurements per iteration for 2SPSA ‒ 3 noisy gradient measurements per iteration for 2SG • Above fixed costs (any p) are separatefrom per-iteration
computing cost for update step (FLOPS cost), which grows with p
‒ Primary driver of computing cost is matrix manipulation for large p
• We discuss below easy-to-implement method for reducing computing cost from O(p3) (in general) to O(p2)
• Overall update of requires O(p3) in general. Two main
issues:
‒ “Preconditioning”: need 𝑯 to be positive definite • Map 𝑯 to some positive-definite matrix 𝑯 ≡ 𝒇 𝑯
• Straightforward approach in Spall (2000 or 2009)
𝑯 𝑯 𝑯 δ 𝑰 / , δ → 0
requires O(p3) in general
‒ Gaussian elimination for 𝑯 𝑮 (standard method) takes
O(p3) computations
• Matrix factorizations can be used to simultaneouslyhandle
both preconditioning and update step 𝑯 𝑮
• Alternative method (Rastogi et al., 2016) based on matrix inversion lemma (Appendix 1) useful if preconditioning not relevant (i.e., know 𝑯 is positive definite)
27
Matrix Factorizations (cont’d)
• Recall: Update of via 𝑯 𝑮 generally requires O(p3). Two
expensive steps:
‒ Pre-conditioning: maintain 𝑯 positive-definiteness
‒ Solving 𝑯 𝒅 𝑮 for descent direction𝒅 (via Gaussian elimination or other standard method) costs O(p3)
• Goal:Achieve O(p2) per-iteration cost—factor of p
improvement over other second-order methods
• Tool is symmetric matrix factorization, applicable for possibly
indefinitematrices
𝑷 𝑯 𝑷 𝑳 𝑩 𝑳
‒ 𝑷 is permutation; 𝑳 is lower-triangular; 𝑩 is block-diagonal with blocks with size 1 or 2
‒ Value of factorization is that all calculations required to obtain 𝒅 𝑯 𝑮 are now O(p2) 28
Matrix Factorizations (cont’d)
• Preconditioningachievable at O(p2) per-iteration cost‒ Eigen-decomposition of block diagonal 𝑩 𝑸 𝑸 is cheap, at most O(p)
‒ Modification to make diagonal 𝟎guarantees 𝑩 𝟎
and hence 𝑯 𝟎(by Sylvester's law of inertia*) ‒ Hence, 𝑷 𝑯 𝑷 𝑳 𝑩 𝑳 = 𝑳 𝑸 𝑸 𝑳
‒ Overall: Preconditioning changes indefinite 𝑯 to 𝑯 𝟎
• Descent direction dkobtainable at O(p2) per-iteration cost
𝑯 𝒅 𝑷 𝑳 𝑩 𝑳 𝑷 𝒅 𝑮
‒ Lower-triangular 𝑳 allows backward/forward substitution ‒ Multiplying by permutation 𝑷 is cheap
• Next slide shows overall flow….
29 *If 𝑫 𝑺𝑨𝑺𝑇is diagonal, 𝑨has the same number of positive/negative/zero eigenvalues as 𝑫
Matrix Factorizations: Overall Flowchart
• Recall: Can achieve both preconditioning and computation ofdescent direction in O(p2)
Numerical Study: Evaluation of
O
(
p
2)
• Note observed linear trend: O(p3)/O(p2) = O(p) difference in
run times, as predicted by theory
‒ Approx. 0.56p= 2800-fold reduction in run time
31
• Consider skewed-quartic loss function with Gaussian noise* • Plot shows ratio of run times of conventional 2SPSA over
efficient 2SPSA as function of p
‒ Both algorithms have same accuracy in estimation of
*𝐿 𝛉 𝜽𝑻𝑩𝑻𝑩𝜽 0.1 ∑ 𝑩𝜽 0.01 ∑ 𝑩𝜽 , 𝑩is a upper triangular matrix of all 1/𝑝
Numerical Study: Comparison with SGD and ADAM
32
• Compare efficient 2SG with standard SG descent (SGD) and ADAM on neural network with empirical risk function (ERF)
– ERF is standard machine learning formulation (summation over data set)
– Used NASA data (NACA 0012) airfoil self-noise data set with 1503 data points
– ADAM is popular adaptive learning method
Some Future Directions
• Above symmetric matrix factorization is versatile• Factorization applicable for general quasi-Newton (QN) methods, such as BFGS and DFP
‒ Standard update formulas contain two rank-one modifications that can be replaced by symmetric matrix factorization with no extra cost
‒ Standard QN methods are O(p2) but….
• Factorization allows extra preconditioningstep at O(p2) to
improve performance for ill-conditioned problems
‒ W/o factorization, preconditioning only available at O(p3)
• Applications where need for preconditioning arises include: ‒ Machine learning problems (logistic regression, ridge
regression, soft-margin support vector machines), where feature vectors have high dimension (relevant in big-data era) ‒ Inverse problems (various fields within medical imaging and
geophysics including mineral and oil exploration)
33
SPSA Using Diagonalized Hessian Estimate
(Sun and Spall, 2019)• Recall first-order Methods (SPSA, SGD):
‒ Pros: Comparatively low computational complexity per iteration (usuallyO(p))
‒ Cons: Slow convergence in final stage, etc. (slide 11)
• Second-order Methods (2SPSA, 2SG):
‒ Pros: High efficiency in convergence and transform invariant ‒ Cons: High computational complexity per iteration (usually
O(p3); optimally O(p2) per part 4); inefficient for
high-dimensional problems
• Core Concept:
‒ Find something between first- and second-order
‒ Apply part of Hessian estimates for better convergence ‒ Apply part of Hessian to avoid the high computational
complexity per iteration
35
36
Diagonalized Hessian Estimate, cont’d
• Modify adaptive algorithm correspondingly:
─ Estimate only diagonal components of Hessian in recursion
─ Propose diagSPSA and diagSG
• diagSPSA and diagSG algorithm have same general parallel recursions form as (*) and (**) in slide 12:
• As only diagonal components are estimated, the computational cost per iteration is O(p)
𝛉 𝛉 𝑎 𝑫 𝑮 𝛉 , 𝑫 abs 𝑫 𝜂 𝑰p
DiagSG in ERF Deep Learning
• Some Existing Algorithms in Deep Learning:
─ Apply diagonal components of empirical Fisher information matrix, sum of outer products of gradients of log-likelihood (not Hessian estimates)
─ Examples: AdaGrad, Adam, etc.
• Comparison with DiagSG (requires knowledge in Chap.13)
─ Finite Stage: Difference of empirical Fisher information matrices and Hessian estimates under ERF setting
─ Asymptotic Behavior: Share very similar asymptotic distribution because of equivalent definition of FIM
• Numerical Experiment:
─ ResNet on CIFAR-10 with p= 2107
─ Note: ResNet stands for “Residual Network,” a powerful structure for computer vision tasks; CIFAR-10 is a standard image dataset
39
Concluding Remarks
• Design of efficient and (relatively) easy to use adaptive search algorithms is long-standing problem
– Especially difficult in setting of noisy measurements
– Knowledge of Jacobian/Hessian matrix also useful in contexts outside of improved search (e.g., Cramér-Rao bound)
• Simultaneous perturbation idea can be used to dramatically enhance efficiency in multivariate problems
• Improvements here to “standard” adaptive SPSA are simple to use and further enhance efficiency
─ Feedback to reduce error in “per-iteration” Jacobian/Hessian estimates
─ Optimal weighting to reduce effects of noise
─ Matrix factorizations to enhance applicability in large-scale (e.g., machine learning) problems
• Ongoing research at JHU and elsewhere in applications to deep learning
40
Two appendices to follow:
Appendix 1: Use of Matrix Inversion Lemma to
Remove Explicit Inversion in SA Update
41
Appendix 1: Use of Matrix Inversion Lemma
to Remove Explicit Inversion in SA Update
• Idea below began as part of final report by student
Pushpendre Rastogi for stochastic optimization class, spring 2014
• Consider special case of “2SPSA” setting of where Gk() is
SP estimate of g() built from noisy L() measurements • Special case of no feedback in Spall (2009) leads to:
• For ease of discussion, ignore distinction between
• Modify form above slightly to work directly with inverses • Consider “pre-symmetrization” process when forming
per-iteration estimate of H (symmetrize after creating appropriate inverse matrix):
1 1 1
1, 2, , so ˆ is average of matrix
2
at left and its transpose
(
)
kk k kp k
k c G H 1
ˆ
(1
)
k
w
k k
w
k kH
H
H
k
and
k:
H
H
1 1
ˆ
ˆ
( )
ˆ
k k
a
k k k k
H G
Appendix 1: Use of Matrix Inversion Lemma
(cont’d)
• Let serve as (non-symmetric) estimate of inverse H
• After a little algebra, can write:
where
and represent vector of inverses of components in corresponding vector
• Note from (*) above that recursion for estimate of His rank-one update
ˆ ˆ
ˆ ˆ
[ (
)
(
)]
[ (
)
(
)]
k k k k k k k k k
k k k k k k k k
y y c c y c
y c c y c
1 1 1 1
1
(1 ) , (*) 2
T k
k k k k k k
k k y w w c c
J J
1 and 1
k k
and
k k
k
43
Appendix 1: Use of Matrix Inversion Lemma
(cont’d)
• Recall (previous slide):
• From a straightforward substitution into the matrix inversion lemma,
• Since inverse of symmetric matrix is symmetric, can impose symmetry constraint directly in terms of inverses
• That is, instead of Spall (2009),
where Gk() is SP estimate of g()
• Similar ideas can be applied to all variations: Adaptive SPSA and SG (Spall, 2000) and enhanced adaptive methods with feedback (Spall, 2009)
1 1 1 1
1
(1 )
2
T k
k k k k k k
k k y w w c c
J J 1 1 1 11 1 1
1
1
,
1 1
T
k k k k
k k k T
k k k k k k
s
w w s
J J J J J 1 21
ˆ
ˆ
(
T) ( )
ˆ
k
k
a
kJ
k
J
kG
k k
where sk w yk k
(
2c ck k)
44
Appendix 1: Use of Matrix Inversion Lemma
(cont’d)
• Let us discuss efficiency of the above idea when applied to four main variations of adaptive methods (Spall, 2000, 2009) • Standard method based on Gaussian elimination for
computing uses following approximate number of operations (adds and multiplies):
• In contrast, as shown in Rastogi et al. (2016), method here uses following approximate number of operations:
• Bottom line: above avoids explicit inverse or Gaussian elimination: Computations now O(p2) instead of O(p3)
3 2 2 7 3 3 p p p
1 ( )ˆ k k k H G
45
Appendix 2: Optimal Weighting for
Enhanced Adaptive SPSA Algorithm
• Recall:• For the case of noisy loss functions, optimal weights are:
• For the case of noisy root-finding (g) measurements,
optimal weights are:
• Both of above based on “downweighting” later per-iteration Jacobian estimates to compensate for increased noise contribution
04 4
k
k k
i i
c c
w
(1
)
1
ˆ
ˆ
kw
k kw
k k kH
H
H
02 2
k
k k
i i
c c
w
Partial List of References
• Rastogi, P., Zhu, J., and Spall J. C. (2016), “Efficient Implementation of Enhanced Adaptive Simultaneous Perturbation Algorithms,” Proc. 50th Annual Conference on Information Sciences and Systems, Princeton, NJ, 16−18 March 2016, pp. 298–303.
http://dx.doi.org/10.1109/CISS.2016.7460518
• Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” IEEE Trans. on Automatic Control, vol. 37(3), pp.
332–341. http://dx.doi.org/10.1109/9.119632
• Spall, J. C. (2000), “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method,” IEEE Trans. on Automatic Control, vol. 45, pp. 1839−1853.
http://dx.doi.org/10.1109/TAC.2000.880982
• Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley.
• Spall, J. C. (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm,”IEEE Trans. on Auto. Control, vol. 54(6), pp. 1216–1229. http://dx.doi.org/10.1109/TAC.2009.2019793
• Sun, S. and Spall, J. C. (2019), “SPSA Method Using Diagonalized Hessian Estimate,”
Proc. of the IEEE Conf. on Decision and Control, Nice, France, 11–13 Dec. 2019, pp.
4922–4927.