Adaptive SPSA_Mar2020_handout.pdf

(1)

Efficient Implementation of Second-Order

Stochastic Approximation

March 2020

James C. Spall

The Johns Hopkins University Applied Physics Laboratory and

Department of Applied Mathematics and Statistics

[email protected]

Overview

• Present key extensions to basic simultaneous perturbation stochastic approximation (SPSA) algorithm for stochastic optimization problems

• While SPSA in basic form is in same general family of “first-order” SA methods

• Aim to capture “second-order” improvement

• Analogous to first-order steepest descent algorithm vs. second-order Newton-Raphson algorithm in classical (deterministic) optimization

(2)

Content*

• Part 1:

Brief Review of SPSA

• Part 2:

Core Adaptive (“Second-Order”) SPSA

Method

• Part 3:

Enhanced Adaptive SPSA Method: Optimal

Weighting and Feedback

• Part 4:

Matrix Factorizations for Efficient

Implementation

• Part 5:

SPSA Method Using Diagonalized Hessian

Estimates

• Appendices

3

Spall, J. C. (2000), “Adaptive Stochastic Approximation by the

Simultaneous Perturbation Method,” IEEE Transactions on Automatic Control, vol. 45, pp. 1839−1853. http://dx.doi.org/10.1109/

TAC.2000.880982

Spall, J. C. (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous

Perturbation Algorithm,”IEEE Transactions on Automatic Control, vol.

54(6), pp. 1216–1229. http://dx.doi.org/10.1109/TAC.2009.2019793

Sun, S. and Spall, J. C. (2019), “SPSA Method Using

Diagonalized Hessian Estimate,” Proc. of the IEEE Conf. on Decision and Control, Nice, France, 11–13 Dec. 2019, pp. 4922–

4927. https://doi.org/10.1109/CDC40024.2019.9029707

Zhu, J., Wang, L., and Spall, J. C. (2019), “Efficient Implementation of Second-Order Stochastic Approximation Algorithms in

High-Dimensional Problems,” IEEE Trans. on Neural Networks and Learning Systems, in press. http:// dx.doi.org/ 10.1109/ TNNLS.2019.2935455

Key References

(3)

Part 1:

Brief

Review of SPSA

5

Background for Stochastic Approximation

• Interested in finding  =  _{such that}_g(__{) =}₀_where_ _is_p

-dimensional vector of “adjustables”

• Common special case of minimization: find root =  _to

where L() is scalar-valued loss function

• Assume only (possibly noisy) measurements of g() and/or L() available; e.g., y() = L() + ()

– Noisy measurements arise in areas such as Monte Carlo simulation, real-time control/estimation, machine learning, system identification, etc.



 



L( )

( )

g  

(4)

7

Brief Diversion (3 slides): Basic SPSA

Algorithm (NIPS 2012 tutorial; Spall, 2003)

• Let () denote SP estimate of g() at kth iteration

• Let denote estimate for _at_k_{th iteration} • SPSA algorithm has form

where {a_k} is nonnegative gain sequence

• Generic iterative form above is standard in SA; stochastic analogue to steepest descent

• Under conditions,  _{in some stochastic sense as}_k_

  

k 1 k a gk k k

ˆ ˆ _{ˆ ( )}ˆ

   k ˆ  k ˆ g k ˆ  8

Computation of

(

• )

(Heart of SPSA)

• Let be vector of pindependent random variables at kth

iteration

• typically generated by Monte Carlo • Let {c_k} be sequence of positive scalars

• For iteration k k+1, take measurements at design

levels:

where are measurement noise terms • Common special case is when

(e.g., system identification with perfect measurements of the likelihood function)

k

ˆ

g

k 

  _  _T k k1, k2, ,... kp  k             

k k k k k k k

y c L c

( ) ( ) ˆ ˆ ( ) ( ) ˆ ˆ ( ) ( )          k ck k ˆ   k ( )  

(5)

9

Computation of

(

• )

(cont’d)

• The standard SP form for (•):

• Note that (•) only requires twomeasurements of L(•) independentofp

• Above SP form contrasts with standard finite-difference approximations taking 2p(or p+1) measurements

• Intuitive reason why (•) is appropriate is that formalized in Spall (1992)

k ˆ g

   

 _ 

 

  

 _ _ 

 

  

 



k k k k k k

k k

k k k k k k

k kp

y + c y c

c g

y + c y c

c

1

ˆ ˆ

( ) ( )

2 ˆ

ˆ ( )

ˆ ˆ

( ) ( )

2

   



   

k ˆ g 

k k k k

E[gˆ ( )ˆ ˆ ] g( );ˆ k

ˆ g

ˆ

_k

g

Part 2: Core Adaptive

(6)

11

Why Seek Second-Order Method?

• Standard SA algorithms (Robbins-Monro, etc.) show “1st-order” behavior

– Sharp initial decline (in optimization case) – Slow convergence in final phase

– Sensitivity to units/scaling for components of  (“non-transform invariant”)

• Long-standing interest in capturing “2nd-order” (Newton-like) effects in SA setting to cope with above shortcomings • Two general methods:

─ Adaptive SA algorithms (Hessian/Jacobian estimation)

─ Iterate averaging

• Traditional shortcomings in both of above in implementation difficulty and practical efficiency

‒ E.g., Difficult to get Hessian estimate

12

“Standard” Adaptive SPSA Algorithm

(Spall, 2000, IEEE Trans. Auto. Contr.)

• Let G_k() denote direct (noisy) measurement of g() or

simultaneous perturbation estimate of g() at kth iteration

• Let denote estimate for _at_k_{th iteration}

• “Standard” adaptive SPSA algorithm has parallel recursions form

where {a_k} is nonnegative gain sequence and is efficient

“per-iteration” Jacobian estimate

• Key is “per-iteration” estimate in (**); formula for next slide • Note 1: Convergence of can be shown (Spall, 2000)

• Note 2:Efficient rank 1 (or 2) update (via matrix inversion

lemma [ShermanMorrisonWoodbury formula]) can be used in (*); see Appendix 1

k ˆ 













k k

a

k k k k k k k

1 1

ˆ

_{H G}

_{( )}

_{ˆ ,}

_H

_{f H}

₍

₎

_(*)



0

1

1 ˆ (**)

1



1 



k k k

k

H

k

ˆ H

k k

ˆ andH

 k

(7)

13

Per-Iteration Jacobian (Hessian) Estimate

where G_k = G(+c_k_k)–G(–c_k_k) and G() is direct noisy

measurement of g() or the simultaneous perturbation estimate of g() built from noisy L() measurements

and

_k [_k₁,_k₂,…,_kp]T_{is mean-zero random vector such that}

the {_kj} satisfy standard SPSA conditions (mean zero,

symmetrically distributed random variables with finite inverse moments)

1 1 1 1 1 1

1 2 1 2

ˆ

1 _,

_{, ,}

_,

_{, ,}

2 2

2

T

k k

k k k kp k k kp

k k

c

     



_

_

_

_





_

_

_

_





_

_

 



_



_

_

 



_

_

_













G

H

Cost of Implementation

• For any p,the cost per iteration of adaptive method is

Four noisy measurements of L or 

Three noisy measurements of g

• Above costs compare very favorably with previous methods:

O(p2) loss measurements per iteration in finite-difference

setting (e.g., Fabian, 1971)

O(p) gmeasurements per iteration in root-finding setting

(8)

15

Efficiency Analysis

• Use asymptotic normality of to compare asymptotic RMS errors (as in basic SPSA) against best possibleasymptotic RMS of SPSA and root-finding, say

• Noisy values of L (2SPSA): a_k=1/k and c_k= c

/

k1/6 (k  1)

• Noisy values of g (root-finding or 2SG): a_k= 1/k and any

valid c_k

• Interpretation: Adaptive method with measurements of Ldoes

almost as well as unobtainable best SPSA; RMS error differs by < factor of 2. With measurements of g, does as well as analytically optimal root-finding method (rarely available).

 _and  SPSA root

RMS RMS

   

RMS for adaptive₂ ₀

SPSA

c RMS



RMS for adaptive_{= 1}

root

RMS

k

ˆ



Part 3: Enhanced Adaptive SPSA

Method: Optimal Weighting and

Feedback

(9)

17

Enhanced Adaptive SPSA Algorithm

(Spall, 2009, IEEE Trans. Auto. Contr.)

• Standard adaptive algorithm can be improved:

─ Introduce feedback term that helps remove error in per-iteration Hestimates

─ Optimal weighting to account for noisy measurements of loss function or of g()

• Enhanced Adaptive SPSA algorithm has same general

parallel recursions form as (*) and (**) (slide 9):

• Items highlighted in redrepresent new expressions here: is feedback term

w_k is optimal weighting

 





k k

a

k k k k

1 1

ˆ

_{H G}

_{( )}

_{ˆ ,}









 

_k





k

(1

w

)

k 1

w

k

ˆ

k

ˆ

k

H



k

ˆ



Enhanced Algorithm (cont’d)

• Recursion for  unchanged from basic adaptive method • Critical aspect of recursion for H is the averaging of

per-iteration Jacobian estimates

─ Each per-iteration Hestimate formed by simultaneous perturbation of G_k() values

─ Per-iteration estimate formed from only two G_k() values (vs. 2pvalues in former adaptive methods)

• Two changes to recursion for H (feedback and weighting) ─ Feedback term uses current (cumulative) estimate of Has

true H: subtracts out error in that depends on true H

─ Weighting accounts for increasing effective noise contribution across iterations

ˆ

k

H

ˆ

(10)

19

Feedback Contribution to

Enhanced Algorithm

• Per-iteration Jacobian estimate can be decomposed into four parts:

where _k= is error term and c_k is positive

scalar such thatc_k  0 as k (c_k is the “difference

interval” for the per-iteration estimates)

• Feedback term is best estimate of error term (using current estimate of H) at each iteration:

• Overall (cumulative) estimate of H (i.e., , k = 0,1,..., n) is

based on weighted sums of

2

ˆ

_{( )}

ˆ _noise

_{( )}

k  k 



k  

O

ck

H H



_{( )}ˆ



k k

 H







 ₁

ˆ

k k Hk

 

ˆ ˆ

k  k

H

n

H

20

Optimal Weighting

• Recall:

• The noiseterm is of stochastic order in the case of noisy loss measurements (optimization) and order in the case of noisy gmeasurements (root-finding)

• Above implies that noise variance grows with k

• Optimal weighting is such that noise growth is “damped down” as kgets large

• Can solve for optimal weights in terms of c_kusing

Lagrange multipliers; see Appendix 2 at end of slides below

O

    2

ˆ

_{( )}

ˆ

_{( )}

k k k noise ck

H H 



2

(

c_k

)

O

1

(

c

_k

)

(11)

21

Comments on Theory

• Former theory on convergence of and asymptotic efficiency continues to apply

• New theory required for convergence of due to modified form

• Under various “standard” smoothness and bounded moments conditions, can show

• For each entry in have

• In special case of noise-free measurements, can show that

rate of convergence of to H() is almost

O

(

e

ck

)

_, _c_{> 0}

– Feedback is critical to fast convergence

– Applies in case of direct loss or direct gmeasurements

ˆ

k



k

H

  ( ) a.s.

k

H H 

k

H ,

k

H

 variance for enhanced adaptive SPSA

1 variance for standard adaptive SPSA

Comments on Numerical Issues in

Practical Implementation

• In practical problems, basic update

may need modest modifications:

A. For small k, must ensure is replaced by related

guaranteed positive-definite matrix B. Stable and efficient computation of

• Standard method to solve A is to use matrix square root (e.g., Cholesky), as in

• Standard method to solve B is to use Gaussian elimination • Both standard solutions above require O(p3) calculations; not









k 1 k

a

k k1 k k

ˆ

_{H G}

_{( )}

_{ˆ ,}



 T    

k k k k p k

1/ 2_, ₀

( )

H H H I



k (want k k )

H H H

k

H



k1 k

(12)

23

Numerical Study

• Consider minimization problem based on 4th-order polynomial loss function; p= dim() = 10

• Compare standard adaptive and enhanced adaptive SPSA in terms of terminal loss function values and H estimates

• Used stochastic gradient setting (direct noisy

measurements of gradient g) (2SG) and setting with only

loss values (2SPSAsee next slide)

• Considered 50 independent runs for each experiment; normalized loss is from 0 (best) to 1(initial condition)

Number of iterations

Normalized loss from standard 2SG

Normalized loss from enhanced 2SG

P-values for

comparing loss functions

2000 0.019 0.012 0.0061

10,000 0.015 0.0034 0.00049

24

Relative Convergence Rates for Hessian

Estimate in 2SPSA in Noise-Free Problem

k

H

_k  true H

(13)

Part 4: Matrix Factorizations for

Efficient Implementation

25

• Adaptive (second-order) SPSA highly efficient relative to use of input information. Only:

‒ 4 noisy loss measurements per iteration for 2SPSA ‒ 3 noisy gradient measurements per iteration for 2SG • Above fixed costs (any p) are separatefrom per-iteration

computing cost for update step (FLOPS cost), which grows with p

‒ Primary driver of computing cost is matrix manipulation for large p

• We discuss below easy-to-implement method for reducing computing cost from O(p3) (in general) to O(p2)

(14)

• Overall update of requires O(p3) in general. Two main

issues:

‒ “Preconditioning”: need 𝑯 to be positive definite • Map 𝑯 to some positive-definite matrix 𝑯 ≡ 𝒇 𝑯

• Straightforward approach in Spall (2000 or 2009)

𝑯 𝑯 𝑯 δ 𝑰 / , δ → 0

requires O(p3) in general

‒ Gaussian elimination for 𝑯 𝑮 (standard method) takes

O(p3) computations

• Matrix factorizations can be used to simultaneouslyhandle

both preconditioning and update step 𝑯 𝑮

• Alternative method (Rastogi et al., 2016) based on matrix inversion lemma (Appendix 1) useful if preconditioning not relevant (i.e., know 𝑯 is positive definite)

27

Matrix Factorizations (cont’d)

• Recall: Update of via 𝑯 𝑮 generally requires O(p3). Two

expensive steps:

‒ Pre-conditioning: maintain 𝑯 positive-definiteness

‒ Solving 𝑯 𝒅 𝑮 for descent direction𝒅 (via Gaussian elimination or other standard method) costs O(p3)

• Goal:Achieve O(p2) per-iteration cost—factor of p

improvement over other second-order methods

• Tool is symmetric matrix factorization, applicable for possibly

indefinitematrices

𝑷 𝑯 𝑷 𝑳 𝑩 𝑳

‒ 𝑷 is permutation; 𝑳 is lower-triangular; 𝑩 is block-diagonal with blocks with size 1 or 2

‒ Value of factorization is that all calculations required to obtain 𝒅 𝑯 𝑮 are now O(p2) ₂₈

(15)

Matrix Factorizations (cont’d)

• Preconditioningachievable at O(p2) per-iteration cost

‒ Eigen-decomposition of block diagonal 𝑩 𝑸  𝑸 is cheap, at most O(p)

‒ Modification to make diagonal  𝟎guarantees 𝑩 𝟎

and hence 𝑯 𝟎(by Sylvester's law of inertia*) ‒ Hence, 𝑷 𝑯 𝑷 𝑳 𝑩 𝑳 = 𝑳 𝑸  𝑸 𝑳

‒ Overall: Preconditioning changes indefinite 𝑯 to 𝑯 𝟎

• Descent direction d_kobtainable at O(p2) per-iteration cost

𝑯 𝒅 𝑷 𝑳 𝑩 𝑳 𝑷 𝒅 𝑮

‒ Lower-triangular 𝑳 allows backward/forward substitution ‒ Multiplying by permutation 𝑷 is cheap

• Next slide shows overall flow….

29 *_{If 𝑫} _𝑺𝑨𝑺𝑇_{is diagonal, 𝑨}_{has the same number of positive/negative/zero eigenvalues as 𝑫}

Matrix Factorizations: Overall Flowchart

• Recall: Can achieve both preconditioning and computation of

descent direction in O(p2)

(16)

Numerical Study: Evaluation of

O

(

p

2

)

• Note observed linear trend: O(p3)/O(p2) = O(p) difference in

run times, as predicted by theory

‒ Approx. 0.56p= 2800-fold reduction in run time

31

• Consider skewed-quartic loss function with Gaussian noise* • Plot shows ratio of run times of conventional 2SPSA over

efficient 2SPSA as function of p

‒ Both algorithms have same accuracy in estimation of

*_{𝐿 𝛉} _𝜽𝑻_𝑩𝑻_𝑩𝜽 _{0.1 ∑} _𝑩𝜽 _{0.01 ∑} _{𝑩𝜽 , 𝑩}_{is a upper triangular matrix of all 1/𝑝}

Numerical Study: Comparison with SGD and ADAM

32

• Compare efficient 2SG with standard SG descent (SGD) and ADAM on neural network with empirical risk function (ERF)

– ERF is standard machine learning formulation (summation over data set)

– Used NASA data (NACA 0012) airfoil self-noise data set with 1503 data points

– ADAM is popular adaptive learning method

(17)

Some Future Directions

• Above symmetric matrix factorization is versatile

• Factorization applicable for general quasi-Newton (QN) methods, such as BFGS and DFP

‒ Standard update formulas contain two rank-one modifications that can be replaced by symmetric matrix factorization with no extra cost

‒ Standard QN methods are O(p2) but….

• Factorization allows extra preconditioningstep at O(p2) to

improve performance for ill-conditioned problems

‒ W/o factorization, preconditioning only available at O(p3)

• Applications where need for preconditioning arises include: ‒ Machine learning problems (logistic regression, ridge

regression, soft-margin support vector machines), where feature vectors have high dimension (relevant in big-data era) ‒ Inverse problems (various fields within medical imaging and

geophysics including mineral and oil exploration)

33

(18)

SPSA Using Diagonalized Hessian Estimate

(Sun and Spall, 2019)

• Recall first-order Methods (SPSA, SGD):

‒ Pros: Comparatively low computational complexity per iteration (usuallyO(p))

‒ Cons: Slow convergence in final stage, etc. (slide 11)

• Second-order Methods (2SPSA, 2SG):

‒ Pros: High efficiency in convergence and transform invariant ‒ Cons: High computational complexity per iteration (usually

O(p3); optimally O(p2) per part 4); inefficient for

high-dimensional problems

• Core Concept:

‒ Find something between first- and second-order

‒ Apply part of Hessian estimates for better convergence ‒ Apply part of Hessian to avoid the high computational

complexity per iteration

35

36

Diagonalized Hessian Estimate, cont’d

• Modify adaptive algorithm correspondingly:

─ Estimate only diagonal components of Hessian in recursion

─ Propose diagSPSA and diagSG

• diagSPSA and diagSG algorithm have same general parallel recursions form as (*) and (**) in slide 12:

• As only diagonal components are estimated, the computational cost per iteration is O(p)

𝛉 𝛉 𝑎 𝑫 𝑮 𝛉 , 𝑫 abs 𝑫 𝜂 𝑰_p

(19)

DiagSG in ERF Deep Learning

• Some Existing Algorithms in Deep Learning:

─ Apply diagonal components of empirical Fisher information matrix, sum of outer products of gradients of log-likelihood (not Hessian estimates)

─ Examples: AdaGrad, Adam, etc.

• Comparison with DiagSG (requires knowledge in Chap.13)

─ Finite Stage: Difference of empirical Fisher information matrices and Hessian estimates under ERF setting

─ Asymptotic Behavior: Share very similar asymptotic distribution because of equivalent definition of FIM

• Numerical Experiment:

─ ResNet on CIFAR-10 with p= 2107

─ Note: ResNet stands for “Residual Network,” a powerful structure for computer vision tasks; CIFAR-10 is a standard image dataset

(20)

39

Concluding Remarks

• Design of efficient and (relatively) easy to use adaptive search algorithms is long-standing problem

– Especially difficult in setting of noisy measurements

– Knowledge of Jacobian/Hessian matrix also useful in contexts outside of improved search (e.g., Cramér-Rao bound)

• Simultaneous perturbation idea can be used to dramatically enhance efficiency in multivariate problems

• Improvements here to “standard” adaptive SPSA are simple to use and further enhance efficiency

─ Feedback to reduce error in “per-iteration” Jacobian/Hessian estimates

─ Optimal weighting to reduce effects of noise

─ Matrix factorizations to enhance applicability in large-scale (e.g., machine learning) problems

• Ongoing research at JHU and elsewhere in applications to deep learning

40

Two appendices to follow:

Appendix 1: Use of Matrix Inversion Lemma to

Remove Explicit Inversion in SA Update

(21)

41

Appendix 1: Use of Matrix Inversion Lemma

to Remove Explicit Inversion in SA Update

• Idea below began as part of final report by student

Pushpendre Rastogi for stochastic optimization class, spring 2014

• Consider special case of “2SPSA” setting of where G_k() is

SP estimate of g() built from noisy L() measurements • Special case of no feedback in Spall (2009) leads to:

• For ease of discussion, ignore distinction between

• Modify form above slightly to work directly with inverses • Consider “pre-symmetrization” process when forming

per-iteration estimate of H (symmetrize after creating appropriate inverse matrix):

1 1 1

1, 2, , so ˆ is average of matrix

2

at left and its transpose

(

)

k

k k kp k

k c     __{ } _ _    G H 1

ˆ

(1

)

k

 

w

k k



w

k k

H

k

and

k

:

H

1 1

ˆ

_{( )}

ˆ

k k

a

k k k k

 







H G



Appendix 1: Use of Matrix Inversion Lemma

(cont’d)

• Let serve as (non-symmetric) estimate of inverse H

• After a little algebra, can write:

where

and represent vector of inverses of components in corresponding vector

• Note from (*) above that recursion for estimate of His rank-one update

ˆ ˆ

[ (

)

(

)]

[ (

)

(

)]

k k k k k k k k k

k k k k k k k k

y y c c y c

y c c y c

       



             

1 1 1 1

1

(1 ) , (*) 2

T k

k k k k k k

k k y w w c c          







J J  

1 _and 1

k k

_ _ 



 

and

k  k



 

k

(22)

43

Appendix 1: Use of Matrix Inversion Lemma

(cont’d)

• Recall (previous slide):

• From a straightforward substitution into the matrix inversion lemma,

• Since inverse of symmetric matrix is symmetric, can impose symmetry constraint directly in terms of inverses

• That is, instead of Spall (2009),

where G_k() is SP estimate of g()

• Similar ideas can be applied to all variations: Adaptive SPSA and SG (Spall, 2000) and enhanced adaptive methods with feedback (Spall, 2009)

1 1 1 1

1

(1 )

2

T k

k k k k k k

k k y w w c c          





   J J 1 1 1 1

1 ₁ ₁

1

,

1 ₁

T

k k k k

k k k _T

k k k k k k

s

w w s

     _ _     _  _  _ _ _ _









  J J J J J    

1 ₂1

ˆ

₍

T

_{) ( )}

ˆ

k



k



a

k

J

k



J

k

G

k k



where s_k w y_k _k

(

2c c_{k k}

)

44

Appendix 1: Use of Matrix Inversion Lemma

(cont’d)

• Let us discuss efficiency of the above idea when applied to four main variations of adaptive methods (Spall, 2000, 2009) • Standard method based on Gaussian elimination for

computing uses following approximate number of operations (adds and multiplies):

• In contrast, as shown in Rastogi et al. (2016), method here uses following approximate number of operations:

• Bottom line: above avoids explicit inverse or Gaussian elimination: Computations now O(p2) instead of O(p3)

3 2 2 7 3 3 p p p  

1 _{( )}_ˆ k k k H G 

(23)

45

Appendix 2: Optimal Weighting for

Enhanced Adaptive SPSA Algorithm

• Recall:

• For the case of noisy loss functions, optimal weights are:

• For the case of noisy root-finding (g) measurements,

optimal weights are:

• Both of above based on “downweighting” later per-iteration Jacobian estimates to compensate for increased noise contribution





0

4 4

k

k _k

i i

c c

w







 

(1

)

₁



ˆ



ˆ

k

w

k k

w

k k k

H







0

2 2

k

k _k

i i

c c

w

Partial List of References

• Rastogi, P., Zhu, J., and Spall J. C. (2016), “Efficient Implementation of Enhanced Adaptive Simultaneous Perturbation Algorithms,” Proc. 50th Annual Conference on Information Sciences and Systems, Princeton, NJ, 16−18 March 2016, pp. 298–303.

http://dx.doi.org/10.1109/CISS.2016.7460518

• Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” IEEE Trans. on Automatic Control, vol. 37(3), pp.

332–341. http://dx.doi.org/10.1109/9.119632

• Spall, J. C. (2000), “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method,” IEEE Trans. on Automatic Control, vol. 45, pp. 1839−1853.

http://dx.doi.org/10.1109/TAC.2000.880982

• Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley.

• Spall, J. C. (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm,”IEEE Trans. on Auto. Control, vol. 54(6), pp. 1216–1229. http://dx.doi.org/10.1109/TAC.2009.2019793

• Sun, S. and Spall, J. C. (2019), “SPSA Method Using Diagonalized Hessian Estimate,”

Proc. of the IEEE Conf. on Decision and Control, Nice, France, 11–13 Dec. 2019, pp.

4922–4927.