Chapter 7_handout.pdf

(1)

CHAPTER 7 S

IMULTANEOUS

P

ERTURBATION

S

TOCHASTIC

A

PPROXIMATION

(SPSA)

•Organization of chapter in ISSO –Problem setting

–SPSA algorithm

–Theoretical foundation

–Asymptotic normality and efficiency –Practical guidelines—MATLAB code –Numerical examples

–Extensions and further results

–Adaptive simultaneous perturbation method Additional information available at www.jhuapl.edu/SPSA

(selected references, background articles, MATLAB code, and video) Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

7-2

A. PROBLEM SETTING AND SPSA

ALGORITHM

• Consider standard minimization setting, i.e., find root  _to

where L() is scalar-valued loss function to be minimized

and  is p-dimensional vector

• Assume only (possibly noisy) measurements of L()

available

– No direct measurements of g() used, as are required in stochastic gradient methods

• Noisy measurements of L() in areas such as Monte Carlo

simulation, real-time control/estimation, etc.

• Interested in p> 1 setting (including p>> 1) 

 

 

 

 L( )

( )

(2)

7-3

SPSA Algorithm

• Let () denote SP estimate of g() at kth iteration

• Let denote estimate for _at_k_{th iteration}

• SPSA algorithm has form

where {a_k} is nonnegative gain sequence

• Generic iterative form above is standard in SA; stochastic analogue to steepest descent

• Under conditions,  _{in “almost sure” (a.s.) stochastic}

sense as k

  

k 1 k ak k k ˆ ˆ _{ˆ ( )}ˆ

  g 

k

ˆ

 k

ˆ

g

k ˆ



4

Computation of

(

• )

(Heart of SPSA)

• Let_k be perturbation vector of pindependent random

variables at kth iteration

• _k typically generated by Monte Carlo • Let {c_k} be sequence of positive scalars

• For iteration kk+1, take measurements at design

levels:

where are measurement noise terms • Common special case is when

(e.g., system identification with perfect measurements of the likelihood function)

k

ˆ

g

  _  _T k k1, k2, ,... kp 





    

    

k k k k k k k

y c L c

( ) ( )

ˆ ˆ

( ) ( )

ˆ ˆ

( ) ( )

   

 k ck k

ˆ

 

k

( )





(3)

5

Computation of

(

• )

(cont’d)

• The standard SP form for (•):

• Note that (•) only requires twomeasurements of L(•)

independentofp

• Above SP form contrasts with standard finite-difference approximations taking 2p(or p+1) measurements (used in

“FDSA”)

• Intuitive reason why (•) is appropriate is that

k ˆ g      _        _ _         

k k k k k k

k k

k k k k k k

k kp

y + c y c

c g

y + c y c

c 1 ˆ ˆ ( ) ( ) 2 ˆ

ˆ ( )

ˆ ˆ ( ) ( ) 2          k ˆ g 

ˆ ˆ ˆ ˆ

[ _k( )_k _k] ( )_k

E g g

k ˆ g

ˆ

_k

g

7-6

Essential Conditions for SPSA

• To use SPSA, there are regularity conditions on L(), choice

of _k, the gain sequences {a_k}, {c_k}, and the measurement

noise

– Sections 7.3 and 7.4 of ISSOpresent essential conditions

• Roughly speaking the conditions are:

A. L()smoothness: L() is thrice differentiable function (can be relaxed—see Section 7.3 of ISSO)

B. Choice of _kdistribution:For all k, _k has independent

components, symmetrically distributed around 0, and

– Bounded inverse moments condition is critical (excludes _kibeing normally or uniformly distributed)

– Symmetric Bernoulli _ki= 1 (prob = ½for each outcome) is allowed; asymptotically optimal (see Section F or Section 7.7 of ISSO)

ki ki

(4)

7-7

Essential Conditions for SPSA (cont’d)

C. Gain sequences: standard SA conditions:

(better to violate some of these gain conditions in certain practical problems; e.g., nonstationary tracking and control where a_k = a > 0, c_k = c > 0 k, i)

D. Measurement Noise:Martingale difference

ksufficiently large. (Noises notrequired to be

independent of each other or of current/previous and

_kvalues.) Alternativecondition (no martingale mean 0

assumption needed) is that be bounded k

k k k k

k k

k k _k

a c a c k a

a

c

2

0 0

, 0, , 0 as

,

 

 

   

 

   

 _{  }

 

 

  _k _k _k _k 

E[ ( ) ( )  ˆ , ] 0

k

( ) 

k ˆ



7-8

(5)

7-9

B. THEORETICAL FOUNDATION

Three Questions

Question 1: Is (•) a valid estimator for g(•)? Answer: Yes, under modest conditions.

Question 2: Will the algorithm converge to ? Answer: Yes, under reasonable conditions.

Question 3: Do savings in data/iteration lead to a

corresponding savings in converging to optimum?

Answer: Yes, under reasonable conditions. k

ˆ g

10

Near Unbiasedness of (•)

• SPSA stochastic analogue to deterministic algorithms if is “on average” same as g() for any 

• Suppressing iteration index k, mth component of is:

• With we have for any m:

 k ˆ ( ) g

( ) ( )

ˆ ( ) noise

2

m

L c L c

g c           

( ) ( ) ( ) ( ( ) ) _noise

2

T T

m

L c L c

c

   

 



 g    g  

( )

noise

i i i

m

g 

 



 

( ) ( ) i noise

m i

i m _m

g g          i m E( / ) 0

ˆ

[ _m( )] _m( ) negligible terms

E g  g  

k

ˆ

g

(6)

7-11

I

llustration of Near-Unbiasedness for (•) with

p

= 2 and Bernoulli Perturbations

g

ˆ

k

7-11

7-12

Theoretical Basis (Sects. 7.3 – 7.4 of

ISSO

)

• Under appropriate regularity conditions (e.g.,

thrice continuously differentiable, is martingale difference noise, etc.), we have:

• Near Unbiasedness

• Convergence:

• Asymptotic Normality:

where , , and depend on SA gains, _k distribution, and shape of L()

k c

where 0

 

k k k k k

E[gˆ ( )ˆ ˆ ] g( )ˆ O c ( ) a.s.2

ˆ _{a.s. as}

k   k  

 

dist.

/ 2 ₂

3

ˆ

( _k ) ( , ), 0

k   N     



_ki   E( 2) , ( )L  k

( )

(7)

7-13

Efficiency Analysis

• Can use asymptotic normality to analyze relative efficiency of SPSA and FDSA (Spall, 1992; Sect. 7.4 of ISSO)

• Analogous to SPSA asymptotic normality result, FDSA is also asymptotically normal (Chap. 6 of ISSO)

The critical cost in comparing relative efficiency of SPSA and FDSA is number of loss function measurements y(•), not number of iterations per se

• Loss function measurements represent main cost (by far)—other costs are trivial

• Full efficiency story is fairly complex—see Sect. 7.4 of

ISSO and references therein

7-14

Efficiency Analysis (cont’d)

• Will compare SPSA and FDSA by looking at relative mean square error (MSE) of  estimate

• Consider relative MSE for same no. of measurements, n

(notsame no. of iterations). Under regularity

conditions above:

(



)

• Equivalently, to achieve same asymptotic MSE

(



)

• Results (



) and (



) are main theoretical results justifying SPSA

n

as  

# meas. ( ) in SPSA 1 # meas. ( ) in FDSA

y

y  p

 









2 SPSA,

2 3 2

FDSA,

ˆ ₁

, 0 ˆ

n

n E

p

E 



 

   



 

(8)

7-15

Paraphrase of

(



)

above:

• SPSA and FDSA converge in same number of iterations despite p-fold savings in cost/iteration for SPSA

— or —

• One properly generated simultaneous random change of all variables in a problem contains as much information

for optimizationas a full set of one-at-a-time changes

of each variable

7-16

C. PRACTICAL GUIDELINES AND

MATLAB CODE

• Practical gain selection (a_k and c_k) discussed in Sect. 7.5, ISSO

• The code below implements SPSA iterations k =1,2,...,n

– Initialization for program variables theta, alpha, etc. not

shown since that can be handled in numerous ways (e.g., file read, direct inclusion, input during execution)

– Components of _k are generated as Bernoulli ±1

– Program calls external function lossto obtain values y()

• Simple enhancements possible to increase algorithm stability and/or speed convergence

– Check for simple constraint violation (shown at bottom of sample code)

– Reject iteration if is too much greater than (requires extra loss measurement per iteration) – Reject iteration if is too large (does not

require extra loss measurement)

k  k 1 y(ˆk1)

k y( )ˆ

(9)

7-17

Selection of Gain Sequences

a

_k

and

c

_k

• Effective practical implementation requires “intelligent” selection of coefficients in gain sequences in SA algorithm and SP gradient estimate:

where coefficientsa, c, , and are strictly positive and

stability constant A 0 is same as in Sect. 4.4

• Asymptotically optimal = 1, = 1/6 not always best • “Trial and error” sometime used for gain selection

• Semi-automatic method (Sect. 7.5): = 0.602,  = 0.101,

c standard deviation of noise, A 10% (or less) total

number of iterations, and a chosen such that change in SA

estimate does not exceed desired magnitude of change in early iterations

‒ Choosing arequires sample SP gradient estimates at initial 

 

 

  

k k

a c

k A and k

( 1 ) ( 1)

7-18

Matlab Code

for k=1:n

ak=a/(k+A)^alpha; ck=c/k^gamma;

delta=2*round(rand(p,1))-1; thetaplus=theta+ck*delta; thetaminus=theta-ck*delta; yplus=loss(thetaplus); yminus=loss(thetaminus);

ghat=(yplus-yminus)./(2*ck*delta); theta=theta-ak*ghat;

end theta

If maximum and minimum values on elements of thetacan be

specified, say thetamax and thetamin, then two lines can be

added below thetaupdate line to impose constraints: theta=min(theta,thetamax);

(10)

7-19

D. APPLICATION OF SPSA

• Numerical Study: SPSA vs. FDSA

• Consider problem of developing neural net controller

(wastewater treatment plant where objectives are clean water

andmethane gas production)

• Neural net is function approximator that takes current information about the state of system and produces control action

• L_k() = tracking error, = neural net weights

• Need to estimate in real-time; used nondecaying a_k=a, c_k= cdue to nonstationary dynamics

• p= dim() = 412

• More information in Example 7.4 of ISSO

7-20

(11)

7-21

RMS Error for Controller

in Wastewater Treatment Model

0101600-Fig-8.3_{7- 21}

7-22

E. EXTENSIONS AND FURTHER RESULTS

• There are variations and enhancements to “standard” SPSA of Section A

• Section 7.7 of ISSO discusses most of:

(i) Enhanced convergence through gradient averaging/smoothing

(ii) Constrained optimization

(iii) Optimal choice of _k distribution (iv) One-measurement form of SPSA (v) Cyclic methods

(vi) Global optimization

(12)

7-23

(i) Gradient Averaging and Gradient

Smoothing

• These approaches may yield improved convergence in some cases

• In gradient averaging is simply replaced by the

average of several (say, q) SP gradient estimates – This approach uses 2q values of y(•) per iteration

– Spall (1992) establishes theoretical conditions for when this is advantageous, i.e., when lower MSE compensates for greater per-iteration cost (2qvs. 2, q>1)

– Essentially, beneficial in a high-noise environment (consistent with intuition!)

• In gradient smoothing, gradient estimates averaged across

iterations according to scheme that carefully balances past estimates with current estimate

– Analogous to “momentum” in neural net/backpropagation literature

k ˆk

ˆ ( ) g

7-24

(ii) Constrained Optimization

• Most practical problems involve constraints on 

• Numerous possible ways to treat constraints (simple constraints discussed in Section C)

• One approach based on projections (exploits well-known

Kuhn-Tucker framework)

• Projection approach keeps in valid region for all kby projecting into a region interior to the valid

region

– Desirable in real systems to keep (in addition to ) inside valid region to ensure physically achievable solution while iterating

• Penalty functionsare general approach that may be easier

to use than projections

– However, penalty functions require care for efficient implementation



k k ck k

ˆ _andˆ

  

 k ck k

ˆ

 

k ˆ



k ˆ

(13)

7-25

(iii) Optimal Choice of



_k

Distribution

• Sections 7.3 and 7.4 of ISSOdiscuss sufficient conditions

for _k distribution (see also Sections A and B here)

– These conditions guide user since user typically has full control over distribution

– Uniform and normal distributions do not satisfy conditions

• Asymptotic distribution theory shows that symmetric Bernoullidistribution is asymptotically optimal

– Optimal in both an MSE and nearness-probability sense – Symmetric Bernoulli is trivial to generate by Monte Carlo

• Symmetric Bernoulli seems optimal in many practical ( finite-sample) problems

– One exception mentioned in Section 7.7 of ISSO(robot

control problem): segmented uniform distribution

7-26

(iv) One-Measurement SPSA

• Standard SPSA use two loss function measurements/iteration • One-measurementSPSA based on gradient approximation:

• As with two-measurement SPSA this form is unbiased estimate of to within

• Theory shows standard two-measurement form generally preferable in terms of total measurements needed for effective convergence

– However, in somesettings, one-measurement form is preferable

– One such setting: control problems with significant nonstationarities

  

 _ 

 

  

 _ 

 

  

 



k k k

k k

k k k

k kp

y c

c

y c

c

1

ˆ

( )

ˆ ˆ ( )

ˆ

( )

 



 

g

ˆ_k ( )

(14)

7-27

(v) Cyclic Methods

• Well-known method is cyclic optimization where divided into two or more subvectors: optimize L() with respect to each

subvector while holding other subvectors fixed

– Generalization of GaussSeidel method, where 

sequentially optimized along independent coordinates

• Prior convergence results only known for deterministic setting (Spall, 2012)

• Hernández and Spall (2014, 2016) give generalization to noisy settings: SPSA and stochastic gradient (SG)

• Various applications for stochastic setting. For example:

– Multi-agent control where L() depends on collection of

agent-specific parameters

– Each agent has incomplete info. about environment (=noise) and can only update owncontribution to minimization

process (Botts et al., 2016)

28

• Special case is two subvectors (“seesaw” method)

– Generalization to M> 2 subvectors is straightforward

• Estimate at iteration kin seesaw approach has form

with a function of and a function of

• Hernández and Spall (2014) give conditions under which SPSA or SG process converges a.s. in seesaw (or M > 1) case

• Hernández and Spall (2016) give corresponding conditions for asymptotic normality

• Above show formally that cyclic methods converge, but that convergence may be slower than standard SPSA or SG

– Slower convergence not surprising given reduced info.

• Conditions very similar to standard convergence and asymptotic normality conditions for SPSA and SG

(v) Cyclic Methods (Cont’d)

        

 



(1)

(2)

ˆ ˆ

ˆ

k k

k

_ˆ(1)

k ˆk1, 

(2) ˆ

k   

(1) (2) 1

ˆ _andˆ

(15)

7-29

(vi) Global Optimization

• SPSA has demonstrated significant effectiveness in global optimization where there may be multiple (local) minima

• One approach is to inject Gaussian noise to right-hand side of standard SPSA recursion:

where b_k 0 and w_k



N(

0 ,

I

_p__p

)

• Injected noise w_k generated by Monte Carlo

• Eqn. (*) has theoretical basis for formal convergence (Section 8.4 of ISSO)

   

ˆ_k ₁ ˆ_k a_kgˆ ( )_k ˆ_k b_kw_k (*)

7-30

(vi) Global Optimization (Cont’d)

• Recent results show that b_k = 0 is sufficient for global

convergence in many cases (Section 8.4 of ISSO); more

detail in Maryak and Chin (2008), IEEE Trans. Auto. Cont.

– No injected noise needed for global convergence

– Implies standard SPSA is global optimizer under appropriate conditions

• Numerical demo on some tough global problems with many local minima yield global solution

– Neither genetic algorithms nor simulated annealing able to find global minima in test suite

– No guarantee of analogous relative behavior on other problems

(16)

7-31

(vii) Noncontinuous (Discrete) Optimization

• Basic SPSA framework for L() differentiable in 

• Many important problems have elements in  taking only discrete (e.g., integer) values

• There have been extensions to SPSA to allow for discrete 

– See references at SPSA Web site (Hill, Gerencser, Vago, Q. Wang, etc.)

• SP estimate produces descent information although gradient not formally defined

• Key issue in implementation is to control iterations and perturbations to ensure they are valid  values

k ˆk

ˆ ( ) g

ˆ

k ck k

 

ˆ_k

(vii) One Approach to Discrete Problems:

Discrete SPSA (DSPSA)

• DSPSA is modification of SPSA; used for discrete

stochastic optimization problems (Wang and Spall, 2011 and 2013)

• In each iteration, an analogue to a gradient is calculated by using loss function measurements at two multivariate integer points in a randomly picked direction

• Algorithm has standard recursive form with analogue to usual gradient approximation

• Assume domain is p ₍_p_{-fold integers); analogue is:}











    

 _ _  _ _  __

   

 

 

     

 _ _ _ _  _ _ _ _

θ π θ Δ π θ Δ Δ

π θ θ θ

1

1 1

ˆ ˆ ˆ

ˆ ( ) ( ) ( ) ,

2 2

ˆ ˆ ˆ ˆ ˆ

where ( ) 2 1 2 and ,...,

k k k k k k k

k k k k kp

y y

 

(17)

(vii) Comments on DSPSA Algorithm

• Initial guess and sequence do not need to be

multivariate integer points

— But yvalues are only collected at valid discrete points via

the floor operator (previous slide)

• Under some general conditions (including some conditions required for basic SPSA), sequence generated by DSPSA converges to optimal solution

• Ongoing work on rate of convergence analysis is based

on evaluating rateat which

— Useful for comparing DSPSA with other methods in some formal sense

— Comparison to stochastic ruler, certain types of random search, etc.; methods need to formallyhandle noisy loss values

• Recent work integrates DSPSA and SPSA to handle  with mixed discrete and continuous parameters (Wang et al., 2018)

 

_θˆ

k

ˆ

(θ_k θ)1

P

7-34

F. ADAPTIVE SIMULTANEOUS

PERTURBATION METHOD

• Standard SPSA exhibits common “1st-order” behavior

– Sharp initial decline

– Slow convergence in final phase

– Sensitivity to units/scaling for elements of 

• “2nd-order” form of SPSA exists for speeding convergence, especially in final phase (analogous to Newton-Raphson)

– Adaptive simultaneous perturbation (ASP) method (details in Section 7.8 of ISSO)

• ASP based on adaptively estimating Hessian matrix

• Addresses long-standing problem of finding “easy” method for Hessian estimation

• Also has uses in nonoptimization applications (e.g., Fisher information matrix in Subsection 13.3.5 of ISSO)

 

    T

L 2 _{( )}

H  

(18)

7-35

Overview of ASP

• ASP applies in either

(i) Standard SPSA setting where only L() measurements

are available (as considered earlier) (“2SPSA” algorithm) — or —

(ii) Stochastic gradient (SG) setting where L() and g()

measurements are available (“2SG” algorithm) • Advantages of 2nd-order approach

— Potential for speedier convergence

— Transform invariance (algorithm performance unaffected by relative magnitude of  elements)

• Transform invariance is unique to 2nd-order algorithms — Allows for arbitrary scaling of elements

— Implies ASP automatically adjusts to chosen units for 

7-36

Cost of Implementation

• For any p, the cost per iteration of ASP is

Four loss measurements for 2SPSA or 

Three gradient measurements for 2SG

• Above costs for ASP compare very favorably with previous methods:

O(p2) loss measurements y(•) per iteration in FDSA setting

(e.g., Fabian, 1971)

O(p) gradient measurements per iteration in SG setting

(e.g., Ruppert, 1985)

• If gradient/Hessian averaging or y(•)-based iterate blocking is

(19)

7-37

Efficiency Analysis for ASP

• Can use asymptotic normality of 2SPSA and 2SG to

compare asymptotic RMS errors (as in basic SPSA) against

best possibleasymptotic RMS of SPSA and SG, say

and

• 2SPSA: With a_k =1/kand c_k= c/k1/6

(

k



1)

• 2SG: With a_k= 1/kand any valid c_k

• Interpretation: 2SPSA (with a_k = 1/k) does almost as well as unobtainable best SPSA; RMS error differs by < factor of 2

• 2SG (with a_k= 1/k) does as well as the analytically optimal

SG (rarely available)



SPSA

RMS RMSSG

  

SPSA

c RMS*

RMS of 2SPSA ₂ ₀

SG

= RMS*

RMS of 2SG ₁

7-38

Concluding Remarks

• SPSA widely used for its power in solving difficult problems

— Especially appropriate for high-dimensional problems and noisy measurements

• Many comparisons with other methods (GAs, simulated annealing, etc.) in literature

— Not surprisingly, studies show varying relative results depending on problems and algorithm “tuning” (recall NFL!) — SPSA designed explicitly for noisy measurements, unlike

most other methods, including “vanilla” GAs, simulated annealing, random search, etc.

• Ongoing research continues to extend the range of applications and/or further enhance efficiency

(20)

39

Partial List of References

• Botts, C. H, Spall, J. C., and Newman, A. J. (2016), “Multi-Agent

Surveillance and Tracking Using Cyclic Stochastic Gradient,”Proceedings of the American Control Conference, Boston, MA, 6–8 July 2016, pp. 270–275.

http://dx.doi.org/10.1109/ACC.2016.7524927

• Hernández, K. and Spall, J. C. (2014), “Cyclic Stochastic Optimization with Noisy Function Measurements,” Proceedings of the American Control Conference, 4–6 June 2014, Portland, OR, pp. 5204–5209.

http://dx.doi.org/10.1109/ACC.2014.6859444

• Hernández, K. and Spall, J. C. (2016), “Asymptotic Normality and Efficiency Analysis of the Cyclic Seesaw Stochastic Optimization

Algorithm,”Proceedings of the American Control Conference, Boston, MA,

6–8 July 2016, pp. 7255–7260. http://dx.doi.org/10.1109/ACC.2016.7526818 • Maryak, J. L., and Chin, D. C. (2008), “Global Random Optimization by

Simultaneous Perturbation Stochastic Approximation,” IEEE Transactions on Automatic Control, vol. 53, pp. 780783

• Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a

Simultaneous Perturbation Gradient Approximation,” IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–341.

http://dx.doi.org/10.1109/9.119632

40

Partial List of References (Cont’d)

• Spall, J. C. (1997), “A One-Measurement Form of Simultaneous

Perturbation Stochastic Approximation,” Automatica,vol. 33, pp. 109–112.

• Spall, J. C. (2000), “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method,” IEEE Transactions on Automatic Control, vol. 45, pp.

1839−1853. http://dx.doi.org/10.1109/TAC.2000.880982

• Spall, J. C. (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm,”

IEEE Transactions on Automatic Control, vol. 54(6), pp. 1216–1229.

http://dx.doi.org/10.1109/TAC.2009.2019793

• Spall, J. C. (2012), “Cyclic Seesaw Process for Optimization and

Identification,” Journal of Optimization Theory and Applications, vol. 154(1),

pp. 187–208. http://dx.doi.org/10.1007/s10957-012-0001-1

• Wang, L., Zhu, J., and Spall, J. C. (2018), “Mixed Simultaneous Perturbation Stochastic Approximation for Gradient-Free Optimization with Noisy

Measurements,”Proceedings of the American Control Conference,