Chapter 6_handout.pdf

(1)

CHAPTER 6 S

TOCHASTIC

A

PPROXIMATION AND

THE

F

INITE-

D

IFFERENCE

M

ETHOD

•Organization of chapter in ISSO

–Contrast of gradient-based and gradient-free algorithms –Motivating examples

–Finite-difference algorithm –Convergence theory –Asymptotic normality

–Selection of gain sequences –Numerical examples

–Extensions and segue to SPSA in Chapter 7

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Motivation for Algorithms

Not Requiring Gradient of Loss Function

• Primary interest here is in optimization problems for which we

cannotobtain direct measurements of L

/



 cannotuse techniques such as Robbins-Monro SA, steepest descent, etc.

 can (in principle) use techniques such as Kiefer and

Wolfowitz SA (Chapter 6), genetic algorithms (Chapters 9–10),…

• Many such “gradient-free” problems arise in practice –Generic difficult parameter estimation

(2)

6-3

Model-Free Control Setup

(Example 6.2 in

ISSO

)

• As usual, want to minimize L() in presence of noisy

measurements of L(): y() = L() + noise

• Here, noisy measurements y() = Q(,V) represent simulation

output

• Want to optimize parameters in simulation, where also has direct physical meaning in real system

– Run simulation to determine best for use in real system

• Cannot easily use stochastic gradient methods due to inability to calculate Q/ need gradient-free method

Simulation-Based Optimization

(Example 6.3 in

ISSO

)



SA method

y() Monte Carlo

Simulation inputs

(3)

6-5

Finite Difference SA (FDSA) Method

• FDSA has standard “first-order” form of root-finding (Robbins-Monro) SA

– Finite difference approximation replaces direct gradient measurement (Chap. 5)

– Resulting algorithm sometimes called Kiefer-Wolfowitz SA

• Let denote FD estimate of g() at kth iteration (next

slide)

• Let denote estimate for  at kth iteration

• FDSA algorithm has form

where a_k is nonnegative gain value

• Under conditions,   _{in stochastic sense (a.s.)}

ˆ ( )_k

g 

k

ˆ



1

ˆ ˆ _{ˆ ( )}ˆ

k  k akgk k

  

k

ˆ



Finite Difference Gradient Approximation

• Classical method for approximating gradients in Kiefer-Wolfowitz SA is by finite differences

• FD gradient approximation used in SA recursion as gradient measurement (previous slide)

• Standard two-sided gradient approximation at iteration kis

where _j is p-dimensional with 1 in jth entry, 0 elsewhere

• Each computation of FD approximation takes 2p

                                

k k k k

k

k k

k k p k k p

k

y c y c

c

y c y c

c 1 1 ˆ ˆ ( ) ( ) 2 ˆ

ˆ ( )

ˆ ˆ

( ) ( )

2

(4)

6-7

Selection of Gain Sequences

a

_k

and

c

_k

• Effective practical implementation requires “intelligent” selection of coefficients in gain sequences in SA algorithm and FD gradient estimate:

where coefficientsa, c, , and are strictly positive and

stability constant A 0 is same as in Sect. 4.4

• Asymptotically optimal = 1, = 1/6 not always best • “Trial and error” sometime used for gain selection

• Semi-automatic method (Sect. 6.6):= 0.602,  = 0.101,

c standard deviation of noise, A 10% (or less) total

number of iterations, and a chosen such that change in SA

estimate does not exceed desired magnitude of change in early iterations

‒ Choosing arequires sample gradient estimates at initial 

 

 

  

k k

a c

k A and k

( 1 ) ( 1)

Example: Wastewater Treatment Problem

(Example 6.5 in

ISSO

)

• Small-scale problem with p = 2

– Aim is to optimize water cleanliness and methane gas byproduct

– Evaluated algorithms with 50 realizations of N= 2000

measurements

• Used FDSA with gains a_k = a

/

(1 + k) and c_k = 1

/

(1 + k)1/6 – Asymptotically optimal decay rates found “best”

• Gain tuning chooses a; naïve gain sets a = 1

• Also compared with random search algorithm B from Chapter 2

• Algorithms use noisy loss measurements (same noise

(5)

6-9

Mean values of



L

(





)

with 95% Confidence Intervals

FDSA with “naïve” gains

FDSA with tuned gains

N = 100 (25 iters.)

0.11 [0.087, 0.140]

0.083 [0.057, 0.108]

N = 2000 (500 iters.)

0.023 [0.017, 0.028]

0.021 [0.016, 0.026]

 Above numbers much lower than random search

algorithm B: best value at N = 2000 is 0.38

 Shows value of approximating gradient in FDSA

ˆ

( )

_k

L



Example: Skewed-Quartic Loss Function

(Examples 6.6 and 6.7 in

ISSO

)

• Larger-scale problem with p = 10:

()_i is the ith component of B, and pB is an upper triangular

matrix of ones

• Used N = 1000 measurements; 50 replicates

• Used FDSA with gains a_k = a

/

(1+k+A) and c_k = c

/

(1+k)

• “Semi-automatic” and manual gain tuning • Also compared with random search algorithm B

 

 



3 



4

1 1

( ) T T 0.1 (p ) 0.01 (p )

i i

(6)

6-11

Algorithm Comparison with Skewed-Quartic

Loss Function (

p

= 10) (Example 6.6 in

ISSO

)

Example with Skewed-Quartic Loss:

Mean Terminal Values and 95% Confidence

Intervals for

FDSA: semi-automatic

gains

FDSA: manually tuned gains

Random searchB

0.427

[0.411, 0.443] [0.502, 0.561] 0.531 [1.190, 1.378]1.285

 FDSA semi-automatic is best with respect to  error

 Random search algorithm B produces solution further from  than initial condition!

 Butloss value is better than initial condition

 



k 0