CHAPTER 6
S
TOCHASTIC
A
PPROXIMATION AND
THE
F
INITE-
D
IFFERENCE
M
ETHOD
•Organization of chapter in ISSO
–Contrast of gradient-based and gradient-free algorithms –Motivating examples
–Finite-difference algorithm –Convergence theory –Asymptotic normality
–Selection of gain sequences –Numerical examples
–Extensions and segue to SPSA in Chapter 7
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall
Motivation for Algorithms
Not Requiring Gradient of Loss Function
• Primary interest here is in optimization problems for which we
cannotobtain direct measurements of L
/
cannotuse techniques such as Robbins-Monro SA, steepest descent, etc.
can (in principle) use techniques such as Kiefer and
Wolfowitz SA (Chapter 6), genetic algorithms (Chapters 9–10),…
• Many such “gradient-free” problems arise in practice –Generic difficult parameter estimation
6-3
Model-Free Control Setup
(Example 6.2 in
ISSO
)
• As usual, want to minimize L() in presence of noisy
measurements of L(): y() = L() + noise
• Here, noisy measurements y() = Q(,V) represent simulation
output
• Want to optimize parameters in simulation, where also has direct physical meaning in real system
– Run simulation to determine best for use in real system
• Cannot easily use stochastic gradient methods due to inability to calculate Q/ need gradient-free method
Simulation-Based Optimization
(Example 6.3 in
ISSO
)
SA method
y() Monte Carlo
Simulation inputs
6-5
Finite Difference SA (FDSA) Method
• FDSA has standard “first-order” form of root-finding (Robbins-Monro) SA
– Finite difference approximation replaces direct gradient measurement (Chap. 5)
– Resulting algorithm sometimes called Kiefer-Wolfowitz SA
• Let denote FD estimate of g() at kth iteration (next
slide)
• Let denote estimate for at kth iteration
• FDSA algorithm has form
where ak is nonnegative gain value
• Under conditions, in stochastic sense (a.s.)
ˆ ( )k
g
k
ˆ
1
ˆ ˆ ˆ ( )ˆ
k k akgk k
k
ˆ
Finite Difference Gradient Approximation
• Classical method for approximating gradients in Kiefer-Wolfowitz SA is by finite differences
• FD gradient approximation used in SA recursion as gradient measurement (previous slide)
• Standard two-sided gradient approximation at iteration kis
where j is p-dimensional with 1 in jth entry, 0 elsewhere
• Each computation of FD approximation takes 2p
k k k k
k
k k
k k p k k p
k
y c y c
c
y c y c
c 1 1 ˆ ˆ ( ) ( ) 2 ˆ
ˆ ( )
ˆ ˆ
( ) ( )
2
6-7
Selection of Gain Sequences
a
kand
c
k• Effective practical implementation requires “intelligent” selection of coefficients in gain sequences in SA algorithm and FD gradient estimate:
where coefficientsa, c, , and are strictly positive and
stability constant A 0 is same as in Sect. 4.4
• Asymptotically optimal = 1, = 1/6 not always best • “Trial and error” sometime used for gain selection
• Semi-automatic method (Sect. 6.6):= 0.602, = 0.101,
c standard deviation of noise, A 10% (or less) total
number of iterations, and a chosen such that change in SA
estimate does not exceed desired magnitude of change in early iterations
‒ Choosing arequires sample gradient estimates at initial
k k
a c
a c
k A and k
( 1 ) ( 1)
Example: Wastewater Treatment Problem
(Example 6.5 in
ISSO
)
• Small-scale problem with p = 2
– Aim is to optimize water cleanliness and methane gas byproduct
– Evaluated algorithms with 50 realizations of N= 2000
measurements
• Used FDSA with gains ak = a
/
(1 + k) and ck = 1/
(1 + k)1/6 – Asymptotically optimal decay rates found “best”• Gain tuning chooses a; naïve gain sets a = 1
• Also compared with random search algorithm B from Chapter 2
• Algorithms use noisy loss measurements (same noise
6-9
Mean values of
L
(
)
with 95% Confidence Intervals
FDSA with “naïve” gains
FDSA with tuned gains
N = 100 (25 iters.)
0.11 [0.087, 0.140]
0.083 [0.057, 0.108]
N = 2000 (500 iters.)
0.023 [0.017, 0.028]
0.021 [0.016, 0.026]
Above numbers much lower than random search
algorithm B: best value at N = 2000 is 0.38
Shows value of approximating gradient in FDSA
ˆ
( )
kL
Example: Skewed-Quartic Loss Function
(Examples 6.6 and 6.7 in
ISSO
)
• Larger-scale problem with p = 10:
()i is the ith component of B, and pB is an upper triangular
matrix of ones
• Used N = 1000 measurements; 50 replicates
• Used FDSA with gains ak = a
/
(1+k+A) and ck = c/
(1+k)• “Semi-automatic” and manual gain tuning • Also compared with random search algorithm B
3
41 1
( ) T T 0.1 (p ) 0.01 (p )
i i
i i
6-11
Algorithm Comparison with Skewed-Quartic
Loss Function (
p
= 10) (Example 6.6 in
ISSO
)
Example with Skewed-Quartic Loss:
Mean Terminal Values and 95% Confidence
Intervals for
FDSA: semi-automatic
gains
FDSA: manually tuned gains
Random searchB
0.427
[0.411, 0.443] [0.502, 0.561] 0.531 [1.190, 1.378]1.285
FDSA semi-automatic is best with respect to error
Random search algorithm B produces solution further from than initial condition!
Butloss value is better than initial condition
k 0