Recursive Discrete Stochastic
Optimization with Noisy Loss
Measurements
James C. Spall
The Johns Hopkins University Applied Physics Laboratory and
Department of Applied Mathematics and Statistics
Acknowledgments:Some of the slides here are due to contributions of Dr. Stacy D. Hill of Johns Hopkins
University/APL, Dr. Qi Wang of Johns Hopkins University and Morgan Stanley, and AMS graduate student Long Wang.
2
Background
Contrast with Statistical Methods for Discrete Stochastic Optimization
• Two general classes of methods for discrete stochastic optimization from a countable (finite or infinite) number of options in
Statistical methods from Chapter 12
Recursive methods here
• Methods fromChapter 12 used in Sect. 14.5 on simulation-based optimization
• Statistical methods provide alternative to recursive methods for discrete problems:
— Tukey–Kramer test — Multiple comparisons — Many-to-one test
3
Background
Contrast with Statistical Methods for Discrete Stochastic Optimization (cont’d)
• Recursive methods here distinct from statistical methods of Chapter 12 are in discrete stochastic optimization
• Recursive methods are in general random search or stochastic approximation setting
• Relative to statistical methods above:
— Advantageof recursive methods is allowance for unlimited
number of discrete options (statistical methods assume only small number [~20 or fewer] of elements in )
— Disadvantageof recursive methods is that they do not
provide “nice” statistical insight and (possibly) P-values of methods of Chapter 12
4
Organization for Remainder of Slides
•
PART 1:
Neighborhood-Based Methods for
Stochastic Discrete Optimization
•
PART 2: Optimization with Discrete Version of
SPSA Using Noisy Loss Function Measurements
PART 1
Neighborhood-Based Methods for
Stochastic Discrete Optimization
5
• Several slides below (Part 1) give “high level” summary of neighbor-based optimization for discrete problems with noise
• Some of slides in Part 1 due to S. D. Hill at JHU/APL • Specific instances of neighbor-based optimization are
stochastic ruler and stochastic comparison
Reference for Part 1:
Hill, S. D. (2014), “Discrete Optimization with Noisy Objective Function
Measurements,” in Wiley Encyclopedia of Operations Research and Management Science, Wiley, Hoboken, NJ; published online 7 March 2014. (Available at course website under “Materials” link.)
See also references cited therein for further information on specific methods.
Generic Neighborhood-Based Random Search
• Iterative random search methods that solve
• Random search involves some Monte Carlo-based selection of search direction in generating iterate
• Want sequence to converge to a value in
• Generic form of methods is, at each k, to randomly
generate (by Monte Carlo) candidatepoint in some
neighborhood around current iterate
Algorithm determines whether to accept/reject candidate point at each iteration based on evidence of improvement
If accept, then next iterate takes value of candidate point; if reject, then next iterate stays at value of current iterate
6
( )
min
L
0 1 2
ˆ ˆ ˆ
, ,
,....,
ˆ
k,....
ˆ
n
ˆ
( )
kN
General Algorithm for
Neighborhood-Based Random Search
Step 1:Set k = 0 and pick initial value
Step 2: Given randomly generate candidate
Step 3: Collect one or more values
Step 4:Using samples from step 3, accept or reject candidate value with some “selection probability”; let
Step 5: Increment k k+1 and return to step 2; terminate
when appropriate 7 0
ˆ
k k k
1ˆ if candidate accepted
ˆ
ˆ if candidate rejected
ˆ
( )
kN
k
ˆ ,
k
( )
k(and possibly
y
ˆ
( )
also
y
k)
Random Search by Stochastic Ruler Method
• Convert core minimization problem into solvable maximization problem:
where Uis independent, uniformly distributed random
variable such that 0 < P(y() U) < 1
• Random variable U represents “stochastic ruler” that y() is
measured against
• Yan and Mukai (1992) show that solution(s) to above maximization problem are in , as desired
SR method exploits randomness of Monte Carlo simulation and generates sequence of estimates for
Sequence is nonstationary Markov chain, but process
converges to stationary value with limiting probability = 1 that solution estimate is in
( )
max
P y
U
Some Rationale for Stochastic Ruler Method
• Assume E[y()] = L() for all (i.e., noise has mean 0)
• Main “tuning knob” is endpoints of uniform distribution – If y() is bounded, can choose endpoints of uniform
distribution as max and min values for y() for all
– If y() is unbounded, need to choose endpoints such
that 0 < P(y() U) < 1 for all
• Yan and Mukai (1992) show that
If P(y() U) > P(y() U), then E[y()] < E[y()]
• Above implies that is value of that maximizes
P(y() U)
• Combine above rationale with Markov chains to show formal convergence of iteration process in SR
9
Stochastic Ruler Method:
Acceptance/Rejection Rule
• “Ruler” U determines whether candidate values are
accepted or rejected
‒ Candidate values generated according to specified transition probability R(,) from previous value
• Given current value and candidate value
(generated via R()), accept candidate with probability
where Mk is positive integer
• Mk represents sample size (maximum number of values y) used to estimate probability above; see next slides
• For formal convergence of SR algorithm to optimum, need Mk
→
as k
→ 10
ˆ
k
k
( )
Mk,
Steps of Stochastic Ruler Algorithm
11
• Comment: Mkis upper bound to number of noisy loss evaluations y
required at each iteration (main cost of algorithm)
12
Convergence Properties of SR
• SR process can be characterized as Markov chain with finite state space (given finite options in )
• Markov chain has stationary probability distribution (Theorem E.1) with probabilities corresponding to
summing to 1
— Key assumption is that Mk→ according toMklog(k)
• Markov chain result leads to convergence in probability:
• Also possible to characterize rate of convergence in terms of when =
— Discussed in Part 2 below via comparison with DSPSA
θ
ˆ
lim (
k) 1
k
P
θˆ θ
( k )
13
PART 2
Optimization with Discrete Version of SPSA
Using Noisy Loss Function Measurements
Discrete and Mixed Continuous/Discrete Variables
14
Note: Most discussion is for discrete-only case; comments at end of
Part 2 on mixed-variable case (see also Wang et al., 2018, below)
References for Part 2:
Wang, Q. and Spall, J. C. (2011), “Discrete Simultaneous Perturbation Stochastic Approximation on Loss Functions with Noisy Measurements,” Proc. of American Control Conf., 29 June–1 July 2011, San Francisco, CA, pp. 4520–4525. http://dx.doi.org/10.1109/ACC.2011.5991407
Wang, Q. and Spall, J. C. (2013), “Rate of Convergence Analysis of Discrete Simultaneous Perturbation Stochastic Approximation Algorithm,”Proc. of American Control Conf., 17–19 June 2013, Washington, DC, pp. 4778–4783. http://dx.doi.org/10.1109/ACC.2013.6580576
Wang, Q. (2013), “Optimization with Discrete Simultaneous Perturbation Stochastic
Approximation Using Noisy Loss Function Measurements,” Ph.D. thesis, JHU Dept. of Applied Mathematics and Statistics. http://arxiv.org/abs/1311.0042.
Discrete Problem and Basic SA Form
• Consider real-valued loss function
where is set of integers
• So, assumed to lie on p-dimensional integer grid
• Want to solve problem
• But we only know the noisy measurements
• Use standard SA-type form for update of
where is appropriate discrete version of “gradient” and is standard gain sequence
( ) :
p
L
( )
θ
min
p
L
( )
( )
( )
( )
0
y
L
E
15ˆ
ˆ ( )
k
kg
(
1
)
k
a
a k
A
1
ˆ
ˆ
ˆ ( )
ˆ
k k
a
k k k
g
DSPSA Algorithm Description
• Step 0: Pick initial guess where pis p-dimensional
multivariate integer space.
• Step 1: Generate k = [k1, k2,…, kp]T, where khas
user-specified distribution satisfying conditions. Special case is when
kiare independent Bernoulli random variables 1 with probability ½• Step 2:
• Step 3: Evaluatey at
• Step 4: Construct “gradient” approximation:
• Step 5: After Miterations of recursion,
set as solution
0
ˆ p,
1
ˆ ˆ ˆ ˆ ˆ
Set ( ) k k 1p 2,where k k ,...,kp
ˆ ˆ
( k) k 2 and ( k) k 2
1 1
1
1 1
ˆ ˆ ˆ
ˆ ( ) ( ) ( ) ,...,
2 2
π Δ π Δ
T
k k y k k y k k k kp
g
1
ˆ
ˆ
ˆ ( )
ˆ
k
k
a
k k k
g
ˆ
Algorithm Description (Cont’d)
• Remarks:
– Simple implementation
– Two loss function measurements in each iteration – Makes use of function structure implicitly
(gradient-like quantity; connects to notion of “subgradient”) – Handle noisy measurements
• Convergence property: under some general conditions, sequence generated by DSPSA converges almost surely to optimal solution
17
Rate of Convergence Results for DSPSA
• We discuss MSE of DSPSA (Wang and Spall, 2013) • Under some general conditions for 0.5 < < 1, we have
where mis determined by loss function and b is uniform upper bound for
• Note: (*) is a difference-equation-like recursion • Similar recursion to (*) exists for case of = 1
2 2
* * 2
1
ˆ
(1 2
)
ˆ
(*)
k k k k
E
a
E
a pb
2 2
1 1
ˆ ˆ
( ) ( )
2 2
k k k k k k
E L L E
π Δ π Δ
Rate of Convergence Results for DSPSA (Cont’d)
• Solving the above difference-equation-like recursion,
where c > 0 and 0.5 < a < 1
• First big-Oterm is function of m, a, a, A, k and second big-O term is function of m, a, a, A, b, k
– mand brelate to loss function and noise; a, a, A relate to gain sequence
• Analogous result exists for a= 1, leading to:
1
2 2
* *
0
1
ˆ
ckˆ
k
E
O e
E
O
k
19 2
*
1
ˆ
k
E
O
k
Rate of Convergence Results for DSPSA (Cont’d)
• Result on MSE can be used to give rate of convergence of to 0 in big-O sense, where indicates
the nearest multivariate-integer point of
• Convergence rate of can be used to compare DSPSA with other discrete stochastic algorithms
• Limited other algorithms include stochastic ruler (SR) algorithm (Yan and Mukai, 1992) and stochastic comparison (SC) algorithm (Gong et al, 1999)
— Comparison with SR and SC below
*
ˆ
([ k] )
P
[ ]
ˆ
kˆ
k
*
ˆ
([ k] )
P
Comparison of SR, SC and DSPSA
21
Takeaway messages from table above and implementation:
• Bestpossible rate (may not be actual rate) for SR under above “big-O” expression is likewise for SC
• Results show that rate of convergence of SR and SC can not be better than DSPSA
• Parameters (coefficients) involved in SR and SC are harder to tune than coefficients in DSPSA
*
ˆ
([ ] ) ( SR);
k
P O k
Numerical Comparison
(one of many, with similar results)
• Numerical comparison of SR, SC, and DSPSA
• Used “skewed quartic” loss function (Spall 2003, Example 6.6) with p= 200:
where pB is upper triangular matrix of 1’s; note = 0
• Algorithms use measurements y() that include additive N(0,1) noise
• Domain = {10, 9,…, 9, 10}200
– Search space is large; number of elements in is 21200= 2.8×10264
• Algorithms initialized at 10×1200 (a corner of )
22
( )
T T0 1
.
p1(
)
30 01
.
p1(
) ,
4i i
i i
Numerical Results for Comparison of
SR, SC, and DSPSA: Loss Values
• Plot below shows comparison in terms of normalized loss value • SR and SC provide overall lower values of loss than DSPSA
– Better performance possibly due to large flat region in domain of Lsurrounding
• Plot in terms of accuracy of estimate of is on next slide
23
* * 0
ˆ
([ ]) ( ) ˆ
([ ]) ( ) k
L L
L L
θ θ
θ θ
Numerical Results for Comparison of
SR, SC, and DSPSA: Values of
• Plot below shows comparison in terms of normalized estimate of for skewed-quartic loss function with p= 200 • DSPSA provides greater accuracy than SR or SC in estimate
of
0 0.5 1 1.5 2
x 104 0.4
0.5 0.6 0.7 0.8 0.9 1
Number of Measurements
Stochastic Ruler Stochastic Comparison DSPSA
*
* 0
ˆ [ ]
ˆ [ ]
k
θ θ
Extension of DSPSA to
Mixed
Discrete and
Continuous Variables
• DSPSA can be extended to minimization of loss function whose variables contain both continuous and discrete
As usual, assume only noisy measurements of the loss function
Reference: Wang et al. (2018), American Control Conf. • The mixed variable version unifies two practical algorithms:
basic SPSA and DSPSA
• Can consider alternating use of SPSA and DSPSA by
updating only continuous variables or only discrete variables during each iteration
Not efficient as updating all variables at each iteration • Method based on one “pseudo gradient” estimate that
captures both continuous and discrete aspects
• Can show formal convergence of simultaneous update of continuous and discrete variables
• Next slide: excerpt from presentation at 2018 ACC
summarizing theory…. 25
Gradient Estimate in Mixed SPSA
• Mixed-integer space: 𝚯 ⊆ ℤ ℝ
• Standard gain sequence
𝑎 𝑎
𝑘 1 𝐴 , 𝑐
𝑐
𝑘 1 , 𝑪
1 2, … ,
1
2, 𝑐 , … , 𝑐
• Simultaneous perturbation estimate
𝛉 𝒎 𝛉 𝑪 ⨀ 𝚫
where operator ⨀ is matrix Hadamard product,
𝒎 𝛉 𝑡 0.5, … , 𝑡 0.5, 𝑡 , … , 𝑡 for 𝛉 𝑡 , … , 𝑡 , and 𝚫 needs to satisfy 𝛉 ∈ 𝚯
• Gradient estimate
𝒈 𝛉 𝑦 𝛉 𝑦 𝛉
2𝑪 ⨀ 𝚫
Mixed SPSA Convergence Results
Algo Convergence Rate of Convergence𝔼 𝛉 𝛉∗ 𝟐 → 0 Constants
SPSA Almost surelySpall (1992) Normal, Spall (1992)𝑂 1/𝑘 𝑎 ∼ 1/𝑘𝑐 ∼ 1/𝑘
DSPSA Wang and Spall (2011)Almost surely Wang and Spall (2013)𝑂 1/𝑘 𝑎 ∼ 1/𝑘𝜇 0
MSPSA Almost surely Wang et al. (2018)
𝑂 𝑘 𝑑𝑂 𝑘 𝑝 𝑑 𝑂 𝑘
[Ongoing work]
𝑎 ∼ 1/𝑘 𝑐 ∼ 1/𝑘
27
• Implementation: only two noisy measurements in each iteration • Set of convergence condition synthesizes those for SPSA and
DPSA
• Utilize function structure implicitly (gradient-like quantity)
• Rate of change shows a combination of initial condition, discrete component and continuous component
Concluding Remarks for Part 2
(Discrete SPSA: DSPSA)
• We described DSPSA for discrete stochastic optimization problem and show the almost sure convergence property • We discuss rate of convergence of DSPSA, which is
O(1/ka).
• Rate of convergence results allow for objective
comparison with other stochastic discrete optimization methods (e.g. stochastic ruler and stochastic comparison)
Can show that rate of convergence of SR and SC can not be better than DSPSA
• Natural extension of DSPSA to mixed variable case (MSPSA) unifies two practical algorithms: basic SPSA and DSPSA
PART 3
Illustration of DSPSA in Simulation-Based
Optimization for a Discrete Problem:
Resource Allocation in Public Health
Note: Some of these slides were shown at 2014 American Control
Conference.
Note: Some ongoing work being carried out by current AMS master’s
student (academic year 2019–2020)
Main reference for this part:
Wang, Q. and Spall, J. C. (2014), “Discrete Simultaneous Perturbation
Stochastic Approximation for Resource Allocation in Public
Health,” Proc. of American Control Conf., 4–6 June 2014, Portland,
OR, pp. 3639–3644.dx.doi.org/10.1109/ACC.2014.6859387
29
Contents for Part 3
• Public Health Problem
• Short Review of Discrete Simultaneous Perturbation Stochastic Approximation (DSPSA)
• Implementation of DSPSA • Numerical Results
• Conclusions for Part 3
Public Health Problem
• Seasonal influenza epidemics and worldwide
epidemics (pandemics) have caused many deaths and much economic loss throughout history
• As in 2009, H1N1 virus was responsible for millions of confirmed infections and thousands of deaths
throughout the world
– H1N1 has also caused billions of dollars of loss to the economy
31
Public Health Problem (cont’d)
• Vaccinations, antiviral agents, and school closure are three main methods to control epidemics and reduce costs incurred from hospitalizations and treatment in ICU
– Note: In COVID-19, some of above not available
• Policymakers want to find optimal intervention strategy to achieve best result in terms of loss to society
• Generally, researchers set up base case, such as no intervention
– Then do sensitivity analysis by adding interventions (e.g. Halder et al., 2011, Khazeni et al., 2009)
– Provides ad hoc means of determining optimal strategy
Public Health Problem (cont’d)
Public Health Problem (cont’d)
• We use DSPSA to find best combination of interventions to achieve least economic loss to society
– DSPSA is alternative to doing sensitivity analysis as means of finding “optimal” strategy
• Open source simulator FluTE (based on a person-by-person model) simulates spread of influenza in large population
• FluTE used to help policymakers prepare for future
influenza seasonal epidemics or pandemics (Chao et al., 2010, Chao et al., 2011)
33
Reprise (Part 2): DSPSA Description
• Step 0: Pick initial guess where pis p-dimensional
multivariate integer space.
• Step 1: Generate k = [k1, k2,…, kp]T, where khas
user-specified distribution satisfying conditions. Special case is when
kiare independent Bernoulli random variables 1 with probability ½• Step 2:
• Step 3: Evaluatey at
• Step 4: Construct “gradient” approximation:
• Step 5: After Miterations of recursion,
set as solution
0
ˆ p,
1
ˆ ˆ ˆ ˆ ˆ
Set ( ) k k 1p 2,where k k ,...,kp
ˆ ˆ
( k) k 2 and ( k) k 2
1 1
1
1 1
ˆ ˆ ˆ
ˆ ( ) ( ) ( ) ,...,
2 2
π Δ π Δ
T
k k y k k y k k k kp
g
1
ˆ
ˆ
ˆ ( )
ˆ
k
k
a
k k k
g
ˆ
Algorithm Description for Bounded Domain
• For bounded domain, steps 2, 3, 5 should be modified. Suppose is projection to map to feasible unit hypercubes
• Step 2:
• Step 3: Evaluate y at
• Step 5: After M iterations, set as
approximated optimal solution
1 1
ˆ
ˆ
ˆ
(
) [ (
),...,
(
)]
T k
k
p kpΨ
ˆ
k
ˆ
ˆ
( )
k
( )
k
1
p2
π
Ψ
ˆ
k
Ψ π
( ( ))
ˆ
k
Δ
k2
ˆ
(
M)
Ψ
35
Simulation for Use with
Implementation of DSPSA
• FluTE based on new stochastic model of epidemics within a large population
• Three intervention methods considered in FluTE: vaccinations, antiviral agents, and school closure • Use FluTE to simulate results of different intervention
strategies
• Due to Monte Carlo randomness in simulation, outputs of FluTE represent noisymeasurements of loss
function L (i.e., y = L+ noise)
Implementation of DSPSA with FluTE
• Loss function: sum of vaccination cost, antivirals cost, school closure cost, hospitalization cost, ICU cost, and death cost
• is vector of input parameters related to intervention strategies
DSPSA update
ˆ
( ( )
k k2)
y
πθ
Δ
ˆ
( ( )
k k2)
y
πθ Δ
1
ˆ
k θ
FluTE Simulator ˆ
( )k k 2
πθ Δ
ˆ
( )k k 2
πθ Δ
DSPSA perturbation
ˆ k
θ
37
Numerical Results Based on FluTE: Sanity Check
• Extreme cases: free interventions and expensive interventions (free hospitalization and ICU; L() > 0 due to cost of ill people)
– Serves as “sanity check” on approach
• Optimal solutions below are full intervention (vaccinate 100% of population) and no intervention (vaccinate 0% of population), respectively
– Overall costs (loss value) approach 0 due to idealized assumptions (not realistic); note different scale on axes
Numerical Study Based on FluTE:
Realistic Case
• We pick 20 tracts of Los Angeles (LA) containing total population of 100,096
– Set this as synthetic city having similar census properties to whole nation
• There are 9 different vaccines produced by 5 manufacturers and detailed supply information is shown in Web Table 3 of Chao et al. (2011).
– Suppose supply for synthetic city is proportional to national supply by population
39
Numerical Study Based on FluTE:
Realistic Case (cont’d)
• We start simulation for H1N1 on September 1, 2009, and total length of simulation set to 175 days (25 weeks)
• In reality, vaccines for H1N1 start to be available near peak of infections
• We are also interested in situation when vaccines available earlier than in “real” situation because we want to check effect of early intervention
• In results below, we do not know true optimal solution and/or true value of loss
– Can only do “intuition check” on solution and
observe noisy loss
Numerical Study Based on FluTE:
Realistic Case (cont’d)
• For early vaccination case, we assume that the vaccines are available 30 days earlier than in actual situation
• Note that in early vaccine case, overall costs tend to be slightly lower than non-early vaccine case due increase flexibility in solution
• Numerical results on next two slides…..
41
Numerical Results Based on FluTE:
Early Vaccination Case
• Early vaccination case provides idealized solution since most people not willing to accept early vaccination
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
1 2 3 4 5 6x 10
6
Number of Iterations
To
ta
l C
o
st
t
o
th
e
E
con
o
m
y
(d
o
lla
rs
)
Numerical Results Based on FluTE:
Non-Early Vaccination Case
• Solution below shows improvement from initial condition but overall loss (cost) values slightly above early vaccine case
– Steady state values now around $2.5M vs. $1.6M previous slide
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
1 2 3 4 5 6 7x 10
6
Number of Iterations
T
o
ta
l C
o
st
t
o
th
e
E
cono
m
y
(d
ol
la
rs
)
43
Interpretation of Numerical Results
Based on FluTE
• Optimal intervention strategy for early vaccination case is: give all family members antiviral agents if one
member is ascertained, and vaccinate 30% of “assigned groups”
• Assigned groups = members of families containing infants, high risk young adults, high risk older adults, all preschoolers, and all school-age children
• Members of families containing infants and all school-age children have highest priority. High risk older adults have second priority. Young adults and preschoolers have lowest priority
• Results here consistent with results in Yang et al. (2009) and Chao et al. (2011)
Interpretation of Numerical Results
Based on FluTE
• Optimal intervention strategy for non-early vaccine case is:
Give all family members antiviral agents if one member is ascertained, and vaccinate 20% of assigned groups (high risk young adults, high risk older adults, and all school-age
children)
• High risk young adults have highest priority. High risk older adults have lower priority. All school-age children have lowest priority
• Result indicates that when vaccines available near peak time of H1N1, better to first vaccinate high risk adults, rather than students (consistent with Chao et al., 2011)
45
Conclusions for Part 3
• DSPSA is powerful general method for discrete stochastic optimization
– Theory for convergence and rate of convergence shown previously (Wang and Spall, 2011, 2013)
• Application of DSPSA towards developing optimal public health strategies for containing spread of influenza given limited societal resources
– Study makes use of open source software for intervention strategies (FluTE: Chao et al., 2010) to generate “noisy” loss values for use in DSPSA
• Numerical results consistent with conclusions from sensitivity studies of other researchers
• Useful in COVID-19?
Additional References for Part 3
(beyond opening slide for Part 3)
• D. L. Chao, M. E. Halloran, V. J. Obenchain, I. M. Longini Jr. (2010), “FluTE, a Publicly Available Stochastic Influenza
Epidemic Simulation Model,” PLoS Computational Biology, Vol. 6, Issue 1, e1000656.
• N. Halder, J. K. Kelso, G. J. Milne (2011), “Cost-Effective
Strategies for Mitigating a Future Influenza Pandemic with H1N1 2009 Characteristics” PLoS ONE, Vol.6, Issue 7, e22087.
• N. Khazeni, D. W. Hutton, A. M. Garber, N. Hupert, and D. Owens (2009), “Effectiveness and Cost-Effectiveness of Vaccination against Pandemic (H1N1),” Annals of Internal Medicine. Vol. 151, No. 12, pp. 829−839.
• Y. Yang, J. D. Sugimoto, M. E. Halloran, N. E. Basta, D. L. Chao, L. Matrajt, G. Potter, E. Kenah, I. M. Longini Jr. (2009), “The Transmissibility and Control of Pandemic Influenza A (H1N1) Virus,” Science, Vol. 326, pp.729−733.