CHAPTER 1
S
TOCHASTIC
S
EARCH AND
O
PTIMIZATION:
M
OTIVATION AND
S
UPPORTING
R
ESULTS
•Organization of chapter in ISSO
–Introduction
–Some principles of stochastic search and optimization
•Key points in implementation and analysis
•No free lunch theorems
–Gradients, Hessians, etc.
–Steepest descent and Newton-Raphson search
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall
Search and Optimization Algorithms as
Part of Problem Solving
• There exist many deterministic and stochastic algorithms • Algorithms are partof the broader solution
• Need clear understanding of problem structure, constraints, data characteristics, political and social context, limits of algorithms, etc.
• “Imagine how much money could be saved if truly
appropriate techniques were applied that go beyond simple linear programming.” (Michalewicz and Fogel, 2000)
– Deeper understanding required to provide truly appropriate solutions; “COTS” software usually not enough!
1-3
Potpourri of Problems Using Stochastic
Search and Optimization
• Minimize the costs of shipping from production facilities to warehouses
• Maximize the probability of detecting an incoming warhead (vs. decoy) in a missile defense system
• Place sensors in manner to maximize useful information • Determine the times to administer a sequence of drugs for
maximum therapeutic effect
• Find the best red-yellow-green signal timings in an urban traffic network
• Determine the best schedule for use of laboratory facilities to serve an organization’s overall interests
1-4
Two Fundamental Problems of Interest
• Let be the domain of allowable values for a vector
• represents a vector of “adjustables”
– may be continuous or discrete (or both)
• Two fundamental problems of interest:
Problem 1.Find the value(s) of a vector that minimize a scalar-valued loss function L()
— or —
Problem 2.Find the value(s) of that solve the equation g() = 0for some vector-valued function g()
• Frequently (but not necessarily) g() =
• Let denote solution to Problem 1 or 2 (may or may not be unique)
1-5 Continuous Discrete/
Continuous
Discrete
Three Common Types of Loss Functions
Classical Calculus-Based Optimization
• Classical optimization setting of interest
Find that minimizes the differentiable loss L() subject tosatisfying relevant constraints ( )
Standard nonlinear unconstrained optimization setting: Find
such that = 0
–Lagrange multipliers useful for some types of constraints
L 0
L( )
*
1-7
Global vs. Local Solutions
• Typically the solution to = 0 may only be a local
(vs. global) optimal
• General global optimization problem is very difficult
• Sometimes local optimization is “good enough” given limited resources available
• Global methods include: genetic algorithms, evolutionary strategies, simulated annealing, etc.
L
1-8
Global vs. Local Solutions (cont’d)
• Global methods tend to have following characteristics:
– Inefficient, especially for high-dimensional
– Relatively difficult to use (e.g., require very careful selection of algorithm coefficients)
– Sometimes questionable theoretical foundation for global convergence
– Multiple runs usually required to have confidence in reaching global optimum
• Some “hype” with many methods; e.g., genetic algorithm (GA) software advertisement:
– “…uses GAs to solve anyoptimization problem!”
• But there are somesound methods
1-9
Examples of Easy and Hard Problems for
Global Optimization
Note: “Hard problem” below (plot (b)) is indicative of relative volume of region aroundin practical high-dimensional problems
Stochastic Search and Optimization
• Focus here is stochasticsearch and optimization:
A. Random noise in input information (e.g., noisy measurements of L(): y() L() + noise)
—and/or—
B. Injected randomness (Monte Carlo) in choice of algorithm iteration magnitude/direction
• Contrasts with deterministic methods
– E.g., steepest descent, Newton-Raphson, etc.
– Assume perfect information about L() (and its gradients) – Search magnitude/direction deterministic at each iteration
• Injected randomness (B) in search magnitude/direction can offer benefits in efficiency and robustness
1-11
Some Popular Stochastic Search and
Optimization Techniques
• Random search
• Stochastic approximation
– Robbins-Monro and Kiefer-Wolfowitz – Stochastic gradient
• Neural network training
• Infinitesimal perturbation analysis • Recursive least squares
– SPSA
• Simulated annealing • Genetic algorithms
• Evolutionary programs and strategies • Reinforcement learning
• Markov chain Monte Carlo (MCMC) • Etc.
1-12
Effects of Noise on Simple Optimization Problem
1-13
• Consider tracking problem where controller and/or system depend on design parameters
– E.g.: Missile guidance, robot arm manipulation, attaining macroeconomic target values, etc.
• Aim is to pick to minimize mean-squared error (MSE):
• In general nonlinear and/or non-Gaussian systems, not possibleto compute L()
• Get observedsquared error by running system
• Note that
– Values of y(), not L(), used in optimization of
Example 1 of Noisy Loss Measurements:
Tracking Problem (Example 1.5 in
ISSO
)
2( )
actual output
desired output
L
E
2 ( ) y
2
( )
( )
noise
y
L
• Have credible Monte Carlo simulation of real system
• Parameters in simulation have physical meaning in system
– E.g.: is machine locations in plant layout, timing settings in traffic control, resource allocation in military operations, etc.
• Run simulation to determine best for use in real system
• Want to minimize averagemeasure of performance L()
– Let y() represent onesimulation output (y() = L() + noise)
Example 2 of Noisy Loss Measurements:
Simulation-Based Optimization
(Example 1.6 in
ISSO
)
Stochasticoptimizer
y
(
)
Monte Carlo Simulation
1-15
Example 3 of Noisy Loss Measurements:
Online Loss Function and Training
• Assume {xk, zk} are i.i.d. input-output data pairs on some system; data may be collected in batch or real-time
• Consider data modeled as zk= h(,xk) + noise, where h(,xk) represents model with unknown parameters
– Typical function h() is a neural network
• Typical Lrepresents mean-squared difference between actual outcome and prediction: L()= E[(zk h(,xk))2] • If process data one at a time, then have “on-line” training,
such as stochastic gradient algorithm
– Note that (zkh(,xk))2/, which is random, represents “noisy” value of unknown true gradient L/
• Contrast is batch training where all data are processed at each iteration via sumof squared errors using deterministic steepest descent or other nonlinear programming method • Stochastic gradient = online trainingin machine learning
1-16
•Suppose loss measurements ygenerated by some random process V
•Common simplification is to replace V by its mean
– Makes problem easier to solve and interpret due to lack of randomness (e.g., no multiple samples required)
– Can work with constants instead of probability distributions – No need to specify forms of distributions
– No need to generate random outcomes Vif using simulation to generate y(may be difficult for non-standard distributions; see, e.g, Appendix D and Chap. 16 of Spall, 2003)
•However, simplification can lead to seriously incorrect results: “Plans based on average assumptions are wrong on average” (see The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertaintyby Sam Savage, 2009)
•Example: If y() = Q(,V) and L() = E[Q(,V)] (e.g., Chaps. 5 and 15) and = E(V), then very wrongto minimize Q(,) instead of L()
1-17
• Algorithm comparisons via number of measurements of L() or g() (not iterations)
– Function measurements typically represent major cost
• Curse of dimensionality
– E.g.: If dim() = 10, each element of can take on 10 values. Taking 10,000 random samples:
Prob(finding at least one of 500 best ) = 0.0005 – Above example would be even much harder with only noisy
function measurements
• Constraints
• Limits of numerical comparisons
– Avoid broad claims based on numerical studies – Best to combine theory andnumerical analysis
Some Key Properties in Implementation and
Evaluation of Stochastic Algorithms
• = 0 setting usually associated with unconstrained
optimization
• Most real problems include constraints
• Many constrained problems can also be converted to = 0
– Penalty functions, projection methods, ad hoc methods and common sense, etc
• Considerations for constraints:
– Hard vs. soft
– Explicit vs. implicit
Constrained vs. Unconstrained
L
1-19
No Free Lunch Theorems
• Notion of “You can’t get something for nothing” can be formalized
• Wolpert and Macready (1997) establish several “No Free Lunch” (NFL) Theorems for optimization
• NFL Theorems apply to settings where parameter set
and set of loss function values are finite, discrete sets
– Relevant for continuous problem when considering digital computer implementation
– Results are valid for deterministic and stochastic settings
• Number of optimization problems—mappings from to set
of loss values—is finite
• NFL Theorems state, in essence, that no one search algorithm is “best” for all problems
1-20
No Free Lunch Theorems—Basic Formulation
• Suppose that
N=number of possible values of
NL= number of possible values of loss function
• Then
NL Nθ = number of loss functions• There is a finite (but possibly huge) number of loss functions
1-21
Illustration of No Free Lunch Theorems
(Example 1.7 in
ISSO
)
• Three values of , two outcomes for noise free loss L
– Eight possible mappings, hence eight optimization problems
• Mean loss across all problems is same regardless of ;
entries 1 or 2 in table below represent two possible L
outcomes
1 2 3 4 5 6 7 8
1 1 1 1 2 2 2 1 2
2 1 1 2 1 1 2 2 2
3 1 2 2 1 2 1 1 2
Map
No Free Lunch Theorems—Basic Formulation
• Assume algorithm Ais applied to L()
• Let represent value of a loss function after n unique
function evaluations
• Consider probability that given algorithm Aand
loss function L
• Proof of NFL Theorems consider this probability for
choices of algorithms Aand loss functions L
• Conclusion of NFL based on averaging above probability w.r.t. either L or A
• Most important special case of is = L()
ˆn L
ˆn L
ˆn ,
1-23
No Free Lunch Theorems (cont’d)
• NFL Theorems state, in essence:
• In particular, if algorithm 1 performs better than algorithm 2 over some set of problems, then algorithm 2 performs better than algorithm 1 on another set of problems
• NFL theorems say nothing about specificalgorithms on
specificproblems
– NFL provides no direct guidance for implementation, but useful as conceptual device to avoid exaggerated claims of power of specific algorithms
Averaging (uniformly) over all possible problems (loss functions L), all algorithms perform equally well
Overall relative efficiency of two algorithms cannot be inferred from a few sample problems
1-24
• Often used directly in deterministic methods like steepest
descent and Newton-Raphson; indirectlyin stochastic
methods
– Exact gradients and Hessians generally notavailable in stochastic optimization
• Gradient g() of L() is the vector of 1st partial derivatives
• Hessian of L() is the matrix H() consisting the 2nd
partial derivatives
• Hessian useful in characterizing shape of L and in
providing search direction for Newton-Raphson algorithm
Gradients and Hessians
2
( ) LT
H
( ) L
1-25
Rationale Behind Steepest Descent Update
Direction for
i
th Element of
1-27
Noisy and Noise-Free Gradient Input (cont’d):
Relative Loss Values (Example 1.8 in
ISSO
)
1-28
Deterministic Optimization: 1
st-order (Steepest
Descent) and 2
nd-order (Newton-Raphson)
1-29
Final Remarks on Contrast of Deterministic
and Stochastic Optimization:
Relative Theoretical Rates of Convergence
• Theoretical analysis based on convergence rates of
iterates where k is iteration counter
• Recall is optimal value of
• For deterministicoptimization, a standard rate result is:
• Corresponding rate with noisy measurements
• Stochastic rate inherently slower in theory and practice
ˆ ,
k
O
ˆ
(
k), 0
1
k
c
c
O
121
ˆ
, 0
k
k
Concluding Remarks
• Stochastic search and optimization very widely used
– Handles noise in the function evaluations – Generally better for global optimization
– Broader applicability to “non-nice” problems (robustness) – Basis for modern “deep learning” methods
• Some challenges in practical problems
– Noise dramatically affects convergence
– Distinguishing global from local minima not generally easy – Curse of dimensionality
– Choosing algorithm “tuning coefficients”
• Rarely sufficient to use theory for standard deterministic methods to characterize stochastic methods
• “No free lunch” theorems are barrier to exaggerated claims of power and efficiency of any specific algorithm
• Algorithms should be implemented in context: “Better a
1-31
Selected References on Stochastic Optimization
• Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013),Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods, Springer.
• Fogel, D. B. (2000), Evolutionary Computation: Toward a New Philosophy of Machine Intelligence(2nd ed.), IEEE Press.
• Fu, M. C. (2002), “Optimization for Simulation: Theory vs. Practice” (with discussion by S. Andradóttir, P. Glynn, and J. P. Kelly), INFORMS Journal on Computing, vol. 14, pp. 192227.
• Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley.
• Gosavi, A. (2015), Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning(2nded.), Springer.dx.doi.org/10.1007/978-1-4899-7491-4
• Holland, J. H. (1975),Adaptation in Natural and Artificial Systems, University of Michigan Press.
• Kleijnen, J. P. C. (2008),Design and Analysis of Simulation Experiments, Springer. • Kushner, H. J. and Yin, G. G. (2003),Stochastic Approximation and Recursive Algorithms
and Applications(2nd ed.), Springer-Verlag.
• Michalewicz, Z. and Fogel, D. B. (2000), How to Solve It: Modern Heuristics, Springer-Verlag.
• Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley.