Hill_EORMS_2014_Discrete_stochastic_optimization.pdf

(1)

NOISY OBJECTIVE FUNCTION

MEASUREMENTS

StacyD. Hill

The Johns Hopkins University, Applied Physics Laboratory, Laurel, Maryland

Discrete optimization with “noisy” objective function measurements—also referred to as discrete (parameter) stochastic optimi-zation—is concerned with maximizing (or minimizing) real-valued functions defined on discrete sets, where the values of the objective function are unknown but can be estimated to within some measurement noise.

Some applications that give rise to the problem of optimizing noisy objective func-tions are: message transmission in commu-nications networks [1–4]; locating a service or resource facility, such as a manufactur-ing warehouse, supply distribution center, school, or hospital [5,6]; scheduling machine usage in a production plant [7,8]; allocating people to evacuation routes [9]; and, more generally, allocating a finite number of units of a fixed resource to users or activities so as to minimize a given performance measure [10–12].

The salient feature of such optimiza-tion problems—which is a major source of difficulty—is the discreteness of the feasible region on which the objective function is defined. In contrast to con-tinuous parameter optimization, objective functions defined on discrete sets lack derivatives that provide information about local optimality and guide the search for the optimum. The optimization of discrete functions relies on comparing differences of the objective function values, which must be estimated when those values contain noise.

This article will describe the general prob-lem of stochastic discrete optimization and discuss characteristics and properties of sev-eral main approaches to solving it.

Preliminaries

Although there is no generally accepted defi-nition of what is meant by a discrete feasible region in optimization, usages of the term imply that the set is at most countable. Car-dinality alone, however, does not suffice to describe discrete sets. (For example, ratio-nal numbers are at most countable but are not discrete.) In the case of optimization over subsets of reald-dimensional space, the com-mon interpretation of a discrete set is that its points are separated by some fixed posi-tive distance. An essential property of such sets is that functions defined on them lack derivatives. One way to guarantee that a derivative is undefinable is to require that the set have no accumulation points. This observation motivates the following topologi-cal definition of discreteness (see, e.g., [13]), which suffices for the discussion in this arti-cle and is sufficiently general to cover most problem settings.

A set in a topological space is discrete if it has no accumulation points. Equivalently, a discrete set is one that contains only isolated points. For example, the integers Z form a discrete set in the realsR, whereas the set of rational numbers do not.

The introduction of a topology, aside from its usefulness in defining discreteness, sim-plifies the discussion of convergence concepts. In particular, a sequence of points {θ_n} in a discrete set converges to a point θ in if θ_n=θ for n sufficiently large. (Take any neighborhood of θ containing no point ofother thanθ. Such neighborhoods exist because discrete sets contain only isolated points.) Similarly, θn converges to a subset

Eof, consisting of finitely many points, if θ_n∈Efor sufficiently largen.

The Optimization Problem

The discrete stochastic optimization problem has the following general formulation. LetL

(2)

be a real-valuedobjective functiondefined on a discrete, at most countable feasible region —the problem domain. The value L(θ) is unknown, but it is assumed that, for each θ∈, there is available a random variable Y=Y(θ) such that

Y=L(θ)+ε(θ), θ∈, (1)

where ε(θ)—measurement or estimation noise—is a random variable with mean zero. Thus, Y(θ) satisfiesL(θ)=E[Y(θ)], for each θ. The problem is to findθ∗∈such that

L(θ∗)= min

θ∈L(θ) (2)

using only measurements ofY. To avoid triv-ialities, it is assumed throughout this article thatis nonempty and, furthermore, thatL is nonconstant on.

The objective function will sometimes be interpreted as measuring some type of loss, such as cost. The problem, then, is to find points in the feasible region that minimizes loss. Note that the problem of minimizing L(θ) is equivalent to the problem of maxi-mizing −L(θ). In addition, any method for finding a maximum is, therefore, easily mod-ified to yield a method for finding a mini-mum. The choice of problem—minimization or maximization—is a matter of convenience and, usually, is dictated by the particular application. The choice here is to view opti-mization as miniopti-mization (of loss) unless otherwise stated.

Let ∗ denote the set consisting of all points inthat solve Equation (2), that is, the set consisting of the points inthat opti-mize L. When is finite, ∗ is nonempty (because it is assumed thatis nonempty). When is infinite, it will be assumed that ∗is nonempty. In either instance, the mini-mum value ofL(θ) exists and will be denoted byL∗.

It is tempting to view discrete optimiza-tion problems as optimizaoptimiza-tion overZ, as any function defined on a discrete set can be transformed into a function on a subset of integers and minimized there. (For example, ifh is a one-to-one correspondence between and a subsetJ of Z, define the loss func-tion on J to be L˜(z)=L(h(z)) for each z in

J.) However, there may be little or no benefit in doing so. In fact, the transformed prob-lem might eliminate some inherent probprob-lem structure (e.g., property of the objective func-tion) that aids in the problem solution. A simple example of this is the discrete resource allocation problem [11]. The problem is to dis-tribute a finite resource to a finite number of users so as to minimize some allocation cost that is assumed to be discretely convex (see, e.g., [14]). Ifdis the number of users, then the feasible region is a subset ofZd

+, the set of points inRd_{with nonnegative integer}

coordi-nates, where the coordinates represents the amount of resource allocated to each user. There are infinitely mappings that define a one-to-one correspondence between Zd

+ and

Z, but there is no guarantee that any one of them will preserve the convexity of the cost function in the original problem. Thus, the formulation and representation of a prob-lem requires some care. (For discussion of issues associated with problem formulation, see, e.g., [15,16].)

OPTIMIZATION APPROACHES

It is worth noting that deterministic opti-mization algorithms do not generally apply to noisy objective functions as they rely on the values ofL(θ) in the search for an opti-mum. Such algorithms assume that Y con-tains no measurement noise and, therefore, assume—erroneously—that a small value of Y(θ) necessarily corresponds to a small value of L(θ). Thus, deterministic methods seek points in the feasible region that minimize Y(θ) and, therefore, do not apply without some modification.

(3)

Statistical Approaches

Statistical optimization methods make decisions about which parameter values are the best typically in terms of hypothesis test-ing procedures or confidence intervals. The methods include ranking and selection (R&S) and multiple comparisons (e.g., [17–22]) and adaptive allocation or sequential design of experiments (see, e.g., [23–25]). Spall [Chap. 12, Ref. 26] discusses the application of R&S and multiple comparisons to dis-crete stochastic optimization, with worked examples.

Statistical procedures treat the objective function values as the unknown parameters corresponding to k statistical populations, 1<k<∞. Theith population,i=1, 2,. . .,k, has probability density function f(y;θ(i)) (with respect to some dominating probability measure ν, say) and meanLi=L(θ(i)). The

form of f(· ;θ(i)) is assumed known, where θ(i) is a fixed but unknown parameter. Thus, is the set consisting of the points θ(1),θ(2),θ(3),. . .,θ(k). The populations are ranked according to the L_i values, where the smaller values determine the better ranks, withL∗being the smallest of theLi’s.

The problem is to construct procedures for identifying the best-ranked population by taking observations of the random variables Y(θ(i)).

Ranking and Selection, Multiple Compari-son. These methods make decisions regard-ing the best in terms of independent observa-tions ofY, whereni independent identically

distributed (i.i.d.) observations are taken of Y(θ(i)),i=1, 2, 3,. . .,k. Inference is based on the sample means of theY(θ(i))’s.

R&S methods are decision rules for select-ing a sselect-ingle population or a subset of popula-tions (of fixed or random size) such that the selection includes the best-ranked population with probability at least equal to P∗. The probability of obtaining a correct selection, P∗, is specified before selection. A variant of such procedures is the selection of a subset that includes the best population, with prob-ability of at least P∗, when the parameters satisfy some other condition (such as the con-dition that the difference between the best

and the next best value be greater than or equal to some specified value).

Multiple comparison procedures rely on simultaneous confidence intervals to select the best population. An example of this type of procedure is the construction of simultane-ous confidence intervals about the differences between the values of the objective function, Li−Lj, wherei=j, such that each interval

is of the same confidence level. Confidence intervals about the differences between objec-tive function values facilitate inference about the magnitudes and directions of the differ-ences and, consequently, enable inferdiffer-ences about the best value ofL(θ). Similar to R&S, the confidence intervals are constructed by taking i.i.d. samples ofY(θ(i)), for eachi.

Some useful measures of performance of these statistical procedures are the expected size and the expected sum of ranks of the selected subset (for R&S procedures) and the expected width of the confidence intervals (for the multiple comparison procedures). For example, consider the problem of selecting a set that contains the best-ranked popu-lation with probability P∗. The size of the selected set is a random variable S such that 1≤S≤k. Suppose that the decision rule selects populationifor inclusion in the set by comparing estimates of the ˆL(θ)’s defined to be the sample means of the observed values ofY(θ). Suppose thatiis included if

ˆ

L(θ(i))≤min{Lˆ(θ),θ∈} +d (3)

wheredis chosen so that the set containsL∗ with probabilityP∗. The expected value ofS is given by

E(S)=

k

i=1

P (Selecting the population of ranki).

(4) It is also of interest to consider under what conditions on theLi values, for example, is

the maximum value of E(S) achieved.

Sequential Design. The method of opti-mization by sequential design (initiated in Ref. [23]) sequentially samplesy1,y2,y3,. . . fromk populations, whereyi is drawn from

(4)

observations so as to optimize the expected value of the sumSn=y1+y2+y3+ · · · +yn

asn→ ∞. (In this section, optimization will be taken to be maximization, in order to keep with the literature on this problem. Thus, L∗, here, denotes the maximum value ofL.)

The choice of population from which to sample thenth observationynis a “decision”

ϕ_n such that the event {ϕ_n=i}, 1≤i≤k, “sample thenth observationynfrom

popula-tioni,” depends only onϕ₁,y₁,. . .,ϕ_n₋₁,y_n₋₁, n=1, 2, 3,. . . . The sequence of decisions defines a sequential experimental design ϕ=(ϕ₁,ϕ₂,ϕ₃,. . .). If

lim

n→∞ E(Sn)

n =L

∗_, ₍₅₎

thenϕis said to be consistent.

Robbins [23] developed consistent sequen-tial experimental designs for the casek=2. Robbins and Lai [24] consider the general case,k≥2.

The problem of maximizing E(S_n) is equiv-alent to minimizing theregret([24])

Rn=nL∗−E(Sn)=

i:L(θi)<L∗

(L∗−Li)ETn(i),

(6) where Tn(i) is the number of times, within

the firstnobservations, thatϕsamples from populationi.

Lai and Robbins [24] call a decision rule asymptotically efficient if its regret satisfies Rn=o(nβ), for everyβ >0. They show how to

construct asymptotically efficient rules that asymptotically satisfy

ETn(i)∼(logn)/I(θ(i);θ∗) (7)

for each θ(i) such thatL_i<L∗, whereI(θ,λ) denotes the Kullback-Leibler number

I(θ,λ)=

[log (f(y;θ)/f(y;λ))]f(y;θ) dν(y).

The interpretation of (7) is that an asymptotically efficient rule takes about (logn)/I(θ(i);θ∗) observations from an infe-rior population. A consequence of (6) and (7)

is that such rules satisfy

Rn∼

⎧ ⎨ ⎩

i:L_i<L∗

L∗−Li

/I(θ(i);θ∗) ⎫ ⎬ ⎭logn,

(8) which provides a measure of expected long run performance.

Lai [25] discusses generalizations of the problem studied in Ref. [24] and discusses applications to engineering and economics.

Random Search

Optimization by random or stochastic search is similar in spirit to the method of sequen-tial design. Random search algorithms pro-duce a sequence ˆθ1, ˆθ2, ˆθ3,. . ., ˆθnof estimated

solutions to Equation 2 such that the next estimate ˆθ_n₊₁ is obtained by searching for candidate points “near” ˆθ_n, according to some specific random procedure. The goal is to obtain estimates ˆθ_n that converge, in some sense, to the optimizing set∗.

In general terms, random search itera-tively selects a candidate value for updating the current solution estimate. Candidate val-ues are chosen from the set of “neighbors” of the current value. The candidate value is compared to the current solution estimate in some particular manner and, depending on the outcome of the comparison, the candidate value either replaces the current solution estimate (and becomes the new value of the estimate) or is discarded, in which instance, the current value of the estimate is retained. The procedure is repeated to obtain the next estimate and, thereby, generate a sequence of solution estimates.

Search methods differ in several ways: the set of points that are considered neighbors of a given point, the random mechanism for selecting candidate values from among the neighbors of a given point, the procedure for comparing the candidate and current val-ues of the estimates, and the criterion for accepting the candidate as the new solu-tion estimate. The convergence properties of the solution estimates vary depending on the method.

(5)

the set of neighbors of θ, from which solu-tion candidates are chosen when the current solution estimate isθ. A candidate valueθis chosen fromN(θ) with (conditional) probabil-ityR(θ,θ). The next step is to decide whether to accept the candidate as the new solution estimate or reject it, in which instance, there is no change and the current estimate is taken to be the new estimate.

The sets of neighbors, it is assumed, are constructed in such a way that all points in are reachablefrom one another. (This assumption ensures that every point inis chosen as a candidate during the search for the optimum.) For a point θ to be reach-able fromθ, there must be a finite sequence θ₀,θ₁,. . .,θ_lin, for some integerl≥1, such thatθ₀=θ,θ_l=θand, for i=0, 1,. . .,l−1, the point θ_i₊₁ belongs to the set of neigh-bors ofθ_i. The smallest suchlis thedistance betweenθandθ.

The function R(·,·), called the transition probabilityfor, satisfies

θ∈N(θ)

R(θ,θ)=1, (9)

and R(θ,θ)>0 if and only ifθ∈N(θ). The transition probability for and neighbor sets are assumed to be symmetric, that is, R(θ,θ)=R(θ,θ) andθ∈N(θ) if and only if θ∈N(θ), for all θ,θ belonging to . It is sometimes convenient (as in Refs. [27–29]) to relax the requirement thatRbe symmet-ric and define it in terms of an unnormalized symmetric functionR, where

R(θ,θ)= R _(θ,_θ₎

D(θ) (10)

and the denominatorD(θ)=_θ_∈N(θ)R(θ,θ) >0 for eachθ in. A simple example of an unnormalized function is R(θ,θ)=1 if and only ifθbelongsN(θ). The denominatorD(θ) is then equal to the number of points inN(θ). The estimates ˆθ_nin general random search algorithms are obtained as follows:

General Random Search Algorithm Step 1: Pick an initial value ˆθ₀ for the

estimate. Initialize counter:n=0.

Step 2: Given ˆθ_n, choose a candidate new value for the solution θ_n ∈N( ˆθ_n), with probabilityR( ˆθ_n,θ_n).

Step 3: Using samples ofY(θn) and

pos-sibly Y( ˆθ_n), given θ_n and ˆθ_n, decide whether to accept or reject θ_n as the new estimate. If accepted, put ˆθ_n₊₁= θ_n; otherwise ˆθ_n₊₁=θˆ_n.

Step 4: Increment counter,n=n+1 and go to step (2).

Several methods that highlight ran-dom search approaches are the stochastic ruler, stochastic comparison, and simulated annealing (SAN) with noise.

Stochastic Ruler. Yan and Mukai [30] con-vert problem (2) into a maximization problem in which candidate solutions θ are evalu-ated by comparing estimates of their loss function valuesL(θ) with an absolute scale. Candidate values are accepted if, with high probability, their loss function estimates are small compared with the scale value. The “scale” against which comparisons are made is a uniformly distributed random variable.

Yan and Mukai [30] showed that if U is uniformly distributed over (a,b), then every solution to the following maximization problem solves the original minimization problem (2):

max P{Y(θ)≤U}, θ∈, (11)

provided that (a,b) is sufficiently large. The only assumption here, in addition to being a finite set, is that, for each θ∈, Y(θ) is independent of U and has a finite second moment. In addition, if theY(θ)’s are uniformly bounded random variables, then aandbcan be taken to be any pair of upper and lower bounds.

In step (3) of the general random search algorithm, the original stochastic ruler method [30] uses estimates of the probability that Y(θ)≤U, given θ_n =θ and current estimate ˆθ_n=θ, in the decision to accept or reject the current estimate. More precisely, given ˆθ_n=θandθ_n =θ, define ˆθ_n₊₁so that

ˆ θ_n₊₁=

θ, with probabilitypn(θn),

θ, with probability 1−pn(θn),

(6)

where

p_n(θ)=(P{Y(θ)≤U})Mn ₍₁₃₎

andMnis a fixed positive integer depending

on n. The probabilityp_n is estimated using up toMnsamples ofY.

The sequence{θˆ_n}is a nonstationary, irre-ducible Markov chain with state space. Yan and Mukai [30] showed that ˆθ_n converges to the set∗in probability, that is,

lim

n→∞P{θˆn∈∗} =1, (14)

if the sequence of positive integersM_n→ ∞ at rate logn. In fact [Thm. 7.1 of Ref. 30],

M_n∼Clogn asn→ ∞, (15)

where the constant C>0 depends on the distances between points and the comparison probabilities P{Y(θ)≤U},θ∈.

An obvious drawback of the stochastic ruler method is that Mn increases with n,

which implies that the estimate ofp_nin (13) requires an increasing number of samples ofY with increase in iteration. Alrefaei and Andradottir [27–29] developed modified ver-sions of the stochastic ruler method in which M_nin (13) is constant. (In Ref. [28], the fea-sible region is allowed to be countably infinite.)

The modified stochastic ruler methods dif-fer from the original method of Yan and Mukai [30] in two key ways. First, the accep-tance probability in the modified method is

pn(θ)=(P{Y(θ)≤U})M (16)

where M is a fixed positive integer. Thus, unlike the acceptance probability (13) in the original method, the number of samples ofY required to estimate the probability pn does

not grow withn.

Second, the modified methods introduce an intermediate step in the decision to accept or reject a candidate solution. This decision generates a sequence{ ˜θ_n}, which is then used to define the estimates ˆθ_nas a subsequence of{ ˜θn}. To be explicit, initialize the algorithm

with θ˜₀∈ and suppose that θ˜_n has been defined. Givenθ˜_n, chooseθ_n fromN(θ˜_n) with

probabilityR(θ˜_n,θ_n). Givenθ_n =θandθ˜_n= ˜θ, the next valueθ˜_n₊₁is defined by putting

˜ θ_n₊₁=

θ, with probabilitypn(θn),

˜

θ, with probability 1−pn(θn).

(17) The estimates ˆθ_n are obtained as an embed-ded Markov chain that is defined by extract-ing a particular subsequence of { ˜θ_n}. Put

ˆ

θ₀= ˜θ₀. Suppose that ˆθ_n has been defined. The decision whether or not to choose θ˜_n₊₁ as the next value ˆθ_n₊₁ or take ˆθ_n₊₁ to be the current value ˆθn is based on the

num-ber of times that the Markov chain visits the two state values through the first n itera-tions. To be more precise, let V_n(θ) denote the number of times the Markov chain{ ˜θ_n} “visits” state θ in the first n steps. Thus, V_n(θ)=#({i:θ˜_i=θ,i=1, 2, 3,. . .,n}). Put

ˆ θ_n₊₁=

⎧ ⎪ ⎨ ⎪ ⎩

˜

θ_n₊₁, if Vn+1(θ˜n+1)/D(θ˜n+1) >V_n₊₁( ˆθ_n)/D( ˆθ_n), ˆ

θ_n, otherwise.

(18)

The function D, recall, is defined in Equation 10. State visits V_n are defined in terms of { ˜θ_n} as follows. For n=0, set V0(θ)=1 if θ= ˜θ0; otherwise, V0(θ)=0. Having definedV_n(θ), forn≥1, let

Vn+1(θ)=

V_n(θ)+1, ifθ= ˜θ_n₊₁, Vn(θ), otherwise.

Alrefaei and Andradottir [29] considered variants of (18), which differ depending on how Vn is defined. Each of the modified

stochastic ruler methods define {θˆ_n} to be some subsequence of{ ˜θ_n}.

The modified stochastic ruler has stronger convergence properties than the original. The modified method produces a stationary Markov chain such that [29]

P{lim

n→∞θˆn∈∗(a,b)} =1. (19)

(7)

section. In addition, see [29] for rate of con-vergence results.

Stochastic Comparison. Stochastic com-parison [32] is similar to the stochastic ruler method in that it replaces the minimization problem [Equation (2)] with a maximization problem that involves comparing random variables. In this method, each Y(θ) is compared, in probability, with all other Y(θ), θ=θ. This comparison probability, denoted sp(θ), is called thesigma-probability corresponding to θ. Any θ that maximizes the sigma-probability over solves the minimization problem [Equation (2)].

The sigma-probability is defined by

sp(θ)= θ=θ

P{Y(θ)<Y(θ)}, θ∈. (20)

Gonget al. [32] showed that, if the measure-ment noisesε(θ) in (1) are i.i.d., zero mean, and have symmetric continuous probability density functions for everyθ in , then any point θ that maximizes sp(θ) over solves the minimization problem [Equation (2)] and conversely.

The implementation steps of stochastic comparison are the same as the stochastic ruler but with a different acceptance proba-bility. Given ˆθ_n=θandθ_n =θ, the following acceptance probability replaces (13)

pn(θ)=(P{Y(θ)<Y(θ)})Mn. (21)

As in (13), this probability is estimated using up toMnsamples ofY(θ) andY(θ).

Stochastic comparison yields a sequence of estimates ˆθn that converge in probability

to∗. The proof of convergence ([32]) is along the same lines as that for the stochastic ruler. (See [31] for rate of convergence analysis.)

Simulated Annealing. SAN for objective functions with noise is considered in [Section 8.2.2 of Ref. 26] and Refs. [33–37]. In terms of the general random search algorithm, SAN for noisy loss functions takes the following form. Given ˆθ_n=θ, chooseθ_n fromN(θ) with probability R(θ,θ_n) in step (2). The decision

step (3) is, given ˆθ_n=θandθ_n =θ,

ˆ θ_n₊₁=

θ, with probabilityp_n(θ,θ), θ, with probability 1−pn(θ,θ).

(22) In this instance, the acceptance probability is

p_n(θ,θ)=exp

−[ ˆL(θ)−Lˆ(θ)]+ Tn

, (23)

whereT_nis the usualtemperatureparameter, ˆ

L(θ) and ˆL(θ) are sample means of Y(θ)’s and Y(θ)’s, respectively, based on samples of sizes, say, kn and kn, respectively, and

α+=max{α, 0}. The acceptance probability (23) is

pn(θ,θ)=exp

−[L(θ)−L(θ)+w_n]+ T_n

,

(24) wherewnis the difference of the noise terms

in the estimates ofL(θ) andL(θ).

The assumptions in Refs. [33–35] on the transition probabilityR(·,·) and temperature sequence {Tn}is that convergence holds for

the “undisturbed” SAN sequence, that is, the sequence of estimates ˆθ_n obtained by setting w_n=0 in (24).

SAN algorithms differ depending on assumptions made about the temperature sequence {Tn}, the noise in (1) [and, hence,

in (24)] and the estimates of the objective function values. For example, Gelfand and Mitter [33] showed that, ifT_n→0 and if the noisewnin (24) is Gaussian with mean zero

noise and standard deviation σ_n=o(T_n), then ˆθn converges to ∗ in probability if

and only if convergence in probability holds for the “undisturbed” SAN sequence. A consequence of the assumption σ_n=o(T_n) is that kn,kn→ ∞, so that the number of

samples required to for the estimates ˆL in (23) increases with iteration.

(8)

In other words, Thm. 4.1(ii) of Ref. 35 assumes that

P{|wn| ≥t} ≤ t

−t

1 √

2π σn

exp ₋

x2 2σ2

n

dx

for eacht≥0 (i.e., the tail probabilities of the N(0,σ_n2) distribution dominate those ofw_n).

The SAN algorithms by Alrefaei and Andradottir [37] seem to be the most practical in terms of implementation and yield strong convergence. They introduced modified SAN algorithms that relax the assumption thatT_n→0, requiring only that the temperature T be a (strictly) positive constant. Their algorithms are similar to their embedded Markov chain algorithms for the stochastic ruler in Refs. [27–29]. The modified SAN algorithms in Ref. [37] produce estimates that converge a.s. to the set of loss function optimizers. The rate of convergence of this algorithm is considered in Ref. [36], which also examines the convergence rate for the modified stochastic ruler method.

Stochastic Approximation

SA refers to a class of recursive algorithms for solving two types of problems: finding roots and minima (or maxima) of noisy objective functions, whereis a subset ofRd_{. There is}

a well-established SA literature for continu-ous parameter problems, beginning with the seminal works of Robbins and Monro [38] and Kiefer and Wolfowitz [39]. (See Chaps. 4–7 of Ref. 26 or Ref. 40 for surveys of SA methods.)

The general form of SA is

ˆ

θ_n₊₁=θˆ_n−a_ngˆ_n( ˆθ_n). (25)

In continuous parameter minimization, the loss function L is assumed to be differen-tiable and ˆgn(θ) is an estimate of its gradient

g(θ). For the problem of finding the root of an unknown function, which is the classical SA setting of Robbins and Monro [38], the term

ˆ

gn(θ) is an estimate of the unknown function

and is assumed to be unbiased. In all SA pro-cedures, thegain sequence{an}is a sequence

of positive numbers such thata_n→0 and

n

a_n= ∞. (26)

In Robbins-Monro SA, the gain sequence also

satisfies

n

a2n<∞. (27)

Kiefer-Wolfowitz SA, where the goal is to minimize an unknown function, imposes additional assumptions on the gain sequence. In this latter SA setting, the gradient esti-mate ˆgn(θ) is biased and is obtained from

noisy estimates of the loss.

If the loss function has a unique mini-mizer, then minimizing point θ∗ is the root of the gradient. The goal, then, is to obtain a sequence{θˆ_n}that converges a.s. or, more generally, in probability to the minimizing value of the loss.

Dupa˘c and Herkenrath [41] considered the problem of finding the root of an unknown vector-valued functiong(θ), defined on a lat-tice type discrete setinRd_{such as}_Zd_{. (A}

discrete subset of Rd _{is a lattice if it is an}

infinite set and the distance between pairs of points that are nearest neighbors is bounded by a fixed, positive constant.) The problem is to find a point inthat minimizes|g(θ)|using only noisy measurements ofg(θ). (In the con-tinuous parameter setting, the problem is to find a root ofg(θ).) They apply their root find-ing procedure to the problem of optimizfind-ing real-valued functions defined on a lattice.

The general root-finding approach in Ref. [41] is to transform the discrete problem into a continuous one by interpolating the discrete function to a continuous function defined over Rd _{and, then, apply the Robbins-Munro SA}

procedure to find a root of the interpolated function. The resulting SA sequence{θˆn}does

not necessarily lie in the lattice . From this sequence, [41] constructs a lattice-valued sequence{ ˜θ_n}that has convergence properties similar to the original. In particular, when {θˆ_n}converges toθ∗, so does { ˜θ_n}and in the same mode.

The lattice-valued sequence { ˜θn} has a

simple form when is the lattice Z. Each ˆ

θ_n satisfies the inequality θˆ_n ≤θˆ_n≤ θˆ_n. Moreover, there are unique scalarsλ₁,λ₂≥0 (depending on ˆθ_n) such thatλ₁+λ₂=1 and

ˆ

θ_n=λ₁θˆ_n +λ₂θˆ_n. Put

˜ θ_n=

(9)

In other words,θ˜_nis a random “projection” of ˆ

θ_nonto the lattice.

The interpolations in Ref. [41] are piece-wise linear interpolations constructed so that they satisfy the conditions of Robbins-Monro SA and have the same set of solutions as the functions that they interpolate. In the root finding problem, in particular, the estimate ˆ

g_n(θ) in (25) is defined so that it is an unbi-ased estimate of the interpolation g of the unknown functiong. Thus,

E[ ˆg_n( ˆθ_n)|θˆ_n]=g( ˆθ_n). (29)

Dupa˘c and Herkenrath [41] showed that their root finding procedure converges exponen-tially fast whenis a lattice type set in R. In the real discrete setting, their Corollary 1 implies that there is a positive constantC such that

P(|θˆ_n−θ∗|>1)≤exp (−Cn). (30)

In terms of the lattice-valued projectionsθ˜_n of ˆθ_n,

P(θ˜_n=θ∗)≤exp (−Cn). (31)

There are other examples of the use of contin-uous interpolation in discrete optimization, for example, Gokbayrak and Cassandras [42] (the “surrogate function” method) and Wang and Schmeiser [43]. Both Refs. [42] and [43] assume that the feasible regionis a subset of Rd _{and both linearly interpolate}

func-tions over simplexes. The resulting contin-uous optimization problem is then solved by means of standard SA methods. Finally, the estimates in the continuous parameter prob-lem are projected onto closest point in the discrete region to obtain a sequence of esti-mates for the solution of the original discrete problem. Thus, if ˆθ_nis the estimated solution to the continuous problem, then its projec-tion θ˜_n onto is taken to be the estimated solution to the original discrete problem.

The linear interpolations in Refs. [42] and [43] are essentially Lov ´asz extensions of a function (Refs. [44] and [45]), which depend on the representation of points in a simplex. The Lov ´asz extension of Lon the unit cube in Rd _{is obtained by linearly interpolating}

L piecewise at the vertices of each simplex

in a particular collection of simplexes. Each simplex hasd+1 vertices and is defined in terms of a permutation on d integers. Ifτ is such a permutation (there are d! unique permutations on d integers), then the sim-plex S_τ defined by τ is the set consisting of all points x=(x1,x2,. . .,xd) in the unit

cube such thatx_τ₍₁₎≤x_τ₍₂₎≤x_τ₍₃₎,. . .,≤x_τ₍_d₎. Each point in a simplex can be uniquely expressed as a convex combination of the vertices. To be more precise, if the vertices of S_τ are denotedv₀,v₁,v₂,. . .,v_d, then, for each x in S_τ, there are d+1 scalars λ_i≥ 0,i=0, 1, 2,. . .,d, called weights, such that d

i=1λi=1 and

x=

d

i=0

λ_ivi. (32)

Such representations are unique. Each point in the unit cube has a representation, as the d! simplexes form a covering for the unit cube. The representation is extended to any point inZd_{in the obvious manner. The}

rep-resentation (32) leads to a natural extension of functions onZd_{. If}_f _{is a function defined}

on the vertices of the unit cube, then the simplex representation of points leads to the following natural extension:

f(x)=

d

i=1

λ_if(vi) (33)

wherexis given by (32).

Gerencs´eret al. [46,47], Hillet al. [48,49], and Hill [50] introduced an optimization approach similar to Refs. [2] and [43], where the discrete region isZd_{. In Refs. [48–50], the}

objective function is assumed to be integer convex and separable ([14]). The implication of this assumption is that the discrete loss function can be viewed as the restriction of a piecewise linear, separable convex function defined on all ofRd. Thus, the resulting SA algorithm is a Robbins-Monro type procedure for convex optimization Ref. [51] (Sec. 5.6), Ref. [52].

(10)

used in the search for the optimum. Ref-erences [48–50] use efficient estimates of the subgradient to obtain a computation-ally efficient SA algorithm. The subgradient estimates use the computationally efficient simultaneous perturbation (SP) approxima-tion developed by Spall [53] in his simulta-neous perturbation stochastic approximation (SPSA) algorithm for continuous parameter optimization. This type of subgradient esi-mate, which leads to discrete SPSA (DSPSA), is the key difference between Refs. [48–50] and other works that rely on continuous interpolation of the loss function.

For scalarθ, the SP subgradient estimate is [see Equation 16 of Ref. 50] (making the obvious substitutions forθ,c, and)

ˆ gn( ˆθn)=

Y( ˆθ_n+c_n_n)−Y( ˆθ_n−c_n_n)

2c_n_n (34)

where Y is a piecewise linear interpolation of Y,c_n>0, and _n is a ±1 Bernoulli ran-dom variable independent of measurement noise. For small enoughc_n, this subgradient estimate reduces to

ˆ gn( ˆθn)=

⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩

Y( ˆθ_n+1)−Y( ˆθ_n−1)/2, when ˆθ_n is an integer, Y(θˆ_n +1)−Y(θˆ_n),

otherwise,

(35) whereθis the vector consisting of the floor of each component ofθ. The subgradient esti-mate in (34) has the obvious extension to vector valued θ’s. Replace Y by its Lov ´asz extension. Then, (34) becomes

ˆ gn( ˆθn)=

(Y( ˆθ_n+c_n_n)−Y( ˆθ_n−c_n_n))

2c_n

−1

n

(36) where −_n1 is formed by taking inverses component-wise.

It should be noted here that the subgra-dient estimate (36) differs from that in Refs. [48–50]. The SA algorithm in those works contain an incorrect subgradient estimate, which can lead to the nonconvergence of ˆθ_nto ∗ that was noted in Ref. [54]. The subgra-dient estimate in Refs. [48–50] for scalar θ reduces to ˆg(θ)=Y(θˆ)−Y(θˆ −1), which

has expected value L(θˆ)−L(θˆ −1). At noninteger values of ˆθ, the differenceL(θˆ)− L(θˆ −1) is not a subgradient of the (linear) interpolation ofL. The correct subgradient at noninteger ˆθ is given by L(θˆ +1)−L(θˆ), as the interpolation of L is differentiable between successive integer points and sub-gradients reduce to the derivative where the latter exists.

The subgradient approximation (36) requires d+1 evaluations of the loss for eachθ. An alternative is to use a randomized Lov ´asz extension to defineY(θ) for arbitrary θ. (The idea is similar to that used in Ref. [41] to project the estimates ˆθ_n onto the lattice.) For each θ in Rd_{, let}_v

0,v1,v2,. . .,vd be the

vertices and λ₀,λ₁,λ₂,. . .,λ_d the weights defined by the Lov ´asz extension. Let θ˜=vi

with probability λ_i, and set Y(θ)=Y(θ).˜ Thus, the random variableY(θ) requires only a single evaluation of Y for each θ, which implies that the subgradient estimate in Equation 36 requires only two evaluations of Y. Note that the randomized extension has a mixture probability distribution given by

P{Y(θ)∈A} =

d

i=0

λ_iP{Y(vi)∈A}, (37)

whereAis a measurable set of real numbers. It follows that

E[Y(θ)]=L(θ). (38)

(11)

discrete SA procedure and compares its rate of convergence with the stochastic ruler and stochastic comparison methods.

Sample Path Methods

Sample path methods refer to a class of tech-niques that use sample average estimates of the loss function in the optimization (see, e.g., Refs. [43,56–60]). Thus, the estimates ˆθ_nare derived from a sample average estimate of the loss function, where the optimization is performed by optimizing ˆL. Thus, any appli-cable deterministic procedure can be used in the sample path approach.

There are two variants of sample path methods, depending on whether or not the sample path in the estimation is “fixed” or “variable.” To describe these estimates, letω denote the random effects in the noise in (1). The observation corresponding to this value of ω is Y(θ,ω). Let ω₁,ω₂,ω₃,. . .,ω_m be an i.i.d. sample ofω, wherem≥1 is fixed. This random sample defines a fixed sample path estimate of the loss

ˆ Lm(θ)=

1 m

m

i=1

Y(θ,ω_i). (39)

The sample of ω’s in this estimate of the loss function remain fixed throughout the estimation to obtain ˆθ_n. A fixed sample path estimate of the loss yields a fixed sample path estimate of the optimum. This definition assumes that the parameterθ inY(θ,ω) can varied for a fixed value ofω. This is possible, in particular, when the observations Y are obtained by Monte Carlo.

Variable sample path estimates are obtained by letting the random effects vary with each sample of observations and itera-tion to obtain the estimate ˆθ_n. In other words, consider a sample ofk_ni.i.d. observations of ωn₁,ωn₂,ω₃n,. . .,ωn_k

n of ω at each iteration n.

These observation define a variable sample path estimate of the loss

ˆ Lkn(θ)=

1 k_n

kn

i=1

Y(θ,ω_in). (40)

Theωj_i’s are assumed to be independent of the ωl_i’s, forj=l. In addition, in the most general

form of (40), the sampling distribution forωn_i depends on the current estimate ˆθ_n.

The key distinction between the two types of loss function estimates is that variable sample path estimates vary with iteration index for ˆθ_n, whereas fixed sample path esti-mates do not.

The convergence of sample path meth-ods has been extensively studied (see, e.g., Refs. [56–59]). Reference [59], for example, showed that the optimal value of ˆL_m, denoted

ˆ

L∗m, converges a.s. to the optimal value L∗

of loss. It also studied approximate optimal solutions called δ-optimal solutions, δ≥0. A point θ is a δ-optimal solution of L if |L(θ)−L∗| ≤δ. The loss function values of such points are within δ of the optimum L∗. The set of such points is denoted ∗_δ. Whenδ=0, the set of optimal andδ-optimal solutions coincide. Approximate optimal solu-tions of sample path estimates are similarly defined. Thus, aδ-optimal solution of ˆLmis

any point θ such that |Lˆ_m(θ)−Lˆ∗_m| ≤δ. The set of such approximate optimal solutions is denoted ∗_m_,_δ. Convergence a.s. holds for approximate solutions ([59]): the set∗_δ con-tains∗_m_,_δa.s. formsufficiently large.

The rates of convergence for sample path estimates are also known ([59]): forδ≥0

lim supm→∞ 1

mlog [1−P(

∗

m,δ⊂∗δ)]≤ −C. (41) In other words, the probability that∗_δ con-tains∗_m_,_δconverges to 1 exponentially fast as m→ ∞. See Ref. [59] for performance bounds on sample path estimates. Reference [56] studied the convergence properties of vari-able sample path methods.

CONCLUSION

(12)

Random search methods and sample path methods are alternatives when the search space is large and there is little or no struc-ture in the loss function that can be exploited in the optimization. For the random search methods, the modified stochastic ruler or modified SAN seem to be computationally efficient choices as they do not require an increasing number of function evaluations at each iteration. In situations where the search space is large and there are proper-ties of the loss function that can be used in the search for an optimum, such as convex-ity, then stochastic approximation methods are available.

ACKNOWLEDGMENT

Sincere thanks are extended to the review-ers and the Topical Editor, Dr. James C. Spall. Their comments and suggestions for improving this manuscript were invaluable during the review process and were greatly appreciated.

REFERENCES

1. Wieselthier JE, Barnhart CM, Ephremides A. Optimal admission control in circuit-switched multihop radio networks. Proceedings of the 31st Conference on Decision and Control, Tuc-son, AZ; 1992. p 1011–1013.

2. Cassandras CG, Julka V. Scheduling poli-cies using marked/phantom slot algorithms. Queueing Syst Theory Appl 1995;20:207– 254. 3. Krishnamurthy V, Wang X, Yin G. Spread-ing code optimization and adaptation in cdma via discrete stochastic approximation. IEEE Trans Infor Theory 2004;50(9):1927– 1949. 4. Mishra V, Bhatnagar S, Hemachandra N.

Discrete parameter simulation optimization algorithms with application to admission con-trol with dependent service times. Proceedings of the 46th Conference on Decision and Con-trol, New Orleans, LA; 2007.

5. Ermoliev YM, Leonardi G. Some proposals for stochastic facility location models. Math Model 1982;3(5):407– 420.

6. Ermoliev YM. Facility location problem. In: Ermoliev Y, Wets RJ-B, editors. Numerical techniques for stochastic optimization. New York: Springer; 1988. p 413–434.

7. Ho YC, Eyler MA, Chien TT. A gradi-ent technique for general buffer storage design in a production line. Int J Prod Res 1979;17(6):557– 580.

8. Spinellis D, Papadopoulos C, MacGregor Smith J. Large production line optimization using simulated annealing. Int J Prod Res 2000;38(3):509– 541.

9. Francis RL. A “uniformity principle” for evac-uation route allocation. J Res Natl Bur Stand 1981;86(5):509– 513.

10. Ibaraki T, Katoh N. Resource allocation prob-lems. Cambridge (MA): MIT Press; 1988. 11. Cassandras CG, Dai L, Panayiotou CG.

Ordi-nal optimization for a class of deterministic and stochastic discrete resource allocation problems. IEEE Trans Auto Contr 1998;43(7): 881–900.

12. Casta ˜n´on DA, Wohletz JM. Model predic-tive control for stochastic resource alloca-tion. IEEE Trans Auto Contr 2009;54(8): 1739–1750.

13. Weisstein EW. Discrete set. Available at http://mathworld.wolfram.com/DiscreteSet .html, From MathWorld–A Wolfram Web Resource, Accessed 2014 Jan 24, copyright date (1999–2014).

14. Favati P, Tardella F. Convexity in nonlinear integer programming. Ric Oper 1990;53:3–44. 15. Hoffman KL. Combinatorial optimization: current successes and directions for the future. J Comput Appl Math 2000;124: 341–360.

16. Sherali HD, Driscoll PJ. Evolution and state-of-the-art in integer programming. J Comput Appl Math 2000;124:319– 340.

17. Gupta SS. On some multiple decision (selec-tion and ranking) rules. Technometrics 1965; 7(2):225–238.

18. Robbins H, Sobel M, Starr N. A sequential procedure for selecting the largest of k means. Ann Math Stat 1968;39(1):88–92.

19. Goldsman D, Nelson BL. Ranking, selection and multiple comparisons in computer simu-lation. Proceedings of 1994 Winter Simulation Conference, Lake Buena Vista, FL; 1994. p 192–198.

20. Hsu JC. Multiple comparisons theory and methods. London: Chapman and Hall; 1996.

(13)

22. Kim S-H, Nelson BL. Recent advances in rank-ing and selection. Proceedrank-ings of the 2007 Winter Simulation Conference, Washington, DC; 2007. p 162–172.

23. Robbins H. Some aspects of the sequential design of experiments. Bull Am Math Soc 1952;55:527–535.

24. Lai TL, Robbins H. Asymptotically efficient adaptive allocation rules. Adv Appl Math 1985;6:4–22.

25. Lai TL. Sequential analysis: some classi-cal problems and new challenges. Stat Sin 2001;11(2):303– 350.

26. Spall JC. Introduction to stochastic search and optimization. Hoboken (NJ): John Wiley and Sons; 2003.

27. Alrefaei MH, Andrad´ottir S. Discrete stochas-tic optimization via a modification of the stochastic ruler method. Proceedings of 1996 Winter Simulation Conference, Coronado, CA; 1996. p 406–411.

28. Alrefaei MH, Andrad´ottir S. A modification of the stochastic ruler method for discrete stochastic optimization. Eur J Oper Res–Theo Meth 2001;133:160– 182.

29. Alrefaei MH, Andrad´ottir S. Discrete stochas-tic optimization using variants of the stochastic ruler method. Naval Res Log 2005;52:344–360.

30. Yan D, Mukai H. Stochastic discrete opti-mization. SIAM J Contr Optim 1992;30(3): 594–612.

31. Wang Q. Optimization with Discrete Simulta-neous Perturbation Stochastic Approximation Using Noisy Loss Function Measurements [PhD dissertation]. The Johns Hopkins Uni-versity, Department of Applied Mathemat-ics and StatistMathemat-ics; Oct 2013. ArXiv e-prints, arXiv:1311.0042.

32. Gong W-B, Ho YC, Zhai W. Stochastic dis-crete optimization. SIAM J Optim 1999;10(2): 384–404.

33. Gelfand SB, Mitter SK. Simulated annealing with noisy or imprecise energy measurements. J Optim Theory Appl 1989;62:49–62. 34. Fox BL, Heine GW. Probabilistic search

with overrides. Ann Appl Prob 1995;5(4): 1087–1094.

35. Gutjahr WJ, Pflug GCh. Simulated anneal-ing for noisy cost functions. J Glob Optim 1996;8:1–13.

36. Andrad´ottir S. Accelerating the convergence of random search methods for discrete stochas-tic optimization. J Assoc Comput Mach 1999;9(4):349–380.

37. Alrefaei MH, Andrad´ottir S. A simulated annealing algorithm with constant tempera-ture for discrete stochastic optimization. Man-age Sci 1999;45(5):748– 764.

38. Robbins H, Monro S. A stochastic approx-imation method. Ann Math Stat 1951;22: 400–407.

39. Kiefer J, Wolfowitz J. A stochastic approx-imation method. Ann Math Stat 1952;23: 426–466.

40. Kushner HJ. Stochastic approximation: a survey. Wiley Interdiscip Rev Comput Stat 2010;2(1):87–96.

41. Dupa˘c V, Herkenrath U. Stochastic approxi-mation on a discrete set and the multi-armed bandit problem. Commun Stat Sequen Anal 1982;1(1):1–25.

42. Gokbayrak CG, Cassandras K. Stochastic dis-crete optimization using a surrogate problem methodology. Proceedings of the 38th Confer-ence on Decision and Control, Phoenix, AZ; 1999. p 1779–1784.

43. Wang H, Schmeiser BW. Discrete stochas-tic optimization using linear interpolation. Proceedings of the 2008 Winter Simulation Conference, WSC’08, Winter Simulation Con-ference; 2008. p 502–508.

44. Lov ´asz L. Submodular functions and convex-ity. In: Bachem A, Gr¨otschel M, Korte B, editors. Mathematical Programming the State of the Art. Berlin: Springer-Verlag; 1983. p 235–257.

45. Marichal J-L, Mathonet P. Approximations of lov ´asz extensions and their induced interaction index. Discrete Appl Math 2008;156(1):11– 24.

46. Gerencsér L, Hill SD, V ágó Z. Optimization over discrete sets via spsa. Proceedings of the 38th Conference on Decision and Control, Phoenix, AZ; 1999. p 1791–1795.

47. Gerencsér L, Hill SD, V ágó Z. Discrete opti-mization via spsa. Proceedings of 2001 Amer-ican Control Conference, Arlington, VA; 2001. p 1503–1504.

48. Hill SD, Gerencsér L, V ágó Z. Stochastic approximation on discrete sets using simulta-neous difference approximations. Proceedings of 2004 American Control Conference, Boston, MA; 2004. p 2795–2798.

(14)

50. Hill SD. Discrete stochastic approximation with application to resource allocation. Johns Hopkins APL Tech. Dig. 2005;26(1):15–21. 51. Kushner HJ, Yin GG. Stochastic

approxima-tion algorithms and applicaapproxima-tions. New York: Springer; 1997.

52. He Y, Fu MC, Marcus SI. Convergence of simultaneous perturbation stochastic approx-imation for nondifferentiable optimization. IEEE Trans Automat Contr 2003;48(8): 1459–1463.

53. Spall JC. Multivariate stocahstic approxima-tion using a simultaneous perturbaapproxima-tion gra-dient approximation. IEEE Trans Automat Contr 1992;37(3):332–341.

54. Wang Q, Spall JC. Discrete simultaneous per-turbation stochastic approximation on loss function with noisy measurements. Proceed-ings of 2011 American Control Conference, San Francisco, CA; 2011. p 4520–4525. 55. Wang Q, Spall JC. Rate of convergence

anal-ysis of discrete simultaneous perturbation stochastic approximation algorithm. Proceed-ings of 2013 American Control Conference, Washington, DC; 2013. p 4778–4783.

56. Homem-de-Mello T. Monte carlo methods for discrete stochastic optimization. In: Uryasev S, Pardalos PM, editors. Stochastic optimiza-tion: algorithms and applications. Norwell (MA): Kluwer Academic Publishers; 2001. p 97–119.

57. Homem-de-Mello T. Variable-sample meth-ods for stochastic optimization. ACM Trans Model Comput Simul (TOMACS) 2003;13(2): 108–133.

58. Homem-de-Mello T. On rates of convergence for stochastic optimization problems under non-independent and identically distributed sampling. SIAM J Optim 2008;19(2):524– 551. 59. Kleywegt AJ, Shapiro A, Homem-de-Mello T. The sample average approximation method for stochastic discrete optimization. SIAM J Optim 2002;12(2):479–502.