Convergence of PADS Methods - Global Optimization Of Computationally Expensive Blackbox Problem

With some technical conditions for the algorithm parameters, it has been shown that LM- SRBF converges to the global minimum almost surely [95]. In this section we will prove the global convergence of the PADS(1) algorithm. The method can be extended to prove the convergence of PADS(J) when J > 1.

To prove the convergence of serial PADS(1), we will apply the theorem that was proved in [95]. Using the notation introduced in [95], we will show that PADS(1) satisfies all conditions required for the theorem.

Definition 3.3. For 1≤ j ≤ Ncand and n ≥ n0,

Yn, j is the random vector representing the random candidate point yn, j before it was

forced to be in the domain D (See function Perturb_x in Figure 3.2).

Y_D is a transformation of a random vector Y (whose realization is in Rd_{) so that Y} D

is always in D. i.e. T : Rd_{→ D is a deterministic function and Y}

D = T(Y )∈ D.

Xn+1 is the random vector representing the (n + 1)th function evaluation point xn+1.

Note that the generated candidate points of PADS1 are always in the domain D, and thus Yn, j = (Yn, j)D in this case. Therefore, the transformation YD defined above in fact occurs

only in PADS2.

Fix n ≥ n0, and let x∗n be the best point found so far (the point with the lowest objective

function value). Recall that in PADS, each coordinate of x∗

nhas a probability ϕ(n) ∈ [0, 1] to

x∗n. In case no variable of x∗n is selected for perturbation with ϕ(n), one variable is randomly

chosen, and thus at least one variable of x∗

n will always be perturbed to obtain y.

Let Q be a random vector in {0, 1}d

\ {(0, ..., 0)} that determines which coordinates of x∗

n will be perturbed to obtain y, i.e. Q(j) = 1 if the jth coordinate of x∗n is selected to

perturb, and Q(j) = 0 otherwise. Then, Q can be modeled as follows through the random vectors R and E:

Define R = (R(1)_{, ..., R}(d)_{) to be a random vector in}_{{0, 1}}d _{that follows the multivariate}

Bernoulli distribution with parameter p, i.e. each coordinate i of R is such that P (R(i)_{= r) =}

pr₍₁_{− p)}1−r_{, r}_{∈ {0, 1}. If at least one of the coordinates is selected for perturbations, then}

the distribution of Q will simply follow R. However, in the case that none of the coordinates are being selected, then the algorithm will uniformly pick exactly one of the coordinates for a perturbation. We shall introduce another random vector, called E = (E(1)_{, ..., E}(d)_{), to}

capture the latter scenario.

Let ei = (0, 0, ..., 0, 1, 0, ..., 0) be the ith row of the d× d identity matrix. Then, E is a

random vector such that P (E ∈ ∪d

i=1{ei}) = 1 and P (E = ei) = 1/d for i = 1, .. d.

Combining the two random vectors R and E gives us a correct representation of Q :

Q = R1{R)=,0} + E1{R=,0}, (3.3.1)

where for any set A,

1A(x) =        1 0 if x ∈ A if x /∈ A (3.3.2)

is an indicator function and 10(i) _{= 0 for all i = 1, ... d.}

Definition 3.4. For 1 ≤ j ≤ Ncand and n ≥ n0, let random vectors Qn, j, Rn, j, En, j have

the same distributions corresponding to Q, R and E defined above where the parameter p = ϕ(n) > 0. Then, Qn, j is the random vector that determines which coordinates of Xn∗ are

For each n ≥ n0, let

• Fn:={X1, .., Xn0, Yn0, 1, ..., Yn0, Ncand, ..., Yn, 1, ..., Yn, Ncand},

• Qn:={Qn0, 1, ..., Qn0, Ncand, ..., Qn, 1, ..., Qn, Ncand}.

Then, define

• En =Fn∪ Qn for n ≥ n0, and En0−1 :={X1, ..., Xn0}.

So Fn is the set of points that were used to build the initial surrogate model for PADS(1)

(in Step 2) and all candidate points generated (in Step 3c of Algorithm 3.2) in all iterations up to n. Qn is the set of vectors describing which coordinates of Xn∗ are perturbed to obtain

each of the candidate point Yn, j in all iterations up to n.

Remark 3.5. In PADS(1), for each n > n0, the value of Xn is selected deterministically from

the values of the random vectors (Y_{n−1, 1})_D, ..., (Y_{n−1, N}cand)D (see Step 3d of Algorithm 3.2).

Therefore, after the nth function evaluation, the entire path of the algorithm is completely determined by σ(En−1), the σ− algebra generated by the random vectors in En−1.

To show that PADS(1) converges to the global minimum almost surely, we will apply the following theorem which was presented in [95].

Theorem 3.6. Let f be a function defined onD ⊆ Rd _{and suppose that x}∗ _{= min}

x∈Df (x) >

−∞ is the unique global minimizer of f in D such that minx∈D, ,x−x∗_,≥ηf (x) > f (x∗) for

all η > 0. Suppose further that the SRS method generates the random vectors {Xn}n≥1 and

{Yn, 1, ..., Yn, Ncand}n≥n0 satisfying the following two conditions:

[C1] For each n ≥ n0, Yn, 1, ..., Yn,Ncand are conditionally independent given the random

vectors in En−1.

[C2] For any j = 1, ..., Ncand, x∈ D and δ > 0, there exists νj(x, δ) > 0 such that

for all n ≥ n0, where B(x, δ) is the open ball of radius δ centered at x.

If the sequence of random vectors {X∗

n}n≥1 is defined by X1∗ = X1 and X_n∗ =        Xn X∗ n−1 if f(Xn) < f (X_n−1∗ ) otherwise , then X∗ n a.s. −−→ x∗_.

Proof. Replacing En defined in [95] with our (larger) En defined in Definition 3.4 and by

a straightforward replication of the proof in [95], this theorem also holds for the PADS(1) framework.

To apply this theorem, one needs to show that PADS(1) satisfies conditions [C1] and [C2] of Theorem 3.6. The condition [C1] is trivial. The condition [C2] says that the algorithm is able to sample points in any region of D. The following two lemmas will show that PADS(1) indeed satisfies [C2].

Lemma 3.7. For a fixed j ∈ {1, ... Ncand}, let H be the event that all coordinates of x∗n are

selected for perturbation (to generate yn, j), i.e. H = {Rn, j(i) = 1 for all i = 1, ..., d}. Let

gn, j be the conditional density of Yn, j given σ(En−1) and H. Then, there is a constant C > 0

such that gn, j(y)≥ C for all y ∈ D = [lb, ub] ⊂ Rd, and n≥ n0.

Proof. First note that Yn, j is a random vector before a candidate point is forced into the

domain D (see Definition 3.3). Under the assumption that all coordinates of x∗

n are selected

for perturbation and that all the information up to function evaluation n − 1 are known, the conditional density gn, j of each version of PADS(1) can be written in one of the following

forms:

PADS1(1): gn, j(y) = A1exp

! −,y−x∗ n,2 2σ2 n "

for y ∈ D and 0 otherwise (truncated normal density);

PADS2(1): gn, j(y) = A2exp

! −,y−x∗ n,2 2σ2 n "

In either case, A1, A2 > 0 are normalizing constants and it is easy to see that C := Aiexp ! −,ub−lb,2 2(inf_n≥n0σn)2 "

> 0 will be a desired constant such that gn, j(y)≥ C for all y ∈ D =

[lb, ub].

Lemma 3.8. If infn≥n0ϕ(n) > 0, the Condition [C2] holds for PADS(1).

Proof. Assume that inf_n≥n₀ϕ(n) > 0. Let j ∈ {1, ... Ncand}, x ∈ D and δ > 0 be given.

Continuing with the notation used in Lemma 3.7, in particular recall that H = {Rn, j(i) =

1 for all i = 1, ..., d} and gn, j the conditional density of Yn, j given σ(En−1) and H.

P [Yn, j ∈ B(x, δ) ∩ D|σ(En−1)] ≥ P [(Yn, j ∈ B(x, δ) ∩ D) ∩ H|σ(En−1)] = P [Yn, j ∈ B(x, δ) ∩ D|σ(En−1), H]× P (H) =    ˆ B(x, δ)∩D gn, j(y)dy    × ϕ(n)d ≥ Cµ (B(x, δ) ∩ D) × # inf n≥n0 ϕ(n) $d := νj(x, δ)

for any n ≥ n0, where C > 0 is a constant existing in Lemma 3.7. Then, by Lemma 3.7 and

the fact that D is a compact hyperrectangle and our assumption on ϕ that infn≥n0ϕ(n) > 0,

one can easily see that νj(x, δ) > 0. Note also that νj(x, δ) is independent of n. Thus, the

condition [C2] is now verified.

Since the two conditions [C1] and [C2] hold for PADS(1), we can apply Theorem 3.6 and conclude that PADS(1) converges to the global minimum with probability 1.

Remark 3.9. Analogous to the arguments we made above for serial PADS(1), in general parallel PADS(J) where J > 1 can also be shown to converge to the global minimum in a probabilistic sense in a similar way.

Example 3.10. One example of ϕ that satisfies the sufficient condition given in Lemma 3.8 is: ϕ(n) =        ϕ0× [1 − ln(n − n0 + 1)/ ln(M− n0)] ϕ(M − 2) if n0 ≤ n ≤ M − 2 n > M − 2 ,

where ϕ0 > 0 and M >> 0. This is an extension of ϕ(n), which was defined in [97] up to

n = Nmax, to the space of natural numbers larger than n0.

Remark 3.11. Since DYCORS framework is equivalent to PADS2(1), the convergence of DYCORS is proven which was not done in [97].

In document Global Optimization Of Computationally Expensive Blackbox Problems Using Radial Basis Functions (Page 88-93)