Bayesian optimization with derivatives - Scalable GPs with derivatives

3.2 Scalable GPs with derivatives

3.3.7 Bayesian optimization with derivatives

Prior work examines Bayesian optimization (BO) with derivative information in low-dimensional spaces to optimize model hyperparameters [81]. Wang et al. consider high-dimensional BO (without gradients) with random projections uncovering low-dimensional structure [70]. We propose BO with derivatives and dimensionality reduction via active subspaces, detailed in Algorithm 1.

Algorithm 3: BO with derivatives and active subspace learning 1: whileBudget not exhausted do

2: Calculate active subspace projection P ∈ RD×d

using sampled gradients 3: Optimize acquisition function, un+1 = arg max A(u) with xn+1 = Pun+1

4: Sample point xn+1, value fn+1, and gradient ∇ fn+1

5: Update data Di+1= Di∪ {xn+1, fn+1, ∇ fn+1}

6: Update hyperparameters of GP with gradient defined by kernel

k(PTx, PT_x0

) 7: end

Algorithm 1 estimates the active subspace and fits a GP with derivatives in the reduced space. Kernel learning, fitting, and optimization of the acquisition function all occur in this low-dimensional subspace. In our tests, we use the expected improvement (EI) acquisition function, which involves both the mean and predictive variance. We consider two approaches to rapidly evaluate the predictive variance v(x) = k(x, x) − KxXK˜−1XXKX x at a test point x. In the first ap-

proach, which provides a biased estimate of the predictive variance, we replace ˜

K−1_XX with the preconditioner solve computed by pivoted Cholesky; using the stable QR-based evaluation algorithm, we have

v(x) ≈ ˆv(x) ≡ k(x, x) − σ−2(kKX xk2− kQT₁KX xk2).

In the second approach, we use a randomized estimator as in [3] to compute the predictive variance at many points X0

simultaneously, and use the pivoted Cholesky approximation as a control variate to reduce the estimator variance:

vX0 = diag(K_X0_X0) − E_z h z (KX0_XK˜−1 XXKXX0z − K_X0_XM−1K_XX0z) i − ˆvX0.

The latter approach is unbiased, but gives very noisy estimates unless many probe vectors z are used. Both the pivoted Cholesky approximation to the predictive variance and the randomized estimator resulted in similar optimizer performance in our experiments.

To test this algorithm, we consider five instances of the 5D Ackley and 5D Rastrigin functions randomly embedded in [−10, 15]50_{and [−4, 5]}50_{, respectively.}

In Figure 3.8(a) and Figure 3.8(b), we show the performance of our algorithm using the D-SKI kernel and the EI acquisition function. We fix d= 2, and at each iteration we pick two directions in the estimated active subspace at random. We also compare to three other methods: BO with EI and no gradients in the

0 100 200 300 400 500 -20 -15 -10 -5 BO exact BO D-SKI BFGS Random sampling (a) BO on Ackley 0 100 200 300 400 500 -40 -20 0 20 BO exact BO SKI BFGS Random sampling (b) BO on Rastrigin

Figure 3.8: In the following experiments, 5D Ackley and 5D Rastrigin are embedded into 50 a dimensional space. We run Algorithm 1, comparing it with BO exact, multi-start BFGS, and random sampling. D-SKI with active subspace learning clearly outperforms the other methods.

examples, the BO variants perform better than the alternatives, and our method outperforms standard BO.

3.4 Conclusion

The work in this chapter extended the work from Chapter 2 to the case where we observe both values and gradients. Gradients are a valuable additional source of information for GP regression, but inclusion of d extra pieces of information per point naturally leads to new scaling issues. We introduced two methods to deal with these scaling issues: D-SKI and D-SKIP. Both are structured interpolation methods, and the latter also uses kernel product structure. We have discussed practical details — preconditioning is necessary to guarantee convergence of it- erative methods and active subspace calculation reveals low-dimensional structure when gradients are available. We presented several experiments with ker-

nel learning, dimensionality reduction, terrain reconstruction, implicit surface fitting, and scalable Bayesian optimization with gradients. For simplicity, these examples all possessed full gradient information; however, our methods triv- ially extend if only partial gradient information is available. Future work and potential extensions are discussed in Chapter 6.

CHAPTER 4

GLOBAL OPTIMIZATION IN RBF NATIVE SPACES

In this chapter we consider the global optimization problem of minimizing f : Ω → R where Ω is compact. Given only continuity of f , an optimization method will converge to a global minimum if and only if it eventually samples densely inΩ [67]. Given further information about the function, such as a Lip- schitz constant, one can find a global minimum more efficiently. While prior surrogate-based optimization methods achieve global convergence by dense sampling, standard approximation theory results for these surrogates often re- quire more structure than simple continuity. In particular, RBF approximations converge only when f belongs to a certain reproducing kernel Hilbert space (the native space for the RBF). In this chapter, we introduce a new global optimization method based on this approximation theory. Given only an upper bound on the native space norm of f , we use RBF approximation theory to develop a new algorithm that converges globally without dense sampling.

4.1 Background

The inequality in (1.8) provides a lower bound for f (x) anywhere in the domain:

f(x) ≥ `f,X(x) := sf,X(x) − PX,ϕ(x) q | f |2 Nϕ − |sX| 2 Nϕ, (4.1)

and this lower bound only depends on | f |Nϕ and the set X. In this section, we develop a greedy global optimization algorithm based on minimizing this lower bound at each step. For this to be feasible we assume that f ∈ Nϕ and that we

when we are working with e.g., polyharmonic kernels since the native spaces are instances of Beppo-Levi spaces as was discussed on page 6. In particular, we make the following assumptions:

1. f ∈ Nϕ.

2. | f |Nϕ or an upper bound ρ is known.

3. X is (ν − 1)-unisolvent (if ϕ is conditionally positive definite of order ν). 4. We can find the global minimum of `f,Xn inΩ.

In document Scalable kernel methods and their use in black-box optimization (Page 77-82)