Approximating gradients by Gaussian process

2.4 KD trees

3.1.3 Approximating gradients by Gaussian process

The next method utilises a Gaussian process (section 2.2) for approximating gradients. The idea is that based onTheorem 2, we know that the larger them number of gradients evaluated, the higher the probability that we have an accurate subspace. Note that algorithms 1 and 2 suggest thatCˆ_k = (∇xfk)(∇xfk)T. This, however, is equivalent to saying that the active subspaceW_k1is always a one-dimensional vector and equal to the gradient∇xfk. We prove this prop- erty as the follows,

Theorem 8. For non-zero∇xfk, ifCˆk = (∇xfk)(∇xfk)T, the active subspace for

region k is simply the gradient; in other words,

Wk1=∇xfk.

Proof. From section 2, we know thatWk1is the matrix that contains the columns

of theneigenvectorsw_k1,w_k2, ..,w_knwith the firstnlargest eigenvaluesλk1,λk2, ...,λkn of the matrixCˆ_k. Setnsuch that the gap betweenλknandλkn+1is the largest.

In other words,n=argmax_i|λi−λi+1|fori =1, ...,m−1.

Therefore, we first need to prove that ∇xfk is one of the eigenvectors of the matrixCˆ_kand then prove that the largest gap lies between the first eigenvalue and the second eigenvalue.

First, let us prove that ∇xfk is one of the eigenvectors of the matrix Cˆk.

We have

Ck = (∇xfk)(∇xfk)T. For non-zero∇xfk, we have

C_k∇xfk = (∇xfk)(∇xfk)T∇xfk =∇xfk∗ ||∇xfk||2=||∇xfk||2∇xfk

Next, we need to prove that the largest gap lies between the first eigenvalue and the second eigenvalue.

First, we know thatCˆ_k is a symmetric matrix. We can show that it is also positive semi-definite because, for anyv ∈Rm_,

vTCˆ_kv=vT(∇xfk)(∇xfk)Tv (3.8)

= (vT∇_xfk)2≥0. (3.9)

Therefore,Cˆ_kpermits an eigenvalue decomposition with eigenvalues that are all greater than or equal to 0 and orthogonal to each other. Therefore, the remaining eigenvectors are orthogonal to∇_xfk. Thus, let us define any of the other eigenvectors to bepi; we have

C_kp_i = (∇xfk)(∇xfk)Tpi =0

Therefore, we have for all other eigenvectors that all the corresponding eigenvalues are 0. This completes the proof, as the differences between all other eigenvalues are also all 0.

Therefore, if we say that the original active subspace works well for nearly ridge functions and that constructing local active subspaces using one gradient point works well for local ridge functions, then utilising a Gaussian process to model gradients should work for more general functions that have a smooth gradient space. Although one can argue that for sufficiently small regions, all surfaces of continuous differentiable functions can be well ap- proximated by hypertangent planes, we do not know beforehand how small the regions should be or how many regions we should have. Therefore, the Gaussian process should still contribute valuable information.

Therefore, we choose to fit a Gaussian process with points{x_i}and their gradient{∇xf(xi)}Also, the Gaussian process is evaluated relatively quickly; hence, we can use more gradient information.

Therefore, we propose the following:

Algorithm 5:Randomly selected points, random regression points, KD tree, Gaussian process

1. Uniformly and independently draw M∗ samples{xk}from the

m-dimensional parameter space. ; 2. For each k, evaluate∇xf(xk). ;

3. Construct a Gaussian process using{x_k}and{∇xf(xk)}. ; 4. Uniformly and independently draw N∗ regression samples{x_i}

from them-dimensional parameter space. ;

5. Assign{xi}to the nearest neighbour of{xk}by the KD tree, resulting in{{x_ki}}. ;

6. For each k, use{xki}to generateNk∇xfˆki from a Gaussian process. ; 7. For each k, evaluateCˆk = _N1_k ∑_iN=k1(∇xfˆki)(∇xfˆki)T. ;

8. For each k, apply eigenvalue decomposition toCˆkand obtain ˆ

Ck =Wˆ kΛˆkWˆ T_k. ;

9. For each k, find the largest eigenvalue gap inΛˆkand hence determineWˆ _k1. ;

10. For each k and each i, computey_ki =Wˆ T_kx_ki. ; 11. For each k, construct ˆgk(yk) = R(yk;{gˆk(yki)}.

The difference between this algorithm and the first algorithm is that we first evaluate all the gradients of samplesx_k and use them to fit a Gaussian process. Afterwards, when we have assigned the training pointsx_ki, we can place all these points into the Gaussian process and obtain the estimated gradients ∇_xfˆki for i = 1, ...,Nk. Because the evaluation of ∇xfˆki is relatively fast, we would expect an algorithm that is faster and more accurate than the active subspace method, given that function is smooth.

In document Improve the Active Subspace Method by Partitioning the Parameter Space (Page 57-59)