2.4 KD trees
3.1.3 Approximating gradients by Gaussian process
The next method utilises a Gaussian process (section 2.2) for approximating gradients. The idea is that based onTheorem 2, we know that the larger them number of gradients evaluated, the higher the probability that we have an ac- curate subspace. Note that algorithms 1 and 2 suggest thatCˆk = (∇xfk)(∇xfk)T. This, however, is equivalent to saying that the active subspaceWk1is always a one-dimensional vector and equal to the gradient∇xfk. We prove this prop- erty as the follows,
Theorem 8. For non-zero∇xfk, ifCˆk = (∇xfk)(∇xfk)T, the active subspace for
region k is simply the gradient; in other words,
Wk1=∇xfk.
Proof. From section 2, we know thatWk1is the matrix that contains the columns
of theneigenvectorswk1,wk2, ..,wknwith the firstnlargest eigenvaluesλk1,λk2, ...,λkn of the matrixCˆk. Setnsuch that the gap betweenλknandλkn+1is the largest.
In other words,n=argmaxi|λi−λi+1|fori =1, ...,m−1.
Therefore, we first need to prove that ∇xfk is one of the eigenvectors of the matrixCˆkand then prove that the largest gap lies between the first eigen- value and the second eigenvalue.
First, let us prove that ∇xfk is one of the eigenvectors of the matrix Cˆk.
We have
ˆ
Ck = (∇xfk)(∇xfk)T. For non-zero∇xfk, we have
ˆ
Ck∇xfk = (∇xfk)(∇xfk)T∇xfk =∇xfk∗ ||∇xfk||2=||∇xfk||2∇xfk
.
Next, we need to prove that the largest gap lies between the first eigen- value and the second eigenvalue.
First, we know thatCˆk is a symmetric matrix. We can show that it is also positive semi-definite because, for anyv ∈Rm,
vTCˆkv=vT(∇xfk)(∇xfk)Tv (3.8)
= (vT∇xfk)2≥0. (3.9)
Therefore,Cˆkpermits an eigenvalue decomposition with eigenvalues that are all greater than or equal to 0 and orthogonal to each other. Therefore, the remaining eigenvectors are orthogonal to∇xfk. Thus, let us define any of the other eigenvectors to bepi; we have
ˆ
Ckpi = (∇xfk)(∇xfk)Tpi =0
.
Therefore, we have for all other eigenvectors that all the corresponding eigenvalues are 0. This completes the proof, as the differences between all other eigenvalues are also all 0.
Therefore, if we say that the original active subspace works well for nearly ridge functions and that constructing local active subspaces using one gradi- ent point works well for local ridge functions, then utilising a Gaussian pro- cess to model gradients should work for more general functions that have a smooth gradient space. Although one can argue that for sufficiently small regions, all surfaces of continuous differentiable functions can be well ap- proximated by hypertangent planes, we do not know beforehand how small the regions should be or how many regions we should have. Therefore, the Gaussian process should still contribute valuable information.
Therefore, we choose to fit a Gaussian process with points{xi}and their gradient{∇xf(xi)}Also, the Gaussian process is evaluated relatively quickly; hence, we can use more gradient information.
Therefore, we propose the following:
Algorithm 5:Randomly selected points, random regression points, KD tree, Gaussian process
1. Uniformly and independently draw M∗ samples{xk}from the
m-dimensional parameter space. ; 2. For each k, evaluate∇xf(xk). ;
3. Construct a Gaussian process using{xk}and{∇xf(xk)}. ; 4. Uniformly and independently draw N∗ regression samples{xi}
from them-dimensional parameter space. ;
5. Assign{xi}to the nearest neighbour of{xk}by the KD tree, resulting in{{xki}}. ;
6. For each k, use{xki}to generateNk∇xfˆki from a Gaussian process. ; 7. For each k, evaluateCˆk = N1k ∑iN=k1(∇xfˆki)(∇xfˆki)T. ;
8. For each k, apply eigenvalue decomposition toCˆkand obtain ˆ
Ck =Wˆ kΛˆkWˆ Tk. ;
9. For each k, find the largest eigenvalue gap inΛˆkand hence determineWˆ k1. ;
10. For each k and each i, computeyki =Wˆ Tkxki. ; 11. For each k, construct ˆgk(yk) = R(yk;{gˆk(yki)}.
The difference between this algorithm and the first algorithm is that we first evaluate all the gradients of samplesxk and use them to fit a Gaussian process. Afterwards, when we have assigned the training pointsxki, we can place all these points into the Gaussian process and obtain the estimated gra- dients ∇xfˆki for i = 1, ...,Nk. Because the evaluation of ∇xfˆki is relatively fast, we would expect an algorithm that is faster and more accurate than the active subspace method, given that function is smooth.