Computation Time Comparison - Improve the Active Subspace Method by Partitioning the Parameter

In this section, we compare the computation times of algorithms 3, 5, 6 and 7 on the four test functions. The time displayed is the time used to evaluate us- ingM∗ gradients, trainingM∗ regions with 8000 regression points and then evaluating 8000 testing points. We run all the algorithms 10 times for each of four different gradients and use the median as the result.

As shown in table 4.1, the active subspace method is fast but not accurate. The MSE is reduced to 3655.77 for evaluating 100 gradient points, but it is still

FIGURE 4.15: Points selected by algorithm 3 (randomly) and algorithm 6 (adaptively) onf(x,y) =xexp(−x2−y2)

far larger than the results of any of the methods that we proposed for even evaluating 5 gradient points. Additionally, the original method reaches its prediction ability limit after approximately 20 gradients, which still provides a substantially higher MSE compared with other methods with far fewer gradients evaluated. Therefore, we are not able to perform a comparison on the time spent for the same-level MSE, and it is also not fair to compare the MSE for the same level of time spent. Note that the computation times for the two adaptive methods increase significantly. This result occurs because the codes were not optimised to run fast because of the time constraint on this thesis.

Table 4.2 shows the computation times for algorithms 3, 5, 6 and 7 on the 7D Rosenbrock function. Again, we can see that our methods have lower MSEs compared with the original active subspace method. We see that for the same amount of time spent, the random method yields better results than the original method. Specifically, evaluating 100 gradients takes the original method 0.0394 seconds and produces an MSE of 17208.43, while algorithms 3 and 4 take a similar amount of time but produce MSEs of 14135.01 and 13421.18, respectively.

Algorithm M

∗ ₌_{5 gradients} _M∗ ₌_{10 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 4362.88 0.0016 5596.29 0.0018 Algorithm 3 891.14 0.0319 213.32 0.036 Algorithm 5 1121.46 0.0356 111.51 0.0411 Algorithm 6 176.91 0.0745 39.21 0.1748 Algorithm 7 413.25 0.0755 127.95 0.1926 Algorithm M ∗ ₌_{50 gradients} _M∗ ₌_{100 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 3977.90 0.0031 3655.77 0.0045 Algorithm 3 3.96 0.0634 0.09 0.0929 Algorithm 5 2.14 0.0695 0.573 0.1051 Algorithm 6 0.68 1.5047 0.14 4.4932 Algorithm 7 0.67 1.7253 0.14 5.1222

TABLE4.1: Computation time for 2D Rosenbrock function

Algorithm M

∗ ₌_{10 gradients} _M∗ ₌_{20 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 29391.51 0.0050 25240.27 0.0082 Algorithm 3 14135.01 0.0360 10764.49 0.0470 Algorithm 5 13421.18 0.0415 10577.92 0.0540 Algorithm 6 12509.53 0.0876 7708.56 0.3819 Algorithm 7 12923.87 0.0955 7402.69 0.4378 Algorithm M ∗ ₌_{50 gradients} _M∗ ₌_{100 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 19819.90 0.0239 17208.43 0.0394 Algorithm 3 7499.85 0.0831 5171.59 0.1377 Algorithm 5 6405.61 0.0897 4491.19 0.1506 Algorithm 6 4647.69 1.9171 3355.28 6.7182 Algorithm 7 4522.21 2.2146 3121.39 7.6900

Algorithm M

∗ ₌_{10 gradients} _M∗ ₌_{20 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 0.42 0.0056 0.28 0.0079 Algorithm 3 0.22 0.0312 0.19 0.0564 Algorithm 5 0.22 0.0416 0.19 0.0633 Algorithm 6 0.22 0.701 0.16 0.4411 Algorithm 7 0.22 0.725 0.16 0.5080 Algorithm M ∗ ₌_{50 gradients} _M∗ ₌_{100 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 0.26 0.0149 0.24 0.0286 Algorithm 3 0.14 0.0848 0.11 0.1255 Algorithm 5 0.14 0.0941 0.10 0.1329 Algorithm 6 0.11 2.0519 0.09 6.6763 Algorithm 7 0.11 2.3730 0.08 7.5203

TABLE4.3: Computation time for the robot arm function

we see similar behaviour as the algorithms on the 7D Rosenbrock function. The MSEs of the adaptive methods are the best, but they take a long time to run. Evaluating 100 gradient points using the original methods takes a similar amount of time as algorithms 3 and 5 evaluating 10 gradients, but the latter algorithms provide better MSE results. We can push this even further by evaluating 1000 gradient points for the original methods, which provides an MSE of 0.22 and takes time of 0.24 seconds. This is longer in terms of computation time and higher in MSE compared with algorithms 3 and 5 for evaluating 100 gradients. For the same level of MSE, algorithms 3 and 5 (10 gradients) use approximately a tenth of the time used by the standard method evaluating 1000 gradients.

Finally, we evaluate the computation times of the algorithms on the OTL- circuit function. Similarly, for the computation time comparison, we evaluate 300 gradient points for the original method, and we obtain an MSE of 0.0297 and evaluation time of 0.0491. This matches the results of algorithm 3 and 5 with 20 gradients point, in which we obtain MSEs of 0.0263 and 0.0265 (table 4.5).

Algorithm M

∗ ₌_{10 gradients} _M∗ ₌_{20 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 0.0616 0.0030 0.0343 0.0048 Algorithm 3 0.0378 0.0341 0.0263 0.0430 Algorithm 5 0.0403 0.0390 0.0265 0.0493 Algorithm 6 0.0420 0.0985 0.0336 0.3510 Algorithm 7 0.0427 0.1125 0.0312 0.4023 Algorithm M ∗ ₌_{50 gradients} _M∗ ₌_{100 gradients} MSE time (seconds) MSE time (seconds) Original active subspace method 0.0322 0.0149 0.0310 0.0169 Algorithm 3 0.0197 0.0604 0.0157 0.1113 Algorithm 5 0.0172 0.0662 0.0152 0.1270 Algorithm 6 0.0183 1.6238 0.0216 5.1221 Algorithm 7 0.0259 1.9091 0.0207 5.8737

TABLE4.4: Computation time for the OTL-circuit 7D function

Algorithm MSE time (seconds) Original active subspace method 0.0297 0.0491 Algorithm 3 0.0263 0.0430 Algorithm 5 0.0265 0.0493

TABLE4.5: Original method evaluating 300 gradient points vs. algorithms 3 and 5 evaluating 20 gradient points

4.5 Conclusion

In conclusion, we test algorithms 3 to 7 on four different functions. We find that our algorithms outperform the original method on all the test functions (in terms of MSE). Generally, methods with a Gaussian-process-estimated gradient perform better than their non-Gaussian counterparts. This is because the Gaussian methods capture potential active subspaces with higher dimensions. The adaptive point selection methods (algorithms 6 and 7) outperform the random point selection methods on the first three test functions. We believe that one of the reasons for this is that these three test functions all have local ridge or nearly ridge structures. In other words, these functions exhibit different ridge behaviours in different regions. Therefore, the adaptive algorithms that capture such structures perform better than purely random algorithms. We also discussed the reason why algorithms 6 and 7 are less effective on the last test function, together with examples and reproduction of the problem on a different function. We found that the adaptive algorithms became ’stuck’ in the regions that admit lower dimensional active subspace, but the next varies constantly in the other direction as well. For example, givenm = 6, we may have eigenvalues for region k

as λ(Ck) = (10, 9.9, 9.8, 8, 7.9, 7.8). This provides us an active subspace of dimension 3, as the gap between the third largest eigenvalue and the fourth largest eigenvalue is 0.8, and the gaps between all other consecutive eigenvalues are 0.1. Let us also assume that all the other regions have eigenvalues of

(10, 9.9, 9.8, 0.3, 0.2, 0.1), which also suggests a three-dimensional active subspace. Consequently, if we test the response surfaces generated by the active subspace, we generally obtain a substantially higher total squared error for region k than we can obtain in other regions. Through the same logic, the highest squared error also lies in regionk. Therefore, our adaptive algorithm always chooses points from that region, but the active subspace for that region is at most three dimensional; then, the algorithms end up not consider- ing other regions. To address this problem, we introduce a new algorithm in the next chapter. Unfortunately, because of the time constraint on this thesis, we are unable to test this new algorithm. Moreover, we also compare the running times for all of our methods on the four test functions. We find that our algorithms all achieve a lower MSE compared with the original active subspace method. In particular, all the algorithms achieve better results than the original active subspace method using 8000 gradient points. This result suggests that the MSE limit of the original method is high or that its best

predictability is limited. Therefore, if one requires higher approximation accuracy, in other words, a lower MSE, then our algorithms are superior to the original method. Algorithms 6 and 7 have the lowest MSE for the first three test functions. However, these algorithms take a longer time to run. Apart from the nature of the algorithms in that they need to go through the candidate points in each iteration to find the point associated with the highest squared error, one of the important reasons for the long computation time is that we have not optimised the code because of time constraints on this thesis. However, the running speed can easily be improved by storing calcu- lated gradients, function values and estimated values in each iteration and modifying them only when needed. This will be much faster than recalculat- ing all the gradients and other values in each iteration, which is what we did in our codes.

Chapter 5

Discussion

To address the problem found in testing the OTL-circuit function, we propose a potentially better algorithm: algorithm 9.

To improve the adaptive methods that we proposed in the last chapter, we add two criteria. The first criterion is the angle between the gradient of the point with the highest squared error and the gradient of the point that creates the region where the point of the highest squared error lies. The second criterion is the ratio of the squared error divided by the largest function value.

The first criterion is useful in many cases. For example, we can address the issue found in the OTL-circuit function by examining the angle men- tioned above. In other words, if the point associated with the highest error provides a gradient that is sufficiently similar to the gradient of the point that creates the corresponding region, then we can say that this region has a one-dimensional active subspace, i.e., the region achieves the best result that the active subspace method can obtain. We evaluate ∇f(x

k+1)∇f(xk)

||∇f(xk)||||∇f(xk+1)|| and compare the value with some criterion, say, 0.1. If the value is less than 0.1, then the angle between them is small. This is the same as the restart Fletcher- Reeves method. Then, we can either skip searching the next gradient point in this region or increase the dimension of the corresponding active subspace if we want higher accuracy in this particular region. That is where the second criterion comes in to actually set the threshold to decide whether the accuracy is acceptable. We calculate the ratio of the chosen point’s squared error divided by the largest function value evaluated on the N∗ points. If the ratio is smaller than the value that we set, we can stop searching that region; otherwise, we increase the dimension of the corresponding active subspace.

This addresses the issue that the adaptive methods are not effective for functions with regions that vary significantly in one direction and vary less significantly but constantly in other directions because the new algorithm

will skip searching the next points from regions that have reached their max- imum predictability.

This new algorithm is also more efficient for ridge or nearly ridge functions. For example, let us consider the function f(x,y) = x2, which is a per- fect ridge function, and the gradient of the function is simply[2x, 0]. There- fore, the second point chosen by the adaptive algorithms will have the same (theoretically, in practice, we have computation errors) normalised gradient as the first randomly chosen point. This suggests that if we use a one- dimensional active subspace, then the active subspace constructed by the first point is sufficient to capture all the information; in other words, it reaches the limit of the active subspace method. Therefore, using the new algorithm with a reasonable error threshold, we only need to evaluate 2 gradient points to capture all the information that we need. The situation is the same for nearly ridge functions.

The original active subspace method is a powerful tool in computation- ally intensive fields such as uncertainty quantification and inverse problems. It can also be considered a dimensionality reduction method. However, ap- proximating the whole parameter space using only one subspace brings lim- itations. As we find for all the test functions, the MSE limit that the original active subspace method can achieve is high. We believe that there remains plenty of room for the active subspace method to improve.

There are also other possible directions that future research may pursue. First, there are many possible methods to develop an adaptive algorithm apart from the algorithms that we introduced. For example, following the setting of the adaptive methods proposed in this thesis, one may choose to use criteria other than the squared error. One may be interested in only the correctness of the approximated active subspace; then, the largest gap of eigenvalues can be used as a criterion. Alternatively, the adaptive methods can be developed in a completely different manner. One possibility would be to develop expectation-maximising-like methods that assign gradients to the optimal (although possibly not the global optimal) cluster of gradients. The other possibility could be using a sparse-grid-like method to select the points and construct regions.

Second, the code efficiency can be significantly increased. Because of time constraints on this thesis, we do not optimise the efficiency of our codes. However, we know that the codes can be optimised to perform much faster than what they currently do. Therefore, we expect better performance in terms of computation time after some modifications to the codes.

Third, the active subspace can be seen as applying PCA to the gradient space. It would be interesting if the gradients have a certain high-dimensional structure, and then, one may apply manifold approximation techniques on the gradients and modify the active subspace method accordingly.

Fourth, methods of identifying functions that do not have an active subspace method need to be developed. For example, the function f(x,y) = x2 +y2 does not have an active subspace. An attempt to find the active subspace will always produce the same amount of error, irrespective of the direction chosen in the parameter space. Therefore, we need a method to identify such functions quickly. More generally, we need methods to identify how well the active subspace-like methods will perform rather than run the method first and see the result to answer the above question.

Moreover, one may ask if using only multivariate regression techniques to construct the response surface is always the best approach. We know that construction of regional response surfaces using piecewise regressions introduces discontinuity from region to region. This may cause problems for gradient methods in optimisation, for example. One approach to address this issue is to smooth the response surface around the edges. The other option is to use a Kriging or Gaussian process to construct the response surface as a whole. However, simply applying these methods introduces problems. For example, let us assume that we have an active variabley_i = 1, gi(yi) = 10 in one region andy_j = 1, gj(yj) = −10 in another region. This is possible because we have different active subspaces in different regions; hence, the active variables from different active subspaces represent different parts of the underlying function. In other words, we cannot compare the active variables across regions. This means that we cannot construct an overall response surface by simply using all theyand their correspondingg(y).

Algorithm 9:New adaptive method

1. Initialise the number of total candidate pointsN∗. ;

2. Initialise the number of required gradient evaluationsM∗. ; 3. Initialise the value of the required accuracy ratioE.;

4. Initialise the thresholdα for the angle between two gradients.; 4. Uniformly and independently draw N∗ points{xi}from the

m-dimensional parameter space. 5. Evaluate{f(xi)}.; 6. Randomly select one point from{xi}asxj.;

7. Evaluate∇xf(xj).;

8. Construct the active subspace by decomposingCˆ_j = (∇xfˆi)(∇xfˆi)T.; 9. Construct the response surface using the remainingN∗−1 points.; 10. Find the point ˆxin{x_i} \x_jthat is associated with the highest

squared error.;

11. Evaluate the gradient of ˆx.;

12. Find the region that ˆxbelongs to; hence, find the pointx_k ∈ {xi} used to create that region.;

13. Calculate the angle between∇_xf(ˆx)with ∇_xf(xk);

14. If the angle between the two gradients is less thanαand if the ratio of the largest squared error divided by the max({f(x_i)})is less than

E, end the algorithm and return the response surface.;

15. If the angle between the two gradients is less thanαand if the ratio of the largest squared error divided by the max({f(x_i)})is larger than

E, increase the dimension of the active subspace by using the eigenvector corresponding to the current dimension of the active subspace + 1 largest eigenvalue.;

16. If the angle between the two gradients is larger thanα, create a new region.;

17. Repeat until the algorithm is stopped or the number of gradient evaluations is reached.

Chapter 6

Conclusion

In conclusion, we study the effect of an active subspace applied to a parti- tioned input space. We introduce two families of algorithms that implement two partitioning approaches. The first approach is dividing the parameter space by creating Voronoi regions using randomly generated samples from a uniform distribution. The second approach is also using the Voronoi regions, but the points are chosen adaptively using the squared error as a criterion. We also improve the predictability of our algorithms by implementing a Gaussian process, which is used to model the gradients. Our results on four test functions reveal that the response surfaces created by our proposed algorithms are more accurate (with a lower mean squared error on separately generated test points) than the response surface created by the original active subspace method.

The original active subspace works on models that have a ridge structure. In other words, one can find a set of directions where the function values vary only along these directions. However, models may not always have a global ridge structure, and they may have local ridge structures. This means that one may find different sets of active directions in different regions of the function input space. The algorithms that we proposed aim to address these types of functions. The algorithms that use randomly selected points partition the space randomly and somehow capture more structure of local ridges compared with the original active subspace. The adaptive algorithms identify local ridge regions adaptively and hence perform better than the random methods. However, we also find that the adaptive algorithms are outperformed by the random algorithms on the fourth test function, which is the OTL-circuit function. We analyse the cause of this result by first plotting the points chosen by both algorithms with the help of PCA. We find that the points chosen by the adaptive method are more clustered than the points chosen by the random method. We believe that this result occurs because there are regions in the OTL-circuit function with relatively

large eigenvalues in the inactive directions. In other words, in some regions, function values vary significantly along the active direction and less significantly but constantly in the inactive directions. This renders our adaptive algorithms ineffective because the algorithms choose only the points in this type of region. We then reproduced this problem by using another function to prove that our hypothesis is correct. We also propose a new algorithm to address this problem. Moreover, we also note a few aspects that future

In document Improve the Active Subspace Method by Partitioning the Parameter Space (Page 82-98)