Prediction times - Flexible and efficient Gaussian process models for machine learning

2.5 Results

2.5.2 Prediction times

Insection 2.1we discussed how we may require different strategies depending on whether prediction cost or training cost is the more important factor to the user. In this section we test one of these extremes by looking at performance as a function purely of prediction time. The aim is to assess firstly whether the FIC approximation ofsection 2.3.4performs better than the baseline SD method of section 2.1.1, and secondly whether the SPGP method of finding pseudo-inputs (section 2.2.1) performs better than random subset selection.

To make this comparison we use a ‘ground truth’ set of hyperparameters that are obtained by maximising the SD marginal likelihood on a large subset of size 2,048, for all three data sets. Since we are assessing pure prediction times here, we do not worry about the training cost of the hyperparameter learning. Section 2.5.3looks at this cost.

Figure 2.7shows plots of test MSE or NLPD vs. total test prediction time (after any precomputations) on the three data sets ofsection 2.5.1, for three different methods. The first method is SD, with points plotted as blue traingles. We vary the size of the subset M in powers of two fromM = 16 to M = 4,096,10 and the subset is chosen randomly from the training data. The FIC approximation is plotted as black stars, and uses exactly the same random subset as SD as its inducing inputs. Finally we show the pseudo-input optimisation of the SPGP as the red squares, where the initialisation is on the random subset.11 This pseudo-input optimisation does not include the joint hyperparameter optimisation as discussed insection 2.2.1. Here we want to separate out these two effects and just show how the pseudo-input optimisation improves FIC with random subset selection. In this section all three methods rely on the same ‘ground truth’ hyperparameters.

Figure 2.7helps us assess which method one should choose if one wants to obtain the best performance for a given test time. Curves lying towards the bottom left of the plots are therefore better. This type of requirement would occur if you had an unlimited offline training time, but you need to make a rapid series of online predictions. Since prediction time for all these methods only depends on the subset/inducing set size, and isO(M2)per test case, these plots also essentially show how performance varies with subset size.

10_For_Abalone_{the final point is the entire training set of size 3,133.}

11_{For FIC and SPGP we stop at a smaller maximum subset size than SD, depending on the data set,}

10−2 10−1 100 101 102 0 50 100 150 200 250 prediction time/s MSE

(a)SARCOS. MSE.

102−2 ₁₀−1 ₁₀0 ₁₀1 ₁₀2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 prediction time/s NLPD (b)SARCOS. NLPD. 10−1 100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 prediction time/s MSE (c)KIN40K. MSE. 10−1 ₁₀0 ₁₀1 ₁₀2 ₁₀3 −0.5 0 0.5 1 1.5 2 prediction time/s NLPD (d)KIN40K. NLPD. 10−3 10−2 10−1 100 101 4 4.5 5 5.5 6 6.5 7 7.5 8 prediction time/s MSE

(e)Abalone. MSE.

10−3 ₁₀−2 ₁₀−1 ₁₀0 ₁₀1 1.95 2 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 prediction time/s NLPD (f)Abalone. NLPD.

Figure 2.7: Test error vs. prediction time for the three data setsSARCOS,KIN40K,

and Abalone. Blue triangles: SD + random subset. Black stars: FIC + random

subset. Red squares: SPGP (with fixed hyperparameters). The different plotted points were obtained by varyingM, the size of the subset or number of inducing points.

Figures2.7 (a)–(d)show qualitatively similar trends for the highly nonlinear data sets ofSARCOSand KIN40K. Firstly the FIC approximation is significantly better than SD for the same random subset. Using information from all the training data clearly helps as compared to throwing it away. These plots then show that optimising inducing inputs with the method of the SPGP makes a further significant improvement in accuracy for a given cost. The SPGP is able to achieve very high accuracy for only a small inducing set. The complexity of these data sets means that there is not much of a saturation effect — as the subset size is increased or more time is spent then performance keeps increasing. Using the SPGP however helps drive towards saturation much earlier.

Figures(e)and(f)show slightly different effects forAbalone. The main difference is that there is a definite saturation effect. This data set is very likely to be much sim- pler thanSARCOSandKIN40K, and so a maximum performance level is reached. However for SD we still require a relatively large subset to reach this saturation. This is in contrast to FIC and the SPGP, which reach this saturation with only a tiny inducing set ofM = 32. Here the full GP performance is matched, and so there is no point increasing the inducing set size further. Prediction times are consequently extremely fast. In this case optimising inducing points with the SPGP does not improve much beyond random, apart from a slight bettering of NLPD.

In summary, if you want to get the most out of a set of inducing inputs, then you should use the FIC approximation and optimise them with the SPGP. The caveat of course is that the SPGP optimisation is a much more expensive operation than random subset selection, and we also have hyperparameter training to take into account. These issues are addressed in the next section.

In document Flexible and efficient Gaussian process models for machine learning (Page 65-67)