Regression over gene expression data - Convolved Gaussian process priors for multivariate regre

We now present an example of the approximations described above, for perform- ing multiple output regression over gene expression data. The setup was described in section 2.3. The difference with that example, is that instead of using D = 50

4_{As mentioned in chapter 2, full Bayesian approaches are also possible.}

5_{Essentially, it would be possible to use any of the methods just mentioned above together}

with the multiple-output GP regression models presented in chapter 2. In the software ac- companying this thesis, though, we follow Snelson and Ghahramani (2006) and optimize the locations of the inducing variables using the approximated marginal likelihoods.

4.2. REGRESSION OVER GENE EXPRESSION DATA

outputs, here we use D = 1000 outputs. We then have access to gene expression data for D = 1000 outputs and N = 12 time points per output.

We do multiple output regression using DTC, FITC and PITC fixing the number of inducing points to K = 8, equally spaced in the interval [−0.5, 11.5]. Since it is a 1-dimensional input dataset, we do not optimize the location of the inducing points, but fix them to equally spaced initial positions. As for the full GP model in the example of section 2.3, we set Q = 1 and Rq = 1. Covariances Kf,f and Kf,uused in the approximations, are computed using the kernels kfd,fd0(x, x

0_{) and} kfd,uq(x, x

0_{), derived in equations (2.17) and (2.18), respectively. For the diagonal} blocks of Ku,u, we use a Gaussian kernel as in equation (2.15). All these kernels have the appropriate normalization constants.

Train set Test set Method Average SMSE Average MSLL

Replicate 1 Replicate 2 FITCDTC 0.5421 ± 0.0085 −0.2493 ± 0.01830.5469 ± 0.0125 −0.3124 ± 0.0200 PITC 0.5537 ± 0.0136 −0.3162 ± 0.0206 Replicate 2 Replicate 1 FITCDTC 0.5454 ± 0.01730.5565 ± 0.0425 −0.3024 ± 0.02940.6499 ± 0.7961 PITC 0.5713 ± 0.0794 −0.3128 ± 0.0138

Table 4.1: Standardized mean square error (SMSE) and mean standardized log loss (MSLL) for the gene expression data for 1000 outputs using the efficient approximations for the convolved multiple output GP. The experiment was repeated ten times with a different set of 1000 genes each time. Table includes the value of one standard deviation over the ten repetitions.

Again we use a scaled conjugate gradient to find the parameters that maximize the marginal likelihood in each approximation using the data from replicate 1. Once we have estimated the hyperparameters of the covariance functions, we test all the approximation methods over the outputs in replicate 2, by conditioning on the outputs of replicate 1. We also repeated the experiment selecting the genes for training from replicate 2 and the corresponding genes of replicate 1 for testing. The optimization procedure runs for 100 iterations.

Table 4.1 shows the results of applying the approximations in terms of SMSE and MSLL. DTC and FITC slightly outperforms PITC in terms of SMSE, but PITC outperforms both DTC and FITC in terms of MSLL. This pattern repeats itself when the training data comes from replicate 1 or from replicate 2.

In figure 4.2 we show the performance of the approximations over the same two genes of figure 2.1, these are FBgn0038617 and FBgn0032216. The non- instantaneous mixing effect of the model can still be observed. Performances for

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0038617 MSLL −0.70153 SMSE 0.21628 Time

Gene expression level

(a) DTC, short length scale

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0038617 MSLL −0.68869 SMSE 0.22408 Time

Gene expression level

(b) FITC, short length scale

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0038617 MSLL −0.86002 SMSE 0.1625 Time

Gene expression level

0 2 4 6 8 10 −1 −0.5 0 0.5 1 1.5 2 FBgn0032216 MSLL −0.30786 SMSE 0.18454 Time

Gene expression level

(d) DTC, long length scale

0 2 4 6 8 10 −1 −0.5 0 0.5 1 1.5 2 FBgn0032216 MSLL −0.50863 SMSE 0.36391 Time

Gene expression level

(e) FITC, long length scale

0 2 4 6 8 10 −1 −0.5 0 0.5 1 1.5 2 FBgn0032216 MSLL −0.83681 SMSE 0.16135 Time

Gene expression level

(f) PITC, long length scale

Figure 4.2: Predictive mean and variance for genes FBgn0038617 (first row) and FBgn0032216 (second row) using the different approximations. In the first column DTC (figures 4.2(a) and 4.2(d)), second column FITC (figures 4.2(b) and 4.2(e)) and in the third column PITC (figures 4.2(c) and 4.2(f)). The training data comes from replicate 1 and the testing data from replicate 2. The solid line corresponds to the predictive mean, the shaded region corresponds to 2 standard deviations of the prediction. Performances in terms of SMSE and MSLL are given in the title of each figure. The crosses in the bottom of each figure indicate the positions of the inducing points.

these particular genes are highlighted in table 4.2.

Notice that the performances are between the actual performances for the LMC and the CMOC appearing in table 2.2. We include these figures only for illustra- tive purposes, since both experiments use a different number of outputs. Figures 2.1 and 2.2 were obtained as part of multiple output regression problem of D = 50 outputs, while figures 4.2 and 4.3 were obtained in a multiple output regression problem with D = 1000 outputs.

In figure 4.3, we replicate the same exercise for the genes FBgn0010531 and FBgn0004907, that also appeared in figure 2.2. Performances for DTC, FITC and PITC are shown in table 4.2 (last six rows), which compare favourably with the performances for the linear model of coregionalization in table 2.2 and close to the performances for the CMOC. In average, PITC outperforms the other methods for the specific set of genes in both figures.

4.2. REGRESSION OVER GENE EXPRESSION DATA

Test replicate Test genes Method SMSE MSLL

Replicate 2 FBgn0038617 FITCDTC 0.2162 −0.70150.2240 −0.6886 PITC 0.1625 −0.8600 FBgn0032216 FITCDTC 0.1845 −0.30780.3639 −0.5086 PITC 0.1613 −0.8368 Replicate 1 FBgn0010531 FITCDTC 0.0774 −1.01710.1707 −0.7423 PITC 0.0872 −0.9899 FBgn0004907 FITCDTC 0.6057 −0.21920.1512 −0.8426 PITC 0.2468 −0.7176

Table 4.2: Standardized mean square error (SMSE) and mean standardized log loss (MSLL) for the genes in figures 4.2 and 4.3 for DTC, FITC and PITC with K = 8.

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0010531 MSLL −1.0171 SMSE 0.077407 Time

Gene expression level

(a) DTC, short length scale

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0010531 MSLL −0.74235 SMSE 0.1707 Time

Gene expression level

(b) FITC, short length scale

0 2 4 6 8 10 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 FBgn0010531 MSLL −0.98993 SMSE 0.087275 Time

Gene expression level

0 2 4 6 8 10 −0.2 −0.1 0 0.1 0.2 0.3 FBgn0004907 MSLL −0.21923 SMSE 0.60572 Time

Gene expression level

(d) DTC, long length scale

0 2 4 6 8 10 −0.2 −0.1 0 0.1 0.2 0.3 FBgn0004907 MSLL −0.84269 SMSE 0.15124 Time

Gene expression level

(e) FITC, long length scale

0 2 4 6 8 10 −0.2 −0.1 0 0.1 0.2 0.3 FBgn0004907 MSLL −0.71762 SMSE 0.24687 Time

Gene expression level

(f) PITC, long length scale

Figure 4.3: Predictive mean and variance for genes FBgn0010531 (first row) and FBgn0004907 (second row) using the different approximations. In the first column DTC (figures 4.3(a) and 4.3(d)), second column FITC (figures 4.3(b) and 4.3(e)) and in the third column PITC (figures 4.3(c) and 4.3(f)). The training data comes now from replicate 2 and the testing data from replicate 1. The solid line corresponds to the predictive mean, the shaded region corresponds to 2 standard deviations of the prediction. Performances in terms of SMSE and MSLL are given in the title of each figure. The crosses in the bottom of each figure indicate the positions of the inducing points, which remain fixed during the training procedure.

Train set Test set Method Average TTPI Replicate 1 Replicate 2 FITCDTC 2.042.31

PITC 2.59

Replicate 2 Replicate 1 FITCDTC 2.102.32

PITC 2.58

Table 4.3: Training time per iteration (TTPI) for the gene expression data for 1000 outputs using the efficient approximations for the convolved multiple output GP. The experiment was repeated ten times with a different set of 1000 genes each time.

With respect to the training times, table 4.3 shows the average training time per iteration (average TTPI) for each approximation. To have an idea of the saving times, one iteration of the full GP model for the same 1000 genes would take around 4595.3 seconds. This gives a speed up factor of 1780, approximately.6

In document Convolved Gaussian process priors for multivariate regression with applications to dynamical systems (Page 86-90)