Offset Peaks and Differing Variances

3.5 Peak Offsetting

3.5.2 Offset Peaks and Differing Variances

In the following results section we consider two possibilities. Firstly we main- tain the restriction that the ω parameters are all equal to 2 and secondly, to make the model as general as possible, we allow these variance scaling parameters to vary at each peak. For this new model the ω parameters will have a prior distribution ω ∼ Unif(1.5, 10). We do not wish the variances of the two peaks at the same location to become equal so the lower bound was set to be 1.5. The upper bound was set to 10 to allow locations to mainly consist of a single peak.

The MCMC algorithm is altered to incorporate another Metropolis-

Hastings step for the variance scaling parameters ωj. When updating an ωj the

posterior only changes through the likelihood. The prior contribution remains

the same since the prior on ωj is uniform. Assume that the current value of

ωj is ωj,t and the proposed value is ωj,t+1 _{∼ N(ω}j,t, σ2

ω). Then the acceptance

probability α is calculated as in equation (3.11) and the move from ωj,t to

ωj,t+1 is accepted with this probability. Note that values of ωj outside the

range (1.5,10) have zero prior and are not considered. This step is repeated for each of the peaks.

In summary, the four models considered in this chapter are:

model section

I single Gaussian peaks 3.2

II double peaks mixture model 3.4.1

III offset peaks mixture model 3.5.1

IV offset peaks with differing variances 3.5.2

Table 3.10: A summary of the four models considered in this chapter.

3.5.3 Results for the Breast Cancer Dataset

To enable suitable comparisons to be drawn with the previous models, 138 peaks have again been modelled. Figure 3.15 shows the same section of data as before but with the MAP estimates of θ under all four models considered in this chapter. The black lines show the data, the red lines the MAP estimates of θ under the single peaks model, the green line the MAP under the double peaks model, the turquoise line the MAP under the offset peaks model and the blue line the MAP under the differing variances model. Table 3.11 shows the parameter values that give the MAP estimate under the offset peaks model (turquoise) and table 3.12 the parameter values under the differing variance model (blue). 7600 7700 7800 7900 8000 8100 8200 0 5 10 15 m/z value data

Figure 3.15: A section of the MAP after 5,000 iterations using all the models and original data for one spectrum in the adcon group for day 4. The red line shows the MAP under the single peaks model, the green line the double peaks model, the turquoise line the offset peaks model and the blue line the differing variances model.

3.5 Peak Offsetting parameter ξ τ MAP value 0.000022 8.671 parameter µ1 µ2 µ3 µ4 µ5 MAP value 8106 7834 7664 7931 8265 parameter µ1+ δ1 µ2+ δ2 µ3+ δ3 µ4+ δ4 µ5+ δ5 MAP value 8141 7920 7711 7962 8266

Table 3.11: The MAP parameter estimates for the section of the breast cancer data shown in figure 3.15 under the offset peaks model (turquoise).

parameter ξ τ MAP value 0.000017 9.703 parameter µ1 µ2 µ3 µ4 µ5 MAP value 8110 7843 7671 7951 8264 parameter µ1+ δ1 µ2+ δ2 µ3+ δ3 µ4+ δ4 µ5+ δ5 MAP value 8137 7922 7720 8151 8265 parameter ω1 ω2 ω3 ω4 ω5 MAP value 1.50 1.74 2.03 3.42 1.68

Table 3.12: The MAP parameter estimates for the section of the breast cancer data shown in figure 3.15 under the differing variances model (blue).

From figure 3.15 we can see that the offset peaks model again provides a better fit to the data than the previous models for the peak at 8,100 Da. The fitted heights are increased further, better matching the data although there still remains some difference. For the purposes of model comparison we recalculate the AIC. The number of parameters in the model is comprised of 138 locations, 39,744 heights, 138 offset parameters, 17 residual variances and 17 proportionality constants. For the model used in this section the AIC is thus (−2 × 2, 182, 023) + (2 × 40, 056) = −4, 283, 938. This value is lower than that of the double peaks model and so the offset peaks model is a further improvement. The BIC statistic for the offset peaks model is −3, 782, 705. The statistic has again lowered from that in the double peaks model and so the offset peaks model is an improvement.

The difference between the data and the fitted model is slightly reduced by the use of the more complicated model with differing variances, however, this new model does not appear to give a much better fit to the data shown than that under the offset model. The more complicated model must fit better in other places, however, as we see a reduction in the AIC and BIC. The number of parameters in the model is comprised of 138 locations, 39,744 heights, 138 offset parameters, 138 variance scaling parameters, 17 residual variances and 17 proportionality constants. For the model used in this section the AIC is thus (−2×2, 188, 279)+(2×40, 192) = −4, 296, 174 and the BIC is −3, 793, 243. Trace plots for all the modelled parameters were checked in both models. There were no evident patterns except for the proposals only accepted in one direction for one peak due to its proximity to the end of the section.

When the MAP parameter estimates in tables 3.11 and 3.12 are compared with those for the previous models we see that the peak locations have again not changed much. The offset parameters for visible peaks are all positive and the variance scaling parameters are all around the value 2. For the fifth peak

it is seen that the value of δ5 is zero as the offset peak is fitted to effectively

the same location. The value of the proportionality parameter has decreased although not by as much as last time.

The t-statistics using the MAP estimates of θ between pairs of control groups and between related control/treated groups did not show any signifi- cantly different peak locations compared with those of the original single peaks model for either of the offset models, although the absolute value of a large number of the t-statistics has increased. Comparisons which ignored day information all remained insignificant. The significant peak locations after correcting for multiple testing can be seen in tables 3.4 and 3.5.

3.5 Peak Offsetting

3.5.4 Results for the Melanoma Dataset

The same small section of the dataset as examined previously is shown in figure 3.16 along with the MAP estimates of θ for each of the four models. Figure 3.16 shows one spectrum from the melanoma dataset - the black line shows the data and the red, green, turquoise and blue lines the MAP estimates of θ under the single peaks, double peaks, offset peaks and differing variance peaks models respectively. Tables 3.13 and 3.14 show the parameter values that are used to construct the MAP estimates of θ under the offset peaks model (turquoise) and differing variances model (blue).

7400 7600 7800 8000 8200 8400 0 10 20 30 40 50 m/z value data

Figure 3.16: A section of the MAP after 5,000 iterations using offset peaks and original data for one spectra in the melanoma dataset. The red line shows the MAP under the single peaks model, the green line the double peaks model, the turquoise line the offset peaks model and the blue line the differing variances model.

parameter ξ τ MAP value 0.000012 1.526 parameter µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 MAP value 7789 8163 8267 7989 7665 8369 7889 7453 parameter µ1+δ1 µ2+δ2 µ3+δ3 µ4+δ4 µ5+δ5 µ6+δ6 µ7+δ7 µ8+δ8 MAP value 7869 8196 8267 8035 7748 8372 7954 7544

Table 3.13: The MAP parameter estimates for the section of the melanoma data shown in figure 3.16 under the offset peaks model (turquoise).

parameter ξ τ MAP value 0.000005 2.422 parameter µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 MAP value 7775 7875 8149 7996 7669 8363 8249 7473 parameter µ1+δ1 µ2+δ2 µ3+δ3 µ4+δ4 µ5+δ5 µ6+δ6 µ7+δ7 µ8+δ8 MAP value 7814 7940 8175 8048 7723 8392 8290 7502 parameter ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 MAP value 2.04 2.69 3.40 3.06 1.60 1.89 2.71 2.39

Table 3.14: The MAP parameter estimates for the section of the melanoma data shown in figure 3.16 under the differing variances model (blue).

From figure 3.16 we can see that the offset peaks model does not appear to show much difference in the range shown compared with the double peaks model. However, when considering the AIC we see a reduction so the model provides a better fit in some areas of the data using this criterion. The number of parameters is 46,182 (112 locations, 45,920 heights, 112 offset parameters, 19 residual variances and 19 proportionality constants). For the model used in this section the AIC is thus (−2 × 3, 136, 830) + (2 × 46, 182) = −6, 181, 296. This lower value indicates that the offset peaks model is an improvement upon the double peaks model. The BIC statistic is −5, 587, 103 which is lower than that of both the single and double peaks models.

When comparing the MAP estimates of θ from the differing variances model with those from the simpler offset peaks model it can be seen that

3.5 Peak Offsetting

the more complicated model gives a much better fit to the data shown. The height of the peak at around 7,800 Da is much more closely fitted and the range between 7,900 and 8,000 Da is an improvement over all the previous models. This is borne out by a reduction in the AIC. The number of parameters in the differing variances model is 46,294 (112 locations, 45,920 heights, 112 offset parameters, 112 variance scaling parameters, 19 residual variances and 19 proportionality constants). For this model the AIC is there- fore (−2×3, 146, 468)+(2×46, 294) = −6, 200, 348 and the BIC is −5, 604, 714. All the offset parameters applying to visible peaks are non-zero so none of the peaks in the range are symmetrical. This was also true for the majority of the peaks in the other parts of the dataset. This suggests that the assumption of peaks having longer right hand tails was sensible. The value of the offset

peak µ8+ δ8 given in table 3.13 has moved 100 Da to the right compared with

µ8. This is the upper limit of the permissible values for a δ. We can see from figure 3.16 that the relative intensity at both the m/z values is negligible and so this causes no problem. In the final model the variance scaling parameters ω seem to be around the value 2 which suggests our previous model using a standard scaling of 2 was not unreasonable.

No patterns were visible in any of the trace plots for the parameters so the chains were acceptable. To check if any peaks were different between the two stages of melanoma t-statistics were again calculated using the MAP estimates of θ at each peak location and then correcting for multiple testing. The locations remain essentially the same as in the previous two models and the majority of the t-statistics are positive showing that the heights of the peaks in stage IV are lower than in stage I. However, at around 11,500 to 11,900 Da the t-statistics show the opposite. In Mian et al. (2005) the area around 11,701 Da was identified as one showing significant variability in the data. The absolute value of the majority of t-statistics increases under the two offset peaks models.

3.6 Summary

In this chapter we have shown how MCMC algorithms can successfully be used to simulate from a model for mass spectrometry data and also how we can incorporate the peak finding procedure from chapter 2 to provide a suitable starting point. The use of these methods greatly reduces the dimension of the datasets to a relatively small number of parameters. The number of datapoints are 2,009,088 and 2,859,955 for the breast cancer and melanoma datasets respectively. For the final model in this chapter the respective num- bers of modelled parameters were 40,192 and 46,294 - around 2% of the original number of datapoints in each case.

It has been shown that it is important to consider the data not as a combination of single peaks but as a combination of double peaks with offset locations. Using the AIC calculations it is seen that using offsets to model the peaks gives much better results. The MAP curves match the data more closely for only a slight increase in the number of parameters. A summary of the AIC results is shown in table 3.15. We conclude that although the AIC is lowest for the different variances model, it is not a large amount lower than the AIC for the simpler offset peaks model for either dataset when compared with the reductions for the previous model alterations. To model the data we should use offset peaks with different variances.

cancer melanoma

AIC BIC # parameters AIC BIC # parameters

single -2,594,632 -2,343,817 20,044 -5,576,214 -5,278,873 23,110

double -3,090,994 -2,591,523 39,916 -5,725,350 -5,132,598 46,070

offset -4,283,938 -3,782,705 40,056 -6,181,296 -5,587,103 46,182

variance -4,296,174 -3,793,243 40,192 -6,200,348 -5,604,714 46,294

Table 3.15: The AIC and BIC statistics for the four models considered in this chapter.

3.6 Summary

results we see that, in agreement with the previous conclusions using AIC, the model including offset peaks with differing variances is deemed to be the most suitable. When moving from the single peaks to the double peaks model, the BIC heavily penalised the introduction of nearly double the number of parameters for the melanoma dataset which led to an increase in the statistic. However, when the more complex models were analysed the BIC statistic fell again.

In this chapter we split the datasets into sections and model each of the sections separately. This resulted in having around 20 different estimates for the proportionality constant ξ - one in each section instead of the overall parameter that we would prefer to model. When checking the values of ξ obtained for each section it was found that, for all sections that contained visible peaks, the value of ξ converged to roughly the same value in each section. For the sections without visible peaks the value of ξ was larger but the peaks had negligible height. It seems reasonable to assume that the value of ξ is approximately constant over all sections.

The analysis of the breast cancer and melanoma datasets using the MCMC methods discussed in this chapter requires a large amount of computational time. This is primarily because each section of the data must be run sequen- tially so as not to split processor time between tasks. The High Performance Computing (GRID) system was used to compare the times taken to analyse the datasets. Using a desktop computer to carry out the analysis of the complete datasets resulted in total analysis times of 563.5 minutes and 706.5 minutes for the breast cancer and melanoma datasets respectively. When using the GRID, the time taken for the analysis of each section of data was reduced by approximately 25% in both datasets. However, since there are multiple processors on the GRID, each section of the data can be submitted to a different one and the analysis carried out in parallel. This reduces the time taken for each dataset to the time taken for the largest section. The times taken when running the

analysis in parallel are 74.9 minutes and 78.4 minutes for the breast cancer and melanoma datasets respectively.

In Dryden et al. (2005) the breast cancer dataset was analysed using a

variant of the Hotelling T2 _{test (Hotelling, 1931). The day information can be}

taken into account when using the Hotelling test as vectors can be constructed of the peak heights over all four days. A brief description of the technique now follows.

Let ¯x_Ai and ¯x_Bi be the q-vectors of means in groups A and B respectively

at m/z value i (i = 1 . . . p), with sample sizes nA, nB. Let Sxi be the unbiased

pooled within-group q × q covariance matrix at m/z value i. For the breast cancer data q = 4 as there are 4 days of information available and p = 13951 is the number of recorded m/z values between 2000 and 30,000 Da. The two

sample Hotelling T2 _{test of H0} _{: µAi}_{= µBi} _{versus H1} _{: µAi}_{6= µ}

Bi at m/z value

i is T2

x,i= (¯xAi− ¯xBi)TS−1xi(¯xAi− ¯xBi) under certain assumptions (see Dryden

et al., 2005) and we reject H0 in favour of H1 at the 100α% level if

Tx,i2 > Tcrit(α) =

(nA+ nB)(nA+ nB− 2)q

nAnB(nA+ nB_{− q − 1)} Fq,nA+nB−q−1(1 − α)

where Fν1,ν2(1 − α) is the 1 − α quantile of the Fν1,ν2 distribution.

The Dryden et al. (2005) method tries to account for the extra noise which would be inherent in further repetitions of the experiment. The noise

is considered to be iid Gaussian with mean zero and variance σ2 _{and thus the}

unobserved noisy vector wi = xi + ǫi where ǫi _{∼ N}q(0, σ2_I_{q) independently.}

The offset test statistic is then T2

i(σ2) = (¯xAi− ¯xBi)T(Sxi+ σ2Iq)−1(¯xAi− ¯xBi).

Given σ2_{, T}2

i (σ2) can be observed and these statistics can be used for inference.

A suitable value of σ2 _{is determined by a calibration method and subsequently}

H0 is rejected if T2

3.6 Summary

The significant results obtained from this analysis are shown in table 3.16. When comparing these results with the MCMC result tables presented in this chapter we see that most of the values in table 3.16 are identified by the MCMC method. The exceptions are the rows including 7,687 Da, 11,381 Da and 15,377 Da for the control/treated comparisons and the rows including 6,231 Da, 6,552 Da, 10,169 Da and 13,811 Da for the control/control comparisons.

adc/adt tdc/tdt mcc/mct adc/tdc adc/mcc tdc/mcc

3839 4127 4120 4396 4396 4648 4813 4798 5364 5692 5653 5661 6231 6282 6552 6552 7029 7017 7019 7687 7685 8094 10169 10265 10248 11381 11351 11369 11340 13854 13811 13831 14028 14048 14055 15377 15390 15402

Table 3.16: Significant m/z values in the breast cancer dataset from Dryden et al. (2005). Similar values are listed on the same line.

There are some differences between the results from the two methods. In the MCMC analysis the significant result at 7,029 Da is between mccon and mctax - in table 3.16 it is between the adcon and adtax groups. Also the last row in table 3.16 shows differences with adc/tdc and adc/mcc - in the MCMC analysis the difference at this m/z value is with tdc/mcc comparison.

It should be noted that the Hotelling analysis does not reveal any significant differences between any of the spectra for m/z values higher than 15,500

Da. The MCMC approach provides many such m/z values of which the majority are for mcc/mct and tdc/mcc comparisons.

In Mian et al. (2003) the breast cancer dataset was studied using artificial neural networks (ANNs). This research highlighted the m/z values 10,518 Da, 11,100 Da, 11,687 Da and 13,239 Da as showing good classification ability between control and treated cell-lines. Only m/z values between 10 kDa and 15 kDa were considered in that analysis. The models described in this chapter do

In document Statistical analysis of proteomic mass spectrometry data (Page 130-143)