Discussion - Analysing gene expression data using parametric regression models

3. Analysing gene expression data using parametric regression models

3.10. Discussion

In this analysis, a regression approach was used to t a selection of models to gene expression proles, and thus obtain biologically interpretable parameters to aid in the identication of functionally related genes. Eight distinct shapes were used to t the expression proles, and these shapes were able to t to a large proportion of the genes. Ordinarily with nonlinear regression, starting values for the regression would be es- timated using a graphical exploration, or through the use of a grid search of potential parameter values (Ritz and Streibig, 2008). However, in this case, there were 2 datasets with over 23 000 gene expression proles in each. Thus a more automated approach was needed. Self-starter functions were developed to estimate starting values, and were integrated into an analysis pipeline to t each of the selected models to each gene expression prole, and determine the best ts. All the relevant data was stored in a database for further analysis.

Through the use of goodness-of-t statistics, the quality of the ts were determined. These statistics included theR2

a,R2LoF, and F-test p-value, and each of these provided

a dierent indication of the t. Investigating the overall trends in these statistics, threshold values were determined in order to lter out the models with poor ts. The thresholds determined wereR2_a>0.6,R2_LoF >0.6, and F-test p-value < 0.05, although these can easily be changed to increase or decrease the stringency as desired. The number of genes that tted each shape was calculated, and it was found in both the senescence and Botrytis datasets that the predominant shape was the linear response. Investigation of some of these ts revealed that the genes exhibited a low level of expression, and had a at, unchanging response over time, thus having both a gradient

A 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! 2000! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! C o u n ts !

Time point (days)!

B 0! 100! 200! 300! 400! 500! 600! 700! 2! 4! 6! 8! 10! 12! 14! 16! 18! 20! 22! 24! 26! 28! 30! 32! 34! 36! 38! 40! 42! 44! 46! 48! C o u n ts !

Time points (hours)!

Figure 3.20: Figures showing the distribution of the spikes in the senescence (A) and Botrytis (B) datasets. Shown are the number of spikes detected at each of the time points.

1609 4714 3501

Differentially expressed

Good regression fits

933

4821 4369

Good regression fits Differentially

expressed

Figure 3.21: Venn diagrams showing the degree of overlap between the genes identied as being dierentially expressed through statistical means (blue circles), and the genes with a good regression t (orange circles). The senescence dataset is shown in (A) and the Botrytis dataset in (B)

and intercept close to 0. The other commonly occurring shapes were the logistic and the two forms of the Gompertz curve, i.e. the sigmoid shapes. This is to be expected as they follow the anticipated change in gene expression, where a gene is activated in response to some stimuli, which results in an increase (or decrease) in expression until a new steady state level is achieved. With regard to the exponential type curves, a similar situation to the linear ts was found where the expression proles were at and unchanging, so the tted parameters were close to 0. Finally, the hyperbolic shape did not t many genes at all, most likely due to the exponential curve providing a better t.

These results seem to suggest that while many of the gene expression proles could be adequately described by the selected shapes, there were still some that were not. Further investigation would be needed to identify and parameterise the missing model shapes. Here, techniques such as splines have the advantage, as they are more exible and thus able to handle unusual prole shapes. However, the purpose of this analysis is to obtain more information from the expression proles than merely their shapes, namely additional information regarding the underlying mechanisms for the given expression prole. Through the use of the tted parameters and goodness-of-t statistics, a more exploratory approach was developed to aid in the analysis of the data.

The thresholds described above were used to identify genes that had a good t to the data, and were selected based on the number of genes that passed a given threshold. This was in an attempt to maximise the number of genes included in further analyses, while still maintaining a level of stringency. However, these thresholds are still ultimately arbitrary, and may be raised or lowered to make them more or less stringent, respectively. In addition, the thresholds do not inform about the amount of error in the parameter estimates. This means that although the t may be of high quality, the standard error of the parameters may be relatively high, indicating that there is insucient data to accurately predict the parameter value. For example, a gene expression prole may look like half of a Gaussian curve, and the Gaussian prole would t it reasonable well. However there would be high errors associated with the asymptote parameter estimate, indicating that some components of the shape were from extrapolation of the dataset. Thus, tted curves should be further ltered by investigating the errors associated with the parameter values.

The inclusion of spikes provided a means of extending the tting process to include more unusual shapes. At present only one spike is permitted per expression prole. An extension would be to allow multiple spikes, particularly those that are adjacent, indicative of a dip as opposed to a spike. Other extensions could include the use of piecewise regression or a broken stick model, where dierent portions of the expression prole are tted by multiple shapes. Using a leave-one-out methodology to nd the spikes biases the analysis to identify genes that are at the beginning or end of the time series. That is, the time points which possess a large amount of leverage on the regression t. Of particular interest are the genes with spikes in the middle of the time series. It would be interesting to determine if there is some biological, or possibly

technical, reason for a set of genes to have spikes at a particular time point.

While it was possible to t a variety of models to a large number of expression proles, many of these ts were poor quality. As mentioned previously, it was found that many of these poor ts were largely unresponsive across the time series. Thus is may be possible to use the regression and goodness-of-t assessments as a means of identifying dierentially expressed genes. When the list of good ts was compared to the dierentially expressed gene lists from statistical analyses such as MAANOVA, it was found to be largely consistent. In the case of senescence, many more genes were found to be dierentially expressed by the regression analysis, although this could be adjusted by making the default thresholds more stringent. In the Botrytis dataset, the regression analysis found half as many genes as the dierential expression analyses. However, the majority of the genes were found by both methods. Thus, the regression analysis could act as a means of ltering out genes for further investigation. The genes that were found to be dierentially expressed but not possess a good model t could be due to the presence of circadian genes that cannot currently be accurately detected by this regression approach. A way to possibly identify these genes would be to attach a sine term to the regression models as an additional parameter to overlay oscillatory behaviour. Alternatively, a Fourier analysis could be used to identify the diurnal signal. Nonetheless, in both datasets most of the dierentially expressed genes were identied as being good ts and could be used as a simple means to identify dierentially expressed genes.

The chapters that follow will build on the use of the tted models, demonstrating a variety of applications.

4. Using tted parameter values to

In document Quantitative analysis of time series microarray data, with application to investigating responses to environmental stresses in arabidopsis (Page 86-90)