Chapter 6 Model Exploration
6.3 Multiple linear regression
Multiple linear regression (MLR) was chosen as the next model building method due to the number of potential explanatory variables available, shown in Table 5.3. MLR looks at the inclusion of acombinationof terms, as described in Section 2.5.11. In some cases, additional explanatory variables were calculated as the product of original explanatory variables. The inclusion of these combination terms allows the assessment of interaction e↵ects within the statistical model, and may highlight potentially interesting biological combinations to be assessed further. However, including all possible combination terms is computationally expensive, and was only possible when considering a relatively small subset of the biochemical metabolites as explanatory variables, and not when using the full range of biochemical metabolites available, shown in Table 5.3.
As there were a large number of potential explanatory variables available, an automated selection algorithm was chosen to assess potential relationships between each dependent variable and the multiple explanatory variables. Stepwise regression was chosen to swiftly select the best model for the observed data, whilst minimising the amount of time required to fit each model. In this case, the best model is the one that best describes the observed data available, measured by the calculation of an adjusted R2 value, as described in Section 2.5.12. The stepwisefit algorithm was implemented in matlab to produce the best model for each of the physiological measurements listed in Table 5.2, as described in Section 2.5.11. All of the models fitted were formed using MLR, and as such no biological information was used to determine the inclusion or exclusion of specific explanatory variables at this stage.
12 14 16 18 20 22 24 26 28 30 32 25 30 35 40 45 50 55 60 65 y = (1.1374) *x + (15.3542) Bicarbonate (mmol/L)
Oxygen Consumption at VO2max (/kg)
R2 = 0.32787
(a) O2 consumption at VO2max (normalised for
body weight) vs. bicarbonate
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 100 150 200 250 300 350 400 y = (45.6883) *x + (227.8103) Creatinine (mg/dl)
Work rate at VO2max
R2 = 0.0096168
(b) Work rate at VO2max vs. creatinine
0 1 2 3 4 5 6 x 105 60 80 100 120 140 160 180 200 220 y = (−6.3979e−05) *x + (142.5772) Adipsin (pg/mL)
Work Rate at LaT
R2 = −0.050665
(c) Work rate at LaT vs. adipsin
6 8 10 12 14 16 18 20 22 24 0.5 1 1.5 2 2.5 3 y = (−0.0063251) *x + (1.773) IL−6 (pg/mL)
Oxygen Consumption at LaT
R2 = −0.0029593
(d) O2 consumption at LaT vs. IL-6
Figure 6.2: Simple linear regression figures, regressing a physiological measurement against a biochemical metabolite. (a) shows a very weak linear relationship between oxygen consumption at VO2max (normalised for body weight) against bicarbonate.
The R2 value for this model is 0.32787, indicating that bicarbonate is explaining 32.8% of the variability associated with the dependent variable.
(b), (c) and (d) were more typical of the regression results, showing the results for (b) work rate at VO2max against creatinine, (c) work rate at LaT against adipsin
and (d) oxygen consumption at LaT against IL-6. These figures show that the bio- chemical metabolites have no relationship with their respective dependent variable.
6.3.1 Assessment of model stability and fitting
All models produced were assessed to determine how well each model fitted the observed data, and examine the relationships between the dependent variable and each of the explanatory variables.
6.3.2 Uncertainty, variability and residuals
For a given model, all predicted values will exhibit some degree ofuncertainty. This uncertainty (or error) should be random, so that predicting higher or lower than the actual response should carry the same probability (i.e. the error is symmetric). The magnitude of error should also be independent of the time when the observation occurred. The uncertainty represents any incomplete knowledge about a system, and the level of uncertainty may be reduced with more or better data and a greater understanding of the system. Error is distinct from variability, which is the range in values of naturally occurring parameters. This variability is inherent to the data, and cannot be reduced by increasing the amount of data, gaining better data or a greater understanding of the system.
Residuals are defined as being the di↵erence between values fitted from a
model and the observed values. Residuals can be used as an estimate of the er- ror associated with a model, and can be used to assess the appropriateness of each parameter included in a model. Examination of the residual values allows the assess- ment of whether the assumptions made during the modelling process are reasonable, and whether the choice of each of the explanatory variables, or even the entire model choice, is appropriate for the intended use. The overall pattern of residuals should follow anormal distribution and be homoscedastic (i.e. show a homogeneity of vari- ance). Any departure from these assumptions generally indicates that there is some structure to the residuals, which is not accounted for within the model.
For each model, the residual values were plotted against each explanatory variable, and against values fitted by the model, and assessed for heteroscedasticity. 95% confidence intervals (CI) for the residuals were added to each figure to aid in the assessment of the residual values. They were calculated from the residual mean square (ResMS, shown in equation 2.21) as follows:
95%CI=pResM S(t value), (6.1)
where the 5% t-value is the critical value of the t-distribution for the number of degrees of freedom for that model, calculated as n - (the number of explanatory
variables used in the model - 1). 6.3.3 Hypothesis testing
In order to determine the appropriateness of each fitted model, univariate correla- tions were performed between (a) the residual values and the fitted values for the model and (b) the residual values and each of the explanatory variables used in the model, as described in Section 2.5.7. If a significant correlation exists between the residuals and the fitted values, then this suggests that there is a structure to the residuals, and that the model is not describing the variability associated with the dependent variable. If a significant correlation exists between the residuals and the explanatory variable, then this suggests that the particular explanatory variable may not be suitable for use in the model, as it is not taking any of the variability out of the model.
6.3.4 Model refinement
Initial MLR models were formed for each physiological measurement, for the com- bined data over all altitudes. These initial models were then refined through the following steps:
1. Apply multiple linear regression for the dependent variable for the combined data against all altitudes, using a selection of biochemical metabolites as ex- planatory variables;
2. Produce a summary for each dependent variable, including figures showing observed vs. fitted values and the related adjusted R2 value, residual values vs. fitted values and residual values vs. each explanatory variable used; 3. Use summary figures to assess the suitability of each of the explanatory vari-
ables in the model (assess for heteroscedasticity and outliers);
4. Remove any observations withhigh leverage, or the metabolite containing the observation with high leverage from the analysis and refit the MLR to assess the e↵ect of removing the values on the model formed;
5. Compare initial and refined models formed, and determine thebest model for each of the dependent variables. This may not always be the best fitting model, but may be the most reliable model, or the model that predicts a dependent variable of biological interest.