Multiple regression is one of the most widely used statistical techniques (Kottegoda and Rosso, 2008, Hair, 2010). It models a single quantitative response (or dependent) variable as a function of multiple explanatory (or independent) variables. In the particular case of a single explanatory variable, it is termed simple or bivariate regression. In hydrology, the use of these models to establish a link between hydrological parameter or response signature and a set of catchment descriptors is a long established practice (Kjeldsen and Jones, 2009).
The independent variables thought to provide information on the behaviour of the dependent variable are included in the regression model and regression parameters are estimated given observation data. Frequently, when no knowledge is available about the relationship between the variables, for convenience, the models are assumed linear in the parameters (note that the predictors may or may not enter the model as first-order terms). Such models are referred to as multiple linear regression (or simple linear regression in the case of a single independent variable), which in its basic formulation can be written as:
Eq. 5.1
where is an observable random variable, are observable nonrandom variables assumed to be measurable without error, are unknown model parameters also known as regression coefficients (or partial regression coefficients), and is an unobservable random variable, referred to as the error term, that represents the discrepancy between and the predicted values of the dependent variable. Statistical assumptions about have to be made for model formulation. It is usually assumed that has zero mean, , constant variance, 2, and that its terms are independent from each other and of the value of the dependent variables. Although these error assumptions are modest, additional assumptions such as normality in distribution have to be added for the purposes of making confidence statements and hypothesis testing.
Nevertheless, for large samples the normality assumption is less critical due to the central limit theorem (Allison, 1999, Kottegoda and Rosso, 2008).
Linear regression models are not always adequate and non-linear models can sometimes be more realistic. Significantly, some non-linear models can be easily linearised through some transformation of the original variables (e.g. by taking the square root or logarithm of the variables) in order to improve the ease of subsequent model fitting. For example, the relationship can be fitted
92 as with . Non-linear models that can be transformed into a linear equation are called intrinsically linear models (Rawlings et al., 1998). There are cases, however, that such transformations do not exist, and techniques of non-linear regression have to be used instead (non-linear models are outside the scope of this thesis, but for a review of different methods for fitting these models see Rawlings et al., 1998, Chapter 15).
When the variables to include in the model are known, as well as the form they should take, the first goal is to estimate the model parameters , which can be more concisely represented in the form of a [ ] vector . For observations, a [ ] matrix containing the observations on the independent variables4 is built as
[
] Eq. 5.2
The response variable is represented by the [ ] vector
[ ] Eq. 5.3
The multiple linear regression model shown in Equation 5.1 can then be written in matrix notation as
Eq. 5.4
Assuming that sets of observations are available for all variables ( ), the set of model parameters is estimated based on these observation data. As an error exists in Y, the model parameters β cannot be determined exactly and a method is necessary to find the ‘best’ model based on a certain performance criterion. The simplest and the most commonly used estimation procedure is the least squares method (Draper and Smith, 1998, Johnson and Wichern, 2007). The regression coefficients determined based on the least squared procedure are given by
̂ ( ) Eq. 5.5
where ̂ is the [ ] vector of fitted parameters given by the least squares analysis. Detailed background to Equation 5.5 and, in addition, Equations 5.6 to 5.12 are given in Appendix A. Here only the most relevant equations are shown. The vector of estimated mean values of is then given by
4 The first column of this matrix with ones is necessary so that the model has a constant term, β0, termed the intercept.
93
̂ ̂ Eq. 5.6
The residuals are given by the difference between the observed and the fitted values of
̂ ̂ Eq. 5.7
where ̂, a [ ] vector, is an estimator of the (unknown) model errors .
The error variance, , is not known and is approximated based on the estimation of residuals (Equation 5.7). Given that degrees of freedom are lost in the estimation of the model parameters, the unbiased estimator of the variance of the errors is (Kottegoda and Rosso, 2008)
̂ ̂ ̂
̂
Eq. 5.8
where denotes transpose.
For a set of -values [ ], the mean response is [ ] and its estimated value is
̂ ̂ Eq. 5.9
It can be shown (see Appendix A – Equations A.12 and A.13 - for proof) that the estimator of the mean response ̂[ ] is unbiased, such that ̂[ ] [ ]. The confidence interval on the predicted value can be established using this estimated mean response value, in combination with its variance and the statistic (see, for example, Kottegoda and Rosso, 2008, for more details). The variance of the mean response is estimated by (see, Kottegoda and Rosso, 2008, Equation 6.2.32)
[ ] [ ( ) ] Eq. 5.10
For an unobserved value of , for example , and assuming that the new observation is independent of the previous ones, i.e. [ ̂ ] , the variance is calculated as:
[ ] [ ̂] [ ] Eq. 5.11
The first term on the right-hand-side of Equation 5.11 describes the error introduced by using a sample, as opposed to the entire population, to estimate the regression coefficients. This reflects the fact that if another sample from the same population were used, slightly different values would be obtained for the parameters. The second term relates to the discrepancy between observed values and underlying true population value of the independent variable.
94 Substituting from Equation 5.10, and given that [ ] , [ ] becomes
[ ] [ ( ) ] Eq. 5.12
The computation of least squares regression is relatively straightforward. However, it has been assumed thus far that the model is known with respect to which variables to include, as well as the form those variables should take and the functional form of the model itself. In most regression problems, however, it is not known in advance which variables should be part of the model. Rather, it is necessary to decide on which independent variables to include in the final regression equation from a pool of candidate predictors.
The fit of a regression model to the data necessarily improves, even if only marginally, as more predictors are included in the model. Therefore, at a first glance, a full (or global) model, which incorporates all available independent variables (irrespective of whether or not they are statistically significant), might be thought to be the best option. However, this type of model would be overparameterised, reducing its value for subsequent analyses due to the costs involved in the estimation of each predictor. Regression models with fewer variables, called partial models, are therefore often seen as a more attractive option. Clearly though, a balance is needed between model parsimony and concise explanation, ensuring that any model remains sufficiently detailed to guarantee that an important variable is not omitted. One option is to calculate all possible regressions for the different combinations of explanatory variables. This procedure often shows that after initially adding a small number of explanatory variables, the additional improvement in the solution when more variables are added is minimal. Listing all regression models can, however, be cumbersome when more than a few predictors are available. For possible predictors, without even considering any data transformation such as X2, log(X), etc., it is necessary to analyse possible models (one of these models does not include any independent variable and corresponds to the mean of the observed dependent variable). Even if as few as 10 predictors are available, 1,024 candidate models would have to be analysed. Undoubtedly, a procedure that could shorten this task would be of great value. In this regard, the use of algorithms, such as leaps-and-bound of Furnival and Wilson (1974), offers an alternative to fitting all possible regressions. Instead, the best regression is found for each subset of regressions with the same number of independent variables, thus avoiding the need to examine all possible models individually. The value of these types of combinatorial approaches are, however, limited by a failure to account for multicollinearity, observations with a disproportionate impact on the regression results (i.e. influential points), and the physical interpretability of the results (Hair, 2010). An alternative automated variable selection procedure is stepwise regression. While stepwise regression suffers from some of the same problems
95 as combinatorial approaches, it is more efficient in selecting the subset of independent variables that maximises the predictive accuracy. Stepwise regression decides on which variables to include in the final model by testing whether the parameters are significantly different from zero. Even though the stepwise methodology does not guarantee that the best subset for each subset size is found, it reduces the list of predictors to a manageable number with lower computational demands than the methods previously described. This is particularly advantageous for large data sets. The stepwise selection procedure (for a description of the stepwise regression procedure see Appendix B) is almost automatic and with a minimum personal judgment allows maximising the predictive ability of the regression model with only those independent variables that contribute in a statistically significant way. For this reason, stepwise multiple regression is often preferred over other procedures. However, one should not rely too heavily on the automatic selection performed by the computer and results should be analysed with critical expert judgement. Preferably, the variables included in the model should be judged not only in terms of statistical criteria, but also in the light of the model’s physical plausibility. For example, in the FEH hydrological judgement was used to determine whether models were physically meaningful and to help to choose between very highly correlated variables. Models thought to be hydrologically unrealistic were rejected (Robson and Reed, 1999). Although more faith can be placed in predictions based on causal relationships, mainly due to the security provided by such models against inadvertent extrapolations and unrecognised changes in the correlational structure of the system, Rawlings et al. (1998) point out that as long as the regressions are used to predict and estimate the mean responses within the X-space of the data, it is somewhat irrelevant whether the variables are selected on theoretical grounds or not.