Chapter 5: Model calibration /validation methodology
5.6 Cross-validation
Model validation usually consists of the testing of the model on a sample of independent data that has been held back during the development of the forecast equation (Wilks, 1995). Hence the fit of a model to observed data can be tested.
Cross validation is a technique used in model validation. It is especially useful when the size of the dataset is small, and therefore inadequate in formulating a good model if a portion of the calibration dataset had to be reserved for testing the model (Wilks 1995, Efron and Gong 1983, Miller 2002). It is also said to give a more independent and robust assessment of the forecast skill (Tootle and Piechota, 2004). The simplest form of cross validation involves dividing the data into two portions, which may or may not be equal in size. In climatological datasets this usually involves separating the time series into two portions. One part of the data is used to train, or calibrate the model, and the other part of the data is used to test, or validate the model.
An improvement on this method is the k-fold cross-validation method. The data is divided into k portions, and the calibration/validation exercise is done repeatedly on every different portion of the data (Miller, 2002). To take this to its logical extreme is to follow the “leave- one-out” cross validation method (Breiman and Spector, 1992), where k is equal to n, the number of data points in the set. This means that n separate model calibrations are carried out, with one data point being held back as the validation data set in each instance. Breiman and Spector (1992) recommend leaving out about 20% of the data each time is a sensible fraction.
The possibility of using bootstrapping to tackle the problem of small sample size and to construct confidence limits was briefly considered. Cross-validation and bootstrapping are both methods for estimating generalization error based on resampling methods. Leave-one- out cross-validation is very similar to another method, known as jackknifing, and this was considered as a possible method also (Efron and Gong 1983, Hjorth 1994). However, cross-validation is used to estimate generalisation error, while the jackknife is used to estimate the bias of a statistic. It was decided to use cross validation and the standard error of the resulting model validation residuals (if they were found to be normally distributed)
for constructing confidence limits. It was felt that this methodology encapsulated any possible outliers, in the tails of the normal distribution, better than bootstrapping, which resamples the current distribution only.
An example of an application of this is with Mullan and Renwick (1996), who used a three-fold cross validation method. They divided the data into thirds, and used each third in turn as the independent, validation set. Forecast skill was averaged over the three independent sets, with the assumption being that this would give a more robust indication of forecast skill. Breiman and Spector (1992) found ten-fold and five-fold cross-validation gave better error estimates than leave-one-out, and it was decided to use five-fold cross validation in this study.
A 50 season time series was used in most of the model calibrations done in this study. The data were therefore divided into five approximately ten year periods, and a different forty year/ten year period was used to construct five different models.
In a location with a climate that was completely stationary over time, with five calibration/validation periods that contained the same variability within their training and validation datasets, and with a perfect model, all five of the predictive equations created from these models would be identical. In reality, all three of these required situations are not in effect, and therefore the five models have slightly different predictive equations, with different associated skill levels, and will therefore produce different predictions of inflows or rainfall (or raindays).
If the models did display fairly similar equations, a reasonable assumption would be that any of the different period models could be taken forward as the model to predict future, unseen data, and the cross validation exercise is therefore only to estimate “true” model error. Another possible methodology is that once model error is estimated using the standard error of all cross-validation period residuals, a completely new model is calibrated using all the available data for the validation, and leaving none out (Wilks 1995). There is also precedent in the literature for using an average of a range of model predictions as the one prediction to be used (Georgakakos 2001, Mo 2003, Mullan and Renwick 1996, Sharma and Lall 2003, Sivillo et al 1997, Stephenson et al. 1999) and this is especially true in the recent use of Global Circulation Model outputs, where several model predictions from around the globe from different GCMs are averaged to give the final prediction (Palmer et al. 2004, Rowell 1998). Indeed, ensemble model forecasting, where a mean or
some other combination of model outputs is used as the actual prediction, is in regular use as a technique which hopefully minimises the errors associated with each of the individual models. Zhang and Casey (2000) discuss model averaging using models calibrated using different techniques, and say that optimal combination of forecasts from those separate schemes provides higher skill than that achieved by any of the individual models. The idea that the individual errors are minimised by using an ensemble only holds true if the individual models are independent and unbiased.
However, although some of the individual period models in this study had very good apparent skill levels, some performed very badly. Intuitively, to include all these models in the final model average required some faith in the badly performing models. Whether they would perform well on future data was unknown, but the period models with higher skill felt intuitively more trustworthy.
It was decided therefore to use a weighted average of all five period models, weighted on their skill levels (explained variance, calculated on the 9 or 10 residuals from each validation period), to go forward as the future model. There is some precedent for this in the literature (Tootle and Piechota 2004, Colman 2004, Zhang and Casey 2000). Zhang and Casey (2000) weighted the combination of different model predictions in their average by finding the combination that minimised the mean-square errors. Methodology was developed in this study to use a weighted average of the five period models as the prediction for future applications of the model. For future model applications, all five period models would be run to give a prediction. A weighted average of these five predictions is then formulated, using the explained variance of each period (calculated on model validation for that period) as the weighting for that period. So periods where model validation gave a high explained variance meant that more trust was placed in the model, but those with lower validation goodness of fit were not necessarily excluded entirely, but gave a proportionally less weighted contribution to the final future model prediction.
Examples of these new weighted models applied to recent “new” data are shown in chapter 6, as is the methodology applied for constructing confidence limits around these new weighted average models.
5.7 Goodness of fit
Techniques outlined in section 5.5 were employed to determine a stopping criterion, which would choose between models of different numbers of parameters, but still with the same dependent variable and similar independent variables. An area of focus of this research, however, was to examine which dependent variables and seasons were easier to predict than others, and therefore models with different dependent variables needed to be compared by some generic measure of model skill.
There are many available goodness of fit criteria to choose from, but all focus on slightly different aspects of model fit. Anderson and Bates (2001) state that different goodness of fit criteria will result in different fit scores, and hence different conclusions about model validity. They state that it is critical, therefore, to apply a number of model fit tests, and to justify the reasons for choosing those tests, and acknowledging the aspects of fit that have been highlighted by focussing on each test. The different foci of the goodness of fit criteria used in this study have been outlined in chapter 2, and they are used in conjunction with each other, as can be seen in section 6.4.
A simple linear measure of goodness of fit or model skill is the correlation coefficient, as used by Mo (2003), but this is not correctly utilised in a goodness of fit (non-regression) setting, and can lead to erroneously high goodness of fit estimates. Adjusted r-squared and Mallow’s Cp, already mentioned as stopping criterion, have also been used as goodness of fit criterion in this study, but further criteria were also examined as measures of the success of the models.
5.7.1 Explained variance
Explained variance has already been discussed in chapter 2, and is used widely in seasonal hydro-climatic forecasting both in New Zealand and internationally as a measure of model skill. The term “explained variance” also has a different application in least squares regression, and this should not be confused with this application of the term. The explained variance of a predictive model used in this thesis is defined as 1 minus the mean squared error of prediction divided by the variance of the observed values. “Explained variance” can be a misleading term, as a regression analysis can quantify the nature and strength of a relationship between two variables, but can say nothing about which variable causes the
in a multiple regression context, which is misleading. However, this terminology is used widely in climatological literature (Compagnucci & Vargas 1998, Eshel et al 2000, Garcia et al 2005, Hastenrath & Greischar 1993, Hawkins et al 2002, Kabanda & Jury 1999, Kidson & Gordon 1986, Paegle & Mo 2002, Renwick 2002, Tomlinson 1981, Whiting et al. 2004, Widmann et al. 2003), and to be consistent with this it was decided to use it in these studies. To avoid confusion we have denoted this explained variance Ev, for the remainder of this text.
The resulting equation, is:
) ( ) 2 ) ( ( 1 2 obs Var n y y Ev − − − =
∑
∧ (5.9)The explained variance measure describes the proportion of variability within the predictand dataset captured by the model prediction, and is widely used in seasonal climate research.
5.7.2 Nash Sutcliffe Criterion
The Nash-Sutcliffe criterion (N-S) has been discussed in chapter 2. The optimal value of N-S is one, and should be above zero to indicate minimally acceptable model performance (Knapp et al., 2004). A value equal to zero indicates that the mean value of the dependent variable is a better predictor than the model.
The formula for calculating the Nash-Sutcliffe criterion is defined as: N-S
∑
∑
− − − = ∧ 2 2 ) ( ) ( 1 y y y y i i i (5.10)where yi and are the observed and model predicted values and yi ∧
y is the average observed value.
5.8 Residual analysis
Before model predictions could be applied to error evaluation, some residual analysis needed to be carried out, to ascertain the distribution characteristics of the residuals from the model validations. Results of regression analysis can be misleading if residuals do not comply with some underlying assumptions, and trends and bias in the model predictions can be shown up through residual analysis. To apply confidence intervals using the standard error of the residuals, a Gaussian distribution of residuals is required.
For the above reasons, four main residual analyses were undertaken. Tests were carried out to examine residual distributions for a) heteroscedasticity, b) trends in residuals, c) normality and d) mutual independence.
Residuals were examined for heteroscedasticity and trends by plotting the residuals against their corresponding predicted values. Any “fanning out” of residuals for either low, middle or high predicted values would imply that variance of the residuals is not constant as the predicted value increases, and that the residuals are therefore heteroscedastic, and evenly spaced confidence limits around these predictions would be misleading.
Residuals were tested for normality by both graphical interpretation of histograms, and by using an adaptation of the Shapiro-Wilks test, following Royston (1982), as mentioned in section 4.7.1 in relation to exploratory data analysis. Any skewness or kurtosis in the distributions could signal outliers in the residuals, which would imply that all values of the predictand were not being predicted equally.
When underlying data are serially correlated, which is a common condition for atmospheric variables (Wilks, 1995), it is useful to investigate whether the residuals are also serially correlated, or if they are mutually independent. This was examined in this study initially by plotting residuals against time, and then examining the residuals for serial correlation using the autocorrelation function in the Statistica software package. Generally, if groups of positive or negative residuals are clustered together, rather than occurring irregularly, then time correlation can be suspected (Wilks, 1995).
5.8.1 Model bias
Model bias, as discussed in section 5.5.3, is defined as being the correspondence between the average forecast and the average observed value of the predictand. In an unbiased forecast model, these values would be the same.
Although the model error in cross-validation techniques in this study is calculated from all five sets of cross-validation period residuals for each season and dependent variable, a weighted average of these models is actually used to predict future scenarios. For this averaging to result in reduced model error, there must be independence between the component parts, and no bias in predictions (ie. Model error (residuals) is randomly distributed about the observed value).
An examination of possible bias can be undertaken by plotting the Cp=p line, where Cp is Mallow’s Cp for the model validation period, and p is the number of parameters in the model (including the intercept). When Cp vs p are plotted for any model, those with minimal bias will have data points very close to the Cp=p line. A plot where points are above the Cp=p line reflects bias in the parameter estimates in the regression equation. Models with values below the line are considered to have no bias (Myers, 1986). This was undertaken for all best models in this study.
The examination of model bias using the small sample size of 9 or 10 from each validation period model could lead to spurious results. For the above two reasons, generally all forty- eight residuals from all five validation period models are utilised in bias analyses. Not withstanding this, it is useful to look at the individual cross-validation period models, to ascertain whether bias is present in all individual period models or not, and to what extent.