5. Discussion
5.3 Evaluation methodology
case for a well-calibrated model, this deficiency of QM becomes more significant (Madadgar et al., 2014). It has been shown that the evaluation results of the flow forecasts improve over the training period after post-processing (appendix 6), which shows the potential of processing with QM when a consistent bias is present. Possibly with a longer period of data individual deviating years (like 2010) will have less influence and a more consistent bias can be obtained over the whole cumulative probability domain. In addition it is possible to distinguish different weather patterns to potentially obtain a more consistent bias for different weather patterns. An additional problem, however, is that the joint distribution of observations and forecasts is often nonhomogeneous in time, for example due to improvement of forecasting systems over time (Verkade et al., 2013).
QM is a relatively simple pre- and post-processing technique. Several previous medium-term meteorological and flow forecast studies have applied more sophisticated pre- and post-processing techniques, which are based on other kind of relationships between forecasts and observations and additional predictors. To ascribe realistic spatial and temporal patterns the Schaake shuffle can be used (Clark et al., 2004). Wetterhall et al. (2012) states that the most appropriate method also depends on the objective of the study and that it is important to test different bias correction techniques prior to use one. It should however be realized that the application of all techniques is limited by the limited period of available forecast data and the non-consistent bias over this period.
5.3
Evaluation methodology
Evaluation of the ensemble flow forecasts is an important part of this study, but several reservations should be taken into account. At first, discharge observations are considered as ‘truth’, but the discharge observations also contain measurement and sampling errors (WMO, 2015). This is the direct effect of observation errors (also see section 5.1) on the evaluation of flow forecasts.
At second, the period of available meteorological forecast data is quite short (6 years). This is the reason that the defined thresholds for low and high flows are not very extreme, because a sufficient number of events is required to evaluate the results. It should be realized that flows below Q75 and above Q25 are not necessarily very extreme events. If a longer period of data would be available the thresholds for low and high flows could be more extreme. In addition, a longer period of data would provide more confidence in the conclusions.
At third, several assumptions have been made to establish quantitative rules for the classification of low flow and high flow producing processes. It has been assumed that the snow accumulation history before an event is embedded in the initial snowpack storage of the day before the event and if snow is involved the event is classified as a snowmelt flood or snow accumulation low flow. This is of course a strong simplification, because when there is a snowpack in the model there is not necessarily a snowpack over the whole catchment and in addition snow accumulation is not necessarily the most important process in these situations (also see examples described below). Snowmelt floods and rain-on-snow floods are considered as one category, because both processes are related to snowmelt and in the HBV model snowmelt and rainfall are strongly related (both depend on temperature). It is not possible to reliably distinguish between these processes. If no snowpack is present, it has been assumed that the low flow event or high flow event is caused by low/high rainfall. To distinguish between short-rain floods and long-rain floods a semi-arbitrary rainfall threshold amount of 10 mm is used. The threshold of 10 mm is a subjective element in the characterization (see section 3.9.1 for the motivation). When rainfall was below 10 mm and there
5.3 Evaluation methodology
was no snowpack present at one day before the event it has been assumed that the high flow must have been caused by long-rain. This has large consequences for the classification of high flows that are caused by a combination of processes. For example, snowmelt has caused saturated conditions in the catchment but if after some time the snowpack has disappeared, a low rainfall can cause high flows. This event will be classified as a long-rain flood because a small amount of rain is the direct driver of the flood, although the wet initial conditions are caused by snowmelt. Another example is that a period of rain causes wet conditions and if on top of this an extreme daily rainfall of at least 10 mm occurs the event will be classified as a short-rain flood, while the long-rain certainly has a role in this flood. These kind of combined processes often occur in practice and they will cause the highest floods, but it is difficult to properly classify them. Another point is that only recent information (one day before the flow event) is used to classify the processes, although it has been assumed that the snow accumulation history is embedded in the snowpack storage from the HBV model. The lag time between precipitation peaks and discharge peaks is not always 1 day, like this is incorporated in the HBV model and the characterization rules. This has the consequence that for example the high flow at the day after a high rainfall amount is classified as a short-rain high flow, while the discharge peak might come one day later and this day will be classified as a long-rain high flow (see for example Figure 6 and Figure 37). Using only recent information is even a larger simplification for low flow classification, because low flow processes usually happen over a longer term. These assumptions must be kept in mind when the evaluation results are interpreted. Nevertheless the classification is based on data that are available and, although it has many limitations, it is considered that it is appropriate for the objective of this study. Another option to examine the performance on different processes could be to use different seasons. However, using these categories instead of seasons provides more insight into the underlying processes and it is for example not necessarily the case that a high flow event in spring is generated by snowmelt (also see Figure 24).
At fourth, statistical significance of the evaluation scores has not been tested in this study. This involves the risk of type I errors, which means that it is concluded that a supposed effect exists when in fact it does not exist (Davis, 2002). In literature several statistical tests are published to test the significance of evaluation scores, like the significance of non-uniformity of rank histograms (Hamill, 2001), confidence intervals around ROC curves (Fawcett, 2006; Wilks, 2006), statistical significance of the AUC (Mason & Graham, 2002) and confidence intervals around reliability diagrams (Bröcker & Smith, 2007; Wilks, 2006). Including these tests would provide more confidence in the evaluation results. To include these tests this study requires an extension with a more statistical point of view. In this study a combination of 5 evaluation scores is used to evaluate the different properties of forecast quality. The different evaluation scores also have limitations in them, which are mentioned in section 3.7. Regarding the CRPSS it has been found that the CRPSS of all flows in general seems to follow the pattern of the CRPSS of high flows, so probably the high flows have the largest influence on the value of the CRPSS. Instead or besides these scores other scores could be used, which could have led to different conclusions. Possibilities of other scores are the Brier (Skill) Score, percentile- based evaluation, Probability Integral Transform curves, discrimination diagrams, cost-loss functions and the error-spread score (a new score developed by Christensen et al. (2015)).