3 Long-term inflow forecast model
3.1 Method
3.1.1 General methodology
The forecast model is based on supervised pattern recognition. Supervised pattern recognition is founded on a priori knowledge about the classes into one of which an unknown pattern should be classified. Normally this information is given in a form of a training set X that consists of patterns whose correct classes are known. In supervised learning, this information is used to build a classifier to categorise an unknown pattern into one of the classes. For streamflow forecasting the following method is used:
1. The training set is generated. All the years in the data set are classified into
the different wetness categories based on the discharge sum distribution of the forecast period.
2. Feature vectors describing the hydrological state of the basin on a forecasting
day are constructed for each year. A feature vector consists of a combination of the measurements on ground water levels, soil moisture, snow water equivalents, frost, discharges, precipitation, NAO indices and water levels. Weather forecasts are not used.
3. A supervised learning algorithm is used to classify a forthcoming period into
one of the constructed wetness classes based on its feature vector.
4. The discharge forecast is calculated. The forecast is based on the discharge
series of the years that belong to the class into which the new pattern was classified.
In principle, the approach chosen and the one used by Piechota et al. (1998) and Piechota and Dracup (1999) in categorical streamflow forecasting differ in two ways. Firstly, in step 3, Piechota et al. give the occurrence of streamflow in one of the categories in the form of probability. In the present study, it is only important whether the classification is correct or not. Secondly, Piechota et al. used features individually while approximating the occurrence probabilities of the forthcoming class and combined the results afterwards by using a linear combination of the probabilities. In this study, the feature vector consists of several variables simultaneously and features are equally weighted in classification.
Several algorithms are available in supervised pattern recognition. In this study, two algorithms are applied to classify new patterns into the constructed classes: the k- nearest neighbour rule and the minimum distance classifier. These classifiers were chosen because of their simplicity. Multi-parameter classifiers were not considered because of the restricted amount of data available. The Euclidean distance was used as a similarity measure in each of the case studies and all the data were standardized before the classification to avoid problems related to the different scales of the features.
The k-nearest neighbour rule (k-NNR) is popular and probably the best known of the nonlinear classification algorithms. This algorithm is strongly dependent on the training set X and thus the training set should be large and represent all the classes. When the k-NNR is used, an unknown pattern is classified into the class that has most of the k nearest neighbours of the new pattern. A simplified example is of course the
nearest neighbour rule, in which an unknown pattern is classified into the class that contains the pattern that is most similar to the new object. Usually, an odd number is selected as k to avoid ties between the classes.
The algorithm can be presented as follows:
1) Choose the parameter k and the similarity measure.
2) Calculate the similarity between the new pattern and each of the patterns in the training set X.
3) Find the k patterns in the training set that were most similar to the new pattern and identify their classes.
4) Classify the new pattern into a class from which most of the k nearest training set patterns derived.
The limited amount of data sets an upper limit to the parameter value k. Three different values are tested: 1, 3 and 5. In a case of a tie, the new pattern is classified based on the nearest neighbour. It can be theoretically proven (e.g. Schalkoff, 1992) that the classification error probability of the NN classifier is at most twice as large as that of an optimal classifier for an infinite training set. Thus, the NN classifier is not optimal but often used, because it is practical and simple to execute.
The other classifier applied is based on statistical pattern recognition. By using the Bayes rule
( ) ( )( )( )
x x x p P p Pωi = ωi ωi (3-1)the object x is classified into a class whose (posterior) probability P( i|x) is largest.
By assuming (a priori) equiprobable classes, with the same covariance matrices, the new pattern is classified into a class whose mean vector it resembles the most. This is a linear classifier called minimum distance classifier (MDC). Instead of comparing the new pattern with every object in the training set, the comparisons are made only between the mean of each class and the new pattern. The Euclidean distance is used as a similarity measure.
The real-time decisions about the operation of Lake Päijänne are based on the forecasts about the wetness category of the forthcoming inflow. However, to ease the release planning and to compare the accuracy of the model with other models, daily inflow forecasts and mean forecast of the accumulated inflow are needed. The mean forecast of the accumulated inflow is based on the inflow time-series of the training
set. When a pattern is classified into a class i, the daily inflow forecast ft is
calculated by using the average
∑
∈ = i j t j t q n f ω , 1 (3-2)where t stands for the date and j for the patterns (years) in the training set. In Equation
3-2, n is the number of the patterns in the class i and qj,t is the observed daily inflow.
As in the current study, for example Grantz et al. (2005) used the k-nearest neighbour rule for finding out the years from the historical records that remind the characteristics of the forecast year the most. Their final long-term forecasts were based, however, on
the locally weighted polynomials of the streamflows of the nearest neighbours and thus the simplicity of the model was lost. As the final forecast is now based on each of the observations in the chosen class and weighting is not used, parameter calibration is not needed and the model remains simple.
As a consequence, however, the new method has two obvious weaknesses. Firstly, the forecasts of the accumulated streamflow given by the model never exceed the largest observation and are never lower than the driest observation. Therefore, the forecast errors concerning very wet and very dry years may be relatively large even if the forecast period has been classified correctly. Secondly, the theoretical confidence limits of the method are not estimated. The classification error probabilities are estimated, but their conversion into the confidence limits of the accumulated discharge is not straightforward. Empirical confidence limits based on the validation can be estimated, however.