Learning Uncertainty Models from Weather Forecast Performance Databases Using Quantile Regression

(1)

Learning Uncertainty Models from Weather Forecast

Performance Databases Using Quantile Regression

Ashkan Zarnani

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada

Petr Musilek

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada

ABSTRACT

Forecast uncertainty information is not available in the immediate output of Numerical weather prediction (NWP) models. Such important information is required for optimal decision making processes in many domains. Prediction intervals are a prominent form of reporting the forecast uncertainty. In this paper, a series of learning methods are investigated to obtain prediction interval models by a statistical post-processing procedure involving the historical performance of an NWP system. The article investigates the application of a number of different quantile regression algorithms, including kernel quantile regression, to compute prediction intervals for target weather attributes. These quantile regression methods along with a recently proposed fuzzy clustering-based distribution fitting model are practically benchmarked in a set of experiments involving a three years long database of hourly NWP forecast and observation records. The role of different feature sets and parameters in the models are studied as well. The forecast skills of the obtained prediction intervals are evaluated not only by means of classical cross fold validation test experiments, but also subject to a new sampling variation process to assess the uncertainty of skill score measurements. The results show also how the different methods compare in terms of various quality aspects of prediction interval forecasts such as sharpness and reliability.

Keywords

Prediction interval, quantile regression, numerical weather forecast, uncertainty modeling

1. INTRODUCTION

The degree of uncertainty in the forecasts of a Numerical Weather Prediction (NWP) model can potentially have an enormous impact on the decisions that are made based on these forecasts. Wind power production and marketing [22], Dynamical Thermal Rating (DTR) systems used in power delivery networks [11] and extreme weather event hazard systems are some applications in which the forecast uncertainty is regarded as significant as the expected forecast value itself.

An NWP model simulates various atmospheric phenomena by means of deterministic physical processes which provide real-valued outputs for expected weather attributes during a forecast horizon in a three dimensional spatial grid [28]. Hence, the system would not provide any information about the uncertainty of its

forecasts. However, there is always some level of error associated with these forecasts and the degree of this inaccuracy is known to be variable for different predictions [15]. Imprecision of initial conditions, parameterization of sub-grid scale processes, and various simplifying assumptions incorporated in the NWP system are regarded as some major reasons for forecast inaccuracies [20]. Although the raw outputs of an NWP system provided as point predictions can be easily understood and evaluated, the

probabilistic nature of the forecasts (which can represent the prediction uncertainty) is dismissed. Prediction Intervals (PI) are a dominant form of forecast uncertainty communication. They are defined as a value interval accompanied by a confidence level for actual observations to be inside this interval (e.g. T = [-3°C, 10°C], conf = 95%) [6][9][23].

There is a large body of literature on obtaining such uncertainty information from NWP models using ensemble forecasting systems [7][25][19]. However, ensemble predictions may incur large computational costs, making them infeasible in some cases. Additionally, the instant availability of historical performance databases for many existing forecasting systems has made post-processing approaches to uncertainty modeling an increasingly attractive topic in forecast uncertainty research [4][18][24][23][1]. Many different statistical models are developed to learn a forecast uncertainty model from a historical system accuracy database. Error distribution fitting and clustering methods have been studied recently as major methods in this regards [23][33]. These rely on the known fact that different forecast situations exhibit typically different levels of forecast uncertainty and such patterns can be potentially found from the system record [15]. Hence, the forecasts of the system are first clustered into similar groups using related attributes regarded as influential variables. Next, the historical error distribution of each cluster is modeled by fitting either a parametric (e.g. Gaussian) or a non-parametric (e.g. empirical) distribution. The desired quantiles of a new forecast are then calculated from the fitted error distribution of the cluster which the new forecast case belongs to. Hence, these approaches can provide the uncertainty information of a forecast in the form of full probability distribution which can then be used to obtain prediction intervals with any desired level of confidence.

In quantile regression based methods, on the other hand, each individual quantile is modeled independently and there are no assumptions on the distribution of the forecast error [18][32]. These methods learn the direct relationship between the target quantile and the set of available influential attributes through an optimization process. Various quantile regression methods have been proposed and applied to forecast uncertainty modeling. Bremnes [3] proposed the application of local quantile regression to obtain non-linear models of quantiles for wind power forecasts.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

(2)

In another study, Nielsen et al. [18] applied an additive quantile model by using spline basis functions. In both works, the resulting prediction intervals were evaluated in terms of their inter-quantile range and actual observation frequencies, as compared to the forecasted quantile. Yet, the skill of the prediction interval forecasting system was not evaluated in an objective framework. A few statistical models including local quantile regression were compared in a study by Bremnes [4] as quantile forecasting models for wind power from NWP outputs. Sharpness and reliability measures (not the forecast skill) are evaluated for different setups of these models. Pinson and Kariniotakis [23] demonstrated a novel fuzzy inference model based on grouping of forecasts and adapted resampling for distribution fit. A detailed comparative study of this method and one quantile regression-based method was provided in Pinson et al. [24]. An improved version of this approach using fuzzy clustering and error distribution fitting was introduced by Zarnani et al. [33][34]. Additionally in the domain of statistical methods, time-adaptive kernel density estimation methods were proposed by Bessa et al. [1]. In this paper, we aim for a comprehensive comparative study between some major quantile regression methods and distribution fitting methods. In particular, the relatively modern kernel quantile regression method [16] is also investigated in the context of weather forecast prediction interval modeling in this study for the first time. This is performed using a large real-world data set and with a focus on forecast skill as a significant measure for essential conclusive comparisons between prediction interval forecasting systems. This approach offers a good foundation to investigate the role of different parameters involved in these methods as well.

Due to limited availability of test samples, such skill score measurements are subject to sampling variations. Therefore, it is crucial to assess whether the observed skill is due to chance or if it is a true attribute of the forecasting system. Joliliffe and Stephenson [12] point out: “It has been unusual in weather forecast verification studies for any attempt to be made to assess this sampling uncertainty, although without some such attempt it is not possible to be sure those apparent differences in skill are real and not just due to random fluctuations”. By decomposition and statistical analysis of skill score we can consider such variations in the evaluations, hence offering a more reliable and fair comparison for the user.

In Section 2, we describe the basics of prediction intervals and the distributional models. Various quantile regression methods are explained in Section 3. The basic quality measures and the evaluation framework of prediction interval forecasts are shown in Section 4. Section 5 provides experimental results and analysis of the quality of prediction intervals obtained using different methods and parameter setups. Finally, the paper concludes with summarizing remarks and future directions in Section 6.

2. PREDICTION INTERVALS: BASICS

AND SOME MODELS

2.1 Prediction Intervals

Due to the random nature of forecast error, a forecast can naturally be represented by a full probability distribution denoted for the target attribute y at time t. Note that this distribution is conditional on x representing the available information at the time of forecast. The uncertainty information of the forecast is chiefly represented by the spread of this distribution with more uncertain

predictions exhibiting a wider spread. Any desired quantile ( ) can then be obtained from this distribution [9]:

(1) , ,

where is the Cumulative Distribution Function (CDF) of . A ( )-confidence level prediction interval ( ) is then defined as a pair of quantiles that represent the range !, where _" $# and _% $# [24]. The choice of depends on the characteristics of the optimization procedure utilizing the forecasts. For instance, some strategies are introduced in [3][22] for selection of wind power quantiles for placing optimal bids in the power market.

2.2 Fuzzy Clustering and Forecast Error Fit

It is well known that different forecasts situations sustain various levels of error [15]. Once a grouping of past forecasts is defined, the error distribution of each group can be estimated using the historical records of that specific group. These distributions can then used in a second stage to calculate any desired quantile and prediction interval. Pinson and Kariniotakis [23] used two forecast attributes to manually group forecast records into four fuzzy sets, and used resampling to estimate the empirical error distribution of a new forecast. Results from wind power forecasting data set show that this method yields sharp and reliable prediction intervals [23].

To obtain groupings of forecasts where forecast attributes have similar values inside a group and rather dissimilar values when compared to the other groups, data clustering algorithms are applied in [33]. This automated approach can handle larger number of influential variables. Once the clusters are obtained, any density estimation approach (including parametric or non-parametric) can be used in the second phase. In crisp clustering algorithms each sample forecast is assigned to exactly one cluster only. Therefore, the membership of a forecast case in each cluster is characterized by a binary value. In a more natural approach proposed in [23], the forecast cases can be associated with multiple clusters (weather situations) with different levels of membership supported by fuzzy sets. Such partial membership of samples in forecast groups can possibly improve the modeling of forecast situations. Many weather conditions such as transitional phases of weather can be better explained using this approach. Hence, methods for modeling forecast uncertainties using fuzzy clustering are proposed in [33]. Results from this study confirmed high skill of prediction interval forecasts obtained from error density estimation using kernel density smoothing on fuzzy clustering outputs. An advantage for all methods described in this section is their ability to provide full probability distribution of the target forecast using a single model. Algorithmic and analytical details of clustering-based prediction interval models are presented in [33] and [34].

3. QUANTILE REGRESSION METHODS

3.1 Linear and Non-Linear Quantile

Regression

In linear regression tasks, a target variable is estimated by a linear combination of a set of related features. The unknown coefficients of this linear equation are tuned by an optimization process using an objective function (e.g. squared error in real value regression). The same approach can be used to find a linear relationship between a set of features and a specific “quantile” of a target

(3)

variable. The -quantile of the target variable y denoted as & is formulated as [13][17]:

(2) & '₍)* ' + * '_,+,* - * '_.+.

where +/0 1 1 2 are the d influential variables for modeling the -quantile of y, and 3_/ 0 1 1 2 constitutes the ' vector of coefficients for the target -quantile. This vector is estimated using the following optimization objective [13][32]:

(3) ' 456789_:; < =?_>@ _> '₍* ' +_> * - * '2 +82

where 8 1 1 A is the number of recorded pairs ( _> B_>) in the data set, and < is the loss function of a -quantile target defined as:

(4) < C_> D C> C>E

C> C> F and (5) C_> _> & _G

The optimization task formulated in (3) is then solved using linear programming techniques [13][17]. In order to obtain _HIJ , quantiles of target error, &_I and &_I are separately modeled by linear quantile regression using the data set of (K_> B_>), where K_> represents the recorded error of forecast case i and B_> is the vector of explanatory influential variables. This yields the optimizer vectors of '_I and '_I that are then used to compute the ( )-confidence level prediction interval of target y for any new forecast B_HIJ:

(6) & _G L'_I B_HIJM * &_>, & _G L'_I B_HIJM * &_>, where L1 1 M represents the dot product of the two vectors. Note that an entry of 1 should be added as the left most element of BNOP to be multiplied by the '₍ term as the intercept. As opposed to the methods described in Section 2, by using quantile regression there is no distribution fitting process required in further steps. This would also mean that a new model has to be trained for any new quantile of interest.

The model optimized in ' describes a linear relationship between the error quantile and the influential features in x. However, by using a non-linear transformation basis function Q B to derive new features from the currently available features, one can in effect learn non-linear relationships using the same formulation in (3). For instance, to learn nth_{degree polynomial}

functions to model quantiles, one can extend features in x by adding Q B B,BRS BN and then perform the same process of optimizing the linear relationship between the new feature set and the target quantile. More complex forms of non-linear relationships can also be represented and optimized by using other basis functions such as TB UVW B etc. We use NLQR to refer to the prediction interval modeling methods that use these transformed features, while LQR refers to the methods that use the original vector x.

3.2 Quantile Regression with Spline-basis

Functions

Additive quantile models are another technique used by Nielsen et al. [18] to learn non-linear models of weather forecast quantiles. Spline basis functions are the most frequently used basis functions [1][17]. Since it is expected that the relationships

between forecast values and forecast error exhibit a non-linear nature, spline-basis can provide a suitable basis transformation. This relationship can then be approximated by a linear combination of basis functions of the influential attributes [10]:

(7) & '₍)* ;_/@. ;_X@.YZ'_{/ X / X} +_/ ,

where _{/ X} is the cubic B-spline basis function used for feature j using 2_/ degrees of freedom. Note that assuming a constant df -degrees of freedom for all basis functions, there will be 2 [ 2 features in the final model to be optimized by the linear optimization task formulated in equation (3). The appropriate value for this parameter has to be determined based on experiments involving the training data set.

3.3 Local Quantile Regression

In local quantile regression (LocQR) there is no effort made to learn complex non-linear models for a quantile. Instead, it is assumed that, in the close neighborhood of a given x, the relationship between x and the target quantile is simple enough to be modeled linearly. Rationally, data points that are closer to x should have more impact on this linear model than those further away from it. This task is formulated as the following optimization problem [31]:

(8) '_I 456789_:; < =?_>@ _> ' B_\ B ] 3 B_\ B where '_I is determined for input x by considering a set of train samples that are centered around x and weighted using [3]:

(9) 3 B_\ B ^_ =`_._ba `] R cR defghiVUg Vj k ` a ` .b F

where 2_l B is the distance from x to its mAth nearest neighbour among training samples B_11?. Based on this definition, m A data samples have zero weight and hence no impact on the quantiles optimized at point x.

Note that using LocQR, two new models have to be optimized for each new forecast (x) to compute upper and lower quantiles of the prediction interval for that specific forecast. This is in contrast with the scenarios of applying other regression methods described above, in which these models are learned only once and then utilized to provide prediction intervals for any future forecasts.

3.4 Kernel Quantile Regression

To learn arbitrary complex nonlinear models the optimization process can be performed in reproducing kernel Hilbert spaces leading to kernel quantile regression (KQR) [16]:

(10) ' 456789_:n ; <?_>@ _> ' + * ,`'`,

where the last term is the regularizer which penalizes more complex models represented by ' and C is the cost factor which balances the total loss over this penalization. This also allows obtaining a dual form of the optimization task using Lagrange multipliers [27]. Since the dual form only uses the dot products of the input vectors, we need only consider the kernel function (k) which would provide an inherent -mapping of inputs into a new feature space:

(11) o +_> +_/ LQ +_> Q +_/ M

where the resulting kernel matrix K is positive semidefinite. A common choice for the kernel function is the Gaussian kernel:

(4)

(12) o +_> +_/ K+p = q+_> +_/q,s#r,] where t u is a tunable parameter.

4. EVALUATION OF PREDICTION

INTERVAL FORECASTS

4.1 Basic Verification Measures

Due to the probabilistic nature of prediction interval forecasts, evaluation of such forecasts is a more complex process compared to point predictions [29]. Many evaluation methods have been developed that consider full probability distribution of forecasts as opposed to a prediction interval of interest [26]. However, there are some basic quality measures for prediction interval forecasts that are widely utilized in this field. They have two major properties of reliability and sharpness [4][24][33]. The reliability measure gauges the ability of the forecasting system to provide prediction intervals that practically adhere to their associated confidence level in test scenarios. Hence, this measure is evaluated by the fraction of T test samples for which the actual observation falls inside the prediction interval [23]

(13) vKw_x yz_x{| ifghg yzx{| _{} ~ y}>{| • >@ €W• y>{ ‚ Vj & k >k & defghiVUgF The sharpness measure, on the other hand, is gauged by the average width of prediction intervals output by a forecasting system. This measure demonstrates the ability of the system to make predictions with lower uncertainty [18][23]:

(14) ƒ„p_x …†2‡„ˆˆˆˆˆˆˆˆˆ_x

•;•>@ =& G & G]

Additionally, a forecasting system that can better distinguish low vs. high uncertainty forecasts can be identified by its resolution that is defined as the standard deviation, rather than mean, of output prediction interval’s widths.

4.2 Skill of a Prediction Interval Forecaster

To compare different prediction interval forecasting systems simple RMSE would mislead us as it would only consider the center of the interval and ignore the interval boundaries (forecast uncertainty magnitudes). The following skill score on the other hand would penalize more uncertain forecast (wide intervals) and also penalizes a miss case by a magnitude equal to the distanced of the missed case with the interval boundary. Hence, for a straightforward objective comparison between different prediction interval forecasting systems, the following skill score is utilized [3][18][30][24]: (15) ƒƒ‰Š5K_x ; ‹Œ•_>Ž&•G • "‘ = > & G] * • >@ Œ•> Ž&_•G• %‘ = > & G]’

where y_>Žis equal to one if _>k , and zero otherwise. Note that a minus sign is added to the measure to make it negatively oriented (i.e. smaller values are preferred as in error measures). When considering a single quantile, this score (without the minus sign) is equivalent to the loss function used in the quantile regression objective. The skill score as defined above is proven to be “strictly proper” [8] which is a key attribute for fair judgment between different forecasters [5]. A strictly proper score would give the

maximum score to a forecast that is actually the true belief of the forecaster and hence cannot be “hedged” [5]. This would mean that only a prediction interval that follows the true distribution of the target will obtain the maximum score. Details of mathematical definitions and proofs can be found in [8]. Yet, another critical aspect for accurate prediction interval evaluation is the role of sampling variations in this process which has been mostly ignored in weather forecast verification studies [12]. Due to the limited availability of test cases, the uncertainty of skill score measurements will also be taken into consideration in this study. For this purpose, the skill score (15) is transformed into the following equation by considering three possible scenarios: two for miss (either left or right of the prediction interval) and one for hit [33]:

(16) ƒƒ‰Š5K_x } _"…†2‡„ˆˆˆˆˆˆˆˆˆ_x* “z_x ;”/@ }/ "…†2‡„ˆˆˆˆˆˆˆˆˆx/* “zx/

where “z_x is the mean of “_>values for test cases i..N. For test case i, “_>is equal to zero for a hit, and C_> (distance of the observation from the prediction interval boundary) for a miss. The first part of the above equation demonstrates the fact that the skill score is essentially a weighted sum of two major aspects of prediction interval quality, namely sharpness and reliability. However, reliability is measured by “z_x rather than vKw_x as in (13). The widths of test prediction intervals are not subject to sampling variations, yet the delta values are measured using a limited number of test samples.

In the case of prediction interval models based on clustering (cf. subsection 2.2), each cluster (j=1..K) is independently evaluated using the }_/ test cases that actually belong only to that cluster. Finally, a weighted average of the K skill scores yields the method’s overall score. It is plausible that the sample statistic “z_x measured in a cluster with fewer test cases (e.g. } ) have higher uncertainty compared to the same statistic “z_x,in a cluster measured using more test cases (e.g.}_, • ). Thus, to get a'-confidence bound on skill score, the sampling distribution of “zx/ is non-parametrically estimated by bootstrapping for each j. The '-quantile of this distribution is then stored in“z_x/: and later applied in equation (16) to obtain the '-confidence bound over skill score denoted as ƒƒ‰Š5K_x::

(17) _“z_x/ “z_x/:c ' /@ 11” >H –—˜˜˜˜˜˜˜˜™ ƒƒ‰Š5K_x ƒƒ‰Š5Kx: '

In the case of quantile regression models, such grouping of test cases is not available by default. Therefore, the test cases are clustered using the same features employed in the regression model. In this way, major neighborhoods of test cases are identified in the input feature space and their variant densities can then be reflected in the skill score evaluations as performed in the clustering based models. We consider 95% confidence bound for skill score comparisons using 2000 bootstrap samples. An example of a sampling distribution of “z_xš and its confidence bound for a sample cluster of test cases for a quantile regression model using spline-basis functions is shown in Figure 1.

(5)

Figure 1. Bootstrap distribution of average delta for a sample cluster - #test cases=588, #misses=26

The ƒƒ‰Š5K_x(1›œmeasure is preferred because it performs a more accurate and fair comparison between prediction interval forecasting systems by considering sampling uncertainties in test experiments.

5. PRACTICAL EVALUATION STUDY

5.1 Data and Models

For empirical comparison of the described methods a three year data set (2007, 2008 and 2009) of hourly WRF model forecasts (WRF Website) in two stations in BC, Canada is used. These forecasts are matched with observations from the National Center for Atmospheric Research (NCAR) data repository. The WRF v3 simulations were run in three nested grids with resolutions of 10.8 km, 3.6 km and 1.2 km. The outermost domain covered an area of about 15,595 km2_{on a 38×38 grid. The grid point closest to the} observation station was assigned as the point of the associated forecast.

In total, ten features were recorded for about 51,000 NWP forecasts. These include temperature (t2, at 2m), wind speed (ws) and direction (wd, at 10m), surface pressure (sp), dew temperature, relative humidity, hour of day (h), day (d) and month of year (m) and station. To take into account basic temporal aspects, new features were derived by calculating the change of currently forecasted surface pressure compared to forecasts of 1, 3, 6 and 12 hours earlier for the same location (denoted as pg1, pg3, pg6 and pg12). Eight different feature sets (Table 1) were defined to select the best combination of predictors for 95% ( 1 •) prediction interval modeling of temperature forecasts.

Table 1. Definition of feature sets used in PI models

Feat Set m d h t2 ws wd sp pg C1 C2 C3 C4 C5 C6 C7

C8 All the above features+(dt,rh,station)

Three-fold cross validation was used by splitting different years into folds. For instance, the 2007 and 2008 data was used to train the prediction interval model and then prediction intervals for 2009 obtained by this model were evaluated for their quality and skill. Also random-based 5-fold cross validation experiments were

conducted but due to the similarity of the obtained results only year-based cross validation results are reported here.

To achieve a relative comparison between prediction interval models, some basic naïve models were also defined. These include a climatological model that applies distribution fitting methods on the whole available body of past forecasts (K=1). Also two models of manual categorization of forecast records based on month (K=12) and predicted temperature (using equi-distant bins K=10) are considered. These baseline models are simplified versions of the approach described in section 2.2. Based on a number of experiments, kernel density smoothing was determined as the best distribution fitting strategy for these methods. Also in the fuzzy clustering method (FCM) the fuzzification factor and K were tuned to values of 1.1 and 45 [33].

5.2 Prediction Interval Forecast Skill Results

For evaluation of quantile regression models, the forecasts are clustered using alternative numbers of clusters in the range of 2 to 100 for projection of the skill score confidence bound. The role of number of degrees of freedom in spline-basis as an important parameter of the spline quantile regression (SPQR) models is depicted in Figure 2. The curves show the change of ƒƒ‰Š5K_x(1›œ using K=50 clusters for models using different feature sets. The number of degrees of freedom by which a model achieves its best score is encircled. Note that K=50 was the number of clusters which best represented the forecast groups in experiments involving clustering-based methods. This value of K is used everywhere unless mentioned otherwise.

Figure 2. Projection of žžŸ ¡¢_£¤1¥¦ for spline quantile

regression models over different degrees of freedom using various feature sets

Feature set C8 provides the best model using the SPQR method. This model achieves the best skill using 9 degrees of freedom. This figure demonstrate a steady improvement of forecast skill of SPQR models as the feature sets grows to include more attributes. In local quantile regression (LocQR) models the role of m is investigated. As previously mentioned there is essentially no offline training phase involved in LocQR and two quantile models have to be optimized for every single test case. Experiments revealed that a rather long computational time is required for the forecast and evaluation of the whole test data set due to this characteristic. As an alternative approach, the LocQR model was trained for a limited number of knot points randomly selected from the train samples. In the test phase, rather than training a

(6)

new model for every test case, the model already trained for the nearest knot to the current test case is applied to compute the prediction interval. Different numbers of knots: 10, 100, 1000, 3000 and 20000 (indicating original LocQR with no knot selection) were tried. The skill score of prediction interval models using C8 feature set are plotted in Figure 3. This figure shows that a relatively small neighborhood (m 1§) is preferred by the LocQR model. Results in this set of experiments also confirm that by using a limited number of knots (e.g. 3000), the training and evaluation phase would be performed much faster without significantly compromising model’s accuracy.

To explore the impact of different feature sets on the trained prediction interval models, Figure 4 illustrates the distribution of skill scores obtained by different models using these predictor sets. The C8 feature set which includes all available forecast attributes provided by the NWP model clearly provides higher skilled prediction interval models on average. This confirms that the quantile regression algorithms have been able to utilize the information contained in all of these attributes to better model the relationship between prediction situation and forecast uncertainty. Also among other basic combinations C3 which has the three attributes of temperature, wind speed and hour-of-day attains the best score which highlights the influence of these attributes on the temperature forecast uncertainty.

Figure 3. Skill score diagrams of LocQR models as a function

of lambda and number of knots

Figure 4. Box plot of skill score for different feature sets used

by the various quantile regression methods The details of prediction interval quality measures from yearly cross fold validation of different methods are reported in Table 2. The first four columns determine the best setup among each of the different quantile regression, clustering-based and baseline methods. Basic quality measures are reported in the next five

columns. The 95% confidence lower bound of the coverage measure is also calculated using one-sided Binomial test. Thus a cluster with 90% coverage (hit rate) in 1000 test cases would have a bigger lower bound (i.e. Coverage0.95=%88.3) as compared to a cluster with the same 90% coverage yet with only 200 test cases (i.e. Coverage0.95=%85.8). Moreover, Root Mean Squared Error (RMSE) is reported as a key measure for point based forecast evaluations. Note that the median of each prediction interval is considered as the new point forecast of the trained model. It should also be noted that since the upper and lower quantile models are learned independently in quantile regression approaches, they may cross one another in some cases. Although there were rather few of these cases (e.g. about 61 for 20% confidence level in NLQR), they are substituted by the climatological baseline prediction interval to keep a balanced judgment between different models.

The best prediction interval models are LocQR with ¨=0.3 using C8 feature set. It is closely followed by KQR with its parameters tuned using grid search into =0.035 and cost factor equal to 1.3 (Figure 5). All of the quantile regression models except than LQR outperform the best fuzzy clustering-based method with 50 clusters and kernel density smoothing (in terms

Figure 5. Tuning the parameter used in the Gaussian kernel

function of KQR method

Figure 6. Trends of various confidence level prediction

(7)

Table 2. Prediction interval verification measures for top models of different methods based on 3-fold (yearly) cross validation

Algorithm K Features Fit/ Params Sharpness (°C) Coverage % Coverage0.95

% Resolution RMSE SScore SScore 0.95 LocQR (50) C8 m=0.3 9.04 94.84 92.51 1.56 2.40 0.2739 0.2977 KQR (50) C8 =0.035 C=1.3 8.30 93.17 90.58 1.45 2.32 0.2725 0.3006 SPQR (50) C8 df=9 9.70 93.69 91.22 1.79 2.66 0.3077 0.3359 NLQR (50) C8 - 9.94 94.14 91.73 1.77 2.70 0.3107 0.3382 FCM 50 C8 Kernel 10.86 94.52 92.16 1.60 2.86 0.3286 0.3558 LQR (50) C8 - 10.70 94.45 92.12 0.90 2.78 0.3310 0.3595

Base-Month 12 Month Kernel 12.21 95.12 94.10 1.91 3.12 0.3601 0.3704 Base-Temp. 10 Temp. Normal 11.70 94.44 93.57 0.98 3.04 0.3620 0.3725

Base-Clim. 1 - Normal 12.17 94.78 94.49 0.00 3.11 0.3740 0.3774

of ©©ªdhg_«(1›œ). In addition, all of these learning-based models surpass the baseline methods (p<0.0005, paired t-test). The quantile regression models provide significantly sharper prediction interval forecasts (lower average PI width) but with about 2% less coverage than the expected 95%. However, by refereeing to skill score, which summarizes the overall performance of a model, one can conclude on the higher quality of prediction interval forecasts by these models, especially LocQR and KQR. A fan chart showing sample temperature prediction intervals with a range of confidence levels for a specific time frame and station is provided in Figure 6. One can notice the dynamic change of estimated forecast uncertainty depending on the various forecast situations. Note that due to the very large size (N×N) of the Kernel matrix used in KQR Cholesky decomposition is needed to compress this matrix into a feasibly computable size [1].

In Figure 7 curves of ƒƒ‰Š5K_x(1›œ with increasing number of clusters are depicted. These figures confirm that the differences between skills of the top models are not due to chance and the prediction intervals obtained by LocQR and KQR are truly superior to other models. As a counter example, this is not the case between LocQR models with m 1 and 0.3. Although the first model has a better skill score in the ordinary test results (K=1), its skill score confidence bound is increasingly worse than the second model when taking sampling variations into account in increasing number of neighborhood clusters.

In other words, the higher skill of the first model in the initial test results is most probably merely due to chance (by providing good prediction intervals in the areas that insufficient samples are available to evaluate the reliability of the model).

This example signifies the role of skill score uncertainty analysis for real-world evaluations and decisions. It is also important to note that by modeling the median forecast error dependent on the available set of attributes, the point forecast performance is considerably improved as a side effect of prediction interval modeling. This can be conceived of as dynamic elimination of forecast bias in these models. The results of this study also conform to results obtained by [24]. Yet, the improvement obtained by quantile regression models over clustering based models is considerably bigger in the experiments conducted here.

Figure 7. Trends of žžŸ ¡¢_£¤1¥¦ for the top quantile regression

models

To have a more comprehensive evaluation on the explained prediction interval modeling methods here we provide the results for a range of major confidence levels i.e. 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 0.95. In Figure 8 and Figure 9 the trends of Reliability and Reliability0.95 (which uses Coverage0.95 rather than the simple measured Coverage) are depicted, respectively.

Figure 8. Comparison of Reliability between various methods

(8)

Figure 9. Comparison of Reliability0.95between various methods over confidence levels

Figure 10.Comparison of žžŸ ¡¢_£¤1¥¦ between various methods

over confidence levels

In Figure 10 the overall skill of the various methods are compared over the given range of confidence levels. The results obtained from the range of confidence levels confirm superior skill of the LocQR and KQR based prediction interval forecasts. KQR has a poor performance in terms of reliability however its sharper forecasts make the overall skill of KQR intervals competitive. A possible explanation for the superior performance of quantile regression models over clustering methods can be the fact that the forecast error information is directly utilized (in the objective loss function) by the single phase optimization involved in these methods. However, clustering methods (including FCM) would determine the clusters of forecast cases by an optimization procedure which does not exploit the forecast error and is performed based on the predicted weather attributes only. Later, it is in the second phase of distribution fitting that forecast error information would be involved.

6. CONCLUSIONS

Major quantile regression methods including kernel quantile regression and recently proposed fuzzy clustering based methods were applied for prediction interval modeling on a data set of NWP forecasts. These models extend the raw point predictions of the prediction system into interval forecasts which intrinsically

communicate the expected forecast uncertainty to the users. A key analysis for skill score evaluations is proposed for consideration of sampling variations in test experiments. The roles of parameters and various available features applied in quantile regression models are investigated in the experiments. The results demonstrate the superior performance of the local linear and kernel quantile regression models. Yet, the results do not provide a clear distinction between these methods in terms of performance. All regression models considerably outperform the clustering-based models in terms of forecast skill. However, one should note that clustering models have a rather better reliability and carry the ability to model the whole probabilistic distribution of a forecast in a single model. More comprehensive investigation of the role of kernel functions and parameters in KQR is considered as a direction for future research. In addition, time series based models for prediction interval modeling will be developed and evaluated against current models.

7. REFERENCES

[1] Bach, FR., and Jordan, MI. 2002. Kernel independent component analysis. Journal of Machine Learning Research. 3:1-48.

[2] Bessa, RJ., Miranda, V., Botterud, A., Wang, J., and Constantinescu, EM. 2012. Time adaptive conditional kernel density estimation for wind power forecasting. IEEE Transactions onSustainable Energy. 3(4): 660-669. [3] Bremnes, JB. 2004. Probabilistic wind power forecasts using

local quantile regression. Wind Energy. 7: 47–54.

[4] Bremnes, JB. 2006. A comparison of a few statistical models for making quantile wind power forecasts. Wind Energy. 9: 3–11.

[5] Brocker, J., and Smith LA. 2007. Scoring probabilistic forecasts: on the importance of being proper. Weather and Forecasting. 22: 382–388.

[6] Chatfield, C. 1993. Calculating interval forecasts. Business and Economics Statistics. 11:121-135.

[7] Ehrendorfer, M. 1997. Predicting the uncertainty of numerical weather forecasts: a review. Meteorologische Zeitschrift, Neue Folge. 6:147-183

[8] Gneiting, T., and Raftery, AE. 2007. Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association. 102: 359–378

[9] Hahn, G.J., and Meeker, W.Q. 1991. Statistical intervals: A guide for practitioners. New York: John Wiley.

[10]Hastie, TJ., and Tibshirani, RJ. 1990. Generalized additive models. Chapman and Hall: London, 1990.

[11]Hosek J., Musilek, P., Lozowski, E., and Pytlak, P. 2011. Effect of time resolution of meteorological inputs on dynamic thermal rating calculations. IET Generation, Transmission and Distribution. 5 (9): 941-947. [12]Jolliffe, I.T, and Stephenson, D.B. 2003. Forecast

verification: A practitioner’s guide in atmospheric science. Chichester, U.K.; Wiley.

[13]Koenker, R. 2005. Quantile regression. Cambridge University Press.

[14]Lange, M. 2005. On the uncertainty of wind power

predictions—Analysis of the forecast accuracy and statistical distribution of errors. Journal of Solar Energy Engineering. 127: 177–184.

(9)

[15]Lange, M. 2003. Analysis of the uncertainty of wind power predictions. PhD thesis, University of Oldenburg.

[16]Li, Y., Liu, Y., and Zhu, J. 2007. Quantile regression in reproducing kernel hilbert spaces. Journal of the American Statistical Association. 102: 255–268.

[17]Møller, JK., Nielsen, HA., and Madsen, H. 2008. Time-adaptive quantile regression. Computational Statistics & Data Analysis. 52(3): 1292–1303.

[18]Nielsen, H., Madsen, H., and Nielsen, TS. 2006. Using quantile regression to extend an existing wind power forecasting system with probabilistic forecasts. Wind Energy. 9: 95–108.

[19]Nipen, T., and Stull, R. 2011. Calibrating probabilistic forecasts from an NWP ensemble. Tellus. 63A: 858–875. DOI: 10.1111/j.1600-0870.2011.00535.x

[20]Orrell, D., Smith, L., Barkmeijer, J., and Palmer, T. 2001. Model error in weather forecasting. Nonlinear Proc. Geophys. 8: 357–371.

[21]Pierolo, R. 2011. Information gain as a score for probabilistic forecasts. Meteorological Application. 18:9-17.

[22]Pinson, P. 2006. Estimation of the uncertainty in wind power forecasting. PhD Dissertation, Ecole des Mines de Paris, [Online] Available: www.imm.dtu.dk/~pp,

ww.pastel.paristech.org/bib

[23]Pinson, P., and Kariniotakis, G. 2010. Conditional prediction intervals of wind power generation. IEEE Transactions on Power Systems. 25(4): 1845-1856.

[24]Pinson, P., Nielsen, H.Aa., MZller, J.K., Madsen, H., and Kariniotakis, G.N. 2007. Nonparametric probabilistic forecasts of wind power: required properties and evaluation. Wind Energy.10:497–516.

[25]Richardson, D.S. 2000. Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc. 126: 649–667.

[26]Roulston, M.S., and Smith, LA. 2001. Evaluating

probabilistic forecasts using information theory. Mon. Wea. Rev. 130: 1653–1660.

[27]Takeuchi, I., Le, Q.V., Sears, T., and Smola, A.J. 2006. Nonparametric quantile estimation. J. Mach. Learn. Res. 7:1231-1264.

[28]The Weather Research and Forecasting (WRF) Model: http://www.wrf-model.org/index.php, accessed on December 2011.

[29]Wilks., D.S. 2006. Statistical methods in the atmospheric sciences. Academic Press: New York; 627.

[30]Winkler, RL. 1972. A decision-theoretic approach to interval estimation. Journal of the American Statistical Association. 67: 187–191.

[31]Yu, K., and Jones, MC. 1998. Local linear quantile regression. Journal of the American Statistical Association. 93: 228–238.

[32]Yu, K., and Zudi, L. 2003. Quantile regression: applications and current research areas. Journal of the Royal Statistical Society. 52(3): 331-350

[33]Zarnani, A., Musilek, P., and Heckenbergerova, J. 2013. Clustering numerical weather forecasts to obtain statistical prediction intervals. Meteorological Applications. n/a. doi: 10.1002/met.1383.

[34]Zarnani, A., Musilek, P. 2013. Modeling Forecast Uncertainty Using Fuzzy Clustering. 7th International Conference on Soft Computing Models in Industrial and Environmental Applications. Advances in Intelligent Systems and Computing. 188: 287-296.