Gridded Forecast Evaluation Procedure - Coupling Data Science Techniques and Numerical Weather

3.6 Conclusions

5.1.3 Gridded Forecast Evaluation Procedure

Generating calibrated gridded solar irradiance forecasts requires determining the best estimate of irradiance at a location where irradiance is not observed. In order to simulate this condition and still score the different procedures, half of the 120 Oklahoma Mesonet sites are selected at random as training sites, and the other half are used as testing sites. In addition, testing days are withheld

during the period of the experiment to prevent temporal contamination of the training data. Every third day during the training period is used as a testing day to reduce the impact of seasonality bias on the evaluation.

Two procedures are tested for generalized gridded forecasting. In the “Single Site” approach, separate statistical learning models are trained at each training site, predictions are made at each of these sites, and then the predictions are interpolated to the testing site using the Cressman (1959) successive correction interpolation method. For each interpolation pointfi, a distance-weighted average of the predictions at the stations with distances dj within a radius of influence R is computed such that

fi = PJ j wjfsj PJ j=1wj ; wj = R2₋_d2 j R2₊_d2 j R < d wj = 0 R ≥d (5.3)

. The test sites were initialized with the mean of the predictions at the training sites, and then four passes were performed with the Cressman filter with a decrease in radius for each pass to capture local effects. The Cressman interpolation method was chosen because the NWS gridded MOS system (Glahn et al. 2009) also uses it for interpolation from training sites to grid points.

In the “Multi Site” approach, the data from all training sites are aggregated together and are used to train one statistical learning model. This model is then applied at the testing sites using the NWP model and clear sky model output at that location. This approach requires training a single statistical model and can thus utilize a larger training set than the Single Site method. Applying a single model at each grid point also eliminates discontinuities that may be found in approaches that use separate statistical models for different regions. This approach is less able to correct for local biases and conditions.

Forecasts from each machine learning model are evaluated based on their accuracy, systemic bias, reliability, discrimination, and sharpness. Forecast accuracy is assessed using the mean absolute error, which is less sensitive to outlier errors. Systemic bias is evaluated through the mean error and determines if the models tend to over or underforecast clearness index. The reliability, or condition bias, of the forecasts is assessed through a reliability diagram in which the forecasts are binned, and the mean observed value is calculated for each bin. Reliable forecasts should have similar average observed values for a given forecast value. Discrimination is assessed by binning observations and calculat- ing the average forecast value for a given observation. If the models show good discrimination, then observations of higher clearness index should have a higher forecast clearness index on average versus observations of lower clearness index. Sharpness is assessed by examining the distribution of forecasts and comparing them with the distribution of observations.

The statistical significance of the verification scores is assessed with bootstrap confidence intervals and permutation tests. A bootstrap replicate size of 10000 is used. Independent bootstrap confidence intervals are used to assess the uncertainty of the verification scores. Ranking the models by score in each bootstrap replicate and counting the frequency of each ranking for each model is used to assess whether certain models consistency outperform others and how much that ranking varies. Permutation tests are finally used to assess whether the difference in scores between two models is statistically significant and what the p-value of that difference is. A global p-value of 0.05 is used to determine statistical significance and is adjusted based on the false discovery rate correction to account for multiple comparisons (Benjamini and Hochberg 1995).

5.2 Results

0.085 0.090 0.095 0.100 0.105 0.110

GFS Multi Site Linear Regression Multi SiteRandom Forest Multi Site Random Forest All Features Multi SiteRandom Forest Short Trees Multi Site Gradient Boosting Multi Site Gradient Boosting Least Squares Multi SiteGradient Boosting Big Trees Multi Site Gradient Boosting All Features Multi Site

Gradient Boosting Slow Learning Rate Multi Site

Mean Absolute Error

0.085 0.090 0.095 0.100 0.105 0.110

GFS Single Site Linear Regression Single SiteRandom Forest Single Site Random Forest All Features Single SiteRandom Forest Short Trees Single Site Gradient Boosting Single Site Gradient Boosting Least Squares Single SiteGradient Boosting Big Trees Single Site Gradient Boosting All Features Single Site Gradient Boosting Slow Learning Rate Single Site

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Gradient Boosting Slow Learning Rate Multi Site

Mean Error

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

GFS Clearness Index Prediction Models

Figure 5.2: Bootstrap confidence intervals for the mean absolute errors and mean errors for each statistical learning model configuration.

In document Coupling Data Science Techniques and Numerical Weather Prediction Models for High-Impact Weather Prediction (Page 162-166)