Figure 1. The net load, or power load minus wind and solar generation, for the state of California on March 31.
Predicting the Solar Resource and Power Load
David Sehloff, Celso Torres Supervisor: Alex Cassidy, Dr. Arye Nehorai Department of Electrical and Systems Engineering
Washington University in St. Louis Spring 2015
Abstract—The value of solar energy is recognized for a variety of reasons, such as environmental and public health and security of supply. Inherently associated with these sources, however, is a degree of uncertainty. Knowing a day in advance how much power renewable sources will deliver, along with how much power will be demanded, would help suppliers determine the type and quantities of other resources to use. The goal of this research is to build predictors for both the power load and the solar resource using the machine learning technique known as support vector regression. Given weather, solar, and power load data from several years, the support vector algorithm builds a predictor model that outputs day-ahead predicted power load and solar resource based on weather forecasts. This allows analysis of the net power demands from the other sources of electric power for various levels of penetration of photovoltaic installations. The results of this analysis could be used by utilities or balancing authorities to plan which power plants should run at certain times and when energy storage or demand shifting incentives should be utilized.
The electric power grid that we know and rely on every day is an engineering feat. Many challenges had to be overcome to reliably provide electric power whenever there is demand for it. Providing this electric power was even called "the greatest engineering achievement of the 20th Century" by the National Academy of Engineering. But the grid is facing new challenges in the 21st century. We can see the value of renewable energy sources in many areas, such as environmental and public health and security of energy supply. In addition, their operation costs can be extremely small. Inherently associated with these sources, however, is a degree of uncertainty. Unlike coal, nuclear, or natural gas power plants, operators cannot directly control the power that renewable sources output. It would be advantageous to know in advance how much power will be delivered by renewable sources, along with how much power consumers will demand, so that suppliers know when other resources should be used.
An example from the California Independent System Operator shows what the changing load profile might look like. Due to an increasing presence of solar resources connected to the grid, the power required from other sources will significantly drop during the peak solar generation hours in the middle of the day. It will then steeply increase as the sun sets, lights turn on, and people come home to their kitchens, laundry, and other appliances and electronics.
The risk of generating too much power at the bottom of the curve and not generating enough at the sharp increase grows with the uncertainty of renewable sources. Figure 1, from the California ISO, shows an example of the actual power demand on March 31 minus the power from solar and wind resources. Two curves of actual measurements are shown, and the rest are projections based on the expected increase in solar and wind installations. This graph, often called a "Duck Curve" in the industry because of its shape, helps depict how the solar resource affects the power grid and the effects that uncertainty can have. When a power grid has a large
amount of solar power capacity, the middle of this curve can fluctuate dramatically day-to-day. 
Accurate forecasting of the hourly solar resource, along with the hourly power demand, one day in advance would help with power grid planning and decrease the costs associated with the integration of solar power with its inherent uncertainty. Our goal is to build predictors of the solar resource and power load profile that will reduce this uncertainty.
II. BUILDING THE PREDICTOR MODELS
To make these predictions, we learned about the theory and implementation of machine learning, and we focused on Support Vector Machines. This type of learning algorithm was first developed in the 1960s and became the subject of much research in the 1990s. Vapnik and colleagues, including Chervonenkis, Schölkopf, and Smola were behind these foundational developments. The algorithm, which has quickly become ubiquitous in diverse practical applications, is named for support vectors, or the data points that are used to build the model.  Support vector machines, or SVMs, can be used for either classification or regression. Classification groups each instance into a certain class based on its attributes, and it uses as support vectors only the data points that are close to the classification boundary. Regression, on the other hand, uses the attributes to fit a function to the data, and it uses as support vectors only the points that are greater than a specified distance from this function. , In this project, we use support vector regression to build our solar resource and power demand predictors.
The support vector regression algorithm is an optimization problem in which we want to choose a hyperplane through the data which is close to as many points as possible, such that these points are within a certain specified distance of ε from the hyperplane. It also is desired that this hyperplane is relatively flat. This leads to the minimization of a function with two terms: one describing the sum of the distances by which each point is outside of the specified ε-insensitive band, and a second term which characterizes the flatness of the function by the squared norm of w, the hyperplane’s normal. This quadratic optimization is subject to the accuracy constraints that each point is within the ε band plus or minus slack variables ξ or ξ*, to allow for points to be outside of the range. These ξ and ξ* slack variables make up the first term in the objective function. This term is multiplied by a cost parameter, C, which can be tuned to, in effect, specify the balance of the model’s accuracy versus simplicity. A large value puts an emphasis on including points within the ε band, while a small value emphasizes more having a flatter, simpler hyperplane to characterize the data. Below is Vapnik’s 1995 formulation of the problem, which can be solved by formulating its dual. For this, a kernel trick is used, which maps the data to a higher dimension to find a solution. A common kernel, and the one which we used, is the radial basis function. It has an adjustable parameter, γ. -
We can change the fit of the model to the data set by varying the parameters ε, C, and γ. According to Müller et al., these parameters, though “powerful means for regularization and adaptation to the noise in the data,” are difficult to select. The lack of a simple, general method or theoretical bounds for their selection shows that this is an area of support vector machines with potential for growth.  In our project, we used a simple grid search algorithm which built models by varying C and γ through a specified range and compared the error of each model in a cross-validation. Parameter selection aims to achieve the balance between over- and under-fitting. Under-fitting the data results in a model that does not capture all the characteristics of the set, while over-fitting builds a model that captures the individual peculiarities or noise of the given data so well that it is not useful for predicting future behavior. In our tests for power load profile parameters, the best results came from C=1024 and γ=7. For the solar resource, we found C=256 and γ=4 to perform well.
Figure 2. The weather stations. The red markers indicate locations for solar forecasting, and the blue markers indicate temperature data sources for power load forecasting.
Support Vector Regression can be implemented in several ways. A common package which is available for use in MATLAB and Java is called LibSVM. We implemented it in both MATLAB directly and in Java through a programming interface called Weka, which organizes the training and testing process and gives the capability to filter the data, such as normalizing the attributes, which is important to ensure that each is given equal weight.
In the preliminary stages of experimenting with regression techniques, we worked with several data sets, including detailed weather information readily available from the National Renewable Energy Laboratory for a typical meteorological year at many locations around the country and historical hourly data from a network of weather stations in the northwest U.S. available from the Bureau of Reclamation. The latter source of data proved to be the most useful, as it is continuously updated and includes many attributes that are important for predicting solar radiation. Detailed power load information from southern Washington, northern Oregon, and western Idaho was also available from the Bonneville Power Administration Balancing Authority. For making predictions, we obtained forecasted weather attributes from the National Weather Service.
To build a predictor for solar radiation, we obtained hourly weather data from January 2010 through March 2015. The attributes in this data were the day of the year, time, temperature, relative humidity, wind gust speed, and wind speed. The target value was the global horizontal irradiance, or GHI, which describes the solar radiation incident on a flat photovoltaic panel. We took data from weather stations at five locations: Imbler, Powell Butte, Echo, and Baker Valley in Oregon, and George in Washington. These locations are shown by the red markers in Figure 2. The solar radiation showed notable variance across locations and hour-to-hour in each location. At a glance, it seemed that this variance was greater than that of the sunniest region of the country, southern California and Arizona, which suggests that a reliable solar predictor could be important in an area such as this one.
To build the power load predictor, we used hourly weather and power load data from January 2007 to March 2015. The power load does not correlate strongly with as many features as the solar radiation. Successful peak-load predictors have been built using only date and time as attributes.  For this project’s aim to predict the hourly load, we found it beneficial to include temperature data from two locations: Hood River, OR, and Boise, ID, shown in teal in Figure 2. It is important to note that the power load generally follows a slightly different pattern on weekends. We included a binary attribute that describes whether or not the day is a weekend. In addition, the temperature tends to have the opposite effect on the power load depending on whether the date is between roughly the beginning of April and the end of September.  We added a binary attribute describing this. The other attributes were the year, day, time, temperature at Hood River, and temperature at Boise. The target value was the power load in MW.
III. TESTING AND VERIFICATION
After building the predictor models for power load and solar radiation at each location, we performed the first predictions using as inputs historical weather data for five days in April, 2015. We then compared the predictions of the model to the actual measurements of power load and solar radiation. For these tests, which do not depend on weather forecasts, we could test our model on as many days as were available from the weather database. Note that we did not test our models on any data points that
we used for training. Figure 3 shows our test of the power load predictor on the five days in April 2015 shown on the horizontal axis by the day number (from 1 to 365). Figure 4 shows our test of the solar predictor on the same days. We processed the solar model’s output to ensure plausible predictions: any value that was negative was set to zero, and all values between 10:00 p.m. and 3:00 a.m. were set to zero.
Error in load forecasts is most commonly characterized by the Mean Absolute Percent Error (MAPE). This is defined as the mean of the absolute difference between the actual and forecasted values at each point, divided by the actual value at that point, shown by the expression below.
Error for the solar prediction is more difficult to characterize in a way that can be easily compared across data sets. Many of the values are zero, and the percent error cannot be well determined for those cases. We characterize the error as Mean Absolute Error (MAE), defined as the mean of the absolute difference between real and forecasted values at each point, as shown below.
4000 4500 5000 5500 6000 6500 7000 7500 8000 100 100.5 101 101.5 102 102.5 103 103.5 104 104.5 105 Power Lo ad (MW) Day
Test of Load Predictor Using Observed Weather
Actual Predicted 0 200 400 600 800 1000 100 100.5 101 101.5 102 102.5 103 103.5 104 104.5 105 Solar Ra diation (W/m 2) Day
Test of Solar Predictor Using Observed WeatherActual
Figure 3. Test of the load predictor, with observed weather from Hood River, OR and Boise, ID as inputs. The actual values are also shown. The Mean Absolute Percent Error is 3.96%.
Figure 4. Test of the solar predictor, with observed weather from Baker Valley, OR as inputs. The actual values are also shown. The Mean Absolute Error is 70.93 W/m2.
The load predictor had a MAPE of 3.96%, and the solar predictor had a MAE of 70.93 W/m2 in these tests.
Next, we input forecasted weather data for April 12th to 14th to our models to create 48-hour-ahead predictions of the solar radiation in each of the five locations and the power load for the overall area. Figure 5 shows the load prediction, which has a MAPE of 4.12%. Figure 6 shows the best solar forecast, for Imbler, OR, which has a MAE of 36.52 W/m2. Table 1 shows the MAE for each of the five locations.
4000 4500 5000 5500 6000 6500 7000 7500 8000 102.5 102.7 102.9 103.1 103.3 103.5 103.7 103.9 104.1 104.3 104.5 Power Lo ad (MW) Date
48-Hour Load Forecast
Actual Predicted 0 200 400 600 800 1000 102.5 102.7 102.9 103.1 103.3 103.5 103.7 103.9 104.1 104.3 104.5 Solar Ra diation (W/m 2) Date
48-Hour Solar Forecast for Imbler, ORActual Predicted
Figure 6. Output of the solar predictor using forecasted weather as inputs. The actual values are also shown. The Mean Absolute Error is 36.56 W/m2.
Figure 5. Output of the load predictor using forecasted temperature as an input. The actual values are also shown. The Mean Absolute Percent Error is 4.12%.
Note: The days of Figures 5 and 6, the 48-hour forecasts, and of Figures 3 and 4, the tests on historical data, overlap because the 48-hour weather forecast data was collected before the historical data tests were performed; the 48-hour forecasts used all forecasted data and represent the prediction results that could be expected up to 48 hours in advance.
These forecasting results can be applied to predict the change in demand from conventional generation sources when power grid has a certain capacity of solar installations. In our case, the chosen locations do not have large photovoltaic plants, but we can see an effect of our results if we model plants of a certain capacity at the locations of our predictors. For this, we assume that the power output of the plant varies linearly with incident solar radiation, neglecting effects of temperature and other factors on photovoltaic cell efficiency. For example, if a 30 MW-capacity photovoltaic plant were installed in each of the five locations, the approximate total power that these five plants would produce is shown in Figure 7.
Subtracting this generation from the predicted load for the same time period gives the predicted net load. Comparing the predicted load to the predicted net load, as in Figure 8., shows the predicted effect that the solar resource will have on the power grid. We can compare this with the actual net load, which comes from actual load minus the calculated solar output. The mean absolute percent deviation of the prediction from the calculated net load is 4.15%. It is clear that this deviation is largely from the power load; the solar resource has a small effect on this value since the modeled capacity is a small fraction of the load. The analysis could easily be done for additional locations or a larger installation to see a more dramatic impact on the grid. Ultimately, this method is to be applied for a certain installed capacity, and its impacts will be larger for larger capacities of solar generation. 0 20 40 60 80 100 120 140 102.5 103 103.5 104 104.5 Power (MW)
Date and Time
Total Power Output from Five 30MW Installations
Calculated Output Forecasted Output Location MAE Imbler, OR 36.52 Powell Butte, OR 66.92 Echo, OR 91.89 Baker Valley, OR 98.23 George, WA 98.25
Table 1. Error in solar prediction at each of the five locations, characterized by MAE.
Figure 7. Predicted total output modeling one 30MW photovoltaic plant at each forecast location compared to the calculated output based on actual solar radiation.
IV. DISCUSSION &CONCLUSIONS
Our results show that support vector regression can be a viable means to solve the forecasting problems presented. We have found several important factors for successful power load and solar resource forecasts. The first is to include a large amount of data for training the model. By including five to eight years of hourly data, our models were able to capture the periodic behavior based on the hour of the day and the day of the year. We performed tests with smaller data sets of several months to several years, but the largest data sets gave the best results. With large data sets, it is also important to have a procedure for missing values. The algorithm we used interprets missing values as zero, which would falsely represent data from most of our attributes. Since fewer than about 0.1% of our data points had missing values, we deleted all such points from the training sets.
Another important factor is the selection of parameters. Certain choices of C and γ gave us predictions that were rather flat, staying near the average value and not reaching the minima or maxima, while others varied widely, at times predicting a dramatic change where the actual value remained nearly constant. These are examples of under- and over-fitting. We used the parameters that we found to give the best balance between these.
Much improvement could be made to our results with improved methods of parameter selection for C and. We used a grid search method, which is time consuming and can be inaccurate. Parameter selection is an area of ongoing research in machine learning, and a variety of methods such as particle swarm optimization have been shown to lead to improved selection of parameters.  With more research in these methods and additional computing power, our models could be made more robust and give more accurate predictions.
In addition, with more time and computing power, we could incorporate more years of data into our model building. We found that adding several years of data improved the accuracy of our results, but the time required for computing with any additional data quickly became prohibitive.
Another improvement could come in the data that we use as our attributes. More sophisticated weather measurements such as percent sky cover or even sky images could be used as attributes. In-depth study of the dynamics of cloud behavior could lead to a better understanding of the changes in solar radiation and an improved model for prediction. Weather stations could be built
5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 4/12 12:00 4/12 18:00 4/13 0:00 4/13 6:00 4/13 12:00 4/13 18:00 4/14 0:00 4/14 6:00 4/14 12:00 Lo ad (MW) Date
Forecasted Electric Power Load
Forecasted Net Load with Solar Forecasted Load without Solar
Figure 8. Comparison of the load with and without solar generation. A peak capacity of 150MW has a small effect. The deviation from the calculations using actual load and radiation is 4.15%. Actual values are not shown.
to take the measurements that are found to be most important. It would also be beneficial to obtain data for the actual power output of a photovoltaic installation. This data would be more useful than the incident radiation as the target value of the prediction.
Finally, in addition to improving the performance of support vector regression models, it would be worthwhile to investigate other methods of machine learning for this problem. One option, if several working models were built, would be to combine them into a hybrid model, as Wu et al. showed very successfully for photovoltaic output. 
A combination of these improvements applied at a specific location could result in accurate prediction of the power that a photovoltaic installation at that specific location will generate. This information would be valuable for the sale of the power or the planning of grid operations, allowing better utilization of power plants, from base- to peak-load, along with storage and demand response, ultimately reducing the cost of harnessing energy from the sun.
We would like to thank Alex Cassidy and Dr. Arye Nehorai for their support and encouragement and Professor Ed Richter for his coordination of the undergraduate research.
 “The Duck Curve: Managing a Green Grid." Flexible Resources Help Renewables. California ISO, 2013. Web. 20 Feb. 2015.
 Smola, Alex J., and Bernhard Schölkopf. "A Tutorial on Support Vector Regression." Statistics and Computing 14.3 (2004): 199-222. Web. 22 Jan. 2015.  K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik, “Predicting time series with support vector machines,” in Artificial
Neural Networks — ICANN’97, vol. 1327, W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, Eds. Springer Berlin Heidelberg, 1997, pp. 999–1004.  Bo-Juen Chen, Ming-Wei Chang, and Chih-Jen Lin, “Load forecasting using support vector Machines: a study on EUNITE competition 2001,” Power
Systems, IEEE Transactions on, vol. 19, no. 4, pp. 1821–1830, Nov. 2004.
 Hong, Wei-Chiang. "Chaotic Particle Swarm Optimization Algorithm in a Support Vector Regression Electric Load Forecasting Model." Energy Conversion and Management 50.1 (2009): 105-17.
 Rojas, I., O. Valenzuela, F. Rojas, A. Guillen, L.j. Herrera, H. Pomares, L. Marquez, and M. Pasadas. "Soft-computing Techniques and ARMA Model for Time Series Prediction." Neurocomputing 71.4-6 (2008): 519-37.
 Yuan-Kang Wu, Chao-Rong Chen, and Hasimah Abdul Rahman, “A Novel Hybrid Model for Short-Term Forecasting in PV Power Generation,” International Journal of Photoenergy, vol. 2014, Article ID 569249, 9 pages, 2014.
 Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
 Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
"AgriMet Historical Hourly (Dayfile) Data Access -- Bureau of Reclamation." AgriMet Historical Hourly (Dayfile) Data Access -- Bureau of Reclamation. Web. 14 Apr. 2015.
"WIND GENERATION & Total Load in The BPA Balancing Authority." BPA: Balancing Authority Load & Total Wind Generation. Web. 14 Apr. 2015. "Weather Forecasts." National Forecast Maps. National Weather Service, 12 Apr. 2015. Web. 12 Apr. 2015. <http://www.weather.gov/forecastmaps>.