Demand Forecasting in Smart Grids

(1)

Published online in Wiley Online Library (wileyonlinelibrary.com) • DOI: 10.1002/bltj.21650

◆ Demand Forecasting in Smart Grids

Piotr Mirowski, Sining Chen, Tin Kam Ho, and Chun-Nam Yu

Data analytics in smart grids can be leveraged to channel the data downpour from individual meters into knowledge valuable to electric power utilities and end-consumers. Short-term load forecasting (STLF) can address issues vital to a utility but it has traditionally been done mostly at system (city or country) level. In this case study, we exploit rich, multi-year, and high-

frequency annotated data collected via a metering infrastructure to perform STLF on aggregates of power meters in a mid-sized city. For smart meter aggregates complemented with geo-speciﬁ c weather data, we benchmark several state-of-the-art forecasting algorithms, including kernel methods for nonlinear regression, seasonal and temperature-adjusted auto-regressive models, exponential smoothing and state-space models. We show how STLF accuracy improves at larger meter aggregation (at feeder, substation, and system-wide level). We provide an overview of our algorithms for load prediction and discuss system performance issues that impact real time STLF.

in the distribution network. Customers may also gain better awareness of their own consumption patterns.

In this paper we report a study on demand prediction, where we analyzed near real time power consumption monitored by tens of thousands of smart meters in a medium-size U.S. city. We developed and adapted several short-term forecasting methods for predicting the load at several levels of aggregation. In this context, short-term load forecasting (STLF) refers to the prediction of power consumption levels in the next hour, next day, or up to a week ahead. Within this time scope, one can have reliable weather forecasts, which provide important input to the prediction, as historically the load in this city is highly inﬂ uenced by weather because electricity is used for both heating and cooling.

Introduction

Smart grid deployments carry the promise of allowing better control and balance of energy supply and demand through near real time, continuous visi- bility into detailed energy generation and consumption patterns. Methods to extract knowledge from near real time and accumulated observations are hence critical to the extraction of value from the infrastructure investment. On the demand side, widespread deployment of smart meters that provide frequent readings allows insight into continuous traces of usage patterns that are unique to each premise and each aggregate at different levels of the distribution hierarchy. This in turn enables better designs and triggers of demand response actions and pricing strategies, and provides input to the planning for growth and changes

(2)

Scenario of the Study

In our study, the meters are deployed at customer locations and their readings are sampled every 15 minutes. The meter’s network description includes its geographical location (latitude, longitude); date of installation and planned removal; type of customer served; as well as which pole, which feeder section, and which substation the meter is connected to.

Weather data is collected by the utility company at the substation level, and consists of hourly temperature, wind speed, and wind chill temperature. Additional weather data, made available by the National Climatic Data Center (NCDC) and the National Oceanic and Atmospheric Administration (NOAA), provide additional measurements, such as humidity or sky cover at the location of the city airport, and hourly weather forecasts up to seven days ahead.

The load prediction algorithms that we have investigated and implemented are embedded in a module of a data analytic system being developed for the utility company. The module receives meter measurements and converts them to power usage values, and aggregates usage at different levels: individual

meters, feeder sections, distribution substations, and at the system level. It then generates load forecasts at prediction horizons that range from 60 minutes (next-hour predictions) to 24 hours (next-day predictions), or even 168 hours (next-week predictions).

As we will detail in a later section, the load forecasts operate independently for each meter and meter aggregate, communicating through a limited set of inputs/outputs with a database to read the latest weather forecasts and per-meter usage history and return corresponding load forecasts. This procedure can be parallelized, which enables some degree of asynchronous behavior within the prediction timeframe (the load forecasts are made with a granularity of one hour).

Short-term load forecasts generated at the meter (customer premise) level will provide the utility company with customer-level smart grid capabilities and help the company communicate with the customer about energy saving and billing issues. STLF generated at higher levels of aggregation (from feeder section to city-wide) will help in planning and operation of the relevant components of the electric grid.

Panel 1. Abbreviations, Acronyms, and Terms ACF—Auto-correlation function

ARIMA—Auto-regressive integrated moving average

ARIMAX—Auto-regressive moving average with external inputs

ARMA—Auto-regressive moving average DASARIMA—Dummy-adjusted seasonal auto-

regressive integrated moving average ENEL—Ente Nazionale d’Electricita GARCH—Generalized auto-regressive

conditional heteroscedastic GDP—Gross domestic product HWT—Holt-Winters model

i.i.d.—Independently and identically distributed KPSS—Kwiatkowiski, Phillips, Schmidt, and Shin LOESS—Locally-weighted scatterplot smoothing LSE—Least square error

LTLF—Long-term load forecast

MAPE—Mean absolute percentage error

ML—Machine learning

MTLF—Medium-term load forecast NCDC—National Climatic Data Center NOAA—National Oceanic and Atmospheric

Administration PACF—Partial ACF

RAM—Random access memory

SARIMA—Seasonal auto-regressive integrated moving average

SARIMAX—Seasonal auto-regressive integrated moving average with external inputs

SSM—State-space model sSVR—Sigma SVR

STLF—Short-term load forecast SVM—Support vector machine SVR—Support vector regression Wh—Watt hour

WKR—Weighted kernel regression

(3)

State-of-the-Art in Load Forecasting

Electric load forecasting is a mature ﬁ eld of investigation and the statistical methodologies have been implemented and deployed in industrial appli- cations. Several meta-review papers provide a good overview of the demand prediction literature [13, 17, 25, 27] and identify three sub-ﬁ elds, depending on the prediction horizon.

The prominent sub-ﬁ eld of investigation, short- term load forecasting (STLF), handles prediction horizons of one hour up to one week and typically relies on time series analysis and modeling. Daily, weekly, and sometimes yearly seasonality can be explicitly modeled. These methods consider variables such as date (e.g., day of week and hour of the day), temperature (including weather forecasts), humidity, temperature-humidity index, wind-chill index and most importantly, historical load. Residential versus commercial or industrial uses are rarely speciﬁ ed.

Representative algorithms for STLF include time series models of linear dynamic systems involving load and weather regressors, typically relying on auto-regressive models such as the auto-regressive moving average (ARMA) [15] and the seasonal auto- regressive integrated moving average (SARIMA) [35]. State-space models offer further reﬁ nement to linear dynamics by deﬁ ning additional (so-called

“hidden” or “latent”) state variables representing underlying load dynamics and seasonality, either by explicit variables as in the exponential smoothing methods [37] or in spline representations of daily load [19]. An alternative approach to modeling load and weather dynamics is to consider nonlinear models and a machine learning approach. A popular class of algorithms for STLF, which we do not report here but which has been used by several electric companies for system-wide predictions, is neural networks [22]. We focused instead on so-called kernel methods, starting from simple weighted kernel regression [6] all the way up to support vector machines [7] and kernel ridge regression. The section on Short-Term Load Forecasting Methodology provides more details about the methods that we implemented and investigated for this comparative study.

The remaining two ﬁ elds of investigation, not covered in this paper, are medium-term load forecasting (MTLF), handling horizons of one week up to one year, and long-term load forecasting (LTLF), with predictions at horizons of multiple years. These methods typically proceed by the regression on input variables, which, in addition to historic load and climate forecasts, typically incorporate demographic and economic factors such as the gross domestic product (GDP), real estate statistics, or population growth projections, as well as estimated demands of electric equipment.

Our key ﬁ nding was that most of the research focused on large aggregated load data, typically at city level or even at country level, where most individual variations are averaged out by the effect of the law of large numbers. These methods were seldom tried on individual meters or at meter aggregate levels such as distribution feeders and substations, with a few exceptions such as recent work on STLF [3] in non-residential buildings or a clustering analysis of individual meters and aggregate load forecasting on feeder sections in a neighborhood of Seoul [32].

In this paper, we propose to continue bridging this gap by systematically evaluating when the state- of-the-art STLF algorithms break down, i.e., how the performance degrades when the number of considered meters goes down.

Other Datasets With Individual Meters

The speciﬁ city of the unique dataset that we investigated is that it contains energy consumption data from the system-wide (city-wide) level down to the level of individual meters. To our knowledge, few such complex datasets [32] have been investigated for short-term load forecasting, even though a few localized (e.g., building-speciﬁ c) smart meter datasets have been studied [3].

The Italian energy provider Enel has deployed over 32 million smart meters. Remote monitoring is done by sending the readings from each customer’s location [31] through a low-bandwidth network to data aggregators located at substations. Data is sampled and stored at 15 minute frequency. The readings are sent about every two weeks or every month. The

(4)

motivation for the utility is the ability to leverage customized hourly-based tariffs [11] to price services for its customers. Although the individual meter data collected by ENEL has been used in studies on group- ing individual customers based on clustering load proﬁ les [16], we are unaware of analyses of these data from the perspective of load aggregates.

At the individual home level, there are several studies on peak load prediction [33] and on energy disaggregation of individual appliances in households [23]. However, these datasets are much smaller (typically < 100 meters) than our current dataset.

The frequency of load measurement is also much higher (one measurement every few seconds or mil- liseconds) and is not typical of smart meters currently under deployment.

The rest of the paper is divided as follows. We begin by explaining the structure of our unique, hierarchical dataset of load consumption coming from a mid-size U.S. city. We then provide an overview of key algorithms for short-term load forecasting that exploit both historical load and weather data. The section titled “Short-Term Load Forecasting Results” details the essentially state-of-the-art STLF results that we obtain at the system (city) level and how STLF performance depends on the size of the load aggregate. We conclude with a discussion of performance, parallelism, and runtime issues waged by performing STLF at all levels of load aggregation;

it also introduces ensemble prediction that leverages multiple STLF algorithms for improved predictions.

Smart Grid Data

The speciﬁ city of our study on short-term load forecasting is in its unique dataset consisting of hundreds of thousands of individual meters interconnected in a hierarchy of feeders and substations. We provide details on how the meter data is aggregated and how we associate it with weather data.

System Hierarchy of Meters, Feeders and Substations This study actually exploits two sets of data, collected in a mid-sized U.S. city (population of about 200,000 inhabitants) over the course of several years:

• System-wide data representing total city consumption (residential and industrial), collected over the

course of 2007, 2008, and 2009 at hourly intervals. This dataset is typical to classic STLF studies.

• Individual meter readings coming from over a hun- dred thousand meters installed at customer loca- tions. Out of this rich dataset, we use 32,000 mostly residential meters that satisﬁ ed a number of conditions detailed earlier. This data was collected between January 2011 and June 2012.

The individual meters measure consumption (in Watt hours, Wh) at 15 minute intervals and are refer- enced in a meter−pole−feeder section−substation−

district hierarchy. Meter measurements included in our analysis are those from the residential and small business customers (with contract demand under 5000 kW). Load predictions are made at these levels:

1. Single customer. Load measurements are derived from meter measurements by differentiation to obtain the value increment within a sampling time interval, divided over the duration of the sampling interval. Single-customer STLF performance and methods are not the object of this paper.

2. Feeder section. We deﬁ ne a feeder section as a sub- set of transformers connected to a feeder (such as serving a neighborhood). The system network topology considered in this paper consists of about 300 unique feeders employed in the time period 2011−2012. The historical load measurements at a given time and at the level of a feeder section are based on aggregating the load derived from all the meters connected to that feeder section.

3. Substation. Each substation serves a small geo- graphical area. There were about 100 unique substations in the distribution network over the time period we consider. Aggregation at the substation level works in the same way as aggregation at the feeder section level.

4. System-wide. This highest level of aggregation comprises all the residential and small business meters indexed in the distribution hierarchy.

As explained in the next section, weather data are geo-located with the average (center) location of the meters connected to that feeder or substation section.

(5)

From Consumer Meters to Consistent Load Aggregates The main problem in aggregating meter data into aggregates is the lack of consistency, across time, of the constituents of each aggregate. Fortunately, our dataset contained, in addition to meter readings, periodically updated metadata that described each meter, its connection to the feeder and substation, its geographical (latitude, longitude) coordinates as well as customer-speciﬁ c data. To obtain a consistent dataset for method evaluation, we used this metadata to discard meters that were disconnected and reconnected, keeping only meters that satisﬁ ed several consistency requirements: same owner, same feeder connection, and same geographical location throughout the evaluation period.

Although our per-meter dataset lists nearly a hundred thousand meters, only a subset of the 32,000 meters satisﬁ es the consistency requirements and contains non-zero meter readings. We concen- trate on the load aggregates derived from these 32,000 meters. Load aggregates at feeder level are basically obtained by summing up the 15-minute load from all the meters connected to that feeder. Similarly, the load aggregates at a substation are determined by adding up loads at all the feeders connected to that substation. Each aggregate’s load is then down-sampled to hourly time intervals.

While aggregating the loads at individual meters, we had to handle non-aligned time stamps, meter reading resets, missing values or repeated readings, sometimes resorting to linear interpolation of the load.

All processing for this 18 month dataset, representing about 100 GB of data, was done using Perl* and shell scripts. As shown in Figure 1, we were able to recon- struct a smooth load proﬁ le at the system-wide level.

Geo-Speciﬁ c Weather Data

The area covered by the individual meters in our mid-sized U.S. city dataset encompasses a gently hilly area of about 40 km by 60 km, traversed by a river and subject to micro-climatic variations. The weather data (temperature and wind speed) are measured hourly at 22 substations across that area. It is com- mon to measure a difference of 15 degrees (F) in temperature between weather substations.

Because the STLF methods detailed in the next section are temperature-dependent, we are prompted to interpolate the temperatures at the locations of all meters and all meter aggregates. This interpolation is done through the simple Kriging algorithm [14].

“Kriging” refers here to the temperature interpolation based on temperature regression against observed temperature values at a set of surrounding locations, each of them weighted according to spatial covari- ance. We employed the mGstat Matlab* toolbox for geo-statistics [18] and performed simple Kriging for each hour independently, using the latitude and longitude coordinates of about 400 feeders and substation meter load aggregates and the geographical coordinates and temperatures of 22 weather substations. A similar procedure was adopted for wind speed. The ﬁ nal result is illustrated on Figure 2, which shows the temperature interpolation at feeder and substation aggregate levels at two times of the year, and proves the large temperature variations.

Short-Term Load Forecasting Methodology

Time series modeling for short-term load forecasting (STLF) has been widely used over the last 30 years and a myriad of approaches have been devel- oped. Kyriakides and Polycarpou [25] summarized these methods as follows:

1. Regression models that represent electricity load as a linear combination of variables related to weather factors, day type, and customer class.

2. Linear time series-based methods including the ARMA model, autoregressive integrated moving average (ARIMA) model, auto regressive moving average with external inputs (ARIMAX) model, generalized auto-regressive conditional heteroscedastic (GARCH) model and state-space models.

3. State-space models (SSMs) typically relying on a ﬁ ltering- (e.g., Kalman) based technique and a characterization of dynamical systems.

4. Nonlinear time series modeling through machine learning methods such as nonlinear regression.

Principles of Statistical Learning for Time Series

In the sections that follow, we will discuss the temperature regression and load residual, linear

(6)

time series approaches, state-space models, and nonlinear time series models. Before delving into more detailed descriptions of learning algorithms, we will begin by outlining their commonalities in a section on the “Principles of Statistical Learning for Time Series.”

Supervised learning of the predictor. Supervised learning consists of ﬁ tting a predictive model to a training dataset (X; L), which consists of pairs (x_i; L_i) of data points or samples x_i and of associated target values L_i. In the case of load forecasting, samples x

represent historical values of electric load, weather, or other types of data, collected over a short time interval (e.g., one day). The target labels L_i corre- spond to the electric load at the prediction horizon.

The objective is to optimize a function f such that for each data point x_i, the prediction f(x_i) is as close as possible to the ground truth target L_i. The discrepancy between all the predictions and the target labels is quantiﬁ ed here by the mean absolute percentage error (MAPE), whose formula is given in “Short- Term Load Forecasting Results.”

Figure 1.

System load aggregated from about 32,000 individual meters over 18 months.

(a) January 2011 through June 2012

System load (MW per 15min)System load (MW per 15min)

(b) August 2011 through September 2011 Jan11

0 5 10 15 20 25 30 35

Apr11 Jul11 Oct11 Jan12 Apr12 Jul12

Aug11 5 10 15 20 25 30

Sep11 Oct11

(7)

Training, validation and test sets. Good statistical learning algorithms are capable of extrapolating knowledge and of generalizing it on unseen data points. For this reason, we separate the known data points into a training (in-sample) set, used to deﬁ ne model f, and a test (out-of-sample) set, used exclusively to quantify the predictive power of f.

In the experiments previously reported in our section on short-term load forecasting results, we use one year of data for training and we test the model in the calendar month immediately following. When evaluating STLF on the 2007−2009 data, we retrain the model 24 times and provide predictions for January 2008 through December 2009. Using the 18-month aggregate dataset from January 2011 through June 2012, we trained six different STLF models for predicting results for January through June 2012.

Direct prediction versus iterated prediction in time series. In a time series prediction problem, as repre- sented in Figure 3, the variable of interest (here, the load) might be present at the same time in the targets (output predictions) of the system and in the inputs, particularly when that variable is serially correlated or when it is produced by a dynamic system (e.g., the

weather/climate model or a model for the human activities). Knowing the history of immediate previous time samples of that variable helps in that prediction.

In our study, we consider hourly load and weather data, and are interested in making load fore- casts at prediction horizons ranging from h = 1 hour (next hour) to h = 168 hours (next week). Predictions at all these different horizons can be achieved in two different ways, through direct prediction and iterated prediction. Let us note t the current time and assume that we have access to historical load up to time t, as well as to weather forecasts up to time t + 168.

• Direct prediction. This predictor takes all the data known up to time t, for instance load values in the past 24 hours (L_t₋₂₃, L_t₋₂₂, … , L_t₋₁, L_t) and temperature forecasts at any horizon h, namely T_t_+h, and directly predicts the load L_t_+h that will occur h hours ahead (see Figure 3b). Direct pre- diction has a huge computational cost, because different predictors need to be trained for each prediction horizon (168 in our case).

• Iterated prediction. This predictor is simply designed to make one-step-ahead predictions, at horizon h = 1. As the predictive model moves forward in time, the outputs of the predictor Figure 2.

Example of temperature variations and of spatial temperature interpolation using Kriging, at two different times of the year.

−85.5 −85.4 −85.3 −85.2 −85.1 −85 −84.9 34.95

35 35.05 35.1 35.15 35.2 35.25 35.3 35.35 35.4 35.45

Longitude

Latitude Latitude

Temperatures (21k meters and weather stations) on 01–Jan–2011 05:45:00

51 52 53 54 55 56 57

(a)

−85.5 −85.4 −85.3 −85.2 −85.1 −85 −84.9 34.95

35 35.05 35.1 35.15 35.2 35.25 35.3 35.35 35.4 35.45

Longitude

Temperatures (21k meters and weather stations) on 30–Jun–2011 00:00:00

64 66 68 70 72 74 76

(b)

(8)

(here, load at time t + h) can in turn become its inputs (see Figure 3a), albeit introducing the prediction error directly into the model. This iterated prediction can be seen as the discretization of a dynamic system.

Temperature Regression and Load Residual

The simplest method for load forecasting relates the load to temperature. This is particularly relevant for residential and business-related consumption,

where a signiﬁ cant portion of power usage might be due to electric heating in the winter and/or air con- ditioning in the summer.

In our data set, electricity was used to both heat and cool many buildings, in addition to gas heating.

The total load decreases with temperature ﬁ rst and then increases, the minimum occurring at or around 66 degrees Fahrenheit. We observed that this relationship varies slightly throughout the day. We investigated two approaches for load regression.

Figure 3.

Direct prediction versus iterated prediction in a time series.

(a) Iterated prediction on load with a 24-hour history of load values and the temperature at the prediction horizon.

(b) Direct prediction on load at horizon h = 3.

h − 23

h − 22

h − 21

h − 1 h + 1 h

h + 1

h + 1 h + 2

h + 3 h

Input Input

Load

Temperature Load

Temperature

Load

Temperature

Prediction at h + 2 Prediction at h + 1

temperature

“forecast”

temperature

“forecast”

Input

Prediction at h + 3

temperature

“forecast”

h − 21

h − 23 h h + 3

Load

Temperature

Input

Prediction at h + 3

temperature

“forecast”

(9)

The ﬁ rst used local polynomial regression, locally-weighted scatterplot smoothing (LOESS) [8]

to fi t a surface of load on temperature and time of day (see Figure 4). Specifi cally, for the fi t at point x, a polynomial surface of degree 1 or 2 is made using points in a neighborhood of x, weighted by their dis- tance from x, to minimize the least square error (LSE). The size of the neighborhood is controlled by a parameter α chosen to be 0.2 in this situation for a balance between smoothness and goodness of fi t.

The MAPE for this ﬁ t is between six to seven percent for system-wide prediction with an average load of approximately 0.7M kWh (2007−2009 system-wide load) when the surface is ﬁ tted to the previous full year’s data.

log(L) = s(T,H) + ε

where L is the hourly load, T is the temperature, H is the hour of day, and ε is the residual. The log transfor- mation is used here to make the distribution more

Gaussian-like and to stabilize the variance, such that the subsequent modeling assumptions hold. Note that the residuals ε are not independently and identically distributed (i.i.d.) and will continue to exhibit a daily cyclic pattern. In the SARIMA, SSM, and Holt-Winters model (HWT) methods detailed in the following sections, those methods are applied to the residuals ε, not to the load time series.

A second method relies on ﬁ tting a cubic polynomial directly on the temperature values, using 24 sets of coefﬁ cients {a_i^(H)}³_i₌₀, one for each hour H of the day. Temperature regression using cubic polyno- mials is a simple benchmark for STLF [21].

L = a₀^(H)+ a₁^(H)T+ a₂^(H)T² + a₃^(H)T³+ ε

Note that we may use the apparent temperature, or the wind-chill temperature, or an average of both, instead of the raw temperature. The apparent tem- perature (temperature taking into account the nonlinear “heat index” due to humidity) may improve the ﬁ t in some cases, particularly during the hot and humid Summer season [36]. Similarly, the wind- speed dependent wind chill temperature may help for Winter load forecasts. We make our choices based on cross-validation performance.

Hobby et al. [20] study the residential energy consumption measured at an aggregate of all residential meters by separating the weather- and illumination- dependent load consumption from the residual consumption. To ﬁ t the weather- and illumination-dependent component, they use 24 cubic spline surfaces, one per hour of the day, indexed by apparent temperature and illumination. They observe a strong cubic dependency of load on temperature and an almost negligible small linear term due to illumination.

Linear Time Series Approaches

Linear time series models exploit directly the historical values of the load, and enable us to make iterated load forecasts thanks to previously observed load values. Gross and Galiana [15] wrote the refer- ence paper on short-term load forecasting using statistical linear time series models, in particular the auto-regressive moving average (ARMA) model.

These models have been later extended to cope with seasonality and non-stationarity in so-called seasonal Figure 4.

Dependency among the temperature, the time of the day and the load modeled as a smooth surface. Load is expressed on the logarithmic scale and the temperature is taken one hour prior to the load value.

time of day lag 1 hr temperature (F)

log(load

+1)

(10)

auto-regressive integrated moving average (SARIMA) models. Further extensions have been made in the work of Soares and Medeiros [35], where they compared two-level seasonal auto-regressive model and dummy-adjusted seasonal auto-regressive integrated moving average (DASARIMA) on Brazilian electric load data.

Seasonal Auto-Regressive Integrated Moving Average Models

In seasonal auto-regressive integrated moving average (SARIMA) models, the seasonality component comes from the daily load cyclic pattern. In this paper we apply the SARIMA model to residuals from the LOESS fi t (we refer to this method as “residual SARIMA”). We also considered the SARIMAX model, i.e., SARIMA with “exogenous” variables, namely temperature. However, the temperature coeffi cient is diffi cult to interpret and the model offers poor prediction accuracy compared to residual SARIMA. In contrast, residual SARIMA explicitly models the relationship between the time series and the exogenous variable. It is especially appealing when changes in exogenous variable(s) are concurrent with changes in the original time series, which is the case with temperature and power usage.

A SARIMA model has seven order parameters.

We can write the model as:

SARIMA (p, d, q) × (P, D, Q)s

Φp(B^s)ϕP(B)(1−B)^d(1−B^S)^DX_t= ΘQ(B^S)θq(B)εt

where B is the lag operator that satisﬁ es:

Bⁱ(X_t) = Xt−i

and Φp(B^s), ΘQ(B^S) and (1−B^S)^D are corresponding autoregressive, moving average and differencing parts for seasonal components, while ϕP(B), θq(B) and (1−B)^dare corresponding autoregressive, moving average and differencing parts for the non-seasonal component. S is the period length (S = 24 with hourly load reading and a daily cyclic pattern).

The procedure of determining the order parameters follows Box-Jenkins procedures by examining the auto-correlation function (ACF) and partial ACF (PACF) of the differenced and original time series.

Investigating the order parameters on the one-year training data, we concluded that d = 1, D = 1, p = 0, P = 0, while q = 1 and Q = 1, essentially ignoring the auto-regressive component. Stationarity of the differenced data were checked using the Kwiatkowiski, Phillips, Schmidt, and Shin (KPSS) test [2, 24]. The p-value was greater than 0.1, suggesting stationarity in differenced data.

Note that for the residual SARIMA, shortening the training period for estimating the parameters of the model, from one year down to the last month immediately preceding the prediction (test) period offered a better ﬁ t.

State-Space Models

The state-space model (SSM) is an online adap- tive method for forecasting. SSMs introduce hidden (unknown) variables representing the quantity to be estimated. The main state-space model used across scientifi c disciplines is the Kalman fi lter. In their review paper, Pigazo and Moreno [30] described how the Kalman fi lter can predict electric load values from the previous load measurements, and then update that prediction using other regressors such as temperature data. Harvey and Koopman [19] modeled load time series through cubic spline interpolation on intra-daily and intra-weekly patterns, where the spline coeffi cients were time-varying and updated using a Kalman fi lter. Dordonnat et al. [10] defi ned a custom state-space that took into account calendar days and used it to predict nationwide French electric load. Taylor and McSharry [37] reformulated the state-space model as a multi-level linear time series model, which can handle weekly and daily seasonality in electric load.

State-space model on the spline ﬁ t of load residu- als. The SSM in [19] does not require ofﬂ ine training and updates the model parameters in real time as each reading comes in. This method has been suc- cessfully applied to the online monitoring of time- varying network streams [4].

In that SSM, the computation for each update is inexpensive thanks to Kalman ﬁ ltering, making it an ideal method for online forecasting. It uses B-splines to model the daily cyclic pattern, as the nonlinear

(11)

trends in the load time series can be transformed into a linear model with respect to the spline basis.

Moreover, a cyclic spline basis ensures the periodic constraint (namely, the daily cyclic pattern of the load). We place K equally spaced knots, or K−1 spline bases to cover a full day (here K = 8 for 24 hourly load readings on a given day).

The state space model consists of two equations:

the observation equation, which generates the load data from the hidden variable, and the state equation, which explains dynamics in the hidden (spline co- efﬁ cient) data. The observation equation is:

εt =Bαt+ ut

u_t∼N(0, σuI)

εt is the one-day time series of the L load residu- als from one day; B is a 24 by K matrix of B-spline bases, each column corresponding to one spline; αt is the vector of coefﬁ cients for the splines; u_t is a vector of i.i.d. Gaussian white noise with standard deviation σu. The vector αt characterizes the daily pattern on day t.

To accommodate day-to-day variations in the daily pattern αt, we use a random walk for the spline coefﬁ cients, speciﬁ ed by the state equation:

αt =αt−1 +v_t v_t∼ N(0, σvI)

where the spline coefﬁ cients on day t are equal to those on t−1, plus i.i.d. white noise of variance σv.

The above SSM is fi tted online with a Kalman fi lter, such that the updating is done for each incom- ing data point. This ensures that forecasts are done in an online fashion. Hyper-parameters are estimated empirically by fi tting them to spline coeffi cients for individual days.

We also applied this approach directly to the log- transformed load without the regression on temperature (results not reported here). The performance is slightly worse than using the residuals but still rea- sonable. This approach would work well if temperature forecasts were unavailable or unreliable.

Holt-Winters double seasonal exponential smooth- ing. The HWT model [37] is a variation on the state- space model designed speciﬁ cally for data that have

two seasonalities: an intra-day (24 h) seasonality, and an intra-week (168 h) seasonality. The state equations involve three state variables, essentially corresponding to the smoothing, daily and weekly effect in the data.

yˆ_t(k) = lt+ dt−m1+k1+ wt−m2+k2+ φ^ke_t e_t= yt−

(

^lt−1 + dt−m1 + wt−m2

)

l_t= lt−1 + αet

d_t= dt−m1 + δet

w_t= wt−m2 + ωet

In the above equations, y is the estimated value of the load, l is the exponentially smoothed fi rst- order auto-regressive component of the load, d is the intra-day seasonal component of the load (m₁= 24 hours) and w is the intra-week seasonal component of the load (m₂= 168 hours); fi nally e is the expo- nentially decaying error term. The values of the state variables are initialized in the following way: the model is on about one month of data. The four coeffi cients α, δ, ω and φ are fi tted by least square optimi- zation (i.e., by minimizing the error between the actual observed load and the predicted load, and we use simple heuristic search using genetic algorithms to fi nd their optimal values.

Nonlinear Time Series Models

Machine learning (ML) techniques focus on learning a prediction function that takes as input the historical load and other data such as weather, and outputs the predicted load. Unlike the statistical methods reviewed in the previous section, the ML methods chosen in our study enable us to learn a nonlinear prediction function. Parametric machine learning techniques focus on tuning the parameters of the load prediction function. Khontanzad et al. [22] described a state-of-the-art implementation of neural networks for load forecasting, that has been used by several electrical companies. Fan and Chen [12] employed self-organizing maps to cluster the load and weather data into several regimes, before using them as inputs to a nonlinear regression function.

(12)

We focus in this paper on kernel-based methods, learning the relationship between data samples: in this case, each sample corresponds to a pair of historical load and weather data, taken over a short time interval, and the electric load at the next time point. We compared three standard, proven tech- niques: weighted kernel regression (WKR) [6], sup- port vector regression (SVR) [7], and kernel ridge regression with learnable feature coefﬁ cients.

In addition to kernel methods, we investigated simple neural network models with one hidden layer. Although the latter achieved good performance at one-hour prediction horizons, they would perform poorly on iterated forecasts and the error would rapidly increase after a few iterations of the neural network predictor (results not reported).

Research on modeling dynamic systems using one hidden-layer neural networks showed indeed that these nonlinear models are very sensitive to noise and that they can generate predictions that diverge from the training set patterns. More complex neural network models that provide stable iterated predictions and are capable of learning long-term dependencies [1] are beyond the scope of this paper. In parallel, it has been proven experimentally that kernel methods such as SVR provide more stable iterated predictions on highly nonlinear time series than the basic embodiment of neural networks [26].

While they do not model long-term dependencies, they at least provide a solution that is bounded and stays within the patterns seen in the training set. This statement does not apply to more complex neural network architectures (that involve state space models and learning hidden representations of time series).

Weighted kernel regression. Weighted kernel regression (WKR) [28] is the simplest among the non-parametric regression algorithms. It consists of computing the Euclidean distance metric between the input sample x and each data point sample y(t) at time t in the training set and then using it in a Gaussian kernel function k(x,y(t)) that can be seen as a measure of symmetric “similarity” between the two samples x and y(t). The Gaussian kernel takes a value equal to one when x and y(t) are identical and therefore when their

distance is equal to zero. The kernel function takes decreasing values down to zero as the input sample x becomes “dissimilar” from the training point y(t) and therefore as their distance increases.

k(x,y(t)) = exp

(

^{− 1}^__₂

_∑

k=1

k

__ 1

σ²(x_k−yk(t))²

)

The kernel function is used as the weight of data point y(t) in the decision function (equation 2). The decision function is a weighted interpolation over the entire training dataset.

L ˆ = Σ __________

^t

L

_t

k (

^x,y(t)

)

Σ

t

k (

x,y(t)

)

WKR assumes smoothness within the input data, controlled through a “spread” coeffi cient σ that depends on the dataset and is fi tted by n-fold cross- validation on the training data. We resorted to fi ve- fold cross-validation on fi ve non-overlapping sets.

More specifi cally, for each choice of hyperparameters, we used 80 percent of the training data to fi t the model and the remaining 20 percent to compute the prediction performance, and repeated that step fi ve times.

Support vector regression. Support vector machines (SVMs) [9, 34] are a popular and effi cient statistical learning tool that can be qualifi ed as mostly non-parametric. SVMs are also called maximum margin classifi ers, because their decision boundary is, by construction, as far as possible from the training data points, so that they remain well separated according to their labels. Maximum margin training enables better generalization of the classifi er to unseen examples.

The work on support vector regression (SVR) by Chen et al. [7] was indeed the winning entry to a competition on the prediction of electric load and can be considered as a state-of-the-art method. SVR relies on the deﬁ nition of a kernel function k(x,y(t)) and in using a decision function f(x) for a sample x that is deﬁ ned in terms of the kernel function between x and the data points in the training set, but involving a minimal, sparse, set of support vectors S = {y(t)}

that are each given a weight αt. Learning in SVM corresponds to ﬁ nding a minimal set S of support vectors that minimizes the error on the training labels.

L ˆ = Σ

^t

^L

^t^α^t

^k ⁽

^x,y(t)

⁾

(13)

We cross-validated the SVM’s regularization coeffi cient C as well as the Gaussian spread coeffi - cient using fi ve-fold cross-validation.

Kernel ridge regression. Kernel ridge regression is a generalized version of support vector regression.

One can see it as a trivial extension of SVR, where the Gaussian spread coefﬁ cient is tuned for each input regressor (feature) separately using a gradient- descent optimization procedure and cross-validation [5]. This method, which we call sigma-SVR, differs from SVR by this simple equation:

k(x,y(t)) = exp

(

^{− 1}^__₂

_∑

k=1

k

__ 1

σ²

(x_k− yk(t))²

)

Short-Term Load Forecasting Results

In our investigations, we used the standard demand prediction metric, mean absolute percent- age error (MAPE), which, for a set of N load values L_t(e.g, in Watt hours Wh) and associated load forecasts

L ˆ

t

, is deﬁ ned as:

MAPE =

__

_N¹

∑

_{t = 1}^N

^| ^L ^ˆ

^______^t

^{− L} _|L

_t

_|

^t

^|

In the previously published STLF studies on city- wide and country-wide load forecasting, the MAPE typically was expected to range from a one to a three percent error at next-hour horizon forecasts to about four percent error at next-day horizons.

System-Wide Predictions

In a ﬁ rst series of experiments, we compared the performance of three iterated predictors relying on nonlinear time series models based on kernel methods: weighted kernel regression (WKR), support vector regression (SVR), and sigma-SVR (sSVR) on system-wide load from 2007 to 2009. We would train the predictors on one year of load and weather forecasts, and make predictions for the following month, repeating this procedure 24 times for January 2008 through December 2009, averaging the MAPE performance, for each prediction horizon, over all 24 months. Our approach essentially simulated an STLF system retrained every month to ﬁ t mid- to long- term evolutions of the city-wide load consumption and of the climate.

Unsurprisingly, as reported on Figure 5, the more complex kernel method that enabled us both to weigh each input feature (e.g., load at a speciﬁ c time, time of day, temperature or humidity forecast) individually and to select the support vectors, namely sSVR, achieved the best results (MAPE = 1.2 percent) at the one-hour horizon and MAPE = 4.7 per- cent after h = 24 hours. The Steadman apparent temperature would slightly outperform raw temperature (decreasing the MAPE).

We then compared the performance of iterated sSVR to the direct prediction using sSVR, as well as to the remaining, linear, models, namely Holt-Winters double-exponential smoothing (HWT), state-space models with B-spline ﬁ t on load residue (SSM) and seasonal auto-regressive integrated moving average (SARIMA), all operating on the load residue after ﬁ t- ting the load on temperature and hour of the day (see “Temperature Regression and Load Residual”).

As can be seen in Figure 6 and Figure 7, which pro- vide details on the system-wide aggregated load from 2012, the overall best algorithms were HWT and sSVR. HWT achieved MAPE = 4 percent performance at h = 24 on the 2008 − 2009 dataset, slightly out- performing sSVR. The performance on the aggregated (2012) dataset was worse, because the set of meters considered (32,000) was only a subset of the total city load. Figure 8 and Figure 9 show how these predictions actually look, at h = 1 and at h = 24 respectively.

Performance on Meter Aggregates

We observed that the load forecasting performance seemed to worsen for lower level aggregates and tried to verify the hypothesis that, independently of the method, aggregates with large forecast errors are those with very few meters. As can be seen on Figure 10, we trained about 400 STLF predictors on different meter aggregates (feeders, substations, and system-wide) and plotted the performance (MAPE at h = 1) versus the size of the meter aggregate (which we can measure, for instance, as the number of meters interconnected to that aggregate, or as the peak hourly load measured at that meter aggregate).

The MAPE would decrease as a function of meter

k

(14)

aggregate size (the more meters in an aggregate, the better the MAPE). We hypothesize that aggregates connected to more meters tend to behave in a more predictable way: the effect of weather (temperature) is prominent and there is an averaging effect due to the large sample (hundreds or thousands) of meters.

Some meter aggregates (see Figure 11) can never- theless be relative well predictable, despite their small size (here 12 meters).

At the substation or system level, accurate forecasts can be useful input to strategic cost-saving deci- sions. At the level of individual meters, the utility is not interested in predicting precisely how much electricity will be used every hour, but rather in detecting large spikes of abnormal activity. Such abnormal

usage spikes could be indicative of a system failure in the home (e.g., a malfunctioning heat pump), and could be useful information to the customer. Accurate forecasts can serve as baselines for detecting such anomalies.

Discussion

In this section, we discuss the practical considerations for the implementation and deployment of a load forecasting system, including modularity and parallelization, running time considerations, and robustness of the forecasts.

Independent STLF for Each Meter Aggregate

As explained previously, the meters, feeders, and substations considered in this study of a mid-sized Figure 5.

System-wide load forecasts using kernel methods for nonlinear time series modeling. These curves are the average of monthly MAPE performance over two years (2008–2009).

MAPE—Mean absolute percentage error SVR—Support vector regression WKR—Weighted kernel regression

sigmaSVR 24h load + temperature

sigmaSVR 24h load + Steadman temperature sigmaSVR 24h load + temperature + humidity WKR 8h load + temperature

WKR 8h load + Steadman temperature SVR 8h load + temperature

SVR 8h load + Steadman temperature 1

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

3 6 9 12

Prediction horizon (h)

System–wide load predictions for Jan 2008 – Dec 2009, per–month MAPE averages

MAPE (%)

15 18 21 24

(15)

U.S. city are interconnected in a hierarchical distribution network.

Such a rich hierarchy invites a study of the correlations or even interdependencies among all metered electrical components. The obvious advan- tage is in exploiting redundancies among all the meters (as households in the same urban area and under identical climatic conditions might present similar load consumption proﬁ les).

From a systems perspective, it may be desirable to make the load prediction component as modular as possible and to forecast load independently for each meter or load aggregate. In this study, all the predictions at the same level of aggregation are

considered independent from the point of view of load forecasting, despite the correlations between each feeder connected to a given substation and the substation itself.

There are several justiﬁ cations for this approach.

First of all, the meters in our system often are updated asynchronously or even suffer downtimes, not nec- essarily related to power outages. It could therefore be very detrimental, for the operation of the entire system, to make it wait for synchronous meter updates. Here, we allow for asynchronous data updates and load forecasts within the prediction timeframe, which happens at a granularity of one hour.

Figure 6.

System-wide load forecasts using various families of prediction algorithms, using the total load consumption of a mid-sized U.S. city. The curves represent average monthly MAPE performance over two years from 2008 to 2009.

HWT—Holt-Winters model

SARIMA—Seasonal auto-regressive integrated moving average SSM—State-space model

SVR—Support vector regression 0 1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7

System–wide load predictions for Jan 2008 – Dec 2009, per–month MAPE averages

3 6 9 12

MAPE (%)

15 18 21 24

HWT on residual load from apparent temperature ﬁt

Iterated sigma–SVR using app. temp. and wind chill temp. and load Direct sigma–SVR using app. temp. and wind chill temp. and load SSM on residual load from temperature ﬁt at h−¹

SARIMA on residual load from temperature ﬁt at h−¹

(16)

Secondly, enforcing independence at each level of aggregation enables us to trivially parallelize the operation of the STLF modules for all the aggregates.

Each module’s only points of input/output are database accesses to read the latest meter historical load data as well as the associated geo-speciﬁ c weather data and to return forecasts at different horizons.

Running Time and Performance

The parallelism that is enabled by the independent STLF operations facilitates the implementation of our system in a multi-threaded environment.

Essentially, the process for generating hourly load

forecasts at an aggregate level can be run as soon as all the hourly weather and meter data for the aggregate components have been collected. The system does not need to wait for the completion of all the prediction processes, each of which takes care of updating the database with forecasts independently.

For model development purposes, we have been using a 16-core 2.3 GHz Intel Xeon* Linux* server with 24 GB random access memory (RAM), running Ubuntu*. The deployment system is a 32-core, 128 GB RAM, Linux system running Red Hat. The sigma- SVR and HWT algorithms are implemented in Matlab Figure 7.

Comparison of different load forecasting algorithms for 32,000 meter load aggregates. The curves represent average monthly MAPE performance over the six months from January to June 2012.

SVR—Support vector regression 0 1

1 2 3 4 5 MAPE (%) 6

7 8 9 10 11 12 13

3 6 9 12

System–wide load predictions for Jan – Jun 2012, per–month MAPE averages

15 18 21 24

HWT on residual load from apparent temperature ﬁt at h−¹ Iterated sigma–SVR using app. temp. and wind chill temp. and load Direct sigma–SVR using app. temp. and wind chill temp. and load SSM on residual load from temperature ﬁt at h−¹

SARIMA on residual load from temperature ﬁt at h−¹

(17)

(or its open source clone, Octave) and the SARIMA and SSM methods run in R.

Our system avoids major computational bottle- necks at runtime. The SARIMA, SSM, and HWT methods can make essentially instantaneous forecasts on the 400 or so meter aggregates. The kernel methods-based predictions by the sigma-SVR algorithm require, for each meter aggregate and for each prediction horizon (up to 168), a few matrix multi- plications, with matrix dimensions on the order of 10,000. The latter can bring the computational time to several minutes, once per hour.

The largest computational requirements are due to training the prediction algorithms, which, as we

explained, happens once a month. While the SARIMA and SSM methods are, again, negligible in terms of training time, it typically takes a few hours to cross-validate the state parameters of the HWT model and about one day to learn the feature and Lagrange coefﬁ cients of the sigma-SVR predictor.

This is currently handled by scheduling learning for all the models over several days.

Ensemble Prediction

Given that we have four different prediction algorithms (HWT, sSVR, SARIMA, SSM), we can study methods for combining their predictions for potentially better accuracy and robustness to noise Figure 8.

Predictions and prediction errors by four algorithms and a simple weather ﬁ t model (in gray) over one week in 2008. These plots show the predictions at horizon h = 1 hour.

sSVR—Sigma SVR

SVR—Support vector regression

Predictions: 1–hour ahead

Prediction Errors: 1–hour ahead

400000

KWh

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

8000001200000

(a)

(b)

−2e+05

KWh

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

0e+002e+05

Observed Regression On Temp HWT sSVR SARIMA SSM

(18)

and random errors. We conjecture that possibility after observing that the predictions of the four algorithms have largely uncorrelated errors, as visible on Figure 8 for an example of system-wide load forecasts over one week at the one-hour horizon and on Figure 9, on the same data and time period, at a 24-hour horizon.

Systematically generated ensembles are used extensively in numerical weather forecasting [29].

Our approach, on the other hand, needs to work with a small ensemble, each of which has independent ability to achieve a certain level of accuracy. In this case, simple combination strategies are desirable.

We considered ﬁ ve simple schemes for combining the predictions:

1. Mean of four predictions, 2. Median of four predictions,

3. Switching among four predictions, using the one with the smallest absolute error at the time when the prediction is made,

4. Mean of HWT and sSVR, and

5. Switching between HWT and sSVR, using the one with the smallest absolute error at the time when the prediction is made.

We summarize in Figure 12 the performance of these algorithms and the combined predictions for the system-wide aggregates from 2008 to 2009. The ﬁ nal performance of the mean of HWT and sSVR predictions on the system-wide data reaches a performance around MAPE = 3 percent at a 24-hour Figure 9.

Predictions and prediction errors by four algorithms and a simple weather ﬁ t model (in gray) over one week in 2008. These plots show the predictions at horizon h = 24 hours.

sSVR—Sigma SVR

SVR—Support vector regression

Predictions: 24–hour ahead

Prediction Errors: 24–hour ahead

400000

KWh

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

8000001200000

(a)

(b)

−2e+05

KWh

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

0e+002e+05

Observed Regression On Temp HWT sSVR SARIMA SSM

(19)

prediction horizon, down from about four percent achieved by HWT alone. We can see that by most of the performance criteria considered, either the mean or the median of the four predictors gives the best performance, and it is better than the best individual method except for the horizon of one hour ahead (which is best done by sSVR). Further investigation will examine to what extent this observation gener- alizes to smaller meter aggregates.

Conclusion

We methodically evaluated state-of-the-art STLF methods on a unique dataset consisting of load aggregates from individual meters, and showed a

dependency of the load forecasting performance on the size of the aggregate. In this study, we considered load forecasting at each meter aggregate as an independent task, and did not fully exploit the pyramidal structure of the meter-feeder-substation network.

Future investigations could explore such hierarchical time series prediction.

Acknowledgements

The authors wish to acknowledge the help and contribution of former and current members of Alcatel-Lucent Bell Labs: Gary Atkinson, Kenneth Budka, Jayant Deshpande, Frank Feather, Zhi He, Marina Thottan and Kim Young Jin, as well as the Figure 10.

Relationship between load forecasting accuracy and the size of the load aggregate (i.e., the number of meters connected to the electrical structure). The monthly MAPE performance has been averaged over six months, from January to June 2012).

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

log₁₀ of number of meters connected to the aggregate MAPE (%) in log10 scale

1−hour ahead predictions at aggregate level

MAPE—Mean absolute percentage error SVR—Support vector regression

Selected feeder

System-wide HWT Sigma−SVR

Demand Forecasting in Smart Grids