Badar ul Islam
*, Umer Ijaz , Muhammad Tahir Raza Department of Electrical Engineering
NFC Institute of Engineering and Fertilizer Research, Faisalabad
*
Corresponding Author: [email protected]
Abstract—Selection of best model parameters is a key to building an effective neural network based prediction model. This paper focuses on short term load forecasting (STLF) framework emphasizing on a new hybrid data selection preprocessing and selection technique to enhance forecast accuracy. A Multilayer perceptron neural network (MLPNN) is used to predict the load demand of one week ahead. Because of the mapping and memorizing ability of the artificial neural network (ANN) towards the non-linear relations between input and output variables, they are widely used in predictive modeling, such as, electrical load demand prediction models. The non-linear behaviour of the load can only be narrated effectively by ANN, if a group of appropriate inputs is identified prior to the training process. A new input variable selection method based on statistical methods and evolutionary computation technique genetic algorithm (GA) is proposed in this research. A case study is developed to verify the efficacy of the optimized training dataset and input variable selection method. In this approach, we trained the ANN with conventional back propagation (BP) training algorithm with and without using the optimized and reduced dataset. Results show that this dataset selection method is effective and forecasts electrical load demand with reduced mean absolute percentage error (MAPE) and reduced training time.
Keywords — Artificial Neural Network , Correlation Analysis, Genetic Algorithm, Hybrid Intelligent System, Particle Swarm, Performance Analysis
I. I NTRODUCTION
Electrical load demand forecasting is critically important because of its significance in energy management systems for providing a better environment for future planning and decisions [1] [2]. These forecasts are classified into short-term, medium- term and long-term forecasts with respect to planning horizon’s duration. Among the others, short term load forecast is important because it helps the operation managers in their day to day operational decisions, such as economic scheduling of generating capacity, arrangement of fuel purchase, estimation of peak demand and system security assessments [3] [4] [5].
Although a lot of statistical and computationally intelligent methods are adopted by the researchers for these forecasts, yet ANN are most frequently used in the development of predictive models for short term load forecast as reported in the published work during the past few years [1] [6] [7]. The effectiveness of the forecast model is based on multiple factors, such as selection of ANN architecture, choice of training algorithm and input dataset used for the training and testing of the network [7]. Besides the other factors, the selection of suitable input variables and training dataset for ANN is focused in this paper.
The objective of this research is to find a technique for the selection of the input variables and training and testing dataset that leads to the minimum forecast error and reduced computational time for ANN based load forecast models. In general, an increase in the number of inputs of a model network may result in the form of a wider training capacity and better forecasting accuracy. However, the erroneous choice of additional input variables and bad data would unnecessarily increase the forecast error and computational time [8] [9].
Although, ANN has the ability to map and store the non-linear relations between inputs and outputs, but if a group of non- relevant variables are selected as an input variable set, the training time of ANN is increased and the error rates are also enhanced [9]. The nonlinear relation of the load can be effectively explicated only when a group of appropriate input variables is identified prior to the training process of ANN. In the first stage of the research, correlation coefficient method and data dispersion analysis are employed for the selection of suitable input variables for ANN based 168 hours-ahead forecast model.
The GA is used to select the best individuals among the remaining samples in the second stage on the basis of minimum mean absolute error (MAE) used as an objective function.
GA is a guided heuristic search method [10], which was originally launched by John Holland in 1970s at the University of Michigan. The main inspiration is adopted from the natural system of evolution for the designing of an artificial system that
Hybrid Input Data Selection Technique for
Neural Network Based Load Forecast Model
retains its robustness and adoption properties [11]. Since the discovery of GA, these techniques are consistently improved by other researchers and are now frequently used for solving many optimization problems related with engineering, business and social science etc. [10] [11] [12]. GA implements the biological processes to perform a random search in a defined N-dimensional search space [11]. It is extremely important to explore and suggest the best available solution among the available set of all the possibilities in an optimization problem. GA work with a population that is represented by a set of artificial elements. In the process of GA based optimization, the major entities include, an individual element or a chromosome [12], a single bit or gene, new population or offsprings, an old population or parents and genetic operators. In an iterative process, the genetic operators are applied in a systematic manner on the parents to produce offsprings in a new generation [10] [11].
Several attempts have been made for the identification of appropriate input dataset for the training and testing of ANN based models. Most of these efforts emphasize on the correlation analysis of the data, however some of the researchers also focused on the combination of mathematical formulation to resort the derived variables by squaring, averaging, adding or differencing the data sequences to determine the appropriate input variables [9] [13]. These mathematical and statistical techniques for the manipulation of data have a common attribute that they attempt to identify almost completely linear relationships among a set of input variables and their dependent variables [14]. Because of their linear and consistent approach, these methods are unable to track the unusual and brisk variations occurring in the real time input data [13] [14].
A variety of input variables have been tried out for ANN based STLF, such as a historical load of multiple intervals, meteorological variables and economic variables [15] [16].
Normally, hit and trial methods of deploying multiple combinations of above mentioned variables are reported in the literature in conjunction with the multiple neural network topologies [13], however some researchers implemented computational intelligent techniques in the data selection process [14] [17] [18] [19] [20]. In addition to the techniques used for load forecasting models many STLF methods have been developed and applied for these forecasts. Some of these efforts include; bagged neural networks based STLF [21], the integration of multiple intelligent methods for STLF [22], incorporation of fractal interpretation and wavelet transforms [23] [24]. In [25], the authors introduced and compared multiple hybrid techniques based on artificial intelligence (AI) for STLF application.
Moreover, AI based methods for micro-grid environment are also designed and reported by some researchers [26] [27].
The proposed research provides not only a way to identify a set of suitable inputs for STLF models but also highlights the appropriateness of the individual input variables in that set. The improved forecast results in terms of reasonable error and less
training time of the model show that the input variable selection approach is better than the existing methods.
II. D ATA A NALYSIS AND P REPROCESSING
Five year historical data of system load and weather variables of the State of Victoria and New South Wales of Australia (from year 2006 to 2010) at half hour sampling frequency are deployed in the experimentation. The data is comprised of electrical load and four meteorological variables including dry bulb temperature, wet bulb temperature, dew point and humidity. To preprocess the data an indigenous data analysis is required. Besides the quantitative analysis, the analysis of data distribution in certain ranges and correlation analysis are executed to determine the importance and criticality of the data.
A. Quantitative Analysis
A comprehensive quantitative analysis is conducted to study different tendencies of the data during this first stage of experimentation. The analysis reports statistical details of the data such as, minimum and maximum values, mean, mode, median, variance and standard deviation as shown in Table 1.
B. Data Dispersion Analysis
System load values are divided into nineteen different slabs to analyze the frequency of occurrence of each sample in a certain range. The analysis returns a percentage of individual and cumulative frequency of occurrence of samples from a certain system load range. This dispersion and repetition analysis of the system load is beneficial for finding the data samples which are repeated frequently and have the maximum influence in the prediction process. This analysis of the system load is depicted in Table 2. Fig. 1 reflects the frequency of occurrence of a certain load data range along with its cumulative frequency in graphical order.
This dispersion and repetition analysis is conducted for all other variables and the criteria of selecting the dataset is same as that of load. The information of data selection is extracted from the load series and implemented on other variables.
TABLE1:
Q
UANTITATIVEA
NALYSIS OFD
ATAS
ETParameters
Input Variables
Sys.
Load
Dry Bulb
Dew Point
Wet Bulb
Humidit y Min Value 5397.36 6 2.6 -8.4 3.8 Max Value 13289.15 102 24.1 24.2 43.9 Mean 8924 68.90 14.87 11.92 18.25 Median 8875.32 69 14.2 12.45 18.5
Mode 9320.38 68 17.9 12.3 22
Range 8347.71 93 24.1 32.6 41.9
Variance 1985410 214.12 18.42 29.88 23.93
TABLE2:
F
REQUENCY OFO
CCURRENCEO
FL
OADR
ANGESLoad Ranges Frequency of Occurrence
Cumulative Percentage
5500-5999 1 0.00%
6000-6499 235 0.27%
6500-6999 3393 4.14%
7000-7499 6653 11.73%
7500-7999 6162 18.76%
8000-8499 7507 27.33%
8500-8999 10355 39.14%
9000-9499 9655 50.16%
9500-9999 12887 64.86%
10000-10499 12398 79.01%
10500-10999 7860 87.97%
11000-11499 4815 93.47%
11500-11999 2820 96.68%
12000-12499 1570 98.48%
12500-12999 795 99.38%
13000-13499 364 99.80%
13500-13999 134 99.95%
14000-14499 37 99.99%
14500-14999 6 100.00%
Fig. 1.Load data ranges, frequency of occurrence for each range and their cumulative percentages.
C. Correlation Analysis
For the selection of suitable inputs variables for ANN based STLF models, correlation coefficient method has been frequently used by the researchers [9] [16]. The method establishes an authenticated relationship among all input variables in the available dataset. Generally, a higher value of correlation coefficient indicates a strong relationship among these variables.
The correlation coefficient r can be determined as follows:
(1)
where, x and y are two variables, n stands for the number of data pairs, the outcome r lies in the range of -1 and +1. In case, the x and y vectors are closely related, there will be a high correlation coefficient value and r will be closer to 1. The positive and negative signs are used to indicate positive linear coefficient and negative linear coefficient respectively. Moreover, if the value of r is near +1, it shows a strong positive correlation and if this value is near -1, it represents negative correlation. If there is no correlation or weak correlation, r is close to 0. To find the significance of the input variables, the correlation analysis is carried out among all the expected input variables and load data.
These results are shown as the following:
Load Related Variables:
Previous week (w-1), same time and day: 0.877 Previous day (d-1), same time: 0.883
Previous hour (h-1), same day: 0.913 Two hours earlier (h-2), same Day: 0.910 Weather Related Variables:
Temperature (dry bulb) : 0.125 Temperature (wet bulb) : -0.034 Dew point : -0.145
Humidity : -0.129
It can be seen from these results, that the load related variables have higher and positive correlation; in particular, h-1 and h-2 have higher values of r. Whereas, the weather variables have considerably lesser correlation values and some of them are negatively correlated. On the other hand, humidity, dew point and wet bulb temperature depict a negative correlation. Among these meteorological variables, temperature (dry bulb) shows its significance as a positively correlated variable with a reasonable value of r.
D. Data Normalization
To restrict the values of all the variables in the range of 0 and 1, they are normalized prior to applying them in the forecast model.
E. Performance Analysis
Multiple performance indices are used to evaluate the
authenticity of the model. Among them, the coefficient of
determination, R
2, provides the values of the variance of one
variable that can be predicted by another variable. The value of
R
2ranges between 0 and 1 and the forecast accuracy is
considered to be high as it approaches to 1 [16]. This performance index can be calculated as:
2 ( - )( - )
2 1
2 2
( - ) ( - )
1 1
N
P P A A t t R t
N N
P P t t A A
t t
(2)
Other performance evaluation measures used in this research include, mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE). These performance indices can be determined as:
1 2
( ) 1 N RMSE A t t P
N i (3)
1 100%
1 N A P t t MAPE N i At
(4)
1 1 N MAE A t t P
N t (5) where, A
tand P
tare the actual and forecasting values at time point t, whereas, A and P are the mean of actual and forecasting values respectively.
As MAE, MAPE and RMSE are used to measure the difference between actual and forecasted load values, the smaller values of theses performance indicators will lead to better forecasting performance [16]. However, if MAPE is employed as a benchmark the inconsistency problem in the results can be avoided, because MAPE is relatively more stable as compared to the other performance measures [16] [17].
III. M ETHODOLOGY A. Probability and Discard Information
All the input variables present in the data set are classified in multiple ranges as shown in Table 2. The selection process of data starts on the basis of frequency of occurrence and effect of correlation with respect to the forecast error. Higher correlation and frequency of occurrence of the data samples is the primary information which is used for data selection based on probability and discard information provided in Table 3 during the first phase of data selection. The maximum discard percentage relates with the range 1 that has maximum number of samples. The discard percentage decreases by 10 % in next range and stops at after nine stages. This inverse probability and selection technique ensures that the data samples from each range are present in the
final data set. The presence of data samples related to all ranges is necessarily required for effective mapping between inputs and target outputs during the ANN training procedure. Another important consideration is to discard the data samples having the repeated values because the similar data play an insignificant role in the training process of ANN. The flow diagram of the hybrid approach for the proposed input variable selection technique is depicted in the Fig. 3
B. Genetic Manipulation of Data
The remaining data samples are subjected to ANN to formulate a set of factors based on various input combinations that contain the correlation and data distribution impact of inputs and output that is constituted on the basis of forecast error. The forecast load values, which have a higher correlation coefficient and maximum number of occurrences with expectation output values, are chosen from this factor set as input variables. By means of this method, a preferable input variables set can be determined. The accuracy of the forecast results depends on the correlation along with the frequency of repeated occurrence of data point in a certain range.
T
ABLE3: P ROBABILITY AND D ISCARD I NFORMATION FOR D ATA S ELECTION
Load Data Ranges Discard
Percentage Range 1 (maximum number of samples) 90 % Range 2 (number of samples less than range 1) 80 %
. .
. .
Range 9 (number of samples less than range 8)
. Range N (minimum number of samples)
10 % 10%
10%
=
=
=
=
START
Divide all the input variable data samples into multiple ranges according to their frequency of occurrence (Table 2)
Find correlation impact of all data ranges. Select the data samples from each range according to the probability and
discard information provided (Table3)
Apply genetic algorithm on the remaining samples for the selection of most influential data set, choose 20% best
samples where fitness function is MAE
Training & Testing of ANN using BP Algorithm with actual and optimized data set
Compare the results of both data sets
Performance with optimized data set is better?
END
NO
YES Historical Load and
Weather Data