Hybrid Input Data Selection Technique for Neural Network Based Load Forecast Model

(1)

Badar ul Islam

^*

, Umer Ijaz , Muhammad Tahir Raza Department of Electrical Engineering

NFC Institute of Engineering and Fertilizer Research, Faisalabad

*

Corresponding Author: [email protected]

Abstract—Selection of best model parameters is a key to building an effective neural network based prediction model. This paper focuses on short term load forecasting (STLF) framework emphasizing on a new hybrid data selection preprocessing and selection technique to enhance forecast accuracy. A Multilayer perceptron neural network (MLPNN) is used to predict the load demand of one week ahead. Because of the mapping and memorizing ability of the artificial neural network (ANN) towards the non-linear relations between input and output variables, they are widely used in predictive modeling, such as, electrical load demand prediction models. The non-linear behaviour of the load can only be narrated effectively by ANN, if a group of appropriate inputs is identified prior to the training process. A new input variable selection method based on statistical methods and evolutionary computation technique genetic algorithm (GA) is proposed in this research. A case study is developed to verify the efficacy of the optimized training dataset and input variable selection method. In this approach, we trained the ANN with conventional back propagation (BP) training algorithm with and without using the optimized and reduced dataset. Results show that this dataset selection method is effective and forecasts electrical load demand with reduced mean absolute percentage error (MAPE) and reduced training time.

Keywords — Artificial Neural Network , Correlation Analysis, Genetic Algorithm, Hybrid Intelligent System, Particle Swarm, Performance Analysis

I. I NTRODUCTION

Electrical load demand forecasting is critically important because of its significance in energy management systems for providing a better environment for future planning and decisions [1] [2]. These forecasts are classified into short-term, medium- term and long-term forecasts with respect to planning horizon’s duration. Among the others, short term load forecast is important because it helps the operation managers in their day to day operational decisions, such as economic scheduling of generating capacity, arrangement of fuel purchase, estimation of peak demand and system security assessments [3] [4] [5].

Although a lot of statistical and computationally intelligent methods are adopted by the researchers for these forecasts, yet ANN are most frequently used in the development of predictive models for short term load forecast as reported in the published work during the past few years [1] [6] [7]. The effectiveness of the forecast model is based on multiple factors, such as selection of ANN architecture, choice of training algorithm and input dataset used for the training and testing of the network [7]. Besides the other factors, the selection of suitable input variables and training dataset for ANN is focused in this paper.

The objective of this research is to find a technique for the selection of the input variables and training and testing dataset that leads to the minimum forecast error and reduced computational time for ANN based load forecast models. In general, an increase in the number of inputs of a model network may result in the form of a wider training capacity and better forecasting accuracy. However, the erroneous choice of additional input variables and bad data would unnecessarily increase the forecast error and computational time [8] [9].

Although, ANN has the ability to map and store the non-linear relations between inputs and outputs, but if a group of non- relevant variables are selected as an input variable set, the training time of ANN is increased and the error rates are also enhanced [9]. The nonlinear relation of the load can be effectively explicated only when a group of appropriate input variables is identified prior to the training process of ANN. In the first stage of the research, correlation coefficient method and data dispersion analysis are employed for the selection of suitable input variables for ANN based 168 hours-ahead forecast model.

The GA is used to select the best individuals among the remaining samples in the second stage on the basis of minimum mean absolute error (MAE) used as an objective function.

GA is a guided heuristic search method [10], which was originally launched by John Holland in 1970s at the University of Michigan. The main inspiration is adopted from the natural system of evolution for the designing of an artificial system that

Hybrid Input Data Selection Technique for

Neural Network Based Load Forecast Model

(2)

retains its robustness and adoption properties [11]. Since the discovery of GA, these techniques are consistently improved by other researchers and are now frequently used for solving many optimization problems related with engineering, business and social science etc. [10] [11] [12]. GA implements the biological processes to perform a random search in a defined N-dimensional search space [11]. It is extremely important to explore and suggest the best available solution among the available set of all the possibilities in an optimization problem. GA work with a population that is represented by a set of artificial elements. In the process of GA based optimization, the major entities include, an individual element or a chromosome [12], a single bit or gene, new population or offsprings, an old population or parents and genetic operators. In an iterative process, the genetic operators are applied in a systematic manner on the parents to produce offsprings in a new generation [10] [11].

Several attempts have been made for the identification of appropriate input dataset for the training and testing of ANN based models. Most of these efforts emphasize on the correlation analysis of the data, however some of the researchers also focused on the combination of mathematical formulation to resort the derived variables by squaring, averaging, adding or differencing the data sequences to determine the appropriate input variables [9] [13]. These mathematical and statistical techniques for the manipulation of data have a common attribute that they attempt to identify almost completely linear relationships among a set of input variables and their dependent variables [14]. Because of their linear and consistent approach, these methods are unable to track the unusual and brisk variations occurring in the real time input data [13] [14].

A variety of input variables have been tried out for ANN based STLF, such as a historical load of multiple intervals, meteorological variables and economic variables [15] [16].

Normally, hit and trial methods of deploying multiple combinations of above mentioned variables are reported in the literature in conjunction with the multiple neural network topologies [13], however some researchers implemented computational intelligent techniques in the data selection process [14] [17] [18] [19] [20]. In addition to the techniques used for load forecasting models many STLF methods have been developed and applied for these forecasts. Some of these efforts include; bagged neural networks based STLF [21], the integration of multiple intelligent methods for STLF [22], incorporation of fractal interpretation and wavelet transforms [23] [24]. In [25], the authors introduced and compared multiple hybrid techniques based on artificial intelligence (AI) for STLF application.

Moreover, AI based methods for micro-grid environment are also designed and reported by some researchers [26] [27].

The proposed research provides not only a way to identify a set of suitable inputs for STLF models but also highlights the appropriateness of the individual input variables in that set. The improved forecast results in terms of reasonable error and less

training time of the model show that the input variable selection approach is better than the existing methods.

II. D ATA A NALYSIS AND P REPROCESSING

Five year historical data of system load and weather variables of the State of Victoria and New South Wales of Australia (from year 2006 to 2010) at half hour sampling frequency are deployed in the experimentation. The data is comprised of electrical load and four meteorological variables including dry bulb temperature, wet bulb temperature, dew point and humidity. To preprocess the data an indigenous data analysis is required. Besides the quantitative analysis, the analysis of data distribution in certain ranges and correlation analysis are executed to determine the importance and criticality of the data.

A. Quantitative Analysis

A comprehensive quantitative analysis is conducted to study different tendencies of the data during this first stage of experimentation. The analysis reports statistical details of the data such as, minimum and maximum values, mean, mode, median, variance and standard deviation as shown in Table 1.

B. Data Dispersion Analysis

System load values are divided into nineteen different slabs to analyze the frequency of occurrence of each sample in a certain range. The analysis returns a percentage of individual and cumulative frequency of occurrence of samples from a certain system load range. This dispersion and repetition analysis of the system load is beneficial for finding the data samples which are repeated frequently and have the maximum influence in the prediction process. This analysis of the system load is depicted in Table 2. Fig. 1 reflects the frequency of occurrence of a certain load data range along with its cumulative frequency in graphical order.

This dispersion and repetition analysis is conducted for all other variables and the criteria of selecting the dataset is same as that of load. The information of data selection is extracted from the load series and implemented on other variables.

TABLE1:

Q

UANTITATIVE

A

NALYSIS OF

D

ATA

S

ET

Parameters

Input Variables

Sys.

Load

Dry Bulb

Dew Point

Wet Bulb

Humidit y Min Value 5397.36 6 2.6 -8.4 3.8 Max Value 13289.15 102 24.1 24.2 43.9 Mean 8924 68.90 14.87 11.92 18.25 Median 8875.32 69 14.2 12.45 18.5

Mode 9320.38 68 17.9 12.3 22

Range 8347.71 93 24.1 32.6 41.9

Variance 1985410 214.12 18.42 29.88 23.93

(3)

TABLE2:

F

REQUENCY OF

O

CCURRENCE

O

F

L

OAD

R

ANGES

Load Ranges Frequency of Occurrence

Cumulative Percentage

5500-5999 1 0.00%

6000-6499 235 0.27%

6500-6999 3393 4.14%

7000-7499 6653 11.73%

7500-7999 6162 18.76%

8000-8499 7507 27.33%

8500-8999 10355 39.14%

9000-9499 9655 50.16%

9500-9999 12887 64.86%

10000-10499 12398 79.01%

10500-10999 7860 87.97%

11000-11499 4815 93.47%

11500-11999 2820 96.68%

12000-12499 1570 98.48%

12500-12999 795 99.38%

13000-13499 364 99.80%

13500-13999 134 99.95%

14000-14499 37 99.99%

14500-14999 6 100.00%

Fig. 1.Load data ranges, frequency of occurrence for each range and their cumulative percentages.

C. Correlation Analysis

For the selection of suitable inputs variables for ANN based STLF models, correlation coefficient method has been frequently used by the researchers [9] [16]. The method establishes an authenticated relationship among all input variables in the available dataset. Generally, a higher value of correlation coefficient indicates a strong relationship among these variables.

The correlation coefficient r can be determined as follows:

(1)

where, x and y are two variables, n stands for the number of data pairs, the outcome r lies in the range of -1 and +1. In case, the x and y vectors are closely related, there will be a high correlation coefficient value and r will be closer to 1. The positive and negative signs are used to indicate positive linear coefficient and negative linear coefficient respectively. Moreover, if the value of r is near +1, it shows a strong positive correlation and if this value is near -1, it represents negative correlation. If there is no correlation or weak correlation, r is close to 0. To find the significance of the input variables, the correlation analysis is carried out among all the expected input variables and load data.

These results are shown as the following:

Load Related Variables:

Previous week (w-1), same time and day: 0.877 Previous day (d-1), same time: 0.883

Previous hour (h-1), same day: 0.913 Two hours earlier (h-2), same Day: 0.910 Weather Related Variables:

Temperature (dry bulb) : 0.125 Temperature (wet bulb) : -0.034 Dew point : -0.145

Humidity : -0.129

It can be seen from these results, that the load related variables have higher and positive correlation; in particular, h-1 and h-2 have higher values of r. Whereas, the weather variables have considerably lesser correlation values and some of them are negatively correlated. On the other hand, humidity, dew point and wet bulb temperature depict a negative correlation. Among these meteorological variables, temperature (dry bulb) shows its significance as a positively correlated variable with a reasonable value of r.

D. Data Normalization

To restrict the values of all the variables in the range of 0 and 1, they are normalized prior to applying them in the forecast model.

E. Performance Analysis

Multiple performance indices are used to evaluate the

authenticity of the model. Among them, the coefficient of

determination, R

²

, provides the values of the variance of one

variable that can be predicted by another variable. The value of

R

²

ranges between 0 and 1 and the forecast accuracy is

(4)

considered to be high as it approaches to 1 [16]. This performance index can be calculated as:

2 ( - )( - )

2 1

2 2

( - ) ( - )

1 1

N

P P A A t t R t

N N

P P t t A A

t t

(2)

Other performance evaluation measures used in this research include, mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE). These performance indices can be determined as:

1 2

( ) 1 N RMSE A t t P

N i (3)

1 100%

1 N A P t t MAPE N i At

(4)

1 1 N MAE A t t P

N t (5) where, A

t

and P

t

are the actual and forecasting values at time point t, whereas, ^A and ^P are the mean of actual and forecasting values respectively.

As MAE, MAPE and RMSE are used to measure the difference between actual and forecasted load values, the smaller values of theses performance indicators will lead to better forecasting performance [16]. However, if MAPE is employed as a benchmark the inconsistency problem in the results can be avoided, because MAPE is relatively more stable as compared to the other performance measures [16] [17].

III. M ETHODOLOGY A. Probability and Discard Information

All the input variables present in the data set are classified in multiple ranges as shown in Table 2. The selection process of data starts on the basis of frequency of occurrence and effect of correlation with respect to the forecast error. Higher correlation and frequency of occurrence of the data samples is the primary information which is used for data selection based on probability and discard information provided in Table 3 during the first phase of data selection. The maximum discard percentage relates with the range 1 that has maximum number of samples. The discard percentage decreases by 10 % in next range and stops at after nine stages. This inverse probability and selection technique ensures that the data samples from each range are present in the

final data set. The presence of data samples related to all ranges is necessarily required for effective mapping between inputs and target outputs during the ANN training procedure. Another important consideration is to discard the data samples having the repeated values because the similar data play an insignificant role in the training process of ANN. The flow diagram of the hybrid approach for the proposed input variable selection technique is depicted in the Fig. 3

B. Genetic Manipulation of Data

The remaining data samples are subjected to ANN to formulate a set of factors based on various input combinations that contain the correlation and data distribution impact of inputs and output that is constituted on the basis of forecast error. The forecast load values, which have a higher correlation coefficient and maximum number of occurrences with expectation output values, are chosen from this factor set as input variables. By means of this method, a preferable input variables set can be determined. The accuracy of the forecast results depends on the correlation along with the frequency of repeated occurrence of data point in a certain range.

T

ABLE

3: P ROBABILITY AND D ISCARD I NFORMATION FOR D ATA S ELECTION

Load Data Ranges Discard

Percentage Range 1 (maximum number of samples) 90 % Range 2 (number of samples less than range 1) 80 %

. .

Range 9 (number of samples less than range 8)

. Range N (minimum number of samples)

10 % 10%

10%

=

(5)

START

Divide all the input variable data samples into multiple ranges according to their frequency of occurrence (Table 2)

Find correlation impact of all data ranges. Select the data samples from each range according to the probability and

discard information provided (Table3)

Apply genetic algorithm on the remaining samples for the selection of most influential data set, choose 20% best

samples where fitness function is MAE

Training & Testing of ANN using BP Algorithm with actual and optimized data set

Compare the results of both data sets

Performance with optimized data set is better?

END

NO

YES Historical Load and

Weather Data

Fig. 3. Flowchart of the proposed data selection technique.

In this section, the reduced dataset obtained from statistical handling during stage 1, as explained above is subjected to genetic manipulation of chromosomes for further optimization and reduction to constitute the most appropriate input variable dataset. The shortlisted samples are subjected to the genetic algorithm for the final selection of the dataset. The most crucial part of the genetic-based algorithm is to define the fitness function. In this case the fitness function is based on the mean absolute error (MAE).

The general form of various steps involved in GA is explained below.

1. Create a random population composed of n number of chromosomes.

2. Determine the fitness f(x) of each chromosome x in the created population.

3. Apply the following operators repeatedly to produce a new population.

4. The selection operator selects two parent chromosomes in the population in accordance with their fitness values.

5. The crossover operator swaps particular sections of parents to generate new offsprings.

6. The mutation operator flips one or more bits in the new chromosome that adds in its fitness value.

7. Select and replace the best individuals in the new population for further iterations.

8. Test for the end condition, if it is satisfied, stop, and identify the best solution in present population.

9. Loop back to the fitness function evaluation (step 2) again.

In the first step of this genetic-based algorithm the chromosome length and index of input variables are defined. Real coded GA is implemented with a fixed length chromosome equal to 8 and the value of each genome in a particular chromosome is the index of the input variable. In the second stage genetic algorithm is used for the selections of the final dataset. Elitism selection technique for a real coded GA is adopted with one point crossover and two point mutation operations. The probabilities of crossover and mutation are considered as 0.75% and 0.02%, respectively.

Minimum MSE is employed as fitness function for GA and other evolutionary methods as a part of this hybrid model, which are given in Eq.6. and Eq.7.

(6) 1

fitness 1

(7) err

A significantly reduced data set is formulated after implementing the mentioned statistic and genetic algorithm based optimization techniques. The optimized data set is subjected to multiple case studies to verify its efficacy. The results of these experiments are discussed in the next section.

IV. R ESULTS AND D ISCUSSION

Data dispersion and frequency of occurrence of a certain data range is found and a proportion of the data samples are selected on the basis of inverse probability and discard scheme discussed in the previous section. The correlation coefficient method is applied to further shortlist the data samples. Finally, the selected data samples are brought under genetic manipulation of chromosomes after applying these statistical methods. Mean absolute error (MAE) is used as a fitness function and the best samples are further acquired to constitute final data set for the training and testing of artificial neural network using the back propagation training mechanism. The 8-10-1 NN architecture based on a back-propagation (BP) training algorithm is used to compare the results with optimized dataset and the original dataset. The scatter plot of learning and predicted residuals is shown in Fig. 4 (a). The autocorrelation impact of all the historical load data samples is depicted in Fig. 4 (b) and the number of occurrences of the error residuals is shown in Fig 4 (c).

+

(6)

Fig. 4. (a) Scatter plot of actual and forecast error (b) Auto correlation of load data (c) Frequency of occurrence of error residual

BP based ANN forecast results using the original dataset that is composed of half hourly recorded samples for 24 hours of a day over a five year period are presented below in graphical form in Fig. 5.

Fig. 5. Forecast results based on five years data

The graph shows a plot of both actual and forecast loads in mega- watts (MW) against half hourly data samples. The MAPE of 3.35

% is calculated in this technique.

The results obtained from testing the neural network on the new data set which has a significantly lesser number of samples as compared to the original dataset produced better results in terms of forecast accuracy and lesser network training time. MAPE is calculated 2.85% using this optimized data set and a decrease of 0.5 % error shows the superiority of using the new data set.

Forecast results based on this optimally selected data set are depicted in Fig. 6. These results are in agreement with the research studies reported by Adepoju et al. in [4].

Fig. 6. Forecast results based on optimized dataset.

Moreover, the training time of the ANN model is reduced by 35% because of considerably lesser number of training sample.

The similar declining trend in the processing time is reported by Khazaee et. Al. in [14]. The reduced training time and enhanced forecast accuracy validate the proposed input variable selection technique.

V. C ONCLUSION

Selection of most appropriate set of input variables not only gives rise to enhanced forecast accuracy, but also saves the computational efforts and time. In this paper, a new hybrid technique based on statistical methods and genetic algorithm is used to select the optimized dataset of input variables. To verify the efficacy of this data selection method, the ANN based on a BP training algorithm is examined using this optimized dataset and the original dataset containing all the recorded inputs.

Training and testing results of these two data sets show that this data selection method is efficient as a considerable reduction in error (0.5 % MAPE) and computational time (35% training time) is observed.

VI. A CKNOWLEDGMENT

The authors wish to thank NFC-IEFR administration for providing the opportunity to conduct this research.

R EFERENCES

[1] A. D. Papalexopoulos, H. Shangyou, and T. M. Peng, "Short-term system load forecasting using an artificial neural network," in Neural Networks to Power Systems, 1993. ANNPS '93., Proceedings of the Second International Forum on Applications of, 1993, pp. 239-244.

[2] G. Gross and F. D. Galiana, "Short-term load forecasting,"

Proceedings of the IEEE, vol. 75, pp. 1558-1573, 1987.

[3] H. K. Alfares and M. Nazeeruddin, "Electric load forecasting:

Literature survey and classification of methods," International Journal of Systems Science, vol. 33, pp. 23-34, 2002/01/01 2002.

[4] G. Adepoju, S. Ogunjuyigbe, and K. Alawode, "Application of

neural network to load forecasting in Nigerian electrical power

(7)

system," The Pacific Journal of Science and Technology, vol. 8, pp. 68-72, 2007.

[5] M. Rothe, D. A. Wadhwani, and D. Wadhwani, "Short term load forecasting using multi parameter regression," arXiv preprint arXiv:0912.1015, 2009.

[6] D. Morinigo-Sotelo, O. Duque-Perez, L. Garcia-Escudero, M.

Fernandez-Temprano, P. Fraile-Llorente, M. Riesco-Sanz, et al.,

"Short-term hourly load forecasting of a hospital using an artificial neural network," in International Conference on Renewable Energies and Power Quality, 2011.

[7] D. Srinivasan, "Evolving artificial neural networks for short term load forecasting," Neurocomputing, vol. 23, pp. 265-276, 1998.

[8] I. Drezga and S. Rahman, "Input variable selection for ANN- based short-term load forecasting," Power Systems, IEEE Transactions on, vol. 13, pp. 1238-1244, 1998.

[9] K.-h. Yang, G. -l. Shan, and L.-L. Zhao, "Correlation coefficient method for support vector machine input samples," in Machine Learning and Cybernetics, 2006 International Conference on, 2006, pp. 2857-2861.

[10] S. Mishra and S. K. Patra, "Short term load forecasting using neural network trained with genetic algorithm & particle swarm optimization," in Emerging Trends in Engineering and Technology, 2008. ICETET'08. First International Conference on, 2008, pp. 606-611.

[11] L. Tian and A. Noore, "Short-term load forecasting using optimized neural network with genetic algorithm," in Probabilistic Methods Applied to Power Systems, 2004 International Conference on, 2004, pp. 135-140.

[12] S. Nayak, B. Misra, and H. Behera, "Index prediction with neuro- genetic hybrid network: A comparative analysis of performance,"

in Computing, Communication and Applications (ICCCA), 2012 International Conference on, 2012, pp. 1-6.

[13] V. Shrivastava and R. Misra, "A Novel Approach of Input Variable Selection for ANN Based Load Forecasting," in Power System Technology and IEEE Power India Conference, 2008.

POWERCON 2008. Joint International Conference on, 2008, pp.

1-5.

[14] P. R. Khazaee, N. Mozayani, and M.-R. J. Motlagh, "A genetic- based input variable selection algorithm using mutual information and wavelet network for time series prediction," in Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on, 2008, pp. 2133-2137.

[15] I. Drezga and S. Rahman, "Short-term load forecasting with local ANN predictors," Power Systems, IEEE Transactions on, vol. 14, pp. 844-850, 1999.

[16] M. Ghayekhlooa, M.B. Menhaja, "Hybrid short-term load forecasting with a new data preprocessing framework", Electric Power Systems Research on 119 (2015) 138–148.

[17] J. Wanga and ZhouWanga " Stock index forecasting based on a hybrid model", Omega 40 (2012) 758–766.

[18] Eiben, A.E., Smith, J.E., 2003. Introduction to Evolutionary Computing. Natural Computing Series. Springer, Berlin.

[19] Badar ul Islam, Zuhairi Baharudin and Perumal Nallagownden, "

Effect Of Input Variables Selection On Energy Demand Prediction Based On Intelligent Hybrid Neural Networks" ARPN Journal of Engineering and Applied Sciences, Vol. 10, No. 21, November, 2015.

[20] Badar ul Islam, Baharudin, Z., Nallagownden, P. and Raza, M.Q., A hybrid neuro-genetic approach for STLF: A comparative analysis of model parameter variations. In Power Engineering and Optimization (PEOCO), 2014 IEEE 8th International Conference (pp. 526-531). IEEE.

[21] A.S. Khwaja, M. Naeem, A. Anpalagan, A. Venetsanopoulos, B.

Venkatesh, Improved short-term load forecasting using bagged neural networks, Electric Power Systems Research, Volume 125, August 2015, Pages 109-115.

[22] Rahmat-Allah Hooshmand, Habib Amooshahi, Moein Parastegari, A hybrid intelligent algorithm based short-term load forecasting approach, International Journal of Electrical Power & Energy Systems, Volume 45, Issue 1, February 2013, Pages 313-324.

[23] Ming-Yue Zhai, A new method for short -term load forecasting based on fractal interpretation and wavelet analysis, International Journal of Electrical Power & Energy Systems, Volume 69, July 2015, Pages 241-245.

[24] Song Li, Peng Wang, Lalit Goel, Short-term load forecasting by wavelet transform and evolutionary extreme learning machine, Electric Power Systems Research, Volume 122, May 2015, Pages 96-103.

[25] N. Amjady, "Short-Term Bus Load Forecasting of Power Systems by a New Hybrid Method," in IEEE Transactions on Power Systems, vol. 22, no. 1, pp. 333-341, Feb. 2007, doi:

10.1109/TPWRS.2006.889130.

[26] Luis Hernández, Carlos Baladrón, Javier M. Aguiar, Belén Carro, Antonio Sánchez-Esguevillas, Jaime Lloret, Artificial neural networks for short-term load forecasting in microgrids environment, Energy, Volume 75, 1 October 2014, Pages 252- 264.

[27] Badar Ul Islam, Zuhairi Baharudin, Perumal Nallagownden,

“Development of chaotically improved meta-heuristics and

modified BP neural network-based model for electrical energy

demand prediction in smart grid” Neural Computing and

Applications, pp.1-15, 2016.