Deep Neural Networks For Short-Term Load Forecasting in ERCOT System

Deep Neural Networks for Short-Term Load
Forecasting in ERCOT System

Mitchell Easley, Luke Haney, Jose Paul, Kim Fowler, IEEE Fellow, Hongyu Wu*, IEEE Senior Member
Department of Electrical and Computer Engineering
Kansas State University, Manhattan, KS, United States
(* corresponding author: hongyuwu@ksu.edu)
Abstract—— Short-term load forecasting plays a major are computer algorithms that have the power to define and
role in the operation of electric power systems to ensure prescribe actions [3]. This is based on different rules and steps
instantaneous balance between electricity generation and created by experienced engineers; these are converted into
demand. The accuracy of the forecast generated by a software to forecast load [1].
neural network (NN) has several factors, including but not The most common method, neural network (NN), is a
limited to, the algorithm used to train the network, how hybrid method that uses time series and regression [3]. The
much and what kind of data are used in the network’s NN looks at previous load data and finds trends from those
training set, how many hidden layers are in the NN, and data. It uses that knowledge to predict the load with weather
the size of the hidden layer(s). We investigate the best forecast data. There are different structures for NN used to
combination of these factors to decrease the mean absolute predict load, such as Hopfield, backpropagation, and
percent error (MAPE) and to give the best forecast Boltzmann machines. The most common NN design used for
possible. Based on system load data from the Electric load forecasting is backpropagation. Unlike statistical
Reliability Council of Texas (ERCOT), this paper focuses forecasting techniques, a mathematical model does not need to
on a comprehensive understanding of the accuracy of be defined a priori for NN; one can easily include any relevant
forecasts generated by NN’s with different algorithms, parameter as a node in the input layer, and the network will
while varying the length of the network’s training set, “learn” the relationship. Reference [4] completed an extensive
hidden layer size, number of hidden layers, and the literature review on NN’s place in Short-Term Load
addition of data sets when training to create a forecast Forecasting (STLF), and found that, although the individual
with the highest accuracy possible. conclusions of most papers on the subject are not sufficiently
convincing by themselves, the great number of positive results
Index Terms—Load forecast, deep learning, neural on the subject demonstrate it is a superior forecasting method.
network, Levenberg-Marquardt, Scaled Conjugate Hippert, Pedreira, and Souza suggest that many early
Gradient, Bayesian Regularization researchers on this topic likely overparameterized their NN,
leading to overfit of the training set and ultimately less
I.INTRODUCTION promising results.
Load forecasting has become one of the major research The different algorithms used in this paper for training NN
fields in electrical engineering. With the frequent changes in are Bayesian Regularization (BR), Scaled Conjugant Gradient
(SCG), and Levenberg-Marquardt (LM). BR trains the
weather conditions, electricity prices and demand side
network by reducing the sum of squared errors. It uses a
participation, load forecasting is more necessary now than
Jacobian matrix for its calculations and takes the most time of
ever [1]. Forecasting divides into long-term, medium-term and
the three methods [5]. LM uses sum-of-mean-squares for
short-term forecasting. Short-term prediction is based more on
weather conditions, such as dry bulb temperature and dew training and a Jacobian matrix for calculations; it takes more
memory to produce the results [6]. SCG is different from the
point temperature. It is usually from a one-hour to a one-week
previous two methods; it uses conjugate direction as a basis
period. Medium-term and long-term predictions depend more
for training the data [7].
on economic factors and political decisions. Medium-term
There are some challenges associated with load forecasting.
lasts from a week to a year and long-term is usually more than
a year [1]. First, load forecasting is based on predicted weather
There are various methods used for load forecasting these conditions [8]. These predicted values are not always accurate,
creating a skewed forecast [9]. Second, different seasons
days [2]. Some common ones used are fuzzy logic, expert
affect the forecast. This sometimes requires multiple models
systems, and neural networks [3]. The fuzzy logic method
to properly perform the forecast. Based on system load data
gathers similarities from large amounts of data. Instead of
from the Electric Reliability Council of Texas (ERCOT), this
saying two values are the same, it determines the degree to
which they are similar. It uses this logic to predict the load. paper focuses on a comprehensive understanding of the
The biggest advantage of using this logic is the absence of accuracy of forecasts generated by NN with different
algorithms, while varying the length of the network’s training
mathematical models and specific inputs [1]. Expert systems
set, hidden layer size, and number of hidden layers, and while
978-1-5386-1006-0/18/$31.00 ©2018 IEEE

including atypical predictor variables. The conclusions are application. In the case of the GD algorithm, it is less
used to create a forecast with the highest accuracy possible. dependent on the weight of the initial values. It uses a first
To the best of our knowledge, this paper is the first of its kind order approach to converge to the minimum value. Due to this,
to study the short-term load forecasting by using deep NN in it takes a longer time to converge compared to the GN
the ERCOT system and to comprehensively investigate the algorithm, but can be used for other applications that the GN
impact of various factors on forecasting accuracy and method doesn’t support [10].
computational efficiency. The LM algorithm is a hybrid method that combines
positive characteristics of the GN and GD algorithms. The LM
II.LOAD FORECASTING METHOD uses GN method when it is near, but not too close to a
We developed the load forecasting code in MATLAB to minimum. The LM uses GD to make good initial guesses and
analyze impacts of different factors on load forecasts by using changes to GN when it is near the minimum. Once it is close
NN. For predicting the future load on a system, we used a time to the minimum, it changes back to GD to improve accuracy.
series NN with nonlinear auto regression with external input This hybrid method gives it an efficient convergence
(NARX). NARX requires large amounts of data, which performance [10].
contain past values of the system load along with weather data
within the area of the transmission system. This typically
B. Bayseian Regularization
consists of, but is not limited to, dry bulb temperature and dew
point. The data within this second series can be anything that The BR technique converts nonlinear systems into well-
one may think could affect the load of a system. We collected constructed problems. It reduces the sum of the squares of the
data in hourly increments. Along with system load and output [11]. This method starts by calculating the weights of
weather data, the code accounts for holidays, which the NN the NN and predicting an initial distribution of weights for
treats like weekdays. Once the data are gathered in Microsoft them. An exponential model is used for the initial distribution
Excel, we exported them into the MATLAB environment. We and the exponent is minimized to find the best weight vector
concatenated the load data of the system with the weather by using BR. A later distribution is then made using Bayes
information and stored the aggregated data in a large data theorem [3].
structure. For a short-term forecast, these data inputs comprise
the following: weather data, hour of day, day of the week, a C. Scaled Conjugate Gradient
flag for the holidays, previous day’s average load, load from
The scaled conjugate gradient method is one of the
the same hour the previous day, and the load from the same
conjugate gradient algorithms. The conjugate gradient
hour and same day from the previous week. The data are then
algorithm modifies the weights so that it converges in the
spilt into two separate datasets. One dataset will be designated
downward direction, which is the negative of the gradient.
to training the NN while the other will be saved for testing
This search is usually done on conjugate directions, which has
purposes later to evaluate the accuracy of the forecast. The
faster convergence compared to the negative of the gradient
training dataset configures the NN.
[6]. The conjugate gradient method increases calculation
Once the NN has been trained, we ran a simulation to
difficulty by performing line search to find the correct step
predict the load. This is done by using the created network and
size. This line search has several calculations for the global
the dataset that was created for testing. Within the testing
error function or its derivative, which increases its difficulty.
dataset, there is the series information consisting of system
This is where scaled conjugate method comes into play. It
load and weather conditions. The series data is used with the
skips the line search process and instead uses the Levenberg
NN to run the simulation. The NN looks at the series data and
Marquardt method to scale the step size [7]. Due to the
forms a load prediction. Once the simulation is completed, the
complexity and the length of the algorithm, the equations
remaining testing data, which consists of the system load, tests
won’t be discussed in the paper.
the accuracy of the simulated load forecast. Plots of the
weekly load can also be seen. These plots show the actual IV. NUMERICAL RESULTS
load, the forecasted load, and a separate graph of residuals
between the two. We acquired new data to run simulations and perform our
testing. These data contained new hourly transmission system
III.TRAINING METHODS load and weather data. The system load data was accessed
from the Electric Reliability Council of Texas (ERCOT)
A. Levenberg Marquardt which governs South and Central-Texas power flow. ERCOT
The LM algorithm is used to find the minimum of a has a large system, with more than 46,500 miles of
function using sum of squares of a nonlinear function. It is a transmission lines which are broken up into different regions.
mix of Gauss-Newton method (GN) and gradient descent The specific region chosen for testing was the North Central
algorithm (GD). The GN method uses the quadratic Region.
convergence technique to find the minimum, which is fast. Several years of hourly weather data were used to train an
This convergence technique mainly depends on the weight of accurate NN. The hourly weather information for the North
the initial values. However, finding these initial values is Central Region of ERCOT’s system was obtained from the
sometimes impossible, due to the complexity of the National Oceanic and Atmospheric Administration (NOAA).
The data was received from a land weather radar at Dallas The increased error is due to unfit load data from previous
Fort Worth airport. years. From 2003 to 2015, the average load on the system
increased by 18.41%. If the magnitude of the average load on
A. Effect of Malfitting on Forecasting Accuracy
the system remained relatively close during this interval, the
The test done in this section is different from what is model would likely have more success in generalizing trends,
known as overfitting. Overfitting is the result of a network that and would have continued to improve in accuracy. This
produces low error with its training set, but develops a larger finding is likely to change for a different system [13], but it is
error when tested with independent data (either test data or important to look at how the overall magnitude of load in a
validation data). Such a network fails to generalize patterns in transmission system changes over the years before selecting
the training data, and picks up spurious correlations [12]. an appropriate training interval.
Overfitting can occur from overtraining or
overparameterization. When overparameterization occurs, the B. Effect of Hidden Layer Size on Forecasting Accuracy
network is too complex (too many nodes or synapses) for the With added neurons in the hidden layer of a single hidden-
amount of data used to train it. In overtraining, the network is layer network (SLN), one can generally expect a more
not necessarily too complex, but through extensive testingfails accurate forecast. However, a more complex NN will take
to generalize from the data. longer to train. To study the impact of hidden layer size on
The stopping criterion used here was cap of 1000 load forecasting accuracy, three years of system load and
iterations. When concerned with overfitting, using a validation weather data were used to train all three algorithms (2012-
set is preferable to this. However, a network with one output 2014). We also used ERCOT’s load data from 2015 as the
node for the entire forecast (as opposed to Multi-Model testing data.
Forecasting) is less sensitive to overparameterization [4]. In Fig. 2 shows a boxplot of the MAPE of the forecast for
this section, we looked at how large of a training set is useful, different hidden layer sizes. The MAPE is shown to
using thirteen years of ERCOT load data and nearby weather consistently decrease as the hidden layer size increases.
data from NOAA. Load data from several years ago would not However, the plot suggests that the decrease in MAPE quickly
produce an appropriate model; as older load data continues to drops. Fig. 3 shows the time taken for training one epoch with
conglomerate into the training set, the network will be fit with the LM algorithm, where a scaled line of the average MAPE
less relevant data, which should increase the forecast’s error. for each hidden layer size overlays a bar graph showing how
To distinguish this from causes of overfitting, we refer to this the training time increases. As seen in Fig. 3, increasing the
problem as malfitting. size of the network’s hidden layer from 20 to 40 neurons will
We used ERCOT’s load data from 2015 as our testing result in a relative decrease in MAPE by 4.84%, but the
data. We trained first with 2014, then 2013-2014, then 2012- training time would be expected to increase by 195%. These
2014, and so on. The largest training set had load data from two relationships with hidden layer size present a tradeoff
2003-2014. The three NARX algorithms in MATLAB’s ANN between the forecasting accuracy and computational
toolbox were tested. The default hidden layer size of twenty efficiency.
neurons was used. The other two NARX algorithms, BR and SCG, were
Fig. 1 shows the MAPE for all three algorithms for the tested the same way. BR had a similar training time for each
different sizes of training sets. The LM and BR algorithms epoch to the LM algorithm, but often ran many more epochs
both produced a less accurate forecast for training set sizes than the latter. The average number of epochs to train a NN
beyond three years. The error continued to increase as the for the LM algorithm was just under 314, while the average
training set increased in size. The scaled conjugate gradient for BR was just under 738 epochs, meaning the overall
algorithm was less sensitive to the training interval, but training time for a NN trained with the BR algorithm will
consistently produces a less accurate forecast. likely be over twice as long as that trained from the LM
algorithm. The scaled conjugate gradient algorithm, while
much quicker than the other two, did not show the exponential
decaying trend beyond about 15 neurons. Fig. 4 compares the
three algorithms in terms of MAPE.
Although significantly faster to train, the SCG fitting
method is much less accurate than the other two. Its forecast,
on average, produced an MAPE 1.535% larger than that of the
LM algorithm and 1.625% larger than BR. The LM and BR
algorithms produce similar forecasts; the Bayesian
Regularization algorithm, on average, will produce an MAPE
0.086% lower than that produced by the LM algorithm.
Fig. 1. The MAPE of each training set size for the three available NARX
algorithms
C. Effects of Deep NN on Forecast Accuracy
To study the impact of the number of hidden layers on load
forecasting accuracy, we conducted testing on deep NN. Figs.
5, 6, and 7 summarize the accuracy found with the three
NARX algorithms. The hidden layers we tested were set to the
default size of 20 neurons.
We notice that, for the LM and BR algorithms in the first
year, the error increases with the NN’s complexity (number of
layers). This is caused by overparameterization, mentioned in
Section V.A. The error increases beyond three years for these
two algorithms, for each NN size. However, multi-hidden-
layer networks gave us the best results as the training
interval’s size increased. The MAPE found with a 2 hidden-
layer NN trained with BR forecasted a load with the smallest
Fig. 2. Boxplot of MAPE vs. Hidden Layer size when trained with the MAPE of 2.61%. Eventually, as the quantity of data increases,
Levenberg Marquardt algorithm. the multi-hidden-layer NN’s outperform the SLN. Results in
Figs. 5-7 suggest that when deciding on the size of the NN
used to forecast, one must consider how much useful data is
available.
Fig. 3. The time taken for one training epoch of a SLN for the LM algorithm
vs. the hidden layer size (the red single line shows average MAPE over
iterations)
Fig. 5. MAPE vs. Number of Hidden Layers for the BR algorithm.
Fig. 4. The MAPE of the three NARX algorithms Fig. 6. MAPE vs. Number of Hidden Layers for the scaled conjugate
gradient algortithm.
obtained from NOAA to get additional variables. We acquired
hourly data for relative humidity, visibility, and wind speed as
additional predictor variables.
To test how each extra variable affected the MAPE, a NN
had to be created for each scenario. This included a new NN
for the original variables, addition of relative humidity,
addition of visibility, addition of wind speed, and the addition
of all three. For each new NN, a forecasting simulation ran
eleven times. Within each iteration, the simulation ran for
different years of training data, ranging from one year to
twelve. These iterations were then averaged together to form a
single resulting MAPE for each year of training data for each
combination of fields. Without loss of generality, only the LM
algorithm was tested.
Fig. 9 compares the accuracy results of added fields, where
Fig. 7. MAPE vs. Number of Hidden Layers for the LM algorithm. a bar graph along with several stem plots are used to show the
MAPE for different combinations of fields for different
D. Effects of Deep NN on Training Time amounts of training data when using the LM algorithm.
For a fixed hidden layer size, we see that the number of
synapse weights that adjust for each epoch increases by n2
each time a hidden layer is added, where n is the number of
neurons in the hidden layer. This is only the case for an NN
without feedback synapses. As a result, we can expect a
significant increase in training time as hidden layers are added
to the NN. Fig. 8 outlines the results of training time versus
number of training years for LM. For the other two
algorithms, similar test results can be obtained.
Fig. 9. Change in MAPE as different combinations of additional fields are

used to train the NN.
The bar graph in Fig. 9 represents the original data used,

which consisted of 8 predictor variables mentioned above. The
remaining stem plots consist of the resulting MAPE when the
Fig. 8. Training time versus number of training years for different number of original data and the specified extra data was used. The black
hidden layers using the Levenberg Marquardt algorithm. stem plot consists of MAPE due to using the original data
along with all the extra variables which consisted of humidity,
visibility, and wind speed.
E. Effects of Additional Predictor Variables
When performing a forecast with both the original and extra
The original predictor variables used for training the NN variables, the MAPE is reduced to a value just under 2.75%.
were hourly readings of 1) dry bulb temperature, 2) dew point, As seen within Fig. 9, some variables alone play a larger role
3) hour of day, 4) day of the week, 5) holiday/weekend in reducing the MAPE than others. For example, the addition
indicator, 6) previous day's average load, 7) load from the of relative humidity alone had a small, if not negative, effect
same hour the previous day, and 8) load from the same hour on the overall MAPE, while the wind speed continually lowers
and same day from the previous week. As mentioned earlier in it. The results in Fig. 9 suggest that when using additional
the paper, extra predictor variables in term of the time series variables, one must be aware of the effects. Some additional
used to train the NN can help reduce the MAPE, depending on fields may affect the MAPE in a positive way, reducing the
their relevance. To test the effect of adding more predictor MAPE while others may take a negative toll due to its
variables to train the neural network, we looked at the data irrelevance. This consistently lower error with an increasing
training set interval suggests that the overfitting of old load forecasted a load with a MAPE of 2.468%. Wind speed and
data is mitigated by additional data fields. visibility were additional predictor variables that were
included in the forecast, and relative humidity was left out. Fig
10 shows the forecast for the weeks of highest and lowest load
V.OBSERVATIONS in 2015 in the ERCOT system.
The following observations are made based on our
numerical testing results: VI.CONCLUSION
1. The most useful training set was 2012-2014 (three In conclusion, this paper studied how different factors
years). Beyond this, the ERCOT system’s load was not affect the accuracy of a short-term load forecast. Specifically,
accurate enough for forecasting load in 2015. we tested how accuracy is affected by hidden layer size for a
2. Neuron accuracy increased with hidden layer size for a SLN, how network training time and accuracy are affected by
SLN. However, for certain training intervals, bigger is the addition of hidden layers, how some additional weather
not always better for deep NN. data affects forecast accuracy, and how the length of the
3. Iterative overtraining is a larger concern for NN’s with training interval affects forecast accuracy. We also made a
multiple hidden layers. comparison of the three NARX algorithms available for NN.
4. With the additional data we retrieved from NOAA, Accurate forecasts are valuable for independent system
wind speed and visibility were the two that had a operators, as they are a necessity both for efficient operation
positive effect on forecast accuracy. of generation, and for consistently meeting customer
5. While BR and LM produced similar error, the BR was electricity demand.
consistently superior. The SCG algorithm, while much
quicker in training the NN, produced less accurate REFERENCES
results. [1] E. A. Feinberg, “Load forecasting.” [Online]. Available:
http://www.almozg.narod.ru/bible/lf.pdf.
[2] A.K. Srivastava, A.S. Pandey, and D. Singh, “Short-Term Load
Forecasting Methods: A Review,” presented at ICETEESES, Sultanpur,
India, March 11-12, 2016.
[3] H. K. Alfares and M. Nazeeruddin, “Electric load forecasting: Literature
survey and classification of methods,” International Journal of Systems
Science, vol. 33, no. 1, pp. 23–34, 2002.
[4] H.S. Hippert, C.E. Pedreira, and R.C Souza, “Neural Networks for Short
Term Load Forecasting: A Review and Evaluation,” IEEE Trans. on
Power Syst., vol. 16, no. 1, pp. 44-55, Feb. 2001.
[5] Mathworks.com. (2017). Bayesian regularization backpropagation -
MATLAB trainbr - MathWorks United Kingdom. [online] Available at:
https://www.mathworks.com/help/nnet/ref/trainbr.html
[6] Mathworks.com. (2017). Levenberg-Marquardt backpropagation -
MATLAB trainlm - MathWorks United Kingdom. [online] Available at:
https://www.mathworks.com/help/nnet/ref/trainlm.html
[7] Mathworks.com. (2017). Scaled conjugate gradient backpropagation -
MATLAB trainscg - MathWorks United Kingdom. [online] Available at:
https://www.mathworks.com/help/nnet/ref/trainscg.html
[8] M. Khodayar and H. Wu, "Demand Forecasting in the Smart Grid
Paradigm: Features and Challenges", The Electricity Journal, vol. 28,
no. 6, pp. 51-62, July 2015.
[9] R.P. Schulte, K.D. Le, R.H, Vierra, G.D. Nagel, and R.T. Jenkins
“Problems associated with Unit Commitment in Uncertainty,” IEEE
Trans. on Power App. Syst., vol. PAS-104, no. 8, Aug. 1985.
[10] B. Kermani, S. Schiffman, and H. Nagle, “Performance of the
Levenberg–Marquardt neural network training method in electronic nose
applications,” Sensors and Actuators B: Chemical, vol. 110, no. 1,
Fig. 10. The forecast developed from a NN with five hidden layers, eight pp.13-22, Sep. 2005.
neurons each, and trained with the BR algorithm. The load is compared to the [11] J. Ticknor, “A Bayesian regularized artificial neural network for stock
forecast during the weeks of maximum (a) and minimum (b) load of 2015. market forecasting,” Expert Systems with Applications, vol. 40, no. 14,
pp.5501-5506, Oct. 2013.
We used our findings to determine what would produce the [12] R. Reed, “Pruning Algorithms – A Survey,” IEEE Trans. Neural Netw.,
most accurate short-term load forecast for the ERCOT system. vol. 4, no. 5, pp. 740-747, Sep. 1993.
The most accurate forecast was found with a NN with five [13] C.N. Lu, H.T. Wu, and S. Vemuri, “Neural Network Based Short Term
Load Forecasting,” IEEE Trans. Power Syst., vol. 7, no. 1, pp. 336-342,
hidden layers, each containing eight neurons. The synapse Feb. 1993.
weights were determined with the BR algorithm. This network

Deep Neural Networks For Short-Term Load Forecasting in ERCOT System

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Deep Neural Networks For Short-Term Load Forecasting in ERCOT System

Diunggah oleh

Hak Cipta:

Format Tersedia

Deep Neural Networks for Short-Term Load

Forecasting in ERCOT System

978-1-5386-1006-0/18/$31.00 ©2018 IEEE

Fig. 5. MAPE vs. Number of Hidden Layers for the BR algorithm.

Fig. 9. Change in MAPE as different combinations of additional fields are

The bar graph in Fig. 9 represents the original data used,

Anda mungkin juga menyukai