Machine Learning Approaches to Energy Consumption Forecasting in Households
aa r X i v : . [ c s . N E ] J un Machine Learning Approaches to EnergyConsumption Forecasting in Households
Riccardo Bonetto, Michele Rossi
Department of Information Engineering (DEI)University of Padova, Via G. Gradenigo 6/B, 35131 Padova (PD), Italy { bonettor, rossi } @dei.unipd.it Abstract —We consider the problem of power demand forecast-ing in residential micro-grids. Several approaches using ARMAmodels, support vector machines, and recurrent neural networksthat perform one-step ahead predictions have been proposed inthe literature. Here, we extend them to perform multi-step aheadforecasting and we compare their performance. Toward thisend, we implement a parallel and efficient training framework,using power demand traces from real deployments to gauge theaccuracy of the considered techniques. Our results indicate thatmachine learning schemes achieve smaller prediction errors inthe mean and the variance with respect to ARMA, but there isno clear algorithm of choice among them. Pros and cons of theseapproaches are discussed and the solution of choice is found todepend on the specific use case requirements. A hybrid approach,that is driven by the prediction interval, the target error, and itsuncertainty, is then recommended.
I. I
NTRODUCTION
In the last few years, the raising concern for greenhousegas emissions, the growth in the electrical power demand, thediffusion of domestic generation plants based of renewables,and the integration of sensing and metering devices into powerdistribution grids has led to the deployment of several smartgrids around the world. As noted in [1], they are one of thekey enablers for the development of smart cities.As for smart grid technology, a great effort has beendevoted to developing distributed control techniques that boostthe efficiency of electrical grids in the presence of endusers with power generation capabilities (prosumers), see forinstance [2]–[4]. Moreover, demand response policies thatinfluence the energy consumption profile of the prosumersproviding economic and power efficiency benefits are beinginvestigated [5].Efficient power consumption forecasting algorithms canprovide further benefits to the smart grid control process.For example, in [6], [7] forecasting is utilized to assess whatfraction of the generated power has to be locally stored forlater use and what fraction of it can instead be fed to the loadsor injected into the grid. Moreover, in [4] prosumers’ powergeneration and consumption forecasts are used to determinethe amount of energy that has to be injected in an isolatedpower grid to stabilize the aggregated power consumption.Lately, several techniques have been developed and anincreasing attention is being paid to machine learning ap-proaches such Artificial Neural Networks (ANN) [8] andSupport Vector Machines (SVM) [9]. These methods, however,are known to be computationally intensive [10], [11]. For this reason, lightweight forecasting solutions are much needed toutilize them in prosumers’ installations featuring off-the-shelfcomputing hardware.In recent work, ANNs and SVMs have been successfully(and increasingly) exploited to forecast power consumptiondata, see for example [12], [13]. In this paper, we provide acomparison between four different forecasting methods. Eachof these can be executed by off-the-shelf computing hardwareand can perform day-ahead and multi-step ahead predictions,i.e., when the output is a vector of forecast power demandsinto the future.The first technique that we consider is an Auto-RegressiveMoving Average (ARMA) model, whose results are used asa baseline to quantify the forecast accuracy gain providedby machine learning algorithms. The second method that weinvestigate is based on h SVMs which are trained and executedin a parallel fashion, where h is the number of time stepsinto the future to be forecasted. The last two methods employANNs. The third one is based on a Nonlinear Auto-Regressive(NAR) recurrent ANN, while the fourth features a Long Short-Term Memory (LSTM) ANN. For each of the considered ANNapproaches, a single network topology is defined and is thentrained h times (one training per time step). The weights andbiases for each of the h training phases are then utilizedto generate an h -steps ahead forecast vector. This approachallows performing the training and forecast processes in aparallel fashion and to only store h matrices.The main contribution of the present work consists ofcarrying out a performance comparison of machine learningsolutions from the state-of-the-art, which are seldom comparedagainst one another. Besides, we also extend our comparisonto LSTM neural networks, which to the best of our knowledgewere never used for energy demand forecasting in smart grids.The rest of this work is structured as follows. In Section IIwe briefly introduce the three considered machine learningforecasting techniques, namely, support vector machines, nonlinear auto-regressive neural networks, and long short-termmemory neural networks. In Section III we discuss a parallelframework that reduces the computational time required bythe training phase and makes it possible to implement theconsidered forecasting solutions in off-the-shelf computingdevices. Moreover, we describe the experimental setup thatwe used to assess its performance and in Section IV we testmachine learning forecasting approaches against an ARMAnput Layer Hidden Layer Output Layer σσσσx τ − n +1 x τ − x τ ˆ x τ Fig. 1. Example of NAR network with n inputs, one hidden layer with neurons and output neuron. model. Finally, in Section V we draw our conclusions.II. M ACHINE L EARNING T ECHNIQUES FOR F ORECASTING
In this section, we briefly describe the considered machinelearning techniques, along with the adopted forecasting archi-tectures.
A. Support Vector Machines (SVM)
SVMs [9] can be used for classification and regression tasks.When used for regression, the common approach is called“ ǫ -insensitive support vector regression” [9]. Let h be futurehorizon of the forecast and let X τn be the last n values ofthe time series X = ( x , x , . . . ) at the current time step τ ≥ . Then, this approach seeks a function f ( X τn ) such thata suitable distance || f ( X τn ) − x τ + h || is minimized and in anycase is no greater than the parameter ǫ . To do so, a convexoptimization problem is set up and solved. This guaranteesthat, if the problem is feasible, the solution is the best one. Inthe case where the optimization problem does not have anyfeasible solution, a tolerance parameter on the ǫ threshold isintroduced. The use of SVMs for regression tasks is appealingsince they guarantee that the forecast error is bounded by ǫ . B. Nonlinear Auto-Regressive (NAR) Neural Networks
Non linear auto-regressive neural networks (NARs) are re-current ANNs performing regression tasks on time series [14].A NAR network operates on a time series X by processing,at each time step τ , the subsequence X τn (i.e., the last n values of X , i.e., ( x τ − n +1 , . . . , x τ )) and the previous NAR’soutput ˆ x τ . The parameter n determines the ANN’s memory,i.e., how far in the past the NAR is required to track thecorrelation structure of the input data. In order to capturethe nonlinear structure of the considered time series, it isrequired that neurons in the hidden layers have nonlinearactivation functions. The output layer, instead, is composedof neurons with linear activation functions, so that the outputis not bounded to any particular dimension.Fig. 1 shows an example NAR network: it takes as input thesequence X τn and the output generated by the same network inthe previous time step. These inputs are processed by a hiddenlayer composed of four neurons with sigmoidal activationfunctions ( σ in the figure). The output of the hidden layeris processed by a linear neuron to produce the desired result. ForgetGate Input InputGateOutput OutputGateCellState × ×× thth σ σσ Fig. 2. Example of LSTM Memory Cell (MC).
C. Long Short-Term Memory (LSTM) Neural Networks
Long Short-Term Memory (LSTM) networks [15] are aparticular class of recurrent ANNs where the neurons in thehidden layers are replaced by the so-called memory cells (MC).A MC is a particular structure that allows storing or forgettinginformation about past network states. This is made possibleby structures called gates . Gates are composed of a cascade ofa neuron with sigmoidal activation function and a pointwisemultiplication block.Fig. 2 shows a typical MC structure. The input gate isa neuron with sigmoidal activation function ( σ ). Its outputdetermines the fraction of MC input that is fed to the cellstate block. Similarly, the forget gate processes the informationthat is recurrently fed back into the cell state block. The outputgate, instead, determines the fraction of the output of the cellstate that is to be used as output of the MC at each time step.The gates’ neurons usually have sigmoidal activation functions( σ ), while the input and cell state use the hyperbolic tangent( th ) activation function. All the internal connections of the MChave unitary weight. Thanks to this architecture, the output ofeach memory cell possibly depends on the entire sequenceof past states. This make LSTM ANNs particularly suitablefor processing time series with long time dependencies (i.e.,inter-sample correlation).Fig. 3 shows an example of the LSTM ANN architecturesthat we used in this work. As done with the NAR networksof Section II-B, we consider the sequence X τn as the networkinput vector. These n time samples are fed as input to thememory cells ( MC cells are shown in Fig. 3). As time( τ ) advances, the output of the memory cells depends on thecurrent input sequence X τn and as well on the previous ones X τ − in − i , i = 1 , . . . , τ − n . As with the NAR network, the outputof the hidden layer is filtered through a linear neuron to obtainthe network output.III. E XPERIMENTAL S ETUP
In this section, we describe the framework that we usedto perform the power forecasts with off-the-shelf hardware.Moreover, we introduce the experimental setup (power de-mand traces and configuration parameters for the consideredschemes) for the numerical assessments of Section IV.
A. Parallel Framework
Our parallel approach splits the forecasting problem intoembarrassingly parallel subproblems. Each subproblem corre-nput Layer Hidden Layer Output Layer
MCMCMCMC x τ − n +1 x τ − x τ Fig. 3. Example of LSTM network with n inputs, one hidden layer with Memory Cells (MC) and output neuron. sponds to the forecast of one of the values of the multi-stepforecast vector. With respect to the SVM approach, for a h -step ahead forecast, h SVMs are trained in parallel exploitingthe multi-core architecture of commercial CPUs and GPUs.Each SVM is trained to solve one of the h subproblems ofthe h -step ahead forecast. Similarly, for the considered ANNarchitectures a single network topology is trained multipletimes. Each training time returns a set of weights and biasescorresponding to the solution of one of the h subproblems.Upon completion of the parallel training phase, h -step aheadforecast vectors can be obtained in a parallel fashion as well. B. Parameters Setup
In order to assess the performance of the considered fore-casting methods, we utilized the dataset in [16]. It contains ac-tive power consumption measurements for a single household.Measurements were taken every seconds during a periodof years, resulting in more than million time samples.Part of this dataset has been used as training set for theconsidered forecasting approaches, while the remaining timesamples were used to assess the accuracy of the obtainedforecasts.For all the approaches, we performed -step ahead fore-casts ( h = 120 , one step corresponds to one minute) usingsequences of n = 30 past time samples. Given the time scaleof the considered dataset, this corresponds to hour-aheadforecasts using measurements from the past minutes. Forthe ARMA, SVM and NAR approaches we considered atraining set of , time samples. Our experimental resultshave shown that increasing the size of the training set beyondthis leads to marginal accuracy improvements at the cost of asignificantly increased training time. For LSTM, instead, weset the training set size to , time samples. This valueled to the best accuracy for this scheme and to a still acceptabletraining time with off-the-shelf hardware.The NAR network that we used for the results in the nextsection is configured as follows: • it takes the subsequence X τ as input (i.e., a time horizonof minutes is used to forecast future values); • it has one hidden layer with neurons, each of themwith sigmoidal activation function; • it has one output neuron with linear activation function. A v e r age A b s o l u t e E rr o r [ k W ] Forecast Span [minutes]SVMLSTMNARARMA
Fig. 4. Mean absolute error for the estimated ARMA model and the proposedNAR-based forecasting scheme.
The training algorithm chosen for the NAR approach isthe Levenberg-Marquardt with Bayesian weights regulariza-tion [17]. This algorithm is particularly suited for time seriesexhibiting a noisy behavior.The LSTM network that we used for the results in the nextsection is configured as follows: • it takes the subsequence X τ as input (i.e., a time horizonof minutes is used to forecast future values); • it has one hidden layer with memory cells, each ofthem with softsign activation function [18]; • it has one output neuron with linear activation function.The LSTM network has been trained using the ADAGRADalgorithm [19]. This training method exhibits an improvedconvergence rate over standard gradient descent schemesthanks to a dynamically adjustable learning rate.IV. R ESULTS
Next, we present the experimental results obtained throughour parallel forecasting framework and the parameters setupof Section III. We compare the performance of the consideredmachine learning approaches against that of a state-of-the-art ARMA model in terms of mean absolute error and errorvariance.Fig. 4 shows the mean absolute error of the forecastsobtained by the ARMA model and those obtained with theSVM, NAR, and LSTM approaches. The mean absolute er-ror has been computed for each forecast as the arithmeticaverage of the distance between the points estimated by themodels and the target values of the input time series. Afirst consideration is that all the machine learning approachesoutperform the results obtained by ARMA. Also, NAR andLSTM have similar average performance. However, LSTMrequires a training set that is times bigger than that usedto train the NAR network. The SVM approach in the first time steps exhibits a slightly higher error with respect toNAR and LSTM. However, for longer time spans it achievesthe best forecast accuracy. These results suggest the adoptionof a hybrid forecasting approach where the first forecasts are E rr o r V a r i an c e Forecast Span [minutes]SVMLSTMNARARMA
Fig. 5. Error variance for the estimated ARMA model and the proposedNAR-based forecasting scheme. computed through a NAR ANN and the last ones are obtainedvia SVM. The point that determines the transition betweenthe NAR network and the SVM model depends on the datasetcharacteristics and should be determined for every use-case.In Fig. 5, we show the variance for the prediction errors ofFig. 4, which is related to the prediction uncertainty. As a firstresult, we note that for the ARMA and NAR approaches theerror variances exhibit the same growth trend as for the averageerrors of Fig. 4. This confirms that an increasing time windowcorresponds to a correspondingly increasing uncertainty in theprediction accuracy. Nevertheless, especially in the first minutes, the ARMA’s error variance grows much faster thanthat of NAR, making the latter a better approach. SVM andLSTM techniques exhibit a considerably lower error variancewith respect to ARMA and NAR. The error variance of SVMgrows as fast as the NAR’s one within the first predictionsteps and then drops, reaching a minimum around h = 60 minutes. SVM resulted to be the algorithm of choice whenpredicting far ahead in time, as it obtains the smallest error,while also showing the second-smallest uncertainty (errorvariance). This is due to the SVM parameter ǫ , which sets abound on the maximum forecasting error. The aforementionedhybrid scheme, i.e., using NAR for short prediction intervals(e.g., up to minutes) and then switching to SVM is alsosupported by the result of Fig. 5. Finally, LSTM exhibits thesmallest variance with respect to all other methods. This meansthat, despite not being the most accurate forecasting scheme,it guarantees that the error fluctuations are small.V. C ONCLUSIONS
In this work we have presented a comparison of the per-formance of different machine learning approaches in termsof multi-steps ahead forecasting error and error variance.After briefly reviewing machine learning approaches for fore-casting from the literature, namely, SVM, NAR and LSTMANNs, we described their forecasting architectures and thedataset that has been used for their experimental results. Theobtained results have been compared to the ones obtainedby an ARMA model. All the machine learning approaches outperform ARMA, whereas no single algorithm of choiceexists among SVM, NAR and LSTM. The LSTM networkexhibits a slightly worse prediction accuracy with respect tothe others, but it has the smallest error variance. NAR exhibitsthe best forecasting accuracy for short prediction windows.Instead, SVM shows a complementary behavior, guaranteeingthe best accuracy for the estimation of power demands farahead in time. Our results suggest the adoption of a hybridapproach, which entails the use of NAR for short time horizonsand SVM for long prediction intervals.R
EFERENCES[1] K. Geisler, “The Relationship Between Smart Grids and Smart Cities,”
IEEE Smart Grid Newsletter Compendium , May 2013.[2] P. Tenti, A. Costabeber, P. Mattavelli, and D. Trombetti, “Distributionloss minimization by token ring control of power electronic interfacesin residential microgrids,”
IEEE Trans. Industrial Electronics , vol. 59,no. 10, pp. 3817–3826, Oct. 2012.[3] R. Bonetto, M. Rossi, S. Tomasin, and M. Zorzi, “On the interplayof distributed power loss reduction and communication in low voltagemicrogrids,”
IEEE Transactions on Industrial Informatics , vol. 12, no. 1,pp. 322–337, Feb 2016.[4] R. Bonetto, T. Caldognetto, S. Buso, M. Rossi, S. Tomasin, and P. Tenti,“Lightweight energy management of islanded operated microgrids forprosumer communities,” in
IEEE International Conference on IndustrialTechnology (ICIT) , Seville, Spain, March 2015, pp. 1323–1328.[5] P. B. Luh, L. D. Michel, P. Friedland, C. Guan, and Y. Wang, “Loadforecasting and demand response,” in
IEEE PES General Meeting ,Minneapolis, MN, U.S., July 2010.[6] B. Narayanaswamy, T. S. Jayram, and V. N. Yoong, “Hedging strategiesfor renewable resource integration and uncertainty management in thesmart grid,” in
IEEE PES Innovative Smart Grid Technologies Europe(ISGT Europe) , Berlin,DE, Oct 2012.[7] R. Haque, T. Jamal, M. N. I. Maruf, S. Ferdous, and S. F. H. Priya,“Smart management of phev and renewable energy sources for gridpeak demand energy supply,” in
Electrical Engineering and InformationCommunication Technology (ICEEICT), 2015 International Conferenceon , Dhaka, BD, May.[8] G. K. Venayagamoorthy, “Potentials and promises of computationalintelligence for smart grids,” in , Calgary, CA, July 2009.[9] V. N. Vapnik,
The Nature of Statistical Learning Theory . New York,NY, USA: Springer-Verlag New York, Inc., 1995.[10] L. Bottou and C.-J. Lin, “Support Vector Machine Solvers,” in
LargeScale Kernel Machines , L. Bottou, O. Chapelle, D. decoste, and J. We-ston, Eds. Cambridge, MA, USA: MIT Press, 2007, pp. 301–320.[11] P. Orponen, “Computational complexity of neural networks: A survey,”
Nordic Journal of Computing , vol. 1, no. 1, pp. 94–110, Mar. 1994.[12] P. Qingle and Z. Min, “Very short-term load forecasting based on neuralnetwork and rough set,” in
IEEE International Conference on IntelligentComputation Technology and Automation (ICICTA) , Changsha, CN,May 2010.[13] K. Gajowniczek and T. Zbkowski, “Short term electricity forecastingusing individual smart meter data,”
Procedia Computer Science , vol. 35,pp. 589 – 597, 2014.[14] S. Haykin,
Neural Networks: A Comprehensive Foundation , 2nd ed.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1998.[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComputation , vol. 9, no. 8, Nov. 1997.[16] K. Bache and M. Lichman, “UCI Machine Learning Repository,” 2013.[Online]. Available: http://archive.ics.uci.edu/ml[17] C. M. Bishop,
Neural Networks for Pattern Recognition . New York,NY, USA: Oxford University Press, Inc., 1995.[18] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” in
Society for Artificial Intelligenceand Statistics International Conference on Artificial Intelligence andStatistics (AISTATS10) , 2010.[19] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods foronline learning and stochastic optimization,”