Forecasting residential gas demand: machine learning approaches and seasonal role of temperature forecasts
FForecasting residential gas demand: machinelearning approaches and seasonal role oftemperature forecasts
Emanuele Fabbiani , Andrea Marziali , and Giuseppe De Nicolao Department of Electrical, Computer and Biomedical Engineering,University of Pavia14th August 2020
Abstract
Gas demand forecasting is a critical task for energy providers as itimpacts on pipe reservation and stock planning. In this paper, the one-day-ahead forecasting of residential gas demand at country level is invest-igated by implementing and comparing five models: Ridge Regression,Gaussian Process (GP), k-Nearest Neighbour, Artificial Neural Network(ANN), and Torus Model. Italian demand data from 2007 to 2017 are usedfor training and testing the proposed algorithms. The choice of the relev-ant covariates and the most significant aspects of the pre-processing andfeature extraction steps are discussed in depth, lending particular atten-tion to the role of one-day-ahead temperature forecasts. Our best model,in terms of Root Mean Squared Error (RMSE), is the ANN, closely fol-lowed by the GP. If the Mean Absolute Error (MAE) is taken as an errormeasure, the GP becomes the best model, although by a narrow margin.A main novel contribution is the development of a model describing thepropagation of temperature errors to gas forecasting errors that is suc-cessfully validated on experimental data. Being able to predict the quant-itative impact of temperature forecasts on gas forecasts could be usefulin order to assess potential improvement margins associated with moresophisticated weather forecasts. On the Italian data, it is shown thattemperature forecast errors account for some 18% of the mean squarederror of gas demand forecasts provided by ANN.
Keywords— natural gas; time series forecasting; statistical learning; Gaussian Pro-cess; neural networks. a r X i v : . [ c s . C Y ] A ug Introduction
Forecasting natural gas demand is a crucial task for energy companies for severalreasons. First, it provides relevant information to reserve pipe capacity and plan stockseffectively. Furthermore, regulations impose the balance of the network by chargingproviders with a fee proportional to their unbalanced quantity. Finally, demand is acritical input to forecast gas price, which is, in turn, a driver for business decisions.Two comprehensive reviews of the literature about gas demand forecasting are [17] and[18]. According to [17], papers can be classified along four dimensions. The predictionhorizon can range from hourly to yearly, the reference area from single nodes of thenetwork to a whole country; adopted models include time series, mathematical andstatistical approaches, neural networks, and others; input features can be demandhistory, temperature, calendar, and other minor ones.Several studies focused on country- or regional-level daily forecasting. Mathematicaland statistical models based on non-linear parametric functions were used in [5] toexplain the factors which affect the demand. A different multi-factor approach wasdeveloped in [16] and a model based on the physical relation between gas demandand the temperature was presented in [7]. An adaptive-network-based fuzzy inferencesystem (ANFIS) was described in [3], where the authors showed better performancesof their model concerning ANN and conventional time series methods. A statisticallearning model, based on support vector machine (SVM), was developed in [26] forUK demand, and compared to ANN and an autoregressive moving average (ARMA)predictor. A hybrid model, exploiting many different techniques, such as wavelet trans-form, genetic algorithm, ANFIS, and ANN, was used in [15]. Neural networks wereapplied in [21, 6, 20, 19, 22] to perform hourly and daily forecasts on cities and regions.Moreover, [24] showed how ANNs, combined with Principal Components CorrelationAnalysis (PCCA), provide robust and precise forecasts on regional demand. Baldacciet al. [4] used nearest neighbors and local regression to forecast the gas demand ofsmall villages. They also presented an investigation over the effects of temperatureforecast errors, concluding that the influence on model accuracy is negligible.Concerning long-term forecasting, [23] discussed gas demand in Bangladesh, showinghow population growth and Gross Domestic Product (GDP) are essential drivers ofthe demand. Similar conclusions were achieved in [11], where a breeder model wasproven superior to other approaches in forecasting Turkish demand.The present work focuses on day-ahead forecasting of Residential Gas Demand (RGD)at country level. A companion paper focuses on Industrial (IGD) and ThermoelectricGas Demand (TGD) [13]. The focus of the analysis is on two issues, not adequatelycovered in the existing literature, with Italian RGD used as a case study.The first issue addressed is the comparison between regression and machine learningmethods in order to understand what technique is more suitable when performanceis assessed over several years. Five forecasting methods are considered, two based onlinear regression and three on machine learning techniques, paying particular attentionto their accurate tuning. This involves a detailed discussion on the selection of therelevant covariates, among which a primary role is played by the weather temperature.The second issue has in fact to do with the influence of weather forecast errors onnatural gas demand models. Despite being critical in industrial applications, previousworks seldom specify if the predictors use forecasted or observed temperature, maybedue to the belief that temperature errors have negligible impact. In contrast, we assessthe influence of weather forecasting errors, both theoretically and experimentally. Anovel easy-to-compute bound is derived that predicts the best achievable RGD rootmean square error (RMSE) as a function of the temperature RMSE. This bound is hen validated on experimental data: Italian RGD forecasts are obtained using bothobserved and predicted temperatures, thus allowing for a quantitative assessment ofaccuracy degradation.The paper is organized as follows. In Section 2, we formulate the problem and presentthe available data. In Section 3, we provide a statistical characterization of targetand input variables, discussing both preprocessing and feature selection. Section 4describes models, including the training process and hyperparameter tuning. In Sec-tion 5, we derive the performance limit, which is used as the ultimate benchmark inSection 6, where the results are presented and discussed. Finally, Section 7 is devotedto some concluding remarks. In Italy, natural gas is the most common fuel for both power plants and domesticheating. Moreover, several industrial facilities burn gas for either heating or poweringproductive processes. According to SNAM Rete Gas [1], the Italian TransmissionSystem Operator (TSO), in 2017 about 70.59 billions of cubic meters of natural gaswere consumed, with an increase of 5.6% over the previous year. Overall, the increasein demand between 2015 and 2017 was 11%. Out of the total gas demand in 2017,35.9% was due to thermoelectric power plants, 22.4% to industrial facilities, and 41.7%to residential users.Residential Gas Demand (RGD) represents the main part of the overall Italian gasconsumption, accounting for household usage for cooking, water heating, and, mostimportantly, environment heating.The available dataset covers 11 years, from 2007 to 2017, and is made of 3 fields:date ( t ), forecasted average temperature ( ˆ T ), and residential gas demand (RGD).Forecasted temperature is relative to the Northern regions of Italy. In the preliminaryanalysis, a weighted average of the temperatures in different zones of Italy was alsoconsidered, but it was dropped because of the weaker correlation with RGD. This isexplained by the role of domestic heating in Northern Italy, where winters are colderthan in other regions.The profile of RGD from 2007 to 2017 is displayed in Fig. 1.
008 2010 2012 2014 201650100150200250300 R G D [ M S C M ] Figure 1: Italian Residential Gas Demand (RGD): years 2007-2017.
RGD magnitude greatly oscillates with the season: during the cold months, fromOctober to March, it represents about 56% of the overall Italian demand, while itdrops to about 28% during the warm months, from April to September. In fact, whenthe temperature climbs above 17-18 ◦ C, domestic heating is typically switched off.Thus, during the cold period lower temperatures cause a larger RGD, while, duringsummer, weather influence is negligible and a weekly pattern becomes evident, withlower RGD during weekends compared to working days. Due to the lack of dependenceon weather conditions, the profile of summer RGD is remarkably repeatable from yearto year. All these features are visible in Fig. 2, which displays eleven years of ItalianRGD, overlapped with a proper shift to align weekdays.
As expected, the autocorrelation function, estimated on the whole dataset, exhibits aclear yearly seasonality and a much smaller weekly periodicity, see Fig. 3.Most of the spectral density, see Fig. 4, is concentrated at period 365.25 days. Asmaller yet relevant peak can be found at a period of 7 days, accounting for theweekly periodicity. In both cases, smaller peaks at lower periods are ascribable toharmonics located at multiples of the main harmonic.
330 340 350 360 370 380 390
Figure 3: RGD autocorrelation function estimated on 2007-2017 data. The365-day yearly periodicity is evident. In the inset, weekly waves witness thepresence of a 7-day periodicity of smaller amplitude.5
Period [days]
100 200 300 40002M4M6M8M10M12M14M
Period [days]
Figure 4: RGD periodogram. Left panel: periods up to to 8 days; right panel:periods up to 500 days. The yearly periodicity is highlighted by peaks at 365.25days, while the weekly one by the smaller spike at a period of 7 days. Otherspikes correspond to harmonics located at multiples of the main harmonic.
The autocorrelation of lag 1 can be assessed through the scatter plot in Fig. 5a,where RGD at time t is plotted against RGD at time t − . The correlation coefficientcomputed on the entire dataset is 0.988, and it increases to 0.995 if the pairs Saturday-Friday and Monday-Sunday are discarded. This reflects a different behavior betweenworking days and weekends, visually confirmed in the plot, where Monday’s RGDstays in the upper part of the cloud whereas Saturday’s RGD lies in the lower part.As for the lag-7 autocorrelation, in Fig. 5b the scatter plot of RGD at times t and t − is displayed. The scatter plot in Fig. 5b is narrower when the demand is low,that is during warm months, while it gets more dispersed in winter, when the demandis high. This is possibly due to the variability of weather from one week to the nextone.In order to characterize the yearly seasonality, it is convenient to introduce the notionof similar day. The following definitions hold: • year( t ) is the year to which day t belongs; • weekday( t ) is the weekday of day t , e.g. Monday, Tuesday, etc; • yearday( t ) is the day number within year( t ) starting from January 1, whose yearday is equal to . Definition 1 (Similar Day) . If t is not a holiday, its similar day τ ∗ = sim( t ) is τ ∗ = arg min τ | yearday( τ ) − yearday( t ) | subject to • year( τ ) = year( t ) − ; • weekday( τ ) = weekday( t ) ; • τ is not a holiday.If t is a holiday, its similar day τ ∗ = sim( t ) is the same holiday in the previous year. According to the Italian calendar, holidays are 1 January, 6 January, 25 April, 1 May,2 June, 15 August, 1 November, 8, 25 and 26 December, Easter and Easter Monday. he relationship between RGD and RGD in the similar day is shown in Fig. 5c: again,the correlation is higher when the demand is lower, due to the smaller influence oftemperature.It can also be of some interest to take into account the similar day of t − . The scatterplot in Fig. 5d shows that the difference RGD( t − − RGD(sim( t − is a good proxyto the difference RGD( t ) − RGD(sim( t )) .Due to these considerations, RGD( t − , RGD( t − , RGD(sim( t )) , and RGD(sim( t − were selected as inputs to forecast RGD at time t .
50 100 150 200 250 300RGD(t-1) [MSCM]50100150200250300 R G D ( t ) [ M S C M ] MondayTuesdayWednesdayThursdayFridaySaturdaySunday (a)
RGD( t ) vs RGD( t −
50 100 150 200 250 300RGD(t-7) [MSCM]50100150200250300 R G D ( t ) [ M S C M ] (b) RGD( t ) vs RGD( t −
50 100 150 200 250 300RGD(sim(t)) [MSCM]50100150200250300 R G D ( t ) [ M S C M ] (c) RGD( t ) vs RGD(sim( t ))
100 50 0 50 100RGD(t-1)-RGD(sim(t-1)) [MSCM]10050050100 R G D ( t ) - R G D ( s i m ( t )) [ M S C M ] (d) RGD( t ) − RGD(sim( t )) vs RGD( t − − RGD(sim( t − Figure 5: Scatter plots involving RGD and potential features to be used for itsprediction.
The RGD time series shows a strong relationship with temperature, especially duringthe winter season, when the temperature falls below ◦ C, and household heating ecomes relevant. As shown in the left panel of Fig. 6, with good approximation therelationship is piecewise linear: a line with a negative slope below ◦ C, followed by aconstant line above ◦ C. In order to transform the piecewise linear dependence intoa linear one, it is useful to refer to the so-called Heating Degree Days (HDD):
Definition 2 (Heating Degree Days (HDD)) . HDD( T ) = max(18 ◦ − T, (1)In the right panel of Fig. 6, the scatter plot of RGD vs. HDD highlights an approxim-ately linear relationship, with a positive correlation of . . The correlation betweenHDD and RGD is even more evident when we look at the time series of RGD andHDD during 2017, see Fig. 7. Temperature (°C) R e s i d e n t i a l G a s D e m a nd R e s i d e n t i a l G a s D e m a nd HDD (°C) H DD ( ° C ) Temperature (°C)
Figure 6: Left panel: scatter plot of daily RGD vs average daily temperature.Right panel: scatter plot of daily RGD vs HDD. Inset: HDD as a function ofthe temperature. 8 an 2017 Mar 2017 May 2017 Jul 2017 Sep 2017 Nov 201750100150200250 05101520 RGDHDD R G D [ M S C M ] H DD [ ° C ] Figure 7: Time series of RGD and HDD in 2017. The instantaneous correlationbetween the two series is apparent.
As shown in Fig. 6, RGD is better correlated to HDD than to plain temperatures.Thus,
HDD( ˆ T ( t )) and the lagged values HDD( ˆ T ( t − , HDD( ˆ T ( t − , HDD( ˆ T (sim( t ))) were selected as features, where ˆ T ( t ) denotes the one-day-ahead forecast of T ( t ) .Moreover, the forecasted plain temperature ˆ T ( t ) and its lagged values ˆ T ( t − , ˆ T ( t − and ˆ T (sim( t )) were also included. It has already been observed that weekdays and holidays have a great influence onRGD. To capture this phenomenon, the following categorical calendar features weretaken into account.
Weekday . Given the weekly periodicity, the seven days of the week were taken asexplanatory features. By resorting to the dummy encoding method, they were trans-formed into 6 dichotomic time series.
Holiday . A binary feature which takes value in correspondence of holidays. Day after holiday . A binary feature which takes value the first working day after aholiday. A working day is defined as a day different from Saturday and Sunday thatis not a holiday. Bridge holiday . A binary feature which takes value on isolated working days, that isworking days where both the day before and the day after are either Saturday, Sundayor a holiday.The complete set of features is summarized in Table 1. sim( t ) continuousRGD sim( t − continuousForecasted temperature t continuousForecasted temperature t-1 continuousForecasted temperature t-7 continuousForecasted temperature sim( t ) continuousForecasted HDD t continuousForecasted HDD t-1 continuousForecasted HDD t-7 continuousForecasted HDD sim( t ) continuousWeekday t categoricalHoliday t binaryDay after holiday t binaryBridge holiday t binaryTable 1: List of features. The classical methods used for time series forecasting are linear Box-Jenkins modelssuch as SARIMA, where the forecast is based only on past values of the time series, andSARIMAX, that accounts also for exogenous variables. A drawback of classical linearmodels is the difficulty in handling the discontinuities due to holidays and the possiblepresence of other nonlinear phenomena. Below, RGD forecasting is formulated as astatistical learning problem.Based on the availability of n data pairs ( x i , y i ) , i = 1 , . . . , n , known as the trainingdata, the goal is to design a prediction rule f ( · ) with the objective of using f ( x ∗ ) as prediction of y ∗ , where ( x ∗ , y ∗ ) is any novel input-output pair. In this context, x i ∈ R p , p < n , is a vector whose entries are given by the p features associated withthe target y i .Herein, the p features are the 21 covariates discussed in the previous section and listedin Table 1. In the following, with reference to the training data, y = { y i } ∈ R n willdenote the vector of the targets and X = { x ij } ∈ R n × p will denote the matrix of thetraining input data, where x ij is the j -th feature of the i -th training pair ( x i , y i ) .Below, five approaches are presented: • ridge regression; • Gaussian Process (GP); • k-nearest neighbour (KNN); • artificial neural network (ANN); • torus model.All the models, except the torus one, were implemented in Python, using scikit-learnand keras packages; automated hyperparameters tuning exploited the GridSearchCV unction of scikit-learn. The torus model was implemented in MATLAB, as well as itshyperparameter tuning routine. Ridge regression [10] is a technique to identify a linear model in the form: ˆ f ( x ) = p (cid:88) j =1 x j β j = x T β , x , β ∈ R p To prevent overfitting, besides the standard squared sum of the residuals, the lossfunction includes also the squared norm of the parameter vector β : β ridge := arg min β (cid:107) y − X β (cid:107) + λ (cid:107) β (cid:107) (2)where λ is the so-called regularization parameter, a hyperparameter which controls theflexibility of the learning algorithm. Assuming that X is full rank, the solution of (2)is β ridge = ( X T X + λ I ) − X T y (3)that highlights the shrinking effect with respect to the standard least squares estimator β LS = ( X T X ) − X T y .Since the parameters are obtained in closed form from (3), the ridge regression modelis entirely specified by choice of λ , that can be calibrated following different approaches[10]. A normalized assessment of the amount of regularization associated with a given λ is provided by the so-called effective degrees of freedom df( λ ) = tr (cid:16) X ( X T X + λ I ) − X T (cid:17) In fact, df( λ ) ranges from p to as λ goes from 0 to infinity [10]. Let ¯y = (cid:2) y ∗ y T (cid:3) T , ¯x = (cid:2) x T ∗ x T . . . x Tn (cid:3) T and assume that, conditional on ¯x , the vector ¯y is normally distributed as follows ¯y | ¯x ∼ N (cid:0) , Σ ( ¯x ) + σ I n (cid:1) [ Σ ( ¯x )] ij = κ (¯ x i , ¯ x j ) where the kernel κ ( · , · ) is a suitable function that specifies the correlation between twotarget variables ¯ y i , ¯ y j as a function of the corresponding input vectors ¯x i , ¯x j . Thechoice of the kernel by the designer should reflect the available prior knowledge on thecharacteristics of the prediction rule. It is worth noting that the previous assumptionis equivalent to assuming that y i = f ( x i ) + (cid:15) i , i = 1 , . . . , n where (cid:15) i ∼ N (0 , σ ) are independent errors and f ( · ) is the realization of a zero-meancontinuous-time Gaussian Process (GP) with autocovariance κ ( ¯x i , ¯x j ) [25, 14]. Theestimation of a new target value y ∗ relies on the following well known property ofnormally distributed random vectors. emma 1 (Conditional distribution of jointly Gaussian variables) . Let z ∗ and z bezero-mean jointly Gaussian random variables distributed as follows: (cid:20) z ∗ z (cid:21) ∼ N (cid:18)(cid:20) (cid:21) , (cid:20) Σ z ∗ z ∗ + σ Σ z ∗ z Σ zz ∗ Σ zz + σ I n (cid:21)(cid:19) Then, the posterior distribution of z ∗ conditional on z is: z ∗ | z ∼ N (cid:16) Σ z ∗ z (cid:0) Σ zz + σ I n (cid:1) − z , Σ z ∗ z ∗ + σ − Σ z ∗ z (cid:0) Σ zz + σ I n (cid:1) − Σ zz ∗ (cid:17) In view of the previous lemma, it is possible to use the posterior expectation as pre-diction rule: f ( x ∗ ) = E [ y ∗ | x ∗ , y , x ] = n (cid:88) i =1 c i κ ( x ∗ , ¯x i ) c = (cid:0) Σ ( x ) + σ I n (cid:1) − y The main distinctive feature of GP models is the learning process, which aims directlyat obtaining the predictive function rather than inferring its parameters.A zero-mean GP is completely defined by its covariance function κ ( x i , x j ) , also calledkernel. When it is a function of the distance r = (cid:107) x i − x j (cid:107) between x i and x j , i.e. κ ( x i , x j ) = κ ( r ) , the kernel is said to be stationary and isotropic. Within this class, apopular and flexible choice is the family of Matérn kernels, defined by: κ Matern ( r ) = 2 − ν Γ( ν ) (cid:16) √ νrl (cid:17) ν K ν (cid:16) √ νrl (cid:17) where ν and l are hyperparameters to be tuned and K ν is a modified Bessel function[2]. The parameter l defines the characteristic length-scale of the process, whereas ν defines the specific covariance function in the Matérn class.Different approaches are possible in order to tune the hyperparameters ν, λ , and σ .According to an empirical Bayes, the hyperparameter vector η is chosen as the max-imizer of the marginal likelihood p ( y | x , η ) . K-Nearest neighbours (KNN) relies on the distance between samples in the featurespace: given a test sample x ∗ , the prediction of y ∗ is computed by averaging K trainingsamples y i , i ∈ C , where C denotes the set identified by the K feature vectors x i thatare closest to x ∗ , according to some distance measure, e.g., the Euclidean norm (hereinadopted).In order to specify a KNN estimator, one has to choose the distance metric, e.g., Euc-lidean, Minkowsky, Manhattan, etc, and the type of weighted average, e.g., uniform orinverse distance, and to calibrate one hyperparameter, viz the number K of neighbors.Too small values of K lead to overfitting to the training data, while including toomany neighbors reduces the variance at the cost of jeopardizing model flexibility. .4 MLP Neural Network Artificial Neural Networks (ANN) are complex non-linear models, capable of capturingnon-linear patterns and relations. A comprehensive explanation of their structure andthe most common training algorithms can be found in [8].In this study, we focused on the Multi-Layer Perceptron (MLP) or fully connectedANN.The Rectified Linear Unit (ReLu) activation function and the Mean Squared Error(MSE) loss function were adopted. The training was performed using gradient des-cent, as implemented in the Adaptive Moment Estimation (ADAM) algorithm [12].The hyperparameters to be tuned include the number of neurons in each layer, theparameters entering the definition of the activation functions, and optimization para-meters such as the number of epochs, batch size, and learning rate.
The torus model [9] is a linear model based on sinusoidal functions, developed initiallyto predict electrical power load. Its short-term version was adapted in order to forecastRGD.Following [9], a logarithmic transformation of the RGD was performed in order tomitigate the effect of its skewness. The long-term model is ln ˆRGD long ( t ) = L ( t ) + F ( t ) + (cid:88) i H i ( t ) where the log-forecast is the sum of three elements: the trend or level L , the mul-tiperiodic term F , which accounts for seasonality, and the effect of holidays (cid:80) i H i .The multiperiodic term F is modeled by a linear combination of sinusoidal functions: F ( t ) = (1+2 N d )(1+2 N w ) (cid:88) i =1 θ i h i ( t ) where the functions h i are given by the product of the j -th element in D with the k -thelement in W , for suitable j and k , with D = { cos( j Ψ t ) , j ∈ [0 , N d ] } ∪ { sin( j Ψ t ) , j = ∈ [1 , N d ] }W = { cos( k Ω t ) , k ∈ [0 , N w ] } ∪ { sin( k Ω t ) , k ∈ [1 , N w ] } The frequencies of the sinusoidal functions are
Ψ = π . and Ω = π . The num-ber of harmonics, respectively N w for 7-day and N d for 365.25-day periodicity, arehyperparameters of the model.The dependency on temperature, expressed in HDD( t ) , and its daily difference HDD( t ) − HDD( t − , was accounted for by including these two features in the set of regressors.The terms related to trend and holidays were the same as in [9].Finally, to get a short-term predictor, the long-term model was corrected with the gasdemand of the previous day: ˆRGD( t ) = ˆRGD long ( t ) RGD( t − long ( t − . The number of harmonics N w and N d were tuned by minimizing the AIC index. Effects of temperature forecast errors
As shown in Section 3, temperature is the most important exogenous variable. Ob-viously, the actual temperature cannot practically be used when forecasting futureRGD: only a forecast is available, affected by a small yet non-negligible error, whichinevitably impacts also the performance of gas demand forecast. The scope of thissection is to assess the influence of the temperature error on the precision of the RGDforecast. For this purpose, we resort to an idealized error propagation model that, des-pite its simplicity, provides an accurate description, as confirmed by the subsequentexperimental validation.Let RGD be a deterministic function g of the true temperature T and some otherfactors x = ( x , x , ... ) : RGD = g ( T, x ) . In view of the analysis and the chartspresented in Section 3, a first-level approximation of the relationship between RGDand T is a linear function of HDD, while the dependence on the other factors can berepresented as an additive term ¯ g ( x ) : RGD = g ( T, x ) = ¯ g ( x ) + α HDD( T ) where α is the sensitivity of the gas demand to HDD. The formula is of general validityand applies to both regional and national gas markets. Indeed, α depends on the size ofthe considered market and can be estimated from historical data, e.g. those displayedin Fig. 6.Consider now the ideal case when α and also the function ¯ g are perfectly known, yet,only a forecast ˆ T = T + (cid:15) of the correct temperature T is available, where (cid:15) is a zero-mean error with variance σ (cid:15) . The optimal forecast ˆRGD , given ˆ T , is therefore: ˆRGD = ¯ g ( x ) + α HDD( ˆ T ) In order to obtain the mean squared error of ˆRGD , we first compute the conditionalvariance of ˆRGD : Var (cid:104) ˆRGD | T ≥ ◦ (cid:105) = Var [¯ g ( x ) + α ·
0] = 0Var (cid:104) ˆRGD | T < ◦ (cid:105) = Var (cid:104) ¯ g ( x ) + α (cid:16) ◦ − ˆ T (cid:17)(cid:105) = α Var [ (cid:15) ] = α σ (cid:15) Since E [ (cid:15) ] = 0 , it follows that E [ ˆRGD] = RGD . Thus: E (cid:20)(cid:16) ˆRGD − RGD (cid:17) (cid:21) = E (cid:20)(cid:16) ˆRGD − RGD (cid:17) | T ≥ ◦ (cid:21) P ( T ≥ ◦ ) ++ E (cid:20)(cid:16) ˆRGD − RGD (cid:17) | T < ◦ (cid:21) P ( T < ◦ ) == 0 + Var (cid:104) ˆRGD | T < ◦ (cid:105) == P ( T < ◦ ) α σ (cid:15) (4)This last equation provides an estimate of the mean squared error due to the temper-ature forecasting error. Since it has been derived under an ideal setting, i.e. α and ¯ g ( · ) perfectly known, it provides a lower limit to the precision that can be achieved bythe best possible forecaster.The arguments entering the bound are easily obtainable as follows: Estimate P ( T < ◦ ) by computing the ratio between the number of samplessuch that T < ◦ and the total number of available data. • Compute α through a least square fit of RGD vs T . • Estimate σ (cid:15) as the sample variance of ˆ T − T .Considering the Italian RGD data, in the 3-year period 2015-2017, P ( T < ◦ ) rangesfrom 54% to 67%, while σ (cid:15) ranges from 0.05 to 0.09, and α from 9.85 to 10.96.Considering altogether the years 2015-2017, we have P ( T < ◦ ) = 63% , σ (cid:15) = 0 . , α = 10 . , corresponding to a best achievable Root Mean Squared Error RMSE = (cid:115) E (cid:20)(cid:16) ˆRGD − RGD (cid:17) (cid:21) = 10 . × √ . × .
063 = 2 .
22 MSCM
Finally, we consider the more realistic case in which the forecasting mean square erroris different from zero even in absence of temperature errors, that is
Var [¯ g ( x )] = σ > Then, under a statistical independence assumption, it is possible to obtain the fore-casting RMSE as a function of σ (cid:15) : RMSE( σ (cid:15) ) = (cid:114) Var (cid:104) ˆRGD (cid:105) = (cid:113) σ + P ( T < ◦ ) α σ (cid:15) (5)In Fig. 8 this relationship is displayed assuming P ( T < ◦ ) = 63% , σ (cid:15) = 0 . , σ = 13 . (this last value is the test MSE achieved by the ANN forecaster trainedwith true temperature data instead of the forecasted ones, see Section 6). Notably, thesensitivity of the gas forecasting error tends to increase as the temperature forecasterror grows. In particular, if we define the threshold ¯ σ (cid:15) = σ P ( T < ◦ ) α the influence of temperature errors is negligible as far as σ (cid:15) (cid:28) ¯ σ (cid:15) , while the temper-ature errors have a linear influence on the gas RMSE for σ (cid:15) (cid:29) ¯ σ (cid:15) . As mentioned in Section 2, available data range from 2007 to 2017. Three one-year-longtest sets were defined, associated with the year 2015, 2016, and 2017. The correspond-ing three training sets spanned from 2007 to the day before the start of the test set:2007-2014, 2007-2015, 2007-2016. In the following, each training set is identified bythe year of the corresponding test set, e.g., we will write "training set 2016" to indicatethe second training set, spanning from 2007 to 2015.On each test set, the performance of the five models was measured using the MeanAbsolute Error (MAE).
MAE = 1 N N (cid:88) j =1 (cid:12)(cid:12)(cid:12) RGD j − ˆRGD j (cid:12)(cid:12)(cid:12) MAE is preferred over MAPE due to the highly non-stationary behavior of RGD series.Using MAPE would attribute undue importance to errors during the summer periodwhen RGD is small, see Fig. 2. Moreover, MAE is proportional to the monetaryloss sustained by energy companies because of errors in nomination due to inaccurateforecasts. evertheless, in order to allow a comparison with forecasting performances achievedin the UK market, we will also refer to the MAPE: MAPE = 100 N N (cid:88) j =1 (cid:12)(cid:12)(cid:12) RGD j − ˆRGD j (cid:12)(cid:12)(cid:12) RGD j Indeed, the comparison between two different markets calls for the use of a relativemetric. To avoid the confounding effect of small absolute errors that are amplified byMAPE during the Italian Summer, the comparison between Italy and UK was limitedto the cold months, when gas demand is relatively high, see Section 6.3.Finally, as the performance limit derived in Section 5 poses a lower bound to the meansquared error, the Root Mean Square Error (RMSE) was also used as a comparisonmetric.
All five models include hyperparameters that were tuned by cross-validation.For ridge regression, the regularization parameter λ was tuned by 5-fold cross-validationon an interval ranging from − to in logarithmic steps. In the training set 2015,line search selected λ = 0 . , corresponding to df(0 . . , while in the othertwo sets, 2016 and 2017, cross-validation selected the minimum λ = 10 − with effectivedegrees of freedom df(10 − ) = 20 . practically equal to the number of parameters.This means that regularization plays a very marginal role.For KNN, we optimized the number of neighbors in the interval [1 , as well as theweighting strategy, choosing between uniform and inverse of the distance. We obtainedseven neighbors for training set 2015 and 6 for the two remaining ones. In all the threecases, the "inverse of distance" weights were selected.As for the Gaussian Process, the maximization of the marginal likelihood yielded ν = 1 . , l = 10 , and σ = 10 , with minimal variations among all training sets.For the ANN models, a trial and error procedure led to an architecture with an inputlayer of 24 neurons, two hidden layers of 12 and 4 neurons, and an output layer ofa single neuron. More complex structures led to overfitting and loss of predictiveperformance. By 5-fold cross-validation, we obtained a learning rate of 0.001 and abatch size of 32. We selected 1000 epochs for training by observing the evolution ofthe loss function on train and validation sets.For what concerns the Torus model, the minimization of AIC led to the choice of N w = 3 and N d = 1 for all the training sets. A first assessment of the performances of the adopted methods was carried out in termsof RMSE. In order to validate the formula that models the propagation of temperatureerrors (Section 5), two sessions were performed. In the first one, the models weretrained and tested using historical records of true temperatures, assuming that theone-day-ahead exact temperature is available as a feature. Then, we used Eq. (5) inorder to predict how much the forecasting RMSE would increase in the more realisticscenario in which one-day-ahead temperature forecasts are employed in place of thetrue temperatures. In the second session, the models were trained and tested using istorical records of forecasted temperatures. In this way, it was possible to validatethe error propagation model against the real errors.The results of the first session are summarized in Table 2. It can be seen that thesmallest RMSE is obtained by GP and ANN, the latter being marginally better. Year 2015 2016 2017 2015-2017Ridge 4.24 4.11 4.12 4.16GP 3.81 3.68 3.64 3.71KNN 7.29 8.49 8.38 8.07Torus 4.21 4.23 3.70 4.05ANN 3.89 3.60 3.44 3.65Table 2: Performance on test sets: yearly RMSE (MSCM) of the five forecasterstrained and tested assuming that one-day-ahead true temperatures are available.
Obviously, in a real-world context, only temperature forecasts are available for theday ahead. In order to account for the performance degradation due to the use offorecasted temperatures, Eq. (5) was used to predict the RMSE of the RGD forecastin correspondence of a temperature forecast variance σ (cid:15) = 0 . , coinciding with thatof our meteorological data. The results are summarized in Table 3. In the first line,the theoretical performance limits computed according to Eq. (4) are reported. Thesevalues were added to the RMSE’s of Table 2 to obtain predictions of RGD forecastingRMSE in a real-world situation in which one-day-ahead temperature forecasts areused. Year 2015 2016 2017 2015-2017Performance limit 2.15 2.02 1.98 2.05Ridge 4.75 4.58 4.57 4.63GP 4.37 4.20 4.15 4.24KNN 7.60 8.73 8.61 8.33Torus 4.73 4.69 4.20 4.55ANN 4.45 4.13 3.97 4.19Table 3: Predicted performance on test sets when temperature forecasts with σ (cid:15) = 0 . are used: yearly RMSE (MSCM) of the five forecasters. In the second session, the predictions of Table 3 were validated by comparing themwith the RGD forecasting RMSE achieved using temperature forecasts. As it can beseen in Table 4, the actual RMSE are in good agreement with their predictions. Thiscan also be visually appreciated in Fig. 9, where theoretical predictions are plottedagainst the actual RMSE.
A second assessment of the models was made in terms of their MAE. Hereafter, one-day-ahead forecasted temperatures are employed in the features. Results on the testsets are shown in Table 5. Now, GP is the best performer, achieving an average MAEof 2.53 MSCM over the three test years. ANN, Torus, and Ridge Regression follow inthe order. KNN is again the worst model, with an average MAE of 5.05 MSCM.The differences between the RMSE- and MAE-based rankings are possibly explainedby the non-Gaussianity of the prediction errors. In case of zero-mean prediction er-rors that are perfectly Gaussian, it should be
MAE / RMSE = (cid:112) /π ∼ . , yield-ing identical rankings, irrespective of the adopted metrics. As a matter of fact, MAE / RMSE < . for all the models: the ratio MAE/RMSE is about 0.61 for GP and Torus, 0.62 for KNN, 0.65 for ANN and 0.72 for Ridge. This is explainedby the non-Gaussianity of the prediction errors, possibly associated with the presenceof "fat tails" in their distributions. In particular, from Fig. 10 it is apparent thatdifferent error variances are observed in the cold and warm seasons. This means thatthe overall error distribution is akin to a mixture, which can produce fat tails whenthe variances in the two seasons are much different. The error distributions in 2017are displayed in Fig. 11.Due to the seasonal behavior of RGD, it is of interest to disaggregate data at a monthlylevel. In Table 6, the monthly averages of MAE and MAPE are reported throughoutthe 2015-2017 test years. It appears that GP is the best performer during the warmperiod, especially from June to October, whereas in the cold months, from December toFebruary, ANN is more accurate. A possible explanation is that the GP model is betterat capturing the effects of the weekly seasonality, that explains most of the Summervariability, while ANN accounts better for the non-linear effect of temperature, mostlyrelevant during the cold months.To the best of our knowledge, there are no published benchmarks for the forecastingtask addressed in this paper. A somehow similar problem was studied by Zhu et al.[26] relative to UK gas demand in 2012. Still, their results are not entirely comparableto ours, for two main reasons: first, the authors considered the total UK demandand not just the residential one; second, UK climate is colder than the Italian one.Nonetheless, we can use a relative error metrics, such as the MAPE, in order to obtaina first level comparison, limited to 6 cold months (from October to March). Our bestmodel in terms of average MAPE over 2015-2017, i.e., the GP, achieves 3.11%, whileZhu’s false neighbors filtered-support vector regression local predictor (FNF-SVRLP)achieves 3.88% on the same six cold months of 2012. Although no definite conclusioncan be drawn, these numbers suggest some degree of consistency between forecastingperformances at country level. APE(%) MAE (MSCM)Month Ridge GP KNN Torus ANN Ridge GP KNN Torus ANNJanuary 3.10 3.01 6.45 3.21 2.93 5.79 5.67 12.44 5.96
February 2.75 2.84 4.77 3.33 2.62 4.52 4.59 7.60 5.48
March 3.94 4.20 7.13 3.83 4.27 4.59 4.89 8.01
June 4.57 1.32 6.02 3.37 1.92 1.46
Table 6: Monthly MAPE and MAE (MSCM) on test sets 2015-2017: bestperformers in terms of MAE are highlighted in boldface.
Jan 2017 Mar 2017 May 2017 Jul 2017 Sep 2017 Nov 2017−20−10010203040 KNNRidgeGPTorusANN R e s i d u a l s [ M S C M ] Figure 10: Out-of-sample model residuals in 201721
NN Ridge GP Torus ANN−20−10010203040 R e s i d u a l s [ M S C M ] Figure 11: Distribution of out-of-sample residuals in 2017
In this paper, one-day-ahead forecasting of the residential gas demand was addressed atthe country level. Five different models were considered: Ridge regression, GaussianProcess, K-nearest neighbor, Artificial Neural Network, and the Torus model. Thechoice of the relevant covariates and the most important aspects of the preprocessingand feature extraction have been discussed, lending particular attention to the roleof one-day-ahead temperature forecasts. In particular, a simple model describing thepropagation of temperature errors to gas forecasting errors was derived.The proposed methodology was tested on daily Italian gas demand data from 2007to 2017. Although a specific benchmark is not available, a comparison with UK datarestricted to cold months showed a substantial consistency between the performancesachieved in the two countries.Our best model, in terms of RMSE, was the Artificial Neural network, closely fol-lowed by the Gaussian Process. If the MAE is taken as an error measure, the GPbecame the best model, although by a narrow margin. From the analysis of monthlyperformance, GP was found to be more accurate in tracking the weekly periodicity,which is predominant in the summer period, while the ANN accounted better for thenon-linear influence of temperature, whose contribution is more significant during thewinter period.An interesting question is how much of the forecasting mean squared error is ascribableto temperature forecasting errors. On the Italian data, we found that the MSE forthe ANN model passed from
MSE = 3 . = 13 . (using true temperatures, seeTable 2) to . = 16 . (using temperature forecasts, see Table 4). This meansthat temperature forecast errors account for some 18% of the MSE of RGD forecasts. s demonstrated in Fig. 9, our error propagation model successfully predicted thequantitative impact of temperature forecast errors on gas forecast errors, a capabilitythat could prove useful in order to assess the extent and convenience of improvementmargins associated with more sophisticated (and possibly more expensive) weatherforecasts.Future developments may span along two main directions: first, the study and as-sessment of more complex architecture of neural networks, such as Recurrent NeuralNetworks (RNN) and Long-Short Term Memory (LSTM), which have already beensuccessfully applied to power demand forecasting; second, a further refinement of thepropagation model of temperature forecasting errors, to remove the restrictive hypo-thesis of piecewise linear relation between gas demand and temperature. References [1] Italian natural gas demand report. http://pianodecennale.snamretegas.it/it/domanda-offerta-di-gas-in-italia/domanda-di-gas-naturale.html. Accessed: 2019-01-31.[2] Milton Abramowitz and Irene A Stegun.
Handbook of mathematical functions withformulas, graphs, and mathematical tables , volume 55. US Government printingoffice, 1948.[3] A Azadeh, SM Asadzadeh, and A Ghanbari. An adaptive network-based fuzzyinference system for short-term natural gas demand estimation: uncertain andcomplex environments.
Energy Policy , 38(3):1529–1536, 2010.[4] Lorenzo Baldacci, Matteo Golfarelli, Davide Lombardi, and Franco Sami. Nat-ural gas consumption forecasting for anomaly detection.
Expert Systems withApplications , 62:190 – 201, 2016.[5] Marek Brabec, Ondřej Konár, Emil Pelikán, and Marek Mal`y. A nonlinear mixedeffects model for the prediction of natural gas consumption by individual custom-ers.
International Journal of Forecasting , 24(4):659–678, 2008.[6] Ömer Fahrettin Demirel, Selim Zaim, Ahmet Çalişkan, and Pinar Özuyar. Fore-casting natural gas consumption in istanbul using neural networks and multivari-ate time series methods.
Turkish Journal of Electrical Engineering & ComputerSciences , 20(5):695–711, 2012.[7] Salvador Gil and J Deferrari. Generalized model of prediction of natural gasconsumption.
Journal of energy resources technology , 126(2):90–98, 2004.[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning
Control Conference (ECC), 2015 European , pages2768–2773. IEEE, 2015.[10] T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction, Second Edition . Springer Series inStatistics. Springer New York, 2009.[11] Yusuf Karadede, Gultekin Ozdemir, and Erdal Aydemir. Breeder hybrid al-gorithm approach for natural gas demand forecasting model.
Energy , 141:1269 –1284, 2017.[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.
13] Andrea Marziali, Emanuele Fabbiani, and Giuseppe De Nicolao. Ensemblingmethods for countrywide short term forecasting of gas demand. arXiv preprintarXiv:1902.00097 , 2019.[14] K.P. Murphy and F. Bach.
Machine Learning: A Probabilistic Perspective . MITPress, 2012.[15] Ioannis P Panapakidis and Athanasios S Dagoumas. Day-ahead natural gas de-mand forecasting based on the combination of wavelet transform and anfis/geneticalgorithm/neural network model.
Energy , 118:231–245, 2017.[16] Primož Potočnik, Marko Thaler, Edvard Govekar, Igor Grabec, and Alojz Pore-doš. Forecasting risks of natural gas consumption in slovenia.
Energy policy ,35(8):4271–4282, 2007.[17] Dario Šebalj, Josip Mesarić, and Davor Dujak. Predicting natural gasconsumption–a literature review. In , 2017.[18] Božidar Soldo. Forecasting natural gas consumption.
Applied Energy , 92:26–37,2012.[19] Božidar Soldo, Primož Potočnik, Goran Šimunović, Tomislav Šarić, and EdvardGovekar. Improving the residential natural gas consumption forecasting modelsby using solar radiation.
Energy and buildings , 69:498–506, 2014.[20] Jolanta Szoplik. Forecasting of natural gas consumption with artificial neuralnetworks.
Energy , 85:208–220, 2015.[21] Fatih Taşpınar, Numan Celebi, and Nedim Tutkun. Forecasting of daily naturalgas consumption on regional basis in turkey using various computational methods.
Energy and Buildings , 56:23–31, 2013.[22] Zlatko Tonković, Marijana Zekić-Sušac, and Marija Somolanji. Predicting naturalgas consumption by neural networks.
Tehnički vjesnik , 16(3):51–61, 2009.[23] Zia Wadud, Himadri S. Dey, Md. Ashfanoor Kabir, and Shahidul I. Khan. Model-ing and forecasting natural gas demand in bangladesh.
Energy Policy , 39(11):7372– 7380, 2011. Asian Energy Security.[24] Nan Wei, Changjun Li, Jiehao Duan, Jinyuan Liu, and Fanhua Zeng. Dailynatural gas load forecasting based on a hybrid deep learning model.
Energies ,12(2):218, 2019.[25] Christopher KI Williams and Carl Edward Rasmussen.
Gaussian processes formachine learning . MIT press Cambridge, MA, 2006.[26] Lixing Zhu, MS Li, QH Wu, and L Jiang. Short-term natural gas demand pre-diction based on support vector regression with false neighbours filtered.
Energy ,80:428–436, 2015.,80:428–436, 2015.