[PDF] Day-ahead electricity price prediction applying hybrid models of LSTM-based deep learning methods and feature selection algorithms under consideration of market coupling

Abstract

The availability of accurate day-ahead electricity price forecasts is pivotal for electricity market participants. In the context of trade liberalisation and market harmonisation in the European markets, accurate price forecasting becomes difficult for electricity market participants to obtain because electricity forecasting requires the consideration of features from ever-growing coupling markets. This study provides a method of exploring the influence of market coupling on electricity price prediction. We apply state-of-the-art long short-term memory (LSTM) deep neural networks combined with feature selection algorithms for electricity price prediction under the consideration of market coupling. LSTM models have a good performance in handling nonlinear and complex problems and processing time series data. In our empirical study of the Nordic market, the proposed models obtain considerably accurate results. The results show that feature selection is essential to achieving accurate prediction, and features from integrated markets have an impact on prediction. The feature importance analysis implies that the German market has a salient role in the price generation of Nord Pool.

Full PDF

DDay-ahead electricity prices prediction applying hybrid models of LSTM-based deeplearning methods and feature selection algorithms under consideration of market coupling

Wei Li a, ∗ , Denis Mike Becker a a NTNU Business School, Norwegian University of Science and Technology, 7491 Trondheim, Norway

Abstract

The availability of accurate day-ahead electricity price forecasts is pivotal for electricity market participants. In the context oftrade liberalisation and market harmonisation in the European markets, accurate price forecasting becomes even more di ﬃ cult toobtain. The increasing power market integration has complicated the forecasting process, where electricity forecasting requiresconsidering features from both the local market and ever-growing coupling markets. In this paper, we apply state-of-the-art deeplearning models, combined with feature selection algorithms for electricity price prediction under the consideration of marketcoupling. We propose three hybrid architectures of long-short term memory (LSTM) deep neural networks and compare theprediction performance, in terms of various feature selections. In our empirical study, we construct a broad set of features from theNord Pool market and its six coupling countries for forecasting the Nord Pool system price. The results show that feature selection isessential to achieving accurate prediction. Superior feature selection algorithms ﬁlter meaningful information, eliminate irrelevantinformation, and further improve the forecasting accuracy of LSTM-based deep neural networks. The proposed models obtainconsiderably accurate results. Keywords:

Deep learning, Electricity price forecasting (EPF), Electricity market coupling, Feature selection, Long short-termmemory (LSTM), The Nord Pool system price

1. Introduction

Over the last two decades, worldwide energy markets haveexperienced a transition towards deregulation and globalisation[1]. Under trade liberalisation, the traditional vertically inte-grated power utilities are replaced with decentralised businessentities whose targets are to maximise their proﬁts. Conse-quently, a growing number of market participants are exposedto intense competition, and their need for suitable decision sup-port models to increase margins and reduce risk has signif-icantly increased [2]. The availability of accurate day-aheadelectricity price forecasts is vital for market participants to ad-just production plans and perform e ﬀ ective bidding strategies tomake an economic proﬁt. In addition, accurate electricity priceforecasting (EPF) can contribute to the stability of the electricalgrid with an increase in renewable and remote generation. Inparticular, volatile prices may overstress power infrastructuresto use strategic reserves and create an urgent need to reinforcethe grid, resulting in the increased risk of a blackout and voltagecollapse.Due to the productive structure and characteristics of elec-tricity prices, highly accurate forecasting is quite challenging[3, 4]. In this context, price prediction tools are essential forall electricity market participants, to enable them to maximisetheir proﬁts, mitigate risks, and stabilise the grid under a liber-alised and harmonised environment. Numerous research e ﬀ orts ∗ Corresponding author

Email address: [email protected] (Wei Li) have contributed to exploiting and developing advanced tech-nologies for day-ahead EPF, aiming at highly accurate forecast-ing results [5, 6]. A considerable amount of literature has beendevoted to EPF methods, which can be classiﬁed into ﬁve ar-eas [5]: multi-agent [7, 8], fundamental [9, 5], reduced-form[10, 11], statistical [12, 13, 14], and computational intelligencemodels [15, 16, 17]. In general, computational intelligent (CI)models are state-of-the-art techniques. Compared with othertraditional models, their superior performance contributes to theprevalence of CI-based models in EPF.In recent years, deep neural networks (DNNs) have gradu-ally entered the scientiﬁc research related to electricity priceforecasting. They are regarded as the most avant-garde CI ap-proach in various other disciplines [18, 19, 20]. DNNs are of-ten categorised into three main classes: Feed-forward NeuralNetworks (FNNs), Recurrent Neural Networks (RNNs), andConvolutional Neural Networks (CNNs). For time series andsequence prediction, RNNs achieved superior performance bybuilding extra mappings to hold relevant information from pastinputs. The long-short term memory (LSTM) and gated recur-rent units (GRU) are important variants of this kind of networkwhich overcome the vanishing gradient problem [21]. Due totheir superiority in forecasting, researchers gradually pay atten-tion to their application in EPF [22, 23, 24]. However, withthe increasing integration of electricity markets, making accu-rate forecasts becomes even more di ﬃ cult in the complex andintegrated system. A large number of explanatory variablesfrom an ever-growing number of interconnected, neighbouring Preprint submitted to Elsevier January 14, 2021 a r X i v : . [ q -f i n . C P ] J a n ower systems need to be considered when forecasting electric-ity prices. When DNNs are applied on high-dimensional data,a critical issue is known as the curse of dimensionality [25].It means, with a large number of features, the performance ofDNNs will degrade because of overﬁtting [26]. To the bestof our knowledge, no existing study considers how to applyDNNs with huge amounts of high-dimensional data in the ever-growing integrated market. In particular, the curse of dimen-sionality of the application of the-state-of-the-art LSTM deeplearning models in EPF, considering market coupling, have yetto be solved. To ﬁll this scientiﬁc gap, we propose three hybridarchitectures of LSTM-based deep learning predictive modelscombined with advanced feature selection algorithms: the two-step hybrid architecture, the autoencoder hybrid architectureand the two-stage hybrid architecture. We employ ﬁve featureselection algorithms for selecting features from Nord Pool andits neighbouring, interconnected countries. They are: Pearson’scorrelation (PC), particle swarm optimisation combined withthe extreme learning machine method (PSO-ELM), genetic al-gorithm combined with the extreme learning machine method(GA-ELM), recursive feature elimination together with the sup-port vector regression method (RFE-SVR) and the lasso regres-sion method. While some current research attempts to involve explana-tory variables from integrated markets to make a predictionfor electricity price [27, 28, 29], no existing research investi-gated the-state-of-the-art LSTM-based neural networks, whichachieved superior performance for time series forecasting. Be-sides, some research starts to pay attention to the market inte-gration in Nord Pool [30, 31, 32]. However, e ﬃ cient ways toutilise the ever-growing information from its integrated marketsfor EPF have yet to be explored. We therefore propose a collec-tion of LSTM-based models with various feature selection al-gorithms and compare their forecasting performance. The maincontributions are as follows:1. We introduce three architectures of hybrid LSTM-baseddeep neural networks for EPF, and conclude that di ﬀ erentfeature selections algorithms collect a divergent subset offeatures, which in turn a ﬀ ects the proposed LSTM models’prediction accuracy.2. The EPF of the Nord Pool market under considerationof market coupling with the application of deep learningmethods remains unexplored. We compare and analysethe forecasting performance of the proposed models in thecase study of the Nord Pool system price forecasting, con-sidering six integrated markets (sixty-two features).3. We employ a SHAP game theoretic approach to explorethe relevance of various cross-border features in EPF.The remainder of this paper is organised as follows. Section2 describes the dataset used in this research. In Section 3, wepresent the architectures of hybrid models. In Section 4, wedescribe the proposed theoretical concepts of feature selection Figure 1

A map of the overview of the Nord Pool market coupling. methods. The autoencoder LSTM models are described in Sec-tion 5. Section 6 describes the model training and introducesevaluation criteria applied in our analysis. Section 7 reportsthe forecasting results of the implemented models for the NordPool market day-ahead price. Finally, Section 8 concludes thepaper and proposed future research developments.

2. Data description

This paper discusses and evaluates several hybrid LSTM-based approaches for the prediction of the Nordic hourly anddaily system prices. Each hourly system price is calculated as amarket-clearing price without taking into account any conges-tion restrictions. The daily system price represents the arith-metic average of the 24 hourly prices. It is used as a settlementprice for the derivatives market.Previous empirical research on the prediction of electricityprices has considered both market-related information and sup-ply / demand relations. The former type includes spot prices andexchange rates. The latter concerns production, consumption,and their prognosis. To ﬁnd out what matters when predict-ing the day-ahead Nordic system price in coupling markets,we also include the electricity exchange between Nord Pooland its integrated countries. The Russian electricity market isexcluded because it di ﬀ ers signiﬁcantly from European mod-els. To consider the inﬂuence of the correspondence betweenelectricity ﬂow and capacity, we introduce a new daily feature,namely the cross-border ﬂow deviation. It can be calculated as σ = (cid:113)(cid:80) Ni = ( X i − µ i ) / N , where X i is the hourly electricity ﬂow, µ i is the hourly expected exchange capacity, and N stands for24 hours. We collected data from the Nord Pool , Thomson ReutersEikon and Entsoe . The available time series ranges from Nord Pool: https: // / Thomson Reuters Eikon: https: // eikon.thomsonreuters.com / Entsoe: https: // transparency.entsoe.eu / / / / / / / ﬃ cient for the application of DL models. Besides, theelectricity exchange between SE4 and LT started at 09 / / / / / / Figure 2 shows the electricity exports from Germany, theNetherlands, Lithuania, Poland, and Russia in 2019 . The ex-ports to the Nord Pool comprised 16.03% of the whole exportsfrom these coupling countries. In Figure 3, we can see that theelectricity exports of the Nord Pool comprised 4.82% of its totalproduction in 2019. The EU aims at achieving 15% intercon-nection capacity in 2030 for each EU country [33]. Table 1

The features included in the dataset.Feature Description (Units) Data SourceF1 System Day-ahead price 1-Lag (EUR / MWh)Nord PoolF2 SE1 Day-ahead price (EUR / MWh) Nord PoolF3 SE2 Day-ahead price (EUR / MWh) Nord PoolF4 SE3 Day-ahead price (EUR / MWh) Nord PoolF5 SE4 Day-ahead price (EUR / MWh) Nord PoolF6 FI Day-ahead price (EUR / MWh) Nord PoolF7 DK1 Day-ahead price (EUR / MWh) Nord PoolF8 DK2 Day-ahead price (EUR / MWh) Nord PoolF9 NO1 Day-ahead price (EUR / MWh) Nord PoolF10 NO2 Day-ahead price (EUR / MWh) Nord PoolF11 NO3 Day-ahead price (EUR / MWh) Nord PoolF12 NO4 Day-ahead price (EUR / MWh) Nord PoolF13 NO5 Day-ahead price (EUR / MWh) Nord PoolF14 EE Day-ahead price (EUR / MWh) Nord PoolF15 LT Day-ahead price (EUR / MWh) Nord PoolF16 PL Day-ahead price (PLN / MWh) Thomson Reuters EikonF17 DE Day-ahead price (EUR / MWh) Thomson Reuters EikonF18 NL Day-ahead price (EUR / MWh) Thomson Reuters EikonF19 Nordic production (MWh) Nord PoolF20 EE production (MWh) Nord PoolF21 LT production (MWh) Nord PoolF22 PL production (MWh) EntsoeContinued on next column Fraunhofer ISE provides the electricity exchange data of Germany / Europe:https: // / Continued from previous column

Feature Description (Units) Data SourceF23 DE production (MWh) EntsoeF24 NL production (MWh) EntsoeF25 Nordic production prognosis (MWh) Nord PoolF26 EE production prognosis (MWh) Nord PoolF27 LT production prognosis (MWh) Nord PoolF28 PL production prognosis (MWh) EntsoeF29 DE production prognosis (MWh) EntsoeF30 NL production prognosis (MWh) EntsoeF31 Nordic consumption (MWh) Nord PoolF32 EE consumption (MWh) Nord PoolF33 LT consumption (MWh) Nord PoolF34 PL consumption (MWh) EntsoeF35 DE consumption (MWh) EntsoeF36 NL consumption (MWh) EntsoeF37 Nordic production prognosis (MWh) Nord PoolF38 EE consumption prognosis (MWh) Nord PoolF39 LT consumption prognosis (MWh) Nord PoolF40 PL consumption prognosis (MWh) EntsoeF41 DE consumption prognosis (MWh) EntsoeF42 NL consumption prognosis (MWh) EntsoeF43 EUR / NOK Nord PoolF44 EUR / SEK Nord PoolF45 EUR / DKK Nord PoolF46 EUR / PLN Thomson Reuters EikonF47 NO2 ↔ NL ﬂow (MWh) Nord PoolF48 DK1 ↔ DE ﬂow (MWh) Nord PoolF49 DK2 ↔ DE ﬂow (MWh) Nord PoolF50 SE4 ↔ DE ﬂow (MWh) Nord PoolF51 SE4 ↔ PL ﬂow (MWh) Nord PoolF52 SE4 ↔ LT ﬂow (MWh) Nord PoolF53 FI ↔ EE ﬂow (MWh) Nord PoolF54 FI ↔ Russia ﬂow (MWh) Nord PoolF55 NO2 ↔ NL ﬂow deviation CalculationF56 DK1 ↔ DE ﬂow deviation CalculationF57 DK2 ↔ DE ﬂow deviation CalculationF58 SE4 ↔ DE ﬂow deviation CalculationF59 SE4 ↔ PL ﬂow deviation CalculationF60 SE4 ↔ LT ﬂow deviation CalculationF61 FI ↔ EE ﬂow deviation CalculationF62 FI ↔ Russia ﬂow deviation Calculation

3. Architecture of a hybrid model

The architecture of a typical hybrid model for EPF, shown inFigure 4, consists of two steps. The ﬁrst step includes data pro-cessing and feature selection, and the second step consists oftraining the predictive models and making predictions. Whenworking with recurrent neural networks, there is another typeof hybrid model that will turn the input data into a compressedrepresentation rather than speciﬁcally show which features areselected. It is called autoencoder architecture, shown in Figure5. A third network architecture combines the two aforemen-tioned architectures, and it is referred to as two-stage featureselection. Here the explanatory variables will be selected bysome feature selection method in the ﬁrst stage. The selected3 igure 2

The electricity cross-border transmission from the coupling countries to Nord Pool.

Figure 3

The percentage of the Nord Pool production for exporting. features will then become the input for the autoencoder modelsin the second stage. Figure 6 shows this architecture.

The LSTM architecture was initially introduced by [34] andhas since been enhanced by other researchers to achieve bet-ter performance [35, 36, 37]. An LSTM network is a specialkind of a recurrent neural network that is capable of learninglong-term dependencies. Unlike simple RNNs, an LSTM net-work has built-in mechanisms that control how information ismemorised or abandoned throughout time. The architecture ofthe LSTM network is shown in Figure 7 and is deﬁned by thefollowing system of equations [38]: f t = σ g ( W x f x t + W h f h t − + W c f c t − + b f ) (1) i t = σ g ( W xi x t + W hi h t − + W ci c t − + b i ) (2) o t = σ g ( W xo x t + W ho h t − + W co c t − + b o ) (3) c t = f t ⊗ c t − + i t ⊗ σ h ( W xc x t + W hc h t − + b c ) (4) h t = o t ⊗ σ h ( c t ) (5)where f t , i t , o t , c t and h t indicate the values of the forget gatestate, input gate state, output gate state, memory cell and hiddenstate at time t in the sequence, respectively. σ g and σ h are sig-moid function and hyperbolic tangent function and ⊗ denotesthe element-wise product. Like all RNNs, the LSTM neuralnetworks will process data sequentially. Hence, they take theform of a chain structure, as shown in Figure 8.

4. Feature selection

Feature selection is the process of selecting a subset of rele-vant features when developing a predictive model. It can reducethe computation time, improves model prediction performance,and helps to get a better understanding of the dataset [39]. Thecurrent research on feature selection algorithms can be cate-gorised as ﬁlter, wrapper and embedded methods. In this pa-per, we will explore the following feature selection algorithms:PC (ﬁlter method), PSO-ELM (wrapper method), GA-ELM(wrapper method), RFE-SVR (wrapper method) and LASSOregression (embedded method). In addition, we will introducethree autoencoder LSTM methods: LSTM-LSTM Encoder-Decoder, CNN-LSTM Encoder-Decoder and Convolutional-LSTM (ConvLSTM) model.

PC coe ﬃ cient is a statistic used to measure the linear rela-tionship between two data samples. Given two variables ( X , Y ),the formula of the PC coe ﬃ cient ρ is given by the following: ρ ( X , Y ) = cov ( X , Y ) σ X σ Y (6)4 nput Dataprocessing Feature selection Model Prediciton OutputStep 1 Step 2 Figure 4

The ﬂowchart of a two-step hybrid model. The green nodes stand for the ﬁrst step and the red nodes for the second step.

Input Dataprocessing Autoencoder Model Prediciton Output

Figure 5

The ﬂowchart of an autoencoder hybrid model. The orange nodes stand for the autoencoder process.

Input Dataprocessing Feature selection Autoencoder Model Prediciton OutputStage 1 Stage 2

Figure 6

The ﬂowchart of a two-stage hybrid model. The green nodes stand for the ﬁrst stage and the orange nodes for the second stage. σ g σ g σ h σ g × + × × σ h c t − Cell h t − Hidden x t Input c t Cell h t Hidden h t Hidden f t i t c t o t Figure 7

LSTM cell.

LSTM LSTM LSTM LSTM = LSTM h x h x h x h t x t h t x t . . . x t . . . Figure 8

LSTM chain. where cov is the covariance, σ X is the standard deviation of X ,and σ Y is the standard deviation of Y . The PSO-ELM and GA-ELM are wrapper-based hybridmethods. PSO and GA are di ﬀ erent types of optimisation al-gorithms which provide the optimised subsets of features as theinput to the ELM to detect the optimal feature selection. The two wrapper-based methods have been widely used for variousfeature selection problems [40, 41, 42, 43, 44, 45].ELM is a single hidden layer feedforward neural network.Its fast training [46] contributes to the popularity of its employ-ment as a predictive model in the wrapper-based feature selec-tion [47, 48, 49]. The output of ELM is calculated as follows: f L ( x ) = L (cid:88) i = β i g ( w i x j + b i ) , j = , ..., N (7)where L is the number of hidden units, N is the number of train-ing samples, β is the weight vector between the hidden layerand the output, w is the weight vector between the input andthe hidden layer, g ( ∗ ) denotes an activation function, b is a biasvector and x is the input vector.The basic idea of PSO is that a population of particles movesthrough the search space. Each particle has knowledge about itscurrent velocity, its own past best conﬁguration ( −→ p ( t )), and thecurrent global best solution ( −→ g ( t )). Based on this information,each particle’s velocity is updated such that it moves closer tothe global best and its past best solution at the same time. Thevelocity update is performed according to the following equa-tion: −→ v ( t + = ω −→ v ( t ) + c r ( −→ p ( t ) − −→ x ( t )) + c r ( −→ g ( t ) − −→ x ( t )) (8)where c and c are constants deﬁned beforehand, that deter-mine the signiﬁcance of −→ p ( t ) and −→ g ( t ). −→ v ( t ) is the velocity ofthe particle, −→ x ( t ) is the current particle position, r and r arerandom numbers from the interval [0,1], and ω is a constant(0 ≤ ω ≤ −→ x ( t + = −→ x ( t ) + −→ v ( t +

1) (9)5 eature set F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 ... Fn1 1 1 1 10 0 0 0 0 0 ...FeaturesEncoding F1 F2 F5 F9 Fn...FeaturesSelectedPredictiveModel F1F2F5F9...Fn ... Global optimum update.Individual updates:optimal set of features.Input layer Hidden layer Output layer

ELM

Continueuntil theconditionalterminal

Figure 9

The workﬂow of two-step-wrapper-based feature selection model.

This iterative process is repeated until a stopping criterion issatisﬁed.The genetic algorithm is a search metaheuristic that was in-spired by Darwin’s theory of natural selection. In general, GAswill search for the optimal solution from a set of possible solu-tions, called a population. A solution is referred to as a chro-mosome or an individual. These chromosomes evolve over anumber of generations by recombination (cross-over) and mu-tation [50].Figure 9 shows the workﬂow of PSO-ELM and GA-ELMmodels for feature selection. The process ﬂow of GA-ELM canbe described as follows:

Step 1 : Initialise the population with a set of random individu-als, each individual representing a particular subset offeatures. For a speciﬁc individual (feature set), the fea-tures are encoded as ”1” or ”0”, as shown in Figure 9.”1” means that the feature is selected and ”0” meansthat it is not selected.

Step 2 : The selected features are the input for the ELM. Theprediction results of the ELM are used to evaluate theﬁtness value of the individuals. The ﬁtness value iscalculated based on the MSE.

Step 3 : Select the best individual with regard to the ﬁtnessvalue. If its ﬁtness is higher than the lowest value inthe existing mating pool, it will replace the individualwith the worst ﬁtness. Furthermore, the global opti-mum will be updated accordingly.

Step 4 : The child individuals are generated by crossover andmutation. The new generation is composed of a set ofnew individuals that are encoded and prepared to beevaluated. The whole process continues until meetingthe iteration terminal. The best feature subset in themating pool is the optimal selection.

RFE-SVR is another wrapper-based feature selectionmethod. The core idea of this algorithm is to search for the bestsubset of features by starting with all features and discardingthe less important features.To minimise the forecasting errors, SVR individualises thehyperplane by maximising the margin. To solve a nonlinearregression problem, the following linear estimation function isconsidered [51]: f ( x ) = ( w × Φ ( x )) + b (10)where w is the parameter vector, Φ ( x ) is a kernel function and b is a constant. The function formulation of the SVR model canbe transformed into the following convex minimisation problemto minimise: min 12 || w || + C n (cid:88) i = ( ξ i + ξ ∗ i ) (11)subject to the constraints:( w × Φ ( x i ) + b ) − y i ≤ (cid:15) + ξ i y i − ( w × Φ ( x i ) + b ) ≤ (cid:15) + ξ ∗ i ξ i , ξ ∗ i ≥ i = , , ..., n where C is a regularisation constant. ξ i and ξ ∗ i are slack vari-ables, which are used to handle the situation where no suchfunction f ( x ) exists to satisfy the constraint | y i − ( w × x i + b ) | ≤ (cid:15) for all points. They are regarded as the soft margin to allow re-gression errors (cid:15) to exist up to ξ i and ξ ∗ i and still satisfy theconstraint. Only the points outside the (cid:15) -radius contribute tothe ﬁnal cost. The error parameter (cid:15) represents the region of thetube located around the regression function f ( x ), as shown inFigure 10. xy ξ ∗ i ξ i support vectorfitted by SV Rdata points inside f ( x ) f ( x ) + (cid:15)f ( x ) − (cid:15) Figure 10

Fitted SVR.

The Lasso regression aims at increasing the prediction ac-curacy of regression models by adding a penalty λ (cid:80) ni = | w i | tothe loss function. This means that, instead of minimising a lossfunction. (cid:80) mj = ( y j − (cid:80) ni = x ji β i ) , the loss function to minimisebecomes (cid:80) mj = ( y j − (cid:80) ni = x ji β i ) + λ (cid:80) ni = | w i | , where w is the vec-tor of model coe ﬃ cients. The algorithm has the advantage thatit shrinks some of the less critical coe ﬃ cients of features tozero. Therefore, it removes less relevant features.6 n t e r m e d i a t e V ec t o r DecoderEncoder Decoder o u t pu t I npu t Figure 11

A structure of Encoder-Decoder model.

5. Autoencoder LSTM

An autoencoder is typically a neural network that aims at ﬁl-tering and compressing the representation of its input, whichconsists of two components: an encoder and a decoder, shownin Figure 11. The encoder typically accepts a set of the inputdata and compresses the information into an intermediate vec-tor. The decoder is typically a predictive model. In our case, anLSTM network is a decoder and the encoders are LSTM, CNN,and convolutional layers.

In an LSTM-LSTM Encoder-Decoder model, an LSTMmodel is used in the encoder to process the raw input time se-ries and transform it to be an intermediate vector. The di ﬀ er-ence with two-stage methods is that the encoder compresses allof the information into a vector rather than create a vector withselected features. The decoder is an LSTM model and is thesame as a predictive model in two-stage methods. In a CNN-LSTM Encoder-Decoder model, a convolutionalneural network (CNN) is the encoder to ﬁlter the input data.CNNs were originally, and successfully, used to process the im-age input data in image recognition tasks [52] or the sequenceof input data in natural language processing problems [53]. Theconvolutional layers are usually followed by a pooling layer,which extracts certain values from the convolved features andproduces a lower dimensional output value. Then, the outputvalues are ﬂattened into one long intermediate vector. For in-stance, when passing the input image data through a convolu-tional layer, the input becomes an abstracted feature map byapplying and sliding a convolution kernel (ﬁlter) all over the in-put matrix (a two-dimensional structure with width and height).Unlike images, a convolution layer can be seen applying andsliding a ﬁlter over the time series (a one-dimensional struc-ture).

The computational mechanism of the convolutional LSTM(ConvLSTM) is similar to that of CNN-LSTM [54]. Unlikethe CNN-LSTM, where the CNN model generates the input forthe LSTM model, in the ConvLSTM model, the LSTM neu-ral network processes the extracted information directly frompreceding convolutional layers.

6. Experiment details

In this section, we introduce the concepts and methods em-ployed in the process of training and evaluation of the con-structed models.

In this paper, we employ several indicators to evaluate theaccuracy of predictions: the mean absolute error (MAE), theroot mean squared error (RMSE), the mean absolute percent-age error (MAPE) and the symmetric mean absolute percentageerror (SMAPE) as the model estimator. They are commonlyadopted in EPF research [1]. Given a predicted output vectorˆ y k = [ ˆ y , .., ˆ y N ] and a real output vector y k = [ y , .., y N ], theMAE, RMSE, MAPE and SMAPE can be calculated as fol-lows: MAE = N N (cid:88) k = | y k − ˆ y k | (12)RMSE = (cid:118)(cid:117)(cid:116) N N (cid:88) k = ( y k − ˆ y k ) (13)MAPE = N N (cid:88) k = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y k − ˆ y k y k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (14)SMAPE = N N (cid:88) k = | y k − ˆ y k | ( | y k | + | ˆ y k | ) / The metrics for assessing the forecasting accuracy mentionedabove cannot guarantee that the observed di ﬀ erence from twopredictive models is statistically signiﬁcant. In this context, theDiebold-Mariano (DM) test is typically used for evaluating theperformance of two models [55, 56]. Given the actual valuesof a time series [ y t ; t = , ..., T ], two forecasts from two models[ˆ y t ; t = , ..., T ] and [ˆ y t ; t = , ..., T ], and the associated fore-cast errors e t = ˆ y t − y t and e t = ˆ y t − y t , the DM test deﬁnesthe loss di ﬀ erential between the two forecasts by: d F , F t = g ( e t ) − g ( e t ) (16)where g ( ∗ ) stands for loss function. In a one-sided DM test, thehypotheses is: H : E ( d F , F t ) ≥ , H : E ( d F , F t ) < . (17)A one-sided DM test is used to detect whether F2 is better thanF1. If H is rejected, the test suggests that the accuracy of F1is, statistically, signiﬁcantly better than F2. The complementaryone-sided DM test can be expressed as follows: H : E ( d F , F t ) ≤ , H : E ( d F , F t ) > . (18)7 raining set Test setTrain Test Step 1

Train Test

Step 2

Train Test

Step 3

Training subset Validation

Outer loopInner loop

Figure 12

Walk forward nested cross-validation. If H is rejected, the test suggests that the accuracy of F2 is,statistically, signiﬁcantly better than F1. In this study, we em-ploy a one-sided DM test to assess the forecasting performanceof the proposed models. We choose d F , F t = | e t | − | e t | as theloss function. To avoid over-ﬁtting, it is common to include a validationset to evaluate the generalisation ability of the training model.The cross-validation is referred to as a method for tuning thehyperparameters and producing robust measurements of modelperformance. In [57], a nested cross-validation procedure wasintroduced, which considerably reduced the bias and providedan almost unbiased estimate of the true error. Because newobservations become available over time, in time series mod-elling, we implement a Walk Forward Nested Cross-Validationin which the forecast rolls forward in time. More speciﬁcally,we successively consider each day as the test set and assign allprevious data to the training set (Outer loop). The training set issplit into a training subset and a validation set. The validationset data comes chronologically after the training subset (Innerloop). Walk forward validation involves moving along the timeseries one-time step at a time. The process is shown in Figure12.

We divide the whole database into two subsets: a training setand a test set. The training set includes a training subset anda validation subset, as shown in the dashed box in Figure 12.We initially apportion the data set into training, validation, andtest sets, with an 80-10-10 split. The magnitude of the test andvalidation set is anchored during the walk-forward test.

For neural network models training, the input data is usu-ally normalised to the intervals [0,1]. This is not only donebecause the normalised data will require less time to train, butthe prediction performance will also increase. In addition, welinearly interpolate the missing data and eliminate duplicatesdue to daylight saving.

Training algorithms for DL models have usually required theinitialisation of the weights of neural networks from which tobegin the iterative training [58]. The random initial conditionsfor an LSTM network can result in di ﬀ erent performance eachtime a given conﬁguration is trained. Thus, we employ ten ex-periments for each model to reduce the impact of the variabilityon performance evaluation. Models are evaluated after takingthe average of experiments. The NARMAX (Nonlinear AutoRegressive Moving Averagewith eXogenous input) model is the benchmark (trained withthe optimal structure) for our case study. The statistical modelis widely used in energy price forecasting to handle multiplenonlinear inputs [32, 59]. The equation is represented as: y ( t ) = F (cid:96) (cid:104) y ( t − , . . . , y (cid:16) t − N y (cid:17) , u ( t ) , . . . , u ( t − N u ) ,ε ( t − , . . . ε ( t − N ε )] + ε ( t ) (19)where u ( t ) is the input and y ( t ) is the output time-series, ε ( t ) isthe prediction error, N u , N y and N ε are the input, output and pre-diction error lags, respectively, and F (cid:96) is a nonlinear function. SHAP (SHapley Additive exPlanations) is a game theo-retic method to explain the output of machine learning models[60, 61, 62]. The Shapley value is used to assess the featurerelevance relative to the expectation of the output [63]. In thisstudy, we use SHAP values to interpret the impact of certainvalues of a given feature on the expected price prediction. Inparticular, a Kernel SHAP is used for explaining an optimalSVR model obtained by grid-search on the dataset. The applied conﬁguration of PSO is [ c : 0.5, c : 0.3, w : 0.7]and the stop condition is satisﬁed after 10,000 iterations. ForGA, the crossover possibility and mutation possibility are 0.5and 0.2, respectively. The population size is 100 and the maxi-mum number of generations is 10,000. On the basis of the pre-dictive ELM, the amount of the selected features by PSO-ELMand GA-ELM is automatically set to 30. For the sake of inputconsistency, the magnitude of the selected features of the restof the models is set to 30 as well. For the PC method, we rankall features attributable to the correlation coe ﬃ cients and selectthe ﬁrst 30 features. In terms of RFE-SVR, we rank features byimportance, discard the least important features, and reﬁt themodel until 30 features remain. The regularisation parameter λ in Lasso regression is 0.02. The Python package SHAP is available at https://github . com/slundberg/shap able 2 The proposed models.

Mode Model ExplanationM0 Benchmark: NARMAX modelM1 Filter-based method: PC-LSTM modelM2 Wrapper-based method: PSO-ELM-LSTM modelM3 Wrapper-based method: GA-ELM-LSTM modelM4 Wrapper-based method: RFE-SVR-LSTM modelM5 Embedded-based method: LASSO-LSTM modelM6 Autoencoder method: LSTM-LSTM Encoder-Decoder modelM7 Autoencoder method: CNN-LSTM Encoder-Decoder modelM8 Autoencoder method: CovLSTM Encoder-Decoder modelM9 Two-stage method: PC-LSTM-LSTM Encoder-Decoder modelM10 Two-stage method: PSO-ELM-LSTM-LSTM Encoder-Decoder modelM11 Two-stage method: GA-ELM-LSTM-LSTM Encoder-Decoder modelM12 Two-stage method: RFE-SVR-LSTM-LSTM Encoder-Decoder modelM13 Two-stage method: LASSO-LSTM-LSTM Encoder-Decoder model

Our study aims at investigating the applications and impactsof di ﬀ erent types of feature selection methods in a predictiveLSTM architecture. We use a coherent conﬁguration of a spe-ciﬁc LSTM model for comparison and do not perform an exten-sive hyperparameter optimisation to search for the optimal con-ﬁguration. After an inexhaustive grid search, we constructedour prediction model from an LSTM model with a single hiddenlayer of 300 units, followed by a fully connected dense layerwith 100 neurons that precedes the output layer. The LSTMencoder has a hidden layer with 300 units. In the CNN-LSTMEncoder-Decoder model, the CNN encoder has two convolu-tional layers, with 96 unites to amplify any salient features,followed by a max-pooling layer. In the ConvLSTM Encoder-Decoder model, the encoder is a convolutional layer with 64units. The input sequence length is 14 days (2 weeks, com-monly used in EPF). The optimiser is the Adam algorithm, andthe loss function is Mean Squared Error (MSE).

7. Results

In this section, we report the empirical results obtained by theapplication of the introduced models. For similarity of presen-tation, the list of models and their acronyms is shown in Table2.

The results of the feature selection are shown in Table 3. InTable 4, the DM test results indicate the proposed LSTM mod-els are overwhelmingly better than the benchmark M0. Table 5exhibits the prediction results of all the models and Figure 13shows that of the LSTM models from ten experiments, mea-sured in terms of SMAPE. As seen in Figure 13, of all themodels, M4, M5, and M6 perform better than the others. TheSMAPE results for M1, M2, M3, M4, M5, M6, M7 and M8are shown in Table 6. The MAD, RMSE and MAPE are shownin Appendix Tables A.8, A.9, and A.10. From Table 3, we cansee that M1 selects all the day-ahead prices. It is not surprising that the day-ahead prices from di ﬀ erent bidding areas are morerelevant to the Nord Pool system price than other feature vari-ables. However, the over-selection results in the informationredundancies. Compared to M1, the wrapper-based methodseliminate several price variables rather than the other categoriesof variables. We can see that the two computationally intensivewrapper-based M2 and M3 do not perform better than the oth-ers. The optimisation methods have the problem of trappingin the same local optima. Although re-setting and experiment-ing can increase the chance of avoiding the traps, when deal-ing with high dimensional data sets, the optimisation methodsstill cannot guarantee they will ﬁnd a global optimum solutionand they are not suitable for all cases [64]. The straightfor-ward concept and fast computation of ELM contributed to itswidespread application in an exhaustive grid search, but it doesnot consider the sequential relationships in time series data, aswith other traditional neural networks. This could be the reasonwhy the two methods eliminate the lag-1 system price as an in-put. For the Lasso-regression method M5, we can summarisethat it puts more emphasis on electricity transmission. Thus,the results show that nonlinear methods M4 and M5 are bet-ter on account of various feature selection in EPF. In addition,we ﬁnd that M6 performs better than M7 and M8. Our resultsindicate that LSTM is a better feature ﬁlter than convolutionalneural networks for time series problems. Table 3

The results of features selection.

Feature Feature selection modelM1 M2 M3 M4 M5F1 (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) F2 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) F3 (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) F4 (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) F5 (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) F6 (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) F7 (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) F8 (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) F9 (cid:51) (cid:55) (cid:55) (cid:51) (cid:55) F10 (cid:51) (cid:55) (cid:55) (cid:51) (cid:51)

F11 (cid:51) (cid:51) (cid:55) (cid:51) (cid:51)

F12 (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)

F13 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

F14 (cid:51) (cid:55) (cid:51) (cid:55) (cid:55)

F15 (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)

F16 (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)

F17 (cid:51) (cid:55) (cid:51) (cid:55) (cid:55)

F18 (cid:51) (cid:51) (cid:55) (cid:51) (cid:51)

F19 (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)

F20 (cid:55) (cid:51) (cid:51) (cid:51) (cid:55)

F21 (cid:55) (cid:55) (cid:51) (cid:55) (cid:55)

Continued on next column9 ontinued from previous column

Feature Feature selection modelM1 M2 M3 M4 M5F22 (cid:55) (cid:51) (cid:51) (cid:51) (cid:55)

F23 (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)

F24 (cid:51) (cid:51) (cid:51) (cid:55) (cid:55)

F25 (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)

F26 (cid:55) (cid:55) (cid:51) (cid:51) (cid:51)

F27 (cid:51) (cid:51) (cid:51) (cid:55) (cid:51)

F28 (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

F29 (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)

F30 (cid:55) (cid:51) (cid:55) (cid:55) (cid:51)

F31 (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)

F32 (cid:55) (cid:51) (cid:51) (cid:51) (cid:51)

F33 (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)

F34 (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)

F35 (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)

F36 (cid:55) (cid:51) (cid:51) (cid:51) (cid:51)

F37 (cid:55) (cid:55) (cid:55) (cid:51) (cid:55)

F38 (cid:51) (cid:51) (cid:51) (cid:51) (cid:55)

F39 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

F40 (cid:55) (cid:55) (cid:51) (cid:55) (cid:55)

F41 (cid:51) (cid:51) (cid:55) (cid:51) (cid:51)

F42 (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

F43 (cid:51) (cid:55) (cid:51) (cid:51) (cid:55)

F44 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

F45 (cid:51) (cid:55) (cid:51) (cid:55) (cid:55)

F46 (cid:55) (cid:51) (cid:55) (cid:55) (cid:51)

F47 (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

F48 (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)

F49 (cid:55) (cid:55) (cid:51) (cid:55) (cid:51)

F50 (cid:55) (cid:51) (cid:51) (cid:55) (cid:51)

F51 (cid:55) (cid:51) (cid:55) (cid:55) (cid:51)

F52 (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

F53 (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)

F54 (cid:55) (cid:55) (cid:51) (cid:55) (cid:51)

F55 (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

F56 (cid:55) (cid:55) (cid:51) (cid:55) (cid:55)

F57 (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

F58 (cid:55) (cid:55) (cid:51) (cid:55) (cid:51)

F59 (cid:51) (cid:51) (cid:51) (cid:55) (cid:51)

F60 (cid:51) (cid:51) (cid:51) (cid:55) (cid:55)

F61 (cid:55) (cid:51) (cid:55) (cid:55) (cid:51)

F62 (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

Note: (cid:51) denotes that the feature is selected; (cid:55) denotes that the feature is not selected.

Figure 13

The SMAPEs of 10 experiments for M1, M2, M3, M4, M5, M6,M7 and M8.

We show the comparison of the SMAPE of the two-step andthe two-stage models in Figure 14. From Figure 14, it can beseen that M3 and M4 have been improved by applying LSTM-LSTM as predictors. We use a one-sided DM test to detectwhether the two-stage LSTM models are statistically better thantwo-step LSTM-LSTM models. The results are shown in Table7. The superior features selection from M4 (RFE-SVR) pro-vides the possibility for autoencoder models to further processit, to obtain more meaningful information.

Figure 14

The comparison of SMAPEs between two-step LSTM models andtwo-stage LSTM-LSTM models.

Additionally, we detect the forecasting performance for 24hourly system prices. Figures 15, 16 and 17 show the resultsfor the three peak hours: H8 (07 - 08), H12 (11 - 12) and H18(17 - 18), respectively, measured in terms of SMAPE. We ob-serve that the feature selections inﬂuence the forecasting ac-curacy and the models M4 and M5 are still relatively stable,performing better than other models.

Figure 18 shows the features’ ranking and impact on the pre-dicted price in terms of the selected features of the RFE-SVRmodel (M4), which is the model with the best performance.From Figure 18a, we can observe that the features from the10 able 4

The results of the one-sided DM test.

F1 F2

M0 M1 M2 M3 M4 M5 M6 M7 M8M0 5.83*** 4.99*** 3.90*** 6.65*** 7.09*** 6.60*** 4.61*** 4.27***M1 -5.83*** -1.15 -1.67* 2.31** 2.98*** 1.59

Note: ***, ** ,* and

Table 5

The SMAPE of M0, M1, M2, M3, M4, M5, M6, M7 and M8.

Model M0 M1 M2 M3 M4 M5 M6 M7 M8SMAPE 10.07 6.25 6.58 7.06 5.29 4.89 5.20 6.14 6.53

Table 6

The SMAPE (%) results for M1, M2, M3, M4, M5, M6, M7 and M8.

M1 M2 M3 M4 M5 M6 M7 M8count 10 10 10 10 10 10 10 10mean 7.42 7.58 8.46 6.76 6.85 6.87 9.51 8.38std 0.50 0.67 0.95 0.65 0.72 0.60 0.44 0.90min 6.49 6.63 6.77 5.76 5.46 5.71 8.49 7.3825% 7.22 7.10 7.95 6.33 6.43 6.77 9.38 7.5950% 7.34 7.48 8.30 6.87 6.91 6.85 9.67 8.2275% 7.81 8.10 9.05 7.01 7.37 7.27 9.77 9.13max 8.13 8.70 10.09 7.87 7.74 7.75 9.95 9.93

Note: 25%, 50%, and 75% denote 25%, 50%, and 75% percentiles.

Table 7

The results of the one-sided DM test when comparing two-step models (F1)and two-stage models (F2).

F1 M1 M2 M3 M4 M5-0.6039 0.2222 2.4556 *** 2.4524 *** -1.7053 ***F2 M9 M10 M11 M12 M13

Note: ***, ** ,* and

Nordic market with the most signiﬁcant impact are Nordic pro-duction prognosis (F25) and consumption prognosis (F37). Thefeatures from DE markets (F35, F23, F41 and F29) are moreimportant than other cross-border markets. This can be ex-plained by the fact that the German market has the most elec-

Figure 15

The SMAPEs of 10 experiments for predicting H8.

Figure 16

The SMAPEs of 10 experiments for predicting H12.

Figure 17

The SMAPEs of 10 experiments for predicting H18. tricity cables and the highest electricity exports to the Nordicmarket. This indicates that it is critical to consider features from11 a) Features ranking (b)

Features impact

Figure 18

The feature ranking and feature impact of the selected features of RFE-SVR. (a) bar chart of the average SHAP value magnitude which shows thefeatures’ importance. (b) a set of beeswarm plots, where each dot corresponds to an individual day-ahead price. The dot’s position on the x-axis shows the impactthat feature has on the model’s prediction for that price. Multiple dots landing at the same x position pile up to show density. cross-border markets with increasing interconnections acrossEurope for EPF. From Figure 18b, we can see that a high valueof Nordic production prognosis (F25) or consumption progno-sis (F37) elevates the predicted price; and a low value of DEconsumption (F35), with a long right tail, higher the predictedprice. The trend of long-tail reaching to the right, but not tothe left, indicates the extreme values of DE consumption cansigniﬁcantly raise the predicted price, but cannot signiﬁcantlylower the price.Figure 19a demonstrates the negative relationship betweenDE consumption and its conditional expectation of the pre-dicted price. If the DE consumption has a high value, then ittends to revert to its expectation ( E [F35]). Thus, the down-ward expectation of DE consumption will lead to the expecteddecline of the import demand from the Nordic market, whichfurther decreases the predicted price based on its expectation( E [f(x)]). Figure 19b represents the change in predicted priceas DE consumption changes. Vertical dispersion at a singlevalue of DE consumption represents interaction e ﬀ ects withother features. For example, the interaction e ﬀ ect of DE con-sumption with Nordic production (F19) is shown in Figure 19c.The SHAP dependence plot highlights that the impact di ﬀ erswith di ﬀ erent values of Nordic production. The results reveal that the price prediction is less sensitive to DE consumptionwhen the value of Nordic production is high. In such a case, theinformation from the Nordic market rather than cross-bordercountries drives the system-price prediction.We can see that the majority of the EUR / NOK (F43) dots hasno contribution to the changes in the predicted price (the valueof y-axis is zero) in Figure 20. Besides, there is no obvious in-teraction e ﬀ ect between the Nordic production and EUR / NOKexchange rate on the predicted price. We see this because thered points are randomly distributed below and above a SHAPvalue of zero. However, we can observe that more red dots areon the right side than on the left side. This indicates that a de-preciation of the NOK can stimulate the electricity export, andfurther prompt the Nordic production.From Figure 21, we can ﬁnd that the predicted price is ex-pected to increase when we observe a high DK1 − > DE ﬂow(indicating that the Nordic electricity price is relatively low).Meanwhile, a high ﬂow from DE to DK1 means the price ishigh, and the price is expected to decline. In addition, an inter-action e ﬀ ect of the DK1 < − > DE ﬂow with the DE productionprognosis shows that the ﬂow has less impact on the predictedprice with high expected production in Germany. Thus, the highproduction prognosis from cross-border countries will lead to a12 a) SHAP partial dependence plot of DE consumption (b)

SHAP dependence plot of DE consumption (c)

SHAP dependence plot of the interaction e ﬀ ect of DEconsumption with Nordic production (F19) Figure 19

The SHAP partial dependence (a) and dependence (b, c) plot of DE consumption (F35). The grey histogram in (a) shows the distribution of the featurein the test dataset. For (a), x-axis means the normalised DE consumption. The x-axes in (b) and (c) are real value of the consumption.

Figure 20

The SHAP dependence plot of EUR / NOK (F43). The interactione ﬀ ect of EUR / NOK with Nordic production (F19). high decline in the expected cross-border transmission. Its im-pact, consequently, will lead to a decrease in the price formationon the following day.

Figure 21

The SHAP dependence plot of DK1 < − > DE ﬂow (F48). Theinteraction e ﬀ ect of DK1 < − > DE ﬂow with DE production prognosis (F29).

Last but not least, not all of the cross-border electricity ﬂowsand ﬂow deviations are helpful for forecasting. The reason forthis is that, in many cases, the ﬂow capacity is fully occupied.The lack of variability results in their inability to provide usefulinformation for forecasting. An example of the ﬂow and ﬂowdeviations between FI and Russia can be seen in Figure 22.However, this implies the potential for more electrical powertransmission across the Europe-wide market in increasing the overall socio-economic beneﬁts.

Figure 22 FI < − > Russia ﬂow (F54) (blue dots) versus FI < − > Russia ﬂowdeviation (F62) (red line).

An accurate prediction can be highly beneﬁcial for the elec-tricity market participants in practice. A power market ﬁrmthat is capable of forecasting the volatile electricity price witha reasonable level of accuracy can reduce trading risk and max-imise proﬁts in the day-ahead market, by adjusting its biddingstrategy and the schedule for production or consumption. Morespeciﬁcally, a 1% improvement in MAPE of forecast accuracy(within a 5% to 14% range), leads to about 0.1% - 0.35% costreduction [65]. On average, a 1% reduction in the MAPE ofshort-term price forecasts can result in savings of $1.5 millionper year, for a typical medium-sized utility company with 5-GW peak load [66, 29]. Furthermore, electricity is economi-cally non-storable and the imbalance between production andconsumption can result in power system instability [67]. Ac-curate electricity forecasting allows energy ﬁrms to e ﬃ cientlyorganise production or consumption, and this improves the sta-bility of the power system. In addition, the Nord Pool systemprice is an index reﬂecting the Nordic Day-ahead market forelectricity and it is a commodity benchmark and a regulateddata benchmark [68]. Accurate forecasts can be regarded asbeing a reference for the Nord Pool market and its integratedmarkets to perform operation and regulation.13 . Conclusion In this paper, we present three LSTM-based DL architecturesfor the EPF. This study puts emphasis on the inﬂuence of fea-ture selection methods in the proposed hybrid models. In par-ticular, we compare the prediction performance of the two-stepfeature selection, the autoencoder, and two-stage feature selec-tion models based on the empirical study on the Nord Pool day-ahead system price. In addition, we employ a SHAP methodto evaluate the features’ importance and impact on predictingthis price. The main ﬁndings are: (1) We conclude that thedi ﬀ erent feature selection methods will lead to di ﬀ erent fea-ture selections. As input, diverse features will have a compara-bly signiﬁcant impact on increasing the performance of LSTM-based predictive models. (2) The two-stage models can im-prove the forecasting accuracy of two-step models to some ex-tent. The superior feature selection from the RFE-SVR modelallows the autoencoder model to detect more meaningful infor-mation for more accurate prediction. (3) The features from theGerman market (with the most power cables linking to NordPool) are more signiﬁcant for EPF than others. This indicatesthat more interconnections will increase the cross-border inﬂu-ence on EPF. (4) Compared to other features, the exchange ratesare relatively less important. (5) Flow deviation cannot signif-icantly contribute to the price prediction because of its lack ofvariability. In many cases, the expected ﬂow capacity is fullyoccupied. This implies more interconnections are expected foran e ﬃ cient Europe-wide electricity market.For future studies, several extensions of the current study canbe developed. Indeed, although the forecasting performance ofthe proposed models is considerable, we do not conduct an ex-tensive grid search to optimise hyperparameters. It is reason-able to believe that the LSTM-based models with more com-prehensive architecture will achieve a better forecasting perfor-mance. The results will beneﬁt spot electricity traders and poli-cymakers, who make decisions based on accurate price predic-tions. Moreover, we envision that more testing on other featureselection models can obtain more, di ﬀ erent feature selectionsubsets. They can provide more possibilities for researchersand industries to understand how di ﬀ erent features a ﬀ ect pre-diction accuracy. Finally, the study was carried out using thedata from the Nord Pool market, but the generality of the pro-posed models ensures a possible application to other integratedmarkets, like EPEX and OMIE.

9. Acknowledgement

This work acknowledges research support by COST Ac-tion “Fintech and Artiﬁcial Intelligence in Finance - Towardsa transparent ﬁnancial industry” (FinAI) CA19130. This re-search has been performed within the + CityxChange (Posi-tive City ExChange) project under the Smart Cities and Com-munities topic that has received funding from the European https: // cityxchange.eu / Union’s Horizon 2020 research and innovation programme un-der Grant Agreement No. 824260. Critical comments and ad-vice from Florentina Paraschiv and Frode Kjærland are grate-fully acknowledged.

Appendix A.

Table A.8

The MAD results for M1, M2, M3, M4, M5, M6, M7 and M8.

M1 M2 M3 M4 M5 M6 M7 M8count 10 10 10 10 10 10 10 10mean 2.78 2.99 3.31 2.54 2.61 2.67 3.67 3.28std 0.17 0.29 0.40 0.25 0.30 0.25 0.15 0.38min 2.42 2.58 2.64 2.16 2.04 2.20 3.30 2.8425% 2.72 2.77 3.10 2.34 2.45 2.59 3.65 2.9950% 2.77 2.95 3.25 2.59 2.65 2.65 3.69 3.1875% 2.92 3.24 3.56 2.63 2.74 2.85 3.73 3.60max 2.98 3.41 3.93 2.99 3.00 3.09 3.89 3.95

Note: 25%, 50%, and 75% denote 25%, 50%, and 75% percentiles.

Table A.9

The RMSE results for M1, M2, M3, M4, M5, M6, M7 and M8.

M1 M2 M3 M4 M5 M6 M7 M8count 10 10 10 10 10 10 10 10mean 3.60 3.79 4.22 3.25 3.46 3.33 4.74 3.99std 0.24 0.42 0.55 0.33 0.43 0.32 0.29 0.46min 3.17 3.15 3.35 2.61 2.79 2.85 4.13 3.5025% 3.51 3.45 3.87 3.04 3.27 3.13 4.62 3.6050% 3.64 3.78 4.16 3.30 3.50 3.30 4.90 3.9275% 3.71 4.05 4.55 3.45 3.71 3.47 4.91 4.33max 3.96 4.51 5.11 3.73 4.08 3.91 4.97 4.77

Note: 25%, 50%, and 75% denote 25%, 50%, and 75% percentiles.

Table A.10

The MAPE (%) results for M1, M2, M3, M4, M5, M6, M7 and M8.

M1 M2 M3 M4 M5 M6 M7 M8count 10 10 10 10 10 10 10 10mean 7.24 7.83 8.75 6.66 6.79 7.01 9.31 8.73std 0.46 0.73 1.05 0.65 0.78 0.68 0.47 1.06min 6.28 6.82 6.89 5.72 5.32 5.67 8.56 7.5025% 7.05 7.22 8.19 6.18 6.35 6.84 9.08 7.9050% 7.24 7.71 8.62 6.73 6.85 6.95 9.16 8.4475% 7.61 8.49 9.35 6.87 7.15 7.45 9.50 9.59max 7.80 8.93 10.37 7.86 7.91 8.09 10.27 10.66

Note: 25%, 50%, and 75% denote 25%, 50%, and 75% percentiles. eferences [1] Weron R. Modeling and forecasting electricity loads and prices: A statis-tical approach. Wiley; 2006.[2] Bunn D. Modelling prices in competitive electricity markets. Wiley;2004.[3] Nogales FJ, Contreras J, Conejo AJ, Espinola R. Forecasting next-day electricity prices by time series models. IEEE Transactions onPower Systems 2002;17(2):342–348. https://doi . org/10 . . . .[4] Bunn DW. Forecasting loads and prices in competitive power mar-kets. Proceedings of the IEEE 2000;88(2):163–169. https://doi . org/10 . . .[5] Weron R. Electricity price forecasting: A review of the state-of-the-art with a look into the future. International Journal ofForecasting 2014;30(4):1030 – 1081. https://doi . org/10 . . ijforecast . . . .[6] Nowotarski J, Weron R. Recent advances in electricity price forecast-ing: A review of probabilistic forecasting. Renewable and SustainableEnergy Reviews 2018;81:1548 – 1568. https://doi . org/10 . . rser . . . .[7] Ventosa M, Baillo A, Ramos A, Rivie M. Electricity market model-ing trends. Energy Policy 2005;33(7):897 – 913. https://doi . org/10 . . enpol . . . .[8] Kiose D, Voudouris V. The acewem framework: An integrated agent-based and statistical modelling laboratory for repeated power auctions.Expert Systems with Applications 2015;42(5):2731 – 2748. https://doi . org/10 . . eswa . . . .[9] Burger M, Schindlmayr G, Graeber B. Managing Energy Risk: An Inte-grated View on Power and Other Energy Markets. Wiley; 2007.[10] Islyaev S, Date P. Electricity futures price models: Calibration and fore-casting. European Journal of Operational Research 2015;247(1):144 –154. https://doi . org/10 . . ejor . . . .[11] Weron R, Misiorek A. Forecasting spot electricity prices: A comparisonof parametric and semiparametric time series models. International Jour-nal of Forecasting 2008;24(4):744 – 763. https://doi . org/10 . . ijforecast . . . .[12] Conejo AJ, Contreras J, Esp´ınola R, Plazas MA. Forecasting electric-ity prices for a day-ahead pool-based electric energy market. Interna-tional Journal of Forecasting 2005;21(3):435 – 462. https://doi . org/10 . . ijforecast . . . .[13] Misiorek A.and Trueck S, Weron R. Point and interval forecasting ofspot electricity prices: Linear vs non-linear time series models. Studies inNonlinear Dynamics & Econometrics 2006;10(3). https://10 . . .[14] Gonzalez JP, Roque AMSMS, P´erez EA. Forecasting functional timeseries with a new Hilbertian ARMAX model: Application to electricityprice forecasting. IEEE Transactions on Power Systems 2018;33(1):545–556. https://doi . org/10 . . . .[15] Catalao J, Mariano S, Mendes V, Ferreira L. Short-term electricity pricesforecasting in a competitive market: A neural network approach. ElectricPower Systems Research 2007;77(10):1297 – 1304. https://doi . org/10 . . epsr . . . .[16] Keles D, Scelle J, Paraschiv F, Fichtner W. Extended forecast methodsfor day-ahead electricity spot prices applying artiﬁcial neural networks.Applied Energy 2016;162:218 – 230. https://doi . org/10 . . apenergy . . . .[17] Peter S, Raglend I. Sequential wavelet-ANN with embedded ANN-PSOhybrid electricity price forecasting model for indian energy exchange.Neural Comput & Applic 2017;28:2277–2292. https://doi . org/10 . .[18] Hinton G, Deng L, Yu D, Dahl GE, rahman Mohamed A, Jaitly N, et al.Deep neural networks for acoustic modeling in speech recognition: Theshared views of four research groups. IEEE Signal Processing Magazine2012;29:82 – 97. https://doi . org/10 . . . .[19] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointlylearning to align and translate. arXiv:1409.0473 [cs.CL] 2014;Availablefrom: http://arxiv . org/abs/1409 . .[20] Li L, Yuan Z, Gao Y. Maximization of energy absorption for a waveenergy converter using the deep machine learning. Energy 2018;165:340– 349. https://doi . org/10 . . energy . . . . [21] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies withgradient descent is di ﬃ cult. IEEE Transactions on Neural Networks1994;5(2):157–166. https://doi . org/10 . . .[22] Lago J, Ridder FD, Schutter BD. Forecasting spot electricity prices:Deep learning approaches and empirical comparison of traditional al-gorithms. Applied Energy 2018;221:386 – 405. https://doi . org/10 . . apenergy . . . .[23] Chang Z, Zhang Y, Chen W. Electricity price prediction based onhybrid model of adam optimized LSTM neural network and wavelettransform. Energy 2019;187:115804. https://doi . org/10 . . energy . . . .[24] Kuo PH, Huang CJ. An electricity price forecasting model by hybridstructured deep neural networks. Sustainability 2018;10(4). https://doi . org/10 . .[25] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning:data mining, inference and prediction. 2 ed.; Springer; 2009.[26] Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Fea-ture selection: A data perspective. ACM Computing Surveys 2017;50(6). . .[27] Ziel F, Steinert R, Husmann S. Forecasting day ahead electricity spotprices: The impact of the exaa to other european electricity markets.Energy Economics 2015;51:430 – 444. https://doi . org/10 . . eneco . . . .[28] Panapakidis IP, Dagoumas AS. Day-ahead electricity price fore-casting via the application of artiﬁcial neural network based models.Applied Energy 2016;172:132 – 151. https://doi . org/10 . . apenergy . . . .[29] Lago J, De Ridder F, Vrancx P, De Schutter B. Forecasting day-aheadelectricity prices in europe: The importance of considering market in-tegration. Applied Energy 2018;211:890 – 903. https://doi . org/10 . . apenergy . . . .[30] Uribe JM, Mosquera-L´opez S, Guillen M. Characterizing electricitymarket integration in nord pool. Energy 2020;208:118368. https://doi . org/10 . . energy . . .[31] Marcjasz G, Lago J, Weron R. Neural networks in day-ahead electricity price forecasting: Single vs. multiple out-puts. arXiv:2008.08006 [stat.AP] 2020;Available from: https://arxiv . org/abs/2008 . .[32] Johannesen NJ, Kolhe M, Goodwin M. Deregulated electric energyprice forecasting in nordpool market using regression techniques. In:2019 IEEE Sustainable Power and Energy Conference (iSPEC). 2019, p.1932–1938. https://ieeexplore . ieee . org/abstract/document/8975173 .[33] Greenﬁsh. Shaping our electrical future: Moving towards an inte-grated european network. . greenfish . eu/shaping-our-electrical-future-moving-towards-an-integrated-european-network/ ; 2019. [accessed 13 September 2020].[34] Hochreiter S, Schmidhuber J. Long short-term memory. NeuralComputation 1997;74(8):1735 – 1780. https://doi . org/10 . . . . . .[35] Gers FA, Schmidhuber J. Recurrent nets that time and count. In: Pro-ceedings of the IEEE-INNS-ENNS International Joint Conference onNeural Networks. IJCNN 2000. Neural Computing: New Challengesand Perspectives for the New Millennium; vol. 3. 2000, p. 189–194. https://10 . . . .[36] Graves A, Schmidhuber J. Framewise phoneme classiﬁcation withbidirectional LSTM and other neural network architectures. Neu-ral Networks 2005;18(5):602 – 610. https://doi . org/10 . . neunet . . . ; iJCNN 2005.[37] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F,Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 [cs.CL]2014;Available from: https://arxiv . org/abs/1406 . .[38] Graves A. Generating sequences with recurrent neural networks.arXiv:1308.0850 [cs.NE] 2013;Available from: https://arxiv . org/abs/1308 . .[39] Chandrashekar G, Sahin F. A survey on feature selection methods. Com-puters & Electrical Engineering 2014;40(1):16 – 28. https://doi . org/10 . . compeleceng . . . ; 40th-year commemorative is-sue.[40] Chen X, Zeng X, van Alphen D. Multi-class feature selection for tex- ure classiﬁcation. Pattern Recognition Letters 2006;27(14):1685 – 1691. https://doi . org/10 . . patrec . . . .[41] Nguyen HB, Xue B, Liu I, Andreae P, Zhang M. New mechanismfor archive maintenance in PSO-based multi-objective feature selection.Soft Computing 2016;20:3927 – 3946. https://doi . org/10 . .[42] Shang L, Zhou Z, Liu X. Particle swarm optimization-based feature se-lection in sentiment classiﬁcation. Soft Computing 2016;20:3821 – 3834. https://doi . org/10 . .[43] Zhou Y, Zhou N, Gong L, Jiang M. Prediction of photovoltaic power out-put based on similar day analysis, genetic algorithm and extreme learn-ing machine. Energy 2020;204:117894. https://doi . org/10 . . energy . . .[44] S. Krishnan G, S. S. A novel GA-ELM model for patient-speciﬁc mor-tality prediction over large-scale lab event data. Applied Soft ComputingJournal 2019;80:525–533. . . asoc . . . .[45] Luo P, Zhu S, Han L, Chen Q. Short-term photovoltaic generation fore-casting based on similar day selection and extreme learning machine. vol.2018-January. 2018, p. 1–5. . . . .[46] Huang GB, Zhu QY, Siew CK. Extreme learning machine: Theoryand applications. Neurocomputing 2006;70(1):489 – 501. https://doi . org/10 . . neucom . . . .[47] Saraswathi S, Sundaram S, Sundararajan N, Zimmermann M, Nilsen-Hamilton M. ICGA-PSO-ELM approach for accurate multiclass can-cer classiﬁcation resulting in reduced gene sets in which genes encod-ing secreted proteins are highly represented. IEEE / ACM Transactions onComputational Biology and Bioinformatics 2011;8(2):452–463. https://doi . org/10 . . . .[48] Chyzhyk D, Savio A, Gra˜na M. Evolutionary ELM wrapper fea-ture selection for Alzheimer’s disease CAD on anatomical brain MRI.Neurocomputing 2014;128:73 – 80. https://doi . org/10 . . neucom . . . .[49] Ahila R, Sadasivam V, Manimala K. An integrated PSO for parameter de-termination and feature selection of ELM and its application in classiﬁca-tion of power system disturbances. Applied Soft Computing 2015;32:23– 37. https://doi . org/10 . . asoc . . . .[50] Whitley D. A genetic algorithm tutorial. Statistics and Computing1994;4(2):65–85. . .[51] Herceg S, ˇZeljka Ujevi´c Andriji´c, Bolf N. Development of soft sen-sors for isomerization process based on support vector machine re-gression and dynamic polynomial models. Chemical Engineering Re-search and Design 2019;149:95 – 103. https://doi . org/10 . . cherd . . . .[52] Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D,et al. Going deeper with convolutions. In: 2015 IEEE Conference onComputer Vision and Pattern Recognition (CVPR). 2015, p. 1–9. https://doi . org/10 . . . .[53] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning withneural networks. In: Ghahramani Z, Welling M, Cortes C, LawrenceND, Weinberger KQ, editors. Advances in Neural Information Process-ing Systems 27. Curran Associates, Inc.; 2014, p. 3104–3112. https://arxiv . org/abs/1409 . .[54] SHI X, Chen Z, Wang H, Yeung DY, Wong Wk, WOO Wc. Convolu-tional LSTM network: A machine learning approach for precipitationnowcasting. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Gar-nett R, editors. Advances in Neural Information Processing Systems 28.Curran Associates, Inc.; 2015, p. 802–810. https://arxiv . org/pdf/1506 . . pdf .[55] Diebold FX, Mariano RS. Comparing predictive accuracy. Journalof Business & Economic Statistics 2002;20(1):134–144. https://doi . org/10 . .[56] Harvey D, Leybourne S, Newbold P. Testing the equality of predictionmean squared errors. International Journal of Forecasting 1997;13(2):281– 291. https://doi . org/10 . .[57] Varma S, Simon R. Bias in error estimation when using cross-validationfor model selection. BMC Bioinformatics 2006;7:91. https://doi . org/10 . .[58] Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press;2016.[59] McHugh C, Coleman S, Kerr D, McGlynn D. Daily energy price fore-casting using a polynomial narmax model. In: Lotﬁ A, Bouchachia H, Gegov A, Langensiepen C, McGinnity M, editors. Advances in Compu-tational Intelligence Systems. Cham: Springer International Publishing.ISBN 978-3-319-97982-3; 2019, p. 71–82.[60] Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al.From local explanations to global understanding with explainable ai fortrees. Nature Machine Intelligence 2020;2(1):56–67. . .[61] Janzing D, Minorics L, Bl¨obaum P. Feature relevance quantiﬁcation inexplainable ai: A causal problem. 2019. arXiv:1910.13413 ; availablefrom: https://arxiv . org/abs/1910 . .[62] Sundararajan M, Najmi A. The many shapley values for model explana-tion. 2020. arXiv:1908.08474 ; available from: https://arxiv . org/abs/1908 . .[63] Lundberg S, Lee SI. A uniﬁed approach to interpreting model predictions.2017. arXiv:1705.07874 ; available from: https://arxiv . org/abs/1705 . .[64] Jamian JJ, Abdullah MN, Mokhlis H, Mustafa MW, Bakar AHA. Globalparticle swarm optimization for high dimension numerical functionsanalysis. Journal of Applied Mathematics 2014; https://doi . org/10 . .[65] Zareipour H, Canizares CA, Bhattacharya K. Economic impact ofelectricity market price forecasting errors: A demand-side analysis.IEEE Transactions on Power Systems 2010;25(1):254–262. https://doi . org/10 . . . .[66] Uniejewski B, Nowotarski J, Weron R. Automated variable selection andshrinkage for day-ahead electricity price forecasting. Energies 2016;9(8). https://doi . org/10 . .[67] Kaminski V. Energy markets. Risk Book; 2013.[68] Nord Pool. Nordic System Price. . nordpoolgroup . com/4a7544/globalassets/download-center/day-ahead/methodology-for-calculating-nordic-system-price . pdf ;2020. [accessed 13 September 2020].;2020. [accessed 13 September 2020].