[PDF] A Machine Learning approach to Risk Minimisation in Electricity Markets with Coregionalized Sparse Gaussian Processes

Abstract

The non-storability of electricity makes it unique among commodity assets, and it is an important driver of its price behaviour in secondary financial markets. The instantaneous and continuous matching of power supply with demand is a key factor explaining its volatility. During periods of high demand, costlier generation capabilities are utilised since electricity cannot be stored and this has the impact of driving prices up very quickly. Furthermore, the non-storability also complicates physical hedging. Owing to these, the problem of joint price-quantity risk in electricity markets is a commonly studied theme. We propose using Gaussian Processes (GPs) to tackle this problem since GPs provide a versatile and elegant non-parametric approach for regression and time-series modelling. However, GPs scale poorly with the amount of training data due to a cubic complexity. These considerations suggest that knowledge transfer between price and load is vital for effective hedging, and that a computationally efficient method is required. To this end, we use the coregionalized (or multi-task) sparse GPs which addresses the aforementioned issues. To gauge the performance of our model, we use an average-load strategy as comparator. The latter is a robust approach commonly used by industry. If the spot and load are uncorrelated and Gaussian, then hedging with the expected load will result in the minimum variance position. Our main contributions are twofold. Firstly, in developing a coregionalized sparse GP-based approach for hedging. Secondly, in demonstrating that our model-based strategy outperforms the comparator, and can thus be employed for effective hedging in electricity markets.

Full PDF

AA Machine Learning approach to RiskMinimisation in Electricity Markets withCoregionalized Sparse Gaussian Processes

Daniel Poh , Stephen Roberts , and Martin Tegn´er

Department of Engineering Science, University of Oxford, ParksRoad, Oxford, OX1 2JD, United Kingdom Oxford-Man Institute of Quantitative Finance, Eagle House,Walton Well Road, OX2 6ED, United Kingdom * Corresponding authorMarch 2019

Abstract

The non-storability of electricity makes it unique among commodity assets, andit is an important driver of its price behaviour in secondary ﬁnancial markets.The instantaneous and continuous matching of power supply with demand isa key factor explaining its volatility. During periods of high demand, costliergeneration capabilities are utilised since electricity cannot be stored and this hasthe impact of driving prices up very quickly. Furthermore, the non-storabilityalso complicates physical hedging. Owing to these, the problem of joint price-quantity risk in electricity markets is a commonly studied theme.We propose using Gaussian Processes (GPs) to tackle this problem since GPsprovide a versatile and elegant non-parametric approach for regression and time-series modelling. However, GPs scale poorly with the amount of training datadue to a cubic complexity. These considerations suggest that knowledge transferbetween price and load is vital for eﬀective hedging, and that a computationallyeﬃcient method is required. To this end, we use the coregionalized (or multi-task) sparse GPs which addresses the aforementioned issues.To gauge the performance of our model, we use an average-load strategy ascomparator. The latter is a robust approach commonly used by industry. If thespot and load are uncorrelated and Gaussian, then hedging with the expectedload will result in the minimum variance position.Our main contributions are twofold. Firstly, in developing a coregionalized1 a r X i v : . [ q -f i n . R M ] A p r parse GP-based approach for hedging. Secondly, in demonstrating that ourmodel-based strategy outperforms the comparator, and can thus be employedfor eﬀective hedging in electricity markets. Keywords:

Machine Learning, Gaussian Processes, Energy Risk Management,Electricity Markets

The de-regulation of electricity markets has led to increased transparency andthe widespread use of risk management products. On top of playing a vitalrole balancing demand and supply, power markets fulﬁl an important func-tion in managing and distributing risk (Bessembinder and Lemmon (2002) andBotterud, Kristiansen, and Ilic (2010)). Unlike other commodity assets, thenon-storability of electricity is a unique feature that has an impact on the termstructure since unlike other commodities, the spot and forward prices is nolonger linked via the cost of storage. This feature is thus a key factor notonly complicating the fundamentals of hedging and pricing, but also in con-tributing to volatile spot-markets with structural price jumps (see Botterud,Kristiansen, and Ilic (2010) and Benth, Benth, and Koekebakker (2008) for ex-ample). While market participants hedge with derivatives such as futures orforwards to minimise portfolio risk, the hedging itself may not be executed inan eﬀective fashion.We propose a hedging approach that ﬂexibly incorporates the dependency be-tween electricity price and consumption load as well as temporal correlation.We apply our method to data from the UK power market and demonstrate itsempirical performance.

One studied theme within the literature on electricity markets is the coupling ofrisks arising from quantity and price. This joint risk underscores the importanceof modelling the covariation between price and quantity. For instance, Bessem-binder and Lemmon (2002) construct an equilibrium-based market model inwhich correlation plays a signiﬁcant part on the optimal hedging strategies inthe forward markets.Oum, Oren, and Deng (2006) adopt the position of an agent with access to ex-ogenous spot and forward markets. Using this perspective as a starting point,the authors derive optimal hedging strategies from a utility maximisation ap-proach. These strategies take advantage of correlation between price and load to2anage joint risk in a single period. Boroumand et al. (2015) tackle the problemof managing joint risk on an intraday scale. By employing a simulation-basedapproach, they demonstrate that hedging with shorter time-frequencies can out-perform portfolios with long-term focus.Close to our work is Tegn´er et al. (2017) who assume the retailer’s perspectivefor risk management, which is a similar position adopted by Oum, Oren, andDeng (2006). The former jointly model the electricity spot price and consump-tion load with a two-dimensional Ornstein-Uhlenbeck process and a seasonalitycomponent.There are more sophisticated approaches such as the three-factor model ofCoulon, Powell, and Sircar (2013). This involves load-based regime switch-ing and options that are evaluated with closed form expressions. Here we focuson an operational risk management strategy and use exogenous data from theOTC forward markets.GPs are not new to electricity markets, although their focus has primarily beenon producing single-output forecasts, usually either consumption load or price.As mentioned earlier in this section, incorporating correlation between variablesis among some of the most important aspects in price and risk models for elec-tricity. The coregionalized or multi-task GP framework provides a natural wayof going about this task. Interestingly enough, it has yet to be addressed in theliterature on GP-based models applied to the power markets.

The remainder of the paper is structured as follows: Section 2 begins withnecessary information on the UK electricity market and then goes on to coverthe hedging problem as well as a background on GPs. In Section 3 we describetools and data used, explain the methodology and experimental setup. Followingthis, Sections 4 and 5 present the results of the model construction and thehedging experiment respectively. Finally, Section 6 recaps and summarises keyinsights, and highlight possible directions for future work.

The power market in the UK can be broadly thought of as having two layers:the wholesale market and the retail market. In the wholesale market, power isproduced by generators and sold to suppliers (or retailers). The retail marketis the next layer where suppliers then resell electricity to end-users.3he wholesale electricity market can be further divided into three ‘sub-markets’that have a temporal ordering. The future and forward markets typically occupythe longer end of this spectrum, dealing with deliveries over a longer span oftime such as a month or a quarter. These contracts obligate participants toeither consume or deliver a ﬁxed amount of electricity during a speciﬁc timeperiod in the future based on some mutually agreed price at inception. Theday-ahead market has a shorter horizon. Electricity in this market is transactedone day prior to actual delivery. Additionally, the term ‘spot-price’ used in mostelectricity markets (in addition to the UK’s) refers to the price in the day-aheadmarket. The intra-day market and/or the balancing mechanism has an evenshorter time horizon. Electricity traded on these markets are delivered on thesame day itself. Trading activities within these sub-markets usually take placeeither over-the-counter (OTC) or on exchanges.

While there exist numerous ﬁnancial products that can be used to manage powerrisk, we consider only the use of two in this paper: the base load and the peakload forward contract for UK power. Both are OTC instruments that requiresettling the price diﬀerence between spot and some agreed amount, over someperiod of time (a month in our context). While the base contract pays thisdiﬀerence for every hour of the month, the peak contract applies only for “peakhours”, which is from 7 a.m. to 7 p.m. on weekdays only. By constructing aportfolio of base and peak load contracts, we are able to create a portfolio thatresembles the load proﬁle.The setup of the following hedging problem where the aim is to accurately repli-cate an uncertain future ﬁnancial obligation is similar to Tegn´er et al. (2017).This is usually achieved by using ﬁnancial derivatives such as futures and for-wards. Speciﬁcally, we attempt to minimize the expected loss by determiningthe portfolio of base and peak load forwards to hold.To this end, let the (positive) payoﬀ of the hedged portfolio at time T be π T .Thus, the loss is given by − π T . Additionally, for some loss function u ( · ), ameasure of its risk (See Artzner et al. (1999) for more details of risk measures)is given by E [ u ( − π T )], where E ( · ) is the expectation operator. There are manyoptions for picking the loss metric u ( · ) and some of these are the maximumloss, quadratic loss and the ‘hockey-stick’ loss (also known as the rectiﬁed linearunit). In order to penalize losses more, we use the exponential loss u ( · ) = exp( · ).Suppose that at time t , we enter an agreement to deliver at some future time T > t an unspeciﬁed amount L T of some commodity at a (constant) price C contracted by a ﬁxed price agreement. If we take a naked position, this wouldmean having to purchase the commodity from the market at the prevailing price S T and volume L T . The payoﬀ at time T is thus π T = ( C − S T ) L T

4n order to reduce the exposure to price ﬂuctuations, we initiate a static hedgeat t . This can be achieved by purchasing a forward contract at price F withexpiry T , although the volume V needs to be established at the time of purchase.If the ﬁxed price agreement diﬀers from the futures by some margin δ ≥ C = F + δ , then the hedged position has payoﬀ π T = ( S T − F )( V − L T ) + δL T (1)where we would need to determine the optimal V at the point when we set upour hedge. For this purpose, we use the exponential loss function and minimizethe expected loss with respect to V . In other words, we want to ﬁnd the optimal V ∗ that minimizes f ( V ) = E (cid:104) u (cid:16) − ( S T − F )( V − L T ) + δL T (cid:17)(cid:105) Bringing the UK power markets back in to focus, suppose we are contracted todeliver based on the terms of some ﬁxed price agreement for the month M . Wedenote with T i the i th hour from the point of initiating the hedge to delivery.Additionally, for a given month M we refer to the set of peak hours as M p ,and the set of oﬀ-peak hours as M o . Hence from equation (1), we can write theoﬀ-peak payoﬀ for T i ∈ M o as π T i = ( S T i − F b )( V b − L T i ) + δ b L T = ( S T i − F b )( V b − L T i ) , T i ∈ M o (2)where V b is understood to be the base load forward position with price F b . Forconvenience, we set the diﬀerential between the ﬁxed price agreement and baseload forward δ b to 0.To obtain the payoﬀ for a peak hour, we would combine the peak load forwardposition V p and price F p with a base load forward position. It is usually safeto assume that F p ≥ F b . The peak payoﬀ for a hedged position is then π T i = ( S T i − F b ) V b + ( S T i − F p ) V p +( F p-fpa − S T i ) L T i , T i ∈ M p where, similar to the oﬀ-peak case, we set the margin δ p = 0. If we deﬁne˜ F ≡ F p − V b V b + V p ( F p − F b )such that F b ≤ ˜ F ≤ F p , we can compact the expression for the peak payoﬀs as π T i = ( S T i − ˜ F )( V b + V P − L T i )where T i ∈ M p . 5f the goal is to hedge in an optimal fashion for the month M , this means havingto determine V b ∗ and V p ∗ that minimizes the combined loss function f ( V b , V p ) = (cid:88) T i ∈ M o E (cid:104) u (cid:16) − ( S T i − F b )( V b − L T i ) (cid:17)(cid:105) + (cid:88) T i ∈ M p E (cid:104) u (cid:16) − ( S T i − ˜ F )( V b + V p − L T i ) (cid:17)(cid:105) (3)where, similar to equation (2), both diﬀerentials δ b and δ p of the respectiveforwards to the ﬁxed price agreement are set to 0 for convenience. A terserformulation of the above problem isargmin V b ,V p f ( V b , V p )There are two approaches to solve this minimization problem: We can eitherdetermine an expression for the expectation given by (3) where δL T i = 0 forsimplicity, or we can arrive at an approximate value via Monte-Carlo simulation.We adopt the latter approach. Gaussian processes are an extension of multivariate Gaussian distributions. Webrieﬂy go through the underlying fundamentals in this section. A detailed treat-ment can be found in Rasmussen and Williams (2006).A GP deﬁnes a probability distribution over functions. For a given input space X , a GP is deﬁned by a mean function m ( x ) and a covariance function κ ( x , x (cid:48) )as f ( x ) ∼ GP ( m ( x ) , κ ( x , x (cid:48) ))where m ( x ) = E [ f ( x )] κ ( x , x (cid:48) ) = E (cid:104)(cid:16) f ( x ) − m ( x ) (cid:17)(cid:16) f ( x (cid:48) ) − m ( x (cid:48) ) (cid:17) (cid:62) (cid:105) The mean function m ( x ) is often set to zero since the GP is ﬂexible enough tomodel the mean arbitrarily well.Suppose we now have a training dataset ( X , y ) as well as a test dataset ( X ∗ , y ∗ ),where X ∈ R N × D , y ∈ R N , X ∗ ∈ R N ∗× D and y ∗ ∈ R N ∗ . Denoting the functionoutputs of the training and test data by f ( X ) and f ( X ∗ ), which we shortento f and f ∗ for brevity, we can make use of the training observations to makepredictions on the test set by the following joint distribution (cid:18) ff ∗ (cid:19) ∼ N (cid:18) , (cid:18) K ( X , X ) K ( X , X ∗ ) K ( X ∗ , X ) K ( X ∗ , X ∗ ) (cid:19)(cid:19) (4)6here K ( X , X ) is a covariance matrix where the ( i, j ) th element is is κ ( x i , x j ).The posterior then has the form p ( f ∗ | X ∗ , X , f ) = N ( f ∗ | µ ∗ , Σ ∗ ) µ ∗ = K (cid:62)∗ K − f (5) Σ ∗ = K ∗∗ − K (cid:62)∗ K − K ∗ (6) The GP’s covariance functions can be used to encode prior domain knowledgeabout f . Intuitively, these functions allow for generalization of the model bycorrelating new inputs to existing observations.For this subsection, we let r ≡ | x − x (cid:48) | . The most commonly used covariancefunction is the squared exponential (SE) or radial basis function, which has theform κ SE ( r ) = σ exp (cid:16) − r (cid:96) (cid:17) The parameters σ and (cid:96) control the amplitude and characteristic length scalerespectively. The continuous, diﬀerentiable and stationary properties of the SEkernel make it a popular choice for generic modelling.As the SE is inﬁnitely diﬀerentiable, it yields smooth sample paths which mightbe unsuitable for real-world phenomena. An alternative to the SE kernel is theMat´ern family, for which the general form is κ Mat´ern ( r ) = 2 − ν Γ( ν ) (cid:16) √ νr(cid:96) (cid:17) ν K ν (cid:16) √ νr(cid:96) (cid:17) where σ and (cid:96) are both positive parameters, while K ν is a modiﬁed Besselfunction. The form for the Mat´ern class simpliﬁes if it is half integer, that is, ν = p + 1 / ν = 3 / ν = 5 / κ Periodic ( r ) = σ exp (cid:16) − ( π ( r ) /p ) (cid:96) (cid:17) where σ , p and (cid:96) are parameters for the amplitude, period and length scalerespectively. It is a well-known that electricity price and load exhibit period-icity on multiple levels. As we explain later, a combination of these kernelsparameterized with diﬀerent periods will be used to capture these seasonalities.7he rational quadratic is the last covariance considered for our model. It hasthe form κ RQ ( r ) = σ (cid:16) r α(cid:96) (cid:17) − α where the parameters σ , α and (cid:96) are respectively the amplitude, shape parameterand length scale. The shape parameter determines the diﬀuseness of the lengthscales. The rational quadratic is generally used to model small to medium termirregularities.One way of constructing new kernels is by aﬃne transformations. We restrictthe scope of our study to composite kernels formed by addition. A wider list ofstationary and non-stationary covariance functions can be found in Rasmussenand Williams (2006). GPs are ﬂexible but perform poorly on larger data sets due to the matrix inver-sion operation (see Equations (5) and (6)) which scales as O ( n ). This limitationhas motivated work in various computationally eﬃcient approaches that aim toapproximate the precise GP solution (Gal, Wilk, and Rasmussen (2014) andHensman, Fusi, and Lawrence (2013)). The comprehensive review by Liu etal. (2018) classiﬁes scalable GPs by ﬁrst grouping them into those that produceglobal approximations and those that produce local approximations. Methodsfalling within global approximations can be further sub-divided into those that(i) operate on a subset of the training data, (ii) use sparse kernels, and (iii) em-ploy sparse approximations. Our GP model uses a variant of the latter known asthe Deterministic Training Conditional (DTC). The rest of this section outlinesthe general idea of sparse approximations, and readers are encouraged to referto Quinonero-Candela and Rasmussen (2005) for details.We start by modifying the joint distribution given by Equation (4) to reduce thecomputational load due to matrix inversion in the posterior distribution. Thisstep involves introducing latent or inducing variables u = ( u , u , . . . , u m ) (cid:62) .These inducing variables correspond to a set of input locations, hence theyare also known as inducing inputs. Sparse algorithms vary in their approachof selecting inducing variables. By the consistency property of GPs, we canrecover p ( f , f ∗ ) from p ( f , f ∗ , u ) by integrating out u in the latter p ( f , f ∗ ) = (cid:90) p ( f , f ∗ | u ) p ( u ) d u where u ∼ N ( , K u,u ). By assuming that both f and f ∗ are conditionallyindependent given u , the joint of the prior is approximated as p ( f , f ∗ ) ≈ q ( f , f ∗ )= (cid:90) q ( f | u ) q ( f ∗ | u ) p ( u ) d u q ( f | u ) and approximate test conditional q ( f ∗ | u ) give rise todiﬀerent algorithms. The exact train and test conditionals are given respec-tively as p ( f | u ) = N (cid:16) K f , u K − u , u u , K u , u − Q f , f (cid:17) p ( f ∗ | u ) = N (cid:16) K f ∗ , u K − u , u u , K f ∗ , f ∗ − Q f ∗ , f ∗ (cid:17) where Q a , b ≡ K a , u K − u , u K u , b .Another approach to sparse approximation by Seeger, Williams, and Lawrence(2003) makes use of an estimation of the likelihood via the projection f = K f , u K − u , u u which gives p ( y | f ) ≈ q ( y | u )= N (cid:16) K f , u K − u , u u , σ I (cid:17) The DTC achieves an equivalent model but makes use of a deterministic trainingconditional and exact test conditional which are given as q DTC ( f | u ) = N (cid:16) K f , u K − u , u u , (cid:17) q DTC ( f ∗ | u ) = p ( f ∗ | u )The posterior or predictive distribution under the DTC is q DTC = N (cid:16) Q f ∗ , f ( Q f , f + σ I ) − y , K f ∗ , f ∗ − Q f ∗ , f ( Q f , f + σ I (cid:17) − Q f , f ∗ (cid:17) = N (cid:16) σ − K f ∗ , u ψK u , f y , K f ∗ , f ∗ − Q f ∗ , f ∗ + K f ∗ , u ψK (cid:62) f ∗ , u (cid:17) where ψ ≡ (cid:16) σ − K u , f K f , u + K u , u (cid:17) − . Coregionalized GPs, also known as multi-task GPs, essentially extend the con-cept of correlating data to GPs. It suggests that the information gained from oneprocess can be generalized to another; in other words, knowledge is transferred9rom one process to another. We cover the main ideas underlying coregional-ization in this section, further details can be found in Alvarez, Rosasco, andLawrence (2011).We can motivate this idea by ﬁrst supposing a set of models, M . Let thecorresponding dataset used by some model m ∈ M be denoted by the scalarvector x containing P elements. For simplicity, assume that all datasets areof size P . We can then denote a model and its dataset by a tuple ( x , m ). Toallow knowledge transfer among models, we introduce some covariance kernel K that describes the correlation between the models m and m (cid:48) using some matrix B m,m (cid:48) . This can be formulated as K (( x , m ) , ( x (cid:48) , m (cid:48) )) = B m,m (cid:48) × K ( x , x (cid:48) ) (7)where it is understood that K also encodes the parameters of K ( · , · ). The kernelmatrix corresponding to (7) can be written as K ( x , x (cid:48) ) =  B , × K ( x , x (cid:48) ) . . . B ,D × K ( x , x (cid:48) )... . . . ... B D, × K ( x , x (cid:48) ) . . . B D,D × K ( x , x (cid:48) )  = B ⊗ K ( x , x (cid:48) )where D is the number of elements in M , and B ∈ R D × D is a coregionalizationmatrix. K is DP × DP .In order that the multiple output kernel K qualiﬁes as a valid kernel, we requirethat both K and B are valid covariance matrices. However, if K is already valid,then we only require that B be positive deﬁnite. This section details a hedging approach with the GP model. We ﬁrst introduceand describe the datasets and the tools used. We then describe how we constructthe kernel. Following that, we explain how model estimation is carried out andconclude by detailing the setup of the hedging problem.

Hourly datasets for the UK day-ahead electricity spot price are obtained fromNord Pool (n.d.). Prices for the OTC base and peak load forward contracts areobtained from Bloomberg LP (n.d.).To the best of our knowledge, there is no publicly available hourly consumptionload data for the UK from 2016 to 2018. We worked around this issue by esti-mating consumption load from power demand data sourced from the National10 D < < SD < < SD < < SD < < SD < > Table 1: Distribution of next day spot-price and power load (19 September 2015to 31 December 2018)Grid ESO (n.d.). In order to do this, we use the facts that load is a ﬂow ofpower over some period while demand is a snapshot at a single point in timewith units of measurements MWh (Megawatt hour) and MW (Megawatt) re-spectively. The available demand data from National Grid ESO are snapshotsrecorded at thirty minute intervals. By averaging two thirty-minute readings(with the ﬁrst starting exactly on the hour) and then assuming that this meanis constant over the hour, we obtain an approximate measure of consumptionload.

Both electricity demand and consumption load within the UK exhibit repetitivebehaviors on multiple time-scales: on a yearly/seasonal basis, across the weekand over a day (Gavin (2014)). While price generally moves in tandem withload, its jumps makes it far more volatile. This is evident from Table. 1 ,where we can see a number of peak and oﬀ-peak prices going further than threestandard deviations from their respective means. While there also appears tobe a fair number of load points between two and three deviations, this shouldbe expected given the seasonality of the data.For data pre-processing, we smooth the spikes in price by setting them to beno more than three standard deviations from the mean. When this conditionis met for price, we also apply this operation to load. This is an importantstep to ensure that the optimization produce posteriors that reasonably ﬁt thedata. An alternative would have been to ﬁx the length scale parameter in eachof the covariance functions, although this would be both cumbersome and lessintuitive.

1. The window starts from 19 September 2015 because the some of the hedging models forJanuary 2016 (i.e. the CSGP-3M variants) were trained on 3 months’ of earlier data. .3 Kernel construction We assembled the composite kernel by summing four types of kernels: the squareexponential (SE), Mat´ern with ν = 5 /

2, periodic and rational quadratic kernels.The SE is included to capture the broader trend underlying the data sets, whilethe Mat´ern kernel incorporates the non-smooth nature of both price and load.We model the repetitive nature of the dataset over various time frequencies withthree separate periodic kernels with periods of 12, 24 and 168 hours. Thesesettings are based on domain knowledge and are conﬁrmed in the plots anddiscussions in Section 3.2. Finally, the rational quadratic kernel is added tomodel the noise term in the datasets. Taking these together, the ﬁnal compositekernel is κ composite = κ SE + κ Mat52 + κ Per12 + κ Per24 + κ Per168 + κ RQ (8)We restrict ourselves to adding simple kernel components to maintain a highdegree of explainability. While not explicit, note that a white noise kernel isadded to Equation (8) to account for observations variance. We deﬁne our primary model to be the coregionalized sparse GP (CSGP) withkernel given by (8) trained on 30 days’ (approximately 1 month) of hourly datawith 10% sparsity (10% of training data). While we recognize that sparsitycould be a tunable hyper-parameter, we ﬁx it here to control the computationalcomplexity of the algorithm. We tune the hyper-parameters for the respectivecovariance functions listed in Section 2.4. Note that there are only six such vari-ables in total for the three periodic functions since we have ﬁxed periodicities.In order to ﬁnd the optimal set of parameters for each hedging month, we runthe optimization a few times to avoid running into local optima. We do this foreach month.

For a particular hedging month, the portfolio delivering the optimal hedge withrespect to our model isargmin V b ,V p f ( V b , V p | θ ) s.t. 0 < F b ≤ ˜ F ≤ F p where ˜ F ≡ F p − V b V b + V p ( F p − F b ) (9)where θ is the vector of optimized parameters of the ﬁtted CSGP, while F b and F p are the prices of the peak load and base load forward contract for thatmonth. 12o hedge a given month, we purchase some combination of base and peak loadcontracts around two weeks before the start of the month for liquidity consid-erations. Therefore, to hedge an exposure for the whole of January in 2018,we would buy the appropriate amount of contracts on the 18th December 2018with the model trained on the hourly data from the previous thirty days. Weassume that the base and peak load forward contracts are purchased at theclosing price on the hedging initialization date. To ensure numerical stability,each hourly load data for the training month is rebased against the maximumload for the length of entire study. This has the eﬀect of converting the absoluteoptimum base and peak positions to percentages of the overall maximum load.The actual payoﬀ for T i or for the i th hour is given as π t i = (cid:40) ( S t i − F b )( V b ∗ − L t i ) for t i ∈ M o ( S t i − ˜ F )( V b ∗ + V p ∗ − L t i ) for t i ∈ M p π t i = ( S t i − F b )( V b ∗ − L t i ) for t i ∈ M o (10) π t i = ( S t i − ˜ F )( V b ∗ + V p ∗ − L t i ) for t i ∈ M P (11)The table in Appendix A lists the various dates at which we initiated ourmonthly positions for the entire period of our study. In the following set of studies, we refer to CSGPs trained on one month of dataas the CSGP-1M, on two months as CSGP-2M, and so on. We compare ourprimary model, which is the CSGP-1M on 10% sparsity using the full kernelgiven by Equation (8), in three diﬀerent settings.

Fig. 1 shows the posterior predictions on price and load at diﬀerent degrees ofsparsity while ﬁxing all other features of the model. These models are trainedon 720 hours (one month) of hourly data for hedging on December 2018, andprediction takes place from the 720 th hour onwards, i.e., 1st December 2018midnight onwards. The dashed vertical line marks the point beyond which westart generating forecasts for the hedging month of December itself.13 a) Price forecast with 10% of 1M data (b) Load forecast with 10% of 1M data(c) Price forecast with 1% of 1M data (d) Load forecast with 1% of 1M data(e) Price forecast with 100% of 1M data (f) Load forecast with 100% of 1M data Figure 1: Examining the resulting price and load forecasts for December 2018arising from models trained on 1M of data with diﬀerent degrees of sparsity. Thetop pair shows the posterior on spot price (left) and consumption load (right)using a sparsity of 10% sparsity. The middle pair uses 1% sparsity while thebottom pair is trained on the full data set and does not use sparsity at all.14s seen in Fig. 1, the posterior mean (the dark blue line) becomes less smoothas more points are used for training . Using fewer points and ensuring that theyare suﬃciently far apart from one another produces longer correlation lengthsbetween points. This allows us to learn a stable longer term trend while avoidingover-ﬁtting to the short term noise contained in the data. For example, the meandoes not attempt to ﬁt the price spikes in Fig. 1c (using 1% of training data)unlike Fig. 1e (100% of data). This has important implications for hedgingperformance as we shall see later in the paper. We ﬁnd that the set of periodic functions are the largest contributor to forecastperformance. Models incorporating these components but lack any of the othercomponents in Equation (8) generates predictions that are similar to the basemodel. Among these models, it is diﬃcult to deﬁnitively say which is bettersince their diﬀerences are marginal.

Fig. 2 compares models trained on longer periods of two and three months.Upon inspection, it appears that the CSGP-1M with 10% sparsity (the primarymodel) produces posterior predictions that are superior to the CSGP-2M andCSGP-3M that are both trained on 10% sparsity. This is likely due to therecurrent nature of the data, so introducing more data from the past has onlymarginal beneﬁts. Additionally, the repetitive nature of the data means thatits truly unique segment is actually a subset that is ‘replicated’ many times.This has the eﬀect of cramming more points (144 and 216 for CSGP-2M andCSGP-3M respectively) into the unique segment of the data, which shortensthe correlation length across points. The model consequently learns less of thebroader trend but more of the random noise (See Section 4.1). For instance, theposterior means in both Fig. 2a and Fig. 2c have missed the cluster of pointsat around the 1480 th and 2200 th hour respectively. As shown in Fig. 1a, theprimary model does not have this problem.

2. To clarify the use of sparsity in the context of this paper, for two models trained on thesame data, the model with 1% sparsity uses less data than the model using say, 10% of thedata a) Price forecast with 10% of 2M data (b) Load forecast with 10% of 2M data(c) Price forecast with 10% of 3M data (d) Load forecast with 10% of 3M data Figure 2: Examining the eﬀects of training on diﬀerent lengths of data whilekeeping all other features ﬁxed. The top pair trains on two months of hourlydata while the bottom pair trains on three. Both are trained with 10% sparsity.

For the average load model, we assume an oracle that is able to look forward intime to know the exact load for the hedging month. To this end, V b ∗ is obtainedby taking the mean of actual hourly load during oﬀ-peak hours. V p ∗ is obtainedin a similar manner with a slight modiﬁcation: ﬁrst by taking the mean ofhourly load for peak hours and then subtracting V b ∗ from the result. Assumingspot-price and consumption load is uncorrelated, hedging at the expected loadresults in the minimum variance position; this is a robust approach used byindustry (Tegn´er et al. (2017)). While it is entirely plausible that non-public,proprietary models exist that perform better than this benchmark, we do notconcern ourselves with what those might be.16he coregionalized sparse GP models outperforms the average load comparatoracross the duration of the hedging program. Appendix B provides a monthlybreakdown with the corresponding optimal V b and V p (scaled by the averageload for the comparator, and the maximum load across the data for the model).Fig. 3 illustrates the result – the chart on the left compares absolute monthlyperformance while the right shows the cumulative performance over the averageload hedge. Both charts are plotted over the duration of the experiment. (a) Absolute payoﬀs across models (b) Cumulative payoﬀs across models Figure 3: Comparing hedging performance of various GP models against theaverage load hedge. Left compares absolute monthly payoﬀ, right shows thecumulative payoﬀ in excess of the comparator. Both charts are plotted over thespan of the empirical hedging experiment (January 2016 to December 2018). (a) Load forecast made with 1% of data (b) Load forecast made with 100% of data

Figure 4: Comparing the posterior on normalized load across diﬀerent levels ofsparsity. The model on the left uses 1% of the data, the right uses the entiretraining set.On a monthly basis, the CSGP-1M with 10% sparsity (which is our primarymodel) outperforms the average load hedge on the majority of the months over17he three year period from January 2016 to December 2018. The relative per-formances of the various GP models in Fig. 3a are diﬃcult to discern, but thecumulative plot on the right makes things clear. From Fig. 3b, it appears thathedging performance is correlated with sparsity since it is the CSGP-1M with1% sparsity that delivers the highest payoﬀ in excess of the average load. Overthe 3 years spanning 2016-2018, the CSGP-1M with 1% and 10% sparsity re-spectively has payoﬀs of 5.42 and -2.91 mio GBP, both higher than the averageload’s -3.61 mio GBP (See the ’Total payoﬀs’ row in Appendix B).As explained in Section 4.1, decreasing the number of data points increasesthe length scale; this focuses the learning on longer term trends rather than thenoise in the data. We use the forecasts made by the CSGP-1M with 1% sparsity(using 1% of data) and CGP-1M (using the full data set) for September 2018as an example. The results are shown in Fig. 4. Both forecasts appear to bereasonably similar, although we argue that the CSGP-1M is superior since itsposterior mean is closer to the actual data points. Furthermore, most if not allpoints are well-enclosed in its conﬁdence interval. The CGP-1M uses all 720hours for training. This produces a shorter length scale, which ends up ﬁttingnoise. This has the eﬀect of generating overly conﬁdent predictions, as can beseen by the cluster of points around the 950 th hour and the upper cluster ofpoints in the 975-1040 hour range not contained within the conﬁdence intervalin Fig. 4b. There are some caveats to keep in mind. Firstly, we recognize the myriad ofintricacies pertaining to the actual hedging of electricity. However, we hadto simplify certain aspects due to the lack of public data and to also balancebetween real life applicability and keeping to a reasonable scope of work. Someexamples are our approach of deriving consumption load from demand, as wellas the pre-processing step of smoothing out price spikes by capping them tobe at most three standard deviations away from the mean. As we had noaccess to the typical level of load that retailers aim to meet, another crucialset of simpliﬁcations we make in order to calculate payoﬀs is the following:(i) we adopt the position of a ﬁrm supplying 1.5% of the market (Assumingaround 70 domestic suppliers (Ofgem (2018)) as well as equal market shareacross participants, this is approximately a share of about 1.42% per retailer.We round this ﬁgure up to 1.5% for convenience), (ii) the area that this retaileris supplying has a similar load proﬁle to the national-level load proﬁle, and thus(iii) the proﬁle that this retailer is supply to is a constant 1.5% ‘strip’ of thenational-level load. 18

Conclusion

We demonstrate that GPs can play a signiﬁcant role in the risk managementpipeline for the power markets. Additionally, coregionalization also shows thatthere is knowledge transfer between the GPs trained on spot price and load.Lastly, sparsity allows us to, with a smaller computational load, produce fore-casts that are comparable to similar models that are trained on far much moredata. We highlight some potentially interesting directions that future researchwe can take.A straightforward extension would be to further pre-process the data in orderto improve accuracy. For instance, we can further segment and ﬁt additionalseparate models for diﬀerent days (holidays, weekends) and then combining theirrespective predictions. It is likely that this approach will improve the ﬁt for theGP model and hence hedging eﬃcacy.Another possible extension would be to re-cast the problem as a dynamic hedg-ing program. In other words, V b and V p are no longer ﬁxed at the start ofthe month but are instead adjusted over the course of hedging period. Thisapproach requires a dynamic model of the various price and load variables, aswell as other modiﬁcations to the existing framework.Regarding kernel design, our approach in this paper is not exhaustive. It is en-tirely likely that other sophisticated high-performing compositions exist. Withthe recent focus on AI explainability, there is a risk that these ‘black box’ ker-nels may be challenging to interpret. To this end, Duvenaud et al. (2013) oﬀersa method that might serve as a helpful starting point.Lastly, another avenue for further work is modifying the kernel parameter es-timation task to give it a more Bayesian ﬂavor. In this form, we move fromestimating a posterior for the processes to also estimating posteriors for pa-rameters. We can do this by ﬁrst introducing some prior belief on what eachparameter estimate should take. Subsequently and by updating based on ob-served data, we obtain a joint posterior distribution for each estimate. Our ﬁnalmaximum likelihood ’guess’ of the true value for a parameter’s estimate can beobtained by summing over the posterior. References

Alvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2011. “Ker-nels for Vector-Valued Functions: a Review” [in en]. ArXiv: 1106.6251, arXiv:1106.6251 [cs, math, stat] (). Accessed December 6, 2018. http ://arxiv.org/abs/1106.6251 . 19rtzner, Philippe, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999.“Coherent Measures of Risk” [in en].

Mathematical Finance

9, no. 3 (): 203–228. issn : 0960-1627, 1467-9965, accessed December 10, 2018. doi: . http://doi.wiley.com/10.1111/1467-9965.00068 .Benth, F.E., J.S. Benth, and S. Koekebakker. 2008. Stochastic Modelling ofElectricity and Related Markets.

Advanced series on statistical science &applied probability. World Scientiﬁc. isbn : 978-981-281-230-8. https://books.google.co.uk/books?id=MHNpDQAAQBAJ .Bessembinder, Hendrik, and Michael L Lemmon. 2002. “Equilibrium Pricingand Optimal Hedging in Electricity Forward Markets” [in en].

The Journalof Finance

57 (3): 1347–1382. doi:

10 . 1111 / 1540 - 6261 . 00463 . https ://onlinelibrary.wiley.com/doi/abs/10.1111/1540-6261.00463 .Bloomberg LP. n.d. UK Power Baseload and Peakload Forward Month Speciﬁc [in English]. Accessed March 4, 2019.Boroumand, Rapha¨el Homayoun, St´ephane Goutte, Simon Porcher, and ThomasPorcher. 2015. “Hedging strategies in energy markets: The case of electric-ity retailers” [in en].

Energy Economics

51 (): 503–509. issn : 01409883,accessed November 12, 2018. doi: . https://linkinghub.elsevier.com/retrieve/pii/S0140988315002170 .Botterud, Audun, Tarjei Kristiansen, and Marija D. Ilic. 2010. “The relationshipbetween spot and futures prices in the Nord Pool electricity market” [in en]. Energy Economics

32, no. 5 (): 967–978. issn : 01409883, accessed Decem-ber 11, 2018. doi: . http://linkinghub.elsevier.com/retrieve/pii/S0140988309002278 .Coulon, Michael, Warren B. Powell, and Ronnie Sircar. 2013. “A model forhedging load and price risk in the Texas electricity market” [in en]. EnergyEconomics

40 (): 976–988. issn : 01409883, accessed December 8, 2018. doi: . https://linkinghub.elsevier.com/retrieve/pii/S0140988313001084 .Duvenaud, David, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum,and Zoubin Ghahramani. 2013. “Structure Discovery in Nonparametric Re-gression through Compositional Kernel Search” [in en]. ArXiv: 1302.4922, arXiv:1302.4922 [cs, stat] (). Accessed December 9, 2018. http://arxiv.org/abs/1302.4922 .Gal, Yarin, Mark van der Wilk, and Carl E. Rasmussen. 2014. “DistributedVariational Inference in Sparse Gaussian Process Regression and LatentVariable Models” [in en]. ArXiv: 1402.1389, arXiv:1402.1389 [cs, stat] ().Accessed December 29, 2018. http://arxiv.org/abs/1402.1389 .Gavin, Claire. 2014. “Seasonal variations in electricity demand.” In EnergyTrends: March 2014, Special feature articles, https://bit.ly/2Uv8MqY .20ensman, James, Nicolo Fusi, and Neil D Lawrence. 2013. “Gaussian processesfor big data” [in en]. In

Conference on Uncertainty in Artiﬁcial Intellegence, auai.org .Liu, Haitao, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. 2018. “When Gaus-sian Process Meets Big Data: A Review of Scalable GPs” [in en]. ArXiv:1807.01065, arXiv:1807.01065 [cs, stat] (). Accessed December 23, 2018. http://arxiv.org/abs/1807.01065 .National Grid ESO. n.d.

Data Explorer, Historic Demand Data.

Database. Ac-cessed March 4, 2019. https : / / demandforecast . nationalgrid . com /efs_demand_forecast/faces/DataExplorer .Nord Pool. n.d.

N2EX Day Ahead Auction Prices.

Database. Accessed March 4,2019. .Ofgem. 2018.

Number of active domestic suppliers by fuel type (GB).

AccessedJanuary 15, 2019. .Oum, Yumi, Shmuel Oren, and Shijie Deng. 2006. “Hedging quantity risks withstandard power options in a competitive wholesale electricity market” [inen].

Naval Research Logistics

53, no. 7 (): 697–712. issn : 0894-069X, 1520-6750, accessed December 8, 2018. doi: . http://doi.wiley.com/10.1002/nav.20184 .Quinonero-Candela, Joaquin, and Carl Rasmussen. 2005. “A Unifying View ofSparse Approximate Gaussian Process Regression” [in en]. Journal of Ma-chine Learning Research issn : 1532-4435. http : / / dl .acm.org/citation.cfm?id=1046920.1194909 .Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006.

Gaussian pro-cesses for machine learning.

Adaptive computation and machine learning.OCLC: ocm61285753. Cambridge, Mass: MIT Press. isbn : 978-0-262-18253-9.Seeger, Matthias, Christopher K. I. Williams, and Neil D. Lawrence. 2003. “FastForward Selection to Speed Up Sparse Gaussian Process Regression.” In

Proceedings of the Ninth International Workshop on Artiﬁcial Intelligenceand Statistics, edited by Christopher M. Bishop and Brendan J. Frey. KeyWest, FL. http : / / inverseprobability . com / publications / seeger -fast03.html .Tegn´er, Martin, Rune Ramsdal Ernstsen, Anders Skajaa, and Rolf Poulsen.2017. “Risk-minimisation in electricity markets: Fixed price, unknown con-sumption” [in en].

Energy Economics

68 (): 423–439. issn : 01409883, ac-cessed November 14, 2018. doi: . https://linkinghub.elsevier.com/retrieve/pii/S0140988317303559 .21 Dates at which the hedging for the month wasinitiated

Contract Init. Date Contract Init. DateJan-16 18-Dec-15 Jul-17 16-Jun-17Feb-16 18-Jan-16 Aug-17 18-Jul-17Mar-16 16-Feb-16 Sep-17 18-Aug-17Apr-16 21-Mar-16 Oct-17 15-Sep-17May-16 18-Apr-16 Nov-17 18-Oct-17Jun-16 19-May-16 Dec-17 17-Nov-17Jul-16 20-Jun-16 Jan-18 18-Dec-17Aug-16 18-Jul-16 Mar-18 15-Feb-18Sep-16 18-Aug-16 Apr-18 16-Mar-18Oct-16 19-Sep-16 May-18 17-Apr-18Nov-16 18-Oct-16 May-18 17-Apr-18Dec-16 17-Nov-16 Jun-18 18-May-18Jan-17 19-Dec-16 Jul-18 18-Jun-18Feb-17 18-Jan-17 Aug-18 18-Jul-18Mar-17 15-Feb-17 Sep-18 17-Aug-18Apr-17 20-Mar-17 Oct-18 17-Sep-18May-17 18-Apr-17 Nov-18 18-Oct-18Jun-17 18-May-17 Dec-18 16-Nov-18To hedge for a particular month, we hold the appropriate amount of base andpeak load contracts approximately 2 weeks before the start of that month givenby the above dates. The base and peak load prices used for a particular monthare the closing price for each respective forward contract.22 C o m p a r i n g r e s u l t s o f t h e m o d e l h e d g e a nd t h e a v e r a g e l o a d A v e r ag e L oa d H e d g e C S G P - M % Sp . C S G P - M % Sp . V b ∗ V p ∗ P a y o ﬀ V b ∗ V p ∗ P a y o ﬀ V b ∗ V p ∗ P a y o ﬀ J a n - . . - . . . - . . . - . F e b - . . . . . . . . . M a r - . . - . . . - . . . - . A p r - . . . . . . . . . M a y - . . - . . . - . . . - . J un - . . - . . . - . . . - . J u l - . . - . . . - . . . - . A u g - . . - . . . . . . . S e p - . . - . . . - . . . - . O c t - . . - . . . - . . . - . N o v - . . . . . . . . . D ec - . . . . . . . . . J a n - . . - . . . - . . . - . F e b - . . . . . . . . . M a r - . . . . . . . . . A p r - . . . . . . . . - . M a y - . . - . . . - . . . - . J un - . . - . . . . . . - . O p t i m a l h o l d i n g s o f b a s e a ndp e a k l oa d c o n t r a c t s ( % ) a nd t h ec o rr e s p o nd i n g p a y o ﬀ s ( m i o G B P ) f o r t h e A v e r ag e L oa dh e d g e a s w e ll a s t h e C S G P - M w i t h % Sp a r s i t y a nd C S G P - M w i t h % Sp a r s i t y . T h e C S G P - M w i t h % Sp a r s i t y i s t h e o n l y m o d e l t h a t e nd s t h e h e d g i n g p r og r a m w i t h a p o s i t i v e p a y o ﬀ . o m p a r i n g r e s u l t s o f t h e m o d e l h e d g e a nd t h e a v e r a g e l o a d ( c o n t i nu e d ) A v e r ag e L oa d H e d g e C S G P - M % Sp . C S G P - M % Sp . V b ∗ V p ∗ P a y o ﬀ V b ∗ V p ∗ P a y o ﬀ V b ∗ V p ∗ P a y o ﬀ J u l - . . - . . . - . . . - . A u g - . . - . . . - . . . - . S e p - . . . . . . . . . O c t - . . - . . . - . . . - . N o v - . . . . . . . . . D ec - . . - . . . - . . . - . J a n - . . . . . . . . . F e b - . . - . . . - . . . - . M a r - . . - . . . - . . . - . A p r - . . - . . . - . . . - . M a y - . . - . . . - . . . - . J un - . . - . . . - . . . - . J u l - . . - . . . - . . . - . A u g - . . - . . . - . . . - . S e p - . . - . . . - . . . - . O c t - . . . . . . . . . N o v - . . . . . . . . . D ec - . . . . . . . . . T o t a l P a y o ﬀ - . . - . O p t i m a l h o l d i n g s o f b a s e a ndp e a k l oa d c o n t r a c t s ( % ) a nd t h ec o rr e s p o nd i n g p a y o ﬀ s ( m i o G B P ) f o r t h e A v e r ag e L oa dh e d g e a s w e ll a s t h e C S G P - M w i t h % Sp a r s i t y a nd C S G P - M w i t h % Sp a r s i t y . T h e C S G P - M w i t h % Sp a r s i t y i s t h e o n l y m o d e l t h a t e nd s t h e h e d g i n g p r og r a m w i t h a p o s i t i v e p a y o ﬀ ..