[PDF] A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Abstract

The large integration of variable energy resources is expected to shift a large part of the energy exchanges closer to real-time, where more accurate forecasts are available. In this context, the short-term electricity markets and in particular the intraday market are considered a suitable trading floor for these exchanges to occur. A key component for the successful renewable energy sources integration is the usage of energy storage. In this paper, we propose a novel modelling framework for the strategic participation of energy storage in the European continuous intraday market where exchanges occur through a centralized order book. The goal of the storage device operator is the maximization of the profits received over the entire trading horizon, while taking into account the operational constraints of the unit. The sequential decision-making problem of trading in the intraday market is modelled as a Markov Decision Process. An asynchronous distributed version of the fitted Q iteration algorithm is chosen for solving this problem due to its sample efficiency. The large and variable number of the existing orders in the order book motivates the use of high-level actions and an alternative state representation. Historical data are used for the generation of a large number of artificial trajectories in order to address exploration issues during the learning process. The resulting policy is back-tested and compared against a benchmark strategy that is the current industrial standard. Results indicate that the agent converges to a policy that achieves in average higher total revenues than the benchmark strategy.

Full PDF

AA Deep Reinforcement Learning Framework for Continuous Intraday MarketBidding

Ioannis Boukas Damien Ernst Thibaut Th´eate Adrien Bolland Alexandre Huynen Martin Buchwald Christelle Wynants Bertrand Corn´elusse Abstract

The large integration of variable energy resourcesis expected to shift a large part of the energy ex-changes closer to real-time, where more accurateforecasts are available. In this context, the short-term electricity markets and in particular the in-traday market are considered a suitable tradingﬂoor for these exchanges to occur. A key compo-nent for the successful renewable energy sourcesintegration is the usage of energy storage. In thispaper, we propose a novel modelling frameworkfor the strategic participation of energy storage inthe European continuous intraday market whereexchanges occur through a centralized order book.The goal of the storage device operator is the max-imization of the proﬁts received over the entiretrading horizon, while taking into account the op-erational constraints of the unit. The sequentialdecision-making problem of trading in the intra-day market is modelled as a Markov DecisionProcess. An asynchronous distributed versionof the ﬁtted Q iteration algorithm is chosen forsolving this problem due to its sample efﬁciency.The large and variable number of the existingorders in the order book motivates the use of high-level actions and an alternative state representa-tion. Historical data are used for the generationof a large number of artiﬁcial trajectories in orderto address exploration issues during the learningprocess. The resulting policy is back-tested andcompared against a benchmark strategy that is thecurrent industrial standard. Results indicate thatthe agent converges to a policy that achieves inaverage higher total revenues than the benchmarkstrategy. Department of Electrical Engineering and Computer Science,University of Li`ege, Li`ege, Belgium Market Modeling and Mar-ket View, ENGIE, Brussels, Belgium. Correspondence to: IoannisBoukas < [email protected] > .

1. Introduction

The vast integration of renewable energy resources (RES)into (future) power systems, as directed by the recent world-wide energy policy drive (The European Commission, 2017),has given rise to challenges related to the security, sustain-ability and affordability of the power system (“The EnergyTrilemma”). The impact of high RES penetration on themodern short-term electricity markets has been the subjectof extensive research over the last few years. Short-termelectricity markets in Europe are organized as a sequence oftrading opportunities where participants can trade energy inthe day-ahead market and can later adjust their schedule inthe intraday market until the physical delivery. Deviationsfrom this schedule are then corrected by the transmissionsystem operator (TSO) in real time and the responsibleparties are penalized for their imbalances (Meeus & Schit-tekatte, 2017).Imbalance penalties serve as an incentive for all marketparticipants to accurately forecast their production and con-sumption and to trade based on these forecasts (Scharff& Amelin, 2016). Due to the variability and the lack ofpredictability of RES, the output planned in the day-aheadmarket may differ signiﬁcantly from the actual RES outputin real time (Karanﬁl & Li, 2017). Since the RES fore-cast error decreases substantially with a shorter predictionhorizon, the intraday market allows RES operators to tradethese deviations whenever an improved forecast is available(Borggrefe & Neuhoff, 2011). As a consequence, intradaytrading is expected to reduce the costs related to the reser-vation and activation of capacity for balancing purposes.The intraday market is therefore a key aspect towards thecost-efﬁcient RES integration and enhanced system securityof supply.Owing to the fact that commitment decisions are taken closeto real time, the intraday market is a suitable market ﬂoorfor the participation of ﬂexible resources (i.e. units able torapidly increase or decrease their generation/consumption).However, fast-ramping thermal units (e.g. gas power plants)incur a high cost when forced to modify their output, to a r X i v : . [ q -f i n . T R ] A p r Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding operate in part load, or to frequently start up and shut down.The increased cost related to the cycling of these units willbe reﬂected to the offers in the intraday market (P´erez Ar-riaga & Knittel et al, 2016). Alternatively, ﬂexible storagedevices (e.g. pumped hydro storage units or batteries) withlow cycling and zero fuel cost can offer their ﬂexibility at acomparatively low price, close to the gate closure. Hence,they are expected to play a key role in the intraday market.

In Europe, the intraday markets are organized in two distinctdesigns, namely auction-based or continuous trading.In auction-based intraday markets, participants can submittheir offers to produce or consume energy at a certain timeslot until gate closure. After the gate closure, the submittedoffers are used to form the aggregate demand and supplycurves. The intersection of the aggregate curves deﬁnesthe clearing price and quantity (Neuhoff et al., 2016). Theclearing rule is uniform pricing, according to which thereis only one clearing price at which all transactions occur.Participants are incentivized to bid at their marginal costsince they are paid at the uniform price. This mechanismincreases price transparency, although it leads to inefﬁcien-cies, since imbalances after the gate closure can no longerbe traded (Hagemann, 2015).In continuous intraday (CID) markets, participants can sub-mit at any point during the trading session orders to buyor to sell energy. The orders are treated according to theﬁrst come ﬁrst served (FCFS) rule. A transaction occurs assoon as the price of a new “Buy” (“Sell”) order is equal orhigher (lower) than the price of an existing “Sell” (“Buy”)order. Each transaction is settled following the pay-as-bidprinciple, stating that the transaction price is speciﬁed by theoldest order of the two present in the order book. Unmatchedorders are stored in the order book and are accessible to allmarket participants. The energy delivery resolution offeredby the CID market in Europe ranges between hourly, 30-minute and 15-minute products, and the gate closure takesplace between ﬁve and 60 minutes before actual delivery.Continuous trading gives the opportunity to market partici-pants to trade imbalances as soon as they appear (Hagemann,2015). However, the FCFS rule is inherently associated withlower allocative inefﬁciency compared to auction rules. Thisimplies that, depending on the time of arrival of the orders,some trades with a positive welfare contribution may not oc-cur while others with negative welfare contribution may berealised (Henriot, 2014). It is observed that a combination ofcontinuous and auction-based intraday markets can increasethe market efﬁciency in terms of liquidity and market depth,and results in reduced price volatility (Neuhoff et al., 2016).In practice, the available contracts (“Sell” and “Buy” orders)can be categorized into three types: • The market order, where no price limit is speciﬁed (theorder is matched at the best price) • The limit order, which contains a price limit and canonly be matched at that or at a better price • The market sweep order, which is executed immedi-ately (fully or partially) or gets cancelled.Limit orders may appear with restrictions related to theirexecution and their validity. For instance, an order thatcarries the speciﬁcation

Fill or Kill should either be fullyand immediately executed or cancelled. An order that isspeciﬁed as

All or Nothing remains in the order book untilit is entirely executed (Balardy, 2017a).The European Network Codes and speciﬁcally the capacityallocation and congestion management guidelines (Meeus& Schittekatte, 2017) (CACM GL) suggest that continu-ous trading should be the main intraday market mechanism.Complementary regional intraday auctions can also be putin place if they are approved by the regulatory authorities(Meeus & Schittekatte, 2017). To that direction, the Cross-Border Intraday (XBID) Initiative (Spot, 2018) has enabledcontinuous cross-border intraday trading across Europe. Par-ticipants of each country have access to orders placed fromparticipants of any other country in the consortium througha centralized order book, provided that there is availablecross-border capacity.

The strategic participation of power producers in short-termelectricity markets has been extensively studied in the lit-erature. In order to co-optimise the decisions made in thesequential trading ﬂoors from day-ahead to real time theproblem has been traditionally addressed using multi-stagestochastic optimisation. Each decision stage corresponds toa trading ﬂoor (i.e. day-ahead, capacity markets, real-time),where the ﬁnal decisions take into account uncertainty usingstochastic processes. In particular, the inﬂuence that theproducer may have on the market price formation leads tothe distinction between “price-maker” and “price-taker” andresults in a different modelling of the uncertainty.In (Baillo et al., 2004), the optimisation of a portfolio of gen-erating assets over three trading ﬂoors (i.e. the day-ahead,the adjustment and the reserves market) is proposed, wherethe producer is assumed to be a “price-maker”. The offeringstrategy of the producer is a result of the stochastic residualdemand curve as well as the behaviour of the rest of themarket players. On the contrary, a “price-taker” produceris considered in (Plazas et al., 2005) for the ﬁrst two stagesof the problem studied, namely the day-ahead and the au-tomatic generation control (AGC) market. However, sincethe third-stage (balancing market) traded volumes are small,

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding the producer can negatively affect the prices with its partici-pation. Price scenarios are generated using ARIMA modelsfor the two ﬁrst stages, whereas for the third stage a linearcurve with negative slope is used to represent the inﬂuenceof the producer’s offered capacity on the market price.Hydro-power plant participation in short-term markets ac-counting for the technical constraints and several reservoirlevels is formulated and solved in (Fleten & Kristoffersen,2007). Optimal bidding curves for the participation ofa “price-taker” hydro-power producer in the Nordic spotmarket are derived accounting for price uncertainty. In(Boomsma et al., 2014), the bidding strategy of a two-levelreservoir plant is casted as a multi-stage stochastic programin order to represent the different sequential trading ﬂoors,namely the day-ahead spot market and the hour-ahead bal-ancing market. The effects of coordinated bidding andthe “price-maker” versus “price-taker” assumptions on thegenerated proﬁts are evaluated. In (Pand ˇZi´c et al., 2013),bidding strategies for a virtual power plant (VPP) buyingand selling energy in the day-ahead and the balancing mar-ket in the form of a multi-stage stochastic optimisation areinvestigated. The VPP aggregates a pumped hydro energystorage (PHES) unit as well as a conventional generator withstochastic intermittent power production and consumption.The goal of the VPP operator is the maximization of theexpected proﬁts under price uncertainty.In these approaches, the intraday market is considered asauction-based and it is modelled as a single recourse action.For each trading period, the optimal offered quantity isderived according to the realization of various stochasticvariables. However, in reality, for most European countries,according to the EU Network Codes (Meeus & Schittekatte,2017), modern intraday market trading will primarily be acontinuous process.The strategic participation in the CID market is investigatedfor the case of an RES producer in (Henriot, 2014) and(Garnier & Madlener, 2015). In both works, the problem isformulated as a sequential decision-making process, wherethe operator adjusts its offers during the trading horizon, ac-cording to the RES forecast updates for the physical deliveryof power. Additionally, in (G¨onsch & Hassler, 2016) the useof a PHES unit is proposed to undertake energy arbitrageand to offset potential deviations. The trading process isformulated as a Markov Decision Process (MDP) where thefuture commitment decision in the market is based on thestochastic realization of the intraday price, the imbalancepenalty, the RES production and the storage availability.The volatility of the CID prices, along with the quality of theforecast updates, are found to be key factors that inﬂuencethe degree of activity and success of the deployed biddingstrategies (Henriot, 2014). Therefore, the CID prices andthe forecast errors are considered as correlated stochastic processes in (Garnier & Madlener, 2015). Alternatively,in (Henriot, 2014), the CID price is constructed as a linearfunction of the offered quantity with an increasing slopeas the gate closure approaches. In this way, the scarcity ofconventional units approaching real time is reﬂected. In(G¨onsch & Hassler, 2016), real weather data and marketdata are used to simulate the forecast error and CID priceprocesses.For the sequential decision-making problem in the CID mar-ket, the offered quantity of energy is the decision variable tobe optimised (Garnier & Madlener, 2015). The optimisationis carried out using Approximate Dynamic Programming(ADP) methods, where a parameterised policy is obtainedbased on the observed stochastic processes for the price, theRES error and the level of the reservoir (G¨onsch & Hassler,2016). The ADP approach presented in (G¨onsch & Hassler,2016) is compared in (Hassler, 2017) to some threshold-based heuristic decision rules. The parameters are updatedaccording to simulation-based experience and the obtainedperformance is comparable to the ADP algorithm. The ob-tained decision rules are intuitively interpretable and arederived efﬁciently through simulation-based optimisation.The bidding strategy deployed by a storage device operatorparticipating in a slightly different real-time market orga-nized by NYISO is presented in (Jiang & Powell, 2014).In this market, the commitment decision is taken one hourahead of real-time and the settlements occur intra-hour everyﬁve minutes. In this setting, the storage operator selects twoprice thresholds at which the intra-hour settlements occur.The problem is formulated as an MDP and is solved using anADP algorithm that exploits a particular monotonicity prop-erty. A distribution-free variant that assumes no knowledgeof the price distribution is proposed. The optimal policy istrained using historical real-time price data.Even though the focus of the mentioned articles lies onthe CID market, the trading decisions are considered totake place in discrete time-steps. A different approach ispresented in (A¨ıd et al., 2016), where the CID market par-ticipation is modelled as a continuous time process usingstochastic differential equations (SDE). The Hamilton Ja-cobi Bellman (HJB) equation is used for the determinationof the optimal trading strategy. The goal is the minimizationof the imbalance cost faced by a power producer arisingfrom the residual error between the RES production anddemand. The optimal trading rate is derived assuming astochastic process for the market price using real marketdata and the residual error.In the approaches presented so far, the CID price is mod-elled as a stochastic process assuming that the participatingagent is a “price-taker”. However, in the CID market, thisassumption implies that the CID market is liquid and theprice at which one can buy or sell energy at a given time

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding are similar or the same. This assumption does not alwayshold, since the mean bid-ask spread in a trading sessionin the German intraday market for 2015 was several hun-dred times larger than the tick-size (i.e. the minimum pricemovement of a trading instrument) (Balardy, 2017b). It isalso reported in the same study that the spread decreases astrading approaches the gate closure.An approach that explicitly considers the order book ispresented in (Bertrand & Papavasiliou, 2019). A threshold-based policy is used to optimise the bid acceptance for stor-age units participating in the CID market. A collection ofdifferent factors such as the time of the day are used for theadaptation of the price thresholds. The threshold policy istrained using a policy gradient method (REINFORCE) andthe results show improved performance against the “rollingintrinsic” benchmark. In this paper, we present a methodol-ogy that serves as a wrapper around the existing industrialstate of the art. We provide a generic modelling frameworkfor the problem and we elaborate on all the assumptions thatallow the formulation of the problem as an MDP. We solvethe resulting problem using a value function approximationmethod.

In this paper, we focus on the sequential decision-makingproblem of a storage device operator participating in theCID market. Firstly, we present a novel modelling frame-work for the CID market, where the trading agents exchangeenergy via a centralized order book. Each trading agent isassumed to dynamically select the orders that maximize itsbeneﬁts throughout the trading horizon. In contrast to theexisting literature, all the available orders are consideredexplicitly with the intention to uncover more information ateach trading decision. The liquidity of the market and theinﬂuence of each agent on the price are directly reﬂected inthe observed order book, and the developed strategies canadapt accordingly. Secondly, we model explicitly the dy-namics of the storage system. Finally, the resulting problemis cast as an MDP.The intraday trading problem of a storage device is solvedusing Deep Reinforcement Learning techniques, speciﬁcallyan asynchronous distributed variant of the ﬁtted Q iterationRL algorithm with deep neural networks as function approx-imators (Ernst et al., 2005). Due to the high-dimensionalityand the dynamically evolving size of the order book, wemotivate the use of high-level actions and a constant size,low-dimensional state representation. The agent can selectbetween trading and idling. The goal of the selected actionsis the identiﬁcation of the opportunity cost of trading giventhe state of the order book observed and the maximizationof the total return over the trading horizon. The resulting op-timal policy is evaluated using real data from the German ID market (EPEXSPOT, 2017). In summary, the contributionsof this work are the following: • We model the CID market trading process as an MDPwhere the energy exchanges occur explicitly through acentralized order book. The trading policy is adaptedto the state of the order book. • The operational constraints of the storage device areconsidered explicitly. • A state space reduction is proposed in order to dealwith the size of the order book. • A novel representation of high-level actions is usedto identify the opportunity cost between trading andidling. • The ﬁtted Q iteration algorithm is used to ﬁnd a time-variant policy that maximizes the total cumulative prof-its collected. • Artiﬁcial trajectories are produced using historical datafrom the German CID market in order to address ex-ploration issues.

The rest of the paper is organized as follows. In Section 2,the CID market trading framework is presented. The inter-action of the trading agents via a centralized order book isformulated as a dynamic process. All the available informa-tion for an asset trading agent is detailed and the objectiveis deﬁned as the cumulative proﬁts. In Section 3, all theassumptions necessary to formulate the bidding process inthe CID market as an MDP are listed. The methodologyutilised to ﬁnd an optimal policy that maximizes the cumu-lative proﬁts of the proposed MDP is detailed in Section 4.A case study using real data from the German CID marketis performed in Section 5. The results as well as consider-ations about limitations of the developed methodology arediscussed in Section 6. Finally, conclusions of this work aredrawn and future recommendations are provided in Section7. A detailed nomenclature is provided at the Appendix A.

2. Continuous Intraday Bidding process

The participation in the CID market is a continuous processsimilar to the stock exchange. Each market product x ∈ X ,where X is the set of all available products, is deﬁned asthe physical delivery of energy in a pre-deﬁned time slot.The time slot corresponding to product x is deﬁned by itsstarting point t delivery ( x ) and its duration λ ( x ) . The tradingprocess for time slot x opens at t open ( x ) and closes at t close ( x ) . Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding λ ( Q ) t τ t Q open , t Q open , t Q open , t Q open t Q close t Q close t Q close t Q close t Q open , t Q open , t Q open , t Q open t Q close t Q close t Q close t Q close t − a i , t − ta i , t t + a i , t + a − i , t a − i , t + Delivery timeline ¯ T Continuous trading timelineDiscrete trading timeline T t Q delivery t Q delivery t Q settle , t Q delivery t Q settle , t Q delivery t Q settle , t Q settle λ ( Q ) t τ t Q o pen , t Q o pen , t Q o pen , t Q o pen t Q cl ose t Q cl ose t Q cl ose t Q cl ose t Q o pen , t Q o pen , t Q o pen , t Q o pen t Q cl ose t Q cl ose t Q cl ose t Q cl ose t − a i , t − ta i , t t + a i , t + a − i , t a − i , t + Delivery timeline ¯ T Continuous trading timelineDiscrete trading timeline T t Q d el ivery t Q d el ivery t Q set t l e , t Q d el ivery t Q set t l e , t Q d el ivery t Q set t l e , t Q set t l e Figure 1.

Trading (continuous and discrete) and delivery timelines for products Q to Q During the time interval t ∈ [ t open ( x ) , t close ( x )] , a participantcan exchange energy with other participants for the laggedphysical delivery during the interval δ ( x ) , with: δ ( x ) = (cid:2) t delivery ( x ) , t delivery ( x ) + λ ( x ) (cid:3) . The exchange of energy takes place through a centralizedorder book that contains all the unmatched orders o j , where j ∈ N t corresponds to a unique index that every order re-ceives upon arrival. The set N t ⊆ N gathers all the uniqueindices of the orders available at time t . We denote the statusof the order book at time t by O t = ( o j , ∀ j ∈ N t ) . As timeprogresses new orders appear and existing ones are eitheraccepted or cancelled.Trading for a set of products is considered to start at the gateopening of the ﬁrst product and to ﬁnish at the gate closureof the last product. More formally, considering an orderedset of available products X = { Q , ... Q } , the correspond-ing trading horizon is deﬁned as T = [ t open ( Q ) , t close ( Q )] .For instance, in the German CID market, trading of hourly(quarterly) products for day D opens at 3 pm (4 pm) of day D − x , the gate closes 30minutes before the actual energy delivery at t delivery ( x ) . Thetimeline for trading products Q to Q that correspond tothe physical delivery in 15-minute time slots from 00:00until 01:00, is presented in Figure 1. It can be observedthat the agent can trade for all products until 23:30. Aftereach subsequent gate closure the number of available prod-ucts decreases and the commitment for the correspondingtime slot is deﬁned. Potential deviations during the physical delivery of energy are penalized in the imbalance market. As its name indicates, the CID market is a continuous en-vironment. In order to solve the trading problem presentedin this paper, it has been decided to perform a relevant dis-cretisation operation. As shown in Figure 1, the tradingtimeline is discretised in a high number of time-steps ofconstant duration ∆ t . Each discretised trading interval forproduct x can be denoted by the set of time-steps T ( x ) = (cid:8) t open ( x ) , t open ( x ) + ∆ t , ..., t close ( x ) − ∆ t , t close ( x ) (cid:9) . Then,the discrete-time trading opportunities for the entire setof products X can be modelled such that the time-steps aredeﬁned as t ∈ T = (cid:83) x ∈ X T ( x ) . In the following, for the sakeof clarity, the increment (decrement) operation t + t − t to time-step t + ∆ t ( t − ∆ t ).It is important to note that in theory the discretisation opera-tion leads to suboptimalities in the decision-making process.However, as the discretisation becomes ﬁner ( ∆ t → X t denote the set of available products at time-step t ∈ T Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding such that: X t = { x | x ∈ X , t ≤ t close ( x ) } . We deﬁne the state of the CID market environment at time-step t as s OBt = O t ∈ S OB . The state contains the observationof the order book at time-step t ∈ T i.e. the unmatchedorders for all the available products x ∈ X t ⊂ X .A set of n agents I = { , , ..., n } are continuously interact-ing in the CID environment exchanging energy. Each agent i ∈ I can express its willingness to buy or sell energy by post-ing at instant t a set of new orders a i , t ∈ A i in the order book,which results in the joint action a t = ( a , t , ..., a n , t ) ∈ ∏ ni = A i .The process of designing the set of new orders a i , t for agent i at instant t consists, for each new order, in determiningthe product x ∈ X t , the side of the order y ∈ { “ Sell ” , “ Buy ” } ,the volume v ∈ R + , the price level p ∈ [ p min , p max ] of eachunit offered to be produced or consumed, and the variousvalidity and execution speciﬁcations e ∈ E . The index ofeach new order j belongs to the set j ∈ N (cid:48) t .The set of new orders is deﬁned as a i , t =(( x j , y j , v j , p j , e j ) , ∀ j ∈ N (cid:48) t ⊆ N ) . We will use the no-tation for the joint action a t = ( a i , t , a − i , t ) to refer to theaction that agent i selects a i , t and the joint action that allother agents use a − i , t = ( a , t , ..., a i − , t , a i + . t , ..., a n , t ) . Table 1.

Order Book for Q and time slot 00:00-00:15 i Side v [MW] p [ e /MWh] ←− ask1 “Buy” 3.15 33.8 ←− bid3 “Buy” 1.125 29.35 “Buy” 2.5 15.9The orders are treated according to the ﬁrst come ﬁrst served(FCFS) rule. Table 1 presents an observation of the orderbook for product Q . The difference between the mostexpensive “Buy” order (“bid”) and the cheapest “Sell” order(“ask”) deﬁnes the bid-ask spread of the product. A dealbetween two counter-parties is struck when the price p buy of a “Buy” order and the price p sell of a “Sell” order satisfythe condition p buy ≥ p sell . This condition is tested at thearrival of each new order. The volume of the transaction isdeﬁned as the minimum quantity between the “Buy” and“Sell” order (min ( v buy , v sell ) ). The residual volume remainsavailable in the market at the same price. As mentioned inthe previous section, each transaction is settled followingthe pay-as-bid principle, at the price indicated by the oldestorder.Finally, at each time-step t , every agent i observes the stateof the order book s OBt , performs certain actions (posting a set of new orders) a i , t , inducing a transition which can berepresented by the following equation: s OBt + = f ( s OBt , a i , t , a − i , t ) . (1) An asset optimizing agent participating in the CID marketcan adjust its position for product x until the correspond-ing gate closure t close ( x ) . However, the physical deliveryof power is decided at t delivery ( x ) . An additional amountof information (potentially valuable for certain players) isreceived during the period (cid:8) t close ( x ) , .., t delivery ( x ) (cid:9) , fromthe gate closure until the delivery of power. Based on thisupdated information, an asset-trading agent may need to orhave an incentive to deviate from the net contracted powerin the market.Let v coni , t = ( v coni , t ( x ) , ∀ x ∈ X t ) ∈ R | X t | , gather the volumes ofpower contracted by agent i for the available products x ∈ X t at each time-step t ∈ T . In the following, we will adopt theconvention for v coni , t ( x ) to be positive when agent i contractsthe net volume to sell (produce) and negative when the agentcontracts the volume to buy (consume) energy for product x at time-step t .Following each market transition as indicated by equation(1), the volumes contracted v coni , t are determined based onthe transactions that have occurred. The contracted volumes v coni , t are derived according to the FCFS rule that is detailedin (EPEXSPOT, 2019). The mathematical formulation ofthe clearing algorithm is provided in (Le et al., 2019). Theobjective function of the clearing algorithm is comprisedof two terms, namely the social welfare and a penalty termmodelling the price-time priority rule. The orders that maxi-mize this objective are matched, provided that they satisfythe balancing equations and constraints related to their spec-iﬁcations. The clearing rule is implicitly given by: v coni , t = clear ( i , s OBt , a i , t , a − i , t ) . (2)We denote as P mari , t ( x ) ∈ R the net contracted power in themarket by agent i for each product x ∈ X , which is updatedat every time-step t ∈ T according to: P mari , t + ( x ) = P mari , t ( x ) + v coni , t ( x ) . (3) ∀ x ∈ X t The discretisation of the delivery timeline ¯ T is done withtime-steps of duration ∆ τ , equal to the minimum duration ofdelivery for the products considered. The discrete deliverytimeline ¯ T is considered to start at the beginning of deliveryof the ﬁrst product τ init and to ﬁnish at the end of the deliveryof the last product τ term . For the simple case where only fourquarterly products are considered, as shown in Figure 1, the Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding delivery time-step is ∆ τ = min and the delivery timeline¯ T = {

00 : 00 ,

00 : 15 , ...,

01 : 00 } , where τ init =

00 : 00 and τ term =

01 : 00. In general, when only one type of productis considered (e.g. quarterly), there is a straightforwardrelation between time of delivery τ and product x , since τ = t delivery ( x ) and ∆ τ = λ ( x ) . Thus, terms x or τ can beused interchangeably. For the sake of keeping the notationrelatively simple, we will only consider quarterly productsin the rest of the paper. In such a context, the terms P mari , t ( τ ) or P mari , t ( x ) can be used interchangeably to denote the netcontracted power in the market by agent i at trading step t for delivery time-step τ (product x ).As the trading process evolves the set of delivery time-steps τ for which the asset-optimizing can make decisions de-creases as trading time t crosses the delivery time τ . Let¯ T ( t ) ⊆ ¯ T be a function that yields the subset of deliverytime-steps τ ∈ ¯ T that follow time-step t ∈ T such that:¯ T ( t ) = { τ | τ ∈ ¯ T \ { τ term } , t ≤ τ } . The participation of an asset-optimizing agent in the CIDmarket is composed of two coupled decision processes withdifferent timescales. First, the trading process where a deci-sion is taken at each time-step t about the energy contracteduntil the gate closure t close ( x ) . During this process, the agentcan decide about its position in the market and create scenar-ios/make projections about the actual delivery plan based onits position. Second, the physical delivery decision that istaken at the time of the delivery τ or t delivery ( x ) based on thetotal net contracted power in the market during the tradingprocess.An agent i participating in the CID market is assumed tomonitor the state of the order book s OBt and its net con-tracted power in the market P mari , t ( x ) for each product x ∈ X ,which becomes ﬁxed once the gate closure occurs at t close ( x ) .Depending on the role it presumes in the market, an asset-optimizing agent is assumed to monitor all the availableinformation about its assets. We distinguish the three fol-lowing cases among the many different roles that can beplayed by an agent in the CID market: • The agent controls a physical asset that can generateand/or consume electricity . We deﬁne as G i , t ( τ ) ∈ (cid:2) G i , G i (cid:3) the power production level for agent i at de-livery time-step τ as computed at trading step t . In asimilar way, we deﬁne the power consumption level C i , t ( τ ) ∈ (cid:2) C i , C i (cid:3) , where C i , C i , G i , G i ∈ R + . We furtherassume that the actual production g i , t ( t (cid:48) ) and consump-tion level c i , t ( t (cid:48) ) during the time-period of delivery t (cid:48) ∈ [ τ , τ + ∆ τ ) , is constant for each product x such that: g i , t ( t (cid:48) ) = G i , t ( τ ) , (4) c i , t ( t (cid:48) ) = C i , t ( τ ) , (5) ∀ t (cid:48) ∈ [ τ , τ + ∆ τ ) . At each time-step t during the trading process, agent i can decide to adjust its generation level by ∆ G i , t ( τ ) orits consumption level by ∆ C i , t ( τ ) . According to theseadjustments the generation and consumption levels canbe updated at each time-step t according to: G i , t + ( τ ) = G i , t ( τ ) + ∆ G i , t ( τ ) , (6) C i , t + ( τ ) = C i , t ( τ ) + ∆ C i , t ( τ ) , (7) ∀ τ ∈ ¯ T ( t ) . Let w exogi , t denote any other relevant exogenous informa-tion to agent i such as the RES forecast, a forecast of theactions of other agents, or the imbalance prices. Thecomputation of ∆ G i , t ( · ) and ∆ C i , t ( · ) depends on themarket position, the technical limits of the assets, thestate of the order book and the exogenous information w exogi , t . We deﬁne the residual production P resi , t ( τ ) ∈ R at delivery time-step τ as the difference between theproduction and the consumption levels and can be com-puted by: P resi , t ( τ ) = G i , t ( τ ) − C i , t ( τ ) . (8)We note that the amount of residual production P resi , t ( τ ) aggregates the combined effects that G i , t ( τ ) and C i , t ( τ ) have on the revenues made by agent i through interact-ing with the markets (intraday/imbalance).The level of generation and consumption for a marketperiod τ can be adjusted at any time-step t before thephysical delivery τ , but it becomes binding when t = τ .We denote as ∆ i , t ( τ ) the deviation from the marketposition for each time-step τ , as scheduled at time t ,after having computed the variables G i , t ( τ ) and C i , t ( τ ) ,as follows: P mari , t ( τ ) + ∆ i , t ( τ ) = P resi , t ( τ ) , (9) ∀ τ ∈ ¯ T ( t ) . The term ∆ i , t ( τ ) represents the imbalance for mar-ket period τ as estimated at time t . This imbal-ance may evolve up to time t = τ . We denote by ∆ i ( τ ) = ∆ i , t = τ ( τ ) the ﬁnal imbalance for market pe-riod τ .The power balance of equation (9) written for time-step t + P mari , t + ( τ ) + ∆ i , t + ( τ ) = G i , t + ( τ ) − C i , t + ( τ ) (10) ∀ τ ∈ ¯ T ( t + ) . Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

It can be observed that by substituting equations (3),(6) and (7) in equation (10) we have: P mari , t ( τ ) + v coni , t ( τ ) + ∆ i , t + ( τ ) = G i , t ( τ ) + ∆ G i , t ( τ ) − ( C i , t ( τ ) + ∆ C i , t ( τ )) (11) ∀ τ ∈ ¯ T ( t ) . The combination of equations (8) and (9) with equa-tion (11) yields the update of the imbalance vectoraccording to: ∆ i , t + ( τ ) = ∆ i , t ( τ ) + ∆ G i , t ( τ ) − ∆ C i , t ( τ ) − v coni , t ( τ ) (12) ∀ τ ∈ ¯ T ( t ) . • The agent does not own any physical asset (marketmaker) . It is equivalent to the ﬁrst case with C i = C i = G i = G i =

0. The net imbalance ∆ i , t ( τ ) is updated atevery time-step t ∈ T according to: P mari , t ( τ ) + ∆ i , t ( τ ) = , (13) ∀ τ ∈ ¯ T ( t ) . • The agent controls a storage device that can produce,store and consume energy . We can consider an agentcontrolling a storage device as an agent that controlsgeneration and production assets with speciﬁc con-straints on the generation and the consumption levelrelated to the nature of the storage device. Followingthis argument, let G i , t ( τ ) ( C i , t ( τ ) ) refer to the level ofdischarging (charging) of the storage device for de-livery time-step τ , updated at time t . Obviously, if G i , t ( τ ) > C i , t ( τ ) > C i , t ( τ ) = G i , t ( τ ) =

0) since a battery cannot chargeand discharge energy at the same time. In this case,agent i can decide to adjust its discharging (charging)level by ∆ G i , t ( τ ) ( ∆ C i , t ( τ ) ). Let SoC i , t ( τ ) denote thestate of charge of the storage unit at delivery time-step τ ∈ ¯ T as it is computed at time-step t , where SoC i , t ( τ ) ∈ (cid:2) SoC i , SoC i (cid:3) . The evolution of the stateof charge during the delivery timeline can be updatedat decision time-step t as: SoC i , t ( τ + ∆ τ ) = SoC i , t ( τ )+ ∆ τ · (cid:32) η C i , t ( τ ) − G i , t ( τ ) η (cid:33) , (14) ∀ τ ∈ ¯ T ( t ) . Parameter η represents the charging and dischargingefﬁciencies of the storage unit which, for simplicity, we assume are equal. We note that for batteries, charg-ing and discharging efﬁciencies may be different anddepend on the charging/discharging speeds. As can beobserved from equation (14), time-coupling constraintsare imposed on C i , t ( τ ) and G i , t ( τ ) in order to ensurethat the amount of energy that can be discharged dur-ing some period already exists in the storage device.Additionally, constraints associated with the maximumcharging power C i and discharging power G i , as well asthe maximum and minimum energy level ( SoC i , SoC i )are considered in order to model the operation of thestorage device.Equation (14) can be written for time-step t + SoC i , t + ( τ + ∆ τ ) = SoC i , t + ( τ )+ ∆ τ · (cid:32) η C i , t + ( τ ) − G i , t + ( τ ) η (cid:33) , (15) ∀ τ ∈ ¯ T ( t + ) . Combining equations (14) and (15) we can derivethe updated vector of the state of charge at time-step t + ∆ G i , t ( τ ) , ∆ C i , t ( τ ) ) as: SoC i , t + ( τ + ∆ τ ) − SoC i , t + ( τ ) = SoC i , t ( τ + ∆ τ ) − SoC i , t ( τ )+ ∆ τ · ( η ∆ C i , t ( τ ) − ∆ G i , t ( τ ) η ) , (16) ∀ τ ∈ ¯ T ( t ) . The state of charge

SoC i , t ( τ ) at delivery time-step τ canbe updated until t = τ . Let us also observe that there isa bijection between P resi , t ( τ ) and the terms C i , t ( τ ) and G i , t ( τ ) or, in other words, determining P resi , t is equiva-lent to determining C i , t ( τ ) and G i , t ( τ ) and vice versa.The deviation from the committed schedule ∆ i , t + ( τ ) at delivery time-step τ at each time-step t + t for anasset-optimizing agent i (controlling a storage device)is gathered in variable: s i , t = ( s OBt , ( P mari , t ( τ ) , ∆ i , t ( τ ) , G i , t ( τ ) , C i , t ( τ ) , SoC i , t ( τ ) , ∀ τ ∈ ¯ T ) , w exogi , t ) ∈ S i . The control action applied by an asset-optimizing agent i trading in the CID market at time-step t consists ofposting new orders in the CID market and adjusting itsproduction/consumption level or equivalently its charg-ing/discharging level for the case of the storage device. Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

The control actions can be summarised in variable u i , t =( a i , t , ( ∆ C i , t ( τ ) , ∆ G i , t ( τ ) , ∀ τ ∈ ¯ T )) .In this paper, we consider that the trading agent adoptsa simple strategy for determining, at each time-step t , thevariables ∆ C i , t ( τ ) , ∆ G i , t ( τ ) once the trading actions a i , t havebeen selected. In this case, the decision regarding the tradingactions a i , t fully deﬁnes action u i , t and thus the notation u i , t will not be further used. This strategy will be referred to inthe rest of the paper as the “default” strategy for managingthe storage device. According to this strategy, the agent aimsat minimizing any imbalances ( ∆ i , t + ( τ ) ) and therefore weuse the following decision rule: ( ∆ C i , t ( τ ) , ∆ G i , t ( τ ) , ∀ τ ∈ ¯ T ) = arg min ∑ τ ∈ ¯ T | ∆ i , t + ( τ ) | , s.t. (2), (3), (8), (9), (12), (14) . (17)One can easily see that from equation (11) this decisionrule is equivalent to imposing P resi , t + ( τ ) as close as possibleto P mari , t + ( τ ) , given the operational constraints of the device.We will elaborate later in this paper on the fact that adoptingsuch a strategy is not suboptimal in a context where theagent needs to be balanced for every market period whilebeing an aggressor in the CID market.For the sake of simplicity, we assume that the decisionprocess of an asset-optimizing agent terminates at the gateclosure t close ( x ) along with the trading process. Thus, theﬁnal residual production P resi ( τ ) for delivery time-step τ is given by P resi ( τ ) = P resi , t = t close ( x ) ( τ ) . Similarly, the ﬁnalimbalance is provided by ∆ i ( τ ) = ∆ i , t = t close ( x ) ( τ ) .Although this approach can be used for the optimisation ofa portfolio of assets, in this paper, the focus lies on the casewhere the agent is operating a storage device. We note thatthis case is particularly interesting in the context of energytransition, where storage devices are expected to play a keyrole in the energy market. The instantaneous reward signal collected after each transi-tion for agent i is given by: r i , t = R i ( t , s i , t , a i , t , a − i , t ) , (18)where R i : T × S i × A × ... × A n → R .The reward function R i is composed of the following terms:i. The trading revenues obtained from the matching pro-cess of orders at time-step t , given by ρ where ρ is astationary function ρ : S OB × A × ... × A n → R ,ii. The imbalance penalty for deviation ∆ i ( τ ) from themarket position for delivery time-step τ at the imbal-ance price I ( τ ) . The imbalance settlement process for product x ∈ X (delivery time-step τ ) takes place at theend of the physical delivery t settle ( x ) (i.e. at τ + ∆ τ ), aspresented in Figure 1. We deﬁne the imbalance settle-ment timeline T Imb , as T Imb = { τ + ∆ τ , ∀ τ ∈ ¯ T } . Theimbalance penalty is only applied when time instance t is an element of the imbalance settlement timeline.The function R i is deﬁned as: R i ( t , s i , t , a i , t , a − i , t ) = ρ (cid:0) s OBt , a i , t , a − i , t (cid:1) + (cid:40) ∆ i ( τ ) · I ( τ ) , if t ∈ T Imb , , otherwise . (19) All the relevant information that summarises the past andthat can be used to optimise the market participationis assumed to be contained in the history vector h i , t =( s i , , a i , , r i , , ..., s i , t − , a i , t − , r i , t − , s i , t ) ∈ H i . Trading agent i is assumed to select its actions following a non-anticipativehistory-dependent policy π i ( h i , t ) ∈ Π from the set of all ad-missible policies Π , according to: a i , t ∼ π i ( ·| h i , t ) . The return collected by agent i in a single trajectory ζ =( s i , , a i , , ..., a i , K − , s i , K ) of K − s i , = s i ∈ S i , which is the sum of cumulated rewardsover this trajectory is given by: G ζ ( s i ) = K − ∑ t = R i ( t , s i , t , a i , t , a − i , t ) | s i , = s i . (20)The sum of returns collected by agent i , where each agent i is following a randomized policy π i ∈ Π are consequentlygiven by: V π i ( s i ) = E a i , t ∼ π i , a − i , t ∼ π − i (cid:40) K − ∑ t = R i ( t , s i , t , a i , t , a − i , t ) | s i , = s i (cid:41) . (21)The goal of the trading agent i is to identify an optimalpolicy π ∗ i ∈ Π that maximizes the expected sum of rewardscollected along a trajectory. An optimal policy is obtainedby: π ∗ i = arg max π i ∈ Π V π i ( s i ) . (22) The imbalance price I ( τ ) is deﬁned by a process that dependson a plethora of factors among which is the net system imbalanceduring delivery period τ , deﬁned by the imbalance volumes of allthe market players ( ∑ I ∆ i ( τ ) ). For the sake of simplicity we willassume that it is randomly sampled from a known distribution overprices that is not conditioned on any variable. Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

3. Reinforcement Learning Formulation

In this section, we propose a series of assumptions thatallow us to formulate the previously introduced problem ofa storage device operator trading in the CID market usinga reinforcement learning (RL) framework. Based on theseassumptions, the decision-making problem is cast as anMDP; the action space is tailored in order to represent aparticular market player and additional restrictions on theoperation of the storage device are introduced.

Assumption

Behaviour of the other agents ) . The otheragents − i interact with the order book in between two dis-crete time-steps in such a way that agent i can be consideredindependent when interacting with the CID market at eachtime-step t . Moreover, it is assumed that these actions a − i , t depend strictly on the history of order book states s OBt − andthus by extension on the history h i , t − for every time-step t : a − i , t ∼ P a − i , t ( ·| h i , t − ) . (23)Assumption (1) suggests that the agents engage in a way thatis very similar to a board game like chess, for which player i can make a move only after players − i have made theirmove. This behaviour is illustrated in Figure 1 (magniﬁedarea). Given this assumption, the notation a − i , t can alsobe seen as referring to actions selected during the interval ( t − ∆ t , t ) . Assumption

Exogenous information ) . The exogenous in-formation w exogi , t is given by a stochastic model that dependssolely on k past values, where 0 < k ≤ t and a random dis-turbance e i , t according to: w exogi , t = b ( w exogi , t − , ..., w exogi , t − k , e t ) , (24) e i , t ∼ P e i , t ( ·| h i , t ) . (25) Assumption

Strategy for storage control ) . The controldecisions related to the charging ( ∆ C i , t ( τ ) ) or discharging( ∆ G i , t ( τ ) ) power to/from the storage device are made basedon the “default” strategy described in Section 2.3.It can be observed that with such an assumption, the storagecontrol decisions ( ∆ C i , t ( τ ) and ∆ G i , t ( τ ) ) are obtained as adirect consequence of the trading decisions a i , t . Indeed, afterthe trading decisions are submitted and the market positionis updated, the storage control decisions are subsequentlyderived following the “default” strategy. Assumption (3)results in reducing the dimensionality of the action spaceand consequently the complexity of the decision-makingproblem.Following Assumptions (1), (2) and (3), one can simplyobserve that the decision-making problem faced by an agent i operating a storage device and trading energy in the CIDmarket can be formalised as a fully observable ﬁnite-timeMDP with the following characteristics: • Discrete time-step t ∈ T , where T is the optimisationhorizon. • State space H i , where the state of the system h i , t ∈ H i attime t summarises all past information that is relevantfor future optimisation. • Action space A i , where a i , t ∈ A i is the set of new ordersposted by agent i at time-step t . • Transition probabilities h i , t + ∼ P ( ·| h i , t , a i , t ) , that canbe inferred by the following processes:1. a − i , t ∈ A − i is drawn according to equation (23)2. The state of the order book s OBt + follows the tran-sition given by equation (1)3. The exogenous information w exogi , t is given byequation (24) and the noise by (25)4. The variable s i , t + that summarises the informa-tion of the storage device optimizing agent fol-lows the transition given by equations (1), (6)-(12)(24), (25) and (16)5. The instantaneous reward r i , t collected after eachtransition is given by equations (18) and (19).The elements resulting from these processes can beused to construct h i , t + in a straightforward way. Assumption

Aggressor ) . The trading agent can only sub-mit new orders that match already existing orders at theirprice (i.e. aggressor or liquidity taker).Let A redi be the space that contains only actions thatmatch pre-existing orders in the order book. Accord-ing to Assumption (4), the i th agent, at time-step t , isrestricted to select actions a i , t ∈ A redi ⊂ A i . Let s OBt =(( x OBj , y (cid:48) OBj , v OBj , p OBj , e OBj ) , ∀ j ∈ N t ) be the order book ob-servation at trading time-step t . We use y (cid:48) OB to denote thatthe new orders have the opposite side (“Buy” or “Sell”) thanthe existing orders. We denote as a ji , t ∈ [ , ] the fractionof the volume accepted from order j . The reduced actionspace A redi is then deﬁned as: A redi = { ( x OBj , y (cid:48) OBj , a ji , t · v OBj , p OBj , e OBj ) , a ji , t ∈ [ , ] , ∀ j ∈ N t } . At this point, posting a new set of orders a i , t ∈ A redi boilsdown to simply specifying the vector of fractions:¯ a i , t = (cid:16) a ji , t , ∀ j ∈ N t (cid:17) ∈ ¯ A redi Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding that deﬁne the partial or full acceptance of the existing or-ders. The action a i , t submitted by an aggressor is a function l of the observed order book s OBt and the vector of fractions¯ a i , t and is given by: a i , t = l ( s OBt , ¯ a i , t ) . (26) Assumption

No imbalances permitted ) . The trading agentcan only accept an order to buy or sell energy if and only if itdoes not result in any imbalance for the remaining deliveryperiods.According to Assumption (5) the agent is completely risk-averse in the sense that, even if it stops trading at any givenpoint, its position in the market can be covered without caus-ing any imbalance. This assumption is quite restrictive withrespect to the full potential of an asset-optimizing agent inthe CID market. We note that, according to the Germanregulation policies (see (Braun & Hoffmann, 2016)), theimbalance market should not be considered as an optimisa-tion ﬂoor and the storage device should always be balancedat each trading time-step t ( ∆ i , t ( τ ) = , ∀ τ ∈ ¯ T ). In thisrespect, we can view Assumption 5 as a way to comply withthe German regulation policies in a risk-free context whereeach new trade should not create an imbalance that wouldhave to be covered later. Assumption

Optimization decoupling ) . The storage de-vice has a given initial value for the storage level

SoC initi at the beginning of the delivery timeline. Moreover, it isconstrained to terminate at a given level

SoC termi at the endof the delivery timeline.Under Assumption (6) the optimisation of the storage unitover a long trading horizon can be decomposed into shorteroptimisation windows (e.g. of one day). In the simulationresults reported later in this paper, we will choose

SoC initi = SoC termi .

4. Methodology

In this section, we describe the methodology that has beenapplied for tackling the MDP problem described in sub-section 3. We consider that, in reality, an asset-optimizingagent has at its disposal a set of trajectories (one per day)from participating in the CID market in the past years. Theprocess of collecting these trajectories and their structure ispresented in Section 4.1. Based on this dataset, we proposein subsection 4.2 the deployment of the ﬁtted Q iterationalgorithm as introduced in (Ernst et al., 2005). This algo-rithm belongs to the class of batch-mode RL algorithms thatmake use of all the available samples at once for updatingthe policy. This class of algorithms is known to be verysample efﬁcient. Despite the different assumptions made on the operationof the storage device and the way it is restricted to interactwith the market, the dimensionality of the action spacestill remains very high. Due to limitations related to thefunction approximation architecture used to implement theﬁtted Q iteration algorithm, a low-dimensional and discreteaction space is necessary, as discussed in subsection 4.3.Therefore, as part of the methodology, in subsection 4.4 wepropose a way for reducing the action space. Afterwards,in subsection 4.5, a more compact representation of thestate space is proposed in order to reduce the computationalcomplexity of the training process and increase the sampleefﬁciency of the algorithm.Finally, the low number of available samples (one trajectoryper day) gives rise to issues related to the limited explorationof the agent. In order to address these issues, we generate alarge number of trading trajectories of our MDP accordingto an ε -greedy policy, using historical trading data. In thelast part of this section, we elaborate on the strategy thatis used in this paper for generating the trajectories and thelimitations of this procedure. As previously mentioned, an asset-optimizing agent can col-lect a set of trajectories from previous interactions with theCID market. Based on Assumption (6), each day can be op-timised separately and thus, trading for one day correspondsto one trajectory. We consider that the trading horizon de-ﬁned in Section 2.2 consists of K discrete trading time-stepssuch that T = { , ..., K } . A single trajectory sampled fromthe MDP described in Section 3 is deﬁned as: ζ m = (cid:0) h mi , , a mi , , r mi , , ..., h mi , K − , a mi , K − , r mi , K − , h mi , K (cid:1) . A set of M trajectories can be then deﬁned as: F = { ζ m , m = , ..., M } . The set of trajectories F can be used to generate the set ofsampled one-step system transitions F (cid:48) deﬁned as: F (cid:48) =  ( h i , , a i , , r i , , h i , ) , · · · ( h i , K − , a i , K − , r i , K − , h i , K ) , ... . . . ... ( h Mi , , a Mi , , r Mi , , h Mi , ) , · · · ( h Mi , K − , a Mi , K − , r Mi , K − , h Mi , K )  . The set F (cid:48) is split into K sets of one-step system transitions F (cid:48) t deﬁned as: F (cid:48) t = (cid:8) ( h mi , t , a mi , t , r mi , t , h mi , t + ) , m = , ..., M (cid:9) t , ∀ t ∈ { , ..., K − } . Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

A batch-mode RL algorithm can be implemented for ex-tracting a relevant trading policy from these one-step systemtransitions. In the following subsection, the type of RL al-gorithm used for inferring a high-quality policy from thisset of one-step system transitions is explained in detail. : In this section,the ﬁtted Q iteration algorithm is proposed for the optimisa-tion of the MDP deﬁned in Section 3, using a set of collectedtrajectories. In order to solve the problem, we ﬁrst deﬁnethe Q -function for each state-action pair ( h i , t , a i , t ) at time t as proposed in (Bertsekas, 2005) as: Q t ( h i , t , a i , t ) = E a − i , t , e i , t { r i , t + V t + ( h i , t + ) } , (27) ∀ t ∈ { , ..., K − } . A time-variant policy π = { µ , ..., µ K − } ∈ Π , consists in asequence of functions µ t , where µ t : H i → A redi . An action a i , t is selected from this policy at each time-step t , accordingto a i , t = µ t ( h i , t ) . We denote as π t + = { µ t + , ..., µ K − } thesequence of functions µ t from time-step t + V t + as the optimal expected cumulative rewards from stage t + K given by: V t + ( h i ) = max π t + ∈ Π E ( a − i , t + , e i , t + ) ··· ( a − i , K − , e i , K − ) (cid:40) K − ∑ k = t + R i , k (cid:0) h i , k , µ k ( h i , k ) , a − i , k (cid:1) | h i , t + = h i (cid:41) . (28)We observe that Q t ( h i , t , a i , t ) is the value attained by takingaction a i , t at state h i , t and subsequently using an optimal pol-icy. Using the dynamic programming algorithm (Bertsekas,2005) we have: V t ( h i , t ) = max a i , t ∈ A redi Q t ( h i , t , a i , t ) . (29)Equation (27) can be written in the following form thatrelates Q t and Q t + : Q t ( h i , t , a i , t ) = E a − i , t , e i , t (cid:40) r i , t + max a i , t + ∈ A redi Q t + ( h i , t + , a i , t + ) (cid:41) . (30)An optimal time-variant policy π ∗ = (cid:8) µ ∗ , ..., µ ∗ K − (cid:9) can beidentiﬁed using the Q -functions as following: µ ∗ t = arg max a i , t ∈ A redi Q t ( h i , t , a i , t ) , (31) ∀ t ∈ { , ..., K − } . Computing the Q-functions from a set of one-step sys-tem transitions : In order to obtain the optimal time-variant policy π ∗ , the effort is focused on computing theQ-functions deﬁned in equation (30). However, two aspectsrender the use of the standard value iteration algorithm im-possible for solving the MDP deﬁned in Section 3. First,the transition probabilities of the MDP deﬁned in Section 3are not known. Instead, we can exploit the set of collectedhistorical trajectories to compute the exact Q-functions us-ing an algorithm such as Q-learning (presented in (Watkins& Dayan, 1992)). Q-learning is designed for working onlywith trajectories, without any knowledge of the transitionprobabilities. Optimality is guaranteed given that all state-action pairs are observed inﬁnitely often within the set ofthe historical trajectories and that the successor states areindependently sampled at each occurrence of a state-actionpair (Bertsekas, 2005). In Section 4.6 we discuss the valid-ity of this condition and we address the problem of limitedexploration by generating additional artiﬁcial trajectories.Second, due to the continuous nature of the state and actionspaces a tabular representation of the Q-functions used inQ-learning is not feasible. In order to overcome this issue,we use a function approximation architecture to representthe Q-functions (Busoniu et al., 2017).The computation of the approximate Q-functions is per-formed using the ﬁtted Q iteration algorithm (Ernst et al.,2005). We present the algorithm for the case where a para-metric function approximation architecture ( Q t ( h i , t , a t ; θ t ) )is used (e.g. neural networks). In this case, the algorithmis used to compute, recursively, the parameter vectors θ t starting from t = K −

1. However, it should be emphasizedthat the ﬁtted Q iteration algorithm can be adapted in astraightforward way to the case in which a non-parametricfunction approximation architecture is selected.The set of M samples of quadruples F (cid:48) t = (cid:110) ( h mi , t , a mi , t , r mt , h mi , t + ) , m = , ..., M (cid:111) obtained from previousexperience is exploited in order to update the parametervectors θ t by solving the supervised learning problempresented in equation (32). The target vectors y t are Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding computed using the Q -function approximation of the nextstage ( Q t + ( h i , t + , a t + ; θ t + ) ) according to equation (33).The Q -function for the terminal state is set to zero ( ˆ Q K ≡ T ,producing a sequence of approximate Q -functions denotedby ˆ Q = { ˆ Q , ..., ˆ Q K − } until termination at t = θ t = arg min θ t M ∑ m = ( Q t ( h mi , t , a mt ; θ t ) − y mt ) (32) y mt = r mt + max a i , t + ∈ A redi Q t + ( h mi , t + , a i , t + ; θ t + ) (33)Once the parameters θ t are computed, the time-variant pol-icy ˆ π ∗ = (cid:8) ˆ µ ∗ , ..., ˆ µ ∗ K − (cid:9) is obtained as:ˆ µ ∗ t ( h i , t ) = arg max a i , t ∈ A redi Q t ( h i , t , a i , t ; θ t ) , (34) ∀ t ∈ { , ..., K − } . In practice, a new trajectory is collected after each tradingday. The set of collected trajectories F is consequentlyaugmented. Thus, the ﬁtted Q iteration algorithm can beused to compute a new optimal policy when new data arrive. The ﬁtted Q iteration algorithm, described in the previoussection, can be used to provide a trading policy based onthe set of past trajectories at the disposal of the agent. Eventhough, this approach is theoretically sound, in practicethere are several limitations to overcome. The efﬁciency ofthe described ﬁtted Q iteration algorithm is overshadowedby the high-dimensionality of the state and the action space.The state variable h i , t = ( s i , , a i , , r i , , ..., s i , t − , a i , t − , r i , t − , s i , t ) ∈ H i is composed of : • The entire history of actions ( a i , , ..., a i , t − ) beforetime t • The entire history of rewards ( r i , , ..., r i , t − ) beforetime t • The history of order book states (cid:0) s OB , ..., s OBt (cid:1) up to time t and,y of the private information ( s privatei , , ..., s privatei , t ) up to time t , where: s privatei , t =(( P mari , t ( τ ) , ∆ i , t ( τ ) , G i , t ( τ ) , C i , t ( τ ) , SoC i , t ( τ ) , ∀ τ ∈ ¯ T ) , w exogi , t ) . The state space H i as well as the action space A redi , as de-scribed in Section 3.2, depend explicitly on the content ofthe order book s OBt . The dimension of these spaces at eachtime-step t depends on the total number of available orders | N t | in the order book. However, the total number of ordersis changing at each step t . Thus, both the state and the actionspaces are high-dimensional spaces of variable size. In orderto reduce the complexity of the decision-making problem,we have chosen to reduce these spaces so as to work witha small action space of constant size and a compact statespace. In the following, we describe the procedure that wascarried out for the reduction of the state and action spaces. In this section, we elaborate on the design of a small anddiscrete set of actions that is an approximation of the originalaction space. Based on Assumptions (1), (2), (3), (4), (5) and(6), a new action space A (cid:48) i is proposed, which is deﬁned as A (cid:48) i = { “ Trade ” , “ Idle ” } . The new action space is composedof two high-level actions a (cid:48) i , t ∈ A (cid:48) i . These high-level actionsare transformed to an original action through mapping p : A (cid:48) i → A redi , from space A (cid:48) i to the reduced action space A redi .The high-level actions are deﬁned as following:4.4.1. “T RADE ”At each time-step t , agent i selects orders from the orderbook with the objective of maximizing the instantaneousreward under the constraint that the storage device can re-main balanced for every delivery period, even if no furtherinteraction with the CID market occurs. As a reminder, thisconstraint was imposed by Assumption (5).Under this assumption, the instantaneous reward signal r i , t , presented in equation (19), consists only of the trad-ing revenues obtained from the matching process of or-ders at time-step t . We will further assume that mapping u : R + × { “ Sell ” , “ Buy ” } → R that adjusts the sign of thevolume v OB of each order according to their side y OB . Ordersposted for buying energy will be associated with positivevolume and orders posted for selling energy with negativevolume, or equivalently: u ( v OB , y OB ) = (cid:40) v OB , if y OB = “ Buy ” , − v OB , if y OB = “ Sell ” . (35)Consequently, the reward function ρ deﬁned in Section 2.4is adapted according to the proposed modiﬁcations. Thenew reward function ρ , where ρ : S OB × ¯ A redi → R , is astationary function of the orders observed at each time-step t and the agent’s response to the observed orders. Ananalytical expression for the instantaneous reward collected Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding is given by: r i , t = ρ (cid:0) s OBt , ¯ a i , t (cid:1) = N t ∑ j = a ji , t · u ( v OBj , y OBj ) · p OBj . (36)The High-level action “Trade” amounts to solving the bidacceptance optimisation problem presented in Algorithm 1.The objective function of the problem, formulated in equa-tion (37), consists of the revenues arising from trading. Itis important to note that the operational constraints guar-antee that no order will be accepted if it causes any imbal-ance. We denote as N τ ⊂ N the set of unique indices of theavailable orders that correspond to delivery time-step τ and N t = (cid:83) τ ∈ ¯ T N τ . In equation (38), the energy purchased andsold ( ∑ j ∈ N τ a ji , t u ( v OBj ) ), the past net energy trades ( P mari , t ( τ ) )and the energy discharged by the storage ( G i , t ( τ ) ) mustmatch the energy charged by the storage ( C i , t ( τ ) ) for everydelivery time-step τ . The energy balance of the storagedevice, presented in equation (39), is responsible for thetime-coupling and the arbitrage between two products x (delivery time-steps τ ). The technical limits of the stor-age level and the charging and discharging process are de-scribed in equations (40) to (44). The binary variables k i , t = ( k i , t ( τ ) , ∀ τ ∈ ¯ T ) restrict the operation of the unit foreach delivery period in only one mode, either charging ordischarging.The optimal solution to this problem yields the vector offractions: ¯ a i , t = (cid:16) a ji , t , ∀ j ∈ N t (cid:17) ∈ ¯ A redi that are used in equation (26) to construct the action a i , t ∈ A redi . The optimal solution also deﬁnes at each time-step t the adjustments in the level of the production (discharge) ∆ G i , t = ( ∆ G i , t ( τ ) , ∀ τ ∈ ¯ T ( t )) and the consumption (charge) ∆ C i , t = ( ∆ C i , t ( τ ) , ∀ τ ∈ ¯ T ( t )) . The evolution of the stateof charge SoC i , t + = ( SoC i , t + ( τ ) , ∀ τ ∈ ¯ T ( t )) of the unitas well as the production G i , t + = ( G i , t + ( τ ) , ∀ τ ∈ ¯ T ( t )) and consumption C i , t + = ( C i , t + ( τ ) , ∀ τ ∈ ¯ T ( t )) levels arecomputed for each delivery period.4.4.2. “I DLE ”No transactions are executed, and no adjustment is made tothe previously scheduled quantities. Under this action, thevector of fractions ¯ a i , t is a zero vector. The discharge andcharge as well as the state of charge of the storage deviceremain unchanged ( ∆ G i , t ≡ ∆ C i , t ≡

0) and we have: G i , t + ( τ ) = G i , t ( τ ) , ∀ τ ∈ ¯ T ( t ) , (49) C i , t + ( τ ) = C i , t ( τ ) , ∀ τ ∈ ¯ T ( t ) , (50) SoC i , t + ( τ ) = SoC i , t ( τ ) , ∀ τ ∈ ¯ T ( t ) . (51)With such a reduction of the action-space, the agent canchoose at every time-step t between the two described high-level actions ( a (cid:48) i , t ∈ A (cid:48) i = { “ Trade ” , “ Idle ” } ). Note that Algorithm 1 “Trade”

Input: t , s OBt , P mari , t , SoC i , SoC i , C i , C i , G i , G i , SoC initi , SoC termi , τ init , τ term , G i , t , C i , t Output: ¯ a i , t , SoC i , t + , G i , t + , C i , t + , ∆ G i , t , ∆ C i , t , k i , t + , r i , t Solve:max ¯ a i , t , SoC i , t + G i , t + , C i , t + ∆ G i , t , ∆ C i , t k i , t + , r i , t ∑ j ∈ N t a ji , t · u ( v OBj , y OBj ) · p OBj (37)s.t. ∑ j ∈ N τ a ji , t u ( v OBj , y OBj ) + P mari , t ( τ )+ C i , t + ( τ ) = G i , t + ( τ ) , ∀ τ ∈ ¯ T ( t ) (38) SoC i , t + ( τ + ∆ τ ) = SoC i , t + ( τ )+ ∆ τ · (cid:18) η · C i , t + ( τ ) − G i , t + ( τ ) η (cid:19) , ∀ τ ∈ ¯ T ( t ) (39) SoC i ≤ SoC i , t + ( τ ) ≤ SoC i , ∀ τ ∈ ¯ T ( t ) (40) SoC initi = SoC i , t + ( τ init ) , (41) SoC termi = SoC i , t + ( τ term ) , (42) C i ≤ C i , t + ( τ ) ≤ k i , t + ( τ ) · C i , ∀ τ ∈ ¯ T ( t ) (43) G i ≤ G i , t + ( τ ) ≤ ( − k i , t + ( τ )) G i , ∀ τ ∈ ¯ T ( t ) (44) G i , t + ( τ ) = G i , t ( τ ) + ∆ G i , t ( τ ) , ∀ τ ∈ ¯ T ( t ) (45) C i , t + ( τ ) = C i , t ( τ ) + ∆ C i , t ( τ ) , ∀ τ ∈ ¯ T ( t ) (46) k i , t + ( τ ) ∈ { , } , ∀ τ ∈ ¯ T ( t ) (47) a ji , t ∈ [ , ] , ∀ j ∈ N t (48) Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding when the agent learns to idle, given a current situation, itdoes not necessarily mean, that if it had chosen to “

Trade ”instead, he would not make a positive immediate reward. In-deed, the agent would choose “

Idle ” if it believes that theremay be a better market state emerging, i.e. the agent wouldlearn to wait for the ideal opportunity of orders appearing inthe order book at subsequent time-steps. We compare thisapproach to an alternative, which we refer to as the “rollingintrinsic” policy. According to this policy, at every time-step t of the trading horizon the agent selects the combination oforders that optimises its operation and proﬁts, based on thecurrent information assuming that the storage device mustremain balanced for every delivery period as presented in(Lohndorf & Wozabal, 2015). The “rolling intrinsic” pol-icy is, thus, equivalent to sequentially selecting the action“ Trade ” (Algorithm 1), as deﬁned in this framework. Thealgorithm proposed later in this paper exploits the experi-ence that the agent can gain through (artiﬁcial) interactionwith its environment, in order to learn the value of tradingor idling at every different state that agent may encounter.

In this section, we propose a more compact and low-dimensional representation of the state space H i . The state h i , t , as explained in Section 4.3, contains the entire historyof all the relevant information available for the decision-making up to time t . We consider each one of the compo-nents of the state h i , t , namely the entire history of actions,order book states and private information, and we provide asimpliﬁed form.First, the vector containing the entire history of actions is re-duced to a vector of binary variables after the modiﬁcationsintroduced in Section 4.4.Second, the vector containing the history of orderbook states is reduced into a vector of engineered fea-tures. We start from the order book state s OBt =(( x OBj , y (cid:48) OBj , v OBj , p OBj , e OBj ) , ∀ j ∈ N t ⊆ N ) ∈ S OB that is de-ﬁned in Sections 2.1 and 2.2 as a high-dimensional contin-uous vector used to describe the state of the CID market.Owing to the variable (non-constant) and large amount oforders | N t | , the space S OB has a non-constant size withhigh-dimensionality.In order to overcome this issue, we proceed as following.First, we consider the market depth curves for each product x . The market depth of each side (“Sell” or “Buy”) at atime-step t , is deﬁned as the total volume available in theorder book per price level for product x . The market depthfor the “Sell” (“Buy”) side is computed by stacking theexisting orders in ascending (descending) price order andaccumulating the available volume. The market depth foreach of the quarter-hourly products Q to Q at time instant t is illustrated in Figure 2a using data from the German CID market. The market depth curves serve as a visualization ofthe order book that provides information about the liquidityof the market. Moreover, it provides information about themaximum (minimum) price that a trading agent will have topay in order to buy (sell) a certain volume of energy. If weassume a ﬁxed-price discretisation, certain upper and lowerbounds on the prices and interpolation of the data in thisprice range, the market depth curves of each product x canbe approximated by a ﬁnite and constant set of values.Even though this set of values has a constant size, it canstill be extremely large. Its dimension is not a function ofthe number of existing orders any more, but it depends onthe resolution of the price discretisation, the price rangeconsidered, and the total number of products in the market.Instead of an individual market depth curve for each product x , we consider a market depth curve for all the availableproducts, i.e. existing orders in ascending (descending)price order and accumulating the available volumes for allthe products. In this way we can construct the aggregatedmarket depth curve, presented in Figure 2b. The aggregatedmarket depth curve illustrates the total available volume(“Sell” or “Buy”) per price level for all products.The motivation for considering the aggregated curves comesfrom the very nature of a storage device. The main proﬁt-generating mechanism of a storage device is the arbitragebetween two delivery periods. Its functionality involves thepurchasing (charging) of electricity during periods of lowprices and the selling (discharging) during periods of highprices.For instance, in Figure 2a, a storage device would buy vol-ume for product Q and sell volume back for product Q .The intersection of the “Sell” and “Buy” curves in Figure 2bdeﬁnes the maximum volume that can be arbitraged by thestorage device if no operational constraints were consideredand serves as an upper bound for the proﬁts at each step t . Alternatively, the market depth for the same products Q to Q at a different time-step of the trading horizon ispresented in Figure 3a. As illustrated in Figure 3b, there isno arbitrage opportunity between the products, hence theaggregated curves do not intersect. Thus, we assume, thatthe aggregated curves provide a sufﬁcient representation ofthe order book.At this point, considering a ﬁxed-price discretisation and aﬁxed price range would yield a constant set of values ableto describe the aggregated curves. However, in order tofurther decrease the size of the set of values with sufﬁcientprice discretisation, we motivate the use of a set of distancemeasures between the two aggregated curves that succeedin capturing the arbitrage potential at each trading time-step t as state variables, as presented in Figures 2b and 3b.For instance, we deﬁne as D Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding (a)(b)

Figure 2. (a) Market depth per product (for products Q to Q )at a time-step t with arbitrage potential. (b) The correspondingaggregated curves for a proﬁtable order book. the 75th percentile of “Buy” price and the 25th percentileof “Sell” price and as D s (cid:48) OBt ∈ S (cid:48) OB = { D , .., D } is used to representthe state of the order book and, in particular, its proﬁt poten-tial. It is important to note that in contrast to s OBt ∈ S OB , thenew order book observation s (cid:48) OBt ∈ S (cid:48) OB does not depend onthe number of orders in the order book and therefore hasa constant size, i.e. the cardinality of S (cid:48) OB is constant overtime.Finally, the history of the private information of agent i ,that is not publicly available, is a vector that contains thehigh-dimensional continuous variables s privatei , t related to theoperation of the storage device. As described in Section 4.3, s privatei , t is deﬁned as: s privatei , t =(( P mari , t ( τ ) , ∆ i , t ( τ ) , G i , t ( τ ) , C i , t ( τ ) , SoC i , t ( τ ) , ∀ τ ∈ ¯ T ) , w exogi , t ) . (a)(b) Figure 3. (a) Market depth per product (for products Q to Q ) ata time-step t with no arbitrage potential. (b) The correspondingaggregated curves for a non proﬁtable order book. According to Assumption (5), the trading agent cannot per-form any transaction if it results in imbalances. Therefore, itis not relevant to consider the vector ∆ i , t since it will alwaysbe zero according to the way the high-level actions are de-ﬁned in Section 4.4. Additionally, Assumption (3) regardingthe default strategy for storage control in combination withAssumption (5) yields a direct correlation between vectors P mari , t and G i , t , C i , t , SoC i , t . Thus, it is considered that P mari , t contains all the required information and thus vectors G i , t , C i , t and SoC i , t can be dropped.Following the previous analysis we can de-ﬁne the low-dimensional pseudo-state z i , t =( s (cid:48) i , , a (cid:48) i , , r i , , ..., a (cid:48) i , t − , r i , t − , s (cid:48) i , t ) ∈ Z i , where s (cid:48) i , t = ( s (cid:48) OBt , P mari , t , w exogi , t ) ∈ S (cid:48) i . This pseudo-state canbe seen as the result of applying an encoder e : H i → Z i which maps a true state h i , t to pseudo-state z i , t . Assumption

Pseudo-state ) . The pseudo-state z i , t ∈ Z i con-tains all the relevant information for the optimisation of theCID market trading of an asset-optimizing agent.Under Assumption (7), using the pseudo-state z i , t insteadof the true state h i , t is equivalent and does not lead to asub-optimal policy. The resulting decision process after thestate and action spaces reductions is illustrated in Figure 4. Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding a − i , t r i , t a − i , t + r i , t + · · · h i , t CID Env Storage h i , t + CID Env Storage · · · e a i , t e a i , t + z i , t p z i , t + p π i a (cid:48) i , t π i a (cid:48) i , t + strategy“default” strategy“default” Figure 4.

Schematic of the decision process. The original MDP is highlighted in a gray background. The state of the original MDP h i , t isencoded in pseudo-state z i , t . Based on z i , t , agent i takes an high-level action a (cid:48) i , t , according to its policy π i . This action a (cid:48) i , t is mapped to anoriginal action a i , t and submitted to the CID market. The CID market makes a transition based on the action of agent i and the actions ofthe other agents a − i , t . After this transition, the market position of agent i is deﬁned and the control actions for storage device are derivedaccording to the “default” strategy. Each transition yields a reward r i , t and a new state h i , t . Output Q Trade Q Idle

FC5 shape: (batch size, 36, 2)

F F · · · F Fully Connected ... ... ...

FC2 shape: (batch size, 36, 36)

F F · · · F FC1 shape: (batch size, 128, 36)

F F · · · F Hidden state h i , h i , · · · h i , t LSTM h i , − LSTM LSTM · · ·

LSTMInputs ¯ s i , t − ¯ h ¯ s i , t − ¯ h + · · · ¯ s i , t Input shape: (batch size, ¯ h , 263) Lstm output: (batch size, 128) Figure 5.

Schematic of the neural network architecture.

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Table 2.

Order book features used for the state reduction.

Symbol Deﬁnition Description D p Buymax − p Sellmin

Signed diff. between the maximum “Buy” price and the minimum “Sell” price D p Buymean − p Sellmean

Signed diff. between the mean “Buy” price and the mean “Sell” price D p Buy − p Sell

Signed diff. between the 25th percentile “Buy” price and the 75th percentile “Sell” price D p Buy − p Sell

Signed diff. between the 50th percentile “Buy” price and the 50th percentile “Sell” price D p Buy − p Sell

Signed diff. between the 75th percentile “Buy” price and the 25th percentile “Sell” price D | v Buymin − v Sellmin | Abs. diff. between the minimum “Buy” cum. volume and the maximum “Sell” cum. volume D | v Buymean − v Sellmean | Abs. diff. between the mean “Buy” cum. volume and the mean “Sell” cum. volume D | v Buy − v Sell | Abs. diff. between the 25th percentile “Buy” cum. volume and the 25th percentile “Sell” cum. volume D | v Buy − v Sell | Abs. diff. between the 50th percentile “Buy” cum. volume and the 50th percentile “Sell” cum. volume D | v Buy − v Sell | Abs. diff. between the 75th percentile “Buy” cum. volume and the 75th percentile “Sell” cum. volume

In this section, the generation of additional artiﬁcial trajec-tories for addressing exploration issues is discussed. Indeed,if we were to implement an agent that selects at every time-step among the “Idle” and “Trade” actions, we would collecta certain number of trajectories (one per day) over a cer-tain period of interactions. The collected dataset could beused to train a policy using a batch mode RL algorithm,as described in Section 4.2. Every time a new trajectorywould arrive, it would be appended in the previous set oftrajectories and the entire dataset could be used to improvethe trading policy.This approach requires a large number of days in order toacquire a sufﬁcient amount of data. Additionally, as dis-cussed in Section 4.2, sufﬁcient exploration of the state andaction spaces is a key requirement for converging to a near-optimal policy. It is required that all parts of the state andaction spaces are visited sufﬁciently often. As a result, theRL agent needs to explore unknown grounds in order todiscover interesting policies (exploration). It should also ap-ply these learned policies to get high rewards (exploitation).However, since the set of collected trajectories would comefrom a real agent, the visitation of many different states isexpected to be limited.In the RL context, exploration is performed when the agentselects a different action than the one that, according to itsexperience, will yield the highest rewards. In real life, itis unlikely for a trader to select such actions, and poten-tially bear negative revenues, for the sake of gaining moreexperience. This leads to limited exploration of the learningprocess and would result in a suboptimal policy.

Assumption

No impact on the behaviour of the rest of the agents ) . The actions of trading agent i do not inﬂuence thefuture actions of the rest of the agents − i in the CID market.In this way, agent i is not capable of inﬂuencing the market.Assumption (8) implies that each of the agents − i entering inthe market would post orders solely based on their individualneeds. Furthermore, its actions are not considered as areaction to the actions of the other market players.Leveraging Assumption (8) allows one to tackle the ex-ploration issues discussed previously by generating severalartiﬁcial trajectories using historical order book data. Wedenote by E the number of episodes (times) each day fromhistorical data is repeated and by L train the set of tradingdays used to train the agent. We can then obtain the totalnumber of trajectories M as M = E · | L train | .The generation of artiﬁcial trajectories is performed accord-ing to the process described in (Ernst et al., 2005). Thisprocess interleaves the generation of trajectories with thecomputation of an approximate Q-function using the trajec-tories already generated. As shown in Algorithm 2, for anumber of episodes ep , we randomly select days from thetraining set which we simulate using an ε -greedy policy.According to this policy, an action is chosen at random withprobability ε and according to the available Q-functionswith probability (1 − ε ). The generated trajectories areadded to the set of trajectories. The second step consistsof updating the Q-function approximation using the set ofcollected trajectories. This process is terminated when thetotal number of episodes has reached the speciﬁed number E .This process introduces parameters L train , E , ep , ε and decay . The selection of these parameters impacts the train-ing progress and the quality of the resulting policy. The set Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Algorithm 2

Generation of artiﬁcial trajectories

Input: L train , E , ep , ε , decay Output: ˆ Q , F Initialize ˆ Q ≡ M ← E · | L train | m ← repeatfor iter j ← to ep do d ← rand ( L train ) (cid:66) Randomly pick a day d from train set L train ζ m ← simulate ( d , ε − greedy ( ˆ Q )) (cid:66) Generate trajectory ζ m by simulating day d using ε -greedy policy F . add ( ζ m ) (cid:66) Append trajectory from day d to set F ε ← anneal ( ε , decay , iter i ) (cid:66) Anneal the value of ε based on decay parameter m ← m + Q using set F according to equations (32), (33) (cid:66) Fit new ˆ Q functions until m ≥ M ; return ˆ Q , F of days considered for training ( L train ) is typically selectedas a proportion (e.g. 70%) of the total set of days available.The total number of episodes E should be large enough sothat convergence is achieved and is typically tuned basedon the application. The frequency with which the trajectorygeneration and the updates are interleaved is controlled byparameter ep . A small number of ep results in a large num-ber of updates. Parameter ε is used to address the trade-offbetween exploration-exploitation during the training pro-cess. As the training evolves, this parameter is annealedbased on some predeﬁned parameter decay , in order to grad-ually reduce exploration and to favour exploration alongthe (near-)optimal trajectories. In practice, the size of thebuffer F cannot grow inﬁnitely due to memory limitations,so typically a limit on the number of trajectories stored inthe buffer is imposed. Once this limit is reached, the oldesttrajectories are removed as new ones arrive. The buffer is adouble-ended queue of ﬁxed size. As described in Section 4.5, pseudo-state z i , t contains asequence of variables whose length is proportional to t . Thismotivates the use of Recurrent Neural Networks (RNNs),that are known for being able to efﬁciently process variable-length sequences of inputs. In particular, we use Long Short-term Memory (LSTM) networks (Goodfellow et al., 2016),a type of RNNs where a gating mechanism is introduced toregulate the ﬂow of information to the memory state.All the networks in this study have the architecture pre-sented in Figure 5. It is composed of one LSTM layer with128 neurons followed by ﬁve fully connected layers with36 neurons where “ReLU” was selected as the activationfunction. The structure of the network (number of layersand neurons) was selected after cross-validation. Theoretically, the length of the sequence of features thatis provided as input to the neural network can be as largeas the total number of trading steps in the optimisationhorizon. In practice though, there are limitations withrespect to the memory that is required to store a tensor ofthis size. As we can observe in Figure 5, each sample inthe batch contains a vector of size 263 for each time-step.Assuming a certain batch size, there is a certain limit tothe number of steps that can be stored in the memory.Therefore, for practical reasons and due to hardwarelimitations, we assume a history length ¯ h deﬁned as z i , t =( a (cid:48) i , t − ¯ h − , r i , t − ¯ h − , s (cid:48) i , t − ¯ h , a (cid:48) i , t − ¯ h , r i , t − ¯ h , ..., a (cid:48) i , t − , r i , t − , s (cid:48) i , t ) ∈ Z i . At each step t , the history length ¯ h takes theminimum value between the time-step t and ¯ h max ,(¯ h = min ( t , ¯ h max ) ). Additionally, we provide the variable¯ s t = ( a (cid:48) i , t − , r i , t − , s (cid:48) i , t ) , as a ﬁxed size input for each step t of the LSTM. Consequently, the pseudo-state can be writtenas z i , t = ( ¯ s t − ¯ h , ..., ¯ s t ) . The exploration requirements of the continuous state space,as deﬁned previously introduce the necessity for collectinga large number of trajectories M . The total time required forgathering these trajectories heavily depends on the simula-tion time needed for one episode. In this particular settingdeveloped, the simulation time can be quite long since, ateach decision step, if the action selected is “Trade”, anoptimisation model is constructed and solved.In this paper and in order to address this issue, we resort toan asynchronous architecture, similar to the one proposedin (Horgan et al., 2018), presented in Figure 6. The twoprocesses, described in Section 4.6, namely generation oftrajectories and computation of the Q-functions, run concur-rently with no high-level synchronization. Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

LearnerNetwork ActorNetworkEnvironmentLocal Buffer Global bufferExperiences

Network parameters Local bufferExperiences

Figure 6.

Schematic of the asynchronous distributed architecture. Each actor runs on a different thread and contains a copy of theenvironment, an individual ε -greedy policy based on the latest version of the network parameters and a local buffer. The actors generatetrajectories that are stored in their local buffers. When the local buffer of each actor is ﬁlled, it is appended to the global buffer and theagent collects the latest network parameters from the learner. A single learner runs on a separate thread and is continuously training usingexperiences from the global buffer. Multiple actors that run on different threads are used togenerate trajectories. Each actor contains a copy of theenvironment, an individual ε -greedy policy based on thelatest version of the Q functions and a local buffer. Theactors use their ε -greedy policy to perform transitions inthe environment. The transitions are stored in the localbuffer. When the local buffer of each actor is ﬁlled, it isappended to the global buffer, the agent collects the latestQ-functions from the learner and continues the simulation.A single learner continuously updates the Q-functions usingthe simulated trajectories from a global buffer.The beneﬁts from asynchronous methods in Deep Reinforce-ment Learning (DRL) are elaborated in (Mnih et al., 2016).Each actor can use a different exploration policy (differentinitial ε value and decay) in order to enhance diversity in thecollected samples which leads to a more stable learning pro-cess. Additionally, it is shown that the total computationaltime scales linearly with the number of threads considered.Another major advantage is that distributed techniques wereshown to have a super-linear speedup for one-step methodsthat are not only related to computational gains. It is arguedthat, the positive effect of having multiple threads leads toa reduction of the bias in one-step methods (Mnih et al.,2016). In this way, these algorithms are shown to be muchmore data efﬁcient than the original versions.

5. Case study

The proposed methodology is applied for the case of a PHESunit. First, the parameters and the exogenous information used for the optimisation of the CID market participationof a PHES operator are described. Second, the benchmarkstrategy used for comparison purposes and the process thatwas carried out for validation are presented. Finally, per-formance results of the obtained policy are presented anddiscussed.

The proposed methodology is applied for a PHES unit par-ticipating in the German CID market with the followingcharacteristics: • SoC i =

200 MWh, • SoC i = • SoC initi = SoC termi = (cid:0) SoC i − SoC i (cid:1) / • C i = G i =

200 MW, • C i = G i = • η = . The discrete trading horizon has been selected to be tenhours, i.e. T = {

17 : 00 , ...,

03 : 00 } . The trading time in-terval is selected to be ∆ t =

15 min. Thus the trading pro-cess takes K =

40 steps until termination. Moreover, all96 quarter-hourly products of the day, X = { Q , .., Q } ,are considered. Consequently, the delivery timeline is ¯ T = {

00 : 00 , ...,

23 : 45 } , with τ init =

00 : 00 and τ term =

24 : 00

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding and the delivery time interval is ∆ τ =

15 min. Each productcan be traded until 30 minutes before the physical deliveryof electricity begins (e.g. t close ( Q ) =

23 : 30 etc.).The number of days selected for training was | L train | =

252 days. For the construction of the training/test sets, thedays were sampled uniformly without replacement from apool of 362 days. The number of simulated episodes foreach training day was selected to be E = ε -greedy policy. As described in Section 4.8, each ofthe actor threads is provided with a different explorationparameter ε that is initialised with a random uniform samplein the range [ . , . ] . The parameter ε is then annealedexponentially until a zero value is reached.The pseudo-state z i , t = ( s (cid:48) i , , a (cid:48) i , , r i , , ..., a (cid:48) i , t − , r i , t − , s (cid:48) i , t ) ∈ Z i is composed of the entire history of observations andactions up to time-step t , as described in Section 4.5. Forthe sake of memory requirements, as explained in Section4.7, we assume that the last ten trading steps contain sufﬁ-cient information about the past. Thus, the pseudo-state istransformed in sequences of ﬁxed length ¯ h max = The exogenous variable w exogi , t represents any relevant infor-mation available to agent i about the system. In this casestudy, we assumed that the variable w exogi , t contains: • The 24 values of the Day-ahead price for the entiretrading day • The Imbalance price and the system Imbalance for thefour quarters preceding each time-step t • Time features: i) the hour of the day, ii) the month andiii) whether the traded day is a weekday or weekend

The strategy selected for comparison purposes is the “rollingintrinsic” policy, denoted by π RI (Lohndorf & Wozabal,2015). According to this policy, the agent selects at eachtrading time-step t the action “Trade”, as described in Sec-tion 4.4. This benchmark is selected since it represents thecurrent state of the art used in the industry for the optimisa-tion of PHES unit market participation. The performance of the policy obtained using the ﬁtted Qalgorithm, denoted by π FQ , is evaluated on test set L test thatcontains historical data from 110 days. These days are notused during the training process. This process of backtesting a strategy on historical data is widely used because it canprovide a measure of how successful a strategy would beif it had been executed in the past. However, there is noguarantee that this performance can be expected in the future.This validation process heavily relies on Assumption (8)about the inability of the agent to inﬂuence the behaviourof the other players in the market. It can still provide anapproximation on the results of the obtained policy beforedeploying it in real life. However, the only way to evaluatethe exact viability of a strategy is to deploy it in real life.We compare the performances of the policy obtained by theﬁtted Q iteration algorithm π FQ and the “rolling intrinsic”policy π RI . The comparison will always be based on thecomputation of the return of the policies on each day. Fora given policy, the return over a day will be simply com-puted by running the policy on the day and summing up therewards obtained.The learning algorithm that we have designed for learning apolicy has two sources of variance, namely those related tothe generation of the new trajectories and those related tothe learning of the Q-functions from the set of trajectories.To evaluate the performance of our learning algorithm wewill perform several runs and average the performances ofthe policies learned. In the following, when we report theperformance of a ﬁtted Q iteration policy over one day, wewill actually report the average performances of ten learnedpolicies over this day.We now describe the different indicators that will be usedafterwards to assess the performances of our method. Theseindicators will be computed for both the training set and thetest set, but are detailed hereafter when they are computedfor the test set. It is straightforward to adapt the procedurefor computing the indicators for the training set.Let V π FQ d and V π RI d denote the total return of the ﬁtted Qand the “rolling intrinsic” policy for day d , respectively.We gather the obtained returns of each policy for each day d ∈ L test . We sort the returns in ascending order, and weobtain an ordered set containing a number of | L test | valuesfor each policy. We provide descriptive statistics about thedistribution of the returns of each policy V π FQ d and V π RI d onthe test set L test . In particular, we report the mean, the mini-mum and maximum values achieved for the set considered.Moreover, we provide the values obtained for each of thequartiles (25%, 50% and 75%) of the set.Additionally, we compute the sum of returns over the entireset of days as follows: V π FQ = ∑ d ∈ L test V π FQ d , (52) V π RI = ∑ d ∈ L test V π RI d . (53) Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

An alternative performance indicator considered is the dis-crepancy of the returns coming from the ﬁtted Q policy withrespect to the risk-averse rolling intrinsic policy. We deﬁnethe proﬁtability ratio r d for each day d ∈ L test , that corre-sponds to the signed percentage difference between the twopolicies as follows: r d = V π FQ d − V π RI d V π RI d · . (54)In a similar fashion, we sort the proﬁtability ratios obtainedfor each day in the test set and we provide descriptive statis-tics about its distribution across the set. The mean, minimumand maximum values of the proﬁtability ratio as well as thevalues of each quartile are reported. Finally, we computethe proﬁtability ratio for the sum of returns over the entireset between the two policies, as: r sum = V π FQ − V π RI V π RI · . (55) The performance indicators described previously are com-puted for both the training and the test set. The resultsobtained are summarised in Tables 3 and 4. Descriptivestatistics about the distribution of the returns from both poli-cies as well as the proﬁtability ratio are presented for eachdataset.It can be observed that on average π FQ yields better returnsthan π RI both on the training and the test set. More speciﬁ-cally, on the training set, the obtained policy performs, onaverage 2 .

56% better than the “rolling intrinsic” policy. Forthe top 25% of the training days the proﬁtability ratio ishigher than 3 .

69% and in some cases it even exceeds 10%.Overall, the total proﬁts coming from the ﬁtted Q policyadd up to e , , e , . . .

5% greater proﬁt onthe test set with respect to the returns of the “rolling intrinsic”policy. It is important to highlight that for 50% of the testset, the proﬁts from the ﬁtted Q policy are higher than 1%in comparison to the “rolling intrinsic”. The differencebetween the total proﬁts resulting from the two policies overthe set of 110 days considered amounts to e ,

681 (1 . −

5% and that the distribution has a positive skew.

Table 3.

Descriptive statistics of the returns obtained on the daysof the training set for policies π FQ and π RI . The last column alsoprovides the corresponding proﬁtability ratios. π FQ returns ( e ) π RI returns ( e ) r (%) mean . min − . . . . max . sum . Table 4.

Descriptive statistics of the returns obtained on the daysof the test set for policies π FQ and π RI . The last column alsoprovides the corresponding proﬁtability ratios. π FQ returns ( e ) π RI returns ( e ) r (%) mean . min − . . . . max . sum .

6. Discussion

In this section, we provide some remarks related to thepractical challenges encountered and the validity of theassumptions considered throughout this paper.

In this paper, we assumed (Assumption 1) that the rest ofthe agents − i post orders in the market based on their needsand some historical information of the state of the orderbook. In reality, the available information that the otheragents possess is not accessible by agent i . This fact givesrise to issues related to the validity of the assumption thatthe process is Markovian. Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Figure 7.

Proﬁtability ratio.

We further assumed (8) in Section 4.6 that the behaviour ofagent i does not inﬂuence the strategy of the other agents − i . Based on this assumption the training and the validationprocess were performed using historical data. However,the strategy of each of the market participants is highlydependent on the actions of the rest participants, especiallyin a market with limited liquidity such as the CID market.These assumptions, although slightly unrealistic and opti-mistic, provide us with a meaningful testing protocol for atrading strategy. The actual proﬁtability of a strategy can beobtained by deploying the strategy in real-time. However,it is important to show that the strategy is able to obtainsubstantial proﬁts in back-testing ﬁrst. In Section 3, the decision-making problem studied in thispaper was framed as an MDP after considering certain as-sumptions. Theoretically, this formulation is very conve-nient, but does not hold in practice. Especially consideringAssumption 7, where the reduced pseudo-state is assumedto contain all the relevant information required.Indeed, the trading agents do not have access to all theinformation required. For instance, a real agent does notknow how many other agents are active in the market. Theydo not know the strategy of each agent either. There isalso a lot of information gathered by w exog which is notavailable for the agent. Finally, the fact that the state spacewas reduced results in an inevitable loss of information.Therefore, it would be more accurate to consider a PartiallyObservable Markov Decision Process (POMDP) instead. Ina POMDP, the real state is hidden and the agent only has ac-cess to observations. For an RL algorithm to properly workwith a POMDP, the observations have to be representativeof the real hidden states. There are two main issues related to the state space explo-ration that result in the somewhat limited performance ofthe obtained policy. First, in the described setting, the wayin which we generate the artiﬁcial trajectories is very im-portant for the success of the method. The generated statesmust be “representative” in the sense that the areas aroundthese states are visited often under a near optimal policy(Bertsekas, 2005). In particular, the frequency of appear-ance of these areas of states in the training process shouldbe proportional to the probability of occurrence under theoptimal policy. However, in practice, we are not in a posi-tion to know which areas are visited by the optimal policy.In that respect, the asynchronous distributed algorithm usedin this paper was found to successfully address the issue ofstate exploration.Second, the assumptions (Assumptions 3, 4, 5) related tothe operation of the storage device according to the “default”strategy without any imbalances allowed, as well as the par-ticipation of the agent as an aggressor, are restrictive withrespect to the set of all admissible policies. Additionally,the adoption of the reduced discrete action space describedin Section 4.4 introduces further restrictions on the set ofavailable actions. Although having a small and discretespace is convenient for the optimisation process, it leadsto limited state exploration. For instance, the evolution ofthe state of charge of the storage device is always givenas the output of the optimisation model based on the orderbook data. Thus, in this conﬁguration, it is not possibleto explore all areas of the state space (storage levels) butonly a certain area driven by the historical order book data.However, evaluating the policy on a different dataset mightlead to areas of the state space (e.g. storage level) that arenever visited during training, leading to poor performance.Potential mitigations of this issue involve diverse data aug-mentation techniques and/or different representation of theaction space.

7. Conclusions and future work

In this paper, a novel RL framework for the participationof a storage device operator in the CID market is proposed.The energy exchanges between market participants occurthrough a centralized order book. A series of assumptionsrelated to the behaviour of the market agents and the oper-ation of the storage device are considered. Based on theseassumptions, the sequential decision-making problem is castas an MDP. The high dimensionality of both the action andthe state spaces increase the computational complexity ofﬁnding a policy. Thus, we motivate the use of discrete high-level actions that map into the original action space. Wefurther propose a more compact state representation. Theresulting decision process is solved using ﬁtted Q iteration,

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding a batch mode reinforcement learning algorithm. The resultsillustrate that the obtained policy is a low-risk policy thatis able to outperform on average the state of the art for theindustry benchmark strategy (“rolling intrinsic”) by 1.5%on the test set. The proposed method can serve as a wrapperaround the current industrial practices that provides decisionsupport to energy trading activities with low risk.The main limitations of the developed strategy originatefrom: i) the insufﬁcient amount of relevant informationcontained in the state variable, either because the state re-duction proposed leads to a loss of information or due to theunavailability of information and ii) the limited state spaceexploration as a result of the proposed high-level actions incombination with the use of historical data. To this end andas future work, a more detailed and accurate representationof the state should be devised. This can be accomplished byincreasing the amount of information considered, such asRES forecasts, and by improving the order book representa-tion. We propose the use of continuous high-level actionsin an effort to gain state exploration without leading to verycomplex and high-dimensional action space.

References

A¨ıd, R., Gruet, P., and Pham, H. An optimal trading prob-lem in intraday electricity markets.

Mathematics andFinancial Economics , 10(1):49–85, 2016.Baillo, A., Ventosa, M., Rivier, M., and Ramos, A. Optimaloffering strategies for generation companies operating inelectricity spot markets.

IEEE Transactions on PowerSystems , 19(2):745–753, May 2004. ISSN 0885-8950.doi: 10.1109/TPWRS.2003.821429.Balardy, C. German continuous intraday market: Ordersbook’s behavior over the trading session. In

Meetingthe Energy Demands of Emerging Economies, 40th IAEEInternational Conference, June 18-21, 2017 . InternationalAssociation for Energy Economics, 2017a.Balardy, C. An analysis of the bid-ask spread in the germanpower continuous market. In

Heading Towards Sustain-able Energy Systems: Evolution or Revolution?, 15thIAEE European Conference, Sept 3-6, 2017 . InternationalAssociation for Energy Economics, 2017b.Bertrand, G. and Papavasiliou, A. Adaptive trading in con-tinuous intraday electricity markets for a storage unit.

IEEE Transactions on Power Systems , pp. 1–1, 2019.ISSN 1558-0679. doi: 10.1109/TPWRS.2019.2957246.Bertsekas, D. P.

Dynamic programming and optimal control ,volume 1. Athena scientiﬁc Belmont, MA, 2005.Boomsma, T. K., Juul, N., and Fleten, S.-E. Bid-ding in sequential electricity markets: The Nordic case.

European Journal of Operational Research ,238(3):797 – 809, 2014. ISSN 0377-2217. doi:http://dx.doi.org/10.1016/j.ejor.2014.04.027. URL .Borggrefe, F. and Neuhoff, K. Balancing and intradaymarket design: Options for wind integration. DIWDiscussion Papers 1162, Berlin, 2011. URL http://hdl.handle.net/10419/61319 .Braun, S. and Hoffmann, R. Intraday Optimization ofPumped Hydro Power Plants in the German ElectricityMarket. In

Energy Procedia , volume 87, pp. 45–52, 2016.doi: 10.1016/j.egypro.2015.12.356.Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D.

Reinforcement learning and dynamic programming usingfunction approximators . CRC press, 2017.EPEXSPOT. Market data intraday continuous,2017. URL .EPEXSPOT. EPEXSPOT Operational rules, 2019.URL .Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batchmode reinforcement learning.

Journal of Machine Learn-ing Research , 6(Apr):503–556, 2005.Fleten, S. E. and Kristoffersen, T. K. Stochastic program-ming for optimizing bidding strategies of a Nordic hy-dropower producer.

European Journal of OperationalResearch , 181(2):916 – 928, 2007. ISSN 0377-2217. doi:http://dx.doi.org/10.1016/j.ejor.2006.08.023.Garnier, E. and Madlener, R. Balancing forecast errors incontinuous-trade intraday markets.

Energy Systems , 6(3):361–388, 2015.G¨onsch, J. and Hassler, M. Sell or store? An ADPapproach to marketing renewable energy.

OR Spec-trum , 38(3):633–660, 2016. ISSN 14366304. doi:10.1007/s00291-016-0439-x.Goodfellow, I., Bengio, Y., and Courville, A.

DeepLearning . MIT Press, 2016. .Hagemann, S. Price determinants in the German intradaymarket for electricity: an empirical analysis.

Journal ofEnergy Markets , 2015.Hassler, M. Heuristic decision rules for short-term trad-ing of renewable energy with co-located energy storage.

Computers and Operations Research , 83, 2017. ISSN03050548. doi: 10.1016/j.cor.2016.12.027.

Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Henriot, A. Market design with centralized wind powermanagement: Handling low-predictability in intradaymarkets.

Energy Journal , 35(1):99–117, 2014. ISSN01956574. doi: 10.5547/01956574.35.1.6.Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel,M., Van Hasselt, H., and Silver, D. Distributed priori-tized experience replay. arXiv preprint arXiv:1803.00933 ,2018.Jiang, D. R. and Powell, W. B. Optimal hour-ahead biddingin the real-time electricity market with battery storageusing Approximate Dynamic Programming. pp. 1–28,2014. ISSN 1091-9856. doi: 10.1287/ijoc.2015.0640.URL http://arxiv.org/abs/1402.3575 .Karanﬁl, F. and Li, Y. The role of continuous intradayelectricity markets: The integration of large-share windpower generation in Denmark.

Energy Journal , 38(2):107–130, 2017. ISSN 01956574. doi: 10.5547/01956574.38.2.fkar.Le, H. L., Ilea, V., and Bovo, C. Integrated Euro-pean intra-day electricity market: Rules, modeling andanalysis.

Applied Energy , 238(June 2018):258–273,2019. ISSN 03062619. doi: 10.1016/j.apenergy.2018.12.073. URL https://doi.org/10.1016/j.apenergy.2018.12.073 .Lohndorf, N. and Wozabal, D. Optimal gas storage valua-tion and futures trading under a high-dimensional priceprocess. Technical report, Technical report, 2015.Meeus, B. L. and Schittekatte, T. The EU Electricity Net-work Codes: Course text for the Florence School of Reg-ulation online course. (October), 2017.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. In

International conference on machine learning , pp. 1928–1937, 2016.Neuhoff, K., Ritter, N., Salah-Abou-El-Enien, A., and Vas-silopoulos, P. Intraday markets for power: discretizingthe continuous trading? 2016.Pand ˇZi´c, H., Morales, J. M., Conejo, A. J., and Kuzle,I. Offering model for a virtual power plant based onstochastic programming.

Applied Energy , 105:282–292,2013. ISSN 03062619. doi: 10.1016/j.apenergy.2012.12.077.P´erez Arriaga, I. and Knittel et al, C.

Utility of the fu-ture. An MIT Energy Initiative response . 2016. ISBN9780692808245. URL energy.mit.edu/uof . Plazas, M. A., Conejo, A. J., and Prieto, F. J. Multimarketoptimal bidding for a power producer.

IEEE Transactionson Power Systems , 20(4):2041–2050, Nov 2005. ISSN0885-8950. doi: 10.1109/TPWRS.2005.856987.Scharff, R. and Amelin, M. Trading behaviour on the contin-uous intraday market Elbas.

Energy Policy , 88:544–557,2016. ISSN 03014215. doi: 10.1016/j.enpol.2015.10.045.Spot, N. Xbid cross-border intra day market project,2018. URL .The European Commission. 2030 en-ergy strategy, 2017. URL https://ec.europa.eu/energy/en/topics/energy-strategy-and-energy-union/2030-energy-strategy .Watkins, C. J. and Dayan, P. Q-learning.

Machine learning ,8(3-4):279–292, 1992.

A. Nomenclature

Acronyms

ADP Approximate Dynamic Programming.CID Continuous Intraday.DRL Deep Reinforcement Learning.FCFS First Come First Served.MDP Markov Decision Process.OB Order Book.PHES Pumped Hydro Energy Storage.RES Renewable Energy Sources.

Sets and indexes

Name Description i Index of an agent. − i Index of all the agents except agent i . j Index of an order. m Index of a sample of quadruples. d Index of a day in a set. t Trading time-step. τ Discrete time-step of delivery. A Joint action space for all the agents. A i Action space of agent i . A − i Action space of the rest of the agents − i . A redi Reduced action space of agent i . A (cid:48) i Set of high-level actions for agent i .¯ A i Set of all factors for the partial/full acceptanceof orders by agent i . Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding E Set of conditions that can apply to an order. F Set of all sampled trajectories. F (cid:48) Set of sampled one-step transitions. F (cid:48) t Set of sampled one-step transitions for time t . H i Set of all histories for agent i . I Set of agents. L train Set of trading days used to train the agent. L test Set of trading days used to evaluate the agent. N t Set of all available order unique indexes at time t . N (cid:48) t Set of all the unique indexes of new ordersposted at time t . N τ Set of all the unique indexes of orders for deliv-ery at τ . O t Set of all available orders in the order book attime t . S OB Set of all available orders in the order book. S (cid:48) OB Low dimensional set of all available orders inthe order book. S i State space of agent i . T Trading horizon, i.e. time interval between ﬁrstpossible trade and last possible trade. T ( x ) Discretization of the trading timeline for product x .¯ T Discretization of the delivery timeline.¯ T ( t ) Discretization of the delivery timeline at tradingstep t . T Imb

Discretization of the imbalance settlement time-line. X Set of all available products. X t Set of all available products at time t . Z i Set of pseudo-states for agent i . Π Set of all admissible policies.

Parameters

Name Description C i Maximum consumption level for the asset ofagent i . C i Minimum consumption level for the asset ofagent i . E Number of episodes. e Conditions applying on an order other than vol-ume and price. ep Number of simulations between two successiveQ function updates. decay

Parameter for the annealing of ε . G i Maximum production level for the asset of agent i . G i Minimum production level for the asset of agent i .¯ h Sequence length of past information. ¯ h max Maximum sequence length of past information. I ( τ ) Imbalance price for delivery period δ ( x ) . K Number of steps in the trading period. M Number of samples of quadruples. n Number of agents. o t Market order. p Price of an order. p max Maximum price of an order. p min Minimum price of an order.

SoC i Maximum state of charge of storage device.

SoC i Minimum state of charge of storage device.

SoC initi

State of charge of storage device at the begin-ning of the delivery timeline.

SoC termi

State of charge of storage device at the end ofthe delivery timeline. t close ( x ) End of trading period for product x . t delivery ( x ) Start of delivery of product x . t open ( x ) Start of trading period for product x . t settle ( x ) Time of settlement for product x . v Volume of an order. x Market product. y Side of an order (“Sell” or “Buy”). y mt Target computed for sample m at time t . δ ( x ) Time interval covered by product x (delivery). ∆ t Time interval between trading time-steps. ∆ τ Time interval between delivery time-steps. ε Parameter for the ε -greedy policy. η Charging/discharging efﬁciency of storage de-vice. θ t Parameters vector of function approximation attime t . λ ( x ) Duration of time-interval δ ( x ) . ζ A single trajectory. ζ m A single indexed trajectory. τ init Initial time-step of the delivery timeline. τ term Terminal time-step of the delivery timeline.

Variables

Name Description a t Joint action from all the agents at time t . a i , t Action of posting orders by agent i at time t . a − i , t Action of posting orders by the rest of the agents − i at time t . a (cid:48) i , t High-level action by agent i at time t . a ji , t Acceptance (partial/full) factor for order j byagent i at time t .¯ a i , t Factors for the partial/full acceptance of all or-ders by agent i at time t . C i , t ( τ ) Consumption level at delivery time-step τ com-puted at time t . c i , t ( t (cid:48) ) Consumption level during the delivery interval. e i , t Random disturbance for agent i at time t . Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding G i , t ( τ ) Generation level at delivery time-step τ com-puted at t . g i , t ( t (cid:48) ) Generation level during the delivery interval. h i , t History vector of agent i at time t . k i , t ( τ ) Binary variable that enforces either charging ordischarging of the storage device. P mari , t ( x ) Net contracted power of agent i for product x (delivery time-step τ ) at time t . P resi , t ( τ ) Residual production of agent i delivery time-step τ (for product x ) at time t . P resi ( τ ) Final residual production of agent i for product r i , t Instantaneous reward of agent i at time t . r d Proﬁtability ratio at day d . r sum Proﬁtability ratio for the sum of returns over set. s i , t State of agent i at time t . SoC i , t ( τ ) State of charge of device at delivery time-step τ computed at t . s OBt

State of the order book at time t . s (cid:48) OBt

Low dimensional state of the order book at time t . s privatei , t Private information of agent i at time t .¯ s t Triplet of ﬁxed size, part of pseudo-state z i , t thatserves as an input at LSTM at time t . u i , t Aggregate (trading and asset) control action ofthe asset trading agent i at time t . v coni , t ( x ) Volume of product x contracted by agent i attime t . w exogi , t Exogenous information of agent i at time t . z i , t Pseudo-state for agent i at time t . ∆ i , t ( τ ) Imbalance for delivery time τ for agent i com-puted at time t . ∆ i ( τ ) Final imbalance for delivery time τ for agent i . ∆ G i , t Change in the production level for the asset ofagent i at time t . ∆ C i , t : Change in the consumption level for the assetof agent i at time t . Functions

Name Description clear ( · ) Market clearing function. b ( · ) Univariate stochastic model for exogenous in-formation. e ( · ) Encoder that maps from the original state space H i to pseudo-state space Z i . f ( · ) Order book transition function. G ζ ( · ) Revenue collected over a trajectory. g ( · ) System dynamics of the MDP. k ( · ) System dynamics of asset trading process. l ( · ) Reduced action space construction function. P a − i , t ( · ) Probability distribution function for the actionsof the rest of the agents − i . P e t ( · ) Random disturbance probability distributionfunction. P ( · ) Transition probabilities of the MDP. P FQ ( · ) The stochastic process (algorithm) of ﬁtted Qiteration. P θ t , ( · ) Distribution of the initial parameters θ t , . p ( · ) Mapping from high-level actions A (cid:48) i to the re-duced action space A redi . Q t ( · , · ) State-action value function at time t .ˆ Q ( · , · ) Sequence of Q-function approximations. R ( · ) Reward function. u ( · ) Signing convention for the volume wrt. the side(‘Buy” or ‘Sell”) of each order. V π i ( · ) Total expected reward function for policy π i . V π FQ d ( · ) Return of the ﬁtted Q policy π FQi for day d . V π RI d ( · ) Return of the “rolling” intrinsic policy π RIi forday d . µ t ( · ) Policy function at time t . π i ( · ) Policy followed by agent i . ρ ( · ))