[PDF] Neural Fitted Q Iteration based Optimal Bidding Strategy in Real Time Reactive Power Market_1

Abstract

In real time electricity markets, the objective of generation companies while bidding is to maximize their profit. The strategies for learning optimal bidding have been formulated through game theoretical approaches and stochastic optimization problems. Similar studies in reactive power markets have not been reported so far because the network voltage operating conditions have an increased impact on reactive power markets than on active power markets. Contrary to active power markets, the bids of rivals are not directly related to fuel costs in reactive power markets. Hence, the assumption of a suitable probability distribution function is unrealistic, making the strategies adopted in active power markets unsuitable for learning optimal bids in reactive power market mechanisms. Therefore, a bidding strategy is to be learnt from market observations and experience in imperfect oligopolistic competition-based markets. In this paper, a pioneer work on learning optimal bidding strategies from observation and experience in a three-stage reactive power market is reported.

Full PDF

NNeural Fitted Q Iteration based Optimal Bidding Strategy inReal Time Reactive Power Market

JAHNVI PATEL,

Department of Computer Science and Engineering & Robert Bosch Centre for Data Scienceand AI, IIT Madras

DEVIKA JAY,

Department of Electrical Engineering, IIT Madras

BALARAMAN RAVINDRAN,

Department of Computer Science and Engineering & Robert Bosch Centrefor Data Science and AI, IIT Madras

K. SHANTI SWARUP,

Department of Electrical Engineering, IIT MadrasIn real-time electricity markets, the objective of generation companies (GENCOs) while bidding is to maximisetheir profit. The strategies for learning optimal bidding have been formulated through game theoreticalapproaches and stochastic optimisation problems. In game theoretical approaches, payoffs of rivals areassumed to be known, which is unrealistic. In stochastic optimisation methods, a suitable distribution functionfor rivals’ bids is assumed. The uncertainty in the network is handled by assuming the process to be a MarkovDecision Process with known state transition probabilities. Similar studies in reactive power markets havenot been reported so far because the network voltage operating conditions have an increased impact onreactive power markets than on active power markets. Contrary to active power markets, the bids of rivalsare not directly related to fuel costs in reactive power markets. Hence, the assumption of a suitable probabilitydistribution function is unrealistic, making the strategies adopted in active power markets unsuitable forlearning optimal bids in reactive power market mechanisms. Therefore, a bidding strategy is to be learntfrom market observations and experience in imperfect oligopolistic competition based markets. In this paper,a pioneer work on learning optimal bidding strategies from observation and experience in a three-stagereactive power market is reported. Such bidding strategy studies on reactive power markets have not beenstudied extensively in the literature. Also, the learning method proposed in this paper does not assume anyprobability distribution to model the uncertainties that affect one’s bidding strategy. For learning optimalbidding strategies, a variant of Neural Fitted Q Iteration (NFQ-TP) with prioritized experience replay andtarget network is proposed. The total reactive power requirement is estimated by the learning agent througha Long Short Term Memory (LSTM) network to suitably define the state space for the learning agent. Thelearning technique is tested on three-stage reactive power mechanism of IEEE 30-bus Power System test bedunder different scenarios and NFQ network configurations. The simulation results ensure that the techniqueis suitable for learning optimal bidding strategies from their own experiences and market observations.CCS Concepts: •

Applied computing → Engineering ; •

Computing methodologies → Reinforcementlearning ; Model development and analysis.Additional Key Words and Phrases: Reactive Power Market, Optimal Bidding Strategy, Deep ReinforcementLearning, Neural Fitted Q Iteration, Prioritized Experience Replay, LSTM

Authors’ addresses: Jahnvi Patel, [email protected], Department of Computer Science and Engineering & Robert BoschCentre for Data Science and AI, IIT Madras; Devika Jay, [email protected], Department of Electrical Engineering, IITMadras; Balaraman Ravindran, Department of Computer Science and Engineering & Robert Bosch Centre for Data Scienceand AI, IIT Madras, [email protected]; K. Shanti Swarup, [email protected], Department of Electrical Engineering, IITMadras.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2021 Association for Computing Machinery.XXXX-XXXX/2021/1-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn , Vol. 1, No. 1, Article . Publication date: January 2021. a r X i v : . [ c s . A I] J a n Jahnvi Patel, Devika Jay, Balaraman Ravindran, and K. Shanti Swarup

ACM Reference Format:

Jahnvi Patel, Devika Jay, Balaraman Ravindran, and K. Shanti Swarup. 2021. Neural Fitted Q Iteration basedOptimal Bidding Strategy in Real Time Reactive Power Market. 1, 1 (January 2021), 23 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

With the advent of advanced information technology and integration of small-scale renewableenergy sources to power supply systems, significant changes in power supply grid operations andplanning are being implemented worldwide. Deregulation of vertically-integrated-utility based gridoperations is considered as a major revolution in power sector. This has increased the participationof private companies in maintaining the grid operation levels within permissible limits. The gridoperator functions as an Independent System Operator (ISO) that procures energy from privateproducers and schedules generation and demand in real-time. This forms an electricity market.Ancillary services, like network voltage support through reactive power procurement, spinningreserve maintenance, etc., are also being achieved through market mechanisms. Among theseservices, reactive power procurement has been regarded as one of the most important ancillaryservice market [5]. However, implementing a real-time pricing mechanism for reactive powerhas been challenging due to the localised nature of reactive power. Reactive power supply is tobe provided locally to support reactive power demand and maintain system-wide bus voltagemagnitudes within permissible limits. However, this local requirement of reactive power results inthe exercise of market power by participants. Under critical operating conditions, participants maysubmit high bids resulting in highly volatile price signals in the market. Hence mechanisms thatpreserve incentive compatibility and individual rationality is essential for reactive power markets[8]. Such incentive-compatible markets provide limited information to market participants, andhence learning optimal bidding strategies is a challenge.Most of the work so far on optimal bidding strategies for reactive power markets focuses ongame theoretic approaches. Optimal bidding strategy for reactive power market was studied underconjectured supply function models with multi-leader follower game setting [4]. Estimation ofelectricity suppliers’ cost function in day-ahead real power market using inverse optimizationmethod based on historical bidding data is discussed in [3]. In [10], the behaviour of generationcompanies participating in reactive power market is simulated as a multi-leader follower game. In[7], Q-learning is used to strategize bidding in market models based on DC-OPF. Deep Q Networks(DQNs) are used to find the optimal bidding strategy for generators in single-sided energy marketsas in [19].Bidding strategies discussed in the literature so far considered a simple pricing mechanism where,in most of the cases, uniform price signals were issued to generation companies without consideringthe location features and real-time reactive power requirements. However, these bidding strategiesfail when reactive power pricing mechanism issues nodal prices, i.e., prices that vary with thelocation of the generation companies. This makes price signals highly correlated with precedingoperating conditions like active power demand from loads, network topology changes, and voltagemagnitude profiles. Also, the actual cost function in the case of reactive power generation of rivalgeneration companies cannot be estimated directly from fuel costs. The cost of reactive powergeneration constitutes only a small percentage of fuel cost, and fuel costs are directly related toactive power generation.Hence, learning agents, i.e., generation companies, are exposed to an environment that provideslimited access to information regarding market pricing mechanism. Imperfect information setupin the market and dependency of the current state of the market on the preceding sequence ofstates make learning bidding strategies a challenge. Assumption of suitable probability distribution , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 3 function to model system uncertainties and bids of rivals is unrealistic in reactive power markets dueto the inherent features of the market discussed above. Learning bidding strategies in an incentive-compatible three-stage reactive power market mechanism [8] has not been reported so far. Thispaper presents a pioneer work in this regard by proposing a deep reinforcement learning techniquefor a single learning agent in an environment (three-stage reactive power market mechanism)constituting competing generation companies that do not share information. The proposed learningtechnique is simplified in Fig.1. The learning agent (GENCOs) acts on the environment (reactive

Fig. 1. Learning optimal bidding strategies in reactive power market power market mechanism) by submitting its optimal bids. The feedback that it receives from theenvironment is the payment, which is the reward signal. The additional inputs that the learningagent needs to learn an optimal bidding strategy are market observations and experience. Thisdefines the framework of the learning technique that is detailed in this paper.The main contributions of this work are summarised below:(1) Design of an optimal bidding strategy suitable for incentive-compatible reactive powermarkets without assuming unrealistic probability distribution function for rivals’ bids andsystem uncertainties. The imperfect oligopoly competition among participants adds furthercomplexity to the market environment. Neural Fitted Q Iteration with Target network andPrioritized Experience Replay (NFQ-TP) learning algorithm is proposed to achieve optimalbidding strategy from market observations and experience for a reactive power producersuch that the profit of generator is maximized. The proposed NFQ-TP increases learningstability using target networks and improves sampling efficiency using prioritized experiencereplay.(2) The market environment that can be formulated as a higher-order Markov Decision Processis converted to a first-order Markov Decision process, by suitable definition of state space inwhich total reactive power requirement in the system is predicted using an LSTM network.(3) Present an approach using Long Short Term Memory networks (LSTM) to estimate currentstate requirement of reactive power based on total quantity generated by GENCOs in pastepisodes.The rest of the paper is organised as follows: A review of the application of deep learningtechniques in the electricity market is presented in Section 2. Section 3 provides a brief backgroundon the reactive power market model in which the bidding strategies are to be learned by the marketparticipant. The learning environment of the market participant (agent) is proposed in Section 4. , Vol. 1, No. 1, Article . Publication date: January 2021.

Jahnvi Patel, Devika Jay, Balaraman Ravindran, and K. Shanti Swarup

Section 5 presents the proposed Neural Fitted Q Iteration with Target Network and PrioritizedExperience Replay algorithm for learning optimal bidding strategies. Experimental results andobservations are discussed in Section 6. In Section 7, brief conclusions are drawn and future workdiscussed.

Participants like generation companies, customers, etc., have been learning bidding strategies sincethe implementation of electricity markets based on competitive bidding. The participants aim toutilise the varying load scenarios to earn profit by submitting optimal bids. Several works relatedto optimal bidding strategies were formulated as an optimisation problem that was solved usingheuristic methods [1, 21]. To consider the uncertainty of a rival’s behaviour while determiningbidding strategies, minimisation of normal probability distribution function of the rival’s bid isgenerally adopted. This minimisation problem has been solved using gravitational search algorithm[18]. Determining monthly bidding strategies was formulated as a bi-level problem in [9]. Based onconditional value at risk, a stochastic bi-level optimization was proposed [14] for coordinated windpower and gas turbine units in the real-time market.Day-ahead bidding problem can also be considered as a Markov decision process (MDP). In [22]a variant of the stochastic dual dynamic programming algorithm was used to solve the MDP. Withadvancements in artificial intelligence, reinforcement learning has found significant application inelectricity market modeling [12, 13]. Deep Deterministic Policy Gradient with prioritized experiencereplay was proposed for complex energy markets [23]. Asynchronous advantage actor-critic (A3C)method was proposed in [2] to determine optimal bidding strategies for wind farms in short termelectricity markets.The aforementioned methodologies that determine the optimal bidding strategies are suitablefor participants in energy markets. There has been no extensive work on the formulation of anoptimal bidding strategy for participants in reactive power markets. This is mainly because ofthe complexity associated with reactive power market. These complexities arise with difficulty indetermining the actual requirement of reactive power in the system. In the case of energy markets,the quantity to be produced by the participant can be directly determined from load forecastingtechniques due to active power load pattern features. Whereas in reactive power markets, therequirement of reactive power in the system is determined by reactive power loading patternand other features like active power loading, network topology, system operating conditions likevoltage, frequency, etc. and the location of the participant. Also, estimating the cost function ofrivals through probability distribution function or supply function methods is not easy in reactivepower markets. This is because the cost function of market participants are not directly related tofuel, contrary to energy markets.Hence, the general assumption of modelling system uncertainties and bids of rivals with a suitableprobability distribution function and then, defining optimal bidding strategy as a solution to astochastic optimisation problem is unrealistic in reactive power market mechanisms. Assumingreactive power market as a Markov Decision Process (MDP) is also not valid as state variablesin reactive power market are highly correlated with not only the current state but also the pastsequences of states. Also, incentive compatibility and limited information setup in the market tomitigate the exercising of market participants make learning optimal bidding strategies a challenge.Thus, these issues lead to the requirement of formulating a new framework for learning optimalbidding strategies for participants in reactive power market based on market observation andexperience. , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 5

Design of reactive power markets involves mechanism definition that will effectively coordinate theinteraction between several generating companies (GENCOs) and Independent System Operator(ISO). In this work, a three-stage reactive power market mechanism is considered [8] in whichGENCOs submit their bids in the format as given below: • Operation cost ($/MVAr) • Lost opportunity cost (LOC, $/MVAr )Based on the bids received, ISO issues a price signal to each participant. The participantsthen respond with their optimal generation schedule. The stages of the market mechanism aresummarised below and depicted in Fig. 2. Fig. 2. Three stage reactive power market model • At the start of every hour, reactive power generators, i.e., market participants, submit theirprice bids to the ISO for the next hour. The two dimensional bids have the form ( 𝑏 𝑖 , 𝑏 𝑖 ) : 𝑏 𝑖 is the operation cost (or the linear term in the cost function of participant) and 𝑏 𝑖 is the lostopportunity cost (or the quadratic term in the cost function of participant). • ISO responds with individual price signals to each of the GENCOs such that total payment isminimized while meeting the demand and other criterion. • GENCOs respond by submitting the quantities they would generate at the next time step.The last two steps are repeated till all the market conditions have been satisfied.As mentioned, each GENCO receives a separate price signal, also known as nodal pricing, andthe generation scheduling is done based on the cost function of each GENCO. The cost functionof each GENCO is private information and is not revealed. Also, estimating the cost function ofrivals from fuel costs, as in energy markets, is not suitable in the reactive power market. This isbecause the cost of reactive power is not directly related to fuel costs. These market features posechallenges in learning due to partial observability. Also, the GENCOs have limited knowledge onthe network topology and system operating conditions that have resulted in different price signalsat different nodes of the network. Reactive power requirement in the system depends on voltageprofile in the system, in addition to network topology and apparent power demand from loads.Hence, the conditional probability distribution of future states depends on a sequence of precedingstates. Thus, the agents have a partially observable environment with imperfect competition basedmarket mechanism to learn their optimal bidding strategy. , Vol. 1, No. 1, Article . Publication date: January 2021.

Jahnvi Patel, Devika Jay, Balaraman Ravindran, and K. Shanti Swarup

As described in earlier sections, learning of optimal bidding strategy in reactive power marketbecomes practical only when the learning process considers the following features for marketenvironment:(1) Higher order Markov Decision Process (MDP)(2) Imperfect information setup(3) Imperfect competitionHigher-order MDP formulation and imperfect information in the market are to be handled sothat the market environment translates to a reinforcement learning setup. The feature of imperfectcompetition is discussed in detail in IEEE 30-bus system simulation studies.Modelling of reactive power market environment to suit reinforcement learning setup is presentedin this section. This section describes key definitions crucial to the discussion of methodologiespresented in the paper and formulates a first-order MDP for the optimal bidding problem.

In a reinforcement learning setting, an MDP is characterized by a set of states 𝑆 , a set of actions 𝐴 ,transition probability function 𝑝 ( 𝑠 𝑡 + | 𝑠 𝑡 , 𝑎 𝑡 ) satisfying Markov property that describes the prob-ability of the learning agent being in state 𝑠 𝑡 + on taking action 𝑎 𝑡 in state 𝑠 𝑡 and an immediatereward function 𝑆 × 𝐴 → 𝑅 . At any time step, the goal of the learning agent is to choose an action 𝑎 𝑡 according to a policy 𝜋 ∗ ( 𝑠 𝑡 ) that maximizes the expected return 𝐺 𝑡 = 𝑟 𝑡 + + 𝛾𝑟 𝑡 + + 𝛾 𝑟 𝑡 + + . . . where 𝛾 is the discount factor that determines the trade-off between long term and short termrewards. This is done by following a policy that maximizes the value of action value function 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝐸 [ 𝐺 𝑡 | 𝑠 𝑡 , 𝑎 𝑡 , 𝜋 ] for the given state 𝑠 𝑡 .Translating the above to the proposed market setting, the GENCO for which optimal biddingstrategy is to be learnt from market observations constitutes the learning agent. Stages 2 and 3 ofthe three-stage market mechanism (Figure 2) can be encapsulated into an optimization model suchthat when any GENCO submits its price bids to the model (stage 1), it receives the price signaland generation schedule for the next hour. This optimization model is the environment in our RLsetting. Next, we analyze the market properties in order to fabricate the state features accurately.Under the assumption that reactive power requirement at any given time step remains almost thesame irrespective of GENCOs behaviour, we fix the bids of all GENCOs to an arbitrary value andcalculate the total quantity generated by all GENCOs in the market at each time step. Figure 3shows how total quantity generated by all reactive power producers (a measure of reactive powerrequirement) varies with time across an episode. The dotted red lines are separated temporallyby 24 hours. We observe a heavy correlation in loads at time-step 𝑡 − , 𝑡 − , 𝑡 − , 𝑡 − 𝑡 ) depends not only on the current state (time 𝑡 − 𝑡 − , 𝑡 − , 𝑡 −

48 in this case. Thisviolates the first-order Markov assumption that future is independent of the past given the present which presents a major hurdle in formulating the problem as an MDP. However, such seeminglynon-Markov cases such as ours when the next step depends on a bounded sequence of steps frompast, i.e. 𝑝 ( 𝑠 𝑡 + | 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 − , 𝑎 𝑡 − , . . . ) = 𝑝 ( 𝑠 𝑡 + | 𝑠 𝑡 , 𝑎 𝑡 , . . . , 𝑠 𝑡 − , 𝑎 𝑡 − ) , then such a higher-order Markovprocess can be translated to a first-order Markov problem by augmenting the state vector andembedding the past (48 hours) as an additional feature. Doing so makes the next step depend only , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 7 Fig. 3. Total Quantity generated by all GENCOs across an episode on the current state and the action taken, thereby making it possible to formulate an MDP for theproblem.

As mentioned in earlier sections, the reactive power requirement in the system depends on not onlythe reactive power loads in the network but also the network topology. This poses an imperfectinformation setup in the learning environment. Load prediction techniques and probability basedassumptions will not be a direct input to determine the total quantity of reactive power generationrequired in the system. Thus, the reactive power requirement at each time step is to be estimatedfrom the total quantity of reactive power generated in the network in previous time steps to seehow the reactive power requirement at current time-step correlates to the market experience wehave gained so far.For this, consider the time series data of total quantity generated by all GENCOs in IEEE 30-bussystem from an earlier episode (month) as the raw training input. From the market analysis inSection 4.1, it is known that this data is periodic in nature, which makes it suitable for usingRecurrent Neural Networks (RNNs) which is a class of neural networks that allow information topersist and hence, make it easier to learn sequential data. An unrolled RNN can be thought of asmultiple sequential copies of an Artificial Neural Network. However, this poses an issue when itcomes to long term memory. RNNs consider only recent information while forgetting temporallyseparated events that could link to current output. This is where Long Short Term Memory (LSTM)Networks, a special class of Recurrent Neural Networks, come in to learn long term dependencies.LSTMs accumulate an internal state that is constantly updated using the inputs, a hidden statethat can be used for computing output, forget gate that determines what information to throwaway from the cell state and together, the hidden state, input, and cell state can be used to computethe next hidden state. Many-to-one LSTM models take a series of recent observations as input andgenerate a single output corresponding to prediction for the next time step. Hence, the learningagent could use LSTMs to forecast load for the next hour as a sequence prediction.The raw reactive power requirement (total quantity) signal is transformed into the training setfor LSTM by splitting the input into sequence of length 24 corresponding to the reactive powerrequirement observations over the past one day. The target is defined as the total quantity for thenext hour. Inputs from the training set are then fed into an LSTM network with 𝑛 units followed bya dense layer. The network is trained by back-propagating the mean squared loss between predictedreactive power requirement and true reactive power requirement. The pre-trained LSTM network , Vol. 1, No. 1, Article . Publication date: January 2021. Jahnvi Patel, Devika Jay, Balaraman Ravindran, and K. Shanti Swarup can be used to estimate the total quantity required in the current state for state representationof the learning agent. The LSTM reactive power requirement prediction algorithm is outlined inAlgorithm 1.

Algorithm 1

Reactive Power Requirement Prediction using LSTM Let 𝑆 ← Total quantity time series data for a month Initialize

𝑋, 𝑦 ← Split 𝑆 into sequence of length 24 and reactive power requirement for nexthour as corresponding output Define model 𝑀 ← [LSTM(n), Dense] for 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = , , . . . do Sample a batch 𝐵 from the training set Use 𝑀 to generate the predicted output values ˆ 𝑦 𝑖 for 𝑖 = ..𝐵 Train the network using Adam optimizer on (cid:205) 𝑖 ( 𝑦 𝑖 − ˆ 𝑦 𝑖 ) for 𝑖 = ..𝐵 and update the LSTMweights end for Having performed suitable translation of higher order MDP to first order MDP, and handlingimperfect information setup in the market environment, the features of the Reinforcement Learningbased agent is described as an MDP ( 𝑆, 𝐴, 𝑃, 𝑅, 𝛾 ) . State Space:

The state features are designed in such a way that the agent can use them todetermine the consequences of its actions while also integrating current market conditions likethe current reactive power requirement estimate. A description of the state features can be foundbelow:(1)

Previous bids:

This considers the past experience of the learning agent in the market tolearn the optimal bidding strategy. We consider the ratio of bids sent to ISO to actual costfunction of the learning agent 𝑖 : < 𝑎 𝑖,𝑡 ′ , 𝑎 𝑖,𝑡 ′ > ∀ 𝑡 ′ ∈ { 𝑡 − , 𝑡 − , 𝑡 − , 𝑡 − } timestamps.(2) Reward signals:

The rewards received at { 𝑡 − , 𝑡 − , 𝑡 − , 𝑡 − } timesteps act as feedbacksignal to the agent and will help the learner to judge the optimality of bids. A lower rewardwill indicate that the learner bid too high resulting in lesser payment. This is because thegeneration scheduled through stages 2 and 3 of the market mechanism was less due to highbid.(3) Total Quantity Estimate:

We use the previous quantity signals to estimate the total gener-ation at the current time-step, which is correlated to the reactive power requirement at thattimestep. LSTM network discussed in Section 4.2 could be utilized to predict this periodicquantity from time-series data.

Action Space:

The bids that we send to the optimization module constitute the action for thatstate. For ease of analysis, instead of considering bids as action signals directly, we consider theaction to be bid magnification defined as the ratio of bids sent to ISO to actual production cost.Thus, for 𝑖 𝑡ℎ learner at timestep 𝑡 with actual operation cost 𝑐 𝑖 and lost opportunity cost 𝑐 𝑖 , theaction space is (a1 𝑖,𝑡 , a2 𝑖,𝑡 ) defined as: < 𝑎 𝑖,𝑡 , 𝑎 𝑖,𝑡 > = < 𝑏 𝑖,𝑡 𝑐 𝑖 , 𝑏 𝑖,𝑡 𝑐 𝑖 > , (1)where 𝑏 𝑖,𝑡 and 𝑏 𝑖,𝑡 represent the bid submitted to ISO as the claimed operation and lost opportunitycost respectively. An advantage of using the ratios over actual bid values is that the actions can , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 9 now remain same across all GENCOs leading to easier analysis. Because our action space is low-dimensional, it is possible to discretize it at the required granularity and can be effectively consideredas discrete space. In order to bound the action space, we restrict the bid magnification coefficientsto [ , ] range and discretize the action space using step size of 0.5. This results in a total of 81discrete actions : ( , ) , ( , . ) , . . . , ( . , . ) , . . . , ( , . ) , ( , ) . Reward Function:

The reward is measured in terms of profit made by reactive power produceras compared to bidding its true cost. Let 𝑐 𝑖 and 𝑐 𝑖 represent the cost coefficients for agent 𝑖 , 𝑞𝑔 be the quantity generated by the producer and 𝑏𝑔 𝑖 be the base generation. Base generation of ageneration company is the amount of reactive power required for the auxillary services within theplant and also for shipment of base active power produced. Thus, we express the profit of GENCOas follows: 𝑝 𝑖,𝑡 = 𝑝𝑟𝑖𝑐𝑒 𝑖,𝑡 × 𝑞𝑔 𝑖,𝑡 − 𝑐 𝑖 × ( 𝑞𝑔 𝑖,𝑡 − 𝑏𝑔 𝑖 ) − 𝑐 𝑖 × ( 𝑞𝑔 𝑖,𝑡 − 𝑏𝑔 𝑖 ) (2)The reward signal is computed as 𝑟 𝑖,𝑡 = 𝑝 𝑖,𝑡 − 𝑝 𝑏𝑖,𝑡 , where 𝑝 𝑏𝑖,𝑡 is computed using price and generationreturned from the environment upon sending the bids ( 𝑏 𝑖,𝑡 , 𝑏 𝑖,𝑡 ) = ( 𝑐 𝑖 , 𝑐 𝑖 ) to the ISO optimiza-tion model. By using a reward measure relative to baseline, the learner would have an incentive tolearn actions that perform better than bidding their true costs and also be penalized by a negativereward when they receive payment less than baseline. In this section, a variant of Neural Fitted Q Iteration (NFQ), i.e., NFQ with Target Network andPrioritized Experience Replay (NFQ-TP), is proposed for the single learning agent to handle theimperfect competition and incomplete information of the market environment. Brief background ofclassical Q-learning and NFQ is first discussed before presenting the proposed learning workflowand NFQ-TP algorithm.

In order to solve the reinforcement learning problem, the agent should learn the expected returnfor each state-action pair. Q-learning [20] is a model-free, off-policy, temporal difference basedlearning algorithm that uses Bellman Optimality Equation [3] to recursively update the Q-valuesuntil convergence is reached. 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝑟 𝑡 + + 𝛾 max 𝑎 ′ 𝑄 ( 𝑠 𝑡 + , 𝑎 ′ ) (3)Q-learning updates estimates based on learned estimates using bootstrapping. Let learning rate be 𝛼 , then the classical Q-learning update rule after 𝑘 iterations is given by [4]. 𝑄 𝑘 + ( 𝑠 𝑡 , 𝑎 𝑡 ) = ( − 𝛼 ) 𝑄 𝑘 ( 𝑠 𝑡 , 𝑎 𝑡 ) + 𝛼 ( 𝑟 𝑡 + + 𝛾 max 𝑎 ′ 𝑄 𝑘 ( 𝑠 𝑡 + , 𝑎 ′ )) (4)The term 𝛿 𝑡 = 𝑟 𝑡 + + 𝛾 max 𝑎 ′ 𝑄 𝑘 ( 𝑠 𝑡 + , 𝑎 ′ ) − 𝑄 𝑘 ( 𝑠 𝑡 , 𝑎 𝑡 ) is known as the TD-error and it is a measure ofthe error in our current estimate of Q-values. In spite of having theoretical convergence properties,this kind of tabular approach is suitable only for small and finite state spaces. This is because thealgorithm becomes computation and memory intensive as the number of states increases.However, the 13-dimensional state space defined for the learning agent is continuous, anddiscretization of it leads to an exponential increase in state space size. Thus, classical Q-learningtechnique is unsuitable for the learning agent in an imperfect, oligopolistic reactive power market.Hence, deep reinforcement techniques like Neural Fitted Q Iteration are required for learningoptimal bidding strategies. , Vol. 1, No. 1, Article . Publication date: January 2021. Classical Q-learning algorithm suffers from the curse of dimensionality. Also, updating the Q-function based on only one point can lead to slower and noisy convergence. The challenges posedby the classical Q-learning technique can be overcome by using Fitted Q Iteration (FQI) [6], a batchmode reinforcement learning algorithm suitable for continuous state spaces and is an extension ofthe traditional Q-learning approach. Basis functions are needed to map the input states into thederived features space, which will approximate optimal Q values. We can then use any supervisedlearning technique, such as regression or SVM, to fit the training set and generalize this informationto any unseen state. At each step, the learning agent has access to the experience buffer from whichit samples a mini-batch as its training set. The advantage of doing this is that the same experiencetuple can be used for learning multiple times for learning while also breaking temporal correlationsin sequentially generated data, thereby satisfying the i.i.d (independently identically distributed)assumption. The training set is of the form ( < 𝑠, 𝑎 > , 𝑄 ) with the state-action pairs as input to thealgorithm and the target being defined as the action-value 𝑄 ( 𝑠, 𝑎 ) for the state-action pair.Neural Fitted Q Iteration (NFQ) [15] is a data-efficient, deep learning based algorithm belongingto the FQI family. It builds on the FQI algorithm by using the global generalization effects of aneural network as a regressor to approximate the Q values using offline data. The principle ofNFQ is the same: using a single point to update the weights can create unintended changes in theweights for other state-action pairs, whereas a batch based approach would stabilise learning. NFQuses RPROP as an optimizer to minimize oscillations in back-propagation. The advantage of usingresilient back-propagation [16] is that it adapts the step size for each weight independently basedon the sign of the double derivative of the loss function. Summarizing, if 𝜃 are the weights of theNFQ network, then corresponding to the input state-action pair < 𝑠, 𝑎 > , the output of the networkwould be the action value 𝑄 ( 𝑠, 𝑎 ; 𝜃 ) . Since NFQ has been shown to perform well in control problemswith continuous state spaces, it is suitable for learning the Q-value function for the optimal strategyproblem. This section presents a variation of NFQ algorithm, i.e., NFQ with target network and prioritizedexperience replay (

NFQ-TP ) is proposed. The advantage is that the proposed algorithm leads tofaster convergence for the bidding environment.

Prioritized experience replay:

The original NFQ algorithm works on a pre-generated offlinetraining data sample of size 𝑁 using random trajectories. However, populating the buffer sufficientlywith all patterns of state-action pairs is difficult while using random actions because of the high-dimensional state space and large number of actions. Hence, in NFQ-TP, the experiences areiteratively added to the buffer at each step of training. Initially, the buffer is populated with 𝑁 𝐼 initial episodes generated using random trajectories. During the training phase, to balanceexploitation with exploration, the agent adds an experience tuple by stepping in the environmentaccording to an 𝜖 -greedy exploration strategy. This means that with ( − 𝜖 ) probability, the agentchooses an action that maximizes action-value 𝑄 ( 𝑠, 𝑎 ; 𝜃 ) predicted by NFQ-TP for the given stateand with 𝜖 probability chooses an action randomly. The exploration factor 𝜖 is initially set to alarge value to encourage taking rare actions and decayed as the algorithm approaches convergence.The sampling efficiency from experience replay can be further improved compared to uniformsampling by picking the important transitions with high TD-error more frequently using a Priori-tized Experience Replay [17]. Higher magnitude of TD-error means the network does not estimatethe Q-value for that experience tuple accurately. Hence, we want to pick such samples more fre-quently. Each experience in the buffer is stored along with the priority 𝑝 𝑖 = | 𝛿 | + 𝜖 𝑃 where | 𝛿 | = , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 11 is the magnitude of TD-error and 𝜖 𝑃 is a constant added to ensure that each experience is pickedwith a non zero probability. The sampling probability of a transition 𝑃 ( 𝑖 ) is defined as: 𝑃 ( 𝑖 ) = 𝑝 𝛽𝑖 (cid:205) 𝑁𝑗 = 𝑝 𝛽𝑗 (5)where 𝛽 controls the amount of prioritization with 0 corresponding to uniform sampling and 1corresponding to complete priority based sampling. Updating target network:

NFQ uses current Q estimates to predict the target Q value for losscomputation, and since these weights are constantly updated at each step in the learning process,the target values are prone to oscillations. To stabilize learning further, a target network is usedwith the parameters from the target network being updated slowly using parameters from the localnetwork. Therefore, for each training step, once the local weights 𝜃 are updated, the weights forthe target network are updated as: 𝜃 𝑇 ← ( − 𝜏 ) 𝜃 𝑇 + 𝜏𝜃 (6)where 𝜃 𝑇 and 𝜃 are parameters for target and local network respectively and 𝜏 << 𝑦 𝑡 = 𝑟 𝑡 + 𝛾 max 𝑎 ′ 𝑄 ( 𝑠 𝑡 + , 𝑎 ′ ; 𝜃 𝑇 ) (7)The TD-error 𝛿 𝑡 for any transition can be defined as: 𝛿 𝑡 = 𝑟 𝑡 + + 𝛾 max 𝑎 ′ 𝑄 ( 𝑠 𝑡 + , 𝑎 ′ ; 𝜃 𝑇 ) − 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ; 𝜃 ) (8)In each learning iteration, the prioritized experience replay is replenished with new experiencesby stepping in the environment in an 𝜖 -greedy manner by using the Q-value estimate from thetarget network to find the maximizing action. The weights of the local network are updated bysampling a mini-batch, using the local network to estimate Q-values for each experience tuple in thebatch, and performing back-propagation on mean squared TD loss with TD target computed usingthe target network. The weights of the target network are updated slowly based on the updates onthe local network. This process is repeated until convergence is reached. The pseudo-code for theNFQ-TP implementation is described in Algorithm 2. Algorithm 2

Neural Fitted Q Iteration (NFQ-TP variation) Initialize prioritized replay memory D to capacity N by taking random steps in environment Initialize local network with random weights 𝜃 Initialize target network with weights 𝜃 𝑇 ← 𝜃 Initialize PER parameters 𝛽 and initial priority 𝑝 𝑚 for 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = , , . . . do Step in environment using policy 𝜋 𝜃 𝑇 for 𝐾 time steps with 𝜖 noise Add 𝐾 experience tuples of the form ( 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 + , 𝑟 𝑡 + ) to D and initialize priorities to 𝑝 𝑚 Sample a mini-batch of size 𝑀 from PER Compute 𝑦 𝑡 = 𝑟 𝑡 + 𝛾 max 𝑎 ′ 𝑄 ( 𝑠 𝑡 + , 𝑎 ′ ; 𝜃 𝑇 ) for each tuple in batch Train the network using Rprop on ( 𝑦 𝑡 − 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ; 𝜃 )) and update the local network 𝜃 Update priorities of the sampled experiences in D

Update target network 𝜃 𝑇 ← ( − 𝜏 ) 𝜃 𝑇 + 𝜏𝜃 end for The learning workflow for the overall algorithm is described in Figure 4. , Vol. 1, No. 1, Article . Publication date: January 2021.

Fig. 4. Optimal Bidding Strategy Training Flow

The three stage reactive power model was tested in IEEE 30-bus system [24] as shown in Fig. 5There are generator buses at nodes 1, 2, 13, 22, 23, 27. The peak load in the system is 189.5MW,

Fig. 5. IEEE 30-bus system 𝑐 𝑐 , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 13 these GENCOs are tabulated in 2. Each episode is a maximum of 720 (24 ×

30) steps long, whereeach step corresponds to an hourly bid in the 30-day window.

Table 1. Actual cost coefficients for all 6 GENCOs

GENCO c1 ($/100MVAr) c2($ / (100MVAr) )1 0.73 0.302 0.68 0.393 0.75 0.434 0.60 0.55 0.75 0.96 0.73 0.38 Table 2. Base MVAr generation in IEEE 30-bus system

GENCO Base MVAr1 7.52 33 3.1254 2.4355 26 2.235The following studies have been performed to understand the performance of the proposedNFQ-TP based learning algorithm for optimal bidding in reactive power market of IEEE 30-bussystem.(1) Single agent NFQ-TP based learning considering two cases of bidding strategies for rivals • Imperfect competition where GENCOs bid higher than the actual cost (Bidding Strategy-B1) • Perfect competition where GENCOs bid their actual cost (Bidding Strategy-B2)The goal is to analyze the robustness of the learning approach in multiple market scenariosagainst differing rival bidding strategies.(2) Comparison of two network configurations of NFQ-TP network: • NFQ-1: Input- State features concatenated with discrete action. Output- Single Q-value forstate-action pair. • NFQ-2: Input- State features. Output-81 Q-values corresponding to each action.Comparison of the two variants gives us insights on how to represent the Q-values effectivelywhile permitting as much generalization as possible without impacting the representationstrength of the network.(3) Comparison of learning benefits for low cost, intermediate cost and high cost agents fromproposed NFQ-TP based learning algorithm in order to analyze how the learning approachprovides an improvement over the naive strategy for different learners.

LSTM Reactive Power Requirement Prediction.

Before we can generate the experiencetuples for training

NFQ-TP network, the LSTM network must be pre-trained to be able to predictthe current state reactive power requirement. We use the algorithm specified in Section 4.2 and feedthe input as the total quantity generation data for a 30-day long episode. We employed a network , Vol. 1, No. 1, Article . Publication date: January 2021. with 100 LSTM units followed by a dense layer whose output is a single value corresponding tothe reactive power requirement prediction for the next time step. Since the algorithm splits thetime series data into segments lasting 24 hours, the input to the LSTM network would be of size 24,corresponding to the total quantities produced by all GENCOs for each hour over the last day.

The bidding strategies adopted by the agents other than learner are described below:(1)

Imperfect competition (Bidding Strategy B-1).

Optimal bidding strategies in reactivepower market for producers have not been studied extensively. Hence, for GENCOs other thanthe learning agent, we modify the strategies adopted for bidding in active power markets. Theoptimal bidding strategy in active power markets is formulated as a stochastic optimisationproblem by assuming suitable probability distribution functions to model uncertainties in thenetwork and bids of rivals in the market. Such a basic optimisation problem formulation foractive power market discussed in [11] is rewritten in the context of reactive power market asgiven below: max 𝑏 𝑖,𝑡 ,𝑏 𝑖,𝑡 𝐸 [ 𝜋 𝑖,𝑡 ] (9) such that 𝜋 𝑖,𝑡 = ( 𝑏 𝑖,𝑡 + 𝑏 𝑖,𝑡 · 𝑞𝑔 𝑖,𝑡 ) · 𝑞𝑔 𝑖,𝑡 − ( 𝑐 𝑖,𝑡 + 𝑐 𝑖,𝑡 · 𝑞𝑔 𝑖,𝑡 ) · 𝑞𝑔 𝑖,𝑡 (10) 𝑞𝑔 𝑚𝑖𝑛𝑖,𝑡 ≤ 𝑞𝑔 𝑖,𝑡 ≤ 𝑞𝑔 𝑚𝑎𝑥𝑖,𝑡 (11) 𝑅 𝑚𝑖𝑛𝑖 ≤ ( 𝑏 𝑖,𝑡 + 𝑏 𝑖,𝑡 · 𝑞𝑔 𝑖,𝑡 ) ≤ 𝑅 𝑚𝑎𝑥𝑖 (12)Where R 𝑚𝑖𝑛𝑖 and R 𝑚𝑎𝑥𝑖 are the limits set to the bidding function which can be assumedbased on the capacity of each GENCO. The optimisation problem can be solved throughheuristic methods like Monte Carlo methods, Genetic Algorithm etc provided assumptionson probability distribution function for system uncertainties and rival’s bids can be validated.However, as detailed in previous sections, assumption of such probability distribution functionfor reactive power markets is unrealistic. Hence, in this work, we consider the results reportedin [11] for IEEE 30-bus system under peak load conditions. The reported results are appliedto reactive power markets by considering demand fluctuations in the bids. For a morecompetitive environment, the bids in reactive power market are made higher than optimalbids in active power market. This is a fair baseline as the chances of exercising market powerin reactive power is higher and the uncertainty in network operating conditions is handled byutilising demand fluctuation signals instead of assuming a probability distribution function.Thus, for this bidding strategy, we assume that all the agents except learner send bids equalto two times actual operation cost and five times lost opportunity cost multiplied with astep-wise variable 𝑑 𝑡 between [ , ] corresponding to the demand fluctuations. 𝑎 𝑖,𝑡 = × 𝑑 𝑡 𝑎 𝑖,𝑡 = × 𝑑 𝑡 (13)Note that we use actual demand and some random noise to generate the multiplier to ensurethat other GENCOs do not have perfect information.(2) Perfect competition (Bidding Strategy B-2).

All the agents except learner send bids equalto the actual operation and lost opportunity cost. , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 15

Two variants of NFQ network are considered to study the performance of the proposed NFQ-TPbased learning. Both the variants use ReLU activation, mean squared TD-error for updating theparameters and RPROP as an optimizer.(1)

NFQ-1 takes the state features along with the discrete action as input and outputs a singleQ-value for the state action pair. Since the state-space is 13-dimensional and action space is1-dimensional, the input to NFQ-1 network would be a size 14 vector. Figure 6 summarizesNFQ-1 implementation.

Fig. 6. Visualization of NFQ network used in Variant 1 (2)

NFQ-2 takes the state features as input and outputs 81 Q-values corresponding to eachof the actions. Input to NFQ-2 would be a size 13 vector. Figure 7 depicts NFQ-2 networkimplementation respectively.

Fig. 7. Visualization of NFQ network used in Variant 2

Adam optimizer is used for learning NFQ network weights with a learning rate of 0.001. A decayrate of 0.01 is used for decaying the exploration noise and a soft update rate of 𝜏 = − is used forupdating the target network. A mini-batch of size 64 is sampled from the Prioritized ExperienceReplay and the prioritization factor 𝛽 is set to 0.7. The buffer capacity is set to 𝑁 = with numberof initial experiences 𝑁 𝐼 = . A more thorough discussion on choosing the hyper-parameters canbe found in Section A. The parameters affecting the experiment are studied in Appendix A, that presents a brief summaryon the technique used to determine the optimal set of hyper-parameters. The experiments arecarried out in accordance with the results obtained from the aforementioned analysis.

LSTM Reactive Power Requirement Prediction.

In order to evaluate the model perfor-mance, we define the baseline model that predicts current quantity as an average of total quantitygenerated at 𝑡 − 𝑡 −

24, i.e. ˆ 𝑄 𝑡 = 𝑄 𝑡 − + 𝑄 𝑡 − . On evaluation, the mean squared error betweenpredicted and actual demand for baseline was found to be 0.019 whereas in case of LSTM model,the error was 0.01. Thus, it is observed that LSTM outperforms baseline and hence, using an LSTMleads to more accurate reactive power requirement prediction. , Vol. 1, No. 1, Article . Publication date: January 2021. Case 1: Rivals adopt bidding strategy B-1 . We consider each of the GENCOs as a learnerand the bidding strategies of remaining 5 competing GENCOs are determined by B-1. This sectionpresents the improvements achieved by each of the implementations of

NFQ-TP over the baselinetrue cost bidding.

Performance of NFQ-1 variant

Figure 8 and Table 3 show the average and standard deviation of episodic rewards across 10 randomruns of the

NFQ-TP algorithm with each of the GENCOs as a learning agent and using NFQ-1network implementation and B-1 rival bidding. The table values have been computed consideringonly the convergence values of episodic returns for each of the runs.

Fig. 8. Learnt episodic rewards averaged across 10 random seeds with each of the GENCOs as a learner usingNFQ-1 model and B-1 rival bidding strategyTable 3. Mean 𝜇 and standard deviation 𝜎 over 10 random runs using NFQ-1 model and B-1 rival biddingstrategy 𝜇 𝜎 Performance of NFQ-2 variant

Figure 9 and Table 4 show the average and standard deviation of episodic rewards across 10 randomruns of the

NFQ-TP algorithm with each of the GENCOs as a learning agent and using NFQ-2network implementation and B-1 rival bidding.

Case 2: Rivals adopt bidding strategy B-2 . We consider each of the GENCOs as a learnerand the bidding strategies of remaining 5 competing GENCOs are determined by B-2. This sectionpresents the improvements achieved by each of the implementations of

NFQ-TP over the baselinetrue cost bidding. , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 17

Fig. 9. Learnt episodic rewards averaged across 10 random seeds with each of the GENCOs as a learner usingNFQ-2 model and B-1 rival bidding strategyTable 4. Mean 𝜇 and standard deviation 𝜎 over 10 random runs using NFQ-2 model and B-1 rival biddingstrategy 𝜇 𝜎 Performance of NFQ-1 variant

Figure 10 and Table 5 show the average and standard deviation of episodic rewards across 10random runs of the

NFQ-TP algorithm with each of the GENCOs as a learning agent and usingNFQ-1 network implementation and B-2 rival bidding.

Fig. 10. Learnt episodic rewards averaged across 10 random seeds with each of the GENCOs as a learnerusing NFQ-1 model and B-2 rival bidding strategy , Vol. 1, No. 1, Article . Publication date: January 2021.

Table 5. Mean 𝜇 and standard deviation 𝜎 over 10 random runs using NFQ-1 model and B-2 rival biddingstrategy 𝜇 𝜎 Performance of NFQ-2 variant

Figure 11 and Table 6 show the average and standard deviation of episodic rewards across 10random runs of the

NFQ-TP algorithm with each of the GENCOs as a learning agent and usingNFQ-2 network implementation and B-2 rival bidding.

Fig. 11. Learnt episodic rewards averaged across 10 random seeds with each of the GENCOs as a learnerusing NFQ-2 model and B-2 rival bidding strategyTable 6. Mean 𝜇 and standard deviation 𝜎 over 10 random runs using NFQ-2 model and B-2 rival biddingstrategy 𝜇 𝜎 Comparison of NFQ-1 and NFQ-2.

On examining the mean episodic rewards for the twobidding strategies, we see that both the implementations are able to perform at least as well as thebaseline bidding. However, we achieved far more superior results while using NFQ-2, even by usinga smaller network. This is because of the way the network is structured. NFQ-1 maps a state-actionpair to Q-value; hence, any update to the pair could affect the Q-values for other actions as wellbecause of spatial locality. This also explains why the variance in episodic rewards is high eventowards convergence while using NFQ-1 implementation. , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 19

Comparison of episodic rewards for B1 v/s B2.

The mean episodic rewards achievedfor the GENCOs are comparatively higher when the rival bidding strategy is B-2. This is becauseB-1 lets competitors follow an adaptive strategy based on the reactive power market requirement.Hence, it is more difficult for the learner to beat the rivals by a large margin in such a scenario.

Observation and inference.

The performance improvement achieved by

NFQ-TP algo-rithm using NFQ-2 network implementation is analysed. The variance in episodic rewards is highat the start of learning, meaning that the network is learning and exploring to find better bids. Afteraround 1000 episodes, we see the variance reducing, and the episodic rewards converge, indicatingthat the algorithm has learnt the optimal bidding strategy. The episodic reward tables indicate thatthe approach is able to achieve significant improvement for intermediate and lower cost GENCOswhile performing at least as well as baseline for the high-cost GENCOs. The use of multiple rivalbidding strategies further shows that the method is suitable in different market scenarios. Therefore,

NFQ-TP provides an effective way for generation companies to improve their bidding strategies.

Learning benefits of GENCOs in each of the cost categories are also analysed. For ease of comparison,the analysis has been performed on NFQ-2 implementation for B-1 bidding strategy.

Fig. 12. Quantity generated for all GENCOs across a 120 timestep window of the optimal bidding trajectorywith GENCO-2 as learner

The generation curve for a 120 timestep window of theoptimal bidding trajectory for GENCO-2 (Figure 12) shows that low cost GENCO-2 not only peaksin high demand intervals by increasing its bidding cost coefficients but it also adjusts the biddingcost accordingly when the demand is lower and therefore, its generation does not dip as much inlow demand intervals. The episodic reward plots further show that that

NFQ-TP is able to achievesignificant improvements in episodic rewards for such GENCOs.

The generation curve for a 120 timestep windowof the optimal bidding trajectory for GENCO-1 (Figure 13) reveals that the learning is indeed usefulespecially in the times when demand is lower as the dips in generation quantity for Learner-1 arenot as much as its competitors when it is learning and predicting the market demand. In suchscenarios, the learner can exploit its knowledge to bid in such a way that the drop in quantity , Vol. 1, No. 1, Article . Publication date: January 2021.

Fig. 13. Quantity generated for all GENCOs across a 120 timestep window of the optimal bidding trajectorywith GENCO-1 as learner generated because of low demand is not as much as that of others. The episodic reward plotsfurther show that that

NFQ-TP is able to achieve good improvements in episodic rewards for suchGENCOs when the rivals are not as market aware (or optimal) and are able to achieve baselinebidding rewards in case of strong market-aware competitors.

Fig. 14. Quantity generated for all GENCOs across a 120 timestep window of the optimal bidding trajectorywith GENCO-5 as learner

The generation curve for a 120 timestep window of theoptimal bidding trajectory for GENCO-5 (Figure 14) shows that high cost GENCOs try to maximizetheir profits by increasing the quantity they can generate. A comparison of quantity generationcurve with the previous curves with other GENCOs as learners shows that the generation changedfrom about 0.025 to above 0.2 for GENCO-5 (which is around 10 times improvement). It is observedthat, in general, the higher the actual costs of the GENCOs, closer the bid values are to the actualcosts of generation so as to achieve higher profits. This is justified because higher the cost a GENCObids, more is the chance that the ISO might assign higher quantity to competing GENCOs. , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 21

The three-stage reactive power market model is designed to encourage GENCOs to bid closer totheir true cost and prevent any single agent from gaining market power. The invisibility of rivalbids and difficulty in estimating bid magnification from the price signals received by ISO presentschallenges in directly estimating the optimal bid value. The environment that we propose takesfeatures directly from the market observations such as bids and the corresponding rewards andlearns some using experience, such as the demand. This work showed techniques to successfullycompute optimal bidding strategy for GENCOs in different scenarios: imperfect competition, whenthe other agents have a time-dependent strategy vs. perfect competition, when they follow a fairlyregular pattern of bidding their true cost.It was shown that using a Q-value estimation technique like Neural Fitted Q Iteration coupledwith our own experience and market observations, GENCOs can bid a value which is a magnificationof their actual costs and in some scenarios, can still make more profit than telling the truth. Thisis because our learners are more aware about the market behaviour and hence, adapt their costsaccording to the market requirements. Because learning becomes more difficult when bid values forother GENCOs are hidden, this work shows why the restriction on visibility of bid values preventsGENCOs from exercising market power.The work presented in the paper considered rivals of the learning agent to follow stochasticoptimisation bidding strategy. This can be further improved by considering similar NFQ-TP learningalgorithms for rivals as well to extend the work to a multi-agent learning environment after a fewrelaxations. Another interesting experiment is to analyse how market behaviour and optimal biddingstrategy are affected if market mechanisms are separate for low-cost GENCOs (i.e., renewableenergy sources) and high-cost GENCOs (i.e., conventional or non-renewable sources). This wouldhelp us determine if separate markets led to lower total cost for ISO as the low-cost GENCOs arefree to bid higher in the common market and can take more advantage of the ISO.

REFERENCES [1] A Badri and M Rashidinejad. 2013. Security constrained optimal bidding strategy of GenCos in day ahead oligopolisticpower markets: a Cournot-based model.

Electrical Engineering

95, 2 (2013), 63–72.[2] Di Cao, Weihao Hu, Xiao Xu, Tomislav Dragičević, Qi Huang, Zhou Liu, Zhe Chen, and Frede Blaabjerg. 2020. Biddingstrategy for trading wind energy and purchasing reserve of wind power producer–A DRL based approach.

InternationalJournal of Electrical Power & Energy Systems

117 (2020), 105648.[3] Ruidi Chen, Ioannis Ch Paschalidis, Michael C Caramanis, and Panagiotis Andrianesis. 2019. Learning from past bidsto participate strategically in day-ahead electricity markets.

IEEE Transactions on Smart Grid

10, 5 (2019), 5794–5806.[4] Puneet Chitkara, Jin Zhong, and Kankar Bhattacharya. 2009. Oligopolistic competition of gencos in reactive powerancillary service provisions.

IEEE Transactions on Power Systems

24, 3 (2009), 1256–1265.[5] US Federal Energy Regulatory Commission et al. 1996. Promoting wholesale competition through open access non-discriminatory transmission services by public utilities; recovery of stranded costs by public utilities and transmittingutilities.

Order

888 (1996), 24.[6] Damien Ernst, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode reinforcement learning.

Journal ofMachine Learning Research

6, Apr (2005), 503–556.[7] L Gallego, O Duarte, and A Delgadillo. 2008. Strategic bidding in Colombian electricity market using a multi-agentlearning approach. In . IEEE, 1–7.[8] D. Jay and S. K. SWARUP. 2020. Game Theoretical Approach to Novel Reactive Power Ancillary Service MarketMechanism.

IEEE Transactions on Power Systems (2020), 1–1.[9] Y. Jiang, J. Hou, Z. Lin, F. Wen, J. Li, C. He, C. Ji, Z. Lin, Y. Ding, and L. Yang. 2019. Optimal Bidding Strategy for aPower Producer Under Monthly Pre-Listing Balancing Mechanism in Actual Sequential Energy Dual-Market in China.

IEEE Access

Preprint ANL/MCS-P1243-0405

4, 04 (2005).[11] Somendra PS Mathur, Anoop Arya, and Manisha Dubey. 2017. Optimal bidding strategy for price takers and customersin a competitive electricity market.

Cogent Engineering

4, 1 (2017), 1358545., Vol. 1, No. 1, Article . Publication date: January 2021. [12] Vishnuteja Nanduri and Tapas K Das. 2007. A reinforcement learning model to assess market power under auction-basedenergy pricing.

IEEE transactions on Power Systems

22, 1 (2007), 85–95.[13] Morteza Rahimiyan and Habib Rajabi Mashhadi. 2010. An Adaptive 𝑄 -Learning Algorithm Developed for Agent-BasedComputational Modeling of Electricity Market. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applicationsand Reviews)

40, 5 (2010), 547–556.[14] Mohammad Rayati, Hamed Goodarzi, and AliMohammad Ranjbar. 2019. Optimal bidding strategy of coordinated windpower and gas turbine units in real-time market using conditional value at risk.

International Transactions on ElectricalEnergy Systems

29, 1 (2019), e2645.[15] Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learningmethod. In

European Conference on Machine Learning . Springer, 317–328.[16] Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: TheRPROP algorithm. In

IEEE international conference on neural networks . IEEE, 586–591.[17] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized experience replay. arXiv preprintarXiv:1511.05952 (2015).[18] Satyendra Singh and Manoj Fozdar. 2019. Optimal bidding strategy with the inclusion of wind power supplier in anemerging power market.

IET Generation, Transmission & Distribution

13, 10 (2019), 1914–1922.[19] Easwar Subramanian, Yogesh Bichpuriya, Avinash Achar, Sanjay Bhat, Abhay Pratap Singh, Venkatesh Sarangan, andAkshaya Natarajan. 2019. lEarn: A Reinforcement Learning Based Bidding Strategy for Generators in Single sidedEnergy Markets. In

Proceedings of the Tenth ACM International Conference on Future Energy Systems . 121–127.[20] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.

Machine learning

8, 3-4 (1992), 279–292.[21] Fushuan Wen and A Kumar David. 2001. A genetic algorithm based method for bidding strategy coordination inenergy and spinning reserve markets.

Artificial Intelligence in Engineering

15, 1 (2001), 71–79.[22] David Wozabal and Gunther Rameseder. 2020. Optimal bidding of a virtual power plant on the Spanish day-ahead andintraday market for electricity.

European Journal of Operational Research

IEEE Transactions on Smart Grid

11, 2 (2019), 1343–1355.[24] Ray Daniel Zimmerman, Carlos Edmundo Murillo-Sánchez, and Robert John Thomas. 2010. MATPOWER: Steady-stateoperations, planning, and analysis tools for power systems research and education.

IEEE Transactions on power systems

26, 1 (2010), 12–19.

A HYPERPARAMETER TUNING

This section aims to present the selection process behind the crucial hyper-parameters. Because ofthe large number of parameters involved in

NFQ-TP algorithm, it is difficult to perform a grid-search,hence, the parameters have been chosen by manual search. For comparison, all the experimentsuse NFQ-2 implementation with B-1 rival bidding strategy. We first look at how the discount factor 𝛾 affects the episodic rewards for an arbitrarily chosen GENCO (say, 2) in Figure 15a. The secondparameter that we will analyze is the size of mini-batch sampled from Prioritized Experience Replayin Figure 15b. Finally, we examine the rate at which exploration factor 𝜖 should be decayed inFigure 15c. Observation and inference

A high value of 𝛾 = . 𝛾 = .

05 makes the learnermyopic and hence, learns a less than optimal bidding strategy. A discount factor of 0.3 give bestobserved episodic rewards.A high value of batch size (256) means slower learning and poor generalization leading to lesserepisodic rewards. A low value of batch size (say, 16), on the other hand, could lead to overfitting onthe batch and hence, convergence to local optima. A batch size of 64 give best observed episodicrewards.A very high exploration decay rate leads to high exploitation and less exploration quickly andhence, the learner converges at a sub-optimal policy. A very low value of exploration decay factor , Vol. 1, No. 1, Article . Publication date: January 2021. ptimal Bidding Strategy using NFQ 23 (a) Effect of discount factor on episodic rewards (b) Effect of mini-batch size on episodic rewards(c) Effect of exploration decay on episodic rewardsFig. 15. Effect of hyper-parameters on episodic rewards achieved by the learning agent could lead to excessive exploration and hence, slow learning. An exploration decay rate of 0.1 givesbest observed episodic rewards.could lead to excessive exploration and hence, slow learning. An exploration decay rate of 0.1 givesbest observed episodic rewards.