A Multi-Agent Deep Reinforcement Learning Approach for a Distributed Energy Marketplace in Smart Grids
Arman Ghasemi, Amin Shojaeighadikolaei, Kailani Jones, Morteza Hashemi, Alexandru G. Bardas, Reza Ahmadi
AA Multi-Agent Deep Reinforcement Learning Approach for aDistributed Energy Marketplace in Smart Grids
Arman Ghasemi, Amin Shojaeighadikolaei, Kailani Jones,Morteza Hashemi, Alexandru G. Bardas, Reza Ahmadi
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USAE-mails: { arman.ghasemi, amin.shojaei, kailanij, mhashemi, alexbardas, ahmadi } @ku.edu Abstract —This paper presents a Reinforcement Learning (RL)based energy market for a prosumer dominated microgrid. Theproposed market model facilitates a real-time and demand-dependent dynamic pricing environment, which reduces gridcosts and improves the economic benefits for prosumers. Further-more, this market model enables the grid operator to leverageprosumers’ storage capacity as a dispatchable asset for gridsupport applications. Simulation results based on the Deep Q-Network (DQN) framework demonstrate significant improve-ments of the 24-hour accumulative profit for both prosumersand the grid operator, as well as major reductions in grid reservepower utilization.
I. I
NTRODUCTION
Small-scale power generation and storage technologies, alsoknown as Distributed Energy Resources (DERs), are changingthe operational landscape of the power grid in a substantialway. Many traditional power consumers adopting a DERtechnology are starting to produce energy, thus morphing froma consumer to a prosumer (produces and consumes energy) [1].The most common prosumer installations are the residentialsolar photovoltaic (PV) systems [2]. Although DER integrationhas the potential to provide multiple benefits to prosumers aswell as grid operators [3], current grid operating strategies failto leverage DER capabilities at a large scale, mostly due tothe lack of modern and intelligent grid control strategies.The residential PV systems likely have excess power gen-eration during peak sun hours which usually do not coincidewith peak demand hours [4]. In other words, current residentialPV systems are likely to generate excess power during off-peak demand hours when electricity is not a valuable gridcommodity, and this excess generation can even contribute togrid instability. Integration of energy storage into prosumersetups can potentially rectify this situation by allowing theprosumers to store their excess energy during the peak sunhours and inject it into the grid during the peak demand hours.Furthermore, proper coordination and aggregation of this dis-patchable prosumers’ generation capacity can be leveraged forvarious grid support services/applications [5], [6] .Nevertheless, current popular net-metering compensationschemes do not properly incentivize the prosumers to engagein grid support applications [7]. The electricity meter in anet-metered household runs backwards when the prosumerinjects power into the grid [8]. At the end of a billing cycle,the customer is billed for the “net” energy use, i.e., thedifference between the overall consumed and produced energy,regardless of the actual schedule of injecting energy into thegrid. Moreover, prosumers are compensated for the generated electricity at the same fixed retail price irrespective of the timeof the day or any grid contingency at hand. Therefore, thereis little incentive for prosumers to engage in any sort of gridsupport service.In this paper, we propose a distributed energy marketplaceframework that realizes a real-time, demand-dependent, dy-namic pricing environment for prosumers and the grid oper-ator. The proposed marketplace framework offers a plethoraof vital properties to incentivize prosumers’ engagement ingrid support applications while providing improved economicbenefits to prosumers as well as the grid operator, resultingin a “win-win” scenario. The contributions of the frameworkproposed in this paper can be summarized as follows, • The proposed marketplace framework enables the grid op-erator to leverage prosumers’ storage capacity as a dis-patchable asset, while reducing grid cost through offsettingreserve power with prosumer generation. • It incentivizes the prosumers to engage in grid supportapplications by providing higher economic benefits whensupporting grid activities. • Founded on a reinforcement learning (RL)-based decision-making, our framework handles the high dimensional, non-stationary, and stochastic nature of the problem without theneed for abstract explicit modeling and deterministic rulesused in traditional approaches. • It models prosumers with generation, storage capacity, andbidirectional grid injection capability. This yields in a highdegree of freedom for cost versus profit optimization andleads to improved overall benefits for all parties.To enable all these properties, the proposed energy marketleverages a multiagent RL framework with a single grid oper-ator agent, and a network of distributed prosumer agents. Thegrid agent’s goal is to maximize its economic benefit. To thisend, the agent makes decisions on the optimal share of powerpurchased from a fleet of conventional generation facilitiesversus a cohort of prosumers with dispatchable generationcapability, by considering the incremental cost of generationfacilities versus the retail price of purchasing electricity fromprosumers. In order to dispatch the prosumers’ generation,the grid agent dynamically sets the retail electricity price toincentivize prosumers to adjust their generation level. On theother hand, the prosumer agents aim to maximize their owneconomic benefit by deciding on the level of grid support par-ticipation according to various factors such as electricity retail a r X i v : . [ ee ss . S Y ] S e p ig. 1. Proposed electricity model market – The proposed energy marketplace includes several generation sources, household prosumers, andhousehold consumers. By leveraging a reinforcement learning (RL) framework, our system enables a dynamic buy and sell pricing schemehandled by the grid as well as dynamic strategy for the prosumers to maximize benefits. price, State of Charge (SoC) of storage device, PV generationlevel, household consumption level, etc. We demonstrate theefficiency of this marketplace through a simulation on a smallscale microgrid as shown in Fig. 1. The microgrid [9] is underthe management of a single grid operator entity and containsloads, distributed energy resources and/or storage devices thatcan be operated in a controlled and coordinated way.This paper is structured as follows: Section II coversbackground and related works, while Section III provides thephysical and learning system models for the proposed energymarket place. Next, the simulation results for the small scalemicrogrid case study are presented in Section IV. Finally,Section V concludes this paper.II. B ACKGROUND AND R ELATED W ORK
A brief survey of traditional energy marketplace modelsand dynamic pricing methods for smart grid applicationsis provided in [10]–[12]. On the other hand, research hasexplored RL-based energy market frameworks and dynamicpricing schemes that bring economic benefits to both costumerand grid operators. The authors in [13] proposed an RLalgorithm that allows service providers and customers to learnpricing and energy consumption strategies without a prioriof knowledge, leading to reduced system costs. Furthermore,[14] investigated an RL-based dynamic pricing scheme forachieving an optimal “price policy” in the presence of fastcharging electric vehicles over the grid. In order to reducethe electricity bill of the residential customers, a mathematicalmodel using RL for load scheduling was developed in [15],assuming that residential loads include schedulable loads, non-schedulable loads, and local PV generation.More closely aligned to our paper are the works in [16] and[17]. [16] described an RL-based dynamic pricing, demand re-sponse algorithm using Q-learning approach for a hierarchicalelectricity market that considers both service providers andcustomers’ profits as well as shows improvements in profia-bility and reduced costs. However, this work only examines regular customers without generation or storage capacity. Theauthors in [17] proposed an RL-based home energy manage-ment (HEM) framework which considers real-time electricityprice and PV generation, and the framework achieve superiorperformance and cost-effective schedules for demand responsein a HEM system. Nonetheless, the households in this work aremodeled as traditional loads unable to sell back their excesspower to the grid. Although the Electric Vehicle (EV) chargingis modeled, the storage capacity of EVs is not leveraged forcost optimization, meaning the households do not have anyenergy storage capacity. A demand response dynamic pricingframework is also provided in [18] which is highly related toour work. III. S
YSTEM M ODEL
The proposed electricity market model is shown in Fig. 1.As pictured, this model encompasses a grid agent (GA) andseveral prosumer agents (PAs). The learning environment is acombination of governing equations of the grid and prosumersphysical systems, the operational limitations of the power gridand the prosumers, and external factors such as the time ofday or PV generation level as explained in the physical modelsubsection below. Although consumers are depicted in Fig. 1,we do not consider them as an individual agent due to theirconstant consumption of energy.
Notations:
We use the following notations throughout thepaper. Bold letters are used for vectors, while non-bold lettersare scalars. Sets are denoted by calligraphy fonts (e.g., S ). Thegrid and household variables are denoted by ( . ) G and ( . ) H . A. Physical System Model
Grid Operation:
We assume a power system with K generators each with a power output level of P G i such that i ∈ { , . . . , K } , and M prosumers each with power injectionlevel of P H j where j ∈ { , . . . , M } . In the context of an energymarketplace, the goal of the grid is to maximize its profitover a time horizon of T , which is denoted by ψ G ( T ) . Theccumulative grid profit is then equal to the total grid revenueminus the total cost of operation, i.e., ψ G ( T ) = Υ G ( T ) − K (cid:213) i = Ω G i ( T ) + M (cid:213) j = Ω H j ( T ) . (1)In this case, Υ G (·) denotes the accumulative grid revenue as aresult of selling P D ( t ) of electricity to the loads at the sellingprice of ρ s ( t ) $/kWh. Therefore, the accumulative revenueover a time horizon of T is defined as: Υ G ( T ) = ∫ T P D ( t ) ρ s ( t ) dt . (2)Moreover, Ω G i ( T ) denotes the accumulative cost of buy-ing electricity from the i th generation facility. The Ω G i ( T ) is typically estimated using the incremental cost curves ofthe generation facilities. In addition to the cost of buyingelectricity from generation facilities, the grid is able to buyelectricity from prosumers. Thus, the accumulative cost ofbuying electricity from the j th prosumer is equal to: Ω H j ( T ) = ∫ T P H j ( t ) ρ b ( t ) dt for P H j ( t ) > , (3)where ρ b ( t ) (in the unit of $/kWh) is the price of purchasingelectricity from prosumers, referred to as buy price hereinafter.The GAs goal is to maximize (1) subject to the fundamentalgrid power balance equation, P D ( t ) − K (cid:213) i = P G i ( t ) − M (cid:213) j = P H j ( t ) = , ∀ t . (4)It should be noted that due to heterogeneous generationfacilities, we assume that the output of the i th facility isconstrained by practical limitations such as: P min G i ≤ P G i ( t ) ≤ P max G i , for i = , ..., K . (5) Prosumer’s Operation:
A typical prosumer setup witha PV deployment and energy storage is shown in Fig.1.According to this figure, the goal of the j th prosumers agentis to maximize its own accumulative profit ψ H j ( T ) defined as: ψ H j ( T ) = Υ H j ( T ) − Ω H j ( T ) , (6)where Υ H j ( T ) is the accumulative revenue of the j th prosumerfor selling electricity to the grid, and Ω H j ( T ) is the accumu-lative cost of buying electricity from the grid defined by: Υ H j ( T ) = ∫ T P H j ( t ) ρ b ( t ) dt for P H j ( t ) > , (7) Ω H j ( T ) = ∫ T P H j ( t ) ρ s ( t ) dt for P H j ( t ) ≤ . (8)Assuming that for the j th prosumer, P PV j ( t ) is the PV gener-ation, P b j ( t ) is battery charge/discharge power, and P C j ( t ) isthe consumption power, the internal power balancing is thendescribed as follows: P H j ( t ) = P PV j ( t ) − P b j ( t ) − P C j ( t ) . (9) In order to model realistic scenarios, we also pose the follow-ing constraints on each of these parameters:(i) If P max H j is the maximum allowable power injection, thenwe have: (cid:12)(cid:12) P H j ( t ) (cid:12)(cid:12) ≤ P max H j .(ii) P max PV j denotes the peak PV generation such that ≤ P PV j ( t ) ≤ P max PV j .(iii) Given that P max b j is the maximum allowable batterycharge/discharge power, then (cid:12)(cid:12) P b j ( t ) (cid:12)(cid:12) ≤ P max b j .(iv) Assuming that φ j is the State of Charge (SoC) ofthe battery, and φ min j and φ max j are the minimum andmaximum allowable state of charge of battery, we have φ min j ≤ φ j ≤ φ max j . The state of charge of battery for the j th prosumer is calculated from, φ j ( t ) = φ j ( ) + C B j ∫ t P b j ( τ ) d τ, (10)where C B j is the battery capacity and φ j ( ) representsthe initial SoC of the battery.Next we describe a deep reinforcement learning frameworkto enable the grid and prosumers to dynamically take optimalactions at each time slot. B. Reinforcement Learning Model
In this work, the dynamic pricing problem is formulated asa Markov Decision Process (MDP) such that given a state s t attime t , the goal is choosing the optimal action for transitioningto a new state s t + at time t + , where s t , s t + ∈ S such that S is the set of all possible environment states. This problem canbe viewed as an instance of Reinforcement Learning (RL) thatis concerned with studying how an agent or a group of agentslearn(s) the environment by collecting observations , choosing actions , and receiving rewards . Assuming that A is the set offeasible actions available to each agent, as a result of takingan action a t ∈ A , the agent receives an immediate reward r t ,and the environment transitions from the state s t to s t + .In the proposed energy marketplace, we have a set of agentsdenoted by N = { GA , PA , . . . , PA M } in which GA is the gridagent and PA j is the agent for prosumer j . Next, we providedetails on the observations, actions, and rewards for each agenttype (i.e., grid agent or prosumer agent). In this framework,all the continuous variables are discretized using a zero-orderhold to find the values at each time slot t . Grid Agent:
The GA observes the following state variables:(i) cost of buying electricity from K generation facilities attime t , which is denoted by ωωω tG = [ ω t , . . . , ω tK ] ,(ii) cost of grid operator for buying electricity from M prosumers, which is denoted by ωωω tH = [ ω tH , . . . , ω tH M ] ,(iii) the total grid demand P tD ,We use the notation s tGA to represent all observations of thegrid agent at time t . Thus, based on the observations of the gridat time t , the grid agent action is to determine the electricitybuy price. As described in the physical model, the buy priceis denoted by ρ tb ∈ A GA , where A GA is the finite set ofavailable actions to GA (i.e., all possible buy prices).he reward function for the grid at time t is defined asthe grid profit, i.e., r t GA = υ tG − K (cid:213) i = ω tG i + M (cid:213) j = ω tH j , (11)where υ tG denotes the grid revenue at time slot t as a resultof selling P tD electricity, which is obtained by υ tG = P tD × ρ ts .In addition, ω tG i is the grid cost to buy P tG i from the i th generation facility at time slot t . The value of P tG i is obtainedusing incremental cost curve of the i th generation facility.Finally, the grid cost to buy P tH j from prosumer j at timeslot t is denoted by ω tH j that can be calculated as, ω tH j = P tH j × ρ tb for P tH j > . (12)Given the definition for immediate reward r tGA , the ultimategoal is to maximize the agent cumulative reward over aninfinite time horizon that is also known as expected return: Γ t GA = ∞ (cid:213) k = γ k r t + k + GA , (13)where ≤ γ ≤ is the discount rate for the grid agent. Prosumer Agent:
The prosumer agent j observes the fol-lowing state variables:(i) state of charge of battery that is denoted by φ tj ,(ii) PV generation denoted by P tPV j ,(iii) buy price ρ tb determined by the grid agent,(iv) local power consumption denoted by P tC j .Based on this set of observations, the charge/discharge com-mand to the energy storage in prosumer j is the action determined by P A j , which is shown by σ tj ∈ A PA j . In thiscase, A PA j is the finite set of available actions to P A j . The reward function for P A j is defined as, r tPA j = υ tH j − ω tH j , (14)where υ tH j = P tH j × ρ tb for P tH j > is the j th prosumersrevenue from selling P tH j to the grid at time slot t and, ω tj = P tH j × ρ ts for P tH j ≤ is the j th prosumers cost frombuying P tH j from the grid at time slot t . Similar to the gridagent, the j th prosumer tries to maximize its infinite-horizonaccumulative reward defined as: Γ tPA j = ∞ (cid:213) k = ˜ γ kj r t + k + PA j , (15)where ≤ ˜ γ j ≤ is the discount rate for P A j . C. Q-Learning Framework
In this work, the agents use Deep Q-Network (DQN) tosolve their respective MDPs and maximize their accumulativerewards in (13) and (15). The DQN algorithm uses deep learning for each agent using the bellman iterative equation.In particular, for the grid agent we have, Q ( s tGA , ρ tb ) ← Q ( s tGA ,ρ tb ) + α [ r t + GA + γ max ρ t + Q ( s t + GA , ρ t + b ) − Q ( s tGA , ρ tb )] , (16)and similarly, for the prosumer agent we have, Q ( s tPA j , σ tj ) ← Q ( s tPA j , σ tj ) + ˜ α j [ r t + PA j + ˜ γ j max σ t + j Q ( s t + PA j , σ t + j ) − Q ( s tPA j , σ tj )] , (17)where α and ˜ α j are the learning rates for G A and
P A j ,respectively. The estimated Q-values are used to find theoptimal policy that maximizes the accumulative rewards. TheDQN framework for the grid and prosumer agents is illustratedin Algorithms 1 and 2, respectively. Algorithm 1
Q-learning Algorithm for the Grid Agent Initialize Q ( s tGA , ρ tGA ) to zero for each Episode do for each Iteration do t : = t + Set buy price ρ tb according to policy π GA Observe reward r t + GA at new state s t + GA Update Q ( s tGA , ρ tb ) using (16) s tGA : = s t + GA end for end forAlgorithm 2 Q-learning Algorithm for the j th Prosumer Agent Initialize Q ( s tPA j , σ tj ) to zero for each Episode do for each Iteration do t : = t + Set charge/discharge σ tj according to policy π PA j Observe reward r t + PA j at new state s t + PA j Update Q ( s tPA j , σ tj ) using (17) s tPA j : = s t + PA j end for end for In this framework, to balance exploration versus exploita-tion, the epsilon greedy strategy π is used for GA and PA asfollow [19], π = (cid:40) arg max a t E [ Q ( s t , a t )] with probability 1 − ε, random action with probability ε. The probability of random actions ε starts at 1 for the first 300episodes, and then decays to 0.01 over the training episodes.IV. C ASE S TUDY AND N UMERICAL R ESULTS
The proposed energy market place model is implementedon a small-scale microgrid system, illustrated in Fig 1, todemonstrate the operation of the agents and their effectiveness ig. 2. Generation and consumption waveform sample for prosumers and consumer
Parameter Description Value P max pv j Max. PV Generation [2-2.5] kW P max b j Max. allowable charge/discharge 2/-2 kW P max H j Max. allowable power injection 10 kW φ max j Max. state of charge . × C b j φ min j Min. state of charge . × C b j C b j Energy storage capacity [8-10] kWh φ j ( ) Initial state of charge [3-4] kWh ρ s Sell price [before 11am, after 11am] [0.05, 0.095] $/kWh ρ tb Buy price for agent-based scenario { . , . , . , . , . , . } $/kWh ρ tb Buy price for conventional scenario 0.05 $/kWh (cid:104) P min G , P max G (cid:105) Limitation of base generation [5, 20] kW (cid:104) P min G , P max G (cid:105) Limitation of reserve generation [0, 50] kW [ β , β ] Incremental cost of two generators [0.03, 0.3] $/kWh
TABLE I. Simulation parameters used for the proposed energy marketplace model on a small-scale microgrid for improving the economic benefit of the grid operator andthe prosumers. As pictured, the system under the study is com-prised of two generation facilities ( K = ) , three prosumers ( M = ) that host the P A to P A agents, the grid operator thathosts the grid agent (GA), and one nongenerational household(a.k.a., consumer, N = ). The parameters of the systemare tabulated in Table I. The employed PV generation andlocal consumption profiles for the last episode of the threeprosumers are illustrated in Fig 2. These waveforms areconstructed to be representative of real-world data availablefrom California ISO website [4]. The peak value of generationand consumption for each prosumer is listed in Table I.The demand profile for last episode for the nongenerationalhousehold is also shown in Fig 2, and its peak value is listedin Table I. Each prosumer is equipped with an energy storagesystem (ESS) which includes a constant charge/discharge rateand a capacity provided in Table I.In order to establish a baseline for the economic benefit ofthe grid operator and the households, a conventional systemwith a fixed buy price and no intelligent prosumer agents issimulated. In this scenario, the prosumers only sell electricityto the grid when their generation is more than their localconsumption and their ESS is fully charged, which is likelyto happen during the peak sun hours [20]. The describedmicrogrid model for trading electricity between grid andresidential loads is shown in Fig. 1. This scenario is referredto as the conventional scenario .In the next scenario, we leverage the grid and prosumeragents to help implement the proposed market model, andthese results are compared with the conventional scenarioto demonstrate the economic improvements. This scenario is Hyperparameters Value for
G A
Value for
P A j Batch size 64 64Discount factor γ =[0.95-0.99] ˜ γ j =[0.95-0.99]Learning rate α =1e-3 ˜ α j =1e-3Soft update interpolation 1e-5 1e-5Hidden Layer-nodes 1-[1000] 2-[1000,1000]Activation Sigmoid SigmoidOptimizer Adam Adam TABLE II. DQN hyperparametrs referred to as the agent-based scenario . In this work, we usePyTorch framework (v. 1.5.0 with Python3) to implement theDQN agents [21]. For training and testing the neural network,we leverage an Intel Xeon processor running at 3 GHz with16 GB of RAM.The DQN algorithm hyperparameters usedfor simulations are provided in Table II.The simulations forboth the conventional and agent-based scenarios are carriedout via episodic iterations for 10,000 episodes. Each episoderepresents a 24 hour cycle and consists of 96 iterations,meaning that the simulation timeslots are 15 minutes.The action space for all prosumer agents (i.e., set A PA )includes three options: charge, no charge or discharge, anddischarge. As a result, these actions command the batterypower to one of the following three levels at each time slot t : P tb j = P max b j Charge action, No charge or discharge action, − P max b j Discharge action. (18)The action space for GA (i.e., buy price) is defined as A GA = { } in which all numbersrepresent $/kWh values. The sell price (cid:0) ρ ts (cid:1) is defined ata constant rate in this work as provided in Table I. Theincremental cost of the two generators in terms of $/kWh aredefined as, (cid:26) ω tG = β for P min G ≤ P tG ≤ P max G ω tG = β for P min G ≤ P tG ≤ P max G . (19)where β > β (see Table I). Consequently, the P G providesbaseline generation capacity at a lower incremental cost while P G provides reserve capacity at a much higher cost.The simulation results comparing the conventional andagent-based scenarios throughout 10,000 episodes are illus-trated in Fig. 3 (a)-(c), where we compare the daily bill ofthe three prosumers over a 24-hour period. From the results,we note that while the daily bill resulting from a conventionalscenario remains fairly constant throughout the episodes, theprosumer agents start converging to a lower bill as the agents ig. 3. Simulation results for conventional vs. agent-based scenariosover 10000 episodes:(a)-(c) 24-hour accumulative reward comparisonfor three prosumers, (d) grid 24-hour accumulative reward compari-son, (e) grid reserve power utilization. explore the environment further and learn the optimal policy.As shown, the daily bill for households 1-3 are loweredby 1400%, 27%, and 13%, respectively. The unusually highdaily bill reduction for household 1 is attributable to theconventional daily bill that is close to zero since the beginning(i.e., high PV generation), and the households smaller peakconsumption according to Fig. 2.Fig. 3 (d)-(e) compare the accumulative grid profit anduse of costly reserve power (PG2) over a 24-hour period.The agent-based scenario starts with a lower profit than theconventional scenario but converges to a much higher profitlevel than the conventional scenario as the agent learns theoptimal policy. In this case, the grid profit improved around15%. According to Fig. 3(e), the grid profit improvement ismostly attributable to the lower usage of costly reserve powerin the agent-based scenario. In fact, in this experiment, the gridagent learns to rely on the prosumers’ generation for balancingthe grid’s power rather than using the reserve power which ismore expensive. The use of reserve power is decreased by10% in this experiment. V. C ONCLUSIONS
This paper proposes an RL-based distributed energymarketplace framework that enables a real-time, demand-dependent, dynamic pricing environment to incentivize pro-sumers’ grid support engagement while improving the eco-nomic benefit of both, prosumers and the grid operator. Simu-lation results, when implementing the proposed market model,show major economic improvements for the prosumers and thegrid (through a reduced reserve power utilization by the grid).R
IEEE Transactions on Energy Conversion , vol. 34, no. 1,pp. 468–477, 2019.[6] O. Ciftci, M. Mehrtash, F. Safdarian, and A. Kargarian, “Chance-constrained microgrid energy management with flexibility constraintsprovided by battery storage,” in , 2019, pp. 1–6.[7] G. C. Christoforidis, I. P. Panapakidis, T. A. Papadopoulos, G. K.Papagiannis, I. Koumparou, M. Hadjipanayi, and G. E. Georghiou, “Amodel for the assessment of different net-metering policies,”
Energies ,vol. 9, no. 4, 2016.[8] A. Poullikkas, “A comparative assessment of net metering and feed intariff schemes for residential pv systems,”
Sustainable Energy Technolo-gies and Assessments , vol. 3, pp. 1 – 8, 2013.[9] B. Nordman, “Local grid definitions,” Smart Grid Interoperability Paneland Lawrence Berkeley National Laboratory, Berkeley,USA, Tech. Rep.,2016.[10] M. Khoshjahan, M. Soleimani, and M. Kezunovic, “Optimal partici-pation of pev charging stationsintegrated with smart buildings in thewholesale energy and reserve markets,” in
IEEE Power & Energy SocietyInnovative Smart Grid Technologies , 2020, pp. 1–5.[11] I. S. Bayram, M. Z. Shakir, M. Abdallah, and K. Qaraqe, “A surveyon energy trading in smart grid,” in , 2014, pp. 258–262.[12] A. R. Khan, A. Mahmood, A. Safdar, Z. A. Khan, and N. A. Khan,“Load forecasting, dynamic pricing and dsm in smart grid: A review,”
Renewable and Sustainable Energy Reviews , vol. 54, 2016.[13] B. Kim, Y. Zhang, M. van der Schaar, and J. Lee, “Dynamic pricingand energy consumption scheduling with reinforcement learning,”
IEEETransactions on Smart Grid , vol. 7, no. 5, pp. 2187–2198, 2016.[14] C. Fang, H. Lu, Y. Hong, S. Liu, and J. Chang, “Dynamic pricing forelectric vehicle extreme fast charging,”
IEEE Transactions on IntelligentTransportation Systems , pp. 1–11, 2020.[15] T. Remani, E. A. Jasmin, and T. P. I. Ahamed, “Residential loadscheduling with renewable generation in the smart grid: A reinforcementlearning approach,”
IEEE Systems Journal , vol. 13, no. 3, pp. 3283–3294, 2019.[16] R. Lu, S. H. Hong, and X. Zhang, “A dynamic pricing demand responsealgorithm for smart grid: Reinforcement learning approach,”
AppliedEnergy , vol. 220, pp. 220–230, 2018.[17] X. Xu, Y. Jia, Y. Xu, Z. Xu, S. Chai, and C. S. Lai, “A multi-agent reinforcement learning based data-driven method for home energymanagement,”
IEEE Transactions on Smart Grid , pp. 1–1, 2020.18] A. Shojaeighadikolaei, A. Ghasemi, K. R. Jones, A. G. Bardas,M. Hashemi, and R. Ahmadi, “Demand responsive dynamic pricingframework for prosumer dominated microgrids using multiagent rein-forcement learning,” in
The 52nd North American Power Symposium .[19] F.-L. Vincent, H. Petr, R. Islam, G. Marc, and P. Loelle, “An introductionto deep reinforcement learning,”
Foundations and Trends in MachineLearning , vol. 11, no. 3-4, pp. 219–354, 2018.[20] Q. Sun, M. E. Cotterell, Z. Wu, and S. Grijalva, “An economic modelfor distributed energy prosumers,” in
Proceedings of the 46th AnnualHawaii International Conference on System Sciences , 2013.[21] N. Naderializadeh and M. Hashemi, “Energy-aware multi-server mobileedge computing: A deep reinforcement learning approach,” in53rdAsilomar Conference on Signals, Systems, and Computers