PowerNet: Multi-agent Deep Reinforcement Learning for Scalable Powergrid Control
Dong Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, Kaixiang Lin
11 PowerNet: Multi-agent Deep ReinforcementLearning for Scalable Powergrid Control
Dong Chen , Zhaojian Li ∗ , Tianshu Chu , Rui Yao , Feng Qiu , Kaixiang Lin Abstract —This paper develops an efficient multi-agent deepreinforcement learning algorithm for cooperative controls inpowergrids. Specifically, we consider the decentralized inverter-based secondary voltage control problem in distributed genera-tors (DGs), which is first formulated as a cooperative multi-agentreinforcement learning (MARL) problem. We then propose anovel on-policy MARL algorithm, PowerNet, in which each agent(DG) learns a control policy based on (sub-)global reward butlocal states from its neighboring agents. Motivated by the factthat a local control from one agent has limited impact on agentsdistant from it, we exploit a novel spatial discount factor to reducethe effect from remote agents, to expedite the training process andimprove scalability. Furthermore, a differentiable, learning-basedcommunication protocol is employed to foster the collaborationsamong neighboring agents. In addition, to mitigate the effectsof system uncertainty and random noise introduced during on-policy learning, we utilize an action smoothing factor to stabilizethe policy execution. To facilitate training and evaluation, wedevelop PGSim, an efficient, high-fidelity powergrid simulationplatform. Experimental results in two microgrid setups show thatthe developed PowerNet outperforms a conventional model-basedcontrol, as well as several state-of-the-art MARL algorithms. Thedecentralized learning scheme and high sample efficiency alsomake it viable to large-scale power grids.
Index Terms —Multi-agent deep reinforcement learning, sec-ondary voltage control, distributed generators (DGs), inverter-based microgrid.
I. I
NTRODUCTION
Renewable energy such as solar and wind has attractedsignificant research interest in recent decades due to its greatpotential to reduce greenhouse gas emissions and subsequentlyslow global warming [1]. In a modern power grid, thoserenewable energy sources are present as distributed generators(DGs), along with other traditional electricity sources suchas fossil-fuel and nuclear stations. In particular, microgridsare localized subgrids that can operate either connected to orisolated from the main grid. As microgrids can still operatewhen the main grid is down, they can strengthen grid resilienceand service reliability. There are two levels of controls ina microgrid when disconnected from the main grid: primarycontrol and secondary control [2]. The primary control refers Dong Chen and Zhaojian Li are with the Department of MechanicalEngineering, Michigan State University, Lansing, MI, 48824, USA. Email: {chendon9,lizhaoj1}@msu.edu. Tianshu Chu is with the Department of Civil and Environmen-tal Engineering, Stanford University, Stanford, CA 94305, USA. Email: [email protected]. Rui Yao and Feng Qiu are with the Argonne National Laboratory, Lemont,IL 60439, USA. Email: [email protected]; [email protected]. Kaixiang Lin is with the Department of Computer Science, MichiganState University, Lansing, MI, 48824, USA. Email: [email protected]. ∗ Zhaojian Li is the corresponding author. to the lower level control within a DG to maintain a desiredvoltage reference whereas the secondary control is concernedwith the cooperative generations of local references across theDGs to achieve grid-wise control objectives [3]–[7]. As thesecondary control has the great potential to improve the overallgrid efficiency, it will be the focus of this paper.Existing secondary control methods can be categorized intotwo main classes: centralized and distributed. A centralizedcontroller gathers information from all DGs and then makesa collective control decision that is then sent to respectiveDGs [8]–[10]. While these centralized voltage control schemesshow promising results, they incur heavy communication over-heads and often suffer from a-single-point-of-failure [11] andcurse of dimensionality, which make it impractical to deploy intoday’s large-scale microgrid systems. Inspired by the idea ofcooperative control in multi-agent systems, distributed controlmethods have also been developed [2], [7], [12]–[16], whichconsider a more practical setup where each DG communicateswith its neighboring DGs and makes a decentralized controldecision based on its own state as well as the states fromits neighbors, shared via local communication networks. Con-ventional model-based methods generally first approximatethe nonlinear heterogeneous dynamics of microgrids witha simplified linear dynamics and then develop distributedfeedback controllers for the formulated tracking synchroniza-tion problem [2], [13], [14], [17]. Note that the underlyingmicrogrid dynamics is subject to complex nonlinearity, systemand disturbance uncertainty, and high dimensionality, modelsimplifications have to be made to enable the model-basedcontrol designs, which will inevitably negatively impact thecontrol performance.Meanwhile, multi-agent reinforcement learning (MARL)has been greatly advanced and successfully applied to a varietyof complex systems in the past few years, including role playgames like StarCraft and DOTA [18], [19], traffic lights control[20], and autonomous driving [21]. The applications of MARLto voltage control of microgrids also exist [10], [11], [22]–[24], which focus on autonomous voltage control (AVC), withthe objective of regulating the voltage of all buses across thepower systems. In this paper, we treat the secondary voltagecontrol in isolated microgrids with the goal of stabilizing theoutput voltages of the DGs at a pre-defined reference value[13]. We formulate the secondary voltage control of inverter-based microgrid systems as a partially observable Markovdecision process (POMDP) and develop an on-policy MARLalgorithm, PowerNet, to efficiently coordinate the DGs forcooperative controls. We develop a decentralized, on-policyMARL approach that is both stable in training and efficient a r X i v : . [ ee ss . S Y ] N ov in policy learning. Comprehensive experiments are conductedand results show that our proposed PowerNet outperformsseveral state-of-the-art MARL algorithms, as well as a model-based approach in terms of performance, learning efficiency,and scalability.The main contributions and the technical advancements ofthis paper are summarized as follows.1) The secondary voltage control of inverter-based micro-grids is modeled as a decentralized, cooperative MARLproblem, and a corresponding power grid simulationplatform, PGSIM, is developed and open-sourced .2) We develop an efficient on-policy decentralized MARLalgorithm, PowerNet, to effectively learn a control policyby introducing a novel spatial discount factor, a learning-based communication protocol, and an action smoothingmechanism.3) Comprehensive experiments are performed, which showthat the proposed PowerNet outperforms the conven-tional model-based control, as well as 5 state-of-the-artMARL algorithms, in terms of sample efficiency andvoltage regulation performance.The remainder of the paper is organized as follows. Thebackground of reinforcement learning and state-of-the-artMARL algorithms are presented in Section II. In Section III,we formulate the secondary voltage control as an MARLproblem and our developed MARL algorithm, PoweNet, isdetailed in Section IV. Experiments, results, and discussionsare presented in Section V whereas concluding remarks andfuture works are drawn in Section VI.II. BACKGROUNDIn this section, we review the preliminaries of reinforcementlearning (RL) and several state-of-the-art MARL algorithmsthat will be used as benchmarks for comparison in Section V. A. Preliminaries of Reinforcement Learning (RL)
The objective of a RL agent is to learn an optimal policy π ∗ that maximizes the accumulated reward R t = (cid:80) Tk =0 γ k r t + k via continuous interactions with the environment. Here r t + k is the (random) instantaneous reward at time step t + k and γ ∈ (0 , is a discount factor that places more emphasison near-term rewards. More specifically, at each discrete timestep t , the agent takes an observation of the environment anditself s t ∈ R n , executes an (exploratory) action a t ∈ R m , andreceives a scalar reward r t ∈ R . In general, the environmentis impacted by some unknown disturbance w t . A policy, π : S →
Pr( A ) , is a mapping from states to a probabilitydistribution over the action space and it describes the agent’sbehavior. If the agent can only observe o t , a partial descriptionof the state s t , the underlying dynamics becomes a partiallyobservable Markov Decision Process (POMDP) and the goalis then to learn a policy that maps from the observation o t toan appropriate action.The value function of a state s under policy π , V π ( s ) , isdefined as the expected return starting from s and followingpolicy π , i.e., V π ( s t ) = E π [ R t | s t = s ] . The state-action value See https://github.com/Derekabc/PGSIM function (or Q-function), Q π ( s, a ) , is defined as the expectedreturn starting from state s , taking first action a , and thenfollowing policy π afterwards. The optimal Q-function for astate-action pair ( s, a ) is given by Q ∗ ( s, a ) = max π Q π ( s, a ) .Note that the knowledge of the optimal Q-function induces anoptimal policy by π ∗ ( s ) = arg max a Q ∗ ( s, a ) . Many times theagent’s policy is parameterized by some parameters θ and thegoal is to learn the values of parameters θ that maximize thereward function. In actor-critic (A2C) algorithms [25], [26],two networks are employed: a critic network parameterized by ω to learn the value function V π θ ω ( s t ) and an actor network π θ ( a t | s t ) parameterized by θ to update the policy distributionin the direction suggested by the critic network by: θ ← θ + E π θ (cid:104)(cid:16) ∇ θ log π θ ( a t | s t ) (cid:17) A t (cid:105) . (1)where A t = Q π θ ( s, a ) − V π θ ω ( s t ) [25] is the advantagefunction used to reduce the sample variance. The value func-tion parameters are then updated by minimizing the followingsquare error loss: min ω E D (cid:16) R t + γV ω (cid:48) ( s t +1 ) − V ω ( s t ) (cid:17) . (2)where D denotes the experience replay buffer that collects pre-viously encountered trajectories and ω (cid:48) denotes the parametersobtained from earlier iterations as a target network [10]. B. MARL and State-of-the-Art
MARL is an extension of the single-agent agent and isconcerned with systems with multiple agents/players. Greatprogress has been made in the past decade with many suc-cessful applications [18]–[21]. Independent Q-learning (IQL)[27] is a natural extension from single-agent RL, allowingeach agent to learn independently and simultaneously whileviewing other agents as part of the environment. Althoughfully scalable, IQL suffers from non-stationarity and partialobservability. To address the non-stationary issue of IQL,FPrint [28] exploits the technique of importance sampling andincludes low-dimensional fingerprints to distinguish obsoletedata in the experience buffer. Furthermore, ConseNet [29]proposes an actor-critic-based MARL algorithm with fullydecentralized critic networks through a consensus update inthe communication network. An emergent paradigm in MARLis learn to communicate , in which agents automatically learn acommunication protocol to maximize their shared utility. Forexample, in DIAL [30], agents receive messages from theirneighbors through a differentiable communication protocol, inwhich the message is encoded and summed with other inputsignals at the receiver side. Another framework is CommNet[31], which learns the communication protocol by calculatingthe mean of all messages instead of encoding them. Note thatfor both DIAL and CommNet, information losses inevitablyincur as they involve summation of input signals [32].Inspired by the aforementioned works, in this paper, wedevelop a new on-policy MARL algorithm, PowerNet, forsecondary voltage control. It exploits the centralized learningdecentralized execution structure as in [33]. To facilitate co-operation among the neighboring agents, instead of summingthe messages, we concatenate and encode the communication
DG 1 DG 2 DG N
Primary ControlSecondary Control
Electrical network
Primary ControlSecondary Control Primary ControlSecondary ControlLocalNetwork LocalNetwork LocalNetwork
VCVSI LC Filter Output ConnectorPrimary Controller Loads &Remaining GridSecondary ControllerLocal Information (a) Diagram of decentralized control of microgrids.
DG 1 DG 2 DG N
Primary ControlSecondary Control
Electrical network
Primary ControlSecondary Control Primary ControlSecondary ControlLocalNetwork LocalNetwork LocalNetwork
Neighboring DGsVCVSI LC Filter Output ConnectorPrimary Controller Loads &Remaining GridSecondary ControllerLocal Information (b) Diagram of inverter-based DG.
Fig. 1: Schematic diagram of the decentralized control of microgrids and the inverter-based DG.messages. Novel techniques of spatial discount factor andaction smoothing are also incorporated to improve learningefficiency and control performance. More details are presentedin Section IV. III. P
ROBLEM FORMULATION
Voltage-controlled voltage source inverter (VCVSI) iswidely used in DGs to provide fast voltage/frequency support[13]. A typical VCVSI with decentralized control architectureis illustrated in Fig. 1a, where a secondary controller isemployed for each DG to coordinate with neighboring DGsand dynamically generate voltage references. The obtainedreference is then used by the lower-level primary controller as the reference to track. The overall control objective isto maintain the voltage and frequency of all DGs to thereference value despite the disruptions in the power networkand tracking imperfections in primary control [13], [14], [34].More specifically, as shown in Fig. 1b, the primary controllerof each DG i , i = 1 , · · · , N , consumes the frequency andvoltage references, w n i and V n i , from the secondary controller,and regulate the output voltage V oi and frequency to a desiredreference. This is typically achieved via the active and reactive-power droop techniques [4], [13] without communicationsamong DGs. The readers are referred to [13], [14] for theunderlying system dynamics, which is exploited to developour power grid simulation platform, PGSIM . The objective ofthe secondary voltage control is to coordinate with other DGsand generate reference signals V n i to synchronize the voltagemagnitude of DG i to the reference value, in the presence ofpower disturbances and primary control imperfections.While model-based secondary control approaches exist [13],[14], the performance is generally unsatisfactory due to systemsimplifications made to address nonlinearity and uncertaindisturbances. In this paper, we develop a model-free, MARL-based approach. Specifically, we model a microgrid as amulti-agent network: G = ( ν , ε ) , where each agent i ∈ ν communicates with its neighbors N i := { j | ε ij ∈ ε ) } . Let S := × i ∈ ν S i and A := × i ∈ ν A i be the global state spaceand action space that represent, respectively, the aggregatedstate information and concatenated controls of all DGs, the See https://github.com/Derekabc/PGSIM underlying dynamics of the microgrid can be characterizedby the state transition distribution P : S × A × S → [0 , .We consider a decentralized MARL framework to achievescalable power grid control. More specifically, each DG onlycommunicates with its neighbors and makes a control decisionbased on these observations. As each agent i (DG i ) onlyobserves part of the environment (states of itself and itsneighbors), this leads to a POMDP [35]. At each time step t,each agent i makes an observation o i,t ∈ O i , takes an action a i,t , and then receives the next observation o i,t +1 and a rewardsignal r i,t := O t × A t → R . The goal is to learn an optimaldecentralized policy π i := O i × A i → [0 , such that theexpected future rewards are maximized. We solve the aboveproblem through MARL and we define the key elements inthe considered POMDP in the following. • Action space : the control action for each DG is thesecondary voltage control set point V n . In this paper, weuse 10 discrete actions that are evenly spaced between . pu and . pu . The overall action of the microgridis the joint actions from all DGs, i.e., a = v n × v n ×· · · × v nN . • State space : the state of each DG i is chosen as s i,t = ( δ i , P i , Q i , i odi , i oqi , i bdi , i bqi , v bdi , v bqi ) to char-acterize the operations of the DG, where δ i is themeasured reference angle frame; P i and Q i denote theactive and reactive power, respectively; i oqi , i oqi , i bdi and i bqi represent the output d-q currents of the DG i and the directly connected buses, respectively; and v bdi and v bqi are the output d-q voltages of the connectedbus, respectively. The entire state of the microgrid systemis the Cartesian product of the individual states, i.e., S ( t ) = s ,t × · · · × s N,t . • Observation space : we assume that each DG can onlyobserve its local state as well as messages from itsneighbors, i.e., o i,t = S i,t ∪ m i,t , where m i,t is thecommunication message received from its neighboringagents j ∈ N i and will be detailed in Section IV. • Transition probabilities : the transition probability T ( s (cid:48) | s, a ) characterizes the dynamics of the microgrid.We follow the models in [13], [14] to build oursimulation platform but we do not exploit any priorknowledge of the transition probability as our MARL is model free. • Reward function : we design the following reward functionto promote the DGs to quickly converge to referencevoltages (e.g., pu ): r i,t (cid:44) . − | − v i | , v i ∈ [0 . , . , −| − v i | , v i ∈ [0 . , . ∪ [1 . , . , − , Otherwise . (3)where r i,t is the reward of agent i at time step t . Morespecifically, we divide the voltage range into 3 operationzones similar to [11]: normal zone ( [0 . , . pu ), vio-lation zone ( [0 . , . ∪ [1 . , . pu ), and divergedzone ( [0 , . ∪ [1 . , ∞ ] pu ). With the formulatedreward, DGs with diverged voltages will receive largepenalty, while DGs with voltage close to pu obtainpositive rewards.IV. P OWER N ET FOR SECONDARY VOLTAGE CONTROL
In this section, we present a novel decentralized MARLalgorithm, PowerNet, to efficiently solve the POMDP formu-lated above. The proposed PowerNet extends the independentactor-critic (IA2C) to deal with multiple cooperative agentsby fostering collaborations between neighboring agents, whichis enabled by the following three novel characteristics: 1) alearning-based differentiable communication protocol, to effi-ciently promote cooperation between neighboring agents; 2) anovel spatial discount factor, to mitigate partial observabilityand enhance learning stability; and 3) an action smoothingscheme, to mitigate the effects of system uncertainty andrandom noise introduced during on-policy learning. We nextexplain those features in more detail.
Neighboring AgentsCommunication Protocol
Fig. 2: Overview of the proposed communication protocol.
1) Differentiable communication protocol:
We consider adecentralized MARL framework where each agent (DG) cancommunicate with its neighbors and exchange necessary in-formation such as the states. In contrast to non-communicativeMARL algorithms (e.g., IA2C, FPrint and ConseNet) thatgenerally suffer from slow convergence and unsatisfactory per-formance [32], we incorporate information from neighboringagents to enhance the learning efficiency. As illustrated in Fig.2, at each time step t , agent i updates its hidden state h i,t as: h i,t = f i ( h i,t − , q o ( e s ( o i,t )) , q h ( h N i,t − )) . (4)where h i,t − is the encoded hidden state from last timestep; o i,t is agent i ’s observation made at time t , i.e., its internal state and states from its neighbors; h N i,t − is theconcatenated hidden state from its neighbors; e s , q o , and q h are differentiable message encoding and extracting functions,where one-layer fully connected layers with 64 neurons areused; and f i is the encoding function for the hidden states andcommunication information, where we use a Long Short TermMemory (LSTM [36]) network to improve the observability[35] and better utilize the past hidden state h i,t − information.Instead of using low-dimensional indicators as in [28], weinclude the neighbor’s complete states into the local observa-tion o i,t = s i,t ∪ s N i ,t to improve the agent’s observabilityand use a network to learn an appropriate representationautomatically. To improve the scalability and robustness, weregroup the observation o i,t according to their physical units.For instance, the observation o i,t is divided into four groups: o i,t ∪ o i,t ∪ o i,t ∪ o i,t according to their units, i.e., fre-quency, power, voltage, and current. These regrouped sub-observations are encoded independently and then concatenatedas e s ( o i,t ) = cat ( e s ( o i,t ) , e s ( o i,t ) , e s ( o i,t ) , e s ( o i,t )) , where e js , j=1,2,3,4, are one-layer fully connected encoding layers.The received communication message m i,t of the i th agentis a combination of internal states and hidden states of itsneighbors, i.e., m i,t = h N i ,t − ∪ s N i ,t with h N i ,t − being thehidden states of agent i ’s neighbors at time t − . The encodedobservation e s ( o i,t ) and the neighbors’ hidden states h N i,t − are extracted by q o and q h , respectively. We then concatenatethe encoded message as ˜ o i,t = cat ( q o ( e s ( o i,t )) , q h ( h N i,t − )) .Concatenation operation is shown in [32] to have betterperformance as compared to the summation operation usedin DIAL and CommNet on reducing information loss. Thehidden state is then obtained by using the LSTM f i to encode ˜ o i,t and h i,t − .The hidden state h i,t obtained from (4) is then used inthe actor and critic networks to generate random actionsand predict the value functions, respectively, i.e., π θ i ( ·| h i,t ) and V ω i ( h i,t ) . In this paper, we use a discrete action spaceand the action is sampled from the last Softmax layer as a i,t ∼ π θ i ( ·| h i,t ) . We adopt the centralized training anddecentralized execution scheme [20], [32], where each agenthas its own actor and critic networks, and their policies areupdated independently instead of updating in a consensus way[29] that may hurt the convergence speed.
2) Spatial discount factor:
As mentioned above, the goalof cooperative MARL is to maximize the shared global reward R g,t = (cid:80) i ∈ ν R i,t , where R i,t = (cid:80) Tk =0 γ k r i,t + k denotesthe cumulative reward for agent i . For each agent, a naturalchoice of the reward is the instantaneous global reward, i.e., (cid:80) Ni =1 r i,t . However, this scheme can lead to several issues.First, aggregating global reward can cause large latency andincrease the communication overheads. Second, the singleglobal reward leads to the credit assignment problem [37],which would significantly impede the learning efficiency andlimit the number of agents to a small size. Third, as an agent istypically only slightly impacted by agents distant from it, usingthe global reward to train the policy for each agent can leadto slow convergence. To address the aforementioned issues ofusing a global reward, in this paper, we employ a novel spatialdiscount factor to address the above challenges. Specifically, each agent i, i = 1 , · · · , N , utilizes the following reward: R i,t = T (cid:88) k =0 γ k (cid:88) j ∈ ν α ( d i,j ) r j,t + k . (5)where α ( d i,j ) ∈ [0 , is the spatial discount function with d i,j being the distance between agent i and j . The distance d i,j can be an Euclidean distance characterizing the physicaldistance between the two agents or the distances betweentwo vertices in a graph (i.e., number of shortest connectingedges). Note that the new reward defined in (5) characterizesa whole spectrum of reward correlation, from local greedycontrol (When α ( d i,j (cid:54) = i ) = 0 and α ( d i,i ) = 1 ), to a globalreward (when α ≡ ). Note that there are different choices ofthe spatial discount function. For example, one can choose α ( d i,j ) = (cid:40) , if d ij ≤ D , , Otherwise . (6)where D is the distance threshold that defines an “effectivedistance” near the considered agent. The threshold D can in-corporate factors such as communication speed and overhead.In this paper, we adopt α ( d i,j ) = α d i,j with α ∈ (0 , beinga constant scalar and a hyperparameter to be tuned.As a result, the gradient computation in Eqn. (1) becomes: ∇ θ i J ( θ i ) = E π θ (cid:104)(cid:16) ∇ θ i log π θ i ( a i,t | o ν i ,t ) (cid:17) A i,t (cid:105) . (7)where A π θi i,t = Q π θi ( s, a ) − V π θi ( s ) is the advantage function, Q π θi ( s, a ) = E π θi ( R i,t | s t = s, a ν i ,t = a ν i ) is the state-actionvalue function, and V π θi ( s ) = V π i ( s, a N i ) is the state valuefunction. We optimize the parameter of critic network V w i asfollows: min ω i E D (cid:104) (cid:88) j ∈ ν α d ij r j,t + V ω (cid:48) i ( o ν i ,t ) − V ω i ( o ν i ,t ) (cid:105) . (8)Minibatches of sampled trajectory are used to update thenetwork parameters using Eqn. 7 and Eqn. 8 to reduce thevariance [20].
3) action smoothing:
Note that our proposed PowerNet isan on-policy MARL algorithm that needs to sample stochasticactions at each time step, which sometimes can lead to noisyaction samples with large fluctuations, even after the algorithmconverges. These action fluctuations can cause undesirablesystem perturbations. Towards this end, we introduce anaction smoothing scheme to smoothen the sampled action a i,t ∼ π θ i ( ·| h i,t ) before execution as: a i,t ← (cid:40) a i,t , t = 1; ρa i,t − + (1 − ρ ) a i,t , t > . (9)where ρ ∈ [0 , is the discount factor; a i,t − is the actiontaken at time t − . This scheme is also known as exponentialmoving average [38]. When ρ is chosen as 0, there is nosmoothing on actions and the agent will directly take actionssampled from the policy π θ i . When ρ = 1 , the new action a i,t will not be updated and the agent keeps using the previousaction. We specify an action smoothing window T w in whichthe past actions are buffered and used for smoothing. This timewindow size needs to be properly chosen as too large T w willconsider too many obsolete actions that may cause the agent toreact slowly to abrupt changes whereas too small T w may have Algorithm 1:
PowerNet for Secondary Voltage Control
Parameter: α, ρ, γ, η w , η θ , T, M, T w , N . Output : θ i , w i , i ∈ ν . initialize s , h − , t, k ← , D ← ∅ ; for j = 0 to M − do for t = 0 to T − do for i ∈ ν do send m i,t = s i,t ∪ h i,t − end for i ∈ ν do get ˜ o i,t = cat ( q o ( e s ( o i,t )) , q h ( h N i,t − )) get h i,t = f i ( h i,t − , (cid:101) o i,t ) , π i,t ← π θ i ( ·| h i,t ) update a i,t ∼ π i,t end for i ∈ ν do update v i,t ← V w i ( h i,t , a N i ,t ) update a i,t according to Eqn. (9) execute a i,t end simulate { s i,t +1 , r i,t } i ∈ ν update D ← { ( o i,t , a i,t , r i,t , v i,t ) } i ∈ ν update t ← t + 1 , j ← j + 1 , k ← k + 1 if k = N then for i ∈ ν do update θ i ← w i + η θ i ∇ θ i J ( θ i ) update w i ← w i + η w i ∇ ω i V ( ω i ) end end initialize D ← ∅ , k ← end update s , h − , t ← end limited smoothing effect. In this paper, ρ and T w are treatedas hyperparameters and are tuned using cross-validations.The complete PowerNet algorithm is shown in Algorithm 1.The hyperparameters include: the distance discount factor α ,the action smoothing factor ρ , the (time)-discount factor γ ,the learning rates of the actor network η w and critic network η θ , the total number of training steps M , the epoch length T , the action sample window T w , and the batch size N .More specifically, the agents interact with the environmentfor thousands of epochs (Lines 2–27). In Lines 4–6, eachagent collects and sends the communication message m i,t to its neighbors. After that, each agent combines, regroupsand encodes its observations (Line 8). The agents then updatetheir hidden states and actor networks used to sample actions(Lines 9–10). In Lines 12–15, the critic network is updated andactions are smoothed before execution. After interacting withthe environment (Lines 16), agents transition to the next stateand receive an immediate reward (Lines 17), which will beadded to an on-policy experience buffer (Lines 18). In Lines20–26, the parameters of the actor and critic networks areupdated using the collected trajectories sampled from the on-policy experience buffer. After each epoch, all agents are resetto their initial states to start a new epoch (Lines 27).V. EXPERIMENT, RESULTS, A ND DISCUSSION S In this section, we apply the developed PowerNet to twomicrogrid systems, IEEE 34-bus test feeder with 6 distributedDGs (microgrid-6, Fig. 3a) and a larger-scale microgrid systemwith 20 DGs ( microgrid-20, Fig. 3b). The simulation platformwe developed is based on the line and load specifications detailed in [39] and [40], respectively. Furthermore, to betterrepresent the real-world power systems, we add random loadvariations across the entire microgrid with ± perturbationsfrom its nominal values specified in [40]. We also add randomdisturbances in the range of ± of the nominal values foreach load at every simulation step to simulate the random dis-ruptions in the real-world power grid. The DGs are controlledwith a sampling time of 0.05s and each DG can communicatewith its neighbors via local communication edges. The lowerlevel primary control is realized through [13].Through cross-validation, the spatial discount factor α ischosen as 1.0 and 0.7 for microgrid-6 and microgrid-20,respectively. It is reasonable to have a larger spatial discountfactor in the microgrid-6 system as agents are strongly andtightly connected in that small microgrid system. A smaller α in the larger-scale microgrid-20 system is advantageous toreduce the effect of remote agents, but a too small α will causethe agents to ignore their effect on the neighboring agents. Forexample, if we choose α = 0 , the expected reward for agent i at time step t will become R i = (cid:80) Tt =1 γ t r i,t , which cannotensure a maximum global reward as each agent update its ownpolicy greedily.For the action smoothing factor ρ , we choose ρ = 0 . formicrogrid-6 and ρ = 0 . for microgrid-20. As microgrid-20is more complicated than the microgrid-6, we choose a larger ρ , considering that a small ρ will result in a slow responseto sudden voltage violations. We choose the sample window T w = 5 for both microgrid systems.We then compare PowerNet with several state-of-the-artbenchmark MARL algorithms (IA2L [33], FPrint [28], Con-seNet [29], DIAL [30], and CommNet [31]), as well as aconventional model-based method [13], to demonstrate itseffectiveness. We train each model over 10,000 episodes,with γ = 0 . , minibatch size N = 20 , actor learning rate η θ = 5 × − , and critic learning rate η w = 2 . × − .To ensure fair comparison, each episode generates differentrandom seeds and in each episode the same random seedis shared across different algorithms to guarantee the sametraining/testing environment. We control the agents every ∆ T = 0 . seconds and one episode lasts for T = 20 steps.Fig. 4 shows the training curve for all MARL algorithmsin microgrid-6 and microgrid-20 systems. To better visualizethe training process, we only show the first one thousand andtwo thousand training episodes for microgrid-6 and microgrid-20 systems, respectively. It is clear that in both microgrid-6 and microgrid-20, PowerNet outperforms all state-of-the-art MARL algorithms in terms of convergence speed. Thisis due to the proposed communication protocol structure andsuitable spatial discount factor, which help improve sampleefficiency and speed up the learning. In the more challengingmicrogrid-20 system, PowerNet shows its greater advantagesof sample efficiency as seen from fastest convergence speedand best average episode reward compared to other algorithms(see Fig. 4b).After 10 thousand episodes of training, we evaluate thetrained algorithms for 20 times with the same random seedfor each agent in every episode, while sharing differenttest seeds for different episodes. Besides the state-of-the- art MARL algorithms, we also compare PowerNet with aconventional model-based method [13] in terms of voltagecontrol performance. The comparison results are summarizedin Table I, which shows the average episode return over 20test episodes of the MARL algorithms and the conventionalmethod (ConvOP). It is clear that PowerNet achieves the bestexecution performance over other methods in both scenarios.ConvOP has bad performance due to its slow response asit is sensitive to random disturbance and relies heavily onaccurate system models. IA2C also does not perform well inboth scenarios which is consistent to the training curves inFig. 4. It is noted that other state-of-the-art MARL algorithmsachieve similar, reasonable performance, but our PowerNetoutperforms them with a significant margin.To better show the voltage control performance, we comparethe voltage control results under heavy load conditions asshown in Fig. 5 (for microgrid-6) and Fig. 6 (for microgrid-20). The control objective is to regulate all DG voltages to thereference value pu . A red cross mark represents a voltageviolation on the DG whereas a black dot means the DG’svoltage is within the normal range. Here we use r to denote theaverage control reward of the last 5 control steps according toEqn. 3. All algorithms can regulate the voltages to the normalrange in the small-size microgrid-6 environment as shown inFig. 5.Fig. 6 shows the performance comparison for the more chal-lenging microgrid-20 case, due to dense network connectionsand complicated couplings among agents. It is thus not wise tocontrol the voltage independently as IA2C does, which indeedleads to a bad performance. ConvOP also fails to regulate thevoltages of some DGs and diverge more from the central line pu as it is not designed to deal with system uncertaintiesand random noises that are present in real-world microgrids.FPrint, DIAL, and CommNet also incur voltage violations,which shows that simply including neighbors’ policies isnot sufficient for a complicated test case as microgrid-20.DIAL and CommNet’s failure imply that simply summing allthe communication information and computing the immediatereward according to undiscounted global reward can causeagents fail to learn effective communication protocols in alarge-scale and cooperative microgrid environment. ConseNetfalis to regulate the voltages and the controlled voltages arestill more diverged compared to PowerNet.VI. C ONCLUSIONS
In this paper, we formulated the secondary voltage controlin inverter-based microgrid systems as a MARL problem. Anovel on-policy, cooperative MARL algorithm, PowerNet, wasdeveloped by incorporating a differentiable, learning-basedcommunication protocol, a spatial discount factor, and anaction smoothing scheme. Comprehensive experiments wereconducted that show the proposed PowerNet outperforms otherstate-of-the-art approaches in terms of convergence speed andvoltage control performance. In our future work, we will de-velop a more realistic simulation environment by incorporatingdata from real-world power systems. We will also investigatesome privacy-preserving schemes as the decentralized im- (a) Microgrid-6 (b) Microgrid-20
DG 1DG 6 DG 2DG 7 DG 3DG 8 DG 4DG 9 DG 5DG 10DG 11DG 16 DG 12DG 17 DG 13DG 18 DG 14DG 19 DG 15DG 20Load 1Load 6 Load 2 Load 3 Load 4 Load 5Load 7 Load 9 Load 10Load 8800 DG 1802806808 810812814850DG 2816818820822824 826Load 1828 830 854 856852832858864Load 2 Load 3834842844846848860 836 840Load 4DG 5862838
Fig. 3: System diagrams of the two microgrid simulation platforms: (a) microgrid-6; and (b) microgrid-20. (a) Microgrid-6 (b) Microgrid-20
Fig. 4: MARL training curves for (a) microgrid-6 and (b) microgrid-20 systems. The lines show the average reward per trainingepisode which are smoothed over the past 100 episodes.TABLE I: Average execution performance comparison between trained MARL policies and the conventional model-basedmethod. The reward is the average reward over 20 evaluation episodes. Best values are in bold.
PowerNet ConvOP [13] IA2C FPrint ConseNet CommNet DIALMicrogrid-6
Fig. 5: Execution performance on voltage control in microgrid-6.plementation requires extensive information exchange amongneighboring agents, which can lead to privacy breach.R
EFERENCES[1] V. K. Sood and H. Abdelgawad, “Chapter 1 - microgrids architectures,”in
Distributed Energy Resources in Microgrids
IEEE Transactions onindustrial informatics , vol. 10, no. 3, pp. 1785–1798, 2014.[3] J. Lai, H. Zhou, X. Lu, X. Yu, and W. Hu, “Droop-based distributedcooperative control for microgrids with time-varying delays,”
IEEETransactions on Smart Grid , vol. 7, no. 4, pp. 1775–1789, 2016.[4] J. M. Guerrero, J. C. Vasquez, J. Matas, L. G. De Vicuña, andM. Castilla, “Hierarchical control of droop-controlled ac and dc micro-grids—a general approach toward standardization,”
IEEE Transactionson industrial electronics , vol. 58, no. 1, pp. 158–172, 2010.[5] A. Mehrizi-Sanir and R. Iravani, “Secondary control for microgrids
Fig. 6: Execution performance on voltage control in microgrid-20. using potential functions: modeling issues,”
Proceedings of the ConseilInternational des Grands Résaux Électriques (CIGRE) , vol. 182, 2009.[6] M. Savaghebi, A. Jalilian, J. C. Vasquez, and J. M. Guerrero, “Secondarycontrol scheme for voltage unbalance compensation in an islandeddroop-controlled microgrid,”
IEEE Transactions on Smart Grid , vol. 3,no. 2, pp. 797–807, 2012.[7] Q. Shafiee, J. M. Guerrero, and J. C. Vasquez, “Distributed secondarycontrol for islanded microgrids—a novel approach,”
IEEE Transactionson power electronics , vol. 29, no. 2, pp. 1018–1031, 2013.[8] P. N. Vovos, A. E. Kiprakis, A. R. Wallace, and G. P. Harrison, “Cen-tralized and distributed voltage control: Impact on distributed generationpenetration,”
IEEE Transactions on power systems , vol. 22, no. 1, pp.476–483, 2007.[9] A. Mehrizi-Sani and R. Iravani, “Potential-function based control of amicrogrid in islanded and grid-connected modes,”
IEEE Transactions onPower Systems , vol. 25, no. 4, pp. 1883–1891, 2010.[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[11] S. Wang, J. Duan, D. Shi, C. Xu, H. Li, R. Diao, and Z. Wang, “A data-driven multi-agent autonomous voltage control framework using deepreinforcement learning,”
IEEE Transactions on Power Systems , 2020.[12] H. Xin, Z. Qu, J. Seuss, and A. Maknouninejad, “A self-organizingstrategy for power flow control of photovoltaic generators in a distribu-tion network,”
IEEE Transactions on Power Systems , vol. 26, no. 3, pp.1462–1473, 2010.[13] A. Bidram, A. Davoudi, F. L. Lewis, and J. M. Guerrero, “Distributed co-operative secondary control of microgrids using feedback linearization,”
IEEE Transactions on Power Systems , vol. 28, no. 3, pp. 3462–3470,2013.[14] A. Bidram, A. Davoudi, F. L. Lewis, and Z. Qu, “Secondary controlof microgrids based on distributed cooperative control of multi-agentsystems,”
IET Generation, Transmission & Distribution , vol. 7, no. 8,pp. 822–831, 2013.[15] L. Ding, Q.-L. Han, and X.-M. Zhang, “Distributed secondary control foractive power sharing and frequency regulation in islanded microgrids us-ing an event-triggered communication mechanism,”
IEEE Transactionson Industrial Informatics , vol. 15, no. 7, pp. 3910–3922, 2018.[16] X. Lu, X. Yu, J. Lai, J. M. Guerrero, and H. Zhou, “Distributedsecondary voltage and frequency control for islanded microgrids withuncertain communication links,”
IEEE Transactions on Industrial Infor-matics , vol. 13, no. 2, pp. 448–460, 2016.[17] J. Lai, X. Lu, X. Yu, W. Yao, J. Wen, and S. Cheng, “Distributed voltagecontrol for dc mircogrids with coupling delays & noisy disturbances,” in
IECON 2017-43rd Annual Conference of the IEEE Industrial ElectronicsSociety . IEEE, 2017, pp. 2461–2466.[18] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik,J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al. , “Grand-master level in starcraft ii using multi-agent reinforcement learning,”
Nature , vol. 575, no. 7782, pp. 350–354, 2019.[19] C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak, C. Dennison,D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al. , “Dota 2 with largescale deep reinforcement learning,” arXiv preprint arXiv:1912.06680 ,2019.[20] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcementlearning for large-scale traffic signal control,”
IEEE Transactions onIntelligent Transportation Systems , vol. 21, no. 3, pp. 1086–1095, 2019.[21] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295 , 2016. [22] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, andZ. Yi, “Deep-reinforcement-learning-based autonomous voltage controlfor power grid operations,”
IEEE Transactions on Power Systems ,vol. 35, no. 1, pp. 814–817, 2019.[23] H. Liu and W. Wu, “Online multi-agent reinforcement learningfor decentralized inverter-based volt-var control,” arXiv preprintarXiv:2006.12841 , 2020.[24] D. Cao, W. Hu, J. Zhao, Q. Huang, Z. Chen, and F. Blaabjerg, “Amulti-agent deep reinforcement learning based voltage regulation usingcoordinated pv inverters,”
IEEE Transactions on Power Systems , vol. 35,no. 5, pp. 4120–4123, 2020.[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in
International conference on machine learning ,2016, pp. 1928–1937.[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[27] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper-ative agents,” in
Proceedings of the tenth international conference onmachine learning , 1993, pp. 330–337.[28] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli,and S. Whiteson, “Stabilising experience replay for deep multi-agentreinforcement learning,” arXiv preprint arXiv:1702.08887 , 2017.[29] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸sar, “Fully decentral-ized multi-agent reinforcement learning with networked agents,” arXivpreprint arXiv:1802.08757 , 2018.[30] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learningto communicate with deep multi-agent reinforcement learning,” in
Advances in neural information processing systems , 2016, pp. 2137–2145.[31] S. Sukhbaatar, R. Fergus et al. , “Learning multiagent communicationwith backpropagation,” in
Advances in neural information processingsystems , 2016, pp. 2244–2252.[32] T. Chu, S. Chinchali, and S. Katti, “Multi-agent reinforcement learningfor networked system control,” arXiv preprint arXiv:2004.01339 , 2020.[33] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch,“Multi-agent actor-critic for mixed cooperative-competitive environ-ments,” in
Advances in neural information processing systems , 2017,pp. 6379–6390.[34] B. Ning, Q.-L. Han, and L. Ding, “Distributed secondary control ofac microgrids with external disturbances and directed communicationtopologies: A full-order sliding-mode approach,”
IEEE/CAA Journal ofAutomatica Sinica , 2020.[35] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partiallyobservable mdps,” arXiv preprint arXiv:1507.06527 , 2015.[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[37] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press, 2018.[38] E. S. Gardner Jr, “Exponential smoothing: The state of the art,”
Journalof forecasting , vol. 4, no. 1, pp. 1–28, 1985.[39] T. Li and J.-F. Zhang, “Consensus conditions of multi-agent systems withtime-varying topologies and stochastic communication noises,”
IEEETransactions on Automatic Control , vol. 55, no. 9, pp. 2043–2057, 2010.[40] A. Mustafa, B. Poudel, A. Bidram, and H. Modares, “Detection andmitigation of data manipulation attacks in ac microgrids,”