Deep Reinforcement Learning for DER Cyber-Attack Mitigation
Ciaran Roberts, Sy-Toan Ngo, Alexandre Milesi, Sean Peisert, Daniel Arnold, Shammya Saha, Anna Scaglione, Nathan Johnson, Anton Kocheturov, Dmitriy Fradkin
DDeep Reinforcement Learning for DERCyber-Attack Mitigation
Ciaran RobertsSy-Toan Ngo, Alexandre MilesiSean Peisert, Daniel Arnold
Lawrence Berkeley National Laboratory { cmroberts,sytoanngo,amilesi,sppeisert,dbarnold } @lbl.gov Shammya SahaAnna ScaglioneNathan Johnson
Arizona State University { sssaha,ascaglio,nathanjohnson } @asu.edu Anton KocheturovDmitriy Fradkin
Siemens CorporationCorporate Technology { anton.kocheturov,dmitriy.fradkin } @siemens.com Abstract —The increasing penetration of DER with smart-inverter functionality is set to transform the electrical distributionnetwork from a passive system, with fixed injection/consumption,to an active network with hundreds of distributed controllersdynamically modulating their operating setpoints as a func-tion of system conditions. This transition is being achievedthrough standardization of functionality through grid codesand/or international standards. DER, however, are unique in thatthey are typically neither owned nor operated by distributionutilities and, therefore, represent a new emerging attack vectorfor cyber-physical attacks. Within this work we consider deepreinforcement learning as a tool to learn the optimal parametersfor the control logic of a set of uncompromised DER units toactively mitigate the effects of a cyber-attack on a subset ofnetwork DER.
I. I
NTRODUCTION
The increasing penetration of distributed energy resources(DER) in electrical distribution systems is causing a paradigmshift in how these networks are managed. While these systemswere historically passive, distributed power generation is forc-ing distribution grids to become more dynamic as DER areexpected to provide grid services, e.g. voltage control. Thistransition presents several challenges, particularly in the areaof cyber-physical security [1], [2].DER are especially unique when it comes to cyber-physicalsecurity. These devices are typically neither utility owned nordirectly controlled and, therefore, present a new attack vectorfor adversaries seeking to disrupt normal grid operating con-ditions. Additionally, many manufacturers and/or aggregatorsremotely control large populations of these devices via cellularnetworks, customers’ WiFi routers, or wired internet connec-tions [3]. This makes ensuring the integrity of commandssignificantly more difficult. While recent standards (e.g. IEEE1547 standard) seek to specify minimal control requirementsfor these devices, they do not explicitly address the associatedcyber-physical security challenges [4]. Inverter manufacturersand aggregators have the ability to remotely monitor andcontrol the settings for inverters/DER deployed in the field.Once access to the central system is gained, that system can
This research was supported in part by the Director, Cybersecurity, En-ergy Security, and Emergency Response, Cybersecurity for Energy DeliverySystems program, of the U.S. Department of Energy, under contract DE-AC02-05CH11231. Any opinions, findings, conclusions, or recommendationsexpressed in this material are those of the authors and do not necessarilyreflect those of the sponsors of this work. be used to push malicious control logic back to all DER. Thus,utilities have already expressed concerns about how the impactof a single cyber intrusion into an aggregators’/manufacturers’internal network could be exploited to compromise an ag-gregators/manufacturers entire DER fleet [3]. In regions withhigh penetration of these devices, this could have devastatingeffects.In this work we adopt a purely physics-based approach forthe mitigation of cyber-physical attacks on DER (specifically,solar photovoltaic inverters). We assume that the adversary hasalready gained access to a subset of DER on a given networkand seeks to maliciously re-configure the control settings ofsmart inverters to disrupt distribution grid operations. Ourapproach does not focus on detecting the cyber-intrusionbut rather mitigating the resulting physical manifestation ofthe attack on the grid. To develop optimal control policiesthat mitigate the impact of these attacks, we train a deepreinforcement learning (DRL) policy that re-configure thecontrol settings of uncompromised DER. This trained policyis then deployed locally on controllable DER and determinessmart inverter parameter updates based on locally observedinformation.DRL has been gaining increasing attention in recent years,including in power systems, for determining control policiesfor highly complex non-linear systems. In [5], the authors useDeep Q-Network (DQN) learning, a reinforcement learning(RL) algorithm that combines Q-Learning with deep neuralnetworks, to control both generator dynamic braking and loadshedding in the event of a contingency to ensure post-faultrecovery. In [6], the authors consider the problem of coordi-nated voltage regulation using capacitors and smart inverters.Exploiting the timescale separation of these devices, theysolve a convex optimization problem to determine the controlpolicies of the smart inverters while using a DQN networkto learn an optimal policy for capacitor bank switching. In[7], a deep deterministic policy gradient (DDPG) RL agent isused to co-ordinate across DER and directly modulate activeand reactive power to regulate the grid voltage during normaloperations.This work differs from those described above in that wefocus on developing a supervisory control policy that con-tinuously monitors system conditions and takes action during a r X i v : . [ ee ss . S Y ] S e p ustained abnormal behavior. This controller, therefore, shouldnot impact an inverters response to normal disturbances, e.g.line-to-ground faults. While the proposed controller designis motivated by the need to respond to cyber-physical at-tacks, it is agnostic to the cause of the abnormal conditions.Consequently, it can also serve to autonomously re-configurecontroller settings in the event that an intended action hasresulted in abnormalities, for instance, when connecting dif-ferent microgrids with independently optimized controllers.This paper presents a framework for DRL for smart gridapplications and explores the use case of a cyber-physicalattack intended to induce oscillatory behavior in the gridvoltage.The remainder of the paper is organized as follows. Sec-tion II gives a brief introduction to DRL and the termsthat will be used throughout the paper. Section III gives anoverview of the power system models and networks used in thestudy. Finally, Section IV presents the results and Section Vsummarizes some of the key conclusions.II. D EEP R EINFORCEMENT L EARNING
A. Reinforcement Learning
RL is a branch of machine learning focused on optimaldecision making in stochastic environments. The goal of RLtechniques is to train an agent ( i.e. the decision-maker) tointeract with an environment in such a way as to maximizea cumulative reward. The environment is usually cast asa Markov Decision Process (MDP), which consists of thefollowing elements: • A state space, S , containing states observable by theagent; • An action space, A , containing all the possible actionsthe agent can execute; • A state transition function, P : S × A × S → [0 , ,specifying the probability distribution over the next state s (cid:48) when an action a is taken at state s ; • A reward function, R : A × S × S → R , specifyingthe reward received by the agent when the environmenttransitions from state s to state s (cid:48) with action a ; • A discount factor, γ ∈ [0 , , representing the trade-offbetween immediate and future rewards.A RL agent learns optimal actions by repeatedly interactingwith the environment and assessing the value of resultingrewards, R t ∈ R , dependent on the actions taken, a t ∈ A ,and the states of the environment, s t ∈ S . The agent-environment interaction is visualized in Fig. 1. As shown inthe figure, the agent takes action a t following policy π causinga state transition in the environment. The new state, s t , andsubsequent reward, R t , are observed by the agent and can thenbe used to update the policy π . The objective of the agent isto maximize the discounted reward J ( π ) = E π (cid:104)(cid:80) Tt =0 γ t R t (cid:105) ,where T is the terminal time step, by following a policy π which can be deterministic or stochastic in nature. EnvironmentAgentState: ( s t ∈ S ) ∼ P Reward: R t ∈ R Action( a t ∈ A ) ∼ π Fig. 1: Reinforcement learning loop.
B. Deep Reinforcement Learning
Classical RL relies on feature engineering and is difficultto apply to environments with high dimensional, continuousaction and/or state spaces [8]. Such spaces, typically, mustbe discretized first, leading to a combinatorial explosion incomplexity and unreasonable training time (the so-called curseof dimensionality ). In addition, classical RL has trouble cap-turing patterns in the presence of noisy or incomplete data.DRL solves these issues by leveraging neural networks withmultiple hidden layers that take the agent observations as inputand output a policy that determines what action to take in agiven state.With DRL, the inputs to the neural network can be struc-tured data (tabular data), unstructured data (images, text,video), or both. The weights of these neural networks areefficiently learned end-to-end via gradient-based optimizationto find the best intermediate features and an optimal outputpolicy. The need for precise feature engineering is then greatlyreduced, thanks to the automatic high-dimensional feature ex-traction of the hidden layers. In DRL, one can use the networksto explicitly approximate an optimal policy distribution, π ,over possible actions. This distribution is then sampled bythe agent to determine the next action, as in policy gradientmethods. They may also be used to approximate either avalue function, V π ( s ) , or an action-value function, Q π ( s, a ) ,from gathered data, leading to an action decision based oninferred values for all possible future states, as in DQN.The value function, V π ( s ) , is the expected discounted rewardwhen starting in state s and following the policy π , whereasthe action-value function Q π ( s, a ) is defined as the expecteddiscounted reward when starting in state s , taking action a ,and then following the policy π thereafter.Thanks to its flexibility, DRL has been successfully appliedto robotic control [9], video games [10], [11] and board gameplaying [12], [13]. C. Policy Gradient and PPO
Policy gradient methods employ a policy modeled by aneural network which is trained directly by gradient ascenton the expected return. The most basic method (vanilla policygradient) is simple to implement but has the drawback ofhaving a high gradient variance. In response, Actor-Critic (AC)methods were proposed[14], where another, possibly shared,neural network approximates the value function.Let π θ ( a | s ) be a stochastic policy, parameterized by θ ,modeling the probability distribution of action a ∈ A giventhe state s ∈ S . Let V πφ ( s ) be a value function parameterizedy φ , estimating the cumulative discounted reward from thecurrent state to the terminal state. The gradient of J ( θ ) is: ∇ θ J ( θ ) = E τ ∼ π θ (cid:34) T (cid:88) t =0 ∇ θ log π θ ( a t | s t ) A πφ ( s t , a t ) (cid:35) , (1)where τ is the trajectory generated by policy π θ and A πφ ( s t , a t ) = R t + γV πφ ( s t +1 ) − V πφ ( s t ) is the advantage es-timation, representing how much better taking action a t is, asopposed to following the policy π when in state s t . The policyand value function are updated by gradient ascent/descent: θ k +1 = θ k + α ∇ θ J ( θ ) , (2) φ k +1 = φ k − β ∇ φ ( R t + V πφ ( s t +1 ) − V πφ ( s t )) . (3)As the training of AC methods can be unstable when thedata distribution changes due to a large policy update, theTrust Region Policy Optimization (TRPO) was introduced[15]. TRPO limits the updates in the policy space by enforcinga KullbackLeibler divergence constraint on the size of eachupdate. A Proximal Policy Optimization (PPO) [16] usinga clipped surrogate objective simplifies the aforementionedmethod and yields similar performance: L CLIP ( θ ) = ˆ E (cid:104) min (cid:16) r t ( θ ) ˆ A t , clip (cid:0) r t ( θ ) , − (cid:15), (cid:15) (cid:1) ˆ A t (cid:17)(cid:105) , where r t ( θ ) (cid:44) π θ ( a t | s t ) π θ old ( a t | s t ) and ˆ A t (cid:44) A πφ ( s t , a t ) This clip operation encourages a more gradual updates to thepolicy rather than large changes, and the minimum betweenthe unclipped and the clipped objective is used so that thefinal objective is a lower bound on the unclipped objective[16]. The hat over the expectation means that we compute aMonte Carlo estimate of it.PPO is a state-of-the-art method that was successfully usedin video games [11] and robotics in simulation [17]. Weconsider here its application to the control of smart inverters.Before we map the specific problem onto the RL formalism,the following remark is in order:
Remark 1:
In many applications, the state of the entiresystem, s t , is not directly observed. In this case, the problemfalls in the class of Partially Observable MDPs (POMDP). Ina POMDP, the additional element in the model is: • An observation transition function (also called perceptualdistribution or emission probability) V : S × O → [0 , that specifies the probability distribution of the observa-tion o t given the state s t .The policy function in this case takes as input the observationrather than the state, i.e. the goal is to find the optimum π ( a t | o t ) . As mentioned later, the formulation in this paper fallsin the class of POMDP. Also, we note that neither the statetransition function nor the perceptual distribution are explicitlygiven; hence the policy neural network is trained through aMonte Carlo method.III. M ETHODOLOGY
A. Modeling the DER action space
In response to evolving standards and requirements, DERare increasingly being deployed with the ability to modu-late their real and reactive power injection/consumption in response to locally measured grid conditions. In this workwe focus specifically on smart inverter Volt-VAR (VV) andVolt-Watt (VW) control functionality as these operating modesare designed to help regulate distribution system voltages inthe presence of large amounts of renewable generation. UnderVV/VW control schemes, each inverter seeks to modulateactive and reactive power injections in response to measuredsystem voltage. The amount by which reactive and activepower injections are modulated is governed according topiece-wise linear functions of voltage, often referred to as“droop” curves. Different parameterizations of VV and VWcurves exist, however, existing guidelines often depict shapesshown in Figs. 2 - 3, which are parameterized by the fiveparameters that define the piece-wise linear curves shown,which will be referred to as the components of the setpoint-vector η = [ η , . . . , η ] . In this work, the action is a × vector a = ∆ η (cid:44) η − η o , where η o is the default set ofparameters. Note that, even though in principle the action iscontinuous, we quantize the possible range for the action andsearch directly for the categorical vector a .The VV curve injects reactive power when voltages inthe system are low and transitions to VAR consumption asvoltages increase. The VW curve provides maximum realpower injection under most voltage levels, but curtails PVoutput as voltage levels increase. The additional capacityresulting from active power curtailment can then be used foradditional reactive power consumption. V % available VARs η η η -100% η Fig. 2: Inverter Volt-VAR curve. Positive percent of VAR injection. V % max. watt output η η Fig. 3: Inverter Volt-Watt curve. Positive percent of watt injection.
Without loss of generality, we assume all inverters in thesubsequent analysis possess both VV and VW functionality.Let p max be the maximum output of the PV unit underpresently available solar insolation, and q avail the limit forreactive power in absolute value. In some instances, theamount of reactive power available for injection/consumptionay be fixed (in the case of an oversized inverter relativeto the capacity of the PV panels) while in others, q avail maydepend on the amount of real power being generated from thePV system: q avail ≤ (cid:112) s − f p (¯ v ) , (4)where s is the inverter capacity. Let u p,i and u q,i denote theactive and reactive power control signal of inverter i . They arefunction of the averaged measured voltage magnitude at thebus (c.f. (7a)). Rather than considering completely arbitraryVV and VW mappings u p,i and u q,i that respect the limits p max and q avail, we seek policies that are expressed as: u pi = f pi (¯ v ) (cid:44) p max ¯ v ∈ [0 , η ] (cid:16) η − ¯ vη − η (cid:17) p max ¯ v ∈ ( η , η ]0 ¯ v ∈ ( η , ∞ ) . (5) u qi = f qi (¯ v ) (cid:44) q avail ¯ v ∈ [0 , η ] (cid:16) η − ¯ vη − η (cid:17) q avail ¯ v ∈ ( η , η ]0 ¯ v ∈ ( η , η ) − (cid:16) η − vη − η (cid:17) q avail ¯ v ∈ [ η , η ] − q avail ¯ v ∈ ( η , ∞ ) , (6)The scheme of (5) - (4) illustrates the combined use ofVV and VW control with VW precedence [18]. Under VWprecedence, priority is given to the VW controller to determineany needed curtailment before determining the VARs available( q avail ). After q avail is fixed, u qi is computed from (6).In the event of a cyber-physical attack we assume that anadversary has the capability to re-dispatch a set of voltagebreakpoints η = [ η , . . . η ] that parametrize the droop curvesin Figs. 2 - 3 for a subset of DER on the network. Within thecontext of this work, the remaining set of non-compromisedDER can then be updated with new parameters vector η (cid:48) = a + η o to re-shape their own local droop curves to transition thesystem voltages to a safe region, devoid of oscillatory behavior.Finally, the structure of the DER VV and VW control dy-namic response, similarly to [18]–[20], includes the followingfirst order low pass filters that average the input voltage anddetermine the active and reactive power injections: ¯ v i,t = ¯ v i,t − + τ mi (cid:0) v i,t − ¯ v i,t − (cid:1) , (7a) p i,t = p i,t − + τ oi (cid:0) f pi (¯ v i,t ) − p i,t − (cid:1) , (7b) q i,t = q i,t − + τ oi (cid:0) f qi (¯ v i,t ) − q i,t − (cid:1) , (7c)where ¯ v i denotes a low-pass filtered measured of the voltagemagnitude, v i , at node i , τ mi is its associated measurementtime constant, τ oi is the output filter time constant and f pi (¯ v i,t ) and f qi (¯ v i,t ) are the piecewise linear functions of the measurednodal voltages for node i given by (5) and (6) respectively.Note the equilibrium of (7b) - (7c) is given by (5) - (6).The stability of (7a) - (7c) has been studied in [19] and[21], where it has been observed that instabilities manifest asoscillations in inverter power injections and nodal voltages.As said before, the RL agent indirectly manipulates theoutputs of the inverter by modifying the vector of parameters η t = a t + η o . Next we define a component of the observation o t used as an input to the DRL controller in our POMDPformulation. The quantity is a local measure of the presenceand severity of voltage magnitude aforementioned oscillations. B. A Measure of Unstable Oscillations
We propose the use of a simple filter to determine the“energy” associated with voltage oscillations in the distributiongrid. The filter consists of the series of a highpass filter, andan energy detector, consisting of a square-law, followed by alowpass filter. A discrete time block diagram of the process isshown in Fig. 4 H HP ( z ) c · () H LP ( z ) v i,t ∆ v i,t y i,t Fig. 4: Block diagram of instability detector using a transfer functionrepresentation of high and low pass filters. where H HP and H LP are high-pass and low-pass filtersrespectively, realized using a bilinear transform equivalent ofa first-order high/low-pass filter, and c is a positive gain. Thehigh-pass filter removes DC content from v i,t , yielding ∆ v i,t .This signal is then squared to produce a DC term which isthen extracted via low pass filtering. The output signal, y i , t is a measure of the intensity of the instability. The filterparameters should be chosen such that the filter does notattenuate oscillations due to inverter instabilities. C. DER Cyber-Attack Mitigation as a RL problem
The primary goal of the DRL controller is to mitigateinstabilities introduced by DER smart inverter VV/VW con-trollers due to maliciously chosen set-points. Let the graph G = ( N , L ) represent the topology of the distribution feederconsidered, where N is the set of nodes of the feeder (with0 indexing the feeder head) and L is the set of lines. Forsimplicity of presentation, we assume the presence of aVV/VW capable smart-inverter at every node in the system,so that the total number of inverters in the system is |N | .We suppose the set N is partitioned into two sets, H , and U , where H (cid:83) U = N which represent the ”compromised”and ’”uncompromised” inverters respectively. Furthermore weassume that U (cid:54) = ∅ , i.e. we have some controllable resourcesto mitigate the effects of the cyber-physical attack. Given U (cid:40) N and the temporal dependency of load and solarirradiance, as mentioned in Remark 1, the model is a POMDPwhere we wish to determine the optimum stochastic policy, π θ ( a | o ) , parameterized by the neural network parameters θ ,modeling the probability distribution of action a ∈ A giventhe observation o ∈ O . Training : Rather than training multiple agents simultaneously,we adopt the following heuristics to aid convergence:1) For agent training, we define a single agent whose inputobservation vector is the mean of the input observationvectors of all controllable inverters ∈ U and whose ac-tion, a t , is a deviation/offset, ∆ η , from default VV/VWcontrol curves that apply across inverters.) Once a single agent has been trained, this agent optimalpolicy is deployed locally on each individual inverterand only acts on local observations.3) Rather than optimize over arbitrarily shaped VV/VWcurves ( f q (¯ v ) and f p (¯ v ) ), we optimize over the devia-tion, i.e. a = ∆ η , from the default parameters definingthe curves in Figs. 2 - 3. An example of this is shownin Fig. 5. The translation is in range from -0.05 pu to0.05 pu around an inverters default VV/VW curve, withthe action space being discretized into k bins.4) New parameterizations of VV/VW functions will bechosen so that measurement and power injection dy-namics evolve on a faster timescale. This choice willpreserve the Markov property between actions taken bythe RL controller. V % available VARs η η η -100% η RL Agent Action ∆ η Fig. 5: Action example.
Observation : The observation vectors o i,t , i ∈ U at each RLagent (i.e. the input to the neural network that learns theoptimum policy π ( a | o ) ), consist of:1) y i,t : the mean of the estimation of voltage oscillationenergy at node i since the last agent environment inter-action.2) y max i,t : the maximum of y i,t over the previous n environ-ment interactions. This is a tunable parameter that storesinformation of the recent oscillation energy.3) q avail, nom i,t : the available reactive power capacity withoutactive power curtailment.4) a one-hot i,t − : one-hot encoding of the previous action takenby the agent. Reward : At a timestep t , the reward function, R t ( a t , o t ) is: R t = − (cid:32) |U| |U| (cid:88) i =1 σ y y i,t + σ a a t (cid:54) = a t − + σ (cid:107) a t (cid:107) + 1 |U| |U| (cid:88) i =1 σ p (cid:32) − p i,t p max i,t (cid:33) (cid:33) . (8)The first component seeks to minimize the voltage oscillation y ; the second one penalizes configuration changes on inverters;the third component encourages the agent to use the default in-verter configurations in the absence of voltage oscillations andthe final component penalizes any active power curtailment. D. The PyCIGAR DRL Environment
Any learning method requires sufficient training over avariety of scenarios. As is done in other application of deep learning in the context of critical infrastructure systems, suchtraining can be performed through realistic Monte Carlo simu-lations that cover a variety of operating conditions and cyber-physical attacks. We named PyCIGAR the modular softwarearchitecture we designed to train the DRL agent described inthe previous section.
PyCIGAR EnvironmentPyCIGAR EnvironmentPyCIGAR EnvironmentSimulator(OpenDSS) PyCIGARAPI RLlibDeviceDeviceDeviceRBControllerRBControllerRBController RLControllerRLControllerRLControllerPyCIGARKernel action(s)observation(s)
Fig. 6: PyCIGAR Architecture.
PyCIGAR is a Python library for distributed reinforcementlearning for electric power distribution grids on quasi-statictime scales. The library provides a link between power systemsimulators and a reinforcement learning library - RLlib [22].PyCIGAR is a unified API that can interface different powersystem simulators (e.g. OpenDSS), while on the RL side Py-CIGAR uses RLlib in order to deploy large scale experimentson a server, machine cluster or on a cloud.A diagram of the PyCIGAR architecture is shown in Fig. 6.In addition to RL-based controllers, PyCIGAR also includesrule-based (RB) control devices (e.g. tap-changing transform-ers) and can easily be extended to support the integration ofother more complicated DER (e.g. electric vehicle chargingand battery storage systems). PyCIGAR provides a foundationfor the rapid development of learning-based control algorithmsfor heterogeneous classes of DER in electric power distributiongrids. IV. R
ESULTS
We conduct experiments on the IEEE 37-bus feeder withall load buses having an active power generation of 50% ofthe nominal load with an additional 10% inverter over-sizingfor reactive power headroom. The agent training environmentconsists of 700 one-second timesteps per simulation. At theend of each experiment, the training environment is reset withrandomized load and solar generation profiles and percentageof compromised inverters. This diversity creates a rich envi-ronment that exposes the RL agent to attacks that could occuranytime throughout the day under a variety of loading, solarconditions and proportions of compromised inverters. For eachcase, all inverters start with their default VV/VW settings andat a particular time in the simulation the attacker gains controlsof to of the installed inverter capacity at eachnode to create a voltage instability. It does so by translatingthe VV/VW curves and steepening the slopes to induce an The name stands for Python based Cybersecurity via Inverter-Grid Auto-matic Reconfiguration. scillation. This attack vector represents a subset of possibleattack vectors. The agent is allowed to reconfigure the VV/VW curves of non-compromised DER to mitigate oscillationsthat result from the cyber-attack. We consider two types ofaction, 1) translating the entire VV/VW piecewise functionsfrom its default configuration (offset action) and 2) adjustingthe slope of the piece wise function in the region ∈ ( η , η ] and ∈ ( η , η ] (slope action). Within the simulation, the agentreceives observations and updates inverters’ functions f qi and f pi every 35 seconds. The training is conducted on an IntelXeon E5-2623 v3 processor, 64GB RAM server and takes 1hour of training time to converge.Fig. 7 shows the baseline case caused by a 45% percentageattack around noon with no action taken to mitigate the resultof the attack. The attack creates oscillations in system voltagesthat are detected by the oscillation detector (see Section III-B).The malicious re-dispatch of settings are shown in the actionsubplot and the components of the reward function, (8), areshown at the bottom. In the absence of an control the rewardis solely composed of the penalty for the oscillation.Fig. 8 - 9 show the behavior from the trained RL agentat a random node in the network in mitigating instabilitiesfrom compromised DER at two different times of day anddifferent percentage of compromised DER. At simulation time t = 200 s the attack is introduced in a portion of DER. Thiscan be seen in the action subplot, which shows the breakpointsof the piecewise linear curves of 2 - 3 being suddenly movedto a new configuration. This triggers an oscillation in gridvoltages. The output of the oscillation observer is both fedinto the RL agent as an observation and included as a negativepenalty in the reward function. The agent, therefore, shouldcontrol non-compromised assets to minimize the oscillation.This is what occurs, as the agent changes the breakpointsof non-compromised units just after t = 250 s by translatingthe default VV/VW curves. This action almost immediatelystops the oscillation in the system voltages, resulting in adefeat of the original cyber-attack. Fig. 8 features an attackin the morning, around 9am, where there is significant excesscapacity for reactive power compensation available for theagent to mitigate the cyber-attack. The agent, therefore, doesnot need to curtail active power generation to successfullymitigate the attack. This, however, is not the case in Fig. 9where the attack occurs around midday and the agent is forcedto curtail active power in order to have enough controllabilityto defeat the attack. Across numerous training configurationsit was observed that RL offset was the preferred action of theagent.Also worth highlighting is the behavior of the agent afterthe compromised DER have been identified and returnedto their original settings, at t = 450 s. Shortly after, weobserve the agent also returning to its original configuration. Indemonstrating this behavior, it can be seen that the RL agentwill take steps to ameliorate the effects of the cyber attack,but will return a state of inactivity once the threat of the attackhas passed. . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack
Fig. 7: Result of an evaluation episode at 45% attack without agentdefense . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack
Fig. 8: Result of an evaluation episode at 20% attack around 9AM
V. C
ONCLUSIONS
This paper has proposed a reinforcement learning approachfor mitigating the oscillation due to unstable smart invertersettings by training an agent to translate the VV/VW curves.The resultant policy successfully mitigated adversary induced . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack
Fig. 9: Result of an evaluation episode at 45% attack around noon voltage oscillatory behavior for the cases considered.Future work will investigate the value of this approachfor larger networks and the sensitivity of the trained agentsto specific network topologies/configuration. Additionally, wewill explore different neural network architectures, includ-ing Recurrent Neural Network (RNN) and Long Short-TermMemory (LSTM), which have proven to be the state ofthe art in solar and load forecasting and may improve theperformance of the agent. Additional types of attacks willalso be considered, including, but not limited to, voltageimbalance attacks. An adversary may seek to exploit DERinteraction with utility voltage regulation systems to createsystem voltage imbalances, leading to device trips and possiblesystem collapse. R
EFERENCES[1] J. Qi, A. Hahn, X. Lu, J. Wang, and C.-C. Liu, “Cybersecurity fordistributed energy resources and smart inverters,”
IET Cyber-PhysicalSystems: Theory & Applications , vol. 1, no. 1, pp. 28–39, 2016.[2] S. Sahoo, T. Dragiˇcevi´c, and F. Blaabjerg, “Cyber security in controlof grid-tied power electronic converters–challenges and vulnerabilities,”
IEEE Journal of Emerging and Selected Topics in Power Electronics ,2019.[3] “Modernizing Hawaiis Grid For Our Customers,” Tech. Rep., 2017.[4] “IEEE Standard 154 TM Communications and Interoperability: NewRequirements Mandate Open Communications Interface and Interoper-ability for Distributed Energy Resources,” Tech. Rep., 2017.[5] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and Z. Huang, “Adaptivepower system emergency control using deep reinforcement learning,”
IEEE Transactions on Smart Grid , 2019.[6] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, “Two-timescale voltage control in distribution grids using deep reinforcementlearning,”
IEEE Transactions on Smart Grid , 2019.[7] C. Li, C. Jin, and R. K. Sharma, “Coordination of pv smart invertersusing deep reinforcement learning for grid voltage regulation,” , pp. 1930–1937, 2019. [8] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press, 2018.[9] O. Andrychowicz, B. Baker, M. Chociej, R. Jzefowicz, B. McGrew,J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider,S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba, “Learningdexterous in-hand manipulation,”
The International Journal of RoboticsResearch , vol. 39, no. 1, pp. 3–20, 11 2019.[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learning(2013),” arXiv preprint arXiv:1312.5602 , vol. 99, 2013.[11] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison,D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al. , “Dota 2 with largescale deep reinforcement learning,” arXiv preprint arXiv:1912.06680 ,2019.[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, p. 484, 2016.[13] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al. , “A generalreinforcement learning algorithm that masters chess, shogi, and gothrough self-play,”
Science , vol. 362, no. 6419, pp. 1140–1144, 2018.[14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in
International Conference on Machine Learning ,2016, pp. 1928–1937.[15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in
International conference on machinelearning , 2015, pp. 1889–1897.[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[17] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa,T. Erez, Z. Wang, S. Eslami et al. , “Emergence of locomotion behavioursin rich environments,” arXiv preprint arXiv:1707.02286 , 2017.[18] B. Seal, “Common Functions for Smart Inverters, 4th Ed.” ElectricPower Research Institute, Tech. Rep. 3002008217, 2017.[19] M. Farivar, L. Chen, and S. Low, “Equilibrium and dynamics of localvoltage control in distribution systems,” in , Dec 2013, pp. 4329–4334.[20] J. H. Braslavsky, L. D. Collins, and J. K. Ward, “Voltage stability in agrid-connected inverter with automatic volt-watt and volt-var functions,”
IEEE Transactions on Smart Grid , vol. PP, no. 99, pp. 1–1, 2017.[21] K. Baker, A. Bernstein, E. DallAnese, and C. Zhao, “Network-cognizantvoltage droop control for distribution grids,”
IEEE Transactions onPower Systems , vol. 33, no. 2, pp. 2098–2108, 2018.[22] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gon-zalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for distributed rein-forcement learning,” in
Proceedings of the 35th International Conferenceon Machine Learning , ser. Proceedings of Machine Learning Research,J. Dy and A. Krause, Eds., vol. 80. Stockholmsmssan, StockholmSweden: PMLR, 10–15 Jul 2018, pp. 3053–3062. A PPENDIX
Hyperparameter Value α (learning rate) × − γ (reward discount factor) 0.5 λ (GAE parameter) 0.95 (cid:15) (PPO clip param) 0.1batch size 420activation function tanhnetwork hidden layers dense (64, 64, 32) σ y (oscillation penalty) 15 σ a (action penalty) 0.05 σ (penalty for deviation from default VV/VW curve) 18 σ p (penalty for curtailing active power) 80action range − . pu to 0.05 pu k (action range discretization) 0.01 p.u.(action range discretization) 0.01 p.u.