[PDF] Deep Reinforcement Learning for DER Cyber-Attack Mitigation

Abstract

The increasing penetration of DER with smart-inverter functionality is set to transform the electrical distribution network from a passive system, with fixed injection/consumption, to an active network with hundreds of distributed controllers dynamically modulating their operating setpoints as a function of system conditions. This transition is being achieved through standardization of functionality through grid codes and/or international standards. DER, however, are unique in that they are typically neither owned nor operated by distribution utilities and, therefore, represent a new emerging attack vector for cyber-physical attacks. Within this work we consider deep reinforcement learning as a tool to learn the optimal parameters for the control logic of a set of uncompromised DER units to actively mitigate the effects of a cyber-attack on a subset of network DER.

Full PDF

DDeep Reinforcement Learning for DERCyber-Attack Mitigation

Ciaran RobertsSy-Toan Ngo, Alexandre MilesiSean Peisert, Daniel Arnold

Lawrence Berkeley National Laboratory { cmroberts,sytoanngo,amilesi,sppeisert,dbarnold } @lbl.gov Shammya SahaAnna ScaglioneNathan Johnson

Arizona State University { sssaha,ascaglio,nathanjohnson } @asu.edu Anton KocheturovDmitriy Fradkin

Siemens CorporationCorporate Technology { anton.kocheturov,dmitriy.fradkin } @siemens.com Abstract —The increasing penetration of DER with smart-inverter functionality is set to transform the electrical distributionnetwork from a passive system, with ﬁxed injection/consumption,to an active network with hundreds of distributed controllersdynamically modulating their operating setpoints as a func-tion of system conditions. This transition is being achievedthrough standardization of functionality through grid codesand/or international standards. DER, however, are unique in thatthey are typically neither owned nor operated by distributionutilities and, therefore, represent a new emerging attack vectorfor cyber-physical attacks. Within this work we consider deepreinforcement learning as a tool to learn the optimal parametersfor the control logic of a set of uncompromised DER units toactively mitigate the effects of a cyber-attack on a subset ofnetwork DER.

I. I

NTRODUCTION

The increasing penetration of distributed energy resources(DER) in electrical distribution systems is causing a paradigmshift in how these networks are managed. While these systemswere historically passive, distributed power generation is forc-ing distribution grids to become more dynamic as DER areexpected to provide grid services, e.g. voltage control. Thistransition presents several challenges, particularly in the areaof cyber-physical security [1], [2].DER are especially unique when it comes to cyber-physicalsecurity. These devices are typically neither utility owned nordirectly controlled and, therefore, present a new attack vectorfor adversaries seeking to disrupt normal grid operating con-ditions. Additionally, many manufacturers and/or aggregatorsremotely control large populations of these devices via cellularnetworks, customers’ WiFi routers, or wired internet connec-tions [3]. This makes ensuring the integrity of commandssigniﬁcantly more difﬁcult. While recent standards (e.g. IEEE1547 standard) seek to specify minimal control requirementsfor these devices, they do not explicitly address the associatedcyber-physical security challenges [4]. Inverter manufacturersand aggregators have the ability to remotely monitor andcontrol the settings for inverters/DER deployed in the ﬁeld.Once access to the central system is gained, that system can

This research was supported in part by the Director, Cybersecurity, En-ergy Security, and Emergency Response, Cybersecurity for Energy DeliverySystems program, of the U.S. Department of Energy, under contract DE-AC02-05CH11231. Any opinions, ﬁndings, conclusions, or recommendationsexpressed in this material are those of the authors and do not necessarilyreﬂect those of the sponsors of this work. be used to push malicious control logic back to all DER. Thus,utilities have already expressed concerns about how the impactof a single cyber intrusion into an aggregators’/manufacturers’internal network could be exploited to compromise an ag-gregators/manufacturers entire DER ﬂeet [3]. In regions withhigh penetration of these devices, this could have devastatingeffects.In this work we adopt a purely physics-based approach forthe mitigation of cyber-physical attacks on DER (speciﬁcally,solar photovoltaic inverters). We assume that the adversary hasalready gained access to a subset of DER on a given networkand seeks to maliciously re-conﬁgure the control settings ofsmart inverters to disrupt distribution grid operations. Ourapproach does not focus on detecting the cyber-intrusionbut rather mitigating the resulting physical manifestation ofthe attack on the grid. To develop optimal control policiesthat mitigate the impact of these attacks, we train a deepreinforcement learning (DRL) policy that re-conﬁgure thecontrol settings of uncompromised DER. This trained policyis then deployed locally on controllable DER and determinessmart inverter parameter updates based on locally observedinformation.DRL has been gaining increasing attention in recent years,including in power systems, for determining control policiesfor highly complex non-linear systems. In [5], the authors useDeep Q-Network (DQN) learning, a reinforcement learning(RL) algorithm that combines Q-Learning with deep neuralnetworks, to control both generator dynamic braking and loadshedding in the event of a contingency to ensure post-faultrecovery. In [6], the authors consider the problem of coordi-nated voltage regulation using capacitors and smart inverters.Exploiting the timescale separation of these devices, theysolve a convex optimization problem to determine the controlpolicies of the smart inverters while using a DQN networkto learn an optimal policy for capacitor bank switching. In[7], a deep deterministic policy gradient (DDPG) RL agent isused to co-ordinate across DER and directly modulate activeand reactive power to regulate the grid voltage during normaloperations.This work differs from those described above in that wefocus on developing a supervisory control policy that con-tinuously monitors system conditions and takes action during a r X i v : . [ ee ss . S Y ] S e p ustained abnormal behavior. This controller, therefore, shouldnot impact an inverters response to normal disturbances, e.g.line-to-ground faults. While the proposed controller designis motivated by the need to respond to cyber-physical at-tacks, it is agnostic to the cause of the abnormal conditions.Consequently, it can also serve to autonomously re-conﬁgurecontroller settings in the event that an intended action hasresulted in abnormalities, for instance, when connecting dif-ferent microgrids with independently optimized controllers.This paper presents a framework for DRL for smart gridapplications and explores the use case of a cyber-physicalattack intended to induce oscillatory behavior in the gridvoltage.The remainder of the paper is organized as follows. Sec-tion II gives a brief introduction to DRL and the termsthat will be used throughout the paper. Section III gives anoverview of the power system models and networks used in thestudy. Finally, Section IV presents the results and Section Vsummarizes some of the key conclusions.II. D EEP R EINFORCEMENT L EARNING

A. Reinforcement Learning

RL is a branch of machine learning focused on optimaldecision making in stochastic environments. The goal of RLtechniques is to train an agent ( i.e. the decision-maker) tointeract with an environment in such a way as to maximizea cumulative reward. The environment is usually cast asa Markov Decision Process (MDP), which consists of thefollowing elements: • A state space, S , containing states observable by theagent; • An action space, A , containing all the possible actionsthe agent can execute; • A state transition function, P : S × A × S → [0 , ,specifying the probability distribution over the next state s (cid:48) when an action a is taken at state s ; • A reward function, R : A × S × S → R , specifyingthe reward received by the agent when the environmenttransitions from state s to state s (cid:48) with action a ; • A discount factor, γ ∈ [0 , , representing the trade-offbetween immediate and future rewards.A RL agent learns optimal actions by repeatedly interactingwith the environment and assessing the value of resultingrewards, R t ∈ R , dependent on the actions taken, a t ∈ A ,and the states of the environment, s t ∈ S . The agent-environment interaction is visualized in Fig. 1. As shown inthe ﬁgure, the agent takes action a t following policy π causinga state transition in the environment. The new state, s t , andsubsequent reward, R t , are observed by the agent and can thenbe used to update the policy π . The objective of the agent isto maximize the discounted reward J ( π ) = E π (cid:104)(cid:80) Tt =0 γ t R t (cid:105) ,where T is the terminal time step, by following a policy π which can be deterministic or stochastic in nature. EnvironmentAgentState: ( s t ∈ S ) ∼ P Reward: R t ∈ R Action( a t ∈ A ) ∼ π Fig. 1: Reinforcement learning loop.

B. Deep Reinforcement Learning

Classical RL relies on feature engineering and is difﬁcultto apply to environments with high dimensional, continuousaction and/or state spaces [8]. Such spaces, typically, mustbe discretized ﬁrst, leading to a combinatorial explosion incomplexity and unreasonable training time (the so-called curseof dimensionality ). In addition, classical RL has trouble cap-turing patterns in the presence of noisy or incomplete data.DRL solves these issues by leveraging neural networks withmultiple hidden layers that take the agent observations as inputand output a policy that determines what action to take in agiven state.With DRL, the inputs to the neural network can be struc-tured data (tabular data), unstructured data (images, text,video), or both. The weights of these neural networks areefﬁciently learned end-to-end via gradient-based optimizationto ﬁnd the best intermediate features and an optimal outputpolicy. The need for precise feature engineering is then greatlyreduced, thanks to the automatic high-dimensional feature ex-traction of the hidden layers. In DRL, one can use the networksto explicitly approximate an optimal policy distribution, π ,over possible actions. This distribution is then sampled bythe agent to determine the next action, as in policy gradientmethods. They may also be used to approximate either avalue function, V π ( s ) , or an action-value function, Q π ( s, a ) ,from gathered data, leading to an action decision based oninferred values for all possible future states, as in DQN.The value function, V π ( s ) , is the expected discounted rewardwhen starting in state s and following the policy π , whereasthe action-value function Q π ( s, a ) is deﬁned as the expecteddiscounted reward when starting in state s , taking action a ,and then following the policy π thereafter.Thanks to its ﬂexibility, DRL has been successfully appliedto robotic control [9], video games [10], [11] and board gameplaying [12], [13]. C. Policy Gradient and PPO

Policy gradient methods employ a policy modeled by aneural network which is trained directly by gradient ascenton the expected return. The most basic method (vanilla policygradient) is simple to implement but has the drawback ofhaving a high gradient variance. In response, Actor-Critic (AC)methods were proposed[14], where another, possibly shared,neural network approximates the value function.Let π θ ( a | s ) be a stochastic policy, parameterized by θ ,modeling the probability distribution of action a ∈ A giventhe state s ∈ S . Let V πφ ( s ) be a value function parameterizedy φ , estimating the cumulative discounted reward from thecurrent state to the terminal state. The gradient of J ( θ ) is: ∇ θ J ( θ ) = E τ ∼ π θ (cid:34) T (cid:88) t =0 ∇ θ log π θ ( a t | s t ) A πφ ( s t , a t ) (cid:35) , (1)where τ is the trajectory generated by policy π θ and A πφ ( s t , a t ) = R t + γV πφ ( s t +1 ) − V πφ ( s t ) is the advantage es-timation, representing how much better taking action a t is, asopposed to following the policy π when in state s t . The policyand value function are updated by gradient ascent/descent: θ k +1 = θ k + α ∇ θ J ( θ ) , (2) φ k +1 = φ k − β ∇ φ ( R t + V πφ ( s t +1 ) − V πφ ( s t )) . (3)As the training of AC methods can be unstable when thedata distribution changes due to a large policy update, theTrust Region Policy Optimization (TRPO) was introduced[15]. TRPO limits the updates in the policy space by enforcinga KullbackLeibler divergence constraint on the size of eachupdate. A Proximal Policy Optimization (PPO) [16] usinga clipped surrogate objective simpliﬁes the aforementionedmethod and yields similar performance: L CLIP ( θ ) = ˆ E (cid:104) min (cid:16) r t ( θ ) ˆ A t , clip (cid:0) r t ( θ ) , − (cid:15), (cid:15) (cid:1) ˆ A t (cid:17)(cid:105) , where r t ( θ ) (cid:44) π θ ( a t | s t ) π θ old ( a t | s t ) and ˆ A t (cid:44) A πφ ( s t , a t ) This clip operation encourages a more gradual updates to thepolicy rather than large changes, and the minimum betweenthe unclipped and the clipped objective is used so that theﬁnal objective is a lower bound on the unclipped objective[16]. The hat over the expectation means that we compute aMonte Carlo estimate of it.PPO is a state-of-the-art method that was successfully usedin video games [11] and robotics in simulation [17]. Weconsider here its application to the control of smart inverters.Before we map the speciﬁc problem onto the RL formalism,the following remark is in order:

Remark 1:

In many applications, the state of the entiresystem, s t , is not directly observed. In this case, the problemfalls in the class of Partially Observable MDPs (POMDP). Ina POMDP, the additional element in the model is: • An observation transition function (also called perceptualdistribution or emission probability) V : S × O → [0 , that speciﬁes the probability distribution of the observa-tion o t given the state s t .The policy function in this case takes as input the observationrather than the state, i.e. the goal is to ﬁnd the optimum π ( a t | o t ) . As mentioned later, the formulation in this paper fallsin the class of POMDP. Also, we note that neither the statetransition function nor the perceptual distribution are explicitlygiven; hence the policy neural network is trained through aMonte Carlo method.III. M ETHODOLOGY

A. Modeling the DER action space

In response to evolving standards and requirements, DERare increasingly being deployed with the ability to modu-late their real and reactive power injection/consumption in response to locally measured grid conditions. In this workwe focus speciﬁcally on smart inverter Volt-VAR (VV) andVolt-Watt (VW) control functionality as these operating modesare designed to help regulate distribution system voltages inthe presence of large amounts of renewable generation. UnderVV/VW control schemes, each inverter seeks to modulateactive and reactive power injections in response to measuredsystem voltage. The amount by which reactive and activepower injections are modulated is governed according topiece-wise linear functions of voltage, often referred to as“droop” curves. Different parameterizations of VV and VWcurves exist, however, existing guidelines often depict shapesshown in Figs. 2 - 3, which are parameterized by the ﬁveparameters that deﬁne the piece-wise linear curves shown,which will be referred to as the components of the setpoint-vector η = [ η , . . . , η ] . In this work, the action is a × vector a = ∆ η (cid:44) η − η o , where η o is the default set ofparameters. Note that, even though in principle the action iscontinuous, we quantize the possible range for the action andsearch directly for the categorical vector a .The VV curve injects reactive power when voltages inthe system are low and transitions to VAR consumption asvoltages increase. The VW curve provides maximum realpower injection under most voltage levels, but curtails PVoutput as voltage levels increase. The additional capacityresulting from active power curtailment can then be used foradditional reactive power consumption. V % available VARs η η η -100% η Fig. 2: Inverter Volt-VAR curve. Positive percent of VAR injection. V % max. watt output η η Fig. 3: Inverter Volt-Watt curve. Positive percent of watt injection.

Without loss of generality, we assume all inverters in thesubsequent analysis possess both VV and VW functionality.Let p max be the maximum output of the PV unit underpresently available solar insolation, and q avail the limit forreactive power in absolute value. In some instances, theamount of reactive power available for injection/consumptionay be ﬁxed (in the case of an oversized inverter relativeto the capacity of the PV panels) while in others, q avail maydepend on the amount of real power being generated from thePV system: q avail ≤ (cid:112) s − f p (¯ v ) , (4)where s is the inverter capacity. Let u p,i and u q,i denote theactive and reactive power control signal of inverter i . They arefunction of the averaged measured voltage magnitude at thebus (c.f. (7a)). Rather than considering completely arbitraryVV and VW mappings u p,i and u q,i that respect the limits p max and q avail, we seek policies that are expressed as: u pi = f pi (¯ v ) (cid:44)  p max ¯ v ∈ [0 , η ] (cid:16) η − ¯ vη − η (cid:17) p max ¯ v ∈ ( η , η ]0 ¯ v ∈ ( η , ∞ ) . (5) u qi = f qi (¯ v ) (cid:44)  q avail ¯ v ∈ [0 , η ] (cid:16) η − ¯ vη − η (cid:17) q avail ¯ v ∈ ( η , η ]0 ¯ v ∈ ( η , η ) − (cid:16) η − vη − η (cid:17) q avail ¯ v ∈ [ η , η ] − q avail ¯ v ∈ ( η , ∞ ) , (6)The scheme of (5) - (4) illustrates the combined use ofVV and VW control with VW precedence [18]. Under VWprecedence, priority is given to the VW controller to determineany needed curtailment before determining the VARs available( q avail ). After q avail is ﬁxed, u qi is computed from (6).In the event of a cyber-physical attack we assume that anadversary has the capability to re-dispatch a set of voltagebreakpoints η = [ η , . . . η ] that parametrize the droop curvesin Figs. 2 - 3 for a subset of DER on the network. Within thecontext of this work, the remaining set of non-compromisedDER can then be updated with new parameters vector η (cid:48) = a + η o to re-shape their own local droop curves to transition thesystem voltages to a safe region, devoid of oscillatory behavior.Finally, the structure of the DER VV and VW control dy-namic response, similarly to [18]–[20], includes the followingﬁrst order low pass ﬁlters that average the input voltage anddetermine the active and reactive power injections: ¯ v i,t = ¯ v i,t − + τ mi (cid:0) v i,t − ¯ v i,t − (cid:1) , (7a) p i,t = p i,t − + τ oi (cid:0) f pi (¯ v i,t ) − p i,t − (cid:1) , (7b) q i,t = q i,t − + τ oi (cid:0) f qi (¯ v i,t ) − q i,t − (cid:1) , (7c)where ¯ v i denotes a low-pass ﬁltered measured of the voltagemagnitude, v i , at node i , τ mi is its associated measurementtime constant, τ oi is the output ﬁlter time constant and f pi (¯ v i,t ) and f qi (¯ v i,t ) are the piecewise linear functions of the measurednodal voltages for node i given by (5) and (6) respectively.Note the equilibrium of (7b) - (7c) is given by (5) - (6).The stability of (7a) - (7c) has been studied in [19] and[21], where it has been observed that instabilities manifest asoscillations in inverter power injections and nodal voltages.As said before, the RL agent indirectly manipulates theoutputs of the inverter by modifying the vector of parameters η t = a t + η o . Next we deﬁne a component of the observation o t used as an input to the DRL controller in our POMDPformulation. The quantity is a local measure of the presenceand severity of voltage magnitude aforementioned oscillations. B. A Measure of Unstable Oscillations

We propose the use of a simple ﬁlter to determine the“energy” associated with voltage oscillations in the distributiongrid. The ﬁlter consists of the series of a highpass ﬁlter, andan energy detector, consisting of a square-law, followed by alowpass ﬁlter. A discrete time block diagram of the process isshown in Fig. 4 H HP ( z ) c · () H LP ( z ) v i,t ∆ v i,t y i,t Fig. 4: Block diagram of instability detector using a transfer functionrepresentation of high and low pass ﬁlters. where H HP and H LP are high-pass and low-pass ﬁltersrespectively, realized using a bilinear transform equivalent ofa ﬁrst-order high/low-pass ﬁlter, and c is a positive gain. Thehigh-pass ﬁlter removes DC content from v i,t , yielding ∆ v i,t .This signal is then squared to produce a DC term which isthen extracted via low pass ﬁltering. The output signal, y i , t is a measure of the intensity of the instability. The ﬁlterparameters should be chosen such that the ﬁlter does notattenuate oscillations due to inverter instabilities. C. DER Cyber-Attack Mitigation as a RL problem

The primary goal of the DRL controller is to mitigateinstabilities introduced by DER smart inverter VV/VW con-trollers due to maliciously chosen set-points. Let the graph G = ( N , L ) represent the topology of the distribution feederconsidered, where N is the set of nodes of the feeder (with0 indexing the feeder head) and L is the set of lines. Forsimplicity of presentation, we assume the presence of aVV/VW capable smart-inverter at every node in the system,so that the total number of inverters in the system is |N | .We suppose the set N is partitioned into two sets, H , and U , where H (cid:83) U = N which represent the ”compromised”and ’”uncompromised” inverters respectively. Furthermore weassume that U (cid:54) = ∅ , i.e. we have some controllable resourcesto mitigate the effects of the cyber-physical attack. Given U (cid:40) N and the temporal dependency of load and solarirradiance, as mentioned in Remark 1, the model is a POMDPwhere we wish to determine the optimum stochastic policy, π θ ( a | o ) , parameterized by the neural network parameters θ ,modeling the probability distribution of action a ∈ A giventhe observation o ∈ O . Training : Rather than training multiple agents simultaneously,we adopt the following heuristics to aid convergence:1) For agent training, we deﬁne a single agent whose inputobservation vector is the mean of the input observationvectors of all controllable inverters ∈ U and whose ac-tion, a t , is a deviation/offset, ∆ η , from default VV/VWcontrol curves that apply across inverters.) Once a single agent has been trained, this agent optimalpolicy is deployed locally on each individual inverterand only acts on local observations.3) Rather than optimize over arbitrarily shaped VV/VWcurves ( f q (¯ v ) and f p (¯ v ) ), we optimize over the devia-tion, i.e. a = ∆ η , from the default parameters deﬁningthe curves in Figs. 2 - 3. An example of this is shownin Fig. 5. The translation is in range from -0.05 pu to0.05 pu around an inverters default VV/VW curve, withthe action space being discretized into k bins.4) New parameterizations of VV/VW functions will bechosen so that measurement and power injection dy-namics evolve on a faster timescale. This choice willpreserve the Markov property between actions taken bythe RL controller. V % available VARs η η η -100% η RL Agent Action ∆ η Fig. 5: Action example.

Observation : The observation vectors o i,t , i ∈ U at each RLagent (i.e. the input to the neural network that learns theoptimum policy π ( a | o ) ), consist of:1) y i,t : the mean of the estimation of voltage oscillationenergy at node i since the last agent environment inter-action.2) y max i,t : the maximum of y i,t over the previous n environ-ment interactions. This is a tunable parameter that storesinformation of the recent oscillation energy.3) q avail, nom i,t : the available reactive power capacity withoutactive power curtailment.4) a one-hot i,t − : one-hot encoding of the previous action takenby the agent. Reward : At a timestep t , the reward function, R t ( a t , o t ) is: R t = − (cid:32) |U| |U| (cid:88) i =1 σ y y i,t + σ a a t (cid:54) = a t − + σ (cid:107) a t (cid:107) + 1 |U| |U| (cid:88) i =1 σ p (cid:32) − p i,t p max i,t (cid:33) (cid:33) . (8)The ﬁrst component seeks to minimize the voltage oscillation y ; the second one penalizes conﬁguration changes on inverters;the third component encourages the agent to use the default in-verter conﬁgurations in the absence of voltage oscillations andthe ﬁnal component penalizes any active power curtailment. D. The PyCIGAR DRL Environment

Any learning method requires sufﬁcient training over avariety of scenarios. As is done in other application of deep learning in the context of critical infrastructure systems, suchtraining can be performed through realistic Monte Carlo simu-lations that cover a variety of operating conditions and cyber-physical attacks. We named PyCIGAR the modular softwarearchitecture we designed to train the DRL agent described inthe previous section.

PyCIGAR EnvironmentPyCIGAR EnvironmentPyCIGAR EnvironmentSimulator(OpenDSS) PyCIGARAPI RLlibDeviceDeviceDeviceRBControllerRBControllerRBController RLControllerRLControllerRLControllerPyCIGARKernel action(s)observation(s)

Fig. 6: PyCIGAR Architecture.

PyCIGAR is a Python library for distributed reinforcementlearning for electric power distribution grids on quasi-statictime scales. The library provides a link between power systemsimulators and a reinforcement learning library - RLlib [22].PyCIGAR is a uniﬁed API that can interface different powersystem simulators (e.g. OpenDSS), while on the RL side Py-CIGAR uses RLlib in order to deploy large scale experimentson a server, machine cluster or on a cloud.A diagram of the PyCIGAR architecture is shown in Fig. 6.In addition to RL-based controllers, PyCIGAR also includesrule-based (RB) control devices (e.g. tap-changing transform-ers) and can easily be extended to support the integration ofother more complicated DER (e.g. electric vehicle chargingand battery storage systems). PyCIGAR provides a foundationfor the rapid development of learning-based control algorithmsfor heterogeneous classes of DER in electric power distributiongrids. IV. R

ESULTS

We conduct experiments on the IEEE 37-bus feeder withall load buses having an active power generation of 50% ofthe nominal load with an additional 10% inverter over-sizingfor reactive power headroom. The agent training environmentconsists of 700 one-second timesteps per simulation. At theend of each experiment, the training environment is reset withrandomized load and solar generation proﬁles and percentageof compromised inverters. This diversity creates a rich envi-ronment that exposes the RL agent to attacks that could occuranytime throughout the day under a variety of loading, solarconditions and proportions of compromised inverters. For eachcase, all inverters start with their default VV/VW settings andat a particular time in the simulation the attacker gains controlsof to of the installed inverter capacity at eachnode to create a voltage instability. It does so by translatingthe VV/VW curves and steepening the slopes to induce an The name stands for Python based Cybersecurity via Inverter-Grid Auto-matic Reconﬁguration. scillation. This attack vector represents a subset of possibleattack vectors. The agent is allowed to reconﬁgure the VV/VW curves of non-compromised DER to mitigate oscillationsthat result from the cyber-attack. We consider two types ofaction, 1) translating the entire VV/VW piecewise functionsfrom its default conﬁguration (offset action) and 2) adjustingthe slope of the piece wise function in the region ∈ ( η , η ] and ∈ ( η , η ] (slope action). Within the simulation, the agentreceives observations and updates inverters’ functions f qi and f pi every 35 seconds. The training is conducted on an IntelXeon E5-2623 v3 processor, 64GB RAM server and takes 1hour of training time to converge.Fig. 7 shows the baseline case caused by a 45% percentageattack around noon with no action taken to mitigate the resultof the attack. The attack creates oscillations in system voltagesthat are detected by the oscillation detector (see Section III-B).The malicious re-dispatch of settings are shown in the actionsubplot and the components of the reward function, (8), areshown at the bottom. In the absence of an control the rewardis solely composed of the penalty for the oscillation.Fig. 8 - 9 show the behavior from the trained RL agentat a random node in the network in mitigating instabilitiesfrom compromised DER at two different times of day anddifferent percentage of compromised DER. At simulation time t = 200 s the attack is introduced in a portion of DER. Thiscan be seen in the action subplot, which shows the breakpointsof the piecewise linear curves of 2 - 3 being suddenly movedto a new conﬁguration. This triggers an oscillation in gridvoltages. The output of the oscillation observer is both fedinto the RL agent as an observation and included as a negativepenalty in the reward function. The agent, therefore, shouldcontrol non-compromised assets to minimize the oscillation.This is what occurs, as the agent changes the breakpointsof non-compromised units just after t = 250 s by translatingthe default VV/VW curves. This action almost immediatelystops the oscillation in the system voltages, resulting in adefeat of the original cyber-attack. Fig. 8 features an attackin the morning, around 9am, where there is signiﬁcant excesscapacity for reactive power compensation available for theagent to mitigate the cyber-attack. The agent, therefore, doesnot need to curtail active power generation to successfullymitigate the attack. This, however, is not the case in Fig. 9where the attack occurs around midday and the agent is forcedto curtail active power in order to have enough controllabilityto defeat the attack. Across numerous training conﬁgurationsit was observed that RL offset was the preferred action of theagent.Also worth highlighting is the behavior of the agent afterthe compromised DER have been identiﬁed and returnedto their original settings, at t = 450 s. Shortly after, weobserve the agent also returning to its original conﬁguration. Indemonstrating this behavior, it can be seen that the RL agentwill take steps to ameliorate the effects of the cyber attack,but will return a state of inactivity once the threat of the attackhas passed. . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack

Fig. 7: Result of an evaluation episode at 45% attack without agentdefense . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack

Fig. 8: Result of an evaluation episode at 20% attack around 9AM

V. C

ONCLUSIONS

This paper has proposed a reinforcement learning approachfor mitigating the oscillation due to unstable smart invertersettings by training an agent to translate the VV/VW curves.The resultant policy successfully mitigated adversary induced . . . vo lt a g e ( p . u . ) o s c ill a ti on − . . ac ti on RL offset RL slope Attack offset Attack slope − − Time (s) r e w a r d OscillationLast actionInitial actionP curtailmentTotal reward attack

Fig. 9: Result of an evaluation episode at 45% attack around noon voltage oscillatory behavior for the cases considered.Future work will investigate the value of this approachfor larger networks and the sensitivity of the trained agentsto speciﬁc network topologies/conﬁguration. Additionally, wewill explore different neural network architectures, includ-ing Recurrent Neural Network (RNN) and Long Short-TermMemory (LSTM), which have proven to be the state ofthe art in solar and load forecasting and may improve theperformance of the agent. Additional types of attacks willalso be considered, including, but not limited to, voltageimbalance attacks. An adversary may seek to exploit DERinteraction with utility voltage regulation systems to createsystem voltage imbalances, leading to device trips and possiblesystem collapse. R

EFERENCES[1] J. Qi, A. Hahn, X. Lu, J. Wang, and C.-C. Liu, “Cybersecurity fordistributed energy resources and smart inverters,”

IET Cyber-PhysicalSystems: Theory & Applications , vol. 1, no. 1, pp. 28–39, 2016.[2] S. Sahoo, T. Dragiˇcevi´c, and F. Blaabjerg, “Cyber security in controlof grid-tied power electronic converters–challenges and vulnerabilities,”

IEEE Journal of Emerging and Selected Topics in Power Electronics ,2019.[3] “Modernizing Hawaiis Grid For Our Customers,” Tech. Rep., 2017.[4] “IEEE Standard 154 TM Communications and Interoperability: NewRequirements Mandate Open Communications Interface and Interoper-ability for Distributed Energy Resources,” Tech. Rep., 2017.[5] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and Z. Huang, “Adaptivepower system emergency control using deep reinforcement learning,”

IEEE Transactions on Smart Grid , 2019.[6] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, “Two-timescale voltage control in distribution grids using deep reinforcementlearning,”

IEEE Transactions on Smart Grid , 2019.[7] C. Li, C. Jin, and R. K. Sharma, “Coordination of pv smart invertersusing deep reinforcement learning for grid voltage regulation,” , pp. 1930–1937, 2019. [8] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[9] O. Andrychowicz, B. Baker, M. Chociej, R. Jzefowicz, B. McGrew,J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider,S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba, “Learningdexterous in-hand manipulation,”

The International Journal of RoboticsResearch , vol. 39, no. 1, pp. 3–20, 11 2019.[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learning(2013),” arXiv preprint arXiv:1312.5602 , vol. 99, 2013.[11] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison,D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al. , “Dota 2 with largescale deep reinforcement learning,” arXiv preprint arXiv:1912.06680 ,2019.[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,”

Nature , vol. 529, no. 7587, p. 484, 2016.[13] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al. , “A generalreinforcement learning algorithm that masters chess, shogi, and gothrough self-play,”

Science , vol. 362, no. 6419, pp. 1140–1144, 2018.[14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in

International Conference on Machine Learning ,2016, pp. 1928–1937.[15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in

International conference on machinelearning , 2015, pp. 1889–1897.[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[17] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa,T. Erez, Z. Wang, S. Eslami et al. , “Emergence of locomotion behavioursin rich environments,” arXiv preprint arXiv:1707.02286 , 2017.[18] B. Seal, “Common Functions for Smart Inverters, 4th Ed.” ElectricPower Research Institute, Tech. Rep. 3002008217, 2017.[19] M. Farivar, L. Chen, and S. Low, “Equilibrium and dynamics of localvoltage control in distribution systems,” in , Dec 2013, pp. 4329–4334.[20] J. H. Braslavsky, L. D. Collins, and J. K. Ward, “Voltage stability in agrid-connected inverter with automatic volt-watt and volt-var functions,”

IEEE Transactions on Smart Grid , vol. PP, no. 99, pp. 1–1, 2017.[21] K. Baker, A. Bernstein, E. DallAnese, and C. Zhao, “Network-cognizantvoltage droop control for distribution grids,”

IEEE Transactions onPower Systems , vol. 33, no. 2, pp. 2098–2108, 2018.[22] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gon-zalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for distributed rein-forcement learning,” in

Proceedings of the 35th International Conferenceon Machine Learning , ser. Proceedings of Machine Learning Research,J. Dy and A. Krause, Eds., vol. 80. Stockholmsmssan, StockholmSweden: PMLR, 10–15 Jul 2018, pp. 3053–3062. A PPENDIX

Hyperparameter Value α (learning rate) × − γ (reward discount factor) 0.5 λ (GAE parameter) 0.95 (cid:15) (PPO clip param) 0.1batch size 420activation function tanhnetwork hidden layers dense (64, 64, 32) σ y (oscillation penalty) 15 σ a (action penalty) 0.05 σ (penalty for deviation from default VV/VW curve) 18 σ p (penalty for curtailing active power) 80action range − . pu to 0.05 pu k (action range discretization) 0.01 p.u.(action range discretization) 0.01 p.u.