[PDF] A Secure Learning Control Strategy via Dynamic Camouflaging for Unknown Dynamical Systems under Attacks

Abstract

This paper presents a secure reinforcement learning (RL) based control method for unknown linear time-invariant cyber-physical systems (CPSs) that are subjected to compositional attacks such as eavesdropping and covert attack. We consider the attack scenario where the attacker learns about the dynamic model during the exploration phase of the learning conducted by the designer to learn a linear quadratic regulator (LQR), and thereafter, use such information to conduct a covert attack on the dynamic system, which we refer to as doubly learning-based control and attack (DLCA) framework. We propose a dynamic camouflaging based attack-resilient reinforcement learning (ARRL) algorithm which can learn the desired optimal controller for the dynamic system, and at the same time, can inject sufficient misinformation in the estimation of system dynamics by the attacker. The algorithm is accompanied by theoretical guarantees and extensive numerical experiments on a consensus multi-agent system and on a benchmark power grid model.

Full PDF

11 A Secure Learning Control Strategy via Dynamic Camouﬂaging forUnknown Dynamical Systems under Attacks

Sayak Mukherjee , Veronica Adetola Abstract —This paper presents a secure reinforcement learning(RL) based control method for unknown linear time-invariantcyber-physical systems (CPSs) that are subjected to compositionalattacks such as eavesdropping and covert attack. We considerthe attack scenario where the attacker learns about the dynamicmodel during the exploration phase of the learning conductedby the designer to learn a linear quadratic regulator (LQR),and thereafter, use such information to conduct a covert attackon the dynamic system, which we refer to as doubly learning-based control and attack (DLCA) framework. We propose a dy-namic camouﬂaging based attack-resilient reinforcement learning(ARRL) algorithm which can learn the desired optimal controllerfor the dynamic system, and at the same time, can inject sufﬁcientmisinformation in the estimation of system dynamics by theattacker. The algorithm is accompanied by theoretical guaranteesand extensive numerical experiments on a consensus multi-agentsystem and on a benchmark power grid model.

Index Terms —Cyber physical systems, CPS security, reinforce-ment learning, covert attacks, attack-resilient learning control.

I. I

NTRODUCTION

Security of Cyber-Physical Systems (CPSs) is becomingone of the fundamental requirements to safeguard variousinfrastructure and control systems against malicious attacksthat can lead to catastrophic failures if left unattended. Refer-ences such as [1], [2] present overview of various theoreticaland computational aspects of attack detection, prevention andresilient control designs. Extensive research work on thedetection and identiﬁcation of the attacks can be found inreferences [3]–[5]. Various types of different attack scenariosare considered in the literature; [6] categorizes these scenariosbased on the CPS model knowledge, disclosure of resourcesand disruption of resources. More speciﬁcally, recent workshave investigated attacks such as denial-of-service attacks[7], false data-injection attacks [8], replay attacks [9], covertattacks [10], [11], etc. Different types of attacks lead to severalprevention and mitigation techniques involving secure stateestimation techniques [12], watermarking certain pre-speciﬁedsignals in the CPS loop [13], [14], moving target defense andits variants [15]–[17], to name a few. Most of the literature onCPS security focuses on the setting where the designer knowsthe dynamic model of the system, and the attacker possessesthe knowledge about the dynamics with varying degree ofavailability. However, with the ever increasing complexity anddimensionality of the dynamic systems, the designer may nothave the explicit model information, and may need to learn thedynamics or the feedback controller from the state, control andoutput trajectories. In this paper, we propose a new formulationand mitigation technique of a compositional attack performed S. Mukherjee and V. Adetola are with the Optimization and ControlGroup, Paciﬁc Northwest National Laboratory (PNNL), Richland, WA, USA.Emails: (sayak.mukherjee, veronica.adetola)@pnnl.gov. in a setting where the designer is tasked to learn an optimalcontroller in a data-driven way. We have considered both theeavesdropping and the covert attack, where the attacker canmanipulate both the controls and measurements to remainundetected and at the same time harm the CPS by injectingmalicious inputs, to be performed in a sequential way.Recently feedback control research for partially or fullyunknown dynamic systems has seen a tremendous growthwith the advancement on data-driven learning techniquessuch as reinforcement learning (RL) [18]. In recent years,several papers such as [19]–[23] have used RL for linearoptimal control using a variety of solution techniques suchas adaptive dynamic programming (ADP), Q-learning, actor-critic methods, model reduction based RL, etc. More variantsof data-driven control research such as data-dependent linearmatrix inequalities [24], various distributed control designs[25], [26], analyzing sample complexity [27], to name a few,have been reported. We, in this paper, consider the linearquadratic regulator problem using ADP/RL as consideredin [20], and then investigate the scenarios with maliciousattacks. In [28], learning-based secure control framework inthe presence of sensor and actuator attacks are discussed. Thiswork ﬁrst uses the model to perform the detection task. [29]considers learning based attacks. We have considered a verygeneric attack scenario where initially the attacker does notpossess any system dynamic information and therefore theattacker ﬁrst eavesdrops, and after gathering sufﬁcient dynamicinformation conducts a covert attack. As we consider theproblem of learning the optimal control in a secured way fromthe perspective of the designer, we refer to our framework asa doubly learning-based control and attack (DLCA) scenario.We propose an attack-resilient reinforcement learning (ARRL)algorithm which can learn the desired optimal controller forthe dynamic system, and at the same time, can inject sufﬁcientmisinformation in the system’s dynamics estimation to deludethe attacker.The contributions of this paper are as follows: • We propose a secure learning control framework, namelyDLCA, where the attacker tries to exploit the learningmethodology such as the system exploration to gatherimportant dynamic information and conduct maliciouscovert attacks. Therefore, the vulnerability of the learningbased designs for CPS in the presence of attackers needsto be addressed. • Thereafter, we propose a retroﬁtted reinforcement learn-ing design, namely, ARRL, where the designer tries tomisguide the attacker during the exploration phase ofthe learning. We dynamically couple the CPS with anonlinear time-variant static map satisfying input-to-statestability (ISS) conditions, which we termed as dynamic a r X i v : . [ ee ss . S Y ] F e b camouﬂaging . • The proposed method is accompanied with sufﬁcientguarantees and numerical experiments conducted on aconsensus multi-agent system and on a benchmark powergrid model.The rest of the paper is organized as follows. The securelearning control problem is introduced in Section II. Thenominal ADP/RL based control design is discussed in SectionIII. The attack model is discussed in Section IV. SectionV proposes the attack resilient design and its advantages.Numerical experiments are performed in Section VI, andconcluding remarks are provided in Section VII.II. S

ECURE L EARNING C ONTROL P ROBLEM

We ﬁrst formalize the problem of performing secure learn-ing control design for the CPS. We consider a linear timeinvariant dynamic systems with the dynamics: ˙ x = Ax + Bu, x (0) = x , (1)where the x ∈ R n denotes the states, and u ∈ R m denotesthe control inputs. The designer is interested in computingthe optimal controller for this dynamic systems with unknownstate matrices. Therefore, we make the following assumptionsfrom the designer’s perspective: Assumption 1:

The model matrices A , and B are unknown. Assumption 2:

The pair ( A , B ) is stabilizable. Assumption 3:

The state and control measurements x ( t ) , and u ( t ) are available to the designer.The designer is tasked with formulating a learner to solvethe following linear quadratic regulator (LQR) problem. P. Under the assumptions , and , learn the state feedbackcontroller u = − Kx such that the following objective isminimized in closed loop:minimize J ( x , u ) = (cid:90) ∞ ( x ( t ) T Qx ( t ) + u ( t ) T Ru ( t )) dt, (2)where Q (cid:23) , R (cid:31) denote the state and control penaltyweights.The standard model based solution to this problem is foundby solving the well-known algebraic Riccati equation (ARE).In recent times, the research works on adaptive dynamicprogramming and reinforcement learning have looked intoformulating techniques to solve this problem without theknowledge of the model matrices, and only using state andinput trajectories (we will brieﬂy recapitulate such nominallearning control design in the next section). However, stan-dard learning control designs do not consider any adversarialbehavior from malicious entities. In this paper, we, on theother hand, considered a scenario when the dynamic systemis under the inﬂuence of an attacker.The attacker initially does not possess any dynamic infor-mation about the system. Therefore, the attacker tries to extractthe dynamic information of the system during the learning, andthen conducts covert attacks, which will be discussed shortly.Therefore, we can formalize the activities of the attacker intwo phases described as follows. Attacker Act A1 (

Eavesdropping ): Attacker conducts theﬁrst phase of the attack during the exploration phase of thelearning control design. The attacker intends to learn aboutthe dynamic matrices

A, B during this phase by gathering theexploration input u ( t ) , and the resultant state measurements x ( t ) . Although, the designer is conducting the explorationof the dynamic system to learn the optimal control K , thedesigner is unaware that the attacker is also using suchtrajectory information to learn about the system’s dynamics. Attacker Act A2 (

Covert Attack ): In the next phase, theadversary conducts a covert attack on the dynamical system.In covert attacks, the attacker inject malicious signals at theinputs, and then compensate the impact of the injected attackin the measurements to conceal the attack, thereby, makingit covert to the system operator. The covert attacks are verydifﬁcult to identify, and its impact on the system can becatastrophic. The attacker can also keep on injecting malicioussignals at the actuation such that the system continues tooperate inefﬁciently without being captured by the sensors.Therefore, the designer now needs to propose modiﬁcationsto the learning control design. The attacker should not be ableto eavesdrop and accurately capture the dynamic informationthat could allow a successful covert attack. At the sametime, the designer should learn the optimal control solutionscorresponding to the dynamical system (1) and objective (2).We enumerate these considerations as follows: • Learner consideration 1:

Learner needs to make theexploration phase of the learning control secure. Thismeans, the measurements of u ( t ) , and x ( t ) should notbe the accurate representation of the dynamical system(1). • Learner consideration 2:

The attacker should be unableto keep its attack on the system covert . Therefore, with-out any external disturbance, when the dynamic systemoperates under steady state condition, any malicious in-jection will create considerable perturbations at the statemeasurement channels which can be easily detected bythe operator. • Learner consideration 3:

Although, the learner camou-ﬂage the exploration trajectory measurements, it is stillrequired to compute the optimal control u = − Kx cor-responding to the original learning problem P . Therefore,we do not compromise with the learning computationaccuracy in order to satisfy considerations 1 and 2,which is really important from the perspective of theimplemented feedback control for the dynamic system.We will next recapitulate the nominal adaptive dynamic pro-gramming (ADP) based optimal control learning strategy thatcan solve problem P without any malicious attack.III. R ECAPITULATION OF

ADP-

BASED N OMINAL L EARNING C ONTROL

The problem P with the unknown model matrices can besolved using ADP/ RL based approaches. We use the off-policy RL based gain computation framework for this studyas given in [20]. Here, the system is excited with explorationsignal u , and, thereafter, the state measurements x ( t ) are gathered for a sufﬁcient number of time samples describedshortly. The control input u should be such that the state tra-jectory remains sufﬁciently bounded. Considering a quadraticLyapunov function x T P x, P (cid:31) , the time-derivative alongthe state trajectories is computed, and subsequently using theKleinman’s algorithm [30] the following model-independenttrajectory relationship can be obtained for the interval [ t, t + T ] : x T ( t + T ) P k x ( t + T ) − x Tt P k x t − (cid:90) t + Tt (( K k x + u ) T RK k +1 x ) dτ = − (cid:90) t + Tt ( x T ¯ Q k x ) dτ. (3) where, ¯ Q k = Q + K Tk RK k . We can solve (3) by constructinga data-driven iteration framework that uses the time sampledmeasurements of states and the controls. The learner gathersthe data matrices D = {N xx , M xx , M xu } where, N xx = (cid:104) x ⊗ x | t + Tt , · · · , x ⊗ x | t l + Tt l (cid:105) T , (4) M xx = (cid:104)(cid:82) t + Tt ( x ⊗ x ) dτ, · · · , (cid:82) t l + Tt l ( x ⊗ x ) dτ (cid:105) T , (5) M xu = (cid:104)(cid:82) t + Tt ( x ⊗ u ) dτ, · · · , (cid:82) t l + Tt l ( x ⊗ u ) dτ (cid:105) T . (6) Algorithm 1 summarizes the steps to compute the optimalcontrol solutions with unknown state dynamics.

Algorithm 1

Nominal Reinforcement Learning Control Data gathering: Measure the state and controls ( x ( t ) and u ) for interval ( t , t , · · · , t l ) , t i − t i − = T . Then construct D = {N xx , M xx , M xu } such that rank( M xx M xu ) = n ( n + 1) / nm .2. Iteratively update the control :

Starting with a stabilizing K , update thefeedback control gain K k iteratively ( k = 0 , , · · · ) by solving the leastsquare equation where vec ( . ) denotes standard vectorization operation - for k = 0 , , , .. A. Solve for P k , and K k +1 : (cid:2) N xx − M xx ( I n ⊗ K Tk R ) − M xu ( I n ⊗ R ) (cid:3) × (7) (cid:20) vec ( P k ) vec ( K k +1 ) (cid:21) = −M xx vec ( ¯ Q k ) . B. Break the iterations when || P k − P k − || < ς , ς is a small positivethreshold. endfor Applying K on the system :

Finally, apply u = − Kx , and remove u . Remark 1:

With nominally stable system, i.e., when A isHurwitz, the iterative update in (7) does not require anyinitial stabilizing control. Otherwise, policy iteration based RLtechniques require a stabilizing K [31], mainly because of itsinception from the Newton-Kleinman updates [30]. Remark 2:

In order to get unique convergent solutions of theoptimal gain, the data sample requirement is converted into thefollowing rank condition: rank( M xx M xu ) = n ( n + 1) / nm . This is dependent on the number of unknown variablesin the least squares problem. The condition is also analogousto the persistency of excitation condition of adaptive controlliterature. Practically, one can use twice of the data samplesrequired by the rank condition for guaranteed convergence. Theorem 1 [20]:

When the rank condition of the Remark 2is satisﬁed, the iterates of P k , K k from Algorithm 1 convergeto optimal P , and K with k → ∞ . (cid:3) IV. A

TTACKER M ODELING

To this end, we consider an attacker which acts in two foldsas follows.

A. Attacker Act 1 - Eavesdropping:

The attacker becomes active during the exploration phase ofthe learning algorithm. In this phase, the attacker eavesdropson the input and state channels u ( t ) , and x ( t ) to gatherthe measurements for the time instants t , ..., t l . The attackergathers as much as dynamic information as possible to performthe identiﬁcation of A, B . The attacker can employ any ofthe sub-space based identiﬁcation approaches. We assumethat the attacker knows the dimensionality of the system.The attacker constructs the surrogate model of the originaldynamical system as follows: ˙˜ x = ˜ A ˜ x + ˜ Bu, ˜ x (0) = ˜ x . (8)It can be expected that the model matrices ˜ A, ˜ B are identiﬁedwith high accuracy. Practically, the attacker can identify amodel which is similar to the original state space, i.e, theyare related by a similarity transformation T : ˜ A = T AT − , ˜ B = T B. (9)In order to consider the worst-case scenario , we assume thatthe attacker can successfully identify the actual state spacemodels. Therefore, ˜ A = A, ˜ B = B , i.e., T = I . Remark 3:

Why eavesdropping during exploration?

The rea-son behind such consideration is that the exploration is themost vulnerable phase for the system to be compromised, cre-ating a worst-case scenario. During exploration, the designertries to persistently excite the system such that the input-outputdata is sufﬁciently rich in dynamic information. Therefore,if the attacker can get access to such exploration data, theattacker could easily perform the system identiﬁcation to getaccurate model estimates.

B. Attacker Act 2 - Covert Attack:

Once, the attacker has access to the system dynamic modelinformation, a covert attack using the manipulated input andstate channels is conducted. An attack on the control inputwill result into: ˜ u ( t ) = u ( t ) + ζ ( t ) , (10)where ζ ( t ) is the malicious input signal. Thereafter, theattacker compensates the effect of the input attack to the statemeasurement sensors of the system by manipulating: ¯ x ( t ) = x ( t ) − ˜ x ( t ) , (11)where ¯ x ( t ) is the measured states, and ˜ x ( t ) is the compen-sation signals added by the attacker. Therefore, the attackerneeds to generate ζ ( t ) and ˜ x ( t ) . We now state the followingLemma which characterizes the covert nature of the attack. Lemma 1:

Consider the attack begins at t = T a , if the attackeruses (10), (11), and runs the dynamics: A : ˙˜ x = A ˜ x + Bζ, (12)with ˜ x ( T a ) = 0 , then the attack will remain covert at themeasured states. Proof: The solution of x ( t ) starting from t = T a isgiven as: x ( t ) = e A ( t − T a ) x ( T a ) + (cid:90) tT a e A ( t − τ ) [ B ( u + ζ )] dτ, (13)and the output of the attacker’s internal model (12) is givenby, ˜ x ( t ) = e A ( t − T a ) ˜ x ( T a ) + (cid:90) tT a e A ( t − τ ) [ Bζ ] dτ. (14)Therefore, the modiﬁed output seen by the designer followingfrom (11): ¯ x ( t ) = e A ( t − T a ) ( x ( T a ) − ˜ x ( T a )) + (cid:90) tT a e A ( t − τ ) [ Bu ] dτ (15)Setting ˜ x ( T a ) = 0 , it is evident that ¯ x ( t ) can be treated aslegitimate state measurements, making the attack covert. (cid:3) V. R

ETROFITTING THE L EARNING TO MAKE IT A TTACK -R ESILIENT

The designer has to make sure that the system identiﬁcationby the attacker is incorrect, and at the same time, the desiredfeedback control gain is being computed without sacriﬁcingperformance as enumerated in Section II. To this end, we,hereby, propose a solution in this section. The main idea isto modify the dynamic model in such a way that the attackercannot perfectly identify the system during the explorationphase of the learning, referring to as dynamic camouﬂaging .On the other hand, the designer has full knowledge about thisextraneous modiﬁcation such that the optimal gain K can becorrectly learned as in P .As the designer starts the learning with the assumptionof unknown A, and B matrices, any system parameter oractuator gains cannot be modiﬁed. Instead, we suggest to addan input-to-state stable (ISS) coupling, dependent on the statesof the dynamic system, at an internal actuation location of thesystem, which is assumed to be safe with respect to externalattacks. To check the applicability of such coupling, operatorscan use the plant simulators, or historic measurements toestimate whether the state trajectories will remain bounded.The underlying dynamic system, therefore, is modiﬁed to: ˙ x = Ax + B ( u + ψ ) , (16) ψ = φ ( t, x ( t )) . (17)Here ψ is assumed to be a nonlinear time varying static map,however, one can also make this a dynamic map. We make thefollowing boundedness assumption on the coupling function ψ = φ ( t, x ( t )) . Assumption 4:

The coupling function ψ = φ ( t, x ( t )) satisﬁes (cid:107) ψ (cid:107) = (cid:107) φ ( t, x ( t )) (cid:107) ≤ γ (cid:107) x ( t ) (cid:107) , γ > . (18)Thereafter, we characterize the ISS stability of theinterconnection during the exploration such that the statetrajectories remain bounded. Lemma 2:

With bounded exploration inputs, and under as-sumption 4, the system (16) will be input-to-state stable (ISS)with respect to ψ . Proof:

For any generic dynamic system the exploration controlcan be written in the form u = − K s s + u where ( A − BK s ) is stable, and u is a bounded exploration such that the statemeasurements remain within the stable neighborhood of theoperating point. Therefore, we could write, ˙ x = ( A − BK s ) x + Bu + Bψ ( t ) , x ( t ) = x . (19)The resulting state trajectories become x ( t ) = e ( A − BK s )( t − t ) x + (cid:90) tt e ( A − BK )( t − τ ) Bu dτ (20) + (cid:90) tt e ( A − BK )( t − τ ) Bψ ( τ ) dτ As A − BK s is Hurwitz, we have, (cid:107) e ( A − BK s )( t − t ) (cid:107) ≤ ke − λ ( t − t ) , k > , λ > . Thereafter, we can bound (cid:107) x ( t ) (cid:107) as, (cid:107) x ( t ) (cid:107) ≤ ke − λ ( t − t ) (cid:107) x ( t ) (cid:107) + (21) k (cid:107) B (cid:107) λ ( sup τ ∈ [ t ,t ] (cid:107) u ( τ ) (cid:107) + sup τ ∈ [ t ,t ] (cid:107) ψ ( τ ) (cid:107) ) . Therefore, we can conclude that with bounded (cid:107) u ( t ) (cid:107) , andassumption , the x ( t ) dynamics remains ISS. (cid:3) This type of dynamics has been recently studied in our pa-pers [32], [33] in the context of robustness of structured controldesigns. Motivated from them, the formalism is found to besuitable to make the nominal RL algorithm attack-resilient.When the attacker eavesdrops on the input and state channels,then u ( t ) and x ( t ) correspond to the underlying dynamics (16),and, therefore, performing a system identiﬁcation will resultinto erroneous state space representations, i.e., ˜ A (cid:54) = T AT − , ˜ B (cid:54) = T B. (22)In our numerical experiments we set φ ( t, x ( t )) = f ( t ) x ( t ) with (cid:107) f ( t ) (cid:107) ≤ γ , f ( t ) (cid:54) = 0 , ∀ t , and to simulate the worst-case, we assume that the identiﬁed state matrix is ˜ A = A + Bf ( t ) , with f ( t ) frozen at time t = t . We assume that theattacker could perfectly estimate B , to create a worst-casescenario.To this end, the learning algorithm needs to be retroﬁtteddue the modiﬁcations as in (16). As we have intentionallyconnected the coupling ψ , the designer has access to the fullmeasurements of ψ ( t ) . At this stage recall the Kleinman’salgorithm: Theorem 2 [30]:

Let K be such that A − BK is Hurwitz.Then, for k = 0 , , . . .

1. Solve for P k (Policy Evaluation) : A Tk P k + P k A k + K Tk RK k + Q = 0 , A k = A − BK k . (23)

2. Update the control gain (Policy update): K k +1 = R − B T P k . (24) Then A − BK is Hurwitz and K k and P k converge to optimal K , and P as k → ∞ . (cid:3) Fig. 1: Overview of the attack resilient learning methodologyThe modiﬁed state dynamics (16) incorporating u = − K k x isgiven by ˙ x = ( A − BK k ) x + B ( K k x + u ) + Bψ. (25)We use similar exploration inputs u = u as before. Asconsidered in the derivation of Algorithm 1, we considersimilar Lyapunov function candidate x T P k x , and compute itsderivative along the closed-loop trajectories and use Theorem2 to replace the dependency on the model matrices as follows: ddt ( x T P k x ) = x T ( A Tk P k + P k A k ) x +2( K k x + u ) T B T P k x + 2( ψ T B T P k ) x (26) = − x T ( Q k ) x + 2( K k x + u ) T RK k +1 x +2( ψ T B T P k ) x, where, Q k = Q + K Tk RK k . Rearranging, and taking integralson the both sides, we have, x T ( t + T ) P k x ( t + T ) − x T ( t ) P k x ( t ) − (cid:90) t + Tt (( K k x + u ) T RK k +1 x ) dτ = − (cid:90) t + Tt ( x T Q k x − ψ T B T P k ) x ) dτ. (27)(27) is independent of model matrices, and constructed bythe trajectories of system states x ( t ) , exploration control u ( t ) , and the measurements of the coupling ψ ( t ) . We canuse the properties of Kronecker product (denoted by ⊗ ) towrite x T Q k x = ( x T ⊗ x T ) vec ( Q k ) , ψ T B T P k x = ( x T ⊗ ψ T ) vec ( B T P k ) . The Attack-resilient RL algorithm can bewritten by formulating an iterative version of (27), usingmeasurements of x ( t ) , u ( t ) and ψ ( t ) as given in Alg. 2. Weappend the data matrix D as in Algorithm 1 with M xψ where, M xψ = (cid:104)(cid:82) t + Tt ( x ⊗ ψ ) dτ, · · · , (cid:82) t l + Tt l ( x ⊗ ψ ) dτ (cid:105) T . (28) Theorem 3:

Performing Algorithm 2 with the modiﬁed system(16) will able to recover the optimal control u = − Kx for theactual system (1). Proof:

Performing Algorithm 2 (ARRL) using x ( t ) , u ( t ) , and ψ ( t ) is equivalently solving the trajectory relationship (27). As Algorithm 2

Attack-Resilient RL (ARRL) Control Data gathering: Measure the state and controls and coupling variables( x ( t ) u ( t ) and ψ ( t ) ) for interval ( t , t , · · · , t l ) , t i − t i − = T .Then construct D = {N xx , M xx , M xu , M xψ } such thatrank( M xx M xu M xψ ) = n ( n + 1) / nm .2. Iteratively update the control :

Starting with a stabilizing K , update thefeedback control gain K iteratively ( k = 0 , , · · · ) by solving the least squareequation - for k = 0 , , , .. A. Solve for P k , and K k +1 : (cid:2) N xx − M xx ( I D ⊗ K Tk R ) − M xu i ( I D ⊗ R ) − M xψ (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Θ k (29) ×  vec ( P k ) vec ( K ( k +1) ) vec ( B T P k )  = −M xx vec ( Q ik ) (cid:124) (cid:123)(cid:122) (cid:125) Φ k . B. Break the iteartions when || P k − P k − || < ς , ς is a small positivethreshold. endfor Applying K on the system :

Finally, apply u = − Kx , and remove u . (27) has been constructed using Theorem 2, then any solutionfrom Theorem 2 will satisfy the k th iteration of the followingequation: Θ k  vec ( P k ) vec ( K ( k +1) ) vec ( B T P k )  = Φ k . (30)Therefore, a solution from Theorem 2 should also sat-isfy (30). With sufﬁcient gathering of data, the conditionrank( M xx M xu M xψ ) = n ( n + 1) / nm is satisﬁed,and, therefore, Θ k will have full column rank. As such,equation (30) has a unique solution P k , K ( k +1) . As this is anunique solution, it is also the solution P k , K k +1 of theorem2. Considering this equivalence of the Algorithm 2 with themodiﬁed Kleinman update in Theorem 2, we can conclude thatthe K , and P corresponding to Algorithm 1 can be recovered. (cid:3) The attacker then tries to launch covert attacks once theclosed-loop system is in the operating condition. However, the measured states seen by the designer now is: ¯ x ( t ) = x ( t ) − ˜ x ( t ) : ¯ x ( t ) = e A ( t − T a ) x ( T a ) − e ˜ A ( t − T a ) ˜ x ( T a )+ (31) (cid:90) tT a ( e A ( t − τ ) [ B ( u + ζ )] − e ˜ A ( t − τ ) [ ˜ Bζ ]) dτ, which is not a legitimate system response, even if ˜ B = B ,and therefore will create undesired perturbations during thenormal operational mode of the plant depending on the energyin attack inputs, and the error (cid:107) A − ˜ A (cid:107) . Therefore, a pre-tuned set-point based detector can alert the system operator totake further necessary actions. The designer can also designsophisticated detectors using the nominal statistical propertiesof the system, which we keep as future research.VI. N UMERICAL S IMULATIONS

A. A Multi-agent System

We consider a multi-agent network with agents taken from[33] where each agent follows the consensus dynamics: ˙ x i = (cid:88) j ∈N i ,i (cid:54) = j α ij ( x j − x i ) + u i , x i (0) = x i , (32)where α ij > are the coupling coefﬁcients. We consider thestate and input matrix to be: A =  − − − − − −  , B = I . (33)The dynamics follows a Laplacain structure with A. n = resulting into a zero eigenvalue with the rest of themare − . , − . , − . , − . , − . , − . . We chooseinitial conditions as [0 . , . , . , . , . , . T . The learnedcontroller is tasked to improve the damping of the sloweigenvalues. We set Q = 10 I , R = I . We ﬁrst experimentwith the nominal system without any retroﬁtting in the learningdesign. As, n = 6 , m = 6 , the rank condition for the algorithm1 requires × (57) = 114 samples. The exploration has beenperformed with samples with . s time step. We excitethe system with the sum of sinusoids exploration. We assumethat this phase is not secured to attackers, and therefore, theattacker could eavesdrop and could easily estimate the statematrices. Once the data matrix D is constructed, the controlgains are computed via the iterations as given in Algorithm 1,and then implemented to the system. Fig. 2 shows the actualstate trajectories during the initial s exploration, and withthe control implementation. Fig. 3 shows the convergence ofthe learning control using Algorithm resulting into: K Alg. =  . . . . . . . . . . . . . . . . . . . . . . − . − . . . . − . . . . . . − . . .  , (34) which matches closely with the model-based solution. At s, the attacker starts injecting malicious signals to thesystem, however, as the learning control was not secured, the attacker could launch a covert attack, and therefore, thestate measurements could not able to capture any of thesemalicious signals as shown in Fig. 4. However, the actualstate trajectories are heavily impacted as shown in Fig. 2. Thiskind of covert attack can cause expensive state excursions, andmake the system less efﬁcient. The quadratic cost incurredfrom 5 s to 10 s turns out to be . × units.Thereafter, we show the efﬁcacy of the retroﬁtted securedlearning design. During the exploration phase of the learn-ing, we have added a functional coupling with the controlinputs in the form, u = u control − f ( t ) x ( t ) , f ( t ) = 0 . × (sin(t) + cos(t) + 0.02). As the attacker does not aware ofsuch modiﬁcations, the system identiﬁcation performed bythe attacker using the input and state measurements duringthe exploration will be erroneous. The modiﬁed algorithm2, however, is tasked with computing the optimal control u = − Kx associated with the actual system dynamics A, B ,and not that of the modiﬁed state dynamics. The controlcomputed with Algorithm 2 with the convergence shown inFig. 6 is given as: K Alg. =  . . . . . . . . . . . . . . . . . . . . . . − . − . . . . − . . . . . . − . . .  , (35) which shows that the modiﬁed algorithm does not suffer inthe accuracy of computing the desired optimal control. As theattacker is not able to capture the accurate state dynamics, theattack does not remain covert anymore. To simulate a worst-case identiﬁcation scenario, assume that the model identiﬁedby the attacker uses ˜ A = A − (cid:15) sc × . × . × I with f ( t = 0) , and (cid:15) sc = 2 is a scaling factor. Thereafter, whenthe attacker injects malicious attack signal at t = 5 s , thestate trajectories at the measurement ports can capture suchbehaviours as shown in Fig. 7. If the state trajectories hit a pre-calibrated set point, then the system operator is being alertedto remove the malicious enterprise. B. A Power Grid Benchmark

We consider a 13-bus Kundur power system model as shownin Fig. 8. To numerically simulate the model to generate data,we model generator dynamics by the ﬂux decay model. Thestate variables are denoted as δ ( t ) , ω ( t ) , E ( t ) , E fd ( t ) whichare the generator internal angle in radians, speed deviationfrom nominal ( π × radians/sec), internal voltage, andexcitation voltage in per unit, respectively. The supplementarycontrol signal is added with the voltage reference of automaticvoltage regulator (AVR) in the excitation dynamics. With a MVA base, the rated active power generation capacity ofthe generators G -G are p.u., p.u., . p.u., and p.u.,respectively. The grid contains one inter-area mode ( . Hz)and two intra-area modes ( . , . Hz). The control designobjective is to improve the inter-area and oscillatory dynamicperformance of the grid. We have designed Q such that itpenalizes the relative difference between the generator angles(inter-area power ﬂows depend on angular differences), and theenergy associated with other generator states. In this model, Fig. 2:

Actual states during exploration(till 2 s), control implementation andattack at 5 s for the nominal design iteration || K k - K k - || Fig. 3:

Convergence of controller K inAlgorithm 1 for the nominal design Fig. 4:

Measured state trajectories duringexploration, control implementation, andattack at 15 s without resilient design

Fig. 5:

Actual states during exploration(till 2 s), control implementation andattack at 5 s for the resilient design iteration || K k - K k - || Fig. 6:

Convergence of the retroﬁttedcontroller following ARRL in Algorithm2

Fig. 7:

Measured state trajectories duringexploration, control implementation, andattack at 5 s with ARRL

Fig. 8:

Kundur -bus -machine power system model we have n = 16 , m = 4 , and therefore, we explore for s with . s time steps. The attack starts at s, and we simulate till s. Fig. 9 shows the actual frequency excursions during thelearning and the attack which shows considerable perturbationscaused by the adversary, however, in the measured frequencies,as in Fig. 10, the impact of the attack is not visible, makingthe attack covert. The fast convergence of the RL controlupdate iterations are shown in Fig. 11. The convergence isreached with high accuracy with respect to the model-basedsolutions as shown in Fig. 12. The dynamic performanceimprovement of the nominal controller is shown in Fig. 13,where the angular difference between the generators and characterizes the oscillations in the tie-lines connecting buses and . Subsequently, we test the attack resilient design withthe malicious signal injection starting at s. The trajectorymeasurements are gathered for s, and thereafter, the controliterations of Algorithm 2 (ARRL) has been performed whichresults in fast convergence as shown in Fig. 15. Most of theentries of K matches closely with the model-based solutions,and only few of the entries are − off. However, when wetest the dynamic performance of the learned ARRL control,we can see from Fig. 17 that the optimal performance ofthe nominal RL design has been recovered. Moreover, in thisscenario, the measured frequencies now capture the impactof the the malicious attack starting from s as in Fig. 14,therefore, the attacker cannot remain covert anymore. VII. C ONCLUSIONS

This paper discussed a secure learning control methodologywhich is resilient to adversarial actions from malicious agents.The attacker eavesdrops the learning process and estimatedynamic information of the CPS to conduct covert attacksubsequently. In such scenarios, we have discussed a dynamiccamouﬂaging technique during the learning to misguide theattacker without compromising the accuracy of learning ofthe optimal control. We have shown that by coupling thedynamic system with nonlinear static time-varying functionscan provide one such dynamic camouﬂaging with adequateguarantees. Numerical simulations conducted on a consensusmulti-agent system, and on a power system model validatesthe algorithmic and theoretical considerations.R

EFERENCES[1] F. Pasqualetti, F. Dorﬂer, and F. Bullo, “Control-theoretic methods forcyberphysical security: Geometric principles for optimal cross-layerresilient control systems,”

IEEE Control Systems Magazine , vol. 35,no. 1, pp. 110–127, 2015.[2] S. M. Dibaji, M. Pirani, D. B. Flamholz, A. M. Annaswamy, K. H.Johansson, and A. Chakrabortty, “A systems and control perspective ofcps security,”

Annual Reviews in Control , vol. 47, pp. 394–411, 2019.[3] H. Li, Z. Chen, L. Wu, H.-K. Lam, and H. Du, “Event-triggeredfault detection of nonlinear networked systems,”

IEEE Transactions onCybernetics , vol. 47, no. 4, pp. 1041–1052, 2016.[4] E. Mousavinejad, F. Yang, Q.-L. Han, and L. Vlacic, “A novel cyber at-tack detection method in networked control systems,”

IEEE transactionson cybernetics , vol. 48, no. 11, pp. 3254–3264, 2018.[5] K. G. Vamvoudakis, J. P. Hespanha, B. Sinopoli, and Y. Mo, “Detectionin adversarial environments,”

IEEE Transactions on Automatic Control ,vol. 59, no. 12, pp. 3209–3223, 2014.[6] M. S. Chong, H. Sandberg, and A. M. Teixeira, “A tutorial introductionto security and privacy for cyber-physical systems,” in . IEEE, 2019, pp. 968–978.[7] Y. Yuan, Q. Zhu, F. Sun, Q. Wang, and T. Bas¸ar, “Resilient control ofcyber-physical systems against denial-of-service attacks,” in . IEEE,2013, pp. 54–59.[8] C.-Z. Bai, F. Pasqualetti, and V. Gupta, “Data-injection attacks instochastic control systems: Detectability and performance tradeoffs,”

Automatica , vol. 82, pp. 251–260, 2017.[9] M. Zhu and S. Martinez, “On the performance analysis of resilientnetworked control systems under replay attacks,”

IEEE Transactions onAutomatic Control , vol. 59, no. 3, pp. 804–808, 2013.

Time (sec) -50510 A c t ua l f r equen cy t r a j e c t o r i e s ( p . u . ) -3 Time (sec) -2024 10 -5 Fig. 9:

Actual frequencies duringprobing (till 5 s), control implementationand attack at 15 s without ARRL

Time (sec) -101 M ea s u r ed f r equen cy t r a j e c t o r i e s ( p . u . ) -4 Fig. 10:

Measured frequencies duringprobing (till 5 s), control implementationand attack at 15 s without ARRL iteration || K k - K k - || Fig. 11:

Convergence of K usingnominal RL (Algorithm 1) Elements of the feedback gain matrix -4-3-2-10123 R e l a t i v e e rr o r be t w een m ode l - ba s ed and l ea r n i ng s o l u t i on s -4 Fig. 12:

Error % of the RL basedsolution using Alg. 1 with respect to themodel based solution Time(s) -0.3-0.2-0.100.10.2 - (r ad ) With learning controlWithout control

Fig. 13:

Improving the inter-areadynamic performance with the nominalRL control

Time (sec) -3-2-10123 M ea s u r ed f r equen cy t r a j e c t o r i e s ( p . u . ) -3 Time (sec) -4-20 10 -5 Fig. 14:

Measured frequencies duringprobing (till 7 s), control implementation,and attack at 15 s with ARRL iteration || K k - K k - || Fig. 15:

Convergence of the retroﬁttedcontroller following ARRL in Algorithm2

Elements of the feedback gain matrix -3-2-10123 R e l a t i v e % e rr o r be t w een m ode l - ba s ed and l ea r n i ng s o l u t i on s Fig. 16:

Error % of the RL basedsolution using ARRL (Alg. 2) withrespect to the model based solution Time(s) -0.3-0.2-0.100.10.2 - (r ad ) With ARRL based controlWithout control

Fig. 17:

Improving the inter-areadynamic performance with the ARRLcontrol [10] A. O. de Sa, L. F. R. da Costa Carmo, and R. C. Machado, “Covertattacks in cyber-physical control systems,”

IEEE Transactions on In-dustrial Informatics , vol. 13, no. 4, pp. 1641–1651, 2017.[11] A. Barboni, H. Rezaee, F. Boem, and T. Parisini, “Detection of covertcyber-attacks in interconnected systems: A distributed model-basedapproach,”

IEEE Transactions on Automatic Control , 2020.[12] Y. Mao, S. Diggavi, C. Fragouli, and P. Tabuada, “Secure state-reconstruction over networks subject to attacks,”

IEEE Control SystemsLetters , vol. 5, no. 1, pp. 157–162, 2021.[13] B. Satchidanandan and P. R. Kumar, “Dynamic watermarking: Activedefense of networked cyber–physical systems,”

Proceedings of the IEEE ,vol. 105, no. 2, pp. 219–240, 2016.[14] Y. Mo, S. Weerakkody, and B. Sinopoli, “Physical authentication ofcontrol systems: Designing watermarked control inputs to detect coun-terfeit sensor outputs,”

IEEE Control Systems Magazine , vol. 35, no. 1,pp. 93–109, 2015.[15] A. Kanellopoulos and K. G. Vamvoudakis, “A moving target defensecontrol framework for cyber-physical systems,”

IEEE Transactions onAutomatic Control , vol. 65, no. 3, pp. 1029–1043, 2019.[16] C. Schellenberger and P. Zhang, “Detection of covert attacks on cyber-physical systems by extending the system dynamics with an auxiliarysystem,” in . IEEE, 2017, pp. 1374–1379.[17] P. Grifﬁoen, S. Weerakkody, and B. Sinopoli, “A moving target defensefor securing cyber-physical systems,”

IEEE Transactions on AutomaticControl , 2020.[18] R. Sutton and A. Barto,

Reinforcement learning - An introduction . MITpress, Cambridge, 1998, 1998.[19] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. Lewis, “Adaptiveoptimal control for continuous-time linear systems based on policyiteration,”

Automatica , vol. 45, pp. 477–484, 2009.[20] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control forcontinuous-time linear systems with completely unknown dynamics,”

Automatica , vol. 48, pp. 2699–2704, 2012.[21] K. Vamvoudakis, “Q-learning for continuous-time linear systems: Amodel-free inﬁnite horizon optimal control approach,”

Systems & Con-trol Letters , vol. 100, pp. 14–20, 2017. [22] B. Kiumarsi, K. Vamvoudakis, H. Modares, and F. Lewis, “Optimaland autonomous control using reinforcement learning: A survey,”

IEEETrans. on Neural Networks and Learning Systems , 2018.[23] S. Mukherjee, H. Bai, and A. Chakrabortty, “Reduced-dimensionalreinforcement learning control using singular perturbation approxima-tions,”

Automatica

IEEE Transactions on Automatic Control ,vol. 65, no. 3, pp. 909–924, 2019.[25] S. Mukherjee and T. L. Vu, “Reinforcement learning of structuredcontrol for linear systems with unknown state matrix,” arXiv preprintarXiv:2011.01128 , 2020.[26] S. Fattahi, N. Matni, and S. Sojoudi, “Efﬁcient learning of distributedlinear-quadratic control policies,”

SIAM Journal on Control and Opti-mization , vol. 58, no. 5, pp. 2927–2951, 2020.[27] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample com-plexity of the linear quadratic regulator,”

Foundations of ComputationalMathematics , pp. 1–47, 2019.[28] Y. Zhou, K. G. Vamvoudakis, W. M. Haddad, and Z. P. Jiang, “A securecontrol learning framework for cyber-physical systems under sensor andactuator attacks,”

IEEE Transactions on Cybernetics , pp. 1–13, 2020.[29] A. Rangi, M. J. Khojasteh, and M. Franceschetti, “Learning-basedattacks in cyber-physical systems: Exploration, detection, and controlcost trade-offs,” arXiv preprint arXiv:2011.10718 , 2020.[30] D. Kleinman, “On an iterative technique for riccati equation computa-tions,”

IEEE Trans. on Automatic Control , vol. 13, no. 1, pp. 114–115,1968.[31] Y. Jiang and Z.-P. Jiang,

Robust Adaptive Dynamic Programming .Wiley-IEEE press, 2017.[32] S. Mukherjee, H. Bai, and A. Chakrabortty, “On robust model-freereduced-dimensional reinforcement learning control for singularly per-turbed systems,” in . IEEE,2020, pp. 3914–3919.[33] S. Mukherjee and T. L. Vu, “Imposing robust structured control con-straint on reinforcement learning of linear quadratic regulator,” arXivpreprint arXiv:2011.07011arXivpreprint arXiv:2011.07011