Safety-Critical Online Control with Adversarial Disturbances
Bhaskar Ramasubramanian, Baicen Xiao, Linda Bushnell, Radha Poovendran
SSafety-Critical Online Control with Adversarial Disturbances
Bhaskar Ramasubramanian , Baicen Xiao , Linda Bushnell , and Radha Poovendran Abstract — This paper studies the control of safety-criticaldynamical systems in the presence of adversarial disturbances.We seek to synthesize state-feedback controllers to minimize acost incurred due to the disturbance while respecting a safetyconstraint. The safety constraint is given by a bound on an H ∞ norm, while the cost is specified as an upper bound on an H norm of the system. We consider an online setting wherecosts at each time are revealed only after the controller atthat time is chosen. We propose an iterative approach to thesynthesis of the controller by solving a modified discrete-timeRiccati equation. Solutions of this equation enforce the safetyconstraint. We compare the cost of this controller with thatof the optimal controller when one has complete knowledge ofdisturbances and costs in hindsight. We show that the regret function, which is defined as the difference between these costs,varies logarithmically with the time horizon. We validate ourapproach on a process control setup that is subject to two kindsof adversarial attacks. I. I
NTRODUCTION
The recent advances and successes of reinforcement learn-ing (RL) [1] in robotics, games, and mobile networks [2]–[6] has spurred its use in other areas where RL algorithmsinteract with the physical environment over long periodsof time [7]–[9]. An increasingly popular domain where RLmethods are being deployed are to safety-critical systemslike large-scale power systems [10], which are susceptibleto attacks by an intelligent adversary [11], [12]. Since thesesystems have an underlying dynamic model, actions of thesystem are typically a function of (a history of) the systemstates. Rules governing these actions can be designed so thatthe overall system behaves in a desired way. At the sametime, the actions may have to be chosen to minimize a cost.We consider a safety-critical linear time invariant (LTI)system affected by adversarial inputs. Our goal is to designstate-feedback controllers to minimize the cost incurred dueto this input while satisfying a safety constraint. This isalso called the ‘ combined H / H ∞ problem ’ [13]–[15]. The H ∞ safety constraint enforces a bound on the ratio of themagnitude of the output to that of the adversarial disturbance.The H cost is the expected mean square output value whenthe disturbance input is a white noise process. The H ∞ constraint is embedded in the optimization process by solvinga modified discrete-time Riccati equation, whose solutionyields an upper bound on the H cost.In this paper, we study an online scenario of the combined H / H ∞ problem. At each time, the adversary inserts a Network Security Lab, Department of Electrical and Computer Engi-neering, University of Washington, Seattle, WA 98195, USA. { bhaskarr, bcxiao, lb2, rp3 } @uw.edu This work was supported by the U.S. Army Research Office and theOffice of Naval Research via Grants W911NF-16-1-0485 and N00014-17-S-B001 respectively. disturbance, after which the cost incurred is revealed tothe system. Our aim is to iteratively design controllers tominimize this cost, while satisfying the safety constraint. Wecompare the cost of this controller with that of the optimalcontroller if all adversarial inputs and costs were knownapriori. The difference between these costs is termed as the regret faced by the system.Regret bounds for partially and fully adversarial distur-bances for LTI dynamics and convex costs were presentedin [16]–[20], where the authors considered a richer class of‘disturbance action policies’. These policies depend not onlyon the current state, but also on a history of disturbances,and provide stronger regret bounds (poly-logarithmic) thanwe present (logarithmic), since policies in this paper onlydepend on the current state. Also, while the analyses in [16]–[20] fix a stabilizing controller at the start of their algorithms,we adopt a different approach and iteratively solve a set ofRiccati equations to update the controller at each time step.
A. Contributions
We aim to minimize an upper bound on the H cost for anLTI system with adversarial inputs with an H ∞ constraint onthe disturbance-output map in an online setting. At each time,the adversary inserts a disturbance input. The cost functionsare revealed to the system only after it has determined acontroller. We make the following contributions: • We introduce strongly stable disturbance attenuating(S2DA) policies. This generalizes strongly stable poli-cies from [21]. S2DA policies are strongly stable andsatisfy an additional condition on the H ∞ norm. • We show that initializing our procedure with a stabiliz-ing disturbance attenuating policy will yield stabilizingdisturbance attenuating policies at successive time steps.If solutions of the Riccati recursion are bounded, weshow that these policies will also be strongly stable. • We establish bounds on the difference between solutionsto the Riccati equation at successive time steps. Weuse the above results to show that the regret bound is O (log T ) , where T is the time horizon of interest. • We validate our method on a model of the TennesseeEastman control challenge [22] that is subject to arbi-trary adversarial inputs, and a denial-of-service attack.
B. Outline of Paper
The rest of this paper is organized as follows: Section IIsummarizes related work. We state our problem and detailour solution in Sections III and IV. Section V illustratesour approach on a model of the Tennessee Eastman controlchallenge. Section VI concludes the paper. a r X i v : . [ ee ss . S Y ] S e p I. R
ELATED W ORK
Simultaneous H / H ∞ policy synthesis for discrete-timelinear systems is a well-studied problem. The structure ofan upper bound for the LQG cost, minimizing which wouldsolve the mixed problem was first proposed in [13]. Theauthors of this paper also presented a closed-form controllerthat would minimize this upper bound. Two other upperbounds for the LQG cost were proposed in [14], and itwas shown that the same controller would be optimal ineach case when restricted to static, full-state feedback. Inthis case, a static, time-invariant state-feedback is sufficientfor optimal performance [15]. In discrete-time, this does nothold for the full-information feedback (states and distur-bances available), or partial information cases. This problemwas studied for nonlinear discrete-time systems in [23].An orthogonal approach to solve disturbance attenuationand rejection problems using geometric control theory waspresented in [24]–[29]. However, these works considered the H and H ∞ cases separately. These problems have also beenstudied in robust and model predictive control [30]–[32].There has been a renewed interest in the use of RLtechniques in learning to control linear dynamical systems.Recent developments in this field are surveyed in [33]. Anonline version of LQ control with Gaussian disturbanceswas presented in [21], where the authors presented a regretbound for known LTI dynamics and adversarial quadraticcosts. In [16], the authors considered strongly convex costs,and presented stronger regret bounds for a larger classof policies they termed disturbance action policies. Thiswas generalized to semi-adversarial disturbance inputs andconvex costs in [17]. The authors of [34] adopted a differentapproach to determine regret bounds for the LQR. Theyiteratively solved a sequence of Riccati equations to generatea sequence of stabilizing controllers, and compared the costof this sequence of controllers with that of the optimal staticcontroller if all costs were known in hindsight.Minimizing the regret over sequentially revealed adversar-ial convex costs against the class of linear policies when amodel of the dynamics was unknown was studied in [18].This was generalized to partially observed systems withsemi-adversarial disturbances in [19], and for the Kalmanfilter in [35]. More recently, [20] presented regret boundsfor the case of fully adversarial disturbances.Sample complexity bounds for the LQR for an LTIsystem with unknown dynamics were given in [36], andfor the Kalman filter in [37]. The convergence of policy-gradient methods for the LQR was studied in [38]–[40].Convergence guarantees for policy gradient methods for themixed H / H ∞ problem was reported in [41]. The authorsof [42] studied a trade-off between exploration for learningand safety for the LQR under bounded disturbances andconstraints on the state and input sets.III. P RELIMINARIES AND P ROBLEM F ORMULATION
For a matrix M , we write M ≥ when M is positivesemi-definite. We write M ≥ M when M − M ≥ . T r ( M ) and λ max ( M ) denote the trace and maximumeigenvalue of M . Consider a discrete-time linear system: x t +1 = Ax t + Bu t + Dw t (1) z t = Cx t + Eu t , (2)where x t , u t , w t , z t denote the state, control, disturbance, andcontrolled output. In this case, the optimal stabilizing controlwill be a static state feedback u t = − Kx t [15].Let T zw denote the input-output map from the disturbanceto the measured output. We want the H ∞ norm of the closed-loop system || T zw || ∞ := sup { || z || || w || : 0 < || w || < ∞} toremain below a desired threshold, γ . When || T zw || ∞ < γ ,the controller is deemed to have attenuated the disturbance .The H norm of a linear system is the expected root meansquare value of z t when w t is a white noise process [43]. Inthis case, the H -cost will be given by lim t →∞ E [ z Tt z t ] . Since w t in this paper can be a more general adversarial input, wewill choose to minimize an upper bound on the (squared) H norm of the closed-loop system [14]. Assumption 1.
Assume the following: ( A, B ) is stabilizable. This will ensure the existence ofa controller K such that ( A − BK ) is stable. C T C = Q ≥ , E T C = 0 and E T E = R ≥ .This will ensure elimination of cross-weighting termsbetween state and control variables [14]. Since we are interested in stabilizing controllers thatadditionally attenuate the adversarial input, we define the valid set of controllers as: K := { K : | λ max ( A − BK ) | < , || T Kzw || ∞ < γ } , (3)where T Kzw is the input-output map from w to z under thecontroller K . An upper bound on the infinite-horizon H -cost that we seek to minimize in this paper is [13], [14]: J ( K ) := T r ( P DD T ) where P solves (4) ( A − BK ) T ˜ P ( A − BK ) + Q + K T RK − P = 0 (5) ˜ P := P + P D ( γ I − D T P D ) − D T P (6)A typical objective to achieve mixed H / H ∞ goals canthen be stated as: ‘ For the system in Equations (1) - (2),determine a sequence of controls { u t } t> so that: the cost in Equation (4) is minimized, subject to u t = − Kx t , K ∈ K , where K is as in Equation (3). ’If P ≥ is a solution to Eqn. (5), then ( A − BK ) is stableif and only if ( A, ( Q + K T RK ) / ) is detectable [13]. If P ∗ ≥ is a solution and P ∗ ≤ P for all other solutions P ,then P ∗ is a minimal solution. The controller that minimizesthe cost while achieving || T K ∗ zw || ∞ < γ is [13], [41]: K ∗ = ( R + B T ˜ P ∗ B ) − B T ˜ P ∗ A (7) We will say that the disturbance has been attenuated if || T zw || ∞ < γ is true even when γ > . In the time-invariant case, || T zw || ∞ can becomputed as the maximum singular value of the transfer matrix from w to z , restricted to the boundary of the unit-circle . e focus on an online setting of Equations (1) - (2). Ateach time t , the adversary chooses w t . The learner chooses u t = − K t x t , and suffers a loss determined as a functionof the matrices C t and E t . We assume that the sequenceof matrices { C t , E t } is determined before the start of thelearning process. However, they are revealed to the learneronly after it chooses u t . Therefore, the learner faces a regret ,defined as the difference between the cost when using theaforementioned controller and the optimal controller fromthe set K . We aim to minimize this regret, and ensure thatit grows sub-linearly with the time horizon T . Formally, Problem 1.
At each time t , the learner observes state x t , andcommits to a controller u t = K t x t . After this, cost matrices Q t := C Tt C t , R t := E Tt E t such that C Tt E t = 0 are revealedto the learner. This cost incurred to the learner is upper-bounded by J t ( K t ) = T r ( P t DD T ) , P t being the solutionof Equation (5). With J ( · ) := 0 , determine a sequence ofpolicies { K t } such that for some large enough time T , aregret term, defined as R ( T ) := J T ( K T ) − min K ∈K J T ( K ) grows sub-linearly with T . IV. S
OLUTION M ETHOD
We briefly summarize our solution approach. First, we in-troduce strongly stable disturbance attenuating policies. Thisis motivated by strongly stable policies introduced in [21] toquantify the stability of a stabilizing policy. Then, we showthat if we initialize our procedure with a stable disturbanceattenuating policy, successive iterates will continue to yieldpolicies that are stable and disturbance attenuating. Whensolutions of the Riccati recursion are uniformly bounded,we show that the sequence of stabilizing and disturbanceattenuating policies are also strongly stable for an appropriatechoice of parameters. In order to establish our regret bounds,we determine upper bounds on the difference between solu-tions to the Riccati recursion at successive time steps. Weput these together to get the final regret bound. The regretbound comprises a burn-in cost , a cost that is incurred beforewe start to obtain meaningful bounds, while a second termgives the bound for a large enough time horizon T . A. Strong Stability
We leverage the notion of a strongly stable controllerfirst proposed in [21] for the LQR problem. This wassubsequently used in [17] for the more general case.
Definition 1 (Strongly Stable Policies [21]) . A policy K isstable if | λ max ( A − BK ) | < . It is ( κ, (cid:15) ) − strongly stablefor κ > , (cid:15) ∈ (0 , if || K || ≤ κ , and there exist matrices L, H such that A − BK = HLH − , with || L || ≤ − (cid:15) and || H |||| H − || ≤ κ . Sequentially strongly stable controllers were used in [21],[34] to reason about a sequence of strongly stable policies.
Definition 2 (Sequentially Strongly Stable Policies [21]) . Asequence of policies { K t } t ≥ is sequentially ( κ, (cid:15) ) − stronglystable for κ > , (cid:15) ∈ (0 , if there exist sequences of matrices { L t } t ≥ , { H t } t ≥ such that for all t ≥ , A − BK t = H t L t H − t , and: || L t || ≤ − (cid:15) , || K t || ≤ κ , || H t || ≤ β , || H − t || ≤ /α , where κ = β/α , β > , || H − t +1 H t || ≤ (cid:15) . In the above, observe that | λ max ( A − BK t ) | = | λ max ( L t ) | ≤ || L t || ≤ − (cid:15) . Since we are interested instable policies that will also achieve disturbance attenuation,we introduce the notion of strongly stable and sequentiallystrongly stable disturbance attenuating policies. Definition 3 (Strongly Stable Disturbance Attenuating(S2DA) Policies) . A policy K is ( κ, (cid:15), γ ) − S2DA if it is ( κ, (cid:15) ) − strongly stable and || T Kzw || ∞ < γ . Definition 4 (Sequentially Strongly Stable Disturbance At-tenuating (S3DA) Policies) . A sequence of policies { K t } t ≥ is sequentially ( κ, (cid:15), γ ) − S3DA if { K t } t ≥ is sequentially ( κ, (cid:15) ) − strongly stable and || T K t zw || ∞ < γ for all t ≥ .B. Set K is Invariant In this part, we will show that if K ∈ K , then K t ∈ K for all t > . That is, if we start with a stabilizing anddisturbance attenuating controller, then successive updatesof the controller will retain this property. The sequence ofcontrollers is then said to be regularized [41]. We adapt theRiccati recursion update procedure in [44] to the setting ofdisturbance attenuation. We use representations of solutionsto Lyapunov and Riccati equations to establish stability anddisturbance attenuation for the updated controllers. Further,since matrices C t and E t are fixed at time t , we can useresults specific to the time-invariant case. Before provingour result (Theorem 1), we state a useful result from robustcontrol [30] that transforms the constraints in Equation (3)to a the solution of a Riccati inequality. Lemma 1. [30] For a discrete-time linear time-invariantsystem, the following conditions are equivalent: The controller gain K ∈ K . There exists
P > such that: i) : I − γ − D T P D > ,and ii) : Q + K T RK − P +( A − BK ) T ( P + P D ( γ I − D T P D ) − D T P )( A − BK ) < . The Riccati equation (5) admits a unique stabilizingsolution P ≥ such that: i) : I − γ − D T P D > , and ii) : ( I − γ − D T P D ) − T ( A − BK ) is stable. In the sequel, for t ≥ , define: ¯ R t := t − t ¯ R t − + 1 t R t (8) ¯ Q t := t − t ¯ Q t − + 1 t Q t (9)We perform the update in this manner in order to obtainuseful bounds on differences between successive updates asa function of the time index t . Theorem 1.
Let Assumption 1 hold, K ∈ K , and there is solution P ≥ to Equation (5). Suppose at time t , ( A − BK t ) T ˜ P t ( A − BK t ) + ¯ Q t + K Tt ¯ R t K t = P t , ˜ P t := P t + P t D ( γ I − D T P t D ) − D T P t (10) and K t is updated as: K t +1 = ( ¯ R t + B T ˜ P t B ) − B T ˜ P t A. (11) Then K t ∈ K for all t > .Proof. We will begin by showing that K t ∈ K will ensurethat the solution to the Equation (10) is bounded. To do this,we use the fact that for a stabilizing K t , the solution toan associated Lyapunov equation will be bounded. Then, wewill show that K t +1 will be stabilizing, and finally show that K t +1 will also be disturbance attenuating. We use induction. A. Base Case :Since ( A, B ) is stabilizable, there exists a stable controller K . Since there exists a solution P to Equation (5), K is also disturbance attenuating (from Lemma 2.1 of [13]),which establishes the base case of our induction. B. P t is Bounded :Let K t ∈ K for some t > . Since K t is stabilizing, thereis a unique solution ¯ P t ≥ to the Lyapunov equation (12),with ¯ P t given by [45]: ( A − BK t ) T ¯ P t ( A − BK t ) − ¯ P t = − ( ¯ Q t + K Tt ¯ R t K t ) , (12) ¯ P t = ∞ (cid:88) i =0 (( A − BK t ) T ) i ( ¯ Q t + K Tt ¯ R t K t )( A − BK t ) i . Now consider P t given by Equation (10). Subtracting Equa-tion (12) from Equation (10), we get: P t − ¯ P t = ( A − BK t ) T ( P t − ¯ P t )( A − BK t ) (13) +( A − BK t ) T ( P t D ( γ I − D T P t D ) − D T P t )(( A − BK t ) Since K t ∈ K , from Lemma 1, ( γ I − D T P t D ) > .Therefore, the second term of Equation (13) is positivedefinite, which means that (13) is a Lyapunov equation for P t − ¯ P t . The (unique) solution to this equation is given by: P t − ¯ P t = ∞ (cid:88) i =0 (( A − BK t ) T ) i (( A − BK t ) T × (14) ( P t D ( γ I − D T P t D ) − D T P t )(( A − BK t ))( A − BK t ) i . Since ( A − BK t ) is stable, both ¯ P t and P t − ¯ P t are bounded.Therefore, P t = ¯ P t + ( P t − ¯ P t ) is bounded. C. K t +1 is Stabilizing :Expanding ( A − BK t ) T ˜ P t ( A − BK t )+ K Tt ¯ R t K t + ¯ Q t and using Equation (11) to write B T ˜ P t A = ( ¯ R t + B T ˜ P t B ) K t +1 : A T ˜ P t A − K Tt B T ˜ P t A − A T ˜ P t BK t + K Tt ( B T ˜ P t B + ¯ R t ) K t + ¯ Q t = A T ˜ P t A + ( K t +1 − K t ) T ( ¯ R t + B T ˜ P t B )( K t +1 − K t ) − K Tt +1 B T ˜ P t A − A T ˜ P t BK t +1 + K Tt +1 ( ¯ R t + B T ˜ P t B ) K t +1 + ¯ Q t =( A − BK t +1 ) T ˜ P t ( A − BK t +1 ) + K Tt +1 ¯ R t K t +1 + ¯ Q t + ( K t +1 − K t ) T ( ¯ R t + B T ˜ P t B )( K t +1 − K t ) ⇒ P t = ( A − BK t +1 ) T P t ( A − BK t +1 ) + M (15)In Equation (15), M is a positive definite matrix defined as: M := K Tt +1 ¯ R t K t +1 + ¯ Q t +( A − BK t +1 ) T ( P t D ( γ I − D T P t D ) − D T P t ) × ( A − BK t +1 )+ ( K t +1 − K t ) T ( ¯ R t + B T ˜ P t B )( K t +1 − K t ) , The terms in the first and third lines in the above equationare positive definite by assumption, and ( γ I − D T P t D ) > from Lemma 1. Since P t is bounded, and we can write P t = (cid:80) ∞ i =0 (( A − BK t +1 ) T ) i M ( A − BK t +1 ) i , ( A − BK t +1 ) must be stable so that the sum on the right hand side doesnot diverge. Therefore, K t +1 is stabilizing. D. K t +1 is Disturbance Attenuating :Since ( A − BK t +1 ) is stable, there exists P ≥ thatsolves ( A − BK t +1 ) T P ( A − BK t +1 ) − P = − V , where V > . Choose V to be: V := K Tt +1 ¯ R t K t +1 + ¯ Q t + ρI + ( A − BK t +1 ) T ( P D ( γ I − D T P D ) − D T P ) × ( A − BK t +1 ) , where ρ > is chosen so that V is positive definite, and ρI + ( A − BK t +1 ) T ( P D ( γ I − D T P D ) − D T P ) × ( A − BK t +1 ) ≤ ( A − BK t +1 ) T ( P t D ( γ I − D T P t D ) − D T P t ) × ( A − BK t +1 )+ ( K t +1 − K t ) T ( ¯ R t + B T ˜ P t B )( K t +1 − K t ) . (16)Rearranging these equations gives us ( A − BK t +1 ) T ˜ P ( A − BK t +1 ) − P + K Tt +1 ¯ R t K t +1 + ¯ Q t = − ρI < , where ˜ P isaccording to Equation (5). This satisfies the second part ofthe second condition in Lemma 1.When K t ∈ K , γ I − D T P t D > from Lemma 1. We canwrite γ I − D T P D = γ I − D T P t D + D T ( P t − P ) D . Now, P t − P = ( A − BK t +1 ) T ( P t − P )( A − BK t +1 )+ N , where N is got by subtracting the term on the left of the inequality in(16) from the term on the right. This is a Lyapunov equationin P t − P . Since K t +1 is stabilizing and N > , there is apositive semi-definite solution, which gives us P t − P ≥ .Therefore, < γ I − D T P t D ≤ γ I − D T P D , whichsatisfies the first part of the second condition in Lemma 1.Then, from Lemma 1, K t +1 is also such that || T K t +1 zw || ∞ <γ , and therefore, K t +1 ∈ K , which completes the proof. . S2DA and S3DA Policies In this part, we present results quantifying the stabilityand disturbance attenuation of a sequence of valid policies.We begin by showing that there exist values of parameters κ, (cid:15) such that any stable and disturbance attenuating policyis S2DA. The proofs are omitted due to space constraints.
Proposition 1.
Assume that K ∈ K . Then, there exist values κ, (cid:15) such that K is ( κ, (cid:15), γ ) − S2DA.
Suppose that a sequence of positive definite matrices P t is generated according to Equation (10), where K t +1 isgiven by Equation (11), and K is an initial stable anddisturbance attenuating policy. Then, we have the followingresult, assuming that the updates P t are uniformly bounded. Proposition 2.
Let Q t , R t ≥ µI , P t ≤ νI , and K t ∈ K .Then, { K t } t ≥ is (¯ κ, κ , γ ) − S2DA, where ¯ κ := (cid:112) ν/µ .Additionally, if || P t − P t +1 || ≤ p ≤ µ /ν , then, { K t } t ≥ is (¯ κ, κ , γ ) − S3DA, where ¯ κ := (cid:112) ν/µ . In the sequel, we will use K κ,(cid:15) to denote the set of ( κ, (cid:15), γ ) − S2DA or ( κ, (cid:15), γ ) − S3DA policies.
D. Bound on Riccati Recursion Updates
Our next result yields a bound on the difference betweensuccessive updates of the Riccati recursion (10). We achievethis by reducing our framework to the form of the recursiveupdates for the traditional LQR that was shown in [34],and assuming that parameter values are chosen so that aninequality in the proof will not depend on a constant term.
Theorem 2.
Let Q t , R t ≥ µI , T r ( Q t ) , T r ( R t ) < σ, P t ≤ νI , and { K t } t ≥ be ( κ, (cid:15), γ ) − S2DA. Then, there exist con-stants p ∗ and t ∗ such that || P t +1 − P t || ≤ p ∗ /t for all t > t ∗ .Proof. From Equations (10), (15), and (11), P t +1 − P t = ( A − BK t +1 ) T ( ˜ P t +1 − ˜ P t )( A − BK t +1 )+ ( ¯ Q t +1 − ¯ Q t ) + K Tt +1 ( ¯ R t +1 − ¯ R t ) K t +1 − ( K t +1 − K t ) T ( ¯ R t + B T ˜ P t B )( K t +1 − K t ) (17) K t +1 − K t = ( B T ˜ P t B + ¯ R t ) − (18) × ( B T ( ˜ P t − ˜ P t − )( A − BK t ) + ( ¯ R t − − ¯ R t ) K t ) , where the last term in the last equation uses the fact that ¯ R t − K t + B T ˜ P t − BK t − B T ˜ P t − A = 0 . Therefore, P t +1 − P t = ( A − BK t +1 ) T ( P t +1 − P t )( A − BK t +1 )+ M t , (19)where M t := M t + M t + M t , and M t := ( A − BK t +1 ) T × ( P t +1 D ( γ I − D T P t +1 D ) − D T P t +1 − P t D ( γ I − D T P t D ) − D T P t ) × ( A − BK t +1 ) (20) M t := ( ¯ Q t +1 − ¯ Q t ) + K Tt +1 ( ¯ R t +1 − ¯ R t ) K t +1 (21) − M t := ( B T ( ˜ P t − ˜ P t − )( A − BK t ) + ( ¯ R t − − ¯ R t ) K t ) T × ( B T ˜ P t B + ¯ R t ) − (22) × ( B T ( ˜ P t − ˜ P t − )( A − BK t ) + ( ¯ R t − − ¯ R t ) K t ) Equation (19) is a Lyapunov equation. Therefore, P t +1 − P t = ∞ (cid:88) i =0 (( A − BK t +1 ) T ) i M t ( A − BK t +1 ) i ≤ || M t || ∞ (cid:88) i =0 (( A − BK t +1 ) T ) i ( A − BK t +1 ) i Now, || M t || ≤ (cid:80) i =1 || M t || . Since K t is ( κ, (cid:15), γ ) − S2DA, || K t || ≤ κ , ( A − BK t +1 ) = H t +1 L t +1 H − t +1 , and we have: || ∞ (cid:88) i =0 (( A − BK t +1 ) T ) i ( A − BK t +1 ) i ||≤ ∞ (cid:88) i =0 κ (1 − (cid:15) ) i = κ (cid:15) (2 − (cid:15) ) ≤ κ (cid:15) (23)From T r ( Q t ) , T r ( R t ) ≤ σ , we can write: || ¯ Q t +1 − ¯ Q t || = 1 t + 1 || Q t +1 − ¯ Q t || ≤ σt + 1 || ¯ R t +1 − ¯ R t || = 1 t + 1 || R t +1 − ¯ R t || ≤ σt + 1 ⇒ || M t || ≤ σ (1 + κ ) t + 1 (24)Now, consider M t . We can write || A − BK t +1 || ≤ κ (1 − (cid:15) ) ≤ κ , since (cid:15) ∈ (0 , . Since < P t ≤ νI , we can write ( γ I − D T P t D ) − ≤ ( γ I − νD T D ) − . Therefore, || ( γ I − D T P t D ) − || ≤ || ( γ I − νD T D ) − || = || γ − ( I − νγ D T D ) − || ≤ γ + νγ || D T D || A lower bound on the norm of the middle term of M t is || M t || ≤ κ m D , where (25) m D : = 2 ν γ (1 + νγ || D T D || ) || D T D || Since R t ≥ µI , || ( B T ˜ P t B + ¯ R t ) − || ≤ µ . Then, we have: || M t || ≤ µ ( 2 σκt + κ || B || ( || P t − P t − || + m D )) (26)Using the bounds in Equations (23)-(26), we have: || P t +1 − P t || ≤ κ (cid:15) ( κ m D + 2 σ (1 + κ ) t + 1 )+ κ (cid:15)µ ( 2 σκt + κ || B || ( || P t − P t − || + m D )) = 2 κ σ (1 + κ ) (cid:15) ( t + 1) + κ (cid:15)µ ( 2 σκt + κ || B || ( || P t − P t − || )) (27) + κ m D (cid:15) ( 2 || B || µ ( 2 σt + || B |||| P t − P t − || ) + || B || m D + 1) To complete the proof, we make the following assumption. Note that | λ max ( A − BK ) | < || A − BK || , where the (two-)norm ofa matrix is given by its maximum singular value. ssumption 2. µ, ν, γ are chosen so that the inequality (27)will be true independent of the last term of (27) for all t > t ∗ . Future work will examine the relaxation of this assumptionin greater detail. This setting is now similar to that in LemmaA.6 in [34]. Therefore, if there is some p ∗ and t ∗ such thatfor all t > t ∗ , ( || P t − P t − || ) ≤ p ∗ /t , then ( || P t +1 − P t || ) ≤ p ∗ / ( t + 1) . Specifically, this will be true for : t > t ∗ = 8 σκ || B || (cid:15)µ (1 + κ || B || (1 + κ ) (cid:15) ) p ∗ ≤ σ || B || + 4 κ σ (1 + κ ) (cid:15) The base case of the induction can be shown as in [34].
E. Online Algorithm
Algorithm 1
Safety-Critical Online Controller Synthesis procedure G ENERATE { K t } t> Input:
System: x t +1 = Ax t + Bu t + Dw t ,initial state, parameters µ, ν, κ := (cid:112) ν/µ, (cid:15) =1 / (2 κ ) , γ, σ, K ∈ K κ,(cid:15) , time horizon T Output: { K t } t> , such that K t ∈ K κ,(cid:15) for t = 1 , , . . . , T do obtain current state x t generate u t = − K t x t adversary plays w t adversary generates C t , E t (Assumption 1) Q t := C Tt C t ; R t := E Tt E t update ¯ R t , ¯ Q t acc. to Eqns. (8)-(9) update P t according to Eqn. (10) if t = (cid:100) σκ || B || (cid:15)µ (1 + κ || B || (1+ κ ) (cid:15) ) (cid:101) then d := 0 , P := P t ∗ , K := K t ∗ d ← d + 1 successively solve Eqn. (10) as long as || P d − P d − || > p ∗ /t ∗ ; update K d according to Eqn. (11) end if return K t +1 according to Eqn. (11) end for end procedure From Assumption 1 and Theorem 1, if we start at t = 1 from a stabilizing policy that attenuates the disturbance,then our update procedure will continue to yield stabilizing,disturbance attenuating policies for all t > . At each step,we compute u t = − K t x t , and the output and cost arerevealed in terms of the matrices C t , and E t , where C t and E t satisfy Assumption 1. The update is carried outaccording to Equations (8)-(9) by averaging over previous These thresholds can be obtained by expanding the quadratic term on theright-hand side of Equation (27) and using Assumption 2 to get a quadraticinequality in p ∗ . That is, we get a quadratic a ( p ∗ ) + bp ∗ + c ≤ , where a, b, c are terms involving t and the constants in Equation (27). The boundon t is obtained by recognizing that ( t + 1) /t ≈ /t , and requiring thatthe roots of this quadratic inequality be real, that is, √ b − ac > . Thebound on p ∗ is then got by requiring that p ∗ ∈ [ p , p ] , where p and p are roots of the quadratic equation a ( p ∗ ) + bp ∗ + c = 0 . Specifically, weset p ∗ ≤ − b/ a so that the quadratic inequality will be satisfied. values of Q t := C Tt C t and R t := E Tt E t . From Theorem 2, || P t +1 − P t || < p ∗ /t for t > t ∗ ( Lines 12-16 ). Algorithm 1formally presents this procedure.
F. Regret Bounds
Iterative solutions to the Riccati equation in the LTI caseexhibit quadratic convergence to an optimal solution P ∗ [44]. In [41], this convergence rate was also shown tohold for the variant of the Riccati equation that we usein this paper. Specifically, for some c > , || P t − P ∗ || ≤ c || P t − − P ∗ || , and || P t +1 − P t || ≤ c || P t − P t − || . Further,observe that R ( T ) in Problem 1 can be written as R ( T ) = T r ( P T DD T ) − T r ( P ∗ T DD T ) , where P ∗ T corresponds tothe solution of the Riccati equation that yields the optimalcontroller from the set K . We use these results to establisha bound on the growth of the regret for sufficiently large T . Theorem 3.
Let the conditions of Assumption 1 hold, and let Q t , R t ≥ µI , T r ( Q t ) , T r ( R t ) < σ , P t ≤ νI , κ = (cid:112) ν/µ,(cid:15) = 1 / κ . Let the controllers { K t } t ≥ be ( κ, (cid:15), γ ) − S2DA,and DD T > . Then, for T ≥ t ∗ = σκ || B || (cid:15)µ (1 + κ || B || (1+ κ ) (cid:15) ) , p ∗ ≤ σ || B || + κ σ (1+ κ ) (cid:15) , and some constant m > , R ( T ) ≤ T r ( DD T )(log( T ) + mp ∗ t ∗ +1 − log( t ∗ )) .Proof. With J ( · ) = 0 , we can express R ( T ) as: R ( T ) = T (cid:88) t =1 ( T r ( P t DD T ) − T r ( P t − DD T )) − T r ( P ∗ T DD T )= t ∗ (cid:88) t =1 ( T r ( P t DD T ) − T r ( P t − DD T )) − T r ( P ∗ T DD T )+ T (cid:88) t = t ∗ ( T r ( P t DD T ) − T r ( P t − DD T ))= T r ( P t ∗ DD T ) − T r ( P ∗ DD T ) + T r ( P ∗ DD T ) − T r ( P ∗ T DD T ) + T (cid:88) t = t ∗ ( T r ( P t DD T ) − T r ( P t − DD T )) ≤ T r ( DD T ) || P t ∗ − P ∗ || + T r ( DD T ) T (cid:88) t = t ∗ || P t − P t − || , where P ∗ is the optimal solution to the time-invariant,infinite-horizon Riccati equation. In the above, the first termcan be interpreted as a burn-in cost , that is, the cost incurredbefore the procedure starts to yield meaningful regret bounds,while the second term gives the bound for large enough T .From Theorem 2, (cid:80) Tt = t ∗ || P t − P t − || ≤ (cid:80) Tt = t ∗ p ∗ /t ≤ log( Tt ∗ ) , while for the first term, we use the quadratic con-vergence to P ∗ to obtain || P t ∗ − P ∗ || ≤ mp ∗ t ∗ +1 . Here, m is aconstant associated with lim t →∞ t (cid:80) i =0 || P t ∗ + i − P t ∗ + i +1 || . There-fore, R ( T ) ≤ T r ( DD T )(log( T ) + mp ∗ t ∗ +1 − log( t ∗ )) . The regret bound in our case differs from those shown inrelated work (e.g. [17], [19], [34]) due to the nature of thecost function that we seek to optimize in this paper. Sincewe are interested in the minimization of an (upper bound onhe) H cost, given by lim t →∞ E [ z Tt z t ] , when w t is white noise,our regret term of the form in Problem 1 can be recast inthe form on the first line of the above proof.V. E XPERIMENTAL E VALUATION
We validate our method on a well-studied problem fromprocess control called the Tennessee Eastman control chal-lenge [22]. The irreversible and exothermic process (Fig-ure 1) produces two products (
G, H ) from four reactants(
A, C, D, E ); component F represents other products formedfrom side reactions in the process. The open-loop process isunstable, which necessitates the use of feedback control. Thismodel has been adapted to demonstrate the use of machinelearning methods to study resilience to attacks [46], faultdetection [47], and impacts of advanced persistent threats[48]. A continuous-time LTI model of the plant presented in[49] consisted of eight states, four inputs, and ten outputs. Weuse values of the A, B, C matrices from [49], and discretizethe model, assuming a zero order hold. We additionallyassume w t ∈ R , D = I × , and E chosen to satisfyAssumption 1. We use these values of C and E to determinethe (optimal) counterfactual static controller K ∈ K .Fig. 1: Flow diagram of the Tennessee Eastman Process [50]We consider two attack scenarios. In the first, at each timestep, the adversary injects an arbitrary input w t . We makeno assumptions on the nature of this input (except that it isbounded, for the purpose of simulation). In the second, wesimulate a denial-of-service attack, by setting w t = − Bu t for t ∈ [ t a , t (cid:48) a ] , and arbitrary at other times. [ t a , t (cid:48) a ] is theattack duration such that < t a ≤ t (cid:48) a . In this case, the impactof the controller on the evolution of the state is canceledduring the attack, but the learner still incurs a cost associatedto the control (as a result of the E t term). In each case, thematrices C t and E t are perturbed versions of C and E .The normalized regret for the two attacks are shown inFigure 2. In particular, we observe that the regret of asequence of controllers computed according to Algorithm 1with respect to the optimal, counter-factual, time-invariantstatic controller that is obtained by solving the Riccatiequation for the time-invariant case satisfies the boundsdetermined in Theorem 3. For the denial-of-service attack,although the effect of the controller is canceled for theduration of the attack, as long as this attack starts at t a > , (a) Arbitrary adversarial input.(b) Denial-of-service attack. Fig. 2: Normalized regret for two types of adversarial input.In Fig. 2a, the adversary input, w t is arbitrary. In Fig. 2b, w t cancels the effect of the control u t on the state evolutionduring the attack, denoting denial-of-service. The blue curvesshow the normalized regret of a controller chosen accordingto Algorithm 1 with respect to the optimal, counter-factual,time-invariant static controller. The red curves denote the(normalized) right-hand side of the regret bound of Thm. 3.Algorithm 1 will continue to produce stabilizing, disturbanceattenuating controllers if we start from an initial controllerthat is stabilizing and disturbance attenuating (Theorem 1).VI. C ONCLUSION
This paper presented an iterative solution to an onlinecontrol problem in the presence of bounded adversarialdisturbances. In this setting, costs incurred by the systemat each time due to an adversarial disturbance input wererevealed only after the input was given. We synthesizedcontrollers to minimize (an upper bound of) a quadraticcost while simultaneously satisfying a safety constraint. Thiswas achieved by solving a Riccati equation in an iterativemanner. Solutions to the Riccati equation enforced the safetyconstraint. We showed that initializing the procedure with astabilizing and disturbance attenuating controller ensured thatcontrollers at successive time steps retained this property.We showed that the regret of this controller, compared tohe optimal controller when all costs and disturbances wereknown in hindsight, varied logarithmically with the timehorizon. We validated our approach on a model of theTennessee Eastman chemical process that was subject toarbitrary adversarial inputs and a denial of service attack.Future work will study the partial information setting,where one will have to synthesize dynamic output feedbackcontrollers, and the more generalized problem of minimizingthe H ∞ norm of the output to disturbance map. We will alsoextend our analysis to the case of unknown system dynamics.R EFERENCES[1] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduc-tion . MIT press, 2018.[2] R. Hafner and M. Riedmiller, “Reinforcement learning in feedbackcontrol,”
Machine Learning , vol. 84, pp. 137–169, 2011.[3] V. Mnih et al. , “Human-level control through deep reinforcementlearning,”
Nature , vol. 518, no. 7540, 2015.[4] T. P. Lillicrap et al. , “Continuous control with deep reinforcementlearning,” in
International Conference on Learning and Representa-tions , 2016.[5] D. Silver et al. , “Mastering the game of Go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, 2016.[6] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile andwireless networking: A survey,”
IEEE Communications Surveys &Tutorials , vol. 21, no. 3, pp. 2224–2287, 2019.[7] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning forautonomous cars that leverage effects on human actions.” in
Robotics:Science and Systems , 2016.[8] Z. Yan and Y. Xu, “Data-driven load frequency control for stochasticpower systems: A deep reinforcement learning method with continuousaction search,”
IEEE Transactions on Power Systems, 34(2) , 2018.[9] C. You, J. Lu, D. Filev, and P. Tsiotras, “Advanced planning forautonomous vehicles using reinforcement learning and deep inverseRL,”
Robotics and Autonomous Systems , vol. 114, pp. 1–18, 2019.[10] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcementlearning and feedback control: Using natural decision methods todesign optimal adaptive controllers,”
IEEE Control Systems Magazine ,vol. 32, no. 6, pp. 76–105, 2012.[11] A. Banerjee, K. K. Venkatasubramanian, T. Mukherjee, and S. K.Gupta, “Ensuring safety, security, and sustainability of mission-criticalcyber-physical systems,”
Proceedings of the IEEE, 100(1) , 2012.[12] J. E. Sullivan and D. Kamensky, “How cyber-attacks in Ukraine showthe vulnerability of the US power grid,”
The Electricity Journal ,vol. 30, no. 3, pp. 30–35, 2017.[13] W. M. Haddad, D. S. Bernstein, and D. Mustafa, “Mixed-norm H /H ∞ regulation and estimation: The discrete-time case,” Systems& Control Letters , vol. 16, no. 4, pp. 235–247, 1991.[14] D. Mustafa and D. S. Bernstein, “LQG cost bounds in discrete-time H /H ∞ control,” Transactions of the Institute of Measurement andControl , vol. 13, no. 5, pp. 269–275, 1991.[15] I. Kaminer, P. P. Khargonekar, and M. A. Rotea, “Mixed H /H ∞ control for discrete-time systems via convex optimization,” Automat-ica , vol. 29, no. 1, pp. 57–70, 1993.[16] N. Agarwal, E. Hazan, and K. Singh, “Logarithmic regret for onlinecontrol,” in
Advances in Neural Information Processing Systems , 2019.[17] N. Agarwal, B. Bullins, E. Hazan, S. Kakade, and K. Singh, “Onlinecontrol with adversarial disturbances,” in
International Conference onMachine Learning , 2019, pp. 111–119.[18] E. Hazan, S. M. Kakade, and K. Singh, “The nonstochastic controlproblem,” in
Algorithmic Learning Theory , 2020, pp. 408–421.[19] M. Simchowitz, K. Singh, and E. Hazan, “Improper learning for non-stochastic control,” in
Conference on Learning Theory , 2020.[20] D. J. Foster and M. Simchowitz, “Logarithmic regret for adversarialonline control,” in
International Conference on Machine Learning ,2020.[21] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar,“Online linear quadratic control,” in
International Conference onMachine Learning , 2018, pp. 1029–1038.[22] J. J. Downs and E. F. Vogel, “A plant-wide industrial process controlproblem,”
Computers & Chemical Engineering, 17(3) , 1993. [23] M. Aliyu and E. Boukas, “Discrete-time mixed H /H ∞ nonlinearfiltering,” in Proc. American Control Conference , 2008.[24] W. M. Wonham,
Linear multivariable control: A Geometric Approach .Springer, 1974.[25] J. Willems, “Almost invariant subspaces: An approach to high gainfeedback design–part I: Almost controlled invariant subspaces,”
IEEETransactions on Automatic Control , vol. 26, no. 1, pp. 235–252, 1981.[26] G. Basile and G. Marro,
Controlled and conditioned invariants inlinear system theory . Prentice Hall Englewood Cliffs, NJ, 1992.[27] K. Furuta and M. Wongsaisuwan, “Closed-form solutions to discrete-time LQ optimal control and disturbance attenuation,”
Systems &Control Letters , vol. 20, no. 6, pp. 427–437, 1993.[28] A. Saberi, Z. Lin, and A. A. Stoorvogel, “ H and H ∞ almostdisturbance decoupling problem with internal stability,” InternationalJournal of Robust and Nonlinear Control , vol. 6(8), 1996.[29] Z. Lin and B. M. Chen, “Solutions to general H ∞ almost disturbancedecoupling problem with measurement feedback and internal stabilityfor discrete-time systems,” Automatica , vol. 36(8), 2000.[30] K. Zhou, J. C. Doyle, and K. Glover,
Robust and optimal control .Prentice Hall New Jersey, 1996, vol. 40.[31] A. Bemporad and M. Morari, “Robust model predictive control: Asurvey,” in
Robustness in identification and control . Springer, 1999.[32] S. V. Rakovi´c and W. S. Levine,
Handbook of model predictive control .Springer, 2018.[33] B. Recht, “A tour of reinforcement learning: The view from contin-uous control,”
Annual Review of Control, Robotics, and AutonomousSystems , vol. 2, pp. 253–279, 2019.[34] M. Akbari, B. Gharesifard, and T. Linder, “An iterative Ric-cati algorithm for online linear quadratic control,” arXiv preprintarXiv:1912.09451 , 2019.[35] A. Tsiamis and G. Pappas, “Online learning of the Kalman filter withlogarithmic regret,” arXiv preprint arXiv:2002.05141 , 2020.[36] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the samplecomplexity of the linear quadratic regulator,”
Foundations of Compu-tational Mathematics , pp. 1–47, 2019.[37] A. Tsiamis, N. Matni, and G. J. Pappas, “Sample complexity ofKalman filtering for unknown systems,” in
Learning for Dynamicsand Control , 2020, pp. 435–444.[38] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergenceof policy gradient methods for the linear quadratic regulator,” in
International Conference on Machine Learning , 2018, pp. 1467–1476.[39] S. Tu and B. Recht, “The gap between model-based and model-freemethods on the linear quadratic regulator: An asymptotic viewpoint,”in
Conference on Learning Theory , 2019, pp. 3036–3083.[40] B. Gravell, P. M. Esfahani, and T. Summers, “Learning robust controlfor linear quadratic systems with multiplicative noise via policygradient,” arXiv preprint arXiv:1905.13547 , 2019.[41] K. Zhang, B. Hu, and T. Bas¸ar, “Policy optimization for H linearcontrol with H ∞ robustness guarantee: Implicit regularization andglobal convergence,” in Learning for Dynamics and Control , 2020,pp. 179–190.[42] S. Dean, S. Tu, N. Matni, and B. Recht, “Safely learning to controlthe constrained linear quadratic regulator,” in
Proc. American ControlConference , 2019, pp. 5582–5588.[43] M. Green and D. J. Limebeer,
Linear robust control . Dover, 2012.[44] G. Hewer, “An iterative technique for the computation of the steadystate gains for the discrete optimal regulator,”
IEEE Transactions onAutomatic Control , vol. 16, no. 4, pp. 382–384, 1971.[45] W. J. Rugh,
Linear System Theory . Prentice Hall, 1996.[46] A. Keliris, H. Salehghaffari, B. Cairl, P. Krishnamurthy, M. Mani-atakos, and F. Khorrami, “Machine learning-based defense againstprocess-aware attacks on industrial control systems,” in
IEEE Inter-national Test Conference , 2016, pp. 1–10.[47] W. Zou, Y. Xia, and H. Li, “Fault diagnosis of Tennessee-Eastmanprocess using orthogonal incremental extreme learning machine basedon driving amount,”
IEEE Transactions on Cybernetics, 48(12) , 2018.[48] L. Huang and Q. Zhu, “A dynamic games approach to proactivedefense strategies against advanced persistent threats in cyber-physicalsystems,”
Computers & Security , vol. 89, p. 101660, 2020.[49] N. L. Ricker, “Model predictive control of a continuous, nonlinear,two-phase reactor,”
Journal of Process Control , vol. 3(2), 1993.[50] I. A. Udugama, K. V. Gernaey, M. A. Taube, and C. Bayer, “A noveluse for an old problem: The Tennessee Eastman challenge processas an activating teaching tool,”