[PDF] Distributional robustness in minimax linear quadratic control with Wasserstein distance

Abstract

To address the issue of inaccurate distributions in practical stochastic systems, a minimax linear-quadratic control method is proposed using the Wasserstein metric. Our method aims to construct a control policy that is robust against errors in an empirical distribution of underlying uncertainty, by adopting an adversary that selects the worst-case distribution. The opponent receives a Wasserstein penalty proportional to the amount of deviation from the empirical distribution. A closed-form expression of the finite-horizon optimal policy pair is derived using a Riccati equation. The result is then extended to the infinite-horizon average cost setting by identifying conditions under which the Riccati recursion converges to the unique positive semi-definite solution to an algebraic Riccati equation. Our method is shown to possess several salient features including closed-loop stability, and an out-of-sample performance guarantee. We also discuss how to optimize the penalty parameter for enhancing the distributional robustness of our control policy. Last but not least, a theoretical connection to the classical H_\infty-method is identified from the perspective of distributional robustness.

Full PDF

DDistributional Robustness in Minimax Linear Quadratic Controlwith Wasserstein Distance ∗ Kihyun Kim Insoon Yang † Abstract

Wasserstein penalty proportional to the amount of deviation from the empiricaldistribution. A closed-form expression of the ﬁnite-horizon optimal policy pair is derived usinga Riccati equation. The result is then extended to the inﬁnite-horizon average cost setting byidentifying conditions under which the Riccati recursion converges to the unique positive semi-deﬁnite solution to an algebraic Riccati equation. Our method is shown to possess several salientfeatures including closed-loop stability, and an out-of-sample performance guarantee. We alsodiscuss how to optimize the penalty parameter for enhancing the distributional robustness ofour control policy. Last but not least, a theoretical connection to the classical H ∞ -method isidentiﬁed from the perspective of distributional robustness. Ambiguity, or uncertainty about uncertainty, in stochastic systems is one of the most fundamen-tal challenges in the practical implementation of stochastic optimal controllers [1, 2]. The trueprobability distribution of underlying uncertainty is unknown in ambiguous stochastic systems. Inpractice, we often only have access to samples generated according to the distribution. Estimatingan accurate distribution from such observations is challenging due to insuﬃcient data and imperfectstatistical models, among others. Using inaccurate distributions in the construction of an optimalpolicy may signiﬁcantly decrease the control performance [3,4] and can even cause unwanted systembehaviors, such as unsafe operation [5]. The focus of this work is to develop a discrete-time minimaxcontrol method using the Wasserstein metric and to analyze its robustness against uncertainties orerrors in such distributional information.Our work is closely related to the literature in distributionally robust control (DRC). DRCmethods seek to design a control policy that minimizes an expected cost of interest under theworst-case distribution in a so-called ambiguity set . Several types of ambiguity sets have beenemployed in DRC using moment constraints [6, 7], conﬁdence sets [8], relative entropy [1, 9], totalvariation distance [2, 10], and Wasserstein distance [11, 12]. Such choices of ambiguity sets have ∗ This work was supported in part by the Creative-Pioneering Researchers Program through SNU, the NationalResearch Foundation of Korea funded by the MSIT(2020R1C1C1009766), and Samsung Electronics. † Department of Electrical and Computer Engineering, Automation and Systems Research Institute, Seoul NationalUniversity, Seoul 08826, Korea { hahakhkim, insoonyang } @snu.ac.kr This paper focuses on distributionally robust extensions of stochastic optimal control problems although distribu-tionally robust techniques have also been studied in other control methods such as model predictive control [13–16],and learning-based control [17, 18], among others. a r X i v : . [ ee ss . S Y ] F e b largely been motivated by the literature in distributionally robust optimization (DRO) [19–24]. Inparticular, DRO and DRC with the Wasserstein ambiguity set possess salient features such as aprobabilistic out-of-sample performance guarantee and computational tractability [12, 22–26].In this paper, we propose a minimax linear-quadratic control method for ambiguous stochasticsystems, inspired by Wasserstein DRC. To pursue distributional robustness, our method adopts ahypothetical opponent selecting the worst-case distribution to maximize a cost of interest, whilethe controller aims to minimize the same cost. To limit the conservativeness of the resulting controlpolicy, our method penalizes the opponent by the amount (measured in the Wasserstein metric) ofdeviation from an empirical distribution.The minimax control problem is challenging to solve due to the inﬁnite-dimensionality of theinner maximization problem in the Bellman equations. In the ﬁnite-horizon setting, we derivea Riccati equation and a closed-form expression of the unique optimal policy and the opponent’spolicy generating the worst-case distribution. In the inﬁnite-horizon setting, we identify a nontrivialstabilizability condition under which the solution to the Riccati equation converges to a symmetricpositive semi-deﬁnite (PSD) solution to an algebraic Riccati equation (ARE). Taking a generalizedeigenvalue approach, our result is strengthened so that the converged solution corresponds to aunique symmetric PSD solution to the ARE under an additional observability condition. Wealso show that the resulting steady-state policy pair is an optimal solution to the inﬁnite-horizonaverage cost minimax problem. The stability properties of the closed-loop system are furtherstudied regarding the expected value of the system state.We examine the distributional robustness of the resulting control policy, using Wasserstein am-biguity sets, motivated by the DRC formulation [12]. Speciﬁcally, we evaluate our policy underthe worst-case distribution in the ambiguity set. A simple upper-bound of this worst-case cost isderived using the optimal value function of our minimax problem. A penalty parameter minimizingthe upper-bound can be computed by solving a convex optimization problem, which is obtainedexploiting the structure property of the value function. This study of our minimax method undera DRC lens yields another salient feature that our policy attains a performance guarantee eval-uated under a new sample, independent of data used in the controller design. The probabilistic out-of-sample performance guarantee is shown using the measure concentration inequality for theWasserstein metric.Another interesting observation is a theoretical connection between our minimax method andthe H ∞ -method. Our method with Wasserstein distance can be understood as a distributionalgeneralization of the H ∞ -method, thereby bridging the gap between stochastic and robust control.This connection yields the robust stability property of our minimax controller. Conversely, ourstochastic interpretation of the H ∞ -method enables us to analyze the H ∞ -controller from theperspective of distributional robustness.This paper is signiﬁcantly expanded from its preliminary conference version [27]. The studyof our minimax method using the DRC formulation with a Wasserstein ambiguity set is newlypresented along with the out-of-sample performance guarantee. Furthermore, the inﬁnite-horizontotal cost results in [27] are extended to the average cost setting, identifying optimality conditionsand the guaranteed cost property. Last but not least, this paper contains the results regarding thebounded-input, bounded-output stability and the robust stability of the closed-loop system. Let S n + (resp. S n ++ ) denote the set of symmetric positive semi-deﬁnite (resp. positive deﬁnite)matrices in R n × n . Given a Borel set W , let P ( W ) denote the set of Borel probability measures on W . Moreover, (cid:107) · (cid:107) represents the standard Euclidean norm. Consider a discrete-time linear stochastic system of the form x t +1 = Ax t + Bu t + Ξ w t , (2.1)where x t ∈ R n and u t ∈ R m represent the system state and input, respectively. Here, w t ∈ R k is a random disturbance vector with probability distribution µ t ∈ P ( R k ). In addition, A ∈ R n × n , B ∈ R n × m , and Ξ ∈ R n × k are time-invariant system matrices.In practice, it is challenging to obtain the true probability distribution µ t of w t . One of the moststraightforward ways to estimate the distribution is to construct the following empirical distributionfrom sample data { ˆ w (1) t , . . . , ˆ w ( N ) t } of w t : ν t := 1 N N (cid:88) i =1 δ ˆ w ( i ) t , (2.2)where δ ˆ w ( i ) t denotes the Dirac measure concentrated at ˆ w ( i ) t . However, it is undesirable to use thisempirical distribution in controller design because the control performance would deteriorate as thetrue distribution deviates from ν t . Let π := ( π , π , . . . ) denote a deterministic Markov control policy, where π t maps the current state x t to an input u t . More precisely, the set of admissible control policies is given by Π := { π | π t ( x t ) = u t ∈ R m , π t is measurable ∀ t } . To design a controller that is robust against errors in the empiricaldistributions, we employ an (hypothetical) opponent that selects the probability distribution µ t inan adversarial way. The opponent policy γ := ( γ , γ , . . . ) is also assumed to be deterministic andMarkov, where γ t maps the current state x t to a probability distribution µ t . Speciﬁcally, the set ofadmissible opponent’s policies is deﬁned by Γ := { γ | γ t ( x t ) = µ t ∈ P ( R k ) ∀ t } . We ﬁrst considerthe ﬁnite-horizon case and later extend our results to the inﬁnite-horizon case.Suppose for a moment that the controller aims to minimize the standard quadratic cost function J x ( π, γ ) = J x ,T ( π, γ ) := 1 T E π,γ (cid:20) x (cid:62) T Q f x T + T − (cid:88) t =0 (cid:0) x (cid:62) t Qx t + u (cid:62) t Ru t (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) , (2.3)with Q, Q f ∈ S n + and R ∈ S m ++ , while the opponent determines γ to maximize the same cost. Ifthis were the case, however, that would give too much freedom to the opponent, thereby causingthe optimal controller to be overly conservative. To systematically adjust conservativeness, wepenalize the opponent according to the degree of deviation from the empirical distributions ν t ’s.By doing so, we can also incorporate the prior information provided by the sample data directly intothe controller design. Speciﬁcally, the penalty is measured by the Wasserstein distance W ( µ t , ν t )between µ t and ν t . The Wasserstein metric of order 2 between two distributions µ and ν is deﬁnedas W ( µ, ν ) := inf η ∈P ( W ) (cid:26) (cid:18)(cid:90) W (cid:107) x − y (cid:107) d η ( x, y ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) Π η = µ, Π η = ν (cid:27) , For ease of exposition, we focus on deterministic Markov policies. However, all the results in this paper are valideven when considering randomized history-dependent policies for both players by the optimality result in [12]. where Π i η is the i th marginal distribution of η . The cost function is then modiﬁed by adding aWasserstein penalty term as follows: J λ x ( π, γ ) = J λ x ,T ( π, γ ) := 1 T E π,γ (cid:20) x (cid:62) T Q f x T + T − (cid:88) t =0 (cid:0) x (cid:62) t Qx t + u (cid:62) t Ru t − λW ( µ t , ν t ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) , (2.4)where λ > J x ( π, γ ) = J x ( π, γ ). Tuning the parameter λ ,we can adjust the conservativeness of our control policy that is obtained by solving the followingminimax stochastic control problem: min π ∈ Π max γ ∈ Γ J λ x ( π, γ ) . (2.5)The inner maximization problem yields a worst-case distribution policy given π . Thus, an optimalsolution π (cid:63) to the outer problem minimizes the worst-case cost. Our ﬁrst goal is to develop a Riccatiequation-based solution to (2.5) and analyze the properties of π (cid:63) such as closed-loop stability. A closely related minimax stochastic control formulation is the distributionally robust control prob-lem [12]. This formulation uses Wasserstein ambiguity sets instead of the Wasserstein penalty term.Speciﬁcally, the Wasserstein ambiguity set is deﬁned as D t := { µ t ∈ P ( R k ) | W ( µ t , ν t ) ≤ θ } . (2.6)The set D t is a statistical ball centered at the empirical distribution ν t , where the distance betweenany two elements is measured by the Wasserstein metric. The opponent’s policy γ is then berestricted in the following space: Γ D := { γ ∈ Γ | γ t ( x t ) ∈ D t ∀ t } . In words, the probability distribution produced by the opponent’s policy must be contained inthe Wasserstein ambiguity set. To achieve distributional robustness, it is desirable to design acontroller that minimizes the expected cost under the worst-case distribution policy in Γ D . Such acontrol policy can be obtained by solving the following Wasserstein distributionally robust controlproblem: min π ∈ Π max γ ∈ Γ D J x ( π, γ ) , (2.7)which can be solved by dynamic programming (DP). Unfortunately, the DP solution is not scalabledue to the curse of dimensionality unlike our Riccati equation-based method. We claim that theoptimal policy π (cid:63) of (2.5) is a reasonable suboptimal solution to the DR control problem since ithas the following guaranteed-cost property:sup γ ∈ Γ D J x ( π (cid:63) ( λ (cid:63) ) , γ ) ≤ λ (cid:63) θ + V ( x ; λ (cid:63) ) , (2.8)where π (cid:63) ( λ ) denotes the optimal policy of (2.5) with λ , V ( x ; λ ) := inf π ∈ Π sup γ ∈ Γ J λ x ( π, γ )denotes the optimal value function of (2.5), and λ (cid:63) ∈ arg min λ ≥ [ λθ + V ( x ; λ )] Note that the objec-tive function of the minimization problem on the right-hand side can be evaluated by solving (2.5).Thus, the right-hand side provides a provable upper-bound on the worst-case cost of employing π (cid:63) ( λ (cid:63) ). This upper-bound can be used to speculate the distributional robustness of π (cid:63) ( λ (cid:63) ) and toquantify a probabilistic out-of-sample performance guarantee of π (cid:63) ( λ (cid:63) ) as discussed in Section 4.In the following section, we ﬁrst study the problem (2.5) to obtain an explicit solution in bothﬁnite-horizon and inﬁnite-horizon cases and identify useful properties. These results will then beused to analyze the distributional robustness of π (cid:63) ( λ (cid:63) ) in Section 4. To begin with, we consider the regularized problem (2.5) in the ﬁnite-horizon setting with costfunction J λ x ( π, γ ), deﬁned in (2.4). Later, we establish the connection between the ﬁnite-horizonand inﬁnite-horizon cases by letting T → ∞ .We use dynamic programming to solve the ﬁnite-horizon problem: let the optimal value functionbe deﬁned by V t ( x ) = V t ( x ; λ ) := inf π ∈ Π sup γ ∈ Γ E π,γ [ (cid:80) T − s = t ( x (cid:62) s Qx s + u (cid:62) s Ru s − λW ( µ s , ν s ) ) + x (cid:62) T Q f x T | x t = x ], which represents the optimal worst-case expected cost-to-go from stage t given x t = x . By deﬁnition, V ( x ; λ ) = V ( x ; λ ) /T . The dynamic programming principle yields V t ( x ) = x (cid:62) Q x + inf u ∈ R m sup µ ∈P ( R k ) (cid:20) u (cid:62) R u − λW ( µ , ν t ) + (cid:90) R k V t +1 ( A x + B u + Ξ w )d µ ( w ) (cid:21) for t = 0 , . . . , T −

1, and V T ( x ) := x (cid:62) Q f x . Note that the inner maximization problem is aninﬁnite-dimensional optimization problem over P ( R k ). For a tractable reformulation, we use amodern DRO technique based on Kantorovich duality [24], which yields V t ( x ) = x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + 1 N N (cid:88) i =1 sup w ∈ R k (cid:8) V t +1 ( A x + B u + Ξ w ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) (cid:9)(cid:21) . (3.1)Let the mean and the covariance matrix of the empirical distribution be denoted by¯ w t := E ν t [ w t ] , Σ t := E ν t [ w t w (cid:62) t ] . We also let Φ := BR − B (cid:62) − λ ΞΞ (cid:62) . (3.2)We now consider the following ansatz of the value function: V t ( x ) = x (cid:62) P t x + 2 r (cid:62) t x + z t , where P t ∈ S n + , r t ∈ R n and z t ∈ R . Our goal is to identify an explicit solution to the minimax optimizationproblem in (3.1). In what follows, we show that the quadratic structure of the value function ispreserved through the Bellman recursion, and the proposed parameterization would thus be exactif matrices P t satisfy a Riccati equation. Lemma 1.

Suppose that V t +1 ( x ) = x (cid:62) P t +1 x + 2 r (cid:62) t +1 x + z t +1 for some P t +1 ∈ S n + , r t +1 ∈ R n and z t +1 ∈ R . We further assume that the penalty parameter satisﬁes λ > ¯ λ t +1 , where ¯ λ t +1 is the maximum eigenvalue of Ξ (cid:62) P t +1 Ξ . Then, the inner maximization problem sup w ∈ R k { V t +1 ( A x + B u + Ξ w ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) } in (3.1) has a unique maximizer w (cid:63)t :=( w (cid:63), (1) t , . . . , w (cid:63), ( N ) t ) , deﬁned as w (cid:63), ( i ) t := ( λI − Ξ (cid:62) P t +1 Ξ) − (Ξ (cid:62) P t +1 ( A x + B u ) + Ξ (cid:62) r t +1 + λ ˆ w ( i ) t ) . (3.3) Furthermore, the outer minimization problem in (3.1) has a unique minimizer u (cid:63) := K t x + L t , (3.4) where K t := − R − B (cid:62) ( I + P t +1 Φ) − P t +1 A,L t := − R − B (cid:62) ( I + P t +1 Φ) − ( P t +1 Ξ ¯ w t + r t +1 ) . Note that w (cid:63)t is linear in ( x , u ) and u (cid:63) is linear in x . The explicit derivation with this linearstructure yields the following Riccati equation: P t = Q + A (cid:62) ( I + P t +1 Φ) − P t +1 Ar t = A (cid:62) ( I + P t +1 Φ) − ( P t +1 Ξ ¯ w t + r t +1 ) z t = z t +1 + tr[( I − Ξ (cid:62) P t +1 Ξ /λ ) − Ξ (cid:62) P t +1 ΞΣ t ]+ ¯ w (cid:62) t Ξ (cid:62) [( I + P t +1 Φ) − − ( I − P t +1 ΞΞ (cid:62) /λ ) − ] P t +1 Ξ ¯ w t + (2 ¯ w (cid:62) t Ξ (cid:62) − r (cid:62) t +1 Φ)( I + P t +1 Φ) − r t +1 (3.5)with the terminal conditions P T = Q f , r T = 0, and z T = 0. Note that P t , t = 0 , . . . , T −

1, aresymmetric since P T is symmetric. For the well-deﬁnedness of the recursion, we make the followingassumption: Assumption 1.

The penalty parameter satisﬁes λ > ¯ λ t for all t ≥ , where ¯ λ t is the maximumeigenvalue of Ξ (cid:62) P t Ξ . Theorem 1 (Optimal policy) . Suppose that Assumption 1 holds. Then, the matrices P t are well-deﬁned and the value function can be expressed as V t ( x ) = x (cid:62) P t x + 2 r (cid:62) t x + z t , t = 0 , . . . , T. Furthermore, the regularized problem (2.5) in the ﬁnite-horizon case has a unique optimal policy,deﬁned as π (cid:63)t ( x ) := K t x + L t , t = 0 , . . . , T − . (3.6)As in the standard LQG, the optimal policy is linear in system state and gain matrix K t canbe obtained by solving a Riccati equation. Note that the Riccati equation in the standard LQG isgiven by (e.g., [28]) P t = Q + A (cid:62) ( I + P t +1 BR − B (cid:62) ) − P t +1 Ar t = A (cid:62) ( I + P t +1 BR − B (cid:62) ) − ( P t +1 Ξ ¯ w t + r t +1 ) z t = z t +1 + tr[Ξ (cid:62) P t +1 ΞΣ t ] − ¯ w (cid:62) t Ξ (cid:62) P t +1 BR − B (cid:62) ( I + P t +1 BR − B (cid:62) ) − P t +1 Ξ ¯ w t + (2 ¯ w (cid:62) t Ξ (cid:62) − r (cid:62) t +1 BR − B (cid:62) )( I + P t +1 BR − B (cid:62) ) − r t +1 , (3.7)and it can be obtained by letting λ → ∞ in (3.5). Increasing λ encourages the opponent notto deviate much from the empirical distribution ν t . Thus, in the limit, our minimax method isequivalent to the standard LQG. This shows that our proposed framework is a generalization ofLQG.Another immediate consequence of Lemma 1 and Theorem 1 is that one of the worst-casedistributions can be explicitly obtained with a ﬁnite support, as follows: Corollary 1 (Worst-case distribution) . Suppose that Assumption 1 holds. Let w (cid:63), ( i ) t ( x ) := ( λI − Ξ (cid:62) P t +1 Ξ) − (Ξ (cid:62) P t +1 ( A x + BK t x + BL t ) + Ξ (cid:62) r t +1 + λ ˆ w ( i ) t ) . Then, the policy γ (cid:63) deﬁned by γ (cid:63)t ( x ) := 1 N N (cid:88) i =1 δ w (cid:63), ( i ) t ( x ) generates the worst-case distribution, i.e., ( π (cid:63) , γ (cid:63) ) is an optimal minimax solution to (2.5) in theﬁnite-horizon case. In this subsection, we investigate an optimal controller for the inﬁnite-horizon case when the numberof stage T increases to ∞ . We consider the following inﬁnite-horizon average cost criterion: J λ x , ∞ ( π, γ ) := lim sup T →∞ T E π,γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t − λW ( µ t , ν ) ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) . (3.8)Based on the results in the ﬁnite-horizon case, we begin by identifying the steady-state policy thatour optimal policy converges to. Our speciﬁc goal is to derive an algebraic Riccati equation (ARE)and characterize the condition under which the recursion (3.5) converges to a unique symmetricPSD solution of the ARE.Throughout this subsection, we assume the following for the stationarity of the problem. Assumption 2.

The random disturbance process { w t } ∞ t =0 is i.i.d., and its empirical distribution isconstructed as ν ≡ ν t := N (cid:80) Ni =1 δ ˆ w ( i ) from the dataset { ˆ w (1) , . . . , ˆ w ( N ) } . Under Assumption 2, we denote the mean value and the covariance matrix of ν by ¯ w and Σ.Based on the iteration (3.5), our focus is on ﬁnding a solution to the following algebraic Riccatiequation (ARE): P = Q + A (cid:62) (cid:20) I + P BR − B (cid:62) − λ P ΞΞ (cid:62) (cid:21) − P A. (3.9)Note that the ARE (3.9) has an equivalent form to the ARE in the classical H ∞ -optimal control(see [29][Section 3.2]). The speciﬁc relationship between our minimax method and the H ∞ -methodwill be discussed in the Section 5. We ﬁrst show that P t updated by (3.5) converges to a unique PSD solution of the ARE (3.9) undersuitable nontrivial stabilizability and observability conditions. Recall that the symmetric matrix Φis deﬁned as (3.2). We make the following assumption on Φ: Assumption 3. Φ (cid:23) , and ( A, Φ / ) is stabilizable. Proposition 1.

Suppose that Assumptions 1–3 hold. Then, a bounded limiting solution P ss :=lim T →∞ P t to the Riccati equation (3.5) exists for any P T ∈ S n + . Furthermore, P ss is a symmetricPSD solution to the ARE (3.9) . To solve the ARE (3.9), we use the method proposed in [30], considering the generalized eigen-value problem of F and G F v = γGv, (3.10)where F := (cid:20) A − Q I (cid:21) and G := (cid:20) I Φ0 A (cid:62) (cid:21) . Lemma 2.

Any solution of the ARE (3.9) can be expressed as P = U U − , where each column of (cid:20) U U (cid:21) ∈ R n × n solves the generalized eigenvalue problem (3.10) of F and G . Lemma 2 shows that all solutions of the ARE (3.9) can be obtained from the generalizedeigenvalue problem of F and G . Unfortunately, most of them are unstabilizing solutions. However,we are only interested in the symmetric PSD solution P ss to which the Riccati recursion (3.5)converges. To identify the steady-state solution, we need the following assumption and lemma: Assumption 4. ( A, Q / ) is observable. Lemma 3.

Suppose that Assumptions 3 and 4 hold. Then, P = U U − is a symmetric PSDsolution to the ARE (3.9) if and only if each column of (cid:20) U U (cid:21) ∈ R n × n solves the generalizedeigenvalue problem (3.10) of F and G with a stable generalized eigenvalue. Lemma 3 motivates us to investigate the condition on F and G under which (3.10) has n stablegeneralized eigenvalues. Note that the following symplectic property holds F Ω F (cid:62) = G Ω G (cid:62) = (cid:20) A − A (cid:62) (cid:21) , where Ω = (cid:20) I n − I n (cid:21) . Thus, if γ is a generalized eigenvalue, so is 1 /γ with the same multiplicity.This implies that if no generalized eigenvalue lies on the unit circle, then exactly n generalizedeigenvalues are stable, and there exists a unique symmetric PSD solution to the ARE by Lemma 3. Lemma 4.

Under Assumptions 3 and 4, ( F, G ) does not have any generalized eigenvalue on theunit circle.Proof. The existence of generalized eigenvalues on the unit circle contradicts Assumptions 3 and4. See [30, Theorem 3] for details. A generalized eigenvalue is stable if its absolute value is less than 1.

By Lemma 4, there exist U , U ∈ R n × n and Λ ∈ R n × n such that F U = GU Λ (3.11)with U = (cid:20) U U (cid:21) , where the columns of V solve (3.10) with n stable generalized eigenvalues, and Λis the corresponding Jordan normal form. We obtain the following lemma that yields to constructa solution of the ARE (3.9) from U and U . Lemma 5.

Under Assumptions 3 and 4, U is nonsingular.Proof. This can be shown directly using the proof of [30, Theorem 6].Using the previous lemmas, we ﬁnally obtain the following conclusion that connects the Riccatiequation (3.5) in the ﬁnite-horizon case and the ARE (3.9) in the inﬁnite-horizon.

Theorem 2.

Suppose that Assumptions 1–4 hold. Then, the recursion (3.5) converges to the uniquesymmetric PSD solution P ss := U U − of the ARE (3.9) as T → ∞ . This result can further be simpliﬁed when the system matrix A is nonsingular. In this particularcase, we let H := G − F = (cid:20) A + Φ A −(cid:62) Q − Φ A −(cid:62) − A −(cid:62) Q A −(cid:62) (cid:21) and construct ˆ U , ˆ U ∈ R n × n so that each column of (cid:20) ˆ U ˆ U (cid:21) ∈ R n × n is an eigenvector of H associatedwith a stable eigenvalue. We then obtain the following result: Corollary 2.

Suppose that Assumptions 1–4 hold and that A is nonsingular. Then, the recur-sion (3.5) converges to the unique symmetric PSD solution P ss := ˆ U ˆ U − of the ARE (3.9) . The convergence of r t in the recursion (3.5) directly follows from the convergence of P t . Proposition 2.

Suppose that Assumptions 1–4 hold. Then, r t in the recursion (3.5) converges to r ss :=[ I − A (cid:62) ( I + P ss Φ) − ] − A (cid:62) ( I + P ss Φ) − P ss Ξ ¯ w (3.12) as T → ∞ . The steady-state control policy in the inﬁnite-horizon case can be obtained using the symmetricPSD solution to the ARE (3.9) as in the ﬁnite-horizon case.

Corollary 3.

Suppose that Assumptions 1–4 hold. Then, the optimal policy π (cid:63)t ( x ) converges to thesteady-state policy π (cid:63)ss ( x ) := K ss x + L ss as T → ∞ , where K ss := − R − B (cid:62) ( I + P ss Φ) − P ss A,L ss := − R − B (cid:62) ( I + P ss Φ) − ( P ss Ξ ¯ w + r ss ) . Furthermore, the policy generating the worst-case distribution converges to the steady-state policy γ (cid:63)ss ( x ) := 1 N N (cid:88) i =1 δ w (cid:63), ( i ) ( x ) (3.13) with w (cid:63), ( i ) ( x ) := ( λI − Ξ (cid:62) P ss Ξ) − (Ξ (cid:62) P ss ( A x + BK ss x + BL ss ) + Ξ (cid:62) r ss + λ ˆ w ( i ) ) . Proposition 3.

Suppose that Assumptions 1–4 hold. Then, the steady-state average cost ρ := lim sup T →∞ min π ∈ Π max γ ∈ Γ J λ x ,T ( π, γ ) is given by ρ = tr[( I − Ξ (cid:62) P ss Ξ /λ ) − Ξ (cid:62) P ss ΞΣ] + ¯ w (cid:62) Ξ (cid:62) [( I + P ss Φ) − − ( I − P ss ΞΞ (cid:62) /λ ) − ] P ss Ξ ¯ w + (2 ¯ w (cid:62) Ξ (cid:62) − r (cid:62) ss Φ)( I + P ss Φ) − r ss , independent of the initial state x . We now examine the optimality of the stationary policy pair ( π (cid:63)ss , γ (cid:63)ss ) using the average costcriterion (3.8). Consider the following average cost problem with a Wasserstein penalty: min π ∈ Π max γ ∈ Γ J λ x , ∞ ( π, γ ) . (3.14)The optimality equation for this problem can be obtained as follows: Proposition 4.

Suppose that Assumptions 1–4 hold. Then, the following Bellman equation holds: ρ + h ( x ) = x (cid:62) Q x + inf u ∈ R m sup µ ∈P ( R k ) (cid:20) u (cid:62) R u − λW ( µ , ν ) + (cid:90) R k h ( A x + B u + Ξ w )d µ ( w ) (cid:21) , (3.15) where h ( x ) := x (cid:62) P ss x + 2 r (cid:62) ss x and ρ is the steady-state average cost deﬁned in Proposition 3.Moreover, ( π (cid:63)ss ( x ) , γ (cid:63)ss ( x )) is an optimal minimax solution of the problem on the right-hand sideof (3.15) . In the Bellman equation (or the average cost optimality equation), h , called the bias , representsthe transient cost, whereas ρ , called the gain , represents the stationary cost.We now introduce an extended average cost function including the bias h as˜ J λ x , ∞ ( π, γ ; h ) := lim sup T →∞ T E π,γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t − λW ( µ t , ν ) ) + h ( x T ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) . Using the extended cost, we can show the average cost optimality of the stationary policy pair( π (cid:63)ss , γ (cid:63)ss ) in a way similar to the average cost LQG (e.g., [31][Section 5.6.5]). Theorem 3.

Suppose that Assumptions 1–4 hold. Consider the steady-state policy pair ( π (cid:63)ss , γ (cid:63)ss ) deﬁned in Corollary 3. Then, the following properties hold:(a) For any ( π, γ ) ∈ Π × Γ , ˜ J λ x , ∞ ( π (cid:63)ss , γ ; h ) ≤ ρ ≤ ˜ J λ x , ∞ ( π, γ (cid:63)ss ; h ) , (3.16) where ρ is the stationary cost deﬁned in Proposition 3.(b) The stationary policy pair ( π (cid:63)ss , γ (cid:63)ss ) is optimal to min π ∈ Π max γ ∈ Γ ˜ J λ x , ∞ ( π, γ ; h ) . (3.17) Moreover, the optimal value of this problem is equal to ρ . It follows from the deﬁnition of Π that ( π (cid:63)ss , π (cid:63)ss , . . . ) ∈ Π. However, with a slight abuse of notation, we simplydenote it as π (cid:63)ss and regard stationary policy π (cid:63)ss (resp. γ (cid:63)ss ) as an element of Π (resp. Γ). (c) The stationary policy pair ( π (cid:63)ss , γ (cid:63)ss ) is optimal to min π ∈ ¯Π max γ ∈ ¯Γ J λ x , ∞ ( π, γ ) (3.18) for any policy spaces ¯Π ⊂ Π and ¯Γ ⊂ Γ satisfying lim sup T →∞ T E π (cid:63)ss ,γ [ h ( x T ) | x = x ] = 0 ∀ γ ∈ ¯Γ (3.19a)lim sup T →∞ T E π,γ (cid:63)ss [ h ( x T ) | x = x ] = 0 ∀ π ∈ ¯Π . (3.19b) Moreover, the optimal value of this problem is equal to ρ . Theorem 3 guarantees the optimality of ( π (cid:63)ss , γ (cid:63)ss ) in the average cost case under the conditions(3.19a) and (3.19b). Note that if the mean-state is bounded under ( π (cid:63)ss , γ ) so that lim sup T →∞ (cid:107) E π (cid:63)ss ,γ [ x T ] (cid:107) < ∞ , then the condition (3.19a) is automatically satisﬁed. In the following subsection, we will showthat π (cid:63)ss is BIBO stable, and thereby any γ generating a bounded-mean distribution should becontained in ¯Γ.The condition (3.19b) is a generalization of the one required in the standard average cost LQGto guarantee the optimality of such a steady-state policy. If h ( x T ) is bounded uniformly over allstages under some policy π , it clearly satisﬁes the condition. We now discuss the stability properties of the closed-loop system x (cid:63)t +1 = ( A + BK ss ) x (cid:63)t + Ξ w t + BL ss (3.20)when the optimal policy π (cid:63)ss is employed. Our ﬁrst result concerns the expected value of theclosed-loop system state E [ x (cid:63)t ], which evolves according to E [ x (cid:63)t +1 ] = ( A + BK ss ) E [ x (cid:63)t ] + Ξ E [ w t ] + BL ss . (3.21) Theorem 4.

Suppose that Assumptions 1–4 hold. Under the optimal policy pair ( π (cid:63)ss , γ (cid:63)ss ) , theexpected state of the closed-loop system (3.20) converges to the following value: [ I − ( I + Φ P ss ) − A ] − [ I − Φ( I + P ss Φ − A (cid:62) ) − P ss ]Ξ ¯ w. Thus, if in addition ¯ w = 0 , then the system is linear and π (cid:63)ss stabilizes the expected state under γ (cid:63)ss . We can further show that π (cid:63)ss guarantees the bounded-input, bounded-state (BIBO) stabilitywhen viewing the disturbance as input. Theorem 5.

Suppose that Assumptions 1–4 hold. Then, the closed-loop gain ( A + BK ss ) is astable matrix. Therefore, the mean-state system (3.21) with π (cid:63)ss is BIBO stable. In the previous section, the minimax control problem with a Wasserstein penalty has been studiedin both ﬁnite and inﬁnite-horizon settings. These results can be used to design a guaranteed-costcontroller in the distributionally robust control setting with Wasserstein ambiguity sets (2.6), aspreviewed in Section 2.3. We ﬁrst show that the total cost under the worst-case distribution in theambiguity set is bounded as follows:2

Lemma 6.

For any π ∈ Π , we have sup γ ∈ Γ D J x ( π, γ ) ≤ inf λ ≥ sup γ ∈ Γ (cid:0) λθ + J λ x ( π, γ ) (cid:1) ∀ x ∈ R n . (4.1)It is nontrivial to compute the upper-bound for arbitrary π . However, if the optimal policy π (cid:63) of our minimax control problem (2.5) is employed, this bound has a tractable form, which isevaluated using the optimal value function of (2.5). Theorem 6.

Let π (cid:63),λ be the optimal policy of (2.5) with λ ≥ . Then, the cost incurred by π (cid:63),λ under the worst-case distribution policy in Γ D is bounded as follows: sup γ ∈ Γ D J x ( π (cid:63),λ , γ ) ≤ λθ + V ( x ; λ ) ∀ x ∈ R n . The dependence of the upper-bound on penalty parameter λ indicates that the distributionalrobustness of our policy π (cid:63),λ can be controlled by tuning λ . This theorem can be used to selectan optimal penalty parameter λ (cid:63) that provides the least upper-bound. Given x ∈ R n , let λ (cid:63) bedeﬁned by λ (cid:63) ∈ arg min λ ≥ (cid:0) λθ + V ( x ; λ ) (cid:1) . (4.2)Then, the cost incurred by π (cid:63),λ (cid:63) under the worst-case distributions in the ambiguity sets is boundedas follows: sup γ ∈ Γ D J x ( π (cid:63),λ (cid:63) , γ ) ≤ λ (cid:63) θ + V ( x ; λ (cid:63) ) . To solve the minimization problem in (4.2), we ﬁrst identify some structural properties of V ( x ; λ )using the results in Section 3.1. Lemma 7.

Let P λt , r λt and z λt , t = 0 , . . . , T , be obtained by the Riccati equation (3.5) with P λT = Q f , r λT = 0 , and z λT = 0 , given λ ≥ . Let ˆ λ := inf { λ | λI − Ξ (cid:62) P λt Ξ (cid:31) , t = 1 , , . . . , T } . Then, V ( x ; λ ) =  ∞ if λ ∈ [0 , ˆ λ ) c if λ = ˆ λc ( λ ) if λ ∈ (ˆ λ, ∞ ) , where c ( λ ) := ( x (cid:62) P λ x + 2( r λ ) (cid:62) x + z λ ) /T and c is a constant satisfying the boundary condition c ≥ c (ˆ λ + (cid:15) ) for all (cid:15) > . This structural property of the optimal value function yields the following simple way to ﬁnd aminimizer λ (cid:63) of (4.2). Proposition 5.

Suppose that λ ∗ ∈ arg min λ> ˆ λ (cid:20) λθ + 1 T (cid:0) x (cid:62) P λ x + 2( r λ ) (cid:62) x + z λ (cid:1)(cid:21) . (4.3) Then, λ ∗ is a minimizer of the optimization problem (4.2) , i.e., λ ∗ = λ (cid:63) . Moreover, the optimizationproblem (4.3) is convex. As shown in the proof of Lemma 7, ˆ λ is the unique boundary point that separates the range of λ by whether it satisﬁes Assumption 1 or not. Therefore, ˆ λ can be obtained by binary search. Anoptimal λ (cid:63) can then be computed by solving (4.3) with existing convex optimization algorithms.3 In this subsection, we examine the distributional robustness of the steady-state optimal policy usingthe following average cost criterion: J x , ∞ ( π, γ ) := lim sup T →∞ T E π,γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) . (4.4)Fix any penalty parameter λ > P λss be the unique symmetricPSD solution of the ARE (3.9) and r λss be deﬁned as (3.12) with given λ ≥

0. The correspondingstationary cost in Proposition 3 is denoted by ρ ( λ ). As in Section 3.2, we use an extended averagecost function including the bias h λ ( x ) := x (cid:62) P λss x + 2( r λss ) (cid:62) x , deﬁned as˜ J x , ∞ ( π, γ ; h λ ) := lim sup T →∞ T E π,γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + h λ ( x T ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) . Consider π (cid:63),λss obtained in Corollary 3 with the penalty parameter λ . The average cost incurredby this policy under the worst-case distribution in the Wasserstein ambiguity set D := { µ ∈ P ( R k ) | W ( µ, ν ) ≤ θ } is computed as sup γ ∈ Γ D J x , ∞ ( π (cid:63),λss , γ ) . The worst-case cost is uniformly bounded by a constant depending on λ . Theorem 7.

Suppose that Assumptions 1–4 hold for some penalty parameter λ . If the stationarypolicy π (cid:63),λss , deﬁned in Corollary 3, is employed, then the worst-case extended average cost is boundedas follows: sup γ ∈ Γ D ˜ J x , ∞ ( π (cid:63),λss , γ ; h λ ) ≤ λθ + ρ ( λ ) ∀ x ∈ R n . (4.5) Moreover, for any policy space ¯Γ D ⊂ Γ D satisfying lim sup T →∞ T E π (cid:63),λss ,γ [ h λ ( x T ) | x = x ] = 0 ∀ γ ∈ ¯Γ D , the worst-case average cost has the same upper-bound: sup γ ∈ ¯Γ D J x , ∞ ( π (cid:63),λss , γ ) ≤ λθ + ρ ( λ ) ∀ x ∈ R n . (4.6)This theorem indicates the robustness of our policy π (cid:63),λss against any distribution errors withinthe Wasserstein ball. Since the upper-bound depends on the penalty parameter λ , it is importantto select an appropriate value of λ . Here, we present a way to obtain a suboptimal λ that minimizesthe upper-bound over a certain range of λ . Lemma 8.

Suppose that Assumptions 2 and 4 hold, and ( A, B ) is stabilizable. Then, there exists ˆ λ > such that Assumption 3 holds for any λ > ˆ λ . Proposition 6.

Suppose that Assumptions 2 and 4 hold, and ( A, B ) is stabilizable. Let ˆ λ be theconstant deﬁned in Lemma 8 and assume that ˆ λ := inf { λ | λI − Ξ (cid:62) P λt Ξ (cid:31) ∀ t ≥ , ∀ T ≥ } (4.7) is ﬁnite. Let ˆ λ ∞ := max i =1 , { ˆ λ i } . Then, ρ ( λ ) deﬁned in Proposition 3 is a monotonically nonin-creasing convex function on (ˆ λ ∞ , ∞ ) . Moreover, lim λ →∞ ρ ( λ ) = tr[Ξ (cid:62) ˜ P ss ΞΣ] − ¯ w (cid:62) Ξ (cid:62) ˜ P ss BR − B (cid:62) ( I + ˜ P ss BR − B (cid:62) ) − ˜ P ss Ξ ¯ w + (2 ¯ w (cid:62) Ξ (cid:62) − ˜ r (cid:62) ss BR − B (cid:62) )( I + ˜ P ss BR − B (cid:62) ) − ˜ r ss . (4.8) Here, ˜ P ss := lim T →∞ ˜ P t and ˜ r ss := lim T →∞ ˜ r t , where ˜ P t and ˜ r t are generated by the Riccati equation (3.7) for the standard LQG with ˜ P T = Q f , ˜ r T = 0 , and ˜ z T = 0 . Under the assumptions in Proposition 6, one may consider the convex optimization problemmin λ> ˆ λ ∞ [ λθ + ρ ( λ )] (4.9)to ﬁnd a reasonably tight upper-bound of the average-cost. The values of ˆ λ in Lemma 8 and ˆ λ inProposition 6 are required for computing ˆ λ ∞ . The ﬁrst parameter ˆ λ can be obtained by examiningthe eigenvalues of Φ and A + Φ / R / K . Speciﬁcally, reducing λ from a suﬃciently large value,one can compute the biggest value of λ such that Φ (cid:23) A + Φ / R / K lieinside the unit circle. The second parameter ˆ λ can be considered as the inﬁnite-horizon extensionof ˆ λ in Lemma 7. As in the ﬁnite-horizon case, one can obtain ˆ λ via binary search. Note that theoptimization problem (4.9) provides the least upper-bound over the range of λ that the existence of ρ ( λ ) is guaranteed. Nevertheless, when ˆ λ ∞ = ˆ λ ≥ ˆ λ , (4.9) provides the least bound for a nearlyentire range, λ ∈ [0 , ˆ λ ) ∪ (ˆ λ , ∞ ), since the optimal value is + ∞ for λ < ˆ λ . An advantage of using the Wasserstein metric in distributionally robust control is to attain aperformance guarantee measured under a new sample, independent of ˆ w = ( ˆ w (1) , . . . , ˆ w ( N ) ) used inthe controller design. Such an out-of-sample performance guarantee has been studied in the inﬁnite-horizon discounted cost setting [12]. In this section, we extend this result to the ﬁnite-horizon andthe inﬁnite-horizon average cost cases. Throughout this subsection, we ﬁx λ > Let ( π (cid:63) ˆ w , γ (cid:63) ˆ w ) denote the optimal policy pair of the ﬁnite-horizon minimax control problem (2.5)with sample ˆ w = ( ˆ w (1) , . . . , ˆ w ( N ) ) and the penalty parameter λ . The out-of-sample performance of π (cid:63) ˆ w is deﬁned as 1 T E π (cid:63) ˆ w w ∼ µ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T Q f x T (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) , (4.10)where µ := µ × · · · × µ T − ∈ P ( R k × T ) represents the true but unknown distribution of w . Notethat it is intractable to explicitly evaluate the out-of-sample performance since µ is unknown. Asan alternative, the following probabilistic guarantee can be considered: µ N (cid:26) ˆ w : 1 T E π (cid:63) ˆ w w ∼ µ (cid:2) C T ( x, u ) | x = x (cid:3) ≤ λθ + V ( x ; λ ) ∀ x ∈ R n (cid:27) ≥ − β, (4.11)5where C T ( x, u ) := x (cid:62) T Q f x T + T − (cid:88) t =0 ( x (cid:62) t Qx t + u Tt Ru t )denotes the total cost, and β ∈ (0 ,

1) represents an acceptable error in probability. Our goal isto identify a condition on the size of the Wasserstein ambiguity set for satisfying the probabilisticout-of-sample performance guarantee. To begin, we assume that µ has a light tail. Assumption 5.

There exist p > and q > satisfying (cid:90) R k exp( p (cid:107) w (cid:107) q )d µ t ( w ) < ∞ for t = 0 , . . . , T − . Under this assumption, the following measure concentration inequality holds for the Wassersteinmetric [32][Theorem 2].

Lemma 9.

Suppose that Assumption 5 holds. Let ν ˆ w t denote the empirical distribution (2.2) obtained using sample ˆ w t . Then, for all N ≥ , θ > and t = 0 , . . . , T − , we have µ Nt (cid:8) ˆ w t : W ( µ t , ν ˆ w t ) ≥ θ (cid:9) ≤ c [ b ( N, θ ) { θ ≤ } + b ( N, θ ) { θ> } ] , where b ( N, θ ) :=  exp( − c N θ ) if k < − c N ( θlog (2+1 /θ ) ) ) if k = 4exp( − c N θ k/ ) if k > , and b ( N, θ ) := exp( − c N θ q/ ) . The positive constants c and c depend only on k , p and q . This lemma provides a suﬃcient condition for the probabilistic out-of-sample performance guar-antee (4.11).

Theorem 8.

Suppose that Assumptions 1 and 5 hold. Let the radius θ be chosen as θ ( N, β ) =  c /q if c > c / if c ≤ ∧ k < c /k if c ≤ ∧ k > θ if c ≤ / (log 3) ∧ k = 4 , where c := 1 N c log (cid:18) c − (1 − β ) /T (cid:19) ,c and c are the positive constants deﬁned in Lemma 9, and ¯ θ satisﬁes ¯ θ log(2 + 1 / ¯ θ ) = c / . Then, the probabilistic out-of-sample performance guarantee (4.11) holds. µ t is compactly supported rather than having alight tail, a simpler concentration inequality holds as proposed in [33][Proposition 3.2]. Deﬁne thediameter of a set S ⊂ R k as diam( S ) := sup {(cid:107) x − y (cid:107) ∞ | x, y ∈ S } . Let supp( µ ) denote the smallestclosed set that has measure 1 with µ . Then, using the same argument as that in the proof ofTheorem 8, we can show that the following guarantee holds. Corollary 4.

Suppose that Assumption 1 holds and µ t ’s are compactly supported for all t =0 , . . . , T − with ζ := max { diam(supp( µ t )) | t = 0 , . . . , T − } . Let the radius θ be chosenas θ ( N, β ) :=  c / ζ if k < c /k ζ if k > θ if k = 4 , where c := 1 N c log (cid:18) c − (1 − β ) /T (cid:19) ,c and c are the positive constants depend only on k , and ¯ θ satisﬁes ¯ θ ζ log(2 + ζ / ¯ θ ) = c / . Then, the probabilistic out-of-sample performance guarantee (4.11) holds.

A potential disadvantage of directly employing the radius suggested in Theorem 8 or Corol-lary 4 is that the guaranteed upper-bound λθ ( N, β ) + V ( x ; λ ) grows with the number of stage T .Speciﬁcally, c increases logarithmically with T , since 1 − (1 − β ) /T ≈ β/T when T is suﬃcientlylarge. The stationarity assumption for µ t can be used to alleviate this issue. For instance, if weassume that T (cid:48) ∈ [1 , T ] stages have a stationary probability distribution and therefore use only onesample for T (cid:48) stages, then we can reduce T to T − T (cid:48) + 1 in our formulation of θ ( N, β ). Thus, inthe case with Assumption 2, we can simply replace T by 1. We now consider the inﬁnite-horizon case with the average cost criteria (4.4). Under Assumptions 1–4, let π (cid:63)ss, ˆ w denote the optimal policy obtained in Corollary 3 with sample ˆ w = ( ˆ w (1) , . . . , ˆ w ( N ) ).We assume that the true distribution is stationary, i.e., µ t ≡ µ for all t . In this average cost setting,our interest is to study the following probabilistic bound on the out-of-sample performance of π (cid:63)ss, ˆ w : µ N (cid:26) ˆ w : lim sup T →∞ T E π (cid:63)ss, ˆ w w ∼ µ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) ≤ λθ + ρ ( λ ) ∀ x ∈ R n (cid:27) ≥ − β, (4.12)where λθ + ρ ( λ ) is the upper-bound on the average cost in Theorem 7. Corollary 5.

Suppose that Assumptions 1–5 hold. Let the radius θ be chosen as θ ( N, β ) :=  c /q if c > c / if c ≤ ∧ k < c /k if c ≤ ∧ k > θ if c ≤ / (log 3) ∧ k = 4 , where c := 1 N c log (cid:18) c β (cid:19) ,c and c are the positive constants deﬁned in Lemma 9, and ¯ θ satisﬁes ¯ θ log(2 + 1 / ¯ θ ) = c / . Then, the probabilistic out-of-sample performance guarantee (4.12) holds. If µ is contained in D , then the policy γ t ≡ µ must be contained in ¯Γ D , implying that theguaranteed-cost property (4.6) holds. Since the rest of proof is similar to that for Theorem 8, wehave omitted the proof. Note that the radius in Corollary 5 can be obtained by letting T = 1in Theorem 8 due to the stationarity assumption of µ and ν . The case of compactly supporteddistributions can be considered similarly to Corollary 4. H ∞ -Optimal Control In this section, we discuss relations between our minimax control method and the H ∞ -method. Forcomparison, we consider the classical dynamic game formulation for minimizing the H ∞ -norm ofthe cost function with respect to the disturbance (e.g., [29]). We ﬁrst examine the ﬁnite-horizon case with the initial condition x = 0. For H ∞ -control, weconsider a modiﬁed dynamic game problem, where the opponent’s policy ˜ γ t now maps the currentstate x t to disturbance vector w t rather than its distribution [29]. Note that the disturbance vectoris no longer random in the H ∞ -setting. The set of admissible opponent’s policies is accordinglymodiﬁed and is denoted by ˜Γ. Consider the following quadratic cost function:˜ J ( π, ˜ γ ) := E π, ˜ γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T Q f x T (cid:12)(cid:12)(cid:12)(cid:12) x = 0 (cid:21) . Given a control policy π , we seek to ﬁnd the inﬁmum of λ > ˜ γ ∈ ˜Γ: (cid:107) w (cid:107)≤ ˜ J ( π, ˜ γ ) = sup ˜ γ ∈ ˜Γ ˜ J ( π, ˜ γ ) (cid:107) w (cid:107) ≤ λ, where (cid:107) w (cid:107) := (cid:80) T − t =0 (cid:107) w t (cid:107) . The ﬁrst equality holds since ˜ J ( π, ˜ γ ) is homogeneous with respectto (cid:107) w (cid:107) when x = 0. Note also that ˜ J ( π, ˜ γ ) / (cid:80) T − t =0 (cid:107) w t (cid:107) ≤ λ for all ˜ γ ∈ ˜Γ if and only if˜ J ( π, ˜ γ ) − λ (cid:80) T − t =0 (cid:107) w t (cid:107) ≤ γ ∈ ˜Γ. Thus, the inequality above can be rewritten assup ˜ γ ∈ ˜Γ (cid:20) ˜ J ( π, ˜ γ ) − λ T − (cid:88) t =0 (cid:107) w t (cid:107) (cid:21) ≤ . This motivates us to consider the following augmented cost function with an additional disturbance-norm term: ˜ J λ ( π, ˜ γ ) := E π, ˜ γ (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t − λ (cid:107) w t (cid:107) ) + x (cid:62) T Q f x T (cid:12)(cid:12)(cid:12)(cid:12) x = 0 (cid:21) , J λ,(cid:63) := inf π ∈ Π sup ˜ γ ∈ ˜Γ ˜ J λ ( π, ˜ γ ) . Let Λ := { λ | ˜ J λ,(cid:63) ≤ } . The desired λ (cid:63) can be obtained as λ (cid:63) := inf { λ | λ ∈ Λ } . More details aboutthe dynamic game formulation of H ∞ -control can be found in [29, Section 1.4]. Let ˜ V t : R n → R denote the value function of this problem. The dynamic programming principle gives the followingBellman equation:˜ V t ( x ) = x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + sup w ∈ R k { ˜ V t +1 ( A x + B u + Ξ w ) − λ (cid:107) w (cid:107) } (cid:21) with ˜ V T ( x ) := x (cid:62) Q f x . If we parameterize ˜ V t ( x ) = x (cid:62) P t x under Assumption 1, the Riccatiequation and the control gain K t are obtained as those in Section 3.1. Note that there are no r t , z t , and L t terms in the H ∞ -control. The worst-case disturbance policy is then given by˜ γ (cid:63)t ( x ) = ( λI − Ξ (cid:62) P t +1 Ξ) − Ξ (cid:62) P t +1 ( A + BK t ) x . Since x = 0, we deduce that ˜ J λ,(cid:63) = ˜ V (0) = 0 (cid:62) P . In this case, any λ satisfying Assumption 1 must be contained in Λ. However, if λ does not satisfyAssumption 1, then the cost value would be + ∞ , and thus λ cannot belong to Λ. Thus, we concludethat λ (cid:63) is the inﬁmum of λ that satisﬁes Assumption 1. See [29][Section 3.3] for further details onthe optimal disturbance attenuation level λ (cid:63) of H ∞ -control.The worst-case disturbance ˜ γ (cid:63)t ( x ) in the H ∞ -method is related with the support elements w (cid:63), ( i ) t ( x ) of the worst-case distribution in Corollary 1 in our method as follows: w (cid:63), ( i ) t ( x ) = ˜ γ (cid:63)t ( x ) + ( λI − Ξ (cid:62) P t +1 Ξ) − (Ξ (cid:62) P t +1 BL t + Ξ (cid:62) r t +1 + λ ˆ w ( i ) t ) . This indicates that each support element of the worst-case distribution in Corollary 1 can beconsidered to be shifted from ˜ γ (cid:63)t ( x ) by the scaled terms generated from the sample data ˆ w ( i ) t and L t , r t +1 . Thus, our minimax control method with Wasserstein distance can be understood as adistributional generalization of the H ∞ -method. In the inﬁnite-horizon case, the corresponding H ∞ -control can be obtained using a limiting solutionof the Riccati equation. This yields the same ARE as (3.9) for our minimax control methods [29,Section 3.4]. Under Assumptions 1–4, the ARE has a symmetric PSD solution P ss from which weobtain the same control gain K ss . Regarding the worst-case disturbance, we have˜ γ (cid:63) ( x ) = ( λI − Ξ (cid:62) P ss Ξ) − Ξ (cid:62) P ss ( A + BK ss ) x . Thus, the worst-case disturbance in the H ∞ method is related to our worst-case distribution γ (cid:63)ss ( x ) := N (cid:80) Ni =1 δ w (cid:63), ( i ) ( x ) through w (cid:63), ( i ) ( x ) = ˜ γ (cid:63) ( x ) + ( λI − Ξ (cid:62) P ss Ξ) − (Ξ (cid:62) P ss BL ss + Ξ (cid:62) r ss + λ ˆ w ( i ) ) . If the sample mean is zero, i.e., (cid:80) Ni =1 ˆ w ( i ) t = 0, for all stages, then ˜ γ (cid:63)t ( x ) corresponds to the mean value of theworst-case distribution. H ∞ -method enables us to analyze our controllerusing the classical robust stability results. Consider a dynamical system of the form x t +1 = Ax t + Bu t + Ξ w t ,z t = (cid:20) Q / (cid:21) x t + (cid:20) R / (cid:21) u t , where z t is the error output and w t is the disturbance input. Let T π denote the closed loop transferfunction from input w to output z under control policy π . As mentioned above, our minimaxcontroller is equivalent to the H ∞ -controller when x = 0 and the empirical distribution has zeromean, i.e., ¯ w = 0. This equivalence yields the following robust stability property of our controller(see e.g., [29] and [34] for further details about robust stability). Proposition 7.

Suppose that Assumptions 1–4 hold, x = 0 , and ¯ w = 0 . Then, the optimalminimax control policy π (cid:63)ss in Corollary 3 satisﬁes the following robust stability property: (cid:107) T π (cid:63)ss (cid:107) ∞ := sup w ∈ R ¯ σ ( T π (cid:63)ss ( jw )) ≤ λ, where ¯ σ ( T ( jw )) denotes the largest singular value of T ( jw ) . Conversely, our stochastic interpretation of the classical H ∞ -method enables us to analyze the H ∞ -controller from the distributional perspective, particularly when the sample of w t is available.For instance, in such data-driven scenarios, one can obtain the probabilistic performance guaranteeof the H ∞ -controller using the out-of-sample performance result in the previous section. To bemore precise, for the H ∞ -controller with a ﬁxed λ , its out-of-sample performance satisﬁes theprobabilistic bound (4.12). In this section, the performance of our minimax control method is demonstrated through a powersystem frequency regulation problem. Stability is an important issue in power transmission systems,as the penetration of variable renewable energy sources and the potential of data integrity attacksincrease. We apply the minimax control method on the IEEE 39 bus system, which models theNew England power grid and has been frequently used to evaluate frequency control methods(e.g. [35, 36]). This model consists of 39 buses, 46 lines, and 10 generators. We use a classicalgenerator model without an excitation system, such as a power system stabilizer and an automaticvoltage regulator, for simplicity.Let δ i and ω i denote the rotor angle and the frequency of the i th generator. Then, ˙ δ i = ω i − ω s ,where ω s is a constant synchronous speed. The electromechanical swing equation for the i thgenerator is given by the following damped oscillator:2 H i ω s ˙ ω i = P i − d i ω i − (cid:88) j (cid:54) = i | Y ij | E i E j sin( δ i − δ j ) , where H i , P i , d i , and E i denote the inertia, the power injection, the damping coeﬃcient, andthe voltage of the i th generators, and Y denotes the admittance matrix of the power network.Linearizing the swing equations at an operating point ( δ ∗ , ω ∗ ) yields M ∆¨ δ + D ∆ ˙ δ + L ∆ δ = ∆ P, (a) (b)(c) (d) Figure 1: Box plots (1,000 test cases) of ∆ ω , controlled by (a) the standard LQG method underthe worst-case distribution generated with θ = 0 .

5, (b) our minimax method under the worst-casedistribution generated with θ = 0 .

5, (c) the standard LQG method under the worst-case distributiongenerated with θ = 1, and (d) our minimax method under the worst-case distribution generatedwith θ = 1.where M := (2 /ω s )diag( H ), D := diag( d ), and the Kron-reduced Laplacian matrix L is deﬁnedby L ij := −| Y ij | E i E j cos( δ ∗ i − δ ∗ j ) for i (cid:54) = j and L ii := − (cid:80) j (cid:54) = i L ij . The second-order ordinarydiﬀerential equation can be expressed in the following state-space form: (cid:20) ∆ ˙ δ ∆ ˙ ω (cid:21) = (cid:20) I − M − L − M − D (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) =: A (cid:20) ∆ δ ∆ ω (cid:21) + (cid:20) M − (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) =: B ∆ P, with system state x ( t ) := (∆ δ (cid:62) ( t ) , ∆ ω (cid:62) ( t )) (cid:62) ∈ R and control input u ( t ) := ∆ P ( t ) ∈ R .To model uncertainty in power injection or net demand, a disturbance w ( t ) is assumed tobe added to the input u ( t ). Then, Ξ = B . For the quadratic cost function, we set x (cid:62) Qx = ∆ δ (cid:62) ( I − (cid:62) )∆ δ + ∆ ω (cid:62) ∆ ω and R = I , where I denotes the 10 by 10 identity matrixand denotes the 10 dimensional vector of all ones. The system is discretized by a zero-orderhold method with sample time 0 . ω is perturbed by 1, 10 samples of disturbances are generated according to the normal distribution N (0 . , . I ), and the worst-case distribution in Corollary 1 is applied to the system in the ﬁnite-horizon setting with the number of stages T = 150. Fig. 1 shows the box plot of frequency ∆ ω , controlled by the standard LQG and the proposedminimax control methods. The ﬁnite-horizon optimal policy (3.6) is used, where the optimalpenalty coeﬃcient λ (cid:63) is obtained using Proposition 5 with the ambiguity set radius θ = 0 . , . All the simulation codes and data can be downloaded at https://github.com/hahakhkim/WassersteinLQ. . (a) (b) (c) Figure 2: (a) Optimal value of λ depending on θ , (b) average control energy depending on θ , and(c) reliability depending on θ .compared to the standard LQG method. Additionally, the proposed control policy successfullydrives the expected value of the system state to zero, while the standard LQG fails to do so. Theresults also show that the size of the Wasserstein ambiguity set or equivalently the value of θ playsan important role in the performance of our method. As θ increases, the resulting policy with thepenalty λ (cid:63) must guarantee the upper-bound of the cost function for a larger size of the ambiguityset. Therefore, it is robust against a wider range of distributions, and the worst-case distributionis selected as a more extreme one. As θ decreases, the worst-case distribution converges to theempirical distribution, and thus the robustness advantage of our policy over the standard LQGdiminishes. The settling time required for each generator to maintain the mean frequency less than3% of the initial deviation is shown in Table 1, when the worst-case distribution with θ = 0 . . . λ (cid:63) obtained using Proposition 5, dependingon the radius θ . The value of λ (cid:63) decreases as θ increases and eventually converges to the inﬁmum of λ (cid:63) satisfying Assumption 1. This observation is consistent with our intuition that the distributionalrobustness of the control policy can be tuned using the penalty parameter instead of θ .Fig. 2 (b) shows the average control energy required for our method depending on the value of θ . The control energy is measured for the ﬁrst 5 seconds, i.e., (cid:80) t =0 (cid:107) u t (cid:107) /

50, and is averaged over1,000 test cases. As shown in Fig. 2 (b), the required energy increases as θ increases. If θ decreases,the required energy declines and eventually converges to the energy required for the standard LQGmethod. This implies that a tradeoﬀ between robustness and control energy exists in our method.Therefore, the value of θ should be properly selected based on the reliability of available data tobalance robustness and control energy.To test the out-of-sample performance of our control policy, the reliability µ N { ˆ w : T E π (cid:63) ˆ w w ∼ µ [ (cid:80) T − t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T Q f x T | x = x ] ≤ λθ + V ( x ; λ ) ∀ x ∈ R n } is computed using 10,0002simulations with sample size N = 10. As shown in Fig. 2 (c), the reliability increases with θ asexpected. More speciﬁcally, the reliability sharply increases in [10 − , × − ] and then saturatesas θ increases further. Given that the control energy also increases with θ , it may be reasonableto choose θ ≈ × − in this problem to attain a suﬃciently robust policy, which is not overlyconservative. We have presented a minimax LQ control method with a Wasserstein penalty to address the is-sue of ambiguity inherent in practical stochastic systems. Our method has several salient featuresincluding ( i ) a closed-form expression of the optimal policy pair, ( ii ) the convergence of a Riccatiequation to the unique symmetric PSD solution to the corresponding ARE, ( iii ) closed-loop stabil-ity, ( iv ) distributional robustness, and ( v ) an out-of-sample performance guarantee. The relationto the H ∞ -method indicates that our method may open an exciting avenue for future research thatconnects stochastic and robust control from the perspective of distributional robustness. More-over, it remains as future work to address partial observability and extensions to continuous-timesettings. A Proofs

A.1 Proof of Lemma 1

Proof.

The function w (cid:55)→ V t +1 ( A x + B u + Ξ w ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) is strictly concave quadratic underthe assumption λI − Ξ (cid:62) P t +1 Ξ (cid:31)

0. Diﬀerentiating it with respect to w , we obtain the followingoptimality condition: w (cid:63), ( i ) t = 12 λ Ξ (cid:62) V (cid:48) t +1 ( A x + B u + Ξ w (cid:63), ( i ) t ) + ˆ w ( i ) t , (A.1)which directly yields (3.3). To solve the outer minimization problem in (3.1), we ﬁrst diﬀerentiatethe outer objective function with respect to u to obtain that2 R u + 1 N N (cid:88) i =1 (cid:20)(cid:18) B + Ξ ∂w (cid:63), ( i ) t ∂ u (cid:19) (cid:62) V (cid:48) t +1 ( A x + B u + Ξ w (cid:63), ( i ) t ) + 2 λ ∂w (cid:63), ( i ) t ∂ u (cid:62) ( ˆ w ( i ) t − w (cid:63), ( i ) t ) (cid:21) = 2 R u + 2 B (cid:62) g t ( u ) , where g t ( u ) := 12 N N (cid:88) i =1 V (cid:48) t +1 ( A x + B u + Ξ w (cid:63), ( i ) t ( u ))= P t +1 (cid:18) A x + B u + 1 N N (cid:88) i =1 Ξ w (cid:63), ( i ) t ( u ) (cid:19) + r t +1 . (A.2)The Hessian of the outer objective function with respect to u is then given by2 (cid:2) R + B (cid:62) P t +1 B + B (cid:62) P t +1 Ξ( λI − Ξ (cid:62) P t +1 Ξ) − Ξ (cid:62) P t +1 B (cid:3) , which is positive deﬁnite under the assumption on the penalty parameter. Thus, the outer objectivefunction is strictly convex, and it has a unique minimizer, u (cid:63) . Equating the derivative to zero yields u (cid:63) = − R − B (cid:62) g (cid:63)t , (A.3)3where g (cid:63)t := g t ( u (cid:63) ). By the deﬁnition of g t and (A.1), g (cid:63)t = P t +1 (cid:18) A x − BR − B (cid:62) g (cid:63)t + 1 λ ΞΞ (cid:62) g (cid:63)t + Ξ N (cid:88) i =1 ˆ w ( i ) t N (cid:19) + r t +1 , which yields the following expression of g (cid:63)t : g (cid:63)t = (cid:18) I + P t +1 BR − B (cid:62) − λ P t +1 ΞΞ (cid:62) (cid:19) − (cid:18) P t +1 A x + P t +1 Ξ N (cid:88) i =1 ˆ w ( i ) t N + r t +1 (cid:19) . (A.4)Note that I + P t +1 BR − B (cid:62) − λ P t +1 ΞΞ (cid:62) must be invertible by the uniqueness of u (cid:63) . A.2 Proof of Theorem 1

Proof.

We use mathematical induction to show that V t ( x ) = x (cid:62) P t x + 2 r (cid:62) t x + z t . For t = T , thestatement is true by the deﬁnition of P T , r T , and z T . Suppose that the induction hypothesis istrue for t + 1, i.e., V t +1 ( x ) = x (cid:62) P t +1 x + 2 r (cid:62) t +1 x + z t +1 . Recall that g (cid:63)t := g t ( u (cid:63) ), where g t is givenas (A.2). Diﬀerentiating (3.1) with respect to x and using (A.1) and (A.3), we obtain V (cid:48) t ( x ) = 2 Q x + 2 ∂ u (cid:63) ∂ x (cid:62) R u (cid:63) + 1 N N (cid:88) i =1 (cid:20)(cid:18) A + B ∂ u (cid:63) ∂ x + Ξ ∂ w (cid:63), ( i ) ∂ x (cid:19) (cid:62) V (cid:48) t +1 ( A x + B u (cid:63) + Ξ w (cid:63), ( i ) ) + 2 λ ∂ w (cid:63), ( i ) ∂ x (cid:62) ( ˆ w ( i ) t − w (cid:63), ( i ) ) (cid:21) = 2 Q x + A (cid:62) N N (cid:88) i =1 V (cid:48) t +1 ( A x + B u (cid:63) + Ξ w (cid:63), ( i ) )+ ∂ u (cid:63) ∂ x (cid:62) (cid:20) R u (cid:63) + B (cid:62) N N (cid:88) i =1 V (cid:48) t +1 ( A x + B u (cid:63) + Ξ w (cid:63), ( i ) ) (cid:21) = 2 Q x + 2 A (cid:62) g (cid:63)t , where u (cid:63) is given as (3.4) and w (cid:63), ( i ) is given as (3.3) with u := u (cid:63) . Replacing g (cid:63)t with (A.4) yields Q x + A (cid:62) g (cid:63)t = P t x + r t (A.5)by the recursion for P t in the Riccati equation (3.5). Thus,12 V (cid:48) t ( x ) = P t x + r t , which implies that V t ( x ) = x (cid:62) P t x + 2 r (cid:62) t x + z (cid:48) t , for some constant z (cid:48) t ∈ R .Plugging u (cid:63) = K t x + L t into (3.1) yields z (cid:48) t = L (cid:62) t RL t + z t +1 + 1 N N (cid:88) i =1 (cid:2) ( BL t + Ξ w (cid:63), ( i ) ) (cid:62) P t +1 ( BL t + Ξ w (cid:63), ( i ) ) + 2 r (cid:62) t +1 ( BL t + Ξ w (cid:63), ( i ) ) − λ (cid:107) ˆ w ( i ) t − w (cid:63), ( i ) (cid:107) (cid:3) . (A.6)4For simplicity, let α i := w (cid:63), ( i ) − ˆ w ( i ) t and β i := BL t + Ξ ˆ w ( i ) t . Then, each term in the summationcan be written as(Ξ α i + β i ) (cid:62) P t +1 (Ξ α i + β i ) + 2 r (cid:62) t +1 (Ξ α i + β i ) − λα (cid:62) i α i = β (cid:62) i P t +1 β i + 2 α (cid:62) i Ξ (cid:62) ( P t +1 β i + r t +1 ) + 2 r (cid:62) t +1 β i − α (cid:62) i ( λI − Ξ (cid:62) P t +1 Ξ) α i . It follows from (3.3) that the constant part of α i (with respect to x ) is given by ( λI − Ξ (cid:62) P t +1 Ξ) − Ξ (cid:62) ( P t +1 β i + r t +1 ). Plugging it into the equality above, we have β (cid:62) i P t +1 β i + ( P t +1 β i + r t +1 ) (cid:62) Ξ( λI − Ξ (cid:62) P t +1 Ξ) − Ξ (cid:62) ( P t +1 β i + r t +1 ) + 2 r (cid:62) t +1 β i = β (cid:62) i P t +1 ( I − λ ΞΞ (cid:62) P t +1 ) − β i + 2 r (cid:62) t +1 (cid:20) I − λ ΞΞ (cid:62) P t +1 (cid:21) − β i + r (cid:62) t +1 Ξ( λI − Ξ (cid:62) P t +1 Ξ) − Ξ (cid:62) r t +1 . Substituting β i with BL t + Ξ ˆ w ( i ) t , and L t with − R − B (cid:62) ( I + P t +1 Φ) − ( P t +1 Ξ ¯ w t + r t +1 ), (A.6) canbe expressed as z (cid:48) t = z t +1 + tr[( I − Ξ (cid:62) P t +1 Ξ /λ ) − Ξ (cid:62) P t +1 ΞΣ t ]+ ¯ w (cid:62) t Ξ (cid:62) [( I + P t +1 Φ) − − ( I − P t +1 ΞΞ (cid:62) /λ ) − ] P t +1 Ξ ¯ w t + (2 ¯ w (cid:62) t Ξ (cid:62) − r (cid:62) t +1 Φ)( I + P t +1 Φ) − r t +1 , where we have omitted the detailed algebra. Finally, by the recursion for z t in (3.5), we deducethat z (cid:48) t = z t . This completes our inductive argument. Lastly, It follows from Lemma 1 that anoptimal policy must be unique and it is obtained as (3.6). A.3 Proof of Proposition 1

Proof.

In the standard LQG, it is well known that if (

A, B ) is stabilizable, the Riccati equation (3.7)has a bounded limiting solution, which coincides with a symmetric PSD solution to an associatedARE [37, Theorem 2.4-1]. Note that the ARE (3.9) can be rewritten as P = Q + A (cid:62) ( I + P Φ) − P A = Q + A (cid:62) ( I + P √ Φ (cid:124)(cid:123)(cid:122)(cid:125) B (cid:48) I − (cid:124)(cid:123)(cid:122)(cid:125) R (cid:48)− √ Φ (cid:62) ) − P Q = Q + A (cid:62) P A − A (cid:62) P B (cid:48) ( R (cid:48) + B (cid:48)(cid:62) P B (cid:48) ) − B (cid:48)(cid:62) P A, which is in the form of the standard ARE. Thus, our ARE (3.9) is obtained by replacing (

A, B )with ( A, √ Φ) in the ARE for the standard LQG, and the result follows.

A.4 Proof of Lemma 2

Proof.

Let P be a solution to the equation P − Q = A (cid:62) P ( I + Φ P ) − A . Let E := ( I + Φ P ) − A be decomposed as E = U DU − , where D is a Jordan normal form. Then, we have P − Q = A (cid:62) P U DU − . Let U := P U . Then, we obtain U − QU = A (cid:62) U D. Since A = ( I + Φ P ) E = ( I + Φ U U − ) U DU − , we have AU = U D + Φ U D. Therefore, we obtain that F (cid:20) U U (cid:21) = G (cid:20) U U (cid:21) D . This implies that a solution to the ARE (3.9) isexpressed as P = U U − and (cid:20) U U (cid:21) solves generalized eigenvalue problem.5 A.5 Proof of Lemma 3

Proof.

Suppose ﬁrst that F (cid:20) U U (cid:21) = G (cid:20) U U (cid:21) Λ, where Λ is a Jordan normal form. Then, A =( I + Φ U U − ) U Λ U − . It follows from the ARE (3.9) that U U − = Q + A (cid:62) U U − ( I + Φ U U − ) − A = Q + ( U − H Λ H U H + U − H Λ H U H Φ H ) U Λ U − . This is a discrete-time Lyapunov equation of the form P = ¯ A H P ¯ A + ¯ Q, (A.7)where P = U U − , ¯ A := U Λ U − , and ¯ Q := Q + ( U Λ U − ) H Φ U Λ U − . Note that ¯ Q (cid:23) (cid:23) P (cid:23) Q (cid:23) A is stable.We now assume that P (cid:23)

0. Suppose that ¯ Q (cid:23)

0, and ¯ A has an unstable eigenvalue, i.e.,¯ Av = γv , where | γ | ≥

1. Pre-multiplying v H and post-multiplying v on both sides of the Lyapunovequation (A.7), we obtain ( γ ∗ γ − v H P v + v H ¯ Qv = 0. Then, ¯ Q / v = 0, which leads to Q / v =Φ / U Λ U − v = 0 and Av = ( I + Φ P ) ¯ Av = γv . This contradicts Assumption 4. Therefore, if P (cid:23) Q (cid:23)

0, then ¯ A must be stable. Since ¯ A and Λ have the same spectrum, the resultfollows. A.6 Proof of Proposition 2

Proof.

The recursion of r t in (3.5) becomes r t = A (cid:62) ( I + P ss Φ) − ( P ss Ξ ¯ w + r t +1 ) as P t convergesto P ss . When A (cid:62) ( I + P ss Φ) − is stable, r t must converge to (3.12). Thus, it suﬃces to show that A (cid:62) ( I + P ss Φ) − is stable. If follows from (3.11) that A = U Λ U − + Φ U Λ U − and( I + Φ P ss ) − A = ( I + Φ U U − ) − ( U Λ U − + Φ U Λ U − )= U Λ U − , which implies that ( I + Φ P ss ) − A and Λ have the same spectrum. Since Λ has n stable eigenvalues,( I + Φ P ss ) − A is stable, and so is its transpose A (cid:62) ( I + P ss Φ) − . A.7 Proof of Proposition 3

Proof.

The steady-state average cost is computed aslim sup T →∞ T ( x (cid:62) P x + 2 r (cid:62) x + z ) = lim sup T →∞ T ( x (cid:62) P ss x + 2 r (cid:62) ss x + z ) = lim sup T →∞ z T .

By the recursion for z t in (3.5), this cost can be expressed aslim sup T →∞ T T − (cid:88) t =0 (cid:20) tr[( I − Ξ (cid:62) P t +1 Ξ /λ ) − Ξ (cid:62) P t +1 ΞΣ]+ ¯ w (cid:62) Ξ (cid:62) [( I + P t +1 Φ) − − ( I − P t +1 ΞΞ (cid:62) /λ ) − ] P t +1 Ξ ¯ w + (2 ¯ w (cid:62) Ξ (cid:62) − r (cid:62) t +1 Φ)( I + P t +1 Φ) − r t +1 (cid:21) , which is equal to the value given in the statement.6 A.8 Proof of Proposition 4

Proof.

We use Lemma 1 with V t +1 ≡ h , setting P t +1 = P ss and r t +1 = r ss and z t +1 = 0. Then, V t ( x ) = x (cid:62) P t x + 2 r (cid:62) t x + z t in (3.1) satisﬁes (3.5) with P t +1 = P ss and r t +1 = r ss and z t +1 = 0: P t = Q + A (cid:62) ( I + P ss Φ) − P ss Ar t = A (cid:62) ( I + P ss Φ) − ( P ss Ξ ¯ w + r ss ) z t = tr[( I − Ξ (cid:62) P ss Ξ /λ ) − Ξ (cid:62) P ss ΞΣ]+ ¯ w (cid:62) Ξ (cid:62) [( I + P ss Φ) − − ( I − P ss ΞΞ (cid:62) /λ ) − ] P ss Ξ ¯ w + (2 ¯ w (cid:62) Ξ (cid:62) − r (cid:62) ss Φ)( I + P ss Φ) − r ss . It follows from the ARE (3.9) that P t = P ss . By the deﬁnition of r ss in (3.12), P ss Ξ ¯ w + r ss = P ss Ξ ¯ w + [ I − A (cid:62) ( I + P ss Φ) − ] − A (cid:62) ( I + P ss Φ) − P ss Ξ ¯ w = [ I − A (cid:62) ( I + P ss Φ) − ] − P ss Ξ ¯ w. Thus, we deduce that r t = r ss . Moreover, Proposition 3 implies that z t = ρ . Putting these resultstogether, we conclude that V t ( x ) = x (cid:62) P ss x + r (cid:62) ss x + ρ = h ( x ) + ρ . Therefore, the equality (3.15)holds. The optimality of ( π (cid:63)ss ( x ) , γ (cid:63)ss ( x )) also follows from Lemma 1. A.9 Proof of Theorem 3

Proof. (a)

Consider any single-stage policy pair ( π t , γ t ). We deﬁne the mapping T π t ,γ t as( T π t ,γ t h )( x ) := x (cid:62) Q x + u (cid:62) t Ru t − λW ( µ t , ν ) + (cid:90) R k h ( A x + Bu t + Ξ w )d µ t ( w ) , u t = π t ( x ) , µ t = γ t ( x ) . It follows from Proposition 4 that( T π t ,γ (cid:63)ss h )( x ) ≥ ( T π (cid:63)ss ,γ (cid:63)ss h )( x ) = ρ + h ( x ) . Fix a policy π := ( π , π , . . . ) ∈ Π and an arbitrary positive integer T . We ﬁrst note that( T π T − ,γ (cid:63)ss h )( x ) ≥ ρ + h ( x ) . By the monotonicity of the mapping T π T − ,γ (cid:63)ss , we have( T π T − ,γ (cid:63)ss T π T − ,γ (cid:63)ss h )( x ) ≥ ρ + ( T π T − ,γ (cid:63)ss h )( x ) ≥ ρ + h ( x ) . Recursively applying this inequality yields( T π ,γ (cid:63)ss · · · T π T − ,γ (cid:63)ss h )( x ) ≥ T ρ + h ( x ) . Dividing both sides by T and letting T tend to ∞ , we have˜ J λ x , ∞ ( π, γ (cid:63)ss ; h ) ≥ ρ, (A.8)which holds for any π ∈ Π. By the same argument with π t := π (cid:63)ss , we can also show that˜ J λ x , ∞ ( π (cid:63)ss , γ ; h ) ≤ ρ for all γ ∈ Γ.7 (b)

Inequalities (3.16) imply that ( π (cid:63)ss , γ (cid:63)ss ) is an optimal policy pair of the minimax prob-lem (3.17). Furthermore, we deduce that ˜ J λ x , ∞ ( π (cid:63)ss , γ (cid:63)ss ; h ) = ρ . (c) Under the conditions (3.19a) and (3.19b), the h term in (3.16) can be ignored, and thus thefollowing inequalities hold: J λ x , ∞ ( π (cid:63)ss , γ ) ≤ ρ ≤ J λ x , ∞ ( π, γ (cid:63)ss ) ∀ ( π, γ ) ∈ ¯Π × ¯Γ . Using the argument in Part (b) , we conclude that ( π (cid:63)ss , γ (cid:63)ss ) is an optimal policy pair and ρ is theoptimal value of the problem (3.18). A.10 Proof of Theorem 4

Proof.

Let ¯ x := x (cid:63) and ¯ x t := E [ x (cid:63)t ] for t = 1 , , . . . , where x (cid:63)t denotes the closed-loop system stateunder the optimal policy in Corollary 3. Then, the mean-state system is given by¯ x t +1 = ( A + BK ss )¯ x t + BL ss + Ξ N N (cid:88) i =1 w (cid:63), ( i ) (¯ x t ) . It follows from (A.1) and (A.4) thatΞ N N (cid:88) i =1 w (cid:63), ( i ) (¯ x t ) = 1 λ ΞΞ (cid:62) g (cid:63)t + Ξ ¯ w = 1 λ ΞΞ (cid:62) ( I + P ss Φ) − ( P ss A x + P ss Ξ ¯ w + r ss ) + Ξ ¯ w. By this equality and the deﬁnition of L ss and r ss , the mean-state dynamics can be rewritten as¯ x t +1 =( I + Φ P ss ) − A ¯ x t + ( I − Φ( I + P ss Φ − A (cid:62) ) − P ss )Ξ ¯ w. It follows from the proof of Proposition 2 that the gain ( I + Φ P ss ) − A is stable, and therefore theexpected state converges to [ I − ( I + Φ P ss ) − A ] − [ I − Φ( I + P ss Φ − A (cid:62) ) − P ss ]Ξ ¯ w . A.11 Proof of Theorem 5

Proof.

Since A + BK ss is independent of ν , without loss of generality, we let ν ≡ δ . Consider thepolicy γ (cid:48) that selects µ t = δ = ν for all t ≥

0. Under the policy pair ( π (cid:63)ss , γ (cid:48) ), the closed-loopsystem is given by x t +1 = ( A + BK ss ) x t for all t ≥

0. Since r ss = 0 in this case, we obtain that h ( x ) = x (cid:62) P ss x and ρ = 0. Since h ( x ) ≥ x , ¯Γ is equivalent to Γ, and thus γ (cid:48) ∈ ¯Γ.It follows from Theorem 3 that ˜ J λ x , ∞ ( π (cid:63)ss , γ (cid:48) ; h ) ≤ ρ = 0 . Recall that h ( x ) = x (cid:62) P ss x and W ( µ t , ν ) = 0 in the above setting. Thus,lim sup T →∞ T E π (cid:63)ss ,γ (cid:48) (cid:20) T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + x (cid:62) T P ss x T (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) is less than or equal to ρ = 0. Since Q, P ss (cid:23) R (cid:31)

0, ˜ J λ x , ∞ ( π (cid:63)ss , γ (cid:48) ; h ) = 0 and lim sup isreplaced by lim. This implies that lim t →∞ ( x (cid:62) t Qx t + u (cid:62) t Ru t ) = 0 , Q / x t → u t → t → ∞ .Note that the linear system can be expressed as x t + k = A k x t + k − (cid:88) l =0 A k − − l Bu t + l , k ≥ . It follows from the triangle inequality that (cid:107) Q / A k x t (cid:107) ≤ (cid:107) Q / x t + k (cid:107) + k − (cid:88) l =0 (cid:107) Q / A k − − l Bu t + l (cid:107) . Recall that Q / x t and u t converge to 0 as t → ∞ . Thus, for any (cid:15) >

0, there exists T ( (cid:15) ) such that n − (cid:88) k =0 (cid:107) Q / A k x t (cid:107) ≤ (cid:15) , t > T ( (cid:15) ) . The left-hand side is the squared Euclidean norm of the product of the observability matrix and x t . By the observability of ( A, √ Q ), the observability matrix has a full rank. Thus, (cid:107) x t (cid:107) ≤ (cid:15)/σ min and x t converges to 0, where σ min is the smallest singular value of the observability matrix. Hence,the closed-loop system x t +1 = ( A + BK ss ) x t is asymptotically stable and the mean-state systemwith π (cid:63)ss is BIBO stable. A.12 Proof of Lemma 6

Proof.

Fix π ∈ Π. Let p (cid:63) := sup γ ∈ Γ D J x ( π, γ ) and d (cid:63) := inf λ ≥ sup γ ∈ Γ ( λθ + J λ x ( π, γ )). For any ε >

0, there exists γ ε ∈ Γ D such that p (cid:63) − ε < J x ( π, γ ε ) . (A.9)Since γ ε ∈ Γ D , we have λθ + J λ x ( π, γ ε ) ≥ T E π,γ ε (cid:20) T − (cid:88) t =0 λW ( µ t , ν t ) + x (cid:62) T Q f x T + T − (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t − λW ( µ t , ν t ) ) (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:21) = J x ( π, γ ε ) . Minimizing both sides with respect to λ ≥ d (cid:63) ≥ J x ( π, γ ε ) . (A.10)Combining inequalities (A.9) and (A.10), we obtain p (cid:63) − ε < d (cid:63) . Since this inequality holds for any ε >

0, we conclude that p (cid:63) ≤ d (cid:63) .9 A.13 Proof of Theorem 6

Proof.

It follows from Lemma 6 thatsup γ ∈ Γ D J x ( π (cid:63),λ , γ ) ≤ inf λ (cid:48) ≥ sup γ ∈ Γ (cid:0) λ (cid:48) θ + J λ (cid:48) x ( π (cid:63),λ , γ ) (cid:1) ≤ λθ + sup γ ∈ Γ J λ x ( π (cid:63),λ , γ ) . By the optimality of π (cid:63),λ , we have V ( x ; λ ) = inf π ∈ Π sup γ ∈ Γ J λ x ( π, γ ) = sup γ ∈ Γ J λ x ( π (cid:63),λ , γ ) . Therefore, the result follows.

A.14 Proof of Lemma 7

Proof.

Fix an arbitrary λ ∈ [0 , ˆ λ ) and consider any λ (cid:48) ∈ ( λ, ˆ λ ). Since λ (cid:48) < ˆ λ , there exists t ≥ λ (cid:48) ≤ ¯ λ t ( λ (cid:48) ), where ¯ λ t ( λ ) is deﬁned as the maximum eigenvalue of Ξ (cid:62) P λt Ξ. Let t denotethe largest time index in arg max { t | λ (cid:48) ≤ ¯ λ t ( λ (cid:48) ) } . Since λ (cid:48) > ¯ λ τ ( λ (cid:48) ) for τ = t + 1 , . . . , T , theoptimal value functions for τ = t, . . . , T are characterized as V τ ( x ; λ (cid:48) ) = x (cid:62) P λ (cid:48) τ x + 2( r λ (cid:48) τ ) (cid:62) x + z λ (cid:48) τ by using the inductive argument in the proof of Theorem 1.Now consider the optimal value functions with the penalty parameter λ . The Bellman recursionat t − V t − ( x ; λ ) = x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + 1 N N (cid:88) i =1 sup w ∈ R k (cid:8) V t ( A x + B u + Ξ w ; λ ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) (cid:9)(cid:21) . Since λ < λ (cid:48) , we have V t ( x ; λ ) ≥ V t ( x ; λ (cid:48) ) for all x ∈ R n . Therefore, V t − ( x ; λ ) ≥ x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + 1 N N (cid:88) i =1 sup w ∈ R k (cid:8) V t ( A x + B u + Ξ w ; λ (cid:48) ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) (cid:9)(cid:21) . Note that the w -dependent part of the inner maximization problem issup w ∈ R k (cid:8) w (cid:62) (Ξ (cid:62) P λ (cid:48) t Ξ − λI ) w + 2[Ξ (cid:62) P λ (cid:48) t ( A x + B u ) + Ξ (cid:62) r λ (cid:48) t + λ ˆ w ( i ) t ] (cid:62) w (cid:9) . This is a strictly convex quadratic function with respect to w since λ < λ (cid:48) ≤ ¯ λ t ( λ (cid:48) ). Thus,the supremum must be + ∞ and V t − ( x ; λ ) = + ∞ . It follows from the Bellman recursion that V ( x ; λ ) = + ∞ .Next, we consider the case where λ ∈ (ˆ λ, ∞ ). It suﬃces to show that any λ in this range satisﬁesAssumption 1. Suppose that λ > ˆ λ does not satisfy Assumption 1. By the deﬁnition of ˆ λ , thereexists at least one ˜ λ ∈ [ˆ λ, λ ) that satisﬁes Assumption 1. Then, we haveinf π ∈ Π sup γ ∈ Γ J ˜ λ x ( π, γ ) = c (˜ λ ) , (A.11)which is ﬁnite, since ˜ λ satisﬁes Assumption 1. On the other hand, λ does not satisfy Assumption 1,and thus we can take the largest time index t in arg max { t | λ ≤ ¯ λ t ( λ ) } in the same way as in theprevious case. Switching the role of ( λ, λ (cid:48) ) in the previous case to that of (˜ λ, λ ), we deduce that V t − ( x ; ˜ λ ) ≥ x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + 1 N N (cid:88) i =1 sup w ∈ R k (cid:8) V t ( A x + B u + Ξ w ; λ ) − ˜ λ (cid:107) ˆ w ( i ) t − w (cid:107) (cid:9)(cid:21) . V t ( x ; ˜ λ ) ≥ V t ( x ; λ ). The supremum must be + ∞ since ˜ λ < λ ≤ ¯ λ t ( λ ). Thus, we haveinf π ∈ Π sup γ ∈ Γ J ˜ λ x ( π, γ ) = ∞ , which is a contradiction to (A.11). Therefore, we conclude that any λ > ˆ λ must satisfy Assump-tion 1.Finally when λ = ˆ λ , the value of the objective function can be either ﬁnite or inﬁnite dependingon the initial states, samples, and system matrices. However, the boundary condition is guaranteedby the monotonically decreasing property of the objective function. Precisely, c should be greaterthan or equal to c (ˆ λ + (cid:15) ) for any (cid:15) > A.15 Proof of Proposition 5

Proof.

It is clear that λ ∗ minimizes the objective function of (4.2) in the range (ˆ λ, ∞ ). Moreover,it follows from Lemma 7 that inf π ∈ Π sup γ ∈ Γ J λ x ( π, γ ) = ∞ for all λ < ˆ λ . Thus, λ ∗ minimizes theobjective function in the range [0 , ˆ λ ) ∪ (ˆ λ, ∞ ) and it suﬃces to show that ˆ λθ + c ≥ λ ∗ θ + c ( λ ∗ ).Suppose that ˆ λθ + c < λ ∗ θ + c ( λ ∗ ). Let∆ := ( λ ∗ θ + c ( λ ∗ )) − (ˆ λθ + c ) > . Since λ ∗ is a minimizer of (4.3), λ ∗ θ + c ( λ ∗ ) ≤ (ˆ λ + (cid:15) ) θ + c (ˆ λ + (cid:15) ) ∀ (cid:15) > , which is equivalent to c + ∆ ≤ (cid:15)θ + c (ˆ λ + (cid:15) ) ∀ (cid:15) > . Now, let (cid:15) be suﬃciently small so that (cid:15)θ < ∆. Then, c < c (ˆ λ + (cid:15) ), which is a contradictionto the boundary condition in Lemma 7. Thus, we conclude that ˆ λθ + c ≥ λ (cid:63) θ + c ( λ (cid:63) ) and theresult follows.We now show that the optimal value functions V t ( x ; λ ) = x (cid:62) P λt x +2( r λt ) (cid:62) x + z λt is jointly convexin ( λ, x ) ∈ (ˆ λ, ∞ ) × R n using mathematical induction. For T , it is clear that V T ( x ; λ ) = x (cid:62) Q f x satisﬁes the joint convexity. Suppose now that the induction hypothesis is valid for t . Recall thatthe Bellman equation for t − V t − ( x ; λ ) = x (cid:62) Q x + inf u ∈ R m (cid:20) u (cid:62) R u + 1 N N (cid:88) i =1 sup w ∈ R k (cid:8) V t ( A x + B u + Ξ w ; λ ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) (cid:9)(cid:21) . For each w ∈ R k , V t ( A x + B u + Ξ w ; λ ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) is convex in ( λ, x ) ∈ (ˆ λ, ∞ ) × R n . Thus, theconvexity is preserved through the point-wise supremum, and u (cid:62) R u + N (cid:80) Ni =1 sup w ∈ R k { V t ( A x + B u +Ξ w ; λ ) − λ (cid:107) ˆ w ( i ) t − w (cid:107) } is jointly convex in ( λ, x , u ) on (ˆ λ, ∞ ) × R n × R m , which is a convex set.Thus, its inﬁmum over u ∈ R m is convex in ( λ, x ) ∈ (ˆ λ, ∞ ) × R n . This completes our mathematicalinduction, and the result follows.1 A.16 Proof of Theorem 7

Proof.

Fix an arbitrary inﬁnite-horizon policy γ ∈ Γ D . Let γ T − be deﬁned as the marginal of γ from stage 0 and T −

1. Then, γ T − is an admissible policy of the opponent in the ﬁnite-horizonsetting. Therefore, ˜ J x ( π (cid:63),λss , γ T − ; h λ ) ≤ sup γ ∈ Γ D ˜ J x ( π (cid:63),λss , γ ; h λ ) , where, with a slight abuse of notation, π (cid:63),λss represents a ﬁnite-horizon policy using π (cid:63),λss at everystage, and ˜ J x ( π, γ ; h ) := T E π,γ (cid:2) (cid:80) T − t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) + h ( x T ) | x = x (cid:3) .Let ˜ J λ x ( π, γ ; h ) := T E π,γ (cid:2) (cid:80) T − t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t − λW ( µ t , ν ) ) + h ( x T ) | x = x (cid:3) . Using theargument used in the proof of Lemma 6, we can deduce thatsup γ ∈ Γ D ˜ J x ( π (cid:63),λss , γ ; h λ ) ≤ inf λ (cid:48) ≥ sup γ ∈ Γ (cid:0) λ (cid:48) θ + ˜ J λ (cid:48) x ( π (cid:63),λss , γ ; h λ ) (cid:1) ≤ λθ + sup γ ∈ Γ ˜ J λ x ( π (cid:63),λss , γ ; h λ ) . (A.12)Combining the two inequalities above yields˜ J x , ∞ ( π (cid:63),λss , γ ; h λ ) = lim sup T →∞ ˜ J x ( π (cid:63),λss , γ T − ; h λ ) ≤ λθ + lim sup T →∞ sup γ ∈ Γ ˜ J λ x ( π (cid:63),λss , γ ; h λ ) ≤ λθ + ρ ( λ ) , where the last inequality follows from Theorem 3 (a). Since γ was arbitrarily chosen from Γ D , theﬁrst bound holds. The second bound can be obtained using the same argument. A.17 Proof of Lemma 8

Proof.

There exists ˆ λ > BR − B (cid:62) − ΞΞ (cid:62) /λ (cid:23) ∀ λ > ˆ λ , since BR − B (cid:62) (cid:31)

0. It follows from the stabilizability of (

A, B ) and the observability of (

A, C )that the ARE of the standard LQG has a unique PSD solution ˜ P . Moreover, the LQG control gain˜ K := − R − B (cid:62) ( I + ˜ P BR − B (cid:62) ) − ˜ P A stabilizes the closed-loop system, such that A + B ˜ K = A − BR − B (cid:62) ( I + ˜ P BR − B (cid:62) ) − ˜ P A is stable and all eigenvalues of this matrix lie inside the unit circle. Then, there exists ˆ λ suchthat all eigenvalues of A − ( BR − B (cid:62) − ΞΞ (cid:62) /λ )( I + ˜ P BR − B (cid:62) ) − ˜ P A = A − Φ( I + ˜ P BR − B (cid:62) ) − ˜ P A := A + Φ K (cid:48) lie inside the unit circle for any λ > ˆ λ , since BR − B (cid:62) − ΞΞ (cid:62) /λ is continuous in λ > BR − B (cid:62) as λ → + ∞ . Since A + Φ K (cid:48) is stable for any λ > ˆ λ , we can conclude that( A, Φ / ) is stabilizable for any λ > ˆ λ . Letting ˆ λ := max t =1 , { ˆ λ t } , the result follows.2 A.18 Proof of Proposition 6

Proof.

Fix λ ∈ (ˆ λ ∞ , ∞ ). Then, λ > ˆ λ and Assumption 3 holds by Lemma 8. Moreover, Assump-tion 1 also holds since λ > ˆ λ . Thus, Assumptions 1–4 hold, and the steady-state average cost ρ ( λ )exists as deﬁned in Proposition 3. Recall that ρ ( λ ) = lim sup T →∞ min π ∈ Π max γ ∈ Γ J λ x ,T ( π, γ ) and J λ x ,T ( π, γ ) is monotonically decreasing with respect to λ for any T >

0. Thus, ρ is a monotonicallynonincreasing function. The limit (4.8) directly follows from the deﬁnition of ρ ( λ ) in Proposition 3.We now show that ρ ( λ ) is convex on (ˆ λ ∞ , ∞ ) using the convexity result in the ﬁnite-horizoncase. Fix any λ , λ ∈ (ˆ λ ∞ , ∞ ) and α ∈ (0 , αρ ( λ ) + (1 − α ) ρ ( λ ) = lim sup T →∞ αT V ( x ; λ ) + lim sup T →∞ − αT V ( x ; λ ) ≥ lim sup T →∞ T (cid:2) αV ( x ; λ ) + (1 − α ) V ( x ; λ ) (cid:3) ≥ lim sup T →∞ T V ( x ; αλ + (1 − α ) λ ) = ρ ( αλ + (1 − α ) λ ) , where the last inequality comes from the convexity of V ( x ; λ ) shown in Proposition 5. Therefore,we conclude that ρ ( λ ) is convex on (ˆ λ ∞ , ∞ ). A.19 Proof of Theorem 8

Proof.

If the true probability measure µ t is contained in the Wasserstein ambiguity set D for all t = 0 , , . . . , T −

1, it follows from Theorem 6 that1 T E π (cid:63) ˆ w w ∼ µ (cid:2) C T ( x, u ) | x = x (cid:3) ≤ λθ + V ( x ; λ ) ∀ x ∈ R n . Therefore, the probability of the expected cost being no greater than λθ + V ( x ; λ ) is greater thanor equal to the probability that µ t ∈ D for all t = 0 , , . . . , T −

1. We then have µ N (cid:26) ˆ w : 1 T E π (cid:63) ˆ w w ∼ µ (cid:2) C T ( x, u ) | x = x (cid:3) ≤ λθ + V ( x ; λ ) ∀ x ∈ R n (cid:27) ≥ T − (cid:89) t =0 µ Nt (cid:8) ˆ w t : W ( µ t , ν ˆ w t ) ≤ θ ( N, β ) (cid:9) ≥ (cid:0) − c [ b ( N, θ ) { θ ≤ } + b ( N, θ ) { θ > } ] (cid:1) T . The radius θ ( N, β ) stated in the theorem satisﬁes 1 − β = (1 − c [ b ( N, θ ) { θ ≤ } + b ( N, θ ) { θ > } ]) T ,and therefore the probabilistic guarantee (4.11) holds. References [1] I. R. Petersen, M. R. James, and P. Dupuis, “Minimax optimal control of stochastic uncertainsystems with relative entropy constraints,”

IEEE Transactions on Automatic Control , vol. 45,no. 3, pp. 398–412, 2000.[2] I. Tzortzis, C. D. Charalambous, and T. Charalambous, “Dynamic programming subject tototal variation distance ambiguity,”

SIAM Journal on Control and Optimization , vol. 53, no. 4,pp. 2040–2075, 2015.3[3] A. Nilim and L. El Ghaoui, “Robust control of Markov decision processes with uncertaintransition matrices,”

Operations Research , vol. 53, no. 5, pp. 780–798, 2005.[4] S. Samuelson and I. Yang, “Data-driven distributionally robust control of energy storage tomanage wind power ﬂuctuations,” in

Proceedings of the 1st IEEE Conference on Control Tech-nology and Applications , 2017.[5] I. Yang, “A dynamic game approach to distributionally robust safety speciﬁcations for stochas-tic systems,”

Automatica , vol. 94, pp. 94–101, 2018.[6] H. Xu and S. Mannor, “Distributionally robust Markov decision processes,”

Mathematics ofOperations Research , vol. 37, no. 2, pp. 288–300, 2012.[7] B. P. G. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari, “Distributionally robust controlof constrained stochastic systems,”

IEEE Transactions on Automatic Control , vol. 61, no. 2,pp. 430–442, 2016.[8] I. Yang, “Distributionally robust stochastic control with conic conﬁdence sets,” in

Proceedingsof the 56th IEEE Conference on Decision and Control , 2017.[9] V. A. Ugrinovskii and I. R. Petersen, “Minimax LQG control of stochastic partially observeduncertain systems,”

SIAM Journal on Control and Optimization , vol. 40, no. 4, pp. 1189–1226,2002.[10] I. Tzortzis, C. D. Charalambous, T. Charalambous, C. K. Kourtellaris, and C. N. Hadjicostis,“Robust linear quadratic regulator for uncertain systems,” in

Proceedings of the 55th IEEEConference on Decision and Control , 2016.[11] I. Yang, “A convex optimization approach to distributionally robust Markov decision processeswith Wasserstein distance,”

IEEE Control Systems Letters , vol. 1, no. 1, pp. 164–169, 2017.[12] ——, “Wasserstein distributionally robust stochastic control: A data-driven approach,”

IEEETransactions on Automatic Control , 2020.[13] J. Coulson, J. Lygeros, and F. D¨orﬂer, “Regularized and distributionally robust data-enabledpredictive control,” in

Proceedings of the 58th IEEE Conference on Decision and Control , 2019.[14] C. Mark and S. Liu, “Stochastic MPC with distributionally robust chance constraints,” in

Proceedings of the 21st IFAC World Congress , 2020.[15] M. Schuurmans and P. Patrinos, “Learning-based distributionally robust model predictivecontrol of Markovian switching systems with guaranteed stability and recursive feasibility,” in

Proceedings of the 59th IEEE Conference on Decision and Control , 2020.[16] C. Ning and F. You, “Online learning based risk-averse stochastic MPC of constrained linearuncertain systems,” arXiv preprint arXiv:2011.11441 , 2020.[17] M. Schuurmans, P. Sopasakis, and P. Patrinos, “Safe learning-based control of stochasticjump linear systems: a distributionally robust approach,” in

Proceedings of the 58th IEEEConference on Decision and Control , 2019.[18] A. Hakobyan and I. Yang, “Learning-based distributionally robust motion control with gaus-sian processes,” in

Proceedings of the IEEE/RSJ International Conference on Intelligent Robotsand Systems , 2020.4[19] E. Delage and Y. Ye, “Distributionally robust optimization under moment uncertainty withapplication to data-driven problems,”

Operations Research , vol. 58, no. 3, pp. 595–612, 2010.[20] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen, “Robustsolutions of optimization problems aﬀected by uncertain probabilities,”

Management Science ,vol. 59, no. 2, pp. 341–357, 2013.[21] W. Wiesemann, D. Kuhn, and M. Sim, “Distributionally robust convex optimization,”

Oper-ations Research , vol. 62, no. 6, pp. 1358–1376, 2014.[22] P. Mohajerin Esfahani and D. Kuhn, “Data-driven distributionally robust optimization usingthe Wasserstein metric: Performance guarantees and tractable reformulations,”

MathematicalProgramming , vol. 171, no. 1–2, pp. 115–166, 2018.[23] C. Zhao and Y. Guan, “Data-driven risk-averse stochastic optimization with Wasserstein met-ric,”

Operations Research Letters , vol. 46, no. 2, 2018.[24] R. Gao and A. J. Kleywegt, “Distributionally robust stochastic optimization with Wassersteindistance,” arXiv preprint arXiv:1604.02199 , 2016.[25] J. Blanchet, K. Murthy, and F. Zhang, “Optimal transport based distributionally robust op-timization: Structural properties and iterative schemes,” arXiv:1810.02403 , 2018.[26] D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shaﬁeezadeh-Abadeh, “Wasserstein distri-butionally robust optimization: Theory and applications in machine learning,”

OperationsResearch & Management Science in the Age of Analytics , no. 130–166, 2019.[27] K. Kim and I. Yang, “Minimax control of ambiguous linear stochastic systems using theWasserstein metric,” in

Proceedings of the 59th IEEE Conference on Decision and Control ,2020.[28] K. J. ˚Astr¨om,

Introduction to Stochastic Control Theory . Courier Corporation, 2012.[29] T. Ba¸sar and P. Bernhard,

H-Inﬁnity Optimal Control and Related Minimax Design Problems:A Dynamic Game Approach . Springer Science & Business Media, 2008.[30] T. Pappas, A. J. Laub, and N. R. Sandel, “On the numerical solution of the discrete-timealgebraic Riccati equation,”

IEEE Transactions on Automatic Control , vol. AC-25, pp. 631–641, 1980.[31] D. P. Bertsekas,

Dynamic Programming and Optimal Control, , 4th ed. Athena Scientiﬁc,2012, vol. 2.[32] N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein distance of the empiricalmeasure,”

Probability Theory and Related Fields , vol. 162, no. 3–4, pp. 707–738, 2015.[33] D. Boskos, J. Cort´es, and S. Mart´ınez, “Data-driven ambiguity sets with probabilistic guaran-tees for dynamic processes,”

IEEE Transactions on Automatic Control , 2020.[34] K. Glover and J. C. Doyle, “State-space formulae for all stabilizing controllers that satisfy an H ∞ -norm bound and relations to risk sensitivity,” Systems & Control Letters , pp. 167–172,1988.5[35] F. D¨orﬂer, M. R. Jovanovi´c, M. Chertkov, and F. Bullo, “Sparsity-promoting optimal wide-area control of power networks,”

IEEE Transactions on Power Systems , vol. 29, no. 5, pp.2281–2291, 2014.[36] A. F. Dizche, A. Chakrabortty, and A. Duel-Hallen, “Sparse wide-area control of power systemsusing data-driven reinforcement learning,” in

Proceedings of 2019 American Control Confer-ence , 2019.[37] F. L. Lewis, D. Vrabie, and V. L. Syrmos,