[PDF] High-Confidence Data-Driven Ambiguity Sets for Time-Varying Linear Systems

Abstract

This paper builds Wasserstein ambiguity sets for the unknown probability distribution of dynamic random variables leveraging noisy partial-state observations. The constructed ambiguity sets contain the true distribution of the data with quantifiable probability and can be exploited to formulate robust stochastic optimization problems with out-of-sample guarantees. We assume the random variable evolves in discrete time under uncertain initial conditions and dynamics, and that noisy partial measurements are available. All random elements have unknown probability distributions and we make inferences about the distribution of the state vector using several output samples from multiple realizations of the process. To this end, we leverage an observer to estimate the state of each independent realization and exploit the outcome to construct the ambiguity sets. We illustrate our results in an economic dispatch problem involving distributed energy resources over which the scheduler has no direct control.

Full PDF

HHIGH-CONFIDENCE DATA-DRIVEN AMBIGUITY SETS FORTIME-VARYING LINEAR SYSTEMS ∗ DIMITRIS BOSKOS † , JORGE CORT´ES ‡ , AND

SONIA MART´INEZ ‡ Abstract.

This paper builds Wasserstein ambiguity sets for the unknown probability distri-bution of dynamic random variables leveraging noisy partial-state observations. The constructedambiguity sets contain the true distribution of the data with quantiﬁable probability and can beexploited to formulate robust stochastic optimization problems with out-of-sample guarantees. Weassume the random variable evolves in discrete time under uncertain initial conditions and dynamics,and that noisy partial measurements are available. All random elements have unknown probabilitydistributions and we make inferences about the distribution of the state vector using several outputsamples from multiple realizations of the process. To this end, we leverage an observer to estimatethe state of each independent realization and exploit the outcome to construct the ambiguity sets.We illustrate our results in an economic dispatch problem involving distributed energy resources overwhich the scheduler has no direct control.

Key words.

Distributional uncertainty, Wasserstein ambiguity sets, stochastic systems, stateestimation

AMS subject classiﬁcations.

1. Introduction.

Decisions under uncertainty are ubiquitous in a wide range ofengineering applications. Faced with complex systems that include components withprobabilistic models, such decisions seek to provide rigorous solutions with quan-tiﬁable guarantees in hedging against uncertainty. In practice, the designer makesinferences about uncertain elements based on collected data and exploits them to for-mulate data-driven stochastic optimization problems. This decision-making paradigmhas found applications in ﬁnance, communications, control, medicine, and machinelearning. Recent research focuses on how to retain high-conﬁdence guarantees forthe optimization problems under plausible variations of the data. To this end, distri-butionally robust optimization (DRO) formulations evaluate the optimal worst-caseperformance over an ambiguity set of probability distributions that contains the trueone with high conﬁdence. Such ambiguity sets are typically constructed under theassumption that data are generated from a static distribution and can be measuredin a direct manner. In this paper we signiﬁcantly expand on the class of scenariosfor which reliable ambiguity sets can be constructed. We consider scenarios wherethe random variable is dynamic and partial measurements, corrupted by noise, areprogressively collected from its evolving distribution. In our analysis, we exploit theunderlying dynamics and study how the probabilistic properties of the noise aﬀectthe ambiguity set size while maintaining the same guarantees.

Literature review:

Optimal decision problems in the face of uncertainty, likeexpected-cost minimization and chance-constrained optimization, are the cornerstonesof stochastic programming [37]. Distributionally robust versions of stochastic opti-mization problems [2, 5, 36] carry out a worst-case optimization over all possibilitiesfrom an ambiguity set of probability distributions. This is of particular importancein data-driven scenarios where the unknown distributions of the random variablesare inferred in an approximate manner using a ﬁnite amount of data [3]. To hedge ∗ This work was supported by the DARPA Lagrange program through award N66001-18-2-4027.A preliminary version of this paper appeared as [7] at the American Control Conference. † Delft Center for Systems and Control, Delft University of Technology ([email protected]). ‡ Department of Mechanical and Aerospace Engineering, University of California, San Diego(cortes,[email protected]). 1 a r X i v : . [ m a t h . O C ] F e b D. BOSKOS, J. CORT´ES, AND S. MART´INEZ this uncertainty, optimal transport ambiguity sets have emerged as a promising tool.These sets typically group all distributions up to some distance from the empiricalapproximation in the Wasserstein metric [40]. There are several reasons that makethis metric a popular choice among the distances between probability distributions,particularly, for data-driven problems. Most notably, the Wasserstein metric penal-izes horizontal dislocations between distributions and provides ambiguity sets thathave ﬁnite-sample guarantees of containing the true distribution and lead to tractableoptimization problems. This has rendered the convergence of empirical measures inthe Wasserstein distance an ongoing active research area [14, 15, 17, 23, 41, 42]. To-wards the exploitation of Wasserstein ambiguity sets for DRO problems, the work [16]introduces tractable reformulations with ﬁnite-sample guarantees, further exploitedin [10, 22] to deal with distributionally robust chance-constrained programs. Thework [12] develops distributed optimization algorithms using Wasserstein balls, whileoptimal transport ambiguity sets have recently been connected to regularization formachine learning [4, 18, 34]. The paper [27] exploits Wasserstein balls to robustifydata-driven online optimization algorithms, and [35] leverages them for the design ofdistributionally robust Kalman ﬁlters. Further applications of Wasserstein ambiguitysets include the synthesis of robust control policies for Markov decision processes [44]and their data-driven extensions [45], and regularization for stochastic predictive con-trol algorithms [13]. Several recent works have also devoted attention to distribution-ally robust problems in power systems control, including optimal power ﬂow [21, 24]and economic dispatch [43, 30, 33]. Time-varying aspects of Wasserstein ambigu-ity sets are considered in [25] for dynamic traﬃc models, in [26] for online learningof unknown dynamical environments, and in [8], which constructs ambiguity ballsusing progressively assimilated dynamic data for processes with random initial con-ditions that evolve under deterministic dynamics. In contrast, in the present work,the state distribution does not evolve deterministically due to the presence of randomdisturbances, which together with output measurements that are corrupted by noise,generate additional stochastic elements that make challenging the quantiﬁcation ofthe ambiguity set guarantees.

Statement of contributions:

Our contributions revolve around building Wasser-stein ambiguity sets with probabilistic guarantees for dynamic random variables whenwe have no knowledge of the probability distributions of their initial condition, thedisturbances in their dynamics, and the measurement noise. To this end, our ﬁrstcontribution estimates the states of several process realizations from output samplesand exploits these estimates to build a suitable empirical distribution as the centerof an ambiguity ball. Our second contribution is the exploitation of concentrationof measure results to quantify the radius of this ambiguity ball so that it provablycontains the true state distribution with high probability. To achieve this, we breakthe radius into nominal and noise components. The nominal component captures thedeviation between the true distribution and the empirical distribution formed by thestate realizations. The noise component captures the deviation between the empiricaldistribution and the center of our ambiguity ball. To quantify the latter, we carefullyevaluate the impact of the estimation error, which due to the measurement noise, doesnot have a compactly supported distribution like the internal uncertainty and requiresa separate analysis. The third contribution is to generalize a concentration inequal-ity around the mean of suﬃciently light-tailed independent random variables, whichenables us to obtain tighter results when analyzing the eﬀect of the estimation error.The fourth contribution is the identiﬁcation of explicit constants for certain of thepresented concentration of measure inequalities, which, to the best of our knowledge,

ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS

2. Preliminaries.

Here we present general notation and concepts from proba-bility theory used throughout the paper.

Notation:

We denote by (cid:107) · (cid:107) p the p th norm in R n , p ∈ [1 , ∞ ], using also thenotation (cid:107)·(cid:107) ≡ (cid:107)·(cid:107) for the Euclidean norm. The inner product of two vectors a, b ∈ R n is denoted by (cid:104) a, b (cid:105) and the Khatri-Rao product [31] of a ≡ ( a , . . . , a d ) ∈ R d and b ≡ ( b , . . . , b d ) ∈ R dn , with each b i belonging to R n , is a ∗ b := ( a b , . . . , a d b d ) ∈ R dn .We use the notation B np ( ρ ) for the ball of center zero and radius ρ in R n with the p th norm and [ n : n ] for the set of integers { n , n + 1 , . . . , n } ⊂ N ∪ { } =: N .The interpretation of a vector in R n as an n × S ⊂ R n withthe p th norm is deﬁned as diam p ( S ) := sup {(cid:107) x − y (cid:107) p | x, y ∈ S } and for z ∈ R n , S + z := { x + z | x ∈ S } . We denote the induced Euclidean norm of a matrix A ∈ R m × n by (cid:107) A (cid:107) := max (cid:107) x (cid:107) =1 (cid:107) Ax (cid:107) / (cid:107) x (cid:107) . Given B ⊂ Ω, B is the indicator function of B onΩ, with B ( x ) = 1 for x ∈ B and B ( x ) = 0 for x / ∈ B . Probability Theory:

We denote by B ( R d ) the Borel σ -algebra on R d , and by P ( R d ) the probability measures on ( R d , B ( R d )). For any p ≥ P p ( R d ) := { µ ∈P ( R d ) | (cid:82) R d (cid:107) x (cid:107) p dµ < ∞} is the set of probability measures in P ( R d ) with ﬁnite p thmoment. The Wasserstein distance between µ, ν ∈ P p ( R d ) is W p ( µ, ν ) := (cid:16) inf π ∈H ( µ,ν ) (cid:110) (cid:90) R d × R d (cid:107) x − y (cid:107) p π ( dx, dy ) (cid:111)(cid:17) /p , where H ( µ, ν ) is the set of all probability measures on R d × R d with marginals µ and ν , respectively. For any µ ∈ P ( R d ), its support is the closed set supp( µ ) := { x ∈ R d | µ ( U ) > U of x } , or equivalently, the smallest closed setwith measure one. Given a measurable space (Ω , F ), an exponent p ≥

1, the convexfunction R (cid:51) x (cid:55)→ ψ p ( x ) := e x p −

1, and the linear space of scalar random variables L ψ p := { X | E [ ψ p ( | X | /t )] < ∞ for some t > } on (Ω , F ), the ψ p -Orlicz norm (cf. [39,Section 2.7.1]) of X ∈ L ψ p is (cid:107) X (cid:107) ψ p := inf { t > | E [ ψ p ( | X | /t )] ≤ } . When p = 1 and p = 2, each random variable in L ψ p is sub-exponential and sub-Gaussian, respectively. We also denote by (cid:107) X (cid:107) p ≡ (cid:0) E (cid:2) | X | p (cid:3)(cid:1) p the norm of a scalarrandom variable with ﬁnite p th moment, i.e., the classical norm in L p (Ω). The inter-pretation of (cid:107) · (cid:107) p as the p th norm of a vector in R n or a random variable in L p shouldbe clear from the context throughout the paper. Given a set { X i } i ∈ I of random vari-ables, we denote by σ ( { X i } i ∈ I ) the σ -algebra generated by them. We conclude witha useful technical result which follows from Fubini’s theorem [1, Theorem 2.6.5]. Lemma (Expectation inequality). Consider the independent random vec-tors X and Y , taking values in R n and R n , respectively, and let ( x, y ) (cid:55)→ g ( x, y ) D. BOSKOS, J. CORT´ES, AND S. MART´INEZ be integrable. Assume that E [ g ( x, Y )] ≥ k ( x ) for some integrable function k and all x ∈ K with supp( X ) ⊂ K ⊂ R n . Then, E [ g ( X, Y )] ≥ E [ k ( X )] .

3. Problem formulation.

Consider a stochastic optimization problem wherethe objective function x (cid:55)→ f ( x, ξ ) depends on a random variable ξ with an unknown distribution P ξ . To hedge this uncertainty, rather than using the empirical distribution P Nξ := 1 N N (cid:88) i =1 δ ξ i , (1)formed by N i.i.d. samples ξ , . . . , ξ N of P ξ to optimize a sample average approxima-tion of the expected value of f , one can instead consider the DRO probleminf x ∈X sup P ∈P N E P [ f ( x, ξ )] , (2)of evaluating the worst-case expectation over some ambiguity set P N of probabilitymeasures. This helps the designer robustify the decision against plausible variationsof the data, which can play a signiﬁcant role when the number of samples is limited.Diﬀerent approaches exist to construct the ambiguity set P N so that it contains thetrue distribution P ξ with high conﬁdence. We are interested in approaches that employdata, and in particular the empirical distribution P Nξ , to construct them. In thepresent setup, the data is generated by a dynamical system subject to disturbances,and we only collect partial (instead of full) measurements that are distorted by noise.Therefore, it is no longer obvious how to build a candidate state distribution as in (1)from the collected samples. Further, we seek to address this in a distributionallyrobust way, i.e., ﬁnding a suitable replacement (cid:98) P Nξ for (1) together with an associatedambiguity set, by exploiting the dynamics of the underlying process.To make things precise, consider data generated by a discrete-time system ξ k +1 = A k ξ k + G k w k , ξ k ∈ R d , w k ∈ R q , (3a)with linear output ζ k = H k ξ k + v k , ζ k ∈ R r . (3b)The initial condition ξ and the noises w k and v k , k ∈ N in the dynamics andthe measurements, respectively, are random variables with an unknown distribution.We seek to build an ambiguity set for the state distribution at certain time (cid:96) ∈ N , by collecting data up to time (cid:96) from multiple independent realizations of theprocess, denoted by ξ i , i ∈ [1 : N ]. This can occur, for instance, when the sameprocess is executed repeatedly, or in multi-agent scenarios where identical entitiesare subject to the same dynamics, see e.g. [46]. The time-dependent matrices inthe dynamics (3) widen the applicability of the results, since they can capture thelinearization of nonlinear systems along trajectories or the sampled-data analoguesof continuous-time systems under irregular sampling, even if the latter are linearand time invariant. To formally describe the problem, we consider a probabilityspace (Ω , F , P ) containing all random elements from these realizations, and make thefollowing sampling assumption. Assumption (Sampling schedule). For each realization i of system (3) ,output samples ζ i , . . . , ζ i(cid:96) are collected over the discrete time instants of the samplinghorizon [0 : (cid:96) ] . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS (cid:96) ]. To obtain quantiﬁable characterizations of the ambiguitysets, we require some further hypotheses on the classes of the distributions P ξ ofthe initial condition, P w k of the dynamics noise, and P v k of the measurement errors(cf. Figure 1). These assumptions are made for individual realizations and allow us toconsider non-identical observation error distributions—in this way, we allow for thecase where each realization is measured by a non-identical sensor of variable precision. Assumption (Distribution classes). Consider a ﬁnite sequence of realiza-tions ξ i , i ∈ [1 : N ] of (3a) with associated outputs given by (3b) , and noise elements w ik , v ik , k ∈ N . We assume the following: H1:

The distributions P ξ i , i ∈ [1 : N ] , are identically distributed; further P w ik , i ∈ [1 : N ] , are identically distributed for all k ∈ N . H2:

The sigma ﬁelds σ ( { ξ i } ∪ { w ik } k ∈ N (cid:1) , σ (cid:0) { v ik } k ∈ N (cid:1) , i ∈ [1 : N ] are independent. H3:

The supports of the distributions P ξ i and P w ik , k ∈ N are compact, centered atthe origin, and have diameters ρ ξ and ρ w , respectively, for all i . H4:

The components of the random vectors v ik have uniformly bounded L p and ψ p -Orlicz norms, as follows, < m v ≤ (cid:107) v ik,l (cid:107) p ≤ M v , (cid:107) v ik,l (cid:107) ψ p ≤ C v , for all k ∈ N , i ∈ [1 : N ] , and l ∈ [1 : r ] , where p ≥ . Remark (Bounded ψ p -Orlicz/ L p -norm ratio). By deﬁnition, ψ p -Orlicznorms can become signiﬁcantly larger than L p norms for random variables withheavier tails. Thus, over an inﬁnite sequence of random variables { X k } , the ratio (cid:107) X k (cid:107) ψ p / (cid:107) X k (cid:107) p may grow unbounded. We exclude this by assuming that C v and m v are either positive or zero simultaneously, in which case we set C v /m v := 0. (cid:3) 𝜉𝜉 ~ 𝑃𝑃 𝜉𝜉0 ⋯ ⋯ ⋯ 𝑣𝑣 ∼ 𝑃𝑃 𝑣𝑣 𝑁𝑁 𝑤𝑤 ∼ 𝑃𝑃 𝑤𝑤1 𝜉𝜉 ~ 𝑃𝑃 𝜉𝜉0 𝑤𝑤 ∼ 𝑃𝑃 𝑤𝑤0 𝑣𝑣 ∼ 𝑃𝑃 𝑣𝑣 𝑤𝑤 ∼ 𝑃𝑃 𝑤𝑤0 𝑤𝑤 ∼ 𝑃𝑃 𝑤𝑤1 𝑣𝑣 ∼ 𝑃𝑃 𝑣𝑣 𝑣𝑣 ∼ 𝑃𝑃 𝑣𝑣 Fig. 1 . Illustration of the probabilistic models for the random variables in the dynamics andobservations according to Assumption 3.

Since the collected samples do not measure the full state, we aim to leverage thedynamics and estimate it from the assimilated output values. To guarantee someboundedness notion for the state estimation errors over arbitrary evolution horizons,we make the following assumption for the dynamics.

Assumption (Detectability/uniform observability). System (3) satisﬁesone of the following properties:(i) It is time invariant and the pair ( A, H ) (with A ≡ A k and H ≡ H k ) is detectable. D. BOSKOS, J. CORT´ES, AND S. MART´INEZ (ii) It is uniformly observable, i.e., for some t ∈ N , the observability Gramian O k + t,k := k + t (cid:88) i = k Φ (cid:62) i,k H (cid:62) i H i Φ i,k satisﬁes O k + t,k (cid:23) bI for certain b > and all k ∈ N , where we denote Φ k + s,k := A k + s − · · · A k +1 A k Further, all system matrices are uniformly bounded and the sin-gular values of A k and the norms of (cid:107) H k (cid:107) are uniformly bounded below.Problem statement: Under Assumptions 2 and 3 on the measurements and dis-tributions of N realizations of the system (3), we seek to construct an estimator (cid:98) ξ i(cid:96) ( ζ i , . . . , ζ i(cid:96) ) for the state of each realization and build an ambiguity set for the statedistribution at time (cid:96) with probabilistic guarantees. Further, under Assumption 5on the system’s detectability/uniform observability properties, we aim to characterizethe eﬀect of the estimation precision on the accuracy of the ambiguity sets.We proceed to address the problem in Section 4 by exploiting a Luenberger ob-server to estimate the states of the collected data and using them to replace theclassical empirical distribution (1) in the construction of the ambiguity set. To ob-tain the probabilistic guarantees, we leverage concentration inequalities to bound thedistance between the updated empirical distribution and the true state distributionwith high conﬁdence. To this end, we further quantify the increase of the ambiguityradius due to the noise. We also study the beneﬁcial eﬀect on the ambiguity radius ofdetectability/uniform observability for arbitrarily long evolution horizons in Section 5.

4. State-estimator based ambiguity sets.

We address here the question ofhow to construct an ambiguity set at certain time instant (cid:96) , when samples are collectedfrom (3) according to Assumption 2. If we had access to N independent full-statesamples ξ (cid:96) , . . . , ξ N(cid:96) from the distribution of ξ at (cid:96) , we could construct an ambiguityball in the Wasserstein metric W p centered at the empirical distribution (1) with ξ i ≡ ξ i(cid:96) and containing the true distribution with high conﬁdence. In particular, forany conﬁdence 1 − β >

0, it is possible, cf. [16, Theorem 3.5], to specify an ambiguityball radius ε N ( β ) so that the true distribution of ξ (cid:96) is in this ball with conﬁdence1 − β , i.e., P ( W p ( P Nξ (cid:96) , P ξ (cid:96) ) ≤ ε N ( β )) ≥ − β. Instead, since we only can collect noisy partial measurements of the state, we use aLuenberger observer to estimate ξ at time (cid:96) . The dynamics of the observer, initializedat zero, is given by (cid:98) ξ k +1 = A k (cid:98) ξ k + K k ( H k (cid:98) ξ k − ζ k ) , (cid:98) ξ = 0 , (4)where each K k is a nonzero gain matrix. Using the corresponding estimates fromsystem (4) for the independent realizations of (3a), we deﬁne the (dynamic) estimator-based empirical distribution (cid:98) P Nξ k := 1 N N (cid:88) i =1 δ (cid:98) ξ ik , (5)Denoting by e k := ξ k − (cid:98) ξ k the error between (3a) and the observer (4), the errordynamics is e k +1 = F k e k + G k w k + K k v k , e = ξ , where F k := A k + K k H k and ξ is ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS e k = Ψ k ξ + k (cid:88) κ =1 (cid:0) Ψ k,k − κ +1 G k − κ w k − κ + Ψ k,k − κ +1 K k − κ v k − κ (cid:1) (6)for all k ≥

1, where Ψ k + s,k := F k + s − · · · F k +1 F k , Ψ k,k := I and Ψ k := Ψ k, . Tobuild the ambiguity set at time (cid:96) , we set its center at the estimator-based empiricaldistribution (cid:98) P Nξ (cid:96) given by (5). In what follows, we leverage concentration of measureresults to identify an ambiguity radius ψ N ( β ) so that the resulting Wasserstein ballcontains the true distribution with a given conﬁdence 1 − β .Note that the random variable ξ ik of a system realization at time k is a func-tion ξ ik ( ξ i , w ik ) of the random initial condition ξ i and the dynamics noise w ik ≡ ( w i , . . . , w ik − ). Analogously, the estimated state (cid:98) ξ ik of each observer realizationis a stochastic variable (cid:98) ξ ik ( ξ i , w ik , v ik ) with additional randomness induced by theoutput noise v ik ≡ ( v i , . . . , v ik − ). Using the compact notation ξ ≡ ( ξ , . . . , ξ N ), w k ≡ ( w k , . . . , w Nk ), and v k ≡ ( v k , . . . , v Nk ) for the corresponding initial condi-tions, dynamics noise, and output noise of all realizations, respectively, we can denotethe true- and estimator-based-empirical distributions at time (cid:96) as P Nξ (cid:96) ( ξ , w (cid:96) ) and (cid:98) P Nξ (cid:96) ( ξ , w (cid:96) , v (cid:96) ). If we view the initial conditions and the corresponding internal noiseof the realizations ξ i over the whole time horizon as deterministic quantities, weuse the alternative notation P Nξ (cid:96) ( z , ω ) and (cid:98) P Nξ (cid:96) ( z , ω , v (cid:96) ) for the corresponding dis-tributions, where z = ( z , . . . , z N ), z ≡ ξ , . . . , z N ≡ ξ N , and ω = ( ω , . . . , ω N ), ω ≡ w (cid:96) , . . . , ω N ≡ w N(cid:96) . We also denote by P ξ (cid:96) the true distribution of the data atdiscrete time (cid:96) , where from (3a), ξ (cid:96) = Φ (cid:96) ξ + (cid:96) (cid:88) k =1 Φ (cid:96),(cid:96) − k +1 G (cid:96) − k w (cid:96) − k , (7)where Φ (cid:96) := Φ (cid:96), and Φ (cid:96),(cid:96) := I (and with Φ k + δk,k deﬁned in Assumption 5). Then,it follows from H1 and H2 in Assumption 3 that the random states ξ i(cid:96) of the systemrealizations are independent and identically distributed. Leveraging this, our goal isto associate to each conﬁdence 1 − β , an ambiguity radius ψ N ( β ) so that P ( W p ( (cid:98) P Nξ (cid:96) , P ξ (cid:96) ) ≤ ψ N ( β )) ≥ − β. (8)To achieve this, we decompose the conﬁdence as the product of two factors:1 − β = (1 − β nom )(1 − β ns ) . (9)The ﬁrst factor (the nominal component “nom”) is exploited to control the Wasser-stein distance between the true empirical distribution and the true state distribu-tion P ξ (cid:96) . The purpose of the second factor (the noise component “ns”) is to boundthe Wasserstein distance between the true- and the estimator-based-empirical distri-butions, which is aﬀected by the measurement noise. Using this decomposition, ourstrategy to get (8) builds on further breaking the ambiguity radius as ψ N ( β ) := ε N ( β nom ) + (cid:98) ε N ( β ns ) . (10)We exploit what is known [8] for the no-noise case to bound the nominal ambiguityradius ε N ( β nom ) with conﬁdence 1 − β nom . Furthermore, we bound the noise ambiguityradius (cid:98) ε N ( β ns ) with conﬁdence 1 − β ns . This latter radius corresponds to the impacton distributional uncertainty of the internal and measurement noise. In the nextsections we present the precise individual bounds for these terms and then combinethem to get the main result on the overall ambiguity radius. D. BOSKOS, J. CORT´ES, AND S. MART´INEZ

According to Assumption 3, the initial con-dition and internal noise distributions are compactly supported, and hence, the sameholds also for the state distribution along time. We will therefore use the followingresult, that is focused on compactly supported distributions and bounds the distancebetween the true and empirical distribution for any ﬁxed conﬁdence level.

Proposition (Nominal ambiguity radius [8, Corollary 3.3]). Con-sider a sequence { X i } i ∈ N of i.i.d. R d -valued random variables with a compactly sup-ported distribution µ . Then for any p ≥ , N ≥ , and conﬁdence − β with β ∈ (0 , ,we have P ( W p ( µ N , µ ) ≤ ε N ( β, ρ )) ≥ − β , where ε N ( β, ρ ) := (cid:16) ln( Cβ − ) c (cid:17) p ρN p , if p > d/ ,h − (cid:16) ln( Cβ − ) cN (cid:17) p ρ, if p = d/ , (cid:16) ln( Cβ − ) c (cid:17) d ρN d , if p < d/ , (11) µ N := N (cid:80) Ni =1 δ X i , ρ := diam ∞ (supp( µ )) , h ( x ) := x (ln(2+1 /x )) , x > , and theconstants C and c depend only on p and d . This result shows how the nominal ambiguity radius depends on the size of thedistribution’s support, the conﬁdence level, and the number of samples, and is basedon recent concentration of measure inequalities from [17]. The determination of theconstants C and c in (11) for the whole spectrum of data dimensions d and Wassersteinexponents p is a particularly cumbersome task. In Section 8.2, we use some alternativeconcentration of measure results, which enable us to provide explicit formulas for theseconstants when d > p . In this section, we quantify the noise ambigu-ity radius (cid:98) ε N ( β ns ) for any prescribed conﬁdence 1 − β ns . We ﬁrst give a result thatuniformly bounds the distance between the true- and estimator-based-empirical dis-tributions with prescribed conﬁdence for all values of the initial condition and theinternal noise from the set B Nd ∞ ( ρ ξ ) × B N(cid:96)q ∞ ( ρ w ), which contains the support of theirjoint distribution (and hence all their possible realizations). For the results of thissection, the initial condition and the internal noise are interpreted as deterministicquantities, as discussed above. Lemma (Distance between true- and estimator-based-empirical dis-tribution). Let ( z , ω ) ∈ B Nd ∞ ( ρ ξ ) × B N(cid:96)q ∞ ( ρ w ) and consider the discrete distribution P Nξ (cid:96) ≡ P Nξ (cid:96) ( z , ω ) and the empirical distribution (cid:98) P Nξ (cid:96) ≡ (cid:98) P Nξ (cid:96) ( z , ω , v (cid:96) ) , where v (cid:96) is themeasurement noise of the realizations. Then, W p ( (cid:98) P Nξ (cid:96) , P Nξ (cid:96) ) ≤ p − p M w + 2 p − p (cid:16) N N (cid:88) i =1 ( E i ) p (cid:17) p , (12) where M w := √ d (cid:107) Ψ (cid:96) (cid:107) ρ ξ + √ q (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 G (cid:96) − k (cid:107) ρ w , (13a) E i ≡ E ( v i ) := (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)(cid:107) v i(cid:96) − k (cid:107) . (13b) ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS E i in Lemma 7. Lemma (Orlicz- and L p -norm bounds for E i ). The random variables E i in (13b) satisfy (cid:107) E i (cid:107) p ≤ M v := M v r (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) , (14a) (cid:107) E i (cid:107) ψ p ≤ C v := C v r (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) , (14b) (cid:107) E i (cid:107) p ≥ m v := m v r p (cid:18) (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) p (cid:19) p , (14c) with m v , M v , and C v as given in H4 . The proofs of both results above are given in the Appendix. We further relyon the following concentration of measure result around the mean of nonnegativeindependent random variables, whose proof is also in the Appendix, to bound theterm (cid:0) N (cid:80) Ni =1 ( E i ) p (cid:1) p , and control the Wasserstein distance between the true- andthe estimator-based-empirical distribution. Proposition (Concentration around p th mean). Let X , . . . , X N bescalar, nonnegative, independent random variables with ﬁnite ψ p norm and E [ X pi ] = 1 .Then, P (cid:18)(cid:18) N N (cid:88) i =1 X pi (cid:19) p − ≥ t (cid:19) ≤ (cid:16) − c (cid:48) NR α p ( t ) (cid:17) , (15) for every t ≥ , with c (cid:48) = 1 / , R := max i ∈ [1: N ] (cid:107) X i (cid:107) ψ p + 1 / ln 2 , and α p ( s ) := (cid:40) s , if s ∈ [0 , ,s p , if s ∈ (1 , ∞ ) . (16)Combining the results above, we obtain the main result of this section regardingthe ambiguity center diﬀerence. Proposition (Distance guarantee between the true- and estimator-based-empirical distribution).

Consider a conﬁdence − β ns and let (cid:98) ε N ( β ns ) := 2 p − p (cid:18) M w + M v + M v α − p (cid:18) R c (cid:48) N ln 2 β ns (cid:19)(cid:19) , (17) with M w , M v given by (13a) , (14a) , R := C v / m v + 1 / ln 2 , (18) and C v , m v as in (14b) , (14c) . Then, for all ( z , ω ) ∈ B Nd ∞ ( ρ ξ ) × B N(cid:96)q ∞ ( ρ w ) , we have P (cid:0) W p ( (cid:98) P Nξ (cid:96) ( z , ω , v (cid:96) ) , P Nξ (cid:96) ( z , ω )) ≤ (cid:98) ε N ( β ns ) (cid:1) ≥ − β ns . (19)0 D. BOSKOS, J. CORT´ES, AND S. MART´INEZ

Proof.

For each i , the random variable X i := E i / (cid:107) E i (cid:107) p satisﬁes (cid:107) X i (cid:107) p = 1. Thus,we obtain from Proposition 9 that P (cid:18)(cid:18) N N (cid:88) i =1 (cid:18) E i (cid:107) E i (cid:107) p (cid:19) p (cid:19) p − ≥ t (cid:19) ≤ (cid:16) − c (cid:48) NR α p ( t ) (cid:17) , where R = max i ∈ [1: N ] (cid:13)(cid:13) E i / (cid:107) E i (cid:107) p (cid:13)(cid:13) ψ p + 1 / ln 2. From (14b), (14c), and (18), we deduce R ≥ R , and thus, P (cid:18)(cid:18) N N (cid:88) i =1 (cid:18) E i (cid:107) E i (cid:107) p (cid:19) p (cid:19) p − ≥ t (cid:19) ≤ (cid:16) − c (cid:48) N R α p ( t ) (cid:17) . Now, it follows from (14a) that M v (cid:18) N N (cid:88) i =1 (cid:18) E i (cid:107) E i (cid:107) p (cid:19) p (cid:19) p − M v ≥ (cid:18) N N (cid:88) i =1 ( E i ) p (cid:19) p − M v . Thus, we deduce P (cid:18)(cid:18) N N (cid:88) i =1 ( E i ) p (cid:19) p − M v ≥ M v t (cid:19) ≤ (cid:16) − c (cid:48) N R α p ( t ) (cid:17) , or, equivalently, that P (cid:18)(cid:18) N N (cid:88) i =1 ( E i ) p (cid:19) p ≥ M v + s (cid:19) ≤ (cid:16) − c (cid:48) N R α p (cid:16) s M v (cid:17)(cid:17) . (20)To establish (19), it suﬃces by Lemma 7 to show that P (cid:18) p − p M w + 2 p − p (cid:16) N N (cid:88) i =1 ( E i ) p (cid:17) p ≤ (cid:98) ε N ( β ns ) (cid:19) ≥ − β ns . By the deﬁnition of (cid:98) ε N and exploiting that it is strictly decreasing with β ns , it suﬃcesto prove that P (cid:18)(cid:16) N N (cid:88) i =1 ( E i ) p (cid:17) p < M v + M v α − p (cid:16) R c (cid:48) N ln 2 β ns (cid:17)(cid:19) ≥ − β ns . Setting τ = α − p (cid:0) R c (cid:48) N ln β ns (cid:1) , we equivalently need to show P (cid:18)(cid:16) N N (cid:88) i =1 ( E i ) p (cid:17) p ≥ M v + τ M v (cid:19) ≤ β ns , which follows by (20) with s = τ M v . Here we combine the results from Sections 4.1and 4.2 to obtain the ambiguity set of the state distribution in the following result.

ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS Theorem (Ambiguity set under noisy dynamics & observations).

Consider data collected from N realizations of system (3) in accordance to Assump-tions 2 and 3, a conﬁdence − β , and let β nom , β ns ∈ (0 , satisfying (9) . Then, theguarantee (8) holds, where ψ N ( β ) is given in (10) and its components ε N ( β nom ) ≡ ε N ( β nom , ρ ξ (cid:96) ) and ε N ( β ns ) are given by (11) and (17) , respectively, with ρ ξ (cid:96) := √ d (cid:107) Φ (cid:96) (cid:107) ρ ξ + √ q (cid:96) (cid:88) k =1 (cid:107) Φ (cid:96),(cid:96) − k +1 G (cid:96) − k (cid:107) ρ w . (21) Proof.

Due to (10) and the triangle inequality for W p , { W p ( (cid:98) P Nξ (cid:96) , P ξ (cid:96) ) ≤ ψ N ( β ) } ⊃ { W p ( (cid:98) P Nξ (cid:96) , P Nξ (cid:96) ) ≤ (cid:98) ε N ( β ns ) }∩ { W p ( P Nξ (cid:96) , P ξ (cid:96) ) ≤ ε N ( β nom , ρ ξ (cid:96) ) } . Thus, to show (8), it suﬃces to show that E (cid:104) { W p ( (cid:98) P Nξ(cid:96) ,P Nξ(cid:96) ) − (cid:98) ε N ( β ns ) ≤ } × { W p ( P Nξ(cid:96) ,P ξ(cid:96) ) − ε N ( β nom ,ρ ξ(cid:96) ) ≤ } (cid:105) ≥ − β. (22)We therefore exploit Lemma 1 with the random variable X ≡ ( ξ , w (cid:96) ), taking valuesin the compact set K ≡ B Nd ∞ ( ρ ξ ) × B N(cid:96)q ∞ ( ρ w ), the random variable Y ≡ v (cid:96) ∈ R N(cid:96)r ,and g ( X, Y ) ≡ g ( ξ , w (cid:96) , v (cid:96) ), where g ( ξ , w (cid:96) , v (cid:96) ) := { W p ( P Nξ(cid:96) ( ξ , w (cid:96) ) ,P ξ(cid:96) ) − ε N ( β nom ,ρ ξ(cid:96) ) ≤ } × { W p ( (cid:98) P Nξ(cid:96) ( ξ , w (cid:96) , v (cid:96) ) ,P Nξ(cid:96) ( ξ , w (cid:96) )) − (cid:98) ε N ( β ns ) ≤ } . Due to (19), we have E (cid:104) { W p ( (cid:98) P Nξ(cid:96) ( z , ω , v (cid:96) ) ,P Nξ(cid:96) ( z , ω )) − (cid:98) ε N ( β ns ) ≤ } (cid:105) ≥ − β ns for any x =( z , ω ) ∈ K and thus E [ g ( x, Y )] ≥ { W p ( P Nξ(cid:96) ( x ) ,P ξ(cid:96) ) − ε N ( β nom ,ρ ξ(cid:96) ) ≤ } × (1 − β ns ) =: k ( x )for all x ∈ K . Hence, since X ≡ ( ξ , w (cid:96) ) and Y ≡ v (cid:96) are independent by H2 , wededuce from Lemma 1 that E [ g ( X, Y )] ≥ E (cid:104) { W p ( P Nξ(cid:96) ( ξ , w (cid:96) ) ,P ξ(cid:96) ) − ε N ( β nom ,ρ ξ(cid:96) ) ≤ } (1 − β ns ) (cid:105) = (1 − β ns ) P ( W p ( P Nξ (cid:96) ( ξ , w (cid:96) ) , P ξ (cid:96) ) ≤ ε N ( β nom , ρ ξ (cid:96) )) . From (7) and H3 in Assumption 3, it follows that P ξ (cid:96) is supported on the com-pact set B d ∞ ( ρ ξ (cid:96) ) with diam ∞ ( B d ∞ ( ρ ξ (cid:96) )) = 2 ρ ξ (cid:96) and ρ ξ (cid:96) given in (21). In addition,due to H1 and H2 in Assumption 3 the random states ξ i(cid:96) in the empirical distri-bution P Nξ (cid:96) ( ξ , w (cid:96) ) = N (cid:80) Ni =1 δ ξ i(cid:96) are i.i.d.. Thus, we get from Proposition 6 that P ( W p ( P Nξ (cid:96) ( ξ , w (cid:96) ) , P ξ (cid:96) ) ≤ ε N ( β nom , ρ ξ (cid:96) )) ≤ − β nom , which implies E [ g ( X, Y )] ≥ (1 − β ns )(1 − β nom ) = 1 − β. Finally, (22) follows from this and the deﬁnition of g .With this result at hand, we deduce from the expressions (11) and (19) for thecomponents of the ambiguity radius that it decreases as we exploit a larger number N of independent trajectories and relax our conﬁdence choices, i.e., reduce 1 − β nom D. BOSKOS, J. CORT´ES, AND S. MART´INEZ and 1 − β ns . Notice further that no matter how many trajectories we use, the noiseambiguity radius decreases to a strictly positive value. It is also worth to observe that ψ N generalizes the nominal ambiguity radius ε N in the DRO literature (even whendynamic random variables are considered [8]) and reduces to ε N in the noise-free casewhere (cid:98) ε N = 0.Drawing conclusions about how the ambiguity radius behaves as we simultane-ously allow the horizon [0 : (cid:96) ] and the number N of sampled trajectories to increase isa more delicate matter. The value of the nominal component depends essentially on N and the support of the distribution at (cid:96) , with the latter in turn depending on thesystem’s stability properties and the support of the initial condition and internal noisedistributions. On the other hand, the noise component depends on N and the qualityof the estimation error. We quantify in the next section how the latter guaranteesuniform boundedness of the noise radius under detectability-type assumptions.

5. Suﬃcient conditions for uniformly bounded noise ambiguity radii.

In this section we leverage Assumption 5 to establish that the noise ambiguity ra-dius remains uniformly bounded as the sampling horizon increases. We ﬁrst provideuniform bounds for the matrices involved in the system and observer error dynamics.

Proposition (Bounds on system/observer matrices).

Under Assump-tion 5, the gain matrices K k can be selected so that the following properties hold:(i) There exist K (cid:63) , K (cid:63) , G (cid:63) > and Ψ (cid:63)s > , s ∈ N , so that (cid:107) G k (cid:107) ≤ G (cid:63) , K (cid:63) ≤ (cid:107) K k (cid:107) ≤ K (cid:63) , and (cid:107) Ψ k + s,k (cid:107) ≤ Ψ (cid:63)s for all and k ∈ N .(ii) There exists s ∈ N so that (cid:107) Ψ k + s,k (cid:107) ≤ for all k ∈ N and s ≥ s .Proof. Note that we only need to verify part (i) for the time-varying case. Sinceall G k are uniformly bounded, we directly obtain the bound G (cid:63) . Let K k := − A k Φ k,k − t − O − k,k − t − Φ (cid:62) k,k − t − H (cid:62) k , (for k > t + 1) as selected in [32, Page 574] (but with a minus sign at the front to getthe plus sign in F k = A k + K k H k ) and with the observability Gramian O k,k − t − asdeﬁned in Assumption 5(ii). Then, the upper bound K (cid:63) follows from the fact thatthe system matrices are uniformly bounded combined with the uniform observabilityproperty of Assumption 5, which implies that all O − k,k − t − are also uniformly bounded.On the other hand, the lower bound K (cid:63) follows from the assumption that the systemmatrices are uniformly bounded, which imposes a uniform lower bound on the smallestsingular value of O − k,k − t − , the uniform lower bound on the smallest singular valueof A k , hence, also on that of Φ k,k − t − and Φ (cid:62) k,k − t − , and the uniform lower boundon (cid:107) H k (cid:107) (all found in Assumption 5). Finally, the bounds Ψ (cid:63)s are obtained using theassumed uniform bounds for all A k and H k and the derived bound K (cid:63) for all K k .To show part (ii) , assume ﬁrst that Assumption 5(i) holds, i.e., the system is timeinvariant and ( A, H ) is detectable. Then, we can choose a nonzero gain matrix K sothat F = A + KH is convergent (cf. [38, Theorem 31]), namely lim s →∞ (cid:107) F s (cid:107) = 0.Consequently, there is s ∈ N with (cid:107) F s (cid:107) ≤ for all s ≥ s and the result follows bytaking into account that Ψ k + s,k = F s . In case Assumption 5(ii) holds, let (cid:101) e k +1 = F k (cid:101) e k (23)be the recursive noise-free version of the error equation (6). Then, from [32, Page577], there exists a quadratic time-varying Lyapunov function V ( k, (cid:101) e ) := (cid:101) e (cid:62) Q k (cid:101) e witheach Q k being positive deﬁnite, a , a > a ∈ (0 , m ∈ N so that a ≤ λ min ( Q k ) ≤ λ max ( Q k ) ≤ a (24a) ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS V ( k + m, (cid:101) e k + m ) − V ( k, (cid:101) e k ) ≤ − a V ( k, (cid:101) e k )(24b)for any k and any solution of (23) with state (cid:101) e k at time k . The latter impliesΨ (cid:62) k + m,m Q k + m Ψ k + m,m (cid:22) (1 − a ) Q k . Then, Ψ (cid:62) k + νm,m Q k + νm Ψ k + νm,m (cid:22) (1 − a ) ν Q k ,which is veriﬁed inductively, sinceΨ (cid:62) k +( ν +1) m,m Q k +( ν +1) m Ψ k +( ν +1) m,m = Ψ (cid:62) k + m,k Ψ (cid:62) k +( ν +1) m,k + m Q k +( ν +1) m Ψ k +( ν +1) m,k + m Ψ k + m,k (cid:22) (1 − a ) ν Ψ (cid:62) k + m,k Q k + m Ψ k + m,k (cid:22) (1 − a ) ( ν +1) Q k . Next, pick (cid:101) e with (cid:107) (cid:101) e (cid:107) = 1 and (cid:107) Ψ k + νm,m (cid:101) e (cid:107) = (cid:107) Ψ k + νm,m (cid:107) . Taking into account that (cid:101) e (cid:62) Ψ (cid:62) k + νm,k Q k + νm Ψ k + νm,m (cid:101) e ≤ (1 − a ) ν (cid:101) e (cid:62) Q k (cid:101) e , we get λ min ( Q k + νm ) (cid:107) Ψ k + νm,k (cid:101) e (cid:107) ≤ (1 − a ) ν λ max ( Q k ). Using (24a), (cid:107) Ψ k + νm,k (cid:107) ≤ (1 − a ) ν (cid:16) a a (cid:17) . (25)Now, select ν so that (1 − a ) ν (cid:48) ( a /a ) ≤ / (2 max s ∈ [1: m ] Ψ (cid:63)s ) for all ν (cid:48) ≥ ν . Let s := νm and pick s ≥ s . Then, s = s (cid:48) + m (cid:48) for some s (cid:48) = ν (cid:48) m , ν (cid:48) ≥ ν , and m (cid:48) ∈ [0 : m −

1] and we get from (25), part (i), and the selection of ν that (cid:107) Ψ k + sm,k (cid:107) = (cid:107) Ψ k + s (cid:48) + m (cid:48) ,k + s (cid:48) Ψ k + s (cid:48) ,k (cid:107)≤ (cid:107) Ψ k + s (cid:48) + m (cid:48) ,k + s (cid:48) (cid:107)(cid:107) Ψ k + ν (cid:48) m,k (cid:107) ≤ Ψ (cid:63)m (cid:48)

12 max s ∈ [1: m ] Ψ (cid:63)s ≤ , which establishes the result.Based on this result and Assumption 5 about the system’s detectability/uniformobservability properties, we proceed to provide a uniform bound on the size of thenoise radius for arbitrarily long evolution horizons. Proposition (Uniform bounds for noise ambiguity radius).

Considerdata collected from N realizations of system (3) , a conﬁdence − β as in (9) , and letAssumptions 2, 3, and 5 hold. Then, there exist observer gain matrices K k so that thenoise ambiguity radius (cid:98) ε N in (17) is uniformly bounded with respect to the samplinghorizon size. In particular, there exists (cid:96) ∈ N , so that for each (cid:96) ≥ (cid:96) , the constants M w ≡ M w ( (cid:96) ) , M v ≡ M v ( (cid:96) ) , and R ≡ R ( (cid:96) ) given by (13a) , (14a) , and (18) , areuniformly upper bounded as M w ≤ √ dρ ξ + 3 √ q (cid:96) − (cid:88) j =0 Ψ (cid:63)j G (cid:63) ρ w , M v ≤ M v r (cid:96) − (cid:88) j =0 Ψ (cid:63)j K (cid:63) , R ≤ C v m v r p − p (cid:80) (cid:96) − j =0 Ψ (cid:63)j K (cid:63) K (cid:63) . Proof.

Consider gain matrices K k and the time s as given in Proposition 12, andlet (cid:96) := s . Then, for any (cid:96) ≥ (cid:96) , (cid:96) = n(cid:96) + r (cid:48) with 0 ≤ r (cid:48) < (cid:96) and we have (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 G (cid:96) − k (cid:107) ≤ (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 (cid:107) G (cid:63) D. BOSKOS, J. CORT´ES, AND S. MART´INEZ = (cid:18) r (cid:48) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 (cid:107) + (cid:96) (cid:88) k = r (cid:48) +1 (cid:107) Ψ (cid:96),(cid:96) − k +1 (cid:107) (cid:19) G (cid:63) ≤ (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n(cid:96) + r (cid:48) (cid:88) k = r (cid:48) +1 (cid:107) Ψ n(cid:96) + r (cid:48) ,n(cid:96) + r (cid:48) − k +1 (cid:107) (cid:19) G (cid:63) ( k (cid:55)→ ( ν − (cid:96) + j + r (cid:48) )= (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:96) (cid:88) j =1 (cid:107) Ψ n(cid:96) + r (cid:48) , ( n − ν ) (cid:96) + r (cid:48) + (cid:96) − j +1 (cid:107) (cid:19) G (cid:63) ( (cid:96) + 1 − j (cid:55)→ j )= (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:96) (cid:88) j =1 (cid:107) Ψ n(cid:96) + r (cid:48) , ( n − ν ) (cid:96) + r (cid:48) + j (cid:107) (cid:19) G (cid:63) ≤ (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:96) (cid:88) j =1 (cid:107) Ψ n(cid:96) + r (cid:48) , ( n − ν +1) (cid:96) + r (cid:48) (cid:107)(cid:107) Ψ ( n − ν ) (cid:96) + r (cid:48) + (cid:96) , ( n − ν ) (cid:96) + r (cid:48) + j (cid:107) (cid:19) G (cid:63) = (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:107) Ψ n(cid:96) + r (cid:48) , ( n − ν +1) (cid:96) + r (cid:48) (cid:107) (cid:96) (cid:88) j =1 (cid:107) Ψ ( n − ν ) (cid:96) + r (cid:48) + (cid:96) , ( n − ν ) (cid:96) + r (cid:48) + j (cid:107) (cid:19) G (cid:63) ≤ (cid:18) r (cid:48) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:18) ν − (cid:89) κ =1 (cid:107) Ψ ( n +1 − κ ) (cid:96) + r (cid:48) , ( n − κ ) (cid:96) + r (cid:48) (cid:107) (cid:19) (cid:96) (cid:88) j =1 Ψ (cid:63)(cid:96) − j (cid:19) G (cid:63) ≤ (cid:18) (cid:96) − (cid:88) s =0 Ψ (cid:63)s + n (cid:88) ν =1 (cid:16) (cid:17) ν − (cid:96) − (cid:88) j =0 Ψ (cid:63)j (cid:19) G (cid:63) ≤ (cid:96) − (cid:88) j =0 Ψ (cid:63)j G (cid:63) , where we have used the notation (cid:80) − κ =0 ≡ (cid:80) κ =1 ≡ (cid:81) κ =1 ≡

1. From this andthe fact that from Proposition 12, (cid:107) Ψ (cid:96) (cid:107) ≤ for all (cid:96) ≥ (cid:96) , we get the upper bound for M w . The upper bound for M v is obtained in the exact same way. Finally, to upperbound R , we obtain the same type of upper bound for C v as for M w , and exploitProposition 12(i) to get the lower bound m v = m v r p (cid:0) (cid:80) (cid:96)k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) p (cid:1) p ≥ m v r p (cid:107) Ψ (cid:96),(cid:96) K (cid:96) − (cid:107) ≥ m v r p K (cid:63) , which is also independent of (cid:96) . Remark (Noise ambiguity radius for time-invariant systems).

Fortime-invariant systems, it is possible to improve the bounds of Proposition 13 for M w , M v , and R by exploiting the fact that the system and observer gain matricesare constant. The precise bounds in this case (see also [7, Proposition 5.5]) are M w ≤ √ dρ ξ + 2 √ q (cid:96) − (cid:88) k =0 (cid:107) Ψ k G (cid:107) ρ w , M v ≤ M v r (cid:96) − (cid:88) k =0 (cid:107) Ψ k K (cid:107) , R ≤ C v m v r p − p (cid:80) (cid:96) − k =0 (cid:107) Ψ k K (cid:107) (cid:0) (cid:80) (cid:96) − k =0 (cid:107) Ψ k K (cid:107) p (cid:1) p , with (cid:96) as in the time-invariant case of Proposition 13, and where G and K denotethe constant values of the internal noise and observer gain matrices, respectively.The superiority of these bounds can be checked using the deﬁnition of the matrixbounds in Proposition 12(i) and their derivation is based on a simpliﬁed version ofthe arguments employed for the proof of Proposition 13. (cid:3)

6. Application to economic dispatch with distributed energy resources.

In this section we illustrate the usefulness of our results. We take advantage of the

ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS load Ocv( 𝑧𝑧 𝜄𝜄 ) source 𝑅𝑅 𝜄𝜄 , 𝑅𝑅 𝜄𝜄 , 𝐼𝐼 𝜄𝜄 , 𝐼𝐼 𝜄𝜄 , 𝐼𝐼 𝜄𝜄 𝐶𝐶 𝜄𝜄 + − − + (a) State of Charge O p e n C i r c u i t V o l t a g e (b) Fig. 2 . (a) shows the equivalent circuit model of a lithium-ion battery cell in discharging mode(c.f. [29, Figure 2],[28, Figure 1]). (b) is taken from [28, Figure 3] and shows the nonlineardependence of the open circuit voltage on the state of charge and its aﬃne approximation. ambiguity sets constructed with use of noisy partial measurements, cf. Theorem 11,to hedge against the uncertainty in an optimal economic dispatch problem. This is aproblem where uncertainty is naturally involved due to (dynamic) energy resources,which the scheduler has no direct access to control or measure, like storage or renew-able energy elements. Consider a network withdistributed energy resources [11] comprising of n generator units and n storage (bat-tery) units. The network needs to operate as close as possible to a prescribed powerdemand D at the end of the time horizon [0 : (cid:96) ], corresponding to a uniform dis-cretization of step-size δt of the continuous-time domain. To this end, each generatorand storage unit supplies the network with positive power P j and S ι , respectively,at time (cid:96) . We assume we can control the power of the generators, which additionallyneeds to be within the upper and lower thresholds P j min and P j max , respectively. Eachbattery is modeled as an uncertain dynamic element with an unknown initial statedistribution and we can decide whether it is connected ( η ι = 1) or not ( η ι = 0) to thenetwork at time (cid:96) . Our goal is to minimize the overall energy cost while remaining asclose as possible to the prescribed power demand. Thus, we minimize the overall cost C ( P , η ) := n (cid:88) j =1 g j ( P j ) + n (cid:88) ι =1 η ι h ι ( S ι ) + c (cid:18) n (cid:88) j =1 P j + n (cid:88) ι =1 η ι S ι − D (cid:19) (26)where P := ( P , . . . , P n ), η := ( η , . . . , η n ), g j and h ι are cost functions for thepower provided by generator j and storage unit ι , respectively. We treat the deviationof the injected power from its prescribed demand as a soft constraint by assigning ita quadratic cost with weight c and augmenting the overall cost function (26). Dueto the uncertainty about the batteries’ state and, hence, their injected powers S ι ,minimizing (26) is a stochastic optimization problem. Each battery is modeledas a single-cell dynamic element and we consider its current I ι discharging over theoperation interval (if connected to the network) as a ﬁxed and a priori known func-tion of time. Its dynamics is conveniently approximated by the equivalent circuit inFigure 2(a) (see e.g., [28, 29]), where z ι is the state of charge (SoC) of the cell andOcv( z ι ) is its corresponding open-circuit voltage, which we approximate by the aﬃne6 D. BOSKOS, J. CORT´ES, AND S. MART´INEZ function α ι z ι + β ι in Figure 2(b). The associated discrete-time cell model is χ ιk +1 ≡ (cid:18) I ι, k +1 z ιk +1 (cid:19) = (cid:18) a ι

00 1 (cid:19) (cid:18) I ι, k z ιk (cid:19) + (cid:18) − a ι − δt/Q ι (cid:19) I ιk θ ιk ≡ V ιk = α ι z ιk + β ι − I ιk R ι, − I ι, k R ι, where a ι := e − δt/ ( R ,ι C ι ) , δt is the time discretization step, and Q ι is the cell capacity.Here, we assume that for all k ∈ [0 : (cid:96) ] the cell is neither fully charged or discharged(by e.g., requiring that 0 < z − (cid:80) (cid:96) − k =0 δtI ιk /Q ι < k and any candidate initialconditions and input currents) and so, the evolution of its voltage is accurately repre-sented by the above diﬀerence equation. The initial condition comprising of the SoC z ι and the current I ι, through R ι, is random with an unknown probability distri-bution. We also consider additive measurement noise with an unknown distribution,namely, we measure θ ιk = α ι z ιk + β ι − I ιk R ι, − I ι, k R ι, + v k . To track the evolution of each random element through a linear system of the form (3),we consider for each battery a nominal state trajectory χ ι,(cid:63)k = ( I ι, ,(cid:63)k , z ι,(cid:63)k ) initiatedfrom the center of the support of its initial-state distribution. Thus, setting ξ ιk = χ ιk − χ ι,(cid:63)k and ζ ιk = θ k ( χ ιk ) − θ k ( χ ι,(cid:63)k ), we get ξ ιk +1 = A ιk ξ ιk ζ ιk = H ιk ξ ιk + v k , where A ιk := diag( a,

1) and H ιk := ( α ι , − R ι, ). Denoting ξ := ( ξ , . . . , ξ n ) and ζ :=( ζ , . . . , ζ n ), we obtain a system of the form (3) for the dynamic random variable ξ .Despite the fact that the state distribution ξ k of the batteries across time is unknown,we assume having access to output data from N independent realizations of theirdynamics over the horizon [0 : (cid:96) ]. Using these samples we exploit the results ofthe paper to build an ambiguity ball P N of radius ε N in the 2-Wasserstein distance(i.e., with p = 2), that contains the batteries’ state distribution P ξ (cid:96) at time (cid:96) withprescribed probability 1 − β . In particular, we take the samples from each realization i ∈ [1 : N ] and use an observer to estimate its state (cid:98) ξ i(cid:96) at time (cid:96) . The ambiguity setis centered at the estimator-based empirical distribution (cid:98) P N ξ (cid:96) = N (cid:80) Ni =1 δ (cid:98) ξ i(cid:96) and itsradius can be determined using Theorem 11 and Proposition 10. To solve the decision problem regarding whether or not to connect thebatteries for economic dispatch, we formulate a distributionally robust optimizationproblem for the cost (26) using the ambiguity set P N . To do this, we derive an explicitexpression of how the cost function C depends on the stochastic argument ξ (cid:96) . Noticeﬁrst that the power injected by each battery at time (cid:96) is S ι = I ι(cid:96) V ι(cid:96) = I ι(cid:96) (cid:0) α ι z ι(cid:96) + β ι − I ι(cid:96) R ι, − I ι, (cid:96) R ι, (cid:1) = (cid:104) ( − I ι(cid:96) R ι, , α ι I ι(cid:96) ) , χ ι(cid:96) (cid:105) + β ι I ι(cid:96) − ( I ι(cid:96) ) R ι, = (cid:104) (cid:98) α ι , ξ ι(cid:96) (cid:105) + (cid:98) β ι ≡ ( (cid:98) α ι ) (cid:62) ξ ι(cid:96) + (cid:98) β ι , with (cid:98) α ι := ( − I ι(cid:96) R ι, , α ι I ι(cid:96) ) and (cid:98) β ι := (cid:104) (cid:98) α ι , χ ι,(cid:63)(cid:96) (cid:105) + I ι(cid:96) β ι − ( I ι(cid:96) ) R ι, = I ι(cid:96) I ι, ,(cid:63)(cid:96) R ι, − α ι I ι(cid:96) z ι,(cid:63)(cid:96) + I ι(cid:96) β ι − ( I ι(cid:96) ) R ι, . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS h ι ( S ) := ¯ α ι S + ¯ β ι for the power provided by thebatteries, the overall cost C becomes C ( P , η ) = g ( P ) + ( η ∗ (cid:101) α ) (cid:62) ξ (cid:96) + η (cid:62) (cid:101) β + c (cid:0) (cid:62) P + ( η ∗ (cid:98) α ) (cid:62) ξ (cid:96) + η (cid:62) (cid:98) β − D (cid:1) , (27)where ∗ denotes the Khatri-Rao product (cf. Section 2) and g ( P ) := n (cid:88) j =1 g j ( P j ) , (cid:98) α := ( (cid:98) α , . . . , (cid:98) α n ) , (cid:98) β := ( (cid:98) β , . . . , (cid:98) β n ) , (cid:101) α := (¯ α (cid:98) α , . . . , ¯ α n (cid:98) α n ) , (cid:101) β := (¯ α (cid:98) β + ¯ β , . . . , ¯ α n (cid:98) β n + ¯ β n ) . Using the equivalent description (27) for C and recalling the upper and lower bounds P j min and P j max for the power injected by the generators, we formulate the DRO powerdispatch problem inf η , P (cid:110) f η ( P ) + sup P ξ (cid:96) ∈P N E P ξ (cid:96) (cid:2) h η ( P , ξ (cid:96) ) (cid:3)(cid:111) , (28a) s.t. P j min ≤ P j ≤ P j max ∀ j ∈ [1 : n ] , (28b)with the ambiguity set P N introduced above and f η ( P ) := g ( P ) + c P (cid:62) (cid:62) P + 2 c ( η (cid:62) (cid:98) β − D ) (cid:62) P + c ( η (cid:62) (cid:98) β − D ) + η (cid:62) (cid:101) β h η ( P , ξ (cid:96) ) := c ξ (cid:62) (cid:96) ( η ∗ (cid:98) α )( η ∗ (cid:98) α ) (cid:62) ξ (cid:96) + (cid:0) c (cid:0) (cid:62) P + η (cid:62) (cid:98) β − D )( η ∗ (cid:98) α ) (cid:62) + ( η ∗ (cid:101) α ) (cid:62) (cid:1) ξ (cid:96) , This formulation aims to minimize the worst-case expected cost with respect to theplausible distributions of ξ at time (cid:96) . Our next goal is toobtain a tractable reformulation of the optimization problem (28). To this end, weﬁrst provide an equivalent description for the inner maximization in (28), which iscarried out over a space of probability measures. Exploiting strong duality (see [19,Corollary 2(i)] or [5, Remark 1]) and recalling that our ambiguity set P N is basedon the 2-Wasserstein distance, we equivalently write the inner maximization problemsup P ξ (cid:96) ∈P N E P ξ (cid:96) (cid:2) h η ( P , ξ (cid:96) ) (cid:3) asinf λ ≥ (cid:26) λψ N + 1 N N (cid:88) i =1 sup ξ (cid:96) ∈ Ξ { h η ( P , ξ (cid:96) ) − λ (cid:107) ξ (cid:96) − (cid:98) ξ i(cid:96) (cid:107) } (cid:27) , (29)where ε N ≡ ε N ( β ) is the radius of the ambiguity ball, Ξ ⊂ R n is the supportof the batteries’ unknown state distribution, and the (cid:98) ξ i(cid:96) are the estimated states oftheir realizations. We slightly relax the problem, by allowing the ambiguity ball tocontain all distributions with distance ψ N from (cid:98) P N ξ (cid:96) that are supported on R n andnot necessarily on Ξ. Thus, we ﬁrst look to solve for each estimated state (cid:98) ξ i(cid:96) theoptimization problem sup ξ (cid:96) ∈ R n { h η ( P , ξ (cid:96) ) − λ (cid:107) ξ (cid:96) − (cid:98) ξ i(cid:96) (cid:107) } , which is writtensup ξ (cid:96) ∈ R n (cid:8) ξ (cid:62) (cid:96) A ξ (cid:96) + (cid:0) c (cid:0) (cid:62) P + η (cid:62) (cid:98) β − D )( η ∗ (cid:98) α ) (cid:62) + ( η ∗ (cid:101) α ) (cid:62) (cid:1) ξ (cid:96) D. BOSKOS, J. CORT´ES, AND S. MART´INEZ − λ ( ξ (cid:96) − (cid:98) ξ i(cid:96) ) (cid:62) ( ξ (cid:96) − (cid:98) ξ i(cid:96) ) (cid:9) = − λ ( (cid:98) ξ i(cid:96) ) (cid:62) ξ i(cid:96) + sup ξ (cid:96) ∈ R n (cid:8) ξ (cid:62) (cid:96) ( A − λI n ) ξ (cid:96) + (cid:0) c (cid:0) (cid:62) P + η (cid:62) (cid:98) β − D )( η ∗ (cid:98) α ) (cid:62) + ( η ∗ (cid:101) α ) (cid:62) + 2 λ ( (cid:98) ξ i(cid:96) ) (cid:62) (cid:1) ξ (cid:96) (cid:9) = − λ ( (cid:98) ξ i(cid:96) ) (cid:62) ξ i(cid:96) + sup ξ (cid:96) ∈ R n (cid:8) ξ (cid:62) (cid:96) ( A − λI n ) ξ (cid:96) + ( r i ) (cid:62) ξ (cid:96) (cid:9) where r i ≡ r i η ( P , λ ) := 2 c ( (cid:62) P + η (cid:62) (cid:98) β − D )( η ∗ (cid:98) α ) + η ∗ (cid:101) α + 2 λ (cid:98) ξ i(cid:96) and A ≡ A η := c ( η ∗ (cid:98) α )( η ∗ (cid:98) α ) (cid:62) is a symmetric positive semi-deﬁnite matrix with diagonalization A = Q (cid:62) DQ where the eigenvalues decrease along the diagonal. Hence, we get thatsup ξ (cid:96) ∈ R n (cid:8) ξ (cid:62) (cid:96) ( A − λI n ) ξ (cid:96) + ( r i ) (cid:62) ξ (cid:96) (cid:9) = sup ξ (cid:96) ∈ R n (cid:8) ξ (cid:62) (cid:96) ( Q (cid:62) DQ − Q (cid:62) λI n Q ) ξ (cid:96) + ( r i ) (cid:62) ξ (cid:96) (cid:9) = sup ξ ∈ R n (cid:8) ξ (cid:62) ( D − λI n ) ξ + ( (cid:98) r i ) (cid:62) ξ (cid:9) with (cid:98) r i := Q r i and denoting λ max ( A ) the maximum eigenvalue of A we have thatsup ξ ∈ R n (cid:8) ξ (cid:62) ( D − λI n ) ξ + ( (cid:98) r i ) (cid:62) ξ (cid:9) = (cid:40) ∞ if 0 ≤ λ < λ max ( A ) ( (cid:98) r i ) (cid:62) ( λI n − D ) − (cid:98) r i if λ > λ max ( A ) . (30)To obtain this we exploited that Q ( ξ ) := ξ (cid:62) ( D − λI n ) ξ + ( (cid:98) r i ) (cid:62) ξ is maximized when ∇ Q ( ξ (cid:63) ) = 0 ⇐⇒ D − λI n ) ξ (cid:63) + (cid:98) r i = 0 ⇐⇒ ξ (cid:63) = 12 ( λI n − D ) − (cid:98) r i , which gives the optimal value Q ( ξ (cid:63) ) = ( (cid:98) r i ) (cid:62) ( λI n − D ) − (cid:98) r i . Note that we do notneed to specify the value of the expression in (30) for λ = λ max . In particular, sincethe function we minimize in (29) is convex in λ , the inner part of the DRO problemis equivalently writteninf λ>λ max ( A ) (cid:26) λ (cid:18) ψ N − N N (cid:88) i =1 ( (cid:98) ξ i(cid:96) ) (cid:62) ξ i(cid:96) (cid:19) + 14 N N (cid:88) i =1 (cid:98) r i η ( P , λ ) (cid:62) ( λI n − D ) − (cid:98) r i η ( P , λ ) (cid:27) . Taking further into account that( λI n − D ) − = diag (cid:16) λ − λ max ( A ) , . . . , λ − λ min ( A ) (cid:17) , as well as the constraints (28b) on the decision variable P , the overall DRO problemis reformulated asmin η inf P ,λ (cid:26) f η ( P ) + λ (cid:18) ψ N − N N (cid:88) i =1 ( (cid:98) ξ i(cid:96) ) (cid:62) ξ i(cid:96) (cid:19) + 14 N N (cid:88) i =1 (cid:98) r i η ( P , λ ) (cid:62) (31a) × diag (cid:16) λ − λ max ( A ) , . . . , λ − λ min ( A ) (cid:17)(cid:98) r i η ( P , λ ) (cid:27) subject to P j min ≤ P j ≤ P j max ∀ j ∈ [1 : n ](31b) λ > λ max ( A ) . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS Fig. 3 . Results from 100 realizations of the power dispatch problem with N = 10 independentsamples used for each realization. We compute the optimizers of the SAA and DRO problems,plot their corresponding optimal values (termed “SAA cost” and “DRO cost”), and also evaluatetheir expected performance with respect to the true distribution (“expected cost with SAA optimizer”and “expected cost with DRO optimizer”). With the exception of two realizations (whose DRO andassociated expected cost are framed inside black boxes), the DRO value is above the expected cost ofthe DRO optimizer, namely, this happens with high probability. From the plot, it is also clear thatthe SAA solution tends to over-promise since its value is most frequently below the expected cost ofthe SAA optimizer. For the simulations we consider n =4 generators and n = 3 batteries with the same characteristics. We assume thatthe distributions of each initial SoC z ι and current I ι, are known to be supportedon the intervals [0 . , .

9] and [1 . , . P z = P z = U [0 . , .

65] ( U denotes uniformdistribution). On the other hand, the provider of battery 1 has access to the distinctbatteries 1A and 1B and selects randomly one among them with probabilities 0.9and 0.1, respectively. The SoC distribution of battery 1A at time zero is P z A = U [0 . , . P z B = U [0 . , . P z = 0 . U [0 . , .

65] + 0 . U [0 . , . I ι, of all batteries are ﬁxed to 1.6308, namely, P I , = P I , = P I , = δ . . For the measurements, we consider the Gaussian mixture noise model P v k =0 . N (0 . , . )+0 . N ( − . , . ) with N ( µ, σ ) denoting the normal distributionwith mean µ and variance σ .To compute the ambiguity radius for the reformulated DRO problem (31), wespecify its nominal and noise components ε N ( β nom , ρ ξ (cid:96) ) and (cid:98) ε N ( β ns ), where due toProposition 6, ρ ξ (cid:96) can be selected as half the diameter of any set containing the sup-port of P ξ (cid:96) in the inﬁnity norm. It follows directly from the speciﬁc dynamics of thebatteries that ρ ξ (cid:96) does not exceed half the diameter of the initial conditions’ distri-bution support, which is isometric to [0 . , . × [1 . , . ⊂ R . Hence, using (37)and Proposition 19 with p = 2, d = 6, and ρ ξ (cid:96) = 0 . ε N ( β nom , ρ ξ (cid:96) ) = 4 . N − + 1 . β − ) N − . D. BOSKOS, J. CORT´ES, AND S. MART´INEZ (a)(b)

Fig. 4 . Analogous results to those of Figure 3, from 100 realizations with (a) N = 40 and (b) N = 160 independent samples, and the ambiguity radius tuned so that the same conﬁdence levelis preserved. In both cases, the DRO value is above the expected cost of the DRO optimizer withhigh probability (in fact, always). Furthermore, the expected cost of the DRO optimizer (red star)is strictly better than the expected cost of the SAA one (green circle) for a considerable number ofrealizations (highlighted in the illustrated boxes). To determine the noise radius, we ﬁrst compute lower and upper bounds m v and M v for the L norm of the Gaussian mixture noise v k and an upper bound C v forits ψ norm. Denoting by E P the integral with respect to the distribution P , wehave for P v k = 0 . N ( µ , σ ) + 0 . N ( µ , σ ) that (cid:107) v k (cid:107) = E ( P + P ) (cid:2) v k (cid:3) = ( µ + σ + µ + σ ), where P = N ( µ , σ ), P = N ( µ , σ ) and we used the fact that E P i (cid:2) v k (cid:3) = µ i + E P i (cid:2) ( v k − µ i ) (cid:3) = µ i + σ i . Hence, in our case, where µ i = σ i = 0 . m v = M v = 0 . √

2. Further, using Proposition 21, we can select C v =0 . (cid:112) / √ ln 2). To perform the state estimation from the output samples we useda Kalman ﬁlter. Its initial condition covariance matrix corresponds to independentGaussian distributions for each SoC z ι and current I ι, with a standard deviationof the order of their assumed support. We also select the same covariance as inthe components of the Gaussian mixture noise to model the measurement noise ofthe Kalman ﬁlter. Using the dynamics of the ﬁlter and the values of m v , M v , and C v above, we obtain from (13a), (14a)-(14c), and (18) the constants M w = 0 . M v = 0 . R = 2 .

72 for the expression of the noise radius. In particular, wehave from Proposition 10 that (cid:98) ε N ( β ns ) = 0 .

47 + 0 . (cid:112) . /N ln(2 /β ns ) and theoverall radius is ψ N ( β ) = 0 .

47 + 4 . N − + 1 . β − ) N − + 0 . β − )) N − . (32)We assume that the energy cost of the generators is lower than that of the batteriesand select the quadratic power generation cost g ( P ) = 0 . (cid:80) j =1 ( P j − . and thesame lower/upper power thresholds P j min = 0 . P j max = 0 . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS R ι, = 0 .

34 and R ι, = 0 .

17, and wetake a ι = 0 .

945 and I ιk = 8 for all times. We nevertheless use diﬀerent linear costs h ι ( S ) = ¯ α ι S for their injected powers, with ¯ α = 1 and ¯ α = ¯ α = 1 .

3, since battery 1is less reliable due to the large SoC ﬂuctuation among its two modes.We solve 100 independent realizations of the overall economic dispatch problem.For each of them, we generate independent samples from the batteries’ initial conditiondistributions and solve the associated sample average approximation (SAA) and DROproblems for N = 10, N = 40, and N = 160 samples, respectively, using CVX [20].It is worth noting that the radius ψ N given by (32) is rather conservative. The mainreasons for this are 1) conservativeness of the concentration of measure results usedfor the derivation of the nominal radius, 2) lack of homogeneity of the distribution’ssupport (the a priori known support of the I ι, components is much smaller thanthat of the z ι ones), 3) independence of the batteries’ individual distributions, whichwe have not exploited, and 4) conservative upper bounds for the estimation error.Although there is room to sharpen all these aspects, it requires multiple additionalcontributions and lies beyond the scope of the paper. Nevertheless, the formula (32)gives a qualitative intuition about the decay rates for the ambiguity radius. In partic-ular, it indicates that under the same conﬁdence level and for small sample sizes, anambiguity radius proportional to N − is a reasonable choice. Based on this, we se-lected the ambiguity radii 0 .

05, 0 . .

025 for N = 10, N = 40, and N = 160.The associate simulation results are shown in Figures 3, 4(a), and 4(b), respectively.We plot there the optimal values of the SAA and DRO problems (termed “SAA cost”and “DRO cost”) and provide the expected performance of their respective decisionswith respect to the true distribution (“expected cost with SAA optimizer” and “ex-pected cost with DRO optimizer”). We observe that in all three cases, the DRO valueis above the expected cost of the DRO optimizer for nearly all realizations (and forall when N is 40 or 160), which veriﬁes the out-of-sample guarantees that we seek inDRO formulations [16, Theorem 3.5]. In addition, when solving the problem for 40or 160 samples, we witness a clear superiority of the DRO decision compared to theone of the non-robust SAA, because it considerably improves the expected cost for asigniﬁcant number of realizations (cf. Figure 4). In fact, the SAA solution tends toconsistently promise a better outcome compared to what the true distribution revealsfor the same decision (e.g., magenta circle being usually under the green circle in allﬁgures). This rarely happens for the DRO solution, and when it does, it is only bya small margin. This makes the DRO approach preferable over the SAA one in thecontext of power systems operations where honoring committments at a much highercost than anticipated might result in signiﬁcant losses, and not fulﬁlling committmentsmay lead to penalties from the system operator.

7. Conclusions.

We have constructed high-conﬁdence ambiguity sets for dy-namic random variables using partial-state measurements from independent realiza-tions of their evolution. In our model, both the dynamics and measurements aresubject to disturbances with unknown probability distributions. The ambiguity setsare built using an observer to estimate the full state of each realization and leveragingconcentration of measure inequalities. For systems that are either time-invariant anddetectable, or uniformly observable, we have established uniform boundedness of theambiguity radius. To aid the associated probabilistic guarantees, we also providedauxiliary concentration of measure results. Future research will include the consider-ation of robust state estimation criteria to mitigate the noise eﬀect on the ambiguityradius, the extension of the results to nonlinear dynamics, and the construction ofambiguity sets with information about the moments.2

D. BOSKOS, J. CORT´ES, AND S. MART´INEZ

8. Appendix.

Here we give proofs of various results of the paper and provideexplicit constants for the ambiguity radius.

Proof of Lemma 7.

Using [8, Lemma A.2] to bound the Wasserstein distance oftwo discrete distributions, we get W p ( (cid:98) P Nξ (cid:96) , P Nξ (cid:96) ) ≤ (cid:16) N N (cid:88) i =1 (cid:107) (cid:98) ξ i(cid:96) − ξ i(cid:96) (cid:107) p (cid:17) p = (cid:16) N N (cid:88) i =1 (cid:107) e i(cid:96) (cid:107) p (cid:17) p . From (6), we have (cid:107) e i(cid:96) (cid:107) = (cid:13)(cid:13)(cid:13) Ψ (cid:96) z i + (cid:96) (cid:88) k =1 (cid:0) Ψ (cid:96),(cid:96) − k +1 G (cid:96) − k ω i(cid:96) − k + Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k v i(cid:96) − k (cid:1)(cid:13)(cid:13)(cid:13) ≤ (cid:107) Ψ (cid:96) (cid:107)(cid:107) z i (cid:107) + (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 G (cid:96) − k (cid:107)(cid:107) ω i(cid:96) − k (cid:107) + (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)(cid:107) v i(cid:96) − k (cid:107) =: M ( z i , ω i ) + E ( v i ) , with E ( v i ) ≡ E i given in the statement. Using that ( a + b ) p ≤ p − ( a p + b p ) for a, b ≥ p ≥ W p ( (cid:98) P Nξ (cid:96) , P Nξ (cid:96) ) ≤ (cid:16) N p − N (cid:88) i =1 ( M ( z i , ω i ) p + ( E i ) p ) (cid:17) p . Next, using ( a + b ) p ≤ a p + b p for a, b ≥ p ≥

1, we have W p ( (cid:98) P Nξ (cid:96) , P Nξ (cid:96) ) ≤ (cid:16) N p − N (cid:88) i =1 M ( z i , ω i ) p (cid:17) p + (cid:16) N p − N (cid:88) i =1 ( E i ) p (cid:17) p . (33)Finally, since ( z , ω ) ∈ B Nd ∞ ( ρ ξ ) × B N(cid:96)q ∞ ( ρ w ), we get M ( z i , ω i ) p ≤ (cid:107) Ψ (cid:96) (cid:107)√ d (cid:107) z i (cid:107) ∞ + (cid:80) (cid:96)k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 G (cid:96) − k (cid:107)√ q (cid:107) ω i(cid:96) − k − (cid:107) ∞ ≤ M w . This combined with (33) yields (12). Proof of Lemma 8.

From H4 in Assumption 3, we obtain for each summandin (13b) (cid:13)(cid:13) (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)(cid:107) v i(cid:96) − k (cid:107) (cid:13)(cid:13) ψ p ≤ (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) (cid:0)(cid:13)(cid:13) v i(cid:96) − k, (cid:13)(cid:13) ψ p + · · · + (cid:13)(cid:13) v i(cid:96) − k,r (cid:13)(cid:13) ψ p (cid:1) ≤ C v r (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) . Hence, we deduce that (cid:107) E i (cid:107) ψ p ≤ (cid:96) (cid:88) k =1 (cid:13)(cid:13) (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)(cid:107) v i(cid:96) − k (cid:107) (cid:13)(cid:13) ψ p ≤ C v r (cid:96) (cid:88) k =1 (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) . For the L p bounds, note that (cid:107) E i (cid:107) p = (cid:13)(cid:13) (cid:80) k ∈ [1: (cid:96) ] ,l ∈ [1: r ] (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)| v i(cid:96) − k,l | (cid:13)(cid:13) p .Thus, from the inequality (cid:107) (cid:80) i c i X i (cid:107) p ≤ (cid:80) i c i (cid:107) X i (cid:107) p , which holds for any nonneg-ative c i and X i in L p , we get (cid:107) E i (cid:107) p ≤ (cid:88) k ∈ [1: (cid:96) ] ,l ∈ [1: r ] (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107)(cid:107) v i(cid:96) − k,l (cid:107) p , ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS H4 of Assumption 3, implies (14a). For the otherbound, we exploit linearity of the expectation and the inequality (cid:0) (cid:80) i c i (cid:1) p ≥ (cid:80) i c pi ,which holds for any nonnegative c i , to get (cid:0) E (cid:2) ( E i ) p (cid:3)(cid:1) p ≥ (cid:18) (cid:88) k ∈ [1: (cid:96) ] ,l ∈ [1: r ] (cid:107) Ψ (cid:96),(cid:96) − k +1 K (cid:96) − k (cid:107) p E (cid:2) | v i(cid:96) − k,l | (cid:3) p (cid:19) p . Thus, from the lower bound in H4 of Assumption 3 we also obtain (14c).We next prove Proposition 9, along the lines of the proof of [39, Theorem 3.1.1],which considers the special case of sub-Gaussian distributions. We rely on the follow-ing concentration inequality [39, Corollary 2.8.3]. Proposition (Bernstein inequality).

Let X , . . . , X N be scalar, mean-zero, sub-exponential, independent random variables. Then, for every t ≥ we have P (cid:18)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X i (cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ (cid:16) − c (cid:48) min (cid:110) t R , tR (cid:111) N (cid:17) , where c (cid:48) = 1 / and R := max i ∈ [1: N ] (cid:107) X i (cid:107) ψ . The precise constant c (cid:48) above is not speciﬁed in [39]. We therefore give an inde-pendent proof of this explicit result in Section 8.2 of the Appendix. Proof of Proposition 9.

Note that each random variable X pi − (cid:107) X pi − (cid:107) ψ ≤ (cid:107) X pi (cid:107) ψ + (cid:107) (cid:107) ψ = (cid:107) X i (cid:107) ψ p +1 / ln 2 ≤ R , where we took into account that E [ ψ ( X pi /t p )] = E [ ψ p ( X i /t )] ⇒ (cid:107) X pi (cid:107) ψ = (cid:107) X i (cid:107) ψ p , and the following fact, shown after the proof. (cid:46) Fact I.

For any constant scalar random variable X = µ , it holds that (cid:107) X (cid:107) ψ p = | µ | / (ln 2) p . (cid:47) Thus, we get from Proposition 15 that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X pi − (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ (cid:16) − c (cid:48) NR min { t , t } (cid:17) , (34)where we used the fact that R >

1. We will further leverage the following facts shownafter the proof of the proposition. (cid:46)

Fact II.

For all p ≥ z ≥ | z − | ≥ δ ⇒ | z p − | ≥ max { δ, δ p } . (cid:47)(cid:46) Fact III.

For any δ ≥

0, if u = max { δ, δ p } , then min { u, u } = α p ( δ ), with α p asgiven by (16). (cid:47) By exploiting Fact II, we get P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) N N (cid:88) i =1 X pi (cid:19) p − (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X pi − (cid:12)(cid:12)(cid:12)(cid:12) ≥ max { t, t p } (cid:19) ≤ (cid:16) − c (cid:48) NR min { max { t, t p } , max { t, t p }} (cid:17) . Thus, since P ( | Y | ≥ t ) ≥ P ( Y ≥ t ) for any random variable Y , we obtain (15) fromFact III and conclude the proof.4 D. BOSKOS, J. CORT´ES, AND S. MART´INEZ

Proof of Fact I.

From the ψ p norm deﬁnition, (cid:107) X (cid:107) ψ p = inf (cid:8) t > | E (cid:2) e ( | X | /t ) p (cid:3) ≤ (cid:9) = inf (cid:8) t > | t ≥ | µ | / (ln 2) p (cid:9) = | µ | / (ln 2) p , which establishes the result. Proof of Fact II.

Assume ﬁrst that z <

1. Then, we have that | z p − | = 1 − z p > − z ≥ δ ≥ δ p . Next, let z ≥

1. Then, we get | z p − | = z p − ≥ z − ≥ δ . Inaddition, when δ p ≥ δ , namely, when δ ≥

1, we have that z p − ( z − p ≥

1, andhence, | z p − | = z p − ≥ ( z − p ≥ δ p . Proof of Fact III.

We consider two cases. Case (i): 0 ≤ δ ≤ ⇒ δ ≥ δ p ⇒ u =max { δ, δ p } = δ . Then min { u, u } = min { δ, δ } = δ . Case (ii): δ > ⇒ δ ≤ δ p ⇒ u = max { δ, δ p } = δ p . Then min { u, u } = min { δ p , δ p } = δ p . Thus, we get thatmin { u, u } = α p ( δ ) for all δ ≥ We ﬁrst givean independent proof of the norm concentration inequality in Proposition 15. Thisproof entails the explicit derivation of the constant c (cid:48) = 1 /

10 therein, which is thesame as that in Proposition 9. We note that a general concentration result with thesame decay rates as Proposition 9 can also be found in [9, Exercise 2.27, Page 51],however, without the explicit characterization of the involved constants. We exploitan equivalent characterization of sub-exponential random variables stated next. Thischaracterization can be found in [39, Proposition 2.7.1 and Exercise 2.7.2], but here wegive the exact constants and the necessary modiﬁcations of the corresponding proofs.

Lemma (Properties of sub-exponential random variables).

Let X besub-exponential. Then:(i) The tails of X satisfy P ( | X | ≥ t ) ≤ (cid:0) − t/ (cid:107) X (cid:107) ψ (cid:1) for all t ≥ . (ii) The moments of X satisfy (cid:107) X (cid:107) pL p = E (cid:2) | X | p (cid:3) ≤ p ! (cid:107) X (cid:107) pψ for all integers p ≥ . (iii) If additionally E [ X ] = 0 , the moment generating function of X satisﬁes E (cid:2) exp( λX ) (cid:3) ≤ exp (cid:0) a (cid:107) X (cid:107) ψ λ (cid:1) , for all a > and λ with | λ | ≤ a − a (cid:107) X (cid:107) ψ . Proof.

To show (i) , we use Markov’s inequality and the deﬁnition of the ψ norm.In particular, we have P ( | X | ≥ t ) = P (cid:0) exp (cid:0) | X | / (cid:107) X (cid:107) ψ (cid:1) ≥ exp (cid:0) t/ (cid:107) X (cid:107) ψ (cid:1)(cid:1) ≤ E (cid:2) exp (cid:0) | X | / (cid:107) X (cid:107) ψ (cid:1)(cid:3) exp (cid:0) − t/ (cid:107) X (cid:107) ψ (cid:1) = 2 exp (cid:0) − t/ (cid:107) X (cid:107) ψ (cid:1) . To show (ii) , note that E (cid:2) | X | p (cid:3) = (cid:90) ∞ P ( | X | p ≥ u ) du = (cid:90) ∞ P ( | X | ≥ t ) pt p − dt (i) ≤ (cid:90) ∞ exp (cid:0) − t/ (cid:107) X (cid:107) ψ (cid:1) pt p − dt s = t/ (cid:107) X (cid:107) ψ = 2 (cid:107) X (cid:107) pψ p (cid:90) ∞ e − s s p − ds = 2 (cid:107) X (cid:107) pψ Γ( p + 1) . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS p + 1) = p ! for integer values of p , we get E (cid:2) | X | p (cid:3) ≤ p ! (cid:107) X (cid:107) pψ . To show (iii) , note ﬁrst that since E [ X ] = 0, E (cid:2) exp( λX ) (cid:3) = E (cid:104) λX + ∞ (cid:88) p =2 ( λX ) p p ! (cid:105) = 1 + ∞ (cid:88) p =2 λ p E (cid:2) X p ] p ! ≤ ∞ (cid:88) p =2 λ p E (cid:2) | X | p ] p ! . Thus, we get from (ii) that E (cid:2) exp( λX ) (cid:3) ≤ ∞ (cid:88) p =2 ( (cid:107) X (cid:107) ψ λ ) p = 1 + 2 (cid:107) X (cid:107) ψ λ − (cid:107) X (cid:107) ψ λ , for all | λ | < / (cid:107) X (cid:107) ψ . Further, when | λ | ≤ a − a (cid:107) X (cid:107) ψ ⇐⇒ − λ (cid:107) X (cid:107) ψ ≥ / a , we have1 + 2 (cid:107) X (cid:107) ψ λ − (cid:107) X (cid:107) ψ λ ≤ a (cid:107) X (cid:107) ψ λ ≤ exp(2 a (cid:107) X (cid:107) ψ λ ) . We next use this result to specify the constant c (cid:48) in Proposition 15, giving itsproof along the lines of Theorem 2.8.1–Corollary 2.8.3 in [39]. Proof of Proposition 15.

We denote S = N (cid:80) Ni =1 X i and consider the real param-eter λ . Using independence of the X i ’s we get from Markov’s inequality P ( S ≥ t ) = P (exp( λS ) ≥ exp( λt )) ≤ exp( − λt ) E [exp( λS )] = exp( − λt ) N (cid:89) i =1 E (cid:104) exp (cid:16) λ N X i (cid:17)(cid:105) . Hence, from Lemma 16(ii) applied to the random variables N X i , for any a > | λ | ≤ a − a N max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (35)it holds E (cid:104) exp (cid:16) λ N X i (cid:17)(cid:105) ≤ exp (cid:16) a λ N (cid:107) X i (cid:107) ψ (cid:17) , for each i ∈ [1 : N ]. Consequently, P ( S ≥ t ) ≤ exp (cid:16) − λt + 2 a λ N (cid:107) X i (cid:107) ψ (cid:17) . Minimizing with respect to λ under the constraint (35) we get the optimizer λ ∗ = min (cid:26) tN a (cid:80) Ni =1 (cid:107) X i (cid:107) ψ , ( a − N a max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:27) . Combining this with the elementary inequality α (cid:98) λ − β (cid:98) λ ≤ − β (cid:98) λ , which holds for any α, β > (cid:98) λ ∈ [0 , β α ], we have P ( S ≥ t ) ≤ exp (cid:18) − min (cid:26) t N a (cid:80) Ni =1 (cid:107) X i (cid:107) ψ , t ( a − N a max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:27)(cid:19) D. BOSKOS, J. CORT´ES, AND S. MART´INEZ ≤ exp (cid:18) − min (cid:26) t a (cid:0) max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:1) , t ( a − a max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:27) N (cid:19) ≤ exp (cid:18) − min (cid:26) a , a − a (cid:27) min (cid:26) t (cid:0) max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:1) , t max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:27) N (cid:19) for all a >

1. Taking into account that min (cid:8) a , a − a (cid:9) is a for 1 < a < / a − a for a ≥ /

4, we can select a = 5 /

4, which maximizes this term and obtain the optimaldecay rate P ( S ≥ t ) ≤ exp (cid:18) −

110 min (cid:26) t (cid:0) max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:1) , t max i ∈ [1: N ] (cid:107) X i (cid:107) ψ (cid:27) N (cid:19) . Repeating the above arguments for the random variables − N X i , we derive the samebound for P ( − S ≥ t ) and establish the result with c (cid:48) = 1 / C and c for the nominal ambiguity radius ε N given by (11) when p < d/

2. Note that any other case can also be reduced to thisat the cost of increased conservativeness by embedding the distribution in a higher-dimensional space. Further, the most typical values of p are p = 1, where generalDRO problems admit the tractable reformulations provided in [16], and p = 2, wherethe dual optimization problem admits certain convenient quadratic terms, which forinstance facilitate taking gradients [12]. Thus, one can use the precise concentrationresults for reasonably low-dimensional data. To obtain the desired constants, we ex-ploit results from [6, 15]. In particular, from [6, Proposition A.2], we have the followingconcentration inequality which quantiﬁes how the Wasserstein distance between thetrue and the empirical distribution concentrates around its expected value. Proposition (Concentration around empirical Wasserstein mean).

Assume that the probability measure µ is supported on the compact subset B of R d (with the Euclidean norm). Then, P ( W p ( µ N , µ ) ≥ E [ W p ( µ N , µ )] + t ) ≤ e − Nt p / (2 (cid:101) ρ p ) ∀ t ≥ , (36) where (cid:101) ρ = diam ( B ) . We will also use [15, Proposition 1 and Remark 4], which give the following boundfor the expected Wasserstein distance between the empirical and actual distribution.

Proposition (Decay of empirical Wasserstein mean).

For any prob-ability measure µ supported on [0 , d and p < d/ , E (cid:2) W p ( µ N , µ ) (cid:3) ≤ C (cid:63) N − /d , with C (cid:63) := √ d ( d − / (2 p ) (cid:18) − p − d/ + 11 − − p (cid:19) /p . (37)Combining Propositions 9 and 18 we get the following explicit characterization ofthe nominal ambiguity radius. Proposition (Explicit concentration inequality constants).

Assumethat the probability measure µ is supported on B ⊂ R d with ρ := diam ∞ ( B ) < ∞ and that p < d/ . Then, we can select the nominal ambiguity radius ε N ( β, ρ ) := 2 ρ (cid:0) C (cid:63) N − d + √ d (2 ln β − ) p N − p (cid:1) . ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS Proof.

Since the Wasserstein distance of the dilation of two distributions in avector space by a factor is equal to this factor times their original Wasserstein dis-tance (as exploited e.g., in [8, Proposition 3.2]), we have from Proposition 18 that E ( W p ( µ N , µ )) ≤ ρC (cid:63) N − /d . Substituting the latter in (36) and taking into accountthat diam ( B ) ≤ √ d diam ∞ ( B ), i.e., that (cid:101) ρ = √ d ρ , we get P ( W p ( µ N , µ ) ≥ ρC (cid:63) N − d + t ) ≤ e − Nt p / (2 d p (2 ρ ) p ) ∀ t ≥ . Set ε := 2 ρC (cid:63) N − d + t and β := e − Nt p / (2 d p (2 ρ ) p ) ⇐⇒ t = √ d ρ (2 ln β − ) p N − p .Then P ( W p ( µ N , µ ) ≤ ε ) ≥ − β for all β ∈ (0 ,

1) and ε ≡ ε N ( β, ρ ) = 2 ρ (cid:0) C (cid:63) N − d + √ d (2 ln β − ) p N − p (cid:1) . We also give an explicit ambiguity radius expression in terms of a single expo-nential inequality as in (11) in the following result.

Corollary (Alternative explicit constants).

Under the assumptions ofProposition 19 we can select the nominal ambiguity radius ε N ( β, ρ ) := 2 ρ (cid:18) ln (cid:0) C (cid:63) β − (cid:1) c (cid:63) (cid:19) d N − d , with C (cid:63) := C d(cid:63) √ d d and c (cid:63) := d √ d d Proof.

Note ﬁrst that e − Nt p / (2 d p (2 ρ ) p ) ≤ e − Nt d / (2 √ d d (2 ρ ) d ) when t ∈ [0 , √ d ρ ](for t > √ d ρ the probability of interest is zero). Thus, using the inequality a q + b q ≤ (cid:0) q − ( a + b ) (cid:1) q for q ≥

1, we get in analogy to the proof of Proposition 19 that ε = 2 ρ (cid:0) CN − d + √ d (2 ln β − ) d N − d (cid:1) = 2 ρ (cid:0) C (cid:63) + √ d (2 ln β − ) d (cid:1) N − d = 2 ρ (cid:16)(cid:0) C d(cid:63) (cid:1) d + (cid:0) √ d d ln β − (cid:1) d (cid:17) N − d ≤ ρ (cid:16) d − (cid:0) C d(cid:63) + 2 √ d d ln β − (cid:1)(cid:17) d N − d = 2 ρ (cid:0) d − C d(cid:63) + 2 d √ d d ln β − (cid:1) d N − d = 2 ρ (cid:18) C d(cid:63) √ d d + ln β − d √ d d (cid:19) d N − d = 2 ρ (cid:18) ln (cid:0) C d(cid:63) √ d d β − (cid:1) d √ d d (cid:19) d N − d ≡ ρ (cid:18) ln (cid:0) C (cid:63) β − (cid:1) c (cid:63) (cid:19) d N − d , with C (cid:63) and c (cid:63) as given in the statement. Here we discuss how tocompute the ψ norm, i.e., the sub-Gaussian norm of a random variable with a Gaus-sian mixture distribution. (cid:46) Fact IV.

For X ∼ N (0 , (cid:107) X (cid:107) ψ = (cid:112) / (cid:47) Proof.

By deﬁnition, (cid:107) X (cid:107) ψ = inf { t > | E [exp( X /t )] ≤ } . Therefore, we seekto determine inf (cid:8) t > | √ π (cid:82) R exp (cid:0) − x (cid:0) − t (cid:1)(cid:1) dx ≤ (cid:9) . Setting σ = − t ,namely, σ ≡ σ ( t ) = (cid:113) t t − , the expression becomesinf (cid:26) t > (cid:12)(cid:12)(cid:12) σ √ πσ (cid:90) R exp (cid:18) − x σ (cid:19) dx ≤ (cid:27) = inf (cid:26) t > (cid:12)(cid:12)(cid:12) (cid:114) t t − ≤ (cid:27) = (cid:112) / . D. BOSKOS, J. CORT´ES, AND S. MART´INEZ (cid:46)

Fact V.

For X ∼ N ( µ, σ ), it holds that (cid:107) X (cid:107) ψ = σ (cid:112) / µ/ √ ln 2. (cid:47) Proof.

Note that X = Y + σZ , with Y = δ µ and Z = N (0 , (cid:107) · (cid:107) ψ isa norm, we get from Fact I in the proof of Proposition 15 and Fact IV above that (cid:107) X (cid:107) ψ ≤ (cid:107) Y (cid:107) ψ + σ (cid:107) Z (cid:107) ψ = µ/ √ ln 2 + σ (cid:112) / (cid:46) Fact VI.

Given arbitrary distributions ν i , let X i ∼ ν i , i = 1 , . . . , n and X ∼ (cid:80) ni =1 c i ν i , with (cid:80) ni =1 c i = 1, c i ≥

0. Then (cid:107) X (cid:107) ψ ≤ max i =1 ,...,n (cid:107) X i (cid:107) ψ . (cid:47) Proof.

From the deﬁnition of the ψ norm, (cid:107) X (cid:107) ψ = inf (cid:26) t > (cid:12)(cid:12)(cid:12) n (cid:88) i =1 c i (cid:90) R exp (cid:0) x /t (cid:1) ν i ( dx ) ≤ n (cid:88) i =1 c i (cid:27) ≤ inf (cid:26) t > (cid:12)(cid:12)(cid:12) (cid:90) R exp (cid:0) x /t (cid:1) ν i ( dx ) ≤ ∀ i = 1 , . . . , n (cid:27) = max i =1 ,...,n inf (cid:26) t > (cid:12)(cid:12)(cid:12) (cid:90) R exp (cid:0) x /t (cid:1) ν i ( dx ) ≤ (cid:27) = max i =1 ,...,n (cid:107) X i (cid:107) ψ . The following result is a consequence of Facts V and VI.

Proposition (Sub-Gaussian norm of Gaussian mixture).

Let X ∼ (cid:80) ni =1 c i N ( µ i , σ i ) , with (cid:80) ni =1 c i = 1 , c i ≥ . Then, (cid:107) X (cid:107) ψ ≤ max i =1 ,...,n { σ i (cid:112) / µ i / √ ln 2 } . REFERENCES[1]

R. B. Ash , Real Analysis and Probability , Academic Press, 1972.[2]

A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski , Robust optimization , Princeton UniversityPress, 2009.[3]

D. Bertsimas, V. Gupta, and N. Kallus , Robust sample average approximation , Mathemat-ical Programming, 171 (2018), pp. 217–282.[4]

J. Blanchet, Y. Kang, and K. Murthy , Robust Wasserstein proﬁle inference and applica-tions to machine learning , Journal of Applied Probability, 56 (2019), pp. 830–857.[5]

J. Blanchet and K. Murthy , Quantifying distributional model risk via optimal transport ,Mathematics of Operations Research, 44 (2019), pp. 565–600.[6]

E. Boissard and T. L. Gouic , On the mean speed of convergence of empirical and occupationmeasures in Wasserstein distance , Annales de l’Institut Henri Poincar´e, Probabilit´es etStatistiques, 50 (2014), pp. 539–563.[7]

D. Boskos, J. Cort´es, and S. Martinez , Data-driven ambiguity sets for linear systems underdisturbances and noisy observations , in American Control Conference, Denver, CO, July2020, pp. 4491–4496.[8]

D. Boskos, J. Cort´es, and S. Martinez , Data-driven ambiguity sets with probabilistic guar-antees for dynamic processes , IEEE Transactions on Automatic Control, 66 (2021). Toappear. Available at https://arxiv.org/abs/1909.11194.[9]

S. Boucheron, G. Lugosi, and P. Massart , Concentration inequalities: A nonasymptotictheory of independence , Oxford University Press, 2013.[10]

Z. Chen, D. Kuhn, and W. Wiesemann , Data-driven chance constrained programs overWasserstein balls , arXiv preprint arXiv:1809.00210, (2018).[11]

A. Cherukuri and J. Cort´es , Distributed coordination of DERs with storage for dynamiceconomic dispatch , IEEE Transactions on Automatic Control, 63 (2018), pp. 835–842.[12]

A. Cherukuri and J. Cort´es , Cooperative data-driven distributionally robust optimization ,IEEE Transactions on Automatic Control, 65 (2020), pp. 4400–4407.[13]

J. Coulson, J. Lygeros, and F. D¨orfler , Regularized and distributionally robust data-enabled predictive control , in IEEE Int. Conf. on Decision and Control, Nice, France,December 2019, pp. 2696–2701.[14]

J. Dedecker and F. Merlev`ede , Behavior of the empirical Wasserstein distance in R d undermoment conditions , Electronic Journal of Probability, 24 (2019).[15] S. Dereich, M. Scheutzow, and R. Schottstedt , Constructive quantization: Approximationby empirical measures , Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques,49 (2013), p. 1183–1203.ATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS [16] P. M. Esfahani and D. Kuhn , Data-driven distributionally robust optimization using theWasserstein metric: performance guarantees and tractable reformulations , MathematicalProgramming, 171 (2018), pp. 115–166.[17]

N. Fournier and A. Guillin , On the rate of convergence in Wasserstein distance of theempirical measure , Probability Theory and Related Fields, 162 (2015), p. 707–738.[18]

R. Gao, X. Chen, and A. J. Kleywegt , Wasserstein distributional robustness and regular-ization in statistical learning , arXiv preprint arXiv:1712.06050, (2017).[19]

R. Gao and A. Kleywegt , Distributionally robust stochastic optimization with Wassersteindistance , arXiv preprint arXiv:1604.02199, (2016).[20]

M. Grant and S. Boyd , CVX: Matlab software for disciplined convex programming, version2.1 . http://cvxr.com/cvx, Mar. 2014.[21]

Y. Guo, K. Baker, E. Dall’Anese, Z. Hu, and T. H. Summers , Data-based distributionallyrobust stochastic optimal power ﬂow–Part I: Methodologies , IEEE Transactions on PowerSystems, 34 (2018), pp. 1483–1492.[22]

A. Hota, A. Cherukuri, and J. Lygeros , Data-driven chance constrained optimization underWasserstein ambiguity sets , in American Control Conference, Philadelphia, PA, USA, 2019,pp. 1501–1506.[23]

B. Kloeckner , Empirical measures: regularity is a counter-curse to dimensionality , arXivpreprint arXiv:1802.04038, (2019), http://dx.doi.org/https://doi.org/10.1051/ps/2019025.[24]

B. Li, J. Mathieu, and R. Jiang , Distributionally robust chance constrained optimal powerﬂow assuming log-concave distributions , in Power Systems Computation Conference, 2018,pp. 1–7.[25]

D. Li, D. Fooladivanda, and S. Mart´ınez , Data-driven variable speed limit design for high-ways via distributionally robust optimization , in European Control Conference, Napoli,Italy, June 2019, pp. 1055–1061.[26]

D. Li, D. Fooladivanda, and S. Mart´ınez , Online learning of parameterized uncertain dy-namical environments with ﬁnite-sample guarantees , preprint arXiv:0000.00000, (2020).[27]

D. Li and S. Mart´ınez , Online data assimilation in distributionally robust optimization , inIEEE Int. Conf. on Decision and Control, Miami, FL, USA, December 2018, pp. 1961–1966.[28]

M. Li , Li-ion dynamics and state of charge estimation , Renewable Energy, 100 (2017), pp. 44–52.[29]

B. Y. Liaw, G. Nagasubramanian, R. G. Jungst, and D. H. Doughty , Modeling oflithium ion cells–A simple equivalent-circuit model approach , Solid State Ionics, 175 (2004),pp. 835–839.[30]

J. Liu, Y. Chen, C. Duan, J. Lin, and J. Lyu , Distributionally robust optimal reactivepower dispatch with Wasserstein distance in active distribution network , Journal of Mod-ern Power Systems and Clean Energy, 8 (2020), pp. 426–436.[31]

S. Liu , Matrix results on the Khatri-Rao and Tracy-Singh products , Linear Algebra and itsApplications, 289 (1999), p. 267–277.[32]

J. B. Moor and B. D. O. Anderson , Coping with singular transition matrices in estimationand control stability theory , International Journal of Control, 31 (1980), pp. 571–586.[33]

B. K. Poolla, A. R. Hota, S. Bolognani, D. S. Callaway, and A. Cherukuri , Wassersteindistributionally robust look-ahead economic dispatch , arXiv preprint arXiv:2003.04874,(2020).[34]

P. E. S. Shafieezadeh-Abadeh, D. Kuhn , Regularization via mass transportation , Journal ofMachine Learning Research, 20 (2019), pp. 1–68.[35]

S. Shafieezadeh-Abadeh, V. A. Nguyen, D. Kuhn, and P. M. Esfahani , Wasserstein distri-butionally robust Kalman ﬁltering , in Advances in Neural Information Processing Systems,2018, pp. 8474–8483.[36]

A. Shapiro , Distributionally robust stochastic programming , SIAM Journal on Optimization,27 (2017), p. 2258–2275.[37]

A. Shapiro, D. Dentcheva, and A. Ruszczy´nski , Lectures on Stochastic Programming: Mod-eling and Theory , vol. 16, SIAM, Philadelphia, PA, 2014.[38]

E. D. Sontag , Mathematical control theory: deterministic ﬁnite dimensional systems ,Springer, 1998.[39]

R. Vershynin , High-dimensional probability: An introduction with applications in data sci-ence , vol. 47, Cambridge University Press, 2018.[40]

C. Villani , Topics in optimal transportation , no. 58 in Graduate Studies in Mathematics,American Mathematical Society, 2003.[41]

J. Weed and F. Bach , Sharp asymptotic and ﬁnite-sample rates of convergence of empiricalmeasures in Wasserstein distance , Bernoulli, 25 (2019), pp. 2620–2648.[42]

J. Weed and Q. Berthe , Estimation of smooth densities in Wasserstein distance , arXivpreprint arXiv:1902.01778, (2019). D. BOSKOS, J. CORT´ES, AND S. MART´INEZ[43]

F. Xin, B.-M. Hodge, L. Fangxing, D. Ershun, and K. Chongqing , Adjustable and distribu-tionally robust chance-constrained economic dispatch considering wind power uncertainty ,Journal of Modern Power Systems and Clean Energy, 7 (2019), pp. 658–664.[44]

I. Yang , A convex optimization approach to distributionally robust Markov decision processeswith Wasserstein distance , IEEE Control Systems Letters, 1 (2017), pp. 164–169.[45]

I. Yang , Wasserstein distributionally robust stochastic control: A data-driven approach , arXivpreprint arXiv:1812.09808, (2018).[46]