Bridging direct & indirect data-driven control formulations via regularizations and relaxations
BBridging direct & indirect data-driven controlformulations via regularizations and relaxations
Florian D¨orfler, Jeremy Coulson, and Ivan Markovsky
Abstract —We discuss connections between sequential systemidentification and control for linear time-invariant systems, whichwe term indirect data-driven control, as well as a direct data-driven control approach seeking an optimal decision compatiblewith recorded data assembled in a Hankel matrix and robustifiedthrough suitable regularizations. We formulate these two prob-lems in the language of behavioral systems theory and parametricmathematical programs, and we bridge them through a multi-criteria formulation trading off system identification and controlobjectives. We illustrate our results with two methods fromsubspace identification and control: namely, subspace predictivecontrol and low-rank approximation which constrain trajectoriesto be consistent with a non-parametric predictor derived from(respectively, the column span of) a data Hankel matrix. In bothcases we conclude that direct and regularized data-driven controlcan be derived as convex relaxation of the indirect approach, andthe regularizations account for an implicit identification step. Ouranalysis further reveals a novel regularizer and sheds light on theremarkable performance of direct methods on nonlinear systems.
I. I
NTRODUCTION
The vast realm of data-driven control methods can be clas-sified into indirect data-driven control approaches consistingof sequential system identification and model-based control aswell as direct data-driven control approaches seeking an opti-mal decision compatible with recorded data. Both approacheshave a rich history, and they have received renewed interestcross-fertilized by novel methods and widespread interest inmachine learning. Representative recent surveys are [1]–[6].The pros and cons of both paradigms have often been elab-orated on: e.g., modeling and identification is cumbersome,its results are often not useful for control (due to, e.g., incom-patible uncertainty quantifications), and practitioners generallyprefer end-to-end approaches. While direct data-driven controlpromises to resolve these problems by learning control policiesdirectly from data, the available methods often do not (yet)lend themselves to real-time and safety-critical control systemsdue to lack of certificates and overburdening computationaland sample complexity, among others. Quite a few approachestried to bridge the two paradigms. Of relevance to this article,we note the literature on identification for control [6]–[9] andcontrol-oriented regularized identification [10], which proposethat the control objective should bias the identification task.
FD and JC are with Department of Information Technology andElectrical Engineering, ETH Zurich, 8092 Zurich, Switzerland. Email: { dorfler,coulson } @control.ee.ethz.ch . IM is with theDepartment of Electrical Engineering (ELEC) of the Vrije Universiteit Brus-sel, 1050 Brussels, Belgium. Email: [email protected] .This work was supported by ETH Zurich funds. IM was supported by theFond for Scientific Research Vlaanderen (FWO) project G090117N and theExcellence of Science Project 30468160. We take a similar perspective here: the sequential identifi-cation and control tasks can be abstracted as nested bi-leveloptimization problem: find the best control subject to a model,where the model is the best fit to a data set within some hy-pothesis class. This approach is modular and both steps admittractable formulations, but generally it is also suboptimal: thereis no separation principle – aside from special cases, see [6,Section 4] – for these two nested optimization problems. Anend-to-end direct algorithmic approach may thus outperformindirect methods if a tractable formulation were available. Forthe latter we resort to a paradigm square in between behavioralsystem theory and subspace system identification methods.Behavioral system theory [11]–[13] takes an abstract viewon dynamical systems as sets of trajectories, and it does notrequire parametric representations which makes it appealingfrom a data-centric perspective. For example, linear time-invariant (LTI) systems are characterized as shift-invariantsubspaces within an ambient space of time series. The role ofidentification is to find such a low-dimensional feature fromdata. Subspace methods take a similar (albeit more algorith-mic) viewpoint [14]–[16] and extract parametric models fromthe range and null spaces of a low-rank data Hankel matrix.Both lines of work come together in a result known asthe Fundamental Lemma [17]; see also [18], [19] for recentextensions. It states that, under some assumptions, the setof all finite-length trajectories (the restricted behavior) of anLTI system equals the range space of a data Hankel matrix.This result serves as the theoretic underpinning for work insubspace identification [19]–[21] and data-driven control, inparticular subspace predictive control based on non-parametricmodels [22]–[24], explicit feedback policies parametrized bydata matrices [25]–[27], and data-enabled predictive controlseeking compatibility of predicted trajectories with the rangespace of a data Hankel matrix. The latter methods have firstbeen established for deterministic LTI systems in [28], [29]and have recently been extended by suitably regularizing theoptimal control problems. Closed-loop stability was certifiedin [30]. The regularizations were first mere heuristics [31] buthave later been constructively derived by robust control andoptimization [32]–[35]. These approaches, albeit recent, haveproved themselves in practical nonlinear problems [35]–[38].We also note the recent maximum-likelihood perspective [39].In this paper, we explore the following questions: howdoes data-enabled predictive control relate to a prior systemidentification? What are principled regularizations? And whydoes it work so well in the nonlinear case? We start ourinvestigations from indirect data-driven control formulated asa bi-level optimization problem in the general output feedback a r X i v : . [ m a t h . O C ] J a n etting. As a vehicle to transition between indirect and directapproaches, we consider a multi-criteria problem trading offidentification and control objectives. We formally show thatone tail of its Pareto front corresponds to the bi-level problem,and a convex relaxation results in the regularized data-enabledpredictive control formulations used in [30]–[38].Most of our results are formulated in the abstract languageof behavioral systems theory and parametric mathematicalprograms, but we also specialize our treatment to two concretemethods: subspace predictive control [22]–[24] and low-rankapproximation [29]. In both cases we conclude that directregularized data-driven control can be derived as a convexrelaxation of the indirect approach, where the projection of thedata on the set of LTI systems is replaced by regularizationsaccounting for implicit identification. Our analysis reveals anovel regularizer promoting a least square fit of the data byprojecting on the null space of the Hankel matrix.We illustrate our results with numerical studies illustratingthe role of regularization, superiority of the new regularizer,and comparisons. Informed by our analysis, we hypothesizeand numerically confirm that the indirect approach is superiorin case of “variance” error, e.g., for LTI stochastic systems,and the direct approach wins in terms of “bias” error, e.g., fornonlinear systems supporting the observations in [35]–[38].The remainder of this paper is organized as follows: Sec-tion II reviews representations of LTI systems. Section III for-mulates the direct and indirect data-driven control problems,and Section IV bridges them. Section V contains our numericalstudies. Finally, Section VI concludes the paper.II. LTI S YSTEMS AND THEIR R EPRESENTATIONS
We adopt a behavioral perspective which allows for systemtheory independent of parametric system representations. Weaim at a concise exposition and refer to [11]–[13] for details.
A. Behavioral Perspective on Discrete-Time LTI systems
Consider the discrete time axis Z , the signal space R q , andthe associated space of trajectories R q Z consisting of all q -variate sequences ( . . . , w ( − , w (0) , w (1) , . . . ) with w ( i ) ∈ R q . Consider a permutation matrix P partitioning each w ( i ) = P (cid:104) u ( i ) y ( i ) (cid:105) , where u ( i ) ∈ R m and y ( i ) ∈ R q − m are freeand dependent variables that will later serve as inputs andoutputs. The behavior B is defined as a subset of the space oftrajectories, B ⊂ R q Z , and a system as the triple ( Z , R q , B ) .In what follows, we denote a system merely by its behavior B , keeping the signal space R q Z fixed throughout. A systemis linear if B is a subspace of R q Z . Let σ denote the shiftoperator with action σw ( t ) = w ( t + 1) . A system is time-invariant if B is shift-invariant: σ B = B . Finally, let B L bethe restriction of the behavior to R qL , i.e., to trajectories in afinite time interval of length L ∈ Z > . Due to time-invariance,we may take the interval [1 , L ] for simplicity. B. Kernel Representations and Parametric Models
Rather than a mere set-theoretic descriptions, one typicallyworks with explicit parametric representations (colloquially termed models ) of LTI systems — the conventional onesfalling in the class of kernel representations. Namely, a kernelrepresentation with lag (cid:96) specifies an LTI behavior as B = kernel ( R ( σ )) = (cid:8) w ∈ R q Z : R ( σ ) w = 0 (cid:9) , where R ( σ ) = R + R σ + · · · + R (cid:96) σ (cid:96) is a polynomial matrixof degree (cid:96) , and the matrices R , R , . . . , R (cid:96) take values in R ( q − m ) × q . Alternatively, one can unfold the kernel represen-tation by revealing a latent variable: the state x ( t ) ∈ R n . The input/state/output (or state-space ) representation is B = (cid:8) w = P [ uy ] ∈ R q Z : ∃ x ∈ R n Z such that σx = Ax + Bu , y = Cx + Du (cid:9) , where A ∈ R n × n , B ∈ R n × m , C ∈ R q − m × n , and D ∈ R q − m × m . We assume that the lag (cid:96) (resp., the state dimension n ) is minimal, i.e., there is no other kernel (resp., state-space)representation with smaller lag (resp., state dimension).The dimension n of a minimal state-space representationmanifests itself in a minimal kernel representation as n = (cid:80) q − mi =1 (cid:96) i , where (cid:96) i is the lag of the i th row of R ( σ ) . C. Representation-Free Estimation and Behavior Dimension
Given an input/state/output representation of B ⊂ R q Z with m inputs, order n , and lag (cid:96) , recall the extended observabilityand convolution (impulse-response) matrices O L = (cid:34) CCA...CA L − (cid:35) and G L = D ··· CB D ··· CAB CB D ... ...... ... ... ... CA L − B ··· CAB CB D which parametrize all length- L trajectories in B L as (cid:20) uy (cid:21) = (cid:20) I G L O L (cid:21) (cid:20) ux ini (cid:21) , (1)where x ini ∈ R n is the initial state. Recall the observabilityproblem : given a length- L time series of inputs and outputs,is it possible to reconstruct x ini ? The formulation (1) givesa succinct answer: namely, x ini can be reconstructed if andonly if O L has full column-rank. The minimum L so that O L has full rank n equals the lag (cid:96) of a minimal kernelrepresentation, i.e., rank ( O L ) = n for L ≥ (cid:96) . As readilydeducible from (1) and formalized in [28, Lemma 1], ina representation-free setting, the initial condition x ini for atrajectory w ∈ B L , can be estimated via a prefix trajectory w ini = (cid:0) w ( − T ini + 1) , . . . , w ( − , w (0) (cid:1) of length T ini ≥ (cid:96) sothat the concatenation w ini ∧ w ∈ B T ini + L is a valid trajectory.Hence, an LTI system is characterized by the complexityparameters ( q, m, n, (cid:96) ) , and we denote the corresponding classof LTI systems by L q,nm,(cid:96) : namely, LTI systems with m inputs, q − m outputs, minimal state dimension n , and minimal lag (cid:96) .The following lemma characterizes the dimension of B L ∈ L q,nm,(cid:96) in terms of the complexity parameters ( q, m, n, (cid:96) ) . Lemma 2.1 (Characterization of B L ): Let B ∈ L q,nm,(cid:96) . Then B L is a subspace of R qL , and for L ≥ (cid:96) its dimension is equalto mL + n . Proof:
The fact that B L ⊂ R qL is a subspace followssince B is a linear system. To show that the dimension of B L s equal to mL + n for L ≥ (cid:96) , we appeal to a minimal state-space parameterization of B — a state-space-independentproof is in [19, Sec. 3]. Then we have w = P [ uy ] ∈ B L if andonly if (1) holds for some x ini ∈ R n . By assumption, the repre-sentation is minimal, and O L ∈ R ( q − m ) L × n is of full column-rank for L ≥ (cid:96) . Therefore, the matrix (cid:2) I G L O L (cid:3) ∈ R qL × ( mL + n ) is of full rank mL + n for L ≥ (cid:96) and forms a basis for B L .Thus, B L has dimension equal to mL + n . Remark 2.2 (Complexity bounds):
All forthcoming resultsassume known complexity ( q, m, n, (cid:96) ) . When only mere timeseries and no further information is available, it is reasonableto assume only upper bounds on these parameters. Thus, theanticipated dimension of B L is at most mL + n . In thiscase, the various rank equalities in this paper quantifying thebehavior dimension should be replaced by inequalities. (cid:3) D. Image Representation of Restricted Behavior
The restricted behavior B L , the set of all trajectories oflength L , can be described by a kernel or state-space repre-sentation. As an interesting alternative, we recall the imagerepresentation of B L by a data matrix of a time series.Consider the sequence w = (cid:0) w (1) , w (2) , . . . , w ( T ) (cid:1) withelements w ( i ) ∈ R q , and define the (block) Hankel matrix H L ( w ) ∈ R qL × ( T − L +1) of depth L , for some L ≤ T , as H L ( w ) = w (1) w (2) . . . w ( T − L + 1) w (2) w (3) . . . w ( T − L + 2) ... ... . . . ...w ( L ) w ( L + 1) . . . w ( T ) . A result due to [17] that became known as the
FundamentalLemma offers an image representation of the restricted behav-ior in terms of the column span of a data Hankel matrix. Wepresent a necessary and sufficient version here assuming:(A.1) rank ( H L ( w )) = mL + n . Lemma 2.3: [19, Corollary 19]): Consider an LTI system B ∈ L q,nm,(cid:96) and an associated trajectory w = (cid:0) w (1) , w (2) ,. . . , w ( T ) (cid:1) ∈ R qT . The following are equivalent for L > (cid:96) :colspan ( H L ( w )) = B L ⇐⇒ Assumption (A.1)In words, the Hankel matrix H L ( w ) composed of a single T -length trajectory parametrizes all L -length trajectories ifand only if rank ( H L ( w )) = mL + n . A plausible reasoningleading up to Lemma 2.3 is that every column of H L ( w ) isa trajectory of length L , and the set of all such trajectorieshas at most dimension mL + n ; see Lemma 2.1. Lemma 2.3extends the original Fundamental Lemma [17, Theorem 1]which requires input/output partitioning, controllability, andpersistency of excitation of order L + n (i.e., H L + n ( u ) musthave full row rank) as sufficient conditions. Lemma 2.3 alsoextends to mosaic Hankel, Page, and trajectory matrices [19]. Remark 2.4 (Models vs. data):
It is debatable whether theimage representation via the Hankel matrix H L ( w ) should becalled a “model”: the restricted behavior is readily availablefrom raw data, and no parametric representation needs tobe identified. Hence, we call colspan ( H L ( w )) a data-drivenrepresentation of B L and reserve the term “model” for kernelor state-space representations. Models are useful for many reasons: first and foremost the availability of powerful analysisand design methods. Another readily discernible advantage isthat models are vastly compressed compared to the image rep-resentation, and the latter holds only on finite horizons unlesstrajectories are weaved together [21]; see also Remark 3.8. (cid:3) III. D
IRECT AND I NDIRECT D ATA -D RIVEN C ONTROL
We present different data-driven control formulations alongwith assumptions under which the formulations are consistent.These assumptions are used only for consistency statementsand not for our main results, but they will prove insightful.
A. Optimal Control Problem
Given a plant with plant behavior B P ∈ L q,nm,(cid:96) , a T ini -lengthprefix trajectory w ini = (cid:0) w ( − T ini + 1) , . . . , w (0) (cid:1) ∈ B T ini ,and a L -length reference trajectory w r ∈ R qL in a referencebehavior B R , consider the finite-time optimal control problem C : minimize w c ctrl ( w − w r )subject to w ini ∧ w ∈ B PT ini + L . (2)For T ini ≥ (cid:96) the prefix trajectory w ini implicitly sets the initialcondition for the optimal control problem (2); see Section II-C.In case of uncertain initial condition, the prefix w ini can bemade a decision variable and included via a penalty term inthe cost; c.f., [29]–[33]. We refrain from such extensions here.We denote a minimizer of the optimization problem C in (2)by w (cid:63)C . Typically, the cost c ctrl ( · ) takes the form of a runningand a terminal cost. We make the following assumption:(A.2) c ctrl : R qL → R ≥ achieves its minimum for w (cid:63)C = w r .The feasible set of problem (2) is convex since B PT ini + L is asubspace; see Lemma 2.1. Note that further convex constraintson admissible trajectories (e.g., capturing input saturation) canbe added. This will not change our developments, but weprefer to keep the focus on the subspace constraint.For the control problem (2), we do not necessarily assume B P = B R , since we often ask systems to track non-plantbehavior (e.g., steps). However, B P = B R will lead to somesimplifications and connects to model reference control:(A.3) w ini ∧ w r ∈ B PT ini + L , i.e., the reference w r ∈ B RL iscompatible with the plant B P and the prefix trajectory. Fact 3.1:
Under Assumptions (A.2) and (A.3), the minimumof the control problem C in (2) is achieved for w (cid:63)C = w r .Fact 3.1 (and similar consistency results later on) followssince w (cid:63)C = w r is feasible and achieves the minimum of thecost. Fact 3.1 serves as ground-truth for comparison.Problem (2) becomes a “classical” control problem if aparametric model for the plant B P is available. The latteris usually obtained from data through system identification.Here, we are interested in the end-to-end problem formulation:namely, to derive the optimal control directly from data. B. Indirect Data-Driven Control via System Identification
Given a T -length trajectory w d ∈ R qT as identificationdata , conventional system identification and control consistsof three steps. The first step, model class selection , amountso choosing the set of candidate models, e.g., L q,nm,(cid:96) specifiedby the complexity ( q, n, m, (cid:96) ) . The second step, model fitting ,chooses an element from the model class that fits the databest in some specified sense, e.g., distance between data w d and model B . This step is often synonymous to learninga parametric model (e.g., PEM), though some classic (e.g.,ETFE) and modern (e.g., kernel-based) methods are non-parametric and by-pass the model class selection; see [2] fora review (and the acronyms). However, for control designthe non-parametric models again have to be projected on abehavior in L q,nm,(cid:96) . Both approaches can be abstracted as ID : minimizeˆ w d , (cid:98) B c id ( ˆ w d − w d )subject to ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) . (3)It is useful to think of the identification loss c id ( · ) as adistance. Given the data w d , problem (3) seeks the closestLTI behavior within the class L q,nm,(cid:96) , i.e., the closest subspacewith dimensions as in Lemma 2.1. We denote a minimizer of(3) by (cid:0) ˆ w (cid:63)d,ID , (cid:98) B (cid:63)ID (cid:1) and assume the following on c id ( · ) :(A.4) c id : R qT → R ≥ achieves its minimum for ˆ w (cid:63)d,ID = w d .Exact identification of the true system requires exact data w d ∈ B PT and an identifiability assumption [19, Theorem 15]which assures that B P can be recovered from w d :(A.5) w d ∈ B PT , i.e., w d is a valid trajectory of B PT ; and(A.6) rank ( H (cid:96) +1 ( w d )) = m ( (cid:96) + 1) + n . Fact 3.2:
Under assumptions (A.4)–(A.6), the minimumvalue of the system identification problem ID in (3) isachieved for ˆ w (cid:63)d,ID = w d and (cid:98) B (cid:63)ID = B P .Finally, equipped with an identified behavior (cid:98) B (cid:63) ∈ L q,nm,(cid:96) ,the third step is certainty-equivalence control : solve the opti-mal control problem (2) subject to the identified model: minimize w c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B (cid:63)T ini + L . (4)In (4), c ctrl ( w − w r ) is merely a surrogate (predicted) controlerror since w ∈ (cid:98) B (cid:63) , the identified model, rather than w ∈ B P .Putting both the system identification (3) and certainty-equivalence control (4) together, we arrive at indirect data-driven control formulated as the bi-level problem BL : minimize w c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B (cid:63)T ini + L where (cid:98) B (cid:63) ∈ arg min ˆ w d , (cid:98) B c id ( ˆ w d − w d )subject to ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) . (5)The bi-level problem structure in (5) reflects the sequentialsystem identification and control tasks, that is, first a model isfitted to the data in the inner identification problem before themodel is used for control in the outer problem. We denote aminimizer for the inner problem of (5) by (cid:0) ˆ w (cid:63)d,BL , (cid:98) B (cid:63)BL (cid:1) anda minimizer for the outer problem of (5) by w (cid:63)BL . Remark 3.3 (Further inner outer optimization problems):
The bi-level formulation (5) is only the tip of the iceberg, and the overall design may feature further nested levels, e.g.,optimization of the model selection hyper-parameters ( n, (cid:96) ) ,uncertainty quantification, etc. We deliberately neglect theselevels here and focus on identification and control. (cid:3) Remark 3.4 (Value of models):
As we are interested inoptimal control and not in the model per se, we treat modelsand identification in a slightly disregarding manner, i.e., theyserve merely an auxiliary purpose here. Of course, modelsare desired for plenty of other reasons: system design, systemanalysis, fault detection, the reasons in Remark 2.4, etc. (cid:3)
Under suitable consistency assumptions, the sequential sys-tem identification and control approach in (5) is optimal.
Fact 3.5:
Consider the optimal control problem C in (2)and the bi-level problem BL in (5). Then1) under Assumptions (A.4)–(A.6), the bi-level problem BL reduces to the optimal control C ; and2) under the additional Assumptions (A.2) and (A.3), theminimum value of the bi-level problem BL is achievedfor ˆ w (cid:63)d,BL = w d , and (cid:98) B (cid:63)BL = B P , w (cid:63)BL = w r .The first statement echos the “model as well as possible”paradigm and a separation of control and identification, albeitin a simple setting; see [6, Section 4.2] for further reading. C. Direct Data-Driven Control via the Image Representation
The direct data-driven control approach pursued here hingesupon the Fundamental Lemma 2.3. A direct corollary of whichis that the prediction and estimation trajectories have to bewithin the column span of the data Hankel matrix.
Corollary 3.6 (Direct data-driven control):
Assume thatAssumptions (A.1) and (A.5) hold with L replaced by T ini + L ,then the optimal control problem C in (2) is equivalent to D : minimize w c ctrl ( w − w r )subject to (cid:20) w ini w (cid:21) ∈ colspan ( H T ini + L ( w d )) , (6)i.e., the minimizers and minima of (2) and (6) coincide. Fact 3.7:
Under Assumptions (A.1) with L replaced by T ini + L , (A.2), (A.3), and (A.5), the minimum value of (6) isachieved for w (cid:63)D = w r . Remark 3.8 (Data lengths):
It is instructive to comparethe sample complexity of direct and indirect approaches (6)and (5). Due to Assumption (A.1), Corollary 3.6 requiresmore data points than the identification Assumption (A.6).This discrepancy is due to Corollary 3.6 seeking a multi-steppredictor, whereas system identification (3) gives a single-steppredictor that can be applied recursively. By weaving multipletrajectories of length (cid:96) + 1 together, Assumption (A.1) can beeased so that the data lengths coincide; see [21, Lemma 3]. (cid:3)
In comparison, with system identification, the model orderselection step is implicit in Assumption (A.1) and encoded inthe rank of the Hankel matrix H T ini + L ( w d ) – at least for cleandata w d ∈ B PT . If the identification data w d is noisy, then theHankel matrix very likely has full rank, and the constraint of(6), col ( w ini , w ) ∈ colspan ( H T ini + L ( w d )) , is vacuous. Thus, w = w r uniquely minimizes the surrogate control error, butthe realized control error may be arbitrarily different. In short,ertainty-equivalence can fail arbitrarily poorly in direct data-driven control, and we need to robustify the direct approach.This is a major difference with the indirect (first identify, thencontrol) approach (5): the purpose of identification is also tofilter noisy data by projecting on a deterministic behavior.To go beyond certainty equivalence, [30]–[35] reformulatethe constraint in (6) as col ( w ini , w ) = H T ini + L ( w d ) g for some g and robustify problem (6) by means of regularization: D λ : minimize w, g c ctrl ( w − w r ) + λ · h ( g )subject to (cid:20) w ini w (cid:21) = H T ini + L ( w d ) g . (7)To provide an intuition, every column of H T ini + L ( w d ) is atrajectory of B PT ini + L , and the decision variable g linearly com-bines these columns for the optimal trajectory w – consistentwith the prefix trajectory w ini and regularized by h ( g ) . Theregularization function h ( · ) and parameter λ are nonnegative.Choices for h ( · ) are one-norms [31], two-norms [34], squaredtwo-norms [30], [35], or arbitrary p -norms [32], [33].The regularizers can be related to min-max optimizationformulations either in deterministic [34], [35] or stochasticcontexts [32], [33]. For instance, if one seeks robustness withrespect to all distributions that could have generated the data w d within a p -Wasserstein ball centered around the empiricaldistribution, then one has to ( i ) regularize with the dual norm h ( g ) = (cid:107) g (cid:107) ∗ p and ( ii ) choose λ ≥ as the radius of theWasserstein ball times the Lipschitz constant of c ctrl ( · ) [32],[33]. The regularized formulation (7) has proved itself inpractical (nonlinear) control problems; see e.g. [35]–[38].IV. B RIDGING D IRECT & I
NDIRECT A PPROACHES
A. Multi-Objective Data-Driven Control
From an optimization perspective it is natural to lift the bi-level problem (5) to a multi-criteria problem simultaneouslyoptimizing for identification and control objectives. Usingweighted sum scalarization, the multi-criteria problem is
M C γ : minimize w, ˆ w d , (cid:98) B γ · c id ( ˆ w d − w d ) + c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B T ini + L , ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) , (8)where the trade-off parameter γ ≥ traces the Pareto frontbetween the identification and optimal control objectives.The multi-criteria problem (8) can be interpreted as fittinga model (cid:98) B simultaneously to two data sets: the identificationdata w d and the reference w r . From a control perspective, theidentification criterion biases the solution w ∈ (cid:98) B to adhereto the observed data w d rather than merely matching theto be tracked reference w r . Likewise, from the other side,the identification criterion is biased by the control objective.In short, control and identification regularize each other, inthe spirit of identification for control [7], [8]. A similarformulation has been proposed in [10] interpolating betweenPEM identification and a model-reference control objective.We denote a minimizer of (8) by (cid:0) w (cid:63)MC , ˆ w (cid:63)d,MC , (cid:98) B (cid:63)MC (cid:1) . Fact 4.1:
Under Assumptions (A.2)–(A.6), for any γ ≥ the minimum of the parametric multi-criteria problem M C γ is achieved for ˆ w (cid:63)d,MC = w d , (cid:98) B (cid:63)MC = B P , and w (cid:63)MC = w r .Different points on the Pareto front of (8) have differentemphasis regarding the control and identification objectives.Below we formalize that for γ sufficiently large, the multi-criteria problem (8) recovers the bi-level problem (5) corre-sponding to sequential system identification and control.We follow standard penalty arguments from bi-level opti-mization [40], [41], which are particularly tractable here since(5) is only weakly coupled: the inner problem does not dependon the decision variable w of the outer problem. The minimum(also termed value function) of the inner problem is ϕ = minimizeˆ w d , (cid:98) B c id ( ˆ w d − w d )subject to ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) . (9)The bi-level problem (5) reads then equivalently as minimize w, ˆ w d , (cid:98) B c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B T ini + L , ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) , c id ( ˆ w d − w d ) − ϕ = 0 . (10)At this point the reader is encouraged to review the definitionand salient properties of a constraint qualification termedpartial calmness [40], [41]; see the appendix. If problem (10)is partially calm at a local minimizer and c ctrl ( · ) is continuous,then there is γ (cid:63) > so that, for all γ > γ (cid:63) , (10) equals minimize w, ˆ w d , (cid:98) B γ · (cid:12)(cid:12) c id ( ˆ w d − w d ) − ϕ (cid:12)(cid:12) + c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B T ini + L , ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) , (11)that is, the local minimizers of (10) and (11) coincide; seeProposition A.2. We now drop the absolute value (since c id ( ˆ w d − w d ) − ϕ ≥ ) and the constant ϕ (which in our casedoes not depend on the variable w of the outer problem) fromthe objective of (11) to recover problem (8). We have thusestablished a chain of equivalences relating the bi-level andmulti-criteria problems. We summarize our discussion below. Proposition 4.2 (Upper tail of the Pareto front of
M C γ ): Consider the parametric multi-criteria problem
M C γ in (8).Assume that (10) is partially calm at any local minimizer and c ctrl ( · ) is continuous. Then there is γ (cid:63) > so that for γ > γ (cid:63) the problem M C γ is equivalent to the bi-level problem BL in(5), i.e., w (cid:63)MC = w (cid:63)BL , ˆ w (cid:63)d,MC = ˆ w (cid:63)d,BL , and (cid:98) B (cid:63)MC = (cid:98) B (cid:63)BL .Moreover, the optimal values of M C γ and BL coincide upto the constant γ · ϕ with ϕ defined in (9).The following comments are in order regarding partialcalmness. As discussed in Proposition A.2, partial calmnessis equivalent to the constraint c id ( ˆ w d − w d ) − ϕ ≥ servingas an exact penalty. Partial calmness is satisfied, for instance,appealing to Proposition A.3, if the identification cost c id ( · ) can be phrased as a distance (see the discussion following theidentification problem (3)) and c ctrl ( · ) is Lipschitz continuousover the feasible set, e.g., the feasible set is either compact(due to constraints) or the control performance is measured bya norm or Huber loss. The Lipschitz constant then serves as aower estimate for γ (cid:63) . A non-Lipschitz cost requires γ → ∞ as a sufficient condition. Note that for γ → ∞ Proposition 4.2holds without assumptions, since (11) is merely an indicatorfunction reformulation of (10). Our relaxations in the next sec-tions will, among others, drop the requirement on γ sufficientlylarge as well as the LTI complexity specification (cid:98) B ∈ L q,nm,(cid:96) .The multi-criteria problem (8) is generally not convex sincewe simultaneously optimize over the to-be-identified model B and the to-be-designed control w . This can be spotted directlyin a kernel representation: the constraint w ∈ (cid:98) B L takes theform (cid:98) R ( σ ) w = 0 , where both (cid:98) R and w are variables. Though,other representations lead to the same conclusions. Proposition 4.3:
Consider the multi-criteria problem (8) anda kernel representation of the to-be-identified behavior: (cid:98) B = kernel ( (cid:98) R ( σ )) . Then the feasible set of (8) is not convex.We believe that the multi-criteria problem is interesting inits own right: studying its Pareto front and choosing an optimaltrade-off parameter may possibly yield superior performance.Our problem setup thus far was conceptual rather thanpractically useful. Below, we consider concrete problem for-mulations and turn our conceptual insights into concise results. B. Bridging Towards Subspace Predictive Control (SPC)
We explain SPC from the perspective of the FundamentalLemma 2.3 stating that any trajectory w ini ∧ w ∈ B PT ini + L liesin colspan ( H T ini + L ( w d )) . Recall that w ini is a prefix trajectoryof length T ini ≥ (cid:96) setting the initial condition, and w is a futuretrajectory of length L > to be designed via optimal control.Accordingly, permute and partition w and the Hankel matrix (cid:20) w ini w (cid:21) ∼ u ini uy ini y , H T ini + L ( w d ) ∼ U p U f Y p Y f = (cid:20) H T ini + L ( u d ) H T ini + L ( y d ) (cid:21) , where u ini ∈ R mT ini , y ini ∈ R ( q − m ) T ini , and ∼ denotessimilarity under a coordinate permutation. The subscripts “p”and “f” are synonymous to “past” and “future”. We seek alinear model, i.e., a matrix K , relating past and future as y = (cid:2) K p K f (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) = K · u ini y ini u . (12)The multi-step predictor K is found from Hankel matrix databy means of the least-square criterion [24, Section 3.4] minimize K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y f − K · U p Y p U f (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F , (13)where (cid:107) · (cid:107) F is the Frobenius norm. Via the Moore-Penroseinverse, the solution of (13) is the classic SPC predictor [22] K = Y f U p Y p U f † . (14)It is insightful to compare equation (12) and the matrices K p , K f to equation (1) and the extended observability andimpulse response matrices O L and G L , respectively. One realizes that for clean data, (12) is an ARX model withrank ( K p ) = n assuring LTI behavior of desired complexityand a lower block-triangular zero pattern of K f assuringcausality. In the absence of clean data, the LTI behavior isenforced by low-rank approximation (typically via singular-value thresholding of K p ) [22], and the desired zero patternof K f is enforced separately [24, Remark 10.1], [23, Section3]. The causality requirement can also be omitted as it is notneeded for offline or receding horizon optimal control, but itis useful to condition the data on the set of causal models.Hence, in this case the identification problem (3) stated asa single monolithic (thus non-convex) program reads as minimize K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y f − K · U p Y p U f (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F subject to K = (cid:2) K p K f (cid:3) K f lower-block triangularrank ( K p ) = n , where the lower-block triangular specification means that allentries above the diagonal ( q − m ) × m blocks equal zero.We obtain a parametric version of the indirect approach (5)with the constraint w ini ∧ w ∈ (cid:98) B (cid:63)T ini + L replaced by (12): minimize u, y c ctrl (cid:0)(cid:2) y − y r u − u r (cid:3)(cid:1) subject to y = K (cid:63) · u ini y ini u where K (cid:63) ∈ arg min K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y f − K · U p Y p U f (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F subject to K = (cid:2) K p K f (cid:3) K f lower-block triangularrank ( K p ) = n . (15)For comparison, consider also an instance of the direct reg-ularized problem (7) with regularizer h ( g ) = (cid:107) ( I − Π) g (cid:107) : minimize u, y, g c ctrl (cid:0)(cid:2) y − y r u − u r (cid:3)(cid:1) + λ · (cid:107) ( I − Π) g (cid:107) subject to U p Y p U f Y f g = u ini y ini uy . (16)Here Π = (cid:20) U p Y p U f (cid:21) † (cid:20) U p Y p U f (cid:21) and ( I − Π) is an orthogonal projectoron the kernel of the first three block-constraint equations.We state the following consistency result. Fact 4.4:
Under Assumptions (A.1) with L replaced by T ini + L , (A.2), (A.3), and (A.5), for any λ ≥ the minimum ofthe regularized problem (16) is achieved for y (cid:63) = Y f g (cid:63) = y r and u (cid:63) = U f g (cid:63) = u r , where (cid:107) ( I − Π) g (cid:63) (cid:107) = 0 .The consistency Fact 4.4 may appear trivial, but it highlightsthat mere p -norm regularizers h ( g ) = (cid:107) g (cid:107) p are not consistent.The following is the main result of this subsection. heorem 4.5 (SPC relaxation): Consider the indirect data-driven control problem (15) and the direct data-driven controlproblem (16) parameterized by λ ≥ . Assume that c ctrl ( · ) isLipschitz continuous. For λ sufficiently small, problem (16) isa convex relaxation of (15), that is,1) (16) is convex,2) any feasible ( u, y ) in (15) is feasible for (16), and3) the optimal value of (16) lower-bounds that of (15). Proof:
First, we perform a convex relaxation by droppingthe rank and block-triangularity constraints in (15). Second,observe that the explicit solution of the inner problem, thepredictor (14), is equivalently derived as least-norm solution y = Y f g (cid:63) where g (cid:63) = arg min g (cid:107) g (cid:107) subject to U p Y p U f g = u ini y ini u . We now insert this reformulation in the relaxation of (15): minimize u, y c ctrl (cid:0)(cid:2) y − y r u − u r (cid:3)(cid:1) subject to y = Y f g (cid:63) where g (cid:63) ∈ arg min g (cid:107) g (cid:107) subject to U p Y p U f g = u ini y ini u . (17)We now follow the arguments from Section IV-A to reduce thebi-level problem (17) to a single-level multi-criteria problem.As in (10), the inner problem can be replaced by a constraintassuring that it achieves its minimum. In this case, we can goone step further and compute the explicit inner minimizer as g (cid:63) = (cid:20) U p Y p U f (cid:21) † (cid:104) u ini y ini u (cid:105) . The optimality constraint assuring that the inner problem of(17) achieves its minimum can now be formulated as g − g (cid:63) = g − Π g ⇔ (cid:107) ( I − Π) g (cid:107) , where the second equality arises when inserting the block-constraint equations of the inner problem into g (cid:63) and usingthe definition of Π . The final reformulation poses the inneroptimality constraint as the distance to the subspace containingthe minimizers of the inner problem. Retaining all constraints,(17) can then be formulated as the single-level problem minimize u, y, g c ctrl (cid:0)(cid:2) y − y r u − u r (cid:3)(cid:1) subject to U p Y p U f Y f g = u ini y ini uy (cid:107) ( I − Π) g (cid:107) = 0 . (18)We now apply Proposition A.3, lift the distance constraint (cid:107) ( I − Π) g (cid:107) = 0 to the objective, and recover problem (16)with λ larger than the Lipschitz constant of c ctrl ( · ) . Hence, (16) is equivalent to (18) for λ sufficiently large.Our final convex relaxation is to choose λ small rather thanlarge. Namely, from the view-point of the objective: it lowersthe cost; or from the bi-level viewpoint: it turns the inneroptimality constraint g = g (cid:63) into a weaker sub-optimalityconstraint. The conclusions follow since (16) is convex, andstarting from (15) we have only enlarged the feasible set.The following four remarks are in order. First, we summa-rize the salient arguments to pass from indirect to direct data-driven control: we relaxed problem (15) by dropping causality(block-triangularity) and LTI complexity (rank) specifications,replaced the least-square criterion (13) by the equivalent least-norm formulation (17), and lifted the problem from bi-levelto multi-criteria, where the least-square objective results inthe regularization (cid:107) ( I − Π) g (cid:107) . For equivalence to the least-square objective, the proof requires λ larger than the (global)Lipschitz constant of c ctrl ( · ) , similar to robustification-inducedregularizations [32], [33]. If c ctrl ( · ) is only locally Lipschitz,e.g., in case of a quadratic cost, then choosing a finite (small) λ is a relaxation that allows the predicted trajectory to notadhere to the least-square fit of the data. Though as we willsee in Section V-B, its effect is minor for λ not overly small.Second, the regularization based on the projector (cid:107) ( I − Π) g (cid:107) differs from the standard two-norm regularizers h ( g ) = (cid:107) g (cid:107) [32]–[34] (or the square thereof [30], [35]). Actually,it is this projection which recovers the original least-squarecriterion (13). In contrast, the standard norm-regularizer (cid:107) g (cid:107) penalizes also the homogeneous solution of the constraintequations and thus also ( u, y ) . The latter seems undesirablefrom the identification perspective: the regularization shouldrather make the predicted trajectory adhere to a least-square fitof the data. While for small values of λ both regularizationshave a similar negligible effect, for large λ the identification-induced regularization (cid:107) ( I − Π) g (cid:107) demonstrates a morerobust performance; see Figure 2 later. Finally, h ( g ) = (cid:107) ( I − Π) g (cid:107) is consistent unlike norm-based regularizers.Third, our proof strategy reveals an entire class of regulariz-ers. In fact, we can choose any norm (cid:107) ( I − Π) g (cid:107) in the proof,use more general penalty functions such as the (squared) meritfunctions in [41], or attack problem (18) with other penaltyor augmented Lagrangian methods. These degrees of freedomreflect the intuition that the Pareto-front of (16) is invariantunder certain (e.g., monotone) transformations of objectivessuch as taking squares; see [42, Appendix A] for a formalreasoning. For our later simulations in Section V-B, we choosethe computationally attractive regularization (cid:107) ( I − Π) g (cid:107) .Fourth, and finally, our proof arguments are obvi-ously “qualitative” crossing out constraints and using non-quantifiable “sufficiently large” reasoning. Hence, the convexrelaxation (16) of (15) should not be expected to be tight.Nevertheless, the formulation (16) (without projector) hasproved itself and often outperforms (15), as testified in [35]–[38]. Section V-B will compare the different formulations. C. Bridging Towards Structured Low-Rank Approximation
We now present an entirely non-parametric problem for-mulation, namely a version of subspace identification basedn structured low-rank approximation [29], and we relate theresulting bi-level problem to direct data-driven control (7).Given the model class L q,nm,(cid:96) , we project the identificationdata w d ∈ R qT on (cid:98) B T ini + L ∈ L q,nm,(cid:96) . By Lemma 2.3, the latterset is characterized by all trajectories ˆ w ∈ R q ( T ini + L ) so thatthe associated Hankel matrix satisfies rank ( H T ini + L ( ˆ w )) ≤ m ( T ini + L ) + n for ( T ini + L ) > (cid:96) . An implicit assumptionis, of course, T (cid:29) T ini + L : the identification data is muchlonger than the estimation plus control prediction horizons.In presence of noise, H T ini + L ( w d ) will not be of low rank,and has to be approximated by a low-rank matrix in anidentification step. Thus, the identification problem (3) reads as minimizeˆ w d c id ( ˆ w d − w d )subject to rank ( H T ini + L ( ˆ w d )) ≤ m ( T ini + L ) + n. (19)Problem (19) is to be read as low-rank approximation problem:given the identification data assorted in a Hankel matrix H T ini + L ( w d ) , we seek the closest sequence ˆ w d so that the Han-kel matrix H T ini + L ( ˆ w d ) has rank no more than m ( T ini + L )+ n .Since ˆ w d ∈ (cid:98) B T we have that rank ( H T ini + L ( ˆ w d )) ≤ m ( T ini + L ) + n . Since also w ini ∈ (cid:98) B T ini and w ∈ (cid:98) B L , we concluderank ([ H T ini + L ( ˆ w d ) col ( w ini , w )]) ≤ m ( T ini + L ) + n . Assuming that rank ( H T ini + L ( ˆ w d )) = m ( T ini + L ) + n , whichis generically the case, H T ini + L ( ˆ w d ) g = col ( w ini , w ) for somevector g . Hence, the bi-level problem (5) takes the form minimize w, g c ctrl ( w − w r )subject to (cid:20) w ini w (cid:21) = H T ini + L ( ˆ w (cid:63)d ) g ˆ w (cid:63)d ∈ arg min ˆ w d c id ( ˆ w d − w d )subject to rank ( H T ini + L ( ˆ w d )) = m ( T ini + L )+ n. (20) Theorem 4.6 (Low-rank relaxation):
Consider the indirectdata-driven control problem (20) and the direct data-drivencontrol problem (7) for h ( g ) = (cid:107) g (cid:107) and parameterized by λ ≥ . Let Assumption (A.4) hold and assume that c ctrl ( · ) is convex. For λ sufficiently small, problem (7) is a convexrelaxation of (20), that is,1) (7) is convex,2) any feasible ( w, g ) in (20) is also feasible for (7), and3) the optimal value of (7) lower-bounds that of (20). Proof:
To prove the claim, one can resort to a proofstrategy via the multi-criteria problem (8), as in the previoussection. Instead, we present a more direct approach here.We start by massaging the rank constraint in (20). First,since rank ( H T ini + L ( ˆ w d )) = m ( T ini + L ) + n , we may withoutloss of generality add the constraint (cid:107) g (cid:107) ≤ n + m ( T ini + L ) to the outer problem, where (cid:107) g (cid:107) denotes the cardinality(number of nonzero entries) of g . Second, we perform a convexrelaxation and drop the rank constraint. Third, another convexrelaxation (popular in LASSO problems [43]) is to replace (cid:107) g (cid:107) ≤ n + m ( T ini + L ) by (cid:107) g (cid:107) ≤ α for α > sufficientlylarge. As a result of these three steps, (20) is relaxed to minimize w, g c ctrl ( w − w r )subject to (cid:20) w ini w (cid:21) = H T ini + L ( ˆ w (cid:63)d ) g , (cid:107) g (cid:107) ≤ α where ˆ w (cid:63)d ∈ arg min ˆ w d c id ( ˆ w d − w d ) . (21)Observe that under Assumption (A.4) the inner problem admitsa trivial solution: ˆ w (cid:63)d = w d . Thus, (21) reduces to minimize w, g c ctrl ( w − w r )subject to (cid:20) w ini w (cid:21) = H T ini + L ( w d ) g , (cid:107) g (cid:107) ≤ α. (22)Next, we lift the 1-norm constraint to the objective minimize w, g c ctrl ( w − w r ) + λ · (cid:107) g (cid:107) subject to (cid:20) w ini w (cid:21) = H T ini + L ( w d ) g, (23)where λ ≥ is a scalar weight. In particular, for each value of α in (22), there is λ ≥ so that the solution of (23) coincideswith (22), and vice versa. These equivalences are standard in (cid:96) -regularized problems and follow from strong duality (appli-cable since c ctrl ( · ) is convex and Slater’s condition holds) [43].The precise value of λ depends on the Lagrange multiplier ofthe constraint (cid:107) g (cid:107) ≤ α and thus on the data. In either case,there is a selection of parameters so that both problems areequivalent, and choosing λ sufficiently small is a relaxation.Thus, we arrived at the direct data-driven control problem(7) for λ sufficiently small and h ( g ) = (cid:107) g (cid:107) . The conclusionsof the theorem follow since (7) is convex, and starting from(20) we have only enlarged the feasible set to arrive at (7).In summary, to pass from indirect data-driven control (20)to direct data-driven control (7), we performed a sequence ofconvex relaxations effectively replacing the rank constraint ofthe system identification by a 1-norm regularizer. Hence, the 1-norm regularizer accounts for selecting the model complexity.Similar comments as those following Theorem 4.5 on tightnessof the relaxation apply to Theorem 4.6 as well. D. Hybrid relaxations
Theorems 4.5 and 4.6 reveal the roles of the two reg-ularizers: (cid:107) g (cid:107) controls the behavior complexity, whereas (cid:107) ( I − Π) g (cid:107) accounts for least-square fitting the data. To blendthe two, consider a hybrid formulation of (16) and (20) minimize w, g c ctrl ( w − w r ) + λ · (cid:107) ( I − Π) g (cid:107) subject to (cid:20) w ini w (cid:21) = H T ini + L ( ˆ w (cid:63)d ) g ˆ w (cid:63)d ∈ arg min ˆ w d c id ( ˆ w d − w d )subject to rank ( H T ini + L ( ˆ w d )) = m ( T ini + L )+ n, (24)where λ ≥ . Observe that this formulation is consistent: Fact 4.7:
Under Assumptions (A.1) with L replaced by T ini + L , (A.2), (A.3), and (A.5), for any λ ≥ the minimumof (24) and achieved for w (cid:63) = w r and (cid:107) ( I − Π) g (cid:63) (cid:107) = 0 .y following exactly the arguments in the proof of Theorem4.6, we arrive at the following convex relaxation of (24): minimize w, g c ctrl ( w − w r ) + λ · (cid:107) ( I − Π) g (cid:107) + λ · (cid:107) g (cid:107) subject to (cid:20) w ini w (cid:21) = H T ini + L ( w d ) g, (25)where λ ≥ . We will validate the performance of the hybridregularizer in Section V-B below; see specifically Figure 3. E. Possible pitfalls of relaxations
Note that the two convex relaxation results in Theorems 4.5and 4.6 are trivially true in the limit when λ = 0 . In fact, eventhe abstract multi-criteria formulation (8) can be related to aconvex relaxation of the abstract bi-level problem (5) in thelimit γ = 0 . Namely, for γ = 0 , (8) reduces to minimize w, ˆ w d , (cid:98) B c ctrl ( w − w r )subject to w ini ∧ w ∈ (cid:98) B T ini + L , ˆ w d ∈ (cid:98) B T , (cid:98) B ∈ L q,nm,(cid:96) . (26) Corollary 4.8:
Consider the indirect data-driven control (5)and multi-criteria problem (26) in the limit γ = 0 . Problem(26) is a convex relaxation of problem (5), that is,1) (26) is convex,2) any feasible ( w, ˆ w d , ˆ B ) in (5) is also feasible for (26),3) and the optimal value of (26) lower-bounds that of (5). Proof:
Consider the equivalent formulation (10) of (5),and note that (26) equals (10) when the inner optimalityconstraint c id ( ˆ w d − w d ) − ϕ = 0 is dropped.Analogous corollaries can be stated for Theorems 4.5 and4.6 for λ = 0 . Given such results, one may wonder whetherTheorems 4.5 and 4.6 are vacuous since they are trivially truefor λ = 0 . We offer several answers. First, the limit λ = 0 clearly leads to a better solution w (cid:63) (i.e., a lower surrogatetracking error) for the open-loop optimal control problem.However, this solution merely matches the reference w r anddoes not adhere to the identification data w d in the senseof meeting any fitting criterion. Hence, the optimal solution w (cid:63) may not be a trajectory of the true system behavior,and the actual realized control performance can be arbitrarilypoor. Obviously, such a situation is not desirable, and onemay want to regularize with a small but non-zero λ – anobservation consistent with [31]–[35] albeit derived from adifferent perspective. Second, Theorems 4.5 and 4.6 require λ to be sufficiently small, but not zero. According to theproofs, depending on Lipschitz constants and multipliers of therespective problems, there is an exact smallest value for λ sothat the behavior (cid:98) B matches the plant behavior B P accordingto the c id ( · ) fitting criterion. In [31]–[35] the coefficient λ relates to a desired robustness level. In either case, λ canhardly be quantified a priori and without cross-validation.We follow up on this set of questions in the next section.V. N UMERICAL A NALYSIS AND C OMPARISONS
We now numerically investigate the effect of the hyper-parameter λ , confirm the superiority of the regularizer h ( g ) = (cid:107) ( I − Π) g (cid:107) , and compare direct and indirect approaches. A. Choice of Regularization Parameter
We first study the parameter λ regularizing direct data-driven control (7). Consider the benchmark single-input,single-output, 5th order, linear time-invariant system [44].Denoting the t -th element of the concatenated input andoutput by w ( t ) = ( u ( t ) , y ( t )) , the control cost was chosen as c ctrl ( w − w r ) = ( w − w r ) (cid:62) W ( w − w r ) with reference w r ( t ) =( u r ( t ) , y r ( t )) = (0 , sin(2 πt/ ( L − for t ∈ { , , . . . , L − } , prediction horizon L = 20 , W = I L ⊗ diag (0 . , ,where I L is the L × L identity, and ⊗ denotes the Kroneckerproduct. We used a 1-norm regularizer h ( g ) = (cid:107) g (cid:107) in (7) anda prefix-trajectory of length T ini = 5 (see Section II-C).We collected one noise-free input/output time series oflength T = 74 by applying a random Gaussian input. Fromthis noise-free data set, 100 independent noisy data sets wereconstructed by adding Gaussian noise with a noise-to-signalratio of 5%. For each data set and each value of λ ∈ (0 , ) ,optimal control inputs were computed from (7). We define the predicted error as c ctrl ( w (cid:63) − w r ) , where w (cid:63) is an optimizerof (7). We define the realized error as c ctrl ( w true − w r ) , where w true is the realized trajectory of the system after applying thecomputed optimal inputs. The predicted and realized errorswere converted to a percentage increase in error with respect tothe best possible performance (i.e., if the deterministic systemmodel was exactly known), and were averaged over the 100independent data sets. The results are plotted in Figure 1.It is apparent that choosing λ too small leads to an op-timistic predicted error but very poor realized performance.Furthermore, the performance is poor for large values of λ indicating that the regularization parameter should be chosencarefully (though a wide range delivers equally good results).These observations are consistent with those in [31]–[35] andthe hypotheses discussed at the end of Section IV-E. B. Role of Projection in Two-Norm Regularization
Theorem 4.5 suggests that the identification-induced reg-ularizer h ( g ) = (cid:107) ( I − Π) g (cid:107) is superior to a two-normregularizer h ( g ) = (cid:107) g (cid:107) if one is interested in consistencyand the predicted trajectory adhering to a least-square fit of the -4 -2 -1012345678 10 Fig. 1. Predicted and realized errors with 1-norm regularizer as function of λ . -4 -2 Fig. 2. Comparison of the realized performance for the two-norm (cid:107) g (cid:107) andidentification-induced regularization (cid:107) ( I − Π) g (cid:107) as function of λ .Fig. 3. Realized error for the hybrid regularizer λ ·(cid:107) ( I − Π) g (cid:107) + λ ·(cid:107) g (cid:107) . data. To test this hypothesis, we consider the same case studyfrom Section V-A and report the averaged cost in Figure 2.Both regularizers perform similarly for small λ , but theidentification-induced regularizer shows a superior and sur-prisingly constant performance for sufficiently large λ . By theproof of Theorem 4.5, for λ sufficiently large, the direct andindirect problems (16) and (15) are equivalent – up to causalityand complexity constraints. Thus, a sufficiently large λ forcesthe least-square fit (13) and results in excellent performanceindependent of the specific value of λ . While there is a smallwindow where the two-norm excels, the identification-inducedregularizer shows overall much more robust performance.Next we study the merits of hybrid regularization (25). Forthe same case study Figure 3 shows the realized performanceplotted over the regularization parameters. The { λ = 0 } and { λ = 0 } slices recover Figures 1 and 2. As before,the regularizer (cid:107) ( I − Π) g (cid:107) is more robust though a hybridregularization yields a minor albeit robust improvement. C. Comparison and Bias-Variance Hypotheses
We now compare the direct and indirect approaches throughtwo case studies. The first study evaluates the performance of both methods on the basis of “variance” error, i.e., on alinear system with noisy measurements. The second studyevaluates the performance on the basis of “bias” error, i.e.,on a nonlinear system with noise-free measurements.We expect the direct method to perform better on thenonlinear system since the indirect method erroneously selectsa linear model class thus leading to a larger “bias” error. Onthe other hand, we expect the indirect method to perform betteron the linear system with noisy outputs since the identificationstep filters noise thus leading to a lower “variance” error.
D. Comparison: Stochastic Linear System
Consider the same case study as in the Section V-A, i.e.,same LTI system, cost, and reference. We collected data forvarying levels of noise-to-signal ratio, i.e., we consideredmeasurements that were affected by Gaussian noise withnoise-to-signal ratio in the set { , , . . . , } . For eachnoise-to-signal ratio, T = 74 input/output data samples werecollected by applying a random Gaussian input. This data wasthen used for both the direct and indirect methods.For the indirect method, the inner system identificationproblem (3) is solved using the subspace approach N4SID [16]with prefix horizon T ini = 5 and prediction horizon L = 20 .Equipped with a 5th-order identified model, optimal controlinputs are computed by solving (4). The indirect method wascompared to the direct method (7), with h ( g ) = (cid:107) g (cid:107) , T ini = 5 ,and λ = 27 . The hyper-parameters of both methods were keptconstant for all simulations below and chosen to give goodrealized control performance for all noise-to-signal ratios.For both methods we recorded the realized performanceafter applying the open-loop inputs and converted it to apercentage error with respect to the best possible performance(i.e., if the deterministic model was exactly known). Foreach noise-to-signal ratio, 100 simulations were conductedwith different random data sets. The results are displayedin the box plot in Figure 4 and show that both methodsperform well for low levels of noise (up to approximately noise-to-signal ratio). As the data becomes noisier, theperformance of the direct method degrades significantly, whilethe performance of the indirect method remains relativelyconstant. We remark that a slightly better albeit qualitativelysimilar result is obtained with the regularizer (cid:107) ( I − Π) g (cid:107) .We attribute these observations to the fact that identificationde-noises the data. These results confirm our hypothesis thatthe indirect method is superior in terms of “variance” error. E. Comparison: Deterministic Nonlinear System
We now consider the scenario where the direct and indirectmethods are subject to a “bias” error, but not a “variance”error. Consider the discrete-time nonlinear Lotka-Volterra dy-namics considered for direct data-driven control in [45] x ( t k +1 ) = f nonlinear ( x ( t k ) , u ( t k ))= (cid:104) x ( t k )+∆ t ( ax ( t k ) − bx ( t k ) x ( t k )) x ( t k )+∆ t ( dx ( t k ) x ( t k ) − cx ( t k )+ u ( t k )) (cid:105) , where t k +1 − t k = ∆ t = 0 . , a = c = 0 . , b =0 . , d = 0 . , and x ( t k ) = (cid:2) x ( t k ) x ( t k ) (cid:3) (cid:62) . Here, ig. 4. Comparison of direct and indirect methods for varying noise. ( x ( t k ) , x ( t k )) denote prey and predator populations, and u ( t k ) is the input. A linearization about the equilibrium (¯ u, ¯ x , ¯ x ) = (0 , c/d, a/b ) yields the affine system x ( t k +1 ) = f linear ( x ( t k ) , u ( t k ) , ¯ x , ¯ x )= (cid:104) x ( t k )+∆ t (( a − b ¯ x )( x ( t k ) − ¯ x ) − b ¯ x ( x ( t k ) − ¯ x )) x ( t k )+∆ t ( d ¯ x ( x ( t k ) − ¯ x )+( d ¯ x − c )( x ( t k ) − ¯ x )+ u ( t k )) (cid:105) . We compare the direct and indirect methods for varyingdegree of nonlinearity by interpolating between f nonlinear and f linear , i.e., we study the interpolated system x ( t k +1 ) = (cid:15) · f linear ( x ( t k ) , u ( t k ) , ¯ x , ¯ x )+ (1 − (cid:15) ) · f nonlinear ( x ( t k ) , u ( t k )) (27)for (cid:15) ∈ [0 , . For (cid:15) = 1 (resp. (cid:15) = 0 ), the dynamics arepurely affine (resp. nonlinear). For each (cid:15) ∈ { , . , . . . , } , T = 2415 data points were collected by applying a noisysinusoidal input u ( t k ) = 2(sin( t k ) + sin(0 . t k ))) + v ( t k ) with v ( t k ) sampled from a Gaussian random variable. Fullstate measurement was assumed. The data collection wasrepeated for 100 different initial conditions. For each degreeof nonlinearity (cid:15) ∈ { , . , . . . , } and each initial condition,the data was used to compute optimal open-loop control inputsusing direct and indirect methods. The control cost was chosenas c ctrl ( w − w r ) = (cid:107) w − w r (cid:107) with equilibrium reference w r = (0 , , , . . . , , , , L = 600 , and w = ( u, x ) .For the indirect method, the inner system identificationoptimization problem given by (3) is solved using the subspaceapproach N4SID [16] with initial condition horizon T ini = 4 ,and prediction horizon L = 600 . A model order of 4 waschosen, as it produced the best performance as measured bythe realized control cost. Optimal control inputs were thencomputed by solving (4). For comparison, we chose the directmethod (7) with h ( g ) = (cid:107) g (cid:107) , T ini = 4 , and λ = 8000 .The performance was measured with the realized control costafter applying the open-loop inputs to system (27). As before,the hyper-parameters of both direct/indirect methods werejudicially chosen and kept constant for all simulations.The results displayed in Figure 5 show that both methodsperform well for low levels of nonlinearity: (cid:15) ∈ [0 . , . As the Fig. 5. Comparison of direct and indirect methods for varying nonlinearity. system becomes increasingly nonlinear, the performance of theindirect method degrades significantly, while the performanceof the direct method remains relatively constant. We attributethis observation to the fact that the indirect method incurs a“bias” error from (erroneously) selecting a linear model classand applying certainty-equivalence control, while the directmethod uses data from the nonlinear system without bias.VI. D
ISCUSSION AND C ONCLUSIONS
We studied the relationship between indirect and direct data-driven control formulated as bi-level (first-identify, then con-trol) and single-level regularized (based on the FundamentalLemma) optimization problems, respectively. An intermediatemulti-criteria problem allowed us to efficiently transition be-tween both formulations. We concluded that the regularizeddirect approach can be viewed as a convex relaxation ofthe indirect approach, where the choice regularizer dependedon the problem formulation and accounted for an implicitidentification step. We also discovered a novel regularizer thatis consistent and accounts for least-square identification.Our results suggested the use of the indirect method in caseof “variance” errors and the use of the direct method in pres-ence of “bias” errors (i.e., a nonlinear system). These insightsshed some partial light on the remarkable performance of data-enabled predictive control applied to nonlinear systems.As a limitation, our results concern only the open-loop pre-dictive control problem, though we ultimately care about therealized performance, especially in a receding horizon closed-loop implementation. Moreover, we believe that the proposedmulti-criteria data-driven control formulation is important inits own right and may deliver excellent performance if onewere to find a convex formulation and appropriate trade-offparameter. Both of these are formidable tasks for future work.Finally, we believe that our approach is also applicable toother identification and control formulations and may deliverinteresting and novel direct data-driven control formulations.
CKNOWLEDGEMENTS
The authors acknowledge their colleagues at ETH Z¨urich,in particular Miguel Picallo Cruz, for fruitful discussions.A
PPENDIX
Consider the mathematical program
M P : minimize x ∈ C f ( x )subject to g ( x ) ≤ , h ( x ) = 0 , (28)where C ⊂ R n is closed, and f, g, h are lower semicontinuousmaps from R n to R , R m , and R . Consider the perturbation M P (cid:15) : minimize x ∈ C f ( x )subject to g ( x ) ≤ , h ( x ) = (cid:15) (29)for (cid:15) ∈ R . We recall the definition of partial calmness [41]: Definition A.1:
Let x (cid:63) solve M P , and let B n denote theopen unit ball in R n . Then M P is said to be partially calm at x (cid:63) provided that there are µ > and δ > such that, forall (cid:15) ∈ δ B and all x ∈ x (cid:63) + δ B n feasible for M P (cid:15) , one has f ( x ) + µ | h ( x ) | ≥ f ( x (cid:63) ) . (30)Partial calmness is equivalent to exact penalization . In partic-ular, consider for µ ≥ the penalized mathematical program P M P µ : minimize x ∈ C f ( x ) + µ · | h ( x ) | subject to g ( x ) ≤ . (31)We summarize [40, Proposition 3.3], [41, Proposition 2.2]: Proposition A.2:
Assume that f is continuous, and let x (cid:63) bea local minimizer of M P . Then
M P is partially calm at x (cid:63) if and only if there is µ (cid:63) > so that x (cid:63) is a local minimizerof P M P µ for all µ ≥ µ (cid:63) . Moreover, any local minima of P M P µ with µ > µ (cid:63) are also local minima of M P .Partial calmness has been studied for a range of problems,particularly bi-level problems [40], [41], [46]. Of specific im-portance to us is a related result due to Clarke [47, Proposition2.4.3] which allows for exact penalization (or equivalentlypartial calmness) and reads in our notation as follows.
Proposition A.3:
Consider the mathematical program
M P and its penalized version
P M P µ . Assume that f is Lipschitzcontinuous with Lipschitz constant L , the equality constrainttakes the form of a distance to a closed set S ⊂ C , h ( x ) = distance ( x, S ) = inf y ∈ S (cid:107) x − y (cid:107) , and M P attains a minimum. Then for µ > µ (cid:63) = L , any localminimum of P M P µ is also a local minimum of M P .We note that (cid:107) · (cid:107) in Proposition A.3 can be an arbitrarynorm. For a more general problem setup with a parametricset S depicting the value function of an (inner) optimizationproblem, the reader is referred to [41, Theorem 2.5]. Addition-ally, the setup can be extended to (squared) merit functions aspenalty functions [41, Theorems 2.6 and 2.9]. These generalizethe notion of distance but are easier to formulate and compute. R EFERENCES[1] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,”
AnnualReview of Control, Robotics, and Autonomous Systems , vol. 3, pp. 269–296, 2020.[2] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung,“Kernel methods in system identification, machine learning and functionestimation: A survey,”
Automatica , vol. 50, no. 3, pp. 657–682, 2014.[3] A. Chiuso and G. Pillonetto, “System identification: A machine learningperspective,”
Annual Review of Control, Robotics, and AutonomousSystems , vol. 2, pp. 281–304, 2019.[4] Z.-S. Hou and Z. Wang, “From model-based control to data-drivencontrol: Survey, classification and perspective,”
Information Sciences ,vol. 235, pp. 3–35, 2013.[5] B. Recht, “A tour of reinforcement learning: The view from continuouscontrol,”
Annual Review of Control, Robotics, and Autonomous Systems ,vol. 2, pp. 253–279, 2019.[6] H. Hjalmarsson, “From experiment design to closed-loop control,”
Automatica , vol. 41, no. 3, pp. 393–438, 2005.[7] H. Hjalmarsson, M. Gevers, and F. De Bruyne, “For model-based controldesign, closed-loop identification gives better performance,”
Automatica ,vol. 32, no. 12, pp. 1659–1673, 1996.[8] M. Gevers, “Identification for control: From the early achievements tothe revival of experiment design,”
European Journal of Control , vol. 11,pp. 1–18, 2005.[9] R. J. Schrama, “Accurate identification for control: The necessity ofan iterative scheme,”
IEEE Transactions on Automatic Control , vol. 37,no. 7, pp. 991–994, 1992.[10] S. Formentin and A. Chiuso, “CoRe: control-oriented regularizationfor system identification,” in . IEEE, 2018, pp. 2253–2258.[11] J. C. Willems, “The behavioral approach to open and interconnectedsystems,”
IEEE Control Systems Magazine , vol. 27, no. 6, pp. 46–99,2007.[12] ——, “Paradigms and puzzles in the theory of dynamical systems,”
IEEETransactions on Automatic Control , vol. 36, no. 3, pp. 259–294, 1991.[13] J. C. Willems and J. W. Polderman,
Introduction to mathematical systemstheory: A behavioral approach . Springer, 1997, vol. 26.[14] P. Van Overschee and B. De Moor,
Subspace identification for linearsystems: Theory, Implementation, Applications . Springer Science &Business Media, 2012.[15] T. Katayama,
Subspace methods for system identification . SpringerScience & Business Media, 2006.[16] P. Van Overschee and B. De Moor, “N4SID: Subspace algorithmsfor the identification of combined deterministic-stochastic systems,”
Automatica , vol. 30, no. 1, pp. 75–93, 1994.[17] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A noteon persistency of excitation,”
Systems & Control Letters , vol. 54, no. 4,pp. 325–329, 2005.[18] H. J. van Waarde, C. De Persis, M. K. Camlibel, and P. Tesi, “Willems’fundamental lemma for state-space systems and its extension to multipledatasets,”
IEEE Control Systems Letters , vol. 4, no. 3, pp. 602–607,2020.[19] I. Markovsky and F. D¨orfler, “Identifiability in the be-havioral setting,” September 2020, Submitted. Available athttp://homepages.vub.ac.be/ imarkovs/publications/identifiability.pdf.[20] I. Markovsky, J. C. Willems, S. Van Huffel, and B. De Moor,
Exactand approximate modeling of linear systems: A behavioral approach .SIAM, 2006, vol. 11.[21] I. Markovsky, J. C. Willems, P. Rapisarda, and B. L. De Moor, “Algo-rithms for deterministic balanced subspace identification,”
Automatica ,vol. 41, no. 5, pp. 755–766, 2005.[22] W. Favoreel, B. De Moor, and M. Gevers, “SPC: subspace predictivecontrol,”
IFAC Proceedings Volumes , vol. 32, no. 2, pp. 4004–4009,1999.[23] S. J. Qin, W. Lin, and L. Ljung, “A novel subspace identificationapproach with enforced causal models,”
Automatica , vol. 41, no. 12,pp. 2043–2053, 2005.[24] B. Huang and R. Kadali,
Dynamic modeling, predictive control andperformance monitoring: A data-driven subspace approach . Springer,2008.[25] J. Berberich, C. W. Scherer, and F. Allg¨ower, “Combining priorknowledge and data for robust controller design,” arXiv preprintarXiv:2009.05253 , 2020.26] H. J. van Waarde, M. K. Camlibel, and M. Mesbahi, “From noisy datato feedback controllers: non-conservative design via a matrix S-lemma,” arXiv preprint arXiv:2006.00870 , 2020.[27] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization,optimality, and robustness,”
IEEE Transactions on Automatic Control ,vol. 65, no. 3, pp. 909–924, 2019.[28] I. Markovsky and P. Rapisarda, “Data-driven simulation and control,”
International Journal of Control , vol. 81, no. 12, pp. 1946–1959, 2008.[29] I. Markovsky, “A missing data approach to data-driven filtering andcontrol,”
IEEE Transactions on Automatic Control , vol. 62, no. 4, pp.1972–1978, 2016.[30] J. Berberich, J. K¨ohler, M. A. Muller, and F. Allg¨ower, “Data-drivenmodel predictive control with stability and robustness guarantees,”
IEEETransactions on Automatic Control , 2020.[31] J. Coulson, J. Lygeros, and F. D¨orfler, “Data-enabled predictive control:In the shallows of the DeePC,” in
European Control Conference , 2019,pp. 307–312.[32] ——, “Regularized and distributionally robust data-enabled predictivecontrol,” in
IEEE Conference on Decision and Control , 2019.[33] ——, “Distributionally robust chance constrained data-enabled predic-tive control,” 2020, Submitted. Available at https://arxiv.org/abs/2006.01702.[34] A. Xue and N. Matni, “Data-driven system level synthesis,” arXivpreprint arXiv:2011.10674 , 2020.[35] L. Huang, J. Zhen, J. Lygeros, and F. D¨orfler, “Quadratic regularizationof data-enabled predictive control: Theory and application to powerconverter experiments,” 2020.[36] L. Huang, J. Coulson, J. Lygeros, and F. D¨orfler, “Decentralized data-enabled predictive control for power system oscillation damping,” arXivpreprint arXiv:1911.12151 , 2019.[37] P. G. Carlet, A. Favato, S. Bolognani, and F. D¨orfler, “Data-drivenpredictive current control for synchronous motor drives,” in , 2020, pp. 5148–5154.[38] E. Elokda, J. Coulson, P. Beuchat, J. Lygeros, and F. D¨orfler,“Data-enabled predictive control for quadcopters,”
International Jour-nal of Robust and Nonlinear Control arXiv preprint arXiv:2011.00925 ,2020.[40] J. Ye and D. Zhu, “Optimality conditions for bilevel programmingproblems,”
Optimization , vol. 33, no. 1, pp. 9–27, 1995.[41] J. Ye, D. Zhu, and Q. J. Zhu, “Exact penalization and necessaryoptimality conditions for generalized bilevel programming problems,”
SIAM Journal on optimization , vol. 7, no. 2, pp. 481–507, 1997.[42] H. Xu, C. Caramanis, and S. Mannor, “Robust regression and lasso,”
IEEE Transactions on Information Theory , vol. 56, no. 7, pp. 3561–3574, 2010.[43] T. Hastie, R. Tibshirani, and M. Wainwright,
Statistical learning withsparsity: the lasso and generalizations . CRC press, 2015.[44] I. Landau, D. Rey, A. Karimi, A. Voda, and A. Franco, “A flexibletransmission system as a benchmark for robust digital control,”
EuropeanJournal of Control , vol. 1, no. 2, pp. 77–96, 1995.[45] E. Kaiser, J. N. Kutz, and S. L. Brunton, “Sparse identification ofnonlinear dynamics for model predictive control in the low-data limit,”
Proceedings of the Royal Society A , vol. 474, no. 2219, p. 20180335,2018.[46] P. Mehlitz, L. I. Minchenko, and A. B. Zemkoho, “A note on partialcalmness for bilevel optimization problems with linearly structuredlower level,”
Optimization Letters , pp. 1–15, 2020.[47] F. H. Clarke,
Optimization and nonsmooth analysis . SIAM, 1990.
Florian D¨orfler (S’09-M’13) is an Associate Pro-fessor at the Automatic Control Laboratory at ETHZ¨urich and the Associate Head of the Department ofInformation Technology and Electrical Engineering.He received his Ph.D. degree in Mechanical Engi-neering from the University of California at SantaBarbara in 2013, and a Diplom degree in Engineer-ing Cybernetics from the University of Stuttgart in2008. From 2013 to 2014 he was an Assistant Pro-fessor at the University of California Los Angeles.His primary research interests are centered aroundcontrol, optimization, and system theory with applications in network systems,in particular electric power grids. He is a recipient of the distinguished youngresearch awards by IFAC (Manfred Thoma Medal 2020) and EUCA (EuropeanControl Award 2020). His students were winners or finalists for Best StudentPaper awards at the European Control Conference (2013, 2019), the AmericanControl Conference (2016), the Conference on Decision and Control (2020),the PES General Meeting (2020), and the PES PowerTech Conference (2017).He is furthermore a recipient of the 2010 ACC Student Best Paper Award,the 2011 O. Hugo Schuck Best Paper Award, the 2012-2014 Automatica BestPaper Award, the 2016 IEEE Circuits and Systems Guillemin-Cauer BestPaper Award, and the 2015 UCSB ME Best PhD award.
Jeremy Coulson (S’09-M’13) is a PhD student withthe Automatic Control Laboratory at ETH Z¨urich.He received his Master of Applied Science in Math-ematics & Engineering from Queen’s University,Canada in August 2017. He received his B.Sc.Engdegree in Mechanical Engineering & Applied Math-ematics from Queen’s University in 2015. His re-search interests include data-driven control methods,and stochastic optimization.