[PDF] Analysis of the least sum-of-minimums estimator for switched systems

Abstract

This paper considers a particular parameter estimator for switched systems and analyzes its properties. The estimator in question is defined as the map from the data set to the solution set of an optimization problem where the to-be-optimized cost function is a sum of pointwise infima over a finite set of sub-functions. This is a hard nonconvex problem. The paper studies some fundamental properties of this problem such as uniqueness of the solution or boundedness of the estimation error regardless of computational considerations. The interest of the analysis is to lay out the main influential properties of the data on the performance of this (ideal) estimator.

Full PDF

aa r X i v : . [ ee ss . S Y ] S e p Analysis of the least sum-of-minimums estimator for switchedsystems

Laurent Bako

Abstract —This paper considers a particular parameter estimator forswitched systems and analyzes its properties. The estimator in question isdeﬁned as the map from the data set to the solution set of an optimizationproblem where the to-be-optimized cost function is a sum of pointwiseinﬁma over a ﬁnite set of sub-functions. This is a hard nonconvex problem.The paper studies some fundamental properties of this problem suchas uniqueness of the solution or boundedness of the estimation errorregardless of computational considerations. The interest of the analysisis to lay out the main inﬂuential properties of the data on the performanceof this (ideal) estimator.

Index Terms —System identiﬁcation, switched systems, sparsity, datarichness, robustness to outliers.

I. I

NTRODUCTION

A switched system is deﬁned by a ﬁnite set of dynamic systemstogether with a map, called the switching law, which selects overtime which system (subsystem) is activated [10], [17]. The switchinglaw may be time-driven, event-driven or state-driven. Such systemscan be viewed as formal descriptions of physical phenomena takingplace in, for example, power converters [11], video sequences (fromsegmentation perspective) [18]. Finding mathematical representationsof switched systems is fundamental for the purpose of control,analysis or diagnosis. In this paper we discuss the theoretical perfor-mances/properties of a particular method for identifying a switchedmodel from measurements.The problem of identifying switched systems directly from input-output data has been largely investigated in the recent literature. Ex-amples of contributions include the works reported in [19], [1], [12],[16], [5] most of which rely on numerical optimization. Some surveysof the topic can be found in [9], [4], [13] (see the references therein).It is fair to remark that a large number of computational methods havebeen proposed for estimating the parameters of switched systems.However, an important aspect that is not well understood yet ishow the properties of the data quantitatively impact the performanceof estimation methods operating on those data. In other words, thenecessary properties of informativity of the data which favor correctestimation is still to be further investigated. In the current work wetake a step forward in the study of such informativity properties. Notethat so far, only a very few works have considered the fundamentalquestion of characterizing data informativity (richness) in the contextof switched system identiﬁcation [14], [18]. [14] sketches a broadpurpose condition of persistence of excitation for estimating switchedstate-space realizations. As to the characterization formulated in [18],it can be interpreted as a rank condition in a lifted space (resultingfrom polynomial embedding of the regressors). However, neitherof these contributions proposed a characterization of the parametricestimation error bound as an explicit function of the informativitydegree of the regression data.The goal of this paper is to analyze the properties of a particularestimator which we call here the Least Sum-of-Minimums estimator(LSM) for switched system identiﬁcation. This estimator maps thedata to the parameter space (of the constituent subsystems) by

L. Bako is with Laboratoire Ampère (UMR CNRS 5005) – EcoleCentrale de Lyon – Université de Lyon, 69134, Ecully, France. E-mail: [email protected] . associating to a given data set the minimizing set of some data-dependent cost function. The cost function is formed as a sum ofpointwise inﬁma of the prediction errors associated to each subsystem.While the prediction errors may be measured in the LSM frameworkwith multiple different loss functions, we focus speciﬁcally on thecase of the absolute deviation loss function. We note that the LSMestimator is neither analytically expressible, nor numerically solvabledirectly at a reasonable computational price. Heuristics exist howeverthat allow to approach the solution with, sometimes, guarantees ofoptimality. For a numerical approach to this problem we refer forexample, to [8]. The perspective taken here is formal rather thancomputational, the goal being to lay out the properties the data shouldenjoy to allow for an adequate retrieval of the system parameters, atleast in principle. In the wake of our previous work reported in [1], weﬁrst derive conditions on the data that guarantee exact recoverabilityof the true parameter matrix in the hypothetical scenario where themeasurements would be essentially noise-free. A striking propertyof the absolute deviation loss (used in the framework of the LSMestimator) is that it allows for exact recovery even in the face ofa sparse noise , provided that the number of nonzero values in thesparse noise sequence does not exceed a certain threshold prescribedby the informativity degree of the data. In the more realistic situationswhere the data are affected by both dense and sparse noise , weprovide parametric error bounds for the estimates delivered by theestimator. The interest of our results reside in the fact that they revealthe impact of the data informativity on the attainable performanceof the (ideal) switched system estimator. This feature makes thempotentially useful for optimal experiment design, that is, the processof deﬁning adequately the data-generating experimental conditionsthat would lead to the smallest (estimation) uncertainty bound. Outline.

We state the switched system identiﬁcation problem inSection II and deﬁne the LSM estimator. We start the analysis byconsidering essentially the noiseless scenario in Section III and thenthe noisy one in Section IV. The main conclusions of our study arerecapitulated in Section V.

Notation. R denotes the set of real numbers; R + is the set ofnonnegative real numbers. For a matrix A = [ a · · · a s ] ∈ R n × s ,we use Set( A ) to denote the ﬁnite set formed with the columns of A , i.e., Set( A ) = { a , . . . , a s } . If S is a ﬁnite set, then |S| denotesthe cardinality of S . If x ∈ R then | x | is the absolute value of x .For x = (cid:2) x · · · x n (cid:3) ∈ R n , k x k will refer to the ℓ normof x (namely the number of nonzero entries in the vector x ); and k x k = P i | x i | will denote the ℓ norm of x . If X ∈ R n × N is amatrix and I ⊂ { , . . . , N } is a subset of the column index of X ,then X I denotes the submatrix of X formed with the columns of X which are indexed by I . Similarly, for a vector v ∈ R N , v I refers tothe subvector of v consisting in the entries of v indexed by I .II. T HE SWITCHED SYSTEM IDENTIFICATION PROBLEM

A. The data-generating system

Consider a (possibly nonlinear) switched system described by anequation of the form y t = x ⊤ t a ◦ σ ( t ) + v t , (1) where t ∈ Z + refers to discrete-time, y t ∈ R is the output of thesystem at time t , x t ∈ R n is the regressor. As to v t , it refers topotential additive noise component. σ : Z + → S , { , . . . , s } deﬁnes a switching signal and a ◦ i ∈ R n , i ∈ S , denote some distinctparameter vectors. Eq. (1) describes a switched system composed of s dynamical subsystems each of which is activated one after anotherin time by the switching signal σ .The model (1) captures the situations where the regressor x t isdirectly observed or obtained through an intermedirary nonlinearmapping of some observable signal z t ∈ R d . We will assume that x t = ϕ ( z t ) (2)where ϕ : R d → R n is some (known) linear or nonlinear map. Hence,depending on the choice of the mapping ϕ , the model (1) can describeboth linear and nonlinear switched systems.We further observe that the system represented by (1) can be static,in which case z t is an unstructured multivariate input vector, ordynamic. In this latter case, z t in (2) may assume the form z t = [ y t − · · · y t − n a u ⊤ t u ⊤ t − · · · u ⊤ t − n b ] ⊤ ∈ R d (3)with n a and n b being some integers and u t ∈ R n u the input ofthe system. Note that n a can be taken equal to zero in which case x t reduces to x t = [ u ⊤ t u ⊤ t − · · · u ⊤ t − n b ] ⊤ (hence yielding aswitched nonlinear system of Finite Impulse Response type). B. The least sum of minimums estimator

For convenience we collect the true parameter vectors a ◦ i ∈ R n from (1) in a matrix A ◦ = [ a ◦ · · · a ◦ s ] ∈ R n × s which we callthe true parameter matrix. Given a collection of N data points ̟ N = (( x , y ) , . . . , ( x N , y N )) (4)generated by the switched system (1), the estimation problem ofinterest here is to estimate the parameter matrix A ◦ .The focus of this paper is this estimation problem. We consider thatthe number s of subsystems and the structural parameters ( n a , n b ) entering the deﬁnition of x t in (2)-(3) are known a priori. Our goalis to design a map, called estimator, which maps the data ̟ N tothe set of parameters describing the constituent subsystems of theswitched system (1). To begin with the approach taken in this paperto such an estimation problem, let T and S denote the index setsof the data and the subsystems respectively, i.e., T = { , . . . , N } and S = { , . . . , s } . Use the notation Σ to denote the set of all maps σ : T → S (called here switching signals). Consider the cost function J ◦ : R n × s × Σ → R + deﬁned by J ◦ ( A, σ ) = N X t =1 (cid:12)(cid:12) y t − a ⊤ σ ( t ) x t (cid:12)(cid:12) where A ∈ R n × s and σ ∈ Σ . Then a natural estimator of A ◦ can bedeﬁned as the set-valued map Ψ : ( R n × R ) N → R n × s , Ψ( ̟ N ) = n Set( ˆ A ) : ∃ ˆ σ ∈ Σ , ( ˆ A, ˆ σ ) ∈ arg min A,σ J ◦ ( A, σ ) o Ψ( ̟ N ) is the set of all sets Set( ˆ A ) for all ˆ A ∈ R n × s such that ( ˆ A, ˆ σ ) is a minimizer of J ◦ ( A, σ ) for some switching signal ˆ σ . Ifwe let J ( A ) = N X t =1 min i =1 ,...,s (cid:12)(cid:12) y t − a ⊤ i x t (cid:12)(cid:12) (5)then it can be easily shown that Ψ( ̟ N ) = n Set( ˆ A ) : ˆ A ∈ arg min A J ( A ) o . (6) Hence, minimizing J ◦ ( A, σ ) is equivalent to minimizing J ( A ) in(5). The so deﬁned Ψ will be called the least sum-of-minimums(LSM) estimator. Because the prediction error is measured here interm of the absolute value loss function, we may also refer to Ψ in the sequel as the absolute deviation LSM estimator. We start byobserving that solving numerically any of these formulations of theswitched identiﬁcation problem is quite hard. The focus of this paperis not on this computational aspect but on the formal properties of themap Ψ . More precisely, we are interested in characterizing conditions(on the data-generating system (1) and on the properties of the data)under which Ψ( ̟ N ) may contain a singleton (unique solution) ormay be located at a bounded distance from the true parameter matrix A ◦ . The primary interest of such conditions is to emphasize themain inﬂuential factors of the estimator’s performance. From thisperpective, we do not expect the intended properties to be necessarilynumerically veriﬁable but to have a rather qualitative ﬂavor whichmay serve for experiment design for instance.III. B ASIC PROPERTIES OF THE ESTIMATOR

We start by introducing some deﬁnitions. For any matrix A =[ a · · · a s ] ∈ R n × s , let σ A ∈ Σ be a switching signal satisfying σ A ( t ) ∈ arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) (7)for all t ∈ T . The deﬁning constraint (7) of the switching sig-nal σ A allows indeed for multiple choices of σ A ( t ) whenever arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) ⊂ S is not a singleton. One simple choiceto solve this issue would be to set arbitrarily σ A ( t ) to be equal tothe smallest element of arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) . However, for thepurpose of our analysis we will deﬁne such σ A ( t ) in a more speciﬁcway. Consider the index set I i ( A ) = { t ∈ T : σ A ( t ) = i } . (8)Then for all A ∈ R n × s , we have I i ( A ) ∩ I j ( A ) = ∅ for i = j and T = ∪ si =1 I i ( A ) . For reasons that will become clear in the rest of thepaper, it is desired here that min i ∈ S | I i ( A ) | be as large as possible.That is, we want the partition { I i ( A ) } i ∈ S of T to be as balancedas possible in term of the cardinalities of its members. Hence, it isof interest to use the possible extra-degree of freedom offered byEq. (7) to select σ A so as to maximize min i ∈ S | I i ( A ) | subject tothe constraint (7). In case the maximizing σ A is still not unique, wecan make it unique for a given A by selecting the one which assignsto each t , the smallest admissible index i ∈ S . To sum up, given A ∈ R n × s , σ A can be selected uniquely by following the processdescribed above.Given σ A , let us now deﬁne the vector φ ( A ) collecting the errors ofthe form y t − x ⊤ t a σ A ( t ) for t ∈ T , φ ( A ) = (cid:2) y − x ⊤ a σ A (1) · · · y N − x ⊤ N a σ A ( N ) (cid:3) ⊤ . (9)Then the cost function J ( A ) in (5) is the ℓ norm of φ ( A ) , J ( A ) = k φ ( A ) k . Note in passing that J ( A ) is invariant undercolumn permutation of the matrix A . This property implies that J ( A ) is indeed a function of Set( A ) . Note that this is an intrinsic propertyof the multiple-regression problem. In other words, the invarianceproperty of J ( A ) does not constitute any restriction on the switchingmechanism of the to-be-identiﬁed data-generating system (1). A. Informativity measure of data and exact recovery

For any r ∈ { , . . . , N } , denote with S r ⊂ R N the set of r -sparsevectors in R N , i.e., S r = n w ∈ R N : k w k ≤ r o . (10) For A ∈ R n × s , deﬁne the distance δ r ( A ) from φ ( A ) to the set S r by δ r ( A ) = inf w n k φ ( A ) − w k : w ∈ S r o . (11)The so-deﬁned δ r ( A ) represents in fact the sum of the N − r smallestentries (in absolute value) of φ ( A ) . In particular, δ ( A ) = k φ ( A ) k and δ N ( A ) = 0 .For any subset T of T , let φ T ( A ) refer to a subvector of φ ( A ) formedwith the entries indexed by T . Deﬁnition 1 (Concentration ratio) . Consider the dataset ̟ N and theassociated map φ deﬁned in (9). Let r ∈ { , . . . , N } . We call r -th concentration ratio of φ on the dataset ̟ N expressed in (4), thenumber deﬁned by ξ r ( ̟ N ) = sup ( A,A ′ ) ∈ ( R n × s ) T ⊂ T (cid:26) k φ T ( A ) − φ T ( A ′ ) k k φ ( A ) − φ ( A ′ ) k : φ ( A ) = φ ( A ′ ) , |T | ≤ r (cid:27) . (12)The supremum is taken here with respect to any pair ( A, A ′ ) ∈ ( R n × s ) such that φ ( A ) = φ ( A ′ ) and over all subsets T of T whosecardinality does not exceed r . The supremum exists because it isapplied to a set which is upper-bounded by .We interpret the concentration ratio as a function which measuresquantitatively different levels r of informativity of the data. For agiven level r , the data ̟ N are all the more informative as ξ r ( ̟ N ) is small. Ideally, we would like ξ r ( ̟ N ) to be as small as possiblefor the largest possible level r .Computing numerically ξ r ( ̟ N ) would require in general solving ahard combinatorial optimization problem, the complexity of whichmight not be affordable in practice. It can however be more cheaplyover-approximated thanks to the direct observation that ξ r ( ̟ N ) ≤ rξ ( ̟ N ) . This is because searching for ξ ( ̟ N ) instead of ξ r ( ̟ N ) alleviates considerably the combinatorial nature of the problem. Notein passing that ξ r ( ̟ N ) is an increasing function of r and satisﬁes ξ ( ̟ N ) = 0 and ξ N ( ̟ N ) = 1 . Remark 1.

In the special case where s = 1 (i.e., the situation where (1) reduces to a single subsystem), the matrix A reduces to a singlevector, say A = a ∈ R n . We recover the classical linear regressionproblem. Then φ ( A ) = (cid:2) ( y − x ⊤ a ) ( y − x ⊤ a ) · · · ( y N − x ⊤ N a ) (cid:3) ⊤ = y − X ⊤ a. where X = [ x · · · x N ] ∈ R n × N is a matrix collecting all theregressors { x t } t ∈ T generated by (1) and y = [ y · · · y N ] is thevector of all output samples. In this case, ξ r ( ̟ N ) in (12) takes theform ξ ◦ r ( ̟ N ) = sup η ∈ R n T ⊂ T ( (cid:13)(cid:13) X ⊤T η (cid:13)(cid:13) k X ⊤ η k : η = 0 , |T | ≤ r ) (13) where it is assumed that rank( X ) = n , that is, X is full row rank.The notation X T refers to the matrix formed with the columns of X indexed by T . We observe that in this case, ξ ◦ r ( ̟ N ) depends only onthe regressor matrix X . Moreover, it can be overestimated throughthe solution of a convex optimization, see [2]. Using the concentration ratio introduced in (12), we can now statea fundamental lemma for our analysis (see Lemma 2 below, whichcan be viewed as a special reformulation of Lemma 4.2 in [3]). Toease the proof, we start with a preliminary technical lemma.

Lemma 1.

Let r ∈ { , . . . , N } and S r be deﬁned as in (10) . Consider an arbitrary vector v ∈ R N and deﬁne T r ( v ) ⊂ T tobe the index set of the r largest entries in absolute value of v . Thenfor all v ′ ∈ R N , (cid:13)(cid:13) v ′ − v (cid:13)(cid:13) − (cid:13)(cid:13) ( v ′ − v ) T r ( v ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) v ′ (cid:13)(cid:13) − k v k + 2 inf w ∈S r k w − v k (14) Proof:

See Appendix A.

Lemma 2.

Let r ∈ { , . . . , N } . Consider the dataset ̟ N as in (4) and ξ r ( ̟ N ) as deﬁned in (12) . If ξ r ( ̟ N ) < / , then (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) J ( A ′ ) − J ( A ) + 2 δ r ( A ) (cid:1) ∀ ( A, A ′ ) ∈ R n × s × R n × s (15) with φ ( A ) , J ( A ) and δ r ( A ) deﬁned in (9) , (5) and (11) respectively.Proof: Let T be a subset of T containing the indices of the r largest entries of φ ( A ) in absolute value. We apply the result ofLemma 1 with v = φ ( A ) and v ′ = φ ( A ′ ) , which leads immediatelyto (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) − (cid:13)(cid:13) φ T ( A ′ ) − φ T ( A ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ′ ) (cid:13)(cid:13) − k φ ( A ) k + 2 δ r ( A ) (16)where δ r ( A ) is deﬁned as in (11). From the deﬁnition (12) of ξ r , itcan further be observed that (cid:13)(cid:13) φ T ( A ′ ) − φ T ( A ) (cid:13)(cid:13) ≤ ξ r ( ̟ N ) (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) , which in turn implies that (cid:0) − ξ r ( ̟ N ) (cid:1) k φ ( A ′ ) − φ ( A ) k issmaller than the left hand side term of (16). We therefore get (cid:0) − ξ r ( ̟ N ) (cid:1) (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ′ ) (cid:13)(cid:13) − k φ ( A ) k + 2 δ r ( A ) and the result follows. Remark 2.

In the scenario of Remark 1, the result of Lemma 2 wouldread as (cid:13)(cid:13) X ⊤ (cid:0) a ′ − a (cid:1) (cid:13)(cid:13) ≤ − ξ ◦ r ( ̟ N ) (cid:0) J ( a ′ ) − J ( a ) + 2 δ r ( a ) (cid:1) (17) with ξ ◦ r ( ̟ N ) as in (13) . Hence if X is full row rank then theleft hand side constitutes a data-dependent norm on the error a ′ − a . If we let λ = inf k η k =1 (cid:13)(cid:13) X ⊤ η (cid:13)(cid:13) , then (cid:13)(cid:13) a ′ − a (cid:13)(cid:13) ≤ λ (1 − ξ ◦ r ( ̟ N )) ( J ( a ′ ) − J ( a ) + 2 δ r ( a )) . By interchanging the roles of A and A ′ in the inequality (15) onecan obtain (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) J ( A ) − J ( A ′ ) + 2 δ r ( A ′ ) (cid:1) Summing this with (15) then yields the following inequality (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) δ r ( A ′ ) + δ r ( A ) (cid:1) . (18)Another immediate consequence of Lemma 2 can be stated asfollows: Lemma 3. If ξ r ( ̟ N ) < / for some r ∈ { , . . . , N } , then for all ˆ A ∈ arg min A J ( A ) and for all A ∈ R n × s , (cid:13)(cid:13) φ ( A ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) δ r ( A ) . (19) with the convention that T r ( v ) = ∅ for r = 0 . Moreover, if there exists a matrix ˜ A such that k φ ( ˜ A ) k ≤ r then arg min A J ( A ) = (cid:8) A ∈ R n × s : (cid:13)(cid:13) φ ( A ) (cid:13)(cid:13) ≤ r (cid:9) = n A ∈ R n × s : φ ( A ) = φ ( ˜ A ) o Proof:

By Eq. (15), we have (cid:13)(cid:13) φ ( A ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:16) J ( ˆ A ) − J ( A ) + 2 δ r ( A ) (cid:17) for all A ∈ R n × s . Because J ( ˆ A ) − J ( A ) ≤ , this yieldsimmediately (19). The second statement follows from the fact thatif k φ ( ˜ A ) k ≤ r , then δ r ( ˜ A ) = 0 . Therefore, replacing A with ˜ A in(19) shows that φ ( ˜ A ) = φ ( ˆ A ) and so, J ( ˜ A ) = J ( ˆ A ) . Hence suchan ˜ A is necessarily in arg min A J ( A ) . On the other hand, since φ ( ˜ A ) = φ ( ˆ A ) , any ˆ A ∈ arg min A J ( A ) satisﬁes k φ ( ˆ A ) k ≤ r hence concluding the proof.An interpretation of Lemma 3 is that if the data ̟ N used toconstruct the map φ in (9) are generated by the switched system(1) and if the data is sufﬁciently informative in the sense that ξ r ( ̟ N ) < / for some r and the system parameter vectors aresuch that k φ ( A ◦ ) k ≤ r over the data, with A ◦ denoting the trueparameter matrix (see Eq. (1)), then Set( A ◦ ) ∈ Ψ( ̟ N ) . At thisstep, a question that needs to be discussed further is whether Set( A ◦ ) may be the unique member of Ψ( ̟ N ) . For this purpose we need aproperty of uniform rank on the data X . Deﬁnition 2 (An integer measure of genericity) . [1] Let X ∈ R n × N be a data matrix satisfying rank( X ) = n . The n -genericity index of X , denoted ν n ( X ) , is deﬁned as the minimum integer m such thatany n × m submatrix of X has rank n , ν n ( X ) = min n m : ∀ S ⊂ T with |S| = m, rank( X S ) = n o . (20)Here, X S is a matrix formed with the columns of X indexed by S .This deﬁnition implies that any submatrix of X ∈ R n × N having atleast ν n ( X ) columns (with n ≤ ν n ( X ) ≤ N ), has full row rank. Thesmaller ν n ( X ) , the more generic the regression data X are said to be.According to this rough criterion, the most generic data X achieve ν n ( X ) = n . This is typically the case when the regressors { x t } t ∈ T are in general position in R n . Under some minimality conditions[15] on the data-generating system (1), if the input signal { u t } isgenerated at random, then ν n ( X ) = n with probability one.Equipped with this notation and the deﬁnition of genericity index ν n ( X ) , we can now characterize uniqueness of the minimizer of J ( A ) based on the following lemma. Lemma 4.

Consider a dataset ̟ N of the form (4) and the notation I i ( A ) introduced at the beginning of Section III. Assume that thereexists a matrix ˜ A = [˜ a · · · ˜ a s ] ∈ R n × s with distinct columns ˜ a i such that min i ∈ S (cid:12)(cid:12) I i ( ˜ A ) (cid:12)(cid:12) ≥ sν n ( X ) (21) on the data ̟ N . Then the following holds: ∀ A ∈ R n × s , φ ( A ) = φ ( ˜ A ) ⇒ Set( A ) = Set( ˜ A ) . (22) Proof:

Let A be such that φ ( A ) = φ ( ˜ A ) . Then for all t ∈ T , y t − x ⊤ t a σ A ( t ) = y t − x ⊤ t ˜ a σ ˜ A ( t ) , which is equivalent to x ⊤ t (˜ a σ ˜ A ( t ) − a σ A ( t ) ) = 0 for all t ∈ T .The next step of the proof is to show that for any i ∈ S there exists j ⋆ ∈ S such that I ij ⋆ , I i ( ˜ A ) ∩ I j ⋆ ( A ) has a cardinality larger thanor equal to ν n ( X ) . For this purpose we proceed by contradiction.Take an arbitrary i ∈ S and assume that | I ij | < ν n ( X ) ∀ j ∈ S . Noting that I i ( ˜ A ) = I i ( ˜ A ) ∩ T = I i ( ˜ A ) ∩ ( ∪ sj =1 I j ( A )) = ∪ sj =1 I ij , we obtain | I i ( ˜ A ) | ≤ P sj =1 | I ij | < sν n ( X ) . But this constitutes acontradiction to the assumption (21). In conclusion, for all i ∈ S ,there exists a j ⋆ such that | I ij ⋆ | ≥ ν n ( X ) . Now we observe that forall t ∈ I ij ⋆ , x ⊤ t (˜ a i − a j ⋆ ) = 0 and so, X ⊤ I ij⋆ (˜ a i − a j ⋆ ) = 0 . Butsince | I ij ⋆ | ≥ ν n ( X ) , we have rank( X I ij⋆ ) = n , which impliesthat ˜ a i = a j ⋆ . Since all columns of ˜ A are distinct (no repetition), weconclude that ˜ A and A have the same columns up to a permutationwhich is equivalent to saying that Set( ˜ A ) = Set( A ) .It is interesting to note that in the absence of noise in (1), havingthe true parameter matrix A ◦ to obey (21) is a sufﬁcient condition forexact recovery of that matrix from the data. What this means is thatif v t = 0 for all t and if all the subsystems have been sufﬁcientlyexcited in the sense that condition (21) holds for ˜ A = A ◦ , then Ψ( ̟ N ) = { Set( A ◦ ) } .The following theorem recapitulates the discussion of this section. Theorem 1.

Consider the dataset ̟ N in (4) , generated by theswitched system (1) . Assume that: • ̟ N is informative enough in the sense that ξ r ( ̟ N ) < / forsome r ∈ { , . . . , N } ; let then r ∗ ( ̟ N ) = max n r : ξ r ( ̟ N ) < / o . • There exists a matrix ˜ A ∈ R n × s satisfying the condition (21) and k φ ( ˜ A ) k ≤ r ∗ ( ̟ N ) .Then the estimator Ψ deﬁned in (6) satisﬁes Ψ( ̟ N ) = (cid:8) Set( ˜ A ) (cid:9) . (23) Proof:

To begin with, note that for r ∗ deﬁned as in the statementof the theorem, it holds that δ r ∗ ( ˜ A ) = 0 (see Eq. (11) for the deﬁni-tion of δ r ). Now, since the conditions of Lemma 3 are satisﬁed, wecan apply it to infer that if ˆ A ∈ arg min A J ( A ) , then φ ( ˆ A ) = φ ( ˜ A ) so that J ( ˜ A ) = min A J ( A ) . Conversely, it is immediate to see thatany A ′ ∈ R n × s which satisﬁes φ ( A ′ ) = φ ( ˜ A ) lies necessarily in arg min A J ( A ) . Hence we can write arg min A J ( A ) = n A ∈ R n × s : φ ( A ) = φ ( ˜ A ) o Applying Lemma 4, we can then write arg min A J ( A ) = n A ∈ R n × s : Set( A ) = Set( ˜ A ) o and so, from (6) we see that Ψ( ̟ N ) = (cid:8) Set( ˜ A ) (cid:9) .An interpretation of Theorem 1 is that if the data are sufﬁcientlyinformative, then the set-valued estimator Ψ( ̟ N ) returns only asingleton. We would of course like this singleton to coincide withthe true set of parameter vectors { a ◦ i } i ∈ S . For this to hold, it sufﬁcesthat the true parameter matrix A ◦ satisﬁes the second condition ofthe theorem. Note that such a condition is readily satisﬁed (with atleast r ∗ = 0 ) when there is no noise in the data (i.e., v t = 0 in (1)for all t ∈ T ) provided that each subsystem generates enough data.Moreover, by the second condition of the theorem, exact recovery ofthe true parameter matrix A ◦ is still achievable by the estimator Ψ when { v t } is a sparse noise sequence containing at most r ∗ nonzeroinstances, regardless of the magnitude of these nonzero values. Hence,the larger r ∗ (i.e., the richer the regression data ̟ N ), the moreoutliers the least absolute deviation LSM estimator can handle. Incontrast, the condition is unlikely to hold generally when dense noise is present in the data. IV. E

RROR BOUNDS IN THE PRESENCE OF NOISE

As mentioned above, we cannot hope for an exact recovery of thetrue parameter matrix A ◦ by the estimator Ψ from data affected by adense noise sequence { v t } . We need instead to search for a possiblebound on the estimation error in function of the magnitude of thenoise and the richness properties of the data. Indeed (19) almostprovides such a bound. The remaining question to be investigated is,under which conditions we can lower-bound k φ ( ˆ A ) − φ ( A ◦ ) k by anorm applying directly to ˆ A − A ◦ . A. A key step towards the obtention of an error bound

To begin with the analysis, we introduce some useful technical tools,the ﬁrst of which is the class of K ∞ functions (see, e.g., [6]). Thisclass of functions will be used to measure the increasing rate of theestimation error. Deﬁnition 3 (class- K ∞ functions) . A function α : R + → R + is saidto be of class- K ∞ if it is continuous, zero at zero, strictly increasingand satisﬁes lim s → + ∞ α ( s ) = + ∞ .Using this deﬁnition we can state a technical lemma which willplay an important role in the analysis. Lemma 5 ([7]) . Let f : R n → R + be a positive continuous functionsatisfying the following properties: • Positive deﬁniteness: f ( x ) = 0 if and only if x = 0 • Relaxed homogeneity: There exists a K ∞ function q such that f ( x ) ≥ q ( λ ) f ( λx ) for all λ > .Then for any norm k·k on R n , there exists a constant α > suchthat f ( x ) ≥ αq ( k x k ) . Our goal now is to derive a bound on a certain measure of theparametric estimation error between the true parameter matrix A ◦ and the estimated ones ˆ A ∈ arg min A J ( A ) . Recalling that J ( A ) isinvariant under column permutation of the matrix A , for this metric tobe pertinent, it needs to be speciﬁed in terms of distance between thesets Set( A ◦ ) and Set( ˆ A ) . Hence we consider a metric d of the form d ( A, A ′ ) = k A − A ′ π k where k·k is a norm on R n × s and π : S → S is a permutation depending on the matrices A and A ′ . Here, thenotation A ′ π is used to refer to the matrix obtained by permuting thecolumns of A as prescribed by π , A ′ π = [ a ′ π (1) · · · a ′ π ( s ) ] . Theexistence of a permutation π such that d ( A, A ′ ) is upper-bounded by k φ ( A ) − φ ( A ′ ) k will depend here on the partitions { I i ( A ) } i ∈ S and { I i ( A ′ ) } i ∈ S achieved by A and A ′ respectively on the data set ̟ N . Deﬁnition 4.

Consider the data set ̟ N in (4), generated by the s -modes switched system (1). We say that two matrices A ∈ R n × s and A ′ ∈ R n × s are comparable over the data set ω N if there exists apermutation π : S → S such that (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) forall i ∈ S .Note, from Lemma 4 above, that any matrix A ∈ R n × s such that min i ∈ S | I i ( A ) | ≥ sν n ( X ) is comparable to any other matrix A ′ withdistinct columns satisfying φ ( A ) = φ ( A ′ ) . In that case, it even holdsthat A = A ′ π for some permutation π on S . We state hereafter asufﬁcient condition for comparability. Lemma 6.

Consider a set ̟ N of input-output data generated bysystem (1) as deﬁned in (4) . Let A ∈ R n × s be a matrix obeying (21) .Then any matrix A ′ ∈ R n × s satisfying | I i ( A ) | + | I j ( A ) |≥ max ℓ ∈ S (cid:2) (cid:12)(cid:12) I i ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) + (cid:12)(cid:12) I j ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) (cid:3) + 2( s − ν n ( X ) ∀ ( i, j ) ∈ S , i = j, (24) is comparable to A over ̟ N in the sense of Deﬁnition 4. Proof: See Appendix B.To illustrate the condition (24), consider the simple case where | S | = s = 2 . Then, under the assumption that A is subject to (21), A and A ′ are comparable over ̟ N if N ≥ max( | I ( A ′ ) | , | I ( A ′ ) | ) + 2 ν n ( X ) .Noting that max( | I ( A ′ ) | , | I ( A ′ ) | ) = N/ / (cid:12)(cid:12) | I ( A ′ ) | −| I ( A ′ ) | (cid:12)(cid:12) with the outer bars denoting the absolute value, (24) reducesto N ≥ ν n ( X )+ (cid:12)(cid:12) | I ( A ′ ) |−| I ( A ′ ) | (cid:12)(cid:12) . This relation identiﬁes threefactors which promote comparability: (i) the data X must be genericenough (i.e., ν n ( X ) small); (ii) A ′ must partition the data into setsof balanced cardinalities; (iii) the number N of data must be largeenough. Theorem 2.

Consider the dataset ̟ N in (4) , generated by theswitched system (1) and assume that ξ r ( ̟ N ) < / for some r ∈ { , . . . , N } . Let ( A, A ′ ) ∈ R n × s × R n × s be a pair ofcomparable matrices with respect to ω N (as deﬁned in Eq. (4) ). Let π denote the associated permutation. Then for any norm k·k on R n × s ,there exists a strictly positive number D such that k A ′ π − A k ≤ D (cid:0) − ξ r ( ̟ N ) (cid:1) (cid:0) J ( A ′ ) − J ( A ) + 2 δ r ( A ) (cid:1) . (25) Proof:

We start by observing that all the conditions of Lemma2 are satisﬁed. As a consequence, Eq. (15) holds. Departing fromthis equation, we just need to ﬁnd an appropriate underestimate of k φ ( A ) − φ ( A ′ ) k . To this end, note that (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) = X t ∈ T (cid:12)(cid:12) x ⊤ t ( a σ A ( t ) − a ′ σ A ′ ( t ) ) (cid:12)(cid:12) = X ( i,j ) ∈ S X t ∈ I i ( A ) ∩ I j ( A ′ ) (cid:12)(cid:12) x ⊤ t ( a i − a ′ j ) (cid:12)(cid:12) ≥ X i ∈ S X t ∈ I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) x ⊤ t η i (cid:12)(cid:12) where η i = a i − a ′ π ( i ) with π : S → S denoting the permutationdeﬁning the comparability of A and A ′ (see Deﬁnition 4). Recall that (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) , i = 1 , . . . , s . Let g : R n × s → R + bethe function deﬁned by g (Λ) = inf { J i } i ∈ S | J i |≥ ν n ( X ) X i ∈ S (cid:13)(cid:13) X ⊤ J i η i (cid:13)(cid:13) (26)where the inﬁmum is taken over all s -tuples ( J , . . . , J s ) of disjointsubsets of T with cardinality larger or equal to ν n ( X ) . Then byletting Λ = A − A ′ π , it follows from the inequality above that (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) ≥ g (Λ) . (27)Since the inﬁmum in (26) operates here on a ﬁnite set, it is reachedby a certain ( J ⋆ , . . . , J ⋆s ) . As a consequence g can be expressed by g (Λ) = P i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) . The rest of the proof consists in showingthat the function g satisﬁes the conditions of Lemma 5. Clearly, g is positive. If for some E = [ e · · · e s ] ∈ R n × s , g ( E ) = 0 ,then X ⊤ J ⋆i e i = 0 for all i = 1 , . . . , s . It follows, by the fact that | J ⋆i | ≥ ν n ( X ) , that e i = 0 . Hence E = 0 and consequently, g ispositive-deﬁnite. Moreover, g is continuous as a consequence of the ℓ norm being continuous. Finally, g satisﬁes the relaxed homogeneityproperty with the K ∞ function q deﬁned by q ( x ) = x . We cantherefore apply Lemma 5 to conclude that g (Λ) ≥ D k Λ k with D being the strictly positive number deﬁned by D = inf k Λ k =1 g (Λ) . (28)This concludes the proof.The theorem establishes a bound on the metric d ( A, A ′ ) in case A and A ′ are comparable in the sense of Deﬁnition 4. For a given r , it is interesting to note that the bound displayed in (25) is all the smaller as the data are more generic (i.e., ξ r ( ̟ N ) deﬁned in(12) is small for a relatively large r ). We also note that if A and A ′ are not comparable as required in the statement of the theoremthen, k A − A ′ π k can grow arbitrarily for any permutation π while k φ ( A ) − φ ( A ′ ) k remains small. To see this, take for example s = 2 and A = (cid:2) ˜ a ˜ a (cid:3) , A ′ = (cid:2) ˜ a ′ β ˜ a ′ (cid:3) with the ˜ a i and ˜ a ′ i being unit ℓ -norm vectors and β ∈ R . Then fora given dataset ̟ N one can choose β sufﬁciently large such that σ A ′ ( t ) = 1 for all t ∈ T , i.e., I ( A ′ ) = T . For such values of β , A and A ′ are not comparable in the sense of Deﬁnition 4. Wecan see however that k φ ( A ) − φ ( A ′ ) k is independent of β while k A − A ′ π k will increase arbitrarily as β increases for any permutation π on S = { , } . Remark 3.

Note that in the scope of Theorem 2, it is, in principle,possible to restrict the deﬁning supremum of ξ r ( ̟ N ) in (12) onlyto all pairs ( A, A ′ ) of comparable matrices. The interest of sucha slight reformulation is that it would produce a smaller value of ξ r ( ̟ N ) and hence a potentially tighter bound in (25) .B. Estimation error bound for the switched system An interesting situation is when ( A, A ′ ) is taken in Theorem 2 tobe equal to ( A ◦ , ˆ A ) with ˆ A ∈ arg min A J ( A ) . In this speciﬁccase, invoking the trick used to establish (19) yields the followingstatement. Corollary 1.

Consider the data ̟ N generated by system (1) and as-sume that ξ r ( ̟ N ) < / for some r ≥ . Let ˆ A ∈ arg min A J ( A ) .If ˆ A and the true parameter matrix A ◦ are comparable in the senseof Deﬁnition 4 with π : S → S denoting the associated comparabilitypermutation, then for any norm k·k on R n × s , there exists a number D > such that k ˆ A π − A ◦ k ≤ D (cid:0) − ξ r ( ̟ N ) (cid:1) δ r ( A ◦ ) . (29)Since r can be any integer in { , . . . , N } such that ξ r ( ̟ N ) < / ,we can, at least formally, optimize the error bound over all such r ’s. Hence, whenever the comparability condition holds true, a betterbound can, in principle, be obtained as (cid:13)(cid:13) ˆ A π − A ◦ (cid:13)(cid:13) ≤ min r =0 ,...,N n δ r ( A ◦ ) D (cid:0) − ξ r ( ̟ N ) (cid:1) : ξ r ( ̟ N ) < o (30)As already remarked, δ r ( A ◦ ) measures how far φ ( A ◦ ) is from theset S r of all r -sparse signals in R N . This is essentially a measure ofthe amount of noise { v t } in the system (1) which generates the data ̟ N . More speciﬁcally, δ r ( A ◦ ) equals the sum of the N − r smallestelements in absolute value of the sequence { v ◦ t } t ∈ T deﬁned by v ◦ t = v t + x ⊤ t ( a ◦ σ ( t ) − a ◦ σ A ◦ ( t ) ) (31)with σ denoting the true switching signal from (1). From the deﬁni-tion of σ A ◦ ∈ Σ (see Eq. (7)), it is not hard to see that | v ◦ t | ≤ | v t | for all t ∈ T and so, δ r ( A ◦ ) ≤ k v k ,r with k v k ,r denoting thesum, in absolute value, of the N − r smallest entries of { v t } t ∈ T .It follows that under the conditions of Corollary 1, k ˆ A π − A ◦ k ≤ / (cid:0) D (1 − ξ r ( ̟ N )) (cid:1) k v k ,r . Hence, by considering the special casewhere r is taken equal to (this is a reasonable choice e.g., whenthere is no outlier in the data), we get k ˆ A π − A ◦ k ≤ /D k v k . Notethat an underestimate ˆ D of the number D can be numerically foundas suggested in Appendix E. Using ˆ D (instead of D ) in the expressionof the bound yields however a more pessimistic value of the bound.A question we ask now is, under which condition we may have v ◦ t = v t from (31). Such a condition is given in the followingproposition. Proposition 1.

Consider the switched system (1) driven by theswitching signal σ and the noise { v t } . Then a necessary and sufﬁcientcondition for σ A ◦ = σ (irrespective of the values of σ and those ofthe noise) is | v t | <

12 min ( i,j ) ∈ S i = j (cid:12)(cid:12) x ⊤ t ( a ◦ i − a ◦ j ) (cid:12)(cid:12) ∀ t ∈ T . (32) Proof:

See Appendix C.The term on the right hand side of (32) can be interpreted asa measure of how distinguishable the subsystems are with respectto each other. Hence, what the proposition says is that if the noiselevel is below a certain threshold (which depends on the parametricdistinguishability of the subsystems and on some genericity conditionon the regressors), then the true switching signal coincides with σ A ◦ .Finally, an interesting consequence of Proposition 1 is that, undercondition (32), we obtain from (31) that v ◦ t = v t for all t ∈ T withthe consequence that δ r ( A ◦ ) reduces to k v k ,r . C. On the comparability of ˆ A and A ◦ According to Corollary 1, a sufﬁcient condition for the estimationerror induced by the estimator Ψ to be bounded as in (29), isthat of comparability of ˆ A and A ◦ over ̟ N for all ˆ A such that Set( ˆ A ) ∈ Ψ( ̟ N ) (see Deﬁnition 4). Lemma 6 suggests that tofavor the comparability of A ◦ and ˆ A , the data ̟ N and the trueparameter matrix A ◦ should satisfy (21) and (24). Indeed theseconditions impose, though in a non trivial way, some constraints onthe distinguishability of the modes composing the switched system,the magnitude of the noise, the excitation signal { u t } and theswitching signal σ .Intuitively, if the level of the noise { v t } is low and if the constituentsubsystems are distinguishable enough, then the true parameter matrix A ◦ and its estimate ˆ A should be comparable. We formalize this asfollows. Lemma 7.

Assume that the input-output data ̟ N (4) , gener-ated by the s -mode switched system (1) is such that A ◦ obeys min i ∈ S | I i ( A ◦ ) | ≥ sm with m ≥ ν n ( X ) . Introduce the notation γ m = inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) , (33) where the inﬁmum is taken over all subsets I of T with cardinalityat least m and over all η ∈ R n with unit ℓ norm.If the subsystems of the switched system (1) are parametricallydistinguishable enough in the sense that min i = j (cid:13)(cid:13) a ◦ i − a ◦ j (cid:13)(cid:13) > δ r ( A ◦ ) γ m (cid:0) − ξ r ( ̟ N ) (cid:1) (34) for some r ∈ { , . . . , N } such that ξ r ( ̟ N ) < / , then A ◦ and ˆ A are comparable over ̟ N in the sense of Deﬁnition 4 for any ˆ A ∈ arg min A J ( A ) .Proof: See Appendix D.V. C

ONCLUSION

In this paper we have studied some properties of the least sum-of-minimums (LSM) absolute deviation estimator for switched systemidentiﬁcation. Although this estimator is hard to implement numeri-cally, it serves here as a reference estimator to analyze the degree ofrichness in the data for the identiﬁcation scheme to be successful. Inparticular, we have proposed a bound on the estimation error induced by this estimator. Interestingly, the expression of the proposed boundinvolves explicitly some informativity measures of the training data.The message of that expression in essence is that the richer the data,the smaller the estimation error. This opens a nice perspective foridentiﬁcation experiment design for switched systems. In effect, onecan form an experiment design problem by searching for the inputsignal which optimizes the derived information-theoretic measuresand thereby, the error bound delivered by the estimator. To furtherpave the avenue towards optimal experiment design, an intermediarystep would, perhaps, be to complement the current analysis withone of the LSM estimator when used with the classical quadraticloss. Another important direction of research is to devise efﬁcientnumerical routines for estimating the informativity indices derived inthis paper. A

PPENDIX

A. Proof of Lemma 1

For the sake of notational simplicity we use T r in place of T r ( v ) .Let T cr = T \ T r be the complement of T r in T . Then (cid:13)(cid:13) v ′ − v (cid:13)(cid:13) = (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) ( v ′ − v ) T cr (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) + (cid:13)(cid:13) v T cr (cid:13)(cid:13) = (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) + inf w ∈S r k w − v k The inequality is derived from the triangle inequality property ofthe ℓ norm. The last equality relation relies on the fact that inf w ∈S r k w − v k = (cid:13)(cid:13) v T cr (cid:13)(cid:13) (the sum of the N − r smallest entriesin absolute value of v ). Considering the term (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) , we can write (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) = (cid:13)(cid:13) v ′ (cid:13)(cid:13) − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) = k v T r k − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′ (cid:13)(cid:13) − (cid:0) k v k − (cid:13)(cid:13) v T cr (cid:13)(cid:13) (cid:1) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′ (cid:13)(cid:13) − k v k + inf w ∈S r k w − v k Here, the second equality follows by adding and subtracting k v T r k while the last line is obtained by applying again the triangle inequalitywhich gives (cid:13)(cid:13) v T r (cid:13)(cid:13) − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) . The result followsby combining the second inequality with the ﬁrst one above. B. Proof of Lemma 6

By reasoning as in the proof of Lemma 4 thanks to the fact that A satisﬁes condition (21), we reach easily the conclusion that for all i ∈ S , there exists i ∗ ∈ S such that | I i ( A ) ∩ I i ∗ ( A ′ ) | ≥ ν n ( X ) . Letus deﬁne a map π : S → S by posing π ( i ) = i ∗ . We need to showthat π can be selected to be a permutation under condition (24) of thelemma. For this purpose, we proceed by contradiction. Recall that π is a permutation here if and only if it is injective. And there is noinjective map π that satisﬁes (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) for all i ∈ S , if and only if there is a pair ( i, j ) , i = j , and an index ℓ ∈ S such that ( (cid:12)(cid:12) I i ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) (cid:12)(cid:12) I j ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) (35a)and ∀ k = ℓ , ( (cid:12)(cid:12) I i ( A ) ∩ I k ( A ′ ) (cid:12)(cid:12) < ν n ( X ) (cid:12)(cid:12) I j ( A ) ∩ I k ( A ′ ) (cid:12)(cid:12) < ν n ( X ) (35b)Assume for contradiction that (35) holds. Then, because { I r ( A ′ ) } r ∈ S forms a partition of T , | I i ( A ) | = P sr =1 | I i ( A ) ∩ I r ( A ′ ) | < ( s − ν n ( X ) + | I i ( A ) ∩ I ℓ ( A ′ ) | . Similarly, we can write, | I j ( A ) | < ( s − ν n ( X ) + | I j ( A ) ∩ I ℓ ( A ′ ) | . Hence | I i ( A ) | + | I j ( A ) | < s − ν n ( X ) + | I i ( A ) ∩ I ℓ ( A ′ ) | + | I j ( A ) ∩ I ℓ ( A ′ ) | . This is incontradiction with (24). We therefore conclude on the existence ofan injective map (and hence of a permutation) π : S → S . C. Proof of Proposition 1

If (32) holds true, then for all t ∈ T and all i ∈ S with i = σ ( t ) , (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) = | v t | < (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ i ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) y t − x ⊤ t a ◦ i (cid:12)(cid:12) + 12 (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) where the last inequality is derived from the triangle inequalityproperty of | · | . It follows that (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) < (cid:12)(cid:12) y t − x ⊤ t a ◦ i (cid:12)(cid:12) whichimplies that σ A ◦ ( t ) = σ ( t ) for all t . Conversely, if σ A ◦ = σ , thenfor all ( j, t ) ∈ S × T such that j = σ ( t ) , we get immediately that (cid:12)(cid:12) v t (cid:12)(cid:12) < (cid:12)(cid:12) y t − x ⊤ t a ◦ j (cid:12)(cid:12) = (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) + v t (cid:12)(cid:12) . Taking the square anddividing by (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) (cid:12)(cid:12) gives (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) (cid:12)(cid:12) > − v t s j ( t ) with s j ( t ) denoting the sign of x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) . The last inequalityholds for any possible values of σ if and only if (cid:12)(cid:12) x ⊤ t ( a ◦ i − a ◦ j ) (cid:12)(cid:12) > | v t | for all ( i, j ) ∈ S with i = j . D. Proof of Lemma 7

To begin with, let us observe that by relying on Lemma 5, it canbe shown that the number γ m in (33) is well deﬁned and satisﬁes γ m > . By the same reasoning as in the proof of Lemma 4, weknow that there exists a map π : S → S such (cid:12)(cid:12) I i ( A ◦ ) ∩ I π ( i ) ( ˆ A ) (cid:12)(cid:12) ≥ m ≥ ν n ( X ) . We just need to establish that such a π is bijectiveunder the conditions of the lemma, a property which is equivalenthere just to injectivity of π . We proceed by contradiction. Supposethat π is not injective, that is, we can ﬁnd ( i, j ) ∈ S with i = j such that π ( i ) = π ( j ) . Let J i = I i ( A ◦ ) ∩ I π ( i ) ( ˆ A ) . By applyingLemma 3, we can write X i ∈ S (cid:13)(cid:13) X ⊤ J i ( a ◦ i − ˆ a π ( i ) ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ◦ ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ d, where we have posed d = 2 δ r ( A ◦ ) / (1 − ξ r ( ̟ N )) for concise-ness. On the other hand, it follows from the deﬁnition (33) of γ m that P i ∈ S (cid:13)(cid:13) X ⊤ J i ( a ◦ i − ˆ a π ( i ) ) (cid:13)(cid:13) ≥ γ m P i ∈ S k a ◦ i − ˆ a π ( i ) k .As a consequence, we can write P i ∈ S k a ◦ i − ˆ a π ( i ) k ≤ d/γ m .Hence, if π ( i ) = π ( j ) , then by virtue of the triangle inequality, k a ◦ i − a ◦ j k ≤ k a ◦ i − ˆ a π ( i ) k + k a ◦ j − ˆ a π ( j ) k ≤ d/γ m . This is incontradiction with the assumption (34). We therefore conclude thatthe claim of the lemma holds true. E. On the estimation of the number D in (28)The following lemma provides a method for computing an under-estimate of the parameter D in (28) for a particular choice of thenorm involved in its deﬁnition though at the price of a combinatorialcomplexity. Lemma 8.

Assume that the norm used for the deﬁnition of thenumber D in (28) is k·k , col deﬁned by k Λ k , col = P si =1 k η i k for Λ = (cid:2) η · · · η s (cid:3) . Let m = ν n ( X ) . Then D ≥ γ m ≥ inf | I | = m λ / ( X I X ⊤ I ) , (36) where γ m is the number deﬁned in (33) and λ / ( · ) denotes thesquare root of the minimum eigenvalue. The inﬁmum is taken overall subsets I of T with cardinality equal to m .Proof: Recall from (26) and the proof of Theorem 2 theexpression g (Λ) = P i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) of the function g , where the J ⋆i are subsets of T satisfying | J ⋆i | ≥ m = ν n ( X ) . Then by substituting k Λ k , col for the norm k Λ k in Eq. (28), we have D = inf k Λ k , col =1 g (Λ) = inf k η k + ··· + k η s k =1 X i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) ≥ inf k η k + ··· + k η s k =1 X i ∈ S γ m k η i k = γ m The inequality follows as a consequence of the deﬁnition of γ m bywhich (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) ≥ γ m k η i k since | J ⋆i | ≥ m . Now, to prove thelast inequality in (36), it sufﬁces to notice that (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) ≥ (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) .As a result, γ m = inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) ≥ inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) = inf | I | = m λ / ( X I X ⊤ I ) . Given I ⊂ T , it is easy to obtain λ / ( X I X ⊤ I ) . Hence to obtain an(under)-estimate of D , we need to compute (cid:0) Nm (cid:1) such values and takethe minimum of them. Here the notation (cid:0) Nm (cid:1) refers to the binomialcoefﬁcient. If we let ˆ D = inf | I | = m λ / ( X I X ⊤ I ) , then it followsfrom (29) that k ˆ A π − A ◦ k ≤ D k v k in the particular case where r is taken equal to . R EFERENCES [1] L. Bako. Identiﬁcation of switched linear systems via sparse optimiza-tion.

Automatica , 47:668–677, 2011.[2] L. Bako. On a class of optimization-based robust estimators.

IEEETransactions on Automatic Control , 62:5990–5997, 2017.[3] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iterativelyreweighted least squares minimization for sparse recovery.

Communica-tions on Pure and Applied Mathematics , 63:1–38, 2010.[4] A. Garulli, S. Paoletti, and A. Vicino. A survey on switched andpiecewise afﬁne system identiﬁcation. In

IFAC Symposium on SystemIdentiﬁcation, Brussels, Belgium , 2012.[5] A. Goudjil, M. Pouliquen, E. Pigeon, and O. Gehan. A real-timeidentiﬁcation algorithm for switched linear systems with bounded noise.In

European Control Conference, Alborg, Denmark , 2016.[6] C. M. Kellett. A compendium of comparison function results.

Mathe-matics of Control, Signals, and Systems , 26:339–374, 2014. [7] A. Kircher, L. Bako, E. Blanco, and M. Benallouch. An optimizationframework for resilient batch estimation in cyber-physical systems.Technical report, Ecole Centrale de Lyon (arxiv.org/abs/1906.01714),2019.[8] F. Lauer. Global optimization for low-dimensional switching linearregression and bounded-error estimation.

Automatica , 89:73–82, 2018.[9] F. Lauer and G. Bloch.

Hybrid System Identiﬁcation: Theory andAlgorithms for Learning Switching Models . Springer InternationalPublishing, 2019.[10] D. Liberzon.

Switching in Systems and Control . Birkhauser Boston Inc.,2003.[11] J. Lunze and F. Lamnabhi-Lagarrigue (Eds).

Handbook of HybridSystems Control: Theory, Tools, Applications . Cambridge UniversityPress, 2009.[12] N. Ozay, M. Sznaier, C. Lagoa, and O. Camps. A sparsiﬁcation approachto set membership identiﬁcation of switched afﬁne systems.

IEEETransactions on Automatic Control , 57:634–648, 2012.[13] S. Paoletti, A. Juloski, G. Ferrari-Trecate, and R. Vidal. Identiﬁcation ofhybrid systems: A tutorial.

European Journal of Control , 13:242–260,2007.[14] M. Petreczky and L. Bako. On the notion of persistence of excitation forlinear switched systems. In

IEEE Conference on Decision and Controland European Control Conference, Orlando, FL, USA , 2011.[15] M. Petreczky, L. Bako, S. Lecoeuche, and K. Motchon. Minimality andidentiﬁability of discrete-time SARX systems.

To appear in InternationalJournal of Robust and Nonlinear Control , 2020.[16] G. Pillonetto. A new kernel-based approach to hybrid system identiﬁca-tion.

Automatica , 70:21–31, 2016.[17] Z. Sun.

Switched Linear Systems: Control and Design . Springer-VerlagLondon, 2005.[18] R. Vidal. Recursive identiﬁcation of switched ARX systems.

Automatica ,44:2274–2287, 2008.[19] R. Vidal, S. Soatto, Y. Ma, and S. Sastry. An algebraic geometricapproach to the identiﬁcation of a class of linear hybrid systems. In