Analysis of the least sum-of-minimums estimator for switched systems
aa r X i v : . [ ee ss . S Y ] S e p Analysis of the least sum-of-minimums estimator for switchedsystems
Laurent Bako
Abstract —This paper considers a particular parameter estimator forswitched systems and analyzes its properties. The estimator in question isdefined as the map from the data set to the solution set of an optimizationproblem where the to-be-optimized cost function is a sum of pointwiseinfima over a finite set of sub-functions. This is a hard nonconvex problem.The paper studies some fundamental properties of this problem suchas uniqueness of the solution or boundedness of the estimation errorregardless of computational considerations. The interest of the analysisis to lay out the main influential properties of the data on the performanceof this (ideal) estimator.
Index Terms —System identification, switched systems, sparsity, datarichness, robustness to outliers.
I. I
NTRODUCTION
A switched system is defined by a finite set of dynamic systemstogether with a map, called the switching law, which selects overtime which system (subsystem) is activated [10], [17]. The switchinglaw may be time-driven, event-driven or state-driven. Such systemscan be viewed as formal descriptions of physical phenomena takingplace in, for example, power converters [11], video sequences (fromsegmentation perspective) [18]. Finding mathematical representationsof switched systems is fundamental for the purpose of control,analysis or diagnosis. In this paper we discuss the theoretical perfor-mances/properties of a particular method for identifying a switchedmodel from measurements.The problem of identifying switched systems directly from input-output data has been largely investigated in the recent literature. Ex-amples of contributions include the works reported in [19], [1], [12],[16], [5] most of which rely on numerical optimization. Some surveysof the topic can be found in [9], [4], [13] (see the references therein).It is fair to remark that a large number of computational methods havebeen proposed for estimating the parameters of switched systems.However, an important aspect that is not well understood yet ishow the properties of the data quantitatively impact the performanceof estimation methods operating on those data. In other words, thenecessary properties of informativity of the data which favor correctestimation is still to be further investigated. In the current work wetake a step forward in the study of such informativity properties. Notethat so far, only a very few works have considered the fundamentalquestion of characterizing data informativity (richness) in the contextof switched system identification [14], [18]. [14] sketches a broadpurpose condition of persistence of excitation for estimating switchedstate-space realizations. As to the characterization formulated in [18],it can be interpreted as a rank condition in a lifted space (resultingfrom polynomial embedding of the regressors). However, neitherof these contributions proposed a characterization of the parametricestimation error bound as an explicit function of the informativitydegree of the regression data.The goal of this paper is to analyze the properties of a particularestimator which we call here the Least Sum-of-Minimums estimator(LSM) for switched system identification. This estimator maps thedata to the parameter space (of the constituent subsystems) by
L. Bako is with Laboratoire Ampère (UMR CNRS 5005) – EcoleCentrale de Lyon – Université de Lyon, 69134, Ecully, France. E-mail: [email protected] . associating to a given data set the minimizing set of some data-dependent cost function. The cost function is formed as a sum ofpointwise infima of the prediction errors associated to each subsystem.While the prediction errors may be measured in the LSM frameworkwith multiple different loss functions, we focus specifically on thecase of the absolute deviation loss function. We note that the LSMestimator is neither analytically expressible, nor numerically solvabledirectly at a reasonable computational price. Heuristics exist howeverthat allow to approach the solution with, sometimes, guarantees ofoptimality. For a numerical approach to this problem we refer forexample, to [8]. The perspective taken here is formal rather thancomputational, the goal being to lay out the properties the data shouldenjoy to allow for an adequate retrieval of the system parameters, atleast in principle. In the wake of our previous work reported in [1], wefirst derive conditions on the data that guarantee exact recoverabilityof the true parameter matrix in the hypothetical scenario where themeasurements would be essentially noise-free. A striking propertyof the absolute deviation loss (used in the framework of the LSMestimator) is that it allows for exact recovery even in the face ofa sparse noise , provided that the number of nonzero values in thesparse noise sequence does not exceed a certain threshold prescribedby the informativity degree of the data. In the more realistic situationswhere the data are affected by both dense and sparse noise , weprovide parametric error bounds for the estimates delivered by theestimator. The interest of our results reside in the fact that they revealthe impact of the data informativity on the attainable performanceof the (ideal) switched system estimator. This feature makes thempotentially useful for optimal experiment design, that is, the processof defining adequately the data-generating experimental conditionsthat would lead to the smallest (estimation) uncertainty bound. Outline.
We state the switched system identification problem inSection II and define the LSM estimator. We start the analysis byconsidering essentially the noiseless scenario in Section III and thenthe noisy one in Section IV. The main conclusions of our study arerecapitulated in Section V.
Notation. R denotes the set of real numbers; R + is the set ofnonnegative real numbers. For a matrix A = [ a · · · a s ] ∈ R n × s ,we use Set( A ) to denote the finite set formed with the columns of A , i.e., Set( A ) = { a , . . . , a s } . If S is a finite set, then |S| denotesthe cardinality of S . If x ∈ R then | x | is the absolute value of x .For x = (cid:2) x · · · x n (cid:3) ∈ R n , k x k will refer to the ℓ normof x (namely the number of nonzero entries in the vector x ); and k x k = P i | x i | will denote the ℓ norm of x . If X ∈ R n × N is amatrix and I ⊂ { , . . . , N } is a subset of the column index of X ,then X I denotes the submatrix of X formed with the columns of X which are indexed by I . Similarly, for a vector v ∈ R N , v I refers tothe subvector of v consisting in the entries of v indexed by I .II. T HE SWITCHED SYSTEM IDENTIFICATION PROBLEM
A. The data-generating system
Consider a (possibly nonlinear) switched system described by anequation of the form y t = x ⊤ t a ◦ σ ( t ) + v t , (1) where t ∈ Z + refers to discrete-time, y t ∈ R is the output of thesystem at time t , x t ∈ R n is the regressor. As to v t , it refers topotential additive noise component. σ : Z + → S , { , . . . , s } defines a switching signal and a ◦ i ∈ R n , i ∈ S , denote some distinctparameter vectors. Eq. (1) describes a switched system composed of s dynamical subsystems each of which is activated one after anotherin time by the switching signal σ .The model (1) captures the situations where the regressor x t isdirectly observed or obtained through an intermedirary nonlinearmapping of some observable signal z t ∈ R d . We will assume that x t = ϕ ( z t ) (2)where ϕ : R d → R n is some (known) linear or nonlinear map. Hence,depending on the choice of the mapping ϕ , the model (1) can describeboth linear and nonlinear switched systems.We further observe that the system represented by (1) can be static,in which case z t is an unstructured multivariate input vector, ordynamic. In this latter case, z t in (2) may assume the form z t = [ y t − · · · y t − n a u ⊤ t u ⊤ t − · · · u ⊤ t − n b ] ⊤ ∈ R d (3)with n a and n b being some integers and u t ∈ R n u the input ofthe system. Note that n a can be taken equal to zero in which case x t reduces to x t = [ u ⊤ t u ⊤ t − · · · u ⊤ t − n b ] ⊤ (hence yielding aswitched nonlinear system of Finite Impulse Response type). B. The least sum of minimums estimator
For convenience we collect the true parameter vectors a ◦ i ∈ R n from (1) in a matrix A ◦ = [ a ◦ · · · a ◦ s ] ∈ R n × s which we callthe true parameter matrix. Given a collection of N data points ̟ N = (( x , y ) , . . . , ( x N , y N )) (4)generated by the switched system (1), the estimation problem ofinterest here is to estimate the parameter matrix A ◦ .The focus of this paper is this estimation problem. We consider thatthe number s of subsystems and the structural parameters ( n a , n b ) entering the definition of x t in (2)-(3) are known a priori. Our goalis to design a map, called estimator, which maps the data ̟ N tothe set of parameters describing the constituent subsystems of theswitched system (1). To begin with the approach taken in this paperto such an estimation problem, let T and S denote the index setsof the data and the subsystems respectively, i.e., T = { , . . . , N } and S = { , . . . , s } . Use the notation Σ to denote the set of all maps σ : T → S (called here switching signals). Consider the cost function J ◦ : R n × s × Σ → R + defined by J ◦ ( A, σ ) = N X t =1 (cid:12)(cid:12) y t − a ⊤ σ ( t ) x t (cid:12)(cid:12) where A ∈ R n × s and σ ∈ Σ . Then a natural estimator of A ◦ can bedefined as the set-valued map Ψ : ( R n × R ) N → R n × s , Ψ( ̟ N ) = n Set( ˆ A ) : ∃ ˆ σ ∈ Σ , ( ˆ A, ˆ σ ) ∈ arg min A,σ J ◦ ( A, σ ) o Ψ( ̟ N ) is the set of all sets Set( ˆ A ) for all ˆ A ∈ R n × s such that ( ˆ A, ˆ σ ) is a minimizer of J ◦ ( A, σ ) for some switching signal ˆ σ . Ifwe let J ( A ) = N X t =1 min i =1 ,...,s (cid:12)(cid:12) y t − a ⊤ i x t (cid:12)(cid:12) (5)then it can be easily shown that Ψ( ̟ N ) = n Set( ˆ A ) : ˆ A ∈ arg min A J ( A ) o . (6) Hence, minimizing J ◦ ( A, σ ) is equivalent to minimizing J ( A ) in(5). The so defined Ψ will be called the least sum-of-minimums(LSM) estimator. Because the prediction error is measured here interm of the absolute value loss function, we may also refer to Ψ in the sequel as the absolute deviation LSM estimator. We start byobserving that solving numerically any of these formulations of theswitched identification problem is quite hard. The focus of this paperis not on this computational aspect but on the formal properties of themap Ψ . More precisely, we are interested in characterizing conditions(on the data-generating system (1) and on the properties of the data)under which Ψ( ̟ N ) may contain a singleton (unique solution) ormay be located at a bounded distance from the true parameter matrix A ◦ . The primary interest of such conditions is to emphasize themain influential factors of the estimator’s performance. From thisperpective, we do not expect the intended properties to be necessarilynumerically verifiable but to have a rather qualitative flavor whichmay serve for experiment design for instance.III. B ASIC PROPERTIES OF THE ESTIMATOR
We start by introducing some definitions. For any matrix A =[ a · · · a s ] ∈ R n × s , let σ A ∈ Σ be a switching signal satisfying σ A ( t ) ∈ arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) (7)for all t ∈ T . The defining constraint (7) of the switching sig-nal σ A allows indeed for multiple choices of σ A ( t ) whenever arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) ⊂ S is not a singleton. One simple choiceto solve this issue would be to set arbitrarily σ A ( t ) to be equal tothe smallest element of arg min i ∈ S (cid:12)(cid:12) y t − x ⊤ t a i (cid:12)(cid:12) . However, for thepurpose of our analysis we will define such σ A ( t ) in a more specificway. Consider the index set I i ( A ) = { t ∈ T : σ A ( t ) = i } . (8)Then for all A ∈ R n × s , we have I i ( A ) ∩ I j ( A ) = ∅ for i = j and T = ∪ si =1 I i ( A ) . For reasons that will become clear in the rest of thepaper, it is desired here that min i ∈ S | I i ( A ) | be as large as possible.That is, we want the partition { I i ( A ) } i ∈ S of T to be as balancedas possible in term of the cardinalities of its members. Hence, it isof interest to use the possible extra-degree of freedom offered byEq. (7) to select σ A so as to maximize min i ∈ S | I i ( A ) | subject tothe constraint (7). In case the maximizing σ A is still not unique, wecan make it unique for a given A by selecting the one which assignsto each t , the smallest admissible index i ∈ S . To sum up, given A ∈ R n × s , σ A can be selected uniquely by following the processdescribed above.Given σ A , let us now define the vector φ ( A ) collecting the errors ofthe form y t − x ⊤ t a σ A ( t ) for t ∈ T , φ ( A ) = (cid:2) y − x ⊤ a σ A (1) · · · y N − x ⊤ N a σ A ( N ) (cid:3) ⊤ . (9)Then the cost function J ( A ) in (5) is the ℓ norm of φ ( A ) , J ( A ) = k φ ( A ) k . Note in passing that J ( A ) is invariant undercolumn permutation of the matrix A . This property implies that J ( A ) is indeed a function of Set( A ) . Note that this is an intrinsic propertyof the multiple-regression problem. In other words, the invarianceproperty of J ( A ) does not constitute any restriction on the switchingmechanism of the to-be-identified data-generating system (1). A. Informativity measure of data and exact recovery
For any r ∈ { , . . . , N } , denote with S r ⊂ R N the set of r -sparsevectors in R N , i.e., S r = n w ∈ R N : k w k ≤ r o . (10) For A ∈ R n × s , define the distance δ r ( A ) from φ ( A ) to the set S r by δ r ( A ) = inf w n k φ ( A ) − w k : w ∈ S r o . (11)The so-defined δ r ( A ) represents in fact the sum of the N − r smallestentries (in absolute value) of φ ( A ) . In particular, δ ( A ) = k φ ( A ) k and δ N ( A ) = 0 .For any subset T of T , let φ T ( A ) refer to a subvector of φ ( A ) formedwith the entries indexed by T . Definition 1 (Concentration ratio) . Consider the dataset ̟ N and theassociated map φ defined in (9). Let r ∈ { , . . . , N } . We call r -th concentration ratio of φ on the dataset ̟ N expressed in (4), thenumber defined by ξ r ( ̟ N ) = sup ( A,A ′ ) ∈ ( R n × s ) T ⊂ T (cid:26) k φ T ( A ) − φ T ( A ′ ) k k φ ( A ) − φ ( A ′ ) k : φ ( A ) = φ ( A ′ ) , |T | ≤ r (cid:27) . (12)The supremum is taken here with respect to any pair ( A, A ′ ) ∈ ( R n × s ) such that φ ( A ) = φ ( A ′ ) and over all subsets T of T whosecardinality does not exceed r . The supremum exists because it isapplied to a set which is upper-bounded by .We interpret the concentration ratio as a function which measuresquantitatively different levels r of informativity of the data. For agiven level r , the data ̟ N are all the more informative as ξ r ( ̟ N ) is small. Ideally, we would like ξ r ( ̟ N ) to be as small as possiblefor the largest possible level r .Computing numerically ξ r ( ̟ N ) would require in general solving ahard combinatorial optimization problem, the complexity of whichmight not be affordable in practice. It can however be more cheaplyover-approximated thanks to the direct observation that ξ r ( ̟ N ) ≤ rξ ( ̟ N ) . This is because searching for ξ ( ̟ N ) instead of ξ r ( ̟ N ) alleviates considerably the combinatorial nature of the problem. Notein passing that ξ r ( ̟ N ) is an increasing function of r and satisfies ξ ( ̟ N ) = 0 and ξ N ( ̟ N ) = 1 . Remark 1.
In the special case where s = 1 (i.e., the situation where (1) reduces to a single subsystem), the matrix A reduces to a singlevector, say A = a ∈ R n . We recover the classical linear regressionproblem. Then φ ( A ) = (cid:2) ( y − x ⊤ a ) ( y − x ⊤ a ) · · · ( y N − x ⊤ N a ) (cid:3) ⊤ = y − X ⊤ a. where X = [ x · · · x N ] ∈ R n × N is a matrix collecting all theregressors { x t } t ∈ T generated by (1) and y = [ y · · · y N ] is thevector of all output samples. In this case, ξ r ( ̟ N ) in (12) takes theform ξ ◦ r ( ̟ N ) = sup η ∈ R n T ⊂ T ( (cid:13)(cid:13) X ⊤T η (cid:13)(cid:13) k X ⊤ η k : η = 0 , |T | ≤ r ) (13) where it is assumed that rank( X ) = n , that is, X is full row rank.The notation X T refers to the matrix formed with the columns of X indexed by T . We observe that in this case, ξ ◦ r ( ̟ N ) depends only onthe regressor matrix X . Moreover, it can be overestimated throughthe solution of a convex optimization, see [2]. Using the concentration ratio introduced in (12), we can now statea fundamental lemma for our analysis (see Lemma 2 below, whichcan be viewed as a special reformulation of Lemma 4.2 in [3]). Toease the proof, we start with a preliminary technical lemma.
Lemma 1.
Let r ∈ { , . . . , N } and S r be defined as in (10) . Consider an arbitrary vector v ∈ R N and define T r ( v ) ⊂ T tobe the index set of the r largest entries in absolute value of v . Thenfor all v ′ ∈ R N , (cid:13)(cid:13) v ′ − v (cid:13)(cid:13) − (cid:13)(cid:13) ( v ′ − v ) T r ( v ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) v ′ (cid:13)(cid:13) − k v k + 2 inf w ∈S r k w − v k (14) Proof:
See Appendix A.
Lemma 2.
Let r ∈ { , . . . , N } . Consider the dataset ̟ N as in (4) and ξ r ( ̟ N ) as defined in (12) . If ξ r ( ̟ N ) < / , then (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) J ( A ′ ) − J ( A ) + 2 δ r ( A ) (cid:1) ∀ ( A, A ′ ) ∈ R n × s × R n × s (15) with φ ( A ) , J ( A ) and δ r ( A ) defined in (9) , (5) and (11) respectively.Proof: Let T be a subset of T containing the indices of the r largest entries of φ ( A ) in absolute value. We apply the result ofLemma 1 with v = φ ( A ) and v ′ = φ ( A ′ ) , which leads immediatelyto (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) − (cid:13)(cid:13) φ T ( A ′ ) − φ T ( A ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ′ ) (cid:13)(cid:13) − k φ ( A ) k + 2 δ r ( A ) (16)where δ r ( A ) is defined as in (11). From the definition (12) of ξ r , itcan further be observed that (cid:13)(cid:13) φ T ( A ′ ) − φ T ( A ) (cid:13)(cid:13) ≤ ξ r ( ̟ N ) (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) , which in turn implies that (cid:0) − ξ r ( ̟ N ) (cid:1) k φ ( A ′ ) − φ ( A ) k issmaller than the left hand side term of (16). We therefore get (cid:0) − ξ r ( ̟ N ) (cid:1) (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ′ ) (cid:13)(cid:13) − k φ ( A ) k + 2 δ r ( A ) and the result follows. Remark 2.
In the scenario of Remark 1, the result of Lemma 2 wouldread as (cid:13)(cid:13) X ⊤ (cid:0) a ′ − a (cid:1) (cid:13)(cid:13) ≤ − ξ ◦ r ( ̟ N ) (cid:0) J ( a ′ ) − J ( a ) + 2 δ r ( a ) (cid:1) (17) with ξ ◦ r ( ̟ N ) as in (13) . Hence if X is full row rank then theleft hand side constitutes a data-dependent norm on the error a ′ − a . If we let λ = inf k η k =1 (cid:13)(cid:13) X ⊤ η (cid:13)(cid:13) , then (cid:13)(cid:13) a ′ − a (cid:13)(cid:13) ≤ λ (1 − ξ ◦ r ( ̟ N )) ( J ( a ′ ) − J ( a ) + 2 δ r ( a )) . By interchanging the roles of A and A ′ in the inequality (15) onecan obtain (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) J ( A ) − J ( A ′ ) + 2 δ r ( A ′ ) (cid:1) Summing this with (15) then yields the following inequality (cid:13)(cid:13) φ ( A ′ ) − φ ( A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:0) δ r ( A ′ ) + δ r ( A ) (cid:1) . (18)Another immediate consequence of Lemma 2 can be stated asfollows: Lemma 3. If ξ r ( ̟ N ) < / for some r ∈ { , . . . , N } , then for all ˆ A ∈ arg min A J ( A ) and for all A ∈ R n × s , (cid:13)(cid:13) φ ( A ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) δ r ( A ) . (19) with the convention that T r ( v ) = ∅ for r = 0 . Moreover, if there exists a matrix ˜ A such that k φ ( ˜ A ) k ≤ r then arg min A J ( A ) = (cid:8) A ∈ R n × s : (cid:13)(cid:13) φ ( A ) (cid:13)(cid:13) ≤ r (cid:9) = n A ∈ R n × s : φ ( A ) = φ ( ˜ A ) o Proof:
By Eq. (15), we have (cid:13)(cid:13) φ ( A ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ − ξ r ( ̟ N ) (cid:16) J ( ˆ A ) − J ( A ) + 2 δ r ( A ) (cid:17) for all A ∈ R n × s . Because J ( ˆ A ) − J ( A ) ≤ , this yieldsimmediately (19). The second statement follows from the fact thatif k φ ( ˜ A ) k ≤ r , then δ r ( ˜ A ) = 0 . Therefore, replacing A with ˜ A in(19) shows that φ ( ˜ A ) = φ ( ˆ A ) and so, J ( ˜ A ) = J ( ˆ A ) . Hence suchan ˜ A is necessarily in arg min A J ( A ) . On the other hand, since φ ( ˜ A ) = φ ( ˆ A ) , any ˆ A ∈ arg min A J ( A ) satisfies k φ ( ˆ A ) k ≤ r hence concluding the proof.An interpretation of Lemma 3 is that if the data ̟ N used toconstruct the map φ in (9) are generated by the switched system(1) and if the data is sufficiently informative in the sense that ξ r ( ̟ N ) < / for some r and the system parameter vectors aresuch that k φ ( A ◦ ) k ≤ r over the data, with A ◦ denoting the trueparameter matrix (see Eq. (1)), then Set( A ◦ ) ∈ Ψ( ̟ N ) . At thisstep, a question that needs to be discussed further is whether Set( A ◦ ) may be the unique member of Ψ( ̟ N ) . For this purpose we need aproperty of uniform rank on the data X . Definition 2 (An integer measure of genericity) . [1] Let X ∈ R n × N be a data matrix satisfying rank( X ) = n . The n -genericity index of X , denoted ν n ( X ) , is defined as the minimum integer m such thatany n × m submatrix of X has rank n , ν n ( X ) = min n m : ∀ S ⊂ T with |S| = m, rank( X S ) = n o . (20)Here, X S is a matrix formed with the columns of X indexed by S .This definition implies that any submatrix of X ∈ R n × N having atleast ν n ( X ) columns (with n ≤ ν n ( X ) ≤ N ), has full row rank. Thesmaller ν n ( X ) , the more generic the regression data X are said to be.According to this rough criterion, the most generic data X achieve ν n ( X ) = n . This is typically the case when the regressors { x t } t ∈ T are in general position in R n . Under some minimality conditions[15] on the data-generating system (1), if the input signal { u t } isgenerated at random, then ν n ( X ) = n with probability one.Equipped with this notation and the definition of genericity index ν n ( X ) , we can now characterize uniqueness of the minimizer of J ( A ) based on the following lemma. Lemma 4.
Consider a dataset ̟ N of the form (4) and the notation I i ( A ) introduced at the beginning of Section III. Assume that thereexists a matrix ˜ A = [˜ a · · · ˜ a s ] ∈ R n × s with distinct columns ˜ a i such that min i ∈ S (cid:12)(cid:12) I i ( ˜ A ) (cid:12)(cid:12) ≥ sν n ( X ) (21) on the data ̟ N . Then the following holds: ∀ A ∈ R n × s , φ ( A ) = φ ( ˜ A ) ⇒ Set( A ) = Set( ˜ A ) . (22) Proof:
Let A be such that φ ( A ) = φ ( ˜ A ) . Then for all t ∈ T , y t − x ⊤ t a σ A ( t ) = y t − x ⊤ t ˜ a σ ˜ A ( t ) , which is equivalent to x ⊤ t (˜ a σ ˜ A ( t ) − a σ A ( t ) ) = 0 for all t ∈ T .The next step of the proof is to show that for any i ∈ S there exists j ⋆ ∈ S such that I ij ⋆ , I i ( ˜ A ) ∩ I j ⋆ ( A ) has a cardinality larger thanor equal to ν n ( X ) . For this purpose we proceed by contradiction.Take an arbitrary i ∈ S and assume that | I ij | < ν n ( X ) ∀ j ∈ S . Noting that I i ( ˜ A ) = I i ( ˜ A ) ∩ T = I i ( ˜ A ) ∩ ( ∪ sj =1 I j ( A )) = ∪ sj =1 I ij , we obtain | I i ( ˜ A ) | ≤ P sj =1 | I ij | < sν n ( X ) . But this constitutes acontradiction to the assumption (21). In conclusion, for all i ∈ S ,there exists a j ⋆ such that | I ij ⋆ | ≥ ν n ( X ) . Now we observe that forall t ∈ I ij ⋆ , x ⊤ t (˜ a i − a j ⋆ ) = 0 and so, X ⊤ I ij⋆ (˜ a i − a j ⋆ ) = 0 . Butsince | I ij ⋆ | ≥ ν n ( X ) , we have rank( X I ij⋆ ) = n , which impliesthat ˜ a i = a j ⋆ . Since all columns of ˜ A are distinct (no repetition), weconclude that ˜ A and A have the same columns up to a permutationwhich is equivalent to saying that Set( ˜ A ) = Set( A ) .It is interesting to note that in the absence of noise in (1), havingthe true parameter matrix A ◦ to obey (21) is a sufficient condition forexact recovery of that matrix from the data. What this means is thatif v t = 0 for all t and if all the subsystems have been sufficientlyexcited in the sense that condition (21) holds for ˜ A = A ◦ , then Ψ( ̟ N ) = { Set( A ◦ ) } .The following theorem recapitulates the discussion of this section. Theorem 1.
Consider the dataset ̟ N in (4) , generated by theswitched system (1) . Assume that: • ̟ N is informative enough in the sense that ξ r ( ̟ N ) < / forsome r ∈ { , . . . , N } ; let then r ∗ ( ̟ N ) = max n r : ξ r ( ̟ N ) < / o . • There exists a matrix ˜ A ∈ R n × s satisfying the condition (21) and k φ ( ˜ A ) k ≤ r ∗ ( ̟ N ) .Then the estimator Ψ defined in (6) satisfies Ψ( ̟ N ) = (cid:8) Set( ˜ A ) (cid:9) . (23) Proof:
To begin with, note that for r ∗ defined as in the statementof the theorem, it holds that δ r ∗ ( ˜ A ) = 0 (see Eq. (11) for the defini-tion of δ r ). Now, since the conditions of Lemma 3 are satisfied, wecan apply it to infer that if ˆ A ∈ arg min A J ( A ) , then φ ( ˆ A ) = φ ( ˜ A ) so that J ( ˜ A ) = min A J ( A ) . Conversely, it is immediate to see thatany A ′ ∈ R n × s which satisfies φ ( A ′ ) = φ ( ˜ A ) lies necessarily in arg min A J ( A ) . Hence we can write arg min A J ( A ) = n A ∈ R n × s : φ ( A ) = φ ( ˜ A ) o Applying Lemma 4, we can then write arg min A J ( A ) = n A ∈ R n × s : Set( A ) = Set( ˜ A ) o and so, from (6) we see that Ψ( ̟ N ) = (cid:8) Set( ˜ A ) (cid:9) .An interpretation of Theorem 1 is that if the data are sufficientlyinformative, then the set-valued estimator Ψ( ̟ N ) returns only asingleton. We would of course like this singleton to coincide withthe true set of parameter vectors { a ◦ i } i ∈ S . For this to hold, it sufficesthat the true parameter matrix A ◦ satisfies the second condition ofthe theorem. Note that such a condition is readily satisfied (with atleast r ∗ = 0 ) when there is no noise in the data (i.e., v t = 0 in (1)for all t ∈ T ) provided that each subsystem generates enough data.Moreover, by the second condition of the theorem, exact recovery ofthe true parameter matrix A ◦ is still achievable by the estimator Ψ when { v t } is a sparse noise sequence containing at most r ∗ nonzeroinstances, regardless of the magnitude of these nonzero values. Hence,the larger r ∗ (i.e., the richer the regression data ̟ N ), the moreoutliers the least absolute deviation LSM estimator can handle. Incontrast, the condition is unlikely to hold generally when dense noise is present in the data. IV. E
RROR BOUNDS IN THE PRESENCE OF NOISE
As mentioned above, we cannot hope for an exact recovery of thetrue parameter matrix A ◦ by the estimator Ψ from data affected by adense noise sequence { v t } . We need instead to search for a possiblebound on the estimation error in function of the magnitude of thenoise and the richness properties of the data. Indeed (19) almostprovides such a bound. The remaining question to be investigated is,under which conditions we can lower-bound k φ ( ˆ A ) − φ ( A ◦ ) k by anorm applying directly to ˆ A − A ◦ . A. A key step towards the obtention of an error bound
To begin with the analysis, we introduce some useful technical tools,the first of which is the class of K ∞ functions (see, e.g., [6]). Thisclass of functions will be used to measure the increasing rate of theestimation error. Definition 3 (class- K ∞ functions) . A function α : R + → R + is saidto be of class- K ∞ if it is continuous, zero at zero, strictly increasingand satisfies lim s → + ∞ α ( s ) = + ∞ .Using this definition we can state a technical lemma which willplay an important role in the analysis. Lemma 5 ([7]) . Let f : R n → R + be a positive continuous functionsatisfying the following properties: • Positive definiteness: f ( x ) = 0 if and only if x = 0 • Relaxed homogeneity: There exists a K ∞ function q such that f ( x ) ≥ q ( λ ) f ( λx ) for all λ > .Then for any norm k·k on R n , there exists a constant α > suchthat f ( x ) ≥ αq ( k x k ) . Our goal now is to derive a bound on a certain measure of theparametric estimation error between the true parameter matrix A ◦ and the estimated ones ˆ A ∈ arg min A J ( A ) . Recalling that J ( A ) isinvariant under column permutation of the matrix A , for this metric tobe pertinent, it needs to be specified in terms of distance between thesets Set( A ◦ ) and Set( ˆ A ) . Hence we consider a metric d of the form d ( A, A ′ ) = k A − A ′ π k where k·k is a norm on R n × s and π : S → S is a permutation depending on the matrices A and A ′ . Here, thenotation A ′ π is used to refer to the matrix obtained by permuting thecolumns of A as prescribed by π , A ′ π = [ a ′ π (1) · · · a ′ π ( s ) ] . Theexistence of a permutation π such that d ( A, A ′ ) is upper-bounded by k φ ( A ) − φ ( A ′ ) k will depend here on the partitions { I i ( A ) } i ∈ S and { I i ( A ′ ) } i ∈ S achieved by A and A ′ respectively on the data set ̟ N . Definition 4.
Consider the data set ̟ N in (4), generated by the s -modes switched system (1). We say that two matrices A ∈ R n × s and A ′ ∈ R n × s are comparable over the data set ω N if there exists apermutation π : S → S such that (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) forall i ∈ S .Note, from Lemma 4 above, that any matrix A ∈ R n × s such that min i ∈ S | I i ( A ) | ≥ sν n ( X ) is comparable to any other matrix A ′ withdistinct columns satisfying φ ( A ) = φ ( A ′ ) . In that case, it even holdsthat A = A ′ π for some permutation π on S . We state hereafter asufficient condition for comparability. Lemma 6.
Consider a set ̟ N of input-output data generated bysystem (1) as defined in (4) . Let A ∈ R n × s be a matrix obeying (21) .Then any matrix A ′ ∈ R n × s satisfying | I i ( A ) | + | I j ( A ) |≥ max ℓ ∈ S (cid:2) (cid:12)(cid:12) I i ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) + (cid:12)(cid:12) I j ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) (cid:3) + 2( s − ν n ( X ) ∀ ( i, j ) ∈ S , i = j, (24) is comparable to A over ̟ N in the sense of Definition 4. Proof: See Appendix B.To illustrate the condition (24), consider the simple case where | S | = s = 2 . Then, under the assumption that A is subject to (21), A and A ′ are comparable over ̟ N if N ≥ max( | I ( A ′ ) | , | I ( A ′ ) | ) + 2 ν n ( X ) .Noting that max( | I ( A ′ ) | , | I ( A ′ ) | ) = N/ / (cid:12)(cid:12) | I ( A ′ ) | −| I ( A ′ ) | (cid:12)(cid:12) with the outer bars denoting the absolute value, (24) reducesto N ≥ ν n ( X )+ (cid:12)(cid:12) | I ( A ′ ) |−| I ( A ′ ) | (cid:12)(cid:12) . This relation identifies threefactors which promote comparability: (i) the data X must be genericenough (i.e., ν n ( X ) small); (ii) A ′ must partition the data into setsof balanced cardinalities; (iii) the number N of data must be largeenough. Theorem 2.
Consider the dataset ̟ N in (4) , generated by theswitched system (1) and assume that ξ r ( ̟ N ) < / for some r ∈ { , . . . , N } . Let ( A, A ′ ) ∈ R n × s × R n × s be a pair ofcomparable matrices with respect to ω N (as defined in Eq. (4) ). Let π denote the associated permutation. Then for any norm k·k on R n × s ,there exists a strictly positive number D such that k A ′ π − A k ≤ D (cid:0) − ξ r ( ̟ N ) (cid:1) (cid:0) J ( A ′ ) − J ( A ) + 2 δ r ( A ) (cid:1) . (25) Proof:
We start by observing that all the conditions of Lemma2 are satisfied. As a consequence, Eq. (15) holds. Departing fromthis equation, we just need to find an appropriate underestimate of k φ ( A ) − φ ( A ′ ) k . To this end, note that (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) = X t ∈ T (cid:12)(cid:12) x ⊤ t ( a σ A ( t ) − a ′ σ A ′ ( t ) ) (cid:12)(cid:12) = X ( i,j ) ∈ S X t ∈ I i ( A ) ∩ I j ( A ′ ) (cid:12)(cid:12) x ⊤ t ( a i − a ′ j ) (cid:12)(cid:12) ≥ X i ∈ S X t ∈ I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) x ⊤ t η i (cid:12)(cid:12) where η i = a i − a ′ π ( i ) with π : S → S denoting the permutationdefining the comparability of A and A ′ (see Definition 4). Recall that (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) , i = 1 , . . . , s . Let g : R n × s → R + bethe function defined by g (Λ) = inf { J i } i ∈ S | J i |≥ ν n ( X ) X i ∈ S (cid:13)(cid:13) X ⊤ J i η i (cid:13)(cid:13) (26)where the infimum is taken over all s -tuples ( J , . . . , J s ) of disjointsubsets of T with cardinality larger or equal to ν n ( X ) . Then byletting Λ = A − A ′ π , it follows from the inequality above that (cid:13)(cid:13) φ ( A ) − φ ( A ′ ) (cid:13)(cid:13) ≥ g (Λ) . (27)Since the infimum in (26) operates here on a finite set, it is reachedby a certain ( J ⋆ , . . . , J ⋆s ) . As a consequence g can be expressed by g (Λ) = P i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) . The rest of the proof consists in showingthat the function g satisfies the conditions of Lemma 5. Clearly, g is positive. If for some E = [ e · · · e s ] ∈ R n × s , g ( E ) = 0 ,then X ⊤ J ⋆i e i = 0 for all i = 1 , . . . , s . It follows, by the fact that | J ⋆i | ≥ ν n ( X ) , that e i = 0 . Hence E = 0 and consequently, g ispositive-definite. Moreover, g is continuous as a consequence of the ℓ norm being continuous. Finally, g satisfies the relaxed homogeneityproperty with the K ∞ function q defined by q ( x ) = x . We cantherefore apply Lemma 5 to conclude that g (Λ) ≥ D k Λ k with D being the strictly positive number defined by D = inf k Λ k =1 g (Λ) . (28)This concludes the proof.The theorem establishes a bound on the metric d ( A, A ′ ) in case A and A ′ are comparable in the sense of Definition 4. For a given r , it is interesting to note that the bound displayed in (25) is all the smaller as the data are more generic (i.e., ξ r ( ̟ N ) defined in(12) is small for a relatively large r ). We also note that if A and A ′ are not comparable as required in the statement of the theoremthen, k A − A ′ π k can grow arbitrarily for any permutation π while k φ ( A ) − φ ( A ′ ) k remains small. To see this, take for example s = 2 and A = (cid:2) ˜ a ˜ a (cid:3) , A ′ = (cid:2) ˜ a ′ β ˜ a ′ (cid:3) with the ˜ a i and ˜ a ′ i being unit ℓ -norm vectors and β ∈ R . Then fora given dataset ̟ N one can choose β sufficiently large such that σ A ′ ( t ) = 1 for all t ∈ T , i.e., I ( A ′ ) = T . For such values of β , A and A ′ are not comparable in the sense of Definition 4. Wecan see however that k φ ( A ) − φ ( A ′ ) k is independent of β while k A − A ′ π k will increase arbitrarily as β increases for any permutation π on S = { , } . Remark 3.
Note that in the scope of Theorem 2, it is, in principle,possible to restrict the defining supremum of ξ r ( ̟ N ) in (12) onlyto all pairs ( A, A ′ ) of comparable matrices. The interest of sucha slight reformulation is that it would produce a smaller value of ξ r ( ̟ N ) and hence a potentially tighter bound in (25) .B. Estimation error bound for the switched system An interesting situation is when ( A, A ′ ) is taken in Theorem 2 tobe equal to ( A ◦ , ˆ A ) with ˆ A ∈ arg min A J ( A ) . In this specificcase, invoking the trick used to establish (19) yields the followingstatement. Corollary 1.
Consider the data ̟ N generated by system (1) and as-sume that ξ r ( ̟ N ) < / for some r ≥ . Let ˆ A ∈ arg min A J ( A ) .If ˆ A and the true parameter matrix A ◦ are comparable in the senseof Definition 4 with π : S → S denoting the associated comparabilitypermutation, then for any norm k·k on R n × s , there exists a number D > such that k ˆ A π − A ◦ k ≤ D (cid:0) − ξ r ( ̟ N ) (cid:1) δ r ( A ◦ ) . (29)Since r can be any integer in { , . . . , N } such that ξ r ( ̟ N ) < / ,we can, at least formally, optimize the error bound over all such r ’s. Hence, whenever the comparability condition holds true, a betterbound can, in principle, be obtained as (cid:13)(cid:13) ˆ A π − A ◦ (cid:13)(cid:13) ≤ min r =0 ,...,N n δ r ( A ◦ ) D (cid:0) − ξ r ( ̟ N ) (cid:1) : ξ r ( ̟ N ) < o (30)As already remarked, δ r ( A ◦ ) measures how far φ ( A ◦ ) is from theset S r of all r -sparse signals in R N . This is essentially a measure ofthe amount of noise { v t } in the system (1) which generates the data ̟ N . More specifically, δ r ( A ◦ ) equals the sum of the N − r smallestelements in absolute value of the sequence { v ◦ t } t ∈ T defined by v ◦ t = v t + x ⊤ t ( a ◦ σ ( t ) − a ◦ σ A ◦ ( t ) ) (31)with σ denoting the true switching signal from (1). From the defini-tion of σ A ◦ ∈ Σ (see Eq. (7)), it is not hard to see that | v ◦ t | ≤ | v t | for all t ∈ T and so, δ r ( A ◦ ) ≤ k v k ,r with k v k ,r denoting thesum, in absolute value, of the N − r smallest entries of { v t } t ∈ T .It follows that under the conditions of Corollary 1, k ˆ A π − A ◦ k ≤ / (cid:0) D (1 − ξ r ( ̟ N )) (cid:1) k v k ,r . Hence, by considering the special casewhere r is taken equal to (this is a reasonable choice e.g., whenthere is no outlier in the data), we get k ˆ A π − A ◦ k ≤ /D k v k . Notethat an underestimate ˆ D of the number D can be numerically foundas suggested in Appendix E. Using ˆ D (instead of D ) in the expressionof the bound yields however a more pessimistic value of the bound.A question we ask now is, under which condition we may have v ◦ t = v t from (31). Such a condition is given in the followingproposition. Proposition 1.
Consider the switched system (1) driven by theswitching signal σ and the noise { v t } . Then a necessary and sufficientcondition for σ A ◦ = σ (irrespective of the values of σ and those ofthe noise) is | v t | <
12 min ( i,j ) ∈ S i = j (cid:12)(cid:12) x ⊤ t ( a ◦ i − a ◦ j ) (cid:12)(cid:12) ∀ t ∈ T . (32) Proof:
See Appendix C.The term on the right hand side of (32) can be interpreted asa measure of how distinguishable the subsystems are with respectto each other. Hence, what the proposition says is that if the noiselevel is below a certain threshold (which depends on the parametricdistinguishability of the subsystems and on some genericity conditionon the regressors), then the true switching signal coincides with σ A ◦ .Finally, an interesting consequence of Proposition 1 is that, undercondition (32), we obtain from (31) that v ◦ t = v t for all t ∈ T withthe consequence that δ r ( A ◦ ) reduces to k v k ,r . C. On the comparability of ˆ A and A ◦ According to Corollary 1, a sufficient condition for the estimationerror induced by the estimator Ψ to be bounded as in (29), isthat of comparability of ˆ A and A ◦ over ̟ N for all ˆ A such that Set( ˆ A ) ∈ Ψ( ̟ N ) (see Definition 4). Lemma 6 suggests that tofavor the comparability of A ◦ and ˆ A , the data ̟ N and the trueparameter matrix A ◦ should satisfy (21) and (24). Indeed theseconditions impose, though in a non trivial way, some constraints onthe distinguishability of the modes composing the switched system,the magnitude of the noise, the excitation signal { u t } and theswitching signal σ .Intuitively, if the level of the noise { v t } is low and if the constituentsubsystems are distinguishable enough, then the true parameter matrix A ◦ and its estimate ˆ A should be comparable. We formalize this asfollows. Lemma 7.
Assume that the input-output data ̟ N (4) , gener-ated by the s -mode switched system (1) is such that A ◦ obeys min i ∈ S | I i ( A ◦ ) | ≥ sm with m ≥ ν n ( X ) . Introduce the notation γ m = inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) , (33) where the infimum is taken over all subsets I of T with cardinalityat least m and over all η ∈ R n with unit ℓ norm.If the subsystems of the switched system (1) are parametricallydistinguishable enough in the sense that min i = j (cid:13)(cid:13) a ◦ i − a ◦ j (cid:13)(cid:13) > δ r ( A ◦ ) γ m (cid:0) − ξ r ( ̟ N ) (cid:1) (34) for some r ∈ { , . . . , N } such that ξ r ( ̟ N ) < / , then A ◦ and ˆ A are comparable over ̟ N in the sense of Definition 4 for any ˆ A ∈ arg min A J ( A ) .Proof: See Appendix D.V. C
ONCLUSION
In this paper we have studied some properties of the least sum-of-minimums (LSM) absolute deviation estimator for switched systemidentification. Although this estimator is hard to implement numeri-cally, it serves here as a reference estimator to analyze the degree ofrichness in the data for the identification scheme to be successful. Inparticular, we have proposed a bound on the estimation error induced by this estimator. Interestingly, the expression of the proposed boundinvolves explicitly some informativity measures of the training data.The message of that expression in essence is that the richer the data,the smaller the estimation error. This opens a nice perspective foridentification experiment design for switched systems. In effect, onecan form an experiment design problem by searching for the inputsignal which optimizes the derived information-theoretic measuresand thereby, the error bound delivered by the estimator. To furtherpave the avenue towards optimal experiment design, an intermediarystep would, perhaps, be to complement the current analysis withone of the LSM estimator when used with the classical quadraticloss. Another important direction of research is to devise efficientnumerical routines for estimating the informativity indices derived inthis paper. A
PPENDIX
A. Proof of Lemma 1
For the sake of notational simplicity we use T r in place of T r ( v ) .Let T cr = T \ T r be the complement of T r in T . Then (cid:13)(cid:13) v ′ − v (cid:13)(cid:13) = (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) ( v ′ − v ) T cr (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) + (cid:13)(cid:13) v T cr (cid:13)(cid:13) = (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) + inf w ∈S r k w − v k The inequality is derived from the triangle inequality property ofthe ℓ norm. The last equality relation relies on the fact that inf w ∈S r k w − v k = (cid:13)(cid:13) v T cr (cid:13)(cid:13) (the sum of the N − r smallest entriesin absolute value of v ). Considering the term (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) , we can write (cid:13)(cid:13) v ′T cr (cid:13)(cid:13) = (cid:13)(cid:13) v ′ (cid:13)(cid:13) − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) = k v T r k − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′ (cid:13)(cid:13) − (cid:0) k v k − (cid:13)(cid:13) v T cr (cid:13)(cid:13) (cid:1) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) + (cid:13)(cid:13) v ′ (cid:13)(cid:13) − k v k + inf w ∈S r k w − v k Here, the second equality follows by adding and subtracting k v T r k while the last line is obtained by applying again the triangle inequalitywhich gives (cid:13)(cid:13) v T r (cid:13)(cid:13) − (cid:13)(cid:13) v ′T r (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( v ′ − v ) T r (cid:13)(cid:13) . The result followsby combining the second inequality with the first one above. B. Proof of Lemma 6
By reasoning as in the proof of Lemma 4 thanks to the fact that A satisfies condition (21), we reach easily the conclusion that for all i ∈ S , there exists i ∗ ∈ S such that | I i ( A ) ∩ I i ∗ ( A ′ ) | ≥ ν n ( X ) . Letus define a map π : S → S by posing π ( i ) = i ∗ . We need to showthat π can be selected to be a permutation under condition (24) of thelemma. For this purpose, we proceed by contradiction. Recall that π is a permutation here if and only if it is injective. And there is noinjective map π that satisfies (cid:12)(cid:12) I i ( A ) ∩ I π ( i ) ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) for all i ∈ S , if and only if there is a pair ( i, j ) , i = j , and an index ℓ ∈ S such that ( (cid:12)(cid:12) I i ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) (cid:12)(cid:12) I j ( A ) ∩ I ℓ ( A ′ ) (cid:12)(cid:12) ≥ ν n ( X ) (35a)and ∀ k = ℓ , ( (cid:12)(cid:12) I i ( A ) ∩ I k ( A ′ ) (cid:12)(cid:12) < ν n ( X ) (cid:12)(cid:12) I j ( A ) ∩ I k ( A ′ ) (cid:12)(cid:12) < ν n ( X ) (35b)Assume for contradiction that (35) holds. Then, because { I r ( A ′ ) } r ∈ S forms a partition of T , | I i ( A ) | = P sr =1 | I i ( A ) ∩ I r ( A ′ ) | < ( s − ν n ( X ) + | I i ( A ) ∩ I ℓ ( A ′ ) | . Similarly, we can write, | I j ( A ) | < ( s − ν n ( X ) + | I j ( A ) ∩ I ℓ ( A ′ ) | . Hence | I i ( A ) | + | I j ( A ) | < s − ν n ( X ) + | I i ( A ) ∩ I ℓ ( A ′ ) | + | I j ( A ) ∩ I ℓ ( A ′ ) | . This is incontradiction with (24). We therefore conclude on the existence ofan injective map (and hence of a permutation) π : S → S . C. Proof of Proposition 1
If (32) holds true, then for all t ∈ T and all i ∈ S with i = σ ( t ) , (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) = | v t | < (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ i ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) y t − x ⊤ t a ◦ i (cid:12)(cid:12) + 12 (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) where the last inequality is derived from the triangle inequalityproperty of | · | . It follows that (cid:12)(cid:12) y t − x ⊤ t a ◦ σ ( t ) (cid:12)(cid:12) < (cid:12)(cid:12) y t − x ⊤ t a ◦ i (cid:12)(cid:12) whichimplies that σ A ◦ ( t ) = σ ( t ) for all t . Conversely, if σ A ◦ = σ , thenfor all ( j, t ) ∈ S × T such that j = σ ( t ) , we get immediately that (cid:12)(cid:12) v t (cid:12)(cid:12) < (cid:12)(cid:12) y t − x ⊤ t a ◦ j (cid:12)(cid:12) = (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) + v t (cid:12)(cid:12) . Taking the square anddividing by (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) (cid:12)(cid:12) gives (cid:12)(cid:12) x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) (cid:12)(cid:12) > − v t s j ( t ) with s j ( t ) denoting the sign of x ⊤ t ( a ◦ σ ( t ) − a ◦ j ) . The last inequalityholds for any possible values of σ if and only if (cid:12)(cid:12) x ⊤ t ( a ◦ i − a ◦ j ) (cid:12)(cid:12) > | v t | for all ( i, j ) ∈ S with i = j . D. Proof of Lemma 7
To begin with, let us observe that by relying on Lemma 5, it canbe shown that the number γ m in (33) is well defined and satisfies γ m > . By the same reasoning as in the proof of Lemma 4, weknow that there exists a map π : S → S such (cid:12)(cid:12) I i ( A ◦ ) ∩ I π ( i ) ( ˆ A ) (cid:12)(cid:12) ≥ m ≥ ν n ( X ) . We just need to establish that such a π is bijectiveunder the conditions of the lemma, a property which is equivalenthere just to injectivity of π . We proceed by contradiction. Supposethat π is not injective, that is, we can find ( i, j ) ∈ S with i = j such that π ( i ) = π ( j ) . Let J i = I i ( A ◦ ) ∩ I π ( i ) ( ˆ A ) . By applyingLemma 3, we can write X i ∈ S (cid:13)(cid:13) X ⊤ J i ( a ◦ i − ˆ a π ( i ) ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) φ ( A ◦ ) − φ ( ˆ A ) (cid:13)(cid:13) ≤ d, where we have posed d = 2 δ r ( A ◦ ) / (1 − ξ r ( ̟ N )) for concise-ness. On the other hand, it follows from the definition (33) of γ m that P i ∈ S (cid:13)(cid:13) X ⊤ J i ( a ◦ i − ˆ a π ( i ) ) (cid:13)(cid:13) ≥ γ m P i ∈ S k a ◦ i − ˆ a π ( i ) k .As a consequence, we can write P i ∈ S k a ◦ i − ˆ a π ( i ) k ≤ d/γ m .Hence, if π ( i ) = π ( j ) , then by virtue of the triangle inequality, k a ◦ i − a ◦ j k ≤ k a ◦ i − ˆ a π ( i ) k + k a ◦ j − ˆ a π ( j ) k ≤ d/γ m . This is incontradiction with the assumption (34). We therefore conclude thatthe claim of the lemma holds true. E. On the estimation of the number D in (28)The following lemma provides a method for computing an under-estimate of the parameter D in (28) for a particular choice of thenorm involved in its definition though at the price of a combinatorialcomplexity. Lemma 8.
Assume that the norm used for the definition of thenumber D in (28) is k·k , col defined by k Λ k , col = P si =1 k η i k for Λ = (cid:2) η · · · η s (cid:3) . Let m = ν n ( X ) . Then D ≥ γ m ≥ inf | I | = m λ / ( X I X ⊤ I ) , (36) where γ m is the number defined in (33) and λ / ( · ) denotes thesquare root of the minimum eigenvalue. The infimum is taken overall subsets I of T with cardinality equal to m .Proof: Recall from (26) and the proof of Theorem 2 theexpression g (Λ) = P i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) of the function g , where the J ⋆i are subsets of T satisfying | J ⋆i | ≥ m = ν n ( X ) . Then by substituting k Λ k , col for the norm k Λ k in Eq. (28), we have D = inf k Λ k , col =1 g (Λ) = inf k η k + ··· + k η s k =1 X i ∈ S (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) ≥ inf k η k + ··· + k η s k =1 X i ∈ S γ m k η i k = γ m The inequality follows as a consequence of the definition of γ m bywhich (cid:13)(cid:13) X ⊤ J ⋆i η i (cid:13)(cid:13) ≥ γ m k η i k since | J ⋆i | ≥ m . Now, to prove thelast inequality in (36), it suffices to notice that (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) ≥ (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) .As a result, γ m = inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) ≥ inf k η k =1 | I |≥ m (cid:13)(cid:13) X ⊤ I η (cid:13)(cid:13) = inf | I | = m λ / ( X I X ⊤ I ) . Given I ⊂ T , it is easy to obtain λ / ( X I X ⊤ I ) . Hence to obtain an(under)-estimate of D , we need to compute (cid:0) Nm (cid:1) such values and takethe minimum of them. Here the notation (cid:0) Nm (cid:1) refers to the binomialcoefficient. If we let ˆ D = inf | I | = m λ / ( X I X ⊤ I ) , then it followsfrom (29) that k ˆ A π − A ◦ k ≤ D k v k in the particular case where r is taken equal to . R EFERENCES [1] L. Bako. Identification of switched linear systems via sparse optimiza-tion.
Automatica , 47:668–677, 2011.[2] L. Bako. On a class of optimization-based robust estimators.
IEEETransactions on Automatic Control , 62:5990–5997, 2017.[3] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iterativelyreweighted least squares minimization for sparse recovery.
Communica-tions on Pure and Applied Mathematics , 63:1–38, 2010.[4] A. Garulli, S. Paoletti, and A. Vicino. A survey on switched andpiecewise affine system identification. In
IFAC Symposium on SystemIdentification, Brussels, Belgium , 2012.[5] A. Goudjil, M. Pouliquen, E. Pigeon, and O. Gehan. A real-timeidentification algorithm for switched linear systems with bounded noise.In
European Control Conference, Alborg, Denmark , 2016.[6] C. M. Kellett. A compendium of comparison function results.
Mathe-matics of Control, Signals, and Systems , 26:339–374, 2014. [7] A. Kircher, L. Bako, E. Blanco, and M. Benallouch. An optimizationframework for resilient batch estimation in cyber-physical systems.Technical report, Ecole Centrale de Lyon (arxiv.org/abs/1906.01714),2019.[8] F. Lauer. Global optimization for low-dimensional switching linearregression and bounded-error estimation.
Automatica , 89:73–82, 2018.[9] F. Lauer and G. Bloch.
Hybrid System Identification: Theory andAlgorithms for Learning Switching Models . Springer InternationalPublishing, 2019.[10] D. Liberzon.
Switching in Systems and Control . Birkhauser Boston Inc.,2003.[11] J. Lunze and F. Lamnabhi-Lagarrigue (Eds).
Handbook of HybridSystems Control: Theory, Tools, Applications . Cambridge UniversityPress, 2009.[12] N. Ozay, M. Sznaier, C. Lagoa, and O. Camps. A sparsification approachto set membership identification of switched affine systems.
IEEETransactions on Automatic Control , 57:634–648, 2012.[13] S. Paoletti, A. Juloski, G. Ferrari-Trecate, and R. Vidal. Identification ofhybrid systems: A tutorial.
European Journal of Control , 13:242–260,2007.[14] M. Petreczky and L. Bako. On the notion of persistence of excitation forlinear switched systems. In
IEEE Conference on Decision and Controland European Control Conference, Orlando, FL, USA , 2011.[15] M. Petreczky, L. Bako, S. Lecoeuche, and K. Motchon. Minimality andidentifiability of discrete-time SARX systems.
To appear in InternationalJournal of Robust and Nonlinear Control , 2020.[16] G. Pillonetto. A new kernel-based approach to hybrid system identifica-tion.
Automatica , 70:21–31, 2016.[17] Z. Sun.
Switched Linear Systems: Control and Design . Springer-VerlagLondon, 2005.[18] R. Vidal. Recursive identification of switched ARX systems.
Automatica ,44:2274–2287, 2008.[19] R. Vidal, S. Soatto, Y. Ma, and S. Sastry. An algebraic geometricapproach to the identification of a class of linear hybrid systems. In