Identification of nonlinear controllers from data: theory and computation
aa r X i v : . [ c s . S Y ] S e p Identification of nonlinear controllers from data:theory and computation ∗ L. Fagiano † and C. Novara ‡ This manuscript contains technical details and proofs of recent results developed by the authors, pertaining to the designof nonlinear controllers from the experimental data measured on an existing feedback control system.
The setting we consider in this work is the following. A single-input, discrete time, nonlinear dynamical system of interestoperates in closed loop with an existing controller. Both the system and the controller are not known. The system’s inputvariable u ( t ) , i.e. the controller’s output, is known and it can be measured at discrete time instants t ∈ Z . Moreover, u islimited in a compact U = [ u, u ] . The system’s output variable y ( t ) , i.e. the controller’s input, is not known a priori butthe control designer can rely on sensors to acquire measurements of different “candidate” feedback variables, based onher/his intuition and experience with the physical process under study. The output y is assumed to belong to a compactset Y ⊂ R n y . After a choice of y ( t ) has been made, we assume that the controller is a static function of this variable: u ( t ) = κ ( y ( t )) κ : Y → U. (1)Moreover, we assume that a disturbance variable e s ( t ) is acting on the dynamical system. The variable e s accounts for (a)exogenous disturbances, (b) neglected and time-varying dynamics, and (c) the approximation error induced by choosingthe input of the controller to be equal to y . The value of e s ( t ) is also assumed to belong to a compact set E s ⊂ R n e . Wethen assume that the chosen output variable evolves in time as follows: y ( t + 1) = f ( y ( t ) , u ( t ) , e s ( t )) f : Y × U × E s → Y. (2)Let us now introduce three assumptions on functions f and κ . In the following, we will make use of the function sets K and KL : to this end, we recall that K is the set of all strictly increasing functions α : R + → R + such that α (0) = 0 , while KL is the set of all functions β : R + × R + → R + such that for fixed t , β ( x, t ) ∈ K , and for fixed x , lim t →∞ β ( x, t ) = 0 . Assumption 1
The function f is Lipschitz continuous over the compact Y × U × E s . In particular, it holds that ∃ γ f ∈ (0 , + ∞ ) : ∀ e s ∈ E s , ∀ y ∈ Y, ∀ u , u ∈ U, k f ( y, u , e s ) − f ( y, u , e s ) k ∞ ≤ γ f | u − u | . (3) (cid:3) Assumption 2
The function κ is Lipschitz continuous over the compact Y . (cid:3) ∗ This research has received funding from the California Energy Commission under the EISG grant n. 56983A/10-15 “Autonomous flexible wingsfor high-altitude wind energy generation”, and from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n.PIOF-GA-2009-252284 - Marie Curie project “Innovative Control, Identification and Estimation Methodologies for Sustainable Energy Technologies”. † L. Fagiano is with the Automatic Control Laboratory, Swiss Federal Institute of Technology,Zurich, Switzerland. E-mail: [email protected]. ‡ C. Novara is Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino - Italy. E-mail: [email protected]. y ( t + 1) = g ( y ( t ) , e s ( t )) . = f ( y ( t ) , κ ( y ( t )) , e s ( t )) g : Y × E s → Y (4)is also described by a Lipschitz continuous function g . In particular, by construction, the function g enjoys the followingproperties: ∃ γ g,y ∈ (0 , + ∞ ) : ∀ e s ∈ E s , ∀ y , y ∈ Y, k g ( y , e s ) − g ( y , e s ) k ∞ ≤ γ g,y k y − y k ∞ , (5) ∃ γ g,e ∈ (0 , + ∞ ) : ∀ y ∈ Y, ∀ e s , e s ∈ E s , k g ( y, e s ) − F ( g, e s ) k ∞ ≤ γ g,e k e s − e s k ∞ . (6)Assumptions 1-2 are quite standard in nonlinear control analysis and design and they are reasonable, since in practicethe inputs, disturbance and outputs of the process under study are often bounded in some compact sets and the func-tions describing the system and the controller are assumed to be differentiable on such compact sets, hence Lipschitzcontinuous.The dynamical system described by g has e s as input and y as output. We denote with g . = g (0 , the value of g evaluated at y = 0 , e s = 0 . The properties of the closed-loop system clearly depend on the controller κ , which is assumedto be stabilizing. In particular, we consider the following notion of stability: Definition 1
A nonlinear system with input e s and output y , is finite-gain ℓ ∞ stable if a function α ∈ K , a function β ∈ KL and a scalar δ > exist, such that: ∀ t ≥ , k y ( t ) k ∞ ≤ α ( k e s k ∞ ) + β ( k y (0) k ∞ , t ) + δ. (7) (cid:3) In Definition 1, the generic signal v . = { v (0) , v (1) , ... } is given by the infinite sequence of values of the variable v ( t ) , t ≥ , and k v k ∞ . = max t ≥ k v ( t ) k ∞ is the ℓ ∞ − norm of the signal v with the underlying norm taken to be the vector ∞− norm k v k ∞ .The stabilizing properties of κ are formalized by the following assumption: Assumption 3
The functions κ and f are such that property (5) holds with γ g,y ( x ) < . (cid:3) Assumption 3 implies that the closed-loop system (4) enjoys finite-gain ℓ ∞ stability as given in Definition 1, inparticular we have: ∀ t ≥ , k y ( t ) k ∞ ≤ γ g,e − γ g,y k e s k ∞ | {z } α ( k e s k ∞ ) + γ tg,y k y (0) k ∞ | {z } β ( k y (0) k ∞ ,t ) + 11 − γ g,y k g k ∞ | {z } δ , (8)see the Appendix for a derivation of this inequality.Overall, Assumptions 1-3 are quite common in the context of system identification, function approximation andlearning, since a stable system is needed to collect data and carry out identification experiments. In particular, in thiswork we will consider a finite number N of input and output measurements, indicated as ˜ u ( k ) , ˜ y ( k ) , k = 0 , . . . , N − ,collected from the system operating in closed loop with the unknown controller κ . These data points are assumed to beaffected by additive noise variables, indicated as e u ( t ) and e y ( t ) , respectively: ˜ u ( t ) = u ( t ) + e u ( t )˜ y ( t ) = y ( t ) + e y ( t ) . (9)Note that e u ( t ) may include both measurement noise and errors arising in the application of the control law. The lattercan be present for example if the aim is to learn a controller from the behavior of a human operator, who might be subjectto fatigue and mistakes.The noise variables are assumed to satisfy the following boundedness properties where, for a generic variable q ∈ R n q and scalar ρ ∈ (0 , + ∞ ) , we denote the n q − dimensional ∞ -norm ball set of radius ρ as B ρ . = { q ∈ R n q : k q k ∞ ≤ ρ } : Assumption 4
The following boundedness properties hold: a) e u ( t ) ∈ B ε u , ∀ t ≥ ; (b) e y ( t ) ∈ B ε y , ∀ t ≥ . (cid:3) According to (1), with straightforward manipulations, the measured data can be described by the following set ofequations: ˜ u ( k ) = κ ( e y ( k )) + d ( k ) , k = 0 , . . . , N − where d ( k ) accounts for the noises e u ( t ) and e y ( t ) in (9). Since e u ( t ) and e y ( t ) are bounded and κ is Lipschitz continu-ous, it follows that d ( k ) is also bounded: d ( k ) ∈ B ε , ∀ k ≥ . (10)The following assumption on the pair ( e y ( k ) , d ( k )) is considered. Assumption 5
The set of points D Nyd . = { ( e y ( k ) , d ( k )) } N − k =0 is dense on Y × B ε as N → ∞ . That is, for any ( y, d ) ∈ Y × B ε and any λ ∈ R + , a value of N λ ∈ N , N λ < ∞ and a pair ( e y ( k ) , d ( k )) ∈ D N λ yd exist such that k ( y, d ) − ( e y ( k ) , d ( k )) k ∞ ≤ λ. (cid:3) Assumption 5 essentially ensures that the controller domain Y is “well explored” by the data e y ( k ) and, at the same time,the noise d ( k ) covers its domain B ε , hitting the bounds − ε and ε with arbitrary closeness after a sufficiently long time.This latter noise property is called tightness, see [4] and, for a probabilistic version, [1].In the described setting, the problem we want to address can be stated as follows: Problem
1: learn a controller ˆ κ from N measurements ˜ y and ˜ u , obtained from the system operating in closed-loopwith an unknown controller κ , such that:1. asymptotically, i.e. as N → ∞ , ˆ κ renders the closed loop system finite-gain ℓ ∞ stable;2. the trajectory deviation induced by the use of ˆ κ instead of κ is “small”;3. ˆ κ has “low” complexity, to be easily implementable on real-time processors. (cid:3) In this section, we present an approach that is able to solve
Problem
1. In order to do so, we first derive a sufficientcondition for a generic controller ˆ κ ≈ κ to stabilize the closed-loop system and then we propose a technique, based onconvex optimization, that is able to learn a controller ˆ κ which enjoys asymptotically the derived stability condition. Our first aim is to derive a sufficient condition on the controller ˆ κ , such that the obtained closed loop system is finite-gain ℓ ∞ stable. The controller ˆ κ is chosen to be a Lipschitz continuous function over the compact Y , with constant γ ˆ κ : ∃ γ ˆ κ ∈ (0 , + ∞ ) : ∀ y , y ∈ Y, | ˆ κ ( y ) − ˆ κ ( y ) | ≤ γ ˆ κ k y − y k ∞ . (11)Let us define the error function ∆ : Y → R : ∆( y ) . = κ ( y ) − ˆ κ ( y ) . (12)We denote with ∆ . = ∆(0) the error function evaluated at y = 0 . By construction, the error function is Lipschitzcontinuous, with some constant γ ∆ ∈ (0 , + ∞ ) : ∃ γ ∆ ∈ (0 , + ∞ ) : ∀ y , y ∈ Y, | ∆( y ) − ∆( y ) | ≤ γ ∆ k y − y k ∞ . (13)We indicate with ˆ g the closed loop system obtained by using the controller ˆ κ . In particular, ˆ g is defined as follows: y ( t + 1) = ˆ g ( y ( t ) , e s ( t ) , e y ( t )) . = f ( y ( t ) , ˆ κ ( y ( t ) + e y ( t )) , e s ( t ))ˆ g : Y × E × B ε y → Y. (14)Note that the feedback variable used by the learned controller ˆ κ is the noise-corrupted measurement of the output y . Thenext result provides a sufficient condition for the controller ˆ κ to stabilize the closed loop system.3 heorem 1 Let Assumptions 1-3 and 4-(b) hold. If γ ∆ < − γ g,y γ f , (15) then the closed-loop system ˆ g is finite-gain ℓ ∞ stable. More precisely, it holds that ∀ t ≥ , k y ( t ) k ∞ ≤ γ g,e − γ k e s k ∞ | {z } α ( k e s k ∞ ) + γ t k y (0) k ∞ | {z } β ( k y (0) k ∞ ,t ) + 11 − γ ( k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y ) | {z } δ , (16) with γ . = ( γ ∆ γ f + γ g,y ) < .Proof. See the Appendix. (cid:3)
It is worth commenting on the result of Theorem 1. Roughly speaking, the quantity γ ∆ gives an indication on theregularity of the error function ∆ = κ − ˆ κ . Assuming for example that ∆ is differentiable, a low γ ∆ means that thequantity k d ∆ /dy k is bounded by a small value, i.e. the variability of the error over the set Y is low. This happense.g. when the functions κ and ˆ κ differ by some offset, but have similar shapes. A large value of γ ∆ , on the other hand,indicates that the control error can have high variability, as it can happen e.g. when the controller ˆ κ is over-fitting theavailable measured data. Theorem 1 states that the quantity γ ∆ should be sufficiently small in order to guarantee closed-loop stability, and how small depends on the features of the plant to be controlled and of the unknown controller κ . Inparticular, the more “sensitive” is the plant to input perturbation, i.e. the larger is the Lipschitz constant γ f , and the worseare the stabilizing properties of the controller κ , i.e. the closer is the Lipschitz constant γ g,y is to 1, the smaller γ ∆ has tobe in order to meet the sufficient condition. In other words, the Theorem indicates that the quality of the learned controller ˆ κ , in terms of low variability of the control error, should be higher if the uncontrolled system is more sensitive to inputperturbations and the closed-loop system obtained with κ is closer to being unstable.The value of γ ∆ influences also the decay rate of the term related to the initial condition k y (0) k ∞ , compare eq. (16), aswell as the gain in the additive term δ . As to the latter, a comparison with the analogous term in (8) reveals the effectsof the absolute value of the control error and of the presence of output noise. The former is represented by the quantity | ∆ | , i.e. the magnitude of the control error evaluated at y = 0 . Note that the choice of y = 0 to evaluate this term is notrestrictive, since a simple coordinate change can be used to refer all the results to a different output value. According tothe result, the is smaller the value of | ∆ | , the closer is the term δ to the one obtained with the unknown controller κ . Thisaspect, coupled with the condition (15) on the value of γ ∆ , basically states that, in order to better replicate the behaviorobtained with the controller κ , the control error function has to be small in absolute value and have low variability, as theintuition would suggest. About the noise term, it contributes to δ in (16) in a way proportional to its maximum norm ε y ,and the gain depends on how sensitive the controller ˆ κ is to perturbations of its input argument, as indicated by the valueof γ ˆ κ . Finally, note that the effects of | ∆ | and ε y are proportional to γ f , i.e. to how sensitive the uncontrolled plant is toinput perturbations, and inversely proportional to − γ , i.e. to how close the closed loop system is to being unstable inthe sense of Definition 1.We extend next the stability analysis to the deviation between the output trajectory obtained by using the controller κ and the one obtained by using ˆ κ . In order to do so, we rename as ˆ y ( t ) the output trajectory of system ˆ g (14), and we definethe deviation ξ ( t ) . = ˆ y ( t ) − y ( t ) , where y is the output trajectory of the system g defined in (4). Then, let us consider thefollowing dynamical system: ξ ( t + 1) = g ∆ ( ξ ( t ) , y ( t ) , e s ( t ) , e y ( t )) . = f ( y ( t ) + ξ ( t ) , ˆ κ ( y ( t ) + ξ ( t ) + e y ( t )) , e s ( t )) − f ( y ( t ) , κ ( y ( t )) , e s ( t ))= f (ˆ y ( t ) , ˆ κ (ˆ y ( t ) + e y ( t )) , e s ( t )) − f ( y ( t ) , κ ( y ( t )) , e s ( t )) g ∆ : Ξ × Y × E × B ε y → Y, (17)where Ξ ⊂ R n y is a compact set containing the values of ξ , which is guaranteed to exist if the assumptions of Theorem 1hold, thanks to the combination of (8) and (16). Corollary 1
Let Assumptions 1-3 and 4-(b) hold. If (15) holds, then the system g ∆ is finite-gain ℓ ∞ stable. More precisely,it holds that ∀ t ≥ , k ξ ( t ) k ∞ ≤ γ g,e − γ k e s k ∞ | {z } α ( k e s k ∞ ) + γ t k ξ (0) k ∞ | {z } β ( k ξ (0) k ∞ ,t ) + γ f γ ∆ − γ γ tg k y (0) k ∞ | {z } β y ( k y (0) k ∞ ,t ) + 11 − γ ( k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y ) | {z } δ . (18)4 roof. See the Appendix. (cid:3)
A comparison between the results (16) and (18) shows that the trajectory deviation ξ enjoys a closed-loop behavior,in the sense of Definition 1, similar to the output of the closed loop system obtained with the learned controller ˆ κ , as faras the effects of e s ( t ) , e y ( t ) and ∆ are concerned. The main difference with respect to (16) is the presence of a secondexponentially decaying term given by the function β y ∈ KL , which depends on the initial condition k y (0) k ∞ . This termcan be interpreted as the relative effect of the magnitude of the initial output on the magnitude of the trajectory deviation.The practical meaning of the dependence of β y on the Lipschitz constants γ f , γ ∆ , γ g and γ can be deduced along thesame lines of the comments on Theorem 1.The results presented so far serve as a theoretical justification of the learning algorithm that we present in the nextsection, which indeed is able to satisfy condition (15) in the limit, hence providing a solution to Problem A parametric representation is considered for the controller ˆ κ : ˆ κ ( y ) = M X i =1 ˆ a i ϕ i ( y ) (19)where ϕ i : Y → U are Lipschitz continuous basis functions. The coefficients ˆ a i ∈ R are identified by means of thefollowing Algorithm 1. Algorithm 1
Controller learning.1. Take a set of basis functions { ϕ i } Mi =1 . The choice of this set can be carried out by means of Procedure 1 below.2. Using the data set D N . = { ˜ u ( k ) , ˜ y ( k ) } N − k =0 and the basis functions chosen at step 1), define the following quantities: Φ . = ϕ ( e y (0)) · · · ϕ M ( e y (0)) ... . . . ... ϕ ( e y ( N − · · · ϕ M ( e y ( N − ∈ R N × M e u . = ( e u (0) , . . . , e u ( N − ∈ R N × .
3. Using Algorithms 2 and 3 below, obtain an estimate ˆ ε of the noise bound ε in (10), and estimates ˆ γ f and ˆ γ g,y of theLipschitz constants γ f and γ g,y in (3) and (5). Choose γ ′ ∆ ≃ (1 − ˆ γ g,y ) / ˆ γ f such that γ ′ ∆ < (1 − ˆ γ g,y ) / ˆ γ f .4. Solve the following convex optimization problem: a = arg min a ∈ R M k a k subject to ( a ) k e u − Φ a k ∞ ≤ α ˆ ε ( b ) | e u ( l ) − e u ( k ) + (Φ rk − Φ rl ) a | ≤ γ ′ ∆ k e y ( l ) − e y ( k ) k ∞ + 2ˆ ε, (cid:26) l = 0 , . . . , N − k = l + 1 , . . . , N − (20) where Φ rk . = [ ϕ ( e y ( k )) · · · ϕ M ( e y ( k )) ] and α ≥ is a number slightly larger than the minimum value forwhich the constraint ( a ) is feasible.5. Obtain the coefficient vector ˆ a = (ˆ a , . . . , ˆ a M ) from the following convex optimization problem: (ˆ a, γ s ∆ ) = arg min a ∈ R M , γ ′′ ∆ ∈ R + γ ′′ ∆ subject to ( a ) k e u − Φ a k ∞ ≤ α ˆ ε ( b ) | e u ( l ) − e u ( k ) + (Φ rk − Φ rl ) a | ≤ γ ′′ ∆ k e y ( l ) − e y ( k ) k ∞ + 2ˆ ε, (cid:26) l = 0 , . . . , N − k = l + 1 , . . . , N − c ) a i = 0 , ∀ i / ∈ supp (cid:0) a (cid:1) (21) where supp (cid:0) a (cid:1) is the support of a , i.e. the set of indices at which a is not null. (cid:3) ℓ norm of the coefficient vector a is minimized in step 4), leading to a sparse coefficient vector a , i.e. a vectorwith a “small” number of non-zero elements. Constraint ( a ) in (20) ensures the consistency between the measured dataand the prior information on the noise affecting these data (assuming that ˆ ε is a reliable estimate of ε and α is closeto 1). Constraints ( b ) allow us to guarantee closed-loop stability when a sufficiently large number of data is used, seeTheorem 4 below. Step 5) aims at reducing the Lipschitz constant of the error function, maintaining the same sparsitylevel obtained in step 4), and satisfying the constraints for closed-loop stability. Indeed, the magnitude of this constant islinked to the maximal deviation from the trajectory achieved by the unknown controller κ , see Corollary 1, hence step 5)of the algorithm accounts for the requirement 2) of Problem
Problem
1. Second, sparse functions have nice regularity properties and are thus able to provide good accuracy on newdata by limiting well-known issues such as over-fitting and the curse of dimensionality. A sparse function is a linearcombination of many basis functions, where the vector of linear combination coefficients is sparse, i.e. it has only afew non-zero elements. The sparsity of a vector is typically measured by the ℓ quasi-norm, defined as the number ofits non-zero elements. Sparse identification can thus be performed by looking for a coefficient vector with a “small” ℓ quasi-norm. However, the ℓ quasi-norm is a non-convex function and its minimization is in general an NP-hard problem.Two main approaches are commonly adopted to deal with this issue: convex relaxation and greedy algorithms [5], [3],[6], [2]. In convex relaxation, a suitable convex function, e.g. the ℓ norm, is minimized instead of the ℓ quasi-norm[3], [6], [2]. In greedy algorithms, the sparse solution is obtained iteratively, [5]. Algorithm 1 is essentially an improved ℓ algorithm: in step 4), an optimization problem is solved, where the ℓ quasi-norm is replaced by the ℓ norm, andadditional constraints for closed-loop stability are used (i.e. ( b ) in (20)). In step 5), a vector ˆ a is obtained, with the samesupport as a , which minimizes the estimated Lipschitz constant of the error function and satisfies the closed-loop stabilitycondition evaluated on the available data.If a small number of data is used for control design, it may happen that (1 − ˆ γ g,y ) / ˆ γ f ≤ , thus not allowing a feasiblechoice of the Lipschitz constant γ ′ ∆ in step 3 of Algorithm 1. In this case, our indication is to collect a larger number ofdata in order to let the estimated Lipschitz constants ˆ γ f and ˆ γ g,y get closer to the true ones, which by assumption satisfythe condition (1 − γ g,y ) /γ f ≤ . Whether collecting more data is not possible, our indication is to choose γ ′ ∆ slightlylarger than the minimum value for which the optimization problem (20) is feasible. Similar indications hold for the casewhere (1 − ˆ γ g,y ) / ˆ γ f > but the chosen γ ′ ∆ is too small and constraint ( b ) in (20) is thus not feasible. All the parameters involved in Algorithm 1 (i.e. the noise bound ˆ ε and the Lipschitz constants ˆ γ f and ˆ γ g,y ) can beestimated in a systematic way by means of the following Algorithms.Suppose that a set of data { e w ( k ) , e z ( k ) } N − k =0 is available, described by e z ( k ) = f ( e w ( k )) + e ( k ) , k = 0 , . . . , N − (22)where f : W → R is a generic unknown function, W ⊂ R n w and e ( k ) is an unknown noise. Assume that e ( k ) ∈ B ε , ∀ k, and f is Lipschitz continuous with constant γ f . The noise bound ε and the Lipschitz constant γ f can be estimated asfollows. Algorithm 2
Noise bound estimation.1. Choose a “small” ρ > . For example: ρ = 0 .
01 max k,l =0 ,...,N − k e w ( k ) − e w ( l ) k ∞ .2. For k = 0 , . . . , N − , compute δ ˜ z k = max i,j ∈ J k | e z ( i ) − e z ( j ) | where J k . = { l : k e w ( k ) − e w ( l ) k ∞ ≤ ρ } . If J k = ∅ , set δ ˜ z k = ∞ . If J k = ∅ for all k = 0 , . . . , N − , go to step1) and choose a larger ρ .3. Obtain the estimate ˆ ε of the noise bound ε as ˆ ε = 12 ˆ N X k ∈ Q δ ˜ z k here Q . = { k ∈ { , . . . , N − } : δ ˜ z k < ∞} and ˆ N . = card ( Q ) . (cid:3) Algorithm 3
Lipschitz constant estimation.1. For k, l = 0 , . . . , N − and e w ( k ) = e w ( l ) , compute ˜ γ lk = e z ( k ) − e z ( l ) − ε k e w ( k ) − e w ( l ) k ∞ , if e z ( k ) > e z ( l ) + 2ˆ ε e z ( l ) − e z ( k ) − ε k e w ( k ) − e w ( l ) k ∞ , if e z ( l ) > e z ( k ) + 2ˆ ε , otherwisewhere ˆ ε is the noise bound estimated by Algorithm 2.2. Obtain the estimate ˆ γ of the Lipschitz constant γ f as ˆ γ = max k,l =0 ,...,N − e w ( k ) = e w ( l ) ˜ γ lk (cid:3) The two algorithms above allow the estimation of the Lipschitz constant of a generic function f . We now discusshow the algorithms can be applied to estimate the Lipschitz constants of the functions f in (2) and g in (4), which arerequired by the learning algorithm 1. The Lipschitz constant γ f of the function f (with respect to u ( k ) ) can be estimatedconsidering that ˜ y ( k + 1) = f (˜ u ( k )) + v f ( k ) , k = 0 , . . . , N − where f (˜ u ( t )) . = f ( y ∗ , ˜ u ( k ) , e ∗ s ) is an unknown function with Lipschitz constant γ f , the quantities y ∗ and e ∗ s are definedas ( y ∗ , e ∗ s ) = arg max ( y,e ) ∈ Y × E s L f ( y, e ) , L f ( y, e ) . = max u ,u ∈ U (cid:13)(cid:13) f ( y, u , e ) − f ( y, u , e ) (cid:13)(cid:13) ∞ | u − u | and v f ( k ) . = f ( y ( k ) , u ( k ) , e s ( k )) − f ( y ∗ , ˜ u ( k ) , e ∗ s ) + e y ( k + 1) (23)is an unknown noise. Analogously, the Lipschitz constant γ g,y of the function g (with respect to y ( k ) ) can be estimatedconsidering that ˜ y ( k + 1) = g (˜ y ( k )) + v g ( k ) , k = 0 , . . . , N − where g (˜ y ( t )) . = g (˜ y ( k ) , e ∗ s ) is an unknown function with Lipschitz constant γ g,y , the quantity e ∗ s is defined as e ∗ s = arg max e ∈ E s L g ( e ) , L g ( e ) . = max y ,y ∈ Y (cid:13)(cid:13) g ( y , e ) − g ( y , e ) (cid:13)(cid:13) ∞ | y − y | and v g ( k ) . = g ( y ( k ) , e s ( k )) − g (˜ y ( k ) , e ∗ s ) + e y ( k + 1) (24)is an unknown noise. Note that the noises v f ( k ) and v g ( k ) in (23) and (24) are bounded, due to the boundedness of u ( k ) , e s ( k ) and y ( k ) and the Lipschitz continuity of f and g : v f ( k ) ∈ B ε f , v g ( k ) ∈ B ε g , ∀ k ≥ .Another important step of Algorithm 1 is the choice of the basis functions ϕ i (as well known, this aspect is crucialfor any identification method relying on a basis function representation). An inappropriately chosen family of functionscan force the retention of many terms by the identification algorithm or can lead to large approximation errors. In thesesituations, several problems may arise, such as high controller complexity, closed-loop instability and/or large deviationsfrom the ideal trajectory. The following procedure can be used to address this issue. Procedure 1
Choice of basis function family.1. Run Algorithm 1 using a given family of basis functions (e.g. Gaussian, sigmoidal, wavelet, polynomial, trigono-metric).2. Consider the following cases: a) If α is small (close to 1) and ˆ a is sparse, it may be concluded that the basis functions have been correctlychosen, since a small number of them is able to “explain” the measured data. Then, stop the procedure.(b) If α is small and ˆ a is not sparse, it may be guessed that the basis function choice is not appropriate. In-deed, using a large number of basis functions may lead to overfitting problems and possibly to closed-loopinstability. Then, go back to step 1), choosing a different family of basis functions.(c) If α is not small, it may be guessed that the basis function choice is not appropriate since it leads to a largeapproximation error on the measured data. Then, go back to step 1), choosing a different family of basisfunctions. (cid:3) The quality of the derived approximation can be also assessed by testing it on data that were not used in the learningalgorithm, as it is commonly done in identification problems.
In this subsection, the asymptotic properties of Algorithms 1, 2 and 3 are analyzed. In particular, the following theoremsshow that the noise bound estimate ˆ ε and the Lipschitz constant estimate ˆ γ provided by Algorithms 2 and 3, respectively,converge to the true values when the number N of data tends to infinity. Theorem 2
Let the set { ( e w ( k ) , e ( k )) } N − k =0 appearing in (22) be dense on W × B ε as N → ∞ . Then, lim N →∞ ˆ ε = ε where ˆ ε is the noise bound estimated by Algorithm 2.Proof. See the Appendix. (cid:3)
Theorem 3
Let the set { ( e w ( k ) , e ( k )) } N − k =0 appearing in (22) be dense on W × B ε as N → ∞ . Then, lim N →∞ ˆ γ = γ f where ˆ γ is the Lipschitz constant estimated by Algorithm 3.Proof. See the Appendix. (cid:3)
A result is now presented, showing that the controller ˆ κ identified by means of Algorithm 1 satisfies the stabilitycondition (15) when the number of data N tends to infinity. Before stating the result, let us introduce two technicalassumptions, regarding the noises v f ( k ) and v g ( k ) defined in (23) and (24), respectively. Assumption 6
The set of points D Nuv . = { e u ( k ) , v f ( k ) } N − k =0 is dense on U × B ε f as N → ∞ . (cid:3) Assumption 7
The set of points D Nyv . = { ( e y ( k ) , v g ( k )) } N − k =0 is dense on Y × B ε g as N → ∞ . (cid:3) In Assumptions 6-7, density of the sets D Nuv , D Nyv is intended in the same sense as in Assumption 5.
Theorem 4
Let the optimization problem (20) be feasible for any N ≥ . Let Assumptions 1-2, 4, and 5-7 hold. Then,the error function ∆ . = κ − ˆ κ is Lipschitz continuous on Y , with constant γ ∆ such that lim sup N →∞ γ ∆ ≤ γ s ∆ < − γ g,y γ f . Proof.
See the Appendix. (cid:3)
We illustrate next the convergence result of Theorem 4 through a simple numerical example.
Example: asymptotic behavior of the estimated Lipschitz constant.
We have considered the function u = κ ( y ) . = 2 ye − y cos (8 y ) , N γ ∆ Figure 1: Continuous: γ N ∆ . Dot-dashed: γ N ∆ nc . Dashed: hypothetical stability bound. −3 −2 −1 0 1 2 3−1−0.500.51 y u Figure 2: Continuous: true function κ . Dashed: Algorithm 1 estimate ˆ κ . Dots: measurements.shown in Fig. 2, and values of N = 10 , , . . . , . For each one of these values, we generated a data set D N . = { ˜ u ( k ) , ˜ y ( k ) } N − k =0 according to e u ( k ) = κ ( e y ( k )) + d ( k ) , k = 0 , . . . , N − where d ( k ) is a white uniform noise with amplitude 0.05. Then, we applied Algorithm 1 to obtain an estimate ˆ κ N of κ ofthe form (19), where ϕ i : [ − , → [ − , are Gaussian basis functions: ϕ i ( y ) = e − y − ˜ y ( i )) , i = 0 , . . . , N − . For comparison, we computed another estimate ˆ κ Nnc of the same form (19), using the same Gaussian basis functions, bymeans of Algorithm 1, but without the constraints ( b ) in (20). We recall that, according to Theorem 4, these constraintsyield the convergence of the Lipschitz constant of the error function to a value that ensures closed-loop stability.The Lipschitz constant γ N ∆ of the error function ∆ N . = κ − ˆ κ N and the Lipschitz constant γ N ∆ nc of the error function ∆ N . = κ − ˆ κ Nnc are shown in Fig. 1 for N = 10 , , . . . , . It can be noted that γ N ∆ decreases quite rapidly as N becomes large, taking soon values below an hypothetical threshold that is sufficient for closed-loop stability. Also γ N ∆ nc seems to have a similar behavior, however the decrease is slower and less regular with respect to the one of γ N ∆ and, inany case, satisfaction of the stability condition is not ensured theoretically. In Fig. 2, the estimate ˆ κ is compared withthe true function κ . 9 ppendix Derivation of eq. (8) . Consider t = 0 . We have: k y (1) k ∞ = k g ( y (0) , e s (0)) k ∞ = k g ( y (0) , e s (0)) − g (0 , e s (0)) + g (0 , e s (0)) − g + g k ∞ ≤ k g ( y (0) , e s (0)) − g (0 , e s (0)) k ∞ + k g (0 , e s (0)) − g k ∞ + k g k ∞ . Using properties (5)-(6), we obtain k y (1) k ∞ ≤ γ g,y k y (0) k ∞ + γ g,e k e s (0) k ∞ + k g k ∞ ≤ γ g,y k y (0) k ∞ + γ g,e k e s k ∞ + k g k ∞ . Analogously, k y (2) k ∞ ≤ γ g,y k y (0) k ∞ + γ g,y γ g,e k e s k ∞ + γ g,e k e s k ∞ + γ g,y k g k ∞ + k g k ∞ . The result is then established by generalizing to any t ≥ : k y ( t ) k ∞ ≤ γ tg,y k y (0) k ∞ + t − P k =0 (cid:0) γ kg,y γ g,e k e s k ∞ (cid:1) + t − P k =0 (cid:0) γ kg,y k g k ∞ (cid:1) ≤ − γ g,y γ g,e ( k e s k ∞ ) + γ tg,y k y (0) k ∞ + 11 − γ g,y k g k ∞ , ∀ t ≥ . Proof of Theorem 1 . Consider t = 0 . We have k y (1) k ∞ = k ˆ g ( y (0) , e s (0) , e y (0)) k ∞ = k ˆ g ( y (0) , e s (0) , e y (0)) − g ( y (0) , e s (0)) + g ( y (0) , e s (0)) k ∞ . Using properties (3) and (5)-(6), k ˆ g ( y (0) , e s (0) , e y (0)) − g ( y (0) , e s (0)) + g ( y (0) , e s (0)) k ∞ ≤ k ˆ g ( y (0) , e s (0) , e y (0)) − g ( y (0) , e s (0)) k ∞ + k g ( y (0) , e s (0)) k ∞ = k f ( y (0) , ˆ κ ( y (0) + e y (0)) , e s (0)) − f ( y (0) , κ ( y (0)) , e s (0)) k ∞ + k g ( y (0) , (0)) − g (0 ,
0) + g k ∞ ≤ γ f | ˆ κ ( y (0) + e y (0)) − κ ( y (0)) | + γ g,y k y (0) k ∞ + γ g,e k e s (0) k ∞ + k g k ∞ where we recall that g (0 , . = g . Using properties (11) and (13), γ f | ˆ κ ( y (0) + e y (0)) − κ ( y (0)) | + γ g,y k y (0) k ∞ + γ g,e k e s (0) k ∞ + k g k ∞ = γ f | ˆ κ ( y (0) + e y (0))) − ˆ κ ( y (0)) + ˆ κ ( y (0)) − κ ( y (0)) | + γ g,y k y (0) k ∞ + γ g,e k e s (0) k ∞ + k g k ∞ = γ f | ∆( y (0)) − ∆(0) + ∆ | + γ f γ ˆ κ k e y (0) k ∞ + γ g,y k y (0) k ∞ + γ g,e k e s (0) k ∞ + k g k ∞ ≤ γ f γ ∆ k y (0) k ∞ + γ f | ∆ | + γ f γ ˆ κ k e y (0) k ∞ + γ g,y k y (0) k ∞ + γ g,e k e s k ∞ + k g k ∞ ≤ ( γ f γ ∆ + γ g,y ) k y (0) k ∞ + γ g,e k e s k ∞ + k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y = γ k y (0) k ∞ + γ g,e k e s k ∞ + k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y where we recall that ∆(0 , . = ∆ and γ . = ( γ f γ ∆ + γ g,y ) < . Analogously, for t = 1 , we have that k y (2) k ∞ ≤ γ k y (1) k ∞ + γ g,e k e s k ∞ + k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y ≤ γ k y (0) k ∞ + γγ g,e k e s k ∞ + γ ( k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y )+ γ g,e k e s k ∞ + k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y . Generalizing to any t ≥ , we obtain k y ( t ) k ∞ ≤ γ t k y (0) k ∞ + t − P k =0 γ k γ g,e k e s k ∞ + t − P k =0 (cid:0) γ k ( k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y ) (cid:1) . Considering (15) and the convergence of the geometric series, it follows that k y ( t ) k ∞ ≤ γ g,e − γ ( k e s k ∞ ) + γ t k y (0) k ∞ + 11 − γ ( k g k ∞ + γ f | ∆ | + γ f γ ˆ κ ε y ) , ∀ t ≥ , Proof of Corollary 1 . Consider a generic time t ≥ . We have k ξ ( t + 1) k ∞ = k f (ˆ y ( t ) , ˆ κ (ˆ y ( t ) + e y ( t )) , e s ( t )) − f ( y ( t ) , κ ( y ( t )) , e s ( t )) k ∞ = k f (ˆ y ( t ) , ˆ κ (ˆ y ( t ) + e y ( t )) , e s ( t )) − f (ˆ y ( t ) , κ (ˆ y ( t )) , e s ( t ))+ f (ˆ y ( t ) , κ (ˆ y ( t )) , e s ( t )) − f ( y ( t ) , κ ( y ( t )) , e s ( t )) k ∞ = k f (ˆ y ( t ) , ˆ κ (ˆ y ( t ) + e y ( t )) , e s ( t )) − f (ˆ y ( t ) , κ (ˆ y ( t )) , e s ( t ))+ g (ˆ y ( t ) , e s ( t )) − g ( y ( t ) , e s ( t )) k ∞ ≤ γ f | ˆ κ (ˆ y ( t ) + e y ( t )) − κ (ˆ y ( t )) | + γ g,y k ξ ( t ) k ∞ where properties (3) and (5) have been used in the last inequality. Moreover, γ f | ˆ κ (ˆ y ( t ) + e y ( t )) − κ (ˆ y ( t )) | + γ g,y k ξ ( t ) k ∞ = γ f | ˆ κ (ˆ y ( t ) + e y ( t )) − ˆ κ (ˆ y ( t )) + ˆ κ (ˆ y ( t )) − κ (ˆ y ( t )) | + γ g,y k ξ ( t ) k ∞ = γ f | ∆(ˆ y ( t )) − ∆ + ∆ + ˆ κ (ˆ y ( t ) + e y ( t )) − ˆ κ (ˆ y ( t )) | + γ g,y k ξ ( t ) k ∞ . Using properties (11), (13), and the equality ˆ y ( t ) = ξ ( t ) + y ( t ) , we obtain that γ f | ∆(ˆ y ( t )) − ∆ + ∆ + ˆ κ (ˆ y ( t ) + e y ( t )) − ˆ κ (ˆ y ( t )) | + γ g,y k ξ ( t ) k ∞ ≤ γ f γ ∆ k ξ ( t ) k ∞ + γ f γ ∆ k y ( t ) k ∞ + γ f | ∆ | + γ f γ ˆ κ k e y ( t ) k ∞ + γ g,y k ξ ( t ) k ∞ = γ k ξ ( t ) k ∞ + γ f γ ∆ k y ( t ) k ∞ + γ f | ∆ | + γ f γ ˆ κ k e y ( t ) k ∞ where γ . = ( γ f γ ∆ + γ g,y ) < . Using (8), iterating backwards to t = 0 and considering inequality (15) and theconvergence of the geometric series, k ξ ( t + 1) k ∞ ≤ γ t k ξ (0) k ∞ + γ f − γ ( γ ˆ κ ε y + | ∆ | )+ γ f γ ∆ − γ (cid:18) γ g,e − γ g,y k e s k ∞ + γ tg,y k y (0) k ∞ + 11 − γ g,y k g k ∞ (cid:19) , which establishes the result. Proof of Theorem 2 . Consider step 2) of Algorithm 2. Equations (22) imply that δ ˜ z k = max i,j ∈ J k | e z ( i ) − e z ( j ) | = max i,j ∈ J k | f ( e w ( i )) − f ( e w ( j )) + e ( i ) − e ( j ) |≥ max i,j ∈ J k ( | e ( i ) − e ( j ) | − | f ( e w ( i )) − f ( e w ( j )) | ) . From Assumption 5, it follows that, for any λ > , there exist a sufficiently large N and two pairs ( e w ( i ) , e ( i )) ∈{ e w ( i ) } N − k =0 × B e and ( e w ( j ) , e ( j )) ∈ { e w ( i ) } N − k =0 × B e with i, j ∈ J k such that | ε − e ( i ) | ≤ λ, |− ε − e ( j ) | ≤ λ, thus yielding the following inequality: | e ( i ) − e ( j ) | ≥ ε − λ. Moreover, due the Lipschitz continuity property, we have | f ( e w ( i )) − f ( e w ( j )) | ≤ γ f k e w ( i ) − e w ( j ) k ∞ ≤ γ f ρ. The above inequalities imply that δ ˜ z k ≥ max i,j ∈ J k ( | e ( i ) − e ( j ) | − | f ( e w ( i )) − f ( e w ( j )) | ) ≥ max i,j ∈ J k (( | e ( i ) − e ( j ) | ) − γ f ρ ) ≥ ε − λ − γ f ρ. (25)11n the other hand, δ ˜ z k = max i,j ∈ J k | e z ( i ) − e z ( j ) | = max i,j ∈ J k | f ( e w ( i )) − f ( e w ( j )) + e ( i ) − e ( j ) |≤ max i,j ∈ J k ( | e ( i ) − e ( j ) | + | f ( e w ( i )) − f ( e w ( j )) | ) ≤ ε + 2 γ f ρ. (26)Since λ and ρ can be chosen arbitrarily small, from (25) and (26) it follows that δ ˜ z k → ε as N → ∞ , i.e. that δ ˜ z k / → ε .In step 3) of Algorithm 2, the operation of taking the mean over all δ ˜ z k / is inessential in this asymptotic analysis. It canbe effective in the finite data case in order to not under-estimate or over-estimate ε . Proof of Theorem 3 . Define ( ¯ w , ¯ w ) . = arg max w ,w ∈ W (cid:12)(cid:12) f ( w ) − f ( w ) (cid:12)(cid:12) k w − w k ∞ and, without loss of generality, suppose that f ( ¯ w ) > f ( ¯ w ) . From Assumption 5, it follows that, for any λ > , thereexist a sufficiently large N and two pairs ( e w ( i ) , e ( i )) ∈ { e w ( i ) } N − k =0 × B e and ( e w ( j ) , e ( j )) ∈ { e w ( i ) } N − k =0 × B e with i, j ∈ { , . . . N − } such that (cid:13)(cid:13) ¯ w − e w ( i ) (cid:13)(cid:13) ∞ ≤ λ, (cid:13)(cid:13) ¯ w − e w ( j ) (cid:13)(cid:13) ∞ ≤ λ | ε − e ( i ) | ≤ λ, |− ε − e ( j ) | ≤ λ. (27)Moreover, (cid:13)(cid:13) ¯ w − ¯ w (cid:13)(cid:13) ∞ = (cid:13)(cid:13) ¯ w − e w ( i ) − ¯ w + e w ( j ) + e w ( i ) − e w ( j ) (cid:13)(cid:13) ∞ ≥ k e w ( i ) − e w ( j ) k ∞ − (cid:13)(cid:13) ¯ w − e w ( i ) (cid:13)(cid:13) ∞ − (cid:13)(cid:13) ¯ w − e w ( j ) (cid:13)(cid:13) ∞ ≥ k e w ( i ) − e w ( j ) k ∞ − λ. (28)Also, from the Lipschitz continuity property of f and from (22), we have that f ( ¯ w ) − f ( ¯ w ) ≤ f ( e w ( i )) − f ( e w ( j )) + 2 γ f λ = e z ( i ) − e ( i ) − e z ( j ) + e ( j ) + 2 γ f λ ≤ e z ( i ) − e z ( j ) − ε + 2 λ + 2 γ f λ (29)where the last inequality follows from (27). Considering that f ( ¯ w ) > f ( ¯ w ) , inequalities (28) and (29) imply that (cid:12)(cid:12) f ( ¯ w ) − f ( ¯ w ) (cid:12)(cid:12) k ¯ w − ¯ w k ∞ = f ( ¯ w ) − f ( ¯ w ) k ¯ w − ¯ w k ∞ ≤ e z ( i ) − e z ( j ) − ε + 2 λ + 2 γ f λ k e w ( i ) − e w ( j ) k ∞ − λ . Since λ is arbitrarily small, we have that, as N → ∞ , (cid:12)(cid:12) f ( ¯ w ) − f ( ¯ w ) (cid:12)(cid:12) k ¯ w − ¯ w k ∞ ≤ e z ( i ) − e z ( j ) − ε k e w ( i ) − e w ( j ) k ∞ . But | f ( ¯ w ) − f ( ¯ w ) | k ¯ w − ¯ w k ∞ = γ f and e z ( i ) − e z ( j ) − ε k e w ( i ) − e w ( j ) k ∞ ≤ ˆ γ , where it has been considered that, from Theorem 2, lim N →∞ ˆ ε = ε . Itfollows that, as N → ∞ , γ f ≤ ˆ γ. (30)On the other hand, since | e ( k ) | ≤ ε, ∀ k , ˜ γ ij = e z ( i ) − e z ( j ) − ε k e w ( i ) − e w ( j ) k ∞ ≤ f ( e w ( i )) − f ( e w ( j )) + e ( i ) − e ( j ) − ε k e w ( i ) − e w ( j ) k ∞ ≤ f ( e w ( i )) − f ( e w ( j )) k e w ( i ) − e w ( j ) k ∞ ≤ γ f , ∀ i, j. It follows that γ f ≥ ˆ γ , which, together with (30), proves the claim. Proof of Theorem 4 . Following the same lines of the proof of Theorem 3 in [4], it can be shown that lim sup N →∞ γ ∆ ≤ γ s ∆ < − ˆ γ g,y ˆ γ f where ˆ γ f and ˆ γ g,y are estimates of the Lipschitz constants γ f and γ g,y in (3) and (5), see step 3) of Algorithm 1.The claim follows from Assumptions 1-2, 4-7 and Theorem 2.12 eferences [1] E.W. Bai, H. Cho, and R. Tempo. Convergence properties of the membership set. Automatica , 34(10):1245–1249,1998.[2] D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete representations in the presenceof noise.
Information Theory, IEEE Transactions on , 52(1):6 – 18, jan. 2006.[3] J.J. Fuchs. Recovery of exact sparse representations in the presence of bounded noise.
Information Theory, IEEETransactions on , 51(10):3601 –3608, oct. 2005.[4] C. Novara, L. Fagiano, and M. Milanese. Direct feedback control design for nonlinear systems.
Automatica ,49(4):849–860, 2013.[5] J.A. Tropp. Greed is good: algorithmic results for sparse approximation.
Information Theory, IEEE Transactions on ,50(10):2231 – 2242, oct. 2004.[6] J.A. Tropp. Just relax: convex programming methods for identifying sparse signals in noise.