Koopman based data-driven predictive control
GGENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017 1
Koopman based Data-driven Simulation andControl
Yingzhao Lian † , Renzi Wang † , and Colin N.Jones (cid:63) Abstract — Sparked by the Willems’ fundamental lemma,a class of data-driven control methods has been developedfor LTI system. At the same time, the Koopman operatortheory attempts cast a nonliner control problem into astandard linear one albeit infinite dimensional. Motivated bythese two ideas, a data-driven control scheme for nonlinearsystems is proposed in this work. The proposed schemeis compatible with most differential regressor enabling anoffline learning. In particular, the model uncertainty is con-sidered, enabling a novel data-driven simulation frameworkbased on Wasserstein distance. Numerical experimentsare performed with Bayesian neural networks to show theeffectiveness of both the proposed control and simulationscheme.
Index Terms — Predictive control, data-driven control,Koopman operator
I. I
NTRODUCTION R ECENT trend of the digitalization motivates the researchinterest of data-driven control [1] because of the wideaccess of data collected by the sensors. Instead of resortingto a first-principle model, the collected data are used eitherto identify a model [2] or to construct a controller directly.The former approach is compatible with most control theoryand therefore results in successful applications [3], [4]. Fea-turing a controller desgin without any intermediate stage, theletter data-driven scheme attract more research interests andfinds successful application in linear systems [5] and iterativecontrol [6], [7]. It worth mentioning that in the community ofreinforcement learning [8], the learning schemes can also becategorized as model-based methods and model-free methods.This work is developed on the basis of the Willems’ funda-mental lemma and the Koopman operator theory. In particular,the Willems’ fundamental lemma characterizes the responsesof deterministic linear time invariant(LTI) systems with mea-sured trajectories under reasonable assumptions of control-lability and persistent excitation. Based on the data-drivenprediction enabled this lemma, predictive control scheme hasbeen developed [9], [10]. Beyond the predictive control, theWillems’ fundamental lemma has also been adopted in feed-back controller design [11], [12]. Within the LTI framework,
This work has received support from the Swiss National ScienceFoundation under the RISK project (Risk Aware Data-Driven DemandResponse), grant number 200021 175627 † : The first two authors contributed equally to this work. (cid:63) : Corresponding authorYingzhao Lian, Renzi Wang and Colin N. Jones arewith Automatic Laboratory, Ecole Polytechnique Federale deLausanne, Switzerland. { yingzhao.lian, renzi.wang,colin.jones } @epfl.ch the Willems’ fundamental lemma is further extend to incorpo-rate measurement noise [13], [14] and process noise [11].At the same time, a significant collection of works hastried to extend the applications of the Willems’ fundamentallemma to nonlinear systems. In [15], an extension to Ham-merstein systems and Wiener systems is proposed based onan a priori knowledge of basis function. [16]–[18] attemptto apply the Willems’ fundamental lemma to the class ofpolynomial system. As pointed out in [17], the necessary andsufficient condition in the fundamental lemma is broken. Anpromising viewpoint regarding the quotient space is proposedin [19], which clusters trajectories into equivalent class andextend the Willems’ fundamental lemma into the reproducingkernel Hilbert space. The functional space viewpoint in [19]motivates us to apply the Willems’ fundamental lemma in thefunction space. In particular, the Koopman operator theory isused.The Koopman operator theory is first introduced in the studyof forward-complete autonomous sytems [20], [21], which is alinear composite operator even when the system is nonlinear.[22], [23] later introduced the applications of the Koopmanoperator in controller and observer design, followed by a widerange of research ranging from the model reduction [24] tothe global optimal control [25]. Even though most algorithmsbased on the Koopman operator have numerical implemen-tations that are similar to those studied in [15]–[18], theKoopman operator establishes a totally different theoreticalframework. In particular, the Koopman operator corresponds toa Heisenberg picture which models the evolution of the observ-able, while other aforementioned methods model the evolutionof the state, corresponding to a Schr¨odinger picture [26]. TheKoopman operator theory can alleviate the theoretical issue instudying the lifting function without resorting to the quotientspace, and enables convergence analysis as that has donein [27].The key component of a Koopman operator based methodis the learning of the eigenfunctions or the lifting functionslying within the subspace spanned by the eigenfunctions. Astandard framework of extended dynamic mode decomposition(EDMD) spans the lifting functions with a dictionary of basisfunction [28], which suffers from the curse of dimensionality.To overcome this challenge, [29], [30] applies kernel methodto learn the Koopman operator in a non-parametric way,which is still not scalable to large dataset. In [31], [32],the lifting functions are approximated by neural networks.However, these aforementioned methods mainly consider one-step forward prediction either due to the formulation of the a r X i v : . [ ee ss . S Y ] F e b GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017 learning problem or due to numerical stability, which resultsin a relatively inaccurate long-term prediction. A link betweenthe Koopman operator and the subspace identification is ob-served in [33], which enables a learning scheme of long-termprediction. However, this method still suffers from the lackof scalability and the numerical limitation of the subspaceidentification [34].In this work, we propose to incorporate the learning of aKoopman operator into the framework of the Willems’ funda-mental. The applications of Koopman operator in the Willems’fundamental lemma are mentioned in [9], [15] but have notbeen detailed. In this work, we show that by maximizingthe linearity of an finite order approximation, a Koopmanoperator can be learned based on the sensitivity analysis ofa parametric programming problem. The proposed learningscheme is capable of uncertainty quantification, where a newobjective function is derived to account for the uncertainty.Meanwhile, we propose a control scheme that solves a bi-level optimization problem by a transformation into to a singlelevel structure. The bi-level optimization formulation has beendiscussed in both [9], [35], where a bi-level problem is relaxedto a multi-objective problem.The remainder of this paper is organized as follows: thepreliminary knowledge is introduced in Section II, after whichthe training and the prediction based on the proposed schemeis elaborated in Section III. We explain the proposed controlframework in Section IV along with presenting the numericalsimulation results of prediction and control in Section V. N OTATIONS (cid:107) x (cid:107) p indicates the (cid:96) p norm of vector x and (cid:107) x (cid:107) Q := x (cid:62) Qx is the weighted norm with Q being positive semi-definite. (cid:107) A (cid:107) F and (cid:107) A (cid:107) ∗ denote the Frobenius norm and thenuclear norm of the matrix A respectively. N ∼ ( µ, Σ) isa Gaussian distribution with mean value µ and covariancematrix Σ . We use Z ≥ to represent a non-negative integer. w := { w k } bk = a is a sequence of signal { w a , . . . , w b } indexedby k . Specifically, the boldface is used to denote a sequencewhile the lightface denotes a measurement, e.g. w and w k .Meanwhile, the subscript d is reserved to denote the datacollected offline. The superscript ∗ is used to denote optimalsolution of an optimization problem. ⊗ denotes the Kroneckerproduct. II. P
RELIMINARY
In this section, the Willems’ fundamental lemma and theKoopman operator theory will be introduced. Then, the sen-sitivity analysis of a parametric optimization, which is theenabler of the learning, is discussed.
A. Willems’ Fundamental Lemma
Given a sequence of measurements { w k } T − k =0 , its Hankelmatrix of depth L is defined as H L ( w ) := w w . . . w T − L w w . . . w T − L +1 ... ... . . . ... w L − w L . . . w T − . (1) Regarding a Hankel matrix H L ( w ) , the signal sequence w ispersitently exciting of order L if H L ( w ) is full row rank. TheWillems’ fundamental lemma utilizes the Hankel matrices tocharacterize the response of the following deterministic lineartime invariant(LTI) system, dubbed B ( A, B, C, D ) , x k +1 = Ax k + Bu k y k = Cx k + Du k , (2)where A ∈ R n x × n x , B ∈ R n x × n u , C ∈ R n y × n x , D ∈ R n y × n u parametrize the system dynamics and the order of thissystem is denoted by O ( B ( A, B, C, D )) := n x . The Willems’fundamental lemma is concluded as
Lemma 1: ( [36,
Theorem 1 ], [12,
Lemma 2 ]) Considera controllable and observable system (2), if the input se-quence u d = { u d,k } T d − k =0 is persistently exciting of order O ( B ( A, B, C, D )) + L , then1) Any L -step input/output trajectory of system (2) can beexpressed as (cid:20) H L ( u d ) H L ( y d ) (cid:21) g = (cid:20) uy (cid:21)
2) Any linear combination of the columns of the Hankelmatrices, that is (cid:20) H L ( u d ) H L ( y d ) (cid:21) g is a L -step input/output trajectory of (2)This lemma enables data-driven simulation and control [9],[37]. To make an N -step prediction, the Hankel matricescomposed of offline data is partitioned as (cid:20) U p U f (cid:21) := H T ini + N ( u d ) , (cid:20) Y p Y f (cid:21) := H T ini + N ( y d ) , where the first T ini row blocks are used to construct U p , Y p while the remaining row blocks is assigned to U f , Y f . In theremainder of this paper, n c is reserved to denote the numberof columns in the Hankel matrix. In particular, T ini is chosento ensure the uniqueness of prediction and the rank of theobservability matrix O T ini ( A, C ) := (cid:2) C (cid:62) ( CA ) (cid:62) , . . . , ( CA T ini − ) (cid:62) (cid:3) (cid:62) is of rank O ( B ( A, B, C, D ) = n x [37]. Without measurementnoise, the N -step output prediction y is defined by y = Y f g s.t. U p Y p U f g = u ini y ini u , (3)where u ini and y ini are T ini -step previous measurementsof the inputs and the outputs. Accordingly, y is the N -stepresponse driven by input sequence u . Built on this prediction UTHOR et al. : PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 3 scheme, the data-enabled predictive control(DeePC) [9] is min g,σ y ,u,y ( N − (cid:88) k =0 (cid:107) y k − r t + k (cid:107) Q + (cid:107) u k (cid:107) R )+ λ g (cid:107) g (cid:107) + λ y (cid:107) σ y (cid:107) s.t. U p Y p U f Y f g = u ini y ini uy + σ y u k ∈ U , ∀ k ∈ , . . . , N − y k ∈ Y , ∀ k ∈ , . . . , N − , (4)where Q and R are the weight penalizing outputs and inputsrespectively and σ g is introduce to deal with measurementnoise. r ∈ R pN is the reference trajectory and U , Y arethe feasible sets of inputs and outputs. (cid:107) g (cid:107) and (cid:107) σ y (cid:107) are regularization terms. λ y , λ g ∈ R > are regularizationparameters. These regularization terms have been interpretedunder distributionally robust optimization framework [38] andmaximal likelihood framework [14]. Remark : When a long input-output sequence is not avail-able, the Hankel matrix H L ( w ) can be replaced by a mosaicHankel matrix [39]. Given M trajectories: w = [ w , . . . , w M ] , where each trajectory w i = ( w i, , . . . , w i,T i ) , w i,k ∈ R q the mosaic Hankel matrix is defined as : H L ( w ) = [ H L ( w ) , . . . , H L ( w M )] B. Koopman Operator
Given a discrete-time autonomous system x k +1 = f ( x k ) , (5)where f models the nonlinear dynamics, a Koopman operatoris a composite operator K ψ := ψ ◦ f , which ψ : R n x → R is called observable. Unlike standardstate space model, the Koopman operator models the evo-lution of a function driven by system dynamics f and itsexistence is guaranteed for forward-complete system [40]. Asthe Koopman operator is an operator on a function space, K isin general infinite-dimensional, but critically it is linear evenwhen the dynamics F are non-linear and as such, an observable φ is an eigenfunction associated with the eigenvalue λ ∈ C if K φ = λφ . From this we can see that the eigenfunctions(or linear combinations of the eigenfunctions) evolve linearlyalong the trajectories of our nonlinear system (5) φ ( x k +1 ) = φ ( f ( x k )) = ( K φ )( x k ) = λφ ( x k ) . (6)Given a collection of eigenfunctions { φ i } n φ i =1 , any observ-able lying within the span of these eigenfunctions can bedecomposed into ψ = (cid:80) i c i ( ψ ) φ i , where c k ( ψ ) is called theKoopman modes of ψ . Then, we have K ψ = (cid:88) i c i ( ψ ) λ i φ i , with λ i denoting the eigenvalue of φ i .In the sequel, the subscript u is used to denote the compo-nents corresponding to a system with control inputs. Given anonlinear dynamics with control input x k +1 = f u ( x k , u k ) , the Koopman operator can be defined in different ways [22],[41], [42]. In this work, we consider the framework in [22].More specifically, denote the infinite control sequence u := { u k } ∞ k =0 ∈ l ( U ) , where l ( U ) represents the space of allcontrol sequence. The augmented state is χ = (cid:20) x u (cid:21) , upon which the system dynamic is augmented as F : R n x × l ( U ) → R n x × l ( U ) F ( χ k ) = (cid:20) f u ( x k , u k (0)) S u k (cid:21) . (7) S is the left shift operator with S u ( i ) := u ( i + 1) and u ( i ) is the evalution of the i -th element of u . In this setup, u can be considered as a sequence of mappings from index i to actual output u i . It is noteworthy to point out that thisdynamical system 7 is infinite dimensional but autonomous.Hence, the aforementioned definition of the Koopman operatorcan be applied directly and the corresponding eigenfunctionsare assumed to be spanned by the following dictionary of basisfunctions. { φ u ( x, u ) } n φu + n u i =1 := { φ u, ( x ) , . . . , φ u,n φu ( x ) , u (0) (cid:62) } . If the evolution of this dictionary of basis functions is closedunder the system dynamics, then we have z k +1 = A z k + B u k (0) u k +1 (0) = u k (1) , (8)where z k := [ φ u, ( x k ) , . . . , φ u,n φu ( x k )] and A , B captures theKoopman operator. Similarly, any functions within the span ofthese basis functions can be recovered by the Koopman modeas ψ u ( x, u (0)) = c (cid:62) u (cid:20) z u (0) (cid:21) , (9)with c := [ c u, , c u, . . . , c u,n φu + n u ] the vector of Koopmanmodes. In particular, we are interested in the Koopman modeof the identity functions evaluated on the sytem outputs.Assumed that we have n y outputs, the evaluation of the i -th output is I y,i ( y k ) := y k,i . With a bit of abuse of notation,the Koopman modes decomposition of outputs evaluation is y k = I y, ( y k ) I y, ( y k ) ... I y,n y ( y k ) = c (cid:62) u, c (cid:62) u, ... c (cid:62) u,n y (cid:20) z u (0) (cid:21) := C u (cid:20) z u (0) (cid:21) , (10)where C u stacks the Koopman modes of the output evalua-tions. GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
C. Differential Parametric Optimization
Sensitivity analysis investigates the smoothness of a para-metric optimization problem, where the implicit function the-orem [43] is applied to the KKT system. This idea has beenapplied to deep learning [44] and reinforcement learning [45].Though the solution map is barely differentiable, the optimalvalue function is smoother than the solution map [46], whichis the only tool used in this work. In general, the continuity ofa general convex optimization problem is guaranteed by theuniform level boundness [47, Theorem 1.17], while a generalnonlinear parametric optimization problem guarantees a lowersemiconitinous value function under the assumption of localcompactness [48].For the sake of clarity, we elaborate this derivative by astandard quadratic programming (QP), please refer to [49] fora general conic form. We use subscript q to avoid confusion.Consider the a parametric QP, Q ( e q ) := e q → z ∗ q withparameters { Q q , , q q H q , h q , E q } and Q q positive definite: min z q z Tq Q q z q + q Tq z s.t. Hz q (cid:54) h q , E q z q = e q (11)The KKT conditions for the QP are: Q q z ∗ + q q + H Tq λ ∗ + E Tq ν ∗ = 0 diag ( λ ∗ ) ( H q z ∗ − h q ) = 0 E q z ∗ q − e q = 0 (12)where z ∗ q , ν ∗ , λ ∗ are the optimal primal and dual variables,diag ( x ) builds a diagonal matrix composed of x . Then thedifferentials of KKT conditions can be computed as: Q q H Tq E Tq D ( λ ∗ ) A diag ( H q z ∗ − h q ) 0 E q dz q dλdν = − dQ q z ∗ q + dq q + dH q λ ∗ + dE Tq ν ∗ diag ( λ ∗ ) dH q z ∗ q − diag ( λ ∗ ) dbdE q z ∗ − de q (13)The derivatives of z ∗ with respect to the parameters( Q q , q q , H q , h q , E q ) and the function input f are given by thesolution to the linear system defined in Equation (13). Forexample, the solution dz of (13) gives the result of ∂z ∗ q ∂Q q ifwe set dQ q = I and the differentials of other parameters to0. The gradient of optimal value L ( z ∗ ) with respect to Q iscalculated accordingly as ∂L ( z ∗ q ) ∂z ∗ q ∂z ∗ q ∂Q q . III. K
OOPMAN BASED D ATA - DRIVEN P REDICTION
In this section, the fundamental lemma is first introducedin the Koopman operator theory, which enables a trainingscheme by minimizing the prediction error with respect to thetraining dataset. The stochastic prediction scheme is therebyintroduced to show the compatibility of probabilistic models,such as Bayesian neural networks and Gaussian processes.
A. Koopman Operator with the Fundamental Lemma
As discussed in Section II, the key component of a Koop-man operator is the eigenfunctions or the linear subspacecontaining the eigenfunctions. Therefore, the learning of anKoopman operator is equivalent to find functions whose evolu-tion of the function evaluation behaves like a linear system 2.Following corollary is the enabler of the proposed learningscheme
Corollary 1:
An dynamical system of order n x can beparametrized as a linear system (2) if and only if the Fun-damental lemma holds. Proof:
Necessary condition holds by Lemma 1 and thesufficient condition holds by the definition of linear systems.As discussed in Section II-B, the outputs evaluations areassumed to be spanned by the following basis functions { φ u ( x, u ) } n φu + n u i =1 := { φ u, ( x ) , . . . , φ u,n φu ( x ) , u (0) (cid:62) } . Then given a sequence of state evolution x d with its cor-responding inputs-output sequence u d , y d , whose inputs arepersistently excited of order n φ u − n u + L , Corollary 1implies that { φ u ( x, u ) } n φu + n u i =1 is the desired collection basisfunctions if and only if ∀ x ∈ R n x the outputs sequence y driven by u , there exist g ∈ R n c ZH L ( u d ) H L ( y d ) g = z uy , (14)where z := [ φ u, ( x ) , . . . , φ u,n φu ( x )] (cid:62) and Z := φ u, ( x ) φ u, ( x ) . . . φ u, ( x n c ) ... . . . . . . ... φ u,n φu ( x ) φ u,n φu ( x ) . . . φ u,n φu ( x n c ) B. Leaning the Koopman Basis Functions
Due to the previous dicussion, the learning of a Koopmanoperator is converted to learn basis functions that maximizesthe satisfaction of the condition 14. In practice the underlyingstate for the nonlinear system is not necessarily measured, wetherefore make the following assumption
Assumption 1: x k is measurable with respect to the previ-ous T ini step input-output sequence { u i , y i } ki = k − T ini +1 .This assumption implies that x k can be determined from { u i , y i } ki = k − T ini +1 and therefore has similar utilization as thematrices U p , Y p in problem (4) and (3). Assumed that wehave a sequence of input-output data u d := { u d,i } n d i =0 and y d := { y d,i } n d i =0 consisting n d measurements, each of themis partitioned into two subsets, including u d,l := { u d,i } n d,t i =0 , y d,l := { y d,i } n d,t i =0 , u d,t := { u d,i } n d i = n d,t +1 and y d,l := { y d,i } n d i = n d,t +1 . n d,t = n c + T ini + L − is the number ofdatapoints in the first two sets. The subsets with subscript d,l are used to build the Hankel matrices charactering theKoopman operator while the remaining two subsets are usedto learn the basis functions.Regarding the Assumption 1, a differentiable learner isused to learn the basis functions, dubbed { φ u,θ } n φu i =1 , whoseparameters are denoted by θ . Neural networks [50] and UTHOR et al. : PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 5
Gaussian process [51] are recommended learners that havestrong representation power. In particular, inducing variablescan be considered as trainable parameters for a Gaussianprocess, please refer to [52], [53] for more details. Enforcingthe condition (14) for L -step sequences, learning problem isformulated as follows: min θ n d (cid:88) i = n c l ( y d,i − H L ( y d,l ) g i ) s.t. g i = arg min g P ( u d,i , y d,i ) , (15)and P ( u d,i , y d,i ) : = λ g (cid:107) g (cid:107) + λ y (cid:107) Zg − z (cid:107) s.t. H L ( u d,l ) g = u d,i z = φ u,θ ( { u k , y k } i + T ini − k = i ) . In particular, u d,i := [ u i , . . . , u i + L + T ini − ] and y d,i :=[ y i , . . . , y i + L + T ini − ] , i ≥ n d,t + 1 are sequences of inputsand outputs of length L + T ini . The matrix Z is the evaluationof the basis functions Z := [ φ u,θ ( { u i , y i } T ini i =0 ) , . . . , φ u,θ ( { u i , y i } T ini + n c − i = n c − )] . The constraint in the learning problem (15) is actually a pre-diction problem similar to (3) and, therefore, l ( · ) penalizes theprediction error. As one may notice, there are two relaxationsin the learning problem (15)1) To recover an output evaluation, an infinite set of basisfunctions may be required. This learning problem learnsa finite order approximation of these probably infiniteset.2) The condition (14) is required to be satisfied for anystates, however, the learning problem relax this conditionto a set of sampled states. Therefore, the training set u d,t and y d,t should be large enough to represent thecondition (14).Figure 1 shows the flow of the learning problem, wherethe dashed line indicates the direction of back-propagation.More specifically, the prediction problem is considered asa parametric optimization problem whose differentiation isdiscussed in Section II-C.Finally, we end this subsection by further listing the benefitsof the proposed scheme. • Unlike general EDMD methods, which learns the matri-ces A , B , C u in (8) and 10. The proposed scheme get ridof the learning of the parameters. Learning of A , B , C u is ill-conditioned because the solution is not unique. • The learning problem optimizes a multi-step forwardprediction, meanwhile the utilization of the Willems’ fun-damental lemma guarantees a good numerical stability inthe training scheme, which is a key challenge in trainingthe fully-connected recurrent neural network [54]. • The proposed scheme is scalable and can be parallelized. • Unlike other nonlinear extension of the Willems’ fun-damental lemma, the proposed scheme does not requirenonlinear mapping of future inputs and outputs in theprediction problem.
C. Stochastic Prediction
As shown in the Section III-B, the prediction problem playsa key role in the proposed scheme. If the chosen learner isdeterministic, then the predicted output sequence ˜ y driven byinputs ˜ u is calculated by ˜ y = H L ( y d,l )˜ g (16) ˜ g = arg min g λ g (cid:107) g (cid:107) + λ y (cid:107) Zg − z (cid:107) s.t. H L ( u d,l ) g = ˜ u z = φ u,θ (˜ u p , ˜ y p ) , which main results in overfitting. Probabilistic learner is onesolution to avoid overfitting, such as the aforementionedGaussian process and the Bayesian neural networks. Theoutput of a probabilistic learner is a distribution but not adeterministic point. In this section, we will show how thisdistributional outputs from a probabilistic learner can be usedfor prediction, which is also essential for the training of theKoopman operator. Two methods will be discussed, one isbased on Monte-Carlo sampling while another one generatesprediction by bounding the Wasserstein distance.
1) Monte-Carlo Prediction:
Assuming the distribution of theprobabilistic learner is P φ u , a Monte-Carlo method is appliedto calculate the distribution of the prediction. In particular, thematrix Z in problem (16) is sampled from P φ u , which gives asample from the output distribution. By Monte-Carlo method,the output distribution can be approximated by the sampledoutputs.Meanwhile, the loss function in the learning problem (15)is modified to expected cost. In conclusion, we have thefollowing learning problem min θ n d (cid:88) i = n c E l ( y d,i − H L ( y d,l ) g i ) s.t. g i ∼ arg min g P ( u d,i , y d,i ) , and P ( u d,i , y d,i ) : = λ g (cid:107) g (cid:107) + λ y (cid:107) Zg − z (cid:107) s.t. H L ( u d,l ) g = u d,i z ∼ P φ u,θ ( { u k , y k } i + T ini − k = i ) , the k -th column of the matrix Z follows the distribution P φ u,θ ( { u i , y i } T ini + k − i = k − ) . The gradient of this learning problemis also approximated by a Monte-Carlo method.
2) Wasserstein Distanced based Prediction:
Intuitively, theregularization term (cid:107) Zg − z (cid:107) in the prediction problem (16)can be considered as the distance between the mean of Zg and z . To formulate a more rigor scheme based on a probabilisticlearner, we propose to minimize the Wasserstein distancebetween Zg and z . Above all, the entry of the probabilisticlearner output is approximated by a Gausian distribution.Specifically, the i -th column of the matrix Z is approximated GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
Fig. 1 : Illustration of learning lifting function frameworkby N ( µ i,l , Σ i,l ) , whose covariance matrix is diagonal Σ i,l = σ i, ,l . . . σ i,n φu ,l , and we denotes the vector composed of the diagonal elementsas σ i,l . Accordingly, the i -th element of vector z is approxi-mated by N ( µ i , σ i ) and we denote z ∼ N ( µ z , σ z ) . Remark 1: • It is noteworthy that if an Gaussian processis used to learn the basis function, no approximation isrequired as the output is already Gaussian. • To better satisfy the condition of a diagonal covariancematrix Σ i,l , it is recommended to replace the Hankel ma-trices with Page matrices. In comparison with definitionin (1), a depth L Page matrix of a sequence w is definedas P L ( w ) := w w L . . . w ( M − L w w L +1 . . . w ( M − L +1 ... ... . . . ... w L − w L − . . . w ML − . Based on the approximation, Zg turns out to be a Gaussiandistribution, which is denoted by Zg ∼ N ( µ Zg , Σ Zg ) forcompactness. To enable a prediction scheme, we conclude thefollowing lemma Lemma 2:
The second Wasserstein distance between Zg and z is bounded by W ( Zg, z ) ≤ (cid:107) µ Zg − µ z (cid:107) + (cid:107) Σ Zg − Σ z (cid:107) ∗ (17) Proof:
The second Wasserstein distance between twoGaussian distribution [55] is W ( N ( µ Zg , Σ Zg ) , N ( µ z , Σ z ))= (cid:107) µ Zg − µ z (cid:107) + Tr (Σ Zg + Σ z ) − Tr (cid:16) Σ Zg Σ z Σ Zg ) (cid:17) , (18)where µ Zg = n c (cid:80) i =1 µ i,l g i with g i being the i -th entry of g . Thecovariance matrix of Zg is calculated as follows Σ Zg = ( g (cid:62) ⊗ I ) Cov ( vec ( Z ))( g ⊗ I ) , with Cov ( vec ( Z )) = Σ ,l . . . Σ n c ,l , therefore, we conclude Σ Zg = n c (cid:88) i =1 g i Σ i,l . (19) Since Σ z is diagonal, Σ Zg Σ z = Σ z Σ Zg , (18) can bereformulated as: W ( N ( µ Zg , Σ Zg ) , N ( µ z , Σ z ))= (cid:107) µ Zg − µ z (cid:107) + Tr (Σ Zg + Σ z − Zg Σ z ) / )= (cid:107) µ Zg − µ z (cid:107) + Tr ((Σ / Zg − Σ / z ) T (Σ / Zg − Σ / z ))= (cid:107) µ Zg − µ z (cid:107) + (cid:13)(cid:13)(cid:13) Σ / Zg − Σ / z (cid:13)(cid:13)(cid:13) F (20)where (cid:107)·(cid:107) F denotes the Frobenius norm. This objective func-tion has a clear interpretation. The first term quantifizes thedistance between the mean of these two Gaussian distribu-tion, while the second measures the discrepancy between thecovariance matrices. If the derived Σ Zg in (19) is substitutedin (20), the evaluation of the resulting metric is numerically ill-conditioned. The Frobenius norm is therefore further relaxedwith the Powers-Størmer’s inequality [56]: Tr ( A α B − α ) ≥ Tr ( A + B − | A − B | ) , ≤ α ≤ (21) A , B are positive semidefinite and | A | is the positive squareroot of the matrix A ∗ A , we have: W ( N ( µ Zg , Σ Zg ) , N ( µ z , Σ z )) ≤ (cid:107) µ Zg − µ z (cid:107) + Tr ( | Σ Zg − Σ z | )= (cid:107) µ Zg − µ z (cid:107) + (cid:107) Σ Zg − Σ z (cid:107) ∗ , with (cid:107)·(cid:107) ∗ denoting the nuclear norm.Based on this lemma, the prediction problem with a prob-abilistic learner is reformulated as ˜ y = H L ( y d,l )˜ g (22) ˜ g = arg min g (cid:107) µ Zg − µ z (cid:107) + (cid:107) Σ Zg − Σ z (cid:107) ∗ s.t. H L ( u d,l ) g = ˜ u z = φ u,θ (˜ u p , ˜ y p ) , Remark 2:
The upper bound (17) is non-smooth, where theabsolute value evaluation is ill-condidtioned around [57,Chapter 3]. The absolute value is smoothed by an approachsimilar to a Huber loss [47, Chapter 2], which is defined as: L δ ( a ) = (cid:40) a for | a | < δδ ( | a | − δ ) otherwiseThe evaluation of the i -th diagonal elements in | Σ Zg − Σ z | isthen reformulated as ( | Σ Zg − Σ z | ) ii = (cid:40) ( σ Zg,i − σ z,i ) | σ Zg,i − σ z,i | < δδ ( | σ Zg,i − σ z,i | − δ ) otherwise . UTHOR et al. : PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 7
IV. K
OOPMAN - BASED D ATA - DRIVEN P REDICTIVE C ONTROL
The DeePC framework (4) can be extended into nonlinearsystems by integrating the equality constraints with (14).However, DeePC formulation suffers from the problem that theprediction step interweaves with the control step. In anotherword, the algorithm may use a non-optimal prediction resultfor control. When the penalty factor λ y is not sufficient largeand the system is initialized away from the reference, thealgorithm tends to compensate the difference with a relativelarge σ y , which will result in control failure. To tackle thisproblem, we propose a bi-level programming formulation [58],where the prediction step is independent from control. min u , y ,g ( N − (cid:88) k =0 (cid:107) y k − r t + k (cid:107) Q + (cid:107) u k (cid:107) R ) subject to u k ∈ U , ∀ k ∈ , . . . , N − y k ∈ Y , ∀ k ∈ , . . . , N − Y f g = y for some g ∈ Φ( u ) (23a) Φ( u ) = arg min g λ g (cid:107) g (cid:107) + λ y (cid:107) Zg − z (cid:107) subject to (cid:20) U p U f (cid:21) g = (cid:20) u ini u (cid:21) with parameter z = φ u,θ ( u ini , y ini ) (23b)The bi-level problem introduces a hierarchical structure wherethe upper level problem (23a) indicates the control step andthe lower level problem (23b) functions as the prediction step.Note that compared to DeePC (4), in (23b) the squares of thetwo norm are used so that the objective function of the lowerlevel problem remains smooth.A usually used approach to solve a bi-level problem is totransform it into a single level problem. Applying optimalityconditions and introducing optimal value function are the twomain categories of transformation approaches if the bi-levelproblem fulfills certain conditions [58, Chapter 5]. Here wepresent the result by replacing the lower level problem (23b)with its KKT conditions: min g, u , y ,µ ,µ ( N − (cid:88) k =0 (cid:107) x k − r t + k (cid:107) Q + (cid:107) u k (cid:107) R ) subject to U p U f Y f g = u ini uy g (cid:62) + 2( Zg − z ) (cid:62) Z + µ U p + µ U f = 0 u k ∈ U , ∀ k ∈ , . . . , N − x k ∈ X , ∀ k ∈ , . . . , N − (24)This equivalent single level problem is solvable by manyoptimization toolboxes. Remark 3:
Solving a bi-level problem is in general NP-hard. To the best of our knowledge, there is no valid approachto solve a general bi-level optimization problem where thelower level problem is non-convex. It is worth mentioning thatthe wasserstein distanced based prediction problem presented in Section III-C is non-convex. We therefore leave the con-trol formulation integrated with wasserstein distanced basedprediction as a future work.The algorithm is summarized as follows:
Algorithm 1:
Koopman based DeePC for t = T ini , . . . do Set z = φ u,θ ( u ini , y ini ) ; Solve (24) to obtain an optimal input sequence u ∗ (0) ; Set u t = u ∗ (0) ; Apply u t to the system and measure y t end where u ∗ (0) denotes the first elements of u ∗ . V. S
IMULATION RESULTS
In this section, the prediction results on a Van der Poloscillator based on Monte-Carlo prediction and wassersteindistanced based prediction are firstly illustrated. Then a nu-merical experiment of controlling a bilinear motor is presented.We finally demonstrate the potential of our proposed schemein the large-scale problems with an example of controllingthe nonlinear Korteweg-de Vries equation. The source code ofthe numerical examples can be accessed through https://github.com/RencciW/DataDrivenControlCode . A. Stochastic prediction
We show the results of prediction of trajectories from a Vander Pol oscillator ˙ x = (cid:20) x µ (1 − x ) x − x + u (cid:21) with µ = 1 . we train a 5-layer network (2, 12, 22, 12and 12) with 1100 data points sampled from 100 randomtrajectories generated by Van der Pol oscillator. Each layerexcluding the input layer is added with a dropout layer.Choose ReLU as activation function for each hidden layer andAdam as optimizer with learning rate − . Set the dropoutrate equals to . . The code is implemented with PyTorch[59]. From each of 100 trajectories, we sample 3 trajectoryfragments for the construction of Hankel matrix. For a bettercomparison with following results, in test phase, we choose 3trajectory fragments from each of 24 trajectories to formulatethe Hankel matrix with T ini = 1 and N = 10 . Test the trainednetwork on 50 data points sampled from trajectories, which areindependent from the data used for training and hankel matrixformulation. Forward the same data 120 times to the networkand compute the mean value and standard derivation of theprediction. The results are shown below. The light blue colorindicates two times the standard derivation. B. Prediction based on Wasserstein distance
We test the new loss function with Van der Pol data.From each of 24 different trajectories, choose one trajectoryfragment to formulate the Page matrix. We firstly lift the
GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017 (a) Prediction result ( x )(b) Prediction result ( x ) Fig. 2 : Prediction result using Koopman operator learned byMC dropoutdata for Page matrix construction with the dropout neuralnetwork we trained from last subsection. The mean value andthe standard derivation of the lifted data are computed forestimation. Then send the data to prediction problem to obtainan optimizer g ∗ . Predict the future trajectory with x = X f g ∗ , X f is the Hankel matrix block for prediction.We compute the mean squared error (MSE) of the predictionbased on proposed Wasserstein loss and original quadraticloss at the th time step. The comparison is summarized infollowing table: x x Wasserstein loss 0.0817 0.0855Original loss 3.8707 1.3909
TABLE I : MSE comparison between prediction result com-puted with different loss functionsIt is clear that the Wasserstein loss outperforms the originalquadratic loss when uncertainty of lifting function is consid-ered. (a) Prediction result ( x ) (b) Prediction result ( x ) Fig. 3 : Prediction result using loss function derived fromWasserstein distance
C. Control with Koopman-based DeePC
1) Control of a bilinear motor:
We firstly compare the controlalgorithm with the algorithm Koopman operator-based MPCcontroller (K-MPC) proposed in [22] by controlling a bilinearmodel of a DC motor. [60] ˙ x = − ( R a /L a ) x − ( k m /L a ) x u + u a /L a ˙ x = − ( B/J ) x + ( k m /J ) x u − τ /Jy = x where x is the rotor current, x the angular velocity anthe control input u is the stator current and the output y isthe angular velocity. The parameters are L a = 0 . , R a =12 . , k m = 0 . , J = 0 . , B = 0 . , τ =1 . , u a = 60 . The physical constraints on the control inputare u ∈ [ − , .We use 40 trajectories with time horizon . s to construct amosaic Hankel matrix. All trajectories are randomly initializedon the unit box [ − , . The control input obeys to a uniformdistribution over the interval [ − , . Choose 40 thin platespline radial basis function with centers selected randomlywith uniform distribution over [ − , as lifting functions. UTHOR et al. : PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 9
Since the system states are not directly measurable, we choosethe number of delays n d = 1 . We define C = [1 , , . . . , , Q = Q N p = 10 , R = 0 . . The prediction horizon N =10 , which implies . s . Since the system is linear in thelifting space, choose T ini = 1 . The reference is designedas r ( t ) = 0 . cos (2 πt/ . Introduce constraints on output y ∈ [ − . , . .We simulate for s and compare the result with a model-based method K-MPC proposed in [22] Fig. 4 : Feedback control input of a bilinear motor
Fig. 5 : Angular velocity of a bilinear motorAs shown in the figure 4 and 5, the algorithm is capable offollowing the reference without violating constraints, althoughcompared to K-MPC, the input computed by our method willvibrate gently when the input trajectory is non-smooth.
2) Control of nonlinear Korteweg–de Vries equation:
Ournext simulation is to control the nonlinear Korteweg–deVries(KdV) equation which models the propagation of acous-tic waves in aplasma or shallow-water wave [61]. The equationis given as: ∂y ( t, x ) ∂t + y ( t, x ) ∂y ( t, x ) ∂x + ∂ y ( t, x ) ∂x = u ( t, x ) where y ( t, x ) is the unknown function and u ( t, x ) the controlinput. x ∈ [ − π, π ] The space is descretized into 128 pointsand the time step ∆ t = 0 . s . The input is assumed to beof the form u ( t, x ) = (cid:80) i =1 u i ( t ) v i ( x ) where v i consists of 3 spacial basis functions: v i ( x ) = e − x − π/ with c = − π/ , c = 0 , c = π/ . The input is constrainedto [ − , . We initial the system by convexly combining 3fixed spatial profiles y = e − ( x − π/ , y = − sin ( x/ , y = e − ( x + π/ . We choose the states itself, the elementwisesquare of the state, the elementwise product of the states withits periodic shift as the lifting functions. Q is the identitymatrix and R is zero matrix. The prediction horizon N = 5 ,which implies . s . T ini remains equal to . Formulate theHankel matrix with 63 trajectories, each of which is simulatedfor . s . Fig. 6 : Feedback control input of KdV
Fig. 7 : Tracking resultDespite of large dimension, the algorithm is still capableof computing the optimum input in an acceptable time andtracking the reference.
VI. C
ONCLUSION
In this work, we extend a data-driven predictive controlmethod into nonlinear systems. The underlying idea is to liftthe system with Koopman operator into infinite dimensionalspace where the system evolves linearly along the nonlinearsystem trajectories. Approximation of nonlinear lifting func-tions based on a purely data-driven framework is proposed,along with considerations on the uncertainty of the approxi-mation, which enabling a novel data-driven simulation schemebased on wasserstein distance. R EFERENCES [1] Z.-S. Hou and Z. Wang, “From model-based control to data-drivencontrol: Survey, classification and perspective,”
Information Sciences ,vol. 235, pp. 3–35, 2013.[2] L. Ljung, “System identification,”
Wiley encyclopedia of electrical andelectronics engineering , pp. 1–19, 1999.[3] N. Lanzetti, Y. Z. Lian, A. Cortinovis, L. Dominguez, M. Mercang¨oz,and C. Jones, “Recurrent neural network based mpc for process indus-tries,” in . IEEE, 2019,pp. 1005–1010.[4] J. Kocijan,
Modelling and control of dynamic systems using Gaussianprocess models . Springer, 2016.[5] M. C. Campi, A. Lecchini, and S. M. Savaresi, “Virtual referencefeedback tuning: a direct method for the design of feedback controllers,”
Automatica , vol. 38, no. 8, pp. 1337–1346, 2002.[6] H. Hjalmarsson, “Iterative feedback tuning—an overview,”
Internationaljournal of adaptive control and signal processing , vol. 16, no. 5, pp.373–395, 2002.[7] D. A. Bristow, M. Tharayil, and A. G. Alleyne, “A survey of iterativelearning control,”
IEEE control systems magazine , vol. 26, no. 3, pp.96–114, 2006.[8] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press, 2018.[9] J. Coulson, J. Lygeros, and F. D¨orfler, “Data-enabled predictive control:In the shallows of the deepc,” in . IEEE, 2019, pp. 307–312.[10] I. Markovsky and P. Rapisarda, “On the linear quadratic data-drivencontrol,” in . IEEE, 2007,pp. 5313–5318.[11] J. Berberich, A. Koch, C. W. Scherer, and F. Allg¨ower, “Robust data-driven state-feedback design,” in . IEEE, 2020, pp. 1532–1538.[12] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization,optimality, and robustness,”
IEEE Transactions on Automatic Control ,vol. 65, no. 3, pp. 909–924, 2019.[13] D. Alpago, F. D¨orfler, and J. Lygeros, “An extended kalman filter fordata-enabled predictive control,”
IEEE Control Systems Letters , vol. 4,no. 4, pp. 994–999, 2020.[14] M. Yin, A. Iannelli, and R. S. Smith, “Maximum likelihood estimationin data-driven modeling and control,” arXiv preprint arXiv:2011.00925 ,2020.[15] J. Berberich and F. Allg¨ower, “A trajectory-based framework for data-driven system analysis and control,” in . IEEE, 2020, pp. 1365–1370.[16] A. Bisoffi, C. De Persis, and P. Tesi, “Data-based stabilization ofunknown bilinear systems with guaranteed basin of attraction,”
Systems& Control Letters , vol. 145, p. 104788, 2020.[17] J. G. Rueda-Escobedo and J. Schiffer, “Data-driven internal modelcontrol of second-order discrete volterra systems,” in . IEEE, 2020, pp. 4572–4579.[18] M. Guo, C. De Persis, and P. Tesi, “Data-driven stabilization of nonlinearpolynomial systems with noisy data,” arXiv preprint arXiv:2011.07833 ,2020.[19] Y. Lian and C. N. Jones, “Nonlinear data-enabled prediction andcontrol,” arXiv preprint arXiv:2101.03187 , 2021.[20] B. O. Koopman and J. v. Neumann, “Dynamical systems ofcontinuous spectra,”
Proceedings of the National Academy ofSciences
Proceedings of the National Academy of Sciences
Automatica ,vol. 93, pp. 149–160, 2018.[23] A. Surana and A. Banaszuk, “Linear observer synthesis for nonlin-ear systems using koopman operator framework,”
IFAC-PapersOnLine ,vol. 49, no. 18, pp. 716–723, 2016.[24] S. Peitz and S. Klus, “Koopman operator-based model reduction forswitched-system control of pdes,”
Automatica , vol. 106, pp. 184–191,2019.[25] M. E. Villanueva, C. Jones, and B. Houska, “Towards global optimalcontrol via koopman lifts,” arXiv preprint arXiv:2003.01265 , 2020. [26] L. D. Landau and E. M. Lifshitz,
Quantum mechanics: non-relativistictheory . Elsevier, 2013, vol. 3.[27] M. Korda and I. Mezi´c, “On convergence of extended dynamic modedecomposition to the koopman operator,”
Journal of Nonlinear Science ,vol. 28, no. 2, pp. 687–710, 2018.[28] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–drivenapproximation of the koopman operator: Extending dynamic modedecomposition,”
Journal of Nonlinear Science , vol. 25, no. 6, pp. 1307–1346, 2015.[29] Y. Kawahara, “Dynamic mode decomposition with reproducing kernelsfor koopman spectral analysis,” in
Proceedings of the 30th InternationalConference on Neural Information Processing Systems , 2016, pp. 919–927.[30] S. Klus, I. Schuster, and K. Muandet, “Eigendecompositions of transferoperators in reproducing kernel hilbert spaces,”
Journal of NonlinearScience , vol. 30, no. 1, pp. 283–315, 2020.[31] N. Takeishi, Y. Kawahara, and T. Yairi, “Learning koopman in-variant subspaces for dynamic mode decomposition,” arXiv preprintarXiv:1710.04340 , 2017.[32] B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universallinear embeddings of nonlinear dynamics,”
Nature communications ,vol. 9, no. 1, pp. 1–10, 2018.[33] Y. Lian and C. N. Jones, “Learning feature maps of the koopmanoperator: A subspace viewpoint,” in . IEEE, 2019, pp. 860–866.[34] S. J. Qin, “An overview of subspace identification,”
Computers &chemical engineering , vol. 30, no. 10-12, pp. 1502–1513, 2006.[35] F. D¨orfler, J. Coulson, and I. Markovsky, “Bridging direct & indirectdata-driven control formulations via regularizations and relaxations,” arXiv preprint arXiv:2101.01273 , 2021.[36] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A noteon persistency of excitation,”
Systems & Control Letters , vol. 54, no. 4,pp. 325–329, 2005.[37] I. Markovsky and P. Rapisarda, “Data-driven simulation and control,”
International Journal of Control , vol. 81, no. 12, pp. 1946–1959, 2008.[38] J. Coulson, J. Lygeros, and F. D¨orfler, “Regularized and distributionallyrobust data-enabled predictive control,” in . IEEE, 2019, pp. 2696–2701.[39] H. J. van Waarde, C. De Persis, M. K. Camlibel, and P. Tesi, “Willems’fundamental lemma for state-space systems and its extension to multipledatasets,”
IEEE Control Systems Letters , vol. 4, no. 3, pp. 602–607,2020.[40] A. Bittracher, P. Koltai, and O. Junge, “Pseudogenerators of spatial trans-fer operators,”
SIAM Journal on Applied Dynamical Systems , vol. 14,no. 3, pp. 1478–1517, 2015.[41] M. O. Williams, M. S. Hemati, S. T. Dawson, I. G. Kevrekidis, andC. W. Rowley, “Extending data-driven koopman analysis to actuatedsystems,”
IFAC-PapersOnLine , vol. 49, no. 18, pp. 704–709, 2016.[42] J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Generalizing koopmantheory to allow for inputs and control,”
SIAM Journal on AppliedDynamical Systems , vol. 17, no. 1, pp. 909–930, 2018.[43] S. G. Krantz and H. R. Parks, “Introduction to the implicit functiontheorem,” in
The Implicit Function Theorem . Springer, 2003, pp. 1–12.[44] L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Y. Tsai, “Implicitdeep learning,” arXiv preprint arXiv:1908.06315 , vol. 2, 2019.[45] M. Zanon and S. Gros, “Safe reinforcement learning using robust mpc,”
IEEE Transactions on Automatic Control , 2020.[46] A. V. Fiacco,
Mathematical programming with data perturbations . CRCPress, 2020.[47] R. T. Rockafellar and R. J.-B. Wets,
Variational analysis . SpringerScience & Business Media, 2009, vol. 317.[48] B. Bank, J. Guddat, D. Klatte, B. Kummer, and K. Tammer,
Non-linearparametric optimization . Springer, 1982.[49] A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and W. M. Moursi, “Dif-ferentiating through a cone program,” arXiv preprint arXiv:1904.09043 ,2019.[50] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,
Deep learning .MIT press Cambridge, 2016, vol. 1, no. 2.[51] C. E. Rasmussen, “Gaussian processes in machine learning,” in
Summerschool on machine learning . Springer, 2003, pp. 63–71.[52] M. Titsias, “Variational learning of inducing variables in sparse gaussianprocesses,” in
Artificial intelligence and statistics . PMLR, 2009, pp.567–574.[53] M. Titsias and N. D. Lawrence, “Bayesian gaussian process latentvariable model,” in
Proceedings of the Thirteenth International Con-ference on Artificial Intelligence and Statistics . JMLR Workshop andConference Proceedings, 2010, pp. 844–851.
UTHOR et al. : PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 11 [54] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencieswith gradient descent is difficult,”
IEEE transactions on neural networks ,vol. 5, no. 2, pp. 157–166, 1994.[55] C. Villani,
Optimal transport: old and new . Springer Science &Business Media, 2008, vol. 338.[56] R. T. Powers and E. Størmer, “Free states of the canonical anticom-mutation relations,”
Communications in Mathematical Physics , vol. 16,no. 1, pp. 1–33, 1970.[57] A. Beck,
First-order methods in optimization . SIAM, 2017.[58] S. Dempe,
Foundations of bilevel programming . Springer Science &Business Media, 2002.[59] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[60] S. Daniel-berhe and H. Unbehauen, “Parameter estimation of thenonlinear dynamics of a thyristor driven dc-motor experimental set-upusing hmf-method,”
IFAC Proceedings Volumes