[PDF] Continuous-time system identification with neural networks: model structures and fitting criteria

Abstract

This paper presents tailor-made neural model structures and two custom fitting criteria for learning dynamical systems. The proposed framework is based on a representation of the system behavior in terms of continuous-time state-space models. The sequence of hidden states is optimized along with the neural network parameters in order to minimize the difference between measured and estimated outputs, and at the same time to guarantee that the optimized state sequence is consistent with the estimated system dynamics. The effectiveness of the approach is demonstrated through three case studies, including two public system identification benchmarks based on experimental data.

Full PDF

CContinuous-time system identiﬁcation with neuralnetworks: model structures and ﬁtting criteria

Marco Forgione, Dario PigaIDSIA Dalle Molle Institute for Artiﬁcial IntelligenceSUPSI-USI, Manno, Switzerland { marco.forgione, dario } @idsia.ch June 5, 2020

Abstract

This paper presents tailor-made neural model structures and two cus-tom ﬁtting criteria for learning dynamical systems. The proposed frame-work is based on a representation of the system behavior in terms ofcontinuous-time state-space models. The sequence of hidden states isoptimized along with the neural network parameters in order to mini-mize the diﬀerence between measured and estimated outputs, and at thesame time to guarantee that the optimized state sequence is consistentwith the estimated system dynamics. The eﬀectiveness of the approachis demonstrated through three case studies, including two public systemidentiﬁcation benchmarks based on experimental data.

In recent years, deep learning has advanced at a tremendous pace and is now thecore methodology behind cutting-edge technologies such as speech recognition,image classiﬁcation and captioning, language translation, and autonomous driv-ing (Bengio et al., 2009; LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015).These impressive achievements are attracting ever increasing investments bothfrom the private and the public sector, fueling further research in this ﬁeld.A good deal of the advancement in the deep learning area is publicly acces-sible, both in terms of scientiﬁc publications and software tools. For instance,highly optimized and user-friendly deep learning frameworks are readily avail-able (Abadi et al., 2015; Paszke et al., 2017), and are often distributed underpermissive open-source licenses. Using the high-level functionalities of a deeplearning framework and following good practice, even a novice user can tackle standard machine learning tasks (once considered extremely hard) such as imageclassiﬁcation with moderate eﬀort.An experienced practitioner can employ the same deep learning framework ata lower level to tackle non-standard learning problems, by deﬁning customized1 a r X i v : . [ ee ss . S Y ] J un odels and objective functions to be optimized, and using operators such as neural networks as building blocks. The practitioner is free from the burdenof writing optimization code from scratch for every particular problem, whichwould be tedious and error-prone. In fact, as a built-in feature, modern deeplearning engines can compute the derivatives of a supplied objective functionwith respect to free tunable parameters by implementing the celebrated back-propagation algorithm (Rumelhart, Hinton, Williams, et al., 1988). In turn, thisenables convenient setup of any gradient-based optimization method.An exciting, challenging—and yet partially unexplored—application ﬁeld is system identiﬁcation with tailor-made model structures and ﬁtting criteria. Inthis context, neural networks can be used to describe uncertain componentsof the dynamics, while retaining structural (physical) knowledge, if available.Furthermore, the ﬁtting criterion can be specialized to take into account themodeler’s ultimate goal, which could be prediction, failure detection, state esti-mation, control design, simulation, etc .The choice of the cost function may also be inﬂuenced by computationalconsiderations. In this paper, in particular, models are assessed in terms oftheir simulation performance. In this setting, from a theoretical perspective,simulation error minimization is generally the best ﬁtting criterion. However,evaluating the simulation error loss and its derivative may be prohibitively ex-pensive from a computational perspective for dynamical models containing neu-ral networks. Furthermore, simulation over time has an intrinsically sequentialnature and oﬀers limited opportunities for parallelization.In this paper, we present two ﬁtting algorithms whose runtime is signiﬁcantlyfaster than full simulation error minimization, but still provide models with highsimulation performance.In the ﬁrst approach, called truncated simulation error minimization , theneural dynamical model is simulated over (mini)batches of subsequences ex-tracted from the training dataset. This allows parallelization of the model sim-ulation over the diﬀerent subsequences in a batch and also results in reducedback-propagation cost with respect to a full open-loop simulation. Special care istaken to provide consistent initial conditions for all the simulated subsequences.In fact, these initial conditions are optimized along with the neural networkparameters according to a dual objective. Speciﬁcally, the batch cost functiontakes into account both the distance between the simulated output and themeasured output —the ﬁtting objective— and the consistency of all the initialcondition variables with the neural model equation — the initial state consis-tency objective. This cost function is iteratively minimized over the randomlyextracted batches through a gradient-based optimization algorithm.In the second approach, called soft-constrained integration , the neural dy-namical model is enforced by a regularization term in the cost function penaliz-ing the violation of a numerical ODE integration scheme applied to the system’s(hidden) state variables. These state variables, together with the neural net-work parameters, are tuned with the dual objective of ﬁtting the measured dataand minimizing the penalty term associated with the violation of the numeri-cal integration scheme. In the soft-constrained integration method, simulation2hrough time is thus completely circumvented and the loss function splits upinto independent contribution for each time step. This enables a fully parallelimplementation of gradient-based optimization.The use of neural networks in system identiﬁcation has a long history. Forinstance, neural AutoRegressive Moving Average with eXogenous inputs modelswere discussed in S. Chen, Billings, and Grant (1990). Training was performedwith the one-step prediction error method (Ljung, 1978) previously developed inthe context of linear dynamical systems. In (Horne & Giles, 1995), several

Re-current Neural Network structures trained by

Back-Propagation Through Time are evaluated on a nonlinear system identiﬁcation task. Although the overallreasoning in these earlier works is similar to ours, their results are hardly com-parable with our current contribution, given the huge gap of hardware/softwaretechnology.More recently, a few interesting approaches exploiting modern deep learningconcepts and tools for system identiﬁcation have been proposed. For instance,(Wang, 2017) and (Andersson, Ribeiro, Tiels, Wahlstr¨om, & Sch¨on, 2019) dis-cuss the use of

Long Short-Term Memory networks and 1-D

Convolutional Neu-ral Networks , respectively. Compared to these recent contributions, our workfocuses on specialized neural model structures for the identiﬁcation task at hand.Furthermore, we pose the identiﬁcation problem directly in a continuous-timesetting, which oﬀers several advantages over the discrete-time counterpart (Gar-nier & Young, 2014; Piga, 2018). First, the majority of physical systems arenaturally modeled in a continuous-time framework. Embedding physical knowl-edge in continuous-time model structures is thus more intuitive, and inspectinga continuous-time identiﬁed model is more insightful as some parameters mayretain a physical meaning. Second, continuous-time models can handle the caseof non-uniformly sampled data. Last, continuous-time identiﬁcation is generallyimmune from the numerical issues aﬀecting discrete-time methodologies in thecase of high sampling frequency.The connection between deep learning and dynamical system theory is cur-rently under intensive investigation, see e.g. , (Weinan, 2017) and cross-contaminationis yielding substantial advances to both ﬁelds. On the one hand, modern deeplearning architectures are interpreted as discrete-time approximations of an un-derlying continuous-time neural dynamical system. Exploiting this parallel,(Haber & Ruthotto, 2017) and (Ruthotto & Haber, 2019) analyze the stabil-ity properties of existing deep neural architectures through the lens of systemtheoretic tools. Modiﬁed architectures guaranteeing stability by design are alsoproposed. On the other hand, contributions such as Long, Lu, and Dong (2019);Rackauckas et al. (2020); Raissi, Perdikaris, and Karniadakis (2019) showcasethe potential of neural networks for data-driven modeling of dynamical systemsdescribed by ordinary and partial diﬀerential equations. With respect to thesecontributions, our aim is to devise computationally eﬃcient ﬁtting strategies forneural dynamical models that are robust to the measurement noise.The rest of this paper is structured as follows. The overall settings andproblem statement is outlined in Section 2. The neural dynamical model struc-tures are introduced in Section 3 and criteria for ﬁtting these model structures3o training data are described in Section 4. Simulation results are presented inSection 5 and can be replicated using the codes available as on-line supplemen-tary material. Conclusions and directions for future research are discussed inSection 6.

We are given a dataset D consisting of N input samples U = { u , u , . . . , u N − } and output samples Y = { y , y , . . . , y N − } , gathered at time instants T = { t = 0 , t , . . . , t N − } from an experiment on a dynamical system S . Thedata-generating system S is assumed to have the continuous-time state-spacerepresentation ˙ x ( t ) = f ( x ( t ) , u ( t )) (1a) y o ( t ) = g ( x ( t )) , (1b)where x ( t ) ∈ R n x is the system state at time t ; ˙ x ( t ) denotes the time derivative of x ( t ); y o ( t ) ∈ R n y is the noise-free output; u ( t ) ∈ R n u is the system input; f ( · , · ) : R n x × R n u → R n x , and g ( · ) : R n x → R n y are the state and output mappings,respectively. The measured output y k at time instant t k , k = 0 , , . . . , N − η k , i.e. , y k = y o ( t k ) + η k . We assume thatthe input u ( t ) can be reconstructed (or reasonably approximated) for all timeinstants t ∈ [ t t N − ] ⊂ R from the samples U .In this paper, we introduce ﬂexible neural model structures that are suitableto represent generic dynamical systems as (1), allowing the modeler to embeddomain knowledge to various degrees and to exploit neural networks to describeunknown model components. Furthermore, we present ﬁtting criteria and al-gorithms to train these neural dynamical model structures. Overall, we aimat ﬁtting a neural network model that exploits the dynamic constraints anddiﬀerent forms of available prior knowledge of the data-generating system (1). Let us consider a model structure M = { M ( θ ) , θ ∈ R n θ } , where M ( θ ) repre-sents a dynamical model parametrized by a real-valued vector θ . We refer to neural model structures as structures M where some components of the model M ( θ ) are described by neural networks. In the following, we introduce possibleneural model structures M for dynamical systems. A general state-space neural model structure has form˙ x = N f ( x, u ; θ ) (2a) y o = N g ( x ; θ ) , (2b)4here N f and N g are feedforward neural networks of compatible size parametrizedby θ ∈ R n θ . For notation simplicity, the time dependence of all signals in (2) isomitted. The general structure (2) can be tailored for the identiﬁcation task at hand.Examples are illustrated in the remainder of this section.

If a linear approximation of the system is available, an appropriate model struc-ture could be ˙ x = A L x + B L u + N f ( x, u ; θ ) (3a) y o = C L x + N g ( x, u ; θ ) , (3b)where A L , B L , and C L are matrices of compatible size describing the linear sys-tem approximation. For example, the values of these matrices can be estimatedfrom the available training dataset D using well-established algorithms for lin-ear system identiﬁcation Ljung (1999); Van Overschee and De Moor (1994).Although model (3) is not more general than (2), it could be easier to train asthe neural networks N f and N g are supposed to capture only residual (nonlinear)dynamics. If the system state is known to be fully observed, the most convenient represen-tation is ˙ x = N f ( x, u ; θ ) (4a) y o = x, (4b)where only the state mapping neural network N f is learned, while the outputmapping is ﬁxed to identity. Tailor-made architectures could be used to embed speciﬁc physical knowledgein the neural model structure.For instance, let us consider the

Cascaded Tanks System (CTS) schematizedin Figure 1. The CTS is a ﬂuid level control system consisting of two tanks withfree outlets fed by a pump. Water is pumped from a bottom reservoir into theupper tank by a controlled pump. The water in the upper tank ﬂows througha small opening into the lower tank, and from another small opening from thelower tank to the reservoir.The system input u is the water ﬂow from the bottom reservoir feeding theupper tank. The state variables x and x are the water level in the upper and In the rest of the paper, the time dependence of signals will be speciﬁed only whennecessary. As in the illustrative example further discussed inSection 5, we consider the case where only the second state x is measured.The following dynamical model for the CTS can be derived from conservationlaws and Bernoulli’s principle (Schoukens & No¨el, 2017):˙ x = − k √ x + k u (5a)˙ x = k √ x − k √ x (5b) y = x , (5c)where k , k , k , and k are ﬁxed coeﬃcients.Based on this physical knowledge, an appropriate neural dynamical modelfor the CTS is ˙ x = N f ( x , u ) (6a)˙ x = N f ( x , x ) (6b) y = x (6c)capturing the information that: ( i ) the system dynamics can be described bya two-dimensional state-space model; ( ii ) the state x is measured; ( iii ) thestate x does not depend on x ; and ( iv ) the state x does not depend directlyon u , given the current value of x . This neural model takes advantage ofthe available process knowledge, while leaving representational capabilities todescribe unmodeled eﬀects such as ﬂuid viscosity, nonlinearities of the actuators,and water overﬂow that may happen when the tanks are completely ﬁlled.Embedding physical knowledge in neural model structures is a very activeand promising trend in deep learning (Raissi et al., 2019). For instance, recentcontributions propose specialized structures that are suitable to describe sys-tems satisfying general physical principles such a energy conservation. In thesecases, a physics-based neural network may be used to learn the system’s Hamil-tonian or Lagrangian function, instead of the individual components its ODE With some abuse of notation, the subscripts in x and x simply denote the variable nameand not a time index. The same notation will be used for the examples in Section 5. In this section, we present algorithms aimed at ﬁtting the model structuresintroduced in Section 3 to the training dataset D .For ﬁxed values of neural network parameters θ , for given initial condition x = x (0), and under the model structure (2), the open-loop state simulation x sim ( t ) is the solution of the Cauchy problem:˙ x sim ( t ) = N f (cid:0) x sim ( t ) , u ( t ); θ (cid:1) (7a) x sim (0) = x , (7b)and the simulated output y sim ( t ) is y sim ( t ; θ, x ) = N g ( x sim ( t ; θ, x ); θ ) . (8)Diﬀerent ODE solution schemes (Quarteroni, Sacco, & Saleri, 2010), may beapplied to numerically solve problem (7). Formally, x sim ( t ; θ, x ) = ODEINT ( t ; N f ( · , · ; θ ) , u ( · ) , x ) (9)will represent the solution of the Cauchy problem (7) obtained using a numericalscheme of choice (explicit or implicit, single-step or multi-step, single-stage ormulti-stage) denoted as ODEINT.The neural network parameters θ can be obtained by minimizing the simu-lation error norm J ( θ, x ) = 1 N N − (cid:88) k =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e k (cid:122) (cid:125)(cid:124) (cid:123) y sim ( t k ; θ, x ) − y k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (10)with respect to both the network parameters θ and the state initial condition x , with y sim deﬁned by Equations (8)-(9).It is worth remarking that, when ODEINT is an explicit ODE solver, the fullcomputational graph producing the cost function (10) can be constructed usingstandard diﬀerentiable blocks. For example, Figure 2 represents the computa-tional graph obtained by applying a forward Euler scheme with constant stepsize ∆ t , assuming that the measurements in D are collected at the same rate7igure 2: Computational graph associated with the forward Euler ODE inte-gration scheme with constant step size ∆ t applied to system (2). Measurementsare assumed to be equally spaced with the same constant rate ∆ t .∆ t . It is interesting to note the similarity between this computational graphand the one of the residual network structure (He, Zhang, Ren, & Sun, 2016).Therefore, in the case of an explicit ODE solver, the derivatives of theloss J ( θ, x ) with respect to θ and x can be obtained using standard back-propagation through the elementary solver steps. Thus, a procedure minimizing J ( θ, x ) can be implemented using available deep learning software.Recently, an alternative approach to diﬀerentiate through the ODE solutionbased on backward time integration of adjoint sensitivities has been proposedT. Q. Chen, Rubanova, Bettencourt, and Duvenaud (2018). Following thisapproach, implicit ODE solvers may be adopted as well.In either case, from a computational perspective, simulating over time has anintrinsically sequential nature and oﬀers scarce opportunities for parallelization.Thus, in practice, minimizing the simulation error with a gradient-based methodover the entire training dataset D may be inconvenient or even unfeasible interms of memory allocation and computational time. Remark 1

In many cases, optimizing the initial condition x is not necessaryin simulation error minimization as this quantity may be available from physicalconsiderations. For instance, in the cascaded tanks system presented above, theinitial values of the tank levels may be known. Even when the initial condition isunknown, its eﬀect may be negligible, as in the case for measurements collectedfrom a fading memory system S on a suﬃciently long time horizon [ t t N − ] . In order to reduce the computational burden and the wall-clock execution timeof the full simulation error minimization approach previously discussed, thesimulated output y sim ( t ; θ, x ) can be obtained by simulating the system state x sim ( t ; θ, x ) in (9) on several smaller portions of the training set D .For eﬃcient implementation, the truncated simulation error minimization8igure 3: Representation of two subsequences of length m extracted from thetraining dataset D . s j and s (cid:96) represent the starting indexes of subsequences j and (cid:96) , respectively.algorithm presented in this section processes batches containing q subsequencesextracted from D . In principle, the simulations can be carried out simultane-ously for all the subsequences in the batch by exploiting parallel computing.A batch is completely speciﬁed by a batch starting index vector s ∈ N q deﬁn-ing the initial sample of each subsequence and a sequence duration m ∈ N deﬁn-ing the number of samples contained in each subsequence, where each element s j of s satisﬁes s j ≤ N − m −

1. Thus, for instance, the j -th output subsequencein a batch contains the measured output samples { y s j , y s j +1 , . . . , y s j + m − } (seeFigure 3 for graphical representation).For notational convenience, we can arrange the batch data in the following tensors : • output y ∈ R q × m × n y , where y j,h = y s j + h • input u ∈ R q × m × n u , where u j,h = u s j + h • relative time τ ∈ R q × m , where τ j,h = t s j + h − t s j ,for j = 0 , , . . . , q − h = 0 , , . . . , m −

1. Furthermore, we deﬁne the tensors • simulated state x sim ∈ R q × m × n x • simulated output y sim ∈ R q × m × n y ,that will be used to store simulated state and output values at the correspondingbatch time instants.Note that the third dimension of the output tensor y ∈ R q × m × n y correspondsto the output channel. With some abuse of notation, the tensor y is addressedby only two indexes because we do not need to specify a particular outputchannel. The same notation is used for tensors u , x sim , and y sim .For the subsequences j = 0 , , . . . , q −

1, the state evolution in the rela-tive time intervals [0 τ j,m − ] (which corresponds to the absolute time interval[ t s j t s j + m − ]) is the solution of the Cauchy problem˙ x sim j ( τ ) = N f ( x sim j ( τ ) , u ( t s j + τ ); θ ) (11)9or a given initial condition x sim j (0). The solution of the Cauchy problem (11)is numerically approximated by x sim j ( τ ) = ODEINT( τ ; N f ( · , · ; θ ) , u ( · ) , x sim j (0)) , (12a)and the simulated output for the j -th subsequence is then given by y sim j ( τ ) = N g ( x sim j ( τ ); θ ) . (12b)We can now arrange the simulated state and output at the measurement timeinstants in the tensors y sim and x sim , respectively: y sim j,h = y sim j ( τ j,h ) (13a) x sim j,h = x sim j ( τ j,h ) , (13b)for j = 0 , , . . . , q − h = 0 , , . . . , m − x sim j (0) in the Cauchy problem (11) is here critical.Indeed, the eﬀect of these initial conditions cannot in general be neglected asthe duration of a subsequence is typically much shorter than the whole dataset D . Furthermore, the system state is unlikely to be known a priori at all dif-ferent time instants. For this reason, we introduce an additional set of freevariables X = { x , x , . . . , x N − } , where x k ∈ R n x represents the (unmeasured)system state at the measurement time t k . For a given batch, the set of sub-sequences’ initial conditions x sim j (0) is then constructed as x sim j (0) = x s j (with j = 0 , . . . , q −

1) and optimized along with the neural network parameters θ inorder to minimize the ﬁtting cost J ﬁt ( θ, X ) = 1 qm q − (cid:88) j =0 m − (cid:88) h =0 (cid:13)(cid:13) y sim j,h ( θ, X ) − y j,h (cid:13)(cid:13) . (14)It is important to remark that the simulated output y sim j,h ( θ, X ) is a function ofboth the neural network parameters θ and the initial conditions x s j ∈ X .Since the ﬁtting cost J ﬁt ( θ, X ) deﬁned above has N additional degrees offreedom w.r.t. the full simulation error minimization case, a price could be paidin terms of lack of generalization of the estimated model.In order to reduce the degrees of freedom in the minimization of the loss J ﬁt , the variable X used to construct the initial conditions for the Cauchyproblem (11) can be enforced to represent the unknown system state and thusto be consistent with the neural model structure (2a) (namely, the state sequence X should satisfy the neural ODE equation (2a)). To this aim, we introduce aregularization term J reg ( θ, X ) penalizing the distance between the state x sim (simulated through (12b)) and the optimization variables in X . Speciﬁcally, theregularization term J reg ( θ, X ) is deﬁned as J reg ( θ, X ) = 1 qm q − (cid:88) j =0 m − (cid:88) h =0 (cid:13)(cid:13) x sim j,h ( θ, X ) − x j,h ( X ) (cid:13)(cid:13) , (15)10here x is a tensor with the same size and structure as x , but populated withsamples from X , i.e. , x j,h = x s j + h .The overall loss is then constructed as a weighted sum of the two objectives,namely: J tot ( θ, X ) = J ﬁt ( θ, X ) + αJ reg ( θ, X ) (16)with regularization weight α ≥ θ and the “hidden” state variablesin X are initialized. For instance, the initial values of θ and X can be setto small random numbers. Alternatively, if a measurement/estimate of someof the system’s state variables is available, it can be used to initialize certaincomponents of X .Then, at each iteration i = 0 , . . . , n − s ∈ N q isselected with s j ∈ [0 N − m − , j = 0 , , . . . , q − s can be either (pseudo)randomly generated, or chosen deterministically. Then,tensors y , x , and τ are populated with the corresponding values in D and X .The initial conditions x sim j (0), with j = 0 , , . . . , q − X (Step 2.2). Subsequently, for each subsequence j , the system state and outputare simulated within the time interval [0 τ s j + m − ], starting from the initial con-dition x sim j (0) (Step 2.3). Then, the values of the simulated state and output atthe (relative) measurement times τ j,h (with h = 0 , . . . , m −

1) are collected intensors x sim and y sim (Step 2.4), and used to construct the loss J tot ( θ, X ) (Step2.5). Next, the gradients of the cost with respect to the optimization variables θ , X are obtained in (Step 2.6), either by back-propagation through the solversteps or exploiting the adjoint sensitivity method suggested in (T. Q. Chen etal., 2018). Lastly, the optimization variables are updated via gradient descentwith learning rate λ (Step 2.7). Improved variants of gradient descent such as Adam (Kingma & Ba, 2015) can be alternatively adopted at Step 2.7.Note that, at each iteration i of the gradient descent algorithm, the cost J tot ( θ, X ) may depend on just a subset of hidden state variables X . Thus, onlyfor those components at each iteration the gradient vector ∇ X J tot = ∂J tot ∂X isnon-zero and an update is performed. Remark 2

The computational cost of evaluating the gradient of J tot is propor-tional to the number of solver steps executed in (12a) . In the case the solverstep is equal to the sampling intervals t k +1 − t k (which corresponds to m solversteps), running truncated simulation error minimization with m (cid:28) N is thussigniﬁcantly faster than full simulation error minimization. Furthermore, thecomputations for the q subsequences can be carried out independently, and thusparallel computing can be exploited. For an eﬃcient use of the training data D , s has to be chosen in such a way that allsamples are visited with equal frequency during the iterations of Algorithm 1. lgorithm 1 Truncated simulation error minimization

Inputs : training dataset D ; number of iterations n ; batch size q ; length of subse-quences m ; learning rate λ >

0; regularization weight α ≥ initialize the neural network parameters θ and the hidden state sequence X ;2. for i = 0 , . . . , n − do select batch start index vector s ∈ N q ;2.2. populate tensors y j,h = y s j + h , x j,h = x s j + h , τ j,h = t s j + h − t s j for j = 0 , , . . . , q − h = 0 , , . . . , m − x sim j (0) = x s j , for j = 0 , , . . . , q − simulate state and output x sim j ( τ ) = ODEINT( τ ; N f ( · , · ; θ ) , x sim j (0)) y sim j ( τ ) = N g ( x sim j ( τ ); θ )for j = 0 , , . . . , q − τ ∈ [0 τ s j + m ];2.4. populate tensors x sim and y sim as x sim j,h = x sim j ( τ j,h ) y sim j,h = y sim j ( τ j,h )for j = 0 , , . . . , q − h = 0 , , . . . , m − compute the cost J tot ( θ, X ) = J fit (cid:122) (cid:125)(cid:124) (cid:123) qm q − (cid:88) j =0 m − (cid:88) h =0 (cid:13)(cid:13)(cid:13) y sim j,h ( θ, X ) − y j,h (cid:13)(cid:13)(cid:13) + α J reg (cid:122) (cid:125)(cid:124) (cid:123) qm m − (cid:88) h =0 (cid:13)(cid:13)(cid:13) x sim j,h ( θ, X ) − x j,h ( X ) (cid:13)(cid:13)(cid:13) ;2.6. evaluate the gradients ∇ θ J tot = ∂J tot ∂θ and ∇ X J tot = ∂J tot ∂X at the currentvalues of θ and X ;2.7. update optimization variables θ and X : θ ← θ − λ ∇ θ J tot X ← X − λ ∇ X J tot ; Output : neural network parameters θ .

12n the next paragraph, we introduce an alternative method for ﬁtting neu-ral dynamical models that does not require simulation through time, and thusis even more convenient from a computational perspective, allowing full paral-lelization at time step level.

The optimization variables X previously introduced for truncated simulationerror minimization are regularized to be consistent with the ﬁtted system dy-namics through the cost J reg (15). Therefore, X implicitly provides anotherestimate of the unknown system state, and thus of the output Y for given N g .This estimate can be compared with the measured samples Y to deﬁne an alter-native ﬁtting objective. This suggests the following variant for the ﬁtting term J ﬁt : J ﬁt ( X ) = 1 qm q − (cid:88) j =0 m − (cid:88) h =0 (cid:13)(cid:13) y j,h − y j,h (cid:13)(cid:13) , (17)where y j,h = N g ( x j,h ).Experimentally, using (17) instead of (14) as ﬁtting term does not signiﬁ-cantly alter the properties of truncated simulation error minimization. Further-more, the wall-clock execution time of Algorithm 1 with the modiﬁed ﬁttingterm (17) is still dominated by the m -step simulation (and back-propagation)still required to compute the gradient of the regularization term J reg in (15).In order to formulate a faster learning algorithm, an alternative regularizationterm J reg promoting consistency of the hidden state variables, but not requiringtime simulation should be devised.The consistency-promoting regularizer J reg considered in the soft-constrainedintegration method penalizes the violation of a numerical ODE integrationscheme applied to the hidden state variables X , independently at each inte-gration step. For instance, the forward Euler scheme can be enforced by meansof the regularization term J reg ( X, θ ) = q − (cid:88) j =0 m − (cid:88) h =1 (cid:107) x j,h − x j,h − − ∆ t N f ( x j,h − , u j,h − ) (cid:107) , (18)assuming for notation simplicity a constant step size ∆ t . If the regularizationterm (18) is reduced to a “small value” through optimization, then the hiddenvariables X will (approximately) satisfy the forward Euler scheme.Algorithm 2 details the training steps when the forward Euler numericalscheme is used to enforce consistency of the hidden state variables X with themodel dynamics. In Step 1, the neural network parameters θ , and the sequenceof hidden variables X are initialized. Then, at each iteration i = 0 , , . . . , n − s ∈ N q is selected with s j ∈ [0 N − m − , j = 0 , , . . . , q − y , x , u are populated with the correspondingsamples in D (Step 2.2), similarly as in Algorithm 113 lgorithm 2 Soft-constrained integration for the forward Euler scheme

Inputs : training dataset D ; number of iterations n ; batch size q ; length of subse-quences m ; learning rate λ >

0; regularization weight α ≥ initialize the neural network parameters θ and the hidden state sequence X ;2. for i = 0 , . . . , n − do select batch start index vector s ∈ N q ;2.2. populate tensors y j,h = y s j + h , x j,h = x s j + h , u j,h = u s j + h , for j = 0 , , . . . , q − h = 0 , , . . . , m − compute the cost J tot ( θ, X ) = J fit (cid:122) (cid:125)(cid:124) (cid:123) qm q − (cid:88) j =0 m − (cid:88) h =0 (cid:13)(cid:13) y j,h − y j,h (cid:13)(cid:13) ++ α J reg (cid:122) (cid:125)(cid:124) (cid:123) qm q − (cid:88) j =0 m − (cid:88) h =1 (cid:107) x j,h − x j,h − − ∆ t N f ( x j,h − , u j,h − ) (cid:107) , with y j,h = N g ( x j,h );2.4. evaluate the gradients ∇ θ J tot = ∂J tot ∂θ and ∇ X J tot = ∂J q,m ∂X at the currentvalues of θ and X ;2.5. update optimization variables θ and X : θ ← θ − λ ∇ θ J tot X ← X − λ ∇ X J tot ; Output : neural network parameters θ . J tot is computed (Step 2.3) as a weighted sum of the ﬁttingcost J ﬁt in (17) and the regularizer J reg (18). Note that, unlike Algorithm 1,Algorithm 2 does not require m -step simulation. The gradients of J tot areobtained using standard back-propagation and used to perform a gradient-basedminimization step (Steps 2.4 and 2.5).The potential advantage of the proposed soft-constrained integration methodover the truncated simulation error minimization is twofold. Firstly, implicitintegration schemes can be enforced with no additional computational burdenwith respect to explicit ones. For instance, the backward Euler integrationscheme can be implemented simply by modifying the consistency term J reg to J reg ( X, θ ) = q − (cid:88) j =0 m − (cid:88) h =1 (cid:107) x j,h − x j,h − − ∆ t N f ( x j,h , u j,h ) (cid:107) , (19)while the Crank-Nicolson scheme corresponds to J reg ( X, θ ) = q − (cid:88) j =0 m − (cid:88) h =1 || x j,h − x j,h − − ∆ t N f ( x j,h , u j,h )+ N f ( x j,h − , u j,h − )) || . (20)Other implicit schemes such as the multi-step Backward Diﬀerentiation Formula(BDF) and Adams-Moulton (AM), or multi-stage implicit Runge Kutta (RK)methods may be similarly implemented, leading to a potential increase in theaccuracy of the ODE numerical solution (Quarteroni et al., 2010). Secondly,the resulting cost function splits up as a sum of independent contributions foreach time step, thus enabling the fully parallel implementation of gradient-basedoptimization. This leads to signiﬁcant computational advantages and a reducedwall-clock execution time.On the other hand, in the proposed soft-constrained integration method,the numerical scheme is only approximately satisﬁed at each solver step, andthe degree of violation eventually depends on the weighting constant α in thecost function. The tuning of the weight α is thus more critical as compared totruncated simulation error minimization. The performance of the model structures and ﬁtting algorithms introducedin this paper are tested on three cases studies considering the identiﬁcationof a nonlinear RLC circuit; a

Cascaded Tanks System (CTS); and an

Elecro-Mechanical Positioning System (EMPS).

Code availability

The software implementation is based on the Python programming languageand the

PyTorch

Deep Learning Framework (Paszke et al., 2017). All the codesrequired to reproduce the results reported in the paper are available in theGitHub repository https://github.com/forgi86/sysid-neural-continuous ataset availability The RLC circuit dataset is simulated and available together with the code inthe on-line repository. Experimental datasets for the CTS and the EMPS casestudies are obtained from the website ,which hosts a collection of public benchmarks widely used in system identiﬁca-tion.

Metrics

For all case studies, the performance of the ﬁtting algorithms is assessed interms of the R index of the model simulation: R = 1 − (cid:80) N − k =0 (cid:0) y k − y sim ( t k ) (cid:1) (cid:80) N − k =0 ( y k − y mean ) , where y mean = N (cid:80) N − k =0 y k .For the CTS, the Root Mean Squared Error (RMSE) of the model simulationis also provided, as this is the performance index suggested in the descriptionof the benchmark (Schoukens & No¨el, 2017):RMSE = (cid:118)(cid:117)(cid:117)(cid:116) N N − (cid:88) k =0 ( y k − y sim ( t k )) . The performance indexes are evaluated both on the training dataset and ona separate test dataset. In the case of systems with multiple output channels,the metrics are computed channel-wise.

Algorithm settings

In truncated simulation error minimization (Algorithm 1), the batch size q andsequence length m are chosen within the integer range [32 128], while the regu-larization weight α in (16) is always set to 1.In the soft-constrained integration method (Algorithm 2), we consider in-stead a single subsequence containing all the dataset samples, i.e. , q = 1 and m = N . The weight constant α is tuned based on the simulation performanceof the identiﬁed model in the training dataset.In all the examples, the Adam optimizer is used for gradient-based optimiza-tion. The learning rate λ is adjusted through a rough trial-and-error within therange [10 − − ], while the other optimizer parameters are left to default. Thenumber of training steps n is chosen suﬃciently high to reach a cost functionplateau.The neural networks’ weight parameters are initialized to random Gaussianvariables with zero mean and standard deviation 10 − , while the bias terms areinitialized to zero. The hidden state variables X are initialized diﬀerently forthe three examples, exploiting available process knowledge.16 Inductor current i L (A) I ndu c t an c e L ( µ H ) Figure 4: Nonlinear series RLC circuit used in the example (left) and nonlineardependence of the inductance L on the inductor current i L (right). Hardware conﬁguration

All computations are carried out on a PC equipped with an Intel i5-7300U2.60 GHz processor and 32 GB of RAM.

We consider the nonlinear RLC circuit in Figure 4 (left). The circuit behavioris described by the continuous-time state-space equation (cid:20) ˙ v C ˙ i L (cid:21) = (cid:20) C − L ( i L ) − RL ( i L ) (cid:21) (cid:20) v C i L (cid:21) + (cid:20) L ( i L ) (cid:21) v in , (21)where v in (V) is the input voltage; v C (V) is the capacitor voltage; and i L (A)is the inductor current. The circuit parameters R = 3 Ω and C = 270 nF areﬁxed, while the inductance L depends on i L as shown in Figure 4 (right) andaccording to L ( i L ) = L (cid:20) . (cid:18) π arctan (cid:0) − | i L | − (cid:1) + 0 . (cid:19) + 0 . (cid:21) , with L = 50 µ H. This dependence is typically encountered in ferrite inductorsoperating in partial saturation (Di Capua et al., 2017).In this case study, we assume both state variables v C and i L to be measured.A training dataset D with N = 4000 samples is built by simulating the systemfor 2 ms with a ﬁxed step ∆ t = 0 . µ s. The input v in is a ﬁltered white noisewith bandwidth 150 kHz and standard deviation 80 V. An independent testdataset is generated using as input v in a ﬁltered white noise with bandwidth200 kHz and standard deviation 60 V. In the training dataset, the observationsof v C and i L are corrupted by an additive white Gaussian noise with zero meanand standard deviation 10 V and 1 A, respectively. This corresponds to a Signal-to-Noise Ratio (SNR) of 20 dB and 13 dB on v C and i L , respectively.17 .1.1 Neural Model Structure By considering as state vector x = [ v C i L ] (cid:62) and input u = v in , we adopt forthis system the fully observed state model structure (4) introduced in Section3 reported for convenience below:˙ x = N f ( x, u ) y = x. This model structure embeds the knowledge that ( i ) the system has a second-order state-space representation and ( ii ) the whole state vector is measured.The neural network N f used in this example has feed-forward structure withthree input units (corresponding to v C , i L , and v in ); a hidden layer with 64linear units followed by ReLU nonlinearity; and two linear output units—thecomponents of the state equation to be learned. Algorithm 1 is executed with learning rate λ = 10 − , number of iterationsis n = 10000, and batches containing q = 64 subsequences, each one of size m = 64. The hidden state variables X are initialized to the values of the noisyoutput measurements. Model equations (11) are numerically integrated usingthe forward Euler numerical scheme. The total run time of Algorithm 1 is 142seconds.Time trajectories of the true and model output are reported in Figure 5.For the sake of visualization, only a portion of the test dataset is shown. Theﬁtted model describes the system dynamics with high accuracy. On both thetraining and the test datasets, the R index in simulation is above 0 .

99 for v C and 0 .

98 for i L . For the sake of comparison, a second-order linear Output Error model estimated using the

System Identiﬁcation Toolbox (Ljung & Singh, 2012)achieves an R index of 0 .

96 for v C and 0 .

77 for i L on the training dataset, and0 .

94 for v C and 0 .

76 for i L on the test dataset. Full simulation error minimization is also tested. This method yields the sameperformance of truncated simulation error minimization in terms of R index ofthe ﬁtted model. However, the runtime required to execute n = 10000 iterationsand reach a cost function plateau is about two hours. For the fully-observed neural model structure, a straightforward ﬁtting criterionmay be deﬁned by taking the noisy output measurement as a state estimate andby minimizing the one-step prediction error loss of the AutoRegressive with18 ime ( µ s) − − − V o l t a g e ( V ) v C v sim C Time ( µ s) − − C u rr e n t ( A ) i L i sim L Figure 5: RLC circuit: true output (black) and estimated output (red) obtainedby the state-space model trained by truncated simulation error minimization.eXogenous input (ARX) model structure: J pred ( θ ) = N − (cid:88) t =1 (cid:107) y t − y t − − ∆ t N f ( y t − , u t − ) (cid:107) . (22)Minimization of (22) corresponds to the training of a standard feedforwardneural network with target y t − y t − ∆ t and features y t − , u t − .Since the neural network is fed with noisy input data and only the 1-stepahead is minimized, this approach is not robust to the measurement noise. Onthis RLC example, the fully-observed state neural model structure trained byminimizing J pred ( θ ) achieves an R index of 0 .

73 for v C and 0 .

03 for i L in thetest dataset (see time trajectories in Figure 6).By repeating the ﬁtting procedure on a noise-free RLC training dataset, one-step prediction error minimization recovers the same performance of simulationerror minimization ( R index of 0 .

99 and 0 .

98 for v C and i L , respectively). We consider the CTS already introduced in Section 3 and described in detailsin (Schoukens & No¨el, 2017).The training and test datasets contain 1024 points each, collected at a con-stant sampling time ∆ t = 5 s. In both datasets, the input is a multisine signalwith identical power spectrum, but diﬀerent realization. Input and output val-ues are in Volts as they correspond to the actuator commands and the rawsensor readings, respectively.The initial state of the system is not provided, but it is known to be thesame for both datasets. Thus, as suggested in (Schoukens & No¨el, 2017), weuse the initial state estimated on the training dataset (which is a by-product ofthe proposed ﬁtting procedures) as initial state for model simulation in test.19 ime ( µ s) − − − V o l t a g e ( V ) v C v sim C Time ( µ s) − − C u rr e n t ( A ) i L i sim L Figure 6: RLC circuit: true output (black) and estimated output (red) obtainedby the state-space model trained by one-step prediction error minimization.

The neural model structure used for this system is˙ x = N f ( x , u ) (23a)˙ x = N f ( x , x , u ) (23b) y = x . (23c)Compared to (6), model (23) also includes a direct dependency on u in thesecond state equation. This dependency is added to take into account that inthis experimental setup, in case of water overﬂow from the upper tank, partof the overﬂowing water may go directly in the lower tank (Schoukens & No¨el,2017).The neural networks N f and N f have two and three input units, respec-tively. Both networks one hidden layer with 100 linear units followed by ReLUnonlinearity and a linear output unit. Algorithm 1 is executed with learning rate λ = 10 − , number of iterationsis n = 10000, batch size q = 64, and subsequence length m = 128. Thecomponents of the hidden state variables X associated to the state variable x are initialized to 0, while the components associated to x are initialized to thenoisy output measurements y . Model equations (9) are numerically integratedusing the forward Euler numerical scheme. The total runtime of Algorithm 1 is533 seconds.Time trajectories of simulated and true output are reported in Figure 7.The achieved R index is 0 .

99 and 0 .

97 on the training and on the test dataset,respectively. The RMSE index is 0 .

08 V and 0 .

33 V on the identiﬁcation and20

Time (s) V o l t a g e ( V ) yy sim Time (s) V o l t a g e ( V ) yy sim Figure 7: CTS: measured output y (black) and model simulation y sim (red) ob-tained by the neural model trained by truncated simulation error minimization.on the test dataset, respectively. These results compare favorably with state-of-the-art black-box nonlinear identiﬁcation methods applied to this bench-mark (Birpoutsoukis, Csurcsia, & Schoukens, 2018; Relan, Tiels, Marconato,& Schoukens, 2017; Svensson & Sch¨on, 2017). To the best of our knowledge,the best previously published result was obtained in (Svensson & Sch¨on, 2017)using a state-space model with priors for basis function expansion inspired byGaussian Processes, and trained using a sequential Monte Carlo method. Theauthors of (Svensson & Sch¨on, 2017) report an RMSE index of 0 .

45 V on the testdataset. A higher performance was reported only in (Rogers, Holmes, Cross, &Worden, 2017) for grey-box models including an explicit physical description ofthe water overﬂow phenomenon (RMSE index of 0 .

18 V).Note that the largest discrepancies between measured and simulated outputare noticeable towards the end of the test experiment, where the measuredoutput y is close to 2 V. This condition is not encountered in the trainingdataset and therefore a mismatch can be expected for a black-box nonlinearmodel such as a neural network. Algorithm 2 is executed with the regularization term J reg in (20) enforcing theCrank-Nicolson integration scheme and a regularization constant α = 50000.For this small dataset, all time steps ﬁt into the memory and can be processedaltogether in a batch. Thus, we consider batches with a single subsequence( q = 1) containing the whole training dataset D ( m = 1024). Optimization isperformed over n = 50000 iterations of the Adam algorithm, with learning rate λ = 10 − . The total runtime of Algorithm 2 is 271 seconds, approximately halfthe runtime time of Algorithm 1.Time trajectories of the output are reported in Figure 8 for both the trainingand the test dataset. The R index of the model is 0 .

99 and 0 .

96 on the trainingand on test dataset, respectively. The RMSE index is 0 .

18 V and 0 .

40 V on thetraining and on the test datasets, respectively. The results are thus in line withthe ones achieved by truncated simulation error minimization.21

Time (s) V o l t a g e ( V ) yy sim Time (s) V o l t a g e ( V ) yy sim Figure 8: CTS: measured output y (black) and model simulation y sim (red)obtained by the neural model trained using the soft-constrained integrationmethod. As a last case study, we consider the identiﬁcation of the Electro-MechanicalPositioning System (EMPS) described in (Janot, Gautier, & Brunot, 2019).The system is a controlled prismatic joint, which is a common componentof robots and machine tools. In the benchmark, the system input is the motorforce τ (N) expressed in the load side and the measured output is the prismaticjoint position p (m). A physical state-space model for the system is˙ p = v (24a)˙ v = − τM − f v M v − F c ( v ) M , (24b)where M (kg) is the joint mass, f v (Ns/m) is the dynamic friction coeﬃcientand F c (N) is the static friction. The benchmark is challenging due to (i)the unknown friction behavior and (ii) the marginally stable (integral) systemdynamics.The identiﬁcation and test dataset are constructed from closed-loop exper-iments performed with the same reference position trajectory. A force distur-bance is acting on the system in the test experiment only. The two datasets havethe same duration (approximately 25 seconds) and are collected at a samplingfrequency of 1 kHz. In this paper, the original EMPS signals are decimated bya factor 5. Thus, each dataset contains N = 4968 points with sampling time∆ t = 5 ms. According to the physical model (24), the neural model structure used to ﬁt theEMPS system is ˙ x = x (25a)˙ x = N f ( x , u ) (25b) y = x , (25c)22 ime (s) . . . . P o s i t i o n ( m ) pp sim Time (s) − − F o r ce ( N ) τ Time (s) . . . . P o s i t i o n ( m ) pp sim Time (s) − − F o r ce ( N ) τ Figure 9: EMPS: measured position p (top panels, black) and model simula-tion p sim (top panels, red) obtained by the neural model trained by truncatedsimulation error minimization. The input force τ is shown in the bottom panels.with state variables x = p and x = v ; and input u = τ .The neural model structure (25) captures the physical knowledge that ( i )the system states are position and velocity; ( ii ) the derivative of position isvelocity; and ( iii ) the velocity dynamics does not depend on the position. Theneural network is thus used to describe the velocity dynamics (24b), which couldbe rather complex due to the presence of static friction. Indeed, static frictionis highly nonlinear and hard to describe with ﬁrst-principles formulas. On theother hand, there is no need to use a black-box model to describe the positiondynamics (24a). In fact, this equation simply states that velocity is the timederivative of position.The neural network N f used for this benchmark has 2 input units; 64 hiddenlinear units followed by ReLU nonlinearity; and one linear output unit. Algorithm 1 is executed with learning rate λ = 10 − , number of iterations is n = 10000, batch size q = 32, and sequence length m = 64. The components of X associated to x are initialized to the measurement position sequence, whilethe components associated to x are initialized to the forward diﬀerence approx-imation of its time derivative. Model equations (9) are numerically integratedusing the fourth-order Runge-Kutta scheme RK44 (Ralston, 1962). The totalruntime of Algorithm 1 is 710 seconds.Time trajectories of the input and of the output are reported in Figure 9.The achieved R index is above 0 .

99 on both the identiﬁcation and test datasets.By comparison, Janot et al. (2019) reports an R index of 0 . .3.3 Soft-constrained integration method Algorithm 2 is executed with the regularization term J reg in (19) enforcing thebackward Euler integration scheme and regularization weight α = 1000. Asfor the CTS benchmark, we consider batches with a single subsequence ( q = 1)containing the whole training dataset D ( m = 4968). Optimization is performedover n = 40000 iterations of the Adam algorithm, with learning rate λ = 10 − .The identiﬁed model achieves an R index above 0 .

99 both in identiﬁcation andin test, as for Algorithm 1. The total runtime of Algorithm 2 is 342 seconds(around 2x faster then Algorithm 1).

In this paper, we have presented neural model structures and two novel method-ologies for the identiﬁcation of continuous-time dynamical systems.The main strengths of the presented framework are: ( i ) its versatility to de-scribe complex and structured non-linear systems, thanks to the neural networkﬂexibility and the possibility to exploit physical model structures; ( ii ) its ro-bustness to the measurement noise, thanks to the minimization of a (truncated)simulation error criterion and regularization terms that enforce the hidden statevariables to be consistent with the estimated neural model; ( iii ) the possibil-ity to exploit parallel computing to train the network and optimize the initialconditions, thanks to the division of the dataset into small-size subsequences.The proposed case studies have shown the eﬀectiveness of the presentedmethodologies on well-known system identiﬁcation benchmarks.Current and future research activities are devoted to the application of theproposed framework for the identiﬁcation of systems described by partial dif-ferential equations, as well as the formulation of alternative ﬁtting criteria thatdirectly take into account the ﬁnal usage the estimated dynamical models, likefault detection and control system design. References

Abadi, M., et al. (2015).

TensorFlow: Large-scale machine learning on hetero-geneous systems.

Retrieved from (Soft-ware available from tensorﬂow.org)Andersson, C., Ribeiro, A. H., Tiels, K., Wahlstr¨om, N., & Sch¨on, T. B. (2019).Deep convolutional networks in system identiﬁcation. In (pp. 3670–3676).Bengio, Y., et al. (2009). Learning deep architectures for AI.

Foundations andTrends R (cid:13) in Machine Learning , (1), 1–127.Birpoutsoukis, G., Csurcsia, P. Z., & Schoukens, J. (2018). Eﬃcient multidi-mensional regularization for volterra series estimation. Mechanical Sys-tems and Signal Processing , , 896–914.24hen, S., Billings, S., & Grant, P. (1990). Non-linear system identiﬁcation usingneural networks. International Journal of Control , (6), 1191–1214.Chen, T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neu-ral ordinary diﬀerential equations. In Advances in Neural InformationProcessing Systems (pp. 6571–6583).Di Capua, G., Femia, N., Stoyka, K., Lodi, M., Oliveri, A., & Storace, M.(2017). Ferrite inductor models for switch-mode power supplies analysisand design. In (pp. 1–4).Garnier, H., & Young, P. C. (2014). The advantages of directly identifyingcontinuous-time transfer function models in practical applications.

Inter-national Journal of Control , (7), 1319–1338.Greydanus, S., Dzamba, M., & Yosinski, J. (2019). Hamiltonian neural net-works. In Advances in Neural information processing systems (pp. 15353–15363).Haber, E., & Ruthotto, L. (2017). Stable architectures for deep neural networks.

Inverse Problems , (1), 014004.He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for imagerecognition. In Proceedings of the ieee Conference on Computer Visionand Pattern Recognition (pp. 770–778).Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrentneural networks. In

Advances in Neural Information Processing Systems (pp. 697–704).Janot, A., Gautier, M., & Brunot, M. (2019, April). Data Set and ReferenceModels of EMPS. In

Nonlinear System Identiﬁcation Benchmarks.

EIND-HOVEN, Netherlands. Retrieved from https://hal.archives-ouvertes.fr/hal-02178709

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimiza-tion. In Y. Bengio & Y. LeCun (Eds.),

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.

Nature , (7553),436–444.Ljung, L. (1978). Convergence analysis of parametric identiﬁcation methods. IEEE Transactions on Automatic Control , (5), 770–783.Ljung, L. (Ed.). (1999). System identiﬁcation: Theory for the user (2nd ed.).Upper Saddle River, NJ, USA: Prentice Hall PTR.Ljung, L., & Singh, R. (2012). Version 8 of the Matlab System IdentiﬁcationToolbox.

IFAC Proceedings Volumes , (16), 1826–1831.Long, Z., Lu, Y., & Dong, B. (2019). PDE-Net 2.0: Learning PDEs from datawith a numeric-symbolic hybrid deep network. Journal of ComputationalPhysics , , 108925.Lutter, M., Ritter, C., & Peters, J. (2019). Deep lagrangian networks: Usingphysics as model prior for deep learning. arXiv preprint arXiv:1907.04490,2019 . 25aszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., . . .Lerer, A. (2017). Automatic diﬀerentiation in PyTorch. In Nips autodiﬀworkshop.

Piga, D. (2018). Finite-horizon integration for continuous-time identiﬁcation:bias analysis and application to variable stiﬀness actuators.

InternationalJournal of Control , 1–14.Quarteroni, A., Sacco, R., & Saleri, F. (2010).

Numerical mathematics (Vol. 37).Springer Science & Business Media.Rackauckas, C., Ma, Y., Martensen, J., Warner, C., Zubov, K., Supekar, R.,. . . Ramadhan, A. (2020). Universal diﬀerential equations for scientiﬁcmachine learning. arXiv preprint arXiv:2001.04385 .Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neu-ral networks: A deep learning framework for solving forward and inverseproblems involving nonlinear partial diﬀerential equations.

Journal ofComputational Physics , , 686–707.Ralston, A. (1962). Runge-Kutta methods with minimum error bounds. Math-ematics of computation , (80), 431–437.Relan, R., Tiels, K., Marconato, A., & Schoukens, J. (2017). An unstructuredﬂexible nonlinear model for the cascaded water-tanks benchmark. IFAC-PapersOnLine , (1), 452–457.Rogers, T., Holmes, G., Cross, E., & Worden, K. (2017). On a grey boxmodelling framework for nonlinear system identiﬁcation. In Special topicsin structural dynamics, volume 6 (pp. 167–178). Springer.Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learningrepresentations by back-propagating errors.

Cognitive modeling , (3), 1.Ruthotto, L., & Haber, E. (2019). Deep neural networks motivated by partialdiﬀerential equations. Journal of Mathematical Imaging and Vision , 1–13.Schmidhuber, J. (2015). Deep learning in neural networks: An overview.

Neuralnetworks , , 85–117.Schoukens, M., & No¨el, J. P. (2017). Three benchmarks addressing open chal-lenges in nonlinear system identiﬁcation. IFAC-PapersOnLine , (1),446–451.Svensson, A., & Sch¨on, T. B. (2017). A ﬂexible state–space model for learningnonlinear dynamical systems. Automatica , , 189–199.Van Overschee, P., & De Moor, B. (1994). N4sid: Subspace algorithms for theidentiﬁcation of combined deterministic-stochastic systems. Automatica , (1), 75–93.Wang, Y. (2017). A new concept using LSTM neural networks for dynamicsystem identiﬁcation. In (pp.5324–5329).Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics ,5