Model inference for Ordinary Differential Equations by parametric polynomial kernel regression
UUNCECOMP 20193 rd ECCOMAS Thematic Conference onUncertainty Quantification in Computational Sciences and EngineeringM. Papadrakakis, V. Papadopoulos, G. Stefanou (eds.)Crete, Greece, 24-26 June 2019
MODEL INFERENCE FOR ORDINARY DIFFERENTIAL EQUATIONSBY PARAMETRIC POLYNOMIAL KERNEL REGRESSION
David K. E. Green , , Filip Rindler , The Alan Turing InstituteLondon, United Kingdome-mail: [email protected] Mathematics Institute, University of WarwickCoventry, United Kingdome-mail: [email protected]
Keywords:
Inverse Problems, Model Inference, Machine Learning, Dynamical Systems, Timeseries Analysis, Artificial Neural Networks, Polynomial Kernel Methods
Abstract.
Model inference for dynamical systems aims to estimate the future behaviour ofa system from observations. Purely model-free statistical methods, such as Artificial NeuralNetworks, tend to perform poorly for such tasks. They are therefore not well suited to manyquestions from applications, for example in Bayesian filtering and reliability estimation.This work introduces a parametric polynomial kernel method that can be used for inferringthe future behaviour of Ordinary Differential Equation models, including chaotic dynamicalsystems, from observations. Using numerical integration techniques, parametric representa-tions of Ordinary Differential Equations can be learnt using Backpropagation and Stochas-tic Gradient Descent. The polynomial technique presented here is based on a nonparametricmethod, kernel ridge regression. However, the time complexity of nonparametric kernel ridgeregression scales cubically with the number of training data points. Our parametric polynomialmethod avoids this manifestation of the curse of dimensionality, which becomes particularlyrelevant when working with large time series data sets.Two numerical demonstrations are presented. First, a simple regression test case is used toillustrate the method and to compare the performance with standard Artificial Neural Networktechniques. Second, a more substantial test case is the inference of a chaotic spatio-temporaldynamical system, the Lorenz–Emanuel system, from observations. Our method was able tosuccessfully track the future behaviour of the system over time periods much larger than thetraining data sampling rate. Finally, some limitations of the method are presented, as well asproposed directions for future work to mitigate these limitations. a r X i v : . [ c s . L G ] A ug avid K. E. Green, Filip Rindler Dynamical systems play a crucial role in mathematical modelling across all areas of physics,engineering and applied mathematics. The equations used in some particular application do-main are typically derived either phenomenologically [23] or from first principles such as theconservation of energy, mass or momentum (as in mechanics [27]). The structure of the equa-tions should describe the fundamental aspects of the system in question as much as possible. Onthe other hand, constitutive parameters are often hard to know explicitly and need to be learntfrom data. As such, it is necessary to balance rigidity and flexibility when modelling a system.This paper considers the problem of finding a model of a dynamical system, represented bycoupled Ordinary Differential Equations (ODEs), from observations. This is a particular formof inverse problem (as in [25]). The time evolution of many dynamical systems is describedby polynomial equations in the system variables and their derivatives. We introduce a formof parametric polynomial kernel regression (related to Radial Basis Function networks [21]).This technique was developed during the search for an algorithm that is able to be trainedcontinuously on streaming data as opposed to complete trajectories. Hidden parameter models(with unobserved variables) are not addressed but the techniques shown here could be extendedto such cases in the future, augmenting probabilistic Bayesian filtering methods (as in [16]).Kernel ridge regression is a nonparametric method for fitting polynomials to data without ex-plicitly calculating all polynomial terms of a set of variables [18, 21]. There are two limitationsof this approach when fitting models to time series data. First, as a nonparametric method, thecomputational time complexity scales cubically with the number of observation points. This isa significant issue when dealing with time series data. Second, it is difficult to compute kernelridge regression efficiently using streaming data. While it is possible to continually update theinverse of a matrix (see [9]), the roughly cubic scaling of the required matrix operations is notwell suited to monitoring high-dimensional systems in a real time data setting. Here, to optimiseour parametric polynomial kernel function representations, Stochastic Gradient Descent (SGD)is used along with the Backpropagation method (see [3]). This combination of techniques helpsto minimise computational complexity and the amount of explicit feature engineering requiredto find a good representation of an unknown ODE.We represent ODE models parametrically as compute graphs. Compute graphs are usedin Artificial Neural Network (ANN) theory to model complicated nonlinear structures by thecomposition of simple functions and are well suited to gradient descent optimisation via theBackpropagation method. It is demonstrated that numerical integration (both explicit and im-plicit) can be used to discretise ODE time integrals in a way that allows for the inference ofcontinuous-time dynamical system models by gradient descent. This is an extension of anapproach that appeared at least as early as [6]. The discretisation procedure is related to theBackpropagation Through Time method [29], which is used for modelling discrete time serieswith so-called Recurrent Neural Networks.To demonstrate the findings of this paper, two numerical case studies were carried out. Thefirst is a simple analysis that contrasts the performance of standard ANN techniques with theproposed kernel method. It is shown that our method had the best extrapolation performance.A more extensive analysis of the chaotic spatio-temporal Lorenz–Emanuel dynamical systemis also presented. The proposed method is able to recover a maximum likelihood estimate ofthe hidden polynomial model. For comparison, a parametric model constructed by direct sum-mation of polynomial features (without kernels, of the form used in [26]) was also tested. Theparametric polynomial kernel method was able to outperform the direct polynomial expansion,2 avid K. E. Green, Filip Rindler accurately predicting the future evolution of a chaotic dynamical system over periods manytimes greater than the training interval.The primary advantage of the technique presented in this paper is that the model represen-tation in parametric form can avoid the curse of dimensionality and poor scaling with trainingset size associated with nonparametric kernel regression. Further, polynomial kernels avoid thecombinatorial explosion that occurs when explicitly computing polynomial series expansions.Interestingly, the accuracy of the proposed parametric kernel method can be tuned by adjust-ing the dimension of a set of intermediate parameters. The trade-off for increased accuracy isadditional training time.
The parametric polynomial regression technique introduced in this paper is built on theframework of so-called compute graphs. This section provides the background theory requiredfor later parts of this work. Compute graphs are very general structures which define the flow ofinformation over a topology and as such provide a convenient parametric representation of non-linear functions. In particular, compute graphs can be coupled with Automatic Differentiation[20] and the Backpropagation algorithm (an application of the chain rule) to allow for gradient-based optimisation. Stochastic Gradient Descent is the most common form of optimiser used inthis context and is briefly described in this section.Artificial Neural Networks (ANNs) are a subset of compute graphs (in the sense of discretemathematics [7]). Common ANN terminology such as Deep Neural Networks, BoltzmannMachines, Convolutional Neural Networks and Multilayer Perceptrons refer to different ANNconnectivity, training and subcomponent patterns [3, 8]. The choice of an appropriate ANNtype depends on the problem being solved. This section works with general compute graphterminology, rather than specific ANN design patterns, as these principles are appropriate forall ANN architectures.A (real-valued) compute graph consists of a weighted directed graph, i.e. an ordered pair G = ( V, E ) with the following properties: • V is the finite set of vertices (or nodes) v i . Vertices specify an activation function σ i : R → R , and an output (or activation) value a i ∈ R . • E is the set of edges e ij . Each edge e ij specifies a start vertex, defined to be v i , and an endvertex, defined to be v j . That is, edges are said to start at v i and terminate at v j . Edgesalso specify a weight, W ij ∈ R .Edges e ij can be understood as ‘pointing’ from v i to v j . Incoming edges to a node v i areall e jk ∈ E with k = i . Similarly, outgoing edges from a node v i are all e jk ∈ E with j = i .Parents of a node v i refer to all nodes v j such that there is an edge starting at v j and terminatingat v i . Similarly, children of a node v i refer to all nodes v j such that there is an edge startingat v i and terminating at v j . A valid path of length m starting at v and terminating at v m is aset { v , v , · · · v m } of at least two nodes such that there exist edges in E from v i to v i +1 for all i ∈ [1 , m − . A recurrent edge in a compute graph refers to an edge that lies on a valid pathfrom a node v i to any of its parents. A graph with recurrent edges is said to be a recurrent graph.An example of a (recurrent) compute graph is shown in fig 1.Inputs to the compute graph are all those nodes with no incoming edges (i.e. no parents), { v i | v i ∈ V ∧ (cid:64) e ki ∈ E } . The activation values a i for input nodes v i must be assigned. The3 avid K. E. Green, Filip Rindler a σ ( z ) σ ( z ) σ ( z ) σ ( z ) σ ( z ) σ ( z ) W W W W W W W W W W Figure 1: Example of compute graph. The subscript inside each node denotes the node number.Arrowheads indicate the direction of the graph edges. The function inside each node refers tothe output function to be applied at the node. Note that node 1 is an input (with value a ) as ithas no parents. Further note that edge W is recurrent as there is a cycle formed in the graphbetween nodes , and .values at all other nodes, v i , in the compute graph are calculated by z i = (cid:88) k : v k parent of v i W ki a k , (1) a i = σ i ( z i ) , (2)where z i represents the weighted inputs to a node from all parent nodes and a i represents theoutput from a node.Note that ANNs often define so-called bias units. Bias units allow for inputs to a node to havetheir mean easily shifted. A bias input to some node v i can be represented in a compute graphby creating a set of nodes b i ∈ B , with no parents, that always output a value of . Further, each b i is assigned to be an additional parent of v i by creating an edge from b i to v i with weight B i so that a i = σ i (cid:88) k : v k parent of v i W ki a k + B i . (3)Bias units will not, however, be explicitly indicated in the rest of this section as they can beassumed to be implicitly defined in eqn (1).The composition of simple functions with a compute graph structure allows for complicatednonlinear functions to be represented parametrically [3]. Optimisation over very large compute graphs representing highly nonlinear functions hasbecome possible using Stochastic Gradient Descent (SGD) coupled with Backpropagation oferrors [3]. Advanced forms of SGD such as the Adam optimisation technique [15] are useful foroptimising complicated compute graphs. The basic SGD method is described here. StochasticGradient Descent finds a locally optimal set of parameters, θ , by iteratively updating the current4 avid K. E. Green, Filip Rindler estimate for the optimal parameters, θ i . It does so by moving the current estimate in the directionof greatest decreasing error, given by the derivative ∇ θ J ( θ i ) : θ i +1 := θ i − η ∇ θ J ( θ i ) , (4)where η is a small parameter that gives the distance to move in the direction defined by ∇ θ J ( θ ) .Iterations are repeated until a specified error tolerance (cid:15) > is reached, i.e. until J ( θ i ) ≤ (cid:15). (5)Consider the case of approximating some unknown function f ( x ) by a compute graph thatoutputs the function ˜ f θ ( x ) . The weights θ are taken to be the values of the edge weights W ij forall e ∈ E . Let the loss functional in this example be given by J ( θ ) := (cid:88) x | f ( x ) − ˜ f θ ( x ) | , (6)for x in some finite set. Thus, J ( θ ) is also representable as a compute graph. The graph for J ( θ ) contains the graph for ˜ f θ ( x ) as a subset. To apply SGD to a compute graph, extended tocontain the terms computing the loss functional, the Backpropagation method (an applicationof the chain rule) can be used if two conditions are met: • All nodal activation functions, σ i , must be differentiable. • The graph must be directed and acyclic, meaning the graph cannot contain any valid pathsfrom a node to any of its parents, i.e. the graph must not have any recurrent edges.If the above conditions are satisfied, Backpropagation can compute ∇ θ J ( θ ) via the chainrule. The basic procedure is outlined here, but a more detailed treatment can be found in [3]. Inthe case that the graph is not acyclic, it can be unrolled via a technique referred to as Backprop-agation Through Time [29].Backwards error derivatives must be computed at all nodes, v i , in the network: δ i := ∂J∂z i . (7)For nodes v i in the graph that compute the loss functional J ( θ ) , the derivative δ i can be com-puted directly. Otherwise, assume that node v i has children { w j } Nj =1 . Using the chain rule,the error derivative δ i can be calculated by pushing the error derivatives backwards through thegraph from children to parents: δ i = N (cid:88) j =1 δ j ∂z j ∂a i ∂a i ∂z i = N (cid:88) j =1 δ j W ij σ (cid:48) i ( z i ) . (8)Given the error derivative terms, the desired error gradients ∇ θ J ( θ ) for θ = { W ij } ij can becomputed at node v j with parents { w k } Mk =1 by ∂J∂W ij = δ j ∂z j ∂W ij = δ j ∂∂W ij (cid:32) M (cid:88) k =1 W kj a k (cid:33) = δ j a i . (9)5 avid K. E. Green, Filip Rindler Automatic Differentiation [20] can be used to write efficient computer code for Backprop-agation. Specifically, Backpropagation is a form of ‘reverse accumulation mode’ AutomaticDifferentiation. The above calculations can be organised efficiently by going through the com-pute graph from output to input nodes. At the time of writing, Tensorflow [1] is a popularimplementation of the algorithms described above. Although other (including gradient-free)optimisation procedures can be used that are suitable for general compute graphs, SGD withBackpropagation is typically very computationally efficient when applicable.
Before discussing model inference for ODEs in particular, a parametric polynomial kernelfunction representation is introduced. Although ANNs and compute graphs are very effectiveat fitting arbitrary functions, standard ANN methods are poorly suited to polynomial func-tion representation. As typical ANN architectures fit a very large number of parameters, theyare unable to perform sensible extrapolation for even low-dimensional polynomial regressionproblems. Polynomial kernel ridge regression using the so-called kernel trick [21] works wellfor fitting polynomials but suffers from cubic (that is, O ( N ) ) computational time complexity.Gradient-descent compute graph optimisation, as it is a parametric method, provides a way tooptimise large data sets without the computational difficulties faced by nonparametric methods.While it is possible to build a compute graph that explicitly includes polynomial basis features,this scales factorially with the number of polynomial features included. In this paper it is shownthat polynomial kernels can be inserted into compute graph structures and optimised by SGD,avoiding both the combinatorial explosion of polynomial series expansions and the poor timescaling of nonparametric kernel ridge regression. Polynomial kernels, typically associated with kernel regression and Support Vector Machines[21, 18], are functions of the form K ( x, y ) = ( b (cid:104) x, y (cid:105) + c ) d (10)for some b, c ∈ R , d ≥ . If the values of y are assumed to be some parameters, the expansionof the polynomial kernel (for d ∈ N ) will, implicitly, yield all polynomial combinations up toorder d .Kernel ridge regression is a nonparametric method in the sense that the number of parametersgrows with the amount of training data [18]. By contrast, in this paper ‘parametric model’ refersto a model with a fixed number of parameters. Adopting the notation in [28], the standard formof ridge regression is as follows. Given observations of an unknown function f : R D → R E at N locations, { ( x i , f ( x i )) } Ni =1 , kernel ridge regression finds an approximation, f k ( x ) , by f ( x ) ≈ f k ( x ) = N (cid:88) i =1 α i K ( x, x i ) , (11)where the values α i are termed weights and K ( x, x i ) is a kernel function. Kernel functionsare a form of generalisation of positive definite matrices (see [18] for additional details). Onlythe (real-valued) polynomial kernel in eqn (10) will be discussed in this paper. The weights6 avid K. E. Green, Filip Rindler α = ( α , . . . , α N ) are calculated using f ( x ) = ( f ( x ) , . . . , f ( x N )) as follows: α = ( K + λI ) − f ( x ) , (12)where K ∈ R N × N is the matrix with entries K ji = K ( x j , x i ) and I is the N by N identitymatrix. The term λ ∈ R is a regularisation term that controls overfitting. Note that if K + λI isnot invertible, then the inverse must be replaced by a pseudo-inverse. In the sense of Bayesianregression, the term λ represents the scale of Gaussian noise added to observations f ( x i ) as apart of the approximation procedure.The use of kernels for regression as in eqn (11) has the effect of mapping a low-dimensionalproblem implicitly into a high-dimensional space. This is a very powerful technique for pro-jecting data onto high-dimensional basis functions. Unfortunately, as a (typically) dense matrixmust be inverted to calculate α , the computational complexity of standard kernel ridge regres-sion scales cubically with the number of data points, N . This is a severe limitation whenconsidering large data sets such as the time series data considered in later sections of this paper. Instead of calculating an inner product between known values of x and y as in eqn (10) andinverting a matrix as in eqn (12), this paper demonstrates that a kernel representation can befound in an efficient way using compute graphs and SGD. Consider the following parametricrepresentation of a function f : R D → R E with parameters θ ∈ Θ : f θ ( x ) = W [( W x + B ) ◦ ( W x + B )] + B , (13)where ◦ denotes elementwise matrix multiplication (or Hadamard product), i.e. A = B ◦ C means a ij = b ij c ij for the corresponding matrix entries [12]. The remaining terms are de-fined by W ∈ R M × D , B ∈ R D , W ∈ R E × M and B ∈ R E . The parameters B , B areknown as bias weights in the ANN literature [3]. The full set of parameters for this model is θ = { W , B , W , B } . The dimension M is an intermediate representation dimension and isdiscussed below.Eqn (13) is a parametric representation of a second-order polynomial kernel. Expandingeqn (13) explicitly would yield a set of second-order polynomials in terms of x i . However,using SGD the unknown polynomial expression can be found without the need to know theexpanded polynomial form. The elementwise matrix product acts like the d -th power in eqn(10). The parameters θ can be trained by SGD and function as parametric representations ofSupport Vectors. The term M required to complete the definition of eqn (13) is a hyperparame-ter representing a choice of intermediate representation dimension and is related to the numberof Support Vectors required to represent the system (as in Support Vector Regression, see [21]).Increasing the size of M increases the number of parameters but can improve the fit of theregressor (as is demonstrated empirically in Section 5).An n -th order polynomial could be fit by taking a larger number of Hadamard products.Denote the composition of Hadamard products by A ◦ n A := A ◦ A ◦ · · · ◦ A ( n times). Then,our approach consists of expressing an n -th order representation of f θ : R D → R E as follows: f θ ( x ) = W [( W X + B ) ◦ n ( W X + B )] + B (14)or some similar variation on this theme. The expression in eqn (14) is differentiable in the senseof compute graphs since all of the operations in eqn (14) are differentiable. Comparing with7 avid K. E. Green, Filip Rindler eqns (11) and (12), the parametric form of polynomial kernel regression can be thought of asan approximation to both the α i and K ( x, x i ) terms in a single expression. As the parametricregression form can be optimised by SGD, the cubic scaling of nonparametric kernel ridgeregression is avoided. This section demonstrates the proposed method via the approximation of a simple cubicfunction, namely f ( x ) := ( x − x + 1)( x + 0 . . (15)The goal of this analysis is to infer the hidden function f ( x ) . Given a set of training data, N pairs { ( x i , f ( x i )) } Ni =1 , the problem is to minimise the loss functional J ( θ ) := 1 N N (cid:88) i =1 | f ( x i ) − f θ ( x ) | . (16)For this test problem, N = 25 training data points were sampled uniformly between x = − and x = 2 .First, a standard ANN ‘Multilayer Perceptron’ (specifically a three-layer deep, 100 unit wideperceptron network) was tested. The reader unfamiliar with these terms can see [21] for defini-tions, but it is sufficient for the purposes of this paper to understand that this perceptron modelcomputes the function f θ ( x ) = W σ ( W σ ( W σ ( W x + B ) + B ) + B ) + B (17)where W ∈ R × , W , W ∈ R × , W ∈ R × , B , B , B ∈ R , and B ∈ R such that the parameters of this network are θ = { W i , B i } i =1 . Additionally, σ ( x ) denotes thesigmoid function: σ ( x ) := 11 + e − x . (18)In eqn (17), σ is applied to vectors componentwise.Second, the parametric polynomial method in eqn (14) was tested for polynomial orders n = 2 , , . The parameter M was fixed to for all comparisons.Both the perceptron model and the parametric polynomial kernel model were trained in twostages. The Adam optimiser [15] was first run for 1000 iterations with a learning rate of . and then for an additional 1000 iterations with a learning rate of . . All ANNs and SGDoptimisers were implemented using the Tensorflow software library [1].Finally, a nonparametric kernel ridge regression estimator of the form in eqn (11) was tested.This was implemented using the SciKit learn ‘KernelRidge’ function [19] using a third-orderpolynomial kernel. Note that this function has additional hyperparameters, α, coef0 and γ .These were set to . , and ‘None’ respectively. The SciKit documentation describes theseparameters in detail. As with the parametric estimator, the choice of maximum polynomialdegree ( d in eqn (12)) is another hyperparameter. For this demonstration, only the known truevalue ( d = 3 ) was tested with the nonparametric regression estimator.The values of J ( θ ) after running SGD are shown in table 1. The third-order parametricpolynomial loss is ten orders of magnitude lower than the regression loss of the perceptron8 avid K. E. Green, Filip Rindler Function representation J ( θ ) Multilayer Perceptron . × − Parametric kernel with n = 2 1 . × Parametric kernel with n = 3 5 . × − Parametric kernel with n = 4 2 . × − Nonparametric polynomial kernel . × − Table 1: Values of J ( θ ) , defined in eqn (16), after optimisation by SGD for the simple regressiontask.network. The lower loss of the n = 3 parametric polynomial method compared to n = 2 and n = 4 is (of course) expected as the hidden function is a third-order polynomial. This indicatesthat several polynomial orders should be tested when applying the proposed technique to otherproblems.The results of the analysis are shown in figs 2 and 3. Each model tested was able to recoverthe true form of f ( x ) in the region of the training data. Relative errors for each method areshown in fig 4. Both the parametric and nonparametric polynomial methods were also able toextrapolate well beyond the range of the original data for the n = 3 model. This can be bestseen in fig 3. The perceptron model, by contrast, almost immediately fails to predict valuesof the hidden function outside of range of the training data. For inferring hidden polynomialdynamical systems from observations, where the ability to extrapolate beyond the training datais essential, the analysis in this section suggests that the parametric polynomial kernel methodcan be expected to have performance superior to standard ANN methods.This analysis also indicates that the loss J ( θ ) is an effective indicator of extrapolation per-formance for polynomial kernel methods (at least in this test case). This is not true for theMultilayer Perceptron model which had a low J ( θ ) value but poor extrapolation performance.One must however take care when making assertions about extrapolation performance, as it iseasy to make incorrect inferences in the absence of data.9 avid K. E. Green, Filip Rindler − − − − x y f ( x ) = ( x − x + 1)( x + 0 . n = 2Parametric polynomial kernel n = 3Parametric polynomial kernel n = 4Multilayer PerceptronNonparametric polynomial kernelTraining data Figure 2: Comparison of performance of the parametric polynomial kernel method on a simpleregression task. Note that the true hidden function, from eqn (15), is underneath the functioninferred by the n = 3 parametric polynomial. The two coincide because of the virtually perfectfit. The nonparametric polynomial kernel ridge estimator also closely coincides with the true f ( x ) . The 25 regression training data points were calculated by sampling uniformly between x = − and x = 2 . − − − − x y f ( x ) = ( x − x + 1)( x + 0 . n = 2Parametric polynomial kernel n = 3Parametric polynomial kernel n = 4Multilayer PerceptronNonparametric polynomial kernelTraining data Figure 3: Comparison of performance of the parametric polynomial kernel method on a simpleregression task. This is a zoomed out view of fig 2 and shows that the polynomial kernelestimators (both parametric for n = 3 and nonparametric) are able to recover the true hiddenfunction in eqn (15) outside of the range of the training data.10 avid K. E. Green, Filip Rindler − − − − − − − − − − − − − x l og ( r e l a t i v ee rr o r ) , (cid:12)(cid:12)(cid:12) y − f ( x ) f ( x ) (cid:12)(cid:12)(cid:12) Parametric polynomial kernel n = 2Parametric polynomial kernel n = 3Parametric polynomial kernel n = 4Multilayer PerceptronNonparametric polynomial kernel Figure 4: Comparison of pointwise absolute errors for the simple regression task. Errors arecomputed as (cid:12)(cid:12)(cid:12) y − f ( x ) f ( x ) (cid:12)(cid:12)(cid:12) where f ( x ) is the true hidden function defined in eqn (15). The parametricpolynomial kernel method has the best performance, followed by the nonparametric polynomialkernel ridge method. Note that the training data was restricted to lie within x = − and x = 2 . Dynamical systems are classified into either difference equations (discrete-time systems)or differential equations (continuous-time systems) [17]. In this paper, only continuous-timedynamical systems are investigated, although the numerical methods presented could be appliedto both continuous-time and discrete-time systems. Continuous-time dynamical systems of theform considered in this paper can be expressed as coupled first-order Ordinary DifferentialEquations (ODEs): ddt u ( t ) = f ( t, u ( t )) , (19)where: • t ∈ [0 , ∞ ) represents time; • u ( t ) ∈ R n is the vector of values representing the n variables of the system at time t ; • f ( t, u ( t )) ∈ R n represents the prescribed time derivatives of u ( t ) .A trajectory of a dynamical system refers to a parameterised path u ( t ) which returns a valueof u for all values of the parameter t . The value of u ( t ) in eqn (19) can be computed given someinitial value, u (0) , by integrating f ( t, u ( t )) forward in time: u ( t ) = u (0) + (cid:90) t ddτ u ( τ ) dτ = u (0) + (cid:90) t f ( τ, u ( τ )) dτ (20)11 avid K. E. Green, Filip Rindler To simplify the solution of ODEs and the implementation of the learning algorithm presentedin this paper, we only consider first-order systems. A differential equation of order m of theform d m dt m u ( t ) = f ( t, u ( t )) (21)can be converted into a system of first-order coupled ODEs. This is also the standard approachemployed in numerical implementations of ODE solvers, for an example, see the SciPy func-tion solve ivp [14]. The conversion can be achieved by introducing new variables for higherderivatives. Consider an m -th order equation of the form d m udt m = g (cid:18) t, u, dudt , d udt , · · · , d m − udt m − (cid:19) . (22)This can be rewritten by replacing the d i udt i terms by new variables v i ( i ∈ [1 , m − ) suchthat: ddt uv ... v m − = v ... v m − f ( t, u, v , v , . . . , v m − ) . (23)As the value of u at some time depends on the values at infinitesimally earlier times throughthe derivatives of u , there is a recursive structure present in the equations (this would be evenclearer for difference equations or after a discretisation). The model inference technique pre-sented in this paper uses loop unrolling to simplify the derived optimisation problem. Model inference, in this context, is the problem of recovering the form of f ( t, u ( t )) (as ineqn (19)) given observations of u ( t ) at times from to T . Model inference can be expressed asan optimisation problem: Minimise J ( θ ) := (cid:90) T (cid:12)(cid:12)(cid:12)(cid:12) u ( t ) − (cid:18) u (0) + (cid:90) t f θ ( τ, u ( τ )) dτ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) dt, (24)where J ( θ ) is a loss functional over some unknown parameters θ ∈ Θ . The function f θ ( τ, u ( τ )) denotes a parametric approximation to the true latent function f ( t, u ( t )) . For the purposes ofthis paper, the parametric representation of f θ can be assumed to be a directed acyclic computegraph. Denote the trajectories computed using the integral of f θ by ˜ u θ ( t ) := u (0) + (cid:90) t f θ ( τ, u ( τ )) dτ. (25)Then the loss functional in eqn (24) can be expressed as J ( θ ) = (cid:90) T | u ( t ) − ˜ u θ ( t ) | dt. (26)In this form, it is clear that the J ( θ ) measures how closely the observed trajectories u ( t ) matchthe predicted trajectories ˜ u θ ( t ) for each value of θ . Additionally, although the L norm has been12 avid K. E. Green, Filip Rindler used above, this norm could be changed to any other norm as appropriate. For simplicity, onlythe L norm will be used in this paper.If observations of dudt are available, the optimisation problem can be expressed in an alterna-tive, but not exactly equivalent, differential form: Minimise K ( θ ) := (cid:90) T (cid:12)(cid:12)(cid:12)(cid:12) ddt u ( t ) − f θ ( t, u ( t )) (cid:12)(cid:12)(cid:12)(cid:12) dt. (27)The loss functional surface for J ( θ ) will tend to be smoother over θ when compared to thedifferential form (since there is an additional integration), potentially altering the behaviour ofvarious optimisation methods. However, the exact minimisers θ ∗ of both J ( θ ) and K ( θ ) , if theyexist so that J ( θ ∗ ) = K ( θ ∗ ) = 0 , are the same, as can be seen by differentiation.The choice to optimise over K ( θ ) or J ( θ ) depends on the chosen representation of f θ andthe availability of observations. Assume that only observations of u ( t ) are available and notdirect observations of dudt . Then it is necessary to either introduce some way to approximate dudt or to approximate (cid:82) t f θ ( τ, u ( τ )) dτ . In the remainder of this section, it is shown that a dis-cretised form of J ( θ ) , denoted by ˆ J ( θ ) , can be derived. The discretised objective ˆ J ( θ ) can betrained using SGD and Backpropagation as long as f θ ( t, u ( t )) can be represented by an acycliccompute graph. The derivation of ˆ J ( θ ) proceeds by first approximating the outer integral ineqn (26) using a finite set of observations of u ( t ) . The derivation of the discretisation is com-pleted by approximating the integral (cid:82) t f θ ( τ, u ( τ )) dτ using standard numerical time integrationtechniques. The continuous form of the integral in eqn (25) is not amenable to numerical computationand requires discretisation. In particular, if f θ is to be represented by a compute graph and learntby SGD, then the entire loss functional J ( θ ) must be represented by a differentiable, directedacyclic compute graph. To achieve this, it is useful to first note that the integral in eqn (25) canbe decomposed into a series of integrals over smaller time domains. Consider the trajectoriesfrom times to t and to t + h : ˜ u θ ( t ) := u (0) + (cid:90) t f θ ( τ, u ( τ )) dτ. (28)Then, ˜ u θ ( t + h ) = u (0) + (cid:90) t f θ ( τ, u ( τ )) dτ + (cid:90) t + ht f θ ( τ, u ( τ )) dτ (29) = ˜ u θ ( t ) + (cid:90) t + ht f θ ( τ, u ( τ )) dτ, (30)giving the trajectory predicted by f θ from ˜ u θ ( t ) to ˜ u θ ( t + h ) .The required discretisation can be completed using standard numerical integration tech-niques. Numerical integration methods such as Euler, Runge-Kutta and Backwards Differenti-ation (see [13] for an overview) work, roughly, by assuming some functional form for f ( x ) andanalytically integrating this approximation. Numerical integration methods can be expressed asa function of the integrand evaluated at some finite set of m points { x j } mj =1 : (cid:90) ba f ( x ) dx ≈ G (cid:0) a, b, f, { x j } mj =1 (cid:1) . (31)13 avid K. E. Green, Filip Rindler Note that the points a ≤ x j ≤ b are defined as a part of the specification of a particular numericalintegration scheme. The function to be integrated, f , must be able to be evaluated at each x j .The trajectories in eqn (30) can then be approximated with a numerical approximationscheme as in eqn (31): ˜ u θ ( t + h ) ≈ ˆ u θ ( t + h ) := ˆ u θ ( t j ) + G (cid:0) t, t + h, f θ , { ( τ j , u ( τ j )) } mj =1 (cid:1) , (32) ˆ u θ (0) := u (0) . (33) ˆ u θ ( t ) refers to a trajectory ˜ u θ ( t ) with continuous integrals replaced by approximate numericalintegrals. The values τ j are evaluation points and correspond to the values x j in eqn (31). Ingeneral, the smaller the value of h the greater the accuracy of the approximation. Small valuesof h , however, increase the computational burden required to compute approximate trajectories. For practical problems, observations of u ( t ) will not be available for all times between and T . Typically, the trajectory u ( t ) will be known only at a finite set of times t ∈ { t i } Ni =1 so that u ( t ) is known at { u ( t i ) } Ni =1 . The finite set { ( t i , u ( t i )) } Ni =1 will be referred to as ‘training data’ andcan be used to discretise the optimisation problem in eqn (24) by the following approximation: Minimise ˜ J ( θ ) := 1 N N (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) u ( t i ) − (cid:18) u (0) + (cid:90) t i f θ ( τ, u ( τ )) dτ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (34) = 1 N N (cid:88) i =1 (cid:12)(cid:12) u ( t i ) − ˜ u θ ( t i ) (cid:12)(cid:12) . (35)However, the terms ˜ u θ ( t ) must also be replaced by a discretisation, as in eqn (32). Assumethat a numerical integration scheme is selected that evaluates the integrand at m points. Itis convenient to decompose the trajectory integrals ˜ u θ ( t ) into a series of integrals over finitesubsets of the training data, t i to t i + p for the window size p ∈ N (typically either m or m − ),such that ˜ u θ ( t i + p ) = ˜ u θ ( t i ) + (cid:90) t i + p t i f θ ( τ, u θ ( τ )) dτ. (36)With reference to eqn (32), this can be further approximated by numerical integration: ˆ u θ ( t i + p ) = ˆ u θ ( t i ) + G (cid:0) t i , t i + p , f θ , { ( τ j , u ( τ j )) } mj =1 (cid:1) (37)such that the value of u ( τ j ) is known (given the training data) for all evaluation points τ j , j ∈ [1 , m ] .Finally, eqn (37) can be modified by using the known value (from the training data) of u ( t i ) in place of ˆ u θ ( t i ) : ˆ u ( t i + p ) := u ( t i ) + G (cid:0) t i , t i + p , f θ , { ( τ j , u ( τ j )) } mj =1 (cid:1) . (38)Eqn (35) can be approximated by the discretised loss functional ˆ J ( θ ) by inserting ˆ u ( t ) : ˆ J ( θ ) := 1 N − p N − p (cid:88) i =1 (cid:12)(cid:12) u ( t i + p ) − ˆ u ( t i + p ) (cid:12)(cid:12) (39) = 1 N − p N − p (cid:88) i =1 (cid:12)(cid:12) u ( t i + p ) − (cid:0) u ( t i ) + G (cid:0) t i , t i + p , f θ , { ( τ j , u ( τ j )) } mj =1 (cid:1)(cid:1)(cid:12)(cid:12) . (40)14 avid K. E. Green, Filip Rindler As ˆ J ( θ ) is a discrete approximation to J ( θ ) , the model inference problem in eqn (26) is approx-imately solved by minimisation of ˆ J ( θ ) over a training data set: θ ∗ = argmin θ J ( θ ) ≈ argmin θ ˆ J ( θ ) . (41)The inferred ODE model then is f θ ∗ ( t, u ( t )) .Note that in the above derivation, loss functionals have been computed for time-dependentmodels of the form f ( t, u ( t )) . In practice, optimisation over a single trajectory will only provideuseful estimates of f θ very close to ( t, u ( t )) . To find estimates of f θ away from those points, onewould have to observe multiple trajectories and modify J ( θ ) to average over these trajectories.Alternatively, in the autonomous case, where f is of the form f ( u ( t )) , one trajectory may beenough to infer f θ , depending on the number of sampling points available. To demonstrate concretely how eqn (40) gives a loss functional discretisation, ˆ J ( θ ) , foran ODE model that can be optimised by SGD and Backpropagation, an example using simplenumerical integration techniques is discussed in this section. Forward Euler (see [13]) computesan approximation to a dynamical system trajectory time integral as follows ( h > ): u ( t + h ) ≈ u ( t ) + hf ( t, u ( t )) . (42)With reference to eqn (31), Forward Euler is a numerical integration scheme with m = p =1 , τ = a and G ( a, b, f, { ( a, u ( a )) } ) = | b − a | f ( a, u ( a )) . (43)Forward Euler is a so-called explicit method as the approximation of u ( t + h ) depends onlyon functions evaluated at times earlier than t + h . Backward Euler, conversely, is an implicitmethod: u ( t + h ) ≈ u ( t ) + hf ( t + h, u ( t + h )) . (44)With reference to eqn (31), Backward Euler is a numerical integration scheme with m = p = 1 , τ = b , and G ( a, b, f, { ( b, u ( b )) } ) = | b − a | f ( b, u ( b )) . (45)Forward time integration using Backward Euler requires solving a system of equations (typ-ically by Newton-Raphson iterations [13]) as u ( t + h ) appears on both sides of eqn (44). Thisis characteristic of implicit integration methods. The choice of when to use explicit or implicitintegration methods for simulation of a system depends on the form of the dynamical systemto be approximated [13]. Implicit methods are more efficient and accurate for so-called ‘stiff’problems [10, 11].However, either method can be used to discretise an ODE into a compute graph representa-tion. For example, assume that f θ ( t, u ( t )) is represented by an acyclic compute graph. Then,given training data { ( t i , u ( t i )) } Ni =1 , the model inference loss functional, ˆ J ( θ ) , in eqn (40) canbe approximated using Forward Euler as follows: ˆ J F ( θ ) := 1 N − N − (cid:88) i =1 (cid:12)(cid:12) u ( t i +1 ) − (cid:0) u ( t i ) + | t i +1 − t i | f θ ( t i , u ( t i )) (cid:1)(cid:12)(cid:12) . (46)15 avid K. E. Green, Filip Rindler Implicit integration schemes can be used in essentially the same way as shown above forForward Euler. As an example, the Backward Euler scheme in (44) can be used to set the modelinference loss functional, ˆ J ( θ ) , from eqn (40) as follows: ˆ J B ( θ ) := 1 N − N − (cid:88) i =1 (cid:12)(cid:12) u ( t i +1 ) − (cid:0) u ( t i ) + | t i +1 − t i | f θ ( t i +1 , u ( t i +1 )) (cid:1)(cid:12)(cid:12) . (47)Note that for the explicit Euler scheme, as in eqn (46), up to time t N we can infer f θ only upto time t N − . Hence, there is a time lag in the learning which is not observed for the implicitEuler scheme.The loss functionals in eqns (46) and (47) are trivially differentiable and acyclic (as the valuesof t i and u ( t i ) are just constants that have been taken from observations) as long as the graphrepresentation of f θ is differentiable and acyclic. Thus, if f θ is represented by a differentiableand acyclic compute graph, the loss functionals ˆ J ( θ ) can be optimised by SGD. More sophisticated integration schemes than Backward or Forward Euler can be used tofind a differentiable parametric representation of ˆ J ( θ ) . Linear multistep integral approximationschemes are briefly described here as they will be used for the numerical simulations presentedin the next section of this paper. Any numerical scheme that is differentiable and representableby a directed acyclic compute graph when inserted into the loss functional could be used. Linearmultistep methods are a convenient choice when the training data consists of observations of u ( t ) that have been sampled at constant frequency.From [10], Adams-Moulton linear multistep integration of order s = 2 can be used to ap-proximate a trajectory of a dynamical system from time a to time b = a + 2 h for some h ∈ R as follows: ˆ u ( b ) = u ( a + h ) + h (cid:18) f θ ( b, u ( b )) + 23 f θ ( a + h, u ( a + h )) − f θ ( a, u ( a )) (cid:19) . (48)To derive the loss functional ˆ J ( θ ) , assume that training data observations of u ( t ) are givenby { t i , u ( t i ) } Ni =1 and that the times t i are evenly spaced such that t i = ( i − h . Inserting eqn(48) into eqn (40) gives the Adams-Moulton approximate loss functional ( m = 3 , p = 2 ): ˆ J A ( θ ) := 1 N − N − (cid:88) i =1 (cid:12)(cid:12) u ( t i +2 ) − ˆ u ( t i +2 ) (cid:12)(cid:12) . (49)Note that the full Adams-Moulton integrator (defined in [10]) could also be used to derivea loss functional that approximates a trajectory discretisation using a series of interpolationpoints between the observations in the training set. For simplicity, only the method shownabove (placing the evaluation points at the values in the training data set) is used in this paper. This section demonstrates the application of the parametric polynomial kernel regressiontechnique to the model inference problem for a dynamical system using the discretisation de-tailed in the previous section of this paper. Simulations of the Lorenz–Emanuel system (see16 avid K. E. Green, Filip Rindler § N variables, u i for ≤ i ≤ N ,arranged periodically such that u N +1 = u , u = u N and u − = u N − . Let the full set ofvariables be denoted by u := { u i } Ni =1 . The Lorenz–Emanuel system can be highly chaotic,displaying sensitive dependence on initial conditions. The equations of motion for this systemare: du i dt = ( u i +1 − u i − ) u i − − u i + F. (50)For the analysis in this section, the following parameters were adopted: N = 8 , F = 5 . (51)The parameter F represents an external forcing term that prevents the energy in the system fromdecaying to zero. The value F = 5 was chosen to be high enough to cause sensitive dependenceon initial conditions. Model inference was performed given the training data shown in fig 5. The training data wasgenerated using the SciPy solve ivp method [14] with the ‘RK45’ algorithm (variable 4th-5thorder Runge-Kutta [5]) and sampled at a rate of samples per time unit for times t = 0 to t = 20 . The initial values for the data were generated by sampling each u i independently froma normal distribution with mean and standard deviation : u i ( t = 0) ∼ N ( µ = 0 , σ = 3) . (52)The performance of the proposed method was tested by resampling new initial conditionsfrom the same distribution in eqn (52) and comparing the outputs from the true simulation tosimulations generated using an inferred model. All test simulations were again carried out usingthe SciPy solve ivp method with ‘RK45’ integration [14, 5].We used the Adams-Moulton loss functional, ˆ J A ( θ ) , in eqn (49) to define the model infer-ence task. The specific form of the inferred models is given in Section 5.3. All models wereimplemented using Tensorflow [1] and optimised with the Adam variant of SGD (see [15] forimplementation details). A fixed optimisation training schedule was adopted in all cases andconsisted of three phases, P , P , P . Each phase is described by an ordered pair ( I i , η i ) where I i is the number of gradient descent iterations for that phase and η i is the ‘learning rate’ parameteras in eqn (4). The training schedule adopted was: { P = (1000 , . , P = (2000 , . , P = (200 , . } . (53)It was found that this schedule was sufficient to minimise ˆ J A ( θ ) to approximately the maximumachievable precision for all models tested.Note that the integrator used to generate trajectories (RK45) and that used for discretisationof the ODE trajectories (Adams-Moulton) are not the same. This was to demonstrate that anyODE solver can be used to generate simulations from the inferred model. To complete the specification of the problem, the basic form of f θ must be provided. Ifthe form of the dynamical system equations are known beforehand, this information can be17 avid K. E. Green, Filip Rindler Figure 5: Lorenz–Emanuel system training data, generated using the model defined in eqn (50).used to simplify the analysis. If no information is available, a search over different types ofcompute graph architectures must be conducted (as in [24]). For this demonstration, only apolynomial structure is assumed. This is a reasonable assumption that one could make wheninvestigating general interdependent data observations from a dynamical system without anyother prior knowledge, as a number of systems have such a structure [23].For this inference task, the exact form of the polynomial couplings between the various u i were not provided to the compute graph. Instead, two types of polynomial nonlinearitieswere tested. First, a linear combination of all second-order polynomial terms that could beconstructed using each of the u i terms was considered, that is, equations of the form d ˆ u i dt = f iθ ( u ) = N (cid:88) k =1 k (cid:88) j =1 α ikj u k u j + β ik u k + γ i (54)for each i ∈ [1 , . . . , N ] . The parameters are γ i ∈ R , β ik ∈ R N , α ikj ∈ R for i ∈ [1 , . . . , N ] , k = [1 , · · · , N ] , j = [1 , · · · , k ] . This sort of polynomial is of the traditional form used forpolynomial chaos expansions (see [26]).Second, the parametric polynomial kernel method introduced in this paper and defined ineqn (13) with dimensions D = E = 8 was used to represent f θ . Values of M = 60 , and were tried to test the effect of this parameter on the accuracy of the results. Stochastic Gradient Descent, combined with ODE trajectory discretisation, was successfullyapplied to model inference for the Lorenz–Emanuel system in eqn (50). Our parametric kernel18 avid K. E. Green, Filip Rindler − − − − − − − − log(Time), t l og ( A cc u m u l a t e d e rr o r ) , (cid:15) ( t ) Polynomial featuresPolynomial kernel, M = 60Polynomial kernel, M = 80Polynomial kernel, M = 100 Figure 6: Lorenz–Emanuel system error vs time. Errors are calculated as per eqn (56).model gave the best accuracy on the inference task. Importantly, the kernel model was able to betuned to higher accuracies by increasing the number of weights used, M . Although increasing M increases the number of total parameters to be optimised, this trade off may be worthwhiledepending on the particular problem.The performance of the different models is shown in fig 6. The accumulated error, (cid:15) ( t ) , wascalculated as the sum of squared errors from the true model: (cid:15) ( t = 0) = 0 , (55) (cid:15) ( t + h ) = (cid:113) (( u ( t + h ) − ˆ u ( t + h )) + (cid:15) ( t ) , (56)where h = 0 . (matching the training data sampling rate of samples per time unit). Theerrors were calculated for the polynomial feature model in eqn (54) and the polynomial kernelmodel in eqn (13) for M = 60 , M = 80 and M = 100 .From fig 6, the direct polynomial feature mapping had the worst accuracy. The parametrickernel method was able to track the system evolution more accurately. In all cases, the inferredmodels were able to maintain a small inference error at times up to at least an order of magnitudegreater than the training data sampling rate.Performance on the model inference task for the polynomial kernel method defined by eqn(13) with M = 100 is demonstrated in fig 7. This pair of figures shows a comparison betweenthe true model output, u ( t ) , and inferred model output, ˆ u ( t ) . From fig 7, it can be seen that theoverall structure of the equations is captured by the inferred model. Due to the chaotic nature ofthe system being analysed, once a few errors accumulate, the true and inferred models divergerapidly. The parametric polynomial kernel method was able to infer the hidden ODE model withgood accuracy given a fixed set of training data. The accumulated errors grow quickly withtime. This is reasonable considering the chaotic nature of the Lorenz–Emanuel system. A moremathematically rigorous stability analysis of the numerical scheme would be interesting but is19 avid K. E. Green, Filip Rindler beyond the scope of this paper. A number of possible variations on the numerical examplepresented could be analysed in future work. For instance, the type of integration method used,the sampling rate of the data, and the effect of different amounts of training data would all beinteresting to investigate.
This paper presented a parametric form of polynomial kernel regression, as well as numeri-cal case studies. In particular, the proposed method was applied to the model inference problemfor a chaotic dynamical system. Our parametric polynomial kernel method was able to har-ness the power of kernelised regression without the cubic computational complexity typicallyincurred by nonparametric polynomial regression, thereby avoiding the curse of dimensionality.Although the method was successfully applied to a test problem, more work will be requiredto fully understand how best to apply parametric polynomial kernels to real world (rather thansimulated) data. As is the case in all regression models, some form of regularisation would needto be included to address overfitting and observational noise.It was assumed for the analysis in this paper that it was known a priori that only certainpolynomial couplings are present. Using the wrong polynomial order in the model expansionwas found to cause convergence difficulties. This is also the case in nonparametric kernelregression (see [18] and the example in fig 2). As such, this is not considered a serious limitationof the method in that it is possible to test a few different sets of model forms when attemptingto find a good fit to a data set. Bayesian model selection methods could be applied to formallyassess the quality of different polynomial kernel model dimensions.It is worth noting that direct projection onto polynomial features was found to perform poorlycompared to the polynomial kernel method. Although stochasticity was not considered in thispaper, it is quite possible that this finding will impact standard techniques frequently employedfor Uncertainty Quantification. A kernel representation of the type introduced in this paperapplied to Gaussian and other stochastic features may be useful for improving standard polyno-mial chaos methods (which are described in [26]).The search for effective compute graph architectures remains a problem that plagues allmethods attempting to learn hidden function structures without inserting large amounts of priorknowledge into the inverse problem. Scaling to very high-dimensional problems would bean interesting challenge. Given the partial decoupling from the curse of dimensionality thatgradient descent methods can provide, it is hoped that the techniques presented in this paperwould be suitable for model inference on large scale dynamical systems in the future.
This project has received funding from the European Research Council (ERC) under the Eu-ropean Union’s Horizon 2020 research and innovation programme, grant agreement No 757254(SINGULARITY) and a Lloyds Register Foundation grant for the Data-Centric Engineeringprogramme at the Alan Turing Institute. 20 avid K. E. Green, Filip Rindler (a) Data trace from true model, u ( t ) .(b) Data trace from inferred parametric polynomial kernel model, ˆ u ( t ) , with M = 100 . Figure 7: Comparison of output traces for the Lorenz–Emanuel system, defined in eqn (50): (a)true system simulation, u ( t ) , and (b) most accurate inferred model, ˆ u ( t ) . The inferred modelstructure is given by the parametric polynomial kernel in eqn (13) for M = 100 .21 avid K. E. Green, Filip Rindler REFERENCES [1] M. Abadi, A. Agarwal, P. Barham & others,
TensorFlow: Large-Scale Machine Learningon Heterogeneous Systems . , 2015.[2] J. Adler, O. ¨Oktem, Solving ill-posed inverse problems using iterative deep neural net-works, Inverse Problems . , 2017.[3] C.M. Bishop, Neural Networks for Pattern Recognition . Oxford University Press, 1995.[4] B. Carpenter, M.D. Hoffman, M. Brubaker, D. Lee, P. Li, M. Betancourt, The Stan mathlibrary: Reverse-mode automatic differentiation in C++.
ArXiv , arXiv:1509.07164 ,2015.[5] R. Dormand, P.J. Prince, A family of embedded Runge-Kutta formulae, Journal of Com-putational and Applied Mathematics . , 19–26, 1980.[6] P. Eberhard, C. Bischof, Automatic differentiation of numerical integration algorithms, Neural Networks . , 717–732, 1999.[7] S. Epp, Discrete Mathematics with Applications . Cengage Learning, 2010.[8] I. Goodfellow, Y. Bengio, A. Courville
Deep Learning . MIT Press, 2016.[9] W.W. Hager, Updating the Inverse of a Matrix,
SIAM Review , , 221–239, 1989.[10] E. Hairer, S.P. Nørsett, G. Wanner Solving Ordinary Differential Equations I: Nonstiffproblems . Springer Science & Business Media, 1993.[11] E. Hairer, S.P. Nørsett, G. Wanner
Solving Ordinary Differential Equations II: Stiff andDifferential-Algebraic Problems . Springer Science & Business Media, 1996.[12] R.A. Horn, C.R. Johnson,
Matrix Analysis . Cambridge University Press, 2012.[13] A. Iserles,
A First Course in the Numerical Analysis of Differential Equations . CambridgeUniversity Press, 2009.[14] E. Jones, T. Oliphant, P. Peterson,
SciPy: Open source scientific tools for Python . , 2018.[15] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, Proceedings of the 3rdInternational Conference on Learning Representations (ICLR) . Springer Verlag, 2015.[16] H.G. Matthies, E. Zander, B.V. Rosi`c, A. Litvinenko, O. Pajonk, Inverse Problems in aBayesian Setting,
Computational Methods for Solids and Fluids . , 245–286, 2016.[17] J.D. Meiss, Differential Dynamical Systems, Revised Edition . SIAM, 2017.[18] K.P. Murphy,
Machine Learning: A Probabilistic Perspective . MIT Press, 2012.[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python,
Journal of MachineLearning Research . , 2825–2830, 2011.22 avid K. E. Green, Filip Rindler [20] L.B. Rall, Automatic Differentiation: Techniques and Applications, Lecture Notes in Com-puter Science . , 1981.[21] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach . Pearson, 2016.[22] M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data,
Science . , 81–85, 2009.[23] J.C. Sprott, Elegant Chaos: Algebraically Simple Chaotic Flows . World Scientific Pub-lishing Company, 2010.[24] K.O. Stanley, R. Miikkulainen, Evolving Neural Networks through Augmenting Topolo-gies,
Evolutionary Computation . , 99–127, 2002.[25] A.M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica . , 451–559,2005.[26] B. Sudret, Uncertainty propagation and sensitivity analysis in mechanical models: Con-tributions to structural reliability and stochastic spectral methods , Habilitation `a dirigerdes recherches, Universit´e Blaise Pascal, 2007.[27] J.R. Taylor,
Classical Mechanics . University Science Books, 2005.[28] K. Vu, J.C. Snyder, L. Li, M. Rupp, B.F. Chen, T. Khelif, K.R. M ¨uller, K. Burke, Un-derstanding kernel ridge regression: Common behaviors from simple functions to densityfunctionals,
International Journal of Quantum Chemistry . , 1115–1128, 2015.[29] P. Werbos, Generalisation of Backpropagation with application to a recurrent gas marketmodel, Neural Networks .1(4)