[PDF] CINDy: Conditional gradient-based Identification of Non-linear Dynamics -- Noise-robust recovery

Abstract

Governing equations are essential to the study of nonlinear dynamics, often enabling the prediction of previously unseen behaviors as well as the inclusion into control strategies. The discovery of governing equations from data thus has the potential to transform data-rich fields where well-established dynamical models remain unknown. This work contributes to the recent trend in data-driven sparse identification of nonlinear dynamics of finding the best sparse fit to observational data in a large library of potential nonlinear models. We propose an efficient first-order Conditional Gradient algorithm for solving the underlying optimization problem. In comparison to the most prominent alternative algorithms, the new algorithm shows significantly improved performance on several essential issues like sparsity-induction, structure-preservation, noise robustness, and sample efficiency. We demonstrate these advantages on several dynamics from the field of synchronization, particle dynamics, and enzyme chemistry.

Full PDF

CCINDy: Conditional gradient-based Identiﬁcation of Non-linearDynamics – Noise-robust recovery

Alejandro Carderera [email protected]

School of Industrial and Systems Engineering, Georgia Institute of Technology, USAZuse Institute Berlin, Germany

Sebastian Pokutta [email protected]

Institute of Mathematics, Technische Universität Berlin, GermanyZuse Institute Berlin, Germany

Christof Schütte [email protected]

Institute of Mathematics, Freie Universität Berlin, GermanyZuse Institute Berlin, Germany

Martin Weiser [email protected]

Zuse Institute Berlin, Germany

Abstract

Governing equations are essential to the study of nonlinear dynamics, often enabling the prediction ofpreviously unseen behaviors as well as the inclusion into control strategies. The discovery of governingequations from data thus has the potential to transform data-rich ﬁelds where well-established dynamicalmodels remain unknown. This work contributes to the recent trend in data-driven sparse identiﬁcation ofnonlinear dynamics of ﬁnding the best sparse ﬁt to observational data in a large library of potential nonlinearmodels. We propose an eﬃcient ﬁrst-order Conditional Gradient algorithm for solving the underlyingoptimization problem. In comparison to the most prominent alternative algorithms, the new algorithm showssigniﬁcantly improved performance on several essential issues like sparsity-induction, structure-preservation,noise robustness, and sample eﬃciency. We demonstrate these advantages on several dynamics from theﬁeld of synchronization, particle dynamics, and enzyme chemistry.

1. Introduction

Many of the developments of physics have stemmed from our ability to describe natural phenomena in termsof diﬀerential equations. These equations have helped build our understanding of natural phenomena inﬁelds as wide-ranging as classical mechanics, electromagnetism, ﬂuid dynamics, neuroscience and quantummechanics. They have also enabled key technological advances such as the combustion engine, the laser, orthe transistor.The modern age of Machine Learning and Big Data has heralded an age of data-driven models, in whichthe phenomena we explain are described in terms of statistical relationships and static data. Given suﬃcientdata, we are able to train neural networks to classify, or to predict, with high accuracy, without the underlyingmodel having any apparent knowledge of how the data was generated, or its structure. This makes the task ofclassifying, or predicting, on out-of-sample data a particularly challenging task. On the other hand, there hasbeen a recent surge in interest in recovering the diﬀerential equations with which the data, often coming froma physical system, have been generated. This enables us to better understand how the data is generated, andto better predict on out-of-sample data, as opposed to using other learning approaches. Moreover, learninggoverning equations also permits understanding the mechanisms underlying the observed dynamical behavior;this is key to further scientiﬁc progress. 1 a r X i v : . [ m a t h . D S ] F e b he seminal work of (Schmidt & Lipson, 2009) used symbolic regression to search the space of mathematicalexpressions, in order to ﬁnd one that adequately ﬁts the data. This entails randomly combining mathematicaloperations, analytical functions, state variables and constants and selecting those that show promise. Theseare later randomly expanded and combined in search of an expression that represents the data suﬃcientlywell. Related to this approach is the Approximate Vanishing Ideal

Algorithm (Heldt et al., 2009), based onthe combination of Gröbner and Border bases with total least-squares regression, where a set of polynomialsover (arbitrary) basis functions is successively expanded to capture all relations approximately satisﬁed bythe data. A more recent algorithm, known as the

Sparse Identiﬁcation of Nonlinear Dynamics (SINDy)algorithm assumes that we have access to a library of predeﬁned basis functions, and the problem becomesthat of ﬁnding a linear combination of basis functions that best predicts the data at hand. This is done usingsequentially-thresholded least-squares, in order to recover a sparse linear combination of basis functions (andpotentially the coordinate system) that is able to represent the underlying phenomenon well (Brunton et al.,2016; Champion et al., 2019). This algorithm works extremely well when using noise-free data, but oftenproduces dense solutions when the data is contaminated with noise. There have been several suggestions todeal with this, from more noise-robust non-convex problem formulations (Schaeﬀer & McCalla, 2017), toproblem formulations that involve both learning the dynamic, and the noise contaminating the underlyingdata (Rudy et al., 2019; Kaheman et al., 2020). Neither of these approaches is computationally eﬃcientfor high-dimensional problems, the former having the additional drawback that the problem formulation isnon-convex.

In this paper we present a

Conditional gradient-based Identiﬁcation of Non-linear Dynamics algorithm, dubbedCINDy, in homage to the inﬂuential SINDy algorithm presented in (Brunton et al., 2016), a sparsity-inducingoptimization algorithm that can be used to solve convex formulations of the sparse recovery problem. CINDyis a ﬁrst-order convex optimization algorithm based on the

Conditional Gradient (CG) algorithm (Levitin &Polyak, 1966) (also known as the Frank-Wolfe algorithm (Frank & Wolfe, 1956)) that brings together manyof the advantages of existing sparse recovery techniques into a single algorithm. As documented in detailbelow, we compared CINDy to the most prominent alternative algorithms for solving the respective learningproblem (SINDy, FISTA, IPM) with the following results:1.

Sparsity-inducing.

The CG-based algorithm has an implicit bias for sparse solutions through theway it builds its iterates. Other existing approaches are forced to ensure sparsity through thresholding,or through problem formulations that encourage sparsity. This has a major impact on the structuralgeneralization behavior , where CINDy signiﬁcantly outperforms other methods leading to much moreaccurate trajectory predictions in the presence of noise.2.

Structure-preserving dynamic.

The CINDy algorithm can easily incorporate underlying symmetriesand conservation laws into the learning problem, resulting in learned dynamics consistent with the truephysics, with minimal impact on the running time of the algorithm but signiﬁcantly reducing samplecomplexity (due to reduced degrees of freedom) and improved generalization performance.3.

Noise robustness.

When it comes to recovery of dynamics in the presence of noise, we demonstratea signiﬁcant advantage over CINDy of about one to two orders of magnitude in recovery error withrespect to the true dynamic , rather than just out-of-sample errors. This is largely due to the sparsityinduced by the underlying CG method.4.

Sample eﬃciency and large-scale learning.

We demonstrate that, given a certain noise level,CINDy will require signiﬁcantly fewer samples to recover the dynamic with a given accuracy. Moreover,being a ﬁrst-order method our approach naturally allows for the learning of large-scale dynamics,allowing even the use of stochastic ﬁrst-order information in the extremely large-scale regime.5.

Black-box implementation.

We provide an implementation of CINDy that can be used as a black-box not requiring any specialized knowledge in CG methods. The source code is made availableunder https://github.com/ZIB-IOL . We hope that this stimulates research in the use of CG-basedalgorithm for sparse recovery. 2 .2 Preliminaries

We denote vectors using bold lower-case letters, and matrices using upper-case letters. We will use 𝑥 𝑖 torefer to the 𝑖 -th element of the vector x , and 𝑋 𝑖, 𝑗 to refer to the element on the 𝑖 -th row and 𝑗 -th column ofthe matrix 𝑋 . Let (cid:107) x (cid:107) and (cid:107) x (cid:107) denote the ℓ and ℓ norm of x respectively, furthermore, let (cid:107) x (cid:107) denotethe ℓ norm , which is the number of non-zero elements in x . Moreover, given a matrix 𝑋 ∈ ℝ 𝑛 × 𝑚 for 𝑝, 𝑞 ≥ let (cid:107) 𝑋 (cid:107) 𝑝,𝑞 = (cid:16)(cid:205) 𝑚𝑗 = (cid:0)(cid:205) 𝑛𝑖 = | 𝑋 𝑖, 𝑗 | 𝑝 (cid:1) 𝑞 / 𝑝 (cid:17) / 𝑞 denote the ℓ 𝑝,𝑞 norm of 𝑋 . We will use (cid:107) 𝑋 (cid:107) 𝐹 = (cid:107) 𝑋 (cid:107) , torefer to the familiar Frobenius norm of a matrix, and (cid:107) 𝑋 (cid:107) to refer to the number of non-zero elements in 𝑋 .Given a matrix 𝑋 ∈ ℝ 𝑚 × 𝑛 let vec ( 𝑋 ) ∈ ℝ 𝑚𝑛 denote the vectorization of the matrix 𝑋 , that is the stacking vec ( 𝑋 ) = [ 𝑋 , , · · · , 𝑋 𝑚, , · · · , 𝑋 , , · · · , 𝑋 𝑚, , · · · , 𝑋 𝑛, , · · · , 𝑋 𝑚,𝑛 ] 𝑇 . Given a non-empty set S ⊂ ℝ 𝑛 we referto its convex hull as conv (S) . The trace of the square matrix 𝑋 ∈ ℝ 𝑛 × 𝑛 will be denoted by trace ( 𝑋 ) . Weuse (cid:164) x ( 𝑡 ) to denote the derivative of x ( 𝑡 ) with respect to time, denoted by 𝑡 , that is, (cid:164) x ( 𝑡 ) = 𝑑 x ( 𝑡 ) 𝑑𝑡 . Given twointegers 𝑖 ∈ ℤ and 𝑗 ∈ ℤ with 𝑖 ≤ 𝑗 we use (cid:200) 𝑖, 𝑗 (cid:201) to denote the set { 𝑘 ∈ ℤ | 𝑖 ≤ 𝑘 ≤ 𝑗 } . The vector with allentries equal to one is denoted by 𝑑 ∈ ℝ 𝑑 . Lastly, we use Δ 𝑑 to denote the unit probability simplex ofdimension 𝑑 , that is, the set Δ 𝑑 = (cid:8) x ∈ ℝ 𝑑 | 𝑇𝑑 x = , x ≥ (cid:9) .

2. Learning sparse dynamics

Many physical systems can be described in terms of ordinary diﬀerential equations of the form (cid:164) x ( 𝑡 ) = 𝐹 ( x ( 𝑡 )) ,where x ( 𝑡 ) ∈ ℝ 𝑑 denotes the state of the system at time 𝑡 and 𝐹 : ℝ 𝑑 → ℝ 𝑑 can usually be expressed as alinear combination of simpler ansatz functions 𝜓 𝑖 : ℝ 𝑑 → ℝ belonging to a dictionary D = { 𝜓 𝑖 | 𝑖 ∈ (cid:200) , 𝑛 (cid:201)} .This allows us to express the dynamic followed by the system as (cid:164) x ( 𝑡 ) = 𝐹 ( x ( 𝑡 )) = Ξ 𝑇 ψ ( x ( 𝑡 )) where Ξ ∈ ℝ 𝑛 × 𝑑 is a – typically sparse – matrix Ξ = [ 𝜉 , · · · , 𝜉 𝑑 ] formed by column vectors 𝜉 𝑖 ∈ ℝ 𝑛 for 𝑖 ∈ (cid:200) , 𝑛 (cid:201) and ψ ( x ( 𝑡 )) = [ 𝜓 ( x ( 𝑡 )) , · · · , 𝜓 𝑛 ( x ( 𝑡 ))] 𝑇 ∈ ℝ 𝑛 . We can therefore write: (cid:164) x ( 𝑡 ) =  𝜉 ...𝜉 𝑑   𝜓 ( x ( 𝑡 )) ...𝜓 𝑛 ( x ( 𝑡 ))  . (2.1)Alternatively, one could also consider that for any 𝑡 ≥ 𝑡 we can write x ( 𝑡 ) = x ( 𝑡 ) + ∫ 𝑡𝑡 (cid:164) x ( 𝜏 ) 𝑑𝜏 = x ( 𝑡 ) + ∫ 𝑡𝑡 Ξ 𝑇 ψ ( x ( 𝜏 )) 𝑑𝜏 = x ( 𝑡 ) + Ξ 𝑇 ∫ 𝑡𝑡 ψ ( x ( 𝜏 )) 𝑑𝜏 . In matrix form this results in: x ( 𝑡 ) − x ( 𝑡 ) =  𝜉 ...𝜉 𝑑  ∫ 𝑡𝑡 𝜓 ( x ( 𝜏 )) 𝑑𝜏... ∫ 𝑡𝑡 𝜓 𝑛 ( x ( 𝜏 )) 𝑑𝜏  , (2.2)In the absence of noise, if we are given a series of data points from the physical system { x ( 𝑡 𝑖 ) , (cid:164) x ( 𝑡 𝑖 )} 𝑚𝑖 = , thenwe know that:  (cid:164) x ( 𝑡 ) · · · (cid:164) x ( 𝑡 𝑚 )  =  𝜉 ...𝜉 𝑑   ψ ( x ( 𝑡 )) · · · ψ ( x ( 𝑡 𝑚 ))  . Or alternatively, viewing the dynamic from an integral perspective, we have that:  x ( 𝑡 ) − x ( 𝑡 ) · · · x ( 𝑡 𝑚 ) − x ( 𝑡 )  =  𝜉 ...𝜉 𝑑  ∫ 𝑡 𝑡 𝜓 ( x ( 𝜏 )) 𝑑𝜏 · · · ∫ 𝑡 𝑚 𝑡 𝜓 ( x ( 𝜏 )) ... . . . ... ∫ 𝑡 𝑡 𝜓 𝑛 ( x ( 𝜏 )) 𝑑𝜏 · · · ∫ 𝑡 𝑚 𝑡 𝜓 𝑛 ( x ( 𝜏 ))  .

1. Technically, the ℓ norm is not a norm.

3f we collect the data in matrices 𝛿𝑋 = [ x ( 𝑡 ) − x ( 𝑡 ) , · · · , x ( 𝑡 𝑚 ) − x ( 𝑡 )] ∈ ℝ 𝑑 × 𝑚 − , (cid:164) 𝑋 = [ (cid:164) x ( 𝑡 ) , · · · , (cid:164) x ( 𝑡 𝑚 )] ∈ ℝ 𝑑 × 𝑚 , Ψ ( 𝑋 ) = [ ψ ( x ( 𝑡 )) , · · · , ψ ( x ( 𝑡 𝑚 ))] ∈ ℝ 𝑛 × 𝑚 , and Γ ( 𝑋 ) ∈ ℝ 𝑛 × 𝑚 − with Γ ( 𝑋 ) 𝑖, 𝑗 = ∫ 𝑡 𝑗 + 𝑡 𝜓 𝑖 ( x ( 𝜏 )) 𝑑𝜏 , we canview the dynamic from two perspectives: Diﬀerential approach (cid:164) 𝑋 = Ξ 𝑇 Ψ ( 𝑋 ) Integral approach 𝛿𝑋 = Ξ 𝑇 Γ ( 𝑋 ) Consequently, when we try to recover the sparsest dynamic that ﬁts this dynamic, we can attempt to solveone of two problems, which we present in tandem:

Diﬀerential approach argmin (cid:164) 𝑋 =Ω 𝑇 Ψ ( 𝑋 ) Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) . Integral approach argmin 𝛿𝑋 =Ω 𝑇 Γ ( 𝑋 ) Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) . (2.3)Note that in the previous problem formulation we are implicitly assuming that we can compute Γ ( 𝑋 ) 𝑖, 𝑗 ,which is usually never the case. In practice we have to resort to approximating the integrals using quadratureover the given data, that is, for example Γ ( 𝑋 ) 𝑖, 𝑗 ≈ (cid:205) 𝑗𝑘 = ( 𝜓 𝑖 ( x ( 𝑡 𝑘 )) + 𝜓 𝑖 ( x ( 𝑡 𝑘 + ))) . If we have access to (cid:164) 𝑋 and 𝑋 , it will make sense to attack the problem from a diﬀerential perspective, but if we only have accessto 𝑋 , and we have to estimate (cid:164) 𝑋 from data, there are occasions where we can beneﬁt from the integralapproach, as we can potentially estimate Γ ( 𝑋 ) more accurately than (cid:164) 𝑋 ; this can be true in particular inthe presence of noise. Henceforth, we use (cid:164) 𝑋 and Γ ( 𝑋 ) to denote the approximate matrices computed usingnumerical rules, as opposed to the exact diﬀerential and integral matrices. Unfortunately, the problemsshown in Equations (2.3) are notoriously diﬃcult NP-hard combinatorial problems, due to the presence of the ℓ norm in the objective function of the minimization problem of both optimization problems (Juditsky &Nemirovski, 2020). Moreover, if the data points are contaminated by noise, leading to noisy matrices (cid:164) 𝑌 , 𝛿𝑌 , Ψ ( 𝑌 ) and Γ ( 𝑌 ) , depending on the expressive power of the basis functions 𝜓 𝑖 for 𝑖 ∈ (cid:200) , 𝑛 (cid:201) , it may not even bepossible (or desirable) to satisfy (cid:164) 𝑌 = Ω 𝑇 Ψ ( 𝑌 ) or 𝛿𝑌 = Ω 𝑇 Γ ( 𝑌 ) for any Ω ∈ ℝ 𝑛 × 𝑑 . Thus one can attempt tosolve, for a suitably chosen 𝜀 > : Diﬀerential approach argmin (cid:107) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:107) 𝐹 ≤ 𝜀 Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) . Integral approach argmin (cid:107) 𝛿𝑌 − Ω 𝑇 Γ ( 𝑌 ) (cid:107) 𝐹 ≤ 𝜀 Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) . (2.4)The most popular sparse recovery algorithm, dubbed SINDy (Brunton et al., 2016), solves a component-wise relaxation of a problem very closely related to the diﬀerential problem shown in Equation (2.4) (Zhang& Schaeﬀer, 2019). Each step of the SINDy algorithm consists of a least-squares step and a thresholdingstep. The coeﬃcients that have been thresholded are discarded in future iterations, making the least-squaresproblem progressively smaller. More speciﬁcally this process, when applied to one of the components of theproblem, converges to (one of) the local minimizers of: argmin 𝜉 𝑗 ∈ ℝ 𝑑 𝑚 ∑︁ 𝑖 = (cid:13)(cid:13)(cid:13) (cid:164) 𝑥 𝑗 ( 𝑡 𝑖 ) − 𝜉 𝑇𝑗 ψ ( x ( 𝑡 𝑖 )) (cid:13)(cid:13)(cid:13) + 𝛼 (cid:13)(cid:13) 𝜉 𝑗 (cid:13)(cid:13) , for a suitably chosen 𝛼 ≥ (Zhang & Schaeﬀer, 2019) and for 𝑗 ∈ (cid:200) , 𝑑 (cid:201) . This methodology was later extendedto partial diﬀerential equations by alternating between ridge-regression steps (as opposed to least-squaressteps) and thresholding steps in (Rudy et al., 2017).In another seminal paper Schaeﬀer & McCalla (2017) framed the sparse recovery problem from an integralperspective for the ﬁrst time, using the Douglas-Rachford algorithm (Combettes & Pesquet, 2011) to solve4he non-convex integral problem in Equation (2.4). They showed experimentally that when the data iscontaminated with noise and information about the derivatives has to be computed numerically, it canbe advantageous to use the integral approach, as opposed to the diﬀerential approach, as the numericalintegration is more robust to noise than numerical diﬀerentiation.However, both problem formulations in Equation (2.4) remain non-convex, and so as is often donein optimization, we can attempt to convexify the problematic term in the optimization problem, namelysubstituting the ℓ norm for the ℓ norm. Note that the smallest value of 𝑝 ≥ that results in the norm (cid:107)·(cid:107) 𝑝, 𝑝 being convex is 𝑝 = . This leads us to a problem, known as basis pursuit denoising (BPD) (Chenet al., 1998), which can be written as: BPD Diﬀerential approach argmin (cid:107) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:107) 𝐹 ≤ 𝜖 Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) , BPD Integral approach argmin (cid:107) 𝛿𝑌 − Ω 𝑇 Γ ( 𝑌 ) (cid:107) 𝐹 ≤ 𝜖 Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω (cid:107) , (2.5)for appropriately chosen 𝜖 > . The formulation shown in Equation (2.5) initially developed by the signalprocessing community, is intimately tied to the Least Absolute Shrinkage and Selection Operator (LASSO)regression formulation (Tibshirani, 1996), developed in the statistics community, which takes the form:

LASSO Diﬀerential approach argmin (cid:107) Ω (cid:107) , ≤ 𝛼 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 LASSO Integral approach argmin (cid:107) Ω (cid:107) , ≤ 𝛼 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) 𝛿𝑌 − Ω 𝑇 Γ ( 𝑌 ) (cid:13)(cid:13) 𝐹 (2.6)In fact, the diﬀerential approach to the LASSO problem shown in Equation (2.6) was used in (Schaeﬀer,2017) in conjunction with the Douglas-Rachford algorithm (Combettes & Pesquet, 2011) to solve the sparserecovery problem. A variation of the LASSO problem was also used to recover the governing equations inchemical reaction systems (Hoﬀmann et al., 2019) using a sequential quadratic optimization algorithm. Thefollowing proposition formalizes the relationship between the BPD and the LASSO problems. Proposition 2.1.

Foucart & Rauhut (2017)[Proposition 3.2]1. If Ξ is the unique minimizer of the BPD problem shown in Equation (2.5) with 𝜖 > , then there existsan 𝛼 ≥ such that Ξ is the unique minimizer of the LASSO problem shown in Equation (2.6) .2. If Ξ is a minimizer of the LASSO problem shown in Equation (2.6) with 𝛼 > , then there exists an 𝜖 ≥ such that Ξ is a minimizer of the BPD problem shown in Equation (2.5) . Both problems shown in Equation (2.5) and (2.6) have a convex objective function and a convex feasibleregion, which allows us to use the powerful tools and guarantees of convex optimization. These two formulationscan also be recast as an unconstrained optimization problem (via Lagrange dualization) in which the ℓ normhas been added to the objective function (see Foucart & Rauhut (2017) and Borwein & Lewis (2010) for moredetails). Moreover, there is a signiﬁcant body of theoretical literature, both from the statistics and the signalprocessing community, on the conditions for which we can successfully recover the support of Ξ (see e.g.,Wainwright (2009)), the uniqueness of the LASSO solutions (see e.g., Tibshirani et al. (2013)), or the robustreconstruction of phenomena from incomplete data (see e.g., Candès et al. (2006)), to name but a few results. Remark 2.2 (From learning ODE’s to learning PDE’s) . Section 2 so far has only dealt with the case wherethe dynamic is expressed as a ordinary diﬀerential equation (ODE). This framework can also be extended todeal with the case of a dynamic expressed as a partial diﬀerential equation (PDE), by simply adding thenecessary partial derivatives as ansatz functions to the regression problem (Rudy et al., 2017).5 .1 Incorporating structure

Conservation laws are a fundamental pillar of our understanding of physical systems. These conservation lawsstem from diﬀerentiable symmetries that are present in nature (Noether, 1918). Imposing these symmetryconstraints in our sparse regression problem can potentially lead to better generalization performance undernoise, reduced sample complexity, and to learned dynamics that are consistent with the symmetries presentin the real world. Our approach allows for arbitrary polyhedral constraints to be added, i.e., linear inequalityand equality constraints; boundedness will be ensured automatically due to the ℓ norm constraint. Inparticular, there are two large classes of structural constraints that can be easily encoded into our learningproblem. From a diﬀerential perspective, we often observe in dynamical systems that certain relations hold between theelements of (cid:164) x ( 𝑡 ) . Such is the case in chemical reaction dynamics, where if we denote the rate of change of the 𝑖 -th species by (cid:164) 𝑥 𝑖 ( 𝑡 ) , we might observe relations of the form 𝑎 𝑗 (cid:164) 𝑥 𝑗 ( 𝑡 ) + 𝑎 𝑘 (cid:164) 𝑥 𝑘 ( 𝑡 ) = due to mass conservation,which relate the 𝑗 -th and 𝑘 -th species being studied.In the case where these relations are linear, we know that for some J ⊆ (cid:200) , 𝑑 (cid:201) and all 𝑡 ≥ we can write: ∑︁ 𝑗 ∈J 𝑎 𝑗 (cid:164) 𝑥 𝑗 ( 𝑡 ) = 𝑐. (2.7)We can encode Equation (2.7) into our learning problem by using the fact that (cid:164) 𝑥 𝑗 ( 𝑡 ) = 𝜉 𝑇𝑗 ψ ( x ( 𝑡 )) and imposingthat for all data points 𝑖 ∈ (cid:200) , 𝑚 (cid:201) ∑︁ 𝑗 ∈J 𝑎 𝑗 𝜉 𝑇𝑗 ψ ( x ( 𝑡 𝑖 )) = 𝑐, which can be expressed more succinctly as ∑︁ 𝑗 ∈J 𝑎 𝑗 𝜉 𝑇𝑗 Ψ ( 𝑋 ) = 𝑐 𝑚 . This involves the addition of 𝑚 linear constraints into our learning problem, which in the absence of noisedoes not pose any problems. However, when the data { (cid:164) x ( 𝑡 𝑖 ) , x ( 𝑡 𝑖 )} 𝑚𝑖 = is contaminated by noise, and weonly have access to { (cid:164) y ( 𝑡 𝑖 ) , y ( 𝑡 𝑖 )} 𝑚𝑖 = , it is futile to assume that (cid:205) 𝑗 ∈J 𝑎 𝑗 (cid:164) 𝑦 𝑗 ( 𝑡 𝑖 ) = 𝑐 for all 𝑖 ∈ (cid:200) , 𝑚 (cid:201) or that (cid:205) 𝑗 ∈J 𝑎 𝑗 𝜉 𝑇𝑗 Ψ ( 𝑌 ) = 𝑐 𝑚 . In this case, it is more reasonable to assume that the derivatives are approximatelypreserved, and instead impose for some 𝜀 > and all 𝑖 ∈ (cid:200) , 𝑚 (cid:201) that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑︁ 𝑗 ∈J 𝑎 𝑗 𝜉 𝑇𝑗 ψ ( y ( 𝑡 𝑖 )) − 𝑐 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝜀. The addition of this constraint to the problem in Equation (2.6) preserves the convexity of the originalproblem. Moreover, the feasible region of the optimization problem remains polyhedral.

One of the key assumptions used in many-particle quantum systems is the fact the particles being studied areindistinguishable. And so it makes sense to assume that the eﬀect that the 𝑖 -th particle exerts on the 𝑗 -thparticle is the same as the eﬀect that the 𝑗 -th particle exerts on the 𝑖 -th particle. The same can be said inclassical mechanics for a collection of identical masses, where each mass is connected to all the other massesthrough identical springs. As an example, consider the system formed by two spring-coupled masses depictedin Figure 1. 6 𝑘 𝑘 𝑚 𝑚𝑥 ( 𝑡 ) 𝑥 ( 𝑡 ) Figure 1: Two spring-coupled masses.Here we denote the displacement of the center of mass of the 𝑖 -th body from its equilibrium position at restby 𝑥 𝑖 ( 𝑡 ) . This allows us to express the dynamical evolution of the system by 𝑚 (cid:165) 𝑥 ( 𝑡 ) = − 𝑘 𝑥 ( 𝑡 )+ 𝑘 ( 𝑥 ( 𝑡 )− 𝑥 ( 𝑡 )) and 𝑚 (cid:165) 𝑥 ( 𝑡 ) = − 𝑘 ( 𝑥 ( 𝑡 ) − 𝑥 ( 𝑡 )) − 𝑘 𝑥 ( 𝑡 ) . Suppose we are given access to a series of noisy data points { (cid:165) y ( 𝑡 𝑖 ) , y ( 𝑡 𝑖 )} 𝑚𝑖 = and we want to learn the dynamic (cid:165) y ( 𝑡 ) = [ (cid:165) 𝑦 ( 𝑡 ) , (cid:165) 𝑦 ( 𝑡 )] with a dictionary D = { 𝜓 ( y ) = , 𝜓 ( y ) = 𝑦 , 𝜓 ( y ) = 𝑦 } of basis functions of polynomials of degree up to one. The problem is analogous tothat of learning (cid:164) x ( 𝑡 ) and can be framed similarly to that of Equation (2.6), substituting (cid:164) 𝑌 for (cid:165) 𝑌 . If we use 𝜉 𝑗 ( 𝜓 ( x )) to refer to the coeﬃcient in 𝜉 𝑗 , where 𝜉 𝑗 is the 𝑗 -th column of Ω , associated with the basis function 𝜓 ( x ) , we can write: ˆ Ξ = argmin (cid:107) Ω (cid:107) , ≤ 𝜏 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) (cid:165) 𝑦 ( 𝑡 ) · · · (cid:165) 𝑦 ( 𝑡 𝑚 )(cid:165) 𝑦 ( 𝑡 ) · · · (cid:165) 𝑦 ( 𝑡 𝑚 ) (cid:21) − (cid:20) 𝜉 ( ) 𝜉 ( 𝑦 ) 𝜉 ( 𝑦 ) 𝜉 ( ) 𝜉 ( 𝑦 ) 𝜉 ( 𝑦 ) (cid:21)  · · · 𝑦 ( 𝑡 ) · · · 𝑦 ( 𝑡 𝑚 ) 𝑦 ( 𝑡 ) · · · 𝑦 ( 𝑡 𝑚 ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 𝐹 . (2.8)Where we have that 𝜉 = [ 𝜉 ( ) , 𝜉 ( 𝑦 ) , 𝜉 ( 𝑦 )] and 𝜉 = [ 𝜉 ( ) , 𝜉 ( 𝑦 ) , 𝜉 ( 𝑦 )] . In light of the structureof the system and its symmetry, it makes sense to add to the learning problem the constraint 𝜉 ( 𝑦 ) = 𝜉 ( 𝑦 ) ,that is, the eﬀect of 𝑦 ( 𝑡 ) on (cid:165) 𝑦 ( 𝑡 ) is the same as the eﬀect of 𝑦 ( 𝑡 ) on (cid:165) 𝑦 ( 𝑡 ) . These constraints can also bereadily applied in the integral formulation of the recovery problem.Several optimization algorithms have been suggested to deal with the addition of such linear constraints,using either a variation of the thresholded least-squares algorithm used in SINDy (Loiseau & Brunton, 2018)or through the use of a sparse relaxed regularized regression (SR3) framework, where a regularization termcontrols the trade-oﬀ between conditioning of the problem and the ﬁdelity to the original problem. Thisframework results in a formulation that can be solved by iteratively solving a least-squares problem followedby a proximal problem (Champion et al., 2020; Zheng et al., 2018). In this paper we present an optimizationalgorithm which can easily and directly deal with the addition of linear constraints, and enforce them exactly,i.e., without the aforementioned trade-oﬀ.

3. Conditional Gradient algorithms

For simplicity, let us assume that we are dealing with the diﬀerential formulation of the LASSO problemshown in Equation (2.6), thus we would like to solve argmin (cid:107) Ω (cid:107) , ≤ 𝛼 Ω ∈ ℝ 𝑛 × 𝑑 𝑓 ( Ω ) , (3.1)where 𝑓 ( Ω ) = (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 . This can be done using ﬁrst-order projection-based algorithms such as gradientdescent or accelerated gradient descent . Using the former, the iterate at iteration 𝑘 + can be expressed, for a7uitably chosen step size 𝛾 𝑘 > as: Ω 𝑘 + = argmin (cid:107) Ω (cid:107) , ≤ 𝜏 Ω ∈ ℝ 𝑛 × 𝑑 (cid:107) Ω − ( Ω 𝑘 − 𝛾 𝑘 ∇ 𝑓 ( Ω 𝑘 ))(cid:107) 𝐹 (3.2) = argmin (cid:107) Ω (cid:107) , ≤ 𝜏 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13)(cid:13)(cid:13) Ω − Ω 𝑘 − 𝛾 𝑘 Ψ ( 𝑌 ) (cid:16) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:17) 𝑇 (cid:13)(cid:13)(cid:13)(cid:13) 𝐹 . (3.3)Fortunately, the quadratic problem shown in Equation (3.3) can be solved exactly with complexity O ( 𝑛𝑑 ) (Condat, 2016) (as this is equivalent to projecting a ﬂattened version of the matrix onto the ℓ polytope ofdimension 𝑛𝑑 ). If we were to add 𝐿 additional linear constraints to the problem in Equation (2.6) to reﬂectthe underlying structure of the dynamical system through symmetry and conservation, we would arrive at apolytope P of the form P = (cid:8) Ω ∈ ℝ 𝑛 × 𝑑 | (cid:107) Ω (cid:107) , ≤ 𝜏, trace ( 𝐴 𝑇𝑙 Ξ ) ≤ 𝑏 𝑙 , 𝑙 ∈ (cid:200) , 𝐿 (cid:201) (cid:9) , with 𝐴 𝑙 ∈ ℝ 𝑛 × 𝑑 and 𝑏 𝑙 ∈ ℝ for all 𝑙 ∈ (cid:200) , 𝐿 (cid:201) . So in this case, with additional structural constraints, theproblem would transform into: argmin Ω ∈P Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 . (3.4)Unfortunately, in general there is no closed-form solution to the projection operator onto P , and soin order to use projection-based algorithms to solve the optimization problem, one has to compute theseprojections approximately. Note that computing a projection onto P is equivalent to solving a quadraticproblem over P , which can be as expensive as solving the original quadratic problem shown in Equation (3.4).In light of this diﬃculty, one can opt to solve the optimization problem using projection-free algorithms likethe Conditional Gradients (CG) algorithm (Levitin & Polyak, 1966) (also known as the

Frank-Wolfe (FW)algorithm (Frank & Wolfe, 1956), shown in Algorithm 1 with exact line search).

Algorithm 1:

CG algorithm applied to (3.4)

Input :

Initial point Ω ∈ P . Output :

Point Ω 𝐾 + ∈ P . for 𝑘 = to 𝐾 do ∇ 𝑓 ( Ω 𝑘 ) ← Ψ ( 𝑌 ) (cid:0) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:1) 𝑇 𝑉 𝑘 ← argmin Ω ∈P trace (cid:0) Ω 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:1) 𝐷 𝑘 ← 𝑉 𝑘 − Ω 𝑘 𝛾 𝑘 ← min (cid:26) −

12 trace ( 𝐷 𝑇𝑘 ∇ 𝑓 ( Ω 𝑘 ) )(cid:107) 𝐷 𝑇𝑘 Ψ ( 𝑌 ) (cid:107) 𝐹 , (cid:27) Ω 𝑘 + ← Ω 𝑘 + 𝛾 𝑘 𝐷 𝑘 end P 𝑓 ( Ω ) = (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 Ω 𝑘 𝑓 ( Ω 𝑘 ) + trace (cid:16) ( Ω − Ω 𝑘 ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) −∇ 𝑓 ( Ω 𝑘 ) 𝑉 𝑘 Figure 2: CG algorithm schematic.This family of algorithms (including Algorithm 1) requires solving a linear optimization problem over apolytope (Line 3 of Algorithm 1) at each iteration, instead of a quadratic problem. As the iterates are obtainedas a convex combination of the current iterate Ω 𝑘 and the solution of the linear optimization problem over P ,denoted by 𝑉 𝑘 (Line 3 of Algorithm 1), thus always ensuring feasibility, these methods are projection-free .The direction 𝑉 𝑘 − Ω 𝑘 is the direction that best approximates (in the inner product sense) the negative of thegradient of the objective function at the current iterate Ω 𝑘 , in this case −∇ 𝑓 ( Ω 𝑘 ) = − Ψ ( 𝑌 ) (cid:0) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:1) 𝑇 ,8f we restrict ourselves to moving towards vertices of the polytope P . To be more precise: trace (cid:16) − ( 𝑉 𝑘 − Ω 𝑘 ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) = max Ω ∈P trace (cid:16) − ( Ω − Ω 𝑘 ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) = max Ω ∈P trace (cid:18) − ( Ω − Ω 𝑘 ) 𝑇 Ψ ( 𝑌 ) (cid:16) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:17) 𝑇 (cid:19) . This can be seen as equivalent to moving along the direction given by the vertex which minimizes a linearapproximation of the objective function at the current iterate Ω 𝑘 over the polytope P (see Figure 2, wherewe have denoted the objective function as 𝑓 ( Ω ) = (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 ), that is: 𝑉 𝑘 = argmin Ω ∈P 𝑓 ( Ω 𝑘 ) + trace (cid:16) ( Ω − Ω 𝑘 ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) = argmin Ω ∈P (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 + (cid:18) ( Ω − Ω 𝑘 ) 𝑇 Ψ ( 𝑌 ) (cid:16) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:17) 𝑇 (cid:19) . Once the vertex 𝑉 𝑘 has been found, the exact line search solution is computed in Line 5 to ﬁnd the stepsize 𝛾 𝑘 that results in the greatest decrease in primal gap, that is, 𝛾 𝑘 = argmin 𝛾 ∈[ , ] 𝑓 ( Ω 𝑘 + 𝛾 ( 𝑉 𝑘 − Ω 𝑘 )) .Fortunately, as the function being minimized (see Equation (3.4)) is a quadratic, there is a closed formexpression for the optimal step size. Note that the step size in Line 5 is always non-negative and the clippingensures that we build convex combinations; the clipping is active (if at all) only in the very ﬁrst iteration bystandard arguments (see, e.g., (Braun et al., 2021)). Remark 3.1.

If we assume that the starting point Ω is a vertex of the polytope, then we know that theiterate Ω 𝑘 can be expressed as a convex combination of at most 𝑘 vertices of P . This is due to the factthat the algorithm can pick up no more than one vertex per iteration. Note that the CG algorithm appliedto the problem shown in Equation (3.1), where the feasible region is the ℓ ball without any additionalconstraints, picks up at most one basis function in the 𝑘 -th iteration, as 𝑉 𝑇𝑘 ψ ( x ( 𝑡 )) = ± 𝜏𝜓 𝑖 ( x ( 𝑡 ) for some 𝑖 ∈ (cid:200) , 𝑛 (cid:201) . This means that if we use the CG algorithm to solve a problem over the ℓ ball, we encouragesparsity not only through the regularization provided by the ℓ ball, but also through the speciﬁc natureof the CG algorithm independently of the size of the feasible region. In practice, when using, e.g., earlytermination due to some stopping criterion, this results in the CG algorithm producing sparser solutions thanprojection-based algorithms (such as projected gradient descent, which typically use dense updates) whenapplied to Problem 3.4, despite the fact that both algorithms converge to the same solution if the problem isstrictly convex.Thus, in addition to the trade-oﬀ between reconstruction accuracy and sparsity oﬀered by LASSOformulations parametrized by the size 𝛼 of the ℓ ball, we have the same trade-oﬀ in terms of the iterationcount. Similar to iterative or semi-iterative regularization methods such as Landweber’s method for ill-posedlinear problems in Hilbert spaces (Hanke, 1991), reconstruction accuracy improves in iterations, while thenorm of the solution, in our case the 𝑙 norm, tends to grow. The trade-oﬀ can be decided by a terminationcriterion, e.g., a suﬃciently small residual in Morozow’s discrepancy principle (Morozov, 1966) or, in our case,a suﬃciently small primal gap.One of the interesting properties of CG algorithms is the fact that, since 𝑓 is convex, at each iteration wecan compute the Frank-Wolfe gap, an upper bound on the primal gap, at no extra cost.

Deﬁnition 3.2 (Frank-Wolfe gap) . The Frank-Wolfe gap of the function 𝑓 over the feasible region P evaluated at Ω 𝑘 , denoted by 𝑔 P ( Ω ) , is given by: 𝑔 P ( Ω 𝑘 ) = max Ω ∈P trace (cid:16) ( Ω 𝑘 − Ω ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) .

9o see why this quantity provides an upper bound on the primal gap 𝑓 ( Ω 𝑘 ) − min Ω ∈P 𝑓 ( Ω ) , when 𝑓 ( Ω ) isconvex, note that if we denote Ω ∗ = argmin Ω ∈P 𝑓 ( Ω ) , then 𝑓 ( Ω 𝑘 ) − 𝑓 ( Ω ∗ ) ≤ trace (cid:16) ( Ω 𝑘 − Ω ∗ ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) (3.5) ≤ max Ω ∈P trace (cid:16) ( Ω 𝑘 − Ω ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) (3.6) = trace (cid:16) ( Ω 𝑘 − 𝑉 𝑘 ) 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:17) , (3.7)holds with Equation (3.5) following from convexity of 𝑓 .To sum up the advantages of CG algorithms when applied to structured sparse LASSO recovery problems:1. Sparsity is encouraged through a two-fold approach: through the ℓ regularization used in the problemformulation and through the use of the Conditional Gradient algorithms, which are sparse in nature.2. Linear equality and inequality constraints can be added easily and naturally to the constraint set ofthe problem to reﬂect symmetry or conservation assumptions. These additional constraints can beeﬃciently managed due to the fact that there are extremely eﬃcient algorithms to solve linear programsover polytopes.These characteristics make the class of Conditional Gradient methods extremely attractive versus projection-based algorithms to solve LASSO recovery problems. For the recovery of sparse dynamics from data, one of the most interesting algorithms in terms of sparsityis the

Fully-Corrective Conditional Gradient (FCCG) algorithm (Algorithm 2). This algorithm picks up avertex 𝑉 𝑘 from the polytope P at each iteration (Line 4 of Algorithm 2) and reoptimizes over the convex hullof S 𝑘 (cid:208) 𝑉 𝑘 (Line 6 of Algorithm 2), which is the union of the vertices picked up in previous iterations, andthe new vertex 𝑉 𝑘 . The reoptimization step can potentially remove a large number of unnecessary verticespicked up in earlier iterations. Algorithm 2:

Fully-Corrective Conditional Gradient (CG) algorithm applied to Problem (3.4)

Input :

Initial point Ω ∈ P . Output :

Point Ω 𝐾 + ∈ P . S ← ∅ for 𝑘 = to 𝐾 do ∇ 𝑓 ( Ω 𝑘 ) ← Ψ ( 𝑌 ) (cid:0) (cid:164) 𝑌 − Ω 𝑇𝑘 Ψ ( 𝑌 ) (cid:1) 𝑇 𝑉 𝑘 ← argmin Ω ∈P trace (cid:0) Ω 𝑇 ∇ 𝑓 ( Ω 𝑘 ) (cid:1) S 𝑘 + ← S 𝑘 (cid:208) 𝑉 𝑘 Ω 𝑘 + ← argmin Ω ∈ conv (S 𝑘 + ) (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 end The reoptimization subproblem shown in Line 6 of Algorithm 2 is a quadratic problem over a polytope P , like the original problem in Equation (3.4). However, it can be rewritten as an optimization problemover the unit probability simplex of dimension 𝑘 , as the cardinality of the set S 𝑘 + satisﬁes |S 𝑘 + | = 𝑘 . Tosee this, note that given a set S 𝑘 + ⊆ vert (P) we can express any Ω ∈ conv (S 𝑘 + ) as Ω = (cid:205) 𝑘𝑖 = 𝜆 𝑖 𝑉 𝑖 for some10 = [ 𝜆 , · · · , 𝜆 𝑘 ] ∈ Δ 𝑘 and 𝑉 𝑖 ∈ S 𝑘 + for all 𝑖 ∈ (cid:200) , 𝑘 (cid:201) . This leads to: min Ω ∈ conv (S 𝑘 + ) (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 = min Ω ∈ conv (S 𝑘 + ) (cid:16) trace (cid:16) (cid:164) 𝑌 𝑇 (cid:164) 𝑌 (cid:17) − (cid:16) Ω 𝑇 Ψ ( 𝑌 ) (cid:164) 𝑌 𝑇 (cid:17) + trace (cid:16) Ω 𝑇 Ψ ( 𝑌 ) Ψ ( 𝑌 ) 𝑇 Ω (cid:17)(cid:17) = min λ ∈ Δ 𝑘 (cid:32)(cid:13)(cid:13) (cid:164) 𝑌 (cid:13)(cid:13) 𝐹 − 𝑘 ∑︁ 𝑖 = 𝜆 𝑖 trace (cid:16) 𝑉 𝑇𝑖 Ψ ( 𝑌 ) (cid:164) 𝑌 𝑇 (cid:17) + 𝑘 ∑︁ 𝑖 = 𝑘 ∑︁ 𝑗 = 𝜆 𝑖 𝜆 𝑗 trace (cid:16) 𝑉 𝑇𝑖 Ψ ( 𝑌 ) Ψ ( 𝑌 ) 𝑇 𝑉 𝑗 (cid:17)(cid:33) . Which can be expressed more succinctly if we denote Λ 𝑘 + = (cid:2) vec (cid:0) 𝑉 𝑇 Ψ ( 𝑌 ) (cid:1) , · · · , vec (cid:0) 𝑉 𝑇𝑘 Ψ ( 𝑌 ) (cid:1)(cid:3) ∈ ℝ 𝑑𝑚 × 𝑘 andwe write: min Ω ∈ conv (S 𝑘 + ) (cid:13)(cid:13) (cid:164) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 = min λ ∈ Δ 𝑘 (cid:16)(cid:13)(cid:13) (cid:164) 𝑌 (cid:13)(cid:13) 𝐹 − (cid:0) (cid:164) 𝑌 (cid:1) 𝑇 Λ 𝑘 + λ + λ 𝑇 Λ 𝑇𝑘 + Λ 𝑘 + λ (cid:17) (3.8) = min λ ∈ Δ 𝑘 (cid:13)(cid:13) Λ 𝑘 + λ − vec (cid:0) (cid:164) 𝑌 (cid:1)(cid:13)(cid:13) . (3.9)So in order to solve the optimization problem in Line 6 of Algorithm 2 we would need to solve the optimizationproblem shown in Equation (3.9) (which is also convex, as convexity is invariant under aﬃne maps) and take Ω = (cid:205) 𝑘𝑖 = 𝜆 𝑖 𝑉 𝑖 . While the original quadratic problem over P , shown in Equation (3.4), has dimensionality 𝑛 × 𝑑 , the quadratic problem shown in Line 6 of Algorithm 2 has dimensionality 𝑘 when it is solved in λ -space,which leads to improved convergence due to reduced problem dimensionality. Remark 3.3.

If the polytope P being considered is simply the ℓ ball, then computing Λ 𝑘 + requires atmost 𝑚𝑘 multiplications, since in this case (cid:107) 𝑉 𝑖 (cid:107) = for all 𝑖 ∈ (cid:200) , 𝑘 (cid:201) , as 𝑉 𝑖 is simply one of the vertices ofthe ℓ ball (which is a polytope), and so computing vec (cid:0) 𝑉 𝑇𝑖 Ψ ( 𝑌 ) (cid:1) requires at most 𝑚 multiplications for each 𝑖 . This means that Λ 𝑘 + is sparse, as (cid:107) Λ 𝑘 + (cid:107) ≤ 𝑚𝑘 , which allows us to eﬃciently compute Λ 𝑇𝑘 + Λ 𝑘 + ∈ ℝ 𝑘 × 𝑘 and vec (cid:0) (cid:164) 𝑌 (cid:1) 𝑇 Λ 𝑘 + ∈ ℝ 𝑘 . Remark 3.4.

If additional constraints are added to the the ℓ ball, in general, we cannot make any statementsabout the sparsity of Λ 𝑘 + , other that in numerical experiments we observe that (cid:107) Λ (cid:107) (cid:28) 𝑑𝑚𝑘 .Due to the fact that there are eﬃcient algorithms to compute projections onto the probability simplex ofdimension 𝑘 with complexity O ( 𝑘 ) (Condat, 2016) we can use accelerated projected gradient descent to solvethe subproblems in Line 6 of Algorithm 2 (Nesterov, 1983; 2018) (shown in Algorithm 4 and Algorithm 5in Appendix B). Solving the problem shown in Line 6 of Algorithm 2 to optimality at each iteration iscomputationally prohibitive, and so ideally we would like to solve the problem in Line 6 to 𝜀 𝑘 -optimality. This leads to the question: How should we choose 𝜀 𝑘 at each iteration 𝑘 , if we want to ﬁnd an 𝜀 -optimal solutionto the problem shown in Equation (3.4)? Computing a solution to the problem shown in Line 6 to accuracy 𝜀 𝑘 = 𝜀 at each iteration might be way too computationally expensive. Conceptually, we need relativelyinaccurate solutions for early iterations where Ω ∗ ∉ conv (S 𝑘 + ) , requiring only accurate solutions when Ω ∗ ∈ conv (S 𝑘 + ) . At the same time we do not know whether we have found S 𝑘 + so that Ω ∗ ∈ conv (S 𝑘 + ) .The rationale behind the Blended Conditional Gradient (BCG) algorithm (Braun et al., 2019) (the variantused for our speciﬁc problem is shown in Algorithm 3) is to provide an explicit value of the accuracy 𝜀 𝑘 needed at each iteration starting with rather large 𝜀 𝑘 in early iterations and progressively getting moreaccurate when approaching the optimal solution; the process is controlled by an optimality gap measure. Insome sense one might think of BCG as a practical version of FCCG with stronger convergence guaranteesand much faster real-world performance.The algorithm approximately minimizes 𝑓 ( Ω ) over conv (S 𝑘 ) in Line 5 of Algorithm 3. This problem isanalogous to the one shown in Equation (3.9) and can be solved in the space of 𝜆 barycentric coordinates.The approximate minimization is carried out until the Frank-Wolfe gap satisﬁes 𝑔 conv (S 𝑘 ) ( Ω 𝑘 + ) ≤ Φ . Thealgorithm then computes the Frank-Wolfe gap over P in Lines 7-8, that is 𝑔 P ( Ω 𝑘 + ) . If this is smaller thanthe accuracy Φ to which we are computing the solutions in Line 5, we increase the accuracy to which we11 lgorithm 3: CINDy : Blended Conditional Gradient (BCG) algorithm variant applied to Prob-lem (3.4)

Input :

Initial point Ω ∈ P . Output :

Point Ω 𝐾 + ∈ P . Ω ← argmin Ω ∈P trace (cid:0) Ω 𝑇 ∇ 𝑓 ( Ω ) (cid:1) Φ ← trace (cid:16) ( Ω − Ω ) 𝑇 ∇ 𝑓 ( Ω ) (cid:17) / S ← { Ω } for 𝑘 = to 𝐾 do Find Ω 𝑘 + such that 𝑔 conv (S 𝑘 ) ( Ω 𝑘 + ) ≤ Φ ⊲ Solve problem approximately ∇ 𝑓 ( Ω 𝑘 + ) ← Ψ ( 𝑌 ) (cid:0) (cid:164) 𝑌 − Ω 𝑇𝑘 + Ψ ( 𝑌 ) (cid:1) 𝑇 𝑉 𝑘 + ← argmin Ω ∈P trace (cid:0) Ω 𝑇 ∇ 𝑓 ( Ω 𝑘 + ) (cid:1) 𝑔 P ( Ω 𝑘 + ) ← trace (cid:16) ( Ω 𝑘 + − 𝑉 𝑘 + ) 𝑇 ∇ 𝑓 ( Ω 𝑘 + ) (cid:17) if 𝑔 P ( Ω 𝑘 + ) ≤ Φ then Φ ← 𝑔 P ( Ω 𝑘 + ) / ⊲ Update accuracy S 𝑘 + ← S 𝑘 Ω 𝑘 + ← Ω 𝑘 else S 𝑘 + ← S 𝑘 (cid:208) 𝑉 𝑘 + ⊲ Expand active set 𝐷 𝑘 ← 𝑉 𝑘 + − Ω 𝑘 𝛾 𝑘 ← min (cid:110) − trace (cid:0) 𝐷 𝑇𝑘 ∇ 𝑓 ( Ω 𝑘 ) (cid:1) / (cid:13)(cid:13) 𝐷 𝑇𝑘 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 , (cid:111) Ω 𝑘 + ← Ω 𝑘 + 𝛾 𝑘 𝐷 𝑘 end end compute the solutions in Line 10 by taking Φ = 𝑔 P ( Ω 𝑘 + ) / . This means that as we get closer to the solutionof the optimization problem, and the gap 𝑔 P ( Ω 𝑘 + ) decreases, we increase the accuracy to which we solve theproblems over conv (S 𝑘 ) . If on the other hand 𝑔 P ( Ω 𝑘 + ) is larger than Φ , expanding the active set promisedmore progress than continuing optimizing over conv (S 𝑘 ) . Thus, we potentially expand the active set inLine 14, and we perform a standard CG step with exact line search in Lines 15-17. Regarding the step size inLine 5 the same comments apply as in Section 3: it is always non-negative, ensures convex combinations, andthe clipping, if active, is active only in the very ﬁrst iteration.The BCG algorithm enjoys robust theoretical convergence guarantees, and exhibits very fast convergencein practice. Moreover, it generally produces solutions with a high level of sparsity in the experimentsessentially identical to those produced by FCCG. This makes the BCG algorithm a powerful alternative tothe sequentially-thresholded least-squares approach followed in the SINDy algorithm (Brunton et al., 2016)(or the sequentially-thresholded ridge regression in Rudy et al. (2017)).

4. Numerical experiments

We benchmark the CINDy algorithm (Algorithm 3) applied to the LASSO formulations presented inEquation (2.6) with the following algorithms. Our main benchmark here is the SINDy algorithm, however weincluded two more popular optimization methods for further comparison.

SINDy:

We use a SINDy algorithm implementation based on the Python implementation in the

PDE-FIND

Github repository from (Rudy et al., 2017) (which originally used ridge-regression, as opposed to theleast-squares regression used in (Brunton et al., 2016)).12

ISTA:

The

Fast Iterative Shrinkage-Thresholding Algorithm (Beck & Teboulle, 2009), commonly knownas FISTA, is a ﬁrst-order accelerated method commonly used to solve LASSO problems which are equivalentto the ones shown in Equation (2.6) (in which the ℓ norm appears as a regularization term in the objectivefunction, as opposed to a constraint). IPM:

Interior-Point Methods (IPM) are an extremely powerful class of convex optimization algorithms,able to reach a highly-accurate solution in a small number of iterations. The algorithms rely on the resolutionof a linear system of equations at each iteration, which can be done eﬃciently if the underlying system issparse. Unfortunately this is not the case for our LASSO formulations, which makes this algorithm impracticalfor large problems. We will use the path-following primal form interior-point method for quadratic problemsdescribed in (Andersen et al., 2011), and implemented in Python’s

CVXOPT , to solve the LASSO problem inEquation (2.6) (with and without additional constraints, as described in Section 2.1).We use CINDy (c) and CINDy to refer to the results achieved by the CINDy algorithm with andwithout the additional constraints described in Section 2.1. Likewise, we use IPM (c) and IPM to refer tothe results achieved by the IPM algorithm with and without additional constraints. We have not addedstructural constraints to the formulation in the SINDy algorithm, as there is no straightforward way toinclude constraints in the original implementation, or the FISTA algorithm, as we would need to computenon-trivial proximal/projection operators, making the algorithm computationally expensive.

Remark 4.1 (Hyperparameter selection for the CINDy algorithm) . In the experiments we have not tuned the ℓ paramater in the LASSO formulation for the CINDy algorithm (neither in the integral nor the diﬀerentialformulation), simply relying on 𝛼 = (cid:13)(cid:13) (cid:164) 𝑌 Ψ ( 𝑌 ) † (cid:13)(cid:13) , and 𝛼 = (cid:13)(cid:13) 𝛿𝑌 Γ ( 𝑌 ) † (cid:13)(cid:13) , in the diﬀerential and integralformulation for all the experiments. With this choice, purposefully, all computed solutions are located inthe interior of the feasible region. It is important to note that due to this choice, sparsity in the recovereddynamics is therefore due to implicit regularization by the CINDy optimization algorithm and not due tobinding constraints of the LASSO problem formulation. Remark 4.2 (Hyperparameter selection for SINDy, FISTA and the IPM algorithm) . We have selected thethreshold coeﬃcient for SINDy, and the ℓ regularization parameters of FISTA and the IPM algorithm basedon performance on testing data. In the diﬀerential and integral formulations we have selected the hyperpa-rameters that gave the smallest value of (cid:13)(cid:13) (cid:164) 𝑌 validation − Ω 𝑇 Ψ ( 𝑌 validation ) (cid:13)(cid:13) 𝐹 and (cid:13)(cid:13) 𝛿𝑌 validation − Ω 𝑇 Γ ( 𝑌 validation ) (cid:13)(cid:13) 𝐹 respectively.Other criteria could be chosen to increase the level of sparsity of the solution, at the expense of accuracyin inferring derivatives/trajectories. Note that in the BCG algorithm, the sparsity-accuracy compromise isinstead parametrized in terms of the stopping criterion instead of thresholds or ℓ bounds. We hasten tostress however, that the sparsity of CINDy (and more precisely of the BCG algorithm) is due to extremelysparse updates in each iteration, whereas the other algortihms’ updates are dense and sparsity is only realizedby means of postprocessing these updates or solutions. Remark 4.3 (Stopping criterion) . We use rounds of thresholding and least-squares for each run of theSINDy algorithm (or until no more coeﬃcients are thresholded in a given iteration). The threshold valuethat yields the best accuracy in terms of testing data is outputted afterwards. During the selection of theFISTA hyperparameter, the algorithm is run for a suﬃciently large number of iterations until the primalprogress made is below a tolerance of − . The same is done when we run the FISTA algorithm with theﬁnal hyperparameter selected. The IPM algorithm is run with the default stopping criterion parameters.Lastly, the CINDy algorithm is run until the Frank-Wolfe gap, an upper bound on the primal gap, is below atolerance of − .For each physical model in this section we generate a set of 𝑇 points, simulating the physical model 𝑐 diﬀerent times, to generate 𝑐 experiments. First, a random starting point for the 𝑗 -th experiment isgenerated. This random starting point is used to generate a set of 𝑇 / 𝑐 points equally-spaced in time x 𝑗 ( 𝑡 𝑖 ) with 𝑖 ∈ (cid:200) , 𝑇 / 𝑐 (cid:201) using a high-order Runge-Kutta scheme and ensuring that the discretization error (cid:107) x 𝑗 ( 𝑡 𝑖 ) − x ( 𝑡 𝑖 )(cid:107) is below a tolerance of − . The samples are then contaminated with i.i.d. Gaussian noise.13f we denote the noisy data point at time 𝑡 𝑖 for the 𝑗 -th experiment by y 𝑗 ( 𝑡 𝑖 ) , we have y 𝑗 ( 𝑡 𝑖 ) = x 𝑗 ( 𝑡 𝑖 ) + 𝜂 N ( , Σ ) , where N ( , Σ ) denotes the 𝑑 -dimensional multivariate Gaussian distribution centered at zero with covariancematrix Σ , where Σ = diag (cid:32) 𝑇 𝑇 / 𝑐 ∑︁ 𝑖 = 𝑐 ∑︁ 𝑗 = (cid:16) 𝑥 𝑗 ( 𝑡 𝑖 ) − 𝜇 (cid:17) , · · · , 𝑇 𝑇 / 𝑐 ∑︁ 𝑖 = 𝑐 ∑︁ 𝑗 = (cid:16) 𝑥 𝑗𝑑 ( 𝑡 𝑖 ) − 𝜇 𝑑 (cid:17) (cid:33) (4.1) 𝜇 𝑘 = 𝑇 𝑇 / 𝑐 ∑︁ 𝑖 = 𝑐 ∑︁ 𝑗 = 𝑥 𝑗𝑘 ( 𝑡 𝑖 ) . (4.2)We denote the noise level that we vary in our experiments by 𝜂 . Note that the 𝑘 -th element on the diagonalof Σ is simply the sample variance of the 𝑘 -th component of points generated in the experiments.Both Γ ( 𝑌 ) and Ψ ( 𝑌 ) are normalized so that their rows have unit variance, to make the learning processeasier for all the algorithms. For the training of the algorithm, of the data points are used, while are used for validation and selecting the combination that gives the best hyperparameters (the ℓ radius inthe FISTA and IPM algorithms, or the threshold in SINDy’s sequential thresholded least-squares algorithm).Lastly, the remaining data points are used for evaluating the output of each algorithm, and are referredto as the testing set.We would like to stress that, while it might seem that we are in the overdetermined regime in terms ofthe number of samples used in training, this is not accurate as due to the evolutionary nature, i.e., evolvingthe dynamic in time, for each of the experiments the samples obtained within one experiment are highlycorrelated.The approximate derivatives are computed from noisy data using local polynomial interpolation (Knowles& Renka, 2014). The matrix Γ ( 𝑌 ) used in the integral formulation of the sparse recovery problem wascomputed through the integration of local polynomial interpolations. The same matrices and training-validation-testing split is used in each experiment for all the algorithms, to make a fair comparison. Section Cin the Appendix shows the diﬀerence in accuracy that can be achieved with the diﬀerent methods whenestimating the derivatives and integrals for one physical model. Given the disparity in accuracy that can beachieved between the estimation of derivatives and integrals, we have decided to present the results for theintegral and diﬀerential formulation separately. We benchmark the algorithms in terms of the following metrics, in a similar spirit as was done in (Kahemanet al., 2020). Given a dictionary of basis functions D of cardinality 𝑛 , an associated exact dynamic Ξ ∈ ℝ 𝑛 × 𝑑 such that (cid:164) x ( 𝑡 ) = Ξ 𝑇 ψ ( x ( 𝑡 )) , and a dynamic Ω ∈ ℝ 𝑛 × 𝑑 outputted by the algorithms, we deﬁne: Deﬁnition 4.4 (Recovery error) . The recovery error E 𝑅 of the algorithm is given by: E 𝑅 def = (cid:107) Ω − Ξ (cid:107) 𝐹 . Given noisy data points 𝑌 testing , (cid:164) 𝑌 testing from the testing set we deﬁne: Deﬁnition 4.5 (Derivative inference error) . The derivative inference error E 𝐷 of the algorithm is given by: E 𝐷 def = (cid:13)(cid:13) ( Ω − Ξ ) 𝑇 Ψ (cid:0) 𝑌 testing (cid:1)(cid:13)(cid:13) 𝐹 . This measure aims at quantifying how well the learned dynamics will infer the true derivatives at 𝑌 . Deﬁnition 4.6 (Trajectory inference error) . The trajectory inference error E 𝑇 of the algorithm is given by: E 𝑇 def = (cid:13)(cid:13) ( Ω − Ξ ) 𝑇 Γ (cid:0) 𝑌 testing (cid:1)(cid:13)(cid:13) 𝐹 . Γ ( 𝑌 ) , compared to the true dynamic Ξ .In order to gauge how well a given algorithm is able to recover the true support of a dynamic, we deﬁne: Deﬁnition 4.7 (Extraneous terms) . The extraneous terms of a given dynamic Ω ∈ ℝ 𝑑 × 𝑛 with respect to itstrue counterpart Ξ ∈ ℝ 𝑑 × 𝑛 is deﬁned as: S 𝐸 def = (cid:12)(cid:12) (cid:8) Ω 𝑖, 𝑗 | Ω 𝑖, 𝑗 ≠ , Ξ 𝑖, 𝑗 = , 𝑖 ∈ (cid:200) , 𝑑 (cid:201) , 𝑗 ∈ (cid:200) , 𝑛 (cid:201) (cid:9) (cid:12)(cid:12) . This metric simply counts the terms picked up in Ω that are not present in the true physical model, representedby Ξ , i.e. it counts the false positives. Deﬁnition 4.8 (Missing terms) . The missing terms of a given dynamic Ω ∈ ℝ 𝑑 × 𝑛 with respect to its truecounterpart Ξ ∈ ℝ 𝑑 × 𝑛 is deﬁned as: S 𝑀 def = (cid:12)(cid:12) (cid:8) Ω 𝑖, 𝑗 | Ω 𝑖, 𝑗 = , Ξ 𝑖, 𝑗 ≠ , 𝑖 ∈ (cid:200) , 𝑑 (cid:201) , 𝑗 ∈ (cid:200) , 𝑛 (cid:201) (cid:9) (cid:12)(cid:12) . This metric simply counts the terms that have not been picked up in Ω that actually participate in governingthe true physical model, represented by Ξ , i.e. it counts the false negatives. The Kuramoto model describes a large collection of 𝑑 weakly coupled identical oscillators, that diﬀer in theirnatural frequency 𝜔 𝑖 (Kuramoto, 1975) (see Figure 3). This dynamic is often used to describe synchronizationphenomena in physics, and has been previously used in the numerical experiments of a tensor-based algorithmfor the recovery of large dynamics (Gelß et al., 2019). If we denote by 𝑥 𝑖 the angular displacement of the 𝑖 -thoscillator, then the governing equation with external forcing (see (Acebrón et al., 2005)) can be written as: (cid:164) 𝑥 𝑖 = 𝜔 𝑖 + 𝐾𝑑 𝑑 ∑︁ 𝑗 = sin (cid:0) 𝑥 𝑗 − 𝑥 𝑖 (cid:1) + ℎ sin ( 𝑥 𝑖 ) = 𝜔 𝑖 + 𝐾𝑑 𝑑 ∑︁ 𝑗 = (cid:2) sin (cid:0) 𝑥 𝑗 (cid:1) cos ( 𝑥 𝑖 ) − cos (cid:0) 𝑥 𝑗 (cid:1) sin ( 𝑥 𝑖 ) (cid:3) + ℎ sin ( 𝑥 𝑖 ) , for 𝑖 ∈ (cid:200) , 𝑑 (cid:201) , where 𝑑 is the number of oscillators (the dimensionality of the problem), 𝐾 is the couplingstrength between the oscillators and ℎ is the external forcing parameter. The exact dynamic Ξ can beexpressed using a dictionary of basis functions formed by sine and cosine functions of 𝑥 𝑖 for 𝑖 ∈ (cid:200) , 𝑑 (cid:201) , andpairwise combinations of these functions, plus a constant term. To be more precise, the dictionary used is D = (cid:40) 𝑑 (cid:214) 𝑖 = sin ( 𝑥 𝑖 ) 𝑎 𝑖 𝑑 (cid:214) 𝑖 = cos ( 𝑥 𝑖 ) 𝑏 𝑖 | 𝑎 𝑖 , 𝑏 𝑖 ∈ (cid:200) , (cid:201) , 𝑖 ∈ (cid:200) , 𝑑 (cid:201) , ≤ 𝑑 ∑︁ 𝑖 = ( 𝑎 𝑖 + 𝑏 𝑖 ) ≤ (cid:41) . Which has a cardinality of + 𝑑 + 𝑑 . Note however, that the data is contaminated with noise, and so weobserve y as opposed to x . For the system we choose the natural frequency 𝜔 𝑖 ∼ U [ , ] for 𝑖 ∈ (cid:200) , 𝑑 (cid:201) . Therandom starting point for each instance of the experimental data used is chosen as x 𝑗 ( 𝑡 ) ∼ U [ , 𝜋 ] 𝑑 . Thisstarting point is used to generate a trajectory according to the exact dynamic using a high-order Runge-Kuttascheme for a maximum time 𝑡 𝑇 / 𝑐 of seconds. This trajectory is then contaminated with noise, in accordancewith the description in the previous section. Remark 4.9.

Note that if we were to include cos ( 𝑦 𝑖 ) and sin ( 𝑦 𝑖 ) for 𝑖 ∈ (cid:200) , 𝑑 (cid:201) in 𝐷 , the matrix Ψ ( 𝑌 ) builtwith this library would not have full rank, as cos ( 𝑦 𝑖 ) + sin ( 𝑦 𝑖 ) = , and the library 𝐷 already includes abuilt-in constant. In our experiments we have observed that when using a dictionary that does not havefull rank, the thresholded least-squares algorithm SINDy tends to produce solutions that include constant, cos ( 𝑦 𝑖 ) and sin ( 𝑦 𝑖 ) terms, whereas the dynamics returned by the CINDy algorithm tends to only include15onstant terms, thereby providing a more parsimonious representation of the dynamic. If we add an ℓ regularization term to the optimization algorithm, resulting in a thresholded ridge regression algorithm, theresulting dynamic tends toward higher parsimony, but at the expense of accuracy in predicting derivativesand trajectories.We also test the performance of the CINDy algorithm and the IPM algorithm with the addition ofsymmetry constraints. We use 𝜉 𝑗 ( 𝜓 ( x )) to refer to the coeﬃcient in 𝜉 𝑗 , where 𝜉 𝑗 is the 𝑗 -th column of Ξ , associated with the basis function 𝜓 ( x ) . The underlying rationale behind the constraints is that as theparticles are identical, except for their intrinsic frequency, the eﬀect of 𝑥 𝑖 on 𝑥 𝑗 should be the same as theeﬀect of 𝑥 𝑗 on 𝑥 𝑖 . In both the integral and the diﬀerential formulation we impose that for all 𝑖, 𝑗 ∈ (cid:200) , 𝑑 (cid:201) : 𝜉 𝑗 ( sin ( 𝑥 𝑖 )) = 𝜉 𝑖 (cid:0) sin (cid:0) 𝑥 𝑗 (cid:1)(cid:1) 𝜉 𝑖 ( cos ( 𝑥 𝑖 )) = 𝜉 𝑖 (cid:0) cos (cid:0) 𝑥 𝑗 (cid:1)(cid:1) 𝜉 𝑗 ( cos ( 𝑥 𝑖 )) = 𝜉 𝑖 (cid:0) cos (cid:0) 𝑥 𝑗 (cid:1)(cid:1) 𝜉 𝑗 (cid:0) sin ( 𝑥 𝑖 ) cos (cid:0) 𝑥 𝑗 (cid:1)(cid:1) = 𝜉 𝑖 (cid:0) sin (cid:0) 𝑥 𝑗 (cid:1) cos ( 𝑥 𝑖 ) (cid:1) 𝜉 𝑗 (cid:0) cos ( 𝑥 𝑖 ) sin (cid:0) 𝑥 𝑗 (cid:1)(cid:1) = 𝜉 𝑖 (cid:0) cos (cid:0) 𝑥 𝑗 (cid:1) sin ( 𝑥 𝑖 ) (cid:1) 𝜉 𝑗 (cid:0) sin ( 𝑥 𝑖 ) sin (cid:0) 𝑥 𝑗 (cid:1)(cid:1) = 𝜉 𝑖 (cid:0) sin (cid:0) 𝑥 𝑗 (cid:1) sin ( 𝑥 𝑖 ) (cid:1) 𝜉 𝑗 (cid:0) cos ( 𝑥 𝑖 ) cos (cid:0) 𝑥 𝑗 (cid:1)(cid:1) = 𝜉 𝑖 (cid:0) cos (cid:0) 𝑥 𝑗 (cid:1) cos ( 𝑥 𝑖 ) (cid:1) , which are simple linear constraints that can easily be added to the linear optimization oracle used in Line 7of the CINDy algorithm (Algorithm 3).The images in Figure 3 and 4 show the recovery results for 𝐾 = and ℎ = . and two diﬀerent valuesfor the dimension, 𝑑 = and 𝑑 = , respectively. A total of points were used to infer the dynamic,spread over experiments for a maximum time of seconds, for both cases. The values of the noise level 𝛼 ranged from − to − . The derivatives used where computed using diﬀerentiation of local polynomialapproximations, and the integrals using integration of local polynomial approximations. Each test wasperformed times. The graphs indicate with lines the average value obtained with E 𝑅 E 𝐷 and E 𝑇 , S 𝐸 and S 𝐸 for a given noise level and algorithm. The shaded regions indicate the value obtained after adding andsubtracting a standard deviation to the average error for each noise level and algorithm.For the case with 𝑑 = (see Figure 3) we can observe that in the diﬀerential formulation the CINDy,CINDy (c) and SINDy algorithms achieve the smallest recovery error E 𝑅 for noise levels below − , and theperformance of the SINDy algorithm degrades after that, relative to that of the best-performing algorithms.Whereas the CINDy and CINDy (c) algorithms are among the best performing in terms of E 𝑅 for all noiselevels. Regarding the E 𝐷 and E 𝑇 errors, there is no clear distinction between the algorithms. Regardingthe sparsity achieved through the algorithms, we can see that CINDy and CINDy (c) consistently tend toproduce the sparsest solutions, as measured with S 𝐸 , whereas IPM and IPM (c) tend to pick up all the termsin the dictionary. For interior point methods, dense solutions are of course to be expected if no solutionrounding is performed. Note also that the performance of SINDy degrades above a noise level of − , whereit starts to produce dense solutions. Lastly, on average none of the algorithms tends to miss more than one ofthe basis functions that are present in the exact dynamic, as measured by S 𝑀 , this is especially remarkablefor the CINDy and CINDy (c) algorithms, which consistently have the lowest number of extra terms. Overall,in the diﬀerential formulation experiments we observe that the CINDy and CINDy (c) algorithms producethe most accurate solutions, as measured by E 𝑅 , consistently producing among the best dynamics for all thenoise levels, while producing dynamics that are much sparser than the ones produced by in general.In terms of integral formulation, the CINDy and CINDy (c) algorithms are on average more accuratein terms of E 𝑅 than any of the algorithms tested, by a larger margin than in the diﬀerential formulation.Regarding the E 𝐷 and E 𝑇 error, as in the results for the diﬀerential formulation, there are only slightdiﬀerences between the method. However, the CINDy and CINDy (c) algorithms produce the sparsestsolutions, with all the remaining algorithms picking up a large number of extraneous basis functions.For the case with 𝑑 = (see Figure 4) in the diﬀerential case, we can see very good performance fromthe SINDy, CINDy, CINDy (c), and FISTA algorithms in E 𝑅 for noise levels below − . However, the16erformance for SINDy degrades above − , in terms of recovery accuracy (there is a diﬀerence of morethan two orders of magnitude between SINDy and CINDy), and in terms of S 𝐸 , as the algorithm startspicking up most of the available basis functions from the dictionary. The performance of the FISTA algorithmdegrades above − , also in terms of E 𝑅 and S 𝐸 . Regarding the integral formulation, there is a largediﬀerence in the performance of CINDy and CINDy (c), and the rest of the algorithms. For most noise levelsthese two aforementioned algorithms are more than two orders of magnitude more accurate in terms of E 𝑅 ,while maintaining extremely sparse solutions. This performance diﬀerence is key in accurately simulationout-of-sample trajectories. Regarding the IPM and IPM (c) results, we were not able to get the performanceof these two algorithms to be on par with the other algorithms for low noise levels, despite increasing theaccuracy of the solver. One of the most crucial aspects of many modern learning problems is the quantity of data needed to train agiven model to achieve a certain target accuracy. When training data is expensive to gather, it is usuallyadvantageous to use models or frameworks that require the least amount of training data to reach a giventarget accuracy on validation data. As we can see in Figure 3, in both the diﬀerential and integral formulationall the algorithms perform similarly when tested on noisy validation data, as is shown in the second and thirdrows of images, however, there are disparities in how they perform when measuring the performance againstthe exact dynamic, as seen in the ﬁrst row of images. For example, in the diﬀerential formulation the CINDyand CINDy (c) algorithms perform noticeably better than the SINDy algorithm for higher noise levels, from − to − , and in the integral formulation the CINDy and CINDy (c) algorithms perform noticeably betterthan the SINDy algorithm for noise levels below − . From a sample eﬃciency perspective, this suggeststhat in both these regimes where CINDy has an advantage, it will require fewer samples than SINDy to reacha target accuracy.This is conﬁrmed in Figure 5, which shows a heat map of log (E 𝑅 ) for diﬀerent noise levels (x-axis) anddiﬀerent numbers of training samples (y-axis), when using the Kuramoto model of dimension 𝑑 = forbenchmarking. If one takes a look at the diﬀerential formulation, one can see that in the low training sampleregime both CINDy and CINDy (c) perform better than SINDy at higher noise levels. It should also be notedthat CINDy (c) performs better than CINDy, which is expected, as the introduction of extra constraints lowers the dimensionality of our learning problem, for which we now have to learn fewer parameters. Thusthe addition of constraints brings two advantages: (1) it outputs dynamics that are consistent with theunderlying physics of the phenomenon and (2) it can potentially require fewer data samples to train. Similarconclusions can be drawn when inspecting the results for the integral formulation, for example focusing againon the low training sample regime. We provide an extended analysis over a broader range of sample sizes inAppendix E for completeness. By observing the results for E 𝑇 in Figure 4 it would seem at ﬁrst sight that all the algorithms (except theIPM and IPM (c) algorithms for low noise levels) will perform similarly when inferring trajectories from aninitial position. However, when we simulate the Kuramoto system from a given initial position, the algorithmshave very diﬀerent performances. This is due to the fact that while the single point evaluations might haverather similar errors (on average), which means nothing else but that they generalize similarly on the speciﬁcevaluations (as expected as this was the considered objective function) they do diﬀer very much in their structural generalization behavior : all algorithms but CINDy and CINDy (c) pick up wrong terms to explainthe dynamic, which then in the trajectory evolution, due to compounding, lead to signiﬁcant mismatches.In Figure 6 we show the results after simulating the dynamics learned by the CINDy and SINDy algorithmfrom the integral formulation for a Kuramoto model with 𝑑 = and a noise level of − . In order to seemore easily the diﬀerences between the algorithms and the position of the oscillators, we have placed the 𝑖 -thoscillator at a radius of 𝑖 , for 𝑖 ∈ (cid:200) , 𝑑 (cid:201) . This is contrary to how this dynamic is usually visualized, with allthe particles oscillating with the same radii. 17 −6 −4 −2  R Differential Integral

CINDyCINDy (c)SINDyFISTAIPMIPM (c) −5 −2  D −4 −1  T  E −7 −5 −3 Noise level0.00.51.0  M −7 −5 −3 Noise level

Figure 3:

Sparse recovery of the Kuramoto model:

CINDyCINDy (c)SINDyFISTAIPMIPM (c) −4 −1  D −4 −1  T  E −7 −5 −3 Noise level02  M −7 −5 −3 Noise level

Figure 4:

Sparse recovery of the Kuramoto model:

Algorithm comparison for a Kuramoto model ofdimension 𝑑 = , with a diﬀerential formulation shown on the left column, and with an integralformulation on the right column. The ﬁrst, second, third, fourth and ﬁrth rows of images indicatea comparison of E 𝑅 , E 𝐷 , E 𝑇 , S 𝐸 , and S 𝑇 , as we vary the noise level, respectively.19 s a m p l e s SINDy s a m p l e s CINDy −8 −6 −4 −2Noise (log )20004000 s a m p l e s CINDy (c) −6−4−20 l o g ( ‖ Ω − Ξ ‖ F / ‖ Ξ ‖ F ) Differential 20004000 s a m p l e s SINDy s a m p l e s CINDy −8 −6 −4 −2Noise (log )20004000 s a m p l e s CINDy (c) −8−6−4−20 l o g ( ‖ Ω − Ξ ‖ F / ‖ Ξ ‖ F ) Integral

Figure 5:

Sample eﬃciency of the sparse recovery of the Kuramoto model:

Algorithm comparisonfor a Kuramoto model of dimension 𝑑 = for both formulations. Time: 0.0 s. Time: 0.2 s. Time: 0.4 s.Time: 0.6 s. Time: 0.8 s. Time: 1.0 s.SINDy CINDy Exact dynamic

Figure 6:

Trajectory comparison:

Kuramoto model of dimension 𝑑 = . The Fermi-Pasta-Ulam-Tsingou model describes a one-dimensional system of 𝑑 identical particles, whereneighboring particles are connected with springs, subject to a nonlinear forcing term (Fermi et al., 1955).This computational model was used at Los Alamos to study the behaviour of complex physical systems overlong time periods. The prevailing hypothesis behind the experiments was the idea that these systems would20ventually exhibit ergodic behaviour, as opposed to the approximately periodic behaviour that some complexphysical systems seemed to exhibit. This is indeed the case, as this model transitions to an ergodic behaviour,after seemingly periodic behaviour over the short time scale. This dynamic has already been used in Gelßet al. (2019). The equations of motion that govern the particles, when subjected to cubic forcing terms isgiven by (cid:165) 𝑥 𝑖 = ( 𝑥 𝑖 + − 𝑥 𝑖 + 𝑥 𝑖 − ) + 𝛽 (cid:2) ( 𝑥 𝑖 + − 𝑥 𝑖 ) − ( 𝑥 𝑖 − 𝑥 𝑖 − ) (cid:3) , where 𝑖 ∈ (cid:200) , 𝑑 (cid:201) and 𝑥 𝑖 refers to the displacement of the 𝑖 -th particle with respect to its equilibrium position.The exact dynamic Ξ can be expressed using a dictionary of monomials of degree up to three. To be moreprecise, the dictionary used is: D = (cid:40) 𝑑 (cid:214) 𝑖 = 𝑥 𝑎 𝑖 𝑖 | 𝑎 𝑖 ∈ (cid:200) , (cid:201) , 𝑖 ∈ (cid:200) , 𝑛 (cid:201) , ≤ 𝑑 ∑︁ 𝑖 = 𝑎 𝑖 ≤ (cid:41) , which has cardinality (cid:0) 𝑑 + (cid:1) . Using this dictionary we know that the exact dynamic satisﬁes (cid:107) Ξ (cid:107) = 𝑑 − .As in the previous example, we can impose a series of linear constraints between the dynamics of neighboringparticles (as the particles are identical). In both the integral and the diﬀerential formulation we impose thatfor all 𝑖 ∈ (cid:200) , 𝑑 (cid:201) and all 𝑗 ∈ { 𝑖 + , 𝑖 − } with ≤ 𝑗 ≤ 𝑑 , 𝜉 𝑗 (cid:16) 𝑥 𝑎𝑖 𝑥 𝑏𝑗 (cid:17) = 𝜉 𝑖 (cid:16) 𝑥 𝑏𝑖 𝑥 𝑎𝑗 (cid:17) holds with 𝑎, 𝑏 ∈ (cid:200) , (cid:201) and ≤ 𝑎 + 𝑏 + 𝑐 ≤ . Remark 4.10 (On the construction of (cid:165) 𝑌 and 𝛿 (cid:164) 𝑌 ) . In this case we are dealing with a second-order ordinarydiﬀerential equation (ODE), as opposed to a ﬁrst-order ODE. As we only have access to noisy measurements { y ( 𝑡 𝑖 )} 𝑚𝑖 = , if we are dealing with the diﬀerential formulation, we have to numerically estimate { (cid:165) y ( 𝑡 𝑖 )} 𝑚𝑖 = , inorder to solve argmin Ω ∈P Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) (cid:165) 𝑌 − Ω 𝑇 Ψ ( 𝑌 ) (cid:13)(cid:13) 𝐹 , where (cid:165) 𝑌 = [ (cid:165) y ( 𝑡 ) , · · · , (cid:165) y ( 𝑡 𝑚 )] ∈ ℝ 𝑑 × 𝑚 is the matrix with the estimates of the second derivatives of y withrespect to time as columns. On the other hand, if we wish to tackle the problem from an integral perspective,we now use the fact that (cid:164) x ( 𝑡 𝑖 + ) = (cid:164) x ( 𝑡 ) + ∫ 𝑡 𝑖 + 𝑡 Ξ 𝑇 ψ ( x ( 𝜏 )) 𝑑𝜏 , which allows us to phrase the sparse regressionproblem from an integral perspective as argmin Ω ∈P Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) 𝛿 (cid:164) 𝑌 − Ω 𝑇 Γ ( 𝑌 ) (cid:13)(cid:13) 𝐹 , where 𝛿 (cid:164) 𝑌 = [ (cid:164) y ( 𝑡 ) − (cid:164) y ( 𝑡 ) , · · · , (cid:164) y ( 𝑡 𝑚 ) − (cid:164) y ( 𝑡 )] ∈ ℝ 𝑑 × 𝑚 − and Γ ( 𝑌 ) is computed similarly as in Section 2. Notethat in this case using the diﬀerential formulation requires estimating the second derivative of the noisy datawith respect to time to form (cid:165) 𝑌 , whereas the integral formulation requires estimating the ﬁrst derivative withrespect to time to form 𝛿 (cid:164) 𝑌 .The images in Figure 7 and 8 show the recovery results for 𝑑 = and 𝑑 = , respectively, when learningthe Fermi-Pasta-Ulam-Tsingou with 𝛽 = . . The exact dynamic in this case satisﬁes (cid:107) Ξ (cid:107) = and (cid:107) Ξ (cid:107) = for 𝑑 = and 𝑑 = , respectively. In the lower dimensional experiment we used points, and in thehigher dimensional one . In both cases we spread out the points accross experiments, for a maximumtime of second per experiment. The derivatives used where computed using polynomial interpolation ofdegree 8. We used polynomial interpolation of degree to compute the integrals for all the noise levels. Thevalues of the noise level 𝛼 ranged from − to − . Each test was performed times. The graphs indicatewith lines the average value obtained with E 𝑅 , E 𝐷 and E 𝑇 , S 𝐸 and S 𝑀 for a given noise level and algorithm.The shaded regions indicate the value obtained after adding and subtracting a standard deviation to theaverage error for each noise level and algorithm. 21or the case with 𝑑 = (Figure 7) we can observe that in both the diﬀerential and integral formulation allthe algorithms perform similarly in terms E 𝐷 and E 𝑇 , however, we can see that the CINDy and CINDy (c)algorithms perform better or on par with the best performing algorithm. For low noise levels the CINDyalgorithm performs 2-5 times better than the worst performing algorithm. There is a huge diﬀerence in thenumber of extra basis functions picked up by the algorithms, as measured by S 𝐸 , as the CINDy and CINDy(c) do not pick up any extra basis functions until the noise level is above − , whereas all the other algorithms,with the exception of the SINDy algorithm for a noise level of − , pick up extra vertices. The IPM andIPM (c) algorithms tend to pick up all the possible basis functions, as does the SINDy algorithm for noiselevels above − . Note how for small noise levels the SINDy algorithm does not pick up a signiﬁcant amountof extra basis functions in the diﬀerential formulation, and with an increase in the noise level the SINDyalgorithms tends to pick up all the available ansatz. As the SINDy, IPM and IPM (c) tend to approximatelypick up all the available basis functions for noise levels above − , they consequently do not miss out anybasis functions, as measured by S 𝑀 . The FISTA, CINDy and CINDy (C) algorithms miss out on some of thebasis functions at higher noise levels, due to the fact that they have a higher tendency towards producingsparse solutions.For the case with 𝑑 = (Figure 8) we observe again that all the models perform similarly in terms of E 𝐷 and E 𝑇 in both the integral and diﬀerential formulation. However, the diﬀerence between the algorithms interms of E 𝑅 is magniﬁed. For example, the CINDy, CINDy (c) and FISTA algorithms are up to two orders ofmagnitude more accurate than the SINDy algorithm in terms of E 𝑅 for the diﬀerential formulation. They alsoprovide the sparsest solutions in terms of S 𝐸 . In the integral formulation the CINDy and CINDy (c) algorithmsare also two orders of magnitude more accurate than SINDy. The LASSO algorithm performs better thanSINDy, but signiﬁcantly worse than CINDy and CINDy (c) in terms of E 𝑅 . In terms of performance withrespect to S 𝐸 the CINDy and CINDy signiﬁcantly better than the other algorithms, however for the highestnoise levels they miss out on a signiﬁcant number of basis functions, as measured by S 𝑀 . For all noise levelsexcept in one, the SINDy algorithm tends to pick up all the available basis functions, as can be seen in theplot for S 𝐸 . As in Section 4.2.1, we present a heat map in Figure 9 that compares the accuracy, in terms of log (E 𝑅 ) , aswe vary the noise levels (x-axis) and the numbers of training samples (y-axis) for the SINDy and CINDyalgorithms. If one takes a look at the diﬀerential formulation, one can see that in the low training sampleregime both CINDy and CINDy (c) perform slightly better than SINDy at higher noise levels. The diﬀerencein performance is less pronounced than in Figure 5, however. We provide an extended analysis over a broaderrange of sample sizes in Appendix E for completeness. In the images in the Appendix it is easier to discernthe advantage of adding constraints to the system, as we can clearly see that the accuracy of CINDy (c) ishigher than that of CINDy for a given noise level and number of samples. The results shown in Figure 8 for E 𝑇 (see third row of images) seem to indicate that all four formulationswill perform similarly when predicting trajectories, however, this stands in contrast to what is shown in theﬁrst row of images, where we can see that the dynamic learned by SINDy is far from the true dynamic. Thisdiscrepancy is due to the fact that all the data is generated in one regime of the dynamical phenomenon,and the metric E 𝑇 is computed using noisy testing data from the same regime. If we were to test theperformance in inferring trajectories with initial conditions that diﬀered from those that had been seen in thetraining-testing-validation data, the picture would be quite diﬀerent. For example one could test the inferencepower of the diﬀerent dynamics if the initial position of the oscillators is a sinusoid with unit amplitude.We can see the diﬀerence in accuracy between the diﬀerent learned dynamics by simulating forward intime the dynamic learned by the CINDy algorithm and the SINDy algorithm, and comparing that to theevolution of the true dynamic. The results in Figure 10 show the diﬀerence in behaviour for diﬀerent timesfor the dynamics learnt by the two algorithms in the integral formulation with a noise level of − for theexample of dimensionality 𝑑 = . In keeping with the physical nature of the problem, we present the ten22 −3 −1  R Differential Integral

CINDyCINDy (c)SINDyFISTAIPMIPM (c) −5 −3  D −5 −3  T  E −7 −5 Noise level0510  M −7 −5 Noise level

Figure 7:

Sparse recovery of the Fermi-Pasta-Ulam-Tsingou model:

CINDyCINDy (c)SINDyFISTAIPMIPM (c) −4 −2  D −4 −2  T  E −7 −5 Noise level02550  M −7 −5 Noise level

Figure 8:

Sparse recovery of the Fermi-Pasta-Ulam-Tsingou model:

Algorithm comparison for aFermi-Pasta-Ulam-Tsingou model of dimension 𝑑 = , with a diﬀerential formulation shown onthe left column, and with an integral formulation on the right column. The ﬁrst, second, third,fourth and ﬁrth rows of images indicate a comparison of E 𝑅 , E 𝐷 , E 𝑇 , S 𝐸 , and S 𝑀 , as we vary thenoise level, respectively. 24 s a m p l e s SINDy s a m p l e s CINDy −8 −6 −4 −2Noise (log )10002000 s a m p l e s CINDy (c) −4−20246 l o g ( ‖ Ω − Ξ ‖ F / ‖ Ξ ‖ F ) Differential 10002000 s a m p l e s SINDy s a m p l e s CINDy −8 −6 −4 −2Noise (log )10002000 s a m p l e s CINDy (c) −4−202468 l o g ( ‖ Ω − Ξ ‖ F / ‖ Ξ ‖ F ) Integral

Figure 9:

Sample eﬃciency of the sparse recovery of the Fermi-Pasta-Ulam-Tsingou model:

Al-gorithm comparison for a Fermi-Pasta-Ulam-Tsingou model of dimension 𝑑 = for both formulations.dimensional phenomenon as a series of oscillators suﬀering a displacement on the vertical y-axis, in a similarfashion as was done in the original paper (Fermi et al., 1955). Note that we have added to the images thetwo extremal particles that do not oscillate. D i s p l a c e m e n t Time: 0.0 s. Time: 1.6 s. Time: 3.2 s.0 5 10Oscillator1.00.50.00.51.0 D i s p l a c e m e n t Time: 4.8 s. 0 5 10OscillatorTime: 6.5 s. 0 5 10OscillatorTime: 8.1 s.SINDy CINDy Exact dynamic

Figure 10:

Trajectory comparison:

Fermi-Pasta-Ulam-Tsingou model of dimension 𝑑 = .25he two orders of magnitude in diﬀerence between the SINDy and the CINDy algorithms, in terms of E 𝑅 ,manifests itself clearly when we try to predict trajectories with out-of-sample initial positions that diﬀer fromthose that have been used for learning. This suggests that the CINDy algorithm has better generalizationproperties under noise. The Michaelis-Menten model is used to describe enzyme reaction kinetics (Michaelis & Menten, 2007). Wefocus on the following derivation (Briggs & Haldane, 1925), in which an enzyme E combines with a substrateS to form an intermediate product ES with a reaction rate 𝑘 𝑓 . This reaction is reversible, in the sense thatthe intermediate product ES can decompose into E and S, with a reaction rate 𝑘 𝑟 . This intermediate productES can also proceed to form a product P, and regenerate the free enzyme E. This can be expressed asS + E 𝑘 𝑓 𝑘 𝑟 E · S 𝑘 cat E + P . If we assume that the rate for a given reaction depends proportionately on the concentration of the reactants,and we denote the concentration of E, S, ES and P as 𝑥 E , 𝑥 S , 𝑥 ES and 𝑥 P , respectively, we can express thedynamics of the chemical reaction as: (cid:164) 𝑥 E = − 𝑘 𝑓 𝑥 E 𝑥 S + 𝑘 𝑟 𝑥 ES + 𝑘 cat 𝑥 ES (cid:164) 𝑥 S = − 𝑘 𝑓 𝑥 E 𝑥 S + 𝑘 𝑟 𝑥 ES (cid:164) 𝑥 ES = 𝑘 𝑓 𝑥 E 𝑥 S − 𝑘 𝑟 𝑥 ES − 𝑘 cat 𝑥 ES (cid:164) 𝑥 𝑃 = 𝑘 cat 𝑥 ES . One of the interesting things about the Michaelis-Menten dynamic is that we can use some of the structuralconstraints described in Section 2.1.1, that is, the exact dynamic satisﬁes (cid:164) 𝑥 S + (cid:164) 𝑥 ES + (cid:164) 𝑥 P = (cid:164) 𝑥 E + (cid:164) 𝑥 ES = . The exact dynamic Ξ can be expressed using a dictionary of monomials of degree up to two. To be moreprecise, the dictionary used is D = (cid:8) 𝑥 𝑎 E E 𝑥 𝑎 S S 𝑥 𝑎 ES ES 𝑥 𝑎 P P | 𝑎 E , 𝑎 S , 𝑎 ES , 𝑎 P ∈ (cid:200) , (cid:201) , ≤ 𝑎 E + 𝑎 S + 𝑎 ES + 𝑎 P ≤ (cid:9) . With this dictionary in mind, and denoting the coeﬃcient vector associated with the chemical E as 𝜉 E , andlikewise for the other chemicals, we can impose the following additional series of constraints for all 𝑖 : 𝜉 E + 𝜉 S + 𝜉 P = 𝜉 E + 𝜉 ES = . The images in Figure 11 show the recovery results for 𝑘 𝑓 = . 𝑘 𝑟 = , 𝑘 cat = and 𝑑 = . The derivativesused were computed using a polynomial interpolation of degree 8. We used polynomial interpolation of degree to compute the integrals for all the noise levels. A total of points were used to infer the dynamic,spread over experiments for a maximum time of . seconds, for both cases. The initial state of thesystem for the 𝑗 -th experiment was selected randomly as x 𝑗 ( 𝑡 ) ∼ U [ , ] .As in the previous experiment, in both the diﬀerential and integral formulation, all the algorithms performsimilarly in terms of E 𝐷 and E 𝑇 . For noise levels below − there is a clear beneﬁt to using the CINDy andCINDy (c) algorithms, as these are up to an order of magnitude more accurate than the other algorithmsin terms of E 𝑅 . For noise levels above − the CINDy and CINDy (c) algorithms perform similarly to theother algorithms in terms of E 𝑅 . As in the previous examples, there is also a huge diﬀerence in the number ofextra basis functions picked up, as measured by S 𝐸 , with CINDy producing in general the sparsest solutions,26 −7 −5 −3 −1  R Differential Integral

CINDyCINDy (c)SINDyFISTAIPMIPM (c) −5 −2  D −5 −2  T  E −7 −5 −3 −1 Noise level02  M −7 −5 −3 −1 Noise level

Figure 11:

Sparse recovery of the Michaelis-Menten model:

Algorithm comparison for a Michaelis-Menten model of dimension 𝑑 = , with a diﬀerential formulation shown on the left column, andwith an integral formulation on the right column. The ﬁrst, second, third, fourth and ﬁrth rows ofimages indicate a comparison of E 𝑅 , E 𝐷 , E 𝑇 , S 𝐸 , and S 𝑀 , as we vary the noise level, respectively.27ollowed by FISTA and CINDy (c) (although the latter has the added beneﬁt that it preserves some of thestructure behind the physical phenomenon due to the constrains). The IPM, IPM (c) and SINDy algorithmstend to pick up all the available basis functions for noise levels above − . In terms of correct basis functionsthat have not been picked up, the FISTA, CINDy and CINDy (c) algorithms only miss out on average on lessthan one of the basis functions for the highest noise levels in the diﬀerential formulation. Similar commentscan be made regarding the integral formulation, where we observe that the aforementioned three algorithmsonly miss out on some basis functions for noise levels above − . Acknowledgments

This research was partially funded by Deutsche Forschungsgemeinschaft (DFG) through the DFG Cluster ofExcellence MATH+ and Project A05 in CRC TRR 154.

References

Acebrón, J. A., Bonilla, L. L., Vicente, C. J. P., Ritort, F., and Spigler, R. The Kuramoto model: A simpleparadigm for synchronization phenomena.

Reviews of modern physics , 77(1):137, 2005.Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., and Wright, S. Interior-point methodsfor large-scale cone programming.

Optimization for machine learning , 5583, 2011.Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM journal on imaging sciences , 2(1):183–202, 2009.Borwein, J. and Lewis, A. S.

Convex analysis and nonlinear optimization: theory and examples . SpringerScience & Business Media, 2010.Braun, G., Pokutta, S., Tu, D., and Wright, S. Blended conditonal gradients. In

International Conference onMachine Learning , pp. 735–743. PMLR, 2019.Braun, G., Carderera, A., Combettes, C. W., Hassani, H., Karbasi, A., Mokhtari, A., and Pokutta, S.Conditional gradients - a survey.

Forthcoming , 2021.Briggs, G. E. and Haldane, J. B. S. A note on the kinetics of enzyme action.

Biochemical journal , 19(2):338–339, 1925.Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discovering governing equations from data by sparseidentiﬁcation of nonlinear dynamical systems.

Proceedings of the national academy of sciences , 113(15):3932–3937, 2016.Candès, E. J., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.

IEEE Transactions on information theory , 52(2):489–509, 2006.Champion, K., Lusch, B., Kutz, J. N., and Brunton, S. L. Data-driven discovery of coordinates and governingequations.

Proceedings of the National Academy of Sciences , 116(45):22445–22451, 2019. ISSN 0027-8424.doi: 10.1073/pnas.1906995116. URL .Champion, K., Zheng, P., Aravkin, A. Y., Brunton, S. L., and Kutz, J. N. A uniﬁed sparse optimizationframework to learn parsimonious physics-informed models from data.

IEEE Access , 8:169259–169271, 2020.Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit.

SIAM Journal onScientiﬁc Computing , 20(1):33–61, 1998.Combettes, P. L. and Pesquet, J.-C. Proximal splitting methods in signal processing. In

Fixed-point algorithmsfor inverse problems in science and engineering , pp. 185–212. Springer, 2011.Condat, L. Fast projection onto the simplex and the ℓ ball. Mathematical Programming , 158(1-2):575–585,2016. 28ermi, E., Pasta, P., Ulam, S., and Tsingou, M. Studies of the nonlinear problems. Technical report, LosAlamos Scientiﬁc Lab., N. Mex., 1955.Foucart, S. and Rauhut, H. A mathematical introduction to compressive sensing.

Bull. Am. Math , 54:151–165,2017.Frank, M. and Wolfe, P. An algorithm for quadratic programming.

Naval Research Logistics Quarterly , 3(1-2):95–110, 1956.Gelß, P., Klus, S., Eisert, J., and Schütte, C. Multidimensional approximation of nonlinear dynamical systems.

Journal of Computational and Nonlinear Dynamics , 14(6), 2019.Hanke, M. Accelerated landweber iterations for the solution of ill-posed equations.

Numer. Math. , 60:341–373,1991.Heldt, D., Kreuzer, M., Pokutta, S., and Poulisse, H. Approximate computation of zero-dimensionalpolynomial ideals.

Journal of Symbolic Computation , 44(11):1566–1591, 2009.Hoﬀmann, M., Fröhner, C., and Noé, F. Reactive sindy: Discovering governing reactions from concentrationdata.

The Journal of Chemical Physics , 150(2):025101, 2019.Juditsky, A. and Nemirovski, A.

Statistical Inference via Convex Optimization , volume 69. PrincetonUniversity Press, 2020.Kaheman, K., Brunton, S. L., and Kutz, J. N. Automatic diﬀerentiation to simultaneously identify nonlineardynamics and extract noise probability distributions from data. arXiv preprint arXiv:2009.08810 , 2020.Knowles, I. and Renka, R. J. Methods for numerical diﬀerentiation of noisy data.

Electron. J. Diﬀer. Equ ,21:235–246, 2014.Kuramoto, Y. Self-entrainment of a population of coupled non-linear oscillators. In

International symposiumon mathematical problems in theoretical physics , pp. 420–422. Springer, 1975.Levitin, E. S. and Polyak, B. T. Constrained minimization methods.

USSR Computational Mathematics andMathematical Physics , 6(5):1–50, 1966.Loiseau, J.-C. and Brunton, S. L. Constrained sparse Galerkin regression.

Journal of Fluid Mechanics , 838:42–67, 2018.Michaelis, L. and Menten, M. L.

Die Kinetik der Invertinwirkung . Universitätsbibliothek Johann ChristianSenckenberg, 2007.Morozov, V. On the solution of functional equations by the method of regularization.

Sviet Math. Dokl. , 7:414–417, 1966.Nesterov, Y.

Lectures on convex optimization , volume 137. Springer, 2018.Nesterov, Y. E. A method for solving the convex programming problem with convergence rate

O ( / 𝑘 ) . In Dokl. akad. nauk Sssr , volume 269, pp. 543–547, 1983.Noether, E. Invariante Variationsprobleme. Nachrichten der Königlichen Gessellschaft der Wissenschaften.Mathematisch-Physikalishe Klasse 2, 235–257.

Transport Theory and Statistical Physics , pp. 183–207, 1918.Rudy, S. H., Brunton, S. L., Proctor, J. L., and Kutz, J. N. Data-driven discovery of partial diﬀerentialequations.

Science Advances , 3(4):e1602614, 2017.Rudy, S. H., Kutz, J. N., and Brunton, S. L. Deep learning of dynamics and signal-noise decomposition withtime-stepping constraints.

Journal of Computational Physics , 396:483–506, 2019.Schaeﬀer, H. Learning partial diﬀerential equations via data discovery and sparse optimization.

Proceedingsof the Royal Society A: Mathematical, Physical and Engineering Sciences , 473(2197):20160446, 2017.29chaeﬀer, H. and McCalla, S. G. Sparse model selection via integral terms.

Physical Review E , 96(2):023302,2017.Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data.

Science , 324(5923):81–85, 2009.Tibshirani, R. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society:Series B (Methodological) , 58(1):267–288, 1996.Tibshirani, R. J. et al. The lasso problem and uniqueness.

Electronic Journal of statistics , 7:1456–1490, 2013.Wainwright, M. J. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ -constrainedquadratic programming (lasso). IEEE transactions on information theory , 55(5):2183–2202, 2009.Zhang, L. and Schaeﬀer, H. On the convergence of the SINDy algorithm.

Multiscale Modeling & Simulation ,17(3):948–972, 2019.Zheng, P., Askham, T., Brunton, S. L., Kutz, J. N., and Aravkin, A. Y. A uniﬁed framework for sparserelaxed regularized regression: Sr3.

IEEE Access , 7:1404–1423, 2018.

Appendix A. Preliminaries

Given a diﬀerentiable function 𝑓 ( x ) : ℝ 𝑑 → ℝ we say the function 𝑓 ( x ) is: Deﬁnition A.1 ( 𝐿 -smooth) . A function is 𝐿 -smooth if for any x , y ∈ ℝ 𝑑 we have: 𝑓 ( x ) ≤ 𝑓 ( y ) + (cid:104)∇ 𝑓 ( y ) , y − x (cid:105) + 𝐿 (cid:107) x − y (cid:107) . This is equivalent to the gradient of 𝑓 ( x ) being 𝐿 -Lipschitz. If the function is twice-diﬀerentiable this isequivalent to: < 𝐿 = max x , y ∈ ℝ 𝑑 ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )(cid:107) x − y (cid:107) (A.1) Deﬁnition A.2 (Convex) . A function is convex if for any x , y ∈ ℝ 𝑑 we have: 𝑓 ( x ) ≥ 𝑓 ( y ) + (cid:104)∇ 𝑓 ( y ) , y − x (cid:105) If the function is twice-diﬀerentiable this is equivalent to: = min x , y ∈ ℝ 𝑑 ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )(cid:107) x − y (cid:107) (A.2) Deﬁnition A.3 ( 𝜇 -strongly convex) . A function is 𝜇 -strongly convex if for any x , y ∈ ℝ 𝑑 we have: 𝑓 ( x ) ≥ 𝑓 ( y ) + (cid:104)∇ 𝑓 ( y ) , y − x (cid:105) + 𝜇 (cid:107) x − y (cid:107) . If the function is twice-diﬀerentiable this is equivalent to: < 𝜇 = min x , y ∈ ℝ 𝑑 ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )(cid:107) x − y (cid:107) (A.3)Given a compact convex set X ⊂ ℝ 𝑑 we deﬁne the constrained optimization problem: min x ∈X 𝑓 ( x ) . (A.4)30 ppendix B. Accelerated Projected Gradient Descent In this section we will focus in the case where 𝑓 ( x ) is 𝐿 -smooth and convex (or potentially 𝜇 -strongly convex).One of the key characteristics of convex problems is that any local minima to the problem in Equation (A.4)is a global minima. Moreover, there exist eﬃcient algorithms for computing the minima of these problems.In order to tackle the problem shown in Equation (A.4), one can, for example, use either a projection-basedor a projection-free methods, depending on how computationally diﬃcult it is to compute projections onto X .We denote the Euclidean projection of x onto X as Π X ( x ) : ℝ 𝑛 → X , which is deﬁned as: Π X ( x ) def = argmin y ∈X (cid:107) x − y (cid:107) . In general, computing these projections is non-trivial. However, for a series of structured feasible regionsthere are closed-form expressions for these projections, which can be computed eﬃciently (see Table 1):

Feasible region X Mathematical expression Projection

Unit probability simplex { x ∈ ℝ 𝑑 | (cid:62) 𝑑 x = , x ≥ } O ( 𝑑 ) ℓ 𝑝 -ball, 𝑝 ∈ { , , +∞} { x ∈ ℝ 𝑑 | (cid:107) x (cid:107) 𝑝 ≤ } O ( 𝑑 ) Nuclear norm-ball { X ∈ ℝ 𝑚 × 𝑛 | (cid:107) X (cid:107) nuc ≤ } O ( 𝑚𝑛 min { 𝑚, 𝑛 }) Matroid polytope { x ∈ ℝ 𝑑 | ∀ 𝑆 ∈ P ( 𝐸 ) , (cid:62) 𝑆 x ≤ 𝑟 ( 𝑆 ) , x ≥ } O ( poly ( 𝑑 )) Table 1: Complexities of projections onto several feasible regions.Fortunately, projections onto the probability simplex can be computed very eﬃciently, which makes accelerated projected descent algorithms an attractive alternative when solving constrained convex problemsover the probability simplex (as is the case in Line 6 of Algorithm 2). These algorithms are termed accelerated because they are able to improve upon the convergence guarantees oﬀered by the standard projected gradientdescent algorithm, both in the convex case (see Algorithm 4) and in the strongly convex case (see Algorithm 5).If we measure optimality by the number of iterations 𝑘 needed for the algorithms to achieve an 𝜖 -optimalaccuracy (which means that 𝑓 ( x 𝑘 ) − min x ∈X 𝑓 ( x ) ≤ 𝜖 ), then in the smooth convex case the acceleratedprojected gradient descent algorithm is able to reach an 𝜖 -optimal solution in O ( /√ 𝜖 ) iterations, as opposedto the O ( / 𝜖 ) iterations needed with standard projected gradient descent. In the smooth and strongly convexcase the accelerated projected gradient descent achieves an 𝜖 -optimal solution in O ( √︁ 𝜇 / 𝐿 log 1 / 𝜖 ) iterations,as opposed to the O ( 𝜇 / 𝐿 log 1 / 𝜖 ) iterations needed for the standard projected gradient descent algorithm. Algorithm 4:

Accelerated gradient descent for smooth convex problems.

Input :

Objective function 𝑓 ( x ) , feasible region X , initial point x ∈ X . Output :

Point x 𝐾 + ∈ X . y ← x 𝐿 ← max x , y ∈X ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )/(cid:107) x − y (cid:107) 𝛾 ← for 𝑘 = to 𝐾 do x 𝑘 + ← Π X (cid:0) y 𝑘 − 𝐿 ∇ 𝑓 ( y 𝑘 ) (cid:1) 𝛾 𝑘 + ← + √ + 𝛾 𝑘 y 𝑘 + ← x 𝑘 + + 𝛾 𝑘 − 𝛾 𝑘 + ( x 𝑘 + − x 𝑘 ) end lgorithm 5: Accelerated gradient descent for smooth strongly-convex problems.

Input :

Objective function 𝑓 ( x ) , feasible region X , initial point x ∈ X . Output :

Point x 𝐾 + ∈ X . y ← x 𝜇 ← min x , y ∈X ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )/(cid:107) x − y (cid:107) 𝐿 ← max x , y ∈X ( x − y ) 𝑇 ∇ 𝑓 ( x )( x − y )/(cid:107) x − y (cid:107) for 𝑘 = to 𝐾 do x 𝑘 + ← Π X (cid:0) y 𝑘 − 𝐿 ∇ 𝑓 ( y 𝑘 ) (cid:1) y 𝑘 + ← x 𝑘 + + − √ 𝜇 / 𝐿 + √ 𝜇 / 𝐿 ( x 𝑘 + − x 𝑘 ) end Appendix C. On computing integrals and derivatives from noisy data

One of the key requirements for the success of any sparse recovery algorithm is the accurate estimationof integrals and derivatives. We are typically given a dictionary of basis functions D = { 𝜓 𝑖 | 𝑖 ∈ (cid:200) , 𝑛 (cid:201)} that can be used to represent our dynamic, encoded by (cid:164) x ( 𝑡 ) = Ξ 𝑇 ψ ( x ( 𝑡 )) , where Ξ ∈ ℝ 𝑛 × 𝑑 and ψ ( x ( 𝑡 )) = [ 𝜓 ( x ( 𝑡 )) , · · · , 𝜓 𝑛 ( x ( 𝑡 ))] 𝑇 ∈ ℝ 𝑛 along with some data points. Ideally, we would like to observe a series ofnoise-free data points from the physical system { x ( 𝑡 𝑖 )} 𝑚𝑖 = . We would also like to observe { (cid:164) x ( 𝑡 𝑖 )} 𝑚𝑖 = , if wefollow the diﬀerential approach, so that we can construct the matrix (cid:164) 𝑋 in the left-hand side of Equation (C.1).Or we would like to observe { ∫ 𝑡 𝑗 + 𝑡 𝜓 𝑖 ( x ( 𝜏 )) 𝑑𝜏 } 𝑚 − 𝑗 = for all 𝑖 ∈ (cid:200) , 𝑛 (cid:201) , if we follow the integral approach, so thatwe can construct the Γ ( 𝑋 ) matrix on the right-hand side of Equation (C.1). Noise-free LASSO Diﬀerential approach argmin (cid:107) Ω (cid:107) , ≤ 𝛼 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) (cid:164) 𝑋 − Ω 𝑇 Ψ ( 𝑋 ) (cid:13)(cid:13) 𝐹 Noise-free LASSO Integral approach argmin (cid:107) Ω (cid:107) , ≤ 𝛼 Ω ∈ ℝ 𝑛 × 𝑑 (cid:13)(cid:13) 𝛿𝑋 − Ω 𝑇 Γ ( 𝑋 ) (cid:13)(cid:13) 𝐹 (C.1)However, in general we do not observe either of these quantities, and have to estimate (cid:164) 𝑋 and Γ ( 𝑋 ) from { x ( 𝑡 𝑖 )} 𝑚𝑖 = . The estimation of these matrices is made even more diﬃcult if we are only able to observenoise-corrupted data points { y ( 𝑡 𝑖 )} 𝑚𝑖 = such that y ( 𝑡 𝑖 ) = x ( 𝑡 𝑖 ) + 𝜈 ( 𝑡 𝑖 ) , where { 𝜈 ( 𝑡 𝑖 )} 𝑚𝑖 = is a set of i.i.d. noisevectors, which is typically the case. In this case, we can only work with a noisy matrix 𝑌 , with which weeither have to estimate (cid:164) 𝑌 ∈ ℝ 𝑑 × 𝑚 or Γ ( 𝑌 ) ∈ ℝ 𝑛 × 𝑚 − . The error in the formulation now comes from both thenoisy data, and from the estimation of the derivatives/integrals from noisy data. The question then becomes,how do (cid:13)(cid:13) (cid:164) 𝑌 − (cid:164) 𝑋 (cid:13)(cid:13) 𝐹 and (cid:107) Γ ( 𝑌 ) − Γ ( 𝑋 )(cid:107) 𝐹 evolve as we increase the noise? Given data contaminated with noise,is it easier to get an accurate estimate of (cid:164) 𝑌 than of Γ ( 𝑌 ) ? If so, we might favor one approach over the other.For example if we let the Fermi-Pasta-Ulam-Tsingou system with 𝑑 = evolve from a random initial state,and we sample the system 𝑇 / 𝑐 times at regularly spaced intervals, and we repeat this experiment 𝑐 = times, we generate 𝑇 = data points x 𝑗 ( 𝑡 𝑖 ) with 𝑖 ∈ (cid:200) , 𝑇 / 𝑐 (cid:201) and 𝑗 ∈ (cid:200) , 𝑐 (cid:201) , with 𝑡 𝑇 / 𝑐 = seconds. Wecan proceed to corrupt these data points with Gaussian noise as in Section 4, generating data points with y 𝑗 ( 𝑡 𝑖 ) = x 𝑗 ( 𝑡 𝑖 ) + 𝛼 N ( , Σ ) . In this case we need to estimate (cid:165) 𝑋 if we want to formulate the problem from adiﬀerential perspective, or (cid:164) 𝑋 if we want to formulate the problem from an integral perspective. Using thesame dictionary of basis functions as the ones used in Section 4.3, we use the ﬁrst-order central diﬀerence ruleas well as diﬀerentiation of local polynomial interpolations to estimate the ﬁrst derivative of 𝑋 with respectto time, using 𝑌 . The results are shown on the image of the left in Figure 12. We also use the second-ordercentral diﬀerence formula, as well as diﬀerentiation of local polynomial interpolations to estimate the secondderivative of 𝑋 with respect to time, using 𝑌 . The results can be seen on the image to the right in Figure 12.32e also use Simpsons quadrature rules as well as integration of local polynomial interpolations to estimatethe integral Γ ( (cid:165) 𝑋 ) using (cid:164) 𝑌 , which is our estimate of (cid:164) 𝑋 , the errors when computing these integrals with theaforementioned methods are shown in Figure 13. −5 −3 −1 Noise parameter10 −4 −2 ‖ ̇ X − ̇ Y ‖ F / ‖ ̇ X ‖ F Estimating first derivative

CentralPoly 2Poly 3Poly 4Poly 5Poly 6Poly 7Poly 8 −5 −3 −1 Noise parameter10 −3 −1 ‖ ̈ X − ̈ Y ‖ F / ‖ ̈ X ‖ F Estimating second derivative

CentralPoly 2Poly 3Poly 4Poly 5Poly 6Poly 7Poly 8

Figure 12:

Fermi-Pasta-Ulam-Tsingou:

Comparison of estimates of ﬁrst and second derivatives withrespect to time with exact ﬁrst and second order derivatives. −4 −1 Noise parameter10 −7 −6 −5 −4 −3 ‖ Γ ( X ) − Γ ( Y ) ‖ F / ‖ Γ ( X ) ‖ F Estimating integral

SimpsonsPoly 2Poly 4Poly 6Poly 8

Figure 13:

Fermi-Pasta-Ulam-Tsingou:

Comparison of estimates of ﬁrst and second derivatives withrespect to time with exact ﬁrst and second order derivatives.Regarding the errors shown between approximating (cid:164) 𝑋 and (cid:165) 𝑋 , for a given method we expect the errorsestimating (cid:165) 𝑋 to be higher than the ones in estimating (cid:164) 𝑋 , which is what we observe in the experiments.There is no direct way to make a fair comparison between an approximation to the matrices (cid:164) 𝑋 and (cid:165) 𝑋 andan approximation to the matrix Γ ( (cid:164) 𝑋 ) , given their diﬀerent nature (and even size), however if we use themetric (cid:13)(cid:13) (cid:164) 𝑋 − (cid:164) 𝑌 (cid:13)(cid:13) 𝐹 / (cid:13)(cid:13) (cid:164) 𝑋 (cid:13)(cid:13) 𝐹 , (cid:13)(cid:13) (cid:165) 𝑋 − (cid:165) 𝑌 (cid:13)(cid:13) 𝐹 / (cid:13)(cid:13) (cid:165) 𝑋 (cid:13)(cid:13) 𝐹 and (cid:13)(cid:13) Γ ( (cid:164) 𝑋 ) − Γ ( (cid:164) 𝑌 ) (cid:13)(cid:13) 𝐹 / (cid:13)(cid:13) Γ ( (cid:164) 𝑋 ) (cid:13)(cid:13) 𝐹 to compare the two approximations,the data seems to suggest that it is easier to estimate the matrix Γ ( (cid:164) 𝑋 ) than it is to estimate (cid:165) 𝑋 , at least inthe current experiment with the Fermi-Pasta-Ulam-Tsingou model.33 ppendix D. Additional ﬁgures The images shown in Figure 14 and 15 show the objective function evaluation for the constrained versionof the Kuramoto LASSO problem with 𝑑 = and 𝑑 = , respectively. The objective function is evaluatedfor the diﬀerent methods, for both the training data and the testing data and the diﬀerential and integralformulation.The images shown in Figure 16 and 17 show the objective function evaluation for the constrained versionof the Fermi-Pasta-Ulam-Tsingou LASSO problem with 𝑑 = and 𝑑 = , respectively. The objective functionis evaluated for the diﬀerent methods, for both the training data and the testing data and the diﬀerential andintegral formulation.The images shown in Figure 18 show the objective function evaluation for the constrained version of theMichaelis-Menten LASSO problem with 𝑑 = . The objective function is evaluated for the diﬀerent methods,for both the training data and the testing data and the diﬀerential and integral formulation. −7 −5 −3 Noise level10 −3 −1 ‖ ̇ Y t r a i n i n g − Ω T Ψ ( Y t r a i n i n g ) ‖ F Differential

CINDyCINDy (c)SINDyFISTAIPMIPM (c)Exact dynamic (a) −7 −5 −3 Noise level10 −4 −2 ‖ δ Y t r a i n i n g − Ω T Γ ( Y t r a i n i n g ) ‖ F Integral

CINDyCINDy (c)SINDyFISTAIPMIPM (c)Exact dynamic (b) −7 −5 −3 Noise level10 −4 −2 ‖ ̇ Y t e s t i n g − Ω T Ψ ( Y t e s t i n g ) ‖ F Differential

CINDyCINDy (c)SINDyFISTAIPMIPM (c)Exact dynamic (c) Iteration −7 −5 −3 Noise level10 −5 −3 −1 ‖ δ Y t e s t i n g − Ω T Γ ( Y t e s t i n g ) ‖ F Integral