[PDF] A Unified Approach for Sparse Dynamical System Inference from Temporal Measurements

Abstract

Temporal variations in biological systems and more generally in natural sciences are typically modelled as a set of Ordinary, Partial, or Stochastic Differential or Difference Equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course, and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously. In this paper, we present a unified approach to infer both the structure and the parameters of nonlinear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm as a function of the experimental setup (sample size, number of time measurements, number of interventions, noise level). Additionally, USDL's accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway. USDL algorithm has been integrated in SCENERY (\url{this http URL}); an online tool for single-cell mass cytometry analytics.

Full PDF

AA Uniﬁed Approach for Sparse Dynamical System Inference fromTemporal Measurements

Yannis Pantazis ∗ Ioannis Tsamardinos ∗† January 23, 2019

AbstractMotivation:

Temporal variations in biological systems and more generally in natural sciences are typicallymodelled as a set of Ordinary, Partial, or Stochastic Diﬀerential or Diﬀerence Equations. Algorithms for learningthe structure and the parameters of a dynamical system are distinguished based on whether time is discrete orcontinuous, observations are time-series or time-course, and whether the system is deterministic or stochastic,however, there is no approach able to handle the various types of dynamical systems simultaneously.

Results:

In this paper, we present a uniﬁed approach to infer both the structure and the parameters of nonlineardynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Ourapproach, which is named Uniﬁed Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atem-poral system of equations is derived through the application of the weak formulation. Then, assuming a sparserepresentation for the dynamical system, we show that the inference problem can be expressed as a sparse signalrecovery problem, allowing the application of an extensive body of algorithms and theoretical results. Resultson simulated data demonstrate the eﬃcacy and superiority of the USDL algorithm under multiple interventionsand/or stochasticity. Additionally, USDL’s accuracy signiﬁcantly correlates with theoretical metrics such as theexact recovery coeﬃcient. On real single-cell data, the proposed approach is able to induce high-conﬁdencesubgraphs of the signaling pathway.

The great majority of biological processes are time-varying requiring the use of dynamical models for theirquantitative description. Examples range from macroscopic processes studied for instance in epidemiology andpopulation dynamics to microscopic processes such as biochemical reactions and gene regulation in a living cell,all of which are modeled as time-varying dynamical systems of complex interactions [1]. Learning the structuralform and the parameters of a dynamical system allows one to predict not only the evolution of the system but alsothe eﬀects of manipulation and perturbation. Depending on the characteristics of the biological system understudy as well as on the available measurements, a palette of dynamical model formalisms have been successfullyapplied. For deterministic processes typical types of models include diﬀerence equations (when time is modeledas discrete), ordinary diﬀerential equations (ODEs) for continuous time, and their space-extended counterpart,partial diﬀerential equations. For stochastic processes, typical types of models include Markov Chains such asauto-regressive and moving averages (for discrete time), and, stochastic diﬀerential equations for continuoustime. Due to the absence of a unifying mechanism, inferring the structure and the parameters of a dynamicalsystem depends on the underlying formalism requiring specialized techniques and algorithms.Several approaches that learn the structural form of an ODE system have been presented in the literature.One of the ﬁrst attempts was the Inferelator [2] which is an algorithm for de novo learning of parsimoniousregulatory networks from systems-biology data sets using a shrinkage operator that induces sparsity. ODEion[3] searches over the model space and performs an optimization via simulation approach while SINDy is anothersparsity-induced algorithm [4, 5]. Sparse regression dedicated on single-cell data have been also proposed [6].Generally, sparse solutions are obtained through convex relaxation approaches [7] such as linear programming orLasso [8] and, not surprisingly, such optimization programs have been studied extensively in structure inferenceand network reconstruction [9, 10, 11, 12, 13]. Bayesian inference approaches [14, 15] have been also utilizedfor structure learning and phenomenological modeling of deterministic dynamical systems. Bayesian methodscan tackle unmeasured (latent) variables on the cost of increased computational demands due to the substantial ∗ Institute of Applied and Computational Mathematics, FORTH, Heraklion, 70013, Greece † Computer Science Department, University of Crete, Heraklion, 70013, Greece and Gnosis Data Analysis PC, Palaiokapa 64, 71305,Heraklion, Greece a r X i v : . [ q - b i o . M N ] J a n ampling of the model space. Moreover, theoretical guarantees on the performance is hard to obtain for Bayesianmethods. Studies on stochastic dynamical systems prove under appropriate assumptions that the true sparsestochastic dynamics is guaranteed to be inferred [13, 16]. The prominent feature of these studies is that thedynamical system is linear with respect to both the unknown parameters and the state variables. Given thestructure of the dynamical system, parameter estimation algorithms are also partitioned according to the typeof the dynamical system [17].In this paper, we present USDL (Uniﬁed Sparse Dynamics Learning) algorithm which is a novel approach toinfer both the structure and the parameters of any dynamical system from temporal measurements. First, byemploying the weak formulation [18, 19], the problem of inducing the structure of a dynamical model is trans-formed into an equivalent yet atemporal learning problem. The weak formulation can be intuitively understoodas a projection operator that multiplies the dynamical system’s equation by an arbitrary function, called a testfunction, and, then integrates over time and/or space. The weak formulation can be also thought as a type offeature extraction. The weak formulation has several advantages: (a) by using integration by parts, the weakformulation does not require the computation of the derivatives of the trajectories; in contrast, other methodscompute numerically the derivatives thus amplifying the noise and deteriorating the reconstruction accuracy, (b)by suitable deﬁnition of the test functions, the same algorithm can be applied to almost any type of dynamicalsystem, thus, uniﬁcation across diﬀerent families of dynamical models is achieved, and, (c) the weak formulationtransforms the dynamical system into a linear system of equations where the time and/or space dimensions havebeen completely eliminated.Second, we assume sparsity of the solution which results –in combination with the weak formulation– intoa well-posed, well-studied problem in computer science, namely sparse signal recovery [20, 21] also known ascompressed sensing [22, 23]. Sparsity in our context means that the dynamics of each state variable are typicallydriven by a relatively small number of variables. Sparsity is critical for learning large systems from ﬁnitedata and constitutes a form of complexity penalization and regularization thus favoring simpler solutions. Thereformulation of the problem as a sparse signal recovery problem allows us to straightforwardly apply a largevividly-evolving body of theoretical and algorithmic results. Speciﬁcally, we choose the Orthogonal MatchingPursuit algorithm [24, 25, 26] which is a greedy and fast algorithm for recovering the sparse solution whiletheoretical guarantees on the correctness of the learned solution are provided based on the mutual incoherenceparameter [27] and the exact recovery coeﬃcient [7]. The presented examples reveal that the above (in)coherencymetrics and especially the latter are informative indicators of the reconstruction accuracy as measured by theprecision and recall curves. Furthermore, multiple interventions typically convey crucial information about thetrue structure of a biological dynamical system [28]. The proposed USDL algorithm is capable of handling notonly observational but also interventional data, a property that distinguishes it from the existing sparsity-inducedapproaches.Finally, we compare USDL algorithm with SINDy [4] and despite the fact that SINDy occasionally convergesfaster than USDL with respect to the amount of data as in Lorenz96 model [29], we demonstrate that our approachperforms better in terms of accuracy in both interventional time-course data and stationary stochastic time-seriesas shown in the Results section by the respective protein networks and multidimensional Ornstein-Uhlenbeckprocess given suﬃcient amount of temporal data. The merit of the weak formulation is notably highlightedat the stationary regime of the Ornstein-Uhlenbeck process where an educated guess of test functions resultedin perfect reconstruction of the stochastic dynamical system showing the plasticity and generality of USDLalgorithm which stems from the plethora of choices for the test functions. The weak formulation has been employed primarily in the ﬁeld of applied mathematics where it provides arigorous theoretical framework to deﬁne solutions that are not necessarily diﬀerentiable [30]. Another importantapplication of the weak formulation is found in numerical analysis. Particularly, the ﬁnite elements methodwhich is a numerical technique for estimating approximate solutions of dynamical systems is based on it [18].In our setting, the solutions are given as measurements while the system of diﬀerential equations is unknownand has to be inferred. Thus, instead of applying the weak formulation to approximate solutions, we use it totransform the problem of learning the structure of a dynamical system into an equivalent yet atemporal learningproblem.For clarity purposes, the weak formulation is presented for ODEs which is a particular example of dynamicalsystems. Let x = x ( t ) ∈ R N be an N -dimensional vector function of time which represents the state variableswhile N is the number of state variables. Following physics notation (i.e., ˙ x := dxdt ), a system of ODEs withlinear parameters is deﬁned as ˙ x = Aψ ( x ), x ( t ) = x where t is the starting time instant with initial value x ∈ R N . A ∈ R N × Q is the unknown and usually sparse connectivity (or coeﬃcient or parameter) matrixto be estimated. Dictionary, ψ ( · ), is a (given) Q -dimensional vector-valued vector function, ψ : R N → R Q ψ ( x ), according to the allowed chemical reaction types under massaction kinetics law. Symbols X – X N correspond to the state variables (a.k.a. (reaction) species in chemistry andsystems biology). The row vector x i : j is deﬁned as x i : j := [ x i , x i +1 , ..., x j ]. For the two-reactant case, the dynamicalsystem is nonlinear with respect to the state variables. Unknown reactions Dictionary, ψ ( x ) Size, QX i → X i (cid:48) x NX i + X j → X i (cid:48) + X j (cid:48) [ x, x x N , x x N , ..., x N ] T ( N + 3) N/ x n = Q (cid:88) q =1 a nq ψ q ( x ) , x n ( t ) = x n , n = 1 , ..., N. (1)An example from the theory of biochemical reaction networks [31] is presented in Table 1. Under massaction kinetics law and depending on the number of reactants in a chemical reaction network, two candidatedictionaries are shown. If the user assumes that only single-reactant reactions occur then the dictionary is simplythe identity function (i.e., ψ ( x ) = x ) since the reaction rates are linear with respect to the state variables forthis case. However, if the user assumes that two-reactant reactions occur then the dictionary is augmented withall quadratic terms as shown in the last row of Table 1.In order to derive the (ﬁnite) weak formulation, a set of M test functions denoted by { φ m ( t ) } Mm =1 has to bespeciﬁed. The test functions are smooth, not necessary orthogonal and can be chosen from a large repositoryof functions. Typical examples are polynomials, splines, Fourier modes (i.e., sines and cosines with varyingfrequency) or kernel functions from other integral transforms. Fourier modes are preferred when trajectoriesexhibit periodicities while splines are more appropriate when localized phenomena have to be highlighted. As wewill show later, customized families of test functions may be required in order to reveal imperceptible variableinteractions. Proceeding, let T be the ﬁnal time and without loss of generality assume that t = 0. Denotingby (cid:104) f, g (cid:105) := (cid:82) T f ( t ) g ( t ) dt the inner product between two functions f and g belonging to the L function space,deﬁne the M -dimensional vector z n whose m -th element is given by z nm := (cid:104) ˙ x n , φ m (cid:105) , the M × Q dictionarymatrix Ψ whose ( m, q )-th element is given by Ψ mq := (cid:104) ψ q ( x ) , φ m (cid:105) and let a n be a Q -dimensional vector whichcorresponds to the n -th row of matrix A . Then, the weak form of the ODE system in eq. (1) is z n = Ψ a n , n = 1 , ..., N. (2)In Appendix A, we provide the detailed derivation of the weak formulation for ODEs as well as for other typesof dynamical systems.The intuition behind weak formulation is that it projects the solution of a dynamical system to a ﬁnite-dimensional vector (i.e., to a set of linear functionals in mathematical language) whose elements are deﬁnedfrom the inner product between the solution and the test functions. A key advantage of the weak formulationis that appropriate test functions for various dynamical systems such as partial and/or stochastic diﬀerentialequations exist and can be utilized transforming again the dynamical inference problem to an atemporal/aspatialproblem similar to eq. (2). Thus, uniﬁcation of the structure inference problem for various dynamical systems isachieved. Moreover, the original problem where time is continuous is transformed from an inﬁnite dimensional(meaning that eq. (1) should be satisﬁed for all t ∈ [0 , T ]) to a ﬁnite dimensional one. Additionally, and moreimportantly from a practical viewpoint, there is no need to numerically estimate the time derivatives of the statevariables . Indeed, exploiting the integration by parts formula, it is straightforward to obtain that z nm = x n ( t ) φ m ( t ) (cid:12)(cid:12)(cid:12) T − (cid:104) x n , ˙ φ m (cid:105) , and since test functions, φ m , are explicitly know, their diﬀerentiation is exact. In general, diﬀerentiation ampliﬁesthe noise of a signal resulting in high variance estimates of the derivatives deteriorating the performance of anyinference approach which necessitates the use of derivative approximation and regularization. Finally, whennew trajectories (or time-series) from diﬀerent initial conditions are obtained, they can be easily incorporatedinto the formulation by straightforward concatenation. Indeed, if P trajectories, (cid:8) x ( p ) ( t ) (cid:9) Pp =1 , are provided and xpanding both z n =  z (1) n ... z ( P ) n  ∈ R MP and Ψ =  Ψ (1) ...Ψ ( P )  ∈ R MP × Q with z ( p ) nm = (cid:104) ˙ x ( p ) n , φ m (cid:105) and Ψ ( p ) mq = (cid:104) ψ q ( x ( p ) ) , φ m (cid:105) , then, eq. (2) is still valid. In the weak formulation, the estimation of the integrals (i.e., the inner products between functions) from thetemporal data is required. There are two major categories of temporal data that we consider depending whetherthe same object is repeatedly measured or not. For the case of repeated measurements, the same object ismeasured sequentially over a time interval hence a time-series is constructed at the sampling points. When thesampling frequency is high enough, the collected time-series can be considered as continuous over time. Forrepeated measurements and deterministic systems, standard techniques such as trapezoidal rule, Simpson’s ruleare utilized for the numerical integration. When measurements are far from each other, interpolation betweenthe time points can be applied. Additionally, there is no need for equispaced sampling since these methods canhandle uneven sampling. For stochastic integrals, numerical integration requires diﬀerent treatment since thedeﬁnition of the integrals is diﬀerent (e.g., Ito integral), nevertheless, numerical methods do also exist for thiscase [32].Non-repeated measurements –we also refer to them as time-course data– measure at each time instant adiﬀerent object. This may happen because the object is destroyed during the measurement process as, forinstance, in mass cytometry (see the demonstration examples below) and the same object cannot be measuredmore than once. However, it is assumed that all the measured objects are drawn from the same distribution. Fortime-course data, time-series cannot be directly constructed from the data. In order to create time-series fromthe time-course data, the collocation method [33] in conjunction with a trajectory smoothing penalty [34, 35]are utilized. In the collocation method, a time-series is approximated by a weighted sum of basis functions.Moreover, time-course data from diﬀerent experiments might be available. Each experiment might performinterventions to some or all variables hence each variable has its own time-series which resembles the so calledmultiple shooting method [36] and therefore each variable has its own trajectory per experiment. Interventionaldata are important in structure inference because they often reveal critical information about the biologicalsystem under study. Details on the collocation method and how to deal with interventions (e.g., inhibition)using the multiple shooting method can be found in Appendix B and C, respectively.

The weak formulation enables us to transform a dynamical system inference problem to the identiﬁcation of thesparse solution of eq. (2) which belongs to the well-studied ﬁeld of sparse signal recovery (SSR). In particular,the structure inference is transformed to the problem of ﬁnding the support of the sparse solution. In SSR, thegoal is to minimize the l quasi-norm of the coeﬃcient vector, a , given that || z − Ψ a || < (cid:15) where (cid:15) is a predeﬁnedtolerance . Performing the minimization is computationally feasible in small scale systems but, in general, it is anintractable problem since it grows exponentially, hence, alternative approximation methods have been developed.One type of approximation is based on the so-called convex relaxation approach where the above optimizationis replaced by a convex program [7]. Convex relaxation is appealing because the optimization can be completedin polynomial time using standard software. General conditions under which the convex relaxation programreturns the right answer has been presented in the literature [7]. In the presence of noise, additional assumptionsare imposed on the strength of the components’ coeﬃcient in order to achieve perfect reconstruction with highprobability. In Appendix D, we present two commonly used convex relaxation formulations namely l errorwhere instead of minimizing the l norm, the l norm is minimized and Lasso where an l norm regularizationterm is added to the quadratic cost functional. Another family of techniques that solves the SSR problem isgreedy algorithms such as matching pursuit [37] and orthogonal matching pursuit (OMP) [24, 25, 26]. For thesegreedy algorithms, the correct support of the signal is recovered under suitable assumptions which are similarto the assumptions of convex relaxation. Details on OMP as well on known theoretical results that assert underwhich conditions these algorithms infer the true sparse representation are provided in Appendix D.In order for any sparse signal identiﬁcation algorithm to perform perfect reconstruction both the degreeof collinearity among the columns of Ψ and the signal-to-noise ratio have to be properly controlled. Mutualincoherence parameter (MIP) ﬁrst introduced in [27] which is deﬁned by µ (Ψ) := max ≤ q,q (cid:48) ≤ Q,q (cid:54) = q (cid:48) | ψ Tq ψ q (cid:48) ||| ψ q || || ψ q (cid:48) || (3) For the shake of simplicity, we drop the dependence on n for the rest of the section. s a measure of similarity between the columns of Ψ. In the extreme cases, µ (Ψ) = 0 when the matrix isorthogonal while µ (Ψ) = 1 when for instance there are two columns of the matrix which are collinear. InAppendix D, theorems that determine under which conditions on MIP the SSR algorithms are guaranteed toreturn the correct solution are presented. However, MIP can be unimportantly conservative because it penalizesfor the correlation of features that may not participate in the solution. Indeed, if two columns of Ψ are highlysimilar (i.e., collinear or strongly dependent) but not part of the solution then MIP is close to one but theSSR algorithms are still expected to estimate the right solution. This can be circumvented with exact recoverycoeﬃcient (ERC) which does not measure the collinearity between two columns in general but measures onlythe collinearity of the subspace deﬁned by the solution with respect to each column not in the solution. Thedeﬁnition of ERC given a set of indices, S ⊂ { , ..., Q } , is [38] ERC ( S ) := 1 − max q (cid:48) / ∈S || ( ¯Ψ T S ¯Ψ S ) − ¯Ψ T S ¯ ψ q (cid:48) || (4)where ¯ ψ q := ψ q || ψ q || are the normalized dictionary atoms (i.e., normalized columns of Ψ) while ¯Ψ S is the matrixthat contains only the columns of ¯Ψ := [ ¯ ψ | ... | ¯ ψ Q ] that are indexed by the set S . Letting T be the true indexset (i.e., the support of the true signal, or, in our case, the true structure of the dynamical system), a necessarycondition for the convex relaxation algorithms or for the OMP to correctly solve SSR is ERC ( T ) >

0. Thediﬃculty for estimating ERC arises from the fact that the true index set, T , is not known a priori. Nevertheless,it can be used as an a posteriori indicator of accuracy. Under noise, OMP algorithm with the standard stoppingcriterion returns the true solution, if it additionally holds that [39] | a q | ≥ SNR ( q ) ERC ( T ) λ min ( T ) , for all q = 1 , ..., Q (5)where SNR ( q ) = || ψ q || || e || is the signal-to-noise ratio between the q -th dictionary atom and the error/noise vector e := z − Ψ a while λ min ( T ) is the smallest eigenvalue of the matrix ¯Ψ T T ¯Ψ T . Overall, monitoring these quantitiesare extremely informative on determining the quality and the conﬁdence of the learned structure. Remark 1:

There is a diﬀerence between our sparse identiﬁcation problem and typical SSR. In standardSSR, the number of dictionary atoms (columns of Ψ) is usually larger than the number of measurements (rows ofΨ) while the opposite is true here since multiple experiments and multiple interventions are typically performedand measured.

Remark 2:

SSR algorithms have one hyperparameter that needs to be ﬁne-tuned. We incorporate a selectionprocess using the F1 score, which is the harmonic mean of precision and recall, as a performance metric. Wealways select the value that maximizes the F1 score.

Remark 3:

Restricted isometry property [40] like MIP is another a priori metric that ensures perfectreconstruction, however, like MIP, it suﬀers from the same problem of being too conservative. Additionally, itis computationally expensive thus we choose not to present it in detail.

Before proceeding with the demonstration examples, a coarse summary of the proposed dynamical systeminference algorithm is presented below as pseudo-code.

Algorithm 1

USDL - Uniﬁed Sparse Dynamics Learning Input:

Time-series or time-course measurements, dictionary, ψ ( x ), and set of test functions, { φ m } . if time-course measurements then Apply the Collocation method. (cid:46)

Time-series interpolation Compute Ψ and z n , n = 1 , ..., N . (cid:46) Weak formulation Estimate MIP from Ψ. for n = 1 , ..., N do (cid:46) For each row of A ˆ a n = SSR( z n , Ψ) (cid:46) Solve SSR problem Estimate

ERC ( ˆ T n ). Output: ˆ A , MIP and ERC. The inference capabilities of the proposed approach for several classes of dynamical systems is presented. Sourcecode that produces the ﬁgures is available (S1 Code). Among the algorithms that are capable of solving SSR, weadopt OMP due to the following reasons. First, it is computationally more eﬃcient since a forward selection of the components is performed. Second, the theoretical justiﬁcations between the various methods are very similarmaking the algorithms almost equivalent. Third, the hyperparameter of OMP is more intuitive compared forinstance with Lasso since it is the energy of the noise term (i.e., || e || ). Indeed, we approximate the noise energyas (1 + α ) times the l -norm of the residual between the signal and the complete Least Squares solution with α being usually a small positive number. OMP stops when relative residual energy becomes smaller than α ,therefore, OMP returns sparser solutions as we increase the value of α (see Appendix D for more details). Lastly,it is straightforward to incorporate prior knowledge by adding any known contribution to the dynamics, hence,it is straightforward to include data with one or more interventions with known eﬀects. The ﬁrst demonstration is a simulation of mass cytometry measurements with a three-species prototypical proteininteraction network with cycles. The complete biochemical reaction network is given in Appendix E and it hasbeen simulated with standard ODE solver. The ground truth of interactions are shown in Fig. 1(a) where thearrow means that the source variable up-regulates the target variable while the vertical bar means that the sourcevariable down-regulates the target variable. The experimental setup assumes that P1, P2 and P3 are measuredat speciﬁc time points and every sample is destroyed during the measurement (see dots in Fig. 1(b)). In orderto apply the weak formulation, time-series have to be constructed, thus, we ﬁrst apply the collocation methodwith smoothing penalty. Dashed curves in Fig. 1(b) correspond to the estimated time-series. If, additionally,multiple experimental interventions are performed, the multiple shooting method is applied and one time-series er experiment is obtained.Proceeding, despite the fact that the complete biochemical reaction network is nonlinear with respect to thestate variables, the assumed model for inference is linear,˙ x = Ax , where vector x ( t ) ∈ R contains the abundance of P1–P3 at time instant t . Connectivity matrix, A , encodes thedirect causal interactions within the set of state variables. Indeed, if element a nq is zero then no direct causalinteraction exists from species x q to x n . If a nq is positive then an increase of variable x q implies an increase ofthe rate of x n which results in increasing the concentration of x n thus x q activates or up-regulates x n and it isdenoted with an arrow from x q to x n . On the contrary, if a nq is negative then x q inhibits or down-regulates x n and it is denoted with a vertical bar. Thus, the structure of the network can be induced from the matrix A . Indeed, both the strength and the type of an interaction is inferred from the absolute value and the sign ofthe corresponding element of A , respectively. Furthermore, it is noteworthy that several sources of error exist inthis benchmark example. First, there is measurement error related to the machine limitations and it is assumedto be additive. Second, there is uncertainty error due to the fact that each measurement comes from a diﬀerentcell and each cell has diﬀerent concentrations of the measured quantities as Fig. 1(b) demonstrates. Additionalsources of error stems from the facts that (i) the complete reaction network is nonlinear for the state variableswhile the assumed model is not and, (ii) not all species are measured since the compound P1P3 is not quantiﬁedresulting in the existence of latent confounding variables. Figure 2: Network reconstruction of protein interactions from temporal mass cytometry data. (a) Time-coursemeasurements (dots) and the estimated smoothed trajectories (solid lines). The collocation method in conjunctionwith smoothing penalty was used for the estimation of the time-series. Observe the high level of stochasticity of thetime-course mass cytometry data. (b) The reconstructed subnetwork with four proteins using the USDL algorithm.Bold arrows (true positives) indicate that the true network of interactions is inferred. (c) Similar to (b) for SINDy.(d) The reconstructed network with eight proteins with non-bold arrows correspond to false positives while dottedarrows correspond to false negatives. Dynamics for the additional proteins vary less over time as it is evident fromthe lower panel of (a). Nevertheless, most of the interactions are directly (such as Akt → Rb or Erk → S6) orindirectly (like Akt → Erk → S6 instead of Akt → S6) inferred. (e) Similar to (d) for SINDy.

Fig. 1(c) presents the MIP as a function of the number of experiments (upper plot) as well as the ERC foreach state variable (lower plot). Even though MIP drops as the number of experiments is increased, the decreaseis not signiﬁcant making the correct inference of the interaction network a priori less certain. ERC which is ana posteriori metric, asserts that the problematic variable is P2 when only one experiment is fed to the inferencealgorithm (dashed lines in lower plot of Fig. 1(c)) while ERC is positive when all ﬁve interventions are usedfor the inference making one crucial assumption for perfect reconstruction true. However, this is not alwaysenough as shown be the performance of the algorithms when the measurement noise is doubled (red solid linesin Fig. 1(c)&(d)).Indeed, the precision-recall curves of Fig. 1(d) assert that the network is partially reconstructed when USDL(solid lines) fed with data from only one intervention is applied. In contrast, the true network is reconstructedwhen all ﬁve interventions are taken into account. Note that 41 Fourier modes which correspond to a constantfunction, 20 sines and 20 cosines were used as test functions in USDL algorithm. Additionally, Fig. 1(d)presents the reconstruction accuracy for the SINDy algorithm (dashed lines). The hyperparameter value forboth approaches is optimally selected by maximizing the F1 score which is shown in Figure 1(a) of AppendixE. When one intervention is used, the performance of SINDy is slightly better to USDL, however, SINDy does As explained in Appendix E, the complete reaction network is unfortunately unidentiﬁable making it an improper example fortesting the learning algorithms. ot improve its accuracy as the number of experiments increases. Evidently, SINDy is not capable of handlingdatasets that have multiple complex interventions. Moreover, we evaluate the forecast capabilities of the inferredmodel on new experiments. Prediction accuracy (Figure 1(b) in Appendix E) is in accordance with the precision-recall curves (i.e., Fig. 1(d)) in all cases revealing once again that the problematic variable is P2. Overall, thisdemonstration example shows the necessity of designing and executing several experiments for guaranteed perfectreconstruction of a network from non-repeated time-course measurements. In Appendix E, we present additionalcase studies on the performance of the proposed approach when the number of sampling points is reduced aswell as when diﬀerent weights for the smoothing penalty in the collocation method are applied. Remark 4:

For small systems, a brute force alternative is tractable. A complete search of all possiblesolutions when the non-zero components are less than ten is computationally feasible for dictionary size up totwenty atoms. However, such an approach will provide little or no information on how to design a new experimentor a new data acquisition policy compared with greedy algorithms or convex relaxation methods where metricssuch as MIP and ERC can guide the experimental designer.

The second demonstration is the inference of protein interactions from publicly available mass cytometry data[41]. Single-cell analysis and particularly mass cytometry widely opens new directions for understanding cellularresponses to perturbations and cellular functionalities due to the capability of measuring tens of proteins in eachcell. Moreover, it can be multiplexed resulting in studying the cells under diﬀerent conditions and time pointsin a relatively cheap and fast way [42, 41]. Given the high resolution of single cell analysis it is expected tobecome a standard technique in medical sciences in the near future. In [41], thirteen time points are sampledand measurements are separated into three subpopulations, namely, CD4+, CD8+ and Eﬀector/Memory. Twoactivation cocktails which stimulate the receptors CD3/CD28 and CD3/CD28/CD4 were applied, respectively.Each experiment was repeated twice with diﬀerent activation levels. Reconstructing the signaling pathwayupon activation is a non-trivial task because few proteins inside the cells are measured and on top of that manyinterfering mechanisms with diﬀerent rate are also occurring. Both result in a large number of latent confoundingfactors. Thus, it is very hard to reconstruct directly the complete system of interactions. However, networkreconstruction would be more successful if restricted to subnetworks.We perform network inference for two subnetworks with the ﬁrst being a cascade of CD3z, SLP76, Erkand S6 proteins while the second is enriched with MAPKAPKII, Creb, Akt and Rb. Fig. 2(a) presents thetrajectories of the proteins estimated from the mass cytometry data using the collocation method with smoothingpenalty. The multiple shooting method is utilized for each subpopulation and each experiment. We note thatthe collocation method assumes that the overall measurement noise is Gaussian, however, we observed that thenoise in the mass cytometry data is sometimes skewed and/or multimodal potentially deteriorating the qualityof the estimated trajectories. Furthermore, time-series adjustment is performed by subtracting the minimumvalue of each trajectory which merely corresponds to the state of no activity. As it is evident from the ﬁgure,the level of stochasticity is high making the dense sampling of the signaling phenomenon necessary.Figs. 2(b)&(c) present the reconstruction of the smaller subnetwork for USDL and SINDy algorithm, respec-tively, while Figs. 2(d)&(e) present the reconstruction results for the larger subnetwork. We consider a linearODE model ( ˙ x = Ax ) excluding more complicated protein interactions. USDL algorithm utilizes 21 Fouriermodes while the hyperparameter for each approach is ﬁne-tuned based on the F1 score estimated on an inde-pendent subset of the data (see Figure 10 in Appendix E) using the KEGG database [43] as ground truth. Thesemantics of arrow and bar edges are the same as in the previous example. An edge is bold when it is alsofound in the KEGG database, regular if found with USDL (or SINDy) but it is not found in KEGG databasewhile it is dotted if it is in the KEGG database but not found. Inference is repeated 100 times using a portionof the available data in each iteration and the reported edges are the ones that are found at least in 80% ofthe times. Concentrated in the case with four proteins, the subnetwork is correctly reconstructed with USDLalgorithm while SINDy infers two additional edges. The bar edges in Fig 2(b), which imply down-regulation,are explained as a mechanism to model the degradation of each variable over time. When four more proteinsare added, the network reconstruction becomes harder. Nevertheless, USDL using CD4+ subpopulation andCD3/CD28 activator (see Fig. 2(d)) was able to recover half of the known edges. The cascade Slp76 → Erk → S6 is still correctly inferred. However, CS3z was replaced by Akt in the phosphorylation of Slp76. Additionally,the proposed algorithm correctly assesses the inﬂuence of Akt to the phosphorylation of both Rb and Crebshowing that Akt plays a central role in pathway signaling of T cells. In contrast, SINDy with the optimalhyper-parameter returns an almost fully-connected graph resulting in a non-sparse solution.The diﬃculty in inferring the KEGG-based network of protein interactions is reﬂected on both MIP andERC (reported in Appendix E). Both incoherency metrics deteriorate when more proteins are added to theanalysis making the assumptions for perfect reconstruction less valid. Thus, more experiments are required inorder to improve the accuracy of the network inference algorithm. These experiments can be guided by metricssuch as MIP or ERC. Additional demonstrations can be found in Appendix E. Particularly, the reconstructednetworks when additional activators and/or subpopulations are considered are presented as well as how important s the high sampling rate for protein interactions inference through mass cytometry data. For instance, thereconstruction accuracy is severely reduced when removing half of the time-points as it is shown in Appendix E.Finally, we have integrated the USDL algorithm in SCENERY ( http://scenery.csd.uoc.gr/ ) [44] which is anweb tool for single-cell cytometry analysis. SCENERY provides a comprehensive and easy-to-use graphical userinterface where users may upload their data and perform various types of protein network reconstruction. Theincorporation of USDL algorithm into SCENERY aims to increase its reusability and, hopefully, its popularity. The last example is a multidimensional Ornstein-Uhlenbeck process which is a system of linear stochastic dif-ferential equations with additive noise. It has applications in evolutionary biology where multiple traits aremodelled over time with multidimensional Ornstein-Uhlenbeck process [45] as well as in particle physics [46] andﬁnance [47]. Mathematically, the driving system of equations is given by˙ x = − Ax + σ ˙ B , (6)where x ( t ) ∈ R N is the stochastic process, connectivity matrix A ∈ R N × N determines the interactions betweenthe variables, σ ∈ R corresponds to the noise level while B ( t ) ∈ R N is an N -dimensional standard Brownianmotion. Intuitively, the derivative of a Brownian motion can be understood as a continuous-time zero-meanwhite noise with variance one. We set N = 20 while matrix A is deﬁned as the graph Laplacian with randomedges and maximum outgoing degree equal to 3. Figure 3: Performance analysis and comparison between USDL and SINDy algorithms for Ornstein-Uhlenbeckstochastic process. (a) The connectivity graph for each variable of the Ornstein-Uhlenbeck process. The edges, theirdirection as well as the type of interaction are determined by the non-zero elements of connectivity matrix A . (b)Precision and recall are shown as functions of the number of measured time-series in two diﬀerent regimes; stationary(blue) and transient (red). Both USDL (solid) and SINDy (dashed) algorithms achieve perfect reconstruction of thedynamical system for the transient regime and when enough time-series are measured. For the stationary regime,perfect reconstruction is possible only for USDL and a special type of test functions (peaky Fourier modes) whileSINDy (blue dashed) fails to recover completely the dynamical system in this regime due to the high stochasticity. Fig. 3(a) presents the graph of interactions for each variable of a randomly-drawn instance of A . We dis-tinguish between two regimes, namely, the stationary (or equilibrium) regime and the transient regime. Atstationarity, the driving force is primarily the stochastic or diﬀusion term (second summand in the r.h.s. of eq.(6)) with the deterministic or drift term (ﬁrst summand in the r.h.s. of eq. (6)) acting as a stabilizer. In thetransient regime, the dynamics are primarily driven by the drift term. This separation is of great importancebecause the signal in the former case is buried under the noise (i.e., both have approximately the same energyor, in other words, the signal-to-noise ratio is approximately one) while the signal is stronger compared to thestochastic term in the transient regime (see the simulated trajectories for both regimes in Appendix E, Figure16(a)). As performance measures show in Fig. 3(b), both USDL and SINDy are able to infer the correct structureof A (i.e., the correct graph of interactions) at the transient regime (red curves) when enough –approximately P = 100– time-series are provided. We remark here that, as in the previous examples, the hyperparameter valuesof both algorithms are selected based on the maximum F1 score. Averaged ERC is positive for this setup as nset plot reveals and signal-to-noise ratio is high enough to theoretically guarantee the perfect reconstructionof the dynamical system.In the stationary regime, the driving force is the noise and positive (averaged) ERC is not enough to guaranteeperfect reconstruction and control of the signal-to-noise ratio is required for true structure learning. We tested aseries of typical test functions such as Fourier modes and splines as well as we varied the number of test functions,however, we were not able to perfectly reconstruct the dynamical system with the accuracy hitting a plateau(ﬁgures in Appendix E) despite the fact that theoretical results [16] suggests that true recovery is possible. Theproblematic cases arise from variables whose time cross-correlations have similar shape and they are close toeach other thus we need to deﬁne another type of test functions with the property of having sharp changes whichassist the separation between small time-diﬀerences. A well-educated choice of test functions, which we namedpeaky Fourier modes (details in Appendix E) result in perfect reconstruction of the connectivity matrix for thestationary regime when USDL is applied (blue solid lines in Fig. 3(b)). In contrast, SINDy algorithm (dashedblue lines) is incapable of inferring the structure of the dynamical system because it is not designed for stochasticsystems and thus it is unable to handle the high intensity of the noise which is observed in the stationary regime.Moreover, we measured the root mean squared error (RMSE) on the elements of the connectivity matrix A and assess the parameter estimation behavior of the inference approaches. RMSE results (Figure 17(b) inAppendix E) reveal that two orders of magnitude lower RMSE is reported in the transient regime relative to thestationary regime. In the same ﬁgure, it is shown that the RMSE for USDL is lower than the RMSE for SINDywhich is in accordance with the structure inference performance (i.e., the precision-recall curves). The weak formulation enables the transformation of any spatio-temporal dynamical system that is linear withrespect to its parameters into a linear system of equations. The uniﬁcation through the weak formulation createsthe foundations for general dynamical system inference in biological applications. For instance, in mass cytometryand more generally in single-cell analysis, not trajectories but the distribution of the species populations ismeasured over time. Thus, the structure learning from partial diﬀerential equations such as the Fokker-Planckor the master equation [46] which both describe the evolution of the probability distribution of the measuredquantities can be transformed into a linear set of equations. Moreover, the avoidance of diﬀerentiation throughthe integration-by-parts trick could beneﬁt the already existing dynamical inference algorithms, especially, inadverse, noisy conditions. Additionally, the transformed structure learning problem can be considered not onlyas an SSR problem but also as a feature selection problem [48], a subﬁeld of machine learning and statistics.The extensive body of work on feature selection could be also employed and, therefore, boost the accuracy ofthe overall inference.SSR literature oﬀers an arsenal of theoretical indicators and metrics that we showed correlate well with theperformance as quantiﬁed by the precision and recall curves. Even though there is a growing research areafor dynamical system inference algorithms, limited number of attempts to compute and exploit such metricscan be found in biological studies. The presented examples revealed that the values of these metrics could beeasily lay far from the theoretically desirable. For instance, MIP took values closer to one rather than to zeroin the mass cytometry example necessitating the design of additional experiments or the elimination of someproblematic proteins from the dictionary. Actually, the determination of the dictionary is crucial in biologicalinverse problem inference. Quantities that are constant over time can imperil the accuracy of a structurelearning algorithm because of the addition of collinearities especially when quadratic terms are considered in thedictionary. Thus, it is preferable to remove some or all of the constant-over-time variables from the dictionaryand attempt to infer the structure of the quantities that are time-varying. Both incoherence metrics can serveas a guideline for the construction of a dictionary with high potential for true recovery.Dynamics in biological processes contain critical information about the underlying reaction mechanisms be-tween molecules. Current technologies are able to measure several time-points increasing the possibility ofinducing the interactions between the measured quantities. However, the shape characteristics of the biologi-cal dynamics is usually simple. Prominent examples are impulsive patterns, which are either up-regulating ordown-regulating excitations followed by a return to their basis values, and sustained patterns where the mea-sured quantity remains over-expressed or under-expressed after the excitation [49]. A cascade of four impulsiveresponses for protein signaling whose interactions were correctly inferred is shown in the upper plot of Fig. 2(a).Thus, the dynamical system that can be potentially identiﬁed correctly from relatively simple trajectories shouldnot have complex driving forces. This is the primary reason why in our experiments we chose a linear dictionary.Adding more complex interactions to the dictionary will only result in less identiﬁable inference problems.Finally, the presented examples assumed a linear with respect to the state variables dynamical model and,thus, a linear dictionary is constructed. However, the proposed inference algorithm is not restricted to lineardiﬀerential equations. In Appendix F, we apply the proposed dynamical inference algorithm to a nonlinear andchaotic system from climate science, namely, Lorenz96 [29] where comparisons with SINDy are also performed.The precision-recall results indicate that both methods are able to achieve perfect reconstruction. However, INDy requires 3-4 times less data in order to succeed it for the case of moderate chaotic behaviour. Generally,chaotic systems enjoy richer and more complex dynamics which actually assist the structural learning of thediﬀerential equations as both the incoherence metrics and the obtained results on the accuracy revealed.

In this paper, we present the USDL algorithm, a generic and uniﬁed approach to solve the sparse dynamicalstructure inference problem from temporal data. It is based on the weak formulation of diﬀerential equationswhere the dimension of time is eliminated. Several properties of weak formulation such as being derivative freeare useful and have high practical value. The transformed system is a set of linear equations whose sparsesolution can be found using SSR algorithms. Convex relaxation methods as well as greedy algorithms such asOMP can be used and theoretical guarantees can be computed. To this end, a priori metrics such as MIP and aposteriori metrics such as ERC are computed and the satisﬁability of the assumptions are checked. These metricsmight also serve as candidates for optimization in experimental design. Moreover, various dynamical modelscan be transformed with the weak formulation to an SSR problem making our approach model independent.We test and compare USDL against SINDy on a wide range of dynamical models and show that, under highstochasticity, USDL achieves perfect reconstruction given enough data while SINDy fails for the same amountof data. This is notably evident for the stationary OU process where noise is prevalent revealing the generalityand robustness of the proposed approach.

Acknowledgements

We sincerely thank George Papoutsoglou for collecting and preprocessing the mass cytometry data and readingearly drafts of the manuscript.

Funding

The research leading to these results has received funding from the European Research Council under theEuropean Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 617393.

References [1] M E J Newman. Networks: An introduction.

Oxford University , pages 163–186, 2014.[2] R. Bonneau, D. J. Reiss, P. Shannon, M. Facciotti, L. Hood, N. S. Baliga, and V. Thorsson. The inferelator:an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo.

Genomebiology , 7(5):R36, 2006.[3] P Gennemark and D Wedelin. ODEion A SOFTWARE MODULE FOR STRUCTURAL IDENTIFICA-TION OF ORDINARY DIFFERENTIAL EQUATIONS.

Journal of Bioinformatics and ComputationalBiology , 12(01):1350015, feb 2014.[4] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data bysparse identiﬁcation of nonlinear dynamical systems.

Proceedings of the National Academy of Sciences ofthe United States of America , 113(15):3932–7, apr 2016.[5] Niall M. Mangan, Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Inferring Biological Networksby Sparse Identiﬁcation of Nonlinear Dynamics.

IEEE Transactions on Molecular, Biological and Multi-ScaleCommunications , 2(1):52–63, jun 2016.[6] Anna Klimovskaia, Stefan Ganscha, and Manfred Claassen. Sparse Regression Based Structure Learningof Stochastic Reaction Networks from Single Cell Snapshot Time Series.

PLOS Computational Biology ,12(12):e1005234, dec 2016.[7] Joel A. Tropp. Just relax: convex programming methods for identifying sparse signals in noise.

InformationTheory, IEEE Transactions on , 52(3):1030–1051, 2006.[8] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society,Series B , 58:267–288, 1994.[9] Mika Gustafsson, Michael H¨ornquist, Jesper Lundstr¨om, Johan Bj¨orkegren, and Jesper Tegn´er. Reverseengineering of gene networks with LASSO and nonlinear basis functions.

Annals of the New York Academyof Sciences , 1158:265–75, mar 2009.

10] Elias August and Antonis Papachristodoulou. Eﬃcient, sparse biological network determination.

BMCSystems Biology , 3(1):25, 2009.[11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with thegraphical lasso.

Biostatistics , 9(3):432–441, 2008.[12] Camille Charbonnier, Julien Chiquet, and Christophe Ambroise. Weighted-LASSO for structured networkinference from time course data.

Statistical applications in genetics and molecular biology , 9(1):Article 15,2010.[13] Andrew Bolstad, Barry D. Van Veen, and Robert Nowak. Causal network inference via group sparseregularization.

IEEE Transactions on Signal Processing , 59(6):2628–2641, 2011.[14] K J Friston, L Harrison, and W Penny. Dynamic causal modelling.

NeuroImage , 19(4):1273–302, aug 2003.[15] B C Daniels and I Nemenman. Automated adaptive inference of phenomenological dynamical models.

Nature Communications , 6:8, 2015.[16] Jos´e Bento, Morteza Ibrahimi, and Andrea Montanari. Learning networks of stochastic diﬀerential equa-tions. In J. D. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors,

Advancesin Neural Information Processing Systems 23 , pages 172–180. Curran Associates, Inc., 2010.[17] Joseph J. DiStefano.

Dynamic systems biology modeling and simulation . Academic Press, 2015.[18] Gilbert Strang and George J. Fix.

An analysis of the ﬁnite element method . Wellesley-Cambridge Press,2nd edition, 2008.[19] M E Davis.

Numerical Methods and Modeling for Chemical Engineers . John Wiley & Sons, 1984.[20] Emmanuel J. Cand`es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete andinaccurate measurements.

Communications on Pure and Applied Mathematics , 59(8):1207–1223, aug 2006.[21] Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From Sparse Solutions of Systems of Equationsto Sparse Modeling of Signals and Images.

SIAM Review , 51(1):34–81, feb 2009.[22] D.L. Donoho. Compressed sensing.

IEEE Transactions on Information Theory , 52(4):1289–1306, 2006.[23] Simon Foucart and Holger Rauhut.

A Mathematical Introduction to Compressive Sensing . Applied andNumerical Harmonic Analysis. Springer New York, New York, NY, 2013.[24] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approxi-mation with applications to wavelet decomposition. In

Proceedings of 27th Asilomar Conference on Signals,Systems and Computers , pages 40–44. IEEE Comput. Soc. Press, 1993.[25] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations.

Constructive Approximation ,13(1):57–98, mar 1997.[26] Joel A. Tropp and Anna C. Gilbert. Signal Recovery From Random Measurements Via Orthogonal MatchingPursuit.

IEEE Transactions on Information Theory , 53(12):4655–4666, dec 2007.[27] D.L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition.

IEEE Transactions onInformation Theory , 47(7):2845–2862, 2001.[28] K Sachs, O Perez, D Pe’er, DA Lauﬀenburger, and GP Nolan. Causal protein-signaling networks derivedfrom multiparameter single-cell data.

Science , 308:523–529, 2005.[29] Edward N. Lorenz. Predictability: a problem partly solved. In

Seminar on Predictability , volume 1, pages1–18, Shinﬁeld Park, Reading, 1996. ECMWF.[30] Lawrence C Evans.

Partial Diﬀerential Equations , volume 19. American Mathematical Society, 1998.[31] G´abor Lente.

Deterministic Kinetics in Chemistry and Systems Biology . SpringerBriefs in Molecular Science.Springer International Publishing, 2015.[32] B. Oksendal.

Stochastic Diﬀerential Equations: An introduction with applications . Springer-Verlag, 1985.[33] J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for diﬀerential equations: Ageneralized smoothing approach.

Journal of the Royal Statistical Society. Series B: Statistical Methodology ,69(5):741–796, 2007.[34] Choujun Zhan and Lam F Yeung. Parameter estimation in systems biology models using spline approxi-mation.

BMC systems biology , 5(1):14, 2011.[35] Peter Craven and Grace Wahba. Smoothing noisy data with spline functions - Estimating the correct degreeof smoothing by the method of generalized cross-validation.

Numerische Mathematik , 31(4):377–403, 1978.[36] M Peifer and J Timmer. Parameter estimation in ordinary diﬀerential equations for biochemical processesusing the method of multiple shooting.

IET systems biology , 1(2):78–88, mar 2007.

37] S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries.

IEEE Transactionson Signal Processing , 41(12):3397–3415, 1993.[38] J.A. Tropp. Greed is Good: Algorithmic Results for Sparse Approximation.

IEEE Transactions on Infor-mation Theory , 50(10):2231–2242, oct 2004.[39] T. Tony Cai and Lie Wang. Orthogonal matching pursuit for sparse signal recovery with noise.

IEEETransactions on Information Theory , 57(7):4680–4688, 2011.[40] E.J. Candes and T. Tao. Decoding by Linear Programming.

IEEE Transactions on Information Theory ,51(12):4203–4215, dec 2005.[41] Smita Krishnaswamy, Matthew H. Spitzer, Michael Mingueneau, Sean C. Bendall, Oren Litvin, Erica Stone,Dana Pe’er, and Garry P. Nolan. Conditional density-based analysis of T cell signaling in single-cell data.

Science , 346(6213), 2014.[42] Bernd Bodenmiller, Eli R Zunder, Rachel Finck, Tiﬀany J Chen, Erica S Savig, Robert V Bruggner, Erin FSimonds, Sean C Bendall, Karen Sachs, Peter O Krutzik, and Garry P Nolan. Multiplexed mass cytometryproﬁling of cellular states perturbed by small-molecule regulators.

Nature biotechnology , 30(9):858–67, 2012.[43] M Kanehisa and S Goto. KEGG: kyoto encyclopedia of genes and genomes.

Nucleic acids research , 28(1):27–30, jan 2000.[44] G Papoutsoglou, G Athineou, V Lagani, I Xanthopoulos, A Schmidt, S ´Elias, J Tegn´er, and I Tsamardinos.Scenery: a web application for (causal) network reconstruction from cytometry data.

Nucleic Acids Research ,45:W270–W275, 2017.[45] Krzysztof Bartoszek, Jason Pienaar, Petter Mostad, Staﬀan Andersson, and Thomas F. Hansen. A phyloge-netic comparative method for studying multivariate adaptation.

Journal of Theoretical Biology , 314:204–215,2012.[46] Crispin Gardiner.

Handbook of stochastic methods: for physics, chemistry & the natural sciences . Springer-Verlag, 2004.[47] Crispin Gardiner. Stochastic Methods: A Handbook for the Natural and Social Sciences, 2009.[48] Isabelle Guyon, Andr´e Elisseeﬀ, and Andre@tuebingen Mpg De. An Introduction to Variable and FeatureSelection.

Journal of Machine Learning Research , 3:1157–1182, 2003.[49] Ziv Bar-Joseph, Anthony Gitter, and Itamar Simon. Studying and modelling dynamic biological processesusing time-series gene expression data.

Nature Reviews Genetics , 13(8):552–564, jul 2012.[50] Thomas Mikosch.

Elementary stochastic calculus with ﬁnance in view . World Scientiﬁc, 2003.[51] B. K. Natarajan. Sparse Approximate Solutions to Linear Systems.

SIAM Journal on Computing , 24(2):227–234, apr 1995.[52] Zheng Zhang, Yong Xu, Jian Yang, Xuelong Li, and David Zhang. A Survey of Sparse Representation:Algorithms and Applications.

IEEE Access , 3:490–530, 2015.[53] Robert Tibshirani. Regression selection and shrinkage via the lasso.

Journal of the Royal Statistical SocietyB , 58(1):267–288, 1996.[54] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.

Annals of Statistics , 32(2):407–499, 2004.[55] R. N. Gutenkunst, J. J. Waterfall, F. P. Casey, K. S. Brown, C. R. Myers, and J. P. Sethna. Universallysloppy parameter sensitivities in systems biology models.

PLOS Computational Biology , 3, 2007.

A Weak formulation for various diﬀerential equations

The theory of the weak formulation has been developed in applied mathematics and particularly in the theoryas well as in the numerical analysis of partial diﬀerential equations (PDEs). However, any dynamical systemcan be written in a weak form. This section presents the derivation of the weak formulation for ODEs, PDEs,stochastic diﬀerential equations (SDEs) and multivariate autoregressive models (MARs). Thus, a wide range ofdynamical systems is covered. We also discuss various choices for the test functions. As we see in Appendix E,the choice of test functions can play a critical role in the perfect reconstruction of a dynamical system.

A.1 Ordinary diﬀerential equations

As presented in the main text, an ODE system with linear parameters is given in a non-matrix form by˙ x n = Q (cid:88) q =1 a nq ψ q ( x ) , x n (0) = x n , n = 1 , ..., N . (7) here x n ( t ) is the n -th state variable of the system while ψ q ( x ) is the q -th candidate function (or atom orelement) of the dictionary.In order to deﬁne the weak formulation, a set of test functions indexed by m = 1 , ..., M and denoted by φ m ( t )has to be speciﬁed. The number of test functions, M can be inﬁnite but for practical purposes it is always ﬁnite.For the rest of the paper, we will assume that it is ﬁnite. Then, by multiplying (7) with φ m ( t ) and integratingfrom 0 to T , we get for each n that (cid:90) T ˙ x n ( t ) φ m ( t ) dt = (cid:90) T Q (cid:88) q =1 a nq ψ q ( x ( t )) φ m ( t ) dt . (8)It is rewritten as (cid:104) ˙ x n , φ m (cid:105) = Q (cid:88) q =1 a nq (cid:104) ψ q ( x ) , φ m (cid:105) , (9)where (cid:104) f, g (cid:105) = (cid:82) T f ( t ) g ( t ) dt is the inner product between two functions f and g in the L function space. Noticethat the inﬁnite number of equations (one for each time instant t ∈ [0 , T ]) in (7) is transformed into M equations(one for each test function). The above set of equations is an atemporal system of equations whose matrix formis given in eq. (2) of the main text.Moreover, we assume that the test functions are smooth, having at least ﬁrst derivative, hence, the diﬀeren-tiation of the state variables can be avoided using integration by parts. Indeed, it holds for each n that x n ( t ) φ m ( t ) (cid:12)(cid:12)(cid:12) T − (cid:104) x n , ˙ φ m (cid:105) = Q (cid:88) q =1 a nq (cid:104) ψ q ( x ) , φ m (cid:105) . (10)This is a key feature of weak formulation both theoretically and practically. From a practical point of view,there is no need to numerically estimate the time derivatives of the state variables. As discussed, diﬀerentiationin general ampliﬁes the noise of a signal resulting in high variance estimates of the derivatives deterioratingthe performance of any inference approach which necessitates derivative approximation. The weak formulationavoids diﬀerentiating the noisy state variables by exploiting the integration by parts theorem as shown above. A.1.1 Test functions

The test functions utilized in weak formulation are smooth, not necessary orthogonal and can be chosen froma huge repository of functions. Examples are polynomials, Fourier/Laplace basis functions, B-splines, etc.Furthermore, the number of test functions should be minimum so as to reduce the computational complexityof the overall optimization problem. The choice of test functions could aﬀect the performance of the inferenceprocedure, hence, careful and educated selection is needed for optimal performance. Of course, the optimalselection of test functions depends on the speciﬁc problem at hand and there is no family of functions that willperform optimally for all cases. The most suitable test functions for phenomena that has periodicities are Fouriermodes. A set of M Fourier modes is deﬁned by φ ( t ) = 1 , φ m − ( t ) = cos (cid:18) πmtT (cid:19) and φ m ( t ) = sin (cid:18) πmtT (cid:19) , m = 1 , ..., ( M − / T being the ﬁnal time. In this paper, Fourier modes are used as the default choice of test functions since theyform a complete basis of functions in the L function space. Another important class of test functions are the B-spline functions. B-splines with or without equally-spaced knots constitute another set of test functions that weexplore in our demonstration examples. The advantage of using unequally-spaced knots is that speciﬁc areas ofinterest can be indicated and exploited. For instance in cellular protein signaling, if the dynamical phenomenonis stronger at the beginning then more knots will be assigned during the early times. Figure 4 demonstrates aset of 12 B-spline functions both with evenly-spaced knots (upper panel) and with unevenly-spaced knots (lowerpanel).A third class of test functions are data-dependent test functions which can capture optimally the variationsof the measured time-series. Similar to Karhunen-Lo`eve expansion, spectral analysis of the measurement matrixreveals the principal components that explains the time-series. The principal components with the highest energy(i.e., eigenvalues) are used as test functions. The principal components are numerically computed through theuse of singular value decomposition. We remark that this approach is known as principal component analysis orproper orthogonal decomposition. A.2 Partial Diﬀerential Equations

PDEs constitute another class of dynamical systems where the variables depend not only on time but also onspace. One of the simplest and well studied PDEs is heat equation which reads in 1D x t − x uu = 0 with some ime Evenly-spaced

Time

Unevenly-spaced

Figure 4: Upper plot: A set of 12 evenly-spaced B-splines in the interval [0 , ,

10] where more resolution is put on the early times. initial and boundary conditions. We denote with x t = ∂x∂t and x uu = ∂ x∂u the ﬁrst order time-derivative andsecond-order space-derivative, respectively. For our purposes, we consider systems which are linear with respectto the unknown parameters. Restricting to one space dimension and one variable, the equation can be of thegeneral form Q (cid:88) q =1 a q ψ q ( x, x t , x u , x tt , x tu , x uu , ... ) = 0 (12)where the driving functions, { ψ q } , depend in the physical or engineering application. Several equations fall inthe above general form. To name a few, transport equation (ﬁrst order), heat equation, Laplace’s equation(both parabolic PDEs) and wave equations with or without shocks, with or without interactions (hyperbolicPDEs). Additionally, special but general cases from reaction-diﬀusion equations (Fisher-Kolmogorov equation),chemotaxis, master equation as well as Fokker-Planck equation (with the last two belonging to the class ofKolmogorov equations) can be written in the general form given by (12).Typically, the inference (or inverse) problems in PDEs try to estimate the boundaries or some other param-eters of the PDE and not the driving forces that governs the system since the kinetic laws are assumed known.Nevertheless, in this paper, we want to show the generality of the weak formulation approach to transform aPDE into an atemporal system of equations. Interestingly, the weak formulation is best applied to ﬁnding thesolution of PDEs. Proceeding, the test functions are now both time and space dependent and the integration isperformed in both dimensions. So, let { φ m ( t, x ) } Mm =1 be a set of M test functions, then, the weak form of (12)is Q (cid:88) q =1 a nq (cid:104) ψ q ( x, x t , x u , x tt , x tu , x uu , ... ) , φ m (cid:105) = 0 (13)where (cid:104) f, g (cid:105) = (cid:82) D (cid:82) T f ( t, u ) g ( t, u ) dtdu is the inner product adapted for functions with both time and spacearguments. D is the space domain of the free variable. Notice that the integration by parts trick can be appliedfor both time-derivatives and space-derivatives. Thus, the need for numerical estimation of the derivatives isminimized. Moreover, a standard choice for time-space test functions is to assume that time and space areseparated. Indeed, the test functions are deﬁned as φ m ( t, u ) = φ m, ( t ) φ m, ( u ) where φ m, and φ m, are testfunctions as in the previous subsection.Finally, we would like to notice that our primary intention is to show that weak formulation is also applicableto PDEs without any diﬃculty in the derivation of the atemporal system. However, the study of particular PDEsmay necessitate special treatment especially on the sampling scheme of the solution as well as on the suitableselection of test functions. An extensive discussion on PDE dynamical system inference is beyond the scope ofthis publication. A.3 Stochastic Diﬀerential Equations

A stochastic process x ( t, ω ) is a function of two variables, time t and random outcome ω . For a ﬁxed randomvariable ω , it is a function of time, x ( t ) := x ( t, ω ) which is called a realization, time-series or trajectory of the rocess. In the following, we suppress the dependence on the random element and keep only the dependence ontime. Proceeding, consider an N -dimensional stochastic process, x ( t ) which is driven by an SDE with additivenoise ˙ x n = Q (cid:88) q =1 a nq ψ q ( x ) + σ ˙ B n , n = 1 , ..., N, (14)where B n ( t ) are independent Brownian motions for each n = 1 , ..., N . Brownian motion as a function of timeis nowhere diﬀerentiable thus its time derivative can be rigorously deﬁned only through the weak formulation.However, and, in an intuitive level, the time derivative of a Brownian motion is known as the white noise, and,it is also part of the driving forces of this stochastic dynamical system. The more standard notation for SDEsis based on diﬀerentials and it is given by the following formula dx n ( t ) = Q (cid:88) q =1 a nq ψ q ( x ( t )) dt + σdB n ( t ) , n = 1 , ..., N. (15)As before the weak formulation is obtained by multiplication of the test function and integration. Thus, for m = 1 , ..., M , we get (cid:90) T φ m ( t ) dx n ( t ) = Q (cid:88) q =1 a nq (cid:90) T φ m ( t ) ψ q ( x ( t )) dt + σ (cid:90) T φ m ( t ) dB n ( t ) , n = 1 , ..., N . (16)However there is an important diﬀerence between the standard Riemann integration used in ODEs and PDEsand stochastic integrals. The interpretation of the integrals is diﬀerent which results in diﬀerent approacheswhen numerical estimation is performed. Here, the stochastic integrals are Ito integrals [50] which are deﬁned ina similar manner to the Riemann-Stieltjes integrals. Numerically, Ito integral is approximated by the Riemannsum (cid:90) T f ( t ) dx t ≈ K − (cid:88) i =0 f ( t i ) (cid:0) x ( t i +1 ) − x ( t i ) (cid:1) (17)where { t i } Ki =0 denotes a partition of the interval [0 , T ]. Notice that there exist a variant of the integration byparts theorem which holds for Ito integrals. However, we did not utilize it in the demonstrated SDE example. A.4 Multivariate Autoregressive Model

The weak formulation can be also applied on discrete-time dynamical systems. The approach is similar to thecontinuous time except the deﬁnition of inner products where the integration is replaced by summation. Wedemonstrate the weak form of discrete-time dynamical systems with multivariate autoregressive (MAR) modelwhich additionally is stochastic. In MAR models, the current measurement is a linear combination of the previousmeasurements plus a stochastic white noise (i.e., Gaussian and independent). The mathematical formula forMAR model is x ( t ) = T (cid:88) τ =1 A ( τ ) x ( t − τ ) + e ( t ) , t = 0 , , ..., T (18)where x ( t ) ∈ R N is the discrete-time N -dimensional MAR process, A ( τ ) is the connectivity matrix for the τ -thprevious measurement while e ( t ) is the driving noise term. T determines the order of MAR.The weak formulation for a discrete-time system is deﬁned for a set of test functions { φ m } Mm =1 whose domainis the set of natural numbers. The new atemporal system is given by (cid:104) x n , φ m (cid:105) = T (cid:88) τ =1 Q (cid:88) q =1 a nq ( τ ) (cid:104) x q ( t − τ ) , φ m (cid:105) + (cid:104) e n , φ m (cid:105) (19)where the inner product is now deﬁned as (cid:104) f, g (cid:105) = (cid:80) Tt =0 f ( t ) g ( t ). Deﬁning the cross-correlation function betweentwo functions as C fg ( τ ) = (cid:80) Tt =0 f ( t ) g ( t + τ ), the system of weak-form equations becomes C x n φ m (0) = T (cid:88) τ =1 Q (cid:88) q =1 a nq ( τ ) C x q φ m ( − τ ) + C e n φ m (0) , (20)which is a linear system of equations. Test functions can be discrete versions of Fourier modes or B-splinefunctions. Using as test functions, the Dirac delta function (i.e., φ m ( t ) = δ ( t − m )) then the weak formulationfalls back into the “temporal” formulation studied in [13]. Time-series Data Fitting a.k.a. Collocation Method

A sample of a time-course dataset (i.e., a non-repeated measurement) is denoted as y nkp = y ( p ) n ( t k ) (21)where n = 1 , ..., N with N being the number of species, k = 1 , ..., K with K being the total number of samplingpoints while p = 1 , ..., P with P being the number of measurements. The interpretation is that y nkp is the p -thmeasurement of the n -th species at time instant, t k . We set t = 0 and t K = T be the initial and ﬁnal timepoints, respectively. Notice that the number of measurements may depend on the time index k (i.e., P = P ( k ))but for simplicity reasons we assume that it is constant. It is also useful to deﬁne the P -dimensional measurementvector, y n , as y n = [ y (1) n ( t ) , ..., y (1) n ( t K ) , ..., y ( P ) n ( t ) , ..., y ( P ) n ( t K )] T (22)The weak formulation does not apply directly when time-course data are provided. Thus, time-series inter-polation is utilized. This is achieved by assuming that the state variables over time can be approximated by alinear combination of L basis functions of time, denoted by { ¯ φ l ( t ) } Ll =1 . Thus, the n -th state variable is writtenas x n ( t ) = L (cid:88) l =1 c nl ¯ φ l ( t ) = c Tn ¯ φ ( t ) , n = 1 , ..., N , (23)where ¯ φ ( t ) = [ ¯ φ ( t ) , ..., ¯ φ L ( t )] T ∈ R L is an L -dimensional vector whose elements are the basis functions. Typicalchoice for the basis functions are B -splines which are local in time and are deﬁned by two parameters; the degreeof the interpolating polynomials and the set of knots. Knots in B -spline construction are usually dictated by thesampling time points, t , ..., t K , but they can be enriched or reduced in a user-speciﬁc basis. If necessary, theycan be also optimized through a cross-validation procedure.In order to formulate the data ﬁtting optimization problem, deﬁne the L × ( K + 1) P matrix, ˜ φ , as˜ φ =  ¯ φ ( t ) ... ¯ φ ( t ) ... ¯ φ ( t K ) ... ¯ φ ( t K )... ... ... ...¯ φ L ( t ) ... ¯ φ L ( t ) ... ¯ φ L ( t K ) ... ¯ φ L ( t K )  , (24)which is in alignment with the measurement vectors y n in (22). The (penalized) data ﬁtting problem is deﬁnedas min C N (cid:88) n =1 (cid:8) || y n − c Tn ˜ φ || + λ C c Tn ¨¯Φ c n (cid:9) (25)where C is the N × L coeﬃcient or weight matrix deﬁned by C = [ c , ..., c N ] T , (26)while the second term is a smoothing penalty controlled by the non-negative parameter (i.e., weight) λ C . Thesecond term is designed to penalize the ripples of the interpolated time-series by penalizing the squared L normof the derivatives of the time-series. This reads to (cid:82) T ( ˙ x n ( t )) dt = (cid:104) ˙ x n , ˙ x n (cid:105) which can be rewritten as c Tn ¨¯Φ c j where ¨¯Φ is a L × L symmetric matrix with elements ¨¯Φ l,k = (cid:104) ˙¯ φ l , ˙¯ φ k (cid:105) . The derivation of the ripple-penalty is (cid:90) T ( ˙ x n ( t )) dt = (cid:90) T (cid:0) L (cid:88) l =1 c nl ˙¯ φ l ( t ) (cid:1)(cid:0) L (cid:88) k =1 c nk ˙¯ φ k ( t ) (cid:1) dt = L (cid:88) l =1 L (cid:88) k =1 c nl c nk (cid:104) ˙¯ φ l , ˙¯ φ k (cid:105) = c Tn ¨Φ c n The optimization problem in (25) is a regularized Least Squares (RLS) problem. Additionally, it can bedecoupled into N independent optimization subproblems; one for each column of C . Thus, the solution for the n -th column (i.e., the n -th coeﬃcient vector) is given byˆ c n = (cid:16) ˜ φ ˜ φ T + λ C ¨Φ (cid:17) − ˜ φy Tn , (27)with n = 1 , ..., N .Cost functional (25) can be seen from a Bayesian perspective as a sum of the log-likelihood and the log-priordistribution. The assumption incorporated in the prior distribution is that the trajectories are typically smoothfunctions of time. Both likelihood and prior distributions are Gaussian thus the data ﬁtting problem is a linearoptimization problem for the coeﬃcient matrices which can be solved explicitly as shown above. We would liketo highlight also that a simple approach to create many trajectories is to repeatedly and randomly choose a ubset of the measurements and solve (25). A weight can be also assigned to each trajectory from the posteriordistribution deﬁned by (25). The expectation when one feeds several trajectories to the learning algorithm is thatthe dynamics of the underlying process are more accurately sampled, thus, the performance is improved. Finally,there are cases where the distribution of the measurements are multi-modal or heteroscedastic. In such cases,the Gaussianity assumption is not valid resulting in sub-optimal solutions. Hence the cost functional should beenriched with multi-modal distributions such as the non-linear Gaussian Mixture Model. Alternatively, MonteCarlo methods which are more expensive computationally can be utilized for the sampling of the trajectories ina non-Gaussian case. C Tackling Multiple Interventions

Depending on the application at hand, there are various types of interventions that can be applied. Probably,the simplest intervention is to start from diﬀerent initial values, then let the system evolve and measure it again.The wider exploration of the state space results in more informative measurements thus the estimation of thestructure and the parameters of the dynamical system becomes in principle more identiﬁable. In systems biologyfor instance, such type of interventions are called activations. Another type of intervention is (partial) inhibitionwhere the eﬀect of a state variable is neutralized or kept constant over time. For instance, if n ∗ state variable iskept constant then we can impose the constraint x n ∗ ( t ) = c for all t or equivalently ˙ x n ∗ ( t ) = 0 which results tothe constraint Q (cid:88) q =1 a n ∗ q ψ q ( x ) = 0 . When an ODE system is augmented with constraints that do not contain time derivatives then the resulted setof equations is called a system of diﬀerential algebraic equations. More complex interventions and/or physicalconstraints (or laws) can be incorporated into the ODE system. The intervention type where c = 0 is typicallycalled inhibition.However, experimentalists and scientists do not have full control on the eﬀect of an intervention thus it ispreferable to introduce a general way to model interventions other than hard constraints. Indeed, for ODEsystems, the original equation, ˙ x n = (cid:80) Qq =1 a nq ψ q ( x ), can be slightly extended to˙ x n = Q (cid:88) q =1 a nq ψ q ( x ) + b n u n , x n (0) = x n , n = 1 , ..., N , (28)where b n ∈ R while u n = u n ( t ) is a given function of time which can be thought as an input signal and constitutesthe intervention signal. If for instance n ∗ state variable is inhibited then this intervention is approximated bythe reaction X n ∗ → ∅ with a large rate constant. In the modiﬁed ODE system (28), intervention of variable n ∗ is modeled by setting u n = 0 for n (cid:54) = n ∗ and u n ∗ = x n ∗ . Moreover, input signals can amount not only forinhibition but also for other type of interventions such as dosage level. Indeed, the form of the input signal fordosage intervention can be deﬁned as u n ( t ) = u ( t ) = e − t/τ where τ controls the decay rate of the drug while b n reﬂects the strength of the eﬀect of the drug to the n -th state variable.Proceeding, assume that there are R diﬀerent experimental conditions thus there are R diﬀerent ODE systemswith each one having diﬀerent intervention input u ( r ) n , r = 1 , ..., R . The weak formulation for the r -th ODE andthe n -th variable is given by z ( r ) n = Ψ ( r ) a n + b ( r ) n v ( r ) n (29)where z ( r ) n and Ψ ( r ) as in the main text while v ( r ) n is the projection vector of the intervention input to the testfunctions with elements given by v ( r ) nm = (cid:104) u ( r ) n , φ m (cid:105) . The integrated ODE system can be written in a matrix formas  z (1) ... z ( R )  =  Ψ (1) v (1) · · · (2) v (2) · · · ( R ) · · · v ( R )   ab (1) ... b ( R )  (30)The derived system of equations fall again in the SSR category thus OMP can be utilized. Moreover, sincewe know a priori that there is driving input we start OMP with index set S = { Q + 1 , · · · , Q + R } . Finally, weremark that when v ( r ) = 0 for some r then there is no input term. Hence, it becomes redundant in the abovesystem of equations and the corresponding column of the measurement matrix as well the corresponding b ( r ) coeﬃcient are removed. Sparse Signal Recovery

After the application of the weak formulation, the transformed system of equations is written as z = Ψ a + e (31)where z , Ψ and a as deﬁned in the main text while e denotes the noise vector with || e || = (cid:15) being the energy ofthe error/noise term. The SSR problem is deﬁned as an l quasi-norm minimization programmin a || a || subject to || z − Ψ a || ≤ (cid:15) , (l0-error)where l quasi-norm counts the number of non-zero elements in its argument (i.e., || a || = { q : a q (cid:54) = 0 } ). Inorder to solve this non-convex program, one must search through all possible solutions making this approachintractable for large systems since the search space is exponentially large [51]. There is a wide spectrum ofalgorithms that try to approximate the solution of the above intractable program. We refer to [52] for a recentreview. In the following, we present representative approximation algorithms for two diﬀerent families that solveSSR. Moreover, the presented techniques are accompanied with theoretical guarantees which determine underwhich conditions the approximate solution is also the solution of (l0-error).Let us remark here that the columns of the measurement matrix are assumed to be normalized in theliterature, however, Ψ obtained by weak formulation has not normalized columns. In order to make the columnsof Ψ having l norm equal to 1, the following computation is performed, z = Ψ a + e = Q (cid:88) q =1 a q ψ q + e = Q (cid:88) q =1 a q || ψ q || ψ q || ψ q || + e = Q (cid:88) q =1 ¯ a q ¯ ψ q + e = ¯Ψ¯ a + e . (32)Thus, the non-zero coeﬃcients of a are ampliﬁed by a factor which is proportional to the l -norm of the respectivecolumn of Ψ. D.1 Convex Relaxation Theory

One approach to relax the above optimization problem and make it computationally feasible is to replace the l quasi-norm with the l norm and obtain the convex programmin a || a || subject to || z − Ψ a || ≤ δ , (l1-error)where || a || = (cid:80) Qq =1 | a q | is the l norm while δ is the tolerance. It holds that l norm is the closest convexfunction to l quasi-norm [7]. Using standard linear programming software the (l1-error) minimization problemcan be solved in polynomial time. Nevertheless, the solution of (l1-error) is just an approximation to the SSRproblem and it is not guaranteed that it provides the same answer as the l quasi-norm problem. Systematicinvestigations have been presented throughout the last ﬁfteen years exploring when the solutions of the twoproblems are equal. For the sake of completeness, we provide a special case of a theorem by Tropp [7] which isbased on ERC and determines under which conditions the solution of (l1-error) recovers the true support. Inorder to present the theorem, we have to deﬁne the following quantities. The ( p, q )-matrix operator norm isdeﬁned for a matrix B as || B || p,q := max x (cid:54) =0 || Bx || q || x || p , while, for an index set S , we deﬁne the restricted pseudo-inverse operator over S as¯Ψ †S := ( ¯Ψ T S ¯Ψ S ) − ¯Ψ T S , and the best l approximation of z restricted on S asˆ z S := ¯Ψ S ¯Ψ †S z . Then, the following holds.

Theorem 1: (Tropp ’06)

Let T be the support of the true solution for which ERC ( T ) ≥ and select toleranceto be δ = (cid:118)(cid:117)(cid:117)(cid:116) (cid:15) + (cid:32) || ¯Ψ( z − ˆ z T ) || ∞ || ¯Ψ †T || , ERC ( T ) (cid:33) . If for all q ∈ T holds that | a q | > || ¯Ψ †T || , || ψ q || δ hen the support of the (unique) solution of (l1-error) convex program with tolerance δ is exactly T . Another methodology to relax the l quasi-norm optimization program is to consider the Lagrangian versionof (l1-error). It reads min a || z − Ψ a || + λ || a || , (l1-penalty)where λ is a weight parameter that balances between complexity and sparsity. Indeed, increasing λ results insparser solutions. This optimization program is also known as LASSO [53] and there exist fast algorithms suchas LARS [54] which ﬁnds LASSO solutions. As in the previous case, there are theoretical guarantees underwhich conditions the LASSO solution is also the solution of the SSR problem. We provide two theorems; oneinvolving MIP and the other ERC which are special cases of theorems in [7]. Theorem 2: (Tropp ’06)

Let T be the support of the true solution and k := |T | its size. Suppose that µ ≤ k as well as that || ¯Ψ T ( z − ˆ z T ) || ∞ ≤ λ − (2 k − µ − ( k − µ for some λ > where ˆ z T is the best l approximation over T . If for all q ∈ T holds that | a q | > λ || ψ q || (1 − ( k − µ ) then the support of the (unique) solution of the Lasso convex program with parameter λ is exactly T . It is noteworthy that for the noiseless case and given that µ ≤ k , there is always a small enough λ whichsatisﬁes the condition on the non-zero coeﬃcients of a . Moving forward, as we saw in the main text, the conditionfor MIP is rarely satisﬁed, mainly, due to collinearity between columns of Ψ that might be irrelevant to the spacespanned by the atoms of the actual solution. For this reason, we stated that MIP is rather conservative. Thefollowing theorem, again due to Tropp, is based on ERC and determines when Lasso solution recovers the truesupport. Theorem 3: (Tropp ’06)

Let T be the support of the true solution for which ERC ( T ) ≥ . Suppose that || ¯Ψ T ( z − ˆ z T ) || ∞ ≤ λERC ( T ) for some λ > . If for all q ∈ T holds that | a q | > λ || ( ¯Ψ T T ¯Ψ T ) − || ∞ , ∞ || ψ q || then the support of the (unique) solution of the Lasso convex program with parameter λ is exactly T . The condition on ERC is harder to validate compared to MIP because the support of the true but unknownsolution is required. Here we present the results when the true support is known, however, in Tropp [7], theconditions are relaxed for arbitrary index set of linearly independent collection of dictionary atoms. Finally,we remark that a diﬃculty in Lasso formulation is that little or no information about the hyper-parameter λ is provided and scientists usually apply information criteria in order to select λ . Nevertheless, guidance on thevalues of λ can be provided from the above theorems since two counterbalancing conditions for λ need to besatisﬁed. D.2 Orthogonal Matching Pursuit

There are several greedy algorithms proposed in the literature trying to solve the SSR problem. Among themone of the fastest and surely the most popular is OMP. Apart from being very fast, OMP has the importantproperty that its hyper-parameter is easy to interpret and approximate from the available data. For the sake ofcompleteness, we present the basic OMP algorithm where we added an extra criterion on the maximum allowednumber of non-zero elements which is denoted with K . • OMP Algorithm:1. Initialize the residual r = z and the set of selected indices S = ∅ . Set counter to i = 1.2. Find the next index as q (cid:48) = arg max q | ¯ ψ Tq r i − | and add q (cid:48) to S .3. Solve the Least Squares problem using only the selected indicesˆ a S = (Ψ TS Ψ S ) − Ψ TS z = Ψ † S z Update residual error r i = z − Ψ S ˆ a S . . If || r i || < (cid:15) or | S | > K , stop the algorithm. Otherwise, set i = i + 1 and return to step 2.The stopping criterion in OMP requires the knowledge of the l norm of the true residual error which inprinciple is not available a priori. The energy of the true residual error is approximated asˆ (cid:15) = (1 + α ) || r LS || , where r LS is the residual when the complete dictionary is utilized given by r LS := z − ΨΨ † z while α is a smallpositive number in the range 10 − − − . α is user-deﬁned and corresponds to the potential over-ﬁtting of thecomplete model representation. Notice that the stopping criterion can be rewritten as || r i || − || r LS || || r LS || < α , implying that OMP stops when the relative residual energy becomes smaller than α .In the following, we present two theorems on perfect support recovery; one based on MIP and the other onERC which were ﬁrst proven in [39]. Theorem 4: (Cai & Wang ’11)

Suppose || e || ≤ (cid:15) and µ < k − where k is the number of nonzero elementsof a . Then, OMP algorithm with stopping rule || r i || ≤ (cid:15) recovers the true subset of correct dictionary atomsindexed by T if for all q ∈ T holds that | a q | ≥ SNR ( q )(1 − (2 k − µ ) . The next theorem collects the pieces that has been presented in the main text.

Theorem 5: (Cai & Wang ’11)

Suppose || e || ≤ (cid:15) and ERC ( T ) > where T is the support of the truesolution. Then, OMP algorithm with stopping rule || r i || ≤ (cid:15) recovers the true subset of correct dictionaryatoms, T , if for all q ∈ T holds that | a q | ≥ SNR ( q ) ERC ( T ) λ min ( T ) . Additional theoretical results can be found in [39]. For instance, by varying the stopping criterion, one canguarantee that the strongest components of the solution will be obtained showing that the strongest componentsare selected ﬁrst.

D.2.1 Adding prior Knowledge

In some applications, user may partially know the driving forces of a dynamical system or she has an instrumentthat intervene the system in a systematic and known way. Thus, being able to add prior knowledge to theinference algorithm is very important. Another convenient feature of OMP is that adding prior knowledge isstraightforward. Indeed, instead of starting with an empty index set, S in step 1 of OMP will be initialized withthe provided indices that encode the prior knowledge. Accordingly, the initial residual error will be evaluatedafter the subtraction of the known non-zero components. E Further Demonstrations

We provide here several details on the production of the ﬁgures in the main text. Additional experiments arealso presented for each of the demonstration examples.

E.1 Protein interaction network

This demonstration example emulates the output of a single-cell mass cytometry machine with a prototypicalprotein interaction network where protein P interacts with protein P through protein P . The completebiochemical reaction network is given in Table 2 and the corresponding ODE system has been simulated withstandard ODE solver. It consists of six reactions which follow the law of mass action kinetics. R and R produce P and P , respectively, while R and R correspond to the binding and unbinding of the P to P . R and R are simple degradation mechanics. The initial concentration of the state variables (i.e., the species) are 1 for P and 0 for the rest species. Furthermore, we perform four interventions. The ﬁrst intervention was to change theinitial concentration of P from 0 to 5 while the second intervention was to change the initial concentration of P from 0 to 15. The third intervention was the inhibition of P while the ﬁnal one was the inhibition of P .Inhibition was implemented by augmenting a new and fast degradation reaction with rate constant equal to 100. P − P corresponding to the 3 proteins while [ P P ] corresponds to the Protein-Protein complex. The state of the reaction model is deﬁned as x = [ P , P , P , [ P P ]] T .Event Reaction Rate Rate constant R P → P + P a ( x ) = k P k = 4 R P → P + P a ( x ) = k P k = 0 . R P + P → [ P P ] a ( x ) = k b P · P k b = 0 . R [ P P ] → P + P a ( x ) = k u [ P P ] k u = 0 . R P → ∅ a ( x ) = k − P k − = 0 . R P → ∅ a ( x ) = k − P k − = 0 . The resulted dynamical system has a subtle property. Due to the fact that dP /dt + d [ P P ] /dt = 0, thefollowing conservation law holds P + [ P P ] = 1 . Eventually, the dynamical system consists of three equations with four unknowns. Therefore, the completereaction network is inherently unidentiﬁable when the conservation law is not taken into account making it aninappropriate example for testing the USDL algorithm. This is the main reason why we choose a coarser and thussimpler dynamical model for the learning algorithms. Moreover, the property of unidentiﬁability is very typicalin biochemical reaction networks and in conjunction with the sloppiness of the parameters [55], the structureinference of the complete biochemical network of interactions is impossible most of the times. alpha -3 -2 -1 F sc o r e USDL, Low

R=1R=2R=3R=5 lambda -4 -3 -2 -1 F sc o r e SINDy, Low

R=1R=2R=3R=5 alpha -3 -2 -1 F sc o r e USDL, High

R=1R=2R=3R=5 lambda -4 -3 -2 -1 F sc o r e SINDy, High

R=1R=2R=3R=5 (a) F1 scores

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (high)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (high) (b) Time-series forecasts

Figure 5: (a) The F1 score as a function of the hyperparameter for both USDL (left column) and SINDy (rightcolumn) and various number of experiments and noise levels. (b) Time-series forecasts (black curves) based on theestimated parameter values for both USDL (left column) and SINDy (right column).

Figure 5(a) presents the F1 score (i.e., the harmonic mean between precision and recall) for both USDL (leftcolumn) and SINDy (right column) and various number of experiments and noise levels. Perfect reconstructionwhich is implied by an F1 score at 100% is achieved only by the USDL algorithm when the noise has low varianceand data from all ﬁve experiments are used. As already stated in the main text, we choose the hyperparametervalue that maximizes the F1 score. We also observe that utilizing up to R = 3 experiments results has similarF1 score performance for both approaches however the USDL algorithm is superior when all available data areexploited.Figure 5(b) presents the time-series forecasts (black curves) based on the estimated parameter values forboth USDL (left column) and SINDy (right column). Our simulation is performed on a new experiment whichis diﬀerent from the ﬁve experiments used for the inference. Indeed, the initial concentration is 10 for P and 5for P . Let us remark also that we use as initial values to the ode solver the mean of variable’s concentration attime point 0. In accordance with the ERC metric, the problematic variable with the highest error is P . E.2 Further exploration

We proceed by showing the results on the performance of the proposed approach under various experimentalconditions. We test the performance when the number of sampling points is reduced as well as when diﬀerentweights for the smoothing penalty in the collocation method are shown next. In all experiments of this subsection,we set the hyperparameter of both USDL and SINDy algorithms based on the optimal F1 score value. We do not ime (min) R e l. A bund a n ce P1P2P3 (a) Time-course measurements and time-series inter-polation

Number of experiments

1 2 3 4 5 M I P lowhigh State variable

P1 P2 P3 E RC -1.5-1-0.500.51 R=1, lowR=5, lowR=1, highR=5, high (b) MIP and ERC

Number of experiments

1 2 3 4 5 P r ec i s i on ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high)

Number of experiments

1 2 3 4 5 R eca ll ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high) (c) Precision and recall

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (high)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (high) (d) Time-series forecasts

Figure 6: (a) Time-series interpolation for the protein interaction network when the smoothing weight parameteris λ C = 1. Trajectories are not smooth introducing further noise, nevertheless, the dynamics of the measurementsare correctly captured. (b) MIP (upper panel) and ERC per state variable (lower panel) for two noise levels; low(blue) and high (red). ERC is negative when R = 1 experimental condition (dashed lines) is considered while itis positive when R = 5 experimental conditions (solid lines) are considered. (c) Precision (upper panel) and recall(lower panel) curves as a function of R for both USDL (solid lines) and SINDy (dashed lines) algorithms. Perfectreconstruction is achieved for the low noise level and when R = 5 interventions are provided. (d) Time-seriesforecasts (black curves) based on the estimated parameter values for both USDL (left column) and SINDy (rightcolumn). Forecasts are similar to the case where λ C = 100. show these plots since they are very similar to Figure 5(a). Figure 6 presents the interpolated time-series of eachspecies when λ C = 1. It is evident with naked eye that the bending of the trajectories are harsher, especially, nearthe sampling points. Nevertheless, positive values of ERC for all species when all ﬁve interventions are takeninto account indicate the perfect reconstruction of the protein interaction network. Indeed, the performanceof the USDL algorithm measured by precision/recall metrics and forecast capacity on new experiment is onlyslightly aﬀected as shown in Figures 6(c) & (d), respectively. SINDy algorithm, as in the main text, does notexploit the information from the multiple interventions. Thus, SINDy’s performance is suboptimal. On theopposite direction where smoothing weight is large ( λ C = 10 ), the constructed trajectories are fairly smootheras Figure 7 shows. Despite the error on the dynamics, the performance of the USDL algorithm is not negativelyaﬀected as precision and recall plots reveal. Evidently, the accuracy of both algorithms is slightly improved forthe high level noise with USDL achieving perfect reconstruction when fed with data from ﬁve interventions. Itis noteworthy that ERC indicates this behavior since it is larger under high noise (red solid line in Figure 7(b)).In both cases the inference methods where robust against the hyperparameter of the collocation method.The next experiment assesses the performance of the proposed algorithm when measurements from fewer time-points are obtained. Since dense sampling is more expensive both in time and in money than sparse sampling,it is desirable the learning methods be robust against less sampling points. Figure 8 presents the results whenwe reduce the number of sampling times by a factor of 3 (from 12 to 4) and keeping the time instants at 0,2, 8 ime (min) R e l. A bund a n ce P1P2P3 (a) Time-course measurements and time-series interpo-lation

Number of experiments

1 2 3 4 5 M I P lowhigh State variable

P1 P2 P3 E RC -1.5-1-0.500.51 R=1, lowR=5, lowR=1, highR=5, high (b) MIP and ERC

Number of experiments

1 2 3 4 5 P r ec i s i on ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high)

Number of experiments

1 2 3 4 5 R eca ll ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high) (c) Precision and recall

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (high)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (high) (d) Time-series forecasts

Figure 7: (a) Time-series interpolation for the protein interaction network when the smoothing weight parameteris λ C = 10 . Trajectories are excessively smoothed resulting in incapability of capturing correctly the dynamics.(b) MIP (upper panel) and ERC per state variable (lower panel) for two noise levels; low (blue) and high (red). (c)Precision (upper panel) and recall (lower panel) curves as a function of R for both USDL (solid lines) and SINDy(dashed lines) algorithms. Perfect reconstruction is achieved for USDL algorithm at both low and high noise levels.(d) Time-series forecasts (black curves) based on the estimated parameter values for both USDL (left column) andSINDy (right column). and 32. Interestingly, the performance of the USDL algorithm remained the same with perfect reconstructionbeing achieved when all ﬁve interventions are considered under low noise. This implies that the collocationmethod was able to correctly estimate the trajectories from these four time points. Perfect reconstruction wasalso predicted from the positivity of the ERC for all variables for the case of ﬁve interventions. We would liketo remark here that ERC can be utilized as an experimental design metric for the determination of the optimalsampling points. Indeed, using prior information about the dynamics of the system under study, the samplingpoints space could be explored. Optimal conditions may be achieved by making ERC as large as possible forall variables. Additionally, the number as well as the type of interventions would be chosen based on the ERC,however, this is a diﬀerent study which we leave it as future work.The ﬁnal experiment assesses the performance of the proposed algorithm when the protein interaction networkhas diﬀerent reaction constants. Nature rarely has the favourite constant values for an algorithmic approach andit is important to test it under diﬀerent conditions. Figure 9 presents the results when the reaction constants ofthe ﬁrst two reactions are doubled (i.e., k = 8 and k = 0 . P whose ERC is negative. For the case where all ﬁve interventions are taken intoaccount, the inspection of the estimated connectivity matrix (i.e., of A ) revealed that the row that describes thedynamics of P was most of the times wrong (more than 95% wrong) while the other two rows that describe thedynamics of P and P were most of the times correct (more than 95% correct). Nevertheless, the total error in ime (min) R e l. A bund a n ce P1P2P3 (a) Time-course measurements and time-series interpo-lation

Number of experiments

1 2 3 4 5 M I P lowhigh State variable

P1 P2 P3 E RC -1.5-1-0.500.51 R=1, lowR=5, lowR=1, highR=5, high (b) MIP and ERC

Number of experiments

1 2 3 4 5 P r ec i s i on ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high)

Number of experiments

1 2 3 4 5 R eca ll ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high) (c) Precision and recall

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (high)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (high) (d) Time-series forecasts

Figure 8: (a) Time-series interpolation for the protein interaction network when less sampling points are measured.(b) MIP (upper panel) and ERC per state variable (lower panel) for two noise levels; low (blue) and high (red).ERC is positive when R = 5 interventions are considered (solid lines) while it is negative when R = 1 interventionis considered (dashed lines). (c) Precision (upper panel) and recall (lower panel) curves as a function of R for bothUSDL (solid lines) and SINDy (dashed lines) algorithms. Overall, the right positioning of the sampling instantsresulted in robust inference of the network of interactions. (d) Time-series forecasts (black curves) based on theestimated parameter values for both USDL (left column) and SINDy (right column). the structure inference signiﬁcantly aﬀect the forecast accuracy of all variables as Figure 9(d) demonstrates.Overall, this example demonstrates that when ERC is positive for all variables then with high probability theprotein interaction network will be perfectly reconstructed. In contrast, when there are variables whose ERC isnegative then the probability of perfect reconstruction becomes very low and more experiments are required inorder to make ERC positive and thus recover the true network of interactions. Since ERC is a variable-dependentquantity, we can be conﬁdent that the found interactions between variables with positive ERC are true. Finally,we remark that for small systems of interactions like this one, a brute force alternative is feasible. A completesearch of all possible solutions when the non-zero components are less than ten is computationally tractable fordictionary size up to twenty atoms. However, such an approach will provide little or no information on how todesign a new experiment or a new data acquisition policy compared with greedy algorithms or convex relaxationmethods where metrics such as MIP and ERC can guide the experimental designer. E.3 Mass Cytometry

The protein interactions found in the literature are reported on Table 3. In the right most column, the scientiﬁcsource of the corresponding interaction is provided. The smoothing weight in collocation method is λ C = 10 . Inorder to make the results robust, we repeat the inference 100 times by using half of the points in each iteration.An edge is added to the network when it is found at least 80% of the times. The hyperparameter tuning is basedon the optimal F1 scores shown in Figure 10. ime (min) R e l. A bund a n ce P1P2P3 (a) Time-course measurements and time-series interpo-lation

Number of experiments

1 2 3 4 5 M I P lowhigh State variable

P1 P2 P3 E RC -1.5-1-0.500.51 R=1, lowR=5, lowR=1, highR=5, high (b) MIP and ERC

Number of experiments

1 2 3 4 5 P r ec i s i on ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high)

Number of experiments

1 2 3 4 5 R eca ll ( % ) USDL (low)SINDy (low)USDL (high)SINDy (high) (c) Precision and recall

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (low)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/USDL (high)

Time (min) R e l. A bund a n ce P1P2P3P1 trajectoryP2 trajectoryP3 trajectoryForecasts w/SINDy (high) (d) Time-series forecasts

Figure 9: (a) Time-series interpolation for the protein interaction network when diﬀerent reaction constants areassumed without changing the topology of the network. (b) MIP (upper panel) and ERC per state variable (lowerpanel) for two noise levels; low (blue) and high (red). ERC is negative for P even when R = 5 interventions areconsidered (solid lines). (c) Precision (upper panel) and recall (lower panel) curves as a function of R for both USDL(solid lines) and SINDy (dashed lines) algorithms. Performance results indicate that perfect reconstruction is notachieved at any case which was again expected from the negativity of ERC for P . Perfect network reconstructionrequires more interventions. (d) Time-series forecasts (black curves) based on the estimated parameter values forboth USDL (left column) and SINDy (right column). The forecast error for P is larger for this case conﬁrmingonce again that ERC nicely correlates with the forecast outcome.Table 3: Protein interactions found in the literature.Interaction LiteratureCD3z → Slp76 Krishnaswamy et al. [41]Slp76 → Erk Krishnaswamy et al. [41]Erk → S6 Krishnaswamy et al. [41]Erk → Creb Krishnaswamy et al. [41]CD3z → MAPKAPKII Krishnaswamy et al. [41]MAPKAPKII → Creb Krishnaswamy et al. [41]Akt → Rb Krishnaswamy et al. [41]Akt → S6 Krishnaswamy et al. [41]Akt → Creb PI3K-AKT Signaling (KEGG)

Table 4 reports the value of MIP and ERC per variable for the two subnetworks presented in the main text.MIP is almost 1 showing that there is strong collinearity between the columns of dictionary matrix, Ψ. ERC alues for the small subnetwork are positive increasing the conﬁdence that the inferred network is correct. Onthe contrary, some variables have negative ERC for the larger subnetwork casting doubt on the correctness ofthe inferred network. Table 4: MIP and ERC values for the two subnetworks of proteins. Positive ERC adds conﬁdence to the inferenceresults, however, it is just an estimate since the ground truth is unknown.MIP ERC– CD3z SLP76 Erk S6 MAPKAPKII Creb Akt RbSubnet 1 0.967 0.032 0.114 0.020 0.065 – – – –Subnet 2 0.980 0.031 0.119 -0.314 -0.256 -0.163 0.031 0.119 -2.545 alpha -2 -1 F sc o r e USDL, N=4 lambda -2 -1 F sc o r e SINDy, N=4 alpha -2 -1 F sc o r e USDL, N=8 lambda -2 -1 F sc o r e SINDy, N=8

Figure 10: The F1 score as a function of the hyperparameter for both USDL (left column) and SINDy (rightcolumn). The ﬁrst row shows the F1 score for the four protein subnetwork while the second row for the eightprotein subnetwork. Evidently, F1 score decreases as the number of proteins is increased for both approaches.

Next, we examine the inference capabilities of the proposed approach under various experimental conditions.Figure 11 presents the ﬁrst experiment where we eliminate half of the sampling points. The new sampling pointsare at 0, 2, 5, 8, 20 minutes. As Figure 11(a) asserts the network inference with USDL deteriorates since theinteraction CD3z → Slp76 is missed while an additional interaction that Slp76 produces S6 is found. SINDyis minimally aﬀected by the elimination of the half measurements inferring though an almost fully-connectedgraph. For the eight-protein network, the inference deteriorates when less sampling points are present for bothUSDL and SINDy (see Figure 11(c) & (d), respectively).Figure 12 demonstrates the reconstructed networks when both activation cocktails are taken into consider-ation. Again, network inference deteriorates for both algorithms even though we provide additional data fromanother intervention. A probable explanation is that there exist interference between measured and unmeasuredproteins with the unmeasured ones serve as latent variables. These extra sources of noise confuse the inferencealgorithm resulting in inferior performance. We also tested the performance when all three available populationsare taken into consideration and fed to the inference algorithms and the results are presented in Figure 13. Oneactivation cocktail (CD3/CD28) is considered. For the USDL algorithm there are few edges that are missing. Onthe contrary, SINDy algorithm especially for the eight-protein network is able to reconstruct most of the inter-actions with a minimum number of false positives. This is a rather unexpected result since diﬀerent populationshave diﬀerent variable connectivity.The ﬁnal experiment is to add all the available measurements with all three populations and both activationcocktails (Figure 13). The results as quantiﬁed by the reconstructed networks are similar to the case where bothactivation cocktails and one population (i.e., CD4 naive) is considered. Overall, we conclude that more datadoes not always produce superior results and caution is necessary in the experimental design in order to revealthe actual protein interactions. Interestingly, the worst performance results for both inference algorithms areobtained when all activators are considered . In this case, only two protein interactions that have been reportedin the literature are found. It is also noteworthy that there are two interactions that are present in all settingsof the experiments. These interactions form the cascade Slp76 → Erk → S6. Moreover, in all experiments withUSDL, the edge between Erk and Akt is always inferred which can be either a true interaction not reported a) USDL (b) SINDy (c) USDL (d) SINDy Figure 11: Reconstructed protein signalling networks when half of the sampling points are provided. (a) USDL (b) SINDy (c) USDL (d) SINDy

Figure 12: Reconstructed protein signalling network when both activation cocktails are provided. or a way to up-regulate Akt since in the T-cell signaling pathway Akt is phosphorylated by CD28 which is notmeasured.Finally, we present an experiment where prior knowledge is provided to the inference algorithm. It is le-gitimate to use any available prior information to guide the inference algorithm especially when a scientist issearching for new knowledge. Our algorithm and in particular OMP can easily incorporate prior knowledgeas discussed earlier. Thus, one can provide the known interactions and look for new ones. In Figure 15 wepresent the reconstructed network when the following interactions are a priori provided: CD3z → Slp76, CD3z → MAPKAPKII, MAPKAPKI I → Creb and Erk → Creb. There are two immediate observations related withCreb that can be made. First, the interaction Akt → Creb is lost probably because it has been replaced by Erk → Creb and, second, instead of up-regulating Creb, MAPKAPKII down-regulates it questioning the existenceof this interaction. a) USDL (b) SINDy (c) USDL (d) SINDy Figure 13: Reconstructed protein signalling network when three dinstict populations are provided. (a) USDL (b) SINDy (c) USDL (d) SINDy

Figure 14: Reconstructed protein signalling network when all available data are provided.

E.4 Multidimensional Ornstein-Uhlenbeck process

The multidimensional Ornstein-Uhlenbeck (OU) process is deﬁned in the more standard notation as dx t = − Ax t dt + σdB t (33)where A is the connectivity matrix, B t is an N -dimensional standard Brownian motion while σ is the noiselevel which is set to 0.25. The OU process satisﬁes the detailed balance condition [46] thus, at the stationaryregime, it is time-reversible meaning that the reﬂected time-series have the same path distribution. Moreover,the stationary distribution, µ ( x ), is proportional to µ ( x ) ∝ e − σ x T Σ − x . (34)Obviously, it is a zero-mean Gaussian. The covariance matrix Σ is found as the solution of the Lyapunovequation A Σ + Σ A T + I = 0. Concentration matrix which is deﬁned as the inverse of the covariance matrix is notnecessarily sparse even when A is sparse. This implies that if the measurements are obtained as i.i.d. samplesfrom the stationary distribution then it is impossible to infer the causal relationships between the state variablessince they have been lost. On the contrary, using the richer information that is contained by the dynamics,primarily time correlations, the true connectivity matrix is estimated and the causal relations can be correctlyinferred. Proceeding, the time-series of an N = 20-dimensional Ornstein-Uhlenbeck (OU) process are presented inFigure 16(a). The upper panel shows the time-series at the stationary regime while the lower panel shows thetime-series at the transient regime. At the transient regime, the initial values of the process were independentlysampled from a Gaussian distribution with variance one which obviously result to starting the process outof equilibrium. The OU process converges to equilibrium after one time unit. The numerical integration isperformed using Euler-Maruyama scheme which is a ﬁrst-order ﬁnite diﬀerence scheme with time step of ∆ t =0 . K = 6 while the threshold is set according to the maximum value of the F1 score which isshown in Figure 17(a). SINDy algorithm’s hyperparameter is similarly set. Figure 17(a) also reveals that there isa large region of hyperparameter values that produce perfect reconstruction at the transient regime. The regionof optimal values is much smaller at the stationary regime but as we increase the number of time-series it tendsto increase at least for the USDL algorithm. Additionally, this ﬁgure highlights the signiﬁcant superiority of theUSDL algorithm over SINDy algorithm when the noise is prominent (i.e., at the stationary regime) showing thegenerality of the proposed methodology.Figure 16(b) shows ERC values per variable. It is evident that when P = 5 time-series are fed to USDLalgorithm (dashed lines), ERC is negative in both regimes for all variables. Moreover, ERC for the transientregime is worse that the stationary regime which might be caused due to diﬀerent threshold values. When P = 100 time-series are fed to the inference algorithm then ERC at both regimes is positive for all variables andERC take slightly larger values in the transient regime. Nevertheless, as it is evident from the precision-recallcurves (Fig. 2(b) in the main text where precision is slightly above 50%), perfect reconstruction is not achievedwith stationary time-series. This is a consequence of the fact that the strength of the interaction coeﬃcients(i.e., non-zero elements of A ) compared to the noise level is low. Indeed, noise is the primal driving force ofthe dynamics in the stationary regime thus it is harder to infer accurately the connectivity matrix. Only when P = 1000 the signal-to-noise ratio is high enough for perfect reconstruction of the dynamical system.Figure 17(b) presents the RMSE results for the parameter estimation of the connectivity matrix A . RMSEresults are qualitatively similar to the precision-recall results since RMSE at the transient regime is two orders ofmagnitude less that the RMSE at the stationary regime. Furthermore, USDL performs better at the stationaryregime than SINDy revealing that a well-educated selection of the family of test functions could also improvethe parameter estimation outcomes. ime Stationary

Time

Transient (a) Time-series for both regimes

State variable x x x x x x x x x x E RC -3-2.5-2-1.5-1-0.500.51 Stationary, P=5Stationary, P=100Transient, P=5Transient, P=100 (b) ERC per state variable

Figure 16: (a) Time-series of a multi-dimensional OU process at stationary regime (upper plot) as well at transientregime (lower plot). (b) ERC for each variable of the OU process at both regimes and two diﬀerent number oftrajectories. ERC is positive when P = 200 time-series are given (solid lines). -5 -4 -3 -2 -1 alpha F sc o r e USDL, Stationary

P=5P=100P=500P=1000 -3 -2 -1 lambda F sc o r e SINDy, Stationary

P=5P=100P=500P=1000 -4 -3 -2 -1 alpha F sc o r e USDL, Transient

P=5P=100P=500P=1000 -3 -2 -1 lambda F sc o r e SINDy, Transient

P=5P=100P=500P=1000 (a) F1 score

Number of timeseries R M SE -3 -2 -1 USDL (Stationary)SINDy (Stationary)USDL (Transient)SINDy (Transient) (b) RMSE

Figure 17: (a) The F1 score as a function of the hyperparameter for both USDL (left column) and SINDy (rightcolumn) and various number of time-series using the peaky Fourier modes as test functions. (b) The respectiveRMSE for both USDL (solid lines) and SINDy (dashed lines). The parameter estimation is more accurate at thetransient regime (almost two orders of magnitude) due to the higher signal-to-noise ratio compared to the stationaryregime.

E.4.1 Selection of Test Functions

In the main text, we highlight that the crucial decision is related with the deﬁnition of the test functions.Figure 18 presents the results of USDL algorithm when M = 81 Fourier modes are deployed. Both precision andrecall are about 90%. As in all other experiments, we set USDL algorithm’s threshold according to the optimalF1 score shown in Figure 19(a). Results from SINDy algorithm are shown for comparison purposes. Looking indepth why perfect reconstruction was not successful, we found out that the problematic cases had the followingpattern. When an interaction of the form x n → x n (cid:48) exist in the connectivity matrix then OMP inferred twointeractions; the correct one as well as that x n (cid:48) x n . These two types of interaction are closely related sincethey both assert that x n precedes x n (cid:48) . The problematic inference arises because the degradation of x n , which isrepresented as x n x n is not enough to explain the dynamics of x n hence additional interactions are inferred.These additional edges are not random but they represent variables with strong time correlations. A way toseparate these additional interactions is to observe that the time cross-correlations between one variable and theother variables are ordered based on which are the driving forces. Thus, we need to deﬁne another type of testfunctions with the property of sharp changes which will be able to separate between small time-diﬀerences. For nstance, multiplying the Fourier modes with sawtooth functions would work. However, sawtooth function isnot smooth thus we propose to use the following test functions which we call peaky Fourier modes φ m − ( t ) = cos (4 πmt/T )0 .

01 + cos (4 πmt/T ) cos(2 πmt/T ) (35)and φ m ( t ) = cos (4 πmt/T )0 .

01 + cos (4 πmt/T ) sin(2 πmt/T ) (36)with m = 1 , ..., ( M − /

2. We remark that the above functions are inﬁnitely smooth but they have abruptedges. Figure 20 show the sine and cosine Fourier modes for m = 5 (blue lines) as well as the respective sineand cosine peaky Fourier modes (red lines). Number of timeseries M I P StationaryTransient

State variable x x x x x x x x x x E RC -3-2.5-2-1.5-1-0.500.51 Stationary, P=5Stationary, P=100Transient, P=5Transient, P=100 (a) MIP and ERC

Number of timeseries P r ec i s i on ( % ) -20020406080100 USDL (Stationary)SINDy (Stationary)USDL (Transient)SINDy (Transient)

Number of timeseries R eca ll ( % ) -20020406080100 USDL (Stationary)SINDy (Stationary)USDL (Transient)SINDy (Transient) (b) Precision and recall

Figure 18: (a) MIP (upper panel) and ERC per state variable (lower panel) for the OU process when more testfunctions are considered. The number of Fourier modes is M = 81. (b) Precision (upper panel) and recall (lowerpanel) curves as a function of P , i.e., the number of trajectories for both USDL and SINDy algorithms. However,perfect reconstruction is not achieved. -5 -4 -3 -2 -1 alpha F sc o r e USDL, Stationary

P=5P=100P=500P=1000 -3 -2 -1 lambda F sc o r e SINDy, Stationary

P=5P=100P=500P=1000 -4 -3 -2 -1 alpha F sc o r e USDL, Transient

P=5P=100P=500P=1000 -3 -2 -1 lambda F sc o r e SINDy, Transient

P=5P=100P=500P=1000 (a) F1 score

Number of timeseries R M SE -3 -2 -1 USDL (Stationary)SINDy (Stationary)USDL (Transient)SINDy (Transient) (b) RMSE

Figure 19: (a) The F1 score as a function of the hyperparameter for both USDL (left column) and SINDy (rightcolumn) and various number of time-series using the standard Fourier modes. (b) The respective RMSE for bothUSDL (solid lines) and SINDy (dashed lines). 32 ime

Sine

Fourierpeaky Fourier

Time

Cosine

Fourierpeaky Fourier

Figure 20: Upper plot: Sine Fourier mode (blue) and its peaky version (red). Lower plot: Same as upper plot butfor cosine Fourier mode. Notice that peaky Fourier modes are actually smooth but they do have abrupt changes.The advantage of peaky Fourier as test functions is that they “extract” information in an adequate manner as theycapture variations more accurately.

F Sparse Learning of the Lorenz96 Model

Lorenz96 [29] is an idealized deterministic climate model whose nonlinear dynamical set of equations is givenin Figure 21(a). We set N = 10 and for large enough force ( F ≥

8) the system is chaotic (Figure 21(b)). Thetime-series of Lorenz96 system are generated through numerical integration. We utilized a third-order Runge-Kutta scheme with time step set to 0 . − F/ , F/ Hz or, equivalently, the sampling time is0 .

001 which is low enough so as to assume that the complete time-series is measured. (a) Lorenz96 system

Time

F=8

Time

F=40 (b) Lorenz96 trajectories

Figure 21: (a) The nonlinear system of equations with periodic indexing. F is the driving force, the linear termcorresponds to the dissipation of energy while the quadratic terms correspond to the convection (mixing). Forcevalues above 8 results in chaotic behavior [29]. (b) Time-series (or trajectories) of Lorenz96 for two values of F .Upper plot shows the time-series under weak force ( F = 8) while the lower plot shows the time-series under strongforce ( F = 40). In order to perform sparse learning inference, the dictionary elements have to be determined. The chosendictionary, ψ ( x ), contains the constant term, the linear terms and all quadratic combinations resulting in Q = 66dictionary atoms (or constructed features) in total. Then, the elements of vectors z n with n = 1 , ..., N and matrixΨ are numerically estimated using the trapezoidal rule. Even though this is a deterministic system and shouldfall to the noiseless SSR case, this is not true since there are errors, ﬁrst, in the construction/discretization of he time-series due to the numerical scheme and, second, in the numerical evaluation of the integrals. We testedtwo cases one with the force taking a low value ( F = 8) and another case taking a high value ( F = 40). In total, M = 41 Fourier modes which constitutes of the constant function, 20 sines and 20 cosines in the interval [0 , P . Threshold value for OMP algorithm was set to α = 0 . K = 7 which is larger than the true value which is 4. As quantiﬁed from the precision-recall subplots(see Figure 22(a)), the performance in terms of both precision and recall is improved as the number of measuredtime-series is increased. It actually reaches perfect reconstruction when P = 5 for the strong forcing case andwhen P = 20 for the weak forcing case. In general, stronger forces which result in more chaotic behaviorand stronger mixing are helpful in identifying the true model as it is evident from the fact that red curvesoutperformed the blue ones. Figure 22(a) presents also the reconstruction accuracy of SINDy algorithm [4](dashed lines) with its hyperparameter value being set to λ = 0 .

1. We employ the central diﬀerence schemewhich is a second order method for the numerical estimation of the derivatives. Interestingly, SINDy requiresless time-series in order to achieve perfect reconstruction in both parameter regimes showing that Lasso is acompetitive alternative for solving the SSR problem.Figure 22(b) shows the RMSE between the true connectivity matrix and the estimated one for both algorithmsand both parameter regimes. As expected, the RMSE decreases with the increase of the data size until it hitsa plateau which stems from the discretization error of the various numerical integrations. Evidently, RMSEperformance of USDL is better compared to SINDy at the strong force regime while the opposite is true atthe weak force regime. Moreover, SSR metrics such as MIP and ERC correlate well with the performance ofUSDL algorithm as depicted in the subplots of Figure 22(c). ERC values per variable are (statistically) equal asexpected because of the symmetry in the dynamical system’s state variables. We remark also that theoreticalguarantees on perfect reconstruction are satisﬁed only under strong forcing and at least P = 20 time-series oflength 2 are taken into account since then ERC is positive for all state variables. In contract, MIP is not smallenough –it should be below 0 . F.1 Further Experimentation

We present a series of comparisons under various setups and assess the performance of the USDL algorithm.The ﬁrst variation uses a diﬀerent family of test functions keeping all the other hyperparameters ﬁxed. A setof M = 40, equally-spaced basis spline (B-splines) functions of second-order are employed as test functions.Figure 24 presents the performance of the proposed approach in terms of precision and recall curves (upperleft plot), in terms of RMSE (upper right plot) as well as metrics from SSR theory (lower plot). Results withB-splines are overall similar as in the case of Fourier modes. There is a minor diﬀerence with spline functionswith USDL performing slightly better in terms of precision for weak forcing (blue in upper curve of Figure 24(a))when the number of trajectories is in the range of 3-10. For comparison purposes, we also present the accuracyperformance for SINDy algorithm which are the same as in Figure 22(a).The second variation utilizes less test functions. Using less test functions is expected to weaken the strengthof the inference method since less information is fed to the algorithm. Figure 25 shows the same quantities as inthe ﬁrst variation when only M = 11 Fourier modes are computed. Interestingly, the performance is similar tothe case with 41 Fourier modes when at least P = 5 trajectories are provided for both weak and strong forces.The performance deteriorates signiﬁcantly both in terms of precision and recall when the force is strong and oneor two trajectories are measured. This behavior is in accordance with ERC which is signiﬁcantly dropped from − . − ψ ( x )) by considering additional candidate functions asdriving forces of the dynamical system. Thus, apart from linear and quadratic terms, we add the complete setof cubic combinations resulting in Q = 286 dictionary atoms for the case of N = 10 state variables. Figure 26presents the performance of both USDL (solid lines) and SINDy (dashed lines) algorithms upper plots) underthis setup as well as the incoherence metrics (lower plots). Starting with the metrics of SSR theory, it is evidentthat MIP is increased for both weak and strong forcing reﬂecting the fact that the additional atoms increase thecollinearity of the measurement matrix, Ψ. In contrast, ERC remains almost the same. In terms of performance,both precision and recall deteriorate for the case of weak forcing (blue curves in Figure 26(a)). However, as thenumber of time-series is increased, the performance of both algorithms is greatly improved from almost 0 toalmost 100 percent. For the strong forcing case (red curves in Figure 26(a)), apart from the case where very fewtrajectories are provided, perfect reconstruction of the dynamical system is observed showing once again that umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -6 -4 -2 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2.5-2-1.5-1-0.500.51 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 22: Performance analysis of USDL algorithm for the nonlinear Lorenz96 climate model. (a) Precision andrecall curves under two diﬀerent parameter regimes; weak force (blue) and strong force (red). Results from bothUSDL (solid lines) and SINDy [4] (dashed lines) are presented. Perfect reconstruction of the dynamical systemis achieved when enough time-series are measured for both parameter regimes and both inference algorithms.However, SINDy algorithm requires less time-series in order to achieve perfect reconstruction in both regimesrevealing that there are cases where iterative thresholding algorithm which is part of SINDy outperforms the greedyOMP algorithm which is part of the proposed approach. Moreover, the driving force F aﬀects the performance ofboth algorithms for the same number of time-series. Higher value of F which implies more chaotic behavior resultsin higher inference accuracy. (b) RMSE as a function of the number of time-series for USDL (solid lines) and SINDy(dashed lines). Except when F = 8 and USDL algorithm (blue solid line), RMSE quickly reaches a plateau whosevalue quantiﬁes the total discretization error. (c) ERC value for each state variable supplemented with conﬁdenceintervals. Weak (blue) and strong (red) forces are considered while one (dashed) or twenty time-series are feed tothe algorithm. Greater values of ERC are observed under strong forcing revealing that chaos assists the correctinference of the dynamics. True recovery is theoretically guaranteed only for the strong forcing case and when atleast 20 time-series are measured since then ERC is positive. strong chaos assists the dynamical model inference. In contrast to the previous experiments, USDL algorithmperforms better when the number of trajectories is low while SINDy algorithm performs better when the numberof trajectories is above P = 10 where we observe perfect reconstruction of the dynamical system even for theweak force case (dashed blue lines in Figure 26(a)).In all above variations, the RMSE performance were minimally aﬀected by the diﬀerent setups. The observedquantitative diﬀerences were in accordance with the precision-recall results. Moreover, we repeated the ﬁne-tuning procedure for the hyperparameter values and the results were similar to Figure 23 therefore we keep thesame values for both algorithms’ hyperparameter.The ﬁnal comparisons demonstrate the eﬀect of the sampling frequency on the sparse inference algorithms.Up to now, we considered a sampling rate of 1000 Hz meaning that we sampled the time-series every 0 . lpha -2 -1 F sc o r e USDL, F=8

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=8

P=1P=5P=10P=20 alpha -2 -1 F sc o r e USDL, F=40

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=40

P=1P=5P=10P=20

Figure 23: The harmonic mean between averaged precision and averaged recall (i.e., the F1 score) as a function ofthe hyperparameter for both USDL (left column) and SINDy (right column) and various number of time-series andforces. There is a large parametric region where optimal values are achieved for both algorithms. continuous. In real applications though, high sampling frequency results in increased costs and typically thereis a trade-oﬀ between expenses and sampling rate. Figure 27 presents various performance metrics as well asboth SSR incoherence metrics when sampling rate is dropped from 1000 Hz to 100 Hz . Results are similar exceptfor the strong forcing case with one trajectory where precision’s performance decreases (red curves in upperpanel of Figure 27(a)). The relative change of ERC (dashed red line in lower panel of Figure 27(c)) capturesthis deterioration. Notice that such deterioration is expected since strong forces result in larger and sharpermodulations of the time-series. Again, SINDy algorithm achieves perfect reconstruction with less time-seriescompared to USDL algorithm. Overall, the performance is satisfactory when sampling frequency drops by afactor of 10. However, if the sampling frequency drops by a factor of 100, performance is severely reduced forUSDL algorithm (solid lines) and moderately reduced for SINDy algorithm (dashed lines) as Figure 28(a) asserts.The increase of RMSE shown in Figure 28(b) further conﬁrms the deterioration of the inference methods. Thereare two major factors contributing to this behavior. First, numerical integration produce larger error and, second,the time-series are sampled below the Nyquist frequency which resulted in aliasing and heavy distortion of thesignals. Consequently, the dynamical system inference failed in this case revealing the sensitivity of the proposedmethod to adequate sampling. Finally, we would like to remark that we optimized over the hyper-parametervalues using F1 score as a measure of goodness. The results on F1 score for both cases are shown in Figure 29where it is also evidence the severe eﬀects of over-downsampling (right plots). Moreover, the optimal parameterregion has been shrunk implying that the inference methods are more sensitive to the hyperparameter value. umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -6 -4 -2 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2-101 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 24: Results for the Lorenz96 model when M = 40 and second-order basis spline functions are used astest functions. (a) Precision (upper panel) and recall (lower panel) curves as a function of P , i.e., the number oftrajectories for both USDL (solid lines) and SINDy (dashed lines) algorithms. Performance of USDL algorithm isoverall similar to the performance when Fourier modes are used. (b) RMSE as a function of the number of time-series for USDL (solid lines) and SINDy (dashed lines). It frequently occurs in our experiments that the RMSEfor USDL algorithm when F = 8 (blue solid) is worse than when F = 40 (red solid) while the opposite frequentlyhappens for the SINDy algorithm. (c) MIP (upper panel) and ERC per state variable (lower panel).37 umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -6 -4 -2 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2-101 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 25: Results for the Lorenz96 model when M = 11 Fourier modes are used as test functions. (a) Precision(upper panel) and recall (lower panel) curves as a function of P , i.e., the number of trajectories for both USDL (solidlines) and SINDy (dashed lines) algorithms. Even though four times less test functions are used, there is no or littledeterioration of the performance in most of the cases ( P ≥ umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -6 -4 -2 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2-101 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 26: Results for the Lorenz96 model when cubic terms are added to the dictionary. The size of the dictionaryis approximately ﬁve times larger. (a) Precision (upper panel) and recall (lower panel) curves as a function of P ,i.e., the number of trajectories for both USDL (solid lines) and SINDy (dashed lines) algorithms. Performance isslightly worse for the weak forcing case (blue) while perfect reconstruction is achieved when the force is strong (red)and more than P = 5 time-series are measured for both algorithms. Despite being less accurate for small P , SINDyalgorithm is able to perfectly reconstruct the dynamical system under the weak force when P = 10 trajectories areprovided. (b) RMSE as a function of the number of time-series for USDL (solid lines) and SINDy (dashed lines).RMSE results are in accordance with the precision and recall curves. (c) MIP (upper panel) and ERC per statevariable (lower panel). 39 umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -4 -2 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2-101 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 27: Results for the Lorenz96 model when time-series are sampled at the rate of 100 Hz instead of 1000 Hz .(a) Precision (upper panel) and recall (lower panel) curves as a function of P , i.e., the number of trajectories forboth USDL (solid lines) and SINDy (dashed lines) algorithms. Minor reduction in the performance of both USDLand SINDy algorithms is observed. (b) RMSE as a function of the number of time-series for USDL (solid lines) andSINDy (dashed lines). Results are qualitatively similar to the case where the rate is 1000 Hz (shown in Figure 22)while from a quantitative perspective RMSE is now two orders of magnitude worse compared to Figure 22. (c) MIP(upper panel) and ERC per state variable (lower panel) .40 umber of timeseries P r ec i s i on ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40)

Number of timeseries R eca ll ( % ) USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (a) Precision and recall

Number of timeseries R M SE -1 USDL (F=8)SINDy (F=8)USDL (F=40)SINDy (F=40) (b) RMSE

Number of timeseries M I P F=8F=40

State variable x x x x x x x x x x E RC -3-2-101 F=8, P=1F=8, P=20F=40, P=1F=40, P=20 (c) MIP and ERC

Figure 28: Results for the Lorenz96 model when time-series are sampled at the rate of 10 Hz instead of 1000 Hz .(a) Precision (upper panel) and recall (lower panel) curves as a function of P , i.e., the number of trajectories forboth USDL (solid lines) and SINDy (dashed lines) algorithms. The impact of sub-sampling by a factor of 100 ishuge, especially for the USDL algorithm, resulting in bad performance of the inference methods primarily in termsof precision. As expected, inference is more accurate under weak forcing due to insuﬃcient sampling in the strongforcing case. (b) RMSE as a function of the number of time-series for USDL (solid lines) and SINDy (dashed lines).Stronger forces results in deterioration of RMSE performance for both algorithms. (c) MIP (upper panel) and ERCper state variable (lower panel). alpha -2 -1 F sc o r e USDL, F=8

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=8

P=1P=5P=10P=20 alpha -2 -1 F sc o r e USDL, F=40

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=40

P=1P=5P=10P=20 (a) F1 score with sampling frequency at 100 Hz . alpha -3 -2 -1 F sc o r e USDL, F=8

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=8

P=1P=5P=10P=20 alpha -3 -2 -1 F sc o r e USDL, F=40

P=1P=5P=10P=20 lambda -2 -1 F sc o r e SINDy, F=40