[PDF] Inference of dynamic systems from noisy and sparse data via manifold-constrained Gaussian processes

Abstract

Parameter estimation for nonlinear dynamic system models, represented by ordinary differential equations (ODEs), using noisy and sparse data is a vital task in many fields. We propose a fast and accurate method, MAGI (MAnifold-constrained Gaussian process Inference), for this task. MAGI uses a Gaussian process model over time-series data, explicitly conditioned on the manifold constraint that derivatives of the Gaussian process must satisfy the ODE system. By doing so, we completely bypass the need for numerical integration and achieve substantial savings in computational time. MAGI is also suitable for inference with unobserved system components, which often occur in real experiments. MAGI is distinct from existing approaches as we provide a principled statistical construction under a Bayesian framework, which incorporates the ODE system through the manifold constraint. We demonstrate the accuracy and speed of MAGI using realistic examples based on physical experiments.

Full PDF

II NFERENCE OF DYNAMIC SYSTEMS FROM NOISY AND SPARSEDATA VIA MANIFOLD - CONSTRAINED G AUSSIAN PROCESSES

A P

REPRINT

Shihao Yang ∗ Samuel W. K. Wong † S. C. Kou ‡ September 30, 2020 K EY P OINTS

Ordinary differential equations are a ubiquitous tool for modeling behaviors in science, such as generegulation, epidemics and ecology. An important problem is to infer and characterize the uncertaintyof parameters that govern the equations. Here we present an accurate and fast inference methodusing manifold-constrained Gaussian processes, such that the derivatives of the Gaussian processmust satisfy the dynamics of the differential equations. Our method completely avoids the use ofnumerical integration and is thus fast to compute. Our construction is embedded in a principledstatistical framework and is demonstrated to yield fast and reliable inference in a variety of practicalproblems. Our method works even when some system component(s) is/are unobserved, which is asigniﬁcant challenge for previous methods. K eywords Parameter estimation | Ordinary differential equations | Posterior sampling | Inverse problem ∗ The H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA † Department of Statistics and Actuarial Science, University of Waterloo, Canada ‡ To whom correspondence should be addressed. Department of Statistics, Harvard University, USA. [email protected] a r X i v : . [ s t a t . M E ] S e p PREPRINT - S

EPTEMBER

30, 2020 A BSTRACT

Parameter estimation for nonlinear dynamic system models, represented by ordinary differentialequations (ODEs), using noisy and sparse experimental data is a vital task in many ﬁelds. We proposea fast and accurate method, MAGI (MAnifold-constrained Gaussian process Inference), for this task.MAGI uses a Gaussian process model over time-series data, explicitly conditioned on the manifoldconstraint that derivatives of the Gaussian process must satisfy the ODE system. By doing so, wecompletely bypass the need for numerical integration and achieve substantial savings in computationaltime. MAGI is also suitable for inference with unobserved system components, which often occur inreal experiments. MAGI is distinct from existing approaches as we provide a principled statisticalconstruction under a Bayesian framework, which incorporates the ODE system through the manifoldconstraint. We demonstrate the accuracy and speed of MAGI using realistic examples based onphysical experiments.

Introduction

Dynamic systems, represented as a set of ordinary differential equations (ODEs), are commonly used to model behaviorsin scientiﬁc domains, such as gene regulation [1], the spread of disease [2], ecology [3], etc. We focus on modelsspeciﬁed by a set of ODEs ˙ x ( t ) = d x ( t ) dt = f ( x ( t ) , θ , t ) , t ∈ [0 , T ] , (1)where the vector x ( t ) contains the system outputs that evolve over time t , and θ is the vector of model parameters tobe estimated from experimental data. When f is nonlinear, solving x ( t ) given initial conditions x (0) and θ generallyrequires a numerical integration method, such as Runge-Kutta.Historically, ODEs have mainly been used for conceptual or theoretical understanding rather than data ﬁtting asexperimental data were limited. Advances in experimental and data-collection techniques have increased the capacityto follow dynamic systems closer to real-time. Such data will generally be recorded at discrete times and subject tomeasurement error. Thus, we assume that we observe y ( τ ) = x ( τ ) + (cid:15) ( τ ) at a set of observation time points τ witherror (cid:15) governed by noise level σ . Our focus here is inference of θ given y ( τ ) , with emphasis on nonlinear f wherespecialized methods that exploit a linear structure, e.g. [4, 5], are not generally applicable. We shall present a coherent,statistically principled framework for dynamic system inference with the help of Gaussian processes (GPs). The keyof our method is to restrict the GPs on a manifold that satisﬁes the ODE system: thus we name our method MAGI(MAnifold-constrained Gaussian process Inference). Placing a GP on x ( t ) facilitates inference of θ without numericalintegration, and our explicit manifold constraint is the key novel idea that addresses the conceptual incompatibilitybetween the GP and the speciﬁcation of the ODE model, as we shall discuss shortly when overviewing our method.We show that the resulting parameter inference is computationally efﬁcient, statistically principled, and effective in avariety of practical scenarios. MAGI particularly works in the cases when some system component(s) is/are unobserved.To the best of our knowledge, none of the current available software packages can analyze systems with unobservedcomponent(s). Overview of our method

Following the Bayesian paradigm, we view the D -dimensional system x ( t ) to be a realization of the stochastic process X ( t ) = ( X ( t ) , . . . , X D ( t )) , and the model parameters θ a realization of the random variable Θ . In Bayesian statistics,the basis of inference is the posterior distribution, obtained by combining the likelihood function with a chosen priordistribution on the unknown parameters and stochastic processes. Speciﬁcally, we impose a general prior distribution π on θ and independent GP prior distributions on each component X d ( t ) so that X d ( t ) ∼ GP ( µ d , K d ) , t ∈ [0 , T ] ,where K d : R × R → R is a positive deﬁnite covariance kernel for the GP and µ d : R → R is the mean function.Then for any ﬁnite set of time points τ d , X d ( τ d ) has a multivariate Gaussian distribution with mean vector µ d ( τ d ) andcovariance matrix K d ( τ d , τ d ) . Denote the observations by y ( τ ) = ( y ( τ ) , . . . , y D ( τ D )) , where τ = ( τ , τ , . . . , τ D ) is the collection of all observation time points and each component X d can have its own set of observation times τ d = ( τ d, , . . . , τ d,N d ) . If the d -th component is not observed, then N d = 0 , and τ d = ∅ . N = N + · · · + N D is thetotal number of observations. We note that for the remainder of the paper, the notation t shall refer to time generically,while τ shall refer speciﬁcally to the observation time points.As an illustrative example, consider the dynamic system in [1] that governs the oscillation of Hes1 mRNA ( M ) andHes1 protein ( P ) levels in cultured cells, where it is postulated that a Hes1-interacting ( H ) factor contributes to a stable2 PREPRINT - S

EPTEMBER

30, 2020Figure 1: Inference by MAGI for Hes1 partially observed asynchronous system on 2000 simulated datasets. The redcurve is the truth. MAGI recovers the system well, without the usage of any numerical solver: the green curve showsthe median of the inferred trajectories among the 2000 simulated datasets, and a 95% interval from the 2.5% and 97.5%of all inferred trajectories is shown via the blue dashed area. time l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l sample observations ll true Ptrue Mtrue Hobserved Pobserved M time P P component (Partially Observed) . . . . . . time M M component (Partially Observed) time H H component (Unobserved) truthmedian of all inferred trajectories 95% interval from the 2.5 and 97.5 percentile of all inferred trajectories oscillation. The ODEs of the three-component system X = ( P, M, H ) are f ( X, θ , t ) =  − aP H + bM − cP − dM + e P − aP H + f P − gH  , where θ = ( a, b, c, d, e, f, g ) are the associated parameters. In Fig 1 (left panel) we show noise-contaminated datagenerated from the system, which closely mimics the experimental setup described in [1]: P and M are observed at15-minute intervals for 4 hours but H is never observed. In addition, P and M observations are asynchronous: startingat time 0, every 15 minutes we observe P ; starting at 7.5 minutes, every 15 minutes we observe M ; P and M arenever observed at the same time. It can be seen that the mRNA and protein levels exhibit the behavior of regulation vianegative feedback.The goal here is to infer the seven parameters of the system: a, b govern the rate of protein synthesis in the presence ofthe interacting factor; c, d, g are the rates of decomposition; and e, f are inhibition rates. The unobserved H componentposes a challenge for most existing methods, but is capably handled by MAGI: the P and M panels of Fig 1 show thatour inferred trajectories provide good ﬁts to the observed data, and the H panel shows that the dynamics of the entirelyunobserved H component are largely recovered as well. We emphasize that these trajectories are inferred without anyuse of numerical solvers. We shall return to this example in detail in the Results section.Intuitively, the GP prior on X ( t ) facilitates computation as GP provides closed analytical forms for ˙ X ( t ) and X ( t ) ,which could bypass the need for numerical integration. In particular, with a GP prior on X ( t ) , the conditionaldistribution of ˙ X ( t ) given X ( t ) is also a GP with its mean function and covariance kernel completely speciﬁed. ThisGP speciﬁcation for the derivatives ˙ x ( t ) , however, is inherently incompatible with the ODE model because (1) alsocompletely speciﬁes ˙ x ( t ) given x ( t ) (via the function f ). As a key novel contribution of our method, MAGI addressesthis conceptual incompatibility by constraining the GP to satisfy the ODE model in (1). To do so, we ﬁrst deﬁne arandom variable W quantifying the difference between stochastic process X ( t ) and the ODE structure with a givenvalue of the parameter θ : W = sup t ∈ [0 ,T ] ,d ∈{ ,...,D } | ˙ X d ( t ) − f ( X ( t ) , θ , t ) d | . (2) W ≡ if and only if ODEs with parameter θ are satisﬁed by X ( t ) . Therefore, ideally the posterior distribution for X ( t ) and θ given the observations y ( τ ) and the ODE structure, W ≡ , is (informally) p Θ , X ( t ) | W, Y ( τ ) ( θ , x ( t ) | W = 0 , Y ( τ ) = y ( τ )) . (3)While (3) is the ideal posterior, in reality W is not generally computable. In practice we approximate W by ﬁnitediscretization on the set I = ( t , t , . . . , t n ) such that τ ⊂ I ⊂ [0 , T ] and similarly deﬁne W I as W I = max t ∈ I ,d ∈{ ,...,D } | ˙ X d ( t ) − f ( X ( t ) , θ , t ) d | . (4)Note that W I is the maximum of a ﬁnite set, and W I → W monotonically as I becomes dense in [0 , T ] . Therefore, thepractically computable posterior distribution is p Θ , X ( I ) | W I , Y ( τ ) ( θ , x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) . PREPRINT - S

EPTEMBER

30, 2020As derived in the SI, this posterior distribution can be decomposed as p Θ , X ( I ) | W I , Y ( τ ) ( θ , x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) ∝ P ( Θ = θ , X ( I ) = x ( I ) , W I = 0 , Y ( τ ) = y ( τ ))= π Θ ( θ ) × P ( X ( I ) = x ( I ) ) (cid:124) (cid:123)(cid:122) (cid:125) (1) × P ( Y ( τ ) = y ( τ ) | X ( τ ) = x ( τ )) (cid:124) (cid:123)(cid:122) (cid:125) (2) × P ( ˙ X ( I ) − f ( x ( I ) , θ , t I ) = | X ( I ) = x ( I ) ) (cid:124) (cid:123)(cid:122) (cid:125) (3) , where in the decomposition, (1) corresponds to the GP prior on X , (2) corresponds to the noisy observations, and (3) corresponds to W I = 0 given X ( I ) and θ . All three terms are multivariate Gaussian; (3) is Gaussian because ˙ X ( I ) given X ( I ) has a multivariate Gaussian distribution as long as the kernel K is twice differentiable. Thus, the practicallycomputable posterior simpliﬁes to p Θ , X ( I ) | W I , Y ( τ ) ( θ , x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) (5) ∝ π Θ ( θ ) exp (cid:110) − D (cid:88) d =1 (cid:104) + | I | log(2 π ) + log | C d | + (cid:107) x d ( I ) − µ d ( I ) (cid:107) C − d (cid:124) (cid:123)(cid:122) (cid:125) (1) + | I | log(2 π ) + log | K d | + (cid:13)(cid:13)(cid:13) f x , θ d, I − ˙ µ d ( I ) − m d { x d ( I ) − µ d ( I ) } (cid:13)(cid:13)(cid:13) K − d (cid:124) (cid:123)(cid:122) (cid:125) (3) + N d log(2 πσ d ) + (cid:107) x d ( τ d ) − y d ( τ d ) (cid:107) σ − d (cid:124) (cid:123)(cid:122) (cid:125) (2) (cid:105)(cid:111) where (cid:107) v (cid:107) A = v (cid:124) A v , | I | is the cardinality of I , f x , θ d, I is short for the d -th component of f ( x ( I ) , θ , t I ) , and themultivariate Gaussian covariance matrix C d and the matrix K d can be derived as follows for each component d :  C = K ( I , I ) m = (cid:48) K ( I , I ) K ( I , I ) − K = K (cid:48)(cid:48) ( I , I ) − (cid:48) K ( I , I ) K ( I , I ) − K (cid:48) ( I , I ) (6)where (cid:48) K = ∂∂s K ( s, t ) , K (cid:48) = ∂∂t K ( s, t ) , and K (cid:48)(cid:48) = ∂ ∂s∂t K ( s, t ) .In practice we choose the Matern kernel K ( s, t ) = φ − ν Γ( ν ) (cid:16) √ ν lφ (cid:17) ν B ν (cid:16) √ ν lφ (cid:17) where l = | s − t | , Γ is theGamma function and B ν is the modiﬁed Bessel function of the second kind, and the degree of freedom ν is set tobe 2.01 to ensure that the kernel is twice differentiable. K has two hyper-parameters φ and φ . Their meaning andspeciﬁcation are discussed in the Materials and Methods section.

Review of related work

The problem of dynamic system inference has been studied in the literature, which we now brieﬂy review. We ﬁrst notethat a simple approach to constructing the ‘ideal’ likelihood function is according to p ( y ( t ) | ˆ x ( t , θ , x (0)) , σ ) , where ˆ x ( t , θ , x (0)) is the numerical solution of the ODE obtained by numerical integration given θ and the initial conditions.This approach suffers from a high computational burden: numerical integration is required for every θ sampled in anoptimization or Markov chain Monte Carlo (MCMC) routine [6]. Smoothing methods have been useful for eliminatingthe dependence on numerical ODE solutions, and an innovative penalized likelihood approach [7] uses a B-spline basisfor constructing estimated functions to simultaneously satisfy the ODE system and ﬁt the observed data. While inprinciple the method in [7] can handle an unobserved system component, substantive manual input is required as weshow in the Results, which contrasts with the ready-made solution that MAGI provides.As an alternative to the penalized likelihood approach, GPs are a natural candidate for fulﬁlling the smoothing role in aBayesian paradigm due to their ﬂexibility and analytic tractability [8]. The use of GPs to approximate the dynamicsystem and facilitate computation has been previously studied by a number of authors [6, 9, 10, 11, 12, 13]. Thebasic idea is to specify a joint GP over y , x , ˙ x with hyperparameters φ , and then provide a factorization of the joint4 PREPRINT - S

EPTEMBER

30, 2020density p ( y , x , ˙ x , θ , φ, σ ) that is suitable for inference. The main challenge is to ﬁnd a coherent way to combineinformation from two distinct sources: the approximation to the system by the GP governed by hyperparameters φ , and the actual dynamic system equations governed by parameters θ . In [6, 9], the factorization proposed is p ( y , x , ˙ x , θ , φ, σ ) = p ( y | x , σ ) p ( ˙ x | x , θ , φ ) p ( x | φ ) p ( φ ) p ( θ ) , where p ( y | x , σ ) comes from the observation modeland p ( x | φ ) comes from the GP prior as in our approach. However, there are signiﬁcant conceptual difﬁculties inspecifying p ( ˙ x | x , θ , φ ) : on one hand, the distribution of ˙ x is completely determined by the GP given x , while onthe other hand ˙ x is completely speciﬁed by the ODE system ˙ x = f ( x , θ , t ) ; these two are incompatible. Previousauthors have attempted to circumvent this incompatibility of the GP and ODE system: [6, 9] use a product-of-expertsheuristic by letting p ( ˙ x | x , θ , φ ) ∝ p ( ˙ x | x , φ ) p ( ˙ x | x , θ ) , where the two distributions in the product come from theGP and a noisy version of the ODE, respectively. In [13], the authors arrive at the same posterior as [6, 9] byassuming an alternative graphical model that bypasses the product of experts heuristic; nonetheless, the methodrequires working with an artiﬁcial noisy version of the ODE. In [10], the authors start with a different factorization: p ( y , x , ˙ x , θ , φ, σ ) = p ( y | ˙ x , φ, σ ) p ( ˙ x | x , θ ) p ( x | φ ) p ( φ ) p ( θ ) , where p ( y | ˙ x , φ ) and p ( x | φ ) are given by the GP and p ( ˙ x | x , θ ) is a Dirac delta distribution given by the ODE. However, this factorization is incompatible with the observationmodel p ( y | x , σ ) as discussed in detail in [14]. There is other related work that uses GPs in an ad hoc partial fashion toaid inference. In [11], GP regression is used to obtain the means of x and ˙ x for embedding within an ApproximateBayesian Computation estimation procedure. In [12], GP smoothing is used during an initial burn-in phase as a proxyfor the likelihood, before switching to the ‘ideal’ likelihood to obtain ﬁnal MCMC samples. While empirical resultsfrom the aforementioned studies are promising, a principled statistical framework for inference that addresses thepreviously noted conceptual incompatibility between the GP and ODE speciﬁcations is lacking. Our work presents onesuch principled statistical framework through the explicit manifold constraint. MAGI is therefore distinct from recentGP-based approaches [9, 13] or any other Bayesian analogs of [7].In addition to the conceptual incompatibility, none of the existing methods offer a practical solution for a system withunobserved component(s), which highlights another unique and important contribution of our approach. Results

We apply MAGI to three systems. We begin with an illustration that demonstrates the effectiveness of MAGI in practicalproblems with unobserved system component(s). Then, we make comparisons with other current methods on twobenchmark systems, which show that our proposed method provides more accurate inference while having much fasterruntime.

Illustration: Hes1 model

The Hes1 model described in the Introduction demonstrates inference on a system with an unobserved component andasynchronous observation times. This section continues the inference of this model. Ref [1] studied the theoreticaloscillation behavior using parameter values a = 0 . , b = 0 . , c = 0 . , d = 0 . ; e = 0 . , f = 20 , g = 0 . ,which leads to one oscillation cycle approximately every 2 hours. Ref [1] also set the initial condition at the lowestvalue of P when the system is in oscillation equilibrium [1]: P = 1 . , M = 2 . , H = 17 . . The noise level inour simulation is derived from [1] where the standard error based on repeated measures are reported to be around 15%of the P (protein) level and M (mRNA) level, so we set the simulation noise to be multiplicative following a log-normaldistribution with standard deviation 0.15, and throughout this example we assume the noise level σ is known to be 0.15from repeated measures reported in [1]. The H component is never observed. Owing to the multiplicative error on thestrictly positive system, we apply our method to the log-transformed ODEs, so that the resulting error distributionsare Gaussian. To the best of our knowledge, MAGI is the only one that provides a practical and complete solution forhandling hidden component cases like this example.We generate 2000 simulated datasets based on the above setup for the Hes1 system. The left-most panel in Fig 1 showsone example dataset. For each dataset, we use MAGI to infer the trajectories and estimate the parameters. We usethe posterior mean of X t = ( P, M, H ) t as the inferred trajectories for the system components, which are generatedby MAGI without using any numerical solver. Fig 1 summarizes the inferred trajectories across the 2000 simulateddatasets, showing the median of the inferred trajectories of X t together with the 95% interval of the inferred trajectoriesrepresented by the 2.5% and 97.5% percentiles. The posterior mean of θ = ( a, b, c, d, f, e, g ) is our estimate of theparameters. Table 1 summarizes the parameter estimates across the 2000 simulated datasets, by showing their meansand standard deviations. Fig 1 shows that MAGI recovers the system well, including the completely unobserved H component. Table 1 shows that MAGI also recovers the system parameters well, except for the parameters that onlyappear in the equation for the unobserved H component, which we will discuss shortly. Together, Fig 1 and Table 15 PREPRINT - S

EPTEMBER

30, 2020demonstrate that MAGI can recover the entire system without any usage of a numerical solver, even in the presence ofunobserved component(s).

Metrics for assessing the quality of system recovery

To further assess the quality of the parameter estimates and the system recovery, we consider two metrics. First, asshown in Table 1, we examine the accuracy of the parameter estimates by directly calculating the root mean squarederror (RMSE) of the parameter estimates to the true parameter value. We call this measure the parameter RMSE metric. Second, it is possible that a system might be insensitive to some of the parameters; in the extreme case, someparameters may not be fully identiﬁable given only the observed data and components. In these situations, it is possiblethat the system trajectories implied by quite distinct parameter values are similar to each other (or even close to thetrue trajectory). We thus consider an additional trajectory RMSE metric to account for possible parameter insensitivity,and measure how well the system components are recovered given the parameter and initial condition estimates. Thetrajectory RMSE is obtained by treating the numerical ODE solution based on the true parameter value as the groundtruth: ﬁrst, the numerical solver is used to reconstruct the trajectory based on the estimates of the parameter and initialcondition (from a given method); then, we calculate the RMSE of this reconstructed trajectory to the true trajectory atthe observation time points. We emphasize that the trajectory RMSE metric is only for evaluation purpose to assess (andcompare across methods) how well a method recovers the trajectories of the system components, and that throughoutMAGI no numerical solver is ever needed.We summarize the trajectory RMSEs of MAGI in Table 2 for the Hes1 system.We compare MAGI with the benchmark provided by the B-spline-based penalization approach of Ref [7]. To the bestof our knowledge, Ref [7] is the only method with a software package that can be manually adapted to handle anunobserved component. We note, however, this package itself is not ready-made for this problem: it requires substantialmanual input as it does not have default or built-in setup of its hyper-parameters for the unobserved component. Noneof the other benchmark methods [9, 13] provides software that is equipped to handle an unobserved component. Table 1compares our estimates against those given by Ref [7] based on the parameter RMSE, which shows that the parameterRMSEs for MAGI are substantially smaller than [7]. Fig 1 shows that the inferred trajectories from MAGI are veryclose to the truth. On the contrary, the method in [7] is not able to recover the hidden component H nor the associatedparameter f and g ; see Fig S1 in the SI for the plots. Table 2 compares the trajectory RMSE of the two methods. It isseen that the trajectory RMSE of MAGI is substantially smaller than that of [7]. Further implementation details andcomparison are provided in the SI.Finally, we note that MAGI recovers the unobserved component H almost as well as the observed components of P and M , as measured by the trajectory RMSEs. In comparison, for the result of [7] in Table 2, the trajectory RMSE of theunobserved H component is orders of magnitude worse than those of P and M . The numerical results thus illustratethe effectiveness of MAGI in borrowing information from the observed components to infer the unobserved component,which is made possible by explicitly conditioning on the ODE structure. The self-regulating parameter g and inhibitionrate parameter f for the unobserved component appear to have high inference variation across the simulated datasetsdespite the small trajectory RMSEs. This suggests that the system itself could be insensitive to f and g when the H component is unobserved.Table 1: Parameter inference in the Hes1 partially observed asynchronous system based on 2000 simulation datasets.Average parameter estimates based on MAGI and Ref [7] across the 2000 simulated datasets are reported togetherwith the standard deviation. Parameter RMSEs are reported in the following column. The boldface highlights the bestmethod in terms of parameter RMSE for each parameter.MAGI Ref [7] θ Truth Estimate RMSE Estimate RMSEa 0.022 0.022 ± ± ± ± ± ± ± ± e 0.5 0.552 ± ± f 20 13.863 ± ± ± ± PREPRINT - S

EPTEMBER

30, 2020Table 2: Trajectory RMSEs of the individual components in the Hes1 system, comparing the average trajectory RMSEsof MAGI and Ref [7] over the 2000 simulated datasets. The best trajectory RMSE for each system component is shownin boldface. Method

P M H

MAGI

Ref [7] 1.30 0.40 59.47Figure 2: Inferred trajectories by MAGI for each component of the FN system over 100 simulated datasets. The blueshaded area represents the 95% interval. The inset plot magniﬁes the corresponding segment. − − time V − . − . − . . . . . time R truthmedian of all inferred trajectories95% interval from the 2.5 and 97.5 percentile of all inferred trajectories Comparison with previous methods based on GPs

To further assess MAGI, we compare with two methods: Adaptive Gradient Matching (AGM) of Ref [9] and FastGaussian process based Gradient Matching (FGPGM) of Ref [13], representing the state-of-the-art of inference methodsbased on GPs. For fair comparison, we use the same benchmark systems, scripts and software provided by the authorsfor performance assessment, and run the software using the settings recommended by the authors. The benchmarksystems include the FitzHugh-Nagumo (FN) equations [15] and a protein transduction model [16].

FN Model

The FitzHugh-Nagumo (FN) equations are a classic Ion channel model that describes spike potentials. The systemconsists of X = ( V, R ) , where V is the variable deﬁning the voltage of the neuron membrane potential and R is therecovery variable from neuron currents, satisfying the ODE f ( X, θ , t ) =  c ( V − V R ) − c ( V − a + bR )  where θ = ( a, b, c ) are the associated parameters. As in [13, 9], the true parameters are set to a = 0 . , b = 0 . , c = 3 ,and we generate the true trajectories for this model using a numerical solver with initial conditions V = − , R = 1 .To compare MAGI with FGPGM of Ref [13] and AGM of Ref [9], we simulated 100 datasets under the noise setting of σ V = σ R = 0 . with 41 observations. The noise level is chosen to be on similar magnitude with that of [13], and thenoise level is set to be the same across the two components as the implementation of [9] can only handle equal-variancenoise. The number of repetitions (i.e., 100) is set to be the same as [13] due to the high computing time of thesealternative methods. 7 PREPRINT - S

EPTEMBER

30, 2020Table 3: Parameter inference in the FN model based on 100 simulated datasets. For each method, average parameterestimates are reported together with standard deviation; parameter RMSEs across simulations are also reported. Theboldface highlights the best method in terms of parameter RMSE for each parameter.

MAGI FGPGM [13] AGM [9] θ Estimate RMSE Estimate RMSE Estimate RMSEa 0.19 ± ± ± ± ± ± c 2.88 ± ± ± The parameter estimation results from the three methods are summarized in Table 3, where MAGI has the lowestparameter RMSEs among the three. Fig 2 shows the inferred trajectories obtained by our method: MAGI recoversthe system well, and the 95% interval band is so narrow around the truth that we can only see the band clearly aftermagniﬁcation (as shown in the ﬁgure inset). The SI provides visual comparison of the inferred trajectories of differentmethods, where MAGI gives the most consistent results across the simulations. Furthermore, to assess how well themethods recover the system components, we calculated the trajectory RMSEs, and the results are summarized in Table4, where MAGI signiﬁcantly outperforms the others, reducing the trajectory RMSE over the best alternative method for60% in V and 24% in R . We note that compared to the true parameter value, all three methods show some bias in theparameter estimates, which is partly due to the GP prior as discussed in [13], and MAGI appears to have the smallestbias.For computing cost, the average runtime of MAGI for this system over the repetitions is 3 minutes, which is 145 timesfaster than FGPGM [13] and 90 times faster than AGM [9] on the same CPU (we follow the authors’ recommendationfor running their methods, see SI for details).Table 4: Trajectory RMSEs of each component in the FN system, comparing the average trajectory RMSE of the threemethods over 100 simulated datasets. The best trajectory RMSE for each system component is shown in boldface.MAGI reduces the RMSE for 60% in component V and 24% in component R over the best alternative method.Method V R

MAGI

FGPGM [13] 0.257 0.094AGM [9] 1.177 0.662

Protein transduction model

This protein transduction example is based on systems biology where components S and S d represent a signalingprotein and its degraded form, respectively. In the biochemical reaction S binds to protein R to form the complex S R ,which enables the activation of R into R pp . X = ( S, S d , R, S R , R pp ) satisﬁes the ODE f ( X, θ , t ) =  − k · S − k · S · R + k · S R k · S − k · S · R + k · S R + V · R pp K m + R pp k · S · R − k · S R − k · S R k · S R − V · R pp K m + R pp  , where θ = ( k , k , k , k , V, K m ) are the associated rate parameters.We follow the same simulation setup as [13, 9], by taking t = { , , , , , , , , , , , , , , } asthe observation times, X (0) = (1 , , , , as the initial values, and θ = (0 . , . , . , . , . , . as the trueparameter values. Two scenarios for additive observation noise are considered: σ = 0 . (low noise) and σ = 0 . (high noise). Note that the observation times are unequally spaced, with only a sparse number of observations from t = 20 to t = 100 . Further, inference for this system has been noted to be challenging due to the non-identiﬁability ofthe parameters, in particular K m and V [13]. Therefore, the parameter RMSE is not meaningful for this system, and wefocus our comparison on the trajectory RMSE.We compare MAGI with FGPGM of Ref [13] and AGM of Ref [9] on 100 simulated datasets for each noise setting (seethe SI for method and implementation details). We plot the inferred trajectories of MAGI in the high noise setting in Fig8 PREPRINT - S

EPTEMBER

30, 2020Figure 3: Inferred trajectories by MAGI for each component of the protein transduction system in the high noise setting.The red line is the truth, and the black line is the median inferred trajectory over 100 simulated datasets. The blueshaded area represents the 95% interval. The inset plots magnify the corresponding segment. . . . . . . time S S . . . . . . time S d Sd . . . . . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp truthmedian of all inferred trajectories 95% interval from the 2.5 and 97.5 percentile of all inferred trajectories

3, which closely recover the system. The 95% interval band from MAGI is quite narrow that for most of the inferredcomponents we need magniﬁcations (as shown in the ﬁgure insets) to clearly see the 95% band. We then calculatedthe trajectory RMSEs, and the results are summarized in Table 5 for each system component. In both noise settings,MAGI produces trajectory RMSEs that are uniformly smaller than both FGPGM [13] and AGM [9] for all systemcomponents. In the low noise setting, the advantage of MAGI is especially apparent for components S , R , S R , and R pp , with trajectory RMSEs less than half of the closest comparison method. For the high noise setting, MAGI reducestrajectory RMSE the most for S d and R pp ( ∼ σ = 0 . Method

S S d R S R R pp MAGI

FGPGM [13] 0.0049 0.0016 0.0156 0.0036 0.0149AGM [9] 0.0476 0.2881 0.3992 0.0826 0.2807High noise case, σ = 0 . Method

S S d R S R R pp MAGI

FGPGM [13] 0.0128 0.0089 0.0210 0.0136 0.0309AGM [9] 0.0671 0.3125 0.4138 0.0980 0.2973

Discussion

We have presented a novel methodology for the inference of dynamic systems, using manifold-constrained Gaussianprocesses. A key feature that distinguishes our work from the previous approaches is that it provides a principledstatistical framework, ﬁrmly grounded on the Bayesian paradigm. Our method also outperformed currently availableGP-based approaches in the accuracy of inference on benchmark examples. Furthermore, the computation time for ourmethod is much faster. Our method is robust and able to handle a variety of challenging systems, including unobservedcomponents, asynchronous observations, and parameter non-identiﬁability.A robust software implementation is provided, with user interfaces available for R, MATLAB, and Python, as describedin the SI. The user may specify custom ODE systems in any of these languages for inference with our package,by following the syntax in the examples that accompany this article. In practice, inference with MAGI using oursoftware can be carried out with relatively few user interventions. The setting of hyperparameters and initial values isfully automatic, though may be overridden by the user. The main setting that requires some tuning is the number ofdiscretization points in I . In our examples, this was determined by gradually increasing the denseness of the points9 PREPRINT - S

EPTEMBER

30, 2020with short sampler runs, until the results become indistinguishable. Note that further increasing the denseness of I hasno ill effect, apart from increasing the computational time.An inherent feature of the GP approximation is the tendency to favor smoother curves. This limitation has beenpreviously acknowledged [13, 9]. As a consequence, two potential forms of bias can exist. First, estimates derived fromthe posterior distributions of the parameters may have some statistical bias. Second, the trajectories reconstructed by anumerical solver based on the estimated parameters may differ slightly from the inferred trajectories. MAGI, which isbuilt on a GP framework, does not entirely eliminate these forms of bias. However, as seen in the benchmark systems,the magnitude of our bias in both respects is signiﬁcantly smaller than the current state-of-the-art in [13, 9].We considered the inference of dynamic systems speciﬁed by ODEs in this article. Stochastic differential equations(SDEs) and continuous-time Markov processes, which explicitly model intrinsic noise, have also been used to describestochastic dynamic systems. Extending our method to the inference of SDE and continuous-time Markov models is afuture direction we plan to investigate. Materials and Methods

For notational simplicity, we drop the dimension index d in this section when the meaning is clear. Algorithm overview

We begin by summarizing the computational scheme of MAGI. Overall, we use Hamiltonian Monte Carlo (HMC) [17]to obtain samples of ( X I , θ , σ ) from their joint posterior distribution. At each iteration of HMC, X I , θ , and σ areupdated together, with leapfrog step sizes automatically tuned during the burn-in period to achieve an acceptance ratebetween 60-90%. (If σ is known a priori, it is ﬁxed and not sampled.) At the completion of HMC sampling (and afterdiscarding an appropriate burn-in period for convergence), we take the posterior means of X I as the inferred trajectories,and the posterior means of θ as the parameter estimates. The techniques we use to temper the posterior and speedup the computations are discussed in the following ‘Prior tempering’ subsection and ‘Techniques for computationalefﬁciency’ in the SI.Several steps are taken to initialize the HMC sampler. First, we apply a GP ﬁtting procedure to obtain values of φ and σ for the observed components; the computed values of the hyper-parameters φ are subsequently held ﬁxed duringthe HMC sampling, while the computed value of σ is used as the starting value in the HMC sampler. Second, startingvalues of X I for the observed components are obtained by linearly interpolating between the observation time points.Third, starting values for the remaining quantities – θ and ( X I , φ ) for any unobserved component(s) – are obtained byoptimization of the posterior as described below. Setting hyper-parameters φ for observed components The GP prior X d ( t ) ∼ GP ( µ d , K d ) , t ∈ [0 , T ] , is on each component X d ( t ) separately. The Gaussian process Maternkernel K ( l ) = φ − ν Γ( ν ) (cid:16) √ ν lφ (cid:17) ν B ν (cid:16) √ ν lφ (cid:17) has two hyper-parameters that are held ﬁxed during sampling: φ controls overall variance level of the GP, while φ controls the bandwidth for how much neighboring points of the GPaffect each other.When the observation noise σ is unknown, values of ( φ , φ , σ ) are obtained jointly by maximizing GP ﬁtting withoutconditioning on any ODE information, namely: ( ˜ φ , ˜ σ ) = arg max φ ,σ p ( φ , σ | y I )= arg max φ ,σ π Φ ( φ ) π Φ ( φ ) π σ ( σ ) p ( y I | φ , σ ) (7)where y I | φ , σ ∼ N (0 , K φ + σ ) . The index set I is the smallest evenly spaced set such that all observation timepoints in this component are in I , i.e., τ ⊆ I . The priors π Φ ( φ ) and π σ ( σ ) for the variance parameter φ and σ are set to be ﬂat. The prior π Φ ( φ ) for the bandwidth parameter φ is set to be a Gaussian distribution with themean being half of the periodicity with the highest frequency loading after Fourier transformation of y on I (thevalues of y on I are linearly interpolated from the observations at τ ), and the standard deviation being set such that (cid:82) T π Φ ( φ ) dφ = 0 . . This Gaussian prior on φ serves to prevent it from being too extreme. In the subsequentsampling of ( θ , X τ , σ ) , the hyper-parameters φ are ﬁxed at ˜ φ while ˜ σ gives the starting value of σ in the HMCsampler. 10 PREPRINT - S

EPTEMBER

30, 2020If σ is known, then values of ( φ , φ ) are obtained by maximizing ˜ φ = arg max φ p ( φ | y I , σ ) = arg max φ π Φ ( φ ) π Φ ( φ ) p ( y I | φ , σ ) (8)and held ﬁxed at ˜ φ in the subsequent HMC sampling of ( θ , X τ ) . The priors for φ and φ are the same as previouslydeﬁned. Initialization of X I for the observed components To provide starting values of X I for the HMC sampler, we use the values of Y τ at the observation time points andlinearly interpolate the remaining points in I . Initialization of the parameter vector θ when all system components are observed To provide starting values of θ for the HMC sampler, we optimize the posterior (5) as a function of θ alone, holding X I and σ unchanged at their starting values, when there is no unobserved component(s). The optimized θ is then usedas the starting value for the HMC sampler in this case. Settings in the presence of unobserved system components: setting φ , initializing X I for unobservedcomponents, and initializing θ Separate treatment is needed for the setting of φ and initialization of θ , X I for the unobserved component(s). We use anoptimization procedure that seeks to maximize the full posterior in (5) as a function of θ together with φ and the wholecurve of X I for unobserved components, while holding the σ , φ and X I for the observed components unchanged attheir initial value discussed above. We thereby set φ for the unobserved component, and the starting values of θ and X I for unobserved components at the optimized value. In the subsequent sampling, the hyper-parameters are ﬁxed atthe optimized φ , while the HMC sampling starts at the θ and the X I obtained by this optimization. Prior tempering

After φ is set, we use a tempering scheme to control the inﬂuence of the GP prior relative to the likelihood during HMCsampling. Note that (5) can be written as p Θ , X ( I ) | Y ( τ ) ,W I ( θ , x ( I ) | y ( τ ) , W I = 0) ∝ p Θ , X ( I ) | W I ( θ , x ( I ) | W I = 0) p Y ( τ ) | X ( τ ) ( y ( τ ) | x ( τ )) . (9)As the cardinality of | I | increases with more discretization points, the prior part p Θ , X ( I ) | W I ( θ , x ( I ) | W I = 0) grows,while the likelihood part p Y ( τ ) | X ( τ ) ( y ( τ ) | x ( τ )) stays unchanged. Thus, to balance the inﬂuence of the prior, weintroduce a tempering hyper-parameter β with the corresponding posterior p ( β ) Θ , X I | W I , Y τ ( θ , x I | , y τ ) ∝ p Θ , X ( I ) | W I ( θ , x ( I ) | W I = 0) /β p Y ( τ ) | X ( I ) ( y ( τ ) | x ( I )) ∝ π Θ ( θ ) exp (cid:110) − D (cid:88) d =1 (cid:104) N d log(2 πσ d ) + (cid:107) ( x d ( τ d ) − y d ( τ d )) (cid:107) σ − d + 1 β (cid:16) (cid:107) x d ( I ) − µ d ( I ) (cid:107) C − d + (cid:13)(cid:13)(cid:13) f x , θ d, I − ˙ µ d ( I ) − m d ( x d ( I ) − µ d ( I )) (cid:13)(cid:13)(cid:13) K − d (cid:17)(cid:105)(cid:111) A useful setting that we recommend is β = D | I | /N , where D is the number of system components, | I | is the numberof discretization time points, and N = (cid:80) Dd =1 N d is the total number of observations. This setting aims to balance thelikelihood contribution from the observations with the total number of discretization points.11 PREPRINT - S

EPTEMBER

30, 2020

Data availability

All of the data used in the article are simulation data. The details, including the models to generate the simulation data,are described in

Results and the SI. Our software package also includes complete replication scripts for all the data andexamples.

Acknowledgements

The research of S.W.K.W. is supported in part by a Discovery Grant from the Natural Sciences and EngineeringResearch Council of Canada. The research of S.C.K. is supported in part by NSF Grant DMS-1810914.

References [1] Hirata H, et al. (2002) Oscillatory expression of the bHLH factor Hes1 regulated by a negative feedback loop.

Science

Biometrics

Differential Equations and Applications in Ecology, Epidemics, and Population Problems .(Elsevier).[4] Gorbach NS, Bauer S, Buhmann JM (2017) Scalable variational inference for dynamical systems in

Advances inNeural Information Processing Systems . pp. 4806–4815.[5] Wu L, Qiu X, Yuan Yx, Wu H (2019) Parameter estimation and variable selection for big systems of linearordinary differential equations: A matrix-based approach.

Journal of the American Statistical Association

Advances in neural information processing systems . pp. 217–224.[7] Ramsay JO, Hooker G, Campbell D, Cao J (2007) Parameter estimation for differential equations: a generalizedsmoothing approach.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

Proceedingsof the Royal Society A: Mathematical, Physical and Engineering Sciences

International Conference on Artiﬁcial Intelligence and Statistics . pp. 216–228.[10] Barber D, Wang Y (2014) Gaussian processes for Bayesian estimation in ordinary differential equations in

International Conference on Machine Learning . pp. 1485–1493.[11] Ghosh S, Dasmahapatra S, Maharatna K (2017) Fast approximate Bayesian computation for estimating parametersin differential equations.

Statistics and Computing

International Conference on Artiﬁcial Intelligence and Statistics . pp. 1252–1260.[13] Wenk P, et al. (2019) Fast Gaussian process based gradient matching for parameter identiﬁcation in systems ofnonlinear ODEs in

International Conference on Artiﬁcial Intelligence and Statistics . pp. 1351–1360.[14] Macdonald B, Higham C, Husmeier D (2015) Controversy in mechanistic modelling with Gaussian processes.

Proceedings of Machine Learning Research

Biophysicaljournal

Bioinformatics

Handbook of Markov Chain Monte Carlo , Chapman &Hall/CRC Handbooks of Modern Statistical Methods, eds. Brooks S, Gelman A, Jones G, Meng X. (CRC Press),pp. 113–162. 12

PREPRINT - S

EPTEMBER

30, 2020

Supporting Information

This supporting information ﬁle presents the detailed derivation of Eq (5) of the main text, techniques for efﬁcientcomputation, and further details and discussion for each of the dynamic system examples in the main manuscript.

Detailed derivation of Eq (5) in the main text

Recall that our key idea is to deﬁne a random variable W quantifying the difference between the stochastic process X ( t ) and the ODE structure with a given value of the parameter θ : W = sup t ∈ [0 ,T ] ,d ∈{ ,...,D } | ˙ X d ( t ) − f ( X ( t ) , θ , t ) d | . Ideally the posterior distribution for X ( t ) and θ given the observations y ( τ ) and the manifold constraint speciﬁed bythe ODE structure, W ≡ , is (informally) p Θ , X ( t ) | W, Y ( τ ) ( θ , x ( t ) | W = 0 , Y ( τ ) = y ( τ )) . In reality W is not generally computable, so in practice we approximate W by ﬁnite discretization on the set I =( t , t , . . . , t n ) such that τ ⊂ I ⊂ [0 , T ] and similarly deﬁne W I = max t ∈ I ,d ∈{ ,...,D } | ˙ X d ( t ) − f ( X ( t ) , θ , t ) d | . Wethus consider the practically computable posterior p Θ , X ( I ) | W I , Y ( τ ) ( θ, x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) . Note that in this posterior distribution, we consider the joint distribution of θ and X ( I ) together. Thus, effectively, wesimultaneously infer both the parameter θ and the trajectory X ( I ) from the noisy observations y ( τ ) .Since p Θ , X ( I ) | W I , Y ( τ ) ( θ , x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) ∝ P ( Θ = θ , X ( I ) = x ( I ) , W I = 0 , Y ( τ ) = y ( τ )) , let us consider the right hand side, which can be decomposed as P ( Θ = θ , X ( I ) = x ( I ) , W I = 0 , Y ( τ ) = y ( τ ))= π Θ ( θ ) × P ( X ( I ) = x ( I ) | Θ = θ ) (cid:124) (cid:123)(cid:122) (cid:125) (1) × P ( Y ( τ ) = y ( τ ) | X ( I ) = x ( I ) , Θ = θ ) (cid:124) (cid:123)(cid:122) (cid:125) (2) × P ( W I = 0 | Y ( τ ) = y ( τ ) , X ( I ) = x ( I ) , Θ = θ ) (cid:124) (cid:123)(cid:122) (cid:125) (3) = π Θ ( θ ) × P ( X ( I ) = x ( I ) ) (cid:124) (cid:123)(cid:122) (cid:125) (1) × P ( Y ( τ ) = y ( τ ) | X ( τ ) = x ( τ ) (cid:124) (cid:123)(cid:122) (cid:125) (2) × P ( W I = 0 | Y ( τ ) = y ( τ ) , X ( I ) = x ( I ) , Θ = θ ) (cid:124) (cid:123)(cid:122) (cid:125) (3) . Note that the third term can be simpliﬁed as P ( W I = 0 | Y ( τ ) = y ( τ ) , X ( I ) = x ( I ) , Θ = θ )= P ( ˙ X ( I ) − f ( x ( I ) , θ , t I ) = | Y ( τ ) = y ( τ ) , X ( I ) = x ( I ) , Θ = θ )= P ( ˙ X ( I ) − f ( x ( I ) , θ , t I ) = | X ( I ) = x ( I ) )= P ( ˙ X ( I ) = f ( x ( I ) , θ , t I ) | X ( I ) = x ( I ) ) , which is the conditional density of ˙ X ( I ) given X ( I ) evaluated at f ( x ( I ) , θ , t I ) .13 PREPRINT - S

EPTEMBER

30, 2020We, therefore, obtain p Θ , X ( I ) | W I , Y ( τ ) ( θ , x ( I ) | W I = 0 , Y ( τ ) = y ( τ )) ∝ π Θ ( θ ) × P ( X ( I ) = x ( I ) ) (cid:124) (cid:123)(cid:122) (cid:125) (1) × P ( Y ( τ ) = y ( τ ) | X ( τ ) = x ( τ )) (cid:124) (cid:123)(cid:122) (cid:125) (2) × P ( ˙ X ( I ) = f ( x ( I ) , θ , t I ) | X ( I ) = x ( I ) ) (cid:124) (cid:123)(cid:122) (cid:125) (3) , = π Θ ( θ ) exp (cid:110) − D (cid:88) d =1 (cid:104) + | I | log(2 π ) + log | C d | + (cid:107) x d ( I ) − µ d ( I ) (cid:107) C − d (cid:124) (cid:123)(cid:122) (cid:125) (1) + N d log(2 πσ d ) + (cid:107) x d ( τ d ) − y d ( τ d ) (cid:107) σ − d (cid:124) (cid:123)(cid:122) (cid:125) (2) + | I | log(2 π ) + log | K d | + (cid:13)(cid:13)(cid:13) f x , θ d, I − ˙ µ d ( I ) − m d · { x d ( I ) − µ d ( I ) } (cid:13)(cid:13)(cid:13) K − d (cid:124) (cid:123)(cid:122) (cid:125) (3) (cid:105)(cid:111) , where the expression (1) comes from the GP prior on X , the expression (2) corresponds to the noisy observations, andthe expression (3) corresponds to the conditional density of ˙ X ( I ) given X ( I ) evaluated at f ( x ( I ) , θ , t I ) ; all threeexpressions are Gaussian as long as the kernel K of the GP on X is twice differentiable. Please note that in the lastequation, (i) we have used the notations (cid:107) v (cid:107) A = v (cid:124) A v , | I | for the cardinality of I , and f x , θ d, I for the d -th component of f ( x ( I ) , θ , t I ) , and (ii) for each component d , the multivariate Gaussian covariance matrix C d and the matrix K d arerelated to the GP kernel K as  C = K ( I , I ) m = (cid:48) K ( I , I ) K ( I , I ) − K = K (cid:48)(cid:48) ( I , I ) − (cid:48) K ( I , I ) K ( I , I ) − K (cid:48) ( I , I ) where (cid:48) K = ∂∂s K ( s, t ) , K (cid:48) = ∂∂t K ( s, t ) , and K (cid:48)(cid:48) = ∂ ∂s∂t K ( s, t ) . Techniques for computational efﬁciency

After setting φ , the matrix inverses C − d , K − d can be pre-computed and held ﬁxed in the sampling of X , θ , σ fromthe target posterior, Eq. (5) in the main text. Thus, the computation of Eq. (5) in the main text at sampled values of( X , θ , σ ) only involves matrix multiplication, which has typical computation complexity of O ( n ) , where n is thematrix dimension (i.e., number of discretization points). Due to the short-term memory and local structure of Gaussianprocesses (GPs), the partial correlation of two distant points diminishes quickly to zero, resulting in the off-diagonal partof precision matrices C − d and K − d being close to zero. Similarly, m d is the projection matrix of the Gaussian processto its derivative process, and since derivative is a local property, the effect from a far away point is small given one’sneighboring points, resulting in the off-diagonal part of projection matrix m d being close to zero as well. Therefore,an efﬁcient band matrix approximation may be used on C − d , K − d , and m d to reduce computation into O ( n ) , whencalculating Eq. (5) in the main text at each sampled ( X , θ , σ ) with a ﬁxed band size. In our experience, a band size of20 to 40 is sufﬁcient, and we recommend using an evenly spaced I for best results with the band matrix approximationand thus faster computation. In our implementation, a failure in the band approximation is automatically detected bychecking for divergence in the quadratic form, and a warning is outputted to the user to increase the band size. More details of the examples

Hes1 model

As stated in the main text, this system has three components, X = ( P, M, H ) , following the ODE f ( X, θ , t ) =  − aP H + bM − cP − dM + e P − aP H + f P − gH  PREPRINT - S

EPTEMBER

30, 2020where θ = ( a, b, c, d, e, f, g ) are the associated parameters.The true parameter values in the simulation are set as a = 0 . , b = 0 . , c = 0 . , d = 0 . ; e = 0 . , f = 20 , g = 0 . , which leads to one oscillation cycle approximately every 2 hours. The initial condition is set to be P (0) = 1 . , M (0) = 2 . , H (0) = 17 . . Recall that these settings, along with the simulated noiselevel, are derived from Ref [1], where the standard error based on repeated measures are reported to be around 15% ofthe P (protein) level and M (mRNA) level. Thus the simulation noise is set to be multiplicative following a log-normaldistribution with standard deviation 0.15, since all components in the system are strictly positive. The number ofobservations is also set based on Ref [1], where P and M are observed at 15-minute intervals for 4 hours but the H component is entirely unobserved. In addition, the observations for P and M are asynchronous: starting at time 0,every 15 minutes we observe P ; starting at the 7.5 minutes, every 15 minutes we observe M . Following our notation inthe main text, τ = { , , , . . . , } , τ = { . , . , . , . . . , . } , and τ = ∅ . In total we have N = 17 observations for P , N = 16 observations for M , and N = 0 observations for H ; P and M are never observed at thesame time. See Fig 1 (leftmost panel) of the main text for a visual illustration.We provide additional details on how to set up MAGI, as applied to this system. Since the components are strictlypositive, we ﬁrst apply a log-transformation to the system so that the resulting noise is additive Gaussian. Deﬁne ˜ P = log P, ˜ M = log M, ˜ H = log H, so that the transformed system is: d ˜ X ( t ) dt =  − a exp( ˜ H ) + b exp( ˜ M − ˜ P ) − c − d + e exp( − ˜ M )(1 + exp(2 ˜ P )) − − a exp( ˜ P ) + f exp( − ˜ H )(1 + exp(2 ˜ P )) − − g  . We conduct all the inference on the log-transformed system, and transform back to the original scale only at the ﬁnalstep to obtain inferred trajectories on the original scale.As described in “Setting hyper-parameters φ for observed components” in the Materials and Methods, we consider theobserved P component and the observed M component separately when setting their respective hyper-parameters φ . For P , since the observation time points are already equally spaced, we have I = τ = { , , , . . . , } ; ˜ φ is obtainedby optimization of Eq (8) in the main text given y ,I = y ,τ , and ﬁxing the noise level σ at the true value of 0.15.For M , since the observation time points are also equally spaced, we have I = τ = { . , . , . , . . . , . } ; ˜ φ for M is obtained by optimization of Eq (8) in the main text, given y ,I = y ,τ , and ﬁxing the noise level σ at thetrue value of 0.15 as well.Next, we consider the discretization set I . In this example we use all observation time points as the discretizationset, i.e., I = τ ∪ τ = { , . , , . , . . . , . , } . To initialize X I for the observed component P and M , wefollow the approach as described in Materials and Methods, using the values of y τ at the observation time points andlinear interpolation for the remaining points in I .We set the hyper-parameter φ and the initial values for the unobserved component H by maximizing the full likelihoodfunction, Eq. (5) of the main text, as described in the Materials and Methods

Section (“Settings in the presence ofunobserved system components: setting φ , initializing X I for unobserved components, and initializing θ ”).To balance the contribution from the GP prior and that from the observed data, we use prior tempering (as described inthe “Prior tempering” subsection of Materials of Methods of the main text). We set β = D | I | / (cid:80) Dd =1 N d = 3 , since wehave a total of 33 observations (17 observations for P , 16 observations for M , and 0 observations for H ) and total of 33discretization points (at times 0, 7.5, 15, ..., 240) for each of the 3 dimensions. Finally, priors for each parameter in θ are set to be ﬂat on the interval (0 , ∞ ) .Having initialized the sampler for this system, we next provide details on HMC sampling to obtain our estimates of thetrajectory and parameters. A total of 20000 HMC iterations were run, with the ﬁrst 10000 discarded as burn-in. EachHMC iteration uses 500 leapfrog steps, where the leapfrog step size is drawn randomly from a uniform distribution on [ L, L ] for each iteration. During the burn-in period, L is adaptively tuned: at each HMC iteration L is multiplied by1.005 if the acceptance rate in the previous 100 iterations is above 90%, and L is multiplied by 0.995 if the acceptancerate in the previous 100 iterations is below 60%. To speed up computations, we use a band matrix approximation(see ‘Techniques for computational efﬁciency’ in this SI document) with band size 20. Using the draws from the10000 HMC iterations after burn-in, the posterior mean of X = ( P, M, H ) is our inferred trajectory for the systemcomponents at time points in I , which are generated by MAGI without using any numerical solver; the posterior meanof θ = ( a, b, c, d, e, f, g ) provides our parameter estimates.We make comparisons with the B-spline-based penalization method of Ref [7], which provides the estimated parametersfor a given dataset and ODE, but does not provide estimates for the system components (i.e., the trajectories) of the ODE.15 PREPRINT - S

EPTEMBER

30, 2020Thus, to infer the trajectories of system components implied by the method of Ref [7], we run the numerical solver foreach parameter estimate (and initial values) produced by the method of Ref [7] to obtain the inferred trajectories for thesystem components. The method of Ref [7] also has hyper-parameters, in particular, the spline basis functions. Theauthors’ R package

CollocInfer does not provide the capability to ﬁt spline basis functions if there are unobservedsystem components. Thus, to obtain results with unobserved components, we ﬁt these spline basis functions using thetrue value of all system components at the observation time points in this study, which in fact gives the method of Ref[7] an additional advantage than in practice: in the analysis of real data, the true value of the system components iscertainly unavailable. Speciﬁcally, we used the routines in the R package

CollocInfer by Ref [7] twice: the ﬁrst time,we supply the package with the fully-observed noiseless true values of all system components at the observation timepoints, and thus obtain the estimated B-spline basis functions as part of the package output; the second time, we supplythe package with noisy data, together with the B-spline basis functions we obtained in the ﬁrst run for the unobservedcomponent, to get the ﬁnal inference results. All other settings are kept at the default values in the package.Even under this setting, the method of Ref [7] had difﬁculty recovering the system trajectories and parameters θ (FigureS1, Table 1 of the main text). Figure S1 plots the inferred trajectories across the 2000 datasets, comparing the twomethods side by side, where the method of Ref [7] is seen to have difﬁculty to recover the unobserved component H . Table 1 of the main text shows the parameter RMSE, where the method of Ref [7] has difﬁculty to recover theparameters f and g , which are associated with the unobserved component H . Even for the observed components P and M , the inferred trajectory of Ref [7] has much larger RMSE compared to MAGI (see Figure S1 and Table 2 of the maintext).Finally, we want to highlight that none of the other benchmark methods, for example, [9, 13], provides software that isequipped to handle an unobserved component.Figure S1: Inference for Hes1 partially observed asynchronized system on 2000 simulated datasets, comparing MAGIto the method of Ref [7]. The green line is the median of the inferred trajectories across the 2000 simulated datasets.The blue shaded area represents the 95% interval represented by the 2.5 and 97.5 percentiles of the inferred trajectories.The upper panel is the result from MAGI, and the lower panel is result from the method of Ref [7]. MAGI time l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l sample observations ll true Ptrue Mtrue Hobserved Pobserved M time P P component (Partially Observed) . . . . . . time M M component (Partially Observed) time H H component (Unobserved) truthmedian of all inferred trajectories 95% interval from the 2.5 and 97.5 percentile of all inferred trajectories

Method of Ref (7) time l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l sample observations ll true Ptrue Mtrue Hobserved Pobserved M time P P component (Partially Observed) . . . . . . time M M component (Partially Observed) time H H component (Unobserved) truthmedian of all inferred trajectories 95% interval from the 2.5 and 97.5 percentile of all inferred trajectories truthmedian of all inferred trajectories95% interval from the 2.5 and 97.5 percentile of all inferred trajectories PREPRINT - S

EPTEMBER

30, 2020

FitzHugh-Nagumo (FN) Model

As stated in the main text, the FitzHugh-Nagumo (FN) model has two components, X = ( V, R ) , following the ODE f ( X, θ , t ) =  c ( V − V R ) − c ( V − a + bR )  where θ = ( a, b, c ) are the associated parameters.Following the same simulation setup as Refs [13, 9], the initial conditions of the system are set at X (0) =( V (0) , R (0)) = ( − , , the true parameter values are set at θ = ( a, b, c ) = (0 . , . , , and the system is ob-served at the equally spaced time points from 0 to 20 with 0.5 interval, i.e, τ = { , . , , . , . . . , } . Simulatedobservations have Gaussian additive noise with σ = 0 . on both components.We provide additional details on how to set up MAGI, as applied to this system. As described in “Setting hyper-parameters φ for observed components” in the Materials and Methods, the smallest index set that includes theobservation time points is I = τ = { , . , , . , . . . , } ; then given y τ , values of ( ˜ φ , ˜ σ ) are obtained by optimizingEq (7) in the main text. Next, we consider the discretization set I . In this example we insert 3 additional equally spaceddiscretization time points between two adjacent observation time points, i.e., I = { , . , . . . . , . , } , | I | = 161 time points. As noted in the Discussion section of the main text, we successively increased the densenessof points in I and found that a further increase in the number of discretization points yielded nearly identical resultsas I = { , . , . . . . , . , } . Next, to initialize X I for the sampler, we follow the approach as describedin Materials and Methods, using the values of y τ at the observation time points and linear interpolation for theremaining points in I . Then, we obtain a starting value of θ for the HMC sampler according to the “Initialization ofthe parameter vector θ when all system components are observed” subsection in the main text. We apply temperingto the posterior distribution following our guideline in the “Prior tempering” subsection in the main text, where β = D | I | / (cid:80) Dd =1 N d = (161 × / (41 × . Finally, the prior distributions for each parameter in θ are set to be ﬂaton (0 , ∞ ) .Having initialized the sampler for this system, we run HMC sampling to obtain our estimates of the trajectory andparameters. A total of 20000 HMC iterations were run, with the ﬁrst 10000 discarded as burn-in. Each HMC iterationuses 100 leapfrog steps, where the leapfrog step size is drawn randomly from a uniform distribution on [ L, L ] foreach iteration. During the burn-in period, L is adaptively tuned: at each HMC iteration L is multiplied by 1.005 if theacceptance rate in the previous 100 iterations is above 90%, and L is multiplied by 0.995 if the acceptance rate in theprevious 100 iterations is below 60%. To speed up computations, we use a band matrix approximation (see ‘Techniquesfor computational efﬁciency’ in this SI document) with band size 20. Using the draws from the 10000 HMC iterationsafter burn-in, the posterior mean of X = ( V, R ) is our inferred trajectory for the system components at time points in I ,which are generated by MAGI without using any numerical solver; the posterior mean of θ = ( a, b, c ) provides ourparameter estimates.For the two benchmark methods, we strictly follow the authors’ recommendation. Speciﬁcally, for FGPGM of Ref [13],we run their provided software with all settings as recommended by the authors: the standard deviation parameter γ there for handling potential mismatch between GP derivatives and the system is set to × − , a Matern52 kernel isused, and 300000 MCMC iterations are run. We treat the ﬁrst half of the iterations as burn-in, and use the posteriormean as the estimate of the parameters and initial conditions. For AGM of Ref [9], the observation noise level isassumed to be known and ﬁxed at their true values (as this method cannot handle unknown noise level), and 300000MCMC iterations are run. We treat the ﬁrst half of the iterations as burn-in, and use the posterior mean of the sampledvalues of the parameters and initial conditions as their respective estimates.As described in “Metrics for assessing the quality of system recovery” in the main text, the parameter RMSE is the rootmean squared error (RMSE) of the parameter estimates to the true parameter value. To visualize the parameter estimatesof different methods, we plot the histogram of estimated parameters for each of the methods in Figure S2. The redline indicates the true value of each parameter ( a, b, c ) , and the histograms show the distributions of the correspondingparameter estimates over the 100 simulated datasets. For MAGI (upper panel), the red lines lie close to the histogramvalues for each parameter, indicating that statistical bias is small; the spreads of the histogram values illustrate thevariances of the estimates. For FGPGM [13] (middle panel), the red lines lie close to the histogram values for eachparameter, indicating that statistical bias is small; the spreads of the histogram values are visibly wider compared to theupper panel, showing larger variances of the estimates. For AGM [9] (lower panel), the relatively narrow spreads of thehistogram values indicate that the variances of the parameter estimates are small; however, for parameters a and c thehistogram values are much further from the true values, indicating a larger statistical bias than the other two methods.17 PREPRINT - S

EPTEMBER

30, 2020Figure S2: Histograms of the estimated θ of the FN system over 100 simulated datasets. Three methods are compared.Upper panel: MAGI. Middle panel: FGPGM of Ref [13]. Lower panel: AGM of Ref [9]. The red line is the trueparameter value. MAGI a D en s i t y b D en s i t y c D en s i t y FGPGM a D en s i t y b D en s i t y . . . . . . . . c D en s i t y . . . . . . . AGM a D en s i t y b D en s i t y c D en s i t y . . . . . . . truth95% interval from the 2.5 and 97.5 percentile of all reconstructed trajectories As described in “Metrics for assessing the quality of system recovery” in the main text, the trajectory RMSE is computedfor each method based on its estimate of the parameters and initial conditions. Recall that the trajectory RMSE treatsthe numerical ODE solution based on the true parameter values as the ground truth, and is obtained as follows: ﬁrst, thenumerical solver is used to reconstruct the trajectory based on the estimates of the parameter and initial condition froma given method; then, the RMSE of this reconstructed trajectory to the true trajectory at the observation time points iscalculated. To visualize the trajectory RMSEs shown in Table 4 of the main text for each method, Figure S3 plots thetrue trajectory (red lines) and the 95% interval of the reconstructed trajectories (gray bands) over the 100 simulateddatasets for MAGI, FGPGM of Ref [13], and AGM of Ref [9]. For MAGI (upper panel), the gray bands closely followthe true trajectories for both components, showing that the statistical bias of the reconstructed trajectories is small;the bands are also quite narrow, showing that the variance in the reconstructed trajectories is low. For FGPGM [13](middle panel), the gray bands largely follow the true trajectories for both components, showing that the statistical biasof the reconstructed trajectories is small; however, the bands are visibly wider compared to the upper panel for bothcomponents, indicating larger variances in the reconstructed trajectories. For AGM [9] (lower panel), the gray bands donot capture the true trajectory for either component, which indicates there is clear statistical bias in the reconstructedtrajectories, and the bands are also much wider than the other two methods indicating a higher variance; this is probablydue to the underlying statistical bias in the parameter estimates as seen in the lower panel of Figure S2.18

PREPRINT - S

EPTEMBER

30, 2020Figure S3: Reconstructed trajectories by the numerical solver for each component of the FN system from three methods.Upper panel: MAGI. Middle panel: FGPGM of Ref [13]. Lower panel: AGM of Ref [9]. The red line is the truetrajectory. The grey area is a 95% interval represented by the 2.5 and 97.5 percentiles.

MAGI − − time V V − . − . − . . . . . time R R FGPGM − − time V V − . − . − . . . . . time R R AGM − − time V V − . − . − . . . . . time R R truth95% interval from the 2.5 and 97.5 percentile of all reconstructed trajectories PREPRINT - S

EPTEMBER

30, 2020

Protein transduction model

As stated in the main text, the protein transduction model has ﬁve components, X = ( S, S d , R, S R , R pp ) , following theODE f ( X, θ , t ) =  − k · S − k · S · R + k · S R k · S − k · S · R + k · S R + V · R pp K m + R pp k · S · R − k · S R − k · S R k · S R − V · R pp K m + R pp  , where θ = ( k , k , k , k , V, K m ) are the associated rate parameters.Following the same simulation setup as in [13, 9], the initial conditions of the system are X (0) = (1 , , , , , thetrue parameter values are θ = (0 . , . , . , . , . , . , and the system is observed at the time points t = { , , , , , , , , , , , , , , } . In the low noise scenario, simulated observations have Gaussian additive noise with σ = 0 . , while in the high noisescenario σ = 0 . . As noted in the main text, inference for this system is challenging due to the non-identiﬁabilityof the parameters, so the comparison of different method focuses on the trajectory recovery rather than the parameterRMSE.We provide additional details on how to set up MAGI, as applied to this system. Recall that the observation times areunequally spaced. Thus, as described in “Setting hyper-parameters φ for observed components” in the Materials andMethods, we take I = { , , , . . . , , } , which is the smallest index set with equally spaced time points thatincludes the observation times, and use linear interpolation between the observations y τ to obtain y I ; given y I , valuesof ( ˜ φ , ˜ σ ) are obtained by optimization. Next, we consider the discretization set I . In this example we insert 1 additionalequally spaced discretization time point between two adjacent time points in I , i.e., I = { , . , . . . , . , } , | I | = 201 time points. As noted in the Discussion, we successively increased the denseness of points in I andfound that a further increase in the number of discretization points yielded nearly identical results as this setting of I . Next, to initialize X I for the sampler, we follow the approach as described in Materials and Methods, using thevalues of y τ at the observation time points and linear interpolation for the remaining points in I . Then, we obtaina starting value of θ for the HMC sampler according to “Initialization of the parameter vector θ when all systemcomponents are observed”. We apply tempering to the posterior following our guideline in “Prior tempering”, so that β = D | I | / (cid:80) Dd =1 N d = (201 × / (15 × . Finally, priors for each parameter in θ are set to be uniform on theinterval [0 , as in Ref [13].Having initialized the sampler for this system, we run HMC sampling to obtain samples of the trajectory and parameters.A total of 20000 HMC iterations were run, with the ﬁrst 10000 discarded as burn-in. Each HMC iteration uses 100leapfrog steps, where the leapfrog step size is drawn randomly from a uniform distribution on [ L, L ] for each iteration.During the burn-in period, L is adaptively tuned: at each HMC iteration L is multiplied by 1.005 if the acceptancerate in the previous 100 iterations is above 90%, and L is multiplied by 0.995 if the acceptance rate in the previous100 iterations is below 60%. To speed up computations, we use a band matrix approximation (see ‘Techniques forcomputational efﬁciency’ in this SI document) with band size 40. Using the draws from the 10000 HMC iterations afterburn-in, the posterior mean of X = ( S, S d , R, S R , R pp ) is our inferred trajectory for the system components, which aregenerated by MAGI without using any numerical solver; the posterior mean of θ = ( k , k , k , k , V, K m ) providesour parameter estimates.We compare MAGI with FGPGM of Ref [13] and AGM of Ref [9] on 100 simulated datasets for each of the twonoise settings. All methods use the same priors for θ , namely uniform on [0 , as used previously in Ref [13]. Westrictly follow the authors’ recommendation for running their methods. Speciﬁcally, for FGPGM of Ref [13], we runtheir provided software with all settings as recommended by the authors: the standard deviation parameter γ therefor handling potential mismatch between GP derivatives and the system is set to − , a sigmoid kernel is used, and300000 MCMC iterations are run. We treat the ﬁrst half of the iterations as burn-in, and use the posterior mean as theestimate of the parameters and initial conditions. For AGM of Ref [9], the observation noise level is assumed to beknown and ﬁxed at their true values (as this method cannot handle unknown noise level), and 300000 MCMC iterationsare run. We treat the ﬁrst half of the iterations as burn-in, and use the posterior mean as the estimate of the parametersand initial conditions.As described in “Metrics for assessing the quality of system recovery” in the main text, the trajectory RMSE is computedfor each method based on its estimate of the parameters and initial conditions. Recall that the trajectory RMSE treatsthe numerical ODE solution based on the true parameter values as the ground truth, and is obtained as follows: ﬁrst, the20 PREPRINT - S

EPTEMBER

30, 2020numerical solver is used to reconstruct the trajectory based on the estimates of the parameter and initial condition froma given method; then, the RMSE of this reconstructed trajectory to the true trajectory at the observation time points iscalculated. To visualize the trajectory RMSEs shown in Table 4 of the main text for each method, Figures S4 and S5 (forthe low and high noise case, respectively) plot the true trajectory (red lines) and the 95% interval of the reconstructedtrajectories (gray bands) over the 100 simulated datasets for MAGI, FGPGM of Ref [13], and AGM of Ref [9].In the low noise case (Figure S4), the gray bands for MAGI (top panel) closely follow the true trajectories for all ﬁvesystem components, showing that the statistical bias of the reconstructed trajectories is small overall. The intervalbands are also very narrow, indicating that the variance in the reconstructed trajectories is low. For FGPGM [13](middle panel), the gray bands largely follow the true trajectories for most of the system components, indicating that thestatistical bias of the reconstructed trajectories is small for most of the time range; however, there is clearly visiblebias for the second half of the time period ( t = 50 to t = 100 ) for R and R pp . The interval bands are also narrow,indicating that the variance in the reconstructed trajectories is low. For AGM [9] (lower panel), the gray bands areunable to capture the true trajectories, which indicates there is signiﬁcant statistical bias in the reconstructed trajectories.The wide interval bands indicate a high variance in the reconstructed trajectories as well; note that the 97.5 percentile ofAGM also exceeds the visible upper limit of the plots for S d and R .Inference is more challenging in the high noise case (Figure S5). For MAGI (upper panel), the gray bands stillclosely follow the true trajectories for all ﬁve system components, showing that the statistical bias of the reconstructedtrajectories is small overall, with some slight bias for R pp . The interval bands are wider than the corresponding lownoise case but still relatively narrow for all the components, indicating that the variance in the reconstructed trajectoriesis low. For FGPGM [13] (middle panel), the gray bands largely follow the true trajectories for all the system components,showing that the statistical bias of the reconstructed trajectories is small overall. The interval bands are, however,signiﬁcantly wider than the upper panel; the variance in the reconstructed trajectories of FGPGM is thus much increasedcompared to that of MAGI. For AGM [9] (lower panel), the gray bands are again unable to capture the true trajectories,which indicates there is signiﬁcant statistical bias in the reconstructed trajectories. The wide interval bands indicate ahigh variance in the reconstructed trajectories; note that the 97.5 percentile of AGM also exceeds the visible upper limitof the plots for S d and R . 21 PREPRINT - S

EPTEMBER

30, 2020Figure S4: Reconstructed trajectories by the numerical solver for each component of the protein transduction systemfrom three methods, in the low noise case. Upper panel: MAGI. Middle panel: FGPGM of Ref [13]. Lower panel:AGM of Ref [9]. The red line is the true trajectory. The grey area is the 95% interval represented by the 2.5 and 97.5percentiles.

MAGI . . . . . . time S S . . . . . . time S d Sd . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp

FGPGM . . . . . . time S S . . . . . . time S d Sd . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp

AGM . . . . . . time S S . . . . . . time S d Sd . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp truth95% interval from the 2.5 and 97.5 percentile of all reconstructed trajectories PREPRINT - S

EPTEMBER

30, 2020Figure S5: Reconstructed trajectories by the numerical solver for each component of the protein transduction systemfrom three methods, in the high noise case. Upper panel: MAGI. Middle panel: FGPGM of Ref [13]. Lower panel:AGM of Ref [9]. The red line is the true trajectory. The grey area is the 95% interval represented by the 2.5 and 97.5percentiles.

MAGI . . . . . . time S S . . . . . . time S d Sd . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp

FGPGM . . . . . . time S S . . . . . . time S d Sd . . . . time R R . . . . . time R S RS . . . . . . . time R pp Rpp

EPTEMBER

30, 2020

Software implementation

User interfaces for MAGI are available for R, MATLAB, and Python at the Github repository https://github.com/wongswk/magihttps://github.com/wongswk/magi