[PDF] Kernel Ordinary Differential Equations

Abstract

Ordinary differential equations (ODE) are widely used in modeling biological and physical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODEs given the noisy observations. We do not restrict the functional forms in ODE to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too. We demonstrate the efficacy of our method through numerous ODE examples.

Full PDF

KKernel Ordinary Diﬀerential Equations

Xiaowu Dai and Lexin Li

University of California at Berkeley

Abstract

Ordinary diﬀerential equations (ODE) are widely used in modeling biological andphysical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODEs given the noisy observations.We do not restrict the functional forms in ODE to be linear or additive, and we allowpairwise interactions. We perform sparse estimation to select individual functionals,and construct conﬁdence intervals for the estimated signal trajectories. We establishthe estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionalscan be smaller or larger than the sample size. Our proposal builds upon the smoothingspline analysis of variance (SS-ANOVA) framework, but tackles several importantproblems that are not yet fully addressed, and thus extends the scope of existingSS-ANOVA too. We demonstrate the eﬃcacy of our method through numerous ODEexamples.

Key Words:

Component selection and smoothing operator; High dimensionality; Or-dinary diﬀerential equations; Smoothing spline analysis of variance; Reproducing kernelHilbert space.

Ordinary diﬀerential equations (ODE) have been widely used to model dynamic systemsand biological and physical processes in a variety of scientiﬁc applications. Examplesinclude infectious disease (Liang and Wu, 2008), genomics (Cao and Zhao, 2008; Chouand Voit, 2009; Ma et al., 2009; Lu et al., 2011; Henderson and Michailidis, 2014; Wu et al.,2014), neuroscience (Izhikevich, 2007; Zhang et al., 2015, 2017; Cao et al., 2019), among1 a r X i v : . [ s t a t . M E ] A ug any others. A system of ODEs takes the form, dx ( t ) dt =  dx ( t ) dt ... dx p ( t ) dt  =  F ( x ( t ))... F p ( x ( t ))  = F ( x ( t )) , (1)where x ( t ) = ( x ( t ) , . . . , x p ( t )) (cid:62) ∈ R p denotes the system of p variables of interest, F = { F , . . . , F p } denotes the set of unknown functionals that characterize the regulatory rela-tions among x ( t ), and t indexes time in an interval standardized to T = [0 , { t , . . . , t n } with measurement errors, y i = x ( t i ) + (cid:15) i , i = 1 , . . . , n, (2)where y i = ( y i , . . . , y ip ) (cid:62) ∈ R p denotes the observed data, (cid:15) i = ( (cid:15) i , . . . , (cid:15) ip ) (cid:62) ∈ R p denotesthe vector of measurement errors that are usually assumed to follow independent normaldistribution with mean 0 and variance σ j , j = 1 , . . . , p , and n denotes the number of timepoints. Besides, an initial condition x (0) ∈ R p is usually given for the system (1).In a biological or physical system, a central question of interest is to uncover the struc-ture of the system of ODEs in terms of which variables regulate which other variables,given the observed noisy time-course data { y i } ni =1 . Speciﬁcally, we say that x k regulates x j , if F j is a functional of x k . In other words, x k controls the change of x j through thefunctional F j on the derivative dx j /dt . Therefore, the functionals F = { F , . . . , F p } encodethe regulatory relations of interest, and are often assumed to take the form, F j ( x ( t )) = θ j + p (cid:88) k =1 F jk ( x k ( t )) + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 F jkl ( x k ( t ) , x l ( t )) , j = 1 , . . . , p, (3)where θ j ∈ R denotes the intercept, and F jk and F jkl represent the main eﬀect and two-wayinteraction, respectively. Higher-order interactions are possible, but two-way interactionsare the most common structure studied in ODE (Ma et al., 2009; Zhang et al., 2015).There have been numerous pioneering works studying statistical modeling of ODEs.However, nearly all existing solutions constrain the forms of F . Broadly speaking, there aretwo categories of functional forms imposed. The ﬁrst category considers linear functionals2or F . For instance, Lu et al. (2011) studied a system of linear ODEs to model dynamicgene regulatory networks. Zhang et al. (2015) extended the linear ODEs to include theinteractions to model brain connectivity networks. The model of Zhang et al. (2015), otherthan diﬀerentiating between the variables that encode the neuronal activities and the onesthat represent the stimulus signals, is in eﬀect of the form, F j ( x ( t )) = θ j + p (cid:88) k =1 θ jk x k ( t ) + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 θ jkl x k ( t ) x l ( t ) , j = 1 , . . . , p, (4)whereas the model of Lu et al. (2011) is similar to (4) but focuses on the main-eﬀect termsonly. In both cases, F j takes a linear form. Dattner and Klaassen (2015) further extendedthe functional F j in (4) to a generalized linear form, but without the interactions, i.e., F j ( x ( t )) = θ j + ψ j ( x ( t )) (cid:62) θ j , j = 1 , . . . , p, (5)where θ j ∈ R , θ j ∈ R d , and ψ j ( x ) = ( ψ j ( x ) , . . . , ψ jd ( x )) (cid:62) ∈ R d is a ﬁnite set of known basis functions. The second category considers additive functionals for F . Particularly,Henderson and Michailidis (2014); Wu et al. (2014); Chen et al. (2017) considered thegeneralized additive model for F j , F j ( x ( t )) = θ j + p (cid:88) k =1 F jk ( x k ( t )) = θ j + p (cid:88) k =1 (cid:8) ψ ( x k ( t )) (cid:62) θ jk + δ jk ( x k ( t )) (cid:9) , j = 1 , . . . , p, (6)where θ j ∈ R , θ jk ∈ R d , ψ ( x ) = ( ψ ( x ) , . . . , ψ d ( x )) (cid:62) ∈ R d is a ﬁnite set of common basisfunctions, and δ jk ∈ R is the residual function. Diﬀerent from Dattner and Klaassen (2015),the residual δ jk is unknown. The functional F j in (6) takes an additive form.These works have laid a solid foundation for statistical modeling of ODEs. However, inplenty of scientiﬁc applications, the linear or additive forms on the functionals F can berestrictive. Besides, it is highly nontrivial to couple the basis function-based solutions withthe interactions. We give a more speciﬁc example in Section 2.1, where a commonly usedenzyme network ODE system involves both nonlinear functionals and two-way interactions.Such examples are often the rules rather than the exceptions, motivating us to considera more ﬂexible form of ODEs. Moreover, the existing ODE methods have primarily fo-cused on sparse estimation, but few tackled the problem of statistical inference, which ischallenging due to the complicated correlation structures of ODEs.3n this article, we propose a novel approach of kernel ordinary diﬀerential equations(KODE) for estimation and inference of the ODE system in (1) given the noisy observationsfrom (2). We adopt the general formulation of (3), but we do not restrict the functionalforms of F , and we allow pairwise interactions. As such, we consider a more general ODEsystem that encompasses (4), (5) and (6) as special cases. We further introduce sparsityregularizations to achieve selection of individual functionals in (3), which yields a sparserecovery of the regulatory relations among F , and thus improves the model interpretability.Moreover, we derive the conﬁdence interval for the estimated signal trajectory x j ( t ). Weestablish the estimation optimality and selection consistency of kernel ODE, under bothlow-dimensional and high-dimensional settings, where the number of unknown functionals p can be smaller or larger than the number of time points n , and we study the regime-switching phenomenon. These diﬀerences clearly separate our proposal from the existingODE solutions in the literature.Our proposal is built upon the smoothing spline analysis of variance (SS-ANOVA)framework that was ﬁrst introduced by Wahba et al. (1995), then further developed in re-gression and functional data analysis settings by Huang (1998); Lin and Zhang (2006); Zhuet al. (2014). We adopt a similar component selection and smoothing operator (COSSO)type penalty of Lin and Zhang (2006) for regularization, and conceptually, our work ex-tends COSSO to the ODE setting. However, our proposal diﬀers from COSSO and theexisting SS-ANOVA methods considerably, in multiple ways. First, unlike the standardSS-ANOVA models, the regressors of kernel ODE are not directly observed and need tobe estimated from the data with error. This extra layer of randomness and estimationerror introduces additional diﬃculty to SS-ANOVA. Second, we employ the integral of theestimated trajectories in the loss function to improve the estimation properties (Dattnerand Klaassen, 2015; Chen et al., 2017). The use of the integral and the inclusion of theinteraction terms pose some identiﬁability question that we tackle explicitly. Third, we es-tablish the estimation optimality and selection consistency in the RKHS framework, whichis utterly diﬀerent from the estimation result studied in Zhu et al. (2014), and requires newtechnical tools. Moreover, our theoretical analysis extends that of Chen et al. (2017) from4he ﬁnite bases setting of cubic splines to the inﬁnite bases setting of RKHS. Finally, forstatistical inference, we derive the conﬁdence bands to provide uncertainty quantiﬁcationfor the penalized estimators of the signal trajectories in the ODE model. Our solutionbuilds on the conﬁdence intervals idea of Wahba (1983). But unlike the classical meth-ods focusing on the ﬁxed dimensionality p (Wahba, 1983; Opsomer and Ruppert, 1997),we allow a diverging p that can far exceed the sample size n . In summary, our proposaltackles several crucial problems that are not yet fully addressed in the existing SS-ANOVAframework, and it is far from a straightforward extension. We believe the proposed kernelODE method not only makes a useful addition to the toolbox of ODE modeling, but alsoextends the scope of SS-ANOVA-based kernel learning.The rest of the article is organized as follows. We propose our kernel ODE method inSection 2, and develop the corresponding estimation algorithm and inference procedure inSection 3. We derive the consistency and optimality of the proposed method in Section4. We investigate the numerical performance in Section 5, and illustrate with a real dataexample in Section 6. We relegate all proofs to the Supplementary Appendix. We consider a enzymatic regulatory network as an example to demonstrate that nonlinearfunctionals as well as interactions are common in the system of ODEs. Ma et al. (2009)found that all circuits of three-node enzyme network topologies that perform biochemicaladaptation can be well approximated by two architectural classes: a negative feedback loopwith a buﬀering node, and an incoherent feedforward loop with a proportioner node. Themechanism of the ﬁrst class follows the Michaelis-Menten kinetic equations (Tzafriri, 2003), dx ( t ) dt = c x { − x ( t ) }{ − x ( t ) } + C − ˜ c c x ( t ) x ( t ) + C ,dx ( t ) dt = c { − x ( t ) } x ( t ) { − x ( t ) } + C − ˜ c c x ( t ) x ( t ) + C , (7) dx ( t ) dt = c x ( t ) { − x ( t ) }{ − x ( t ) } + C − c x ( t ) x ( t ) x ( t ) + C , x ( t ) , x ( t ) , x ( t ) are three interacting nodes such that x ( t ) receives the input, x ( t )plays the diverse regulatory role, and x ( t ) transmits the output, x is the initial input stim-ulus, and c , . . . , c , C , . . . , C , ˜ c , ˜ c denote the catalytic rate parameters, the Michaelis-Menten constants, and the concentration parameters, respectively. See also Figure 1(a) fora graphical illustration of this ODE system. In this model, we see that the functionals F , F , F are all nonlinear. Besides, both F and F involve two-way interactions. It is ofgreat interest to estimate F j ’s given the observed data, to verify model (7), and to carryout statistical inference of the unknown parameters. This example, along with many otherODE systems with nonlinear functionals and interaction terms motivate us to consider ageneral ODE system as given in (3). Before presenting our method, we ﬁrst brieﬂy review the two-step collocation estimationmethod, which is commonly used for parameter estimation in ODE, and is also useful inour setting. The method was ﬁrst proposed by Varah (1982), then extended to variousODE models. In the ﬁrst step, it ﬁts a smoothing estimate,min z j n (cid:88) i =1 { y ij − z j ( t i ) } + λ nj J ( z j ) , j = 1 , . . . , p, where J ( · ) is a smoothness penalty. Let (cid:98) x j ( t ) denote its minimizer. Given (cid:98) x j ( t ), in thesecond step, it solves an optimization problem to estimate the model parameters θ j ∈ R and θ j = ( θ j , . . . , θ jp ) (cid:62) ∈ R p , for j = 1 , . . . , p . Particularly, Varah (1982) considered thederivative d (cid:98) x j ( t ) /dt and the following minimization,min θ j ,θ j (cid:90) (cid:32) d (cid:98) x j ( t ) dt − θ j − p (cid:88) k =1 θ jk (cid:98) x k ( t ) (cid:33) dt, j = 1 , . . . , p. Wu et al. (2014) developed a similar two-step collocation method for their additive ODEmodel (6), and estimated the model parameters θ j ∈ R and θ jk = ( θ jk , . . . , θ jkd ) (cid:62) ∈ R d ,for j, k = 1 , . . . , p , with a standardized group (cid:96) -penalty,min θ j ,θ jk (cid:90) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:98) x j ( t ) dt − θ j − p (cid:88) k =1 θ (cid:62) jk ψ ( (cid:98) x k ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dt + τ nj p (cid:88) k =1 (cid:20)(cid:90) (cid:8) θ (cid:62) jk ψ ( (cid:98) x k ( t )) (cid:9) dt (cid:21) / . (cid:96) and regular (cid:96) -penalties. Meanwhile, Hendersonand Michailidis (2014) considered an extra (cid:96) -penalty.Alternatively, in the second step, Dattner and Klaassen (2015) proposed to focus on theintegral (cid:82) t g j ( (cid:98) x ( u )) du , rather than the derivative d (cid:98) x j ( t ) /dt , and they estimated the modelparameters θ j ∈ R and θ j = ( θ j , . . . , θ jd ) (cid:62) ∈ R d , for j = 1 , . . . , p , in (5) by,min θ j ,θ j p (cid:88) j =1 (cid:90) (cid:26)(cid:98) x j ( t ) − θ j − θ (cid:62) j (cid:90) t ψ j ( (cid:98) x ( u )) du (cid:27) dt. They found that this modiﬁcation from the derivative to integral leads to a more robustestimate and also an easier derivation of the asymptotic properties. Chen et al. (2017)adopted this idea for their additive ODE model (6), and estimated the parameters θ j ∈ R ,˜ θ j ∈ R , and θ jk = ( θ jk , . . . , θ jkd ) (cid:62) ∈ R d , for j, k = 1 , . . . , p , bymin θ j , ˜ θ j ,θ jk n n (cid:88) i =1 (cid:40) y ij − θ j − b j t i − p (cid:88) k =1 θ (cid:62) jk (cid:90) t i ψ ( (cid:98) x k ( u )) du (cid:41) + τ nj p (cid:88) k =1 (cid:34) n n (cid:88) i =1 (cid:26) θ (cid:62) jk (cid:90) t i ψ ( (cid:98) x k ( u )) dt (cid:27) (cid:35) / . We build the proposed kernel ODE within the smoothing spline ANOVA framework; seeWahba et al. (1995) and Gu (2013) for more background on SS-ANOVA. Speciﬁcally, let H k denote a space of functions of x k ( t ) ∈ X with zero marginal integral. Let { } denotethe space of constant functions. Construct the tensor product space, H = { } ⊕ p (cid:88) k =1 H k ⊕ p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 ( H k ⊗ H l ) . (8)We assume the functionals F j , j = 1 , . . . , p , in the ODE model (3) are located in the spaceof H . The identiﬁability of the terms in (3) is assured by the conditions speciﬁed throughthe averaging operators: (cid:82) T F jk ( x k ( t )) dt = 0 for k = 1 , . . . , p . Let (cid:107) · (cid:107) H denote the normof H , and P k F j and P kl F j denote the orthogonal projection of F j onto H k and H k ⊗ H l ,respectively. We consider a two-step collocation estimation method, by ﬁrst obtaining a7moothing spline estimate (cid:98) x ( t ) = ( (cid:98) x ( t ) , . . . , (cid:98) x p ( t )) (cid:62) , where (cid:98) x j ( t ) = arg min z j ∈F (cid:40) n n (cid:88) i =1 { y ij − z j ( t i ) } + λ nj (cid:107) z j ( t ) (cid:107) F (cid:41) , j = 1 , . . . , p, (9)then estimating F j ∈ H and θ j ∈ R by the following penalized optimization,min θ j ,F j n n (cid:88) i =1 (cid:26) y ij − θ j − (cid:90) t i F j ( (cid:98) x ( t )) dt (cid:27) + τ nj (cid:32) p (cid:88) k =1 (cid:107)P k F j (cid:107) H + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) . (10)Our proposal deals with the integral (cid:82) t i F j ( (cid:98) x ( u )) du , rather than the derivative d (cid:98) x j ( t ) /dt ,which is in a similar spirit as Dattner and Klaassen (2015). Besides, it involves two penaltyfunctions, J ≡ (cid:107) · (cid:107) F in (9), and J ( F j ) ≡ (cid:80) pk =1 (cid:107)P k F j (cid:107) H + (cid:80) pk =1 (cid:80) pl =1 (cid:107)P kl F j (cid:107) H in(10), and λ nj and τ nj are two tuning parameters. We next make some remarks about thisproposal.For the functionals, the formulation in (10) is highly ﬂexible, nonlinear, and incorporatestwo-way interactions. Meanwhile, it naturally covers the linear ODE in (4) and (5), andthe additive ODE in (6) as special cases. In particular, if H is the linear functional space, H = { }⊕ (cid:80) pk =1 { x k − / }⊕ (cid:80) k (cid:54) = l [ { x k − / }⊗{ x l − / } ] with the input space X = [0 , p ,then any F of the form in (4) belongs to H . If H is spanned by some known generalizedfunctions, H = ψ j ( x ) ⊕ . . . ⊕ ψ jp ( x ), then any F in (5) belongs to H . If H is the additivefunctional space, H = { } ⊕ (cid:80) pk =1 H k with the (cid:96) -norm, then for F jk ( x k ( t )) = ψ ( x k ( t )) (cid:62) θ jk ,the penalty on the main eﬀects becomes (cid:80) pk =1 (cid:107)P k F j (cid:107) H = (cid:80) pk =1 [ (cid:82) { ψ ( x k ( t )) (cid:62) θ jk } dt ] / ,which is exactly the same as the ODE model of Chen et al. (2017).For the penalties, the ﬁrst penalty function J is the squared RKHS norm correspondingto the RKHS {F , (cid:107) · (cid:107) F } . It is for estimating (cid:98) x j , and F does not have to be the same as H .The second penalty function J is a sum of RKHS norms on the main eﬀects and pairwiseinteractions. This penalty is similar as the COSSO penalty of Lin and Zhang (2006). Butas we outline in Section 1, our extension is far from trivial. We also note that, we do notimpose a hierarchical structure in terms of the main eﬀects and interactions, in that if aninteraction term is selected, the corresponding main eﬀect term does not have to be selected(Wang et al., 2009). This is motivated by the observation that, in the ODE system, e.g.,8he enzymatic regulatory network example in Section 2.1, the interaction terms x ( t ) x ( t )and x ( t ) x ( t ) both appear in the ODE regulating the dynamics of x ( t ), but the maineﬀect terms x ( t ) and x ( t ) are not present. Theorem 1.

Assume that the RKHS H can be decomposed as in (8) . Then there exists aminimizer of (10) in H for any tuning parameter τ nj ≥ . Moreover, the minimizer is ina ﬁnite-dimensional space. Theorem 1 is a generalization of the well-known representer theorem (Wahba, 1990). Thediﬀerence is that, unlike the smoothing splines model as studied in Wahba (1990), theminimization of (10) involves an integral in the loss function, and the penalty is not anorm in H but a convex pseudo-norm. A direct implication of Theorem 1 is that, althoughthe minimization with respect to F j is taken over an inﬁnite-dimensional space in (10), thesolution to (10) can actually be found in a ﬁnite-dimensional space. We next develop anestimation algorithm to solve (10). The estimation of the proposed kernel ODE system consists of two major steps. The ﬁrststep is the smoothing spline estimation in (9), which is standard and the tuning of thesmoothness parameter λ nj is often done through generalized cross-validation (see, e.g., Gu,2013). The second step is to solve (10). Toward that end, we ﬁrst propose an optimizationproblem that is equivalent to (10), but is computationally easier to tackle. We then developan estimation algorithm to solve this new equivalent problem.Speciﬁcally, we consider the following optimization problem, for j = 1 , . . . , p ,min θ j ,θ j ,F j n n (cid:88) i =1 (cid:26) y ij − θ j − (cid:90) t i F j ( (cid:98) x ( t )) dt (cid:27) + η nj (cid:32) p (cid:88) k =1 θ − jk (cid:107)P k F j (cid:107) H + θ − jkl p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) + κ nj (cid:32) p (cid:88) k =1 θ jk + p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 θ jkl (cid:33) , (11)9ubject to θ k ≥ , θ kl ≥ , k, l = 1 , . . . , p, k (cid:54) = l , where θ j = ( θ j , . . . , θ jp , θ j , . . . , θ j p , . . . , θ jp ,. . . , θ jp ( p − ) (cid:62) ∈ R p collects the parameters to estimate, and η nj , κ nj ≥ are the tuning pa-rameters, j = 1 , . . . , p . Comparing (11) to (10), we introduce the parameters θ jk and θ jkl to control the sparsity of the main eﬀect and interaction terms in F j . This is similar toLin and Zhang (2006). The two optimization problems (10) and (11) are equivalent, in thefollowing sense. Let κ nj = τ nj / (4 η nj ). Then we have, η nj θ − jk (cid:107)P k F j (cid:107) H + κ nj θ jk ≥ η / nj κ / nj (cid:107)P k F j (cid:107) H = τ nj (cid:107)P k F j (cid:107) H , where the equality holds if θ jk = η / nj κ − / nj (cid:107) P k F j (cid:107) H . A similar result holds for θ jkl = η / nj κ − / nj (cid:107) P kl F j (cid:107) H . In other words, if ( (cid:98) θ j , (cid:98) F j ) minimizes (10), then ( (cid:98) θ j , (cid:98) θ j , (cid:98) F j ) mini-mizes (11), with (cid:98) θ jk = η / nj κ − / nj (cid:107)P k (cid:98) F j (cid:107) H , and θ jkl = η / nj κ − / nj (cid:107)P kl F j (cid:107) H , for any k, l =1 , . . . , p, k (cid:54) = l . Meanwhile, if ( (cid:98) θ j , (cid:98) θ j , (cid:98) F j ) minimizes (11), then ( (cid:98) θ j , (cid:98) F j ) minimizes (10).Next, we devise an iterative alternating optimization approach to solve (11). That is,we ﬁrst estimate θ j given ﬁxed F j and θ j , then estimate the functional F j given ﬁxed θ j and θ j , and ﬁnally estimate θ j given ﬁxed θ j and F j .For given (cid:98) F j and (cid:98) θ j , we have that, (cid:98) θ j = ¯ y j − (cid:90) T ¯ T ( t ) (cid:98) F j ( (cid:98) x ( t )) dt, where T i ( t ) = 1 { ≤ t ≤ t i } , ¯ T ( t ) = n (cid:80) ni =1 T i ( t ), and ¯ y j = n − (cid:80) ni =1 y ij .For given (cid:98) θ j and (cid:98) θ j , the optimization problem (11) becomes,min F j (cid:40) n n (cid:88) i =1 (cid:20) ( y ij − ¯ y j ) − (cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) F j ( (cid:98) x ( t )) dt (cid:21) + η nj (cid:32) p (cid:88) k =1 (cid:98) θ − jk (cid:107)P k F j (cid:107) H + (cid:98) θ − jkl p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33)(cid:41) . (12)Let K j ( · , · ) : X × X (cid:55)→ R denote the Mercer kernel generating the RKHS H j , j = 1 , . . . , p .Then K k K l is the reproducing kernel of the RKHS H k ⊗ H l (Aronszajn, 1950). Let K θ j = (cid:80) pk =1 (cid:98) θ jk K k + (cid:80) k (cid:54) = l (cid:98) θ jkl K k K l . By the representer theorem (Wahba, 1990), the solution (cid:98) F j to (12) is of the form, (cid:98) F j ( (cid:98) x ( t )) = b j + n (cid:88) i =1 c ij (cid:90) T K θ j ( (cid:98) x ( t ) , (cid:98) x ( s )) (cid:8) T i ( s ) − ¯ T ( s ) (cid:9) ds (13)10or some b j ∈ R and c j = ( c j , . . . , c nj ) ∈ R n . Write y j = ( y j , . . . , y nj ) (cid:62) ∈ R n and¯ y j = (¯ y j , . . . , ¯ y j ) (cid:62) ∈ R n . Let B be an n × i th entry is B i = (cid:82) T { T i ( t ) − ¯ T ( t ) } dt , i = 1 , . . . , n . Let Σ be an n × n matrix whose ( i, i (cid:48) )th entry is Σ ii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K θ j ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , i, i (cid:48) = 1 , . . . , n . Plugging (13) into (12), we obtainthe following quadratic minimization problem in terms of { b j , c j } ,min b j ,c j n (cid:107) ( y j − ¯ y j ) − ( Bb j + Σ c j ) (cid:107) + η nj c (cid:62) j Σ c j , which has a closed-form solution. Consider the QR decomposition B = [ Q Q ][ R (cid:62) ,where Q ∈ R n × , Q ∈ R n × ( n − , and [ Q Q ] is orthogonal such that B (cid:62) Q = 0 × ( n − .Write W j = Σ + nη nj I n , where I n is the n × n identity matrix. Then the minimizers are, c j = Q ( Q (cid:62) W j Q ) − Q (cid:62) ( y j − ¯ y j ) ,b j = R − Q (cid:62) ( y j − ¯ y j − W j c j ) . Following the usual smoothing splines literature, we tune the parameter η nj in (12) byminimizing the generalized cross-validation criterion (GCV, Wahba et al., 1995),GCV = (cid:107) A j ( η nj )( y j − ¯ y j ) − ( y j − ¯ y j ) (cid:107) [ n − tr { I n − A j ( η nj ) } ] , where the smoothing matrix A j ( η nj ) ∈ R n × n is of the form, A j ( η nj ) = I n − nη nj Q ( Q (cid:62) W j Q ) − Q (cid:62) . (14)For given (cid:98) θ j and (cid:98) F j , θ j is the solution to a usual (cid:96) -penalized regression problem,min θ j (cid:40) ( z j − Gθ j ) (cid:62) ( z j − Gθ j ) + nκ nj (cid:32) p (cid:88) k =1 θ jk + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 θ jkl (cid:33)(cid:41) , (15)subject to θ k ≥ , θ kl ≥ , k, l = 1 , . . . , p, k (cid:54) = l , where the “response” is z j = ( y j − ¯ y j ) − (1 / nη nj c j − Bb j , the “predictor” is G ∈ R n × p , whose ﬁrst p columns are Σ k c j with k = 1 , . . . , p , and the last p ( p −

1) columns are Σ kl c j with k, l = 1 , . . . , p, k (cid:54) = l , and Σ k =(Σ kii (cid:48) ) , Σ kl = (Σ klii (cid:48) ) are both n × n matrices whose ( i, i (cid:48) )th entries are Σ kii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K k ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , and Σ klii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K kl ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , respectively, where i, i (cid:48) = 1 , . . . , n, j = 1 , . . . , p . We employ Lasso for (15) in11 lgorithm 1 Iterative optimization algorithm for kernel ODE. Initialization: the initial values for θ jk = θ jkl = 1, j, k, l = 1 , . . . , p, k (cid:54) = l } , and thetuning parameters: ( η nj , κ nj ). Fit smoothing spline model (9), and obtain (cid:98) x j ( t ), j = 1 , . . . , p . repeat Solve (cid:98) θ j given (cid:98) F j and (cid:98) θ j , j = 1 , . . . , p . Solve (cid:98) F j in (12) given (cid:98) θ j and (cid:98) θ j , j = 1 , . . . , p . Solve (cid:98) θ j in (15) given (cid:98) θ j and (cid:98) F j , j = 1 , . . . , p . until the stopping criterion is met.our implementation, and tune the parameter κ nj using tenfold cross-validation, followingthe usual Lasso literature.We repeat the above optimization steps iteratively until some stopping criterion is met;i.e., when the estimates in two consecutive iterations are close enough, or when the numberof iterations reaches some maximum number. In our simulations, we have found that thealgorithm converges quickly, usually within 10 iterations. Another issue is the identiﬁabilityof P k F j ’s and P kl F j ’s in (11) in the sense of unique solutions. We introduce the collinearityindices C jk and C jkl to reﬂect the identiﬁability. Speciﬁcally, let W denote a p × p matrix,whose entries are cos( P k F j , P k (cid:48) F j ) , cos( P k F j , P k (cid:48) l (cid:48) F j ) , cos( P kl F j , P k (cid:48) F j ) , cos( P kl F j , P k (cid:48) l (cid:48) F j ), j, k, l = 1 , . . . , p . Then C jk and C jkl are deﬁned by the diagonals of W − . When some C jk and C jkl are much larger than one, then the identiﬁability issue occurs (Stewart, 1987; Gu,2013). This is often due to insuﬃcient amount of data relative to the complexity of themodel we ﬁt. In this case, we ﬁnd that increasing η nj and κ nj in (11) often helps with theidentiﬁability issue, as it helps reduce the model complexity.We summarize the above estimation procedure in Algorithm 1. Next, we derive the conﬁdence intervals for the estimated trajectory (cid:98) x j ( t i ). This is relatedto post-selection inference, as the actual coverage probability of the conﬁdence intervalignoring the preceding sparse estimation uncertainty can be dramatically smaller than thenominal level. Our result extends the recent work of Berk et al. (2013); Bachoc et al.(2019) from linear regression models to nonparametric ODE models, while our setting is12ore challenging, as it involves inﬁnite-dimensional functional objects.Let (cid:98) θ j denote the estimator of θ j obtained from Algorithm 1. Denote M ≡ { , . . . , p, (1 , , . . . , (1 , p ) , . . . , ( p, , . . . , ( p, p − } , and denote M j ⊆ M as the index set of thenonzero entries of the sparse estimator (cid:98) θ j . Note that M j is allowed to be an empty set. Let (cid:98) θ M j be the least squares estimate with M j as the support that minimizes the unpenalizedobjective function in (15), i.e., ( z j − Gθ j ) (cid:62) ( z j − Gθ j ). Plugging this estimate (cid:98) θ M j into (13)gets the corresponding estimate of the functional F j as, (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) = b j + n (cid:88) i =1 c ij (cid:90) T K (cid:98) θ Mj ( (cid:98) x, (cid:98) x ( s )) (cid:8) T i ( s ) − ¯ T ( s ) (cid:9) ds. For a nominal level α ∈ (0 ,

1) and i = 1 , . . . , n , deﬁne c ( (cid:98) x j ( t i )) as the smallest constantsatisfying that, P n,F j ,σ j (cid:20) max M j ⊆M σ − j (cid:12)(cid:12)(cid:12) { (cid:101) A M j } i · ( y j − ¯ y j ) (cid:12)(cid:12)(cid:12) ≤ c ( (cid:98) x j ( t i )) (cid:21) ≥ − α, (16)where { (cid:101) A M j } i · = { A M j } i · / (cid:107){ A M j } i · (cid:107) l , { A M j } i · is the i th row of A M j , A M j is the smoothingmatrix as deﬁned in (14) with the corresponding (cid:98) θ M j , and σ j is the variance of the errorterm (cid:15) ij in (2). We then construct the conﬁdence interval CI( (cid:98) x j ( t )) for the prediction oftrue trajectory x j ( t ) following model selection as,CI( (cid:98) x j ( t i )) = (cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) dt ± c ( (cid:98) x j ( t i )) σ j (cid:107){ A M j } i · (cid:107) , (17)for any i = 1 , . . . , n and j = 1 , . . . , p .Next, we show that the conﬁdence interval in (17) has the desired coverage probability.Later we develop a procedure to estimate the cutoﬀ value c ( (cid:98) x j ) in (16) given the data. Theorem 2.

Let M j ⊆ M be the index set of the nonzero entries of the sparse estimator (cid:98) θ j . Then the choice of c ( (cid:98) x j ( t i )) in (16) does not depend on F j , and CI ( (cid:98) x j ( t i )) in (17) satisﬁes the coverage property, for any i = 1 , . . . , n and j = 1 , . . . , p , in that, inf F j ∈H ,σ j > P (cid:26)(cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) E (cid:104) (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) (cid:105) dt ∈ CI ( (cid:98) x j ( t i )) (cid:27) ≥ − α. A few remarks are in order. First, the coverage in Theorem 2 is guaranteed for all sparseestimation and selection procedures. As such, CI( (cid:98) x j ) in (17), following the terminology of13erk et al. (2013), is a universally valid post-selection conﬁdence interval. Second, if wereplace c ( (cid:98) x j ( t i )) in (17) by z α/ , i.e., the α/ (cid:98) x j ( t i )) reduces to the “naive” conﬁdence interval. It is constructed as if M j wereﬁxed a priori, and it ignores any uncertainty or error of the sparse estimation step. Thisnaive conﬁdence interval, however, does not have the coverage property as in Theorem2, and thus is not a truly valid conﬁdence interval. Finally, data splitting (Cox, 1975) isa commonly used alternative strategy for post-selection inference. But it is not directlyapplicable in our ODE setting, because it is diﬃcult to split the time series data intoindependent parts.Next, we devise a procedure to compute the cutoﬀ value c ( (cid:98) x j ( t i )). Proposition 1.

The value c ( (cid:98) x j ( t i )) in (16) is the same as the solution of t ≥ satisfying, E U P (cid:18) max M j ⊆M (cid:12)(cid:12)(cid:12) { (cid:101) A M j } i · V (cid:12)(cid:12)(cid:12) ≤ t/U (cid:12)(cid:12)(cid:12)(cid:12) U (cid:19) = 1 − α, where V is uniformly distributed on the unit sphere in R n , and U is a nonnegative randomvariable such that U follows a chi-squared distribution χ ( n ) . Following Proposition 1, we compute c ( (cid:98) x j ( t i )) as follows. We ﬁrst generate N i.i.d. copiesof random vectors V , . . . , V N uniformly distributed on the unit sphere in R n . We thencalculate the quantity, c ν = max M j ⊆M |{ (cid:101) A M j } i · V ν | for ν = 1 , . . . , N . Let D U denote thecumulative distribution function of U , and D χ denote the cumulative distribution functionof a χ ( n ) distribution. Then D U ( t ) = D χ ( t ). We next obtain c ( (cid:98) x j ( t i )) by searching for c that solves N − (cid:80) Ni =1 D U ( c/c i ) = 1 − α , using, for example, a bisection searching method.Finally, we estimate the error variance σ j in (17) using the usual noise estimator in thecontext of RKHS (Wahba, 1990); i.e., (cid:98) σ j = (cid:107) A M j ( y j − ¯ y j ) − ( y j − ¯ y j ) (cid:107) / tr( I − A M j ).We also remark that, the inference on the prediction of the trajectory x j ( t ) followingmodel selection as described in Theorem 2 amounts to the inference on the estimation ofthe integration (cid:82) t F j ( x ( s )) ds . This type of inference is of great importance in dynamicsystems (Izhikevich, 2007; Chou and Voit, 2009; Ma et al., 2009). Our solution takes theselected model as an approximation to the truth, but does not require that the true datageneration model has to be among the candidates of model selection. We note that, it is14lso possible to do inference on the individual components of F j directly; e.g., one couldconstruct the conﬁdence interval for F jk in (3). But this is achieved at the cost of imposingadditional assumptions, including the requirement that the true data generation model isamong the class of pairwise interaction model as in (3), and the orthogonality property as inChernozhukov et al. (2015), or its equivalent characterization as in Zhang and Zhang (2014);Javanmard and Montanari (2014). For nonparametric kernel estimators, the orthogonalityproperty is shown to hold if the covariates x j ’s are assumed to be weakly dependent (Luet al., 2020). It is interesting to further investigate if such a property holds in the contextof kernel ODE model under a similar condition of weakly dependent covariates. We leavethis as our future research. We next establish the estimation optimality and selection consistency of kernel ODE. Thesetheoretical results hold for both the low-dimensional and high-dimensional settings, wherethe number of functionals p can be smaller or larger than the sample size n . We ﬁrstintroduce two assumptions. Assumption 1.

The number of nonzero functional components is bounded, i.e., card (cid:0) { k : F jk (cid:54) = 0 } ∪ { ≤ l (cid:54) = k ≤ p : F jkl (cid:54) = 0 } (cid:1) is bounded for any j = 1 , . . . , p . Assumption 2.

For any F j ∈ H , there exists a random variable B , with E ( B ) < ∞ , and (cid:12)(cid:12)(cid:12)(cid:12) ∂F j ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) ≤ B (cid:107) F j (cid:107) L , almost surely . Assumption 1 concerns the complexity of the functionals. Similar assumptions have beenadopted in the sparse additive model over RKHS when F jkl = 0 (see, e.g., Koltchinskii andYuan, 2010; Raskutti et al., 2011). Assumption 2 is an inverse Poincar´e inequality typecondition, which places regularization on the ﬂuctuation in F j relative to the (cid:96) -norm. Thesame assumption was also used in additive models in RKHS (Zhu et al., 2014).We begin with the error bound for the estimated trajectory (cid:98) x ( t ), which holds uniformlyfor j = 1 , . . . , p . This is a relatively standard result, and the bound is needed for both an-15lyzing the error of the functional estimators in kernel ODE, and establishing the selectionconsistency later. Theorem 3 (Optimal estimation of the trajectory) . Suppose that x j ( t ) ∈ F , j = 1 , . . . , p ,and the RKHS F is embedded to a β th-order Sobolev space, β > / . Then the smoothingspline estimate from (9) satisﬁes that, for any j = 1 , . . . , p , min λ nj ≥ (cid:90) T { (cid:98) x j ( t ) − x j ( t ) } dt = O p (cid:16) n − β β (cid:17) , which achieves the minimax optimal rate. Next, we derive the convergence rate for the estimated functional F j . Because thetrajectory (cid:98) x is estimated, to establish the optimal rate of convergence, it requires extratheoretical attention, which is related to recent work on errors in variables for lasso-typeregressions (Loh and Wainwright, 2012; Zhu et al., 2014). The proof involves several toolsfor the Rademacher processes (van der Vaart and Wellner, 1996), and the concentrationinequalities for empirical processes (Talagrand, 1996; Yuan and Zhou, 2016). Theorem 4 (Optimal estimation of the functional) . Suppose that F j ∈ H , j = 1 , . . . , p ,where H satisﬁes (8) , and the RKHS H j is embedded to a β th-order Sobolev space, β > .Suppose Assumptions 1 and 2 hold. Then, as long as F j is not a constant function, theKODE estimate (cid:98) F j from (10) satisﬁes that, for any j = 1 , . . . , p , min τ nj ≥ (cid:90) T (cid:110) (cid:98) F j ( x ( t )) − F j ( x ( t )) (cid:111) dt = O p (cid:32)(cid:18) n log n (cid:19) − β β + log pn + n − β β (cid:33) , which achieves the minimax optimal rate. This theorem is one of the key results, and we make a few remarks. First, there are threeerror terms in Theorem 4, which are attributed to the estimation of the interactions, theLasso estimation, and the measurement errors in variables, respectively. Particularly, theerror term O p (cid:0) n − β / (2 β +1) (cid:1) arises due to the unobserved x ( t ), which is instead measuredat discrete time points and is subject to measurement errors. Since this error term achievesthe optimal rate, it fully characterizes the inﬂuence of the estimated (cid:98) x ( t ) on the resulting16stimator (cid:98) F j . Moreover, β and β measure the orders of smoothness for estimating x j and F j , respectively. They can be diﬀerent, which makes it ﬂexible when choosing kernels forthe estimation procedure. For instance, if there is prior knowledge that x ( t ) is smooth, wemay then choose β > β , and the resulting estimator (cid:98) F j achieves a convergence rate of O p (cid:0) ( n/ log n ) − β / (2 β +1) + log p/n (cid:1) . It is interesting to note that this rate is the same asthe rate as if x ( t ) were directly observed and there were no integral involved in the lossfunction, for example, in the setting of Lin and Zhang (2006).Second, there exists a regime-switching phenomenon, depending on the dimensional-ity p with respect to the sample size n . On one hand, if it is an ultrahigh-dimensionalsetting, i.e., p > exp (cid:104)(cid:8) n (log n ) β (cid:9) β (cid:105) , then the minimax optimal rate in Theorem 4becomes O p (cid:0) log p/n + n − β / (2 β +1) (cid:1) . Here, the ﬁrst rate O p (log p/n ) matches with theminimax optimal rate for estimating a p -dimensional linear regression when the vector ofregression coeﬃcients has a bounded number of nonzero entries (Raskutti et al., 2011).Hence, we pay no extra price in terms of the rates of convergence for adopting a nonpara-metric modeling of F j in (3), when compared with the more restrictive linear ODE modelin (4) (Zhang et al., 2015). On the other hand, if it is a low-dimensional setting, i.e., p ≤ exp (cid:104)(cid:8) n (log n ) β (cid:9) β (cid:105) , then the optimal rate becomes O p (cid:0) ( n/ log n ) − β / (2 β +1) + n − β / (2 β +1) (cid:1) . Here, the ﬁrst rate O p (cid:0) ( n/ log n ) − β / (2 β +1) (cid:1) is the same as the optimalrate of estimating F j as if we knew a priori that F j comes from a two-dimensional tensorproduct functional space, rather than the p -variate functional space H in (8); see also Lin(2000) for a similar observation.Third, the optimal rate in Theorem 4 is immune to the “curse of dimensionality”, inthe following sense. We introduce p ( p −

1) pairwise interaction components to H in (8),and henceforth, for each x j ( t ), j = 1 , . . . , p , it requires to estimate a total of p functions.A direct application of an existing basis expansion approach, for instance, Brunton et al.(2016), leads to a rate of O p (cid:16) n − O (1 /p ) (cid:17) . This rate degrades fast when p increases. Bycontrast, we proceed in a diﬀerent way, where we simultaneously aim for the ﬂexibility ofa nonparametric ODE model by letting H obey a tensor product structure as in (8), whileexploiting the interaction structure of the system. As a result, our optimal error bound17 p (cid:0) ( n/ log n ) − β / (2 β +1) (cid:1) does not depend on the dimensionality p .Lastly, the incorporation of the integral, (cid:82) t i F j ( (cid:98) x ( t )) dt , in the loss function in (10)makes the estimation error of (cid:98) F j depend on the convergence of E (cid:82) T { (cid:98) x j ( t ) − x j ( t ) } dt . Asa comparison, if we use the derivative instead of the integration, then the estimation errorwould depend on the convergence of the derivative, E (cid:82) T { d (cid:98) x j ( t ) /dt − dx j ( t ) /dt } dt (Wuet al., 2014). However, it is known that the derivative estimation in the reproducing kernelHilbert space has a slower convergence rate than the function estimation (Cox, 1983). Thatis, E (cid:82) T { d (cid:98) x j ( t ) /dt − dx j ( t ) /dt } dt converges at a slower rate than E (cid:82) T { (cid:98) x j ( t ) − x j ( t ) } dt .This demonstrates the advantage of working with the integral in our KODE formulation,and our result echos the observation for the additive ODE model (Chen et al., 2017).Next, we establish the selection consistency of KODE. Putting all the functionals { F , . . . , F p } together forms a network of regulatory relations among the p variables { x ( t ) ,. . . , x p ( t ) } . Recall that, we say x k is a regulator of x j , if in (3) F jk is nonzero, or if F jkl isnonzero for some l (cid:54) = k . Denote the set of the true regulators and the estimated regulatorsof x j ( t ) by S j = (cid:8) ≤ k ≤ p : F jk (cid:54) = 0 , or F jkl (cid:54) = 0 for some 1 ≤ l (cid:54) = k ≤ p (cid:9) , (cid:98) S j = (cid:8) ≤ k ≤ p : (cid:107) (cid:98) F jk (cid:107) H (cid:54) = 0 , or (cid:107) (cid:98) F jkl (cid:107) H (cid:54) = 0 for some 1 ≤ l (cid:54) = k ≤ p (cid:9) , respectively, j = 1 , . . . , p . We need some extra regularity conditions on the minimumregulatory eﬀect and the design matrix, which are all commonly adopted in the literatureof Lasso regression (Zhao and Yu, 2006; Ravikumar et al., 2010; Chen et al., 2017). Dueto some additional notation, we defer those conditions to the Supplementary Appendix.The next theorem establishes that KODE is able to recover the true regulatory networkasymptotically. Theorem 5 (Recovery of the regulatory network) . Suppose that F j ∈ H , j = 1 , . . . , p ,where H satisﬁes (8) , and the RKHS H j is embedded to a β th-order Sobolev space, with β > . Suppose Assumption 1, and Assumptions 3–5 in the Appendix hold. Then, theKODE correctly recovers the true regulatory network, in that, for all j = 1 , . . . , p , P (cid:16) (cid:98) S j = S j (cid:17) → , as n → ∞ . igure 1: (a) Diagram of the NFBLB regulatory network following (7). (b) Phase dynamics forthe three nodes x , x , x over time [0 , x uniformly drawn from [0 . , . We study the empirical performance of the proposed KODE with two examples, the enzymeregulatory network, and the Lotka-Volterra equations. Given a system of ODEs and the ini-tial condition, we obtain the numerical solutions of the ODEs using the Euler method withstep size 0 .

01. The data observations are drawn from the solutions at an evenly spaced timegrid, with measurement errors. To implement KODE, we ﬁt the smoothing spline to esti-mate x j ( t ) in (9) using a Mercer kernel, K F ( x, x (cid:48) ) = (1 + √ | x − x (cid:48) | /ν ) exp( −√ | x − x (cid:48) | /ν ),where the smoothing parameter λ nj is chosen by GCV, and the bandwidth ν is chosen bytenfold cross-validation. We compute the integral (cid:82) t i F j ( (cid:98) x ( t )) dt in (10) numerically withindependent sets of 1000 Monte Carlo points. We compare KODE with linear ODE withinteractions in (4) (Zhang et al., 2015), and additive ODE in (6) (Chen et al., 2017). Due tothe lack of available code online, we implement the two competing methods in the frame-work of Algorithm 1, using a linear kernel for (6), and using an additive Mercer kernelfor (6). We evaluate the performance using the prediction error, plus the false discoveryproportion and power for edge selection of the corresponding regulatory network.19 .2 Enzymatic regulatory network The ﬁrst example is a three-node enzyme regulatory network of a negative feedback loopwith a buﬀering node (Ma et al., 2009, NFBLB). The ODE system is given in (7) in Section2.1. Figure 1(a) shows the NFBLB network diagram consisting of the three interactingnodes: x receives the input, x transmits the output, and x plays a regulatory role,leading a negative regulatory link to x . We note that, although biological circuits can havemore than three nodes, many of those circuits can be reduced to a three-node framework,given that multiple molecules often function as a single virtual node. Moreover, despite thediversity of possible network topologies, NFBLB is one of the two core three-node topologiesthat could perform adaption in the sense that the system resets itself after responding to astimulus; see Ma et al. (2009) for more discussion of NFBLB. For the ODE system in (7),we set the catalytic rate parameters of the enzymes as c = c = c = c = c = 10 , c = 1,the Michaelis-Menten constants as C = · · · = C = 0 .

1, and the concentration parametersof enzymes as ˜ c = 1 , ˜ c = 0 .

2. These parameters achieve the adaption as shown in Figure1(b). The output node x shows a strong initial response to the stimulus, and also exhibitsstrong adaption, since its post-stimulus steady state x = 0 .

052 is close to the pre-stimulusstate x = 0. The input x ∈ R is drawn uniformly from [0 . , . x (0) = 0, and the measurement errors are i.i.d. normal with mean zero and variance σ j .The time points are evenly distributed, t i = ( i − / , i = 1 , . . . , n . In this example, p = 3,and for each function x j ( t ), j = 1 , ,

3, there are p = 9 functions to estimate, and in totalthere are 27 functions to estimate under the sample size n = 40.Figure 2 reports the true and estimated trajectory of x ( t ), with 95% upper and lowerconﬁdence bounds, of the three ODE methods, where we use the tensor product Mercerkernel for KODE in (10). The noise level is set as σ j = 0 . , j = 1 , ,

3, and the resultsare averaged over 500 data replications. It is seen that the KODE estimate has a smallervariance than the additive and linear ODE estimates. Moreover, the conﬁdence intervalof KODE achieves the desired coverage for the true trajectory. In contrast, the conﬁdenceintervals of additive and linear ODE models mostly fail to include the truth. This is becausethere is a discrepancy between the additive and linear ODE model speciﬁcations and the20

Adaptation x (t)KODE95% UCB95% LCB Adaptation x (t)Additive ODE95% UCB95% LCB Adaptation x (t)Linear ODE95% UCB95% LCB Figure 2:

The true (black solid line) and estimated (blue dashed line) trajectory of x ( t ), withthe 95% upper and lower conﬁdence bounds (red dotted lines). (a) KODE, (b) Additive ODE,(c) Linear ODE. The results are averaged over 500 data replications. Noise level P r ed i c t i on e rr o r (a) KODEAdditive ODELinear ODE

Noise level F a l s e d i sc o v e r y p r opo r t i on (b) KODEAdditive ODELinear ODE

Noise level P o w e r (c) KODEAdditive ODELinear ODE

Figure 3:

The prediction and selection performance of three ODE methods with varying noiselevel. The results are averaged over 500 data replications. (a) Prediction error, (b) False discoveryproportion, (c) Empirical power. true ODE model in (7), and this discrepancy accumulates as the course of the ODE evolves.Figure 3 reports the prediction and selection performance of the three ODE methods,with varying noise level σ j ∈ { . , . , . . . , . } , j = 1 , ,

3. The results are averaged over500 data replications. The prediction error is deﬁned as the squared root of the sum ofpredictive mean squared errors for x ( t ) , x ( t ) , x ( t ) at the unseen “future” time point t = 2.The false discovery proportion is deﬁned as the proportion of falsely selected edges in theregulatory network out of the total number of edges. The empirical power is deﬁned as the21roportion of selected true edges in the network. It is seen that KODE clearly outperformsthe two alternative solutions in terms of both prediction and selection accuracy. The second example is the high-dimensional Lotka-Volterra equations, which are pairs ofﬁrst-order nonlinear diﬀerential equations describing the dynamics of biological systems inwhich predators and prey interact (Volterra, 1928). We consider a ten-node system, dx j − ( t ) dt = α ,j x j − ( t ) − α ,j x j − ( t ) x j ( t ) ,dx j ( t ) dt = α ,j x j − ( t ) x j ( t ) − α ,j x j ( t ) , (18)where α ,j = 1 . . j − , α ,j = 0 . . j − , α ,j = 0 . . j − α ,j =0 . . j − j = 1 , . . . ,

5. The parameters α ,j and α ,j deﬁne the interaction betweenthe two populations such that dx j − ( t ) /dt and dx j ( t ) /dt are nonadditive functions of x j − and x j , where x j − is the prey and x j is the predator. Figure 4(a) shows anillustration of the interaction between x ( t ) and x ( t ). The input x ∈ R is drawnuniformly from [5 , , with the initial value x j − (0) = x j (0), and the measurementerrors are i.i.d. normal N (0 , σ j ), where σ j again reﬂects the noise level. The time pointsare evenly distributed in [0 , n = 200. In this example, p = 10, and for eachfunction x j ( t ), j = 1 , . . . ,

10, there are p = 100 functions to estimate, and in total thereare 1 ,

000 functions to estimate under the sample size n = 200.Figure 4(b) and (c) report the estimated trajectory of x ( t ), with 95% upper and lowerconﬁdence bounds, of KODE and additive ODE, where the noise level is set as σ j = 1 , j =1 , . . . ,

10. The conﬁdence interval of KODE achieves a better empirical coverage for thetrue trajectory compared to that of additive ODE. For this example, we use the linearkernel for KODE in (10), since the functional forms in (18) are known to be linear. Forthis reason, we only compare KODE with the additive ODE method. Moreover, in theimplementation, the estimates (cid:98) F j ( (cid:98) x ( t )) are thresholded to be nonnegative to ensure thephysical constraint that the number of population cannot be negative. Figure 5 reports theprediction and selection performance of the two ODE methods, with varying noise level22

20 40 60 80 1000510152025303540

Prey x (t)Predator x (t) Prey x (t)KODE95% UCB95% LCB Prey x (t)Additive ODE95% UCB95% LCB Figure 4: (a) The true trajectories of the prey x ( t ) and the predator x ( t ). (b) The estimatedtrajectory (cid:98) x ( t ) (blue dashed line), with the 95% upper and lower conﬁdence bounds (red dottedlines), by KODE. (c) By additive ODE. The results are averaged over 500 data replications. Noise level P r ed i c t i on e rr o r (a) KODEAdditive ODE

Noise level F a l s e d i sc o v e r y p r opo r t i on (b) KODEAdditive ODE

Noise level P o w e r (c) KODEAdditive ODE

Figure 5:

The prediction and selection performance of two ODE methods with varying noiselevel. The results are averaged over 500 data replications. (a) Prediction error, (b) False discoveryproportion, (c) Empirical power. σ j ∈ { , , . . . , } , j = 1 , . . . ,

10. All the results are averaged over 500 data replications. Itis seen that the KODE estimate achieves a smaller prediction error, and a higher selectionaccuracy, since KODE allows ﬂexible non-additive structures, which results in signiﬁcantlysmaller bias and variance in functional estimation as compared to the additive modeling.23

Application to Gene Regulatory Network

We illustrate KODE with a gene regulatory network application. Schaﬀter et al. (2011)developed an open-source platform called GeneNetWeaver (GNW) that generates in silicobenchmark gene expression data using dynamical models of gene regulations and nonlin-ear ODEs. The generated data have been used for evaluating the performance of networkinference methods in the DREAM3 competition (Marbach et al., 2009), and were also ana-lyzed by Henderson and Michailidis (2014); Chen et al. (2017) in additive ODE modeling.GNW extracts two regulatory networks of

E.coli ( E.coli1 , E.coli2 ), and three regulatorynetworks of yeast (yeast1, yeast2, yeast 3), each of which has two dimensions, p = 10 nodesand p = 100 nodes. This yields totally 10 combinations of network structures. Figure6(a)-(b) show an example of the 10-node and the 100-node E.coli1 networks, respectively.The systems of ODEs for each extracted network are based on a thermodynamic approach,which leads to a non-additive and nonlinear ODE structure (Marbach et al., 2010). Be-sides, the network structures are sparse; e.g., for the 10-node

E.coli1 network, there are11 edges out of 90 possible pairwise edges, and for the 100-node

E.coli1 network, thereare 125 edges out of 9 ,

900 possible pairwise edges. Moreover, for the 10-node network,GNW provides R = 4 perturbation experiments, and for the 100-node network, GNW pro-vides R = 46 perturbation experiments. In each perturbation experiment, GNW generatesthe time-course data with diﬀerent initial conditions of the ODE system to emulate thediversity of gene expression trajectories (Marbach et al., 2009). Figure 6(c)-(f) show thetime-course data under R = 4 perturbation experiments for the 10-node E.coli1 network.All the trajectories are measured at n = 21 evenly spaced time points in [0 , . (cid:8) y ( r ) ij ; i =1 , . . . , n, j = 1 , . . . , p, r = 1 , . . . , R (cid:9) denote the observed data from n subjects for p variables24 Time R e s pon s e (c) Perturbation 1 Time (d) Peturbation 2

Time (e) Peturbation 3

Time (f) Perturbation 4

Figure 6: (a) The 10-node

E.coli1 network. (b) The 100-node

E.coli1 network. (c)-(f) Fourperturbation experiments for the 10-node

E.coli1 network, where each experiment corresponds toa diﬀerent initial condition of the ODE system. under R experiments, with unknown initial conditions x ( r ) (0) ∈ R p , r = 1 , . . . , R . Then wemodify the KODE method in (9) and (10), by seeking F j ∈ H and θ j ∈ R that minimize1 Rn R (cid:88) r =1 n (cid:88) i =1 (cid:26) y ( r ) ij − θ j − (cid:90) t i F j ( (cid:98) x ( r ) ( t )) dt (cid:27) + τ nj (cid:32) p (cid:88) k =1 (cid:107)P k F j (cid:107) H + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) , where (cid:98) x ( r ) ( t ) = ( (cid:98) x ( r )1 ( t ) , . . . , (cid:98) x ( r ) p ( t )) (cid:62) is the smoothing spline estimate obtained by, (cid:98) x ( r ) j ( t ) = arg min z j ∈F (cid:40) n n (cid:88) i =1 ( y ( r ) ij − z j ( t i )) + λ nj (cid:107) z j ( t ) (cid:107) F (cid:41) , j = 1 , . . . , p, r = 1 , . . . , R. Algorithm 1 can be modiﬁed accordingly to work with multiple experiments.We again compare KODE with the additive ODE (Chen et al., 2017) and the linearODE (Zhang et al., 2015), adopting the same implementation as in the simulations. Sincewe know the true edges of the generated gene regulatory networks, we use the area under25 able 1:

The area under the ROC curve, and the 95% conﬁdence interval, for 10 combinationsof network structures from GNW. The results are averaged over 100 data replications. p = 10 p = 100KODE Additive ODE Linear ODE KODE Additive ODE Linear ODE E.coli1 .

541 0 . .

677 0 . . , . . , . . , . . , . . , . . , . E.coli2 .

632 0 . .

659 0 . . , . . , . . , . . , . . , . . , . .

541 0 . .

589 0 . . , . . , . . , . . , . . , . . , . .

562 0 . .

588 0 . . , . . , . . , . . , . . , . . , . .

569 0 . .

613 0 . . , . . , . . , . . , . . , . . , . the ROC curve (AUC) as the evaluation criterion. Table 1 reports the results averagedover 100 data realizations for all ten combinations of network structures. It is clearly seenthat KODE outperforms both the additive and linear ODE methods in all cases. Thisdemonstrates that our proposed KODE is a competitive and useful tool for ODE modeling.In addition, this example also shows that the proposed method can scale up and work withreasonably large networks. For instance, for the network with p = 100 nodes, there are p = 10 ,

000 functions to estimate, and the sample size is n = 21 with R = 46 perturbations. References

Aronszajn, N. (1950). Theory of reproducing kernels.

Transactions of the American Math-ematical Society , 68:337–404.Bachoc, F., Leeb, H., and P¨otscher, B. M. (2019). Valid conﬁdence intervals for post-model-selection predictors.

Annals of Statistics , 47:1475–1504.Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selectioninference.

Annals of Statistics , 41:802–837.Brunton, S. L., Proctor, J. L., and Kutz, J. N. (2016). Discovering governing equationsfrom data by sparse identiﬁcation of nonlinear dynamical systems.

Proceedings of theNational Academy of Sciences , 113:3932–3937.26ao, J. and Zhao, H. (2008). Estimating dynamic models for gene regulation networks.

Bioinformatics , 24:1619–1624.Cao, X., Sandstede, B., and Luo, X. (2019). A functional data method for causal dynamicnetwork modeling of task-related fmri.

Frontiers in Neuroscience , 13:127.Chen, S., Shojaie, A., and Witten, D. M. (2017). Network reconstruction from high-dimensional ordinary diﬀerential equations.

Journal of the American Statistical Associ-ation , 112:1697–1707.Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach.

Annual Review of Economics ,7:649–688.Chou, I.-C. and Voit, E. O. (2009). Recent developments in parameter estimation andstructure identiﬁcation of biochemical and genomic systems.

Mathematical Biosciences ,219:57–83.Cox, D. D. (1983). Asymptotics for m-type smoothing splines.

Annals of Statistics , 11:530–551.Cox, D. R. (1975). A note on data-splitting for the evaluation of signiﬁcance levels.

Biometrika , pages 441–444.Dattner, I. and Klaassen, C. A. J. (2015). Optimal rate of direct estimators in systems ofordinary diﬀerential equations linear in functions of the parameters.

Electronic Journalof Statistics , 9:1939–1973.Gu, C. (2013).

Smoothing Spline ANOVA Models . Springer Science & Business Media.Henderson, J. and Michailidis, G. (2014). Network reconstruction using nonparametricadditive ode models.

PLOS ONE , 9:1–15.Huang, J. Z. (1998). Projection estimation in multiple regression with application to func-tional anova models.

The Annals of Statistics , 26(1):242–272.Izhikevich, E. (2007).

Dynamical Systems In Neuroscience . MIT Press.27avanmard, A. and Montanari, A. (2014). Conﬁdence intervals and hypothesis testing forhigh-dimensional regression.

Journal of Machine Learning Research , 15:2869–2909.Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learning.

Annals ofStatistics , 38:3660–3695.Liang, H. and Wu, H. (2008). Parameter estimation for diﬀerential equation models using aframework of measurement error in regression models.

Journal of the American StatisticalAssociation , 103:1570–1583.Lin, Y. (2000). Tensor product space anova models.

Annals of Statistics , 28:734–755.Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariatenonparametric regression.

Annals of Statistics , 34:2272–2297.Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy andmissing data: Provable guarantees with nonconvexity.

Annals of Statistics , 40:1637–1664.Lu, J., Kolar, M., and Liu, H. (2020). Kernel meets sieve: Post-regularization conﬁdencebands for sparse additive model.

Journal of the American Statistical Association , 0(0):1–16.Lu, T., Liang, H., Li, H., and Wu, H. (2011). High-dimensional ODEs coupled with mixed-eﬀects modeling techniques for dynamic gene regulatory network identiﬁcation.

Journalof the American Statistical Association , 106:1242–1258.Ma, W., Trusina, A., El-Samad, H., Lim, W. A., and Tang, C. (2009). Deﬁning networktopologies that can achieve biochemical adaptation.

Cell , 138:760–773.Marbach, D., Prill, R. J., Schaﬀter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G.(2010). Revealing strengths and weaknesses of methods for gene network inference.

Proceedings of the National Academy of Sciences, , 107:6286–6291.Marbach, D., Schaﬀter, T., Mattiussi, C., and Floreano, D. (2009). Generating realistic insilico gene networks for performance assessment of reverse engineering methods.

Journalof Computational Biology , 16:229–239. 28psomer, J. D. and Ruppert, D. (1997). Fitting a bivariate additive model by local poly-nomial regression.

Annals of Statistics , 25:186–211.Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over (cid:96) q -balls. IEEE Transactions on Information Theory ,57:6976–6994.Ravikumar, P., Wainwright, M. J., and Laﬀerty, J. (2010). High-dimensional ising modelselection using l -regularized logistic regression. Annals of Statistics , 38:1287–1319.Schaﬀter, T., Marbach, D., and Floreano, D. (2011). Genenetweaver: in silico benchmarkgeneration and performance proﬁling of network inference methods.

Bioinformatics ,27:2263–2270.Stewart, G. W. (1987). Collinearity and least squares regression.

Statistical Science , 2:68–84.Talagrand, M. (1996). New concentration inequalities in product spaces.

InventionesMathematicae , 126:505–563.Tzafriri, A. R. (2003). Michaelis-menten kinetics at high enzyme concentrations.

Bulletinof Mathematical Biology , 65:1111–1129.van der Vaart, A. W. and Wellner, J. A. (1996).

Weak Convergence and Empirical Processes .Springer-Verlag, New York.Varah, J. M. (1982). A spline least squares method for numerical parameter estimation indiﬀerential equations.

SIAM Journal on Scientiﬁc and Statistical Computing , 3:28–46.Volterra, V. (1928). Variations and Fluctuations of the Number of Individuals in AnimalSpecies living together.

ICES Journal of Marine Science , 3:3–51.Wahba, G. (1983). Bayesian “conﬁdence intervals” for the cross-validated smoothing spline.

Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 45:133–150.Wahba, G. (1990).

Spline Models for Observational Data . SIAM, Philadelphia.29ahba, G., Wang, Y., Gu, C., Klein, R., and Klein, B. (1995). Smoothing spline ANOVAfor exponential families, with application to the Wisconsin Epidemiological Study ofDiabetic Retinopathy.

Annals of Statistics , 23:1865–1895.Wang, S., Nan, B., Zhu, N., and Zhu, J. (2009). Hierarchically penalized Cox regressionwith grouped variables.

Biometrika , 96(2):307–322.Wu, H., Lu, T., Xue, H., and Liang, H. (2014). Sparse additive ordinary diﬀerential equa-tions for dynamic gene regulatory network modeling.

Journal of the American StatisticalAssociation , 109:700–716.Yuan, M. and Zhou, D.-X. (2016). Minimax optimal rates of estimation in high dimensionaladditive models.

Annals of Statistics , 44(6):2564–2593.Zhang, C.-H. and Zhang, S. S. (2014). Conﬁdence intervals for low dimensional parametersin high dimensional linear models.

Journal of the Royal Statistical Society. Series B. ,76(1):217–242.Zhang, T., Wu, J., Li, F., Caﬀo, B., and Boatman-Reich, D. (2015). A dynamic directionalmodel for eﬀective brain connectivity using electrocorticographic (ECoG) time series.

Journal of the American Statistical Association , 110:93–106.Zhang, T., Yin, Q., Caﬀo, B., Sun, Y., and Boatman-Reich, D. (2017). Bayesian infer-ence of high-dimensional, cluster-structured ordinary diﬀerential equation models withapplications to brain connectivity studies.

Annals of Applied Statistics , 11:868–897.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.

Journal of MachineLearning Research , 7:2541–2563.Zhu, H., Yao, F., and Zhang, H. H. (2014). Structured functional additive regression inreproducing kernel hilbert spaces.