KKernel Ordinary Differential Equations
Xiaowu Dai and Lexin Li
University of California at Berkeley
Abstract
Ordinary differential equations (ODE) are widely used in modeling biological andphysical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODEs given the noisy observations.We do not restrict the functional forms in ODE to be linear or additive, and we allowpairwise interactions. We perform sparse estimation to select individual functionals,and construct confidence intervals for the estimated signal trajectories. We establishthe estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionalscan be smaller or larger than the sample size. Our proposal builds upon the smoothingspline analysis of variance (SS-ANOVA) framework, but tackles several importantproblems that are not yet fully addressed, and thus extends the scope of existingSS-ANOVA too. We demonstrate the efficacy of our method through numerous ODEexamples.
Key Words:
Component selection and smoothing operator; High dimensionality; Or-dinary differential equations; Smoothing spline analysis of variance; Reproducing kernelHilbert space.
Ordinary differential equations (ODE) have been widely used to model dynamic systemsand biological and physical processes in a variety of scientific applications. Examplesinclude infectious disease (Liang and Wu, 2008), genomics (Cao and Zhao, 2008; Chouand Voit, 2009; Ma et al., 2009; Lu et al., 2011; Henderson and Michailidis, 2014; Wu et al.,2014), neuroscience (Izhikevich, 2007; Zhang et al., 2015, 2017; Cao et al., 2019), among1 a r X i v : . [ s t a t . M E ] A ug any others. A system of ODEs takes the form, dx ( t ) dt = dx ( t ) dt ... dx p ( t ) dt = F ( x ( t ))... F p ( x ( t )) = F ( x ( t )) , (1)where x ( t ) = ( x ( t ) , . . . , x p ( t )) (cid:62) ∈ R p denotes the system of p variables of interest, F = { F , . . . , F p } denotes the set of unknown functionals that characterize the regulatory rela-tions among x ( t ), and t indexes time in an interval standardized to T = [0 , { t , . . . , t n } with measurement errors, y i = x ( t i ) + (cid:15) i , i = 1 , . . . , n, (2)where y i = ( y i , . . . , y ip ) (cid:62) ∈ R p denotes the observed data, (cid:15) i = ( (cid:15) i , . . . , (cid:15) ip ) (cid:62) ∈ R p denotesthe vector of measurement errors that are usually assumed to follow independent normaldistribution with mean 0 and variance σ j , j = 1 , . . . , p , and n denotes the number of timepoints. Besides, an initial condition x (0) ∈ R p is usually given for the system (1).In a biological or physical system, a central question of interest is to uncover the struc-ture of the system of ODEs in terms of which variables regulate which other variables,given the observed noisy time-course data { y i } ni =1 . Specifically, we say that x k regulates x j , if F j is a functional of x k . In other words, x k controls the change of x j through thefunctional F j on the derivative dx j /dt . Therefore, the functionals F = { F , . . . , F p } encodethe regulatory relations of interest, and are often assumed to take the form, F j ( x ( t )) = θ j + p (cid:88) k =1 F jk ( x k ( t )) + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 F jkl ( x k ( t ) , x l ( t )) , j = 1 , . . . , p, (3)where θ j ∈ R denotes the intercept, and F jk and F jkl represent the main effect and two-wayinteraction, respectively. Higher-order interactions are possible, but two-way interactionsare the most common structure studied in ODE (Ma et al., 2009; Zhang et al., 2015).There have been numerous pioneering works studying statistical modeling of ODEs.However, nearly all existing solutions constrain the forms of F . Broadly speaking, there aretwo categories of functional forms imposed. The first category considers linear functionals2or F . For instance, Lu et al. (2011) studied a system of linear ODEs to model dynamicgene regulatory networks. Zhang et al. (2015) extended the linear ODEs to include theinteractions to model brain connectivity networks. The model of Zhang et al. (2015), otherthan differentiating between the variables that encode the neuronal activities and the onesthat represent the stimulus signals, is in effect of the form, F j ( x ( t )) = θ j + p (cid:88) k =1 θ jk x k ( t ) + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 θ jkl x k ( t ) x l ( t ) , j = 1 , . . . , p, (4)whereas the model of Lu et al. (2011) is similar to (4) but focuses on the main-effect termsonly. In both cases, F j takes a linear form. Dattner and Klaassen (2015) further extendedthe functional F j in (4) to a generalized linear form, but without the interactions, i.e., F j ( x ( t )) = θ j + ψ j ( x ( t )) (cid:62) θ j , j = 1 , . . . , p, (5)where θ j ∈ R , θ j ∈ R d , and ψ j ( x ) = ( ψ j ( x ) , . . . , ψ jd ( x )) (cid:62) ∈ R d is a finite set of known basis functions. The second category considers additive functionals for F . Particularly,Henderson and Michailidis (2014); Wu et al. (2014); Chen et al. (2017) considered thegeneralized additive model for F j , F j ( x ( t )) = θ j + p (cid:88) k =1 F jk ( x k ( t )) = θ j + p (cid:88) k =1 (cid:8) ψ ( x k ( t )) (cid:62) θ jk + δ jk ( x k ( t )) (cid:9) , j = 1 , . . . , p, (6)where θ j ∈ R , θ jk ∈ R d , ψ ( x ) = ( ψ ( x ) , . . . , ψ d ( x )) (cid:62) ∈ R d is a finite set of common basisfunctions, and δ jk ∈ R is the residual function. Different from Dattner and Klaassen (2015),the residual δ jk is unknown. The functional F j in (6) takes an additive form.These works have laid a solid foundation for statistical modeling of ODEs. However, inplenty of scientific applications, the linear or additive forms on the functionals F can berestrictive. Besides, it is highly nontrivial to couple the basis function-based solutions withthe interactions. We give a more specific example in Section 2.1, where a commonly usedenzyme network ODE system involves both nonlinear functionals and two-way interactions.Such examples are often the rules rather than the exceptions, motivating us to considera more flexible form of ODEs. Moreover, the existing ODE methods have primarily fo-cused on sparse estimation, but few tackled the problem of statistical inference, which ischallenging due to the complicated correlation structures of ODEs.3n this article, we propose a novel approach of kernel ordinary differential equations(KODE) for estimation and inference of the ODE system in (1) given the noisy observationsfrom (2). We adopt the general formulation of (3), but we do not restrict the functionalforms of F , and we allow pairwise interactions. As such, we consider a more general ODEsystem that encompasses (4), (5) and (6) as special cases. We further introduce sparsityregularizations to achieve selection of individual functionals in (3), which yields a sparserecovery of the regulatory relations among F , and thus improves the model interpretability.Moreover, we derive the confidence interval for the estimated signal trajectory x j ( t ). Weestablish the estimation optimality and selection consistency of kernel ODE, under bothlow-dimensional and high-dimensional settings, where the number of unknown functionals p can be smaller or larger than the number of time points n , and we study the regime-switching phenomenon. These differences clearly separate our proposal from the existingODE solutions in the literature.Our proposal is built upon the smoothing spline analysis of variance (SS-ANOVA)framework that was first introduced by Wahba et al. (1995), then further developed in re-gression and functional data analysis settings by Huang (1998); Lin and Zhang (2006); Zhuet al. (2014). We adopt a similar component selection and smoothing operator (COSSO)type penalty of Lin and Zhang (2006) for regularization, and conceptually, our work ex-tends COSSO to the ODE setting. However, our proposal differs from COSSO and theexisting SS-ANOVA methods considerably, in multiple ways. First, unlike the standardSS-ANOVA models, the regressors of kernel ODE are not directly observed and need tobe estimated from the data with error. This extra layer of randomness and estimationerror introduces additional difficulty to SS-ANOVA. Second, we employ the integral of theestimated trajectories in the loss function to improve the estimation properties (Dattnerand Klaassen, 2015; Chen et al., 2017). The use of the integral and the inclusion of theinteraction terms pose some identifiability question that we tackle explicitly. Third, we es-tablish the estimation optimality and selection consistency in the RKHS framework, whichis utterly different from the estimation result studied in Zhu et al. (2014), and requires newtechnical tools. Moreover, our theoretical analysis extends that of Chen et al. (2017) from4he finite bases setting of cubic splines to the infinite bases setting of RKHS. Finally, forstatistical inference, we derive the confidence bands to provide uncertainty quantificationfor the penalized estimators of the signal trajectories in the ODE model. Our solutionbuilds on the confidence intervals idea of Wahba (1983). But unlike the classical meth-ods focusing on the fixed dimensionality p (Wahba, 1983; Opsomer and Ruppert, 1997),we allow a diverging p that can far exceed the sample size n . In summary, our proposaltackles several crucial problems that are not yet fully addressed in the existing SS-ANOVAframework, and it is far from a straightforward extension. We believe the proposed kernelODE method not only makes a useful addition to the toolbox of ODE modeling, but alsoextends the scope of SS-ANOVA-based kernel learning.The rest of the article is organized as follows. We propose our kernel ODE method inSection 2, and develop the corresponding estimation algorithm and inference procedure inSection 3. We derive the consistency and optimality of the proposed method in Section4. We investigate the numerical performance in Section 5, and illustrate with a real dataexample in Section 6. We relegate all proofs to the Supplementary Appendix. We consider a enzymatic regulatory network as an example to demonstrate that nonlinearfunctionals as well as interactions are common in the system of ODEs. Ma et al. (2009)found that all circuits of three-node enzyme network topologies that perform biochemicaladaptation can be well approximated by two architectural classes: a negative feedback loopwith a buffering node, and an incoherent feedforward loop with a proportioner node. Themechanism of the first class follows the Michaelis-Menten kinetic equations (Tzafriri, 2003), dx ( t ) dt = c x { − x ( t ) }{ − x ( t ) } + C − ˜ c c x ( t ) x ( t ) + C ,dx ( t ) dt = c { − x ( t ) } x ( t ) { − x ( t ) } + C − ˜ c c x ( t ) x ( t ) + C , (7) dx ( t ) dt = c x ( t ) { − x ( t ) }{ − x ( t ) } + C − c x ( t ) x ( t ) x ( t ) + C , x ( t ) , x ( t ) , x ( t ) are three interacting nodes such that x ( t ) receives the input, x ( t )plays the diverse regulatory role, and x ( t ) transmits the output, x is the initial input stim-ulus, and c , . . . , c , C , . . . , C , ˜ c , ˜ c denote the catalytic rate parameters, the Michaelis-Menten constants, and the concentration parameters, respectively. See also Figure 1(a) fora graphical illustration of this ODE system. In this model, we see that the functionals F , F , F are all nonlinear. Besides, both F and F involve two-way interactions. It is ofgreat interest to estimate F j ’s given the observed data, to verify model (7), and to carryout statistical inference of the unknown parameters. This example, along with many otherODE systems with nonlinear functionals and interaction terms motivate us to consider ageneral ODE system as given in (3). Before presenting our method, we first briefly review the two-step collocation estimationmethod, which is commonly used for parameter estimation in ODE, and is also useful inour setting. The method was first proposed by Varah (1982), then extended to variousODE models. In the first step, it fits a smoothing estimate,min z j n (cid:88) i =1 { y ij − z j ( t i ) } + λ nj J ( z j ) , j = 1 , . . . , p, where J ( · ) is a smoothness penalty. Let (cid:98) x j ( t ) denote its minimizer. Given (cid:98) x j ( t ), in thesecond step, it solves an optimization problem to estimate the model parameters θ j ∈ R and θ j = ( θ j , . . . , θ jp ) (cid:62) ∈ R p , for j = 1 , . . . , p . Particularly, Varah (1982) considered thederivative d (cid:98) x j ( t ) /dt and the following minimization,min θ j ,θ j (cid:90) (cid:32) d (cid:98) x j ( t ) dt − θ j − p (cid:88) k =1 θ jk (cid:98) x k ( t ) (cid:33) dt, j = 1 , . . . , p. Wu et al. (2014) developed a similar two-step collocation method for their additive ODEmodel (6), and estimated the model parameters θ j ∈ R and θ jk = ( θ jk , . . . , θ jkd ) (cid:62) ∈ R d ,for j, k = 1 , . . . , p , with a standardized group (cid:96) -penalty,min θ j ,θ jk (cid:90) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:98) x j ( t ) dt − θ j − p (cid:88) k =1 θ (cid:62) jk ψ ( (cid:98) x k ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dt + τ nj p (cid:88) k =1 (cid:20)(cid:90) (cid:8) θ (cid:62) jk ψ ( (cid:98) x k ( t )) (cid:9) dt (cid:21) / . (cid:96) and regular (cid:96) -penalties. Meanwhile, Hendersonand Michailidis (2014) considered an extra (cid:96) -penalty.Alternatively, in the second step, Dattner and Klaassen (2015) proposed to focus on theintegral (cid:82) t g j ( (cid:98) x ( u )) du , rather than the derivative d (cid:98) x j ( t ) /dt , and they estimated the modelparameters θ j ∈ R and θ j = ( θ j , . . . , θ jd ) (cid:62) ∈ R d , for j = 1 , . . . , p , in (5) by,min θ j ,θ j p (cid:88) j =1 (cid:90) (cid:26)(cid:98) x j ( t ) − θ j − θ (cid:62) j (cid:90) t ψ j ( (cid:98) x ( u )) du (cid:27) dt. They found that this modification from the derivative to integral leads to a more robustestimate and also an easier derivation of the asymptotic properties. Chen et al. (2017)adopted this idea for their additive ODE model (6), and estimated the parameters θ j ∈ R ,˜ θ j ∈ R , and θ jk = ( θ jk , . . . , θ jkd ) (cid:62) ∈ R d , for j, k = 1 , . . . , p , bymin θ j , ˜ θ j ,θ jk n n (cid:88) i =1 (cid:40) y ij − θ j − b j t i − p (cid:88) k =1 θ (cid:62) jk (cid:90) t i ψ ( (cid:98) x k ( u )) du (cid:41) + τ nj p (cid:88) k =1 (cid:34) n n (cid:88) i =1 (cid:26) θ (cid:62) jk (cid:90) t i ψ ( (cid:98) x k ( u )) dt (cid:27) (cid:35) / . We build the proposed kernel ODE within the smoothing spline ANOVA framework; seeWahba et al. (1995) and Gu (2013) for more background on SS-ANOVA. Specifically, let H k denote a space of functions of x k ( t ) ∈ X with zero marginal integral. Let { } denotethe space of constant functions. Construct the tensor product space, H = { } ⊕ p (cid:88) k =1 H k ⊕ p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 ( H k ⊗ H l ) . (8)We assume the functionals F j , j = 1 , . . . , p , in the ODE model (3) are located in the spaceof H . The identifiability of the terms in (3) is assured by the conditions specified throughthe averaging operators: (cid:82) T F jk ( x k ( t )) dt = 0 for k = 1 , . . . , p . Let (cid:107) · (cid:107) H denote the normof H , and P k F j and P kl F j denote the orthogonal projection of F j onto H k and H k ⊗ H l ,respectively. We consider a two-step collocation estimation method, by first obtaining a7moothing spline estimate (cid:98) x ( t ) = ( (cid:98) x ( t ) , . . . , (cid:98) x p ( t )) (cid:62) , where (cid:98) x j ( t ) = arg min z j ∈F (cid:40) n n (cid:88) i =1 { y ij − z j ( t i ) } + λ nj (cid:107) z j ( t ) (cid:107) F (cid:41) , j = 1 , . . . , p, (9)then estimating F j ∈ H and θ j ∈ R by the following penalized optimization,min θ j ,F j n n (cid:88) i =1 (cid:26) y ij − θ j − (cid:90) t i F j ( (cid:98) x ( t )) dt (cid:27) + τ nj (cid:32) p (cid:88) k =1 (cid:107)P k F j (cid:107) H + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) . (10)Our proposal deals with the integral (cid:82) t i F j ( (cid:98) x ( u )) du , rather than the derivative d (cid:98) x j ( t ) /dt ,which is in a similar spirit as Dattner and Klaassen (2015). Besides, it involves two penaltyfunctions, J ≡ (cid:107) · (cid:107) F in (9), and J ( F j ) ≡ (cid:80) pk =1 (cid:107)P k F j (cid:107) H + (cid:80) pk =1 (cid:80) pl =1 (cid:107)P kl F j (cid:107) H in(10), and λ nj and τ nj are two tuning parameters. We next make some remarks about thisproposal.For the functionals, the formulation in (10) is highly flexible, nonlinear, and incorporatestwo-way interactions. Meanwhile, it naturally covers the linear ODE in (4) and (5), andthe additive ODE in (6) as special cases. In particular, if H is the linear functional space, H = { }⊕ (cid:80) pk =1 { x k − / }⊕ (cid:80) k (cid:54) = l [ { x k − / }⊗{ x l − / } ] with the input space X = [0 , p ,then any F of the form in (4) belongs to H . If H is spanned by some known generalizedfunctions, H = ψ j ( x ) ⊕ . . . ⊕ ψ jp ( x ), then any F in (5) belongs to H . If H is the additivefunctional space, H = { } ⊕ (cid:80) pk =1 H k with the (cid:96) -norm, then for F jk ( x k ( t )) = ψ ( x k ( t )) (cid:62) θ jk ,the penalty on the main effects becomes (cid:80) pk =1 (cid:107)P k F j (cid:107) H = (cid:80) pk =1 [ (cid:82) { ψ ( x k ( t )) (cid:62) θ jk } dt ] / ,which is exactly the same as the ODE model of Chen et al. (2017).For the penalties, the first penalty function J is the squared RKHS norm correspondingto the RKHS {F , (cid:107) · (cid:107) F } . It is for estimating (cid:98) x j , and F does not have to be the same as H .The second penalty function J is a sum of RKHS norms on the main effects and pairwiseinteractions. This penalty is similar as the COSSO penalty of Lin and Zhang (2006). Butas we outline in Section 1, our extension is far from trivial. We also note that, we do notimpose a hierarchical structure in terms of the main effects and interactions, in that if aninteraction term is selected, the corresponding main effect term does not have to be selected(Wang et al., 2009). This is motivated by the observation that, in the ODE system, e.g.,8he enzymatic regulatory network example in Section 2.1, the interaction terms x ( t ) x ( t )and x ( t ) x ( t ) both appear in the ODE regulating the dynamics of x ( t ), but the maineffect terms x ( t ) and x ( t ) are not present. Theorem 1.
Assume that the RKHS H can be decomposed as in (8) . Then there exists aminimizer of (10) in H for any tuning parameter τ nj ≥ . Moreover, the minimizer is ina finite-dimensional space. Theorem 1 is a generalization of the well-known representer theorem (Wahba, 1990). Thedifference is that, unlike the smoothing splines model as studied in Wahba (1990), theminimization of (10) involves an integral in the loss function, and the penalty is not anorm in H but a convex pseudo-norm. A direct implication of Theorem 1 is that, althoughthe minimization with respect to F j is taken over an infinite-dimensional space in (10), thesolution to (10) can actually be found in a finite-dimensional space. We next develop anestimation algorithm to solve (10). The estimation of the proposed kernel ODE system consists of two major steps. The firststep is the smoothing spline estimation in (9), which is standard and the tuning of thesmoothness parameter λ nj is often done through generalized cross-validation (see, e.g., Gu,2013). The second step is to solve (10). Toward that end, we first propose an optimizationproblem that is equivalent to (10), but is computationally easier to tackle. We then developan estimation algorithm to solve this new equivalent problem.Specifically, we consider the following optimization problem, for j = 1 , . . . , p ,min θ j ,θ j ,F j n n (cid:88) i =1 (cid:26) y ij − θ j − (cid:90) t i F j ( (cid:98) x ( t )) dt (cid:27) + η nj (cid:32) p (cid:88) k =1 θ − jk (cid:107)P k F j (cid:107) H + θ − jkl p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) + κ nj (cid:32) p (cid:88) k =1 θ jk + p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 θ jkl (cid:33) , (11)9ubject to θ k ≥ , θ kl ≥ , k, l = 1 , . . . , p, k (cid:54) = l , where θ j = ( θ j , . . . , θ jp , θ j , . . . , θ j p , . . . , θ jp ,. . . , θ jp ( p − ) (cid:62) ∈ R p collects the parameters to estimate, and η nj , κ nj ≥ are the tuning pa-rameters, j = 1 , . . . , p . Comparing (11) to (10), we introduce the parameters θ jk and θ jkl to control the sparsity of the main effect and interaction terms in F j . This is similar toLin and Zhang (2006). The two optimization problems (10) and (11) are equivalent, in thefollowing sense. Let κ nj = τ nj / (4 η nj ). Then we have, η nj θ − jk (cid:107)P k F j (cid:107) H + κ nj θ jk ≥ η / nj κ / nj (cid:107)P k F j (cid:107) H = τ nj (cid:107)P k F j (cid:107) H , where the equality holds if θ jk = η / nj κ − / nj (cid:107) P k F j (cid:107) H . A similar result holds for θ jkl = η / nj κ − / nj (cid:107) P kl F j (cid:107) H . In other words, if ( (cid:98) θ j , (cid:98) F j ) minimizes (10), then ( (cid:98) θ j , (cid:98) θ j , (cid:98) F j ) mini-mizes (11), with (cid:98) θ jk = η / nj κ − / nj (cid:107)P k (cid:98) F j (cid:107) H , and θ jkl = η / nj κ − / nj (cid:107)P kl F j (cid:107) H , for any k, l =1 , . . . , p, k (cid:54) = l . Meanwhile, if ( (cid:98) θ j , (cid:98) θ j , (cid:98) F j ) minimizes (11), then ( (cid:98) θ j , (cid:98) F j ) minimizes (10).Next, we devise an iterative alternating optimization approach to solve (11). That is,we first estimate θ j given fixed F j and θ j , then estimate the functional F j given fixed θ j and θ j , and finally estimate θ j given fixed θ j and F j .For given (cid:98) F j and (cid:98) θ j , we have that, (cid:98) θ j = ¯ y j − (cid:90) T ¯ T ( t ) (cid:98) F j ( (cid:98) x ( t )) dt, where T i ( t ) = 1 { ≤ t ≤ t i } , ¯ T ( t ) = n (cid:80) ni =1 T i ( t ), and ¯ y j = n − (cid:80) ni =1 y ij .For given (cid:98) θ j and (cid:98) θ j , the optimization problem (11) becomes,min F j (cid:40) n n (cid:88) i =1 (cid:20) ( y ij − ¯ y j ) − (cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) F j ( (cid:98) x ( t )) dt (cid:21) + η nj (cid:32) p (cid:88) k =1 (cid:98) θ − jk (cid:107)P k F j (cid:107) H + (cid:98) θ − jkl p (cid:88) k =1 ,k (cid:54) = l p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33)(cid:41) . (12)Let K j ( · , · ) : X × X (cid:55)→ R denote the Mercer kernel generating the RKHS H j , j = 1 , . . . , p .Then K k K l is the reproducing kernel of the RKHS H k ⊗ H l (Aronszajn, 1950). Let K θ j = (cid:80) pk =1 (cid:98) θ jk K k + (cid:80) k (cid:54) = l (cid:98) θ jkl K k K l . By the representer theorem (Wahba, 1990), the solution (cid:98) F j to (12) is of the form, (cid:98) F j ( (cid:98) x ( t )) = b j + n (cid:88) i =1 c ij (cid:90) T K θ j ( (cid:98) x ( t ) , (cid:98) x ( s )) (cid:8) T i ( s ) − ¯ T ( s ) (cid:9) ds (13)10or some b j ∈ R and c j = ( c j , . . . , c nj ) ∈ R n . Write y j = ( y j , . . . , y nj ) (cid:62) ∈ R n and¯ y j = (¯ y j , . . . , ¯ y j ) (cid:62) ∈ R n . Let B be an n × i th entry is B i = (cid:82) T { T i ( t ) − ¯ T ( t ) } dt , i = 1 , . . . , n . Let Σ be an n × n matrix whose ( i, i (cid:48) )th entry is Σ ii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K θ j ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , i, i (cid:48) = 1 , . . . , n . Plugging (13) into (12), we obtainthe following quadratic minimization problem in terms of { b j , c j } ,min b j ,c j n (cid:107) ( y j − ¯ y j ) − ( Bb j + Σ c j ) (cid:107) + η nj c (cid:62) j Σ c j , which has a closed-form solution. Consider the QR decomposition B = [ Q Q ][ R (cid:62) ,where Q ∈ R n × , Q ∈ R n × ( n − , and [ Q Q ] is orthogonal such that B (cid:62) Q = 0 × ( n − .Write W j = Σ + nη nj I n , where I n is the n × n identity matrix. Then the minimizers are, c j = Q ( Q (cid:62) W j Q ) − Q (cid:62) ( y j − ¯ y j ) ,b j = R − Q (cid:62) ( y j − ¯ y j − W j c j ) . Following the usual smoothing splines literature, we tune the parameter η nj in (12) byminimizing the generalized cross-validation criterion (GCV, Wahba et al., 1995),GCV = (cid:107) A j ( η nj )( y j − ¯ y j ) − ( y j − ¯ y j ) (cid:107) [ n − tr { I n − A j ( η nj ) } ] , where the smoothing matrix A j ( η nj ) ∈ R n × n is of the form, A j ( η nj ) = I n − nη nj Q ( Q (cid:62) W j Q ) − Q (cid:62) . (14)For given (cid:98) θ j and (cid:98) F j , θ j is the solution to a usual (cid:96) -penalized regression problem,min θ j (cid:40) ( z j − Gθ j ) (cid:62) ( z j − Gθ j ) + nκ nj (cid:32) p (cid:88) k =1 θ jk + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 θ jkl (cid:33)(cid:41) , (15)subject to θ k ≥ , θ kl ≥ , k, l = 1 , . . . , p, k (cid:54) = l , where the “response” is z j = ( y j − ¯ y j ) − (1 / nη nj c j − Bb j , the “predictor” is G ∈ R n × p , whose first p columns are Σ k c j with k = 1 , . . . , p , and the last p ( p −
1) columns are Σ kl c j with k, l = 1 , . . . , p, k (cid:54) = l , and Σ k =(Σ kii (cid:48) ) , Σ kl = (Σ klii (cid:48) ) are both n × n matrices whose ( i, i (cid:48) )th entries are Σ kii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K k ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , and Σ klii (cid:48) = (cid:82) T (cid:82) T { T i ( s ) − ¯ T ( s ) } K kl ( (cid:98) x ( t ) , (cid:98) x ( s )) { T i (cid:48) ( t ) − ¯ T ( t ) } dsdt , respectively, where i, i (cid:48) = 1 , . . . , n, j = 1 , . . . , p . We employ Lasso for (15) in11 lgorithm 1 Iterative optimization algorithm for kernel ODE. Initialization: the initial values for θ jk = θ jkl = 1, j, k, l = 1 , . . . , p, k (cid:54) = l } , and thetuning parameters: ( η nj , κ nj ). Fit smoothing spline model (9), and obtain (cid:98) x j ( t ), j = 1 , . . . , p . repeat Solve (cid:98) θ j given (cid:98) F j and (cid:98) θ j , j = 1 , . . . , p . Solve (cid:98) F j in (12) given (cid:98) θ j and (cid:98) θ j , j = 1 , . . . , p . Solve (cid:98) θ j in (15) given (cid:98) θ j and (cid:98) F j , j = 1 , . . . , p . until the stopping criterion is met.our implementation, and tune the parameter κ nj using tenfold cross-validation, followingthe usual Lasso literature.We repeat the above optimization steps iteratively until some stopping criterion is met;i.e., when the estimates in two consecutive iterations are close enough, or when the numberof iterations reaches some maximum number. In our simulations, we have found that thealgorithm converges quickly, usually within 10 iterations. Another issue is the identifiabilityof P k F j ’s and P kl F j ’s in (11) in the sense of unique solutions. We introduce the collinearityindices C jk and C jkl to reflect the identifiability. Specifically, let W denote a p × p matrix,whose entries are cos( P k F j , P k (cid:48) F j ) , cos( P k F j , P k (cid:48) l (cid:48) F j ) , cos( P kl F j , P k (cid:48) F j ) , cos( P kl F j , P k (cid:48) l (cid:48) F j ), j, k, l = 1 , . . . , p . Then C jk and C jkl are defined by the diagonals of W − . When some C jk and C jkl are much larger than one, then the identifiability issue occurs (Stewart, 1987; Gu,2013). This is often due to insufficient amount of data relative to the complexity of themodel we fit. In this case, we find that increasing η nj and κ nj in (11) often helps with theidentifiability issue, as it helps reduce the model complexity.We summarize the above estimation procedure in Algorithm 1. Next, we derive the confidence intervals for the estimated trajectory (cid:98) x j ( t i ). This is relatedto post-selection inference, as the actual coverage probability of the confidence intervalignoring the preceding sparse estimation uncertainty can be dramatically smaller than thenominal level. Our result extends the recent work of Berk et al. (2013); Bachoc et al.(2019) from linear regression models to nonparametric ODE models, while our setting is12ore challenging, as it involves infinite-dimensional functional objects.Let (cid:98) θ j denote the estimator of θ j obtained from Algorithm 1. Denote M ≡ { , . . . , p, (1 , , . . . , (1 , p ) , . . . , ( p, , . . . , ( p, p − } , and denote M j ⊆ M as the index set of thenonzero entries of the sparse estimator (cid:98) θ j . Note that M j is allowed to be an empty set. Let (cid:98) θ M j be the least squares estimate with M j as the support that minimizes the unpenalizedobjective function in (15), i.e., ( z j − Gθ j ) (cid:62) ( z j − Gθ j ). Plugging this estimate (cid:98) θ M j into (13)gets the corresponding estimate of the functional F j as, (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) = b j + n (cid:88) i =1 c ij (cid:90) T K (cid:98) θ Mj ( (cid:98) x, (cid:98) x ( s )) (cid:8) T i ( s ) − ¯ T ( s ) (cid:9) ds. For a nominal level α ∈ (0 ,
1) and i = 1 , . . . , n , define c ( (cid:98) x j ( t i )) as the smallest constantsatisfying that, P n,F j ,σ j (cid:20) max M j ⊆M σ − j (cid:12)(cid:12)(cid:12) { (cid:101) A M j } i · ( y j − ¯ y j ) (cid:12)(cid:12)(cid:12) ≤ c ( (cid:98) x j ( t i )) (cid:21) ≥ − α, (16)where { (cid:101) A M j } i · = { A M j } i · / (cid:107){ A M j } i · (cid:107) l , { A M j } i · is the i th row of A M j , A M j is the smoothingmatrix as defined in (14) with the corresponding (cid:98) θ M j , and σ j is the variance of the errorterm (cid:15) ij in (2). We then construct the confidence interval CI( (cid:98) x j ( t )) for the prediction oftrue trajectory x j ( t ) following model selection as,CI( (cid:98) x j ( t i )) = (cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) dt ± c ( (cid:98) x j ( t i )) σ j (cid:107){ A M j } i · (cid:107) , (17)for any i = 1 , . . . , n and j = 1 , . . . , p .Next, we show that the confidence interval in (17) has the desired coverage probability.Later we develop a procedure to estimate the cutoff value c ( (cid:98) x j ) in (16) given the data. Theorem 2.
Let M j ⊆ M be the index set of the nonzero entries of the sparse estimator (cid:98) θ j . Then the choice of c ( (cid:98) x j ( t i )) in (16) does not depend on F j , and CI ( (cid:98) x j ( t i )) in (17) satisfies the coverage property, for any i = 1 , . . . , n and j = 1 , . . . , p , in that, inf F j ∈H ,σ j > P (cid:26)(cid:90) T (cid:8) T i ( t ) − ¯ T ( t ) (cid:9) E (cid:104) (cid:98) F j, (cid:98) θ Mj ( (cid:98) x ( t )) (cid:105) dt ∈ CI ( (cid:98) x j ( t i )) (cid:27) ≥ − α. A few remarks are in order. First, the coverage in Theorem 2 is guaranteed for all sparseestimation and selection procedures. As such, CI( (cid:98) x j ) in (17), following the terminology of13erk et al. (2013), is a universally valid post-selection confidence interval. Second, if wereplace c ( (cid:98) x j ( t i )) in (17) by z α/ , i.e., the α/ (cid:98) x j ( t i )) reduces to the “naive” confidence interval. It is constructed as if M j werefixed a priori, and it ignores any uncertainty or error of the sparse estimation step. Thisnaive confidence interval, however, does not have the coverage property as in Theorem2, and thus is not a truly valid confidence interval. Finally, data splitting (Cox, 1975) isa commonly used alternative strategy for post-selection inference. But it is not directlyapplicable in our ODE setting, because it is difficult to split the time series data intoindependent parts.Next, we devise a procedure to compute the cutoff value c ( (cid:98) x j ( t i )). Proposition 1.
The value c ( (cid:98) x j ( t i )) in (16) is the same as the solution of t ≥ satisfying, E U P (cid:18) max M j ⊆M (cid:12)(cid:12)(cid:12) { (cid:101) A M j } i · V (cid:12)(cid:12)(cid:12) ≤ t/U (cid:12)(cid:12)(cid:12)(cid:12) U (cid:19) = 1 − α, where V is uniformly distributed on the unit sphere in R n , and U is a nonnegative randomvariable such that U follows a chi-squared distribution χ ( n ) . Following Proposition 1, we compute c ( (cid:98) x j ( t i )) as follows. We first generate N i.i.d. copiesof random vectors V , . . . , V N uniformly distributed on the unit sphere in R n . We thencalculate the quantity, c ν = max M j ⊆M |{ (cid:101) A M j } i · V ν | for ν = 1 , . . . , N . Let D U denote thecumulative distribution function of U , and D χ denote the cumulative distribution functionof a χ ( n ) distribution. Then D U ( t ) = D χ ( t ). We next obtain c ( (cid:98) x j ( t i )) by searching for c that solves N − (cid:80) Ni =1 D U ( c/c i ) = 1 − α , using, for example, a bisection searching method.Finally, we estimate the error variance σ j in (17) using the usual noise estimator in thecontext of RKHS (Wahba, 1990); i.e., (cid:98) σ j = (cid:107) A M j ( y j − ¯ y j ) − ( y j − ¯ y j ) (cid:107) / tr( I − A M j ).We also remark that, the inference on the prediction of the trajectory x j ( t ) followingmodel selection as described in Theorem 2 amounts to the inference on the estimation ofthe integration (cid:82) t F j ( x ( s )) ds . This type of inference is of great importance in dynamicsystems (Izhikevich, 2007; Chou and Voit, 2009; Ma et al., 2009). Our solution takes theselected model as an approximation to the truth, but does not require that the true datageneration model has to be among the candidates of model selection. We note that, it is14lso possible to do inference on the individual components of F j directly; e.g., one couldconstruct the confidence interval for F jk in (3). But this is achieved at the cost of imposingadditional assumptions, including the requirement that the true data generation model isamong the class of pairwise interaction model as in (3), and the orthogonality property as inChernozhukov et al. (2015), or its equivalent characterization as in Zhang and Zhang (2014);Javanmard and Montanari (2014). For nonparametric kernel estimators, the orthogonalityproperty is shown to hold if the covariates x j ’s are assumed to be weakly dependent (Luet al., 2020). It is interesting to further investigate if such a property holds in the contextof kernel ODE model under a similar condition of weakly dependent covariates. We leavethis as our future research. We next establish the estimation optimality and selection consistency of kernel ODE. Thesetheoretical results hold for both the low-dimensional and high-dimensional settings, wherethe number of functionals p can be smaller or larger than the sample size n . We firstintroduce two assumptions. Assumption 1.
The number of nonzero functional components is bounded, i.e., card (cid:0) { k : F jk (cid:54) = 0 } ∪ { ≤ l (cid:54) = k ≤ p : F jkl (cid:54) = 0 } (cid:1) is bounded for any j = 1 , . . . , p . Assumption 2.
For any F j ∈ H , there exists a random variable B , with E ( B ) < ∞ , and (cid:12)(cid:12)(cid:12)(cid:12) ∂F j ( x ) ∂x k (cid:12)(cid:12)(cid:12)(cid:12) ≤ B (cid:107) F j (cid:107) L , almost surely . Assumption 1 concerns the complexity of the functionals. Similar assumptions have beenadopted in the sparse additive model over RKHS when F jkl = 0 (see, e.g., Koltchinskii andYuan, 2010; Raskutti et al., 2011). Assumption 2 is an inverse Poincar´e inequality typecondition, which places regularization on the fluctuation in F j relative to the (cid:96) -norm. Thesame assumption was also used in additive models in RKHS (Zhu et al., 2014).We begin with the error bound for the estimated trajectory (cid:98) x ( t ), which holds uniformlyfor j = 1 , . . . , p . This is a relatively standard result, and the bound is needed for both an-15lyzing the error of the functional estimators in kernel ODE, and establishing the selectionconsistency later. Theorem 3 (Optimal estimation of the trajectory) . Suppose that x j ( t ) ∈ F , j = 1 , . . . , p ,and the RKHS F is embedded to a β th-order Sobolev space, β > / . Then the smoothingspline estimate from (9) satisfies that, for any j = 1 , . . . , p , min λ nj ≥ (cid:90) T { (cid:98) x j ( t ) − x j ( t ) } dt = O p (cid:16) n − β β (cid:17) , which achieves the minimax optimal rate. Next, we derive the convergence rate for the estimated functional F j . Because thetrajectory (cid:98) x is estimated, to establish the optimal rate of convergence, it requires extratheoretical attention, which is related to recent work on errors in variables for lasso-typeregressions (Loh and Wainwright, 2012; Zhu et al., 2014). The proof involves several toolsfor the Rademacher processes (van der Vaart and Wellner, 1996), and the concentrationinequalities for empirical processes (Talagrand, 1996; Yuan and Zhou, 2016). Theorem 4 (Optimal estimation of the functional) . Suppose that F j ∈ H , j = 1 , . . . , p ,where H satisfies (8) , and the RKHS H j is embedded to a β th-order Sobolev space, β > .Suppose Assumptions 1 and 2 hold. Then, as long as F j is not a constant function, theKODE estimate (cid:98) F j from (10) satisfies that, for any j = 1 , . . . , p , min τ nj ≥ (cid:90) T (cid:110) (cid:98) F j ( x ( t )) − F j ( x ( t )) (cid:111) dt = O p (cid:32)(cid:18) n log n (cid:19) − β β + log pn + n − β β (cid:33) , which achieves the minimax optimal rate. This theorem is one of the key results, and we make a few remarks. First, there are threeerror terms in Theorem 4, which are attributed to the estimation of the interactions, theLasso estimation, and the measurement errors in variables, respectively. Particularly, theerror term O p (cid:0) n − β / (2 β +1) (cid:1) arises due to the unobserved x ( t ), which is instead measuredat discrete time points and is subject to measurement errors. Since this error term achievesthe optimal rate, it fully characterizes the influence of the estimated (cid:98) x ( t ) on the resulting16stimator (cid:98) F j . Moreover, β and β measure the orders of smoothness for estimating x j and F j , respectively. They can be different, which makes it flexible when choosing kernels forthe estimation procedure. For instance, if there is prior knowledge that x ( t ) is smooth, wemay then choose β > β , and the resulting estimator (cid:98) F j achieves a convergence rate of O p (cid:0) ( n/ log n ) − β / (2 β +1) + log p/n (cid:1) . It is interesting to note that this rate is the same asthe rate as if x ( t ) were directly observed and there were no integral involved in the lossfunction, for example, in the setting of Lin and Zhang (2006).Second, there exists a regime-switching phenomenon, depending on the dimensional-ity p with respect to the sample size n . On one hand, if it is an ultrahigh-dimensionalsetting, i.e., p > exp (cid:104)(cid:8) n (log n ) β (cid:9) β (cid:105) , then the minimax optimal rate in Theorem 4becomes O p (cid:0) log p/n + n − β / (2 β +1) (cid:1) . Here, the first rate O p (log p/n ) matches with theminimax optimal rate for estimating a p -dimensional linear regression when the vector ofregression coefficients has a bounded number of nonzero entries (Raskutti et al., 2011).Hence, we pay no extra price in terms of the rates of convergence for adopting a nonpara-metric modeling of F j in (3), when compared with the more restrictive linear ODE modelin (4) (Zhang et al., 2015). On the other hand, if it is a low-dimensional setting, i.e., p ≤ exp (cid:104)(cid:8) n (log n ) β (cid:9) β (cid:105) , then the optimal rate becomes O p (cid:0) ( n/ log n ) − β / (2 β +1) + n − β / (2 β +1) (cid:1) . Here, the first rate O p (cid:0) ( n/ log n ) − β / (2 β +1) (cid:1) is the same as the optimalrate of estimating F j as if we knew a priori that F j comes from a two-dimensional tensorproduct functional space, rather than the p -variate functional space H in (8); see also Lin(2000) for a similar observation.Third, the optimal rate in Theorem 4 is immune to the “curse of dimensionality”, inthe following sense. We introduce p ( p −
1) pairwise interaction components to H in (8),and henceforth, for each x j ( t ), j = 1 , . . . , p , it requires to estimate a total of p functions.A direct application of an existing basis expansion approach, for instance, Brunton et al.(2016), leads to a rate of O p (cid:16) n − O (1 /p ) (cid:17) . This rate degrades fast when p increases. Bycontrast, we proceed in a different way, where we simultaneously aim for the flexibility ofa nonparametric ODE model by letting H obey a tensor product structure as in (8), whileexploiting the interaction structure of the system. As a result, our optimal error bound17 p (cid:0) ( n/ log n ) − β / (2 β +1) (cid:1) does not depend on the dimensionality p .Lastly, the incorporation of the integral, (cid:82) t i F j ( (cid:98) x ( t )) dt , in the loss function in (10)makes the estimation error of (cid:98) F j depend on the convergence of E (cid:82) T { (cid:98) x j ( t ) − x j ( t ) } dt . Asa comparison, if we use the derivative instead of the integration, then the estimation errorwould depend on the convergence of the derivative, E (cid:82) T { d (cid:98) x j ( t ) /dt − dx j ( t ) /dt } dt (Wuet al., 2014). However, it is known that the derivative estimation in the reproducing kernelHilbert space has a slower convergence rate than the function estimation (Cox, 1983). Thatis, E (cid:82) T { d (cid:98) x j ( t ) /dt − dx j ( t ) /dt } dt converges at a slower rate than E (cid:82) T { (cid:98) x j ( t ) − x j ( t ) } dt .This demonstrates the advantage of working with the integral in our KODE formulation,and our result echos the observation for the additive ODE model (Chen et al., 2017).Next, we establish the selection consistency of KODE. Putting all the functionals { F , . . . , F p } together forms a network of regulatory relations among the p variables { x ( t ) ,. . . , x p ( t ) } . Recall that, we say x k is a regulator of x j , if in (3) F jk is nonzero, or if F jkl isnonzero for some l (cid:54) = k . Denote the set of the true regulators and the estimated regulatorsof x j ( t ) by S j = (cid:8) ≤ k ≤ p : F jk (cid:54) = 0 , or F jkl (cid:54) = 0 for some 1 ≤ l (cid:54) = k ≤ p (cid:9) , (cid:98) S j = (cid:8) ≤ k ≤ p : (cid:107) (cid:98) F jk (cid:107) H (cid:54) = 0 , or (cid:107) (cid:98) F jkl (cid:107) H (cid:54) = 0 for some 1 ≤ l (cid:54) = k ≤ p (cid:9) , respectively, j = 1 , . . . , p . We need some extra regularity conditions on the minimumregulatory effect and the design matrix, which are all commonly adopted in the literatureof Lasso regression (Zhao and Yu, 2006; Ravikumar et al., 2010; Chen et al., 2017). Dueto some additional notation, we defer those conditions to the Supplementary Appendix.The next theorem establishes that KODE is able to recover the true regulatory networkasymptotically. Theorem 5 (Recovery of the regulatory network) . Suppose that F j ∈ H , j = 1 , . . . , p ,where H satisfies (8) , and the RKHS H j is embedded to a β th-order Sobolev space, with β > . Suppose Assumption 1, and Assumptions 3–5 in the Appendix hold. Then, theKODE correctly recovers the true regulatory network, in that, for all j = 1 , . . . , p , P (cid:16) (cid:98) S j = S j (cid:17) → , as n → ∞ . igure 1: (a) Diagram of the NFBLB regulatory network following (7). (b) Phase dynamics forthe three nodes x , x , x over time [0 , x uniformly drawn from [0 . , . We study the empirical performance of the proposed KODE with two examples, the enzymeregulatory network, and the Lotka-Volterra equations. Given a system of ODEs and the ini-tial condition, we obtain the numerical solutions of the ODEs using the Euler method withstep size 0 .
01. The data observations are drawn from the solutions at an evenly spaced timegrid, with measurement errors. To implement KODE, we fit the smoothing spline to esti-mate x j ( t ) in (9) using a Mercer kernel, K F ( x, x (cid:48) ) = (1 + √ | x − x (cid:48) | /ν ) exp( −√ | x − x (cid:48) | /ν ),where the smoothing parameter λ nj is chosen by GCV, and the bandwidth ν is chosen bytenfold cross-validation. We compute the integral (cid:82) t i F j ( (cid:98) x ( t )) dt in (10) numerically withindependent sets of 1000 Monte Carlo points. We compare KODE with linear ODE withinteractions in (4) (Zhang et al., 2015), and additive ODE in (6) (Chen et al., 2017). Due tothe lack of available code online, we implement the two competing methods in the frame-work of Algorithm 1, using a linear kernel for (6), and using an additive Mercer kernelfor (6). We evaluate the performance using the prediction error, plus the false discoveryproportion and power for edge selection of the corresponding regulatory network.19 .2 Enzymatic regulatory network The first example is a three-node enzyme regulatory network of a negative feedback loopwith a buffering node (Ma et al., 2009, NFBLB). The ODE system is given in (7) in Section2.1. Figure 1(a) shows the NFBLB network diagram consisting of the three interactingnodes: x receives the input, x transmits the output, and x plays a regulatory role,leading a negative regulatory link to x . We note that, although biological circuits can havemore than three nodes, many of those circuits can be reduced to a three-node framework,given that multiple molecules often function as a single virtual node. Moreover, despite thediversity of possible network topologies, NFBLB is one of the two core three-node topologiesthat could perform adaption in the sense that the system resets itself after responding to astimulus; see Ma et al. (2009) for more discussion of NFBLB. For the ODE system in (7),we set the catalytic rate parameters of the enzymes as c = c = c = c = c = 10 , c = 1,the Michaelis-Menten constants as C = · · · = C = 0 .
1, and the concentration parametersof enzymes as ˜ c = 1 , ˜ c = 0 .
2. These parameters achieve the adaption as shown in Figure1(b). The output node x shows a strong initial response to the stimulus, and also exhibitsstrong adaption, since its post-stimulus steady state x = 0 .
052 is close to the pre-stimulusstate x = 0. The input x ∈ R is drawn uniformly from [0 . , . x (0) = 0, and the measurement errors are i.i.d. normal with mean zero and variance σ j .The time points are evenly distributed, t i = ( i − / , i = 1 , . . . , n . In this example, p = 3,and for each function x j ( t ), j = 1 , ,
3, there are p = 9 functions to estimate, and in totalthere are 27 functions to estimate under the sample size n = 40.Figure 2 reports the true and estimated trajectory of x ( t ), with 95% upper and lowerconfidence bounds, of the three ODE methods, where we use the tensor product Mercerkernel for KODE in (10). The noise level is set as σ j = 0 . , j = 1 , ,
3, and the resultsare averaged over 500 data replications. It is seen that the KODE estimate has a smallervariance than the additive and linear ODE estimates. Moreover, the confidence intervalof KODE achieves the desired coverage for the true trajectory. In contrast, the confidenceintervals of additive and linear ODE models mostly fail to include the truth. This is becausethere is a discrepancy between the additive and linear ODE model specifications and the20
Adaptation x (t)KODE95% UCB95% LCB Adaptation x (t)Additive ODE95% UCB95% LCB Adaptation x (t)Linear ODE95% UCB95% LCB Figure 2:
The true (black solid line) and estimated (blue dashed line) trajectory of x ( t ), withthe 95% upper and lower confidence bounds (red dotted lines). (a) KODE, (b) Additive ODE,(c) Linear ODE. The results are averaged over 500 data replications. Noise level P r ed i c t i on e rr o r (a) KODEAdditive ODELinear ODE
Noise level F a l s e d i sc o v e r y p r opo r t i on (b) KODEAdditive ODELinear ODE
Noise level P o w e r (c) KODEAdditive ODELinear ODE
Figure 3:
The prediction and selection performance of three ODE methods with varying noiselevel. The results are averaged over 500 data replications. (a) Prediction error, (b) False discoveryproportion, (c) Empirical power. true ODE model in (7), and this discrepancy accumulates as the course of the ODE evolves.Figure 3 reports the prediction and selection performance of the three ODE methods,with varying noise level σ j ∈ { . , . , . . . , . } , j = 1 , ,
3. The results are averaged over500 data replications. The prediction error is defined as the squared root of the sum ofpredictive mean squared errors for x ( t ) , x ( t ) , x ( t ) at the unseen “future” time point t = 2.The false discovery proportion is defined as the proportion of falsely selected edges in theregulatory network out of the total number of edges. The empirical power is defined as the21roportion of selected true edges in the network. It is seen that KODE clearly outperformsthe two alternative solutions in terms of both prediction and selection accuracy. The second example is the high-dimensional Lotka-Volterra equations, which are pairs offirst-order nonlinear differential equations describing the dynamics of biological systems inwhich predators and prey interact (Volterra, 1928). We consider a ten-node system, dx j − ( t ) dt = α ,j x j − ( t ) − α ,j x j − ( t ) x j ( t ) ,dx j ( t ) dt = α ,j x j − ( t ) x j ( t ) − α ,j x j ( t ) , (18)where α ,j = 1 . . j − , α ,j = 0 . . j − , α ,j = 0 . . j − α ,j =0 . . j − j = 1 , . . . ,
5. The parameters α ,j and α ,j define the interaction betweenthe two populations such that dx j − ( t ) /dt and dx j ( t ) /dt are nonadditive functions of x j − and x j , where x j − is the prey and x j is the predator. Figure 4(a) shows anillustration of the interaction between x ( t ) and x ( t ). The input x ∈ R is drawnuniformly from [5 , , with the initial value x j − (0) = x j (0), and the measurementerrors are i.i.d. normal N (0 , σ j ), where σ j again reflects the noise level. The time pointsare evenly distributed in [0 , n = 200. In this example, p = 10, and for eachfunction x j ( t ), j = 1 , . . . ,
10, there are p = 100 functions to estimate, and in total thereare 1 ,
000 functions to estimate under the sample size n = 200.Figure 4(b) and (c) report the estimated trajectory of x ( t ), with 95% upper and lowerconfidence bounds, of KODE and additive ODE, where the noise level is set as σ j = 1 , j =1 , . . . ,
10. The confidence interval of KODE achieves a better empirical coverage for thetrue trajectory compared to that of additive ODE. For this example, we use the linearkernel for KODE in (10), since the functional forms in (18) are known to be linear. Forthis reason, we only compare KODE with the additive ODE method. Moreover, in theimplementation, the estimates (cid:98) F j ( (cid:98) x ( t )) are thresholded to be nonnegative to ensure thephysical constraint that the number of population cannot be negative. Figure 5 reports theprediction and selection performance of the two ODE methods, with varying noise level22
20 40 60 80 1000510152025303540
Prey x (t)Predator x (t) Prey x (t)KODE95% UCB95% LCB Prey x (t)Additive ODE95% UCB95% LCB Figure 4: (a) The true trajectories of the prey x ( t ) and the predator x ( t ). (b) The estimatedtrajectory (cid:98) x ( t ) (blue dashed line), with the 95% upper and lower confidence bounds (red dottedlines), by KODE. (c) By additive ODE. The results are averaged over 500 data replications. Noise level P r ed i c t i on e rr o r (a) KODEAdditive ODE
Noise level F a l s e d i sc o v e r y p r opo r t i on (b) KODEAdditive ODE
Noise level P o w e r (c) KODEAdditive ODE
Figure 5:
The prediction and selection performance of two ODE methods with varying noiselevel. The results are averaged over 500 data replications. (a) Prediction error, (b) False discoveryproportion, (c) Empirical power. σ j ∈ { , , . . . , } , j = 1 , . . . ,
10. All the results are averaged over 500 data replications. Itis seen that the KODE estimate achieves a smaller prediction error, and a higher selectionaccuracy, since KODE allows flexible non-additive structures, which results in significantlysmaller bias and variance in functional estimation as compared to the additive modeling.23
Application to Gene Regulatory Network
We illustrate KODE with a gene regulatory network application. Schaffter et al. (2011)developed an open-source platform called GeneNetWeaver (GNW) that generates in silicobenchmark gene expression data using dynamical models of gene regulations and nonlin-ear ODEs. The generated data have been used for evaluating the performance of networkinference methods in the DREAM3 competition (Marbach et al., 2009), and were also ana-lyzed by Henderson and Michailidis (2014); Chen et al. (2017) in additive ODE modeling.GNW extracts two regulatory networks of
E.coli ( E.coli1 , E.coli2 ), and three regulatorynetworks of yeast (yeast1, yeast2, yeast 3), each of which has two dimensions, p = 10 nodesand p = 100 nodes. This yields totally 10 combinations of network structures. Figure6(a)-(b) show an example of the 10-node and the 100-node E.coli1 networks, respectively.The systems of ODEs for each extracted network are based on a thermodynamic approach,which leads to a non-additive and nonlinear ODE structure (Marbach et al., 2010). Be-sides, the network structures are sparse; e.g., for the 10-node
E.coli1 network, there are11 edges out of 90 possible pairwise edges, and for the 100-node
E.coli1 network, thereare 125 edges out of 9 ,
900 possible pairwise edges. Moreover, for the 10-node network,GNW provides R = 4 perturbation experiments, and for the 100-node network, GNW pro-vides R = 46 perturbation experiments. In each perturbation experiment, GNW generatesthe time-course data with different initial conditions of the ODE system to emulate thediversity of gene expression trajectories (Marbach et al., 2009). Figure 6(c)-(f) show thetime-course data under R = 4 perturbation experiments for the 10-node E.coli1 network.All the trajectories are measured at n = 21 evenly spaced time points in [0 , . (cid:8) y ( r ) ij ; i =1 , . . . , n, j = 1 , . . . , p, r = 1 , . . . , R (cid:9) denote the observed data from n subjects for p variables24 Time R e s pon s e (c) Perturbation 1 Time (d) Peturbation 2
Time (e) Peturbation 3
Time (f) Perturbation 4
Figure 6: (a) The 10-node
E.coli1 network. (b) The 100-node
E.coli1 network. (c)-(f) Fourperturbation experiments for the 10-node
E.coli1 network, where each experiment corresponds toa different initial condition of the ODE system. under R experiments, with unknown initial conditions x ( r ) (0) ∈ R p , r = 1 , . . . , R . Then wemodify the KODE method in (9) and (10), by seeking F j ∈ H and θ j ∈ R that minimize1 Rn R (cid:88) r =1 n (cid:88) i =1 (cid:26) y ( r ) ij − θ j − (cid:90) t i F j ( (cid:98) x ( r ) ( t )) dt (cid:27) + τ nj (cid:32) p (cid:88) k =1 (cid:107)P k F j (cid:107) H + p (cid:88) k (cid:54) = l,k =1 p (cid:88) l =1 (cid:107)P kl F j (cid:107) H (cid:33) , where (cid:98) x ( r ) ( t ) = ( (cid:98) x ( r )1 ( t ) , . . . , (cid:98) x ( r ) p ( t )) (cid:62) is the smoothing spline estimate obtained by, (cid:98) x ( r ) j ( t ) = arg min z j ∈F (cid:40) n n (cid:88) i =1 ( y ( r ) ij − z j ( t i )) + λ nj (cid:107) z j ( t ) (cid:107) F (cid:41) , j = 1 , . . . , p, r = 1 , . . . , R. Algorithm 1 can be modified accordingly to work with multiple experiments.We again compare KODE with the additive ODE (Chen et al., 2017) and the linearODE (Zhang et al., 2015), adopting the same implementation as in the simulations. Sincewe know the true edges of the generated gene regulatory networks, we use the area under25 able 1:
The area under the ROC curve, and the 95% confidence interval, for 10 combinationsof network structures from GNW. The results are averaged over 100 data replications. p = 10 p = 100KODE Additive ODE Linear ODE KODE Additive ODE Linear ODE E.coli1 .
541 0 . .
677 0 . . , . . , . . , . . , . . , . . , . E.coli2 .
632 0 . .
659 0 . . , . . , . . , . . , . . , . . , . .
541 0 . .
589 0 . . , . . , . . , . . , . . , . . , . .
562 0 . .
588 0 . . , . . , . . , . . , . . , . . , . .
569 0 . .
613 0 . . , . . , . . , . . , . . , . . , . the ROC curve (AUC) as the evaluation criterion. Table 1 reports the results averagedover 100 data realizations for all ten combinations of network structures. It is clearly seenthat KODE outperforms both the additive and linear ODE methods in all cases. Thisdemonstrates that our proposed KODE is a competitive and useful tool for ODE modeling.In addition, this example also shows that the proposed method can scale up and work withreasonably large networks. For instance, for the network with p = 100 nodes, there are p = 10 ,
000 functions to estimate, and the sample size is n = 21 with R = 46 perturbations. References
Aronszajn, N. (1950). Theory of reproducing kernels.
Transactions of the American Math-ematical Society , 68:337–404.Bachoc, F., Leeb, H., and P¨otscher, B. M. (2019). Valid confidence intervals for post-model-selection predictors.
Annals of Statistics , 47:1475–1504.Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selectioninference.
Annals of Statistics , 41:802–837.Brunton, S. L., Proctor, J. L., and Kutz, J. N. (2016). Discovering governing equationsfrom data by sparse identification of nonlinear dynamical systems.
Proceedings of theNational Academy of Sciences , 113:3932–3937.26ao, J. and Zhao, H. (2008). Estimating dynamic models for gene regulation networks.
Bioinformatics , 24:1619–1624.Cao, X., Sandstede, B., and Luo, X. (2019). A functional data method for causal dynamicnetwork modeling of task-related fmri.
Frontiers in Neuroscience , 13:127.Chen, S., Shojaie, A., and Witten, D. M. (2017). Network reconstruction from high-dimensional ordinary differential equations.
Journal of the American Statistical Associ-ation , 112:1697–1707.Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach.
Annual Review of Economics ,7:649–688.Chou, I.-C. and Voit, E. O. (2009). Recent developments in parameter estimation andstructure identification of biochemical and genomic systems.
Mathematical Biosciences ,219:57–83.Cox, D. D. (1983). Asymptotics for m-type smoothing splines.
Annals of Statistics , 11:530–551.Cox, D. R. (1975). A note on data-splitting for the evaluation of significance levels.
Biometrika , pages 441–444.Dattner, I. and Klaassen, C. A. J. (2015). Optimal rate of direct estimators in systems ofordinary differential equations linear in functions of the parameters.
Electronic Journalof Statistics , 9:1939–1973.Gu, C. (2013).
Smoothing Spline ANOVA Models . Springer Science & Business Media.Henderson, J. and Michailidis, G. (2014). Network reconstruction using nonparametricadditive ode models.
PLOS ONE , 9:1–15.Huang, J. Z. (1998). Projection estimation in multiple regression with application to func-tional anova models.
The Annals of Statistics , 26(1):242–272.Izhikevich, E. (2007).
Dynamical Systems In Neuroscience . MIT Press.27avanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing forhigh-dimensional regression.
Journal of Machine Learning Research , 15:2869–2909.Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learning.
Annals ofStatistics , 38:3660–3695.Liang, H. and Wu, H. (2008). Parameter estimation for differential equation models using aframework of measurement error in regression models.
Journal of the American StatisticalAssociation , 103:1570–1583.Lin, Y. (2000). Tensor product space anova models.
Annals of Statistics , 28:734–755.Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariatenonparametric regression.
Annals of Statistics , 34:2272–2297.Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy andmissing data: Provable guarantees with nonconvexity.
Annals of Statistics , 40:1637–1664.Lu, J., Kolar, M., and Liu, H. (2020). Kernel meets sieve: Post-regularization confidencebands for sparse additive model.
Journal of the American Statistical Association , 0(0):1–16.Lu, T., Liang, H., Li, H., and Wu, H. (2011). High-dimensional ODEs coupled with mixed-effects modeling techniques for dynamic gene regulatory network identification.
Journalof the American Statistical Association , 106:1242–1258.Ma, W., Trusina, A., El-Samad, H., Lim, W. A., and Tang, C. (2009). Defining networktopologies that can achieve biochemical adaptation.
Cell , 138:760–773.Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G.(2010). Revealing strengths and weaknesses of methods for gene network inference.
Proceedings of the National Academy of Sciences, , 107:6286–6291.Marbach, D., Schaffter, T., Mattiussi, C., and Floreano, D. (2009). Generating realistic insilico gene networks for performance assessment of reverse engineering methods.
Journalof Computational Biology , 16:229–239. 28psomer, J. D. and Ruppert, D. (1997). Fitting a bivariate additive model by local poly-nomial regression.
Annals of Statistics , 25:186–211.Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over (cid:96) q -balls. IEEE Transactions on Information Theory ,57:6976–6994.Ravikumar, P., Wainwright, M. J., and Lafferty, J. (2010). High-dimensional ising modelselection using l -regularized logistic regression. Annals of Statistics , 38:1287–1319.Schaffter, T., Marbach, D., and Floreano, D. (2011). Genenetweaver: in silico benchmarkgeneration and performance profiling of network inference methods.
Bioinformatics ,27:2263–2270.Stewart, G. W. (1987). Collinearity and least squares regression.
Statistical Science , 2:68–84.Talagrand, M. (1996). New concentration inequalities in product spaces.
InventionesMathematicae , 126:505–563.Tzafriri, A. R. (2003). Michaelis-menten kinetics at high enzyme concentrations.
Bulletinof Mathematical Biology , 65:1111–1129.van der Vaart, A. W. and Wellner, J. A. (1996).
Weak Convergence and Empirical Processes .Springer-Verlag, New York.Varah, J. M. (1982). A spline least squares method for numerical parameter estimation indifferential equations.
SIAM Journal on Scientific and Statistical Computing , 3:28–46.Volterra, V. (1928). Variations and Fluctuations of the Number of Individuals in AnimalSpecies living together.
ICES Journal of Marine Science , 3:3–51.Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline.
Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 45:133–150.Wahba, G. (1990).
Spline Models for Observational Data . SIAM, Philadelphia.29ahba, G., Wang, Y., Gu, C., Klein, R., and Klein, B. (1995). Smoothing spline ANOVAfor exponential families, with application to the Wisconsin Epidemiological Study ofDiabetic Retinopathy.
Annals of Statistics , 23:1865–1895.Wang, S., Nan, B., Zhu, N., and Zhu, J. (2009). Hierarchically penalized Cox regressionwith grouped variables.
Biometrika , 96(2):307–322.Wu, H., Lu, T., Xue, H., and Liang, H. (2014). Sparse additive ordinary differential equa-tions for dynamic gene regulatory network modeling.
Journal of the American StatisticalAssociation , 109:700–716.Yuan, M. and Zhou, D.-X. (2016). Minimax optimal rates of estimation in high dimensionaladditive models.
Annals of Statistics , 44(6):2564–2593.Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parametersin high dimensional linear models.
Journal of the Royal Statistical Society. Series B. ,76(1):217–242.Zhang, T., Wu, J., Li, F., Caffo, B., and Boatman-Reich, D. (2015). A dynamic directionalmodel for effective brain connectivity using electrocorticographic (ECoG) time series.
Journal of the American Statistical Association , 110:93–106.Zhang, T., Yin, Q., Caffo, B., Sun, Y., and Boatman-Reich, D. (2017). Bayesian infer-ence of high-dimensional, cluster-structured ordinary differential equation models withapplications to brain connectivity studies.
Annals of Applied Statistics , 11:868–897.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.
Journal of MachineLearning Research , 7:2541–2563.Zhu, H., Yao, F., and Zhang, H. H. (2014). Structured functional additive regression inreproducing kernel hilbert spaces.