[PDF] Multiple-shooting adjoint method for whole-brain dynamic causal modeling

Abstract

Dynamic causal modeling (DCM) is a Bayesian framework to infer directed connections between compartments, and has been used to describe the interactions between underlying neural populations based on functional neuroimaging data. DCM is typically analyzed with the expectation-maximization (EM) algorithm. However, because the inversion of a large-scale continuous system is difficult when noisy observations are present, DCM by EM is typically limited to a small number of compartments (<10). Another drawback with the current method is its complexity; when the forward model changes, the posterior mean changes, and we need to re-derive the algorithm for optimization. In this project, we propose the Multiple-Shooting Adjoint (MSA) method to address these limitations. MSA uses the multiple-shooting method for parameter estimation in ordinary differential equations (ODEs) under noisy observations, and is suitable for large-scale systems such as whole-brain analysis in functional MRI (fMRI). Furthermore, MSA uses the adjoint method for accurate gradient estimation in the ODE; since the adjoint method is generic, MSA is a generic method for both linear and non-linear systems, and does not require re-derivation of the algorithm as in EM. We validate MSA in extensive experiments: 1) in toy examples with both linear and non-linear models, we show that MSA achieves better accuracy in parameter value estimation than EM; furthermore, MSA can be successfully applied to large systems with up to 100 compartments; and 2) using real fMRI data, we apply MSA to the estimation of the whole-brain effective connectome and show improved classification of autism spectrum disorder (ASD) vs. control compared to using the functional connectome. The package is provided \url{this https URL}

Full PDF

MMultiple-shooting adjoint method forwhole-brain dynamic causal modeling

Juntang Zhuang, Nicha Dvornek, Sekhar Tatikonda, Xenophon Papademetris,Pamela Ventola, James Duncan

Yale University

Abstract.

Autism spectrum disorder (ASD) is a neurodevelopmental disorder that aﬀectsboth social behavior and mental health [11]. ASD is typically diagnosed withbehavioral tests, and recently functional MRI (fMRI) has been applied to analyzethe cause of ASD [3]. Connectome analysis in fMRI aims to elucidate neuralconnections in the brain and can be generally categorized into two types: the a r X i v : . [ q - b i o . N C ] F e b Authors Suppressed Due to Excessive Length functional connectome (FC) [21] and the eﬀective connectome (EC) [5]. TheFC typically calculates the correlation between time-series of diﬀerent regions-of-interest (ROIs) in the brain, which is typically robust and easy to compute;however, FC does not reveal the underlying dynamics. EC models the directedinﬂuence between ROIs, and is widely used in analysis of EEG [8] and fMRI [19].EC is typically estimated using dynamic causal modeling (DCM) [5]. DCMcan be viewed as a Bayesian framework for parameter estimation in a dynamicalsystem represented by an ordinary diﬀerent equation (ODE). A DCM modelis typically optimized using the expectation-maximization (EM) algorithm [10].Despite its wide application and good theoretical properties, a drawback is weneed to re-derive the algorithm when the forward model changes, which limitsits application. Furthermore, current DCM can not handle large-scale systems,hence is unsuitable for whole-brain analysis. Recent works such as rDCM [4],spectral-DCM [17] and sparse-DCM [16] modify DCM for whole-brain analysisof resting-state fMRI, yet they are limited to a linear dynamical system anduse the EM algorithm for optimization, hence cannot be used as oﬀ-the-shelfmethods for diﬀerent forward models.In this project, we propose the Multiple-Shooting Adjoint (MSA) methodfor parameter estimation in DCM. Speciﬁcally, MSA uses the multiple-shootingmethod [1] for robust ﬁtting of an ODE, and uses the adjoint method [15] forgradient estimation in the continuous case; after deriving the gradient, genericoptimizers such as stochastic gradient descent (SGD) can be applied. Our con-tributions are: (1) MSA is implemented as an oﬀ-the-shelf method, and can beeasily applied to generic non-linear cases by specifying the forward model with-out re-deriving the optimization algorithm. (2) In toy examples, we validate theaccuracy of MSA in parameter estimation; we also validated its ability to han-dle large-scale systems. (3) We apply MSA in the whole-brain dynamic causalmodeling for fMRI; in a classiﬁcation task of ASD vs. control, EC estimated byMSA achieves better performance than FC.

We ﬁrst introduce the notations and problem in Sec 2.1, then introduce mathe-matical methods in Sec2.3-2.4, and ﬁnally introduce DCM for fMRI in Sec2.5.

We summarize notations here for the ease of reading, which correspond to Fig. 1. • z ( t ) , (cid:103) z ( t ) , z ( t ): z ( t ) is the true time-series, (cid:103) z ( t ) is the noisy observation,and z ( t ) is the estimation. If p time-series are observed, then they are p -dimensional vectors for each time t . • ( t i , (cid:98) z i ) Ni =0 : { (cid:98) z i } Ni =0 are corresponding guesses of states at split time points { t i } Ni =0 . See Fig. 1. (cid:98) z i are discrete points, while (cid:103) z ( t ) , z ( t ) , z ( t ) are trajectories. ultiple-shooting adjoint method for whole-brain dynamic causal modeling 3 Fig. 1:

Left: illustration of the shooting method. Right: illustration of the multiple-shooting method. Blue dots represent the guess of state at split time t i . • f η : Hidden state z ( t ) follows the ODE dzdt = f ( z, t ), f is parameterized by η . • θ : θ = [ η, z , ...z N ]. We concatenate all optimizable parameters into onevector for the ease of notation, denoted as θ . • λ ( t ): Lagrangian multiplier in the continuous case, used to derive the adjointstate equation.The task of DCM can be viewed as a parameter estimation problem for a con-tinuous dynamical system, and can be formulated as:argmin η (cid:90) (cid:16) z ( τ ) − (cid:103) z ( τ ) (cid:17) dτ s.t. dz ( τ ) dτ = f η ( z ( τ ) , τ ) (1)The goal is to estimate η from observations (cid:101) z . In the following sections, we ﬁrstbrieﬂy introduce the multiple-shooting method, which is related to the numericalsolution of a continuous dynamical system; next, we introduce the adjoint statemethod, which eﬃciently determines the gradient for parameters in continuousdynamical systems; next, we introduce the proposed MSA method, which com-bines multiple-shooting and the adjoint state method, and can be applied withgeneral forward models and gradient-based optimizers; ﬁnally, we introduce theDCM model, and demonstrate the application of MSA. The shooting method is commonly used to ﬁt an ODE under noisy observa-tions, which is crucial for parameter estimation in ODE. In this section, we ﬁrstintroduce the shooting method, then explain its variant, the multiple-shootingmethod, for long time-series.

Shooting method

The shooting method typically reduces a boundary-valueproblem to an initial value problem [6]. An example is shown in Fig. 1: to ﬁnda correct initial condition (at t = 0) that reaches the target (at t = 1), theshooting algorithm ﬁrst takes an initial guess (e.g. (cid:91) z (0)), then integrate the Authors Suppressed Due to Excessive Length curve to reach point ( t , z (1)); the error term target − z (1) is used to updatethe initial condition (e.g. (cid:91) z (0)) so that the end-time value z (1) is closer totarget. This process is repeated until convergence. Besides the initial condition,the shooting method can be applied to update other parameters. Multiple-shooting method

The multiple-shooting method [1] is an extensionof the shooting method to long time-series; it splits a long time-series into chunks,and applies the shooting method to each chunk. Integration of a dynamicalsystem for a long time is typically subject to noise and numerical error, whilesolving short time-series is generally easier and more robust.As shown in the right subﬁgure of Fig. 1, a guess of initial condition at time t is denoted as (cid:98) z , and we can use any ODE solver to get the estimated integralcurve z ( t ) , t ∈ [ t , t ]. Similarly, we can guess the initial condition at time t as (cid:98) z , and get z ( t ) , t ∈ [ t , t ] by integration as in Eq. 3. Note that each time chunkis shorter than the entire chunk ( | t i +1 − t i | < | t − t | , i ∈ { , } ), hence easierto solve. The split causes another issue: the guess might not match estimationat boundary points (e.g. z ( t ) (cid:54) = (cid:98) z , z ( t ) (cid:54) = (cid:98) z ). Therefore, we need to considerthis error of mismatch when updating parameters, and minimizing this mismatcherror is typically easier compared to directly analyzing the entire sequence.The multiple-shooting method can be written as:argmin η,z ,...z N J = argmin η,z ,...z N N (cid:88) i =0 (cid:90) t i +1 t i (cid:16) z ( τ ) − (cid:103) z ( τ ) (cid:17) d τ + α N (cid:88) i =0 (cid:16) z ( t i ) − (cid:98) z i (cid:17) (2) z ( t ) = (cid:98) z i + (cid:90) tt i f η (cid:0) z ( τ ) , τ (cid:1) d τ, t i < t < t i +1 , i ∈ { , , , ...N } (3)where N is the total number of chunks discretized at points { t , ...t N } , withcorresponding guesses { (cid:98) z , ... (cid:99) z N } . We use z ( t ) to denote the estimated curve asin Eq. 3; suppose t falls into the chunk [ t i , t i +1 ], z ( t ) is determined by solvingthe ODE from ( (cid:98) z i , t i ), where (cid:98) z i is the guess of initial state at t i . We use (cid:103) z ( t )to denote the observation. The ﬁrst part in Eq. 2 corresponds to the diﬀerencebetween estimation z ( t ) and observation (cid:103) z ( t ), while the second part correspondsto the mismatch between estimation (orange square, z ( t i )) and guess (blue circle, (cid:98) z i ) at split time points t i . The second part is weighted by a hyper-parameter α . The ODE function f is parameterized by η . The optimization goal is to ﬁndthe best η that minimizes loss in Eq. 2, besides model parameters η , we alsoneed to optimize the guess (cid:98) z i for state at time t i , i ∈ { , , ...N } . Note thatthough previous work typically limits f to have a linear form, we don’t havesuch limitations. Instead, multiple-shooting is generic for general f . Our goal is to minimize the loss function in Eq. 2. Let θ = [ η, z , ..., z N ] representall learnable parameters. After ﬁtting an ODE, we derive the gradient of loss Lw.r.t parameter θ and state guess (cid:98) z i for optimization. ultiple-shooting adjoint method for whole-brain dynamic causal modeling 5 Adjoint state equation

Note that diﬀerent from discrete case, the gradient incontinuous case is slightly complicated. We refer to the adjoint method [15,23,2].Consider the following problem: dz ( t ) dt = f θ (cid:16) z ( t ) , t (cid:17) , s.t. z (0) = x, t ∈ [0 , T ] , θ = [ η, z , ...z N ] (4)ˆ y = z ( T ) , J (cid:16) ˆ y, y (cid:17) = J (cid:16) z (0) + (cid:90) T f θ ( z, t ) dt, y (cid:17) (5)where the initial condition z (0) is speciﬁed by input x , output ˆ y = z ( T ). Theloss function J is applied on ˆ y , with target y . Compared with Eq. 1 to Eq. 3,for simplicity, we use θ to denote both model parameter η and guess of initialconditions { (cid:98) z i } . The Lagrangian is L = J (cid:16) z ( T ) , y (cid:17) + (cid:90) T λ ( t ) (cid:62) (cid:104) dz ( t ) dt − f θ ( z ( t ) , t ) (cid:105) dt (6)where λ ( t ) is the continuous Lagrangian multiplier. Then we have the following: ∂J∂z ( T ) + λ ( T ) = 0 (7) dλ ( t ) dt + (cid:16) ∂f θ ( z ( t ) , t ) ∂z ( t ) (cid:17) (cid:62) λ ( t ) = 0 ∀ t ∈ (0 , T ) (8) dLdθ = (cid:90) T λ ( t ) (cid:62) ∂f θ ( z ( t ) , t ) ∂θ dt (9)We skip the proof for simplicity. In general, the adjoint method determines theinitial condition λ ( T ) by Eq. 7, then solves Eq. 8 to get the trajectory of λ ( t ), andﬁnally integrates λ ( t ) as in Eq. 9 to get the ﬁnal gradient. Note that Eq. 7 to Eq. 9is generic for general θ , and in case of Eq. 2 and Eq. 3, we have θ = [ η, z , ...z N ],and ∇ θ = [ ∂L∂η , ∂L∂z , ... ∂L∂z N ]. Note that we need to calculate ∂f∂z and ∂f∂θ , which canbe easily computed by a single backward pass; we only need to specify the forwardmodel without worrying about the backward, because automatic diﬀerentiationis supported in frameworks such as PyTorch and Tensorﬂow. After deriving thegradient of all parameters, we can update these parameters by general gradientdescent methods.Note that though J ( z ( T ) , y ) is deﬁned on a single point in Eq. 6, it can be de-ﬁned as the integral form in Eq. 2, or a sum of single-point loss and integral form.The key observation is that for any loss in the integral form, e.g. (cid:82) Tt =0 loss ( t ) dt ,we can deﬁned an auxiliary variable F such that dF ( t ) dt = loss ( t ) , F (0) = 0, then F ( T ) is just the value of the integral; in this way, we can transform integral form (cid:82) T loss ( t ) dt into a single point form F ( T ). Authors Suppressed Due to Excessive Length

Algorithm 1:

Multiple-shooting adjoint method

Input

Observation (cid:103) z ( t ), number of chunks N , learning rate lr . Initialize model parameter η , state { (cid:98) z i } Ni =0 at discretized points { t i } Ni =0 Repeat until convergence (1) Estimate trajectory z ( t ) from current parameters by the multipleshooting method as in Eq. 3.(2) Compute the loss J in Eq. 2, plug J in Eq. 6. Derive the gradient bythe adjoint method as in Eq. 7 to Eq. 9.(3) Update parameters θ ← θ − lr × ∇ θ Adaptive checkpoint adjoint

Eq. 7 to Eq. 9 are the analytical form of thegradient in the continuous case, yet the numerical implementation is crucial forempirical performance. Note that z ( t ) is solved in forward-time (0 to T ), while λ ( t ) is solved in reverse-time ( T to 0), yet the gradient in Eq. 9 requires both z ( t )and λ ( t ) in the integrand. Memorizing a continuous trajectory z ( t ) requires muchmemory; to save memory, most existing implementations forget the forward-timetrajectory of z ( t ), and instead only record the end-time state z ( T ) and λ ( T ) andsolve Eq. 4 and Eq. 7 to Eq. 9 in reverse-time on-the-ﬂy.While memory cost is low, existing implementations of the adjoint methodtypically suﬀer from numerical error: since the forward-time trajectory (denotedas −−→ z ( t ) = z ( t )) is deleted, and the reverse-time trajectory (denoted as ←−− z ( t )) isreconstructed from the end-time state z ( T ) by solving Eq. 4 in reverse-time, −−→ z ( t )and ←−− z ( t ) cannot accurately overlap due to inevitable errors with numerical ODEsolvers. The error −−→ z ( t ) − ←−− z ( t ) propagates to the gradient in Eq. 9 in the ∂f ( z,t ) ∂z term. Please see [23] for a detailed explanation.To solve this issue, the adaptive checkpoint adjoint (ACA) [23] records −−→ z ( t )using a memory-eﬃcient method to guarantee numerical accuracy. In this work,we use ACA for its accuracy. MSA is a combination of the multiple-shooting and theadjoint method, which is generic for various f . Details are summarized in Algo. 1.MSA iterates over the following steps until convergence: (1) estimate the trajec-tory based on the current parameters, using the multiple-shoot method for inte-gration; (2) compute the loss and derive the gradient using the adjoint method;(3) update the parameters based on the gradient. Advantages of MSA

Previous work has used the multiple-shooting method forparameter estimation in ODEs [13], yet MSA is diﬀerent in the following aspects:(A) Suppose the parameters have k dimensions. MSA uses an element-wise up-date, hence has only O ( k ) computational cost in each step; yet the method in [13]requires the inversion of a k × k matrix, hence might be infeasible for large-scale ultiple-shooting adjoint method for whole-brain dynamic causal modeling 7 systems. (B) The implementation of [13] does not tackle the mismatch betweenforward-time and reverse-time trajectory, while we use ACA [23] for accurategradient estimation in step (2) of Algo. 1. (C) From a practical perspective, ourimplementation is based on PyTorch which supports automatic-diﬀerentiation,therefore we only need to specify the forward model f without the need tomanually compute the gradient ∂f∂z and ∂f∂θ . Hence, our method is oﬀ-the-shelffor general models, while the method of [13] needs to re-implement ∂f∂z and ∂f∂θ for diﬀerent f , and conventional DCM with EM needs to re-derive the entirealgorithm when f changes. We brieﬂy introduce the dynamical causal modeling here. Suppose there are p nodes (ROIs) and denote the observed fMRI time-series signal as s ( t ), whichis a p -dimensional vector at each time t . Denote the hidden neuronal state as z ( t ); then z ( t ) and s ( t ) are p -dimensional vectors for each time point t . Denotethe hemodynamic response function (HRF) [9] as h ( t ), and denote the externalstimulation as u ( t ), which is an n -dimensional vector for each t . The forward-model is: f (cid:16) [ z ( t ) D ( t )] (cid:17) = (cid:20) dz ( t ) /dtdD ( t ) /dt (cid:21) = (cid:20) D ( t ) z ( t ) + Cu ( t ) Bu ( t ) (cid:21) , D (0) = A (10) s ( t ) = (cid:16) z ( t ) + (cid:15) ( t ) (cid:17) ∗ h ( t ) , (cid:103) z ( t ) = z ( t ) + (cid:15) ( t ) = Deconv (cid:16) s ( t ) , h ( t ) (cid:17) (11)where (cid:15) ( t ) is the noise at time t , which is assumed to follow an independentGaussian distribution, and ∗ represents convolution operation. Note that a moregeneral model would be s ( t ) = (cid:16) z ( t ) + (cid:15) ( t ) (cid:17) ∗ h ( t ) + (cid:15) ( t ), where (cid:15) ( t ) is theinherent noise in neuronal state z ( t ), and (cid:15) ( t ) is the measurement noise. Weomit (cid:15) ( t ) for simplicity in this project; it’s possible to model both noises, evenmodel HRF as learnable parameters, but would cause a more complicated modeland require more data for accurate parameter estimation. D ( t ) is a p × p matrix for each t , representing the eﬀective connectome betweennodes. A is a matrix of shape p × p , representing the interaction between ROIs. B is a tensor of shape p × p × n , representing the eﬀect of stimulation on theeﬀective connectome. C is a matrix of shape p × n , representing the eﬀect ofstimulation on neuronal state. An example of n = 1 , p = 3 is shown in Fig. 2.The task is to estimate parameters A, B, C from noisy observation s ( t ). Forsimplicity, we assume h ( t ) is ﬁxed and use the empirical result from Nitimeproject [18] in our experiments. By deconvolution of s ( t ) with h ( t ), we get anoisy observation of z ( t ), denoted as (cid:103) z ( t ); z ( t ) follows the ODE deﬁned in Eq. 10.By plugging f into Eq. 1, and viewing η as [ A, B, C ], this problem turns intoa parameter estimation problem for ODEs, which can be eﬃciently solved byAlgo. 1. We emphasize that Algo. 1 is generic and in fact MSA can be appliedto any form of f , where here the linear form of f in Eq. 10 is a special case fora speciﬁc model for fMRI. Authors Suppressed Due to Excessive Length

Fig. 2:

Toy example of dynamiccausal modeling with 3 nodes (la-beled 1 to 3). u is a 1-D stimula-tion signal, so n = 1 , p = 3. A, B, C are deﬁned as in Eq. 10. For sim-plicity, though A is a 3 × A , , A , , A , are non-zero. We ﬁrst validate MSA on toy examples of linear dynamical systems, then validateits performance on large-scale systems and non-linear dynamical systems.Fig. 3:

Results for the toy example of a linear dynamical system in Fig. 2. Left: er-ror in estimated value of connection A , , A , , A , , other parameters are set as 0 insimulation. Right: from top to bottom are the results for node 1, 2, 3 respectively. Foreach node, we plot the observation and estimated curve from MSA and EM methods.Note that the estimated curve is generated by integration of the ODE under estimatedparameters with only the initial condition known, not smoothing of noisy observation. A linear dynamical system with 3 nodes

We ﬁrst start with a simplelinear dynamical system with only 3 nodes. We further simplify the matrix A as in Fig. 2, where only three elements in A are non-zero. We set B as a zerosmatrix, and u ( t ) as a 1-dimensional signal. The dynamical system is linear: (cid:20) dz ( t ) /dtdD ( t ) /dt (cid:21) = (cid:20) D ( t ) z ( t ) + Cu ( t )0 (cid:21) , D (0) = A, u ( t ) = (cid:40) , f loor ( t )%2 = 00 , otherwise (12) (cid:103) z ( t ) = z ( t ) + (cid:15) ( t ) , (cid:15) ( t ) ∼ N (0 , σ ) (13) ultiple-shooting adjoint method for whole-brain dynamic causal modeling 9 u ( t ) is an alternating block function at a period of 2, taking values 0 or 1. Theobserved function (cid:103) z ( t ) suﬀers from i.i.d Gaussian noise (cid:15) ( t ) with 0 mean anduniform variance σ .We perform 10 independent simulations and parameter estimations. For es-timation of DCM with the EM algorithm, we use the SPM package [14], whichis a widely used standard baseline. The estimation in MSA is implemented inPyTorch, using ACA [23] as the ODE solver. For MSA, we use the AdaBeliefoptimizer [25] to update parameters with the gradient; though other optimizerssuch as SGD can be used, we found AdaBelief converges faster in practice.For each of the non-zero elements in A , we show the boxplot of error inestimation in Fig. 3. Compared with EM, the error by MSA is signiﬁcantlycloser to 0 and has a smaller variance. An example of a noisy observation andestimated curves are shown in Fig. 3, and the estimation by MSA is visuallycloser to the ground-truth compared to the EM algorithm. We emphasize thatthe estimated curve is not a simple smoothing of the noisy observation; instead,after estimating the parameters of the ODE, the estimated curve (for t >

0) isgenerated by solving the ODE using only the initial state. Therefore, the matchbetween estimated curve and observation demonstrates that our method learnsthe underlying dynamics of the system.

Application to large-scale systems

After validation on a small system withonly 3 nodes, we validate MSA on large scale systems with more nodes. We usethe same linear dynamical system as in Eq. 12, but with the node number p ranging from 10 to 100. Note that the dimension of A and B grows at a rateof O ( p ), and the EM algorithm estimates the covariance matrix of size O ( p ),hence the memory for EM method grows extremely fast with p . For varioussettings, the ground truth parameter is randomly generated from a uniformdistribution between -1 and 1, and the variance of measurement noise is set as σ = 0 .

5. For each setting, we perform 5 independent runs, and report the meansquared error (MSE) between estimated parameter and ground truth.As shown in Table 1, for small-size systems (number of nodes < = 20), MSAconsistently generates a lower MSE than the EM algorithm. For large-scale sys-tems, since the memory cost of the EM algorithm is O ( p ), the algorithm quicklyruns out-of-memory. On the other hand, the memory cost for MSA is O ( p ) be-cause it only uses the ﬁrst-order gradient. Hence, MSA is suitable for large-scalesystems such as in whole-brain fMRI analysis. Application to general non-linear systems

Since neither the multiple-shootmethod nor the adjoint state method requires the ODE f to be linear, our MSAcan be applied to general non-linear systems. Furthermore, since our implemen-tation is in PyTorch which supports automatic diﬀerentiation, we only need tospecify f when ﬁtting diﬀerent models, and the gradient will be calculated auto-matically. Therefore, MSA is an oﬀ-the-shelf method, and is suitable for generalnon-linear ODEs both in theory and implementation. Table 1:

Mean squared error ( × − , lower is better) in estimation of parameters fora linear dynamical system with diﬀerent number of nodes. “OOM” represents “out ofmemory”. 10 Nodes 20 Nodes 50 Nodes 100 NodesEM 3 . ± . . ± . . ± . . ± . . ± . . ± . We validate MSA on the Lotka-Volterra (L-V) equations [22], a system ofnon-linear ODEs describing the dynamics of predator and prey populations.The L-V equation can be written as: f (cid:16) [ z ( t ) , z ( t )] (cid:17) = (cid:20) dz ( t ) /dtdz ( t ) /dt (cid:21) = (cid:20) ζz ( t ) − βz ( t ) z ( t ) δz ( t ) z ( t ) − γz ( t ) (cid:21) , (cid:34) (cid:93) z ( t ) (cid:93) z ( t ) (cid:35) = (cid:20) z ( t ) + (cid:15) ( t ) z ( t ) + (cid:15) ( t ) (cid:21) (14)where ζ, β, δ, γ are parameters to estimate, (cid:103) z ( t ) is the noisy observation, and (cid:15) ( t ) is the independent noise. Note that there are non-linear terms z ( t ) z ( t ) inthe ODE, making EM derivation diﬃcult. Furthermore, the EM method needsto explicitly derive the posterior mean, hence needs to be re-derived for everydiﬀerent f ; while MSA is generic and hence does not require re-derivation.Besides the L-V model, we also consider a modiﬁed L-V model, deﬁned as: dz ( t ) /dt = ζz ( t ) − βφ ( z ( t )) z ( t ) z ( t ) (15) dz ( t ) /dt = δφ ( z ( t )) z ( t ) z ( t ) − γz ( t ) (16)where φ ( x ) = 1 / (1 + e − x ) is the sigmoid function. We use this example todemonstrate the ability of MSA to ﬁt highly non-linear ODEs.We compare MSA with LMFIT [12], which is a well-known python packagefor non-linear ﬁtting. We use L-BFGS solver in LMFIT, which generates betterresults than other solvers. We did not compare with original DCM with EMbecause it’s unsuitable for general non-linear models. The estimation of the curvefor t > We apply MSA on whole-brain fMRI analysis with dynamic causal modeling.fMRI for 82 children with ASD and 48 age and IQ-matched healthy controlswere acquired. A biological motion perception task and a scrambled motion task[7] were presented in alternating blocks. The fMRI (BOLD, 132 volumes, TR= 2000ms, TE = 25ms, ﬂip angle = 60 ◦ , voxel size 3.44 × × mm ) wasacquired on a Siemens MAGNETOM Trio 3T scanner. ultiple-shooting adjoint method for whole-brain dynamic causal modeling 11 Fig. 4:

Results for the L-V model.

Fig. 5:

Results for the modiﬁed L-V model.

Fig. 6:

An example of MSA for one subject. Left: eﬀective connectome during task 1.Middle: eﬀective connectome during task 2. Right: top and bottom represents the eﬀec-tive connectome for task 1 and 2 respectively. Blue and red edges represent positive andnegative connections respectively. Only top 5% strongest connections are visualized.

Estimation of EC

We use the AAL atlas [20] containing 116 ROIs. For eachsubject, the parameters for dynamic causal modeling as in Eq. 10 is estimatedusing MSA. An example snapshot of the eﬀective connectome (EC) during thetwo tasks is shown in Fig. 6, showing MSA captures the dynamic EC duringdiﬀerent tasks.

Classiﬁcation task

We conduct classiﬁcation experiments for ASD vs. controlusing EC and FC as input respectively. The EC estimated by MSA at each timepoint provides a data sample, and the classiﬁcation of a subject is based on themajority vote of the predictions across all time points. The FC is computed usingPearson correlation. We experimented with a random forest model and InvNet[24]. Results for a 10-fold subject-wise cross validation are shown in Fig. 7. Forboth models, using EC as input generates better accuracy, F1 score and AUCscore (threshold range is [0,1]). This indicates that estimating the underlyingdynamics of fMRI helps identiﬁcation of ASD.

Fig. 7:

Classiﬁcation results for ASD vs. control.

We propose the multiple-shooting adjoint (MSA) method for parameter esti-mation in ODEs, enabling whole-brain dynamic causal modeling. MSA has thefollowing advantages: robustness for noisy observations, ability to handle large-scale systems, and a general oﬀ-the-shelf framework for non-linear ODEs. Wevalidate MSA in extensive toy examples and apply MSA to whole-brain fMRIanalysis with DCM. To our knowledge, our work is the ﬁrst to successfully applywhole-brain dynamic causal modeling in a classiﬁcation task based on fMRI.Finally, MSA is generic and can be applied to other problems such as EEG andmodeling of biological processes.

References

1. Bock, H.G., Plitt, K.J.: A multiple shooting algorithm for direct solution of optimalcontrol problems. IFAC Proceedings Volumes (1984)2. Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary dif-ferential equations. Advances in neural information processing systems (2018)3. Di Martino, A., O’connor, D., Chen, B., Alaerts, K., Anderson, J.S., et al.: En-hancing studies of the connectome in autism using the autism brain imaging dataexchange ii. Scientiﬁc data (2017)4. Fr¨assle, S., Harrison, S.J., Heinzle, J., Clementz, B.A., Tamminga, C.A., et al.:Regression dynamic causal modeling for resting-state fmri. bioRxiv (2020)5. Friston, K.J., Harrison, L.: Dynamic causal modelling. Neuroimage (2003)6. Hildebrand, F.B.: Introduction to numerical analysis (1987)7. Kaiser, M.D., Hudac, C.M., Shultz, S., Lee, S.M., Cheung, C., et al.: Neural sig-natures of autism. PNAS (2010)8. Kiebel, S.J., Garrido, M.I., Moran, R.J., Friston, K.J.: Dynamic causal modellingfor eeg and meg. Cognitive neurodynamics (2008)9. Lindquist, M.A., Loh, J.M., Atlas, L.Y., Wager, T.D.: Modeling the hemodynamicresponse function in fmri: eﬃciency, bias and mis-modeling. Neuroimage (2009)10. Moon, T.K.: The expectation-maximization algorithm. ISPM (1996)11. Nation, K., Clarke, P., Wright, B., Williams, C.: Patterns of reading ability inchildren with autism spectrum disorder. J Autism Dev Disord (2006)12. Newville, M., Stensitzki, T., Allen, D.B., Rawlik, M., Ingargiola, A., Nelson, A.:Lmﬁt: Non-linear least-square minimization and curve-ﬁtting for python (2016)ultiple-shooting adjoint method for whole-brain dynamic causal modeling 1313. Peifer, M., Timmer, J.: Parameter estimation in ordinary diﬀerential equations forbiochemical processes using the method of multiple shooting (2007)14. Penny, W.D., Friston, K.J., Ashburner, J.T., Kiebel, S.J., Nichols, T.E.: Statisticalparametric mapping: the analysis of functional brain images (2011)15. Pontryagin, L.S.: Mathematical theory of optimal processes (2018)16. Prando, G., Zorzi, M., Bertoldo, A., Corbetta, M., Chiuso, A.: Sparse dcm forwhole-brain eﬀective connectivity from resting-state fmri data. NeuroImage (2020)17. Razi, A., Seghier, M.L., Zhou, Y., McColgan, P., Zeidman, P., Park, H.J., et al.:Large-scale dcms for resting-state fmri. Network Neuroscience (2017)18. Rokem, A., Trumpis, M., Perez, F.: Nitime: time-series analysis for neuroimagingdata. In: Proceedings of the 8th Python in Science Conference (2009)19. Seghier, M.L., Zeidman, P., Leﬀ, A.P., Price, C.: Identifying abnormal connectiv-ity in patients using dynamic causal modelling of fmri responses. Front. Neurosci(2010)20. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O.,et al.: Automated anatomical labeling of activations in spm using a macroscopicanatomical parcellation of the mni mri single-subject brain. Neuroimage (2002)21. Van Den Heuvel, M.P., Pol, H.E.H.: Exploring the brain network: a review onresting-state fmri functional connectivity. Eur Neuropsychopharmacol (2010)22. Volterra, V.: Variations and ﬂuctuations of the number of individuals in animalspecies living together. ICES Journal of Marine Science3