[PDF] An ensemble solver for segregated cardiovascular FSI

Abstract

Computational models are increasingly used for diagnosis and treatment of cardiovascular disease. To provide a quantitative hemodynamic understanding that can be effectively used in the clinic, it is crucial to quantify the variability in the outputs from these models due to multiple sources of uncertainty. To quantify this variability, the analyst invariably needs to generate a large collection of high-fidelity model solutions, typically requiring a substantial computational effort. In this paper, we show how an explicit-in-time ensemble cardiovascular solver offers superior performance with respect to the embarrassingly parallel solution with implicit-in-time algorithms, typical of an inner-outer loop paradigm for non-intrusive uncertainty propagation. We discuss in detail the numerics and efficient distributed implementation of a segregated FSI cardiovascular solver on both CPU and GPU systems, and demonstrate its applicability to idealized and patient-specific cardiovascular models, analyzed under steady and pulsatile flow conditions.

Full PDF

AA N ENSEMBLE SOLVER FOR SEGREGATED CARDIOVASCULAR

FSI

A P

REPRINT

Xue Li

Department of Applied and Computational Mathematics and StatisticsUniversity of Notre DameNotre Dame, IN 46556 [email protected]

Daniele E. Schiavazzi

Department of Applied and Computational Mathematics and StatisticsUniversity of Notre DameNotre Dame, IN 46556 [email protected]

February 2, 2021 A BSTRACT

Computational models are increasingly used for diagnosis and treatment of cardiovascular disease.To provide a quantitative hemodynamic understanding that can be effectively used in the clinic, itis crucial to quantify the variability in the outputs from these models due to multiple sources ofuncertainty. To quantify this variability, the analyst invariably needs to generate a large collection ofhigh-ﬁdelity model solutions, typically requiring a substantial computational effort. In this paper, weshow how an explicit-in-time ensemble cardiovascular solver offers superior performance with respectto the embarrassingly parallel solution with implicit-in-time algorithms, typical of an inner-outer loopparadigm for non-intrusive uncertainty propagation. We discuss in detail the numerics and efﬁcientdistributed implementation of a segregated FSI cardiovascular solver on both CPU and GPU systems,and demonstrate its applicability to idealized and patient-speciﬁc cardiovascular models, analyzedunder steady and pulsatile ﬂow conditions. K eywords Ensemble solvers · Uncertainty Quantiﬁcation · Computational Hemodynamics · ExplicitTime Integration · Biomechanics

In this study we focus on the development of efﬁcient solvers for complex ﬂuid-structure interaction (FSI) phenomenaarising in cardiovascular (CV) hemodynamics. For this and many other applications, output variability is induced byuncertainty or ignorance in the input processes, e.g., material property distribution, physiologically sound boundaryconditions or model anatomy, resulting from operator-dependent image volume segmentation. In this context, thenew paradigm of Uncertainty Quantiﬁcation (UQ) is rapidly becoming an integral part of the modeling exercise,and an indispensable tool to rigorously quantify conﬁdence in the simulation outputs, enabling robust predictions ofgreater clinical impact. However, running a complete UQ study on a large scale cardiovascular model is typicallyassociated with a substantial computational cost. Non-intrusive approaches for the solution of the forward problemin uncertainty quantiﬁcation (also known as uncertainty propagation ) typically consider the underlying deterministicsolver as a black box (see Figure 1), requiring model solutions at various parameter realizations to be independently (and possibly simultaneously) computed. The scalability of this paradigm is limited due to two main reasons. First, inthe embarrassingly parallel solution of multiple instances of the same problem, a large number of operations is repeated. a r X i v : . [ c s . C E ] F e b PREPRINT - F

EBRUARY

2, 2021

Figure 1:

Schematic representation of the forward and inverse problems in UQ.Second, the solution of large linear systems of equations from numerical integration in time with implicit schemespresents, in general, a less than ideal scalability when computed on large multi-core architectures.To tackle these challenges, we propose an efﬁcient computational approach for solving multiple instances of the samemodel (so-called model ensemble ) at the same time, in a highly scalable fashion, enabled by CPU/GPU implementationsof numerical integration schemes that rely heavily on distributed sparse matrix-vector products. We validate ourapproach by characterizing the effect of variability in vessel wall thickness and elastic modulus on the mechanicalresponse of ideal and patient-speciﬁc cardiovascular models analyzed under steady and pulsatile ﬂow conditions. Astochastic model for the spatial distribution of thickness and elastic modulus is provided, in this study, by approximatinga Gaussian random ﬁeld with Matérn covariance through the solution of a stochastic partial differential equation on atriangular ﬁnite element mesh of the vessel lumen [1]. Additionally, we follow a one-way coupled, segregated approachto ﬂuid-structure interaction, where the wall stress is computed by an implicit Variational Multiscale ﬂuid solver andpassed to a structural, three-d.o.f.s shell model of the vascular walls [2]. Unique contributions of our approach are:1. We propose, for the ﬁrst time, an ensemble solver in the context of cardiovascular hemodynamics for UQ, withthe aim of drastically reduce the computational effort to perform campaigns of high-ﬁdelity model solutions.2. Our approach paves the way to a fully explicit treatment of ﬂuid structure interaction in cardiovascularmodeling, whose potential has not yet been fully explored in the literature. In our opinion, this new paradigmwill provide simpler approaches to simulating complex physiological dysfunctions (e.g. aortic dissection,involving large vessel deformations and contact/auto-contact phenomena, see [3]).3. The combination of explicit time integration schemes with the solution of model ensembles leads to an efﬁcientdistribution of computing load and memory usage in the GPU, enabling scalability in a way that is currentlynot possible with embarrassingly parallel model solutions and implicit solvers.We motivate the computational advantage of explicit-in-time ensemble solvers through a back-of-the-envelope argument.It is well known how explicit numerical integration schemes are only conditionally stable. For structural problems, thelarger stable time step is determined through a CFL condition as . · ∆ t where ∆ t is the amount of time required byan elastic wave to cross the smallest element in the mesh. A reasonable value of ∆ t for CV modeling is ∆ t = l e, min /c ,where l e, min is the diameter of the circle/sphere inscribed in the smallest element in the mesh, c = (cid:112) E/ρ is the elasticwave speed, E and ρ the elastic modulus and density of the vascular tissue (assumed homogeneous and isotropic in thecurrent argument). Typical values of l e, min = 1 . × − m, E = 0 . MPa and ρ = 1 . kg/m lead to a stable timestep equal to approximately . · . × − ≈ . × − . In contrast, the typical time step adopted by implicit CV FSIsolvers is one millisecond (an upper bound, typically much smaller). At every time step, these solvers require multiplelinear systems to be solved with iterative methods, each consisting in several matrix-vector products per iteration.Thus, if we assume an average of 10 non-linear iterations consisting of 10 linear solver iterations each (with twomatrix-vector multiplications per iteration), an explicit solver is only a factor of ﬁve more expensive. However, the costof explicit methods can be further reduced through several means, for example, increasing the critical time step via massscaling (see, e.g., [4]), selectively updating the stiffness, damping and mass matrices between successive time stepsto avoid the repeated assembly of element matrices (in the linear regime), and by solving multiple realizations of theboundary conditions, material properties and geometry at the same time. Additionally, large matrix-vector operationsneeded by the explicit solution of model ensembles are particularly well suited for GPU computing. In summary,explicit time integration schemes have many advantages over implicit approaches in the simulation of phenomenaoccurring over small time intervals and, combined with the solution of model ensembles, bear substantial potentialto boost the efﬁciency of solving CV models on modern GPU-based systems. Even though our work focuses on CVhemodynamics, the proposed solver paradigm is applicable to study ﬂuid-structure interaction phenomena in otherﬁelds, but its efﬁciency is affected by the stiffness and mass properties in the selected application.Use of explicit structural solvers in cardiovascular ﬂow problems is mainly related to their ﬂexibility in modelingcomplex contact conﬁgurations. Studies involving coronary stent deployment following endoscopic balloon inﬂation2 PREPRINT - F

EBRUARY

2, 2021are proposed, e.g., in [5, 6], and coupling with an implicit ﬂuid solver is discussed in [7]. Implementation of structuralexplicit solvers on GPU are discussed in various studies in the literature. In [8] the authors describe in detail anapplication involving thin shells, while an overview on applications in biomechanics is discussed in [9]. Additionally,Ensemble methods for ﬂuid problems have been recently proposed by [10, 11, 12, 13, 14] in the context of the Navier-Stokes equations with distinct initial conditions and forcing terms. This is based on the observation that solution oflinear systems is responsible for a signiﬁcant fraction of the overall running time for linearly implicit methods, andthat is far more efﬁcient to solve multiple times a system of equation with the same coefﬁcient matrix and differentright-hand-side than different systems altogether. Extensions have also been proposed to magnetohydrodynamics [15],natural convection problems [16] and parametrized ﬂow problems [17, 18]. We note how, in our case, there is noapproximation introduced in the formulation of the ensemble numerical scheme. In addition, no ensemble methodappears to be available from the literature in the context of ﬂuid-structure interaction problems.The basic methodology behind the proposed solver is discussed in Section 2, including the generation of random ﬁeldmaterial properties, the structural ﬁnite element formulation, the variational multiscale ﬂuid solver, thier weak couplingand CPU/GPU implementation. Validation of the proposed approach is discussed in Section 3 with reference to anidealized model of the thoracic aorta and a patient-speciﬁc coronary model. Performance and scalability of the approachis discussed in Section 3.3 followed by conclusions and future work in Section 4.

A homogeneous and isotropic vascular tissue with uncertain elastic modulus and thickness is assumed in this study,modeled through a Gaussian random ﬁeld with Matérn covariance. A Gaussian marginal distribution appears to bethe simplest idealized distribution compatible with the scarce experimental observations, while the choice of a Matérncovariance relates to its ﬁnite differentiability, which make this model more desirable than other kernels [19]. Let (cid:107)·(cid:107) denote the Euclidean distance in R d . The Matérn covariance between two points at distance (cid:107) h (cid:107) is r ( (cid:107) h (cid:107) ) = σ ν − Γ( ν ) ( κ (cid:107) h (cid:107) ) ν K ν ( κ (cid:107) h (cid:107) ) , h ∈ R d (1)where Γ( · ) is the gamma function, K ν is the modiﬁed Bessel function of the second kind, σ is the marginal variance, ν is a scaling parameter which determines the mean square differentiability of the underling process, and κ is relatedto the correlation length ρ = √ ν/κ , i.e., the distance which corresponds to a correlation of approximately . , forall ν . It is known from the literature [20, 21] how Gaussian random ﬁelds with Matérn covariance can be obtained assolutions of a linear fractional stochastic partial differential equation (SPDE) of the form ( κ − ∆) α/ x ( s ) = W ( s ) , s ∈ R d , (2)where α = ν + d/ , κ > , ν > , W ( s ) is a white noise spatial process, ∆ = Σ i ∂ /∂ s i , and the marginal varianceis σ = Γ( ν )Γ( ν + d/ π ) d/ κ ν . Since the lumen wall is modelled with a surface of triangular elements, we are interested in generating realizations fromdiscretely indexed Gaussian random ﬁelds. This is achieved through an approximate stochastic weak solution of theSPDE (2), as discussed in [1]. We construct a discrete approximation of the solution x ( s ) using a linear combination ofbasis functions, { ψ k } , k = 1 , . . . , n , and appropriate weights, { w k } , k = 1 , . . . , n , i.e., x ( s ) = (cid:80) nk =1 ψ k ( s ) w k . Wethen introduce an appropriate Sobolev space with inner product (cid:104)· , ·(cid:105) , a family of test functions { ϕ k } , k = 1 , . . . , n , andderive a Galerkin functional for (2) of the form (cid:104) ϕ i , ( κ − ∆) α/ ψ j (cid:105) w j = (cid:104) ϕ i , W(cid:105) (3)We then choose ϕ k = ( κ − ∆) / ψ k for α = 1 and ϕ k = ψ k for α = 2 , leading to precision matrices Q α expressedas Q α = κ C + G for α = 1 Q α = ( κ C + G ) T C − ( κ C + G ) for α = 2 Q α = ( κ C + G ) T C − Q α − C − ( κ C + G ) for α > , (4)where, for α ≥ a recursive Galerkin formulation is used, letting α = 2 on the left-hand side of equation (2) andreplacing the right-hand side with a ﬁeld generated by α − , assigning ϕ k = ψ k , k = 1 , . . . , n . Note how the use of3 PREPRINT - F

EBRUARY

2, 2021piecewise linear basis functions { ψ k } , k = 1 , . . . , n , lead to matrices G ij = (cid:104)∇ ψ i , ∇ ψ j (cid:105) and C ij = (cid:104) ψ i , ψ j (cid:105) , (5)that are sparse , and often found in the ﬁnite element discretization of second order elliptic PDEs. However, the precisionmatrices Q α are, in general, not sparse as they contain the inverse C − . Thus, by replacing the matrix C with the lumped diagonal matrix (cid:101) C , sparsity is restored and Q α can be efﬁciently manipulated and decomposed through fastroutines for sparse linear algebra available on a wide range of architectures. In addition, the introduction of (cid:101) C leadsto non zero terms on each row only for the immediate neighbors I ( s k ) of a given node k on the triangular surfacemesh, since the basis function { ψ k } is supported only on the elements connected to node k . This reduces the Gaussianrandom ﬁeld to a Gaussian Markov random ﬁeld, for which ρ ( x ( s i ) , x ( s k ) | x ( s j ) , s j ∈ I ( s i )) == ρ ( x ( s i ) | x ( s j ) , s j ∈ I ( s i )) , (6)or, in other words, given the value of x on its neighbors , at any node k the random ﬁeld x ( s k ) is statisticallyindependent from any other location. The approximation error introduced in (6) is, however, small and often negligiblein applications [22]. The interested reader is referred to [1] for additional detail on the derivation of Q α .Numerically generated realizations for various correlation lengths are shown in Figure 2 on an ideal cylindricalrepresentation of the descending thoracic aorta, while Figure 3 shows the agreement of the generated ﬁeld with theMatérn model. c m (a) ρ = . cm (b) ρ = 3 . cm (c) ρ = 7 . cm Figure 2:

Random ﬁeld generated on a cylindricalmesh for various correlation lengths.

Figure 3:

Comparison between spatial correlationsfrom a numerically generated ﬁeld x ( s ) with precisionmatrix Q α and the exact Matérn model. We use a small strain, linear, 3 d.o.f. elastic thin shell which allows for a full compatibility between the ﬂuid meshdiscretized with tetrahedral elements and the solid walls. The in-plane stiffness of the shell is complemented with atransverse shear stiffness which provides stability under transverse loading [2]. Using a superscript l and lower case x, y, z to indicate quantities expressed in the local shell reference frame, we introduce a constitutive relation in Voigtnotation expressed as σ l = C · ε l , with σ l =  σ xx σ yy τ xy τ xz τ yz  , ε l =  ∂u x /∂x∂u y /∂y ( ∂u x /∂y + ∂u y /∂x ) ∂u z /∂x∂u z /∂y  . (7)We assume ε lzz = 0 , i.e., zero deformation through the thickness, disregarding the effect of both the normal pressureacting at the lumen surface and the Poisson effect due to the membrane deformations. Strains ε l and nodal displacements4 PREPRINT - F

EBRUARY

2, 2021 u are related through the matrix B of shape function derivatives for a linear triangular element, i.e. ε l = B u = 12 A e  y y y x x x x y x y x y

00 0 y y y x x x   u x, u y, u z, u x, u y, u z, u x, u y, u z,  (8)where x j , y j , j ∈ { , , } are the local coordinates of the j -th element node, x ij = x i − x j (similarly for y ij ), u x,j , u y,j , u z,j are the local nodal displacements and A e is the triangular element area. The constitutive matrix isexpressed as C = E (1 − ν )  ν ν . − ν ) 0 00 0 0 0 . k (1 − ν ) 00 0 0 0 0 . k (1 − ν )  (9)where E and ν are the Young’s modulus and Poisson’s ratio coefﬁcient, respectively, and the shear factor k accounts fora parabolic variation of transverse shear stress through the shell thickness (assumed as / for a shell with rectangularcross section). Finally, the local element stiffness matrix k e ∈ R × can be expressed as k e = (cid:90) Ω se B T CB dΩ se = n gp (cid:88) i =1 B T CB A e ζ i w i , (10)where n gp is the total number of integration points and ζ i , w i , i ∈ { , , . . . , n gp } , are the element thickness andintegration rule weights, respectively. In this study, we adopt a three-point Gauss integration rule to capture a linearvariation for E and ζ through each element, generated from a Gauss Markov random ﬁeld with Matérn covariance Q α ,i.e., E = E ( x, y, ω ) and ζ = ζ ( x, y, ω ) . Nodal vectors E ∼ N ( E , Q − α ) and ζ ∼ N ( ζ , Q − α ) are generated as E = E + ( L T ) − z , and ζ = ζ + ( L T ) − z , (11)where z ∼ N ( , I n ) is a vector with standard Gaussian components and L is the sparse Cholesky factor of Q α , i.e., Q α = L L T . The covariance matrix Q α is assembled in compressed sparse column (CSC) format and the Choleskydecomposition is computed using the cholmod routine provided by the scikit-sparse Python library. Finally, the product ( L T ) − z is performed using the solve_Lt routine, as the solution of a triangular system. The evolution of blood ﬂow and pressure in the human cardiovascular system can be modeled using the Navier-Stokesequations. Even though many simplifying assumptions can be made to the equations to reduce the computationalcomplexity, here we focus on high-ﬁdelity models, i.e., models associated with large discretizations of a three-dimensional ( n sd = 3 ) ﬂuid domain Ω f ⊆ R n sd . The boundary Γ f of Ω f coincides with the mid-plane of the soliddomain Ω s , and is partitioned into Γ f = Γ fg ∪ Γ fh ∪ Γ fs which correspond to the application of Dirichlet, Neumannboundary conditions and interaction with the solid, respectively. Consider also the vector ﬁelds h : Γ h × (0 , T ) → R n sd , g : Γ g × (0 , T ) → R n sd , f : Ω × (0 , T ) → R n sd and v : Ω → R n sd . We would like to solve the problem of ﬁnding v ( X , t ) and p ( X , t ) , ∀ X ∈ Ω , ∀ t ∈ [0 , T ] such that (cid:26) ρ ˙ v + ρ v · ∇ v = −∇ p + ∇ · τ + f ( X , t ) ∈ Ω × (0 , T ) ∇ · v = 0 ( X , t ) ∈ Ω × (0 , T ) , (12)subject to the boundary conditions  v = g ( X , t ) ∈ Γ g × (0 , T ) t n = σ · n = [ − p I + τ ] · n = h ( X , t ) ∈ Γ h × (0 , T ) t n = t f ( X , t ) ∈ Γ s × (0 , T ) v ( X ,

0) = v ( X ) X ∈ Ω , (13)5 PREPRINT - F

EBRUARY

2, 2021where τ = µ ( ∇ v + ∇ v T ) is the viscous stress tensor resulting from considering blood as a Newtonian ﬂuid. Solutionof (12) in weak form requires to deﬁne four approximation spaces, i.e., two trial spaces for the velocity v and pressure p S hk = (cid:110) v | v ( · , t ) ∈ H (Ω) , t ∈ [0 , T ] , v | x ∈ Ω e ∈ P k (Ω e ) , v ( · , t ) = g on Γ g (cid:111) , P hk = (cid:8) p | p ( · , t ) ∈ L (Ω) , t ∈ [0 , T ] , p | x ∈ ¯Ω e ∈ P k (Ω e ) (cid:9) , and two test spaces for (cid:126)w and q W hk = (cid:110) w | w ( · , t ) ∈ H (Ω) , t ∈ [0 , T ] , w | x ∈ Ω e ∈ P k (Ω e ) , w ( · , t ) = on Γ g (cid:111) , Q hk = P hk where H (Ω) is the Sobolev space of function triplets in L (Ω) with derivatives in L (Ω) and P k (Ω) is the space ofpolynomials of order k in Ω . The above spaces are separated into large and a small scale contributions S hk = S hk ⊕ (cid:103) S hk , W hk = W hk ⊕ (cid:103) W hk and P hk = P hk ⊕ (cid:103) P hk with a corresponding decomposition of velocity and pressures as v = v + (cid:101) v and p = p + (cid:101) p , respectively. This decomposition is introduced in a weak form of the Navier-Stokes equations (12) anda closure obtained by expressing the small scale variables (cid:101) v and (cid:101) p in terms of their large scale counterparts v and p using [23] (cid:20)(cid:101) v (cid:101) p (cid:21) = (cid:20) τ M R M ( v , p ) τ C R C ( v , p ) (cid:21) , (14)where R M and R C represent the momentum and continuity residual expressed as (cid:26) R M ( v , p ) = R M = ρ ˙ v + ρ v · ∇ v + ∇ p − ∇ · τ − f ,R C ( v , p ) = R C = ∇ · v , (15)and the stabilization coefﬁcients are expressed as τ M = (cid:18) t + u · G u + C I ν G : G (cid:19) − / τ C = ( τ M g · g ) − , G i,j = (cid:88) k =1 ∂ξ k ∂x i ∂ξ k ∂x j , g i = (cid:88) k =1 ∂ξ k ∂x i , (16)where ∂ ξ /∂ x is the inverse Jacobian of the element mapping between the parametric and physical domains.To simplify the notation, in what follows the large scale variables v and p will be denoted simply by v and p and thespaces S hk , W hk and P hk by S hk , W hk and P hk . A weak solution of the Navier-Stokes equations can now be determinedby ﬁnding v ∈ S kh and p ∈ P kh such that B ( w , q ; v , p ) = B G ( w , q ; v , p )++ n el (cid:88) e =1 (cid:90) Ω e { ( v · ∇ ) w · ( τ M R M ) + ∇ · w τ C R C } dΩ e ++ n el (cid:88) e =1 (cid:90) Ω e { w · [ − τ M R M · ∇ v ] + [ R M · ∇ w ] · [ τ R M · v ] } dΩ e ++ n el (cid:88) e =1 (cid:90) Ω e ∇ q · τ M ρ R M dΩ e = 0 , (17)for all (cid:126)w ∈ (cid:126) W kh and q ∈ P kh , where the Galerkin functional B G is expressed as B G ( w , q ; v , p ) = (cid:90) Ω w · ( ρ ˙ v + ρ v · ∇ v − f ) dΩ++ (cid:90) Ω {∇ w : ( − p I + τ ) − ∇ q · v } dΩ++ (cid:90) Γ h {− w · h + q v n } dΓ + (cid:90) Γ s (cid:8) − w · t f + q v n (cid:9) dΓ++ (cid:90) Γ g q v n dΓ . (18)6 PREPRINT - F

EBRUARY

2, 2021The discrete variables w h , v h , q h and p h are introduced in (17), leading to a non linear system of equations of the form (cid:26) N M ( ˙ v h , v h , p h ) = 0 N C ( ˙ v h , v h , p h ) = 0 , (19)A predictor-multicorrector scheme [24] is used for time integration and, at each time step, the resulting non linearsystem (19) is solved through successive Newton iterations. Additional details can be found in [23, 25]. Since this study focuses more on the development of an ensemble solver, we provide a one-way coupled, simpliﬁedtreatment of the interaction between ﬂuid and structure, leaving a more rigorous treatment to future work. Here weassume a ﬂuid model with rigid walls and a structural model of the lumen governed by the three d.o.f. shell discussed inSection 2.2.1. The lumen deformation does not, in turn, affect the geometry of the ﬂuid region. The elastic forces in thelumen wall are in equilibrium with the shear and pressure exerted by the ﬂuid, or, in other words t s = − t f , where thewall stress t f computed by the ﬂuid solver is t f = σ f · n , σ = 2 µ ε − p I , (20) p is the main nodal pressure unknown in the VMS ﬂuid solver, and ε = ∇ s u is the symmetric part of the velocitygradient, which is constant on each P1 element in the ﬂuid mesh. An example of shear forces computed by the VMSﬂuid solver for a coronary model is illustrated in Figure 4. At every node on the wall, the shear forces and normalvectors from adjacent elements are averaged and the nodal pressure added, leading to the three components of the nodalforce that are passed to the structural solver.Finally, stress components in the circumferential, radial and axial directions are obtained by transforming the solidstress tensor to a local cylindrical coordinate system (see Figure 4). For an arbitrary Gauss point or lumen shell node s ,we identify the closest location on the vessel centerline. The tangent vector to the centerline at s is deﬁned as the local axial direction z , the normal at the Gauss point is the local radial direction r , and the local circumferential direction θ is obtained as the cross product between r and z . Figure 4:

Visualization of interface shear forces exchanged between the ﬂuid and the structural solver. The local axissystem used for stress post-processing is also shown in the close up.

The solution in time for the dynamics of the undamped non-linear structural system M ¨ u + C ˙ u + K ( u ) u = f ( t ) , (21)is computed using central differences in time, resulting in an update formula in the [ n ∆ t, ( n +1) ∆ t ] intervals expressedas (cid:18) (cid:102) M + ∆ t (cid:101) C (cid:19) u n +1 = ∆ t f n − (cid:0) ∆ t K − M (cid:1) u n − (cid:18) M − ∆ t C (cid:19) u n − (22)which is performed independently by an arbitrary number of mesh partitions. To do so efﬁciently, (cid:102) M , (cid:101) C in (22) arelumped mass matrix pre-assembled before the beginning of the time loop, containing the elemental contributions7 PREPRINT - F

EBRUARY

2, 2021from all ﬁnite elements, even those belonging to separate mesh partitions. Given the limited amount of deformationtypically observed in cardiovascular applications over a single heart cycle, the assumption of a ﬁxed nodal mass overtime is considered realistic. This assumption removes the need of communicating nodal masses during ﬁnite elementassembly, improving scalability. In some cases, a viscous force f v is added to the right-hand-side in order to damp thehigh-frequency oscillations as f v = − c d ˙ u n , through an appropriate damping coefﬁcient c d .Even though the geometry of the vessel lumen is updated at every step by adding u n +1 , the displacement magnitudesover a typical heart cycle remain of the same order as the thickness of the vessel wall, hence in the linear regime.In this context, the stiffness do not signiﬁcantly change and it is possible to save computational time by avoiding toassemble it at every iteration. In our code, we therefore provide the option to selectively update the stiffness matrixafter a prescribed number of iterations. Synchronization of displacements for the nodes shared by multiple partitions isimplemented using Send , Recv to the root CPU and broadcast back.The performance of the proposed ensemble solver was ﬁrst tested on multiple CPUs. We use distributed sparse matricesin the Yale compressed sparse row (CSR) format [26] with dense coefﬁcient entries of size · n s , where n s is thenumber of material property realizations and a local element matrix of size 3 × mkl_cspblas_dcsrgemv routineprovided through the Intel MKL library, using single and multiple threads. We veriﬁed the satisfactory performance ofour implementation under a wide range of mesh sizes, number of cores, and with/without multithreading. Encouragingspeedups were obtained on multiple CPUs, as shown in Figure 5 and Figure 6. NNZ0 . . . . . . . . . P e r f o r m a n ce ( G ﬂ o p / s ) Cython+oMP 1tMKL 1tCython+oMP 2tMKL 2tCython+oMP 3tMKL 3t1GPU Total1GPU Core (a) C S R - S c a l a r C S R - V e c t o r C S R - S t r e a m O v e r l a p p i n g . . . . . . . P e r f o r m a n ce ( G ﬂ o p / s ) CPU PerformancespMV + CPU-GPU communicationspMV product kernel (b)

Figure 5:

Performance of matrix-vector product kernel on CPU by our Cython+openMP implementation, GPUimplementation and MKL library on multiple threads (a). Optimization of GPU matrix-vector kernel and CPU-GPUcommunication performance (b). Tests were performed using 1 CPU and 1 GPU on a cylindrical model associated witha sparse matrix having nnz = 160,587 and 1,000 material property realizations.

We developed an hardware-independent openCL implementation of the solver running on multiple GPUs. We startedwith a naïve CSR-based parallel sparse matrix-vector product (SpMV), known as

CSR-Scalar , where each row of thesparse matrix is assigned to a separate thread [27]. This works well on CPUs, but causes uncoalesced, slow memoryaccesses on GPUs, since elements in each row occupy consecutive addresses in memory, but consecutive threadsaccess elements on different rows. In addition, long rows lead to an unequal amount of work among the threadsand some of them need to wait for others to ﬁnish. We then transitioned to a

CSR-Vector scheme [28], assigning awavefront (or so-called warp on NVIDIA architectures) to work on a single row of the matrix. This allows for access toconsecutive memory locations in parallel, resulting in fast coalesced loads. However, CSR-Vector can lead to poorGPU occupancy for short rows due to unused execution resources. Improved performance can be achieved using

CSR-Stream [29] which statically ﬁxes the number of nonzeros that will be processed by one wavefront and streams allof these values into the local scratchpad memory, effectively utilizing the GPU’s DRAM bandwidth and improvingover CSR-Scalar. CSR-Stream also dynamically determines the number of rows on which each wavefront will operate,thus improving over CSR-Vector. While CSR-Stream substantially improves the performance of the spMV productkernel, the CPU-to-GPU data transfer still dominates the time step update, as shown in Figure 5b. This problem can bemitigated by sending data to the GPU in smaller chunks, thus overlapping data transfer and kernel execution.8

PREPRINT - F

EBRUARY

2, 2021 n Samples [min]050010001500200025003000 C P U T i m e f o r S a m p l e × n [ m i n ] Scaling Results for Small MeshEqual Performance1 CPU12 CPUs24 CPUs (a) n Samples [min]0100020003000400050006000 C P U T i m e f o r S a m p l e × n [ m i n ] Scaling Results for Large MeshEqual Performance1 CPU12 CPUs24 CPUs (b)

Figure 6:

Performance of explicit ensemble solver on multiple CPUs for a mesh with 5,074 elements, 2,565 nodes (a)and for a mesh with 15,136 elements and 7,628 nodes (b). n Samples [min]0 . . . . . . . . . C P U T i m e f o r S a m p l e × n [ m i n ] × Scaling Results for Small MeshEqual Performance1 CPU12 CPUs24 CPUs1 GPU2 GPUs3 GPUs4 GPUs (a) n Samples [min]0 . . . . . . . . . C P U T i m e f o r S a m p l e × n [ m i n ] × Scaling Results for Large MeshEqual Performance1 CPU12 CPUs24 CPUs1 GPU2 GPUs3 GPUs4 GPUs (b)

Figure 7:

Performance of explicit ensemble solver on multiple CPUs/GPUs for a mesh with 5,074 elements, 2,565nodes (a) and for a mesh with 15,136 elements and 7,628 nodes (b).In addition, data transfer can be minimized by an assembly-free approach. Even though this is a standard practice inexplicit ﬁnite element codes, it is particularly effective on a GPU for three reasons. First, storage of a sparse globalmatrix would occupy a signiﬁcant portion of the GPU memory, posing restrictions on the model size. Second, indexingoperations to access entries in the global matrix would cause severe uncoalesced memory access, reducing signiﬁcantlythe degree of parallelism in the GPU. Third, for the most common sparse matrix storage schemes, indexing alwaysinclude searching, which would be particularly slow on GPU. We compute local matrices directly in the GPU and taketheir product with a partition-based displacement vector, where only the displacements of shared nodes need to besynchronized through the root CPU.To compute element matrices, we buffer data to the GPU before the beginning of the time integration loop, in order toavoid any CPU to GPU data transfer due to element assembly. Buffered quantities include mesh geometry and materialproperties, particularly the product between the Young’s modulus and thickness at each Gauss point for all random ﬁeldrealizations. Note how the rest of the local stiffness matrix is constant for linear triangular elements. In addition, theleft-hand-side lumped mass matrix resulting from the central difference scheme does not change throughout the timeloop. We also leverage a mesh coloring algorithm, allowing working units to process different elements at the sametime. During each time step, each computing group in the GPU works on one element, while the working units in thesame group work on different realizations and therefore have access to coalesced GPU memory. Communication isonly triggered by displacement synchronization. We use pinned host memory to speed up the data transfer betweenCPU and GPU. All displacements for each realization and the local matrices are stored in private memory since they donot need to be shared with other working units. Final GPU speed ups are illustrated in Figure 7.9

PREPRINT - F

EBRUARY

2, 2021

The CVFES solver is developed in Python 3 with optimization in Cython [30] and element assembly and matrixproduct implementation on openCL [31] (though the Python pyOpenCL library [32]). The code leverages the VTKlibrary [33] to read the solid, ﬂuid mesh and boundary conditions. In this context the solver is fully compatible withthe input ﬁles generated by the SimVascular software platform [34] and can be easily integrated with the SimVascularmodeling workﬂow. Partitioning on multiple CPUs and GPUs is obtained for both solvers using parMETIS [35].The code used to generate the results discussed in Section 3 is available through a public GitHub repository at https://github.com/desResLab/CVFES . The ﬁrst benchmark represents an ideal cylindrical lumen subject to aortic ﬂow. The cylinder has a diameter equal to4 cm and length of 30 cm, while two Matérn random ﬁelds for thickness and elastic modulus have been assigned asdiscussed in Section 2.1. Speciﬁcally, a mean µ = 7 . × Barye and a standard deviation σ = 7 . × Barye havebeen assumed for the elastic modulus, whereas the thickness random ﬁeld is characterized through a mean µ = 0 . cmand a standard deviation σ = 0 . cm. Three values of the correlation length were considered, equal to 0.95 cm, 3.7 cmand 7.2 cm, respectively (see discussion in [36]). A uniform pressure of 13 mmHg was added to the pressure computedby the VMS ﬂuid solver. We considered diastole as a natural (unstressed) state and applied the difference between adiastolic pressure of 80 mmHg and the mean brachial pressure in a healthy subject (i.e., with systolic pressure equal to120 mmHg). This conﬁguration is analyzed both under steady state and pulsatile ﬂow conditions, under fully ﬁxedstructural boundary conditions at the two cylinder ends. The ﬂuid solution is computed with the VMS ﬂuid solver discussed in Section 2.2.2 using a parabolic velocity proﬁle atthe inﬂow corresponding to a − . mL/s volumetric ﬂow rate, zero-traction boundary condition at the outlet and ano-slip condition at the lumen wall. As expected, the ﬂuid solution show a perfectly linear relative pressure proﬁlealong the cylinder center path and a uniform parabolic velocity proﬁle from inlet to outlet, typical of viscous-dominatedPoiseuille ﬂow, as shown in Figure 8.The steady state pressure resulting from the ﬂuid solver is applied as discussed in Section 2.2.3 and 100 thickness andelastic modulus realizations are solved simultaneously. The explicit structural simulation is run for . seconds, until asteady state was observed. Three wall mesh densities (coarse, medium and ﬁne) are ﬁnally considered, consisting of5,074, 15,136 and 32,994 triangular shell elements, with explicit time steps set to ∆ t = 4 . × − s, ∆ t = 4 . × − s and ∆ t = 1 . × − s, respectively and no viscous damping.Displacement magnitudes along the longitudinal z axis (cylinder generator) are shown in the top row of Figure 9 for theﬁner mesh and various correlation lengths. The mean displacement one diameter away from the fully ﬁxed ends (thickblack line) is consistent with an homogeneous solution for a thick cylinder with average elastic modulus and thickness(blue dashed line). Displacements associated with single random ﬁeld realizations are also shown, in Figure 9 (toprow), using colors. As expected, the displacement wave length increases with the correlation length, and so does thedisplacement uncertainty quantiﬁed through the 5%-95% conﬁdence interval (gray shaded area). Finally, the second rowof Figure 9 shows how increasing the mesh density produces a limited difference in the 5%-95% conﬁdence interval.Circumferential stress was found to be the most signiﬁcant component, as expected, showing uncertainty increasingwith the correlation length, similarly to what observed for the displacement magnitude. The in-plane and out-of-planeshear components ( σ θz and σ rz ), though much smaller than circumferential and axial stresses, are essentially related tothe material property non-homogeneity and reduce for an increasing correlation length. These stress components arezero both on average and by solving the model with average material properties. Thus, they can only be captured byexplicitly modeling the spatial variability of material properties as in the proposed approach. A parabolic pulsatile inﬂow (Figure 11a) was applied to the same cylindrical geometry discussed in the previous section,while all the other boundary conditions (walls and outlets) were kept the same as in the steady case. The time stepis set to . × − and the simulation run for two heart cycles and 100 material property realizations. To avoid theapplication of impulsive loads which can excite a broad range of frequencies, signiﬁcantly affecting the undampeddynamics, a ramp was applied to the wall loads resulting from the ﬂuid solver in the last heart cycle. The ramp follows10 PREPRINT - F

EBRUARY

2, 2021 z axis [cm]051015 P r e ss u r e [ B a r y e ] (a) (b) Figure 8:

Steady state pressure (a) and velocity (b) distributions from variational multi-scale ﬂuid solver. z axis [cm]0 . . . D i s p l a ce m e n t [ c m ] Fine Mesh - ρ = 0 .

95 cm 0 10 20 30Distance along z axis [cm]0 . . . D i s p l a ce m e n t [ c m ] Fine Mesh - ρ = 3 . z axis [cm]0 . . . D i s p l a ce m e n t [ c m ] Fine Mesh - ρ = 7 . z axis [cm]0 . . . % - % D i s p l a ce m e n t [ c m ] ρ = 0 .

95 cm

CoarseMediumFine z axis [cm]0 . . % - % D i s p l a ce m e n t [ c m ] ρ = 3 . CoarseMediumFine z axis [cm]0 . . % - % D i s p l a ce m e n t [ c m ] ρ = 7 . CoarseMediumFine

Figure 9:

Displacement magnitude along cylinder generator (longitudinal z axis) for various correlation lengths (ﬁrstrow). 5%-95% conﬁdence intervals for displacement magnitudes for three increasing mesh densities (i.e., coarse,medium and ﬁne, second row).a sine wave and is kept active for the ﬁrst . seconds of the simulation. No damping was applied to the simulation, i.e., f v = 0 .As expected, the resulting displacements follow a time proﬁle similar to the inﬂow, and the 5%-95%conﬁdence intervalsincrease with the correlation length of the underlying random ﬁeld. Circumferential and axial stress components are thedominant stress components, which also increase with the correlation length, as observed for the steady state results. We also demonstrate the results of the proposed ensemble solver on a patient-speciﬁc model of the left anteriordescending coronary branch. An ensemble of 100 model solutions were obtained using random ﬁeld parameters for theelastic modulus equal to µ = 1 . × Barye and σ = 1 . × Barye. For the thickness, we considered a meanequal to µ = 0 . cm and a standard deviation of σ = 0 . cm. A slip-free boundary condition was applied at theoutlets of the ﬂuid domain, whereas fully ﬁxed mechanical restraints (i.e., all three nodal translations) where appliedalong the edges at both the inlets and outlets. A uniform pressure has been superimposed to the lumen stress obtainedfrom the ﬂuid solver as discussed for the ideal aortic model in Section 3.1. A constant ﬂow rate equal to − . mL/s is applied at the model inlet with a parabolic proﬁle, while a zero-tractionboundary condition is applied at the outlets and a no-slip condition at the walls. The explicit time step is set to . × − and the model was run for . seconds to reach the steady state. No additional viscous force f v wasconsidered. Figure 12 shows the displacements and stress for all three analyzed correlation lengths, averaged through across sectional slice of the branch of interest, and plotted for successive slices along the longitudinal z axis.11 PREPRINT - F

EBRUARY

2, 2021 z axis [cm]80000100000 σ θθ [ B a r y e ] Fine Mesh - ρ = 0 .

95 cm 0 10 20 30Distance along z axis [cm] − σ θ z [ B a r y e ] Fine Mesh - ρ = 0 .

95 cm 0 10 20 30Distance along z axis [cm] − σ r z [ B a r y e ] Fine Mesh - ρ = 0 .

95 cm0 10 20 30Distance along z axis [cm]80000100000 σ θθ [ B a r y e ] Fine Mesh - ρ = 3 . z axis [cm] − σ θ z [ B a r y e ] Fine Mesh - ρ = 3 . z axis [cm] − σ r z [ B a r y e ] Fine Mesh - ρ = 3 . z axis [cm]80000100000 σ θθ [ B a r y e ] Fine Mesh - ρ = 7 . z axis [cm] − σ θ z [ B a r y e ] Fine Mesh - ρ = 7 . z axis [cm] − σ r z [ B a r y e ] Fine Mesh - ρ = 7 . Figure 10:

Ensemble means, 5%-95% conﬁdence intervals and single realizations of longitudinal stress proﬁles forvarious mesh densities and random ﬁeld correlation lengths. . . . . − − − − I nﬂ o w [ L / m i n ] InﬂowCardiac Output (a) . . . . . . . . D i s p l a ce m e n t N o r m [ c m ] T i m e R a m p ρ = 0 .

95 cm ρ = 3 . ρ = 7 . (b) . . . . − . . . . . . S t r e ss [ B a r y e ] × T i m e R a m p σ θθ , ρ = 0 .

95 cm σ θθ , ρ = 3 . σ θθ , ρ = 7 . σ zz , ρ = 0 .

95 cm σ zz , ρ = 3 . σ zz , ρ = 7 . (c) Figure 11:

Results for pulsatile ﬂow on ideal cylindrical vessel. Inﬂow time history and spatial location for acquisitionof displacement and stress outputs (a). Displacement time history at selected spatial location with thick lines representingensemble averages, and thin lines used for 5%/95% percentiles (b). Circumferential and axial stress time history atselected spatial location (c).Displacement results conﬁrm the absence of torsion, and a prevalent radial deformation mode, with rigid body motionevident from the axial displacements d z , affected by the centerline path geometry. The stress results conﬁrm theimportance of the circumferential followed by the axial component. The circumferential stress reduces with thecoronary branch radius as intuitively suggested by the Barlow (or Mariotte) formula for thin-walled cylinders. Similarcylindrical displacements, circumferential and axial stress are observed for all three correlation lengths. For the selectedcoronary branch, the smaller correlation length (0.95 cm) is approximately equal to twice the largest diameter, inducingminimal changes in local deformability. 12 PREPRINT - F

EBRUARY

2, 2021 . . . . . . . z axis [cm]0 . . . D i s p l a ce m e n t [ c m ] d θ d r d z . . . . M e a n R a d i u s [ c m ] . . . . . . . z axis [cm]2000040000 S t r e ss [ B a r y e ] σ θθ σ zz . . . . M e a n R a d i u s [ c m ] Figure 12:

Displacement and stress proﬁles in left coronary artery LAD branch under steady ﬂow, averaged over allmaterial property realizations and cross-sectional slice.

The same geometry analyzed in the previous section is subject to a parabolic pulsatile inﬂow shown in Figure 13. Atime step equal to . × − is selected, and the model is run for two complete heart cycles (1.6 seconds) and 100material property realizations. Similar to the pulsatile cylindrical test case, a time ramp is applied during the ﬁrst 0.2s, to avoid impulsive loads produced by a non-zero wall stress at t = 0 , and a pressure of 13 mmHg is superimposedto the ﬂuid wall stress. A viscous force f v was applied, using a damping coefﬁcient equal to c d = 0 . , which wasfound from various tests to remove the high-frequency oscillations without affecting the system dynamics. Results forthe average displacements and stress are shown in Figure 13 at four successive locations over the center path of theLAD branch. Even for this case the circumferential and axial stress are the most relevant components and their timehistory is similar to the inﬂow and exhibit a maximum at diastole. . . . . . . . . . − . − . . I nﬂ o w [ m L / s ] Inﬂow . . . . . . . . . . D i s p l a ce m e n t [ c m ] . . . . σ θθ [ B a r y e ] loc. 1loc. 2loc. 3loc. 4 . . . . σ zz [ B a r y e ] Figure 13:

Displacement and stress proﬁles in left coronary artery LAD branch under pulsatile ﬂow, averaged over allmaterial property realizations and cross-sectional slice. 13

PREPRINT - F

EBRUARY

2, 2021

Table 1:

Speedup - Model with 5074 elements, 2565 nodes - 1000 time steps with ∆ t = 1 . × − Hardware 1 Smp (Spd) 10 Smp (Spd) 50 Smp (Spd) 100 Smp (Spd) 200 Smp (Spd) 500 Smp (Spd)1 CPU

12 CPU

Speedup - Model with 15136 elements, 7628 nodes - 1000 time steps with ∆ t = 1 . × − Hardware 1 Smp (Spd) 10 Smp (Spd) 50 Smp (Spd) 100 Smp (Spd) 200 Smp (Spd) 500 Smp (Spd)1 CPU

12 CPU

24 CPU (*) All time entries are in a days:hours:minutes:seconds format. The speedup is indicated as ( x, y ) , where x is the speedup withrespect to the number of samples and y the speedup by distributing the computation on multiple CPUs or GPUs. In this section, we compare the performance obtained by running the proposed ensemble solver on three cylindricalmodels with an increasing number of elements, as shown in Table 1, Table 2 and Table 3. The explicit time step wasset to . × − s for all models, and each run consisted of 1,000 time steps. The GPUs used for these tests are four GeForce RTX 2080 Ti with 11GB of RAM equipped with 4352 NVIDIA CUDA Cores and connected to the server mainboard through PCI express ports.Two types of speedup are investigated, the ﬁrst relates to solving the same number of material property realizationson an increasing number of processors, either CPUs or GPUs, and provides an idea of the effectiveness of the variousoptimizations presented in Section 2.3 and Section 2.4. This speedup (ﬁrst number in parenthesis) is observed toincrease with the model size and the number of random ﬁeld realizations. The computation/communication tradeoff andthe small mesh sizes selected for these tests also justify the negligible beneﬁts of using 24 CPU cores instead of 12. Thespeedup achieved by our GPU implementation is instead very relevant, i.e., approximately three orders of magnitudewith respect to a single CPU implementation.The second type of speedup quantiﬁes the efﬁciency in the proposed ensemble solver, i.e., how much faster one canobtain the solution of multiple realizations by solving them at the same time, with respect to the solution of a singlerealization on the same hardware, multiplied by the total number of realizations. The computational savings of anensemble solution are quantiﬁed between one and two orders of magnitude, conﬁrming our initial claim.

Our study investigates the efﬁciency achievable by a ensemble cardiovascular solver on modern GPU architectures.The basic methodological approaches are discussed together with details on our efforts to optimize execution on sucharchitectures, both algorithmically and in terms of host-to-device communication. In particular, computation of localelement matrices is performed directly on the GPU, and working units are used to determine displacement incrementsfor independent material property realizations.The main result from our analysis is the observation that explicit-in-time ensemble solvers based on matrix-vectorproduct naturally achieve high scalability, due to their ability to generate large amount of computations and re-usabledata patterns provided by independent realizations. This also suggests how ensemble cardiovascular solvers are ideal toefﬁciently generate campaigns of high-ﬁdelity model solutions that form a pre-requisite for uncertainty quantiﬁcationstudies, the state-of-the-art paradigm for the analysis of cardiovascular systems, due to their ability to quantify theeffects of multiple sources of uncertainty on the simulation outputs.14

PREPRINT - F

EBRUARY

2, 2021

Table 3:

Speedup - Model with 131552 elements, 65896 nodes - 1000 time steps with ∆ t = 1 . × − Hardware 1 Smp (Spd) 10 Smp (Spd) 50 Smp (Spd) 100 Smp (Spd) 200 Smp (Spd) 500 Smp (Spd)1 CPU

12 CPU

24 CPU (*) All time entries are in a days:hours:minutes:seconds format. The speedup is indicated as ( x, y ) , where x is the speedup withrespect to the number of samples and y the speedup by distributing the computation on multiple CPUs or GPUs. We have also validated our solver for the steady state and pulsatile solution of an idealized aortic ﬂow and a patient-speciﬁc model of the left coronary artery, under vessel wall mechanical property uncertainty, modeled through aGaussian random ﬁeld approximation.Future work will be devoted to further improve the computational efﬁciency of the proposed approach and to transitionfrom a segregated to a fully coupled Arbitrary Lagrangian-Eulerian ﬂuid-structure interaction paradigm. In addition, theproposed approach only considered uncertainties due to vessel wall thickness and elastic modulus. We plan to supportuncertainty also in the boundary conditions and model geometry.

Acknowledgements

This work was supported by a National Science Foundation award

CAREER: Bayesian Inference Networksfor Model Ensembles (PI Daniele E. Schiavazzi). This research used computational resources provided through theCenter for Research Computing at the University of Notre Dame. We also acknowledge support from the open sourceSimVascular project at . Conﬂict of interest

The authors declare that they have no conﬂict of interest.

References [1] F. Lindgren, H. Rue, and J. Lindström. An explicit link between gaussian ﬁelds and Gaussian Markov randomﬁelds: the stochastic partial differential equation approach.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 73(4):423–498, 2011.[2] C.A. Figueroa, I.E. Vignon-Clementel, K.E. Jansen, T.J.R. Hughes, and C.A. Taylor. A coupled momentum methodfor modeling blood ﬂow in three-dimensional deformable arteries.

Computer methods in applied mechanics andengineering , 195(41-43):5685–5706, 2006.[3] K. Bäumler, V. Vedula, A.M. Sailer, J. Seo, P. Chiu, G. Mistelbauer, F.P. Chan, M.P. Fischbein, A.L. Marsden, andD. Fleischmann. Fluid–structure interaction simulations of patient-speciﬁc aortic dissection.

Biomechanics andModeling in Mechanobiology , pages 1–22, 2020.[4] H. Askes, D.C.D. Nguyen, and A. Tyas. Increasing the critical time step: micro-inertia, inertia penalties and massscaling.

Computational Mechanics , 47(6):657–667, 2011.[5] S. Morlacchi, C. Chiastra, D. Gastaldi, G. Pennati, G. Dubini, and F. Migliavacca. Sequential structural and ﬂuiddynamic numerical simulations of a stented bifurcated coronary artery.

Journal of biomechanical engineering ,133(12), 2011.[6] C. Chiastra, G. Dubini, and F. Migliavacca. Modeling the stent deployment in coronary arteries and coronarybifurcations. In

Biomechanics of Coronary Atherosclerotic Plaque , pages 579–597. Elsevier, 2020.[7] C. Chiastra, S. Morlacchi, D. Gallo, U. Morbiducci, R. Cárdenes, I. Larrabide, and F. Migliavacca. Computationalﬂuid dynamic simulations of image-based stented coronary bifurcation models.

Journal of The Royal SocietyInterface , 10(84):20130193, 2013.[8] A. Bartezzaghi, M. Cremonesi, N. Parolini, and U. Perego. An explicit dynamics gpu structural solver for thinshell ﬁnite elements.

Computers & Structures , 154:29–40, 2015.15

PREPRINT - F

EBRUARY

2, 2021[9] V. Strbac, D.M. Pierce, J. Vander Sloten, and N. Famaey. GPGPU-based explicit ﬁnite element computationsfor applications in biomechanics: the performance of material models, element technologies, and hardwaregenerations.

Computer Methods in Biomechanics and Biomedical Engineering , 20(16):1643–1657, 2017.[10] N. Jiang and W. Layton. An algorithm for fast calculation of ﬂow ensembles.

International Journal of UncertaintyQuantiﬁcation , 4(4):273–301, 2014.[11] N. Jiang. A higher order ensemble simulation algorithm for ﬂuid ﬂows.

Journal of Scientiﬁc Computing ,64(1):264–288, 2015.[12] N. Jiang and W. Layton. Numerical analysis of two ensemble eddy viscosity numerical regularizations of ﬂuidmotion.

Numerical Methods for Partial Differential Equations , 31(3):630–651, 2015.[13] A. Takhirov, M. Neda, and J. Waters. Time relaxation algorithm for ﬂow ensembles.

Numerical Methods forPartial Differential Equations , 32(3):757–777, 2016.[14] N. Jiang. A second-order ensemble method based on a blended backward differentiation formula timesteppingscheme for time-dependent Navier–Stokes equations.

Numerical Methods for Partial Differential Equations ,33(1):34–61, 2017.[15] M. Mohebujjaman and L.G. Rebholz. An efﬁcient algorithm for computation of MHD ﬂow ensembles.

Computa-tional Methods in Applied Mathematics , 17(1):121–137, 2017.[16] J.A. Fiordilino. Ensemble time-stepping algorithms for the heat equation with uncertain conductivity.

NumericalMethods for Partial Differential Equations , 34(6):1901–1916, 2018.[17] Y. Luo and Z. Wang. An ensemble algorithm for numerical solutions to deterministic and random parabolic PDEs.

SIAM Journal on Numerical Analysis , 56(2):859–876, 2018.[18] M. Gunzburger, N. Jiang, and Z. Wang. An efﬁcient algorithm for simulating ensembles of parameterized ﬂowproblems.

IMA Journal of Numerical Analysis , 39(3):1180–1205, 2019.[19] M.L. Stein.

Interpolation of spatial data: some theory for kriging . Springer Science & Business Media, 2012.[20] P. Whittle. On stationary processes in the plane.

Biometrika , pages 434–449, 1954.[21] P. Whittle. Stochastic-processes in several dimensions.

Bulletin of the International Statistical Institute , 40(2):974–994, 1963.[22] D. Bolin and F. Lindgren. A comparison between Markov approximations and other methods for large spatial datasets.

Computational Statistics & Data Analysis , 61:7–21, 2013.[23] Y. Bazilevs, V.M. Calo, J.A. Cottrell, T.J.R. Hughes, A. Reali, and G. Scovazzi. Variational multiscale residual-based turbulence modeling for large eddy simulation of incompressible ﬂows.

Computer Methods in AppliedMechanics and Engineering , 197(1):173–201, 2007.[24] K.E. Jansen, C.H. Whiting, and G.M. Hulbert. A generalized- α method for integrating the ﬁltered Navier–Stokesequations with a stabilized ﬁnite element method. Computer methods in applied mechanics and engineering ,190(3-4):305–319, 2000.[25] J. Seo, D.E. Schiavazzi, and A.L. Marsden. Performance of preconditioned iterative linear solvers for cardiovascu-lar simulations in rigid and deformable vessels.

Computational Mechanics , pages 1–23, 2019.[26] A. Buluç, J.T. Fineman, M. Frigo, J.R. Gilbert, and C.E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In

Proceedings of the twenty-ﬁrst annualsymposium on Parallelism in algorithms and architectures , pages 233–244, 2009.[27] M. Garland. Sparse matrix computations on manycore GPUs. In

Proceedings of the 45th annual DesignAutomation Conference , pages 2–6, 2008.[28] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In

Proceedings of the conference on high performance computing networking, storage and analysis , pages 1–11,2009.[29] J.L. Greathouse and M. Daga. Efﬁcient sparse matrix-vector multiplication on GPUs using the CSR storageformat. In

SC’14: Proceedings of the International Conference for High Performance Computing, Networking,Storage and Analysis , pages 769–780. IEEE, 2014.[30] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both worlds.

Computing in Science Engineering , 13(2):31–39, march-april 2011.[31] J.E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computingsystems.

Computing in science & engineering , 12(3):66–73, 2010.16

PREPRINT - F

EBRUARY

2, 2021[32] A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Pycuda and pyopencl: A scripting-basedapproach to gpu run-time code generation.

Parallel Computing , 38(3):157–174, 2012.[33] Will Schroeder, Ken Martin, and Bill Lorensen.

The Visualization Toolkit (4th ed.) . Kitware, 2006.[34] H. Lan, A. Updegrove, N.M. Wilson, G.D. Maher, S.C. Shadden, and A.L. Marsden. A re-engineered software in-terface and workﬂow for the open-source simvascular cardiovascular modeling package.

Journal of biomechanicalengineering , 140(2), 2018.[35] George Karypis and Vipin Kumar. MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System,Version 4.0. , 2009.[36] J.S. Tran, D.E. Schiavazzi, A.M. Kahn, and A.L. Marsden. Uncertainty quantiﬁcation of simulated biomechanicalstimuli in coronary artery bypass grafts.