[PDF] Massively parallel-in-space-time, adaptive finite element framework for non-linear parabolic equations

Abstract

We present an adaptive methodology for the solution of (linear and) non-linear time dependent problems that is especially tailored for massively parallel computations. The basic concept is to solve for large blocks of space-time unknowns instead of marching sequentially in time. The methodology is a combination of a computationally efficient implementation of a parallel-in-space-time finite element solver coupled with a posteriori space-time error estimates and a parallel mesh generator. This methodology enables, in principle, simultaneous adaptivity in both space and time (within the block) domains. We explore this basic concept in the context of a variety of time-steppers including Θ -schemes and Backward Differentiate Formulas. We specifically illustrate this framework with applications involving time dependent linear, quasi-linear and semi-linear diffusion equations. We focus on investigating how the coupled space-time refinement indicators for this class of problems affect spatial adaptivity. Finally, we show good scaling behavior up to 150,000 processors on the Blue Waters machine. This is achieved by careful usage of memory via block storage and non-zero formats along with lumped communication for matrix assembly. This methodology enables scaling on next generation multi-core machines by simultaneously solving for large number of time-steps, and reduces computational overhead by refining spatial blocks that can track localized features. This methodology also opens up the possibility of efficiently incorporating adjoint equations for error estimators and inverse design problems, since blocks of space-time are simultaneously solved and stored in memory.

Full PDF

MMASSIVELY PARALLEL-IN-SPACE-TIME, ADAPTIVE FINITEELEMENT FRAMEWORK FOR NON-LINEAR PARABOLICEQUATIONS

ROBERT DYJA, BASKAR GANAPATHYSUBRAMANIAN,AND KRISTOFFER G. VAN DER ZEE

Abstract.

We present an adaptive methodology for the solution of (linearand) non-linear time dependent problems that is especially tailored for mas-sively parallel computations. The basic concept is to solve for large blocks ofspace-time unknowns instead of marching sequentially in time. The method-ology is a combination of a computationally eﬃcient implementation of aparallel-in-space-time ﬁnite element solver coupled with a posteriori space-time error estimates and a parallel mesh generator. This methodology en-ables, in principle, simultaneous adaptivity in both space and time (withinthe block) domains. We explore this basic concept in the context of a varietyof time-steppers including Θ-schemes and Backward Diﬀerentiate Formulas.We speciﬁcally illustrate this framework with applications involving time de-pendent linear, quasi-linear and semi-linear diﬀusion equations. We focus oninvestigating how the coupled space-time reﬁnement indicators for this class ofproblems aﬀect spatial adaptivity. Finally, we show good scaling behavior upto 150,000 processors on the Blue Waters machine. This is achieved by carefulusage of memory via block storage and non-zero formats along with lumpedcommunication for matrix assembly. This methodology enables scaling on nextgeneration multi-core machines by simultaneously solving for large number oftime-steps, and reduces computational overhead by reﬁning spatial blocks thatcan track localized features. This methodology also opens up the possibility ofeﬃciently incorporating adjoint equations for error estimators and inverse de-sign problems, since blocks of space-time are simultaneously solved and storedin memory. Introduction

We describe the methodology, implementation details and application examplesof space-time block adaptive solutions to parabolic partial diﬀerential equations(evolution problems). This approach is primarily motivated by the necessity of

Czestochowa University of Technology, ul. Dabrowskiego 73, 42-201 Czestochowa,PolandIowa State University, 306 Lab of Mechanics, Ames, IA 50011University of Nottingham, School of Mathematical Sciences, University Park, Not-tingham NG7 2RD, UK

E-mail addresses : [email protected], [email protected],[email protected] . Key words and phrases. parabolic problems, parallel-in-time, ﬁnite element method, adaptivemesh reﬁnement.We gratefully acknowledge NSF 1435587 and XSEDE resources at TACC, as well as an ex-ploratory account on BlueWaters.Kristoﬀer G. van der Zee’s research was supported by the Engineering and Physical SciencesResearch Council (EPSRC) under grant EP/I036427/1. a r X i v : . [ phy s i c s . c o m p - ph ] A ug R. DYJA, B. GANAPATHYSUBRAMANIAN, AND K.G. VAN DER ZEE designing computational methodologies that can scale to leverage the availabilityof very large computing clusters (with ∼ ∼ thousands of processors, beyond which eﬃciency iscommunication limited) which make this approach unsuitable on larger machines.To overcome this barrier, a natural approach is to consider the time domain as anadditional dimension and simultaneously solve for blocks of time, instead of thestandard approach of sequential time-stepping. Early work on this approach wasconsidered by Hughes and coworkers [15, 16], Tezduyar and co-workers [24], andReddy and co-workers [21], while variations on this theme have recently been ex-plored by several groups [6, 9, 18, 19, 22, 28]. Other examples include methodsthat exploit the fact that some coeﬃcients during matrix assembly can be calcu-lated concurrently [7] and methods that transform the equations so as to build onesystem of equations for more than one timestep [17].The concept of solving for blocks of time simultaneously has recently gaineda lot of attention to enable eﬀective usage of exascale computing resources (seefor instance the US DOE’s Exascale Mathematics Working Group [14]). In addi-tion to this obvious advantage, solving for space-time blocks also allows naturalincorporation of a posteriori error estimates for mesh adaptivity, and enables thesolution of inverse problems (involving adjoints). This has several additional tangi-ble beneﬁts in the context of computational overhead. For evolution problems thatexhibit regionalized behavior in space and time (wave equations, moving interfaceslike bubbles and shocks) solving in blocks of space-time that are locally reﬁned tomatch the regional behavior provides substantial computational gain [9]. Similarly,the availability of error estimates across a block of time allows optimal choices ofspace and time adaptivity.Motivated by these considerations, this paper presents a methodology for thesolution of linear, quasi-linear and semi-linear (time dependent) diﬀusion equationsin three dimensions. We discuss the development of a parallel adaptive frameworkfor the solution of large blocks of space-time. We detail the development of theblock space-time framework for two classes of time-steppers ( θ schemes, BackwardDiﬀerence Formula (BDF)). We discuss implementation details of the massivelyparallel adaptive block-space time framework and illustrate scaling behavior up to150,000 processors. We subsequently deﬁne a posteriori space-time error indicators to identify spatial regions for mesh adaption. Finally, we demonstrate that for suﬃ-ciently large problems the block space-time approach is much more computationallyeﬃcient than the standard sequential time-stepping approach.The outline of the rest of the paper is as follows: Section 2, and 3 detail the blockspace-time framework for linear, and non-linear evolution equations, respectively.Section 4 discusses the space-time error estimates for these classes of problems. InSection 5, we discuss implementation details and show scaling performance andanalysis. Section 6 illustrates several numerical examples of the framework. Weconclude in Section 7. DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 3 Basic space-time formulation: linear and non-linear versions

Space-time framework for a linear problem.

Given a bounded domainΩ ∈ R , and a ﬁnite time domain [0 , T ], consider the parabolic equation that solvesfor u : Ω × [0 , T ] → R :(1) (cid:40) ∂ t u ( x , t ) − ∇ · κ ∇ u ( x , t ) = f ( x , t ) in Ω × [0 , T ] u ( x ,

0) = u where f : Ω × [0 , T ] → R is a smooth source function, and κ >

0. We consider (with-out loss of generality) that Dirichlet boundary conditions are imposed on the bound-ary Γ , unless otherwise speciﬁed. Considering a tesselation,

T ≡ { Ω , ..., Ω e , ... } ,of the domain Ω into elements (with average size h ), the weak form of this equationis given as:(2) (cid:40) Find u h ( · , t ) ∈ U h :( w h , ∂ t u h ( · , t )) + ( ∇ w h , κ ∇ u h ( · , t )) = ( w h , f ) ∀ w h ∈ V h where ( . , . ) is the L inner product on Ω and(3) U h := (cid:8) u h | u h ∈ H (Ω) , u h ∈ P (Ω e ) ∀ e (cid:9) V h := (cid:8) w h | w h ∈ H (Ω) , w h ∈ P (Ω e ) ∀ e (cid:9) with P (Ω e ) being the space of the standard polynomial ﬁnite element shape func-tions on element Ω e . To obtain a fully discretized form, we employ a time steppingtechnique on the above semi-discrete equation. While any time-stepping methodcan be used, as an example, consider the Euler Backward Formula that is deﬁnedon a discretization { , t , t , ..., T } of the time domain:(4) (cid:18) w h , u hn +1 − u hn ∆ t (cid:19) − (cid:0) ∇ w h , κ ∇ u hn +1 (cid:1) = (cid:0) w h , f n +1 (cid:1) , for n = 0 , , ... where the subscript denotes evaluation at that discrete time, and ∆ t = t n +1 − t n is the timestep.Following standard FEM practice, with the tesselation of the domain resultingin k nodal values (to describe spatial variation) of the ﬁeld u , equation (4) can beexpressed in terms of matrix-vector products as:(5) Mu n +1 − ∆ t Ku n +1 = Mu n + ∆ t f n +1 , for n = 0 , , ... where M and K are the global mass and stiﬀness matrices, respectively. u n +1 and u n are vectors containing the nodal values of the ﬁeld u at time step n + 1 and n , respectively. Equation (5) represents the system of equations solved to get thesolution for timestep n + 1. The size of vector u is equal to the number of nodalunknowns, k . Similarly matrices K , M are sparse matrices of size k × k .Consider a block-wise division of the total time domain. Each block, B i , consistsof multiple timesteps. This is schematically represented in Figure 1. Instead ofsequentially solving for each time step (as in (5)), consider solving for the ﬁeldvariable in a complete time block, B , consisting of N timesteps simultaneously, i.e.solve for u i , i = 1 , ..., N simultaneously. With some abuse of notation, the elements of matrix M are equal to M ij = ( w hi , w hj ) andmatrix K are equal to K ij = ( ∇ w hi , κ ∇ w hj ). R. DYJA, B. GANAPATHYSUBRAMANIAN, AND K.G. VAN DER ZEE B B t= t n t n+ t N t N +B t=Tt t t n +B Figure 1.

Indexing of timesteps per block (B). Number oftimesteps per block is equal to N.This results in a block diagonal matrix of size ( N × k )-by-( N × k ) given by:(6)  I − M M + ∆ t K − M M + ∆ t K . . . − M M + ∆ t K   u u u ... u N  =  IC∆ t f ∆ t f ...∆ t f N  where I is an identity matrix (of size k ) and IC are the imposed initial conditions.This system solves for N timesteps at once with a total number of unknowns N × k . Remark 1.

By treating the (unknown) nodal values at diﬀerent time steps as mul-tiple degrees of freedom associated with each spatial node, we can leverage standardalgorithmic approaches (assembly, memory usage) tailored for multiple d.o.f prob-lems. Many approaches in uncertainty quantiﬁcation (polynomial chaos represen-tation, spectral stochastic methods) leverage such an approach of representing ﬁeldvariation along additional dimensions (stochastic dimensions) as simply additionaldegrees of freedom at each spatial location [20, 27, 11, 23] . Figure 2(left) illustratesan elemental matrix considering N (= 10) time steps (i.e. degrees of freedom pernode), for a 2D quadrilateral element . Note the sparse structure of the elementalmatrix as well as the resulting global matrix (for a × quad mesh) in Figure 2(right). Figure 2.

Left, Nonzero entries in an elemental matrix. Right,Nonzero entries in the resultant global matrix for a 2x2 mesh ofquadrilateral elements. Dark squares represent nonzero entries andnumber are node indices. The time block consists of 10 time steps. using Backward-Euler discretization in time DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 5

Space-time framework for non-linear problems.

Extending the approachto certain non-linear problems is straightforward. Consider the case where κ is afunction of the dependent variable, u . We assume that κ ( u ) satisﬁes appropri-ate smoothness and boundedness assumptions to ensure existence and uniqueness,0 < κ ≤ κ ( u ) ≤ κ < ∞ . In this case the weak form for a block B is(7) ( w h , u hn +1 − u hn ∆ t ) + ( ∇ w h , κ ( u hn +1 ) ∇ u hn +1 ) = ( w h , f n +1 ) , for n = 0 , , ..., N The solution to such non-linear equations are usually via (quasi-)Newton schemes.The methodology involves construction of the Jacobian and residual, which areused to compute updates. This is represented in matrix-vector terms as: J u in +1 δ u i +1 n +1 = F u in +1 , u i +1 n +1 = u in +1 + δ u i +1 n +1 , (8) for i = 1 , ..., until convergence and for n = 0 , , ... where J u in +1 is the Jacobian (or linearized form), and F u in +1 is the residual of theabove equation, both computed using u in +1 . More speciﬁcally, for the non-lineardiﬀusion equations deﬁned by (7), the residual, F u in +1 , is given by(9) F u in +1 = 1∆ t Mu in +1 − t Mu n − K ( u in +1 ) u in +1 − f n +1 where K ( u in +1 ) denotes the solution dependent stiﬀness matrix. The Jacobian isgiven as:(10) J u in +1 = 1∆ t M + (cid:18) K ( u in +1 ) + d K ( u in +1 )d u (cid:19) Instead of sequentially solving for each time step (as in (9)), consider solving forthe ﬁeld variable in a complete time block, B , consisting of N timesteps simulta-neously. That is,(11)  J u i − t M J u i . . . − t M J u iN   δ u i +11 δ u i +12 ... δ u i +1 N  =  F u i F u i ... F u iN  and  u i +11 u i +12 ... u i +1 N  =  u i u i ... u iN  +  δ u i +11 δ u i +12 ... δ u i +1 N  Remark 2.

One can alternatively ignore the oﬀ-diagonal entries of the block Ja-cobian to construct an approximate diagonal Jacobian. The propagation of timeinformation is then limited to the residual on the right hand side. We tested bothapproaches, with the latter approach taking marginally more iterations to conver-gence, while providing substantial ease of implementation. Unless otherwise stated,all our results are based on the latter approach.

R. DYJA, B. GANAPATHYSUBRAMANIAN, AND K.G. VAN DER ZEE Space-time formulation: Higher Order Time schemes

We next look at extending the space-time strategy to incorporate two families ofhigher order time-steppers: Θ scheme and Backward diﬀerence formula. We con-sider linear and nonlinear diﬀusion, and moreover, we also consider the treatment ofthe Allen-Cahn equation, which is a parabolic PDE with a lower-order nonlinearity,whose solution has evolving layers and for which adaptivity is particularly useful.3.1. Θ scheme: Linear equation.

The semi-discrete form of the Θ-scheme –which is a generalization of the Euler Backward scheme – is as follows:(12) ( w, u n +1 ) − ( w, u n ) + ∆ t [(1 − Θ)( ∇ w, κ ∇ u n ) + Θ( ∇ w, κ ∇ u n +1 )] =∆ t [(1 − Θ)( w, f n ) + Θ( w, f n +1 )] , for n = 0 , , ... The fully discrete matrix-vector representation is given by(13) Mu n +1 − Mu n +∆ t (1 − Θ) Ku n +∆ t Θ Ku n +1 = ∆ t (1 − Θ) f n +∆ t Θ f n +1 , for n = 0 , , ... Again, it is straightforward to group and simultaneously solve for N time stepstogether. The corresponding matrix form is expressed as:  I − M + ∆ t (1 − Θ) K M + ∆ t Θ K − M + ∆ t (1 − Θ) K M + ∆ t Θ K . . . − M + ∆ t (1 − Θ) K M + ∆ t Θ K   u u u ... u N  =  IC∆ t [(1 − Θ) f + Θ f ]∆ t [(1 − Θ) f + Θ f ]...∆ t [(1 − Θ) f N − + Θ f N ]  (14)Again, the global space-time matrix (14) has a block structure.3.2. Θ scheme: Nonlinear diﬀusion with variable coeﬃcient. The corre-sponding weak form for this case is given as:(15) ( w h , u hn +1 − u hn ∆ t ) + (1 − Θ)( ∇ w h , κ ( u hn ) ∇ u hn ) + Θ( ∇ w h , κ ( u hn +1 ) ∇ u hn +1 ) =(1 − Θ)( w h , f n ) + Θ( w h , f n +1 ) , for n = 0 , , ... The Jacobian is:(16) J u in +1 = M + ∆ t Θ (cid:18) K ( u in +1 ) + d K ( u in +1 )d u (cid:19) while the residual is given by:(17) F u in +1 = Mu in +1 − Mu n +∆ t Θ (cid:0) K ( u in +1 ) u in +1 − f n +1 (cid:1) +∆ t (1 − Θ) ( K ( u n ) u n − f n ) DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 7

Again, it is straightforward to group and simultaneously solve for N time stepstogether. The corresponding (diagonal) block Jacobian (see Remark 2) is given as:(18)  M +∆ t Θ (cid:16) K ( u i ) + d K ( u i )d u (cid:17) M +∆ t Θ (cid:16) K ( u i ) + d K ( u i )d u (cid:17) . . . M +∆ t Θ (cid:16) K ( u iN ) + d K ( u iN )d u (cid:17)  As before, the upper index denotes Newton-Raphson iteration.3.3. Θ scheme: Allen Cahn Equation.

The Allen-Cahn equation is a semi-linear diﬀusion equation with a non-linear reaction term:(19) ∂ t u ( x , t ) − ∆ u ( x , t ) + (cid:15) − f ( u ) = 0where we set f ( u ) = u ( u − u ( x ,

0) = u along with zeroﬂux conditions in the boundaries.The corresponding semi-discrete form is given as:(20) ( w h , ∂ t u h ) + ( ∇ w h , ∇ u h ) + (cid:15) − ( w h , f ( u h )) = 0Using the Θ-scheme results in the fully discrete form:(21) ( w h , u hn +1 ) + ∆ t Θ (cid:2) ( ∇ w h , ∇ u hn +1 ) + (cid:15) − ( w h , f ( u hn +1 )) (cid:3) =( w h , u hn ) − ∆ t (1 − Θ) (cid:2) ( ∇ w h , ∇ u hn ) + (cid:15) − ( w h , f ( u hn )) (cid:3) Again, it is straightforward to group and simultaneously solve for N time stepstogether. The corresponding (diagonal) block Jacobian is given as:(22)  M + ∆ t Θ (cid:0) K ( u i )+ (cid:15) − f ( u i )d u (cid:17) M + ∆ t Θ (cid:0) K ( u i )+ (cid:15) − f ( u i )d u (cid:17) . . . M + ∆ t Θ (cid:0) K ( u iN )+ (cid:15) − f ( u iN )d u (cid:17)  Alternative schemes for the Allen-Cahn equation and other phase-ﬁeld models aredescribed in, e.g., [13].3.4.

Backward diﬀerence formula based time steppers.

BDF based time-steppers of order s utilize the solution at s previous time steps to construct thesolution at the next time step. A general s order BDF scheme is given as:(23) s (cid:88) k =0 α k u n + k = ∆ tβg n + s R. DYJA, B. GANAPATHYSUBRAMANIAN, AND K.G. VAN DER ZEE where the left hand side is the BDF scheme representation of the time derivative, ∂u∂t , in terms of the solution u i at time point i , and the right hand side collects allother terms. Here, α and β are known BDF coeﬃcients [8]. A ﬁrst order ( s = 1)BDF scheme is identical to the Euler Backward scheme described earlier. Thesimplest multi-step scheme is for s = 2 and for the linear diﬀusion equation it isgiven as:(24)( w h , u hn +2 ) −

43 ( w h , u hn +1 ) + 13 ( w h , u hn ) + 23 ∆ t ( ∇ w h , κ ∇ u hn +2 ) = 23 ∆ t ( w h , f n +2 )Again, it is straightforward to group and simultaneously solve for N time stepstogether. The corresponding space-time block equations are as follows:(25)  I − M M + ∆ t K M − M M + ∆ t K . . . M − M M + ∆ t K   u u u ... u N  =  IC∆ t f ∆ t f ... ∆ t f N  It is clear that higher order multistep methods produce block matrices that have alarger bandwidth (see Figure 3). Nonlinear problems can be treated similarly.

Figure 3.

View of nonzero entries of an elemental matrix (2Dquad with linear basis functions) for a block of space-time con-sisting of 2 timesteps + initial condition (left) and 9 timesteps +initial condition(right) with Backward diﬀerence formula of secondorder. First timestep is calculated using Euler Backward scheme.4.

Adaptive meshing for the block-space-time method: Residualbased error estimator

A central idea of this work is to develop a block space-time methodology that canbe integrated with mesh adaptivity. This will enable targeted reﬁnement of regionsthat exhibit variations in the corresponding block of time. Mesh adaptivity requiresthe deﬁnition of an indicator function that determines which regions of the spacerequire reﬁnement/coarsening. In this work, we build on prior work and utilize Note that the ﬁrst time step is approximated using a backward Euler time stepper as thesecond order BDF scheme requires knowledge of the solution at two previous time steps

DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 9 standard residual-based error indicators; see, e.g. [4, 26, 25]. Alternative duality-based indicators for nonlinear parabolic problems are described in, e.g. [9, 10, 12].Residual based error indicators η e are constructed for each spatial element, Ω e ,and consist of two terms – an interior residual, r int , and a jump residual, r jump [26]:(26) η e = h e (cid:107) r hint (cid:107) L (Ω e ) + h e (cid:107) r hjump (cid:107) L ( ∂ Ω e ) where h e is the size of element, Ω e . For example, for the linear and quasi-lineardiﬀusion equation, the detailed derivation of the interior and jump residual is avail-able in the work of Verfurth [26]. We refer the interested reader to that work andonly show the key results here. Basically, the two terms are constructed (as thename suggests) from the deﬁnition of the residual:(27) R h ( w ) = ( w, f ) − ( w, ∂ t u h ) − ( ∇ w, κ ( u h ) ∇ u h )This residual is decomposed into element-wise terms as:(28) R h ( w ) = (cid:88) e (cid:18)(cid:90) Ω e wr h int dΩ + (cid:90) Γ e wr h jump dΓ (cid:19) where(29) r h int = f − ∂ t u h + ∇ · κ ( u h ) ∇ u h and(30) r h jump = (cid:26) ∂ Ω e ∩ Γ h κ ( u h ) ∇ u h + · n + + κ ( u h ) ∇ u h − · n − on ∂ Ω e \ Γ h where Γ h is that part of the boundary with Dirichlet conditions imposed.The interior and jump residuals for the Allen-Cahn equations are similarly deﬁnedas(31) r h int = (cid:15) − f ( u h ) − ∂ t u h + ∆ u h and(32) r h jump = (cid:26) ∂ Ω e ∩ Γ h ∇ u h + · n + + ∇ u h − · n − on ∂ Ω e \ Γ h Recall that we are using error indicators deﬁned over a block of time. We extendthe concept of error indicator deﬁned at one time step to the notion of an errorindicator deﬁned over a block of time steps. There are several choice of errorindicators with two of them being:the average value of the error indicator across the block of time(33) η e,avge = (cid:88) Block of time, B (cid:32) (cid:80) Nn =1 η e,n N (cid:33) and the maximum value in the block of time:(34) η e,max = max Block of time, B η e,n The former approach is advantageous for slow variations in the block and avoidsfrequent migrating of elements between processors, which can negatively aﬀect scal-ability. The latter approach is advantageous when there is rapid changes across afew timesteps. In this case the error estimator is not ’diﬀused’ by timesteps wheresolution is changing slowly.

Remark 3.

In this work, we approximate the semi-discrete term, ∂u∂t in terms ofits ﬁnite diﬀerence representation ( Θ or BDF scheme). An implicit assumptionis that the time steps are small enough that this approximation is valid. Ideally,one would choose a consistent representation in both space and time; i.e. usinga ﬁnite element representation for time variations [21, 15] . This provides severaladvantages in terms of mathematical elegance. We defer this development to asubsequent paper. Implementation details

We utilize our in-house scalable, parallel FEM framework that is optimized fordistributed memory computing. The FEM software library is implemented in C++and uses object oriented software principles. Linear algebra, parallel matrix andvector storage are all performed by the PETSc library [5]. PETSc modules (KSP,SNES) are used to solve (non)linear equations. The FEM library is dynamicallylinked to the Parallel Hierarchical Grid (PHG) library [2] which is a parallel meshreﬁnement framework. PHG uses a bisection type algorithm [29], speciﬁcally newestvertex bisection to reﬁne/coarsen elements . PHG operates on simplex elements andproduces conforming meshes after reﬁnement. While our existing implementationwas reasonably optimized, several software engineering principles had to be carefullyimplemented to ensure eﬃcient execution of the block space-time problems. Someof the simpler changes that had substantial impact for the space-time formulation , but had minimal impact on the existing, standard iterative formulation are: (a)moving calculations out of loops whenever possible, (b) executing calculations andstoring results in array before loops, (c) conversion to 64-bit based integer variablesfor storing matrix indices, which allows going beyond 4 billion unknowns, (d) usinglocal values instead of values obtained through pointer or reference .We made software engineering decisions to ensure that the space-time imple-mentation is compatible with our existing sequential FEM frameworks (see remark1). Basically, the space-time formulation is treated as the solution of a steady-stateproblem. Initial conditions are imposed as boundary condition on the ﬁrst degreeof freedom. We discuss key aspects of the implementation next.5.1. Memory interlacing and matrix bandwidth.

We rearrange the vectorof unknowns u to enumerate time points before looping over space. i.e. u = { u , u , ..., u N , u , u , ..., u N , ..., u kN } , where subscript refers to time and superscriptdenotes space. This allows for more eﬃcient assembly, because all data (coeﬃcientsin system of equation) for a speciﬁc element share memory locality, thus preventingcache misses. Moreover, this approach is compatible with existing frameworksfor FEM, because problems with multiple DOFs are supported in existing FEMframeworks, so we can use standard procedures for system assembling or imposingboundary conditions. This has the additional advantage of reducing the matrix In newest vertex bisection, the edge that lies in opposite to the newest node is divided even very simple changes possibly like avoiding division operations by replacing with multipli-cation, or using ’const’ to enable the compiler to automatically simplify expressions, and avoidingunnecessary conversions between integer and double values have some impact these standard best practices provide minimal improvements to sequential time stepping, andare usually not reported or ignored. the function can access values more quickly, as it does not have to ﬁrst fetch pointer, thancheck where pointer points and get value DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 11 bandwidth. Table 1 enumerates the bandwidth for space-time formulation (2Dmesh with 100 ×

100 quad elements, linear basis function) using a Euler Backwardformulation. As expected, the bandwidth linearly increases with increasing numberof timesteps.

Table 1.

The bandwidth size for space-time formulationnumber of timestepssequential 5 10 25 50row size 10201 51005 102010 255025 510050bandwidth 205 1023 2043 5103 10203

10 100 1000 10000 CN BDF2 EB t i m e [ s ] Elliptic problem with force term aij125102550100no zeros

Figure 4.

Computational time for diﬀerent sizes of block size inblock based matrix aij format . Investigated example was the 2Dlinear diﬀusion problem using 100x100 quad elements with oneblock of 100 timesteps. Numbers in legend describes block size,the ’aij’ is sparse matrix without blocks and ’no zeroes’ is specialcase with ignored zero entries in element matrix.5.2.

Matrix access and storage.

We explored the use of block structured matrixformats that are optimized for storing the global matrix, where the block size is afunction of the number of timesteps. We use blocks of sizes ranging from 2 × N × N (where N is the number of timesteps in a block, B ). Figure 4 plots thetime required to assemble and solve a linear diﬀusion problem using the space-timeformulation (Backward Euler). We can see that diﬀerent non-zero blocks sizes cangive observable savings in computational time. Note also that each of these blocksis itself quite sparse (see Figure. 6). Careful identiﬁcation of the non-zero patterncould potentially result in much larger savings. This is illustrated in the ’no-zero’column in Figure 4 where we avoid inserting zero elements to the global matrix.The memory requirement for this case is also substantially minimized (see ﬁgure5). m e m o r y [ b y t e s ] size of blockElliptic problem with force term Figure 5.

Memory requirements for diﬀerent sizes of block sizein block based matrix aij format and sparse element matrix On theexample of elliptic problem on grid of 100x100 elements with oneblock of 100 timesteps

Figure 6.

View of full element matrix for Euler Backward (left)and compressed element matrix (right). Dark grey color meansnonzero entries in matrix.5.3.

Compressed element matrix.

Figure 6 shows the non zero patterns withinan elemental stiﬀness matrix. The size of this matrix is ( nbf × N ) × ( nbf × N ),where nbf is the number of basis functions (assuming 1 dof) and N is the numberof time steps in the block, B . Clearly, as N increases, the sparsity of the elementalstiﬀness matrix improves. This suggests using compressed formats for storing theelemental stiﬀness matrices. This issue becomes more important with increasingblock size.5.4. Scaling.

We performed scaling studies on two machines. Preliminary scalingwas performed on the TACC Stampede [3]. Stampede consists of 6400 computenodes each equipped with two Intel E5-2680 8-core processors and 32 GB of memory.Each compute node has access to 250GB of local storage and Lustre ﬁlesystem witha write performance of 150GB/s. The nodes are connected with FDR InﬁniBand

DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 13

16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048 R e l a t i v e S peed U p t i m e [ s ] Figure 7.

Speed up on Stampede (top) and time requirements forspace-time and iterative formulations on the same machine (bot-tom)network. Scaling results are shown in Figure 7. We also performed scaling on NCSABlue Waters [1]. Blue Waters consists of 22,640 nodes, each consisting of two AMD6276 ”Interlagos” processors for total number of 362,240 computing cores. Eachnode has 64GB of memory, which gives total system memory equal to 1.476PB.Compute nodes have access to storage with Lustre ﬁlesystem that has size 26.4 PBand oﬀers total bandwith of more than 1TB/s.Figures 8, 9, 10 show scaling for the linear, nonlinear and Allen-Cahn equationsrespectively. We show good scaling upto 131,072 processors. The largest prob-lem size used was ∼ . mesh with 10timesteps per block. The results illustrate reasonably good scaling for the space-time approach, and Figure 7 suggests that for larger number of processors, thespace-time approach performs better than the iterative approach. Speciﬁcally, be-yond a threshold number of processors, the total time to solve a given problem is

16 64 256 1024 4096 16384 65536 262144 16 64 256 1024 4096 16384 65536 262144 R e l a t i v e S peed U p *10200 *10400 *10800 *10 Figure 8.

Speed up on Blue Waters for linear time dependent diﬀusion

16 64 256 1024 4096 16384 65536 262144 16 64 256 1024 4096 16384 65536 262144 R e l a t i v e S peed U p *10200 *10400 *10550 *10 Figure 9.

Speed up on Blue Waters for nonlinear time dependentdiﬀusion (nonlinearity induced by diﬀusion coeﬃcient dependingon u )lower for the space time approach compared to the iterative approach. We antic-ipate that for more complex problems (complex geometries, remeshing, I/O) thespace-time approach may perform even better.6. Numerical Examples

We illustrate our adaptive parallel-in-space-time framework on linear and non-linear diﬀusion equation and the Allen-Cahn equation.6.1.

Problem A – Linear Diﬀusion.

We ﬁrst consider the linear case. We set κ = 1. We use the method of manufactured solutions to construct the forcing termin (1) to ensure an analytical solution, u :(35) u ( x, y, z, t ) = exp ( − α ( x, y, z, t )) DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 15

16 64 256 1024 4096 16384 65536 262144 16 64 256 1024 4096 16384 65536 262144 R e l a t i v e S peed U p *10200 *10400 *10550 *10 Figure 10.

Speed up on Blue Waters Supercomputer for Allen-Cahn problemwhere α ( x, y, z, t ) is equal to:(36) α ( x, y, z, t ) = ( x − x ( t )) + ( y − y ( t )) + ( z − z ( t )) d and x ( t ) = a cos( ωt ) + by ( t ) = a sin( ωt ) + bz ( t ) = b with a = 0 . b = 0 . ω = 2 π , d = 0 .

1. Thus, u is a rotating exponential hillcentered at the midplane and rotating with a radius of 0.2 and an angular speed of2 π . This manufactured solution is constructed by the following forcing term(37) f ( x , t ) = − (cid:18) x − x ) + ( y − y ) + ( z − z ) ) d − d −− aω (cid:18) cos( tω )( y − y ) − sin( tω )( x − x ) d (cid:19)(cid:19) exp ( − α ( x, y, z, t ))The equation is solved in a region of dimensions [0 : 1] × [0 : 1] × [0 : 1]. Everyboundary face of the region has an essential boundary condition with prescribedvalue of u equal to the value computed from (35). We use a time step of 0.01 andsolve for a block of 100 timesteps.Figure 11 shows a cutoﬀ of the reﬁned spatial mesh after 20 reﬁnement iterations.We remind the reader that each iteration consists of solving the space-time problem,constructing the time-averaged elemental reﬁnement indicators, equation (33), andthen reﬁning the 3D mesh according to the indicators. Given that this is a movingsource problem, it is clearly seen that there is reﬁnement in regions where the sourcehas passed through.We compare spatial convergence rates for three implementations: (a) sequen-tial time stepping with no spatial adaptivity, (b) space-time implementation withno spatial adaptivity, and (c) space-time implementation with spatial adaptivity. Figure 11.

Reﬁned mesh: view from top (left) and front (right). l a s t || e || L2 h ‾ grid convergence for space-time formulation, dt=0.01iterspace-timespace-time, adapt Figure 12.

Grid convergence for linear heat equation. Error is (cid:107) u − u h (cid:107) from last time-step versus average element size ¯h. l a s t || e || L2 dttime convergence for space-time formulation (linear problem)EBBDF2 Figure 13.

Time convergence for linear heat equation. Error is (cid:107) u − u h (cid:107) from last time-step versus size of time-step.We plot convergence in Figure 12, where error is (cid:107) u − u h (cid:107) . The ﬁrst two imple-mentations (obviously) overlap, with the adaptive mesh implementation showing a DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 17 reduced error. All three curves show a slope of 2, which is to be expected. The Fig-ure 13 shows time-step convergence, with expected slopes of 1 and 2 for BackwardEuler and Backward Diﬀerence Formulae, respectively.6.2.

Problem B – Nonlinear Diﬀusion.

For the nonlinear case, we set thecoeﬃcient κ ( u ) = 1 + 10 u and we choose an exact solution such that | u | ≤

1, thusbounding κ . As before, we choose our analytical solution to be given by (35). Theforcing term, f , is consequently:(38) f ( x , t ) =80 ud (cid:0) ( x − x ) + ( y − y ) + ( z − z ) (cid:1) exp ( − α ( x, y, z, t ))++ (1 + 10 u ) (cid:18) x − x ) + ( y − y ) + ( z − z ) ) d − d (cid:19) exp ( − α ( x, y, z, t )) −− aω cos( tω )( y − y ) − sin( tω )( x − x ) d exp ( − α ( x, y, z, t ))Figure. 14 plots several time snapshots of the moving non-linear source problem,with the solution accurately tracking the moving source. In Figure. 15, we plotspatial convergence for three implementations: (a) sequential time stepping with nospatial adaptivity, (b) space-time implementation with no spatial adaptivity, and(c) space-time implementation with spatial adaptivity (plotted with the averageelement size). Convergence rates follow along expected lines, with a slope of 2.6.3. Problem C – Allen-Cahn.

In this ﬁnal example we solve the modiﬁed Allen-Cahn problem. The key physics (i.e. phase change) which is described by thisnon-linear equation essentially occurs in a highly localized region of the domain (ona surface of co-dimension 1). Adaptive reﬁnement (and coarsening) has been a veryeﬀective approach to accurately resolve this localized region. The equation is givenas ∂u∂t = − D (cid:0) f ( u ) − C n ∇ u (cid:1) , where(39) f ( u ) = 2 Au (1 − u + 2 u ) − k and D = 1, C n = 0 . A = 16, k = 0 .

1. The initial conditions are(40) u ( x ) = 0 . .  r − . (cid:113) A C n  , r = (cid:112) x + y + z with zero ﬂux conditions on all boundaries. This represents an initial solid of radius, r = 0 .

5, that is melting. Using symmetry arguments, we consider a single octantof the space ([0 : 1] × [0 : 1] × [0 : 1]). We consider a time step of 0.02 and a blocksize of 50 time-steps. Figure 16 shows snapshots of the melting sphere at varioustime points.Figure 17 illustrates adaptive mesh reﬁnement across the time block. The ﬁgureshows the reﬁned mesh for the ﬁrst block of time ( t = 0 to t = 1) Note thatreﬁnement is localized along the solidiﬁcation front within the time block . To capturethis thin interface using a uniform mesh would have required 589 824 elements,instead of the 137 776 elements that were used here. (a) t=0s (b) t=0.25s (c) t=0.75s (d) t=1s Figure 14.

Results of computations where upper half of compu-tational domain is removed. Values of u change in time from 0s(a) to 1s (d) l a s t || e || L2 h ‾ grid convergence for space-time formulation, dt=0.01, nonlineartet, itertet, adapttet Figure 15.

Grid convergence for nonlinear heat equation. Erroris (cid:107) u − u h (cid:107) from last time-step versus average element size ¯h. DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 19 (a) t=0s (b) t=1s (c) t=3s (d) t=5s

Figure 16.

Results of computations for the 3D Allen-Cahn prob-lem. Values of u for initial condition (a), after 1s (b), 3s (c) and5s (d) Y XZ

Figure 17.

Finite element mesh after 20 reﬁnement iterationswith superimposed solution at time 1 s at the dense part of mesh7. Conclusion

We present formulation, implementation details and representative examples ofa parallel-in-space-time based adaptive methodology for the solution of (linear and)non-linear time dependent problems. The basic concept is to solve for large blocks ofspace-time unknowns instead of marching sequentially in time. The methodology isa combination of a computationally eﬃcient implementation of a parallel-in-space-time ﬁnite element solver coupled with a posteriori space-time error estimates and a parallel mesh generator. We illustrate how this implementation is especially tailoredfor massively parallel computations. We show good scaling behavior up to 150,000processors on the Blue Waters machine. This methodology enables scaling on nextgeneration multi-core machines by simultaneously solving for large number of time-steps, and reduces computational overhead by reﬁning spatial blocks that can tracklocalized features. This methodology also opens up the possibility of eﬃcientlyincorporating adjoint equations for error estimators and inverse design problems,since blocks of space-time are simultaneously solved and stored in memory. Ourfuture work is focused on extending the space-time framework to utilizing ﬁniteelement basis functions in time (which enables formal derivation of space-time aposteriori error estimates), and subsequently implementing 4D ﬁnite elements toenable simultaneous space and time adaptivity.

References Blue waters user portal | system summary .2. Phg (parallel hierarchical grid) .3.

Texas advanced computing center - stampede technical details .4. Mark Ainsworth and J. Tinsley Oden,

A posteriori error estimation in ﬁnite element analysis ,John Wiley & Sons, New York, 2000.5. Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith,

Eﬃcient man-agement of parallelism in object oriented numerical software libraries , Modern Software Toolsin Scientiﬁc Computing (E. Arge, A. M. Bruaset, and H. P. Langtangen, eds.), Birkh¨auserPress, 1997, pp. 163–202.6. Marek Behr,

Simplex space-time meshes in ﬁnite element simulations , Int. J. Numer. Meth.Fluids (2008), no. 9, 1421–1434.7. Kevin Burrage,

10. parallel methods for systems of ordinary diﬀerential equations , Applica-tions on Advanced Architecture Computers, 1996, pp. 101–120.8. John C. Butcher,

Numerical methods for ordinary diﬀerential equations , John Wiley & Sons,West Sussex, 2008.9. V. Carey, D. Estep, A. Johansson, M. Larson, and S. Tavener,

Blockwise adaptivity for timedependent problems based on coarse scale adjoint solutions , SIAM J. Sci. Comput. (2010),no. 4, 2121–2145.10. G. S¸im¸sek, X. Wu, K.G. van der Zee, and E.H. van Brummelen, Duality-based two-level errorestimation for time-dependent pdes: Application to linear and nonlinear parabolic equations ,Computer Meth. Appl. Mech. Engrg. (2015), 83–109.11. Bert J. Debusschere, Habib N. Najm, Philippe P. Pbay, Omar M. Knio, Roger G. Ghanem,and Olivier P. Le Matre,

Numerical challenges in the use of polynomial chaos representationsfor stochastic processes , SIAM J. Sci. Comput. (2004), no. 2, 698–719.12. K. Eriksson, C. Johnson, and A. Logg, Adaptive computational methods for parabolic prob-lems , Encyclopedia of Computational Mechanics, in: Fundamentals (E. Stein, R. de Borst,and T.J.R. Hughes, eds.), John Wiley & Sons, Ltd., 2004, pp. 675–702.13. Hector Gomez and Kristoﬀer G. van der Zee,

Computational phase-ﬁeld modeling , Encyclo-pedia of Computational Mechanics, Second Edition, John Wiley & Sons, Ltd., (In Press).14. Jeﬀrey Hittinger, Sven Leyﬀer, and Jack Dongarra,

Models and algorithms for exascale com-puting pose challenges for applied mathematicians , SIAM News, 2013.15. Thomas J. R. Hughes and G. M. Hulbert,

Space-time ﬁnite element methods for elastody-namics: Formulations and error estimates , Comput. Methods Appl. Mech. Eng. (1988),no. 3, 339–363.16. Thomas J.R. Hughes and James R. Stewart, A space-time formulation for multiscale phe-nomena , J. Comput. Appl. Math. (1996), no. 1, 217–229.17. J. Lions, Yvon Maday, and Gabriel Turinici, A”parareal”in time discretization of pde’s , C. R.Acad. Sci. Series I Mathematics (2001), no. 7, 661–668.

DAPTIVE PARALLEL-IN-SPACE-TIME FINITE ELEMENT 21

18. Robert B. Lowrie, Philip L. Roe, and Bram van Leer,

Space-time methods for hyperbolic con-servation laws , Barriers and Challenges in Computational Fluid Dynamics (V. Venkatakrish-nan, Manuel D. Salas, and Sukumar R. Chakravarthy, eds.), Springer Netherlands, Dordrecht,1998, pp. 79–98.19. Karthik Mani and Dimitri Mavriplis,

Eﬃcient solutions of the euler equations in a time-adaptive space-time framework , 49th AIAA Aerospace Sciences Meeting including the NewHorizons Forum and Aerospace Exposition, American Institute of Aeronautics and Astronau-tics, 2011.20. Akil Narayan and Dongbin Xiu,

Stochastic collocation methods on unstructured grids in highdimensions via interpolation , SIAM J. Sci. Comput. (2012), no. 3, A1729–A1752.21. J. P. Pontaza and J. N. Reddy, Space-time coupled spectral/hp least-squares ﬁnite elementformulation for the incompressible navier-stokes equations , J. Comput. Phys. (2004),no. 2, 418–459.22. Thomas C. S. Rendall, Christian B. Allen, and Edward D. C. Power,

Conservative unsteadyaerodynamic simulation of arbitrary boundary motion using structured and unstructuredmeshes in time , Int. J. Numer. Meth. Fluids (2012), no. 12, 1518–1542.23. Christian Soize and Roger Ghanem, Physical systems with random uncertainties: Chaos rep-resentations with arbitrary probability measure , SIAM J. Sci. Comput. (2004), no. 2, 395–410.24. Tayfun E. Tezduyar, Sunil Sathe, Ryan Keedy, and Keith Stein, Spacetime ﬁnite element tech-niques for computation of ﬂuidstructure interactions , Computer Meth. Appl. Mech. Engrg. (2006), no. 17–18, 2002–2027.25. R¨udiger Verf¨urth,

A review of a posteriori error estimation and adaptive mesh-reﬁnementtechniques , Wiley-Teubner, New York and Stuttgart, 1996.26. ,

A posteriori error estimation techniques for ﬁnite element methods , Oxford Univer-sity Press, Oxford, 2008.27. Xiaoliang Wan and George Em Karniadakis,

Multi-element generalized polynomial chaos forarbitrary probability measures , SIAM J. Sci. Comput. (2006), no. 3, 901–928.28. Luming Wang and Per-Olof Persson, A high-order discontinuous galerkin method with un-structured spacetime meshes for two-dimensional compressible ﬂows on domains with largedeformations , Comput Fluids (2015), 53–68.29. Lin-Bo Zhang,

A parallel algorithm for adaptive local reﬁnement of tetrahedral meshes usingbisection , Numer. Math. Theor. Meth. Appl.2