[PDF] Fast solution of fully implicit Runge-Kutta and discontinuous Galerkin in time for numerical PDEs, Part I: the linear setting

Abstract

Fully implicit Runge-Kutta (IRK) methods have many desirable properties as time integration schemes in terms of accuracy and stability, but are rarely used in practice with numerical PDEs due to the difficulty of solving the stage equations. This paper introduces a theoretical and algorithmic preconditioning framework for solving the systems of equations that arise from IRK methods applied to linear numerical PDEs (without algebraic constraints). This framework also naturally applies to discontinuous Galerkin discretizations in time. Under quite general assumptions on the spatial discretization that yield stable time integration, the preconditioned operator is proven to have conditioning ~O(1), with only weak dependence on number of stages/polynomial order; for example, the preconditioned operator for 10th-order Gauss IRK has condition number less than two, independent of the spatial discretization. The new method can be used with arbitrary existing preconditioners for backward Euler-type time stepping schemes, and is amenable to the use of three-term recursion Krylov methods when the underlying spatial discretization is symmetric. The new method is demonstrated to be effective on various high-order finite-difference and finite-element discretizations of linear parabolic and hyperbolic problems, demonstrating fast, scalable solution of up to 10th order accuracy. The new method consistently outperforms existing block preconditioning approaches, and in several cases, the new method can achieve 4th-order accuracy using Gauss integration with roughly half the number of preconditioner applications and wallclock time as required using standard diagonally implicit RK methods.

Full PDF

FFast parallel solution of fully implicit Runge-Kuttaand discontinuous Galerkin in time for numericalPDEs, Part I: the linear setting ∗ Ben S. Southworth † Oliver A. Krzysik ‡ Will Pazner § Hans De Sterck ¶ January 8, 2021

Abstract

Fully implicit Runge-Kutta (IRK) methods have many desirable properties astime integration schemes in terms of accuracy and stability, but are rarely used inpractice with numerical PDEs due to the diﬃculty of solving the stage equations.This paper introduces a theoretical and algorithmic framework for the fast, parallelsolution of the systems of equations that arise from IRK methods applied to linearnumerical PDEs (without algebraic constraints). This framework also naturallyapplies to discontinuous Galerkin discretizations in time. The new method can beused with arbitrary existing preconditioners for backward Euler-type time steppingschemes, and is amenable to the use of three-term recursion Krylov methods whenthe underlying spatial discretization is symmetric. Under quite general assump-tions on the spatial discretization that yield stable time integration, the precondi-tioned operator is proven to have conditioning ∼ O (1), with only weak dependenceon number of stages/polynomial order; for example, the preconditioned operatorfor 10th-order Gauss integration has condition number less than two. The newmethod is demonstrated to be eﬀective on various high-order ﬁnite-diﬀerence andﬁnite-element discretizations of linear parabolic and hyperbolic problems, demon-strating fast, scalable solution of up to 10th order accuracy. In several cases, thenew method can achieve 4th-order accuracy using Gauss integration with roughlyhalf the number of preconditioner applications as required using standard SDIRKtechniques. ∗ BSS was supported by Lawrence Livermore National Laboratory under contract B639443, and asa Nicholas C. Metropolis Fellow under the Laboratory Directed Research and Development programof Los Alamos National Laboratory. OAK acknowledges the support of an Australian GovernmentResearch Training Program (RTP) Scholarship. † Theoretical Division, Los Alamos National Laboratory, U.S.A. ( [email protected] ), http://orcid.org/0000-0002-0283-4928 ‡ School of Mathematics, Monash University, Australia ( [email protected] ), https://orcid.org/0000-0001-7880-6512 § Center for Applied Scientiﬁc Computing, Lawrence Livermore National Laboratory, U.S.A.( [email protected] ) ¶ Department of Applied Mathematics, University of Waterloo, Waterloo, Canada ( [email protected] ) a r X i v : . [ m a t h . NA ] J a n Introduction

Consider the method-of-lines approach to the numerical solution of linear partial diﬀer-ential equations (PDEs), where we discretize in space and arrive at a system of ordinarydiﬀerential equations (ODEs) in time, M u (cid:48) ( t ) = L ( t ) u + f ( t ) in (0 , T ] , u (0) = u , where M is a mass matrix, L ( t ) ∈ R N × N a discrete linear operator, and f ( t ) a time-dependent forcing function. Then, consider time propagation using an s -stage Runge-Kutta scheme, characterized by the Butcher tableau c A b T , with Runge-Kutta matrix A = ( a ij ), weight vector b T = ( b , . . . , b s ) T , and quadraturenodes c = ( c , . . . , c s ).Runge-Kutta methods update the solution using a sum over stage vectors, u n +1 = u n + δt s (cid:88) i =1 b i k i , (1) M k i = L ( t n + δtc i ) ·  u n + δt s (cid:88) j =1 a ij k j  + f ( t n + δtc i ) . (2)The stage vectors { k i } can then be expressed as the solution of the block linear system,  M . . . M  − δt  a L ... a s L ... . . . ... a s L s ... a ss L s   k ... k s  =  f ... f s  , (3)where L i := L ( t n + δtc i ) and f i := f ( t n + δtc i ) + L ( t n + δtc i ) u n . Primarily this paperfocuses on spatial operators L that are independent of time; however, some of the resultshold for commuting operators, such as those that may arise from, for example, a time-dependent reaction term, so for now we maintain this generality.The diﬃculty in fully implicit Runge-Kutta methods (which we will denote IRK)lies in solving the N s × N s block linear system in (3). This paper focuses on theparallel simulation of numerical PDEs, where N is typically very large and L is highlyill-conditioned. In such cases, direct solution techniques to solve (3) are not a viableoption, and fast, parallel iterative methods must be used. However, IRK methods arerarely employed in practice due to the diﬃculties of solving (3). Even for relativelysimple parabolic PDEs where −L is symmetric positive deﬁnite (SPD), (3) is a largenonsymmetric matrix with signiﬁcant block coupling. For nonsymmetric matrices L that already have variable coupling, traditional iterative methods are even less likely toyield acceptable performance in solving (3). Note, PDEs with an algebraic constraint, such as the divergence-free constraint in the Stokes equa-tions, instead yield a diﬀerential algebraic equation (DAE), which requires separate careful treatment,and is addressed in a companion paper along with nonlinearities [49]. .2 Discontinuous Galerkin in time Discontinuous Galerkin (DG) in time discretizations of systems of linear ODEs give riseto linear algebraic systems of the form  δ M δ s M . . . δ s M δ ss M  − δt  m L ... m s L ... . . . ... m s L s ... m ss L s   u ... u s  =  r ... r s  . (4)The coeﬃcients m ij correspond to a temporal mass matrix, the coeﬃcients δ ij corre-spond to a DG weak derivative with upwind numerical ﬂux, and the unknowns u i arethe coeﬃcients of the polynomial expansion of the approximate solution (for example,see [1, 27, 31, 47]). Both of the coeﬃcient matrices { m ij } and { δ ij } are invertible. Itcan be seen that the algebraic form of the DG in time discretization is closely relatedto the implicit Runge-Kutta system (3) and, in fact, (4) can be recast in the form of(3) using the invertibility of the matrix { δ ij } . In particular, the degree- p DG methodusing ( p + 1)-point Radau quadrature, which is exact for polynomials of degree 2 p , isequivalent to the Radau IIA collocation method [31], which is used for many of thenumerical results in Section 3. Thus, although the remainder of this paper focuses onfully implicit Runge-Kutta, the algorithms developed here can also be applied to DGdiscretizations in time on ﬁxed slab-based meshes. This paper develops fast, parallel preconditioning techniques for the solution of fully im-plicit Runge-Kutta methods and DG discretizations in time for linear numerical PDEs.First, Section 1.4 provides background on why IRK methods are desirable over the sim-pler and more commonly used digonally implicit Runge-Kutta (DIRK) methods, andalso provides some historical context for the preconditioners developed in this work.Section 1.5 then brieﬂy discusses stable integration from a method-of-lines perspectiveand introduces two key elements that will be used throughout the paper.Section 2 introduces a new method to solve for the IRK update (1) for linear operators L that are independent of time. This new method requires the preconditioning of s real-valued matrices of the form ηM − δt L for some η >

1, analogous to the matrices that arisein backward Euler integration, and is easily implemented using existing preconditionersand parallel software libraries. In contrast to other works that have considered thepreconditioning of (3), the proposed algorithm here (i) is amenable to short-term Krylovrecursion (conjugate gradient (CG)/MINRES) if ηM − L is, and (ii) only operates onthe solution, thus not requiring the storage of each stage vector. Theory is developed inSection 2.2 that guarantees O (1) conditioning of the preconditioned operator under basicassumptions on stability from Section 1.5, with only weak dependence on the numberof stages or polynomial order. For example, the preconditioned operator for 10th-orderGuass integration has condition number less than two.Numerical results are provided in Section 3, demonstrating the new method for avariety of problems and corresponding preconditioners, including very high-order ﬁnite-diﬀerence and discontinuous Galerkin spatial discretizations of advection-diﬀusion equa-tions, and matrix-free continuous Galerkin discretizations of diﬀusion equations. Themethod is shown to be fast and scalable up to 10th-order accuracy in time, eﬀective onfully advective (hyperbolic) problems, and, for multiple examples, can obtain 4th-orderaccuracy with Gauss integration using roughly half as many preconditioning iterationsas needed by standard 4th-order SDIRK schemes.3 .4 Motivation and previous work Diagonally implicit Runge-Kutta methods (DIRK), where A is lower triangular, arecommonly used in practice. For such schemes, the solution of (3) using a block sub-stitution algorithm requires only s linear solves of systems of the form M − δta ii L i .Unfortunately, DIRK schemes suﬀer from order reduction, where the order of accuracyobserved in practice on stiﬀ nonlinear PDEs or DAEs can be limited to ≈ min { p, q + 1 } or q , respectively, for integration order p and stage-order q [18, 25]. The stage-orderof a DIRK method is at most one (EDIRK methods, with one explicit stage, have amaximum stage order of two) and, thus, even a 6th-order DIRK method may only yieldﬁrst- or second-order accuracy [9]. In contrast, IRK methods may have arbitrarily highstage order and, thus, formally high-order accuracy on stiﬀ, nonlinear problems, andeven index-2 DAEs. Although the focus of this paper is linear PDEs without algebraicconstraints, we want to highlight that the theory and framework developed here is fun-damental to a companion paper on nonlinear PDEs and DAEs [49]. Furthermore, forless stiﬀ problems, IRK methods can yield accuracy as high as order 2 s for an s -stagemethod, compared with a maximum of s or s + 1 for SDIRK methods with reason-able stability properties [18, Section IV.6]. Multistep methods can overcome some ofthe accuracy constraints of SDIRK methods, but implicit multistep methods cannotbe A-stable and greater than order two, which is limiting when considering advection-dominated or hyperbolic problems, where the ﬁeld-of-values often push up against theimaginary axis. Furthermore, for problems where symplectic integration is desirable forconservation, neither linear multistep nor explicit Runge Kutta methods can be sym-plectic [19]. Although DIRK methods can be symplectic, they are limited to at most4th order and, moreover, known methods above second order are impractical due tonegative diagonal entries of A (leading to a negative shift rather than positive shiftof the spatial discretization) [25]. Thus, even moderate order symplectic integrationrequires IRK methods.One simpliﬁcation for using IRK methods is to assume L i = L j for all i, j , in whichcase (3) can be expressed in Kronecker product form,( I ⊗ M − δtA ⊗ L ) k = f . (5)Such an assumption is natural for linear problems with no time-dependent diﬀerentialcomponents and, for the most part, covers the type of problems studied in this paper.Many papers have considered the solution of (5), with Butcher [10] and Bickart’s [6]being some of the earliest works, which develop ways to transform (5) to a simplerform. There, and in many of the works that followed, the goal was to minimize thecost of LU decompositions used to solve (5), typically in the context of ODEs.For large-scale simulation of PDEs, particularly on modern computing architectures,LU decompositions (or other direct factorizations) are typically not feasible. In this vein,a number of people have considered preconditioning techniques for (5) or approximationsto (5) on the nonlinear iteration level or time discretization level. Various block precon-ditioning/approximation techniques have been studied, primarily for parabolic problems[20, 21, 22, 23, 34, 36, 50], and multigrid methods for IRK and parabolic problems weredeveloped in [28]. New ADI-type preconditioners for IRK methods were developed forparabolic problems in [12] with spectral radius shown to be < In Kronecker form (5), SIRK methods [37] are also relatively straightforward to solve using existingpreconditioning techniques. But, although SIRK methods oﬀer some advantages over DIRK methods,they still lack the favorable stability and accuracy properties of IRK methods [8, 39].

Remark 1 (Growing interest in IRK) . It is worth pointing out that while writing thispaper, at least three preprints have been posted online studying the use of IRK meth-ods for numerical PDEs. Two papers develop new block preconditioning techniques forparabolic PDEs [24, 43], and one focuses on a high-level numerical implementation ofIRK methods with the Firedrake package [16].

Throughout the paper, we use the reformulation used in, for example, [42], where wecan pull an A ⊗ I out of the fully implicit system in (3), yielding the equivalent problem  A − ⊗ M − δt  L . . . L s  ( A ⊗ I )  k ... k s  =  f ... f s  . (6)The oﬀ-diagonal block coupling in (6) now consists of mass matrices rather than diﬀer-ential operators, which makes the analysis and solution more tractable. The algorithmsdeveloped here depend on the eigenvalues of A and A − , leading to our ﬁrst assumption. Assumption 1.

Assume that all eigenvalues of A (and equivalently A − ) have positivereal part. Recall that if an IRK method is A-stable, irreducible, and A is invertible (whichincludes DIRK, Gauss, Radau IIA, and Lobatto IIIC methods, among others), thenAssumption 1 holds [18]; that is, Assumption 1 is straightforward to satisfy in practice.Stability must be taken into consideration when applying ODE solvers within amethod-of-lines approach to numerical PDEs. The Dalhquist test problem extends nat-urally to this setting, where we are interested in the stability of the linear operator L ,for the ODE(s) u (cid:48) ( t ) = L u , with solution e t L u . A necessary condition for stabilityis that the eigenvalues of L lie within distance O ( δt ) of the region of stability for theRunge-Kutta scheme of choice (e.g., see [44]). Here we are interested in implicit schemesand, because most implicit Runge-Kutta schemes used in practice are A- or L-stable,an eﬀectively necessary condition for stability is that the real part of eigenvalues of L be nonpositive. For normal operators, this requirement ends up being a necessary andsuﬃcient condition for stability.For non-normal or non-diagonalizable operators, the analysis is more complicated.One of the best known works on the subject is by Reddy and Trefethen [44], wherenecessary and suﬃcient conditions for stability are derived as the ε pseudo-eigenvaluesof L being within O ( ε ) + O ( δt ) of the stability region as ε, δt →

0. Here we relaxthis assumption to something that is more tractable to work with by noting that the ε pseudo-eigenvalues are contained within the ﬁeld of values to O ( ε ) [51, Eq. (17.9)],where the ﬁeld of values is deﬁned as W ( L ) := {(cid:104)L x , x (cid:105) : (cid:107) x (cid:107) = 1 } . (7)5his motivates the following assumption for the analysis done in this paper: Assumption 2.

Let L be the linear spatial operator, and assume that W ( L ) ≤ (thatis, W ( L ) is a subset of the left half plane (including imaginary axis)). It should be noted that the ﬁeld of values has an additional connection to stability.From [51, Theorem 17.1], we have that (cid:107) e t L (cid:107) ≤ t ≥ W ( L ) ≤ (cid:107) e t L (cid:107) ≤ C for all t ≥ C . In practice, Assumption 2 often holds when simulating numerical PDEs, andin Section 2.2, it is proven that Assumptions 1 and 2 guarantee the methods proposedhere yield O (1) conditioning of the preconditioned operator. It should also be notedthat L need not be nonsingular. For ease of notation, let us scale both sides of (6) by a block diagonal operator, withdiagonal blocks M − , and let (cid:98) L i := δtM − L i , for i = 1 , ..., s . Now let α ij denote the ij -element of A − . Then, solving (6) can beeﬀectively reduced to inverting the operator M s := A − ⊗ I −  (cid:98) L . . . (cid:98) L s  =  α I − (cid:98) L α I · · · α s Iα I α I − (cid:98) L · · · α s I ... ... . . . ... α s I · · · · · · α ss I − (cid:98) L s  . (8)We proceed by deriving a closed form inverse of (8), demonstrating how the Runge-Kutta update in (1) can then be performed directly (without forming and saving eachstage vector), and developing a preconditioning strategy to apply this update usingexisting preconditioners. This section introduces a result similar to Bickart’s [6], but using a diﬀerent frameworkand generalized to hold for commuting operators. We consider M s as a matrix over thecommutative ring of linear combinations of { I, (cid:98) L} , and the determinant and adjugatereferred to in Lemma 1 are deﬁned over matrix-valued elements rather than scalars. Forthe interested reader, see [7] for details on matrices and the corresponding linear algebrawhen matrix elements are deﬁned over a space of commuting matrices. Lemma 1.

Let α ij denote the ( i, j ) th entry of A − and assume { (cid:98) L i } si =1 are commutingoperators. Deﬁne M s as in (8) , let det( M s ) be the determinant of M s , and let adj( M s ) be the adjugate of M s . Then, M s is invertible if and only if det( M s ) is invertible, and M − s = diag (cid:0) det( M s ) − (cid:1) adj( M s ) . Note, there are methods with one explicit stage followed by several fully implicit stages [9]. Insuch cases, A is not invertible, but the explicit stage can be eliminated from the system (by doing anexplicit time step). The remaining operator can then be reformulated as in (6). here “diag” indicates a block diagonal matrix, with diagonal blocks in this case givenby det( M s ) − . Now, suppose (cid:98) L i = (cid:98) L j for all i, j , and let P s ( x ) be the characteristicpolynomial of A − . Then, M − s = diag (cid:0) P s ( (cid:98) L ) − (cid:1) adj( M s ) , Proof.

Notice in (8) that if (cid:98) L i and (cid:98) L j commute for all i, j , then M s is a matrix over thecommutative ring of linear combinations of I and { (cid:98) L i } . Let adj( M s ) denote the matrixadjugate. A classical result in matrix analysis tells us thatadj( M s ) M s = M s adj( M s ) = diag(det( M s )) I. Moreover, M s is invertible if and only if the determinant of M s is invertible, in whichcase M − s = diag(det( M s ) − )adj( M s ) [7, Theorem 2.19 & Corollary 2.21]. For the caseof time-independent operators ( (cid:98) L i = (cid:98) L j ), notice that M s takes the form A − − (cid:98) L I overthe commutative ring deﬁned above. Analogous to a scalar matrix, the determinant of A − − (cid:98) L I is the characteristic polynomial of A − evaluated at (cid:98) L .Returning to (6), we can express the solution for the set of all stage vectors k =[ k ; ... ; k s ] as k := diag (cid:0) det( M s ) − (cid:1) ( A − ⊗ I )adj( M s ) f , where f = [ f ; ... ; f s ] (note that A ⊗ I commutes with diag(det( M s ) − )). The Runge-Kutta update is then given by u n +1 = u n + δt s (cid:88) i =1 b i k i = u n + δt det( M s ) − ( b T A − ⊗ I )adj( M s ) f . (9) Remark 2 (Implementation & complexity) . The adjugate consists of linear combina-tions of I and (cid:98) L , and an analytical form can be derived for an arbitrary s × s matrix,where s ∼ O (1) . Applying its action requires a set of vector summations and matrix-vector multiplications. In particular, the diagonal elements of adj( M s ) are monic poly-nomials in (cid:98) L of degree s − (or linear combinations of comparable degree if (cid:98) L i (cid:54) = (cid:98) L j )and oﬀ-diagonal terms are polynomials in (cid:98) L of degree s − .Returning to (9) , we consider two cases. First, if a given Runge-Kutta scheme isstiﬄy accurate (for example, Radau IIA methods), then b T A − = [0 , ..., , . Thisyields the nice simpliﬁcation that computing the update in (9) only requires applyingthe last row of adj( M s ) to f (in a dot-product sense) and applying det( M s ) − to theresult. From the discussion above regarding the adjugate structure, applying the last rowof adj( M s ) requires ( s − s −

1) + ( s −

1) = ( s − matrix-vector multiplications.Because this only happens once, followed by the linear solve(s), these multiplications aretypically of relatively marginal cost.In the more general case of non-stiﬄy accurate methods (for example, Gauss meth-ods), one can obtain an analytical form for ( b T A − ⊗ I )adj( M s ) . Each element in thisblock × s matrix consists of polynomials in (cid:98) L of degree s − (although typically notmonic). Compared with stiﬄy accurate schemes, this now requires ( s − s matrix-vectormultiplications, which is s − more than for stiﬄy accurate schemes, but still typicallyof marginal overall computational cost. .2 Preconditioning by conjugate pairs Following the discussion and algorithm developed in Section 2.1, the key outstandingpoint is inverting det( M s ) − . Moving forward, we restrict our attention to the case (cid:98) L i = (cid:98) L j for all i, j , in which case we have a simple closed form for det( M s ) − = P s ( (cid:98) L ) − ,where P s ( x ) is the characteristic polynomial of A − (see Lemma 1).In contrast to much of the early work on solving IRK systems, where LU factoriza-tions were the dominant cost and system sizes relatively small, explicitly forming andinverting P s ( (cid:98) L ) for numerical PDEs is typically not a viable option in high-performancesimulation on modern computing architectures. Instead, by computing the eigenvalues { λ i } of A − , we can express P s ( (cid:98) L ) in a factored form, P s ( (cid:98) L ) = s (cid:89) i =1 ( λ i I − (cid:98) L ) , (10)and its inverse can then be computed by successive applications of ( λ i I − (cid:98) L ) − , for i = 1 , ..., s . Unfortunately, eigenvalues of A and A − are often complex, and for real-valued matrices this makes the inverse of individual factors ( λ i I − (cid:98) L ) − more diﬃcult andoften impractical with standard preconditioners and existing software. Moving forward,let λ := η + i β denote an eigenvalue of A − , for η, β ∈ R , with β ≥ η > Q η : = (( η + i β ) I − (cid:98) L )(( η − i β ) I − (cid:98) L )= ( η + β ) I − η (cid:98) L + (cid:98) L = ( ηI − (cid:98) L ) + β I. (11)In practice, we typically do not want to directly form or precondition a quadratic oper-ator like (11), due to (i) the overhead cost of large parallel matrix multiplication, and(ii) the fact that many fast parallel methods such as multigrid are not well-suited forsolving a polynomial in (cid:98) L . The point of (11) is that by considering conjugate pairs ofeigenvalues, the resulting operator is real-valued. Here we propose preconditioning (11)with the inverse of a real-valued quadratic polynomial, ( δI − (cid:98) L )( γI − (cid:98) L ), for δ, γ ∈ R + . A natural starting point is preconditioning (11) with the inverse of the real-valued part ofthe operator, ( ηI − (cid:98) L ) , dropping the + β I term, which can be applied as two successiveinverses of ( ηI − (cid:98) L ). Expanding, the preconditioned operator then takes the form P η := ( ηI − (cid:98) L ) − (cid:104) ( η + β ) I − η (cid:98) L + (cid:98) L (cid:105) = I + β ( ηI − (cid:98) L ) − = I + β η (cid:16) I − η (cid:98) L (cid:17) − . (12)For η > β and under the assumptions introduced in Section 1.5, the preconditionedoperator in (12) has a well bounded ﬁeld-of-values, which can be directly translated intoguaranteed GMRES convergence bounds [30]. However, for β (cid:29) η , such a precondition-ing is less eﬀective.Thus, consider a more general preconditioner of similar form, but with two separateconstants γ, δ >

0, where the preconditioned operator takes the form P δ,γ := ( δI − (cid:98) L ) − ( γI − (cid:98) L ) − (cid:104) ( ηI − (cid:98) L ) + β I (cid:105) . (13)8uch an approach was proven eﬀective for symmetric deﬁnite spatial matrices in [5],where it is assumed γ = δ , and the optimal constant is derived to be γ = γ ∗ = (cid:112) η + β .The following theorem derives tight bounds on the condition number of P δ,γ (13), aswell as optimality of γ ∈ (0 , ∞ ), in a much more general context. Corollary 1 thenshows that the optimal preconditioner over all δ, γ ∈ (0 , ∞ ), in terms of minimizing atight upper bound on (cid:96) -condition number, is given by P γ ∗ := ( γ ∗ I − (cid:98) L ) − (cid:104) ( ηI − (cid:98) L ) + β I (cid:105) . (14)where δ = γ = γ ∗ = (cid:112) η + β . Theorem 1 (Optimal preconditioning) . Suppose Assumptions 1 and 2 hold, that is, η > and W ( (cid:98) L ) ≤ , and suppose (cid:98) L is real-valued. Let P δ,γ denote the preconditionedoperator as in (13) , where [( ηI − (cid:98) L ) + β I ] is preconditioned with ( δI − (cid:98) L ) − ( γI − (cid:98) L ) − ,for δ, γ ∈ (0 , ∞ ) . Let κ ( P δ,γ ) denote the two-norm condition number of P δ,γ , and deﬁne γ ∗ by γ ∗ := η + β δ . (15) Then κ ( P δ,γ ∗ ) ≤ η (cid:18) δ + η + β δ (cid:19) . (16) Moreover, (i) bound (16) is tight in the sense that ∃ (cid:98) L such that (16) holds withequality, and (ii) γ = γ ∗ is optimal in the sense that, without further assumptions on (cid:98) L , γ ∗ minimizes a tight upper bound on κ ( P δ,γ ) over all γ ∈ (0 , ∞ ) .Proof. First we will prove bound (16), then we will demonstrate it is tight, and ﬁnallywe will demonstrate the optimality of γ = γ ∗ .The square of the condition number of P δ,γ is given by κ ( P δ,γ ) = (cid:107)P δ,γ (cid:107) (cid:107)P − δ,γ (cid:107) = max v (cid:54) =0 (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) v (cid:54) =0 (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) , (17)where for real-valued (cid:98) L , the max and min can be obtained by restricting ourselvesto real-valued v . The key step in establishing (16) is bounding (cid:107)P δ,γ (cid:107) and (cid:107)P − δ,γ (cid:107) from above by bounding (cid:107)P δ,γ v (cid:107) / (cid:107) v (cid:107) from above and below, respectively. Considerthe form of the preconditioned operator P δ,γ in (13) and make the substitution v (cid:55)→ ( γI − (cid:98) L )( δI − (cid:98) L ) w . Using the fact that rational functions of L commute, (cid:107)P δ,γ v (cid:107) canbe expanded for real-valued v (and, thus, real-valued w ) as (cid:107)P δ,γ v (cid:107) = (cid:13)(cid:13) [( ηI − (cid:98) L ) + β ] w (cid:13)(cid:13) , = (cid:13)(cid:13) [( η + β ) w − η (cid:98) L w + (cid:98) L w (cid:13)(cid:13) = (cid:13)(cid:13) ( η + β ) w + (cid:98) L w (cid:13)(cid:13) − η ( η + β ) (cid:104) (cid:98) L w , w (cid:105)− η (cid:104) (cid:98) L ( (cid:98) L w ) , (cid:98) L w (cid:105) + 4 η (cid:13)(cid:13) (cid:98) L w (cid:13)(cid:13) . (18)Similarly, expanding (cid:107) v (cid:107) yields (cid:107) v (cid:107) = (cid:13)(cid:13) ( γI − (cid:98) L )( δI − (cid:98) L ) w (cid:13)(cid:13) , = (cid:13)(cid:13) δγ w − ( δ + γ ) (cid:98) L w + (cid:98) L w (cid:13)(cid:13) , = (cid:13)(cid:13) δγ w + (cid:98) L w (cid:13)(cid:13) − δγ ( δ + γ ) (cid:104) (cid:98) L w , w (cid:105)− δ + γ ) (cid:104) (cid:98) L ( (cid:98) L w ) , (cid:98) L w (cid:105) + ( δ + γ ) (cid:13)(cid:13) (cid:98) L w (cid:13)(cid:13) . (19)9hus, the key ratio in (17) takes the form (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) = c ( w ) f ( w ) + c f ( w ) + c f ( w ) + c f ( w ) f ( w ) + f ( w ) + f ( w ) + f ( w ) , (20)where for δ, γ >

0, we have deﬁned the functions and constants f := (cid:13)(cid:13) δγ w + (cid:98) L w (cid:13)(cid:13) ≥ , c := (cid:13)(cid:13) ( η + β ) w + (cid:98) L w (cid:13)(cid:13) (cid:13)(cid:13) δγ w + (cid:98) L w (cid:13)(cid:13) ≥ ,f := − δγ ( δ + γ ) (cid:104) (cid:98) L w , w (cid:105) ≥ , c := η + β δγ ηδ + γ > ,f := − δ + γ ) (cid:104) (cid:98) L ( (cid:98) L w ) , (cid:98) L w (cid:105) ≥ , c := 2 ηδ + γ > ,f := ( δ + γ ) (cid:13)(cid:13) (cid:98) L w (cid:13)(cid:13) ≥ , c := (cid:18) ηδ + γ (cid:19) > . (21)Note that functions f and f are non-negative by assumption of W ( (cid:98) L ) ≤

0, while forall w (cid:54) = , it must hold that either c f > c f > c f = 0 i.f.f. (cid:98) L w = , which implies c f > w (cid:54) = ).Since all of the addends in the numerator and denominator of (20) are non-negative,and at least one addend in each is positive, (20) can be bounded asmin { c , c , c , c } =: c min ≤ (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) ≤ c max := max { c , c , c , c } . Applying these bounds to the norms in (17) yields (cid:107)P δ,γ (cid:107) ≤ √ c max , (cid:107)P − δ,γ (cid:107) ≤ √ c min . (22)Bounding c for general γ , and hence c min and c max , is diﬃcult because the sign of (cid:104) (cid:98) L w , w (cid:105) (which appears in expanding (cid:13)(cid:13) δγ w + (cid:98) L w (cid:13)(cid:13) ) is not known for general (cid:98) L ,noting that the sign of W ( (cid:98) L ) does not determine that of W ( (cid:98) L ). However, observe from(21) that the judicious choice of γ = γ ∗ := ( η + β ) /δ yields c ( w ) = 1. Moreover, inthe ﬁnal part of this proof we demonstrate that γ = γ ∗ is optimal, and, as such, movingforward we only consider the case γ = γ ∗ .Letting γ = γ ∗ := ( η + β ) /δ , from (21) one has c = 1 ≥ c = c = √ c =2 η/ ( δ + γ ∗ ), where the inequality 1 ≥ η/ ( δ + γ ∗ ) follows by noting the equivalentrelation δ − ηδ + η + β ≥ η, δ >

0. Thus, for γ = γ ∗ , the bounds in (22) aregiven by (cid:107)P δ,γ ∗ (cid:107) ≤ , (cid:107)P − δ,γ ∗ (cid:107) ≤ δ + γ ∗ η = 12 η (cid:18) δ + η + β δ (cid:19) (23)Applying these bounds to the condition number (17) yields the upper bound in (16).We now show that bound (16) is tight and that γ = γ ∗ is optimal. We do so byconstruction, showing that the bound in (16) is achieved for certain matrices, and thatfor γ (cid:54) = γ ∗ , ∃ matrices (cid:98) L for which κ ( P δ,γ ) > ( δ + γ ∗ ) / (2 η ). Note that the min/maxof (cid:107)P δ,γ v (cid:107) / (cid:107) v (cid:107) over v for real-valued P δ,γ is equivalent when minimizing over realor complex v ; we now consider complex v for theoretical purposes. To that end, let v = ( γI − (cid:98) L )( δI − (cid:98) L ) w , but suppose that (i ξ, w ) is an eigenpair of (cid:98) L , with ξ a real10umber and w a complex eigenvector. Plugging into (cid:107)P δ,γ v (cid:107) (18) and (cid:107) v (cid:107) (19), andtaking the ratio as in (20), deﬁne the following function of ξ : H δ,γ ( ξ ) := (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) L w =i ξ w = | ( η − i ξ ) + β | | ( δγ − ξ − i( δ + γ ) ξ | = ( δγ ∗ − ξ ) + (2 ηξ ) ( δγ − ξ ) + [ ξ ( δ + γ )] , (24)where we have made use of δγ ∗ = η + β . By virtue of restricting that w be aneigenvector, from (17) we have1 (cid:107)P − δ,γ (cid:107) = min v (cid:54) =0 (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) ≤ H δ,γ ( ξ ) ≤ max v (cid:54) =0 (cid:107)P δ,γ v (cid:107) (cid:107) v (cid:107) = (cid:107)P δ,γ (cid:107) . (25)That is, any value of 1 / H δ,γ ( ξ ) serves as a lower bound on (cid:107)P − δ,γ (cid:107) , while any value of H δ,γ ( ξ ) serves as a lower bound on (cid:107)P δ,γ (cid:107) . Therefore, the ratio of any two values of H δ,γ ( ξ ) provides a lower bound on κ ( P δ,γ ).We now show that bound (16) on κ ( P δ,γ ∗ ) is tight. Considering (24) at the judiciouslychosen eigenvalues of i ξ = { , ± i √ δγ ∗ } , we have H δ,γ (0) = γ ∗ γ , H δ,γ ( ± (cid:112) δγ ∗ ) = (2 η ) γ ∗ δ ( γ − γ ∗ ) + γ ∗ ( δ + γ ) . (26)First observe from (25) and (26) that (cid:107)P δ,γ ∗ (cid:107) ≥ H δ,γ ∗ (0) = 1, and thus the upper boundon (cid:107)P δ,γ ∗ (cid:107) from (23) achieves equality for a matrix (cid:98) L having an eigenvalue of ξ = 0.Secondly, observe from (25) and (26) that (cid:107)P − δ,γ ∗ (cid:107) ≥ / H δ,γ ∗ ( ±√ δγ ∗ ) = [( δ + γ ∗ ) / (2 η )] ,and thus the upper bound on (cid:107)P − δ,γ ∗ (cid:107) from (23) achieves equality for a matrix (cid:98) L havingeigenvalues i ξ = ± i √ δγ ∗ . Therefore, bound (16) on κ ( P δ,γ ∗ ) achieves equality for anymatrix (cid:98) L having eigenvalues { , ± i √ δγ ∗ } . Finally, having shown that (16) is tight, we now show that γ = γ ∗ is optimal.Once again, consider the matrix (cid:98) L from above with eigenvalues { , ± i √ δγ ∗ } , such that κ ( P δ,γ ∗ ) = ( δ + γ ∗ ) / (2 η ). Now, observe from this, (25), and (26), that for 0 < γ < γ ∗ , κ ( P δ,γ ∗ ) < γ ∗ [ δ ( γ − γ ∗ ) + γ ∗ ( δ + γ ) ](2 ηγ ) = H δ,γ (0) H δ,γ ( ±√ δγ ∗ ) ≤ κ ( P δ,γ ) , (27)by noting that the ﬁrst inequality in (27) is equivalent to γ ∗ ( γ ∗ − γ ) + ( γ ∗ − γ )[2 γ ∗ γ + δ ( γ ∗ + γ )] >

0, which is clearly true when 0 < γ < γ ∗ . Now suppose that (cid:98) L haseigenvalues i ξ → ± i ∞ , which, when substituted into (24), yields lim ξ →±∞ H δ,γ ( ξ ) = 1.Combining with (25) and (26), we have for γ ∗ < γ < ∞ , κ ( P δ,γ ∗ ) < δ ( γ − γ ∗ ) + γ ∗ ( δ + γ ) (2 η ) γ ∗ = H δ,γ ( ±∞ ) H δ,γ ( ±√ δγ ∗ ) ≤ κ ( P δ,γ ) , (28)by noting that the ﬁrst inequality in (28) is equivalent to δ ( γ − γ ∗ ) + γ ∗ ( γ − γ ∗ )(2 δ + γ ∗ + γ ) >

0, which is clearly satisﬁed for γ > γ ∗ .By construction in (27) and (28), we have shown that for all γ ∈ (0 , ∞ ) \ γ ∗ , thereexist matrices (cid:98) L such that κ ( P δ,γ ) > κ ( P δ,γ ∗ ) = ( δ + γ ∗ ) / (2 η ). It therefore holds forgeneral (cid:98) L satisfying Assumption 2 that a tight upper bound on κ ( P δ,γ ) for γ ∈ (0 , ∞ ) \ γ ∗ must be larger than the tight upper bound of κ ( P δ,γ ∗ ) ≤ ( δ + γ ∗ ) / (2 η ). Hence γ = γ ∗ isthe minimizer over γ ∈ (0 , ∞ ) of a tight upper bound on κ ( P δ,γ ). By nature of the continuity of eigenvalues and continuity of H δ,γ ( ξ ) at ξ = 0, there also existnonsingular matrices with condition number within (cid:15) of (16) for any (cid:15) > orollary 1 (Optimal preconditioning with γ = δ ) . A tight upper bound on the con-dition number of the general operator P δ,γ (13) is minimized over δ, γ ∈ (0 , ∞ ) when δ = γ = γ ∗ , with γ ∗ = (cid:112) η + β . (29) That is to say, the preconditioner ( γI − (cid:98) L ) − employed in (14) with γ = γ ∗ is optimal overthe space of general preconditioners ( δI − (cid:98) L ) − ( γI − (cid:98) L ) − , for δ, γ ∈ (0 , ∞ ) . Furthermore,the condition number of the preconditioned operator P γ ∗ (14) is tightly bounded via κ ( P γ ∗ ) ≤ (cid:115) β η . (30) Proof.

From Theorem 1, a tight upper bound on the condition number of P δ,γ (13) isminimized with respect to γ when γ = γ ∗ (15), with its minimum value given by (16).To minimize bound (16) with respect to δ , we diﬀerentiate it and observe for δ > δ = (cid:112) η + β . Since this function is increasingas δ → + and δ → ∞ , this critical point must be a local minimum. Therefore, thetight upper bound (16) is minimized when δ = (cid:112) η + β . Substituting δ = (cid:112) η + β into (15) yields γ ∗ = (cid:112) η + β . Finally, substituting δ = γ ∗ = (cid:112) η + β into (16) andnoting that P γ,γ (13) is equivalent to P γ (14) yields bound (30).Table 1 provides condition number bounds from Corollary 1 and (30) for Gauss,Radau IIA, and Lobatto IIIC Runge-Kutta methods.Stages 2 3 4 5 λ ± , λ λ ± , λ ± , λ ± , λ λ ± , λ ± , Gauss 1.15 1.00 1.38 1.61 1.04 1.00 1.83 1.13Radau IIA 1.22 1.00 1.51 1.79 1.05 1.00 2.05 1.15Lobatto IIIC 1.41 1.00 1.79 2.12 1.06 1.00 2.42 1.17Table 1: Bounds on κ ( P γ ∗ ) from Corollary 1 and (30) for Gauss, Radau IIA, and LobattoIIIC integration, with 2–5 stages. Each column within a given set of stages correspondsto either a real eigenvalue, λ = η , or a conjugate pair of eigenvalues, e.g., λ ± , = η ± i β ,of A − .In the proof of Theorem 1, γ ∗ provides a delicate cancellation, and we have beenunable to complete a general proof for γ (cid:54) = γ ∗ without further assumptions, such as W ( (cid:98) L ) ≥

0. For the speciﬁc case of δ = γ = η , one can derive a tight bound of 1 + β /η for SPD operators and ≈ (1 + β /η ) / for skew symmetric operators (by appealing tonormality). Numerical tests indicate using the modiﬁed constant γ ∗ as opposed to η isparticularly important for hyperbolic-type problems. Indeed, one example in Section 3.2demonstrates an almost 6 × reduction in iteration count achieved by using γ ∗ instead of η ,whereas γ ∗ has oﬀered a much more modest improvement over η for parabolic/diﬀusiveproblems we have explored. It is unclear if perhaps an O (1) bound on condition numberdoes not hold for general nonsymmetric operators with δ = γ = η , or if the simpleincrease in condition number from ≈ β /η (cid:55)→ (1 + β /η ) / is enough to cause asigniﬁcant degradation in convergence. In any case, P γ ∗ has performed better than P η for all problems we have tested, and is the recommended preconditioner in practice. Remark 3 (Mass matrices) . Recall in the ﬁnite element context where mass matricesare involved, we deﬁned (cid:98) L := δtM − L . For a given conjugate pair of eigenvalues, the uadratic polynomial (11) can be expressed as Q η = M − (( η + i β ) M − δt L ) M − (( η − i β ) M − δt (cid:98) L ) . (31) In this context, it is best to ﬁrst scale both sides of the linear system by M . This halvesthe number of times M − must be applied for each Krylov iteration, and if M and L areHermitian, the resulting quadratic system is SPD and can be solved using preconditionedCG or MINRES, preconditioned with one application of a preconditioner, the action of M , and a second application of the preconditioner. Remark 4 (Inexact preconditioning) . In practice, fully converging ( γ ∗ I − (cid:98) L ) − eachiteration as a preconditioner is often not desirable due to the cost of performing a fulllinear solve. Here, we propose applying a Krylov method to Q η := ( η + β ) I − η (cid:98) L + (cid:98) L bycomputing the operator’s action (that is, not fully constructing it), and preconditioningeach Krylov iteration with two applications of a sparse parallel preconditioner for ( γ ∗ I − (cid:98) L ) , approximating the action of ( γ ∗ I − (cid:98) L ) − .Analogous to standard block-preconditioning techniques, this approximate inverse ap-proach is often (but not always) more eﬃcient than computing a full inverse each itera-tion. However, it is important that the underlying preconditioner provides a good approx-imation. Fortunately, for diﬃcult problems without highly eﬀective preconditioners, it isstraightforward to apply either multiple inner ﬁxed-point iterations or an inner Kryloviteration (wrapped with a ﬂexible outer Krylov method [38, 46]) to ensure robust (outer)iterations. In Section 3.2.3, a numerical example is shown where the proposed methoddiverges using a single inner ﬁxed-point iteration as a preconditioner for ( γ ∗ I − (cid:98) L ) , butthree (or more) inner ﬁxed-point iterations yields fast, scalable convergence. In this section, we consider a constant-coeﬃcient advection-diﬀusion problem discretizedin space with high-order ﬁnite-diﬀerences. An exact solution to this problem is used todemonstrate the high-order accuracy of the IRK methods, and the robustness of the al-gorithms developed in the previous section with respect to mesh resolution. Speciﬁcally,we solve the PDE u t + 0 . u x + u y = 0 . u xx + 0 . u yy + s ( x, y, t ) , ( x, y, t ) ∈ ( − , × (0 , , (32)on a periodic spatial domain. The source term s ( x, y, t ) is chosen such that the solutionof the PDE is u ( x, y, t ) = sin ( π/ x − − . t ]) sin ( π/ y − − t ]) exp( − [0 . . t ).We consider tests using IRK methods of orders three, four, seven, and eight. The3rd- and 4th-order IRK methods are paired with 4th-order central-ﬁnite-diﬀerences inspace, and the 7th- and 8th-order methods with 8th-order central-ﬁnite-diﬀerences inspace. In all cases, a time-step of δt = 2 h is used, with h denoting the spatial mesh size,and results are run on four cores. Due to the diﬀusive, but non-SPD nature of the spatialdiscretization, we apply GMRES(30) preconditioned by a classical algebraic multigrid(AMG) method in the hypre library [15]. Speciﬁcally, we use classical interpolation(type 0), Falgout coarsening (type 6) with a strength tolerance θ C = 0 .

25, zero levelsof aggressive coarsening, and L -Gauss–Seidel relaxation (type 8), with an absolute andrelative stopping tolerance of 10 − . A single iteration of AMG is applied to approximate( γ ∗ I − δt L ) − . 13n Figure 1, discretization errors are shown for diﬀerent IRK methods, alongside theaverage number of AMG iterations needed per time step. The expected asymptotic con-vergence rates (black dashed lines in the left panel) are observed for all discretizations. − − − − − δt − − − − − − D i s c r e t i z a t i o n e rr o r Gauss(4)Lobatto IIIC(4)Radau IIA(3)A − SDIRK(4)L − SDIRK(4) − − − − − δt A M G i t e r a t i o n s − − − − − δt − − − − − D i s c r e t i z a t i o n e rr o r Gauss(8)Lobatto IIIC(8)Radau IIA(7) − − − − − δt A M G i t e r a t i o n s Figure 1: Finite-diﬀerence advection-diﬀusion problem (32). L ∞ -discretization errorsat t = 2 as a function of time-step δt are shown on the left for various discretizations ofapproximately 4th order (top) and 8th order (bottom). Black, dashed lines with slopesof three and four are shown (top), as are those with slopes of seven and eight (bottom).Plots on the right show the average number of AMG iterations per time step.The preconditioner appears robust with respect to mesh and problem size, sincethe average number of AMG iterations per time step (which is a proxy for the numberof GMRES iterations) remains roughly constant as the the mesh is reﬁned. Of thefully implicit methods, the Gauss methods require the fewest AMG iterations, closelyfollowed by Radau IIA methods, with the Lobatto IIIC methods requiring the mostAMG iterations. This is consistent with the theoretical estimates in Table 1. Note thatwhile Gauss and Radau IIA methods have very similar iteration counts, Gauss convergesat one order faster, which can be seen in the left-hand panel of the ﬁgure.Considering the lower-order methods in the top row of Figure 1, L–SDIRK(4), a5-stage, 4th-order, L-stable SDIRK method requires the most AMG iterations of allmethods. A–SDIRK(4), a 3-stage, 4th-order, A-stable SDIRK method, requires farfewer AMG iterations than L–SDIRK(4). However, A–SDIRK(4) yields a signiﬁcantlylarger discretization error than the other 4th-order schemes, and takes longer to reachits asymptotic convergence rate. Thus, in terms of work done per accuracy, the newpreconditioner with 4th-order Gauss integration is the clear winner for this particular An exception here is A–SDIRK(4), which appears to be converging with a rate closer to threethan four; however, further decreasing δt (not shown here) conﬁrms 4th-order convergence is achievedeventually. Here we consider a more diﬃcult advection-diﬀusion problem, discretized using high-order DG ﬁnite elements. In particular, we demonstrate the eﬀectiveness and scalabilityof the new preconditioning on more complex ﬂows and DG discretizations (Figure 2 andSection 3.2.2), the reduction in computational work that can be achieved using large timesteps and very high-order integration (Section 3.2.2), and how using multiple “inner”preconditioning iterations to approximate (cid:98) L − (or even inner Krylov acceleration) canbe important for the fast, scalable application of IRK integration (Section 3.2.3).The governing equations in spatial domain Ω = [0 , × [0 ,

1] are given by u t + ∇ · ( β u − ε ∇ u ) = f (33)where β ( x, y ) := (cos(4 πy ) , sin(2 πx )) T is the prescribed velocity ﬁeld and ε the diﬀusioncoeﬃcient. Periodic boundary conditions are enforced on ∂ Ω, and (33) is discretizedwith an upwind DG method [14], where diﬀusion terms are treated with the symmetricinterior penalty method [3, 4]. The resulting ﬁnite element problem is to ﬁnd u h ∈ V h such that, for all v h ∈ V h , (cid:90) Ω ∂ t ( u h ) v h dx − (cid:90) Ω u h β · ∇ h v h dx + (cid:90) Γ (cid:99) u h β · (cid:74) v h (cid:75) ds + (cid:90) Ω ∇ h u h · ∇ h v h dx − (cid:90) Γ {∇ h u h } · (cid:74) v h (cid:75) ds − (cid:90) Γ {∇ h v h } · (cid:74) u h (cid:75) ds + (cid:90) Γ σ (cid:74) u h (cid:75) · (cid:74) v h (cid:75) ds = (cid:90) Ω f v h dx, where V h is the DG ﬁnite element space consisting of piecewise polynomials of degree p deﬁned on elements of the computational mesh T of the spatial domain Ω. No continu-ity is enforced between mesh elements. Here, ∇ h is the broken gradient, Γ denotes theskeleton of the mesh, and {·} and (cid:74) · (cid:75) denote the average and jump of a function acrossa mesh interface. (cid:99) u h is used to denote the upwind numerical ﬂux. This discretizationhas been implemented in the MFEM ﬁnite element framework [2], and uses AMG pre-conditioning with approximate ideal restriction (AIR) [32, 33] (after ﬁrst scaling by theinverse of the block-diagonal mass matrix).The DG method is particularly well-suited for advection-dominated problems. Inthe following subsections we vary ε from 0 (purely advective) to 0 .

01. The velocity ﬁeld,initial condition, and numerical solution for ε = 10 − are shown in Figure 2. η vs. γ ∗ First, we demonstrate the eﬀectiveness of using γ ∗ (Section 2.3) instead of η in thepreconditioner, as well as the robustness of the proposed method on a fully hyperbolicproblem, where most papers have only discussed parabolic PDEs. Thus, we set thediﬀusion coeﬃcient ε = 0 and apply AIR as a preconditioner for individual systems( γM − δt L ).AIR was originally designed for upwind DG discretizations of advection and is well-suited for this problem. We use the hypre implementation, with distance 1.5 restrictionwith strength tolerance θ R = 0 .

01, one-point interpolation (type 100), Falgout coars-ening (type 6) with strength tolerance θ C = 0 .

1, no pre-relaxation, and forward GaussSeidel post-relaxation (type 3), ﬁrst on F-points followed by a second sweep on all points.The domain is discretized using 4th-order ﬁnite elements on a structured mesh, and thetime step for each integration scheme is chosen such that the spatial and temporal or-ders of accuracy match; for example, for 8th-order integration we choose δt = √ h , for15 (a) Velocity ﬁeld (b) t = 0 (c) t = 0 . t = 0 . t = 2 . t = 7 . Figure 2: DG advection-diﬀusion problem with velocity ﬁeld shown in subplot (a) andthe solution plotted for various time points from t = 0 to t = 7 . (cid:55)→ (cid:55)→ h , so that δt = h . All linear systems are solved to a relative tolerance of10 − . There are a total of 1,638,400 spatial degrees-of-freedom (DOFs), and the sim-ulations are run on 288 cores on the Quartz machine at Lawrence Livermore NationalLab, resulting in ∼ η and γ ∗ as the preconditioning constants. Iteration countsare shown for Gauss, Radau IIA, and Lobatto IIIC integration, with 2–5 stages, andthe (factor of) reduction in iteration count achieved using γ ∗ vs. η is also shown. For5-stage Lobatto IIIC integration, γ ∗ yields almost a 6 × reduction in total inner AIRiterations to solve for the “hard” stage ( β > η ), while in no cases is there an increase initeration count when using γ ∗ . This section demonstrates high-order IRK integration applied to the advection-dominatedproblem in Figure 2, using AIR as a preconditioner for the linear systems.Figure 3a shows the total number of AIR iterations required to compute one timestep using six diﬀerent Runge-Kutta schemes, from 4th to 10th order, as a function ofmesh spacing h . Note that as h →

0, there is only modest growth in the number of AIRiterations per time step, indicating good (but not perfect) scalability. Moreover, notethat there is small growth in the number of iterations for SDIRK4 as well (increasingfrom 20 to 25), which suggests the growth in iteration count is due to imperfect scalabilityof AIR iterations rather than the new IRK preconditioning.Figure 3b then shows the relative number of AIR iterations to compute a given16aussStages/Order 2/4 3/6 4/8 5/10Iterations( η ) 17 6 30 11 47 8 16 70Iterations( γ ∗ ) 11 6 15 10 19 8 13 23Speedup 1.5 1.0 2.0 1.1 2.5 1.0 1.2 3.0Radau IIAStages/Order 2/3 3/5 4/7 5/9Iterations( η ) 12 5 39 11 64 8 16 97Iterations( γ ∗ ) 12 5 18 9 21 8 12 25Speedup 1.0 1.0 2.2 1.2 3.0 1.0 1.3 3.9Lobatto IIICStages/Order 2/2 3/4 4/6 5/8Iterations( η ) 8 3 67 11 113 7 17 175Iterations( γ ∗ ) 8 3 22 9 26 7 12 30Speedup 1.0 1.0 3.0 1.2 4.3 1.0 1.4 5.8Table 2: Average AIR iterations to solve for each stage in an implicit Runge-Kuttamethod using preconditioners ( ηM − δt L ) − and ( γ ∗ M − δt L ) − , with γ ∗ deﬁned in(29). The ratio of iterations( η )/iterations( γ ∗ ) is shown in the “Speedup” rows.accuracy compared to SDIRK4. For example, if h = 0 .

01, then SDIRK4 uses δt = 0 . δt = √ .

01 = 0 . δt , that is, 10 × less time steps to achievecomparable accuracy. Note that as h →

0, this factor becomes progressively larger.For quite moderate h , we see how very high-order integration can quickly lead to areduction in total preconditioning iterations compared to a standard SDIRK4 scheme.Gauss8 and Radau9, for example, require almost 7 × less AIR iterations than SDIRK4 at h = δt = 1 / δt . Although this doesnot account for additional setup time, where Gauss8 and Radau9 require solving twoand three diﬀerent linear systems, respectively, and SDIRK4 only one, very high-orderintegration permitted through the new preconditioning still oﬀers the opportunity forsigniﬁcant speedups. A I R i t e r a t i o n s / t i m e s t e p (a) Total AIR iterations per time step . . . . . R e l a t i v e A I R i t e r a t i o n s / t i m e s t e p SDIRK4Gauss4Radau7 Gauss8Radau9Gauss10 (b) AIR iterations relative to SDIRK4

Figure 317 .2.3 Diﬀusive problems and inner Krylov

In [32], AIR was shown to be eﬀective on some DG advection-diﬀusion problems, andclassical AMG is known to be eﬀective on diﬀusion-dominated problems. However, theregion of comparable levels of advection and diﬀusion remains the most diﬃcult froma multigrid perspective. We use this to demonstrate how methods developed here re-quire a “good” preconditioner for a backward Euler time step, ( γM − δt L ) − , in orderto converge on more general IRK methods. Fortunately, ensuring a preconditioner issuﬃciently good can be resolved by appealing to standard block preconditioning tech-niques, where an inner iteration is used that applies multiple AIR iterations as a singlepreconditioner.Here we consider an analogous problem to above, but set the diﬀusion coeﬃcient to ε = 0 .

01. We use a mesh with spacing h ≈ . δt = 0 .

1, and three-stage 6th-order Gauss integration. Altogether, this yieldsequal orders of accuracy, with time and space error ∼ − . FGMRES [46] is used forthe outer iteration, which allows for GMRES to be applied in an inner iteration as apreconditioner for ( γ ∗ M − δt L ). Figure 4 plots the total number of AIR iterations pertime step as a function of the number of AIR iterations applied for each application of thepreconditioner, using an inner GMRES or an inner ﬁxed-point (Richardson) iteration.An advection-dominated problem with ε = 10 − is also shown for comparison.Recall we have three stages, one of which is a single linear system corresponding to areal eigenvalue, and the other corresponding to a pair of complex conjugate eigenvalues,which we precondition as in Section 2. The latter ends up being the more diﬃcultproblem to solve – for ε = 0 .

01 (Figure 4b), the outer FGMRES iteration for the complexconjugate quadratic does not converge in 1000 iterations when using one AIR iterationas a preconditioner. If two AIR iterations with GMRES are used as a preconditioner, theFGMRES iteration converges in approximately 130 iterations, each of which requires twoapplications of GMRES preconditioned with two AIR iterations, yielding just over 500total AIR iterations to converge. Further increasing the number of AIR iterations perpreconditioning yields nice convergence using inner ﬁxed-point or GMRES, with 150 and112 total AIR iterations per time step, respectively. In contrast, Figure 4a shows thatadditional AIR iterations for the advection-dominated case are generally detrimental tooverall computational cost (although the outer iteration converges slightly faster, it doesnot make up for the additional linear solves/iteration). T o t a l A I R i t e r a t i o n s / t i m e s t e p (a) ε = 10 − . T o t a l A I R i t e r a t i o n s / t i m e s t e p Inner ﬁxed-pointInner GMRES (b) ε = 0 . Figure 4: AIR iterations per time step as a function of the number of AIR iterationsapplied in each application of the preconditioner, for diﬀusion coeﬃcient ε .18 .3 High-order matrix-free diﬀusion In this example, we illustrate the use of high-order IRK methods coupled with high-orderﬁnite element spatial discretizations. It is well-known that matrix assembly becomes pro-hibitively expensive for high-order ﬁnite elements. Naive algorithms typically require O ( p d ) operations to assemble the resulting system matrix, where p is the polynomialdegree and d is the spatial dimension. Techniques such as sum factorization can reducethis cost on tensor-product elements to O ( p d +1 ), however this cost can still be pro-hibitive for large values of p [35]. On the other hand, matrix-free operator evaluationon tensor-product meshes can be performed in O ( p d +1 ) operations [40], motivating thedevelopment of solvers and preconditioners that can be constructed and applied withoutaccess to the assembled system matrix [26].We consider a high-order ﬁnite element discretization of the linear heat equation onspatial domain Ω, (cid:90) Ω ∂ t ( u h ) v h dx + (cid:90) Ω ∇ u h · ∇ v h dx = (cid:90) Ω f v h dx, where u h , v h ∈ V h , and V h denotes the degree- p H -conforming ﬁnite element spacedeﬁned on a mesh T consisting of tensor-product elements (i.e. quadrilaterals or hexa-hedra). The matrix-free action of the corresponding operator is computed in O ( p d +1 )operations using the partial assembly features of the MFEM ﬁnite element library [2].In order to precondition the resulting system, we make use of a low-order reﬁned precon-ditioner, whereby the high-order system is preconditioned using a spectrally equivalentlow-order ﬁnite element discretization computed on a reﬁned mesh [11]. The low-orderreﬁned discretization can be assembled in O ( p d ) time, thereby avoiding the prohibitivecosts of high-order matrix assembly. We make use of the uniform preconditioners forthe low-order reﬁned problem based on subspace corrections, developed in [41].For this test case, take the spatial domain to be Ω = [0 , × [0 , f ( x, y, t ) = sin(2 πx ) cos(2 πy ) (cid:0) cos( t ) + 8 π (2 + sin( t )) (cid:1) , which corresponds to the exact solution u ( x, y, t ) = sin(2 πx ) cos(2 πy )(2 + sin( t )) . We begin with a very coarse 3 × t = 0 . k −

1, where k is the order of accuracy of the timeintegration method, resulting in k th order convergence in both space and time. Themesh and time step are reﬁned by factors of two to conﬁrm the high-order convergencein space and time of the method. The relative L error, obtained by comparing againstthe exact solution, is shown in Figure 5.We also use this test case to study the eﬀect of inner iterations on the convergenceof the iteration solver. As discussed in Remark 4, it is important that the underlyingpreconditioner provides a good approximation of the inverse of the operator. For thatreason, we consider the use of an inner Krylov solver at every iteration. Since thiscorresponds to using a variable preconditioner at each iteration, a ﬂexible Krylov methodmay have to be used for the outer iteration, although in practice good convergence isoften still observed using the standard CG method [38]. In particular, we compare thetotal number of preconditioner applications required to converge the outer iteration toa relative tolerance of 10 − , both with and without an inner Krylov solver. For theinner Krylov solver, we use a CG iteration with the same relative tolerance as the outer19.10.050.02510 − − − − δt R e l a t i v e L e rr o r Gauss(orders 2, 4, 6, 8, 10)Radau IIA(orders 3, 5, 7, 9)Figure 5: High-order convergence in space and time for the matrix-free diﬀusion problem.Gauss and Radau IIA methods of orders 2 through 10 are used. The dashed lines indicatethe expected rates of convergence for each method.iteration in order to give a good approximation to the inverse of the operator. Theiteration counts are displayed in Figure 6. We note that for the fully implicit IRKmethods, using an inner Krylov solver can reduce the total number of preconditionerapplications by about a factor of 1.5, although this depends on the type of method andorder of accuracy. As expected, the use of inner iterations does not reduce the totalnumber of preconditioner applications for DIRK methods. In addition, for this testcase, the total number of preconditioner applications required for the second and fourthorder Gauss IRK methods is between 1.3 and 2 times smaller than those required forthe corresponding equal-order DIRK methods.

This paper introduces a theoretical and algorithmic framework for the fast, parallel so-lution of fully implicit Runge-Kutta and DG discretizations in time for linear numericalPDEs. Theory is developed to guarantee O (1) conditioning under general assump-tions on the spatial discretization that yield stable time integration. Numerical resultsdemonstrate the new method on various high-order ﬁnite-diﬀerence and ﬁnite-elementdiscretizations of linear parabolic and hyperbolic problems, demonstrating fast, scalablesolution of up to 10th order accuracy. In several cases, the new method can achieve4th-order accuracy using Gauss integration with roughly half the number of precondi-tioner applications as required using standard SDIRK techniques. Ongoing work involvesaddressing fully nonlinear problems and algebraic constraints, in particular, without as-suming that the linear system (3) can be expressed in Kronecker-product form (thusallowing for a true Newton or better Newton-like method compared with the commonlyused/analyzed simpliﬁed Newton approach). Acknowledgments

This work was performed under the auspices of the U.S. Department of Energy byLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-817946). Los Alamos National Laboratory report number LA-UR-20-30412. This20 3 4 5 6 7 8 9 100100200300 Temporal Order P r ec o nd i t i o n e r A pp li c a t i o n s Gauss Methods 2 3 4 5 6 7 8 9 100100200300 Temporal Order P r ec o nd i t i o n e r A pp li c a t i o n s Lobatto IIIC Methods2 3 4 5 6 7 8 9 100100200300 Temporal Order P r ec o nd i t i o n e r A pp li c a t i o n s Radau IIA Methods 2 3 4 5 6 7 8 9 100100200300 Temporal Order P r ec o nd i t i o n e r A pp li c a t i o n s DIRK MethodsNo Inner Iteration Inner CG IterationFigure 6: Comparison of total number of preconditioner applications with and withoutinner iterations. Both the outer iteration and the inner CG iteration are converged to arelative tolerance of 10 − .document was prepared as an account of work sponsored by an agency of the UnitedStates government. Neither the United States government nor Lawrence LivermoreNational Security, LLC, nor any of their employees makes any warranty, expressed orimplied, or assumes any legal liability or responsibility for the accuracy, completeness,or usefulness of any information, apparatus, product, or process disclosed, or representsthat its use would not infringe privately owned rights. Reference herein to any speciﬁccommercial product, process, or service by trade name, trademark, manufacturer, orotherwise does not necessarily constitute or imply its endorsement, recommendation,or favoring by the United States government or Lawrence Livermore National Security,LLC. The views and opinions of authors expressed herein do not necessarily state orreﬂect those of the United States government or Lawrence Livermore National Security,LLC, and shall not be used for advertising or product endorsement purposes. References [1]

G. Akrivis, C. Makridakis, and R. H. Nochetto , Galerkin and Runge-Kutta methods: uniﬁed formulation, a posteriori error estimates and nodal su-perconvergence , Numerische Mathematik, 118 (2011), pp. 429–456, doi: . 212]

R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cer-veny, V. Dobrev, Y. Dudouit, A. Fisher, T. Kolev, W. Pazner, M. Stow-ell, V. Tomov, J. Dahm, D. Medina, and S. Zampini , MFEM: a modular ﬁ-nite element methods library , Computers & Mathematics with Applications, (2020),doi: .[3]

D. N. Arnold , An interior penalty ﬁnite element method with discontinuous ele-ments , SIAM Journal on Numerical Analysis, 19 (1982), pp. 742–760, doi: .[4]

D. N. Arnold, F. Brezzi, B. Cockburn, and L. D. Marini , Uniﬁed analysis ofdiscontinuous Galerkin methods for elliptic problems , SIAM Journal on NumericalAnalysis, 39 (2002), pp. 1749–1779, doi: .[5]

S. Basting and E. B¨ansch , Preconditioners for the Discontinuous Galerkin time-stepping method of arbitrary order , ESAIM: Mathematical Modelling and NumericalAnalysis, 51 (2017), pp. 1173–1195, doi: .[6]

T. A. Bickart , An Eﬃcient Solution Process for Implicit Runge–Kutta Methods ,SIAM Journal on Numerical Analysis, 14 (1977), pp. 1022–1027, doi: .[7]

W. C. Brown , Matrices over commutative rings , Marcel Dekker, Inc., 1993.[8]

K. Burrage , Eﬃciently Implementable Algebraically Stable Runge–Kutta Meth-ods , SIAM Journal on Numerical Analysis, 19 (1982), pp. 245–258, doi: .[9]

J. Butcher and D. Chen , A new type of singly-implicit Runge–Kuttamethod , Applied Numerical Mathematics, 34 (2000), pp. 179–188, doi: .[10]

J. C. Butcher , On the implementation of implicit Runge-Kutta methods , BITNumerical Mathematics, 16 (1976), pp. 237–240, doi: .[11]

C. Canuto, P. Gervasio, and A. Quarteroni , Finite-element precondition-ing of G-NI spectral methods , SIAM Journal on Scientiﬁc Computing, 31 (2010),pp. 4422–4451, doi: .[12]

H. Chen , A splitting preconditioner for the iterative solution of implicit Runge-Kutta and boundary value methods , BIT Numerical Mathematics, 54 (2014),pp. 607–621, doi: .[13]

H. Chen , Kronecker product splitting preconditioners for implicit Runge-Kutta dis-cretizations of viscous wave equations , Applied Mathematical Modelling, 40 (2016),pp. 4429–4440, doi: .[14]

B. Cockburn and C.-W. Shu , Runge-Kutta discontinuous Galerkin methodsfor convection-dominated problems , Journal of Scientiﬁc Computing, 16 (2001),pp. 173–261, doi: .[15]

R. D. Falgout and U. M. Yang , hypre: A library of high performance precondi-tioners , European Conference on Parallel Processing, 2331 LNCS (2002), pp. 632–641.[16] P. E. Farrell, R. C. Kirby, and J. Marchena-Menendez , Irksome: Au-tomating runge–kutta time-stepping for ﬁnite element methods , arXiv preprintarXiv:2006.16282, (2020). 2217]

M. J. Gander and M. Neumuller , Analysis of a new space-time parallel multi-grid algorithm for parabolic problems , SIAM Journal on Scientiﬁc Computing, 38(2016), pp. A2173–A2208.[18]

E. Hairer and G. Wanner , Solving Ordinary Diﬀerential Equations II,Stiﬀ and Diﬀerential-Algebraic Problems , (1996), pp. 118–130, doi: .[19]

E. Hairer, G. Wanner, and C. Lubich , Geometric Numerical Integra-tion, Structure-Preserving Algorithms for Ordinary Diﬀerential Equations , (2002),doi: .[20]

W. Hoffmann and J. J. B. D. Swart , Approximating Runge-Kutta matrices bytriangular matrices , BIT Numerical Mathematics, 37 (1997), pp. 346–354, doi: .[21]

P. J. v. d. Houwen and J. J. B. d. Swart , Parallel linear system solvers forRunge-Kutta methods , Advances in Computational Mathematics, 7 (1997), pp. 157–181, doi: .[22]

P. J. v. d. Houwen and J. J. B. d. Swart , Triangularly Implicit IterationMethods for ODE-IVP Solvers , SIAM Journal on Scientiﬁc Computing, 18 (1997),pp. 41–55, doi: .[23]

L. O. Jay , Inexact Simpliﬁed Newton Iterations for Implicit Runge-Kutta Methods ,SIAM Journal on Numerical Analysis, 38 (2000), pp. 1369–1388, doi: .[24]

X. Jiao, X. Wang, and Q. Chen , Optimal and low-memory near-optimal pre-conditioning of fully implicit runge-kutta schemes for parabolic pdes , arXiv preprintarXiv:2012.12779, (2020).[25]

C. Kennedy and M. H. Carpenter , Diagonally Implicit Runge-Kutta Methodsfor Ordinary Diﬀerential Equations. A Review , tech. report, 2016.[26]

M. Kronbichler and K. Ljungkvist , Multigrid for matrix-free high-order ﬁ-nite element computations on graphics processors , ACM Transactions on ParallelComputing, 6 (2019), pp. 1–32, doi: .[27]

P. Lasaint and P. Raviart , On a ﬁnite element method for solving the neutrontransport equation , Mathematical Aspects of Finite Elements in Partial DiﬀerentialEquations, (1974), pp. 89–123, doi: .[28]

J. V. Lent and S. Vandewalle , Multigrid Methods for Implicit Runge–Kuttaand Boundary Value Method Discretizations of Parabolic PDEs , SIAM Journal onScientiﬁc Computing, 27 (2005), pp. 67–92, doi: .[29]

R. J. LeVeque , Finite Diﬀerence Methods for Ordinary and Partial DiﬀerentialEquations: Steady-State and Time-Dependent Problems , vol. 98, Siam, 2007.[30]

J. Liesen and P. Tich`y , The ﬁeld of values bound on ideal GMRES , arXivpreprint arXiv:1211.5969, (2020).[31]

C. Makridakis and R. H. Nochetto , A posteriori error analysis for higherorder dissipative methods for evolution problems , Numerische Mathematik, 104(2006), pp. 489–514, doi: .2332]

T. A. Manteuffel, S. M¨unzenmaier, J. Ruge, and B. S. Southworth , Non-symmetric reduction-based algebraic multigrid , SIAM J. Sci. Comput., 41 (2019),pp. S242–S268, doi: .[33]

T. A. Manteuffel, J. Ruge, and B. S. Southworth , Nonsymmetric alge-braic multigrid based on local approximate ideal restriction ( (cid:96)

AIR) , SIAM J. Sci.Comput., 40 (2018), pp. A4105–A4130, doi: .[34]

K. A. Mardal, T. K. Nilssen, and G. A. Staff , Order-Optimal Precondition-ers for Implicit Runge–Kutta Schemes Applied to Parabolic PDEs , SIAM Journalon Scientiﬁc Computing, 29 (2007), pp. 361–375, doi: .[35]

J. Melenk, K. Gerdes, and C. Schwab , Fully discrete hp -ﬁnite elements: fastquadrature , Computer Methods in Applied Mechanics and Engineering, 190 (2001),pp. 4339–4364, doi: .[36] T. K. Nilssen, G. A. Staff, and K. Mardal , Order optimal preconditioners forfully implicit Runge-Kutta schemes applied to the bidomain equations , NumericalMethods for Partial Diﬀerential Equations, 27 (2011), pp. 1290–1312, doi: .[37]

S. P. Nørsett , Runge-kutta methods with a multiple real eigenvalue only , BITNumerical Mathematics, 16 (1976), pp. 388–393.[38]

Y. Notay , Flexible conjugate gradients , SIAM Journal on Scientiﬁc Computing,22 (2000), pp. 1444–1460, doi: , https://doi.org/10.1137%2Fs1064827599362314 .[39] B. Orel , Real pole approximations to the exponential function , BIT, 31 (1991),pp. 144–159, doi: .[40]

S. A. Orszag , Spectral methods for problems in complex geometries , Journalof Computational Physics, 37 (1980), pp. 70–92, doi: .[41]

W. Pazner , Eﬃcient low-order reﬁned preconditioners for high-order matrix-freecontinuous and discontinuous Galerkin methods , SIAM Journal on Scientiﬁc Com-puting (In Press), (2020).[42]

W. Pazner and P.-O. Persson , Stage-parallel fully implicit Runge–Kutta solversfor discontinuous Galerkin ﬂuid simulations , Journal of Computational Physics, 335(2017), pp. 700–717, doi: .[43]

M. M. Rana, V. E. Howle, K. Long, A. Meek, and W. Milestone , Anew block preconditioner for implicit runge-kutta methods for parabolic pde , arXivpreprint arXiv:2010.11377, (2020).[44]

S. C. Reddy and L. N. Trefethen , Stability of the method of lines , NumerischeMathematik, 62 (1992), pp. 235–267, doi: .[45]

T. Richter, A. Springer, and B. Vexler , Eﬃcient numerical realiza-tion of discontinuous Galerkin methods for temporal discretization of parabolicproblems , Numerische Mathematik, 124 (2013), pp. 151–182, doi: .[46]

Y. Saad , A ﬂexible inner-outer preconditioned GMRES algorithm , SIAM Journalon Scientiﬁc Computing, 14 (1993), pp. 461–469.2447]

D. Sch¨otzau and C. Schwab , Time Discretization of Parabolic Problems by theHP-Version of the Discontinuous Galerkin Finite Element Method , SIAM Journalon Numerical Analysis, 38 (2000), pp. 837–875, doi: .[48]

I. Smears , Robust and eﬃcient preconditioners for the discontinuous Galerkintime-stepping method , IMA Journal of Numerical Analysis, (2016), p. drw050,doi: .[49]

B. S. Southworth, O. A. Krzysik, and W. Pazner , Fast parallel solution offully implicit Runge-Kutta and discontinuous Galerkin in time for numerical PDEs,part II: nonlinearities and DAEs , arXiv preprint arXiv:2101.01776, (2021).[50]

G. A. Staff, K.-A. Mardal, and T. K. Nilssen , Preconditioning of fullyimplicit Runge-Kutta schemes for parabolic PDEs , Modeling, Identiﬁcation andControl: A Norwegian Research Bulletin, 27 (2006), pp. 109–123, doi: .[51]