[PDF] A Note on Parallel Preconditioning for All-At-Once Evolutionary PDEs

Abstract

McDonald, Pestana and Wathen (SIAM J. Sci. Comput. 40(2), pp. A2012-A1033, 2018) present a method for preconditioning of time-dependent PDEs via approximation by a nearby time-periodic problem, that is, they employ circulant-related matrices as preconditioners for the non-symmetric block Toeplitz matrices which arise from an all-at-once formulation. They suggest that such an approach might be efficiently implemented in parallel. In this short article, we present parallel numerical results for their preconditioner which exhibit strong scaling. We also extend their preconditioner via a Neumann series approach, which also allows for efficient parallel execution. Our simple implementation (in C++ and MPI) is available at the Git repository PARALAAOMPI. this https URL

Full PDF

aa r X i v : . [ m a t h . NA ] O c t ETNA

Kent State University andJohann Radon Institute (RICAM)Electronic Transactions on Numerical Analysis.Volume XX, pp. 1–yy, 20xx.Copyright © A NOTE ON PARALLEL PRECONDITIONING FOR ALL-AT-ONCEEVOLUTIONARY PDES

ANTHONY GODDARD ∗ AND

ANDY WATHEN † Abstract.

McDonald, Pestana and Wathen (SIAM J. Sci. Comput. (2), pp. A2012–A1033, 2018) present amethod for preconditioning of time-dependent PDEs via approximation by a nearby time-periodic problem, that is,they employ circulant-related matrices as preconditioners for the non-symmetric block Toeplitz matrices which arisefrom an all-at-once formulation. They suggest that such an approach might be efﬁciently implemented in parallel.In this short article, we present parallel numerical results for their preconditioner which exhibit strong scaling. Wealso extend their preconditioner via a Neumann series approach, which also allows for efﬁcient parallel execution.Our simple implementation (in C++ and MPI) is available at the Git repository PARALAAOMPI . AMS subject classiﬁcations.

1. Introduction.

There have been several suggestions for ways to achieve parallel com-putationally efﬁcient methods for time-dependent problems. Perhaps the most common ap-proaches are based on the Parareal algorithm [7] and its use together with multilevel ideas [4],although there are several other approaches; see the review by Gander [6]. A recent sugges-tion by McDonald et al. [1] involves the use of circulant-related preconditioners for the blockToeplitz matrices that arise from the approximation of initial value problems with constanttime-steps. Their approach is particularly geared to linear initial value problems for PDEs,although it can be applied in the simpler context of ODE IVPs [9]. For PDEs, regularity ofthe spatial grid is not required.The approach requires the solution of block diagonal systems in different orderings so asto effect the action of the preconditioner, as we show below. It also allows for an extensioninvolving a Neumann series which we introduce here. Both the extended and the originalpreconditioners would appear suitable for effective parallel implementation, although suchimplementation details have not been explored before. That the original preconditioner leadsto a small number of iterations which is independent of the number of time-steps when em-ployed with the widely used GMRES method [5] is established in [1].In this short article we make a preliminary exploration of the possibilities for effectiveparallel implementation of these preconditioners. Our initial results—using C++ and MPI—show strong scaling with up to 32 cores for the preconditioned GMRES solution of the all-at-once (monolithic) system. This system is derived from a spatial ﬁnite element approximationand a simple time-stepping strategy in the usual method of lines approach. We explore all-at-once formulations for the heat equation and for the wave equation together with associatedinitial and boundary conditions.In Section 2 we describe the original McDonald-Pestana-Wathen preconditioner and ourextension of it. The details of our parallel implementation are given in Section 3 and numeri-cal (timing) results in Section 4, followed by conclusions.

2. Description of the Preconditioners.

Consider the heat equation ∂u∂t = ∂ u∂x on Ω × ( , T ] , (2.1) u = for x ∈ ∂ Ω , (2.2) u ( , x ) = s ( x ) , (2.3) ∗ Mathematical Institute, Oxford University, UK. Supported by the James Pantyfedwen Foundation. † Mathematical Institute, Oxford University, UK https://github.com/anthonyjamesgoddard/PARALAAOMPI TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHEN where Ω = [ , ] and T = . Discretising in space with standard Galerkin ﬁnite elementswe obtain M d u d t = − K u , (2.4)where M, K ∈ R n × n are the mass and stiffness matrices and n is the number of nodesin the spatial discretisation. Now we use the implicit Euler scheme to discretise the temporaldomain to obtain ( K + τ M ) u k = M u k − , k ∈ [ , ℓ ] , (2.5)where τ is the constant time-step, ℓ is the number of time-steps in the temporal discreti-sation and u is a projection of the initial data. The idea presented in [1] involves packagingthe approximate solutions into a so-called monolithic system. Executing this idea yields A U = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ A A A ⋱ ⋱ A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ u u ⋮ u ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ M u ⋮ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ = b , (2.6)where A = K + τ M , A = − M and A ∈ R nℓ × nℓ . We note that the monolithic sys-tem does not need to be formed explicitly; it is merely used as a conduit for demonstrationpurposes. The McDonald-Pestana-Wathen preconditioner in its original form is the blockcirculant matrix P = ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ A A A A ⋱ ⋱ A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ . (2.7)We can extend the all-at-once method to non-uniform time-stepping schemes. Considerapplying the implicit Euler scheme to (2.4) with variable time-steps τ , τ , ..., τ ℓ : Equation(2.5) then becomes ( K + τ i M ) u k = M u k − , k ∈ [ , ℓ ] . (2.8)We can package this sequence into a monolithic system as follows: B U = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ A A A ⋱ ⋱ A A ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ u u ⋮ u ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ M u ⋮ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ = b , (2.9)where A i = K + τ i M and A = − M . While we have lost the block Toeplitz structure ofthe system, this will not prevent us from applying Q = ⎡⎢⎢⎢⎢⎢⎢⎢ ⎣ A A A A ⋱ ⋱ A A ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎦ (2.10)as a preconditoner. The issue is that of the computational cost of the application of thepreconditioner. We can write Q = P + σ ⊗ K, (2.11) TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES σ is a diagonal matrix with diagonal entries given by σ i = τ i − τ , i ∈ [ , ℓ ] , and τ = / ℓ . Assuming that ∣∣ σ ⊗ K ∣∣ < we can use a Neumann approximation to calculate theinverse of Q to obtain Q − i = P − − P − ([ σ ⊗ K ]P − ) + P − ([ σ ⊗ K ]P − ) + ... ( − ) i − P − ([ σ ⊗ K ]P − ) i − , (2.12)where i is some positive integer that we must choose.We can also apply the all-at-once method to hyperbolic equations. We will restrict ourscope to the wave equation ∂ u∂t = ∂ u∂x , on Ω × ( , T ] , (2.13) u = , for x ∈ ∂ Ω , (2.14) ∂u∂t ( , x ) = , (2.15) u ( , x ) = s ( x ) . (2.16)Discretising in space we obtain M d u d t = − K u . (2.17)We can choose the central difference formula to approximate the second time derivative,resulting in the sequence M u n − + ( τ K − M ) u n + M u n + = . (2.18)Casting this sequence into a monolithic system, we obtain C CD U = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ A MM A MM A ⋱⋱ ⋱ MM A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ u u u ⋮ u ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ − M u ⋮ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = b CD , (2.19)where A = τ K − M for this problem. We can then precondition this system with theblock circulant R CD = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ A M MM A MM A ⋱⋱ ⋱ MM M A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ . (2.20)Alternatively we can approximate the second time derivative asd u n d t ≈ u n − − u n − + u n τ . (2.21)Expression (2.21) will be referred to as the 2-Step Backwards Difference (BD2) formula,which is a ﬁrst order approximation. By substituting (2.21) into (2.17) and rearranging weobtain M u n − − M u n − + ( M + τ K ) u n = , (2.22) TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHEN which can be compiled into a monolithic system C BD U = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ A A A A A A ⋱ ⋱ ⋱ A A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ u u u ⋮ u ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣( M + τ K ) u − M u ⋮ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = b BD , (2.23)where A = M + τ K, A = − M, A = M for this problem. We will precondition thissystem with R BD = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ A A A A A A A A A ⋱ ⋱ ⋱ A A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ . (2.24)We can also use a second order backwards difference formulad u n d t ≈ − u n − + u n − − u n − + u n τ , (2.25)which is a second order method and will be referred to as the 4-Step Backwards Differ-ence (BD4) formula. By substituting (2.25) into (2.17) we obtain ( M + τ K ) u n − M u n − + M u n − − M u n − = . (2.26)As a consequence of using a 4-step method we have to approximate u , u using sub-4-step methods. In this case we can use BD2 and the initial conditions. This results in themonolithic system C BD = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ BC BA A A A A A A A A A A ⋱ ⋱ ⋱ ⋱⋱ ⋱ ⋱ ⋱ A A A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ u u u u u ⋮⋮ u ℓ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ B u − M u M u ⋮⋮ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = b BD , (2.27)where B = M + τ K, C = − M, A = M + τ K, A = − M, A = M, A = − M for this problem. We precondition this formulation of the wave equation system in a slightlydifferent way. Previously we preconditioned the Toeplitz system with its corresponding cir-culant. In this instance, we do not have a Toeplitz structure to begin with. To overcome this,we precondition the system with the circulant matrix that would result if we had a Toeplitz TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES R BD = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ A A A A A A A A A A A A A A A A A A A A ⋱ ⋱ ⋱ ⋱⋱ ⋱ ⋱ ⋱ A A A A ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ . (2.28)We will not pursue non-uniform temporal domains since there is less interest in usingnon-uniform time-steps for the wave equation.

3. Parallel Implementation.

Throughout our implementation we keep all matrices onthe master process and broadcast them when necessary. Vectors, on the other hand, are de-ﬁned on all processes. To understand the reasoning for this, consider the fact that our denseblock U requires O ( ℓ ) memory and our sparse blocks (linear combinations of M, K ) eachrequire O ( n ) memory. Since there are ℓ sparse blocks, the total memory cost is O ( ℓ + nℓ ) .Further, since a vector requires O ( nℓ ) memory, if each one of p processes has a copy of thevector, then we are using O ( nℓp ) memory. In the complexity analysis below, we considerthe situation where ℓ processes are available. Using ℓ processes would signiﬁcantly increasethe memory requirements of our implementation with this memory management scheme. Inour case, p << ℓ and so the vector storage cost almost matches that of the matrix storage cost.Even in the case where p ∼ ℓ , the reduction in communication cost that results from this typeof memory management is likely to be signiﬁcant.In order to see how (2.7) can be applied in parallel we follow [1] in writing it in the form P = I ℓ ⊗ A + Σ ⊗ A , (3.1)where Σ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

11 1 ⋱ ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ∈ R n × n , (3.2)and I ℓ is the ℓ × ℓ identity matrix. The key property of circulant matrices is that theycan be diagonalised by a Fourier basis. That is, we can write Σ = U Λ U ∗ , where U k,j = e (( k − )( j − ) πi )/ n /√ n and the diagonal entries of Λ are the roots of unity for the “downshift"matrix Σ . Hence, again following [1], (3.1) can be written as P = ( U ⊗ I n )[ I ℓ ⊗ A + Λ ⊗ A ]( U ∗ ⊗ I n ) . (3.3)Inverting P we obtain P − = ( U ⊗ I n )[ I ℓ ⊗ A + Λ ⊗ A ] − ( U ∗ ⊗ I n ) . (3.4)The application of ( U ⊗ I n ) to a vector can be carried out in parallel. To see this, considerthe explicit representation of ( U ⊗ I n ) given by ( U ⊗ I n ) = ⎡⎢⎢⎢⎢⎢⎣ U I n . . . U ℓ I n ⋮ ⋱ ⋮ U ℓ I n . . . U ℓℓ I n ⎤ ⎥⎥⎥⎥⎥⎦ . (3.5) TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHEN

First, we broadcast each row of U to a process. Then we can evaluate y i = [ U i I n . . . U iℓ I n ] ⎡ ⎢⎢⎢⎢⎢⎣ z ⋮ z ℓ ⎤ ⎥⎥⎥⎥⎥⎦ (3.6)on each process, where z i ∈ R n is a chunk of the nℓ -vector z and i is an integer such that i ∈ [ , ℓ ] . This can be done in O( nℓ ) . The local resultants of each of the calculations carriedout on each process y i can then be reduced to yield the ﬁnal resultant of the calculation ( U ⊗ I n ) z .The only other implementation-speciﬁc issue that we need to be concerned with is theinversion of the block-tridiagonal matrix I ℓ ⊗ A + Λ ⊗ A . This can easily be carried outin parallel by assigning each tridiagonal block to a process and then applying the Thomasalgorithm to each block, which would incur a cost of O( n ) over ℓ processes. Therefore, thetotal complexity of this implementation is O( n + nℓ ) over ℓ processes. As in [1], multileveliterations can be applied as approximate solvers for the spatial operators when there is morethan one spatial dimension.An alternative method involves the use of the Fast Fourier Transform (FFT). In order tosee how the FFT can be applied in this case, we can write ( U ⊗ I n ) z = ⎡⎢⎢⎢⎢⎢⎣ U I n . . . U ℓ I n ⋮ ⋱ ⋮ U ℓ I n . . . U ℓℓ I n ⎤ ⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎣ z ⋮ z ℓ ⎤ ⎥⎥⎥⎥⎥⎦ (3.7) = ⎡ ⎢⎢⎢⎢⎢⎣ U ⋱ U ⎤ ⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎣ x ⋮ x n ⎤ ⎥⎥⎥⎥⎥⎦ (3.8) = ( I n ⊗ U ) x , (3.9)where ( x i ) k = ( z k ) i and x i ∈ R ℓ are the chunks of x , i ∈ [ , n ] and k ∈ [ , ℓ ] . Thistransformation is called a vector transpose and can be carried out in O( n ) over ℓ processes. Therefore we can form y i = U x i , (3.10)where i is an integer such that i ∈ [ , n ] . According to [8], the cost of applying U toa vector using the FFT is O( ℓ log ℓ ) . Before we can progress with the calculation, we willhave to apply the vector transpose again to [ y , ..., y n ] T . Consequently the complexity ofthis implementation now becomes O( ℓ log ℓ + n ) over max ( n, ℓ ) processes.The timing results in the next section were obtained using the ﬁrst method of evaluating ( U ⊗ I n ) , not the FFT method. The reason for this is that the extra communication requiredto perform the transpose operator multiple times signiﬁcantly reduced the performance ofthe all-at-once implementation. However, the functionality to implement both routines isprovided in the GitHub repository.From an inspection of the operations we can see that the most expensive component of aGMRES iteration is the application of the preconditioner. If we consider applying the aboveideology to Q − i B U = Q − i b , then for us to increase i from 1 to 2 we have to take into accountthe cost of one extra preconditioner application, as well as communication. That is, for us toconsider the Neumann method effective, we must expect the number of GMRES iterationsthat are needed to solve the problem to reduce by more than half. See

VectorTranspose of ParallelRoutines.cpp in the Git repository.

TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES

4. Numerical results.

All parallel results in this section were generated on the nightcrawlerworkstation at Oxford University. This machine is equipped with × core Intel(R) Xeon(R)Gold 6140 CPU @ 2.30GHz processors, 768GB RAM and 4600GB in scratch disk capacity.Care was taken to access the machine when the workload was low so as to maintain consistentresults.In [1] we see that the idea of preconditioning a block Toeplitz system with a block cir-culant matrix yields very good theoretical results. Table 1 of [1] shows that few iterationsare required when computing the solution of such preconditioned Toeplitz systems. It wassuggested in [1] that the all-at-once method can be executed in parallel. Timing results forthe heat equation on a uniform temporal domain with initial condition u s ( x ) = x ( − x ) aregiven in Table 4.1. T ABLE

Timed results (in seconds) for solving the system P − A U = P − b using GMRES with tolerance set to − .The iteration count remained at a constant value of 2 for all values of n and ℓ tested. p is the number of processesused in the calculations. p = p = p = p = p = p = n = ℓ = n = n = n = ℓ = n = n = n = ℓ = n = n = ℓ = n = ℓ = and n = : The time taken to solve the system reduces from ∼ minutes to ∼ minute.Observe that, for all values of n and ℓ , the time taken to solve the problem reduces by morethan half as a result of increasing p from 1 to 2. We suspect that this is because half of theproblem ﬁts in local memory better than the entire problem does.While the data presented in Table 4.1 is very useful for highlighting the speed-up achievedby distributing the calculation across multiple processes, it would be useful to see how efﬁ-ciently the processes are being used. To see this, we consider the parallel efﬁciency of p processes deﬁned by P peff = Time Taken on 1 Process p × Time Taken on p Processes . (4.1)The parallel efﬁciency results are shown in Figure 4.1. The suspected reason for the jumpin going from P eff to P eff is that, as noted above, the problem ﬁts better in local memoryover two processes. The parallel efﬁciency falls as we increase the number of processesused in the calculation, which is to be expected. This is a consequence of the number ofcommunications taking place between processes. We note that as we increase the number ofdegrees of freedom, P eff also increases, particularly for the values n = and ℓ = , TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHEN P eff > . That is, our metric for measuring parallel efﬁciency implies that the all-at-onceimplementation is more efﬁcient on 32 processes than it is on a single process. F IG . 4.1. The parallel efﬁciency of our implementation of GMRES used to solve the all-at-once formulation ofthe preconditioned heat equation system P − A U = P − b . . . . . . . . P a r a ll e l E ﬃ c i e n c y n = 320 , ℓ = 768 n = 512 , ℓ = 768 n = 768 , ℓ = 768 n = 320 , ℓ = 1024 n = 512 , ℓ = 1024 n = 768 , ℓ = 1024 n = 320 , ℓ = 1440 n = 512 , ℓ = 1440 n = 768 , ℓ = 1440 n = 1440 , ℓ = 1568 In Section 2 we introduced an extension to the all-at-once method that enables us to con-sider problems that are non-uniformly discretised in time. The key behind the extension is theNeumann approximation of the preconditioner Q − , given by (2.12). The non-uniform tem-poral discretisation that was used to obtain the numerical results associated with the system Q − i B U = Q − i b is given by t j = ⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩ , j = , n ( j + δ ( Rand(0,1,j) − . )) , j ∈ [ , ℓ − ] , , j = ℓ, (4.2)where j is an integer, δ is a real number such that δ ∈ ( , ) and Rand(0,1,j) is arandom real number between and . Larger values of δ clearly tend to give more irregulartime-steps τ j = t j − t j − . Table 4.2 shows the GMRES iteration counts required to solve Q − i B U = Q − i b for increasing values of i . We also show the difference between the timetaken to solve the problem with i = and the time taken to solve the problem with i = , ∆ , .That is, a positive value of ∆ , indicates that it took longer to solve the problem with i = than it did with i = . TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES T ABLE

GMRES iteration count for Q − i B U = Q − i b . The tolerance was set to − . The values of n and ℓ for eachtable are given as (a) n = , ℓ = , (b) n = , ℓ = , (c) n = , ℓ = , (d) n = , ℓ = , (e) n = , ℓ = , (f) n = , ℓ = . ∆ , is the difference between the time taken to solve the problem with i = and the time taken to solve the problem using i = . In this particular experiment, 16 processes were utilised. (a) i = i = i = , (b) i = i = i = , δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . i = i = i = , (d) i = i = i = , δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . i = i = i = , (f) i = i = i = , δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . ∆ , > for all values of n, ℓ and δ tested. This somewhat validates the claim wemade in the previous section according to which the iteration count would need to reduce bymore than half for any improvements to be observed in the timed results. For n = , ℓ = and δ = . , the iteration count reduced by a half as a result of increasing i from 1 to 2; inthis case ∆ , = . , which is quite signiﬁcant. On a positive note, Table 4.2 highlightsthe robustness of the all-at-once formulation in the presence of temporal perturbations. Forinstance, increasing δ from 0.1 to 0.9 increases the iteration count by 2 for all values of n and ℓ tested. Increasing i beyond 2 does halve the iteration count in some cases, but since thecost incurred by communication is going to be increased further, there will be no advantagein doing so.In Section 2 we introduced an all-at-once formulation of the wave equation. Three formu-las were considered as candidates to approximate the time derivative. The central differencesformulation performed poorly by selecting the smooth initial condition u s = sin ( πx ) and TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHEN

GMRES failed to converge to a solution by selecting the non-smooth initial condition u ns ( x ) = ⎧⎪⎪⎨⎪⎪⎩ cos π ( x − ) , x ∈ ( , ) , , x ∈ [ , ]/( , ) , (4.3)when applied to the system R − CD C CD U = R − CD b CD . Central differences performedpoorly on u s in the sense that the solution displays aggressive numerical dissipation, re-gardless of how large n and ℓ are chosen to be. However, the number of GMRES iterationsrequired to solve the system R − CD C CD U = R − CD b CD with the smooth initial condition re-mained at 2 for all values of n and ℓ tested. For these reasons the central differences formula-tion will not be considered further.Iteration counts for BD2 and BD4 are shown in Table 4.3 and Table 4.4. Plots of thesolution of the wave equation, obtained using the corresponding all-at-once formulations, areshown in Figure 4.2. All of these results were obtained with the non-smooth initial condition u ns . Iteration counts for BD2 are low and the wave speeds are largely conserved. Thedrawback of BD2 is the introduction of numerical dissipation in the solution, as seen in Figure4.2 (a), although the dissipation is very subtle compared to that observed in the solution of thecentral differences formulation. BD4 rectiﬁes this drawback, but iteration counts are muchhigher. Similarly to the central difference approximation, when the smooth initial conditionwas used, the number of GMRES iterations required to solve the problem remained ﬁxed at2 for both BD2 and BD4 formulations. Our observations indicate that the all-at-once methodformulation of the wave equation is sensitive to the choice of initial conditions. T ABLE

GMRES iteration counts k required to solve the wave equation system. v is the wave speed of the approximatesolution. The wave speed of the exact solution is 1. (a) R − BD C BD U = R − BD b BD , (b) R − BD C BD U =R − BD b BD . Tolerance was set to − . (a) k v (b) k vn = n =

60 1.03 ℓ = n = ℓ = n =

100 1.01 n = n =

180 0.67 n = n =

80 1.03 ℓ = n = ℓ = n =

120 1.01 n = n =

200 0.67 n = n =

140 3.10 ℓ = n = ℓ = n =

180 1.52 n = n = n and ℓ tested. However, the parallel efﬁciency does seem to increase with the number of degrees offreedom. This trend was observed in the parallel efﬁciency results for the heat equation.

5. Conclusions.

We have provided a parallel implementation of the all-at-once methodof [1]. This was achieved using MPI and C++ and is a proof-of-concept software that sup-ports the claims made in [1], namely, that the all-at-once method with the McDonald et.al.

TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES T ABLE

GMRES iteration counts k required to solve R − BD C BD U = R − BD b BD for larger values of n and ℓ .Tolerance was set to − . kn = ℓ = n = n = n = ℓ = n = n = n = ℓ = n = n = F IG . 4.2. Solution proﬁles for the wave equation. The proﬁles were taken at equally spaced time intervalsand the non-smooth initial condition u ns was used. n = ℓ = . (a) R − BD C BD U = R − BD b BD (b) R − BD C BD U = R − BD b BD . . . . . . . x − . . . . . . . u (a) . . . . . . x − . − . . . . . . u (b) preconditioner is parallelisable. We have also provided new applications for the all-at-oncemethod, namely, applications to non-uniform temporal discretisation and to hyperbolic equa-tions. Problems in one spatial dimension were considered throughout. To apply the all-at-once method to higher-dimensional problems, some alterations to the implementation pro-vided in this paper are required. An alterative suggested in [1] would be to apply AlgebraicMultigrid in parallel instead of the parallel Thomas algorithm. TNA

Kent State University andJohann Radon Institute (RICAM) A. J. GODDARD AND A. J. WATHENT

ABLE

The timed results (in seconds) for solving the system R − BD C BD U = R − BD b BD using GMRES withtolerance set to − . The iteration count remained at a constant value of 2 for all values of n and ℓ tested. p is thenumber of processes used in the calculations. The smooth initial condition u s was used. p = p = p = p = p = p = n = ℓ = n = n = n = ℓ = n = n = n = ℓ = n = n = ℓ = n = REFERENCES[1] E. M C D ONALD , J. P

ESTANA AND

A.J. W

ATHEN , Preconditioning and iterative solution of all-at-once sys-tems for evolutionary partial differential equations , SIAM J. Sci. Comput., (2), pp. A2012–A1033.[2] C.C. P AIGE AND

M.A. S

AUNDERS , Solution of sparse indeﬁnite systems of linear equations , SIAM J. Num.Anal., 12 (1975), pp. 617–629.[3] J. P

ESTANA AND

A.J. W

ATHEN , A preconditioned MINRES method for nonsymmetric Toeplitz matrices ,SIAM J. Matrix Anal. Appl., 36 (2015), pp. 273–288.[4] R. D. F

ALGOUT , S. F

RIEDHOFF , T Z . V. K OLEV , S. P. M AC L ACHLAN , AND

J. B. S

CHRODER , ParallelTime Integration with Multigrid , SIAM J. Sci. Comput., 36 (2014), pp.C635-C661.[5] Y. S

AAD AND

M.H. S

CHULTZ , GMRES: A generalized minimal residual algorithm for solving nonsymmetriclinear systems , SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869.[6] M. J. G

ANDER ,

50 Years of Time Parallel Time Integration , Multiple Shooting and Time Domain Decompo-sition Methods, 9 (2015), pp. 69–113.[7] J.L. L

IONS , Y. M

ADAY AND

G. T

URINICI , Résolution d’EDP par un schéma en temps pararéel , ComptesRendus de l’Académie des Sciences - Series I - Mathematics, 332, pp. 661–668.[8] C. V AN L OAN , Computational Frameworks for the Fast Fourier Transform , SIAM.[9] E. M C D ONALD , S. H ON , J. P ESTANA AND

A.J. W

ATHEN , Preconditioning for nonsymmetry and time-dependence Domain Decomposition Methods in Science and Engineering XXIII, Lecture Notes in Com-putational Science and Engineering book series, volume 116, pp. 81–91 Proceedings of the 23rd DomainDecomposition Conference (2015), Jeju, Korea.

TNA

Kent State University andJohann Radon Institute (RICAM)

PARALLEL PRECONDITIONING FOR EVOLUTIONARY PDES F IG . 4.3. The parallel efﬁciency of our implementation of GMRES used to solve the all-at-once formulationof the preconditioned wave equation system R − BD C BD U = R − BD b BD . The smooth initial condition u s wasused. . . . . . P a r a ll e l E ﬃ c i e n c y n = 320 , ℓ = 768 n = 512 , ℓ = 768 n = 768 , ℓ = 768 n = 320 , ℓ = 1024 n = 512 , ℓ = 1024 n = 768 , ℓ = 1024 n = 320 , ℓ = 1440 n = 512 , ℓ = 1440 n = 768 , ℓ = 1440 n = 1440 , ℓ, ℓ