[PDF] Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

Abstract

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units have been shown to be highly efficient for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. We describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.

Full PDF

MMatrix-free GPU implementation of a preconditioned conjugate gradientsolver for anisotropic elliptic PDEs

Eike Müller *,1 , Xu Guo , Robert Scheichl and Sinan Shi Department of Mathematical Sciences, University of Bath, Bath BA2 7AY, United Kingdom Edinburgh Parallel Computing Centre (EPCC), The University of Edinburgh, James Clerk MaxwellBuilding, Mayﬁeld Road, Edinburgh EH9 3JZ, United Kingdom Current address: OT-Med, Europôle Méditerranéen de l’Arbois, Bâtiment du Cerege, BP 80 13545Aix-en-Provence Cedex 4, France * Email: [email protected]

March 1, 2013

Abstract

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial diﬀerential equations (PDEs)in “ﬂat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at everytime step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arisein global ocean models, subsurface ﬂow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck ofthe forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented eﬃciently.Graphics Processing Units (GPUs) have been shown to be highly eﬃcient (both in terms of absolute performance and powerconsumption) for a wide range of applications in scientiﬁc computing, and recently iterative solvers have been parallelised on thesearchitectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG)algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementationexploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm ismemory bound, performance can be improved signiﬁcantly by reducing the amount of global memory access. We achieve this byusing a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil.Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reducesthe runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPUimplementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithmfor diﬀerent problem sizes is quantiﬁed in terms of ﬂoating point throughput and global memory bandwidth.

Anisotropic elliptic PDEs arise in many areas of geophysicaland atmospheric modelling, which are often characterisedby “ﬂat” geometries: the horizontal extent of the domain ofinterest is much larger than its vertical size. This is the casefor global weather- and climate prediction models. As theheight of the atmosphere is signiﬁcantly smaller than the ra-dius of the earth, the horizontal resolution is of the order of10-25 kilometers but the vertical grid spacing is several tensor hundreds of metres. Similar ranges of scales are encoun-tered in models for simulating global ocean currents. Dueto the layered structure of geological formations, oil and gasreservoir simulations and subsurface ﬂow models of aquifersare also typically carried out in “ﬂat” domains. After dis-cretisation the cells of the computational grid are very ﬂatand the resulting matrix stencil is highly anisotropic, i.e.the coupling in the vertical direction exceeds the horizontal coupling by several orders of magnitude. To achieve optimalperformance, it is important to exploit the strong anisotropyof the system when choosing an appropriate computationalgrid and an eﬃcient solver.In this work we focus on the elliptic PDE for the pres-sure correction arising in the dynamical core of numericalweather- and climate- prediction models. In many forecastmodels semi-implicit semi-Lagrangian time stepping intro-duced in [1] and [2] is used to advance the atmospheric ﬁeldsforward in time. In contrast to explicit time stepping thismethod has a larger stability region and allows for longermodel time steps without compromising the accuracy of thelarge scale dynamics, which can reduce the overall modelruntime. However, if this approach is used to solve the fullycompressible non-hydrostatic Euler equations, a three di-mensional PDE for the pressure correction has to be solvedat every time step as discussed for example in [3, 4, 5, 6, 7, 8],1 a r X i v : . [ c s . NA ] F e b hich often forms the computationally most expensive pro-portion of the model runtime.Algorithmically the most eﬃcient solvers for large ellip-tic PDEs are suitably preconditioned Krylov subspace- ormultigrid methods (see e.g. [9, 10, 11, 12]). The stronganisotropy in the vertical direction can be exploited to con-struct an eﬃcient preconditioner (or multigrid smoother)based on vertical line relaxation as discussed in [4, 5]. Inan related context [13] and [14] describe how the equa-tions of ocean ﬂows can be discretised on a tensor prod-uct grid which is unstructured in the horizontal but con-sists of regular columns in the vertical direction. Again thestrong anisotropy is used in the construction of an eﬃcientpreconditioner of the iterative solver. In a similar fashionanisotropic elliptic PDEs arise in fully implicit methods forgas- and oil reservoir modelling. A “supercoarsening” multi-grid algorithm for solving elliptic PDEs encountered in mul-tiphase ﬂow in porous media is described by [15]: while thefull three dimensional equation is solved on the ﬁnest grid,any vertical variations are averaged out on the coarser multi-grid levels by collapsing vertical columns to a single layer.The exact hardware on which forecast models will be im-plemented in the future is currently unknown, and it isimportant to explore novel chip architectures in additionto standard CPUs. Graphics Processing Units (GPUs) arefast and power- eﬃcient computing devices and signiﬁcantspeedups relative to standard CPU implementations havebeen achieved in the past for iterative solvers for ellipticPDE, as described in [16, 17, 18, 19, 20, 21, 22, 23, 24, 25].While modern multicore CPUs contain several tens ofcores and have a peak ﬂoating point performance of O (10 − GFLOPs, GPUs have several hundreds to thou-sands of cores and applications have to make eﬃcient useof the massively parallel SIMD architecture and limitedcache size per thread. The nVidia M2090 Fermi GPU, onwhich this work was carried out, has a peak performance of1.331/0.665 TFLOP/s in single and double precision respec-tively and a global memory bandwidth of 177GByte/s. Bydividing the peak FLOP rate by the memory bandwidth onthe GPU, one can deduce that the number of computationsper ﬂoating point variable read from memory is around 30,so computations are essentially “free” and the performanceis limited by the speed with which data can be read fromglobal memory and how eﬃciently it can be kept in cache.In this article we describe a matrix-free GPU implementa-tion of a preconditioned Conjugate Gradient (PCG) solvertailored towards the solution of anisotropic PDEs with atensor-product structure. The most compute intensive com-ponents of the iterative solver are the evaluation of a largesparse matrix-vector product (SpMV) and the inversion ofa block-tridiagonal matrix. Both kernels were ported to theGPU and the memory access pattern and thread layout wereadapted to increase data throughput. In our implementa-tion the matrix is not stored explicitly but recalculated inevery grid cell, which reduces access to global memory com-pared to the matrix-explicit code. However the stencil we use is more complicated than the simple Poisson stencil ona regular grid used in previous studies (see [17, 19]). Due tothe tensor product structure of the equation and the com-putational grid, the matrix can be written as the product ofa one dimensional vertical discretisation, which is the samefor every column and only needs to be calculated and storedonce, and a horizontal stencil. As the number of verticallevels is very large in atmospheric applications, and the hor-izontal coupling only needs to be calculated once for eachvertical column, this creates only a small overhead. Thevertical discretisation requires the storage of four vectors oflength n z , where n z is the number of vertical levels. In totalwe store × n z values to parametrise the matrix. With theexception of [17, 19], all implementations discussed in theliterature that we are aware of, store the matrix explicitly,which requires the storage of × n horiz × n z matrix entries,where n horiz is the number of horizontal grid cells. Thiscan have a negative impact on the performance on band-width limited architectures as it requires reading the matrixstencil from global memory in addition to the ﬁeld vectors.While the speciﬁc implementation described in this articleis based on a three dimensional grid which can be written asthe tensor product of regular horizontal grid and a gradedvertical grid, the method we present can also be applied tounstructured horizontal grids.As the sparse matrix-vector product and preconditionersolve are highly eﬃcient in our GPU implementation, otherparts of the main CG such as level 1 BLAS vector updatesand scalar products start to account for a signiﬁcant pro-portion of the runtime, even if they are implemented usingoptimised GPU libraries such as CUBLAS. We ﬁnd thatto achieve further performance increases it is not suﬃcientto optimise the kernels in isolation, but rather severalcomponents of the main CG iteration need to be consideredtogether, as has been suggested in [20]. In particular weﬁnd that the number of memory references can be reducedfurther by fusing the loops over the computational grid inthe main kernels with the BLAS operations and this canlead to an additional performance gain of around . Forthe entire PCG algorithm we are able to obtain a totalspeedup of a factor × for single precision arithmeticon a nVidia Fermi M2090 card relative to one core ofan Intel Xeon Sandybridge E5-2620 CPU. For doubleprecision arithmetic the speedup was slightly smaller with × . This includes time for setting up the discretisationand copying data between host and device. To study theperformance of our matrix-free GPU code we compared itto an implementation which stores the matrix explicitly inthe compressed sparse row storage (CSR) format using theCUSPARSE and CUBLAS libraries. Our matrix-free code issigniﬁcantly faster than the implementation based on CRSdata structures which does not exploit the regular structureof the problem. We quantiﬁed the absolute performance interms of ﬂoating point operations per second (FLOPs) andglobal memory bandwidth for diﬀerent problem sizes, wherethe latter is the more relevant measure for the performance2f a memory bound algorithm. The optimised matrix-freecode achieves a bandwidth of around − of thetheoretical peak value, which is a sizeable proportion butshows that theoretically there is still potential for additionalimprovements which could lead to a further speedup of afactor × − × . An idea of how this could be achievedby increasing the granularity of the algorithm is discussedbelow. The ﬂoating point performance is 70-80 GFLOPsfor single precision and 40-50 GFLOPs for double precision,corresponding to around − of the theoretical peak value. Overview.

This article is organised as follows: previousGPU implementations of iterative solvers for PDEs are re-viewed in section 2. In section 3 the model equation and itsdiscretisation is described in detail with particular emphasison the tensor-product structure of the grid and the ellipticoperator. Preconditioned Krylov-subspace solvers and thematrix-free and interleaved form of the PCG algorithm forsolving the model equation are presented in section 4. Ageneral overview over the GPU architecture and the CUDAprogramming- and execution model can be found in sec-tion 5, and our CUDA implementation of the PCG solveris described in section 6. Performance measurements arediscussed in section 7 where we also present comparisons toa matrix-explicit implementation and quantify the absoluteperformance. Our conclusions and a discussion of plannedfurther work can be found in section 8. For reference ap-pendix A contains the explicit form of the two most impor-tant kernels of our optimised algorithm.

The GPU implementation of Krylov-subspace solvers and inparticular of the Preconditioned Conjugate Gradient algo-rithm has been studied extensively in the literature, bothfor more general sparse matrices and for matrices arisingfrom the discretisation of elliptic PDEs. As far as we areaware, all implementations discussed in the literature (withthe exception of [19] and [17]) are based on matrix-explicitrepresentations. While some of the authors study Poisson-or sign-positive Helmholtz equations, none of the problemsstudied in the literature show the strong anisotropy whichcharacterises the elliptic operator we consider in this work,and hence the preconditioners investigated in the literaturewill not be optimal in our case.While the speedups presented in the following review de-pend on the problem and on the hardware used at a par-ticular time and should only be used as an indicator forachievable performance gains, almost all GPU implementa-tions are signiﬁcantly faster than the corresponding CPUversions with speedups of × − × relative to the sequen-tial code.Some early work is presented in [16] where both a con-jugate gradient and a multigrid solver are implemented for solving the sign positive Helmholtz equation − ∆ u + σu = g arising from an implicit time discretisation of the incom-pressible Navier-Stokes equations on a regular two dimen-sional grid. However, for both solvers the matrix is storedexplicitly.The compressed sparse row storage format (CSR) is a verypopular and general format which has been used in a vari-ety of recent GPU implementations of Krylov subspace algo-rithms. A PCG solver for the same sign-positive Helmholtzequation as in [16] is described in [23] for two and threedimensions. For the approximate inverse SSOR precondi-tioner, which requires an additional matrix multiplication,a socket-to-socket speedup of more than × is reported forthe best implementation of the PCG algorithm on an nVidiaTesla T10 card relative to the unpreconditioned CG algo-rithm on an Intel Xeon Quad-Core 2.66 GHz CPU. Althoughin contrast to [16] a three dimensional system is solved, theelliptic operator considered is fully isotropic in both cases.In [20] a modiﬁed version of the Conjugate Gradient algo-rithm is used for solving a set of general matrices from theUniversity of Florida sparse matrix collection described in[26]. A simple Jacobi preconditioner is used and the per-formance of the solver is optimised by using the prefetchCSR sparse matrix-vector multiplication in [27] and fusingkernels in the main PCG loop. As described in section 4.3below we use a similar technique for fusing diﬀerent ker-nels in our implementation to improve the performance ofthe code. Together with some other improvements the au-thors of [20] report that this led to a signiﬁcant speedupcompared to a GPU implementation using the “Row perwarp” sparse matrix-vector multiplication described in [28].One of the problems studied in [20] is the “thermal2” matrixwhich arises from an FEM discretisation of the stationaryheat equation ∂ x ( k∂ x T ) + ∂ y ( k∂ y T ) = 0 on an unstructuredtwo dimensional grid. For this problem a speedup of × relative to a single core of a Intel Core2 2.4GHz could beachieved on both nVidia GT8800 and GTX280 GPUs. TheGPU implementation of a Krylov subspace solver for the(sign-indeﬁnite) two dimensional Helmholtz equation is de-scribed in [21]. A shifted Laplace multigrid preconditioneris used to reduce the number of iterations and a speedup ofaround × could be achieved on an nVidia GeForce 9800GTX/9800 GTX+ GPU relative to the sequential imple-mentation on one core of an AMD Phenom 9850 CPU.Other sparse matrix storage formats have also been usedto implement iterative solvers on GPUs. An implementationof a CG solver for the two dimensional PDE arising from theimplicit time discretisation of the heat equation is describedin [29]. The ELLPACK-R data format, which is more suit-able for structured problems, was used for storing the matrixand a speedup of a factor × could be achieved for a two di-mensional problem of size × on an nVidia GeForceGTX 480 card, relative to the sequential implementationon an Intel Core i7 860 CPU with 2.80GHz. Although thestructure of the ﬁve point nearest-neighbour stencil arisingfrom a ﬁnite-diﬀerence approximation of the Poisson equa-3ion is similar to the stencil we use in our discretisation,in contrast to our problem the elliptic PDE solved in [29]is two dimensional and fully isotropic. The GPU imple-mentation of preconditioned GMRES and Conjugate Gra-dient solvers for a range of problems and preconditionershas been studied in [25] and the performance for diﬀerentsparse matrix storage formats is compared. While in some ofthe implementations only the sparse-matrix vector productis carried out on the device and the preconditioner is exe-cuted on the host, preconditioners that are easier to paral-lelise are also ported to the GPU. However, the authors ﬁndthat for a simple block-Jacobi preconditioner implementedon the GPU the number of iterations is very large. Thisshould be compared to our implementation: for the stronglyanisotropic elliptic PDE we consider the blocks have a directphysical interpretation as they described the strong verticalcoupling within one column, which is much larger than thecoupling between diﬀerent blocks. As a result, the simpleblock-diagonal preconditioner proved to be very eﬃcient inour numerical tests. In [22] the GPU implementation of aPreconditioned Conjugate Gradient solver for both a two di-mensional Poisson problem and the elliptic equation arisingin the Variational Boussinesq Model (VBM) is described.A Repeated Red Black (RRB) preconditioner is used, butan incomplete Poisson preconditioner with diagonal scalingis also considered. As for the sparse approximate inverseused in [23], the incomplete Poisson preconditioner can bereduced to an additional sparse matrix product. The sparsematrix-vector product is implemented by storing the localﬁve-point stencil at each gridpoint. For the RRB precondi-tioner a speedup of around × could be achieved for thePoisson test problem on an nVidia GeForce GTX 580 card,relative to the sequential implementation on an Intel XeonW3520 CPU. However, the costs for memory allocation andsetup of the preconditioner matrix take up around a thirdof the total runtime. In contrast in our matrix free im-plementation only a small amount of data has to be copiedbetween host and device and the matrix setup costs are neg-ligible. For realistic problems the speedup reported in [22]is × − × for the RRB preconditioner and × − × forthe incomplete Poisson preconditioner.As far as we are aware, the only matrix-free implementa-tion of a CG solver discussed in the literature are [17, 19]. In[17] both the homogenous Poisson equation and the NavierStokes equation are solved on an unstructured mesh by im-plementing matrix-free gradient and divergence operations.On an nVidia a speedup of around × was achieved on annVidia 6600GT card relative to an AMD Athlon 64 CPU.Note, however, that the implementation is based on the low-level graphics API and the hardware used in the study isquite dated by current standards. The GPU implementa-tion of a matrix-free PCG solver for the homogenous Poissonequation in three dimensions is also described in [19].Due to its signiﬁcance in many scientiﬁc applications andin particular iterative solvers, the performance of sparsematrix-vector multiplications on its own has been studied extensively in the literature: Various sparse matrix formatsare described in [28] and their performance for a sparsematrix-vector multiplication is compared for both struc-tured and unstructured matrices. While CSR is the mostgeneral format and can be applied to matrices with widelyvarying row sizes, the best performance for structured ma-trices arising from the discretisation of PDEs is obtainedwith the DIA and ELLPACK formats. However, in [24]an eﬃcient parameter dependent implementation of sparsematrix-vector multiplication based on the CSR format oncache based GPUs is described. Cache usage and perfor-mance can be improved signiﬁcantly by varying the numberof threads processing each row, the thread block size andnumber of rows processed by one cooperating thread group.The authors ﬁnd that by tuning the parameters heuristi-cally, the performance of a Conjugate Gradient solver fora structured problem arising from the ﬁnite element dis-cretisation of a simple elliptic Poisson problem using CSRstorage is comparable to the corresponding implementationusing the ELLPACK format. Both matrix-explicit versionsare beaten by an implementation which stores a small localmatrix on each element and assembles the global stiﬀnessmatrix on-the-ﬂy in each matrix-vector product as describedin [30, 31]. For other work on improved matrix-explicit im-plementations of the sparse matrix-vector product see thereview and references cited in [24].The number of iterations can often be reduced signiﬁ-cantly by using multigrid methods, and recently both geo-metric and algebraic multigrid solvers have been ported toGPUs, see e.g. [32, 33, 34]. The extension of our PCGsolver to a geometric multigrid solver for anisotropic prob-lems based on the tensor product idea in [35] is discussed in[36] and we are currently working on a GPU implementationof the same matrix-free geometric multigrid solver.We ﬁnally remark that multiple-GPU implementations ofiterative solvers have been described in the literature (seee.g. [37, 38, 19, 39]) and we are currently working on ex-tending our algorithm to clusters of GPU. Following [7, 8] a model equation which reproduces the mostimportant features of the full PDE for the pressure correc-tion in a NWP model, has been derived in [36]. The deriva-tion of the full pressure correction equation in atmosphericmodels can be found for example in [3, 4, 5, 6] and is de-scribed for ocean models in [13] and [14]. The model equa-tion we use is a symmetric positive deﬁnite PDE and can bewritten in spherical coordinates as − ω (cid:18) ∆ d + λ r ∂∂r (cid:18) r ∂∂r (cid:19)(cid:19) u + u = f (1)where ∆ d denotes the two dimensional Laplacian on theunit sphere. For simplicity all length scales are measuredin units of the earth radius R earth , and the equation issolved in a thin spherical shell r ∈ [1 , H atmos ] . We write4 atmos = D/R earth (cid:28) where D is the thickness of the at-mosphere. The model parameters ω and λ depend on themodel time step size, in particular, ω is proportional to thetime step size. In our numerical experiments the parameterswere adjusted to their values for typical model resolutions inNWP with a Courant number of around . An importantfeature of the discretised PDE is the strong anisotropy inthe vertical direction: the depth of the atmosphere is abouttwo orders of magnitude smaller than the radius of the earthand consequently the horizontal grid spacing ∆ x is signiﬁ-cantly larger than the vertical grid spacing ∆ z . The stronggrid aligned anisotropy in the vertical direction is given (ap-proximately) by γ = (cid:18) λ ∆ x ∆ z (cid:19) , (2)and as H atmos (cid:28) we have ∆ z (cid:28) ∆ x , so γ (cid:29) . Thisproperty can be used to construct a simple but very eﬃ-cient and parallelisable preconditioner for iterative solversof this equation, which solves the vertical equation exactlybut ignores the horizontal couplings.The condition number of the preconditioned operator ap-proaches a ﬁxed value as the horizontal resolution increasesand does not diverge as for the Poisson equation. Tosee this, note that to keep the Courant number constant,the time step size ∆ t ∝ ω has to chosen to be propor-tional to the horizontal grid spacing, i.e. ∆ t ∝ ∆ x . Thelargest eigenvalue of the preconditioned matrix is of the or-der ω / ∆ x ∝ ∆ t / ∆ x and independent of ∆ x , whereasthe smallest eigenvalue is 1 due to the presence of the zeroorder term in (1).After discretisation the elliptic operator in (1) can be writ-ten abstractly in tensor product form as the sum of threeterms: L = D ( h ) ⊗ M ( r ) + M ( h ) ⊗ D ( r ) + ˜ M ( h ) ⊗ ˜ M ( r ) (3)Here M ( h ) and ˜ M ( h ) ( M ( r ) and ˜ M ( r ) ) are horizontal (verti-cal) mass matrices which only contain couplings in the hor-izontal (vertical) direction. D ( h ) ( D ( r ) ) are second orderderivate operators which contain couplings in the horizontal(vertical) direction. To discretise (1) we use a ﬁnite volume scheme on a tensor-product grid, which consists of a (possibly unstructured butconforming) two dimensional grid on the unit sphere and anon-uniform (typically graded) one dimensional grid in thevertical direction. The model ﬁelds are deﬁned as integralsin a grid cell given by the horizontal grid element T andvertical index k = 0 , . . . , n z − u ( T ) k ≡ (cid:90) r k +1 r k (cid:90) T u ( r, θ ) r dr d θ (4)with θ denoting horizontal coordinates on the unit sphere(throughout this work we use zero-based indexing as all our implementations are in the C programming language). In-dependent of the horizontal discretisation we need to store avector u ( T ) of length n z at each element T . The vertical gridis deﬁned by the grid points r k where k = 0 , . . . , n z . Thenumber of vertical grid cells is usually large, n z = O (100) .In meteorological applications a graded vertical grid withsmaller grid spacings near the ground is desirable and weuse r k = 1 + ( k/n z ) · H atmos .Equation (1) is discretised using a ﬁnite volume scheme,and schematically it can be written for each horizontal gridelement T as ( Au ) ( T ) = A T u ( T ) + (cid:88) T (cid:48) ∈N ( T ) A T,T (cid:48) u ( T (cid:48) ) = f ( T ) (5)where N ( T ) is the set of neighbours of T . A T is a tridiag-onal n z × n z matrix and A T,T (cid:48) are diagonal n z × n z matri-ces for each neighbouring element T (cid:48) . The matrices A T,T (cid:48) correspond to the oﬀ-diagonal couplings in the horizontalderivative operator D ( h ) ⊗ M ( r ) in (3) and are given by theproduct A T,T (cid:48) = α T,T (cid:48) diag( d ) = α T,T (cid:48)  d d . . . d n z −  . (6)The coeﬃcient α T,T (cid:48) is the ratio between the length S T,T (cid:48) of the edge between the cells T and T (cid:48) and the distance be-tween the midpoints r T and r T (cid:48) of these cells. We also deﬁnethe diagonal coeﬃcient α T as the sum of the oﬀ-diagonal co-eﬃcients α T ≡ (cid:88) T (cid:48) ∈N ( T ) α T,T (cid:48) . (7)The (symmetric) tridiagonal matrix A T can be split into asum of three terms: A T =  ˜ d b c . . . . . .. . . . . . b n z − c n z − ˜ d n z −  = | T | diag( a ) − α T diag( d )+ | T | tridiag ( − ( b + c ) , b , c ) , (8)with ˜ d k = | T | ( a k − b k − c k ) − α ij d k where | T | is the area ofthe horizontal grid element T . The ﬁrst term correspondsto the product ˜ M ( h ) ⊗ ˜ M ( r ) in (3). The second term is thediagonal contribution of D ( h ) ⊗ M ( r ) , and the third term cor-responds to the vertical derivative term M ( h ) ⊗ D ( r ) . Whilethe ﬁnite volume discretisation described so far leads to aseven-point nearest-neighbour stencil on a regular grid, thesame structure also arises for other stencil types which in-clude couplings between grid cells which are not directlyadjacent.The vectors a , b , c and d do not depend on the grid cell T and can be precomputed. On the other hand the quantities5 ( Figure 1: Structure of the matrix arising from the ﬁnitevolume discretisation of (1) for n z = 4 vertical columns.Vertical couplings are shown in red and horizontal couplingsin blue. | T | and α T,T (cid:48) only need to be calculated once per verticalcolumn. These two important observations will be exploitedin the eﬃcient matrix-free implementation of the PCG solverdescribed below.

Assuming an ordering of the horizontal degrees of free-dom, which maps each cell T to a linear index ν ( T ) ∈ , . . . , n horiz − , one can write the full 3d solution vectorof length n = n horiz × n z as u = (cid:110) u (0) , u (1) , . . . , u ( n horiz ) (cid:111) (9)with u (cid:96) = u ( T ) k where (cid:96) = n z · ν ( T ) + k. (10)In this representation the discretised equation (5) can bewritten in the familiar form as Au = f (11)and the structure of the matrix A is shown in Fig. 1. Notethat the matrix has a block structure, where each of theblocks corresponds to one combination ( T, T (cid:48) ) of neighbour-ing elements of the horizontal grid. Each of the gray blocksis of size n z × n z , the dark gray blocks ( T (cid:48) = T ) describe thediagonal terms and vertical coupling, whereas the light grayblocks ( T (cid:54) = T (cid:48) ∈ N ( T ) ) describe the horizontal coupling.In the following we work on one panel of a cubed sphere gridwith gnomonic projection (Fig. 2). This deﬁnes a mappingfrom ˜Ω = [ − , × [ − , to spherical coordinates on one-sixth of the entire sphere and each cell of the regular grid ofsize n horiz = m × m on ˜Ω can be labelled with an index pair ( i, j ) ∈ [0 , m − × [0 , m − . In this case we have u (cid:96) = u ( i,j ) k where (cid:96) = n z ( m · i + j ) + k (12)and label each grid cell T ij by its horizontal indices. Westress, however, that there is no algorithmic problem in ex-tending our approach to unstructured horizontal grids or tothe entire spherical shell for example by ordering the hori-zontal grid cells along a space-ﬁlling curve. Figure 2: Cubed sphere grid. The model equation was dis-cretised on one logically rectangular panel of this grid. Typically the number of degrees of freedom per atmosphericvariable in current global forecast models is at the orderof several 10 millions. For next generation forecast modelsglobal horizontal model resolutions of around one kilometreare envisaged, which will require the solution of PDEs withmore than unknowns. Clearly naive direct methodscan not be used for the solution of equations of this sizeand spectral methods often require a regular underlying gridstructure and are hard to parallelise.Krylov subspace methods (see e.g. [12]) are very eﬃcientiterative algorithms for solving sparse linear systems, in par-ticular if they are suitably preconditioned. These methodsconstruct the approximation u ( k ) of the exact solution u of(11) in a k -dimensional Krylov-subspace K k = span (cid:110) r , Ar , A r , . . . , A k − r (cid:111) ⊂ R n , (13)where r is the initial residual r = b − Au (0) . They are easyto parallelise as in addition to local operations such as sparsematrix-vector multiplications and axpy -like vector updatesthey require only a small number of global reductions. The simplest Krylov subspace method, which can be appliedfor symmetric positive deﬁnite matrices, is the ConjugateGradient (CG) algorithm introduced in [9]. At each stepthe approximate solution vector u ( k ) is updated by addinga correction proportional to the search direction p ( k ) . Thesearch directions are chosen such that they are A - orthog-onal for diﬀerent k , k (cid:48) : (cid:104) p ( k ) , Ap ( k (cid:48) ) (cid:105) = 0 for k (cid:54) = k (cid:48) . Theconvergence rate depends on the spectral properties of thematrix A , in particular on the condition number κ , which isthe ratio between the largest and smallest eigenvalue, as de-rived in [12]. For the system arising from the discretisationof the Poisson equation, κ grows rapidly with the inverse6rid spacing /h and it can be shown that asymptoticallyfor h → the number of iterations required to reduce theerror by a factor (cid:15) is k ∝ log (cid:15)h . (14)For anisotropic systems h is the smallest grid spacing in theproblem, for the highly anisotropic problem in this work thisis the vertical grid spacing ∆ z (cid:28) ∆ x . The dominant costin each iteration is the sparse matrix-vector multiplication y ← (cid:91) Ax , which is of O ( n ) computational complexity. Hencethe total cost of the algorithm is Cost(CG) ∝ nh log (cid:15). (15)To solve non-symmetric systems, more general Krylov spacemethods such as GMRES, BCG, BiCGStab and GCR canbe used. Although the number of sparse matrix-vectorproducts and intermediate vectors which need to be storedchanges between diﬀerent Krylov subspace algorithms, theirgeneral structure is very similar and in this work we focuson the Conjugate Gradient algorithm for simplicity.An equivalent version of the linear system in (11) can beobtained by multiplication with the inverse of the matrix M , M − Au = M − f . (16)This is generally referred to as left preconditioning. M is amatrix which should be easy to invert (i.e. it should be easyto solve the system M x = y ) and as similar to A as possible,such that the preconditioned matrix M − A is well condi-tioned. Usually these two requirements are mutually exclu-sive and a tradeoﬀ between them has to be found. Initialproﬁling of the CPU code revealed that the sparse matrix-vector multiplication and preconditioner solve account forthe largest proportion of the runtime ( − ). In addi-tion to these operations each iteration requires three axpy -like vector updates, one scal -operation and two scalar prod-ucts ( dot ). An additional norm calculation ( nrm ) is requiredfor the evaluation of the stopping criterion. The number ofﬂoating points operations and memory references for the in-dividual level 1 BLAS operations is given in Tab. 1, thetotal number of FLOPS per grid cell for all BLAS opera-tions is and the number of memory references is , andhence this part of the algorithm is clearly memory bound.The strong anisotropy of the discretised PDE can be usedto construct an eﬃcient preconditioner: if the small hori-zontal couplings are ignored, the matrix A shown in Fig.1 is block-diagonal with each block corresponding to onevertical column. Each of the tridiagonal blocks can be in-verted independently using the Thomas algorithm writtendown explicitly in [40]. More eﬃcient block-SOR precon-ditioners can be used as well, and require the inversion ofthe same block-diagonal matrix plus an additional sparsematrix-vector product.This approach has been applied very successfully for thepressure solver in the dynamical core of several numerical operation FLOPs MEM scal axpy dot nrm2 total (PCG)

13 16

Table 1: Number of ﬂoating point operations and memoryreferences per grid cell for diﬀerent level 1 BLAS operations.The last row shows the total number of FLOPs and memoryreferences for all BLAS operations in the PCG algorithm.weather- and climate prediction models, see [41, 5, 6, 42]. Inparticular the authors of [4] show the eﬃciency of a simple 1dline relaxation in comparison to other preconditioners suchas 2d ADI or a three dimensional pointwise SOR iteration.The good weak and strong scaling on up to 65536 cores of theHECToR Cray supercomputer and more than degreesof freedom is demonstrated for the model equation (1) in[36].As discussed in section 3, the condition number of thepreconditioned elliptic operator approaches a ﬁxed value forphysical choices of the parameter ω in (1) as the horizon-tal grid spacing tends to zero. Hence for these parametersKrylov subspace solvers for this equation are algorithmi-cally stable in the sense that the number of iterations doesnot diverge as the horizontal resolution increases. However,as shown in [36] the number of iterations and total solu-tion time can be reduced signiﬁcantly by using (geometric)multigrid solvers. Neither the matrix A nor the preconditioner matrix M needto be stored explicitly in the algorithm, it is suﬃcient toevaluate the sparse matrix-vector product y ← (cid:91) Ax and tosolve the equation M x = y for x . For matrices arisingfrom the discretisation of PDEs, such as the one discussed insection 3, the local matrix stencil only couples each grid cellto its neighbours. As memory access is signiﬁcantly moreexpensive than ﬂoating point operations on GPUs, it will bebeneﬁcial to recalculate the stencil whenever it is needed inthe sparse matrix-vector product or preconditioner solve. Ineach horizontal grid cell ( i, j ) the diagonal matrices A T ij ,T i (cid:48) j (cid:48) in (6) (with T i (cid:48) j (cid:48) ∈ N ( T ij ) ) and the tridiagonal matrix A T ij in (8) are calculated from the precomputed vectors a , b , c and d . For eﬃciency, we instead store the vectors a (cid:48) , b (cid:48) , c (cid:48) and d with a (cid:48) k = a k /d k , b (cid:48) k = b k /d k , c (cid:48) k = c k /d k , (17)as then the number of ﬂoating point operations in the sparsematrix-vector multiplication can be further reduced. To re-construct the matrix stencil the coeﬃcients | T ij | and α i (cid:48) j (cid:48) are also needed (we write α ij ≡ α T ij and α i (cid:48) j (cid:48) ≡ α T ij ,T i (cid:48) j (cid:48) for T i (cid:48) j (cid:48) ∈ N ( T ij ) for simplicity). While these can be given7y relatively complicated algebraic expressions on complexgeometries, such as the spherical grid considered in this ar-ticle, they only need to be calculated once in each verticalcolumn and for large n z this will only lead to a small over-head.On the regular horizontal grid which we use in our imple-mentation the sparse matrix-vector product y ← (cid:91) Ax canthen be calculated in each grid cell ( i, j, k ) as y ijk ← (cid:91) (cid:104) (( a (cid:48) k − b (cid:48) k − c (cid:48) k ) · | T ij | − α ij ) · x ijk + | T ij | · b (cid:48) k · x i,j,k +1 + | T ij | · c (cid:48) k · x i,j,k − + α i +1 ,j · x i +1 ,j,k + α i − ,j · x i − ,j,k + α i,j +1 · x i,j +1 ,k + α i,j − · x i,j − ,k (cid:105) · d k (18)(with possible modiﬁcations at the boundary of the domain).For each ( i, j, k ) this requires 20 ﬂoating point operationsand 12 memory references (7 loads for x , 1 store for y and4 loads for a (cid:48) k , b (cid:48) k , c (cid:48) k and d k ). However, as the vectors a (cid:48) , b (cid:48) , c (cid:48) and d do not change from column to column these arelikely to remain in cache, reducing the number of memoryreferences to 8. Depending on the innermost loop two of thevalues of x (namely the ones belonging to the same verticalcolumn which are needed at the next vertical level, i.e. x ijk and x ij,k +1 ) will most likely stay in cache, which reducesthe number of memory references further to 6. In contrast14 ﬂoating point operations and 22 memory references arenecessary if the matrix A is stored in compressed sparse rowstorage (CSR) format. Again, the actual number of memoryreferences is likely to be smaller as some of the variables arecached. However, in this case caching is not possible for thematrix entries which vary from one three dimensional gridcell to the next.For the construction of the preconditioner we drop thesecond term in (5), which couples diﬀerent vertical columns,and write A T ij u ( i,j ) = b ( i,j ) (using (12) to implicitly con-vert between the global vector representation and the col-umn based representation in (4)). For each column ( i, j ) thetridiagonal system A T ij x ( i,j ) = y ( i,j ) deﬁned by | T ij | · d k · (cid:104) x ( i,j ) k − · c (cid:48) k + x ( i,j ) k · (( a (cid:48) k − b (cid:48) k − c (cid:48) k ) − ˜ α ij )+ x ( i,j ) k +1 · b (cid:48) k (cid:105) = y ( i,j ) k (19)with ˜ α ij ≡ α ij / | T ij | needs to be solved for x ( i,j ) . This can bedone eﬃciently in O ( n z ) time using the Thomas algorithm,which is essentially Gaussian elimination applied to a tridi-agonal system. The forward iteration requires 8 memoryreferences at each step (loading the auxilliary vector φ k − and y ( i,j ) k − and storing φ k and y ( i,j ) k plus four loads for a (cid:48) k , b (cid:48) k , c (cid:48) k and d k ). In each step of the backward iteration 4 memoryreferences are needed. Hence the total number of memoryreferences is . Again, some of the data might be kept incache. If only the vectors a (cid:48) , b (cid:48) , c (cid:48) and d are cached the number of memory references reduces to 8. If in additionany data in the same vertical column can be kept in cache,the number of memory references reduces further to 5. Asthere are no dependencies in the horizontal direction, we canparallelise in this direction, i.e. the tridiagonal solve in eachvertical column can be carried out independently.An additional advantage of the matrix free method is thefact that there are no matrix setup costs; the cost for pre-computing the vectors a (cid:48) , b (cid:48) , c (cid:48) , d and copying them to thedevice is neglible as these vectors only have length n z . Field vectors are accessed in diﬀerent components of thePCG algorithm, for example the residual r is needed in thepreconditioner solve, in the update of the residual and thecalculation of the residual norm. In the standard implemen-tation of the algorithm these operations are carried out inseparate loops over the grid, which increases the number ofmemory references as data can not be kept in cache. How-ever, the main iteration of the PCG algorithm can be rewrit-ten such that it only consists of two loops over the grid, eachof which contains either the sparse matrix-vector multipli-cation or the tridiagonal solve and a number of BLAS op-erations. Similar loop fusion for the a GPU implementationof the PCG algorithm presented in [43] has been discussedin [20].The main iteration of this modiﬁed algorithm is shownin Algorithm 1. The kernels are written down explicitly for Algorithm 1

Interleaved PCG loop for k = 1 , maxiter do Interleaved preconditioner kernel: Calculate r ← (cid:91) r − α q , z = M − r , || r || ← (cid:91) (cid:112) (cid:104) r , r (cid:105) , κ ← (cid:91) (cid:104) r , z (cid:105) in a single iteration over the grid. if ( || r || / || r || < (cid:15) or || r || < τ ) then Exit β ← (cid:91) κ/κ old , κ old ← (cid:91) κ Interleaved SpMV kernel: Calculate u ← (cid:91) u + α p , p ← (cid:91) z + β p , q ← (cid:91) Az + β q , σ ← (cid:91) (cid:104) p , q (cid:105) in a single iteration over the grid. α ← (cid:91) κ old /σ end for the matrix-free implementation in Appendix A. The num-ber of ﬂoating point operations and memory references forthe matrix-free PCG algorithm and for its interleaved ver-sion are shown in Tab. 2. For the memory references wegive three diﬀerent values corresponding to the following as-sumptions: (1) no data is kept in cache, (2) only the vectors a (cid:48) , b (cid:48) , c (cid:48) and d are cached and (3) in addition data in thesame vertical column is cached.While the algorithm is still memory bound on the GPU,the number of memory references is reduced in the inter-leaved implementation, in particular if the cache can be usedeﬃciently.8lgo- operation FLOPs Memory referencesrithm no matrix columnscache cached cachedPCG SpMV 20 12 8 6prec 13 12 8 5BLAS 13 16 16 16 total 46 40 32 27 Inter- SpMV 28 17 13 11leaved prec 19 16 12 9PCG total 47 33 25 20

Table 2: Number of ﬂoating point operations and memoryreferences per iteration and per grid point for diﬀerent com-ponents of the PCG and the interleaved PCG algorithm.Memory references are shown without caching, with cachingthe vectors a (cid:48) , b (cid:48) , c (cid:48) and d only and assuming that all de-grees of freedom in a vertical column are cached as well. In contrast to CPUs, on which most transistors are usedfor advanced execution control units and cache hierarchies,GPUs have a very large number (several hundreds to thou-sands) of lightweight compute cores which support SIMDparallelism for simple compute kernels and are ideally suitedfor ﬂoating point intensive calculations on regular data. Theclock speed of each individual core is smaller than on theCPU, which improves the power eﬃciency.The cores in the GPU are grouped into a number ofstreaming multiprocessors (SMs). Data can be stored inglobal on-chip GPU memory and in addition, each SM hasa smaller and faster shared memory. Each compute corecan also access an even smaller local memory in addition toa set of registers. Constants can be stored in fast constantmemory. Modern GPUs, such as the Fermi architecture (see[44]), also have a hardware managed L1/L2 cache hierachy.Data transfer between host and device memory has to bemanaged explicitly by the user. As typically the PCIe bushas a small bandwidth (at the order of ten GB/s), this isexpensive and data should be calculated and kept on theGPU as long as possible.

The CUDA programming model described in the CUDAprogramming guide ([45]) provides an extension of the Clanguage. When writing code for a GPU, the most computeintensive subroutines are parallelised by isolating them insimple kernels. On the GPU these kernels are executed bylightweight threads which are run on the compute cores ofthe SMs. Threads are grouped into blocks in a one-, two-or three dimensional grid, which execute independently onone SM. Synchronisation and data exchange is only possi-ble between threads in the same block. In each block thethreads are arranged into a grid, i.e. each thread is uniquely identiﬁed by its thread index and block index. Threads arescheduled in groups of size 32, called warps. The threads ineach warp execute one common instruction at a time; if theexecution path diverges, for example due to an if -statementin the kernel, both branches are executed. This is known asthread divergence and should be avoided whenever possible.As our solver algorithm is memory bound, performancecan be increased by minimising latency and increasing globalmemory bandwidth. If a warp stalls as it is waiting for datafrom memory, it is paused and the next warp which is readyfor execution is launched by the warp scheduler. In con-trast to threads on a multicore CPU, the GPU scheduleris designed for launching and switching threads with mini-mal overheads. Thus, given the number of threads is largeenough, memory latency can be hidden. To achieve this theGPU usually has to be oversubscribed, i.e. the number ofthreads launched at a single time should exceed the numberof compute cores.For optimal eﬃciency, memory access for all threads inone half-warp should be coalesced. Global memory access isprocessed in segments of 128 bytes (which is the size of onecache line) and all threads in a half-warp should access thesmallest possible number of diﬀerent segments. For exam-ple, if each of the 16 threads reads a double precision (8 byte)ﬂoating point number from global memory, this will requirethe transfer of one segment if these numbers are stored con-secutively, but it will require 16 separate memory transfersand thus incur a big penalty if the numbers are more than128 byte apart in global memory. Both the standard PCG method and its interleaved versionwere implemented in C and CUDA-C. In the standard ver-sion the sparse matrix-vector multiplication and precondi-tioner were implemented as kernels and the CUBLAS librarywas used for the level 1 BLAS operations.To minimise memory transfers between host and device,data is kept on the device inside the entire PCG loop andall operations are carried out on the GPU. Host-device datatransfers are only necessary for copying the initial solutionand the right hand side to GPU memory before the CG it-eration and for copying the ﬁnal solution back to the hostat the end. In addition, the four vectors a (cid:48) , b (cid:48) , c (cid:48) and d describing the vertical discretisation need to be copied tothe device once before the main PCG iteration. However,as each vector only has length n z , the amount of data tran-ferred is negligible in comparison to the size of the initialsolution and right hand side vector. In particular it is sig-niﬁcantly less than for matrix-explicit implementations.9 .1 Domain decomposition and data layout Due to the inherently sequential nature of the Thomas algo-rithm, parallelisation in the vertical direction is not possiblefor the preconditioner. Instead the code is parallelised byassigning each vertical column to one thread and organisingthese into threadblocks of size B = B x × B y , each of whichis launched on one streaming multiprocessor. Parallelisationonly in the horizontal direction is common in atmosphericmodels where data dependencies in the vertical direction areintroduced by physical processes such as radiative transfer.To achieve a good occupancy the number of threads perblock should be large, in particular latency hiding can onlyoccur for B (cid:29) .In the code all three dimensional ﬁeld vectors, such asthe solution vector u , are represented as one dimensionalarrays of size n = m × m × n z which are stored contiguouslyin memory. To access the entry u ijk the three dimensionalindex ( i, j, k ) ∈ [0 , m − × [0 , m − × [0 , n z − (where k is the vertical index) has to be mapped to a linear index (cid:96) ∈ { , . . . , n − } . To achieve optimal cache usage on theCPU, vertical columns are stored consecutively in memoryin the C code, which can be achieved by the mapping (cid:96) (C) = n z · ( m · i + j ) + k (20)already introduced in (12). However, on a GPU, the datalayout in (20) would lead to a signiﬁcant amount of un-coalesced memory access as the number of vertical levels islarge. For a typical n z of 128, consecutive threads will accessdata which is × sizeof( f loat ) bytes apart in memory. Al-though this problem can be mitigated by use of the L1 cache(or manual prefetching into shared memory at the beginningof each kernel), on a GPU the L1 cache is shared betweena large number of threads, which severly limits the cachesize per thread. As each thread processes an entire verticalcolumn, eﬃcient caching would only possible for relativelysmall horizontal block sizes B : in the Thomas algorithmtwo vectors of length n z need to be stored per column inaddition to the four vectors a (cid:48) , b (cid:48) , c (cid:48) and d which describethe vertical discretisation. The total amount of L1 cache onthe Fermi architecture is limited to 48kB, hence for singleprecision ﬂoating point numbers one would have

48 kB ≥ (2 B + 4) n z × sizeof( f loat ) (21)which would limit the number of threads per block to 44 forsingle precision and 22 for double precision calculations ifwe assume that n z = 128 .An alternative approach is to change the storage formatof the ﬁelds. Instead of (20) we use (cid:96) (CUDA-C) = m · ( n z · j + k ) + i (22)in the matrix-free CUDA-C code, i.e. the ﬁrst horizontalindex runs fastest. This ensures that, provided the horizon-tal block size B x is larger than 16, memory access for allthreads in a half-warp is coalesced, as illustrated in Fig. 3.

128 256123 131130129 257258259 387386385384 256 1 2 3257 258 259387386385384 xz Figure 3: Memory layout and data access pattern. Dataread by all threads in a warp is shown with a gray back-ground. If the vertically continuous ordering of the degreesof freedom in (20) is used, memory access is not coalescedbetween the threads in one warp (left). The ordering de-grees of freedom in (22) leads to coalesced memory accessbetween all threads in a warp (right).In our numerical tests we found that the blocksize B x = 64 , B y = 2 gave good results, in particular the global load ef-ﬁciency was almost for the interleaved preconditionerand larger than for the SpMV kernel. For this blocksizethe data processed by one block is too large to ﬁt into cache.However, we ﬁnd that the L1 cache hit rate ranges between for the interleaved preconditioner and for the inter-leaved SpMV kernel. The total global memory bandwidthis around − of the peak value for both kernels.To reduce the number of cache misses further, a more ﬁnegrained parallelisation would have to be used to reduce thedata volume per thread and ensure that data used by eachthread block can be kept in shared memory.The approach we are currently exploring (but which isnot used in the implementation described in this article) isto use a diﬀerent solver for the tridiagonal system in thevertical direction. The substructuring method discussed in[46] splits the n z × n z problem into B z smaller tridiagonalsystems of size ≈ ( n z /B z ) × ( n z /B z ) , which can be solvedindependently by diﬀerent threads. To obtain the global so-lution on the entire vector of length n z , a global tridiagonalsystem of size B z × B z for the interface points has to besolved before the solutions of the small subsystems can becombined to the total solution. In addition to the matrix-free implementation described insection 4.2 we also wrote a version of the code based on anexplicit representation of the matrix. Because of its popu-larity in the literature on sparse matrix-vector products andKrylov-subspace solvers (see the discussion in section 2) wechose the CSR format. The n × n tridiagonal matrix usedin the preconditioner was stored as a set of three vectorsof length n . The cusparse{S|D}csrmv() function from theCUSPARSE library was used for the sparse matrix-vectormultiplication and the subroutine cusparse{S|D}gtsv() GTTRF , GTTRS for the tridiagonal solver.

All runs were carried out on the GPU node of the aquila cluster in Bath. The node contains an Intel Xeon E5-2620Sandybridge CPU with a clockspeed of 2.00GHz and annVidia Fermi M2090 GPU. The theoretical peak perfor-mance of one core of the Sandybridge CPU without AVXextensions is 8.0GFLOPs (4 ﬂoating point operations percycle × nvcc compiler (release 5.0, V0.2.1221) wasused for compiling the CUDA code and we used version 4.4.6of the gnu C compiler for complilation of the CPU code. Toachieve the best possible performance of the matrix-explicitCPU code, optimised BLAS and LAPACK libraries based onthe OpenBLAS implementation (see [47]) were used. AVXextensions were disabled on the CPU. matrix-explicit time per call speedupkernel C CUDA C vs. CUDA single precision SpMV 170.6 6.91 × preconditioner 205.3 12.50 × double precision SpMV 182.2 10.91 × preconditioner 249.7 23.20 × Table 3: Measured times and speedups for one sparsematrix-vector multiplication (SpMV) y ← (cid:91) Ax and one pre-conditioner solve x ← (cid:91) M − y on the CPU and GPU usingthe matrix-explicit implementation. All times are given inmilliseconds. The speedup of the CUDA-C code relativeto the sequential CPU implementation is shown in the lastcolumn for each case. We ﬁrst study the performance and speedup of the indi-vidual components of the PCG solver and of the entire al-gorithm for a ﬁxed problem of size × × with ω = 6 . · − and λ = 3 . · − , which is a typical setof parameters in meteorological applications. On the GPUwe used a two-dimensional block layout and each block hasa size of B x × B y = 64 × threads. We found that this givesgood results and varying the block size did not increase theperformance. In Tab. 3 the times for a single sparse matrix-vector multi-plication and preconditioner solve are shown for the matrix-explicit method. These times do not include any costs forsetting up the matrix or for transferring data between hostand device, as this is only required once for each PCGsolve, which consists of a large number of sparse matrix-vector multiplications and preconditioner applications. Thespeedups shown in this table are relative to the sequentialCPU implementation. Assuming that the CPU code canbe parallelised perfectly, the socket-to-socket speedup is afactor smaller as the CPU contains six processor cores.The corresponding times for the matrix-free implementa-tion are shown in Tab. 4, where we also show the speedupof the matrix-free CUDA-C code relative to the matrix-explicit CUDA-C code. On the CPU the matrix-free sparsematrix-vector multiplication is more than twice as fast asthe matrix-explicit implementation. However, the matrix-explicit preconditioner is − faster than the corre-sponding matrix-free version. On the GPU the matrix-freecode is signiﬁcantly faster than the matrix-explicit version,both for the sparse matrix-vector multiplication where thespeedup is . × for single precision and . × for double pre-cision. For the preconditioner the speedup is slightly smaller11 atrix-free time per call speedupC vs. mat.-freekernel C CUDA CUDA vs. CSR single precision SpMV 78.5 0.75 × . × preconditioner 252.6 2.40 × . × interlvd. SpMV 129.9 2.16 × —interlvd. prec. 253.1 3.34 × — double precision SpMV 80.7 1.41 × . × preconditioner 350.4 3.75 × . × interlvd. SpMV 132.3 3.86 × —interlvd. prec. 351.3 4.86 × —Table 4: Measured times and speedups for one sparsematrix-vector multiplication (SpMV) y ← (cid:91) Ax and pre-conditioner solve y ← (cid:91) M − x on the CPU and GPU us-ing the matrix-free implementation. All times are given inmilliseconds. The last two columns show the speedup ofthe matrix-free CUDA-C code relative to the correspondingsequential CPU implementation and relative to the matrix-explicit CUDA-C version.with . × and . × for single- and double precision respec-tively. The speedup of both standalone matrix-free GPUkernels relative to the sequential CPU code is more than × in single precision. In double precision the speedup is × for the preconditioner and × for the sparse matrix-vector multiplication. While the corresponding speedups forthe interleaved kernels are smaller, their performance has tobe judged in the context of the full PCG loop, which forthe standalone kernels also contains several level 1 BLASoperations.We observe that for the matrix-free standalone SpMV ker-nel the double precision implementation takes about twiceas long as the single precision version. This suggests thatthe implementation is bandwidth limited and less aﬀectedby cache eﬃciency, as the double precision version requirestransferring twice as much data from global memory thanthe single precision implementation. This interpretation isalso corroborated by the relatively high cache hit rate forthis kernel reported in Tab. 7. The cache hit rate is smallerfor the interleaved sparse matrix-vector multiplication andthe preconditioner kernels, all of which show a smaller in-crease in the runtime between single and double precision. Asimilar drop of performance by nearly a factor of two whengoing from single to double precision can be observed for thematrix-explicit kernels, see Tab. 3. We now analyse the performance of the entire PCG solverand break down the time spent in a single iteration of thealgorithm.

We measured the time required to carry out in 100 PCGiterations, which is suﬃcient to reduce the residual by ﬁveorders or magnitude. Our measurements include the timefor the matrix setup and data transfer between host anddevice. The results are listed for three diﬀerent implemen-tations of the algorithm in Tab. 5 where we also calculatedthe speedup of the matrix-free CUDA-C code both relativeto the C code and relative to the matrix-explicit GPU code.The matrix-free interleaved algorithm gives the best perfor-mance, with a speedup of a factor of × (single precision)and × (double precision) relative to the C code on theCPU. It outperforms the matrix-explicit GPU code by morethan a factor of four. On the CPU the interleaved algorithmis only slightly faster than standard PCG for single preci-sion and even slightly slower for double precision. This isbecause in the sequential implementation most of the time( − ) is spent in the sparse matrix-vector multipli-cation and preconditioner kernels, so fusing the kernels canonly give a speedup of no more than − . This isdiﬀerent on the GPU, as will be discussed in section 7.4.2.As expected the cost for setting up the vertical discreti-sation matrix (calculation of a (cid:48) , b (cid:48) , c (cid:48) and d ) and copyingit to the device turned out to be negligible ( (cid:28) ms ). Forthe matrix-explicit code the matrix setup time only accountsfor a small proportion of the runtime; on the GPU the ma-trix setup time is and of the total solution time insingle- and double precision respectively. Although for thematrix-free interleaved code host-device data transfer of thesolution and right hand side vector takes up only a smallpart of the runtime (about both in single- and doubleprecision), this is not true any longer if a smaller number ofiterations is used for example to only reduce the residual bythree orders of magnitude. We also found that on the CPUthe total solution time can be reduced by a factor of aroundfour if the Krylov- subspace solver is replaced by a geometricsolver multigrid, as is discussed in [36]. Again, this wouldincrease the relative importance of the host-device memorytransfer. The time per iteration is given for the three diﬀerent im-plementations of the PCG algorithm in Tab. 6 where wealso calculate the speedup of the matrix-free implementa-tion relative to the matrix-explicit version. The speeduprelative to the C implementation is × for the matrix-freeinterleaved implementation in single precision and × indouble precision. In both cases it is more than four timesfaster than the matrix-explicit implementation on the GPU.Comparing the total time per iteration for the matrix-freeand the interleaved implementation, we ﬁnd that the ratiobetween the two is . in single precision and . in doubleprecision, which should be compared to the correspondingratio of memory references in Tab. 2, which is /

40 = 0 . without caching or /

27 = 0 . assuming that the vectors12atrix and data total time speeduppreconditioner setup transfer C vs. matrix-freeimplementation C CUDA CUDA C CUDA CUDA vs. CSR single precision matrix-explicit 0.55 0.15 0.047 43.72 2.54 × —matrix-free — — 0.047 37.00 0.87 × . × matrix-free interleaved — — 0.047 37.39 0.62 × . × double precision matrix-explicit 0.72 0.16 0.073 50.50 4.41 × —matrix-free — — 0.073 49.37 1.48 × . × matrix-free interleaved — — 0.073 45.78 0.96 × . × Table 5: Total solution time for diﬀerent implementations of the PCG algorithm. Costs for matrix setup and host-devicedata transfer are listed separately and included in the total times. 100 iterations of the PCG main loop were carried outin all cases and all times are given in seconds.time per iteration speedupC vs. mat.–freeimplementation C CUDA CUDA vs. CSR single precision matrix-explicit 410.8 23.5 × —matrix-free 376.3 8.1 × . × —"— interlvd. 370.6 5.6 × . × double precision matrix-explicit 494.6 41.4 × —matrix-free 491.5 13.9 × . × —"— interlvd. 453.4 8.8 × . × Table 6: Time per iteration for diﬀerent implementations ofthe conjugate gradient algorithm. All times are measuredin milliseconds. a (cid:48) , b (cid:48) , c (cid:48) and d and all data in one vertical column is cached(the ratio is /

32 = 0 . if only the a (cid:48) , b (cid:48) , c (cid:48) and d arecached). To identify the remaining bottlenecks, these timesare further broken down for the GPU code in Figs. 4 and5 for single- and double precision arithmetic. Note that forthe matrix-free code the BLAS operations, and in particularthe axpy updates, make up a signiﬁcant proportion of thetime in the matrix-free code. The plot clearly shows the ben-eﬁt of the interleaved implementation, which increases theperformance by in single precision and in doubleprecision. While comparing the runtime of the CUDA-C code to thecorresponding sequential CPU implementation can give anidea of the achievable performance gains, it is somewhatarbitrary in that it depends on the exact CPU which is usedfor this comparison. For this reason we also quantiﬁed theabsolute performance of the matrix-free CUDA-C code.Several performance indicators for the kernels of thematrix-free code are shown in Tab. 7. The global load matrix-explicit matrix-free matrix-freeinterleaved01020304050 t i m e [ m s ] single precision SpMVpreconditionerBLAS nrm2BLAS dotBLAS scalBLAS axpy

Figure 4: Time per iteration for diﬀerent parts of the mainCG loop on the device using single precision. The BLAS-operations were implemented with the CUBLAS library.13 atrix-explicit matrix-free matrix-freeinterleaved01020304050 t i m e [ m s ] double precision SpMVpreconditionerBLAS nrm2BLAS dotBLAS scalBLAS axpy

Figure 5: Time per iteration for diﬀerent parts of the mainCG loop on the device using double precision. The BLAS-operations were implemented with the CUBLAS library.eﬃciency measures the amount of coalesced global memoryaccess and the L1 hit rate quantiﬁes the cache eﬃciency. Forall kernels load eﬃciency is very high due to coalescence ofglobal memory access as described in section 6.1.The ﬂoating point performance for one iteration of thematrix-free interleaved PCG algorithm is plotted for a rangeof problem sizes between . · and . · in Fig. 6. ThenVidia M2090 Fermi GPU has a global memory of 6GB,and as the PCG algorithm requires the storage of 5 ﬁeldvectors, which limits the problem size to less than · for single precision and . · for double precision. In allcases the size of a vertical column was kept ﬁxed at n z = 128 .The performance is virtually independent of the problem sizeand about 70-80 GFLOPs for single- and 40-50 GFLOPs fordouble precision. As the algorithm is memory bound, a morerelevant measure is the global memory band width which isshown for the interleaved kernels in Fig. 7. The bandwidthincreases slightly with the problem size and − ofthe peak value could be achieved. In this article we described the matrix-free implementationof a Preconditioned Conjugate Gradient solver for stronglyanisotropic elliptic PDEs on a GPU. Equations of this typearise in many applications in atmospheric and geophysicalmodelling if the problem is discretised in “ﬂat” geometries.In particular the semi-implicit semi-Lagrangian time dis-cretisation of the non-hydrostatic Euler equations in the dy-namical core of many weather- and climate prediction mod-els leads to a three dimensional PDE for the pressure cor-rection and due to the small thickness of the atmosphere theelliptic operator has a very strong anisotropy in the verticaldirection. This anisotropy can be exploited to construct a GFLOPs memory global L1bandwidth load ef- hit[GB/s] ﬁcency rate single precision

SpMV 223.7 50.73 86.2% 76.7%preconditioner 45.4 38.41 99.9% 40.4%interlvd. SpMV 108.7 64.67 88.8% 56.5%interlvd. prec. 47.7 46.16 99.6% 33.6% double precision

SpMV 119.0 79.07 85% 74.0%preconditioner 29.1 49.35 99.7% 40.1%interlvd. SpMV 60.9 78.97 88.0% 57.5%interlvd. prec. 32.8 64.58 99.8% 33.1%Table 7: Performance measurements of the diﬀerent matrix-free kernels as reported by the nVidia visual proﬁler. Thebandwidth is the DRAM load bandwidth. . · . · . · . · Problem size020406080100 F l o a t i n g p o i n t p e r f o r m a n c e [ G F L O P / s ] Single precisionDouble precision

Figure 6: Floating point performance for diﬀerent problemsizes for the matrix-free interleaved PCG code on the GPU.14 . · . · . · . · Problem size050100150200 G l o b a l m e m o r y b a n d w i d t h [ G B / s ] % o f p e a k SpMV [single precision]Preconditioner [single precisionSpMV [double precision]Preconditioner [double precision]Peak bandwidth

Figure 7: Global memory bandwidth (DRAM read band-width as reported by the nVidia Visual Proﬁler) for diﬀerentproblem sizes for the interleaved apply- and preconditionerkernels.simple and eﬃcient preconditioner based on vertical line re-laxation. The dependencies in each vertical column requirea horizontal domain decomposition, which has implementa-tions for the data- and thread layout on the GPU.As the performance of the algorithm is limited by thespeed with which data can transferred from global memoryto the compute units, it is important to reduce the number ofmemory references. We achieved this by using a matrix-freeimplementation which recalculates the local matrix stencilwhenever it is needed instead of reading it from memory. Inaddition, we reduced the amount of data transfer by fusingseveral kernels in the PCG loop, which gave an additionalimprovement of in single precision and in doubleprecision. In total we demonstrated that on an nVidia FermiM2090 GPU the best matrix-free code achieved a speedup of × in single precision and × for double precision, com-pared to the corresponding sequential implementation on anIntel Xeon E5-2620 Sandybridge CPU. In terms of absolutetimes, the residual for a problem with . · degrees offreedom could be reduced by more than ﬁve orders of magni-tude in 0.62 seconds in single precision. The double precisionimplementation takes about 1.5 times longer, which demon-strates the good double precision performance of modernGPUs. Our matrix-free version is more than four timesfaster than a matrix-explicit GPU implementation based onthe CUSPARSE and CUBLAS libraries using the CSR ma-trix format which clearly demonstrates the beneﬁt of our ap-proach. We measured the absolute performance of our codefor a range of problem sizes and achieved a global memorybandwidth of − of the peak rate for problem sizesbetween . · and . · degrees of freedom.Global memory access was coalesced for the threadswithin a half warp by numbering the degrees of freedomsuch that the horizontal index runs fastest. This is diﬀer-ent from CPU implementations where vertical columns are stored consecutively in memory for cache eﬃciency. Whilethe speciﬁc implementation discussed in this article is basedon a regular horizontal grid, our method can be applied toany three dimensional grid which can be written as the ten-sor product of possibly unstructured horizontal grid and anon-uniform one dimensional grid in the vertical direction.The achieved global memory bandwidth is a sizeable frac-tion of the peak value, and it can be theoretically improvedby an additional factor × to × by making better use ofthe GPU cache or shared memory. This would require theparallelisation of the tridiagonal solver in the vertical direc-tion, and we are currently investigating the substructuringapproach described in [46].With the planned increase in weather- and climate modelresolution, problems with more than degrees of freedomneed to be solved. Clearly this is not possible on a singleGPU due to limited global memory size. To solve problemsof this size hundreds of GPUs are necessary as each haslimited memory. We are currently extending the algorithmto multi-GPU clusters will introduce additional bottlenecks:unless data can be copied directly between GPU memory, ithas to be transferred to the host at each iteration before itcan be sent through the standard MPI network. However,in this case only halo data needs to be exchanged betweenneighbouring devices and given that the local problem sizeis not too small, this is signiﬁcantly less than the total dataprocessed by one GPU.The PCG solver described in this work requires aroundhundred iterations to reduce the residual by ﬁve orders ofmagnitude. In contrast, multigrid solvers can achieve thesame reduction in a much smaller number of iterations, ashas been demonstrated in [36] for the elliptic PDE consid-ered in this article. The preconditioner used in this workcan be used as a multigrid smoother and the only missingcomponents are intergrid operators for restriction and pro-longation, which is the object of our current research. Acknowledgements

We would like to thank all members of the GungHo! projectand in particular Chris Maynard and David Ham for use-ful and inspiring discussions. The numerical experiments inthis work were carried out on a node of the aquila super-computer at the University of Bath and we are grateful toSteven Chapman for his continuous and tireless technicalsupport which was essential for the success of this project.The contribution of EM and RS was funded as part of theNERC project on “Next Generation Weather and ClimatePrediction” (NGWCP), grant number NE/J005576/1.

A Interleaved PCG kernels

The kernels for the interleaved PCG algorithm described insection 4.3 are shown in Algorithms 2 and 3. In the GPU im-plementation each thread calculates dot products and norms15n one column. To obtain global sums, these need to bereduced with an additional BLAS operation. However, asthis operation only operates on two-dimensional (horizon-tal) vectors, its cost is very small ( < of the time periteration). Algorithm 2

Interleaved matrix-multiplication kernel. Si-multaneously calculate u ← (cid:91) u + α p , p ← (cid:91) z + β p , q ← (cid:91) Az + β q , σ ← (cid:91) (cid:104) p , q (cid:105) in a single iteration over the grid. for i = 0 , . . . , m do for j = 0 , . . . , m do Calculate α i (cid:48) ,j (cid:48) and | T ij | σ ← (cid:91) for k = 0 , . . . , n z − do p ∗ ← (cid:91) p ijk , q ∗ ← (cid:91) q ijk , z ∗ ← (cid:91) z ijk u ijk ← (cid:91) u ijk + α · p ∗ p ∗ ← (cid:91) β · p ∗ + z ∗ , q ∗ ← (cid:91) β · q ∗ p ijk ← (cid:91) p ∗ δq ← (cid:91) (( a (cid:48) k − b (cid:48) k − c (cid:48) k ) · | T ij | − α ij ) · z ∗ + b (cid:48) k · | T ij | · z i,j,k +1 + c (cid:48) k · | T ij | · z i,j,k − + α i +1 ,j · z i +1 ,j,k + α i − ,j · z i − ,j,k + α i,j +1 · z i,j +1 ,k + α i,j − · z i,j − ,k q ∗ ← (cid:91) q ∗ + d k · δq , σ ← (cid:91) σ + p ∗ · q ∗ q ijk ← (cid:91) q ∗ end for end for end for References [1] M. Kwizak and A. J. Robert. a Semi-Implicit Schemefor Grid Point Atmospheric Models of the PrimitiveEquations.

Monthly Weather Review , 99:32, 1971.[2] AndrÃľ Robert. A stable numerical integration schemefor the primitive meteorological equations.

Atmosphere-Ocean , 19(1):35–46, 1981.[3] P K Smolarkiewicz and L G Margolin. On forward-in-time diﬀerencing in ﬂuids: An Eulerian/semi-Lagrangian nonhydrostatic model for stratiﬁed ﬂows.

Atmosphere-Ocean , special vol. XXXV(1):127–152,1997.[4] S J Thomas, A V Malevsky, M Desgagne, R Benoit,P Pellerin, and M Valin. Massively Parallel Imple-mentation of the Mesoscale Compressible CommunityModel.

Span , pages 1–19, 1997.[5] W C Skamarock, P K Smolarkiewicz, and J BKlemp. Preconditioned conjugate-residual solversfor Helmholtz equations in nonhydrostatic models.

Monthly Weather Review , 125(4):587–599, April 1997.[6] T. Davies, M. J. P. Cullen, A. J. Malcolm, M. H. Maw-son, A. Staniforth, A. A. White, and N. Wood. A newdynamical core for the Met Oﬃce’s global and regionalmodelling of the atmosphere.

Quarterly Journal of

Algorithm 3

Interleaved preconditioner kernel. Simultane-ously calculate r ← (cid:91) r − α q , R ≡ || r || ← (cid:91) (cid:112) (cid:104) r , r (cid:105) , κ ← (cid:91) (cid:104) r , z (cid:105) and solve M z = r in a single iteration over the grid. for i = 0 , . . . , m do for j = 0 , . . . , m do Calculate α i (cid:48) ,j (cid:48) and | T ij | R ← (cid:91) , κ ← (cid:91) , D ← (cid:91) ( a (cid:48) k − b (cid:48) k − c (cid:48) k ) − α (cid:48) ij , φ ← (cid:91) b (cid:48) k /D r ∗ ← (cid:91) r ij − α · q ij , R ← (cid:91) R + r ∗ · r ∗ z ij ← (cid:91) r ∗ / ( D · | T ij | · d k ) , r ij ← (cid:91) r ∗ for k = 0 , . . . , n z − do D ← (cid:91) (cid:0) ( a (cid:48) k − b (cid:48) k − c (cid:48) k ) − α (cid:48) ij (cid:1) − φ k − · c (cid:48) k , φ k ← (cid:91) b (cid:48) k /D r ∗ ← (cid:91) r ijk − α · q ijk , R ← (cid:91) R + r ∗ · r ∗ z ijk ← (cid:91) ( r ∗ / ( | T ij | · d k ) − c (cid:48) k · z i,j,k − ) /D , r ijk ← (cid:91) r ∗ end for κ ← (cid:91) κ + z i,j,n z − · r i,j,n z − for k = n z − , . . . , do z ∗ ← (cid:91) z ijk − φ k · z i,j,k +1 , κ ← (cid:91) κ + z ∗ · r ijk z ijk ← (cid:91) z ∗ end for end for end for R ← (cid:91) √ R the Royal Meteorological Society , 131(608):1759–1782,April 2005.[7] Thomas Melvin, Mark Dubal, Nigel Wood, An-drew Staniforth, and Mohamed Zerroukat. An in-herently mass-conserving iterative semi-implicit semi-Lagrangian discretization of the non-hydrostaticvertical-slice equations. Quarterly Journal of the RoyalMeteorological Society , 136(648):799–814, 2010.[8] Nigel Wood, Andrew Staniforth, Andy White, TomAllen, Michail Diamantakis, Markus Gross, ThomasMelvin, Chris Smith, Simon Vosper, Mohamed Zer-roukat, and John Thuburn. An inherently mass-conserving semi-implicit semi-Lagrangian discretisationof the deep-atmosphere global nonhydrostatic equa-tions. submitted to Quarterly Journal of the Royal Me-teorological Society , 2013.[9] M R Hestenes and E Stiefel. Methods of ConjugateGradients for Solving Linear Systems.

Journal of Re-search of the National Bureau of Standards , 49(6):409–436, 1952.[10] W.L. Briggs, V.E. Henson, and S.F. McCormick.

AMultigrid Tutorial . Society for Industrial and AppliedMathematics, 2000.[11] U Trottenberg, C W Oosterlee, and A Schüller.

Multi-grid . Academic Press, 2001.[12] Yousef Saad.

Iterative Methods for Sparse Linear Sys-tems . Society for Industrial and Applied Mathematics,2 edition, April 2003.1613] John Marshall, Alistair Adcroft, Chris Hill, Lev Perel-man, and Curt Heisey. A ﬁnite-volume, incompressibleNavier Stokes model for studies of the ocean on parallelcomputers.

Journal of Geophysical Research , 102:5753–5766, 1997.[14] O. B. Fringer and M. Gerritsen. An unstructured-grid,ﬁnite-volume, nonhydrostatic, parallel coastal oceansimulator.

Ocean Modelling , 14:139–173, 2006.[15] S. Lacroix, Y. Vassilevski, J. Wheeler, and M. Wheeler.Iterative Solution Methods for Modeling MultiphaseFlow in Porous Media Fully Implicitly.

SIAM Journalon Scientiﬁc Computing , 25(3):905–926, 2003.[16] Jeﬀ Bolz, Ian Farmer, Eitan Grinspun, and PeterSchröder. Sparse Matrix Solvers on the GPU: Con-jugate Gradients and Multigrid.

ACM Transactions onGraphics , 22:917–924, 2003.[17] S Menon and JB Perot. Implementation of an eﬃcientconjugate gradient algorithm for Poisson solutions ongraphics processors. In

Proceedings of the 2007 Meetingof the Canadian CFD Society, Toronto Canada , 2007.[18] R.F. Carvalho, C.A.P.S. Martins, R.M.S. Batalha, andA.F.P. Camargos. 3D parallel conjugate gradient solveroptimized for GPUs. In

Electromagnetic Field Compu-tation (CEFC), 2010, 14th Biennial IEEE Conferenceon , page 1, may 2010.[19] M. Ament, G. Knittel, D. Weiskopf, and W. Strasser.A Parallel Preconditioned Conjugate Gradient Solverfor the Poisson Problem on a Multi-GPU Platform.In

Parallel, Distributed and Network-Based Processing(PDP), 2010, 18th Euromicro International Conferenceon , pages 583 –592, feb. 2010.[20] M.M. Dehnavi, D.M. Fernández, and D. Giannacopou-los. Enhancing the Performance of Conjugate GradientSolvers on Graphic Processing Units.

Magnetics, IEEETransactions on , 47(5):1162 –1165, may 2011.[21] H. Knibbe, C.W. Oosterlee, and C. Vuik. GPU im-plementation of a Helmholtz Krylov solver precondi-tioned by a shifted Laplace multigrid method.

Journalof Computational and Applied Mathematics , 236:281–293, 2011.[22] Martijn de Jong. Developing a CUDA solver for largesparse matrices for MARIN. Master’s thesis, Delft In-stitute of Applied Mathematics, 2012.[23] Rudi Helfenstein and Jonas Koko. Parallel pre-conditioned conjugate gradient algorithm on GPU.

Journal of Computational and Applied Mathematics ,236(15):3584 – 3590, 2012. Proceedings of the FifteenthInternational Congress on Computational and AppliedMathematics (ICCAM-2010), Leuven, Belgium, 5-9July, 2010.[24] I. Reguly and M. Giles. Eﬃcient sparse matrix-vectormultiplication on cache-based GPUs. In

InnovativeParallel Computing (InPar), 2012 , pages 1 –12, may2012. [25] Ruipeng Li and Yousef Saad. GPU-accelerated precon-ditioned iterative linear solvers.

The Journal of Super-computing , 63:443–466, 2013.[26] Timothy A. Davis and Yifan Hu. The university ofFlorida sparse matrix collection.

ACM Trans. Math.Softw. , 38(1):1:1–1:25, December 2011.[27] M.M. Dehnavi, D.M. Fernández, and D. Giannacopou-los. Finite-Element Sparse Matrix Vector Multiplica-tion on Graphic Processing Units.

Magnetics, IEEETransactions on , 46(8):2982 –2985, aug. 2010.[28] Nathan Bell and Michael Garland. Implementingsparse matrix-vector multiplication on throughputori-ented processors. In

Proceedings of the Conference onHigh Performance Computing Networking, Storage andAnalysis , SC ’09, pages 18:1–18:11, New York, NY,USA, 2009. ACM.[29] Dominik Michels. Sparse-matrix-cg-solver in cuda. In

Proceedings of the 15th Central European Seminar onComputer Graphics , 2011.[30] Graham R. Markall, David A. Ham, and Paul H.J.Kelly. Towards generating optimised ﬁnite elementsolvers for GPUs from high-level speciﬁcations.

Pro-cedia Computer Science , 1(1):1815 – 1823, 2010. ICCS2010.[31] C.D. Cantwell, S.J. Sherwin, R.M. Kirby, and P.H.J.Kelly. From h to p eﬃciently: Strategy selection foroperator evaluation on hexahedral and tetrahedral ele-ments.

Computers & Fluids , 43(1):23 – 28, 2011. Sym-posium on High Accuracy Flow Simulations. Special Is-sue Dedicated to Prof. Michel Deville.[32] Nolan Goodnight, Cliﬀ Woolley, Gregory Lewin, DavidLuebke, and Greg Humphreys. A multigrid solver forboundary value problems using programmable graphicshardware. In

ACM SIGGRAPH 2005 Courses , SIG-GRAPH ’05, New York, NY, USA, 2005. ACM.[33] Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Pe-ter Zajac, and Stefan Turek.

Eﬃcient ﬁnite elementgeometric multigrid solvers for unstructured grids onGPUs . Techn. Univ., Fak. für Mathematik, 2011.[34] James Brannick, Yao Chen, Xiaozhe Hu, and Lud-mil Zikatanov. Parallel Unsmoothed Aggregation Al-gebraic Multigrid Algorithms on GPUs. arXiv preprintarXiv:1302.2547 , 2013.[35] S. Börm and R. Hiptmair. Analysis Of Tensor ProductMultigrid.

Numer. Algorithms , 26:200–1, 1999.[36] Eike Müller and Robert Scheichl. Massively parallelsolvers for elliptic PDEs in Numerical Weather- andClimate Prediction. in preparation , 2013.[37] Ali Cevahir, Akira Nukada, and Satoshi Matsuoka.

Fast Conjugate Gradients with Multiple GPUs , volume5544 of

Lecture Notes in Computer Science . SpringerBerlin Heidelberg, 2009. Editors: Allen, Gabrielle andNabrzyski, Jarosław and Seidel, Edward and Albada,GeertDick and Dongarra, Jack and Sloot, Peter M.A.1738] Serban Georgescu and Hiroshi Okuda. Conjugate gra-dients on multiple GPUs.

International Journal for Nu-merical Methods in Fluids , 64(10-12):1254–1273, 2010.[39] Mickeal Verschoor and Andrei C. Jalba. Analysisand performance estimation of the Conjugate Gradientmethod on multiple GPUs.

Parallel Comput. , 38(10-11):552–575, October 2012.[40] William H. Press, Saul A. Teukolsky, William T. Vet-terling, and Brian P. Flannery.

Numerical Recipes 3rdEdition: The Art of Scientiﬁc Computing . CambridgeUniversity Press, New York, NY, USA, 3 edition, 2007.[41] P K Smolarkiewicz and L G Margolin. Variationalsolver for elliptic problems in atmospheric ﬂows.

Appl.Math. and Comp. Sci. , 4:101–125, 1994.[42] Zbigniew Piotrowski, Andrzej Wyszogrodzki, and PiotrSmolarkiewicz. Towards petascale simulation of atmo-spheric circulations with soundproof equations.

ActaGeophysica , pages 1–18, 2011.[43] A.T. Chronopoulos and C.W. Gear. s-step iterativemethods for symmetric linear systems.

Journal of Com-putational and Applied Mathematics , 25(2):153 – 168,1989.[44] nVidia Corporation. Fermi architecture whitepaper,2009. Accessed 9 February 2013.[45] nVidia Corporation. CUDA programming guide, 2012.Accessed 9 February 2013.[46] A. Toselli and O.B. Widlund.