Compressed Basis GMRES on High Performance GPUs
José I. Aliaga, Hartwig Anzt, Thomas Grützmacher, Enrique S. Quintana-Ortí, Andrés E. Tomás
CCOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS
JOS´E I. ALIAGA ∗ , HARTWIG ANZT † , THOMAS GR ¨UTZMACHER ‡ ,ENRIQUE S. QUINTANA-ORT´ı § , AND
ANDR´ES E. TOM ´AS ¶ Abstract.
Krylov methods provide a fast and highly parallel numerical tool for the iterativesolution of many large-scale sparse linear systems. To a large extent, the performance of practicalrealizations of these methods is constrained by the communication bandwidth in all current com-puter architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce,and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in allarchitectures).This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solverthat advocates for decoupling the storage format (i.e., the data representation in memory) of theorthogonal basis from the arithmetic precision that is employed during the operations with thatbasis. Given that the execution time of the GMRES solver is largely determined by the memoryaccess, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterativestep via a lower volume of bits being retrieved from memory. Together with the special propertiesof the orthonormal basis (whose elements are all bounded by 1), this paves the road toward theaggressive customization of the storage format, which includes some floating point as well as fixedpoint formats with little impact on the convergence of the iterative process.We develop a high performance implementation of the “compressed basis GMRES” solver in theGinkgo sparse linear algebra library and using a large set of test problems from the SuiteSparsematrix collection we demonstrate robustness and performance advantages on a modern NVIDIAV100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE doubleprecision.
Key words.
Sparse linear systems, mixed precision, Krylov solvers, compressed basis GMRES,GPUs.
AMS subject classifications.
1. Introduction.
Krylov solvers enhanced with some type of sophisticated pre-conditioning technique nowadays compound a popular approach for the iterative so-lution of large and sparse linear systems [24]. In particular, preconditioned Krylovsolvers are often preferred over their direct counterparts for the solution of discretizedhigh-dimensional problems (e.g., 3D problems), where a factorization-based directsolver based would incur significant fill-in [11, 24]. Krylov solvers are also widely ap-pealing for massively-parallel architectures (e.g., graphics processing units, or GPUs)due to their superior scalability.At a high level, Krylov methods span a Krylov subspace by generating a sequenceof orthogonal (Krylov) search directions (starting with the normalized residual andcomputing each new search direction via the multiplication of the sparse coefficientmatrix with the previous direction); the orthogonalization of the resulting searchdirection against previous search directions; and the optimization of the solutionapproximation in the extended Krylov subspace [24]. Each iteration step is usuallycomposed of a sparse matrix-vector product (
SpMV ); an orthogonalization routine;and several vector operations that compute the new search directions, update thesolution approximation, and estimate the norm of the residual [24]. ∗ Universitat Jaume I, Spain ([email protected]) † Karlsruhe Institute of Technology, Germany; and Innovative Computing Laboratory, Universityof Tennessee at Knoxville, USA ([email protected]). ‡ Karlsruhe Institute of Technology, Germany ([email protected]) § Universitat Polit`ecnica de Val`encia, Spain ([email protected]) ¶ Universitat Jaume I, Spain; and Universitat de Val`encia, Spain ([email protected])1 a r X i v : . [ c s . M S ] S e p J. I. ALIAGA ET AL.
The numerical operations (kernels) appearing in Krylov methods are well-suitedfor parallelization. Unfortunately, most of these kernels, including
SpMV , are me-mory-bound on virtually all modern processor architectures [18]. As a result, manygeneric as well as hardware-specific optimization efforts for Krylov methods havefocused on avoiding, reducing, or hiding (i.e., overlapping with computation) thecommunication/memory accesses of the algorithm. Some optimization techniquestargeting the communication overhead include the following:– The design of specialized (i.e., application-specific) sparse matrix data layoutsthat restrict the indexing information (overhead) and/or improve data localitywhen accessing the contents of the sparse coefficient matrix [24].– The reorganization of the operations inside the body of the Krylov solver thattrades off reduced communication for an increase of computation per itera-tion, possibly also at the cost of introducing numerical instabilities that mayresult in slower convergence of the iteration; see, e.g., [10] and the referencestherein.– The reformulation of the solver as an iterative refinement scheme combinedwith the use of mixed precision for the storage of and arithmetic operationswith the sparse coefficient matrix [17].– The utilization of adaptive-precision schemes for memory-bound precondi-tioners [5].In this paper we also address the communication costs of Krylov methods, fo-cusing on the Generalized Minimal residual (GMRES) algorithm, a Krylov solver forgeneral linear systems that explicitly maintains the complete set of Krylov search di-rections instead of relying on short recurrences (as many other Krylov solvers do) [24].Orthogonally to all previous communication optimization efforts, our optimized vari-ant of the GMRES algorithm reduces communication in the access to the orthogonalbasis during the iteration loop body. In more detail, our GMRES algorithm decou-ples the memory storage format from the arithmetic precision, and stores the Krylovsearch directions in a compact “reduced precision” format. This radically diminishesthe memory access volume during the orthogonalization, while not affecting the con-vergence rate of the solver, yielding notable performance improvements. Concretely,we make the following contributions in our paper:– We propose to decouple the memory storage format from the arithmetic pre-cision to maintain the Krylov basis in reduced precision in memory whileperforming all arithmetic operations using full, hardware-supported ieee ieee ieee fixed point-based alternatives enhanced with vector-wise normalization.– We provide strong practical evidence of the advantage of our approach bydeveloping a high performance realization of the solver for modern NVIDIA’sV100 GPUs and testing it using a considerable number of large-scale problemsfrom the SuiteSparse matrix collection (https://sparse.tamu.edu/).– We integrate the mixed precision GMRES algorithm into in the Ginkgo sparse linear algebra library.– We combine our implementation with a high performance realization of anadaptive precision block-Jacobi preconditioner that adjusts the memory for- https://ginkgo-project.github.ioOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS compressed basis GMRES (CB-GMRES) storing the orthonormal basis in reduced precision. In Section 5 we providedetails about how we decouple the memory precision from the arithmetic precision,and how we realize the implementation of CB-GMRES in Ginkgo. The experimentalevaluation of the CB-GMRES implementation is presented in Section 6, assessingaccuracy, convergence, performance, and flexibility of the developed algorithm. Weconclude in Section 7 with a summary of the findings and ideas for future research.
2. Related work.
The potential of using lower precision in different componentsof a Krylov solver has been previously investigated for both Lanczos-based (short-termrecurrence) and Arnoldi-based (long-term recurrence) algorithms and the associatedmethods for solving linear systems of equations.From the theoretical point of view, most of those works are based on roundingtheory for Krylov solvers running in finite precision. Among the most relevant resultsare those by Paige [22], who derived distinct relations between the loss of orthogonalityand other important quantities in finite precision Lanczos. Greenbaum extended theseresults to prove backward stability for the CG method in finite precision [15]. She alsoderived theoretical bounds for the maximum attainable accuracy in finite precision forCG, BiCG, BiCGSTAB, and other Lanczos-based methods [16]. Carson [9] extendedthese results to s -step Lanczos/CG variants, deducing that an s -step Lanczos in finiteprecision behaves like a classical Lanczos run in lower “effective” precision, where this“effective” precision depends on the conditioning of the polynomials used to generatethe s -step bases. Additional bounds for Lanczos-based Krylov solvers running in finiteprecision can be found in [20].From these theoretical results on Krylov solvers running in finite precision, Si-moncini/Szyld [25] and Eshof/Sleijpen [27] developed “inexact Krylov subspace meth-ods” that apply the SpMV in lower precision to accelerate linear system solvers whenthis kernel dominates the cost of the computation. Theoretical results prove that in-exact Krylov methods can achieve the same solution accuracy as high precision Krylovsolvers, but little is known about the potential convergence delay.Concerning long-recurrence strategies, Gratton et al [14] combined the previousfindings from Bj¨orck [8] and Paige et al [22, 23] to derive theoretical norms for a mixedprecision GMRES algorithm based on modified Gram-Schmidt. In this algorithm,they consider using inexact (e.g., single precision) inner products in the orthogonal-ization process, which results in a loss of double precision (DP-)orthogonality of theKrylov search directions. This makes the work by Gratton et al [14] very similar toour approach. However, our approach is different in several aspects: • We decouple the arithmetic precision from the memory storage format tomaintain the orthogonal search directions in lower precision while preservingfull precision in all computations; • We consider not only IEEE single precision as the reference compact storageformat, but also IEEE half precision (HP) and fixed point formats based on32-bit and 16-bit integers; • We realize a production-ready and sustainable implementation for high per-formance GPU architectures including restarting and classical Gram-Schmidtwith reorthogonalization; and • We provide comprehensive experimental results analyzing accuracy, conver-
J. I. ALIAGA ET AL. Compute r := b − Ax , β := (cid:107) r (cid:107) , and v := r /β . Set V = [ v ] for j := 1 , , . . . , m Compute w := A ( M − v ) ω := (cid:107) w (cid:107) Orthogonalize h j,j := V Tj w , w := w − V j h j,j h j +1 ,j := (cid:107) w (cid:107) if ( h j +1 ,j < η ω ) then Re-orthogonalize u := V Tj w , w := w − V j u h j,j := h j,j + u h j +1 ,j := (cid:107) w (cid:107) endif if ( h j +1 ,j = 0) or ( h j +1 ,j < η ω ) then set m := j and go to step 17 , endif v := w/h j +1 ,j Set V j +1 := [ V j , v ] endfor Define the ( m + 1) × m Hessenberg matrix ¯ H m = ( h ij ) ≤ i ≤ m +1 , ≤ j ≤ m Compute y m the minimizer of (cid:107) βe − ¯ H m y (cid:107) and x m := x + M − ( V m y m ) if satisfied then Stop , else set x := x m and go to step 1 , endif Fig. 1 . Algorithmic formulation of the restarted GMRES algorithm for the solution of sparselinear systems. gence, and performance of our mixed precision GMRES solver.
3. The GMRES algorithm.
Consider the linear system(3.1) Ax = b, where the coefficient matrix A ∈ R n × n is sparse, with n z nonzero entries; b ∈ R n represents the right-hand side vector; and x ∈ R n contains the sought-after solution(vector). Figure 1 displays a mathematical formulation of the restarted GMRESalgorithm for the iterative solution of (3.1). There we assume that M ∈ R n × n definesan appropriate preconditioner; x is an initial approximation to the actual solution; e stands for the first column of the square identity matrix of order m +1; and the scalars m and η respectively define the dimension of the orthogonal basis and the threshold forthe re-orthogonalization. The orthogonalization mechanism in the algorithm relies onthe classical Gram-Schmidt (CGS) method, but a version that employs the modifiedGram-Schmidt (MGS) variant is simple to derive from that [13]. We prefer CGSover MGS as it allows for higher efficiency (using BLAS 2 routines), and providescomparable accuracy if enhanced with optional re-orthogonalization. The stoppingcriterion can be based, for example, on the residual (cid:107) r m (cid:107) = (cid:107) b − Ax m (cid:107) being smallerthan a certain relative threshold τ · (cid:107) b (cid:107) . For convenience, the GMRES algorithminternally keeps track of the residual by iteratively updating the residual vector inevery iteration. However, rounding effects can cause the iterative residual to differfrom the explicit residual, and every restart therefore explicitly computes the residualto re-align the iteratively-computed residual.From the computational point of view, the main kernels appearing in the GMRESalgorithm correspond to the application of the preconditioner M and the SpMV operation with the coefficient matrix A (both in Line 3); the orthogonalization ofvector w with respect to the vectors in the basis V j (Lines 5 and 8); the solution ofthe linear least squares (LLS) problem (Line 17); the assembly of the next iterate,which requires the application of the orthogonal basis followed by the preconditioner(Line 17); and a few minor vector operations such as axpy s, vector scaling, etc. [19]. OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS x m (Line 17) also con-tribute a minor cost to the overall procedure, as they are m times less frequent incomparison with the kernel calls in Lines 3, 5, and 8.
4. CB-GMRES storing the orthonormal basis in reduced precision.
Forsimplicity, consider that the GMRES algorithm integrates a simple preconditioner,such as a Jacobi scheme (or a block-Jacobi variant with a small block size) [24]. Theperformance of the algorithm is then strongly determined by the costs of the
SpMV kernel and the general matrix-vector products (
GeMV ), with V Tj and V j . These arememory-bound kernels, with their execution times largely dictated by the numberof memory accesses (memory operations, or memops hereafter). The optimizationwe propose thus aims to reduce the cost of the GeMV operations by storing theorthogonal basis V j in a more compact, reduced-precision format.In order to analyze the theoretical memop count of the SpMV and the
GeMV kernels, for simplicity, let us assume the following:1. The right-hand side vectors for both types of matrix-vector products reside incache. In general, this is not true but, for the following theoretical derivation,the memory layer where the vectors reside is not important.2. The sparse coefficient matrix is stored in the compressed sparse row (CSR)format. This is a general and flexible data layout that employs one integerper nonzero value to represent its column index, plus n + 1 integers for therow pointers [24].3. The re-orthogonalization mechanism included in the GMRES algorithm (Li-nes 7–11 in Figure 1) is not needed.Then, the ratio between the contributions of SpMV and the two
GeMV to the memopcount, due to the accesses to the corresponding matrices, is given by(4.1)
Memops
GeMV
Memops
SpMV = 2 nm (cid:48) n z (1 + f ) + ( n + 1) f ≈ nm (cid:48) ns (1 + f ) + nf = 2 m (cid:48) s (1 + f ) + f , where s = n z /n is the average number of nonzero entries per row of the sparse matrix; m (cid:48) = j − f > f = 32 /
64 = 1 / m (cid:48) steadilygrows with the iteration count ( m (cid:48) = j − j ), which hints that the mem-ops related to the orthogonalization can quickly to dominate the cost. In practicalimplementations though, the GMRES solver is usually enhanced with a restart mech-anism like in the formulation of the algorithm in Section 3, to keep both the memoryrequirements and the orthogonalization cost at reasonable levels. Depending on theproblem size and the available resources, the typical values for the restart parametervary between m = 30 and 200. At the same time, the nonzero-per-row ratio s isusually relatively small, and often significantly smaller than the restart parameter m . Therefore, assuming a restart parameter m and considering the memops in that J. I. ALIAGA ET AL. restart cycle, equation (4.1) then becomes(4.2)
Memops
GeMV
Memops
SpMV = (cid:80) m − j =1 njm ( n z (1 + f ) + ( n + 1) f ) ≈ ms (1 + f ) + f . With typical parameters f = 1 / m = 100, the memory access count due tothe orthogonalization theoretically thus exceeds the memory access overhead for the SpMV kernel for matrices with ratios s = n z /n > CB-GMRES.
In order to reduce the memory access volume in the orthogonaliza-tion step of GMRES, we propose to store the vectors of the orthogonal basis V j ina compact reduced-precision format; retrieve the data from memory in that format;and transform the values into ieee ieee ieee Discussion.
Storing the orthogonal basis of a Krylov method in a reduced precisionformat will typically introduce rounding errors that may affect the numerical prop-erties of the method, potentially impacting the convergence and numerical stabilityof the iterative solver. As the solution approximation is optimal in the generatedKrylov subspace, perturbed Krylov search directions may result a loss in the DP-orthogonality of the search directions and a different (Krylov) subspace and in whichthe solution approximation is computed. However, the solution approximation processaccounts for the perturbed search directions, and as long as the generated subspaceallows for a good approximation of the solution, this approximation will be foundin the optimization process. Hence, as long as the search directions are “relatively”close to the optimal search directions, the convergence will only be mildly affected. Inparticular, we may assume that the need for additional search directions (equivalentto additional iterations) can be compensated by the faster execution of each iteration.To close this section, we emphasize that:– Our approach is orthogonal and complementary to other techniques whichaim to reduce the memory access overhead, for example, by customizing thesparse matrix data layout to the application, operating with iterative refine-
OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Library InfrastructureAlgorithm Implementations • Iterative Solvers • Preconditioners • … Core
OpenMP-kernels • SpMV • Solver kernels • Precond kernels • … OpenMP
Reference kernels • SpMV • Solver kernels • Precond kernels • … Reference
CUDA-GPU kernels • SpMV • Solver kernels • Precond kernels • … CUDA
HIP-GPU kernels • SpMV • Solver kernels • Precond kernels • … HIP
Library core contains architecture-agnostic algorithm implementation;Runtime polymorphism selects the right kernel depending on the target architecture;Architecture-specific kernels execute the algorithm on target architecture;Reference are sequential kernels to check correctness of algorithm design and optimized kernels; Optimized architecture-specific kernels; • Shared kernels
Common
Fig. 2 . The architecture of the Ginkgo library separating the algorithmic core from the backends. ment scheme+mixed precision for the coefficient matrix, or the exploitingcustomized precision in the preconditioner, among others.– The arithmetic precision is decoupled from the representation format so thatwe can actually store the data for the orthogonal basis in any format whilerelying on the data types with hardware support for the arithmetic operations.
5. Implementation of CB-GMRES.5.1. The Ginkgo sparse linear algebra library.
For convenience and ease ofuse, we have realized the CB-GMRES algorithm in the Ginkgo ecosystem. Ginkgo isa sparse linear algebra library implemented in modern C++ that embraces two prin-cipal design concepts [2]: The first principle, aiming at future technology readiness, isto consequently separate the numerical algorithms from the hardware-specific kernelimplementation to ensure correctness (via comparison with sequential reference ker-nels), performance portability (by applying hardware-specific kernel optimizations),and extensibility (via kernel backends for other hardware architectures); see Figure 2.The second design principle – pursuing user-friendliness – is the convention to expressfunctionality in terms of linear operators: every solver, preconditioner, factorization,matrix-vector product, and matrix reordering is expressed as a linear operator (orcomposition thereof). This allows to easily combine the CB-GMRES with any pre-conditioner available in Ginkgo, and to select the realization of the
SpMV kernel thatis most appropriate for the characteristics of a specific problem [3].Ginkgo relies on an “executor” concept to favor platform portability. The ex-ecutor specifies the memory location and execution domain of linear algebra objectsand abstracts the computational capabilities of distinct devices. Each executor im-plements methods for allocating/deallocating memory on the device targeted by theexecutor, copying data between executors, providing hardware-specific kernels, run-ning operations, and synchronizing all operations launched on the executor. Theuser can run a single code on different platforms (without having to modify his/hercode) by selecting the proper executor at the beginning of the application. This en-capsulates all information in the executor, and automatically orchestrates memoryallocation, memory transfers, and kernel selection. For the CB-GMRES implemen-
J. I. ALIAGA ET AL.
Fig. 3 . Accessor separating the memory format from the arithmetic format and realizing on-the-fly data conversion in each memory access. tation with the orthonormal Krylov basis stored in reduced precision, the executorconcept is extended with a “memory accessor”, described next, that handles the dataconversion transparently to the user.
At a high level, the idea of the CB-GMRES solver is tocompress the orthogonal matrix/vector before and after the memory operations usingone of the reduced/customized storage formats, but still use the working precision(i.e., DP) for the arithmetic operations. Retrieving the orthonormal basis in reducedprecision from memory thus requires reading the basis contents and converting theminto DP. When these values are stored in SP, the conversion is easy to perform viaa datatype casting operator. For fixed point representations, though, the conversionrequires some additional manipulations plus the scaling with a normalization factor.To decouple the memory access and conversion from the code development effort,we use a memory accessor that converts the data between DP and the memory stor-age/communication format on-the-fly (see Figure 3). The efficient implementation ofthe accessor aims to hide the cost of these data conversions by overlapping them withthe memory accesses, in principle introducing a minor or even negligible overhead.In addition, the introduction of this technique can accelerate the execution as access-ing the data in lower precision significantly reduces the memory access volume periteration.Considering the realization of the CB-GMRES algorithm, after the new basisvector v j = v is formed at iteration j , the memory accessor is activated in order tocompress the DP values of this vector before storing them into memory; see Lines 13and 14 in the algorithm in Figure 1. The memory accessor is also active when re-trieving the contents of the full orthogonal basis V j from memory; see Lines 5 and 8of the algorithm.On the technical side, the accessor leverages static polymorphism (via C++ tem-plates) for both the arithmetic precision (in our work, fixed to IEEE DP) and thememory format. While this flexible design can accommodate any memory format, wecurrently only support
6. Experimental Evaluation of the compressed basis GMRES.
In thissection, we analyze several properties of the CB-GMRES algorithm in order to assessthe benefits of this solver as part of production code. Concretely, we investigate thefollowing questions: 1) Can we achieve high accuracy in the solution approximations?2) How significant is the convergence delay introduced by moving away from the “full”precision Krylov search directions and utilizing instead search directions that are lowprecision approximations of these orthonormal vectors? 3) What are the performanceadvantages of the CB-GMRES over the standard (DP) GMRES? 4) Which specificstorage format we should use for the memory operations?
To answer these questions, we select a set of 49 large-scale testmatrices from the Suite Sparse Matrix Collection [1] that we adopt as benchmarkproblems to explore the accuracy, convergence, and performance of the CB realizationsof the GMRES algorithm. The selected test matrices are regular, appropriate in sizeand nonzero count, and a DP GMRES needs at least 40 iterations to converge. Thetest matrices are listed along with some key properties in Table 1.The CB-GMRES algorithm is implemented utilizing building blocks from theGinkgo environment. The orthogonalization kernel is based on classical Gram-Sch-midt with optional re-orthogonalization. All other functionality (
SpMV kernels, pre-conditioners, utility functions, comparison solvers, etc.) is taken from the Ginkgo li-brary. Unless otherwise stated, we enhance all the CB and DP GMRES algorithmswith a simple light-weight scalar Jacobi preconditioner (diagonal scaling) as this gen-erally improves convergence and provides a more realistic setting than a stand-aloneGMRES algorithm. The
SpMV kernel integrated in all variants of GMRES to gener-ate the Krylov search directions is Ginkgo’s CSR-based
SpMV routine; this particu-lar realization of
SpMV maintains the coefficient matrix in Compressed Sparse Row(CSR) format, and automatically selects a csr kernel that is optimal for a problem-specific sparsity pattern [4]. The DP GMRES code is identical to the CB-GMRES0
J. I. ALIAGA ET AL.
Table 1
Test matrices
Matrix Size Non-zeros Non-zeros per row af 0 k101 503,625 17,550,675 34.8af 1 k101 503,625 17,550,675 34.8af 2 k101 503,625 17,550,675 34.8af 3 k101 503,625 17,550,675 34.8af 4 k101 503,625 17,550,675 34.8af 5 k101 503,625 17,550,675 34.8af shell1 504,855 17,562,051 34.8af shell10 1,508,065 52,259,885 34.7af shell2 504,855 17,562,051 34.8af shell3 504,855 17,562,051 34.8af shell4 504,855 17,562,051 34.8af shell5 504,855 17,579,155 34.8af shell6 504,855 17,579,155 34.8af shell7 504,855 17,579,155 34.8af shell8 504,855 17,579,155 34.8af shell9 504,855 17,588,845 34.8apache2 715,176 4,817,870 6.7atmosmodd 1,270,432 8,814,880 6.9atmosmodj 1,270,432 8,814,880 6.9atmosmodl 1,489,752 10,319,760 6.9atmosmodm 1,489,752 10,319,760 6.9audikw 1 943,695 77,651,847 82.3bone010 986,703 47,851,783 48.5boneS10 914,898 40,878,708 44.7Bump 2911 2,911,419 127,729,899 43.9circuit5M dc 3,523,317 14,865,409 4.2Cube Coup dt6 2,164,760 124,406,070 57.5CurlCurl 2 806,529 8,921,789 11.1CurlCurl 3 1,219,574 13,544,618 11.1CurlCurl 4 2,380,515 26,515,867 11.1ecology1 1,000,000 4,996,000 5.0ecology2 999,999 4,995,991 5.0Fault 639 638,802 27,245,944 42.7Flan 1565 1,564,794 114,165,372 73.0G3 circuit 1,585,478 7,660,826 4.8Geo 1438 1,437,960 60,236,322 41.9Hook 1498 1,498,023 59,374,451 39.6inline 1 503,712 36,816,170 73.1ldoor 952,203 42,493,817 44.6mc2depi 525,825 2,100,225 4.0ML Geer 1,504,002 110,686,677 73.6parabolic fem 525,825 3,674,625 7.0Serena 1,391,349 64,131,971 46.1ss 1,652,680 34,753,577 21.0t2em 921,632 4,590,832 5.0thermal2 1,228,045 8,580,313 7.0tmt sym 726,713 5,080,961 7.0tmt unsym 917,825 4,584,801 5.0Transport 1,602,111 23,487,281 14.7
OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS a f _0_ k f _1_ k f _2_ k f _3_ k f _4_ k f _5_ k f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll c he2a t m o s m odda t m o s m od j a t m o s m od l a t m o s m od m aud i k w _1bone010bone S B u m p_2911 c i r c u i t M _d c C ube_ C oup_d t C u r l C u r l _2 C u r l C u r l _3 C u r l C u r l _4e c o l og y c o l og y F au l t _639 F l an_1565 G c i r c u i t G eo_1438 H oo k _1498 i n li ne_1 l doo r m c i M L_ G ee r pa r abo li c _ f e m S e r ena ss t m t he r m a l t m t _ sy m t m t _un sy m T r an s po r t Matrices -10 -9 -8 -7 -6 N o r m a li z ed r e s i dua l no r m
Serena problems.
GMRES with the orthogononal basis stored using xx -bit floating-point and fixed-pointformats. The notation
Serena problems.(Similar behaviour was observed for other problems from the 49-case collection.)While in this case all CB-GMRES variants achieve the same final accuracy, the storageformat selected for the orthogonal basis impacts the convergence rate and, in conse-quence, the iteration count. In addition, the spikes in the residual curves identifythe restart points that update the recurrence residual with an explicitly computedresidual. For
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
Even though we now have experimentallydemonstrated that the CB-GMRES variants can compensate for the perturbations inthe subspace via additional iterations (which is equivalent to extending the subspaceby additional search directions), the resulting algorithms will only be useful in produc-tion if the associated iteration overhead is smaller than the runtime reduction comingfrom the decreased memory access volume. In the right-hand side plot in Figure 7 weshow statistics on the performance improvements that CB-GMRES renders over DPGMRES when using different storage formats for the orthogonal basis. As could beexpected from the large iteration overheads, storing the orthogonal basis in
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
Table 2
Statistics for the
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
GMRES
Finally, we investigate how the CB-GMRES interacts with a moresophisticated preconditioner and with other mixed precision techniques. For this, weswitch from a scalar Jacobi preconditioner to a block-Jacobi preconditioning schemewith block-size 4, and report the performance advantages in Figure 10. As in the pre-vious experiments, we fix the restart parameter to m = 100 and run the experimentswith a right-hand side vector defined by b i = sin ( i ), an starting guess x = 0, andthe residual stopping criterion set to (cid:107) Ax ∗ − b (cid:107) ≤ − (cid:107) b (cid:107) . Compared with theresults in Figure 8, we note a slight decrease in the speedups, which is expected asthe addition of a more expensive preconditioner diminishes the performance benefits OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Speedup over
Fig. 9 . speedup for different CB-GMRES variants ( GMRES
GMRES
SpMV kernel, or from the integration into a mixed precision iterative refinementframework.
7. Summary and Outlook.
We have introduced and evaluated a communi-cation-reduction version of GMRES that maintains the orthogonal basis in a com-pressed (compact) form while performing all arithmetic in double precision. Thecombination of these two factors aims to reduce the traffic between memory andthe processor arithmetic units while maintaining the accuracy of the search direc-tions generated during the optimization process and extracting the performance fromhardware-supported arithmetic. In contrast, the memory storage provides (to a cer-tain extent) enough flexibility to evaluate distinct 16-bit and 32-bit formats, includingfloating point and fixed point variants.We have integrated a high-performance realization of the GMRES with com-
OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Speedup over
GMRES
SpMV , or the GMRES algorithm itself.In future work we will investigate whether compression techniques that are tradi-tionally used to compress large datasets can pose an alternative to the use of of 32-bitand 16-bit fixed and floating point formats to store the compressed basis vectors.
Acknowledgments.
Jos´e I. Aliaga, Enrique S. Quintana-Ort´ı and Andr´es E.Tom´as were supported by the EU H2020 project 732631 “OPRECOMP. Open Trans-precision Computing” and the
MINECO (Spain) project TIN2017-82972-R. HartwigAnzt and Thomas Gr¨utzmacher were supported by the “Impuls und Vernetzungs-fond” of the Helmholtz Association under grant VH-NG-1241, and the US ExascaleComputing Project (17-SC-20-SC), a collaborative effort of the U.S. Department ofEnergy Office of Science and the National Nuclear Security Administration.
REFERENCES[1]
Suite Sparse Matrix Collection . http://faculty.cse.tamu.edu/davis/suitesparse.html, April2020.[2]
H. Anzt, T. Cojean, G. Flegar, F. G¨obel, T. Gr¨utzmacher, P. Nayak, T. Ribizel, Y.-H.Tsai, and E. S. Quintana-Ort´ı , Ginkgo: A modern linear operator algebra frameworkfor high performance computing , 2020, https://arxiv.org/abs/2006.16852.[3]
H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M.Tsai, and W. Wang , Load-balancing sparse matrix vector product kernels on gpus , ACMTransactions on Parallel Computing (TOPC), 7 (2020), pp. 1–26.[4]
H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M.Tsai, and W. Wang , Load-balancing sparse matrix vector product kernels on gpus , ACMTrans. Parallel Comput., 7 (2020), https://doi.org/10.1145/3380930, https://doi.org/10.1145/3380930.[5]
H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ort´ı , Adaptive preci-sion in block-Jacobi preconditioning for iterative sparse linear system solvers , Concurrencyand Computation: Practice and Experience, 31 (2019), p. e4460.[6]
H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ort´ı , Adaptive preci-sion in block-jacobi preconditioning for iterative sparse linear system solvers , Concurrencyand Computation: Practice and Experience, 31 (2019), p. e4460.[7]
H. Anzt, G. Flegar, T. Gr¨utzmacher, and E. S. Quintana-Ort´ı , Toward a modu-lar precision ecosystem for high-performance computing , The International Journal ofHigh Performance Computing Applications, 33 (2019), pp. 1069–1078, https://doi.org/10.1177/1094342019846547, https://doi.org/10.1177/1094342019846547, https://arxiv.org/abs/https://doi.org/10.1177/1094342019846547.[8] ˚A. Bj¨orck , Solving linear least squares problems by Gram-Schmidt orthogonalization , BITNumerical Mathematics, 7 (1967), pp. 1–21.[9]
E. C. Carson , Communication-avoiding Krylov subspace methods in theory and practice , PhDthesis, University of California, Berkeley, 2015.[10]
S. Cools , Analyzing and improving maximal attainable accuracy in the communication hidingpipelined BiCGStab method , Parallel Computing, 86 (2019), pp. 16 – 35.[11]
T. Davies , Direct Methods for Sparse Linear Systems , Society for Industrial and AppliedOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Mathematics, Philadelphia, PA, USA, 2006.[12]
G. Flegar, H. Anzt, T. Cojean, and E. S. Quintana-Ort´ı , Customized-precision block-jacobi preconditioning for krylov iterative solvers on data-parallel manycore processors ,ACM TOMS, (submitted).[13]
G. H. Golub and C. F. V. Loan , Matrix Computations , The Johns Hopkins University Press,Baltimore, 3rd ed., 1996.[14]
S. Gratton, E. Simon, D. Titley-Peloquin, and P. Toint , Exploiting variable precision inGMRES , SIAM J. Sci. Comput. (to appear), (2020).[15]
A. Greenbaum , Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences ,Lin. Alg. Appl., 113 (1989), pp. 7–63.[16]
A. Greenbaum , Estimating the attainable accuracy of recursively computed residual methods ,SIAM J. Matrix Anal. Appl., 18 (1997), pp. 535–551.[17]
N. J. Higham , Accuracy and Stability of Numerical Algorithms , Society for Industrial andApplied Mathematics, Philadelphia, PA, USA, second ed., 2002.[18]
M. Horowitz , Computing’s energy problem (and what we can do about it) , in 2014 IEEEInternational Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb2014, pp. 10–14, https://doi.org/10.1109/ISSCC.2014.6757323.[19]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh , Basic linear algebra subpro-grams for Fortran usage , ACM Trans. Math. Softw., 5 (1979), pp. 308–323.[20]
G. Meurant and Z. Strako , The lanczos and conjugate gradient algorithms in finiteprecision arithmetic , Acta Numerica, 15 (2006), p. 471542, https://doi.org/10.1017/S096249290626001X.[21]
NVIDIA Corp. , Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE , 2017.[22]
C. C. Paige , Accuracy and effectiveness of the Lanczos algorithm for the symmetric eigen-problem , Lin. Alg. Appl., 34 (1980), pp. 235–258.[23]
C. C. Paige, M. Rozloˇzn´ık, and Z. Strakoˇs , Modified gram-schmidt MGS, least squares, andbackward stability of MGS-GMRES , SIAM J. Matrix Anal. Appl., 28 (2006), pp. 264–284.[24]
Y. Saad , Iterative methods for sparse linear systems , Society for Industrial and Applied Math-ematics, Philadelphia, PA, USA, 2nd ed., 2003.[25]
V. Simoncini and D. B. Szyld , Theory of inexact Krylov subspace methods and applicationsto scientific computing , SIAM J. Sci. Comput., 25 (2003), pp. 454–477.[26]
Y. M. Tsai, T. Cojean, and H. Anzt , Sparse linear algebra on AMD andNVIDIA GPUs– the race is on , in High Performance Computing, P. Sadayappan, B. L. Chamberlain,G. Juckeland, and H. Ltaief, eds., Cham, 2020, Springer International Publishing, pp. 309–327.[27]
J. van den Eshof and G. L. Sleijpen , Inexact Krylov subspace methods for linear systems ,SIAM J. Matrix Anal. Appl., 26 (2004), pp. 125–153. J. I. ALIAGA ET AL.
Speedup over full precision