[PDF] Compressed Basis GMRES on High Performance GPUs

Abstract

Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in all current computer architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solver that advocates for decoupling the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory access, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a lower volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating point as well as fixed point formats with little impact on the convergence of the iterative process. We develop a high performance implementation of the "compressed basis GMRES" solver in the Ginkgo sparse linear algebra library and using a large set of test problems from the SuiteSparse matrix collection we demonstrate robustness and performance advantages on a modern NVIDIA V100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE double precision.

Full PDF

CCOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS

JOS´E I. ALIAGA ∗ , HARTWIG ANZT † , THOMAS GR ¨UTZMACHER ‡ ,ENRIQUE S. QUINTANA-ORT´ı § , AND

ANDR´ES E. TOM ´AS ¶ Abstract.

Krylov methods provide a fast and highly parallel numerical tool for the iterativesolution of many large-scale sparse linear systems. To a large extent, the performance of practicalrealizations of these methods is constrained by the communication bandwidth in all current com-puter architectures, motivating the recent investigation of sophisticated techniques to avoid, reduce,and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in allarchitectures).This paper introduces a new communication-reduction strategy for the (Krylov) GMRES solverthat advocates for decoupling the storage format (i.e., the data representation in memory) of theorthogonal basis from the arithmetic precision that is employed during the operations with thatbasis. Given that the execution time of the GMRES solver is largely determined by the memoryaccess, the datatype transforms can be mostly hidden, resulting in the acceleration of the iterativestep via a lower volume of bits being retrieved from memory. Together with the special propertiesof the orthonormal basis (whose elements are all bounded by 1), this paves the road toward theaggressive customization of the storage format, which includes some ﬂoating point as well as ﬁxedpoint formats with little impact on the convergence of the iterative process.We develop a high performance implementation of the “compressed basis GMRES” solver in theGinkgo sparse linear algebra library and using a large set of test problems from the SuiteSparsematrix collection we demonstrate robustness and performance advantages on a modern NVIDIAV100 GPU of up to 50% over the standard GMRES solver that stores all data in IEEE doubleprecision.

Key words.

Sparse linear systems, mixed precision, Krylov solvers, compressed basis GMRES,GPUs.

AMS subject classiﬁcations.

1. Introduction.

Krylov solvers enhanced with some type of sophisticated pre-conditioning technique nowadays compound a popular approach for the iterative so-lution of large and sparse linear systems [24]. In particular, preconditioned Krylovsolvers are often preferred over their direct counterparts for the solution of discretizedhigh-dimensional problems (e.g., 3D problems), where a factorization-based directsolver based would incur signiﬁcant ﬁll-in [11, 24]. Krylov solvers are also widely ap-pealing for massively-parallel architectures (e.g., graphics processing units, or GPUs)due to their superior scalability.At a high level, Krylov methods span a Krylov subspace by generating a sequenceof orthogonal (Krylov) search directions (starting with the normalized residual andcomputing each new search direction via the multiplication of the sparse coeﬃcientmatrix with the previous direction); the orthogonalization of the resulting searchdirection against previous search directions; and the optimization of the solutionapproximation in the extended Krylov subspace [24]. Each iteration step is usuallycomposed of a sparse matrix-vector product (

SpMV ); an orthogonalization routine;and several vector operations that compute the new search directions, update thesolution approximation, and estimate the norm of the residual [24]. ∗ Universitat Jaume I, Spain ([email protected]) † Karlsruhe Institute of Technology, Germany; and Innovative Computing Laboratory, Universityof Tennessee at Knoxville, USA ([email protected]). ‡ Karlsruhe Institute of Technology, Germany ([email protected]) § Universitat Polit`ecnica de Val`encia, Spain ([email protected]) ¶ Universitat Jaume I, Spain; and Universitat de Val`encia, Spain ([email protected])1 a r X i v : . [ c s . M S ] S e p J. I. ALIAGA ET AL.

The numerical operations (kernels) appearing in Krylov methods are well-suitedfor parallelization. Unfortunately, most of these kernels, including

SpMV , are me-mory-bound on virtually all modern processor architectures [18]. As a result, manygeneric as well as hardware-speciﬁc optimization eﬀorts for Krylov methods havefocused on avoiding, reducing, or hiding (i.e., overlapping with computation) thecommunication/memory accesses of the algorithm. Some optimization techniquestargeting the communication overhead include the following:– The design of specialized (i.e., application-speciﬁc) sparse matrix data layoutsthat restrict the indexing information (overhead) and/or improve data localitywhen accessing the contents of the sparse coeﬃcient matrix [24].– The reorganization of the operations inside the body of the Krylov solver thattrades oﬀ reduced communication for an increase of computation per itera-tion, possibly also at the cost of introducing numerical instabilities that mayresult in slower convergence of the iteration; see, e.g., [10] and the referencestherein.– The reformulation of the solver as an iterative reﬁnement scheme combinedwith the use of mixed precision for the storage of and arithmetic operationswith the sparse coeﬃcient matrix [17].– The utilization of adaptive-precision schemes for memory-bound precondi-tioners [5].In this paper we also address the communication costs of Krylov methods, fo-cusing on the Generalized Minimal residual (GMRES) algorithm, a Krylov solver forgeneral linear systems that explicitly maintains the complete set of Krylov search di-rections instead of relying on short recurrences (as many other Krylov solvers do) [24].Orthogonally to all previous communication optimization eﬀorts, our optimized vari-ant of the GMRES algorithm reduces communication in the access to the orthogonalbasis during the iteration loop body. In more detail, our GMRES algorithm decou-ples the memory storage format from the arithmetic precision, and stores the Krylovsearch directions in a compact “reduced precision” format. This radically diminishesthe memory access volume during the orthogonalization, while not aﬀecting the con-vergence rate of the solver, yielding notable performance improvements. Concretely,we make the following contributions in our paper:– We propose to decouple the memory storage format from the arithmetic pre-cision to maintain the Krylov basis in reduced precision in memory whileperforming all arithmetic operations using full, hardware-supported ieee ieee ieee ﬁxed point-based alternatives enhanced with vector-wise normalization.– We provide strong practical evidence of the advantage of our approach bydeveloping a high performance realization of the solver for modern NVIDIA’sV100 GPUs and testing it using a considerable number of large-scale problemsfrom the SuiteSparse matrix collection (https://sparse.tamu.edu/).– We integrate the mixed precision GMRES algorithm into in the Ginkgo sparse linear algebra library.– We combine our implementation with a high performance realization of anadaptive precision block-Jacobi preconditioner that adjusts the memory for- https://ginkgo-project.github.ioOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS compressed basis GMRES (CB-GMRES) storing the orthonormal basis in reduced precision. In Section 5 we providedetails about how we decouple the memory precision from the arithmetic precision,and how we realize the implementation of CB-GMRES in Ginkgo. The experimentalevaluation of the CB-GMRES implementation is presented in Section 6, assessingaccuracy, convergence, performance, and ﬂexibility of the developed algorithm. Weconclude in Section 7 with a summary of the ﬁndings and ideas for future research.

2. Related work.

The potential of using lower precision in diﬀerent componentsof a Krylov solver has been previously investigated for both Lanczos-based (short-termrecurrence) and Arnoldi-based (long-term recurrence) algorithms and the associatedmethods for solving linear systems of equations.From the theoretical point of view, most of those works are based on roundingtheory for Krylov solvers running in ﬁnite precision. Among the most relevant resultsare those by Paige [22], who derived distinct relations between the loss of orthogonalityand other important quantities in ﬁnite precision Lanczos. Greenbaum extended theseresults to prove backward stability for the CG method in ﬁnite precision [15]. She alsoderived theoretical bounds for the maximum attainable accuracy in ﬁnite precision forCG, BiCG, BiCGSTAB, and other Lanczos-based methods [16]. Carson [9] extendedthese results to s -step Lanczos/CG variants, deducing that an s -step Lanczos in ﬁniteprecision behaves like a classical Lanczos run in lower “eﬀective” precision, where this“eﬀective” precision depends on the conditioning of the polynomials used to generatethe s -step bases. Additional bounds for Lanczos-based Krylov solvers running in ﬁniteprecision can be found in [20].From these theoretical results on Krylov solvers running in ﬁnite precision, Si-moncini/Szyld [25] and Eshof/Sleijpen [27] developed “inexact Krylov subspace meth-ods” that apply the SpMV in lower precision to accelerate linear system solvers whenthis kernel dominates the cost of the computation. Theoretical results prove that in-exact Krylov methods can achieve the same solution accuracy as high precision Krylovsolvers, but little is known about the potential convergence delay.Concerning long-recurrence strategies, Gratton et al [14] combined the previousﬁndings from Bj¨orck [8] and Paige et al [22, 23] to derive theoretical norms for a mixedprecision GMRES algorithm based on modiﬁed Gram-Schmidt. In this algorithm,they consider using inexact (e.g., single precision) inner products in the orthogonal-ization process, which results in a loss of double precision (DP-)orthogonality of theKrylov search directions. This makes the work by Gratton et al [14] very similar toour approach. However, our approach is diﬀerent in several aspects: • We decouple the arithmetic precision from the memory storage format tomaintain the orthogonal search directions in lower precision while preservingfull precision in all computations; • We consider not only IEEE single precision as the reference compact storageformat, but also IEEE half precision (HP) and ﬁxed point formats based on32-bit and 16-bit integers; • We realize a production-ready and sustainable implementation for high per-formance GPU architectures including restarting and classical Gram-Schmidtwith reorthogonalization; and • We provide comprehensive experimental results analyzing accuracy, conver-

J. I. ALIAGA ET AL. Compute r := b − Ax , β := (cid:107) r (cid:107) , and v := r /β . Set V = [ v ] for j := 1 , , . . . , m Compute w := A ( M − v ) ω := (cid:107) w (cid:107) Orthogonalize h j,j := V Tj w , w := w − V j h j,j h j +1 ,j := (cid:107) w (cid:107) if ( h j +1 ,j < η ω ) then Re-orthogonalize u := V Tj w , w := w − V j u h j,j := h j,j + u h j +1 ,j := (cid:107) w (cid:107) endif if ( h j +1 ,j = 0) or ( h j +1 ,j < η ω ) then set m := j and go to step 17 , endif v := w/h j +1 ,j Set V j +1 := [ V j , v ] endfor Deﬁne the ( m + 1) × m Hessenberg matrix ¯ H m = ( h ij ) ≤ i ≤ m +1 , ≤ j ≤ m Compute y m the minimizer of (cid:107) βe − ¯ H m y (cid:107) and x m := x + M − ( V m y m ) if satisﬁed then Stop , else set x := x m and go to step 1 , endif Fig. 1 . Algorithmic formulation of the restarted GMRES algorithm for the solution of sparselinear systems. gence, and performance of our mixed precision GMRES solver.

3. The GMRES algorithm.

Consider the linear system(3.1) Ax = b, where the coeﬃcient matrix A ∈ R n × n is sparse, with n z nonzero entries; b ∈ R n represents the right-hand side vector; and x ∈ R n contains the sought-after solution(vector). Figure 1 displays a mathematical formulation of the restarted GMRESalgorithm for the iterative solution of (3.1). There we assume that M ∈ R n × n deﬁnesan appropriate preconditioner; x is an initial approximation to the actual solution; e stands for the ﬁrst column of the square identity matrix of order m +1; and the scalars m and η respectively deﬁne the dimension of the orthogonal basis and the threshold forthe re-orthogonalization. The orthogonalization mechanism in the algorithm relies onthe classical Gram-Schmidt (CGS) method, but a version that employs the modiﬁedGram-Schmidt (MGS) variant is simple to derive from that [13]. We prefer CGSover MGS as it allows for higher eﬃciency (using BLAS 2 routines), and providescomparable accuracy if enhanced with optional re-orthogonalization. The stoppingcriterion can be based, for example, on the residual (cid:107) r m (cid:107) = (cid:107) b − Ax m (cid:107) being smallerthan a certain relative threshold τ · (cid:107) b (cid:107) . For convenience, the GMRES algorithminternally keeps track of the residual by iteratively updating the residual vector inevery iteration. However, rounding eﬀects can cause the iterative residual to diﬀerfrom the explicit residual, and every restart therefore explicitly computes the residualto re-align the iteratively-computed residual.From the computational point of view, the main kernels appearing in the GMRESalgorithm correspond to the application of the preconditioner M and the SpMV operation with the coeﬃcient matrix A (both in Line 3); the orthogonalization ofvector w with respect to the vectors in the basis V j (Lines 5 and 8); the solution ofthe linear least squares (LLS) problem (Line 17); the assembly of the next iterate,which requires the application of the orthogonal basis followed by the preconditioner(Line 17); and a few minor vector operations such as axpy s, vector scaling, etc. [19]. OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS x m (Line 17) also con-tribute a minor cost to the overall procedure, as they are m times less frequent incomparison with the kernel calls in Lines 3, 5, and 8.

4. CB-GMRES storing the orthonormal basis in reduced precision.

Forsimplicity, consider that the GMRES algorithm integrates a simple preconditioner,such as a Jacobi scheme (or a block-Jacobi variant with a small block size) [24]. Theperformance of the algorithm is then strongly determined by the costs of the

SpMV kernel and the general matrix-vector products (

GeMV ), with V Tj and V j . These arememory-bound kernels, with their execution times largely dictated by the numberof memory accesses (memory operations, or memops hereafter). The optimizationwe propose thus aims to reduce the cost of the GeMV operations by storing theorthogonal basis V j in a more compact, reduced-precision format.In order to analyze the theoretical memop count of the SpMV and the

GeMV kernels, for simplicity, let us assume the following:1. The right-hand side vectors for both types of matrix-vector products reside incache. In general, this is not true but, for the following theoretical derivation,the memory layer where the vectors reside is not important.2. The sparse coeﬃcient matrix is stored in the compressed sparse row (CSR)format. This is a general and ﬂexible data layout that employs one integerper nonzero value to represent its column index, plus n + 1 integers for therow pointers [24].3. The re-orthogonalization mechanism included in the GMRES algorithm (Li-nes 7–11 in Figure 1) is not needed.Then, the ratio between the contributions of SpMV and the two

GeMV to the memopcount, due to the accesses to the corresponding matrices, is given by(4.1)

Memops

GeMV

Memops

SpMV = 2 nm (cid:48) n z (1 + f ) + ( n + 1) f ≈ nm (cid:48) ns (1 + f ) + nf = 2 m (cid:48) s (1 + f ) + f , where s = n z /n is the average number of nonzero entries per row of the sparse matrix; m (cid:48) = j − f > f = 32 /

64 = 1 / m (cid:48) steadilygrows with the iteration count ( m (cid:48) = j − j ), which hints that the mem-ops related to the orthogonalization can quickly to dominate the cost. In practicalimplementations though, the GMRES solver is usually enhanced with a restart mech-anism like in the formulation of the algorithm in Section 3, to keep both the memoryrequirements and the orthogonalization cost at reasonable levels. Depending on theproblem size and the available resources, the typical values for the restart parametervary between m = 30 and 200. At the same time, the nonzero-per-row ratio s isusually relatively small, and often signiﬁcantly smaller than the restart parameter m . Therefore, assuming a restart parameter m and considering the memops in that J. I. ALIAGA ET AL. restart cycle, equation (4.1) then becomes(4.2)

Memops

GeMV

Memops

SpMV = (cid:80) m − j =1 njm ( n z (1 + f ) + ( n + 1) f ) ≈ ms (1 + f ) + f . With typical parameters f = 1 / m = 100, the memory access count due tothe orthogonalization theoretically thus exceeds the memory access overhead for the SpMV kernel for matrices with ratios s = n z /n > CB-GMRES.

In order to reduce the memory access volume in the orthogonaliza-tion step of GMRES, we propose to store the vectors of the orthogonal basis V j ina compact reduced-precision format; retrieve the data from memory in that format;and transform the values into ieee ieee ieee Discussion.

Storing the orthogonal basis of a Krylov method in a reduced precisionformat will typically introduce rounding errors that may aﬀect the numerical prop-erties of the method, potentially impacting the convergence and numerical stabilityof the iterative solver. As the solution approximation is optimal in the generatedKrylov subspace, perturbed Krylov search directions may result a loss in the DP-orthogonality of the search directions and a diﬀerent (Krylov) subspace and in whichthe solution approximation is computed. However, the solution approximation processaccounts for the perturbed search directions, and as long as the generated subspaceallows for a good approximation of the solution, this approximation will be foundin the optimization process. Hence, as long as the search directions are “relatively”close to the optimal search directions, the convergence will only be mildly aﬀected. Inparticular, we may assume that the need for additional search directions (equivalentto additional iterations) can be compensated by the faster execution of each iteration.To close this section, we emphasize that:– Our approach is orthogonal and complementary to other techniques whichaim to reduce the memory access overhead, for example, by customizing thesparse matrix data layout to the application, operating with iterative reﬁne-

OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Library InfrastructureAlgorithm Implementations • Iterative Solvers • Preconditioners • … Core

OpenMP-kernels • SpMV • Solver kernels • Precond kernels • … OpenMP

Reference kernels • SpMV • Solver kernels • Precond kernels • … Reference

CUDA-GPU kernels • SpMV • Solver kernels • Precond kernels • … CUDA

HIP-GPU kernels • SpMV • Solver kernels • Precond kernels • … HIP

Library core contains architecture-agnostic algorithm implementation;Runtime polymorphism selects the right kernel depending on the target architecture;Architecture-specific kernels execute the algorithm on target architecture;Reference are sequential kernels to check correctness of algorithm design and optimized kernels; Optimized architecture-specific kernels; • Shared kernels

Common

Fig. 2 . The architecture of the Ginkgo library separating the algorithmic core from the backends. ment scheme+mixed precision for the coeﬃcient matrix, or the exploitingcustomized precision in the preconditioner, among others.– The arithmetic precision is decoupled from the representation format so thatwe can actually store the data for the orthogonal basis in any format whilerelying on the data types with hardware support for the arithmetic operations.

5. Implementation of CB-GMRES.5.1. The Ginkgo sparse linear algebra library.

For convenience and ease ofuse, we have realized the CB-GMRES algorithm in the Ginkgo ecosystem. Ginkgo isa sparse linear algebra library implemented in modern C++ that embraces two prin-cipal design concepts [2]: The ﬁrst principle, aiming at future technology readiness, isto consequently separate the numerical algorithms from the hardware-speciﬁc kernelimplementation to ensure correctness (via comparison with sequential reference ker-nels), performance portability (by applying hardware-speciﬁc kernel optimizations),and extensibility (via kernel backends for other hardware architectures); see Figure 2.The second design principle – pursuing user-friendliness – is the convention to expressfunctionality in terms of linear operators: every solver, preconditioner, factorization,matrix-vector product, and matrix reordering is expressed as a linear operator (orcomposition thereof). This allows to easily combine the CB-GMRES with any pre-conditioner available in Ginkgo, and to select the realization of the

SpMV kernel thatis most appropriate for the characteristics of a speciﬁc problem [3].Ginkgo relies on an “executor” concept to favor platform portability. The ex-ecutor speciﬁes the memory location and execution domain of linear algebra objectsand abstracts the computational capabilities of distinct devices. Each executor im-plements methods for allocating/deallocating memory on the device targeted by theexecutor, copying data between executors, providing hardware-speciﬁc kernels, run-ning operations, and synchronizing all operations launched on the executor. Theuser can run a single code on diﬀerent platforms (without having to modify his/hercode) by selecting the proper executor at the beginning of the application. This en-capsulates all information in the executor, and automatically orchestrates memoryallocation, memory transfers, and kernel selection. For the CB-GMRES implemen-

J. I. ALIAGA ET AL.

Fig. 3 . Accessor separating the memory format from the arithmetic format and realizing on-the-ﬂy data conversion in each memory access. tation with the orthonormal Krylov basis stored in reduced precision, the executorconcept is extended with a “memory accessor”, described next, that handles the dataconversion transparently to the user.

At a high level, the idea of the CB-GMRES solver is tocompress the orthogonal matrix/vector before and after the memory operations usingone of the reduced/customized storage formats, but still use the working precision(i.e., DP) for the arithmetic operations. Retrieving the orthonormal basis in reducedprecision from memory thus requires reading the basis contents and converting theminto DP. When these values are stored in SP, the conversion is easy to perform viaa datatype casting operator. For ﬁxed point representations, though, the conversionrequires some additional manipulations plus the scaling with a normalization factor.To decouple the memory access and conversion from the code development eﬀort,we use a memory accessor that converts the data between DP and the memory stor-age/communication format on-the-ﬂy (see Figure 3). The eﬃcient implementation ofthe accessor aims to hide the cost of these data conversions by overlapping them withthe memory accesses, in principle introducing a minor or even negligible overhead.In addition, the introduction of this technique can accelerate the execution as access-ing the data in lower precision signiﬁcantly reduces the memory access volume periteration.Considering the realization of the CB-GMRES algorithm, after the new basisvector v j = v is formed at iteration j , the memory accessor is activated in order tocompress the DP values of this vector before storing them into memory; see Lines 13and 14 in the algorithm in Figure 1. The memory accessor is also active when re-trieving the contents of the full orthogonal basis V j from memory; see Lines 5 and 8of the algorithm.On the technical side, the accessor leverages static polymorphism (via C++ tem-plates) for both the arithmetic precision (in our work, ﬁxed to IEEE DP) and thememory format. While this ﬂexible design can accommodate any memory format, wecurrently only support , and (for IEEE DP, SP and OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS , (for 32-bit and 16-bit ﬁxed point formats) inGinkgo. The versions based on integer formats rely on a ﬁxed point representation inorder to maintain the orthonormal basis vectors. This representation only requires afractional part because the vectors are normalized, making each vector entry smallerthan 1. However, this is not eﬃcient for large vectors because the largest absolutevalue will likely be signiﬁcantly smaller than 1, therewith wasting representation range(and precision). To optimize the accuracy, a diﬀerent scaling factor is used for eachvector: σ j = (cid:107) v j (cid:107) ∞ / (cid:107) v j (cid:107) max intxx , where v j is the vector computed at iteration j before normalization, and max intxx is the maximum positive value of the integer representation using xx ∈ { , } bits.Both norms can be computed simultaneously so that the extra overhead due to thememory accesses to obtain the inﬁnity norm remains small. The vector v j is thenstored in V j +1 as V j +1 = [ V j , v j /σ j ] , and any subsequent access to the contents of V j +1 implies an intrinsic post-multi-plication by a diagonal matrix Σ j +1 = diag( σ , σ , . . . , σ j , σ j +1 ) that contains thescaling factors on the diagonal. This scaling adds one multiplication per element tothe computational cost of any operation involving the orthogonal basis, and storingthe scaling factor in memory. However, as the whole algorithm is heavily bandwidthbound, we expect the overhead remaining small.

6. Experimental Evaluation of the compressed basis GMRES.

In thissection, we analyze several properties of the CB-GMRES algorithm in order to assessthe beneﬁts of this solver as part of production code. Concretely, we investigate thefollowing questions: 1) Can we achieve high accuracy in the solution approximations?2) How signiﬁcant is the convergence delay introduced by moving away from the “full”precision Krylov search directions and utilizing instead search directions that are lowprecision approximations of these orthonormal vectors? 3) What are the performanceadvantages of the CB-GMRES over the standard (DP) GMRES? 4) Which speciﬁcstorage format we should use for the memory operations?

To answer these questions, we select a set of 49 large-scale testmatrices from the Suite Sparse Matrix Collection [1] that we adopt as benchmarkproblems to explore the accuracy, convergence, and performance of the CB realizationsof the GMRES algorithm. The selected test matrices are regular, appropriate in sizeand nonzero count, and a DP GMRES needs at least 40 iterations to converge. Thetest matrices are listed along with some key properties in Table 1.The CB-GMRES algorithm is implemented utilizing building blocks from theGinkgo environment. The orthogonalization kernel is based on classical Gram-Sch-midt with optional re-orthogonalization. All other functionality (

SpMV kernels, pre-conditioners, utility functions, comparison solvers, etc.) is taken from the Ginkgo li-brary. Unless otherwise stated, we enhance all the CB and DP GMRES algorithmswith a simple light-weight scalar Jacobi preconditioner (diagonal scaling) as this gen-erally improves convergence and provides a more realistic setting than a stand-aloneGMRES algorithm. The

SpMV kernel integrated in all variants of GMRES to gener-ate the Krylov search directions is Ginkgo’s CSR-based

SpMV routine; this particu-lar realization of

SpMV maintains the coeﬃcient matrix in Compressed Sparse Row(CSR) format, and automatically selects a csr kernel that is optimal for a problem-speciﬁc sparsity pattern [4]. The DP GMRES code is identical to the CB-GMRES0

J. I. ALIAGA ET AL.

Table 1

Test matrices

Matrix Size Non-zeros Non-zeros per row af 0 k101 503,625 17,550,675 34.8af 1 k101 503,625 17,550,675 34.8af 2 k101 503,625 17,550,675 34.8af 3 k101 503,625 17,550,675 34.8af 4 k101 503,625 17,550,675 34.8af 5 k101 503,625 17,550,675 34.8af shell1 504,855 17,562,051 34.8af shell10 1,508,065 52,259,885 34.7af shell2 504,855 17,562,051 34.8af shell3 504,855 17,562,051 34.8af shell4 504,855 17,562,051 34.8af shell5 504,855 17,579,155 34.8af shell6 504,855 17,579,155 34.8af shell7 504,855 17,579,155 34.8af shell8 504,855 17,579,155 34.8af shell9 504,855 17,588,845 34.8apache2 715,176 4,817,870 6.7atmosmodd 1,270,432 8,814,880 6.9atmosmodj 1,270,432 8,814,880 6.9atmosmodl 1,489,752 10,319,760 6.9atmosmodm 1,489,752 10,319,760 6.9audikw 1 943,695 77,651,847 82.3bone010 986,703 47,851,783 48.5boneS10 914,898 40,878,708 44.7Bump 2911 2,911,419 127,729,899 43.9circuit5M dc 3,523,317 14,865,409 4.2Cube Coup dt6 2,164,760 124,406,070 57.5CurlCurl 2 806,529 8,921,789 11.1CurlCurl 3 1,219,574 13,544,618 11.1CurlCurl 4 2,380,515 26,515,867 11.1ecology1 1,000,000 4,996,000 5.0ecology2 999,999 4,995,991 5.0Fault 639 638,802 27,245,944 42.7Flan 1565 1,564,794 114,165,372 73.0G3 circuit 1,585,478 7,660,826 4.8Geo 1438 1,437,960 60,236,322 41.9Hook 1498 1,498,023 59,374,451 39.6inline 1 503,712 36,816,170 73.1ldoor 952,203 42,493,817 44.6mc2depi 525,825 2,100,225 4.0ML Geer 1,504,002 110,686,677 73.6parabolic fem 525,825 3,674,625 7.0Serena 1,391,349 64,131,971 46.1ss 1,652,680 34,753,577 21.0t2em 921,632 4,590,832 5.0thermal2 1,228,045 8,580,313 7.0tmt sym 726,713 5,080,961 7.0tmt unsym 917,825 4,584,801 5.0Transport 1,602,111 23,487,281 14.7

OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS a f _0_ k f _1_ k f _2_ k f _3_ k f _4_ k f _5_ k f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll c he2a t m o s m odda t m o s m od j a t m o s m od l a t m o s m od m aud i k w _1bone010bone S B u m p_2911 c i r c u i t M _d c C ube_ C oup_d t C u r l C u r l _2 C u r l C u r l _3 C u r l C u r l _4e c o l og y c o l og y F au l t _639 F l an_1565 G c i r c u i t G eo_1438 H oo k _1498 i n li ne_1 l doo r m c i M L_ G ee r pa r abo li c _ f e m S e r ena ss t m t he r m a l t m t _ sy m t m t _un sy m T r an s po r t Matrices -10 -9 -8 -7 -6 N o r m a li z ed r e s i dua l no r m Fig. 4 . Normalized residual of the distinct CB-GMRES versions. code with the orthogonal basis stored in DP as we did not detect any runtime overheadfrom using the memory accessor.In the performance tests, we utilize Ginkgo’s CUDA executor, which is heavily-optimized for NVIDIA GPUs. We run all experiments on an NVIDIA V100 GPUwith support for compute capability 7.0 [21]. The V100 accelerator board is equippedwith 16 GB of main memory, 128 KB L1 cache and 6MB of L2 cache. Bandwidthtests achieved 897 GB/s for main memory access in this particular device [26]. Thetheoretical peak performance for the V100 GPU is 7.83 DP TFLOPS (that is, 7 . · ﬂoating-point operations per second). Ginkgo’s CUDA backend was compiled usingCUDA version 9.2. We initially investigate whether CB-GMREScan match the accuracy levels attained by DP GMRES. For that purpose, we consider49 linear systems of the form Ax = b , with the coeﬃcient matrix deﬁned from thetest matrices in Table 1, and the components of the right-hand side vector set as b i = sin ( i ). The GMRES algorithm is started with an initial guess x = 0, uses a restartparameter m = 100, and is considered to converge when the solution approximation x ∗ yields a residual (cid:107) Ax ∗ − b (cid:107) ≤ − (cid:107) b (cid:107) . We believe this setting reﬂects real-worldproblems, and we use it for the rest of the evaluation.To avoid expensive explicit residual computations, the GMRES algorithm inter-nally updates a recurrence residual that is used to check convergence. However, whenusing ﬁnite precision and due to the accumulation of rounding error, this iteratively-computed residual can diverge from the real residual, and the GMRES algorithm maystop “too early” even though the real residual did not fall below the selected thresh-old. Using the compressed basis formats to store the orthonormal basis may enhancethis eﬀect. To tackle this problem, we modify all the implementations to compute theexplicit residual once convergence is indicated by the recurrence residual, but con-tinue iterating with the updated residual in case the actual accuracy threshold is notfulﬁlled.To assess the solution accuracy, in Figure 4 we report the normalized residual (cid:107) Ax ∗ − b (cid:107) / (cid:107) b (cid:107) for the solution approximations computed with the distinct CB-GMRES versions. In all ﬁgures and tables in this section, as well as in the followingdiscussions, and respectively identify diﬀerent realizations of CB-2 J. I. ALIAGA ET AL. -10 -8 -6 -4 -2 N o r m a li z ed r e s i dua l no r m 0 500 1000 1500 2000 2500Iteration number10 -10 -8 -6 -4 -2 N o r m a li z ed r e s i dua l no r m Fig. 5 . Convergence of the CB-GMRES variants for the circuit5M dc and

Serena problems.

GMRES with the orthogononal basis stored using xx -bit ﬂoating-point and ﬁxed-pointformats. The notation refers to the DP GMRES algorithm. In theseinitial results, we notice that the CB-GMRES variants fulﬁll the residual accuracyrequirement in most cases, but small diﬀerences in the residual norms may indicatevariations in the convergence rate. In Figure 5 we expose the convergencebehaviour of the CB-GMRES variants for the circuit5M dc and

Serena problems.(Similar behaviour was observed for other problems from the 49-case collection.)While in this case all CB-GMRES variants achieve the same ﬁnal accuracy, the storageformat selected for the orthogonal basis impacts the convergence rate and, in conse-quence, the iteration count. In addition, the spikes in the residual curves identifythe restart points that update the recurrence residual with an explicitly computedresidual. For

GMRES in particular, this results in signiﬁcant corrections of thenormalized residual. As expected, using a compressed format to store the orthog-onal basis can delay convergence and require additional search directions. In orderto quantify this eﬀect, in Figure 6 we display the iteration count of the CB-GMRESvariants relative to the DP GMRES iteration count. An iteration overhead of 100 inthat ﬁgure identiﬁes those storage formats for which CB-GMRES did not convergewithin the iteration limit.This experiment shows that the realizations

GMRES and

GMRES match the iteration count of DP GMRES in almost all cases, and only need a fewadditional iterations for a couple of problems. In contrast, when the orthogonal basisis stored using the 16-bit formats, the overhead often increases dramatically, andeven for matrices within the same ( af shell ) group, there is no clear winner betweenthe

GMRES and

GMRES . As expected, for those problems where

GMRES and

GMRES need additional iterations,

GMRES and

GMRES typically fail.In the left-hand side plot in Figure 7 and Table 2 (left-hand side), we report a fewkey statistics obtained from the experimental evaluation with the 49 test problems.While storing the vector entries in or incurs no iteration overhead,when using 16-bit storage we obtain a median iteration overhead of 4 × , with the 50%-quantiles varying between 2 × and 7 . × , and the 90%-quantiles reaching up to 12 × and 15 × for and , respectively. OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS a f _0_ k f _1_ k f _2_ k f _3_ k f _4_ k f _5_ k f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll f _ s he ll c he2a t m o s m odda t m o s m od j a t m o s m od l a t m o s m od m aud i k w _1bone010bone S B u m p_2911 c i r c u i t M _d c C ube_ C oup_d t C u r l C u r l _2 C u r l C u r l _3 C u r l C u r l _4e c o l og y c o l og y F au l t _639 F l an_1565 G c i r c u i t G eo_1438 H oo k _1498 i n li ne_1 l doo r m c i M L_ G ee r pa r abo li c _ f e m S e r ena ss t m t he r m a l t m t _ sy m t m t _un sy m T r an s po r t Matrices R e l a t i v e i t e r a t i on s c o m pa r ed t o < f l oa t > Fig. 6 . Iteration overhead of the CB-GMRES variants relative to the DP GMRES iterationcount for a residual threshold (cid:107) Ax ∗ − b (cid:107) ≤ − (cid:107) b (cid:107) . Precision010203040 R e l a t i v e i t e r a t i on s Precision00.511.52 S peedup Fig. 7 . Statistics obtained from running the CB-GMRES algorithms on the 49 test problems.Left: Iteration overhead (relative to DP GMRES); Right: speedup relative to DP GMRES.

Even though we now have experimentallydemonstrated that the CB-GMRES variants can compensate for the perturbations inthe subspace via additional iterations (which is equivalent to extending the subspaceby additional search directions), the resulting algorithms will only be useful in produc-tion if the associated iteration overhead is smaller than the runtime reduction comingfrom the decreased memory access volume. In the right-hand side plot in Figure 7 weshow statistics on the performance improvements that CB-GMRES renders over DPGMRES when using diﬀerent storage formats for the orthogonal basis. As could beexpected from the large iteration overheads, storing the orthogonal basis in or usually results in a slowdown of the global solution process. Conversely,storing the orthogonal basis in or yields attractive performanceimprovements, with slight advantages for the GMRES variant. The medianspeedup for

GMRES is 1.4 × , with the 50%-quantiles reaching up to 1.6 × andoutliers reaching up to 1.75 × . Here we note that GMRES shows an outlier witha 2.4 × speedup, which is likely related to faster convergence due to “lucky rounding.”In Figure 8 we provide a detailed performance evaluation by visualizing thespeedup for the distinct test problems. There we notice a very uniform picture con-4 J. I. ALIAGA ET AL.Solver arithmetic mean arithmetic median variance

GMRES

Table 2

Statistics for the

GMRES iteration count normalized to the

GMRES> implementation on the test matrices listed in Table 1. cerning the speedups for

GMRES and

GMRES , with

GMRES being slightly superior. This is likely due to the overhead of the scaling process andthe additional scaling factors needed when storing the basis vectors in

GMRES .From this experiment, we conclude that the

GMRES is an appropriate choicefor a wide range of problems.When motivating the use of a more compact storage format to maintain the or-thonormal vectors in Section 4, we argued that the memory savings against DP GM-RES grow with the size of the Krylov subspace; that is, the instances of CB-GMRESusing a larger restart parameter m should attain larger performance beneﬁts thantheir CB-GMRES counterparts adopting smaller restart values. In more detail, whenignoring numerical eﬀects, we can expect that the speedup asymptotically reaches theratio between the storage format complexities: 4 × when using GMRES or GMRES ; and 2 × when using GMRES or GMRES . In Figure 9 wequantify those speedups experimentally, considering restart parameters in the range10–300. We note that restart values beyond 200 are rarely employed as they intro-duce numerical instabilities and signiﬁcant memory- and computational overhead. Toavoid this issue, this experiment considers the runtimes needed to execute 10 restartcycles but ignores any numerical eﬀects. Also, even though we already identiﬁed the

GMRES as being superior in terms of convergence and performance, we in-clude all CB variants in this analysis. In Figure 9 we employ grey lines to indicate thespeedup behavior for the distinct matrices and use boxplots to illustrate the statis-tics for the CB-GMRES variants. The results indicate that the average speedups for

GMRES or GMRES asymptotically approach a value below 2 × , withthe speedups being constantly higher for the former (which requires no scaling). Thespeedup is smaller than 2 × because the cost savings are limited to those obtained fromthe compressed storage of the orthogonal basis, but not in other parts of the algorithmsuch as, for example, the SpMV kernel (see Amdahl’s law). For

GMRES or GMRES , the speedup values are larger, though below the 4 × theoretical bound.Again, the scaling process and memory overhead make the GMRES speedupsinferior to the

GMRES speedups.

Finally, we investigate how the CB-GMRES interacts with a moresophisticated preconditioner and with other mixed precision techniques. For this, weswitch from a scalar Jacobi preconditioner to a block-Jacobi preconditioning schemewith block-size 4, and report the performance advantages in Figure 10. As in the pre-vious experiments, we ﬁx the restart parameter to m = 100 and run the experimentswith a right-hand side vector deﬁned by b i = sin ( i ), an starting guess x = 0, andthe residual stopping criterion set to (cid:107) Ax ∗ − b (cid:107) ≤ − (cid:107) b (cid:107) . Compared with theresults in Figure 8, we note a slight decrease in the speedups, which is expected asthe addition of a more expensive preconditioner diminishes the performance beneﬁts OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Speedup over af_0_k101af_1_k101af_2_k101af_3_k101af_4_k101af_5_k101af_shell1af_shell10af_shell2af_shell3af_shell4af_shell5af_shell6af_shell7af_shell8af_shell9apache2atmosmoddatmosmodjatmosmodlatmosmodmaudikw_1bone010boneS10Bump_2911circuit5M_dcCube_Coup_dt6CurlCurl_2CurlCurl_3CurlCurl_4ecology1ecology2Fault_639Flan_1565G3_circuitGeo_1438Hook_1498inline_1ldoormc2depiML_Geerparabolic_femSerenasst2emthermal2tmt_symtmt_unsymTransport M a t r i c e s Fig. 8 . speedup of the CB-GMRES variants over DP GMRES for a residual threshold (cid:107) Ax ∗ − b (cid:107) / ≤ − (cid:107) b (cid:107) . J. I. ALIAGA ET AL. S peedup o v e r < f l oa t > Mean of Mean of Mean of Mean of

Fig. 9 . speedup for diﬀerent CB-GMRES variants ( GMRES ) over DP GM-RES (

GMRES ) for increasing restart values. attained from storing the Krylov basis in a compressed format.We next move from a standard block-Jacobi preconditioner to an adaptive preci-sion block-Jacobi variant that stores the inverted diagonal blocks of the preconditionerin reduced precision if the numerical properties allow for it [6, 12]. We thus com-bine a CB-GMRES algorithm with a multi-precision preconditioner. In Figure 11,we report the speedups of CB-GMRES + adaptive precision block-Jacobi over DPGMRES + double precision block-Jacobi. The results in that ﬁgure provide exper-imental evidence that the performance advantages are accumulative, and the newCB-GMRES can be eﬃciently combined with an independent optimization strategytargeting the preconditioner. We can naturally expect similar behaviors when com-bining CB-GMRES with other communication reduction techniques which target the

SpMV kernel, or from the integration into a mixed precision iterative reﬁnementframework.

7. Summary and Outlook.

We have introduced and evaluated a communi-cation-reduction version of GMRES that maintains the orthogonal basis in a com-pressed (compact) form while performing all arithmetic in double precision. Thecombination of these two factors aims to reduce the traﬃc between memory andthe processor arithmetic units while maintaining the accuracy of the search direc-tions generated during the optimization process and extracting the performance fromhardware-supported arithmetic. In contrast, the memory storage provides (to a cer-tain extent) enough ﬂexibility to evaluate distinct 16-bit and 32-bit formats, includingﬂoating point and ﬁxed point variants.We have integrated a high-performance realization of the GMRES with com-

OMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Speedup over af_0_k101af_1_k101af_2_k101af_3_k101af_4_k101af_5_k101af_shell1af_shell10af_shell2af_shell3af_shell4af_shell5af_shell6af_shell7af_shell8af_shell9apache2atmosmoddatmosmodjatmosmodlatmosmodmaudikw_1bone010boneS10Bump_2911circuit5M_dcCube_Coup_dt6CurlCurl_2CurlCurl_3CurlCurl_4ecology1ecology2Fault_639Flan_1565G3_circuitGeo_1438Hook_1498inline_1ldoormc2depiML_Geerparabolic_femSerenasst2emthermal2tmt_symtmt_unsymTransport M a t r i c e s Fig. 10 . speedup for diﬀerent CB-GMRES variants ( GMRES ) over DP GM-RES (

GMRES ) for increasing restart values. J. I. ALIAGA ET AL. pressed orthogonal basis into the Ginkgo framework for sparse linear systems. Theperformance evaluation of this solver on a recent NVIDIA V100 GPU demonstratesthe practical advantages of the communication-reduction technique, which is alignedwith the acceleration that could be expected from Amdahl’s law. On the one hand,the speedups are more notable for the 32-bit ﬂoating point format, followed closelyby its 32-bit ﬁxed point counterpart. On the other hand, the 16-bit formats furtherreduce the communication volume, but they regularly fail to preserve the convergencecharacteristics of the GMRES solver. Overall, we believe that the proposed techniqueis useful as it tackles the memory wall problem that is present in current processors.Furthermore, its beneﬁts are orthogonal and, therefore accumulative, to those thatcan be attained with other communication-reduction techniques applied, for example,to the preconditioner, the realization of

SpMV , or the GMRES algorithm itself.In future work we will investigate whether compression techniques that are tradi-tionally used to compress large datasets can pose an alternative to the use of of 32-bitand 16-bit ﬁxed and ﬂoating point formats to store the compressed basis vectors.

Acknowledgments.

Jos´e I. Aliaga, Enrique S. Quintana-Ort´ı and Andr´es E.Tom´as were supported by the EU H2020 project 732631 “OPRECOMP. Open Trans-precision Computing” and the

MINECO (Spain) project TIN2017-82972-R. HartwigAnzt and Thomas Gr¨utzmacher were supported by the “Impuls und Vernetzungs-fond” of the Helmholtz Association under grant VH-NG-1241, and the US ExascaleComputing Project (17-SC-20-SC), a collaborative eﬀort of the U.S. Department ofEnergy Oﬃce of Science and the National Nuclear Security Administration.

REFERENCES[1]

Suite Sparse Matrix Collection . http://faculty.cse.tamu.edu/davis/suitesparse.html, April2020.[2]

H. Anzt, T. Cojean, G. Flegar, F. G¨obel, T. Gr¨utzmacher, P. Nayak, T. Ribizel, Y.-H.Tsai, and E. S. Quintana-Ort´ı , Ginkgo: A modern linear operator algebra frameworkfor high performance computing , 2020, https://arxiv.org/abs/2006.16852.[3]

H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M.Tsai, and W. Wang , Load-balancing sparse matrix vector product kernels on gpus , ACMTransactions on Parallel Computing (TOPC), 7 (2020), pp. 1–26.[4]

H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M.Tsai, and W. Wang , Load-balancing sparse matrix vector product kernels on gpus , ACMTrans. Parallel Comput., 7 (2020), https://doi.org/10.1145/3380930, https://doi.org/10.1145/3380930.[5]

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ort´ı , Adaptive preci-sion in block-Jacobi preconditioning for iterative sparse linear system solvers , Concurrencyand Computation: Practice and Experience, 31 (2019), p. e4460.[6]

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ort´ı , Adaptive preci-sion in block-jacobi preconditioning for iterative sparse linear system solvers , Concurrencyand Computation: Practice and Experience, 31 (2019), p. e4460.[7]

H. Anzt, G. Flegar, T. Gr¨utzmacher, and E. S. Quintana-Ort´ı , Toward a modu-lar precision ecosystem for high-performance computing , The International Journal ofHigh Performance Computing Applications, 33 (2019), pp. 1069–1078, https://doi.org/10.1177/1094342019846547, https://doi.org/10.1177/1094342019846547, https://arxiv.org/abs/https://doi.org/10.1177/1094342019846547.[8] ˚A. Bj¨orck , Solving linear least squares problems by Gram-Schmidt orthogonalization , BITNumerical Mathematics, 7 (1967), pp. 1–21.[9]

E. C. Carson , Communication-avoiding Krylov subspace methods in theory and practice , PhDthesis, University of California, Berkeley, 2015.[10]

S. Cools , Analyzing and improving maximal attainable accuracy in the communication hidingpipelined BiCGStab method , Parallel Computing, 86 (2019), pp. 16 – 35.[11]

T. Davies , Direct Methods for Sparse Linear Systems , Society for Industrial and AppliedOMPRESSED BASIS GMRES ON HIGH PERFORMANCE GPUS Mathematics, Philadelphia, PA, USA, 2006.[12]

G. Flegar, H. Anzt, T. Cojean, and E. S. Quintana-Ort´ı , Customized-precision block-jacobi preconditioning for krylov iterative solvers on data-parallel manycore processors ,ACM TOMS, (submitted).[13]

G. H. Golub and C. F. V. Loan , Matrix Computations , The Johns Hopkins University Press,Baltimore, 3rd ed., 1996.[14]

S. Gratton, E. Simon, D. Titley-Peloquin, and P. Toint , Exploiting variable precision inGMRES , SIAM J. Sci. Comput. (to appear), (2020).[15]

A. Greenbaum , Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences ,Lin. Alg. Appl., 113 (1989), pp. 7–63.[16]

A. Greenbaum , Estimating the attainable accuracy of recursively computed residual methods ,SIAM J. Matrix Anal. Appl., 18 (1997), pp. 535–551.[17]

N. J. Higham , Accuracy and Stability of Numerical Algorithms , Society for Industrial andApplied Mathematics, Philadelphia, PA, USA, second ed., 2002.[18]

M. Horowitz , Computing’s energy problem (and what we can do about it) , in 2014 IEEEInternational Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb2014, pp. 10–14, https://doi.org/10.1109/ISSCC.2014.6757323.[19]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh , Basic linear algebra subpro-grams for Fortran usage , ACM Trans. Math. Softw., 5 (1979), pp. 308–323.[20]

G. Meurant and Z. Strako , The lanczos and conjugate gradient algorithms in ﬁniteprecision arithmetic , Acta Numerica, 15 (2006), p. 471542, https://doi.org/10.1017/S096249290626001X.[21]

NVIDIA Corp. , Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE , 2017.[22]

C. C. Paige , Accuracy and eﬀectiveness of the Lanczos algorithm for the symmetric eigen-problem , Lin. Alg. Appl., 34 (1980), pp. 235–258.[23]

C. C. Paige, M. Rozloˇzn´ık, and Z. Strakoˇs , Modiﬁed gram-schmidt MGS, least squares, andbackward stability of MGS-GMRES , SIAM J. Matrix Anal. Appl., 28 (2006), pp. 264–284.[24]

Y. Saad , Iterative methods for sparse linear systems , Society for Industrial and Applied Math-ematics, Philadelphia, PA, USA, 2nd ed., 2003.[25]

V. Simoncini and D. B. Szyld , Theory of inexact Krylov subspace methods and applicationsto scientiﬁc computing , SIAM J. Sci. Comput., 25 (2003), pp. 454–477.[26]

Y. M. Tsai, T. Cojean, and H. Anzt , Sparse linear algebra on AMD andNVIDIA GPUs– the race is on , in High Performance Computing, P. Sadayappan, B. L. Chamberlain,G. Juckeland, and H. Ltaief, eds., Cham, 2020, Springer International Publishing, pp. 309–327.[27]

J. van den Eshof and G. L. Sleijpen , Inexact Krylov subspace methods for linear systems ,SIAM J. Matrix Anal. Appl., 26 (2004), pp. 125–153. J. I. ALIAGA ET AL.

Speedup over full precision af_0_k101af_1_k101af_2_k101af_3_k101af_4_k101af_5_k101af_shell1af_shell10af_shell2af_shell3af_shell4af_shell5af_shell6af_shell7af_shell8af_shell9apache2atmosmoddatmosmodjatmosmodlatmosmodmaudikw_1bone010boneS10Bump_2911circuit5M_dcCube_Coup_dt6CurlCurl_2CurlCurl_3CurlCurl_4ecology1ecology2Fault_639Flan_1565G3_circuitGeo_1438Hook_1498inline_1ldoormc2depiML_Geerparabolic_femSerenasst2emthermal2tmt_symtmt_unsymTransport M a t r i c e s Fig. 11 . speedup for diﬀerent CB-GMRES variants ( GMRES ) over DP GM-RES (