[PDF] A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic

Abstract

Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more than an order of magnitude higher performance than what is available in IEEE double precision. At the same time, the gap between the compute power on the one hand and the memory bandwidth on the other hand keeps increasing, making data access and communication prohibitively expensive compared to arithmetic operations. To start the multiprecision focus effort, we survey the numerical linear algebra community and summarize all existing multiprecision knowledge, expertise, and software capabilities in this landscape analysis report. We also include current efforts and preliminary results that may not yet be considered "mature technology," but have the potential to grow into production quality within the multiprecision focus effort. As we expect the reader to be familiar with the basics of numerical linear algebra, we refrain from providing a detailed background on the algorithms themselves but focus on how mixed- and multiprecision technology can help improving the performance of these methods and present highlights of application significantly outperforming the traditional fixed precision methods.

Full PDF

AA Survey of Numerical Methods Utilizing MixedPrecision Arithmetic by the ECP Multiprecision Effort Team (Lead: Hartwig Anzt)

Ahmad Abdelfattah , Hartwig Anzt , Erik G. Boman , Erin Carson , Terry Cojean , JackDongarra , Mark Gates , Thomas Gr¨utzmacher , Nicholas J. Higham , Sherry Li , NeilLindquist , Yang Liu , Jennifer Loe , Piotr Luszczek , Pratik Nayak , Sri Pranesh , SivaRajamanickam , Tobias Ribizel , Barry Smith , Kasia Swirydowicz , Stephen Thomas ,Stanimire Tomov , Yaohung M. Tsai , Ichi Yamazaki , Urike Meier Yang University of Tennessee, Knoxville, USA Karlsruhe Institute of Technology, Karlsruhe, Germany Sandia National Lab, Albuquerque, USA Charles University, Prague, Czech Republic Oak Ridge National Lab, Oak Ridge, USA University of Manchester, Manchester, UK Lawrence Livermore National Lab, USA Lawrence Berkeley National Lab, Berkeley, USA Argonne National Lab, Argonne, USA National Renewable Energy Lab, Boulder, USA

Technical report from the Exascale Computing Project. a r X i v : . [ c s . M S ] J u l Introduction

Within the past years, hardware vendors have started designing low precision special function unitsin response to the demand of the Machine Learning community and their demand for high com-pute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL’s Summit supercom-puter providing more than an order of magnitude higher performance than what is available in IEEEdouble precision. At the same time, the gap between the compute power on the one hand and thememory bandwidth on the other hand keeps increasing, making data access and communicationprohibitively expensive compared to arithmetic operations. Having the choice between ignoring thehardware trends and continuing the traditional path, and adjusting the software stack to the changinghardware designs, the US Exascale Computing Project decided for the aggressive step of building amultiprecision focus effort to take on the challenge of designing and engineering novel algorithmsexploiting the compute power available in low precision and adjusting the communication formatto the application speciﬁc needs. To start the multiprecision focus effort, we survey the numericallinear algebra community and summarize all existing multiprecision knowledge, expertise, and soft-ware capabilities in this landscape analysis report. We also include current efforts and preliminaryresults that may not yet be considered “‘mature technology,” but have the potential to grow intoproduction quality within the multiprecision focus effort. As we expect the reader to be familiarwith the basics of numerical linear algebra, we refrain from providing a detailed background on thealgorithms themselves but focus on how mixed- and multiprecision technology can help improvingthe performance of these methods and present highlights of application signiﬁcantly outperformingthe traditional ﬁxed precision methods. 2 ontents QR MGS-GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Alterative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10 IEEE Formats and Format Conversion 37

Dense Linear Algebra

The revolution of machine learning applications and artiﬁcial intelligence (AI) spiked an interest indeveloping high-performance half-precision arithmetic ( -bit ﬂoating-point format), since most AIapplications do not necessarily require the accuracy of single or double precision [1]. Half precisionalso enables machine learning applications to run faster, not only because of the faster arithmetic,but also because of the reduction in memory storage and trafﬁc by a factor of × against singleprecision, and by a factor of × against double precision.In terms of vendor support, NVIDIA, Google, and AMD manufacture hardware that is capable ofperforming ﬂoating point arithmetic using -bit formats. Google’s Tensor Processing Units (TPUs)are customized chips that are mainly designed for machine learning workloads using the bfloat16 format. AMD also provides half-precision capabilities, and their software stack shows support forboth the bfloat16 format and the IEEE format [2]. The theoretical performance of half-precisionon AMD GPUs follows the natural × / × speedups against single/double precisions, respectively.As an example, the Mi50 GPU has a theoretical FP16 performance of . Tﬂop/s, against a . Tﬂop/s for FP32 and . Tﬂop/s for FP64. But perhaps the most widely accessible hardware withhalf-precision capability are NVIDIA GPUs, which have introduced half-precision arithmetic sincethe Pascal architecture. Throughout this section, we will focus on NVIDIA GPUs and its mathlibraries to highlight half-precision developemnts for numerical kernels.NVIDIA GPUs implement the “binary16” format which is deﬁned by the IEEE-754 standard [2].While the Pascal GPU architecture introduced hardware support for FP16 arithmetic, the Volta archi-tecture, which powers the Summit supercomputer, comes with hardware acceleration units (calledTensor Cores) for matrix multiplication in FP16. These Tensor Cores are theoretically × fasterthan the theoretical FP16 peak performance of the preceding architecture. Applications taking ad-vantage of the Tensor Cores can run up to × faster than using the regular FP16 arithmetic on thesame GPU. The Tensor Cores are also able to perform a mixed-precision multiplication, with a lowprecision input (e.g. half-precision) and a higher precision output (typically single-precision). TheTensor Core units are discussed in more details in Section 2.1.1.In terms of half-precision Basic Linear Algebra Subroutines (BLAS), most of the available routinesconsider only dense matrix multiplications (GEMMs). From the perspective of machine learningapplications, most of the performance critical components in training/inference can be reformulatedto take advantage of the GEMM kernel. As for dense linear algebra, many high level algorithms arebuilt to extract their high performance from GEMM calls. Therefore, accelerating such performance-critical steps through FP16 GEMM (HGEMM) would propagate the performance advantage to theentire algorithm, while keeping other numerical stages in their original precision(s). An exampleof this practice is the mixed precision dense LU factorization [3], which is used to accelerate thesolution of Ax = b in double precision, see Section 2.2. The CUDA Toolkit is one of the ﬁrst programming models to provide half-precision (i.e., FP16)arithmetic. Early support was added in late 2015 for selected embedded GPU models that are basedon the Maxwell architecture. The FP16 arithmetic has become mainstream in CUDA-enabled GPUssince the Pascal architecture. In general, half precision has a dynamic range that is signiﬁcantlysmaller than single or double precision.The Volta and Turing architectures introduce hardware acceleration for matrix multiplication inFP16. The hardware acceleration units are called Tensor Cores. They can deliver a theoreticalpeak performance that is up to × faster than the peak FP32 performance. As an example, eachVolta V100 GPU has Tensor Cores, evenly distributed across multiprocessors. Each TensorCore possesses a mixed-precision × × matrix processing array which performs the operation D = A × B + C , where A , B , C and D are × matrices. The inputs A and B must be represented inFP16 format, while C and D can be represented in FP16 or in FP32 formats. It is also possible that C and D point to the same matrix. cublasHgemm implements the GEMM operation for real FP16 arithmetic.Apart from the vendor library, taking advantage of the Tensor Cores in a custom kernel is possiblealso through the use of low-level APIs that are provided by the programming model. As shownin Figure 1, Tensor Cores deal with input and output data through opaque data structures called fragments . Each fragment is used to store one matrix. Fragments can be loaded from shared memoryor from global memory using the load matrix sync() API. A similar API is available for storingthe contents of an output fragment into the shared/global memory of the GPU. The mma sync()

API is used to perform the multiplication. The user is responsible for declaring the fragments asrequired, and calling the APIs in the correct sequence.Figure 1: Programmability of the Tensor Core unitsThe programming model imposes some restrictions to the programming of the Tensor Cores. First,the GEMM dimensions ( M , N , K ), which also control the size of the fragments, are limited to threediscrete combinations, namely ( , , ), ( , , ), and ( , , ). Second, the operations ofload, store, and multiply must be performed by one full warp (32 threads). Finally, the load/storeAPIs require that the leading dimension of the corresponding matrix be multiple of -bytes. As anexample, a standard GEMM operation of size ( , , ) requires three load matrix sync() calls (for A , B , and C ), one mma sync() call, and then a ﬁnal store matrix sync() call towrite the result. The latest CUDA version to date (10.1) provides direct access to the Tensor Coresthrough an instruction called mma.sync . The instruction allows one warp to perform four inde-pendent GEMM operations of size ( , , ). However, using the explicit instruction may lead tolong-term compatibility issues for open-source libraries as new architectures are released. The cuBLAS library provides several routines that take advantage of the reduced FP16 precision.Figure 2 shows the performance of three different HGEMM kernels. An HGEMM kernel with half-precision output can achieve up to Tﬂop/s of performance if the tensor cores are turned off.While this is around × the single-precision performance, a signiﬁcantly higher performance canbe achieved if the tensor cores are turned on. As the ﬁgure shows, the tensor cores are capable ofdelivering an asymptotic Tﬂop/s, which is × the asymptotic performance of a non-acceleratedHGEMM.However, perhaps the most interesting performance graph of Figure 2 is the HGEMM with FP32output. The reason is that its performance is close to the accelerated HGEMM kernel, but withmuch more precision on the output. This is of particular importance for mixed-precision algo-6 T f op / s Matrix size (x Performance of cuBLAS HGEMM on square sizes (CUDA-9.1) HGEMM-FP16 output (tensor cores on)

HGEMM-FP32 output (tensor cores on)

HGEMM-FP16 output (tensor cores off)

Figure 2: Performance of different HGEMM kernel from the cuBLAS library on square sizes. Re-sults are shown on a Tesla V100 GPU using CUDA-9.1.rithms [3, 4]. To put this fact in more perspective, Figure 3 shows the forward error between thethree different HGEMM kernels, with respect to the single-precision GEMM kernel from the IntelMKL library. The forward error is computed as (cid:107) R cuBLAS − R MKL (cid:107) F √ k +2 | α |(cid:107) A (cid:107) F (cid:107) B (cid:107) F +2 | β | (cid:107) C (cid:107) F , where k is equal to thematrix size. The ﬁrst surprising observation is that an HGEMM operation with FP16 output is moreaccurate if the tensor cores are turned on, which means that the utilization of the tensor core unitsachieves both better performance and higher accuracy. The second observation is that performingHGEMM with FP32 output achieves at least two more digits of accuracy when compared with theother two HGEMM variants. Given that HGEMM with FP32 output is mostly within % of thepeak tensor core throughput, it is clearly the best option for mixed-precision algorithms that targetachieving higher accuracy while taking advantage of the half-precision. -7 -6 -5 -4 -3 -2 F o r w a r d E rr o r Matrix size (x Forward

Error of Square

HGEMM

HGEMM-FP16 output (tensor cores on)

HGEMM-FP32 output (tensor cores on)

HGEMM-FP16 output (tensor cores off)

Figure 3: Forward error of HGEMM with respect to MKL SGEMM ( C = αAB + βC ). Resultsare shown for square sizes using cuBLAS 9.1 and MKL 2018.1. The forward error is computed as (cid:107) R cuBLAS − R MKL (cid:107) F √ k +2 | α |(cid:107) A (cid:107) F (cid:107) B (cid:107) F +2 | β | (cid:107) C (cid:107) F , where k is equal to the matrix size. Apart from the vendor-supplied BLAS, few efforts have focused on building open-source BLAS rou-tines that utilize the tensor cores of NVIDIA GPUs. An example of such efforts is in the MAGMAlibrary [5], which has a batch HGEMM kernel that makes use of the tensor cores [6]. The kernel7uilds an abstraction layer over the tensor cores to overcome their size restrictions, so that arbitraryblocking sizes can be used by the kernel. The batch HGEMM kernel in MAGMA outperformscuBLAS for relatively small sizes, as shown in Figure 4. The same work also shows that extremelysmall matrix (e.g. whose sizes ≤ ) may not necessarily beneﬁt from tensor core acceleration. -2 -1 T f op / s Matrix size

Performance of Batch

HGEMM on square sizes (batch size = MAGMA cuBLAS

Figure 4: Performance of the batch HGEMM kernel on square sizes. Results are shown on a TeslaV100 GPU using CUDA-9.1.

On modern architectures, the performance of 32-bit operations is often at least twice as fast asthe performance of 64-bit operations. There are two reasons for this. Firstly, 32-bit ﬂoating pointarithmetic is usually twice as fast as 64-bit ﬂoating point arithmetic on most modern processors.Secondly the number of bytes moved through the memory system is halved.Mixed precision algorithms stem from the observation that, in many cases, a single precision solutionof a problem can be reﬁned to the point where double precision accuracy is achieved. The reﬁnementcan be accomplished, for instance, by means of the Newtons algorithm (see Equation (1)) whichcomputes the zero of a function f ( x ) according to the iterative formula x n +1 = x n − f ( x n ) f (cid:48) ( x n ) (1)In general, we would compute a starting point and f ( x ) in single precision arithmetic and the reﬁne-ment process will be computed in double precision arithmetic.If the reﬁnement process is cheaper than the initial computation of the solution, then double precisionaccuracy can be achieved nearly at the same speed as the single precision accuracy.A common approach to the solution of linear systems, either dense or sparse, is to perform the LUfactorization of the coefﬁcient matrix using Gaussian elimination. First, the coefﬁcient matrix A isfactored into the product of a lower triangular matrix L and an upper triangular matrix U . Partialrow pivoting is in general used to improve numerical stability resulting in a factorization P A = LU ,where P is a permutation matrix. The solution for the system is achieved by ﬁrst solving Ly = P b ( forward substitution ) and then solving U x = y ( backward substitution ). Due to round-off errors, thecomputed solution x carries a numerical error magniﬁed by the condition number of the coefﬁcientmatrix A .In order to improve the computed solution, we can apply an iterative process which produces acorrection to the computed solution at each iteration, which then yields the method that is commonlyknown as the iterative reﬁnement (IR) algorithm. As Demmel points out [7], the nonlinearity of theround-off errors makes the iterative reﬁnement process equivalent to the Newtons method appliedto the function f ( x ) = b − Ax . Provided that the system is not too ill-conditioned, the algorithm8roduces a solution correct to the working precision. Iterative reﬁnement in double/double precisionis a fairly well understood concept and was analyzed by Wilkinson [8], Moler [9] and Stewart [10].Iterative reﬁnement is a long-standing method that was programmed by Wilkinson in the 1940sfor the ACE digital computer. The idea is to improve the computed solution of a linear systemby iteratively solving a correction equation and adding the correction to the current solution; for acomprehensive treatment, Higham [11, Chap. 12].The algorithm can be modiﬁed to use a mixed precision approach. The three tasks original solve/-factorization, residual computation, and correction equation solvecan be done in the same precision(ﬁxed precision) or in different precisions (mixed precision). The original usage was mixed preci-sion, with the residual computed at twice the working precision.For the current work, the factorization P A = LU and the solution of the triangular systems Ly = P b and

U x = y are computed using single precision arithmetic. The residual calculation and the updateof the solution are computed using double precision arithmetic and the original double precisioncoefﬁcients (see Algorithm 1). The most computationally expensive operation, the factorizationof the coefﬁcient matrix A , is performed using single precision arithmetic and takes advantage ofits higher speed. The only operations that must be executed in double precision are the residualcalculation and the update of the solution (they are denoted with an (cid:15) d in Algorithm 1). Algorithm 1

Mixed precision, Iterative Reﬁnement for Direct Solvers LU ← P A (cid:46) ( (cid:15) s ) Solve Ly = P b (cid:46) ( (cid:15) s ) Solve

U x = y (cid:46) ( (cid:15) s ) for k = 1 , , . . . do r k ← b − Ax k − (cid:46) ( (cid:15) d ) Solve Ly = P r k (cid:46) ( (cid:15) s ) Solve

U z k = y (cid:46) ( (cid:15) s ) x k ← x k − + z k (cid:46) ( (cid:15) d ) Check convergence end for

We observe that the only operation with computational complexity of O ( n ) is handled in singleprecision, while all operations performed in double precision are of at most O ( n ) complexity. Thecoefﬁcient matrix A is converted to single precision for the LU factorization and the resulting factorsare stored in single precision while the initial coefﬁcient matrix A needs to be kept in memory.Therefore, one drawback of the following approach is that the it uses 50% more memory than thestandard double precision algorithm.The method in Algorithm 1 can offer signiﬁcant improvements for the solution of a sparse linearsystem in many cases if:1. single precision computation is signiﬁcantly faster than double precision computation;2. the iterative reﬁnement procedure converges in a small number of steps;3. the cost of each iteration is small compared to the cost of the system factorization. If thecost of each iteration is too high, then a low number of iterations will result in a perfor-mance loss with respect to the full double precision solver. In the sparse case, for a ﬁxedmatrix size, both the cost of the system factorization and the cost of the iterative reﬁne-ment step may substantially vary depending on the number of nonzeroes and the matrixsparsity structure; this will be addressed in Section 4.2. In the dense case, results are morepredictable.Note that the choice of the stopping criterion in the iterative reﬁnement process is critical. Formulasfor the errors computed at each step of Algorithm 1 can be obtained for instance in [12, 13].Recently, Carson and Higham [14] analyzed a three-precision iterative reﬁnement scheme (factor-ization precision, working precision, residual precision) and concluded that if the condition num-ber of A is not too large, namely κ ∞ ( A ) = (cid:107) A (cid:107) ∞ (cid:13)(cid:13)(cid:13) A − (cid:13)(cid:13)(cid:13) ∞ < , then using FP16 for the O ( n ) portion (the LU factorization) and (FP32, FP64) or (FP64, FP128) as the (working, residual) preci-9ion for the O ( n ) portion (reﬁnement loop), one can expect to achieve forward error and back-ward error on the order of − and − respectively. We note that, if ˆ x is an approximatesolution of Ax = b the forward error is deﬁned by (cid:107) ˆ x − x (cid:107) ∞ (cid:107) x (cid:107) ∞ and the backward error is deﬁned by min { (cid:15) : ( A + ∆ A ) ˆ x = b, (cid:107) ∆ A (cid:107) ≤ (cid:15) (cid:107) A (cid:107)} and can be evaluated as (cid:107) r (cid:107) (cid:107) A (cid:107) (cid:107) ˆ x (cid:107) , where r = b − A ˆ x . Carson and Higham [15] proposed the use of Generalized Minimum Residual (GMRES) [16] pre-conditioned by the FP16 LU factors as the solver in correction equation and showed that in this casethe constraint on the condition number can be relaxed to κ ∞ ( A ) < when the (working, residual)precision is (FP32, FP64) and to when the (working, residual) precision is (FP64, FP128). Werefer to [17, Table 2.2] for limiting condition number, and forward, backward errors. Analysis cov-ering this GMRES-based approach when two precisions are used with the residual precision equalto the working precision is given in [17].The idea is that the GMRES solver will provide a better and more stable approximate solutionto Az k = r k than the basic triangular solve, which is directly affected by the quality of the low-precision LU factors. Using GMRES, we can still guarantee that the solution of the correctionequation Az k = r k has some correct digits and a residual at the level of the convergence tolerancerequested by the algorithm. The convergence tolerance of the reﬁnement process is chosen to beof the order of the unit roundoff of the low-precision arithmetic used during the factorization (e.g.,we use − or − when the LU factorization is in FP16 or FP32, respectively). This variant iscalled GMRES-based iterative reﬁnement (GMRES-IR) by Carson and Higham, and it is describedin Algorithm 2. Note that U − and L − are never explicitly formed; instead matrixvector products U − L − Ay needed by GMRES are computed by multiplication by A followed by two triangularsolves. Since this paper focuses on the practical usage and possible performance gains rather thanerror analysis, we point the reader to [11, 15, 17] for detailed error analysis of the IR and GMRES-IRtechniques. Algorithm 2

Iterative reﬁnement using GMRES (GMRES-IR) Convert A to A f from precision u w to u f Perform LU factorization of A f in precision u f Find the initial solution x using the computed LU factorization of A f in precision u f then cast x to precision u w // Reﬁnement loop, outer loop repeat Compute residual r k = b − Ax k in precision u r and cast it to u w (cid:46) Residual Solve U − L − Az k = U − L − r k by GMRES in precision u w (cid:46) Correction Correct the current solution x k +1 = x k + z k in precision u w (cid:46) Update until x k is accurate enoughThe design and implementation of numerical algorithms that efﬁciently exploit current highly paral-lel computer architectures is a challenge, especially if close to peak performance is to be achieved.Since in the GMRES-IR approach O ( n ) operations are in lower precision, this idea allows a newclass of iterative reﬁnement solvers and a number of computational techniques that allow us tosolve fundamental Ax = b problems close to peak FP64 performance. The developments openup directions for future work, including further optimizations, development of a full set of mixed-precision factorizations, linear system solvers as well as eigensolvers and singular value decompo-sition (SVD). It is clear that the use of low precision ﬂoating-point arithmetic in iterative reﬁnement can leadto signiﬁcant speedups. However, fp16 has a small dynamic range, and therefore encounteringoverﬂow, underﬂow, and subnormal numbers is very likely. To address these issues now we discussscaling algorithms, which are presented in [18], [19] and [20]. We refer interested readers to thesereferences for more details. 10e consider a two-sided diagonal scaling prior to converting to fp16: A is replaced by RAS , where R = diag ( r i ) , S = diag ( s i ) , r i , s i > , i = 1 : n. Such scaling algorithms have been developed in the context of linear systems and linear program-ming problems. Despite the large literature on scaling such problems, no clear conclusions areavailable on when or how one should scale; see [21] for a recent experimental study. In contrast toprevious studies, where the aim of scaling has been to reduce a condition number or to speed up theconvergence of an iterative method applied to the scaled matrix, we scale in order to help squeezea single or double precision matrix into half precision, with a particular application to using theresulting half precision LU factors for iterative reﬁnement.Our usage of two-sided diagonal scaling is given in Algorithm 3.

Algorithm 3 (Two-sided diagonal scaling then round). This algorithm rounds A ∈ R n × n to the fp16matrix A ( h ) , scaling all elements to avoid overﬂow. θ ∈ (0 , is a parameter. Apply any two-sided diagonal scaling algorithm to A , to obtain diagonal matrices R , S . Let β be the maximum magnitude of any entry of RAS . µ = θx max /β A ( h ) = f l h ( µ ( RAS )) We consider two different algorithms for determining R and S ; both algorithms are carried out at theworking precision. One option is row and column equilibration, which ensures that every row andcolumn has maximum element in modulus equal to —that is, each row and column is equilibrated.The LAPACK routines xyyEQU carry out this form of scaling [22]. A symmetry-preserving two-sided scaling is proposed by Knight, Ruiz, and Uc¸ar [23]. The algorithm is iterative and scalessimultaneously on both sides rather than sequentially on one side then the other. Haidar at al. [3] proposed IR methods using mixed-precision factorizations. While classical IR andextensions like the GMRES-IR use ﬁxed-precision factorizations (e.g., in precision u f as illustratedin Algorithm 2), mixed-precision factorizations apply higher precision (e.g., u w ) at critical parts ofthe algorithm to get extra-precise factorizations while retaining the performance of the low-precisioncounterpart. The developments were applied to GPU Tensor Cores and illustrate that FP16 can beused to get FP64 accuracy for problems with κ ∞ ( A ) of up to , compared to a more typicalrequirement of κ ∞ ( A ) < . The work illustrates that mixed-precision techniques can be of greatinterest for linear solvers in many engineering areas. The results show that on single NVIDIA V100GPU the new solvers can be up to four times faster than an optimized double precision solver [3],[4],[24].The mixed-precision factorizations were motivated by the need to get extra precision when workingwith very low precisions, like the FP16. Also, this allows to easily overcome implementation issuesand other limitations of using FP16 arithmetic, and thus harness the power of specialized hardware,like the GPU Tensor Cores, for a larger range of scientiﬁc computing applications.A building-block for the mixed-precision factorizations is mixed-precision BLAS. Having mixed-precision BLAS can enable the ease of developing many mixed-precision LAPACK algorithms.Currently, cuBLAS provides mixed FP32-FP16 precision HGEMM that uses the GPU’s TensorCores FP16 acceleration. In this GEMM, the input A and B matrices can be FP32, internally getcasted to FP16, used to compute a GEMM on Tensor Cores in full (FP32) accuracy, and the resultstored back on the GPU memory in FP32. There are two main beneﬁts of having such mixed-precision BLAS routines. First, note that this mixed-precision HGEMM is almost as fast as thenon-mixed FP16 precision only HGEMM (see Figure 2), and second, the use of mixed-precisiongains about one more decimal digit of accuracy (see Figure 3).Besides the two main beneﬁts outlined above, the availability of mixed-precision GEMM also en-ables us to easily develop other mixed-precision algorithms, e.g., LAPACK, and in particular, thevarious mixed-precision factorizations that we recently added in MAGMA [3]. Figure 5 shows theperformance of the mixed-precision LU (marked as ”FP16-TC hgetrf LU”). Note that this factor-11 atrix size

2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k T f l op / s FP16-TC (Tensor Cores) hgetrf LU FP16 hgetrf LUFP32 sgetrf LUFP64 dgetrf LU

Figure 5: Mixed-precision LU (hgetrf) in MAGMA and its speedup vs. FP64 LU.

Matrix size2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k T f l op / s

253 263 263 263 263 263 263 263 263 273 273 263

Performance of solving Ax=busing FP64 or IR with GMRes to achieve FP64 accuracy

FP16-TC->64 dhgesvFP16->64 dhgesvFP32->64 dsgesvFP64 dgesv κ ∞ ( A ) Figure 6: Mixed-precision iterative reﬁnement in MAGMA and acceleration vs. FP64 solvers. Note ≈ overhead per iteration, and more than × less overhead in terms of iterations for mixed-precision LU vs. regular FP16 LU (the 3 vs. 7 iterations until FP64 convergence).ization is about − × faster than dgetrf. Its data storage is in FP32 and the implementation is thesame as sgetrf, except that it uses the mixed-precision HGEMMs for the trailing matrix updates.Figure 6 shows the mixed-precision iterative reﬁnement in MAGMA [3]. The × overall acceler-ation is due to a number of optimizations. First, note that the 3 iterations to get to FP64 accuracyled to loss of about 2 Tﬂop/s compared to the hgetrf performance (24 Tﬂop/s vs. 26 Tﬂop/s), i.e.,the overhead of one iteration can be deduced as being about 2%. Loosing 75%, e.g., through up to40 iterations, would lead to no acceleration compared to FP64 solver. This overhead per iteration isvery low, which is due to fusing all data conversions with computational kernels. Without fusion,the overhead would have been easily about × higher. Second, note that the iterative reﬁnementusing the mixed-precision factorization has more than × smaller overhead in terms of iterations tosolution (the vs. iterations until FP64 convergence). This is due to the extra digit of accuracy thatthe mixed-precision HGEMM has over the FP16 HGEMM, which also translates to a more accuratemixed-precision LU. 12 .5 Cholesky Factorization In the previous section we considered scaling of a general and symmetric matrix, we now assume thatwe are given a symmetric positive deﬁnite matrix A ∈ R n × n in ﬁnite precision arithmetic of precision u and wish to compute a Cholesky factorization in ﬁnite precision arithmetic with precision u h > u .The most practically important cases are where ( u h , u ) = (half, single), (half, double), or (single,double). Naively, we might ﬁrst form A ( h ) = f l h ( A ) , where f l h denotes the operation of rounding toprecision u h , and then compute the Cholesky factorization of A ( h ) in precision u h . For half precisionthis procedure can fail for two reasons. First, if fp16 is used then the limited range might causeoverﬂow during the rounding. Second, for both bﬂoat16 and fp16, A ( h ) might not be (sufﬁciently)positive deﬁnite, because a matrix whose smallest eigenvalue is safely bounded away from zerowith respect to single precision or double precision may become numerically indeﬁnite under theperturbation induced by rounding to half precision. The second issue can also be encountered whena double precision matrix is rounded to single precision. To overcome these problems we will usescaling and shifting. The ﬁrst step is to scale the matrix A to H = D − AD − , D = diag ( a / ii ) , and D will be kept atprecision u . Because Cholesky factorization is essentially numerically invariant under two-sideddiagonal scaling (as can be seen from the recurrence relations for the Cholesky factors), the solereason for scaling is to reduce the dynamic range in order to avoid overﬂow and reduce the chanceof underﬂow, for fp16. For bﬂoat16 or single precision it will not usually be necessary to scale, andwe can work with A throughout. For the rest of the presentation we will always scale, for simplicityof notation. Two-sided diagonal scaling before rounding to low precision was used in [18] in thecontext of LU factorization. The scaling used there equilibrates rows and columns; our scaling with D is the natural analogue of that for symmetric positive deﬁnite matrices. For Cholesky factorizationwe need an extra ingredient to ensure a successful factorization, which we consider next. We now convert H to the lower precision u h , incorporating a shift to ensure that the lower precisionmatrix is sufﬁciently positive deﬁnite for Cholesky factorization to succeed, as discussed in [19, Sec.2]. We will shift H by c n u h I , where c n is a positive integer constant, to obtain G = H + c n u h I . Sincethe diagonal of H is I , this shift incurs no rounding error and it produces the same result whetherwe shift in precision u then round or round then shift in precision u h .For fp16, in view of the narrow range we will also multiply the shifted matrix by a scalar to bringit close to the overﬂow level x max , in order to minimize the chance of underﬂow and of subnormalnumbers being produced. So our ﬁnal precision- u h matrix is constructed as G = H + c n u h I,β = 1 + c n u h , µ = θx max /β, (2) A ( h ) = f l h ( µG ) , where θ ∈ (0 , is a parameter. Note that β = max ij | g ij | , so the largest absolute value of anyelement of A ( h ) is θx max . Note also that since the growth factor for Cholesky factorization is (see,e.g., [11, Prob. 10.4]), there is no danger of overﬂow during Cholesky factorization of A ( h ) .We refer to [19, Sec. 3.3] for an analysis regarding the choice of c n . However since the estimates arepessimistic, we take the pragmatic approach of taking c n to be a small constant c . If the Choleskyfactorization fails we will increase c and try again. We will determine experimentally how large c should be for a range of problems of interest. Based on this we give the low precision Choleskyfactorization algorithm in Algorithm 4. We consider the linear least squares problem min x (cid:107) Ax − b (cid:107) , where A ∈ R m × n with m ≥ n has fullrank. The ideas of mixed-precision iterative reﬁnement and GMRES-IR can be adapted to the leastsquares case. Least squares problems may be ill conditioned in practice, and so rounding errors may13 lgorithm 4 (Cholesky factorization in precision u h ). Given a symmetric positive deﬁnite A ∈ R n × n in precision u this algorithm computes an approximate Cholesky factorization R T R ≈ µD − AD − atprecision u h , where D = diag ( a / ii ) . The scalar θ ∈ (0 , and the positive integer c are parameters. D = diag ( a / ii ) , H = D − AD − % Set h ii ≡ instead of computing it. G = H + cu h I β = 1 + cu h µ = θx max /β A ( h ) = f l h ( µG ) Attempt Cholesky factorization A ( h ) = R T R in precision u h . if Cholesky factorization failed then c ← c , goto line 2 end if result in an insufﬁciently accurate solution. In this case, iterative reﬁnement may be used to improveaccuracy, and it also improves stability. The normal equations method solves A T Ax = A T b using the Cholesky factorization of A T A (see Section 2.5). In general, this method is deprecatedby numerical analysts because it has a backward error bound of order κ ( A ) u [11, sect. 20.4] andthe Cholesky factorization can break down for κ ( A ) > u − / , but it is used by statisticians withsome justiﬁcation [25]. Here, we assume that A is (sufﬁciently) well conditioned. We propose theGMRES-IR-based least squares solver given in Algorithm 5. Algorithm 5 (Cholesky-based GMRES-IR for the least squares problem) Let a full rank A ∈ R m × n ,where m ≥ n , and b ∈ R m be given in precision u . This algorithm solves the least squares problem min x (cid:107) b − Ax (cid:107) using Cholesky-based GMRES-IR. The scalar θ ∈ (0 , and the positive integer c are parameters. Compute B = AS , where S = diag (1 / (cid:107) a j (cid:107) ) , with a j the j th column of A . µ = θx max B ( h ) = f l h ( µ / B ) Compute C = B ( h ) T B ( h ) in precision u h . Compute the Cholesky factorization C + cu h diag ( c ii ) = R T R in precision u h . if Cholesky factorization failed then c ← c , goto line 5 end if Form b ( h ) = f l h ( SA T b ) . Solve R T Ry = b ( h ) in precision u h and form x = µSy at precision u . for i = 0 : i max − do Compute r i = A T ( b − Ax i ) at precision u r and round r i to precision u . Solve MA T Ad i = Mr i by GMRES at precision u , where M = µSR − R − T S and matrix–vector products with A T A are computed at precision u r , and store d i at precision u . x i +1 = x i + d i at precision u . if converged then return x i +1 , quit end if end for We make some comments on the algorithm. Line 1 produces a matrix B with columns of unit 2-norm. The computation C = B ( h ) T B ( h ) on line 4 produces a symmetric positive deﬁnite matrix withconstant diagonal elements µ = θx max , so overﬂow cannot occur for θ < . The shift on line 5 isanalogous to that in Algorithm 4, but here the matrix C is already well scaled and in precision u h sothere is no need to scale C to have unit diagonal.14here are two reasons why explicitly forming C = B ( h ) T B ( h ) in Algorithm 5 is reasonable from thenumerical stability point of view. First, C is used to form a preconditioner, so the usual problemswith forming a cross product matrix (loss of signiﬁcance and condition squaring) are less of a con-cern. Second, if we are working in fp16 on an NVIDIA V100 we can exploit the tensor cores whenforming C to accumulate block fused multiply-add operations in single precision; this leads to amore accurate C , as shown by the error analysis of Blanchard et al. [26].For the computed ˆ R we have ˆ R T ˆ R ≈ B ( h ) T B ( h ) ≈ µSA T AS, or ( A T A ) − ≈ µS ˆ R − ˆ R − T S, so we are preconditioning with an approximation to the inverse of A T A . We apply the precon-ditioned operator MA T A to vectors at precision u r . Computing y = A T Ax costs mn ﬂops and SR − R − T y costs another n + n ﬂops, making mn + 2 n + n ﬂops in total. For m (cid:29) n and large n , computing y = A T Ax costs a factor n/ fewer ﬂops than the mn ﬂops needed to form A T A ,while for m ≈ n the difference is a factor n/ . For large n , even allowing for the fact that the ﬂopswe are comparing are at different precisions, as long as GMRES converges quickly the cost of thereﬁnement stage should be negligible compared with the cost of forming A T A and computing theCholesky factorization.Related to this work is the Cholesky–QR algorithm for computing a QR factorization A = QR .It forms the cross-product matrix A T A , computes the Cholesky factorization A T A = R T R , thenobtains the orthogonal factor Q as Q = AR − , and this process can be iterated for better numericalstability; see, for example, [27], [28], [29], [30]. Our work differs in that we do not compute Q , wecarry out the Cholesky factorization in lower precision than the working precision, and we solve aleast squares problem using preconditioned iterative reﬁnement. Another approach to mixed precision least squares iterative reﬁnement was presented by Carson,Higham, and Pranesh in [20]. This approach is based on the method of using the QR factorization A = Q (cid:20) R (cid:21) , where Q = [ Q , Q ] ∈ R m × m is an orthogonal matrix with Q ∈ R m × n and Q ∈ R m × ( m − n ) , and R ∈ R n × n is upper triangular. The unique least squares solution is x = R − Q T b with residual (cid:107) b − Ax (cid:107) = (cid:107) Q T b (cid:107) .An iterative reﬁnement approach that works even when Ax = b is inconsistent was suggested byBj¨orck [31]. Reﬁnement is performed on the augmented system (cid:34) I AA T (cid:35) (cid:20) rx (cid:21) = (cid:20) b (cid:21) , (3)which is equivalent to the normal equations. In this way, the solution x i and residual r i for the leastsquares problem are simultaneously reﬁned. Bj¨orck [31] shows that the linear system can be solvedby reusing the QR factors of A .Existing analyses of the convergence and accuracy of this approach in ﬁnite precision assume thatat most two precisions are used; the working precision u is used to compute the QR factorization,solve the augmented system, and compute the update. A second precision u r ≤ u is used to computethe residuals. Typically u r = u , in which case it can be shown that as long as the condition numberof the augmented system matrix is smaller than u − , the reﬁnement process will converge with alimiting forward error on the order of u ; see [32] and [11, sect. 20.5] and the references therein.The work [20] shows that the three-precision iterative reﬁnement approach of Carson andHigham [14] can be applied in this case; the theorems developed in [14] regarding the forwarderror and normwise and componentwise backward error for iterative reﬁnement of linear systemsare applicable. The only thing that must change is the analysis of the method for solving the cor-rection equation since we now work with a QR factorization of A , which can be used in variousways. 15he work in [20] also extends the GMRES-based reﬁnement scheme of [15] to the least squares caseand shows that one can construct a left preconditioner using the existing QR factors of A such thatGMRES provably converges to a backward stable solution of the preconditioned augmented system.Further, it is shown that an existing preconditioner developed for saddle point systems can also workwell in the GMRES-based approach in practice, even though the error analysis is not applicable. Werefer the reader to [20] for further details. Quantization is a technique widely being used in deep learning inference [33, 34]. While the modelis usually still trained in single precision, quantization compress the data and use lower precisionto carry out the computation in inference stage which is applying the trained model to new datafor real application. For an int8 quantized model, the data is converted into 8-bit integers. Thecomputation and communication are reduced 4 times comparing to 32-bit single precision while theaccuracy lost is acceptable (usually < for predictive models). Integer arithmetic is available onmost hardware architectures. Field Programmable Gate Array (FPGA)s are usually more capablein integer operations and might not have ﬂoating-point number arithmetic units. New Application-Speciﬁc Integrated Circuit (ASIC)s for deep learning inference are also moving toward using mostlyinteger arithmetic for quantized neural networks. This motivated to investigate the use of integerarithmetic for the Gaussian elimination (LU factorization) with partial pivoting. Storage format i in 32-bit integerRepresented real number R ( i ) = i/ × Conversion from double precision number α i ← int32 ( α × ) Conversion to double precision number α α ← double ( i ) / Addition R ( i ) + R ( j ) = i/ + j/ = ( i + j ) / = R ( i + j ) Multiplication R ( i ) × R ( j ) = i/ × j/ = ( i × j ) / = ( i × j/ ) / = R ( i × j/ ) Table 1: Proposed Fixed-point Number RepresentationThe basic idea is to scale down numbers to ﬁt into a ﬁxed-point number representation: i/ × where i is in 32 bits integer. The exponent will not change under addition or multiplication socan be ignored. The addition under is form is simply integer addition. Multiplication becomes: i/ × j/ = i × j/ = ( i × j/ ) / . To compute i × j/ can be done with 32 bits integermultiply and return the high 32 bits in the 64 bits result. Note that this operation can be done inone instruction on modern CPU instruction set architectures (ISAs) including x64 and ARM . Table 1summarizes the proposed ﬁxed-point number representation.Algorithm 6 shows for LU factorization with partial pivoting based on integer arithmetic. The com-putation inside the loop is mainly 32-bit integer arithmetic. Line 9 requires 64-bit integer divisionbut only once per column. The scale in line 10 will remain in int32 range because the pivot haslarger magnitude then other elements in the column. The update in line 11 is 32-bit integer multiplybut we only need the high 32 bits in 64 bits results.The input integer r determines the number of bits ( − r ) we are actually using while converting A into integer. Because the matrix would grow during the factorization and we do not have anydynamic scaling during the factorization, it might hit the integer range and overﬂow at some point.To avoid it, we ﬁrst scale the matrix into [ − − r , − r ] . The higher r is, the more room we will havefrom the integer range. But less accurate the input matrix would be after converted into int32 . Figure 7 shows the normalized backward error (cid:107) Ax − b (cid:107) ∞ / (cid:107) A (cid:107) ∞ (cid:107) x (cid:107) ∞ vs. input matrix size. Thealgorithm is implemented in MATLAB R2018b . Each element of the matrix is generated from uni-form random distribution: uniform (-1,1). Each point is the the geometric average over 30 randommatrices and error bars indicate the 15% and 85% percentiles. The result from single and double16 lgorithm 6

LU factorization with partial pivoting based on integer arithmetic. Input : n by n matrix A in double precision.Integer r for the range while normalizing A . Declare identity matrix P as permutation matrix. m ← max( A ) × r ; A ← A/m (cid:46)

Normalize A into [ − − r , − r ] A int ← int32 ( A × ) (cid:46) Convert A into proposed ﬁxed-point representation. for i = 1 . . . n do (cid:46) Main loop over columns pivot ← (arg max | A int [ i : n, i ] | ) + i − (cid:46) Find the pivot index. swap ( A int [ i, :] , A int [ pivot , :]) (cid:46) Swap rows. swap ( P [ i, :] , P [ pivot , :]) α ← int64 (2 ) /A [ i, i ] (cid:46) Find the scale with integer division. A int [ i : n, i ] ← αA int [ i : n, i ] (cid:46) Scale the column. A int [ i + 1: n, i + 1: n ] ← A int [ i + 1: n, i + 1: n ] − A int [ i + 1: n, i ] × A int [ i, i + 1: n ] / (cid:46) Integer rank-1 update with a division using integer shift. end for L ← lower triangular part of double ( A ) / with unit diagonal. U ← upper triangular part of double ( A ) / including diagonal. Return : P , L, U as the result of factorization such that P ( A/m ) = LU

100 200 300 400 500 600 700 800 900 1000

Square Matrix Size -16 -14 -12 -10 -8 -6 -4 -2 N o r m a li z ed R e s i dua l | A x - b | / | A || x | singledoubler=4r=5r=6r=7r=8r=9r=10 Figure 7: Normalized backward error vs. input matrix size for different number of usable bits r .Result in single and double precision are also shown.precision LU factorization is also reported in bold lines as reference. For the results which are > − , overﬂow occurred and the algorithm failed. Otherwise there’s no numerical error during thefactorization. The error is only introduced in the conversions between ﬂoating-point and ﬁxed-pointformat, not during integer factorization. The backward error grows with r since the input is trun-cated more at the conversion. But still when r = 10 it is still using −

10 = 22 bits and the resultis comparable with single precision which is using 23 mantissa bits.

We would like to show case that it is possible to have low precision factorization using integer arith-metic. For the next step we will conduct more detailed error analysis. Extend to shorter integers suchas int16 and int8 . Also to tackle the overﬂow problem, we would like to consider dynamicallyscale the columns during factorization to keep the numbers in range. And the same as per chan-nel quantization in deep learning, assign a different range to each column might also be a feasibleapproach. 17 .8 Symmetric Eigenvalue Problems

In [35] an algorithm is described for determining rigorous error bounds for a simple eigenvalue andits associated eigenvector. The algorithm has the pleasing feature of providing an improved eigenpairas a by-product. This approach assumes that an eigenpair is given. No assumptions are made abouthow that eigenpair was found, whether through some knowledge of the physical problem, an initialeigenvalue decomposition in a lower precision or a clever guess.We are interested in improving the accuracy of an eigenvalue- eigenvector pair. Consider the eigen-value problem Ax = λx , where λ and x have been found by some means. Because they were arrivedat by some calculation on a computer with ﬁnite precision or by some insight into the problem, theyare, in general, not the true eigenvalue and eigenvector, but an approximation. We know, however,that there exist µ and ˜ y such that A ( x + ˜ y = λ + µ ) = ( λ + µ )( x + ˜ y ) is the exact solution to theeigenvalue problem, where µ and ˜ y are the corrections to the computed λ and x .We will normalize x such that || x || ∞ = 1 and say x s = 1 , where the s th component of x is the largest.This can be done because we have one degree of freedom in our choice of the components for x .We will assume that the s th component of x is exact and no correction is needed. This determinesthe value of ˜ y s , which is the correction to x s . Because x s is exact, the value of ˜ y s is zero. This alsodetermines the degree of freedom in the corrected vector, x + ˜ y , through the relationship between x and ˜ y , namely ( x + ˜ y s ) = 1 .We can rewrite equation A ( x + ˜ y = λ + µ ) = ( λ + µ )( x + ˜ y ) as ( A − λI ) ˜ y − µx = λx − Ax + µ ˜ y . Note that λx − Ax is the residual for the computed eigenvalue and eigenvector. If we look more closely at theproduct ( A − λI ) ˜ y , we discover that because ˜ y s = 0 , the s th column of ( A − λI ) does not participatein the product with ˜ y . In the formulation of ( A − λI ) ˜ y − µx , we can replace the s component of ˜ y ,which is zero, by the value µ and the s th column of ( A − λI ) by − x to arrive at ( A − λI ) ˜ y − µx .We will deﬁne y by y ≡ ˜ y + µe s , where e s is the s th column of the identity matrix. So the s th component of the newly deﬁned y has the value µ ; i.e., y s = µ . We will also deﬁne the matrix B as thematrix ( A − λI ) with the s th column replaced by − x . Thus we can rewrite ( A − λI ) ˜ y − µx = λx − Ax + µ ˜ y as By = r + y s ˜ y , where r = λx − Ax .Because the n + 1 element of the solution vector is known, we will solve with the truncated formof B , truncated so the n+l row and n+l column are no longer present. This truncation can be donebecause we know the solution vector has a zero in the ( n + 1) th position. The above equation is anonlinear equation deﬁning the correction y . This system can be solved by the following iterativemethod for solving, By ( p +1) = r + y ( p ) s ˜ y ( p ) , where ˜ y p = y ( p ) s − y ( p ) s e s . This is the approach used in [36]. Algorithm 7

Iterative reﬁnement for symmetric eigenvalue problem. Input : A = A T ∈ R n × n , (cid:98) X ∈ R n × (cid:96) , ≤ (cid:96) ≤ n Output : X (cid:48) ∈ R n × (cid:96) , (cid:101) D = diag (cid:16) (cid:101) λ i (cid:17) ∈ R (cid:96) × (cid:96) , (cid:101) E ∈ R (cid:96) × (cid:96) , ω ∈ R function (cid:104) X (cid:48) , (cid:101) D, (cid:101) E, ω (cid:105) ← R EF S Y E V ( A , (cid:98) X ) R ← I n − (cid:98) X T (cid:98) X S ← (cid:98) X T A (cid:98) X (cid:98) λ i ← s ii / (1 − r ii ) for i = 1 , . . . , (cid:96) (cid:46) Compute approximate eigenvalues. (cid:101) D ← diag (cid:16) (cid:101) λ i (cid:17) ω ← (cid:16)(cid:13)(cid:13)(cid:13) S − (cid:101) D (cid:13)(cid:13)(cid:13) + (cid:107) A (cid:107) (cid:107) R (cid:107) (cid:17) e ij ←  s ij + (cid:101) λ j r ij (cid:101) λ j − (cid:101) λ i if (cid:12)(cid:12)(cid:12)(cid:101) λ i − (cid:101) λ j (cid:12)(cid:12)(cid:12) > ωr ij / otherwise for ≤ i, j ≤ (cid:96) (cid:46) Compute the entries of the reﬁnementmatrix (cid:101) E . X (cid:48) ← (cid:98) X + (cid:98) X (cid:101) E (cid:46)

Update (cid:98) X by (cid:98) X ( I n + (cid:101) E ) end function n = 200 .Algorithm 7 shows another approach using an iterative reﬁnement procedure for solving a symmetriceigenvalue problem [37]. This methods succeeds also for clustered eigenvalues [38]. Line 4, 5,and 10 represent the compute intensive parts of the algorithm, which amounts to 4 calls to matrix-matrix multiply function xGEMM . Line 8 is the matrix norm. The original analysis uses 2-norm butit is suggested to approximate it using the Frobenius norm because it is much easier to compute inpractice. Line 9 is an element-wise operation to construct the reﬁnement matrix E . Line 10 is theupdate of eigenvectors by applying the reﬁnement matrix E . High-precision arithmetic is requiredfor all computations except line 8 for matrix norm. Although the algorithm can work for only asubset of eigenvectors, it is only reﬁning in the corresponding subspace. Hence the reﬁnementprocess could be limited. In other words, the desired accuracy might be unattainable if only a partof the spectrum is reﬁned in higher precision.Figure 8 shows the convergence behavior of algorithm 7 on a real symmetric matrix of size n = 200 when reﬁning the entire eigen-specturm. Each line represents the convergence of one eigenvalue,and the normalized residual (cid:107) Ax − λx (cid:107) / (cid:107) A (cid:107)(cid:107) x (cid:107) is plotted against subsequent iteration numbers.Iterative reﬁnement based on linear solve is also possible [36]. Algorithm 8 is the procedure calledSICE which, in each iteration, solves a linear system resulting from a rank-1 update in order to reﬁnea single eigen-pair. The rank-1 update is introduced while replacing one column in A − λI to removeone degree of freedom on eigenvector correction and, at the same time, compute a correction for thecorresponding eigenvalue. The original formulation [36] solves the system with two series of Givensrotations to make it upper triangular. This process is hard to parallelize on modern architectures.Also, some form of orthogonalization should be considered while using the algorithm to reﬁne morethan one eigenvalue.In many applications, we are satisﬁed with a subset of the eigenvalue eigenvector pairs. In thiscase, it can be much more efﬁcient to use an algorithm such as the Multiple Relatively RobustRepresentations (MRRR) [39] to compute the eigenpairs once the matrix has been reduced to atri-diagonal form. Though this method by itself is less accurate than its counterparts (Divide andConquer and QR), [40] show that using a mixed precision approach can be beneﬁcial to improve theaccuracy of the solve and the overall time to solution. The mixed precision approach here also showspromise in improving the orthogonality over its single precision and other solver counterparts. A fundamental requirement for accelerating scientiﬁc computations through multiprecision use isthe ability to efﬁciently convert data between the different ﬂoating point formats employed and tominimize the communications associated with these data movements. Techniques and implemen-tations to accomplish this efﬁciently have been developed usually ad-hoc, e.g., implemented andtuned for particular algorithms and implementations that use mixed-precision. Thus, although thereare some solutions that address particular challenges, there are no standards, often there are no user-level interfaces to lower-level building blocks, and therefore not extracted as independent, supported19 lgorithm 8

SICE algorithm for iteratively reﬁning computed eigenvalue. function [ x, λ ] ← SICE ( A , x , λ ) [ Q, T ] ← schur ( A ) (cid:46) Schur decomposition to ﬁnd A = QT Q T where T is quasi uppertriangular. [ m, s ] ← max ( x ) (cid:46) Find maximum value and index in the eigenvector. x ← x /m (cid:46) Normalize for i = 1 , , . . . do c ← − x i − − ( A − λ i − I )[: , s ] (cid:46) Column s of A − λ i − I d ← Q T c f ← e Ts Q (cid:46)

Row s of Q Solve the rank-1 updated system Q ( T − λ i − I + df T ) Q T y i = Ax i − − λ i − x i − λ i ← λ i − + y i [ s ] (cid:46) Eigenvalue correction. x i ← x i − + y i (cid:46) Eigenvector correction. x i [ s ] ← x i − [ s ] (cid:46) Restore x[s]. if × y i [ s ] > y i − [ s ] then Break from for loop. end if end for x ← x i λ ← λ i end function libraries or components infrastructure that other developers can use. To address this, we have beeninvestigating a number of building blocks that can be extracted and included in numerical librariesfor the development of mixed-precision algorithms. The components that are of interest for data andcommunication compression are discussed in the subsequent subsections. Many mixed-precision algorithms need to convert data between different standard IEEE formats.For example, LAPACK supports this type of data conversion as needed for its mixed-precisioniterative reﬁnement solvers. Support is provided through auxiliary routines for either general ortriangular matrices, following standard LAPACK naming conventions and matrix representations.For example, general matrices can be converted from FP64 to FP32 as follows:z l a g 2 c (M, N, zA , LDA, cA , LDCA, INFO )d l a g 2 s (M, N, dA , LDA, sA , LDSA , INFO ) .The ﬁrst example is for casting double complex to single complex matrix, and the second for doubleto single real matrix. The other way around (from single to double) is also provided through the clag2z and slag2d routines.The interfaces for converting triangular matrices are:z l a t 2 c (UPLO , N, zA , LDA, cA , LDCA, INFO )d l a t 2 s (UPLO , N, dA , LDA, sA , LDSA , INFO )and the ones for going from single to double are clat2z and slat2z , respectively.These routines, following LAPACK’s interfaces, are also provided in MAGMA for GPUs. MAGMAalso adds conversion from single to half precision (FP32 to FP16) for general matrices:s l a g 2 h (M, N, sA , LDA, hA , LDHA, INFO , QUEUE)and the corresponding hlag2s . These routines are well optimized for NVIDIA GPUs, and alsosupported for AMD GPUs (through hipMAGMA). MAGMA also provides the batched equivalentfor batches of conversions.A more specialized for mixed-precision calculations library may have to support a more completeset of data conversion routines, e.g., for arrays, strided arrays, tensors, etc., and more combinations20igure 9: Speedup of All2All by × compression using cast (red) vs. × compression using cast(dotted green) vs. × compression using ZFP on Nvidia V100 GPUs.of formats, including user/application deﬁned. For example, some mixed-precision FFT algorithms(see Section 3.3) use dynamic ”splitting” of a high precision array (e.g., FP32) into two lower-precision (e.g., FP16) arrays. See also Section 7 for further discussion and extensions on formats. Data compression for reducing communications is another component needed for the developmentof mixed-precision algorithms. The simplest form that we consider is the casting, as discussed inSection 3.1. This is an example of a lossy compression. Casting from FP64 to FP32 for example,leads directly to a loss of about 8 decimal digits of accuracy, but reduces the data size by a factorof two. Casting has been used to accelerate the FP64 solvers in MAGMA up to × using themixed-precision iterative reﬁnement techniques [24, 4, 3] and we use it as benchmark to evaluatethe potential of using other compression algorithms.We evaluated for example the possibility to use ZFP compression. ZFP provides lossy compressionalgorithms, where the compression mode can be speciﬁed by the user as either ﬁxed rate, ﬁxedaccuracy, or ﬁxed precision [41]. Analysis for the round-off error introduced by ZFP in compressingﬂoating-point data is presented in [42]. The values in this experiment are taken random numbers andthe compression speciﬁed is × . Note that compared to casting, the compression rate is as castingto FP16, but the accuracy is comparable to casting to FP32. These results make it feasible to usetools like ZFP to accelerate memory-bound codes, e.g., like FFT (see Section 3.3), up to × whileloosing about 8 decimal digits of accuracy. Of interest is MPI extension that fuses subsequent data conversions with the MPI communications.The conversion must be user speciﬁed and includes casting or other data compression or conversionmechanisms, where a single MPI call will convert the input data as speciﬁed, send the converteddata, and the corresponding MPI call will receive and convert the result again, as speciﬁed by theuser. Our MPI collaborators have developed preliminary mixed-precision MPI for All2All and P2Pcommunication using casting. The results show that asymptotically, for large enough data, the MPIcommunications can be accelerated proportional to the data compression, i.e., the conversion isnegligible. The implementations are for CPUs, as well as GPUs using GPU-direct communications.Our preliminary results can also use ZFP to compress the data. Figure 9 illustrates an accelerationresult for All2All in FP64 (marked as base, i.e., the acceleration of 1 line).21 eFFTe GPU acceleration

Packing 0.91%Unpacking 0.72% FFT computation 1.03 %MPI communication 97.34%Packing 9.65%Unpacking 29.13% FFT computation 11.77%MPI communication 49.45%0.14s0.17s0.43s 0.72s 0.71s0.017sLocal kernelsaccelerated using GPUsTotal speedup heFFTe weak scaling on 3D FFTs Figure 10:

Left : heFFTe acceleration using GPUs vs. CPUs of FFT on 4 Summit nodes.Note than nodal computations are accelerated × using GPUs. Right : heFFTe weak scalabilitywith sizes indicated on the graph on up to 1024 nodes (6 V100 GPUs; double complex arithmetic;starting and ending with bricks; performance assumes N log N ﬂops).Note that here we use ZFP and manage to accelerate the MPI communication more than × whileloosing only about decimal digits of accuracy, i.e., we achieve an acceleration that outperforms thecorresponding version that uses cast to FP32 (while having the same accuracy). The data sendingitself is accelerated close to × as it is compressed × , but the overall acceleration drops when addingthe cost for the data compression and decompression. This means that acceleration theoretically stillcan go up to about × in implementations where the GPU work on compression and decompressionis pipelined and overlapped with the communication. One application of the mixed-precision technologies described in this Section is the acceleration ofmultidimentional FFTs through mixed-precision algorithms. We found that more than dozen of ECPapplications use FFTs in their codes, e.g., including LAMMPS, HACC, ExaAM, and applicationsfrom the Copa co-design center [43]. ECP applications that require FFT-based solvers suffer fromthe lack of fast and scalable 3D FFT routines for distributed-heterogeneous parallel systems as theones projected for the upcoming exascale computing systems, and some of the applications haveindicated interest in exploring the use of approximate FFTs that trade some loss of accuracy forincrease in performance.To address the above needs for high-performance scalable FFTs on large-scale GPU systems, we de-veloped and released the

Highly Efﬁcient FFTs for Exascale ( heFFTe ) library [44, 45, 46]. heFFTev0.2 features very good weak, as well as strong, scalability and performance that is close to 90% ofthe rooﬂine peak (see Figure 10). However, after accelerating the local/nodal computations of about × using GPUs (vs CPUs), the main bottleneck becomes the MPI communications. Currently, ontypical FFT problems the GPUs can be used only about 1% of the time, i.e., the GPUs are free tobe used for other computations 99% of the time, while the rest 99% of the time is spent in MPIcommunications. Thus, any acceleration in the MPI communications would translate into the sameacceleration of the overall FFT computation. One idea to accelerate FFTs using mixed-precision is through a tradeoff of accuracy for speed. Herewe reduce the communication volume by compressing the data, e.g., by casting or compression asoutlined in Section 3.2.Multidimensional FFTs use All2All-type of MPI communications, where ﬁrst data is packed (locallyon each GPU, using and beneﬁting from GPUs’ high bandwidth), next is the MPI communication,and ﬁnally data is unpacked locally on each GPU [43, 44]. As packing and unpacking are memorybound and involve going through the data once, the addition of casting or other type of compressioncan be fused with the packing/unpacking and thus signiﬁcantly remove its overhead. This idea isexplored through the use of mixed-precision MPI that fuses all these operations (as in Section 3.2.1).Quantiﬁcation of the speedups obtained vs. the reduction in accuracy is as illustrated in Figure 9.22urrent work is on overlapping the use of the GPUs for compression/decompression with the MPIcommunications through pipelining. This is possible as FFTs are memory bound and GPUs can befree to do other computations up to 99% of the time, e.g., as benchmarked for the Summit hardwareconﬁguration.

Accuracy requirements are mainly application dependent and can be controlled through specifyinga desired compression rate. The accuracy of the resulting mixed-precision FFTs will be higher thanthe corresponding FFT using ”low” precision for both storage and computation, while retainingthe same performance. Moreover, going to the extreme, it is possible to tune the mixed-precisionFFTs to be of the same accuracy, e.g., as FP64 FFTs. For example, this can be done throughevaluating the loss of decimal digits in local computations due to round-off and use an appropriatedata compression rate, so that only the valid digits get to be communicated.

Another idea in accelerating multidimensional FFTs is that instead of compression, the high-precision FFT data can be dynamically split into two scaled data sets of lower-precision data, applythe FFT transformations on the two sets in parallel and combine the result at the end. This wasexplored in the context of FP32 data that gets split into two FT16 sets in order to apply fast GPUTensor Cores computations on the FP16 data [47, 48], e.g., by going to a small radix, e.g., 4, wherethe FFT matrix can be represented exactly (even in FP16) and beneﬁt from Tensor Cores acceleratedHGEMMs. An extension of this idea can take into account that the FFT matrix is never assembled,except for a small radix matrix in order to apply it on the data as GEMM. Thus, without muchcomputational overhead (because of the memory-bound nature of FFT) one can use higher precisionFourier matrix and computations (than the precision that stores the data) to accelerate the entire FFTcomputation.

Direct methods for sparse systems can also beneﬁt from using lower precision formats. The idea isto perform the expensive calculations in lower precision, taking advantage of the faster speed oftenprovided by hardware. Then, some cheaper “ﬁxup” algorithm is employed to recover the accuracyat a higher precision. Sparse factorizations, such as sparse LU and QR factorizations, are mostoften used to construct sparse direct solvers. Two related but orthogonal research directions can betaken here. The ﬁrst is about the factorizations themselves, and the second is in the context of directsolvers.

Similar to dense LU and QR factorizations, a large fraction of the computation lies in the Schurcomplement updates throughout the elimination steps. In the dense case, much of the work inthe Schur complement update can be realized in terms of GEMM operations. However, in thesparse case, each Schur complement update usually follows three steps: 1) gather the values fromsparse data structures into contiguous memory, 2) perform GEMM operation, 3) scatter the outputof GEMM into destination sparse data structures.The beneﬁt of using lower precision is two fold: for step 2), we can use very fast lower precisionvendor-provided GEMM functions, e.g. those utilizing NVIDIA’s Tensor Cores. For the gather/s-catter in steps 1) and 3), the amount of data movement would be reduced.For the dense case the main beneﬁt comes from accelerated GEMM speed. But in the sparse case,GEMM is only one part of the three steps above. Furthermore, the dimensions of the GEMM kernelcalls is generally smaller and of non-uniform size throughout factorization. Therefore, the speedgain from GEMM alone is limited. We will need to design new schemes to enhance overlap ofGEMM computation with gather/scatter operations.23 .2 Multiprecision sparse direct solvers

For the dense case, in Section 2.2 we revisited the mixed precision iterative reﬁnement (IR) algo-rithms with adaptive precision adjustment depending on convergence history. The algorithms candeliver high accuracy to the solution even when the expensive LU and QR factorizations are donein lower precision. We recall the IR algorithm using three precisions in Algorithm 9 [49, 50]. Thisalgorithm is already available in x GERFSX functions in LAPACK.The following three precisions are used:• ε w is the working precision used to store the input data A and b . It is the lowest precisionused in the solver, and is the desired precision for the output.• ε x is the precision used to store the computed solution x ( i ) . We require ε x ≤ ε w , possibly ε x ≤ ε w if necessary for componentwise convergence.• ε r is the precision used to compute the residuals r ( i ) . We usually have ε r (cid:28) ε w , typicallybeing at least twice the working precision ( ε r ≤ ε w ). Algorithm 9

Three-precisions Iterative Reﬁnement for Direct Solvers Solve Ax (1) = b using the basic solution method (e.g., LU or QR) (cid:46) ( ε w ) i = 1 repeat r ( i ) ← b − Ax ( i ) (cid:46) ( ε r ) Solve

A dx ( i +1) = r ( i ) using the basic solution method (cid:46) ( ε w ) Update x ( i +1) ← x ( i ) + dx ( i +1) (cid:46) ( ε x ) i ← i + 1 until x ( i ) is “accurate enough” return x ( i ) and error boundsWith the above setup and adaptive adjustment of ε x and ε r , the algorithm converges with smallnormwise error and error bound if the normwise condition number of A does not exceed / ( γ ( n ) ε w ) .Similarly, the algorithm converges with small componentwise error and error bound if the compo-nentwise condition number of A does not exceed / ( γ ( n ) ε w ) . Moreover, this IR procedure canreturn to the user the reliable error bounds both normwise and componentwise. The error analysisin [49] should all carry through to the sparse cases.The following are example conﬁgurations of the precisions:• ε w = 2 − (IEEE-754 double precision), ε x = 2 − , ε r = 2 − (double-double)• ε w = 2 − (B-ﬂoat), ε x = 2 − , ε r = 2 − Our plan is ﬁrst to extend the above algorithm to the sparse direct solvers SuperLU andSTRUMPACK. While doing so, we will address the following open questions:• When ε w is bﬂoat16, the error analysis and error bounds may need be revisited.• The relative cost of sparse LU/QR (lines 1 and 5) and sparse matvec (line 4) is differentfrom the dense counter part. For a typical 3D PDE discretized problem, the respective costsare O ( n ) and O ( n / ) . Thus, the ratio between ”expensive” and ”cheap” is smaller thanthe dense case. We need to be more mindful with the higher precision calculations. The scope of our review includes both Lanczos-based (short-term recurrence) and Arnoldi-based(long-term recurrence) methods and the associated methods for solving linear systems of equations Ax = b . In the context of long-term recurrence methods, we consider the Arnoldi-QR algorithmwith the modiﬁed Gram-Schmidt implementation of the Generalized Minimum Residual (GMRES)Krylov subspace method for iteratively solving linear systems of equations. The emphasis here is to24xamine the approaches employed to date that incorporate mixed-precision ﬂoating point arithmeticto speed-up computations, and yet retain some or all of the numerical properties of the originalalgorithms in full double precision arithmetic (i.e. representation error and loss of orthogonality). We ﬁrst summarize very brieﬂy the most well-known results on the ﬁnite precision behavior ofLanczos and CG methods, and discuss how such results could potentially be extended to the mixedprecision case and existing progress in this area. We note that there is a huge literature on the ﬁniteprecision behavior of Lanczos-based methods which we cannot hope to fully describe here. For amore thorough account and historical references, we point the reader to the manuscript of Meurantand Strakoˇs [51].Fundamental relations dealing with the loss of orthogonality and other important quantities in ﬁniteprecision Lanczos have been derived by Chris Paige [52]. These results were subsequently used byAnne Greenbaum to prove backward stability-like results for the CG method [53]; namely, Green-baum showed that CG in ﬁnite precision can be seen as exact CG run on a larger linear system, inwhich the coefﬁcient matrix has eigenvalues in tight clusters around the eigenvalues of the originalmatrix (where the diameter of these clusters depends on properties of the matrix and the machineprecision). Greenbaum also proved fundamental results on the maximum attainable accuracy in ﬁ-nite precision in what she calls “recursively computed residual methods”, which includes CG, BICG,BICGSTAB, and other Lanczos-based methods [54]. The results of Paige and Greenbaum have alsobeen extended to s -step Lanczos/CG variants in [55], where it is shown that s -step Lanczos in ﬁniteprecision behaves like classical Lanczos run in a lower “effective” precision (where this “effective”precision depends on the conditioning of the polynomials used to generate the s -step bases). We be-lieve that these existing results can be extended to the mixed precision case; in Paige’s analysis [52],he ﬁrst deﬁnes an ε quantity that is used for errors in inner products and an ε quantity that comesfrom errors in matrix-vector products, but then these quantities are combined in later theorems in or-der to simplify the analysis. It is possible to expand upon his analysis and keep these two quantitiesseparate; such results could also then be interpreted in the framework of Greenbaum [53].Existing results in the area of mixed precision Lanczos-based methods are contained within the workon “inexact Krylov subspace methods”, which also applies to Arnoldi-based methods; see, e.g., themanuscripts of Simoncini and Szyld [56], and van den Eshof and Sleijpen [57]. Within such frame-works, it is assumed that the matrix-vector products are computed with some bounded perturbation(which can change in each iteration) and all other computation is exact. These methods were mo-tivated by improving performance in applications where the matrix-vector products dominate thecost of the computation, e.g., when the matrix is dense or the application of A involves solving alinear system. Many theoretical results on “inexact Krylov subspace methods”, mostly focused onthe maximum attainable accuracy, have been proved in the literature. A surprising result is that theinexactness in the matrix-vector products can be permitted to grow in norm as the iterations progressat a rate proportional to the inverse of the residual norm without affecting the maximum attainableaccuracy. However, a crucial practical question is whether inexactness will affect the convergencebehavior before the attainable accuracy is reached; this is entirely possible in the case of short-termrecurrence methods such as CG and has not been well-studied theoretically. We brieﬂy mention works which make use of mixed precision Krylov subspace methods in practicalapplications, focusing on performance rather than on theoretical results.One instance of this is in the work of Clark et al. [58], which uses mixed precision CG andBICGSTAB methods implementing the “reliable update” strategy of Sleijpen and van der Vorst [59]within a Lattice QCD application run on GPUs. The idea behind the “reliable update” strategy isthat the true residual is computed and used to replace the recursively updated residual in select iter-ations, thus improving the attainable accuracy; this is done in conjunction with batched updates tothe solution vector. By using higher (double) precision only in the true residual computations andgroup updates (and single or half precision for the rest of the computation), the authors claim theyare able to achieve full double precision accuracy. This deserves further theoretical study, which we25elieve can be achieved by extending the results in [59] and the related work of van der Vorst andYe [60] to the mixed precision setting. QR MGS-GMRES

For MGS-GMRES the mixed precision work by Gratton et. al. [61] is the most recent and appro-priate - and in particular the loss-of-orthogonality relations due to Bj¨orck [62] and Paige [52], laterreﬁned by Paige, Rozloˇznik and Strakoˇs [63], are employed in order to provide tolerances for mixedsingle–double computations. MGS-GMRES convergence stalls (the norm-wise relative backwarderror approaches ε ) when linear independence of the Krylov vectors is lost, and this is signaled byPaige’s S matrix norm (cid:107) S (cid:107) = 1 . The S matrix [64] is derived from the lower triangular T matrixappearing in the rounding error analyses by Giraud et. al. [65].To summarize, the Gratton et. al. [61] paper postulates starting from the Arnoldi-QR algorithmusing the modiﬁed Gram-Schmidt algorithm and employing exact arithmetic in the MGS-GMRESiterative solver. The Arnoldi-QR algorithm applied to a non-symmetric matrix A produces the matrixfactorization, with loss of orthogonality F k AV k = V k +1 H k , V Tk +1 V k +1 = I + F k (4)They next introduce inexact (e.g. single precision) inner products - this directly relates to the loss-of-orthogonality relations for the A = QR factorization produced by MGS. The resulting loss oforthogonality, as measured by (cid:107) I − Q T Q (cid:107) , grows as O ( ε ) κ ( A ) as was derived by Bj¨orck [62] and O ( ε ) κ ([ r , AV k ]) for Arnoldi-QR - which is described by Paige, Rozloˇzn´ık and Strakoˇs [66, 63]and related work. The inexact inner products are given by h ij = v Ti w j + η ij (5)where h ij are elements of the Hessenberg matrix H k , and the Arnoldi-QR algorithm produces a QR factorization of the matrix (cid:2) r , AV k (cid:3) = V k +1 (cid:2) β e , H k (cid:3) , (6)The loss of orthogonality relations for F k are given below, where the matrix U is strictly uppertriangular F k = ¯ U k + ¯ U Tk , U k =  v T v · · · v T v k +1 . . . v Tk v k +1  (7)Deﬁne the matrices, N k =  η · · · η k . . . η kk  , R k =  h · · · h k . . . h k +1 ,k  (8)The loss of orthogonality relation derived by Bj¨orck [62], for the A = QR factorization via themodiﬁed Gram-Schmidt algorithm can be applied to the Arnoldi-QR algorithm to obtain N k = − (cid:2) , U k (cid:3) H k = − U k R k (9)The complete loss of orthogonality (linear independence) of the Krylov vectors in MGS-GMRESsignals the minimum error is achieved and GMRES then stalls or really can go no further thanwhen the norm-wise relative backward error reaches O ( ε ) . Gratton et al. [61] show how to maintainsufﬁcient orthogonality in order to achieve a desired relative residual error level - by switchingthe inner products from double to single at certain tolerance levels and combine this with inexactmatrix-vector products as in van den Eshof and Sleijpen [57] and Simoncini and Szyld [56].In practice, the restarted variant of GMRES is often employed to reduce memory requirements.The algorithm produces both implicit and explicit residuals. Thus, we might ask whether eithercan be performed in reduced precision. The work described herein on iterative reﬁnement by NickHigham and Erin Carson for mixed precision can be applied to analyse the convergence of restartedGMRES ( m ) , assuming a ﬁxed number of iterations - because restarted GMRES is just iterativereﬁnement with GMRES as the solver for the correction term. However, a more detailed analysiswith experiments has yet to be performed. We are fairly certain that the residual computations mustbe performed in higher precision in order to achieve a norm-wise backward error close to doubleprecision machine round-off. 26 .3 Alterative Approaches Although somewhat outside the scope of this review, we can demonstrate that it is possible to modifythe Gratton et al. [61] analysis based on the inverse compact

W Y form of the MGS algorithm, in-troduced by ˇSwirydowicz et al. [67]. Rather than treat all of the inner products in the MGS-GMRESalgorithm equally, consider the strictly upper triangular matrix U = L T from the loss of orthogo-nality relations. We introduce single precision L : ,j − = Q Tj − q j − and double precision triangularsolve r = ( I + L j − ) − Q Tj − a to update R - as this would directly employ the forward error analysisof Higham [68]. The former affects the loss of orthogonality, whereas the latter affects the repre-sentation error for QR - but then also for Arnoldi-QR. This could allow more (or most) of the innerproducts to be computed in single precision.Evidence for maintaining orthogonality is provided in Figure 11, with (cid:107) I − Q T Q (cid:107) plotted for A = QR using the inner products in standard MGS (blue) in double precision versus the inverse compact WYMGS (red) with Q Tj − q j − in single precision (simulated in MATLAB) - and we observe at least thesame or slightly higher error levels. The x -axis is log condition number for randomly generatedmatrices. The lower triangular solve is computed in double precision.Barlow [69] contains similar if not the same algorithm formulations in block form. His work isrelated to Bj¨orck’s 1994 paper [70, Section 7] which derives the triangular matrix T using a recur-sive form for MGS, and which is referred to as a “compact WY” representation in the literature.While Bj¨orck used a lower triangular matrix for the compact WY form of MGS, Malard and Paige[71] derived the upper triangular form, also employed by Barlow, which reverses the order of el-ementary projectors. The latter is unstable in that a backward recurrence leads to O ( ε ) κ ( A ) lossof orthogonality. An interesting observation from Julien Langou is that the upper triangular formis less stable than the lower triangular (even though the backward-forward algorithm results in re-orthogonalization; see the algorithm in Leon, Bj¨orck, Gander [72]).Barlow [69] employs the Householder compact WY representation of reﬂectors and also refers tothe work of Chiara Puglisi [73] – discussed in Joffrain et al. [74] – and this is referred to as the“inverse compact WY” representation of Householder; this originally comes from Walker’s work onHouseholder GMRES [75]. Barlow then extends this approach to the block compact WY form ofMGS; see also the technical report by Sun [76]. The contribution by ˇSwirydowicz et al. [67] was tonote that there exists an inverse compact WY representation for MGS - having the projector P IM = I − Q j − T IM Q Tj − = I − Q j − ( I + L j − ) − Q Tj − and to “lag” the norm (cid:107) q j − (cid:107) so that these can be computed in one global reduction. Barlow [69]makes this connection for blocks (and in effect this is given in his equation (3.10)) and referencesPuglisi [73].Bj¨orck and Paige [77] made the link between Householder and MGS based on the observation madeby Shefﬁeld. Paige deﬁnes this to be augmentation and Gratton et al. [61] also references thiswork. Paige has also recently extended these augmentation ideas to Lanczos. The T matrix appearsin Paige’s work with W¨ulling [78] and then later in [64] to derive the loss of orthogonality matrix S = ( I + L Tj − ) − L Tj − . This also appears in the work of Giraud, Gratton and Langou [65]; Langou alsoworked with Barlow and Smoktunowicz [79] on the Pythagorean trick to reduce cancellation errorin the computation of vector norms and a Cholesky-like form of classical Gram-Schmidt (CGS).In order to combine single-double ﬂoating-point operations in MGS-GMRES, at ﬁrst it appears thatwe could store the T matrix in single precision - but then we would still have to form Q Tj − a , andstore Q j − in double precision. By examining the cost trade-offs a bit further, we can instead use aform of re-orthogonalization based on a backward-forward solver recurrence T = ( I + L Tj − ) − ( I + L j − ) − and our initial computational results demonstrate this works well, driving the relative residual to O ( ε ) in double, with orthogonality maintained to O ( ε ) in single.The representation error (backwards error) for A + E = QR computed by MGS, is not affected bysingle precision inner products - and remains O ( ε ) . We are not aware of whether or not this waspreviously known. 27igure 11: Loss of Orthogonality for Mixed Single-Double MGS AlgorithmFigure 12: GMRES residuals and loss of orthogonality (cid:107) S (cid:107) for impcol e matrix28 Multiprecision Preconditioners

In the iterative solution process of large sparse systems, preconditioners are an important build-ing block facilitating satisfactory convergence. The concept of preconditioning is to turn an ill-conditioned liner system Ax = b into a (left-) preconditioned system MAx = Mb ( AMy = b , x = My for right-preconditioning), which allows for faster convergence of the iterative solver.The convergence characteristics typically depend on the conditioning of the target system. Foran ill-conditioned A , the preconditioner is also required to be ill-conditioned. Otherwise, the pre-conditioner can not be expected to improve the conditioning of the problem or the convergence ofthe iterative solver. In that respect, the preconditioner basically tries to approximate the inverseof the system matrix. Obviously, if the preconditioner is the exact inverse, the solution is readilyavailable. However, computing the exact inverse is prohibitively expensive, and in most cases, thepreconditioner is just a rough approximation of the system matrix inverse. As a consequence, itis natural to question the need for using high precision for a preconditioner that is inherently car-rying only limited accuracy. Indeed, choosing a lower precision format for the preconditioner is avalid strategy as long as the accuracy loss induced by using a lower precision format neither im-pacts the preconditioner accuracy nor its regularity. For example, Trilinos allows the use of lowprecision preconditioners inside high precision iterative solvers, see Section 9, and the hypre teamworks on multigrid methods running the ﬁrst cycles in lower precision. However, the use of lowerprecision in the preconditioner application results in different rounding effects than when using highprecision. Speciﬁcally, the rounding effects make the preconditioner non-constant as the roundingeffects are not only larger than in high precision, but also depend on the input data [80]. As a result,low precision preconditioners can only be used to accelerate an iterative method that can handlenon-constant preconditioners, i.e., can converge even if the preconditioner changes in-between it-erations. For the Krylov subspace solvers generating search directions orthogonal to the previoussearch direction, a changing preconditioner requires an additional orthogonalization of the precon-ditioned search direction against the previous preconditioned search direction. The ﬂexible Krylovsolvers (e.g. FGMRES, FCG) contain this additional orthogonalization and are therefore slightlymore expensive. At the same time, they do allow for using low precision preconditioners, which cancompensate for the additinal cost.An alternative workaround is to decouple the memory precision from the arithmetic precision,see Section 7, and only store the preconditioner in low precision but apply it in high precision [80].Running all arithmetic in high precision keeps the preconditioner constant, and removes the needfor the additional orthogonalization of the preconditioned search direction. On the other hand, de-coupling memory precision from arithmetic precision requires to convert in-between the formatson-the-ﬂy when reading data from main memory. Fortunately, most iterative solvers and precon-ditioners are memory bound, and the conversion can be hidden behind the memory transfers. Aproduction-ready implementation of an adaptive precision block-Jacobi preconditioner decouplingmemory precision from arithmetic precision is available in the Ginkgo library, see Section 9. Across the complete hardware technology foodchain, we are witnessing a widening gap between thecompute power in terms of ﬂoat point operations per second on the one side and the communicationpower in terms of memory bandwidth. In modern processor technology, retrieving values frommain memory takes several orders of magnitude longer than performing arithmetic operations, andcommunicating between distinct nodes of a cluster is again orders of magnitude slower than mainmemory access. In consequence more and more algorithms hit the memory wall – and alreadytoday, virtually all applications inside the ECP ecosystem are memory bound on modern hardwarearchitectures. With no disruptive hardware changes on the horizon, we are facing a situation whereall applications suffer from the slow communication to main memory or in-between nodes.A promising – and maybe the only promising – strategy to overcome this problem is to utilize thebandwidth capacity more carefully, reduce the communication volume and the number of commu-nication points, and whenever possible, trade communication against computations. Speciﬁcally,the idea is to radically decouple the memory precision from the arithmetic precision, employ high29igure 13: Accessor separating the memory format from the arithmetic format and realizing on-the-ﬂy data conversion in each memory access.

Invert the diagonal blockusing Gauss-Jordan elimination.Compute condition number and exponent range. Select storage format: fp fp fp fp fp fp Figure 14: Storage format optimization for block-Jacobi: Starting from the most compact storage(left top), the format is extended in exponent bits to ﬁt the data range (rightwards) and to preserveregularity (downwards) until both is satisﬁed.precision only in the computations, and lower the precision as much as possible when accessingdata in main memory or communicating with remote processors [81]. An important aspect inthis context is the design of a “ memory accessor ” that converts data on the ﬂy between the IEEEhigh precision arithmetic format and the memory/communication format, see Figure 13. Also, thememory/communication format does not necessarily have to be part of the IEEE standard, but canalso be an arbitrary composition of sign, exponent, and signiﬁcand bits [82], or even nonstandardformats like L. Gustafson’s Unum format [83]. Obviously, there is a close link to the idea tocomplement the format separation with compression techniques, like proposed in Section 3.A proof-of-concept for the idea of decoupling arithmetic precision from memory precision is theadaptive precision block-Jacobi preconditioner [80] available in the Ginkgo sparse linear algebralibrary. The idea here is to compute a block-Jacobi preconditioner in high precision, but then storethe distinct inverted diagonal blocks in the lowest ﬂoating point precision format that avoids overﬂowand still preserves the regularity of the preconditioner, see Figure ﬁg. 14.This storage format is chosen for each diagonal block individually, respectively reﬂecting the char-acteristics. Figure 15 (top) visualizes the distribution of formats when storing the inverted diagonalblocks of size 24 for symmetric positive deﬁnite matrices of the Suite Sparse Matrix Collection.Obviously, converting to a lower format generally reduces the accuracy of the linear operator, but asblock-Jacobi preconditioners ignore all off-(block)diagonal entries, they are typically only a roughapproximation of the matrix inverse and therewith by design only have very limited accuracy. Ex-perimental results reveal that the use of a lower precision format for storing the inverted diagonal30igure 15: Top: Distribution of ﬂoating point formats among the distinct blocks when preserving1 digit accuracy of each inverted diagonal block. Each column represents one symmetric positivedeﬁnite matrices of the Suite Sparse Matrix Collection. Bottom: Impact on the top-level CG solversolving the system-induced linear problem. For most systems, the convergence rate is unaffected bythe use of a lower storage precision format, all preconditioner applications are faster, resulting in anaverage 20% runtime reduction. 31locks has in most cases only negligible effects on the preconditioner effectiveness and the outersolver convergence. At the same time, storing the inverted diagonal blocks in lower precision re-duces the memory access volume in every preconditioner application, therewith accelerating thebandwidth bound iterative solution process, see Figure 15. For the adaptive precision block-Jacobipreconditioner, is important that the accessor converts the inverted diagonal blocks back to the IEEEstandard precision not only for performance reasons – leveraging the highly-optimized IEEE ﬂoat-ing point arithmetic of the processors – but also for numeric reasons: Using working precision inthe arithmetic operations of the precnditioner application preserves the preconditioner as a constantoperator, applying a preconditioner in lower precision would result in a non-constant preconditionerand require the use of a (more expensive) ﬂexible iterative solver [80].

Multigrid methods are highly effective iterative methods. There are basically two types of multigridmethods: geometric multigrid methods (GMG) and algebraic multigrid methods (AMG). GMG re-quires actual grids on each level to generate its components, whereas AMG can be considered morelike a black box method, in that it can be given a matrix and right hand side and will generate thecomponents for each level automatically using sensible heuristics. These methods are an interestingtarget for multiprecision treatment due to their different components which affect the overall algo-rithm in different ways. GMG and AMG components combine smoothers, coarser grid, restrictionand prolongation operators on each level. In addition, it is of interest to investigate changes in pre-cision on different levels. Finally, GMG and AMG can be used as preconditioners to other solvers,i.e. there is potential to use lower precision across the whole preconditioner. Historically, mostwork focused on the use of a lower precision GMG or AMG method as a preconditioner to a doubleprecision solver. Additionally, there are attempts to apply ZFP [41] within MG or establish an erroranalysis framework for AMG.Ljungkvist and Kronbichler[84, 85] successfully use mixed precision to solve the Laplace problemfor different orders with a matrix-free geometric multigrid approach. Their solver infrastructureallows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precisionwith an outer correction in double precision, increasing throughput by up to 83 percent.Similarly, Glimberg et al [86] use a single precision multigrid to precondition a double precisiondefect correction scheme and solve the Laplace problem within a nonlinear water wave applicationon a GPU architecture. They achieve a speedup of up to 1.6 of the mixed precision version over thedouble precision version, a speedup of 1.9 for a purely single precision version.Yamagishi and Matsumura [87] also apply a single precision multigrid to a double precision con-jugate gradient solver to the Poisson/Helmholtz problem within their non-hydrostatic ocean model.They report a speedup up to 2 for a single precision Matvec over a double precision one and im-proved overall times using this approach, however they compare the full application run only to theirCPU version.There are various publications that pursue the same strategy of using a single precision AMG pre-conditioner to a double precision solver.Emans and van de Meer [88]perform a careful analysis of the individual kernels of preconditionedKrylov solvers on multi-core CPUs, including sparse matrix-vector multiplications (SpMV) whichmake up a large portion of AMG. They also consider the effect of communication, where lowerprecision leads to smaller size messages, but latencies are still an issue, particularly on the coarsestlevels of AMG. They ﬁnd that the use of mixed precision for the preconditioner barely affects con-vergence and therefore speedups for the kernels, which were between 1.1 and 1.5, can potentiallycarry over to the whole solver and lead to improvements of runtimes within CFD applications.Sumiyoshi et al [89] investigate AMG performance on a heterogeneous computer architecture withboth CPUs and GPUs for isotropic and anisotropic Poisson problems. They consider smoothedaggregation AMG as a stand-alone solver. They carefully analyze different portions of the algorithmon ﬁve different architectures, including one multi-core CPU cluster. They report speedups between1.2 and 1.6 on the GPU-CPU architectures for the mixed-precision implementation over the doubleprecision version. These speedups are related to SpMV performance (between 1.6 and 1.8) onthese architectures. However, the mixed-precision version was slightly slower on the CPU-onlyarchitecture, which achieved barely any improvement for the SpMV operations.32ichter et al [90] examine the performance of a single precision AMG solver (ML and PETSc)applied to a double precision PCG solver. They apply the method for an electrostatic simulation ofthe high voltage isolator on a GPU/CPU computer architecture. Their mixed precision version takesabout 84 percent of the time of the double precision version.An approach described in a presentation by Kate Clark [91] takes the use of mixed precision evenfurther to involve half precision. Clark and collaborators achieved good results using a doubleprecision defect correction approach with a single precision Krylov solver and a half precision AMGpreconditioner.Another interesting related study by Fox and Kolasinski [92] examines the use of ZFP, a lossycompression algorithm, within multigrid. Due to the local structure of ZFP, ZFP can easily beintegrated into numerical simulations without changing the underlying algorithms. However, sinceZFP is a lossy algorithm, it will introduce some error, thus, it is important to understand if theerror caused by ZFP overwhelms other traditional sources of error, such as discretization error. Thestudy shows that for MG on a Poisson problem, applying ZFP to the approximation vector, cansigniﬁcantly decrease memory use and thus is expected to decrease runtimes, while the generatederrors stay below discretization error. Since a hardware version of ZFP is not available yet, noactual runs were possible, however the results show good potential to use GMG and/or AMG as apreconditioner.Currently, Tamstorf et al [93] appear to be the only ones who have investigated the theory of multi-precision multigrid methods. Their original intent was to improve the appearance of the movementof cloth within Disney movies, which requires higher than FP64 accuracy. However, their theoryapplies equally to decreased precision. They have created a theoretical framework with rigorousproofs for a mixed-precision version of multigrid for solving the algebraic equations that arise fromdiscretizing linear elliptic partial differential equations (PDEs). The arising matrices being sparseand symmetric positive deﬁnite enable the use of the so-called energy or A norm to establish con-vergence and error estimates. Bounds on the convergence behavior of multigrid are developed andanalyzed as a function of the matrix condition number. Both theoretical and numerical results con-ﬁrm that convergence to the level of discretization accuracy can be achieved with mixed-precisionversions of V-cycles and full multigrid. This framework is inspired by the results of Carson andHigham [15] but ultimately provides tighter bounds for many PDEs. Tamstorf et al [94] furtherextend their theoretical framework to include the quantization error. They use the bounds to guidethe choice of precision level in their progressive-precision multigrid scheme by balancing quanti-zation, algebraic and descretization errors. They show that while iterative reﬁnement is susceptibleto quantization errors during the residual and update computation, the V-cycle used to compute thecorrection in each iteration is much more resilient, and continues to work if the system matrices inthe hierarchy become indeﬁnite due to quantization. Modern high-performance computing (HPC) hardware continues to experience an ongoing shifttowards supporting a variety reduced-precision formats for representing ﬂoating-point numbers inorder to offer a much increased performance rate. However, portability is often of little concernas the hardware tends to serve only a speciﬁc set of workloads that are of special interest to theparticular vendor. The examples include Intel’s Cascade Lake Vector Neural Network Instructions(VNNI) and the recently announced Xe platform for graphics cards, AMD’s Radeon Instinct cards(MI5, MI8, MI25, MI55, MI60) and NVIDIA’s compute cards from the Pascal, Volta, and Turingseries. Finally, ARM included 16-bit ﬂoating point (FP16) in its NEON vector unit speciﬁcationVFP 8.2-A. These accelerators follow two types of speciﬁcations for 16-bit ﬂoating-point format:IEEE-compliant FLOAT16 and extended-range BFLOAT16.At the same time, a new breed of accelerators take the use of reduced precision to a new levelas they target new machine learning workloads. This new hardware includes Cloud AI 100 byQualcomm, Dot-Product Engine by HPE, Eyeriss by MIT [95], TPU by Google [96], MAERI byGeorgia Institute of Technology [97] Deep Learning Boost by Intel, CS-1 by Cerebras, and Zion byFacebook.In general, the machine learning community has been more aggressive in evaluating multiple pre-cision to the extent that even a 1-bit Stochastic Gradient Descent has been considered [98]. The33ypical use case in machine learning is to use the training with 32-bit arithmetic and use differentprecision for the inference task. The quantization for the inference is supported in popular frame-works like TensorFlow [99] and pyTorch [100]. Quantization is the approach to store the tensorsand compute on them using bitwidths lower than ﬂoating point bitwidths. Even in machine learningframeworks, the support for quantizations is limited to just the key functionality needed for a con-volutional neural networks or recurrent neural networks with some limited hardware support. Forexample, pyTorch and TensorFlow supports 8-bit quantization for activation and weights. This al-lows using 8-bits for inference where the additional 2-4x performance is necessary. On the trainingfront, it has been shown that 16-bit training is sufﬁcient for certain tasks [1, 101]. The recent GordonBell winner demonstrated that lower-precison training can be used for scientiﬁc machine learningtasks as well [102].The analogous effort to the work in deep learning to the examples of our interest in scientiﬁccomputing involves training the network in lower precision and performing inference in a higherone [103, 104]. The compute imbalance between training and inference is even higher than that offactorization and the subsequent iterative reﬁnement. Another difference is that in the context ofneural network training, lowering the precision may be incorporated into the model as a regularizer.

Ginkgo is a modern sparse linear algebra library able to run on multi- and manycore archi-tectures [105]. The library design is guided by combining ecosystem extensibility with heavy,architecture-speciﬁc kernel optimization using the platform-native languages CUDA (NVIDIAGPUs), HIP (AMD GPUs), or OpenMP (Intel/AMD/ARM multicore). The software developmentcycle ensures production-quality code by featuring unit testing, automated conﬁguration and in-stallation, Doxygen code documentation, as well as a Continuous Integration (CI) and ContinuousBenchmarking framework.Ginkgo uses a static template parameter for the value type and a template parameter for the integertype to allow for compilation in different precision formats. Standard value type formats supportedare IEEE double precision, IEEE single precision, double complex precision, and single complexprecision. Theoretically, Ginkgo can also be compiled for any other (arbitrary) precision format, butthe support on both the hardware and the software side is very limited outside the IEEE standard.Aside from being compilable for different precision formats, Ginkgo features the adaptive precisionblock-Jacobi preconditioner, decoupling the memory precision from the arithmetic precision, andoptimizing the storage format for the inverted diagonal block individually. Even though heavilyleveraging advanced multiprecision technology, the numerical considerations of the adaptive preci-sion block-Jacobi preconditioner are fully automated and hidden from the user who can employ thefunctionality as black-box algorithm without numerical degradation. Building upon the knowledgegained in the adaptive precision block-Jacobi, Ginkgo is currently employing the accessor conceptto consequently separate the memory precision from the arithmetic precision, see section 7.A orthogonal multiprecision technology that is under consideration for integration into Ginkgo isthe multiprecision SpMV based on value clustering.

The Highly-Efﬁcient FFTs for Exascale (heFFTe) library provides fast and robust multi-dimensionalFFT routines for Exascale platforms. heFFTe leverages established but ad hoc software tools thathave traditionally been part of application codes, but not extracted as independent, supported li-braries. These multidimensional FFTs rely on third-party 1D FFTs, either from FFTW or fromvendor libraries.FFTs are used in many domain applications–including molecular dynamics, spectrum estimation,fast convolution and correlation, signal modulation, and wireless multimedia applications. For ex-ample, distributed 3-D FFT is one of the most important kernels used in molecular dynamics com-putations, and its performance can affect an application’s scalability on larger machines. Similarly,the performance of the ﬁrst principle calculations depends strongly on the performance of the FFT34olver. Speciﬁcally, for DOE, we found that more than a dozen ECP applications use FFT in theircodes. However, the current state-of-the-art FFT libraries are not scalable on large heterogeneousmachines with many nodes, or even on one node with multiple high-performance GPUs (e.g., severalNVIDIA V100 GPUs). To address these needs, the heFFTe v0.2 library release demonstrates verygood weak and strong scalability, and a very high performance that is close to 90% of the roof-linetheoretical peak performance. This is achieved through (1) efﬁcient use of GPUs’ high bandwidth,(2) algorithms to reduce global communications, when possible, and (3) employment of GPUDirecttechnologies as well as MPI optimizations. heFFTe provides multi-precision capabilities with sup-port for both single and double precision arithmetic. heFFTe is a C++ library, and the arithmeticused is templated, so that other precisions can be easily added. Current work is on adding mixed-precision capabilities using mixed-preision MPI and compression, as described in Section 3.2. hypre is a software library of high-performance preconditioners and solvers for the solution of large,sparse linear systems of equations on massively parallel computers. The hypre library was createdwith the primary goal of providing users with advanced parallel preconditioners. The library featuresparallel multigrid solvers for both structured and unstructured grid problems. For ease of use, thesesolvers are accessed from the application code via hypre’s conceptual linear system interfaces, whichallow a variety of natural problem descriptions and include a structured, a semi-structured and alinear-algebraic interface. The (semi-)structured interfaces are an alternative to the standard matrix-based interface, give users a more natural means for describing linear systems and provide access tostructured multigrid solvers, which can take advantage of the additional information.

The Kokkos Kernels project primarily focuses on performance-portable kernels for sparse/dense lin-ear algebra and graph kernels. Kokkos Kernels relies on Kokkos programming model for portability.The focus of sparse linear algebra kernels has been to support the requirements of frameworks suchas Trilinos and computational science applications. The sparse linear algebra data structure used is acompressed row storage. Kokkos Kernels provides kernels for sparse matrix-vector multiplication,sparse matrix-matrix multiplication, ILU(k) factorization, Gauss-Seidel preconditioner, triangularsolves when the triangular factors arise from direct solvers or incomplete factorizations. All thesekernels are templated on the matrix and the vector type allowing multiple precision support fromthe initial software design. Kokkos Kernels also supports dense linear algebra kernels for team-levelBLAS and LAPACK functionality. This allows computational science applications to use BLASand LAPACK operations in the “inner-loop” when programming for accelerators. The BLAS andLAPACK functionality is also templated on the scalar type allowing multiprecision use. KokkosKernels also support graph kernels such as distance-1 coloring, distance-2 coloring and trianglecounting kernels.

MAGMA provides LAPACK and a large number of highly optimized dense and sparse linear al-gebra (LA) routines for heterogeneous architectures. Besides LAPACK, other dense LA routinesin MAGMA include BLAS, Batched BLAS and LAPACK, and mixed-precision factorizations andsolvers. A MAGMA Sparse component provides support for sparse iterative solvers and precon-ditioners, a number of sparse matrix formats and conversion routines, SpMV/MM and auxiliarykernels.MAGMA addresses the complex challenges of heterogeneous compute environments by providinghybridized software that combines the strengths of different algorithms for different hardware com-ponents. MAGMA’s LA algorithms target hybrid manycore systems featuring GPUs speciﬁcallyand thus enable applications to fully exploit the power offered by each of the hardware compo-nents. MAGMA provides solvers for linear systems, least squares, eigenvalue problems, and singu-lar value problems. Designed to be similar to LAPACK in functionality, data storage, and interface,the MAGMA library allows scientists to seamlessly port any linear algebra reliant software com-ponents to heterogeneous architectures. MAGMA allows applications to fully exploit the powerof current heterogeneous systems of multi/many-core CPUs and multi-GPUs to deliver the fastestpossible time to accurate solution within given energy constraints.35AGMA provides mixed-precision solvers using LU, Cholesky, or QR factorizations. In terms oflow precision developments, the latest MAGMA release to-date (v 2.5.3) provides an optimizedbatch HGEMM kernel that outperforms the vendor BLAS for relatively small sizes. It also providesa mixed-precision linear solver for Ax = b in double-precision, while taking advantage of half-precision during the LU factorization. The mixed-precision solver is up to × faster than a directFP64 solver, and converges to double precision accuracy if the condition number of the matrix κ ∞ ( A ) is up to . PETSc is a suite of data structures and routines for the scalable solution of scientiﬁc applicationsmodeled by partial differential equations; TAO is a scalable library for numerical optimization.PETSc/TAO can be easily used in application codes written in C, C++, Fortran, and Python.PETSc is written in pure C89 (recently extended to support portions of C999 that are supported bythe more recent Microsoft C compilers). The emphasis has always been on ultimate portability tothe Fortran and C standards. At the same time we have always insured PETSc compilers completelywith C++ as well and that the C compiled version can be used from C++ compiled code. Thelargest hassle in this regard has always been the differences between complex number handling inC and C++ requiring extensive code to handle the differences. PETSc can be built only for a singlescalar type and precision at a time, for example real numbers and quad precision. Since C does notoffer templates, managing multiple integrated precision’s is difﬁcult. For CPUs, PETSc supportshalf-precision (ARM only), single, double, quad (GNU compilers only). The above applies forCPU based systems. For GPU’s, where large improvements in time to solution are possible withless precision, PETSc will use its GPU interfaces to allow computing with a selected precisionat runtime on the GPUs. If the numerical values are in, say, double on the CPU they would beconverted to, for example, single when transferred to the GPU for the computation. Of course, themore desirable case where the data remains on the GPU will require less conversion, except whenparticularly desired, for example, ill-conditioning requires a portion of the computation to be donewith more precision.

PLASMA (Parallel Linear Algebra Software for Modern Architectures) [5] is a software packagebased on modern OpenMP for solving problems in dense linear algebra. PLASMA provides im-plementations of state-of-the-art algorithms using modern task scheduling techniques. PLASMAprovides routines for solving linear systems, least squares problems, eigenvalue problems, and sin-gular value problems. PLASMA is based on OpenMP and its data-dependence tracking and taskscheduling. PLASMA library allows scientists to easily port their existing software componentsfrom LAPACK to PLASMA to take advantage of the new multicore architectures. PLASMA pro-vides LAPACK-style interface for maximum portability and compatibility. An interface with moreefﬁcient data storage is also provided to achieve performance as close as possible to the computa-tional peak performance of the machine.

SLATE is a distributed, GPU accelerated library for dense linear algebra, intended as a replacementfor ScaLAPACK. To this end, SLATE provides parallel Basic Linear Algebra Subprograms (BLAS),norms, linear systems solvers, least square solvers, singular value and eigenvalue solvers. It iswritten using modern C++, with ScaLAPACK and LAPACK compatible wrappers.SLATE provides mixed-precision solvers using both LU and Cholesky factorization. The factoriza-tion is done in a lower precision, then iterative reﬁnement is applied to improve the accuracy to ahigher precision. The code is templated on the two precisions; currently single/double and single-complex/double-complex are supported. Future plans include using half precision and a more robustGMRES reﬁnement mechanism. 36 .9 STRUMPACK

STRUMPACK is a distributed, GPU accelerated library for dense and sparse linear algebra usingrank-structured matrix approximations, including hierarchically semiseparable (HSS), hierarchicallyoff-diagonal low rank (HODLR), butterﬂy, and a non-hierarchical format called block low rank(BLR). The baseline sparse STRUMPACK is a multifrontal sparse LU direct solver. The frontalmatrices in the sparse factors can be approximated with the above rank-structured formats, servingas effective sparse preconditioners with nearly optimal complexity in ﬂops and memory. SparseSTRUMPACK relies on ButteryPACK for the HODLR and butterﬂy formats, and provides C++interfaces to the ButteryPACK Fortran library.STRUMPACK is written using modern C++, with templated datatypes to support various preci-sions, including real and complex, single and double precisions. It can also support half-precision.Currently, iterative reﬁnement and GMRES are performed in the same working precision as factor-ization.

SuperLU is a distributed, GPU accelerated sparse direct solver for general sparse linear systems, us-ing supernodal techniques in LU factorization and triangular solves. It uses MPI+OpenMP+CUDAto support various forms of parallelism. Routines are also provided to equilibrate the system, esti-mate the condition number, calculate the relative backward error, and estimate error bounds for thereﬁned solutions.SuperLU is written in C and is callable from either C or Fortran program. The code base usesmacros to template the datatypes, so it can support the mixture of various precisions, including realand complex, single, double and half precisions. Currently, iterative reﬁnement is performed in thesame working precision as factorization. Work is in progress to provide lower precision factorizationcoupled with higher precision iterative reﬁnement.

The Trilinos Project is a premier software framework in scientiﬁc computing for the solution oflarge-scale, complex multiphysics engineering and scientiﬁc problems. Trilinos is object-orientedand organized into about 60 different packages, each with a speciﬁc focus. These packages in-clude linear and nonlinear solvers, preconditioners (including algebraic multigrid), graph partition-ers, eigensolvers, and optimization algorithms, among other things. Users are required to install onlythe subset of packages related to the problems they are trying to solve. Trilinos supports MPI+X,where X can be CUDA, OpenMP, etc. (anything Kokkos supports).In Trilinos, the scalar type is a template parameter, typically set to IEEE double precision while alsoIEEE single precision is fully supported. Users can employ preconditioners that are compiled insingle precision inside a double precision outer solver - however have to account for the numericaleffects, i.e., may need a ﬂexible Krylov solver (FCG / FGMRES). A brief discussion of using mixedprecision in the Belos package was given in [106]. Other scalar types than single and double mayalso be used, however, this is not common and not supported in the explicit template instantiation(ETI) build system.

10 IEEE Formats and Format Conversion

The half-precision (fp16) ﬂoating-point format, deﬁned in the 2008 revision of the IEEE standardfor ﬂoating-point arithmetic, and a more recently proposed half-precision format bﬂoat16, are in-creasingly available in GPUs and other accelerators. While the support for low precision arithmeticis mainly motivated by machine learning applications, as discussed in earlier sections, general pur-pose numerical algorithms can beneﬁt from it too. Since the appropriate hardware is not alwaysavailable, and one may wish to experiment with new arithmetics not yet implemented in hardware,software simulations of low precision arithmetic are needed. In [107], Higham and Pranesh discussa strategy to simulate low precision arithmetic using arithmetic of higher precision, and correctness37f such simulations is explained via rounding error analysis. A MATLAB function function chop isprovided, that can be used to efﬁciently simulate fp16, bﬂoat16, and other low precision arithmetics,with or without the representation of subnormal numbers and with the options of round to nearest,directed rounding, stochastic rounding, and random bit ﬂips in the signiﬁcand. Interested readersare referred to [107] for further details. Traditional rounding error analysis in numerical linear algebra leads to backward error bounds in-volving the constant γ n = nu/ (1 − nu ) , for a problem size n and unit roundoff u . In light of large-scale and possibly low-precision computations, such bounds can struggle to provide any useful infor-mation. In [108], Higham and Mary develop a new probabilistic rounding error analysis that can beapplied to a wide range of algorithms. By using a concentration inequality and making probabilisticassumptions about the rounding errors, they show that in several core linear algebra computations γ n can be replaced by a relaxed constant (cid:101) γ n proportional to (cid:112) n log nu with a probability boundedbelow by a quantity independent of n. The new constant (cid:101) γ n grows much more slowly with n than γ n . We refer to [108], [109] for further details. Acknowledgments

This work was supported by the US Exascale Computing Project (17-SC-20-SC), a collaborativeeffort of the U.S. Department of Energy Ofﬁce of Science and the National Nuclear Security Ad-ministration. This work was performed under the auspices of the U.S. Department of Energy byLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Disclaimer

This document was prepared as an account of work sponsored by an agency of the United Statesgovernment. Neither the United States government nor Lawrence Livermore National Security,LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legalliability or responsibility for the accuracy, completeness, or usefulness of any information, appa-ratus, product, or process disclosed, or represents that its use would not infringe privately ownedrights. Reference herein to any speciﬁc commercial product, process, or service by trade name,trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement,recommendation, or favoring by the United States government or Lawrence Livermore NationalSecurity, LLC. The views and opinions of authors expressed herein do not necessarily state orreﬂect those of the United States government or Lawrence Livermore National Security, LLC, andshall not be used for advertising or product endorsement purposes.Sandia National Laboratories is a multimission laboratory managed and operated by National Tech-nology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell Inter-national, Inc., for the U.S. Department of Energys National Nuclear Security Administration undercontract DE-NA-0003525. https://github.com/higham/chop eferences [1] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learn-ing with Limited Numerical Precision. In Proceedings of the 32Nd International Conferenceon International Conference on Machine Learning - Volume 37 , ICML’15, pages 1737–1746.JMLR.org, 2015.[2] IEEE Standard for Floating-Point Arithmetic.

IEEE Std 754-2008 , pages 1–70, Aug 2008.[3] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPUTensor Cores for Fast FP16 Arithmetic to Speed Up Mixed-precision Iterative ReﬁnementSolvers. In

Proceedings of the International Conference for High Performance Computing,Networking, Storage, and Analysis , SC ’18, pages 47:1–47:11, Piscataway, NJ, USA, 2018.IEEE Press.[4] Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. Investigating half precisionarithmetic to accelerate dense linear system solvers. In

Proceedings of the 8th Workshop onLatest Advances in Scalable Algorithms for Large-Scale Systems , pages 1–8, 2017.[5] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou,Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emergingarchitectures: The PLASMA and MAGMA projects.

Journal of Physics: Conference Series ,180:012037, July 2009.[6] Ahmad Abdelfattah, Stanimire Tomov, and Jack J. Dongarra. Fast Batched Matrix Multipli-cation for Small Sizes Using Half-Precision Arithmetic on GPUs. In , pages 111–122. IEEE, 2019.[7] James W Demmel. Applied numerical linear algebra. 1997.

SIAM, Philadelphia .[8] James Hardy Wilkinson.

Rounding errors in algebraic processes . Courier Corporation, 1994.[9] Cleve B Moler. Iterative reﬁnement in ﬂoating point.

Journal of the ACM (JACM) , 14(2):316–321, 1967.[10] Gilbert W Stewart.

Introduction to matrix computations . Elsevier, 1973.[11] Nicholas J. Higham.

Accuracy and Stability of Numerical Algorithms . Society for Industrialand Applied Mathematics, Philadelphia, PA, USA, second edition, 2002.[12] James Demmel, Yozo Hida, William Kahan, Xiaoye S Li, Sonil Mukherjee, and E JasonRiedy. Error bounds from extra-precise iterative reﬁnement.

ACM Transactions on Mathe-matical Software (TOMS) , 32(2):325–351, 2006.[13] Werner Oettli and William Prager. Compatibility of approximate solution of linear equa-tions with given error bounds for coefﬁcients and right-hand sides.

Numerische Mathematik ,6(1):405–409, 1964.[14] Erin Carson and Nicholas J Higham. Accelerating the solution of linear systems by iterativereﬁnement in three precisions.

SIAM Journal on Scientiﬁc Computing , 40(2):A817–A847,2018.[15] Erin Carson and Nicholas J Higham. A new analysis of iterative reﬁnement and its applicationto accurate solution of ill-conditioned sparse linear systems.

SIAM Journal on ScientiﬁcComputing , 39(6):A2834–A2856, 2017.[16] Youcef Saad and Martin H Schultz. Gmres: A generalized minimal residual algorithm forsolving nonsymmetric linear systems.

SIAM Journal on scientiﬁc and statistical computing ,7(3):856–869, 1986.[17] Nicholas J. Higham. Error analysis for standard and GMRES-based iterative reﬁnement intwo and three-precisions. MIMS EPrint 2019.19, Manchester Institute for Mathematical Sci-ences, The University of Manchester, November 2019.[18] Nicholas J. Higham, Srikara Pranesh, and Mawussi Zounon. Squeezing a matrix into halfprecision, with an application to solving linear systems.

SIAM J. Sci. Comput. , 41(4):A2536–A2551, 2019. 3919] Nicholas J. Higham and Srikara Pranesh. Exploiting lower precision arithmetic in solvingsymmetric positive deﬁnite linear systems and least squares problems. MIMS EPrint 2019.20,Manchester Institute for Mathematical Sciences, The University of Manchester, November2019.[20] Erin Carson, Nicholas J. Higham, and Srikara Pranesh. Three-precision GMRES-based iter-ative reﬁnement for least squares problems. MIMS EPrint 2020.5, Manchester Institute forMathematical Sciences, The University of Manchester, February 2020.[21] Joseph M. Elble and Nikolaos V. Sahinidis. Scaling linear optimization problems prior toapplication of the simplex method.

Comput. Optim. Appl. , 52(2):345–371, 2012.[22] E. Anderson, Z. Bai, C. H. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. J. Du Croz,A. Greenbaum, S. J. Hammarling, A. McKenney, and D. C. Sorensen.

LAPACK Users’ Guide .Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, third edition, 1999.[23] Philip A. Knight, Daniel Ruiz, and Bora Uc¸ar. A symmetry preserving algorithm for matrixscaling.

SIAM J. Matrix Anal. Appl. , 35(3):931–955, 2014.[24] Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Panruo Wu, Srikara Pranesh, Stan-imire Tomov, and Jack Dongarra. the design of fast and energy-efﬁcient linear solvers: Onthe potential of half-precision arithmetic and iterative reﬁnement techniques. In Yong Shi,Haohuan Fu, Yingjie Tian, Valeria V. Krzhizhanovskaya, Michael Harold Lees, Jack Don-garra, and Peter M. A. Sloot, editors,

Computational Science—ICCS 2018 , pages 586–600.Springer International Publishing, Cham, 2018.[25] Nicholas J. Higham and G. W. Stewart. Numerical linear algebra in statistical computing.In A. Iserles and M. J. D. Powell, editors,

The State of the Art in Numerical Analysis , pages41–57. Oxford University Press, 1987.[26] Pierre Blanchard, Nicholas J. Higham, Florent Lopez, Theo Mary, and Srikara Pranesh.Mixed precision block fused multiply-add: Error analysis and application to GPU tensorcores. MIMS EPrint 2019.18, Manchester Institute for Mathematical Sciences, The Univer-sity of Manchester, September 2019.[27] Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto, and YukaYanagisawa. Shifted cholesky qr for computing the qr factorization of ill-conditioned ma-trices.

SIAM J. Scientiﬁc Computing , 42:A477–A503, 2020.[28] Yusaku Yamamoto, Yuji Nakatsukasa, Yuka Yanagisawa, and Takeshi Fukaya. Roundoff erroranalysis of the CholeskyQR2 algorithm.

Electron. Trans. Numer. Anal. , 44:306–326, 2015.[29] Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. Mixed-precision Cholesky QR fac-torization and its case studies on multicore CPU with multiple GPUs.

SIAM J. Sci. Comput. ,37(1):C307–C330, 2015.[30] Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. Stability and performance of var-ious singular value QR implementations on multicore CPU with a GPU.

ACM Trans. Math.Software , 43(2):10:1–10:18, September 2016.[31] ˚Ake Bj¨orck. Iterative reﬁnement of linear least squares solutions I.

BIT Numerical Mathe-matics , 7(4):257–278, 1967.[32] ˚Ake Bj¨orck. Iterative reﬁnement and reliable computing. In M. G. Cox and S. J. Hammarling,editors,

Reliable Numerical Computation , pages 249–266. Oxford University Press, 1990.[33] Pytorch 1.4.0 documentation: Quantization.[34] Tensorﬂow lite 8-bit quantization speciﬁcation.[35] J. J. Dongarra, C. B. Moler, and J. H. Wilkinson. Improving the accuracy of computedeigenvalues and eigenvectors.

SIAM Journal on Numerical Analysis , 20(1):23–45, 1983.[36] Jack Dongarra. Algorithm 589 sicedr: A FORTRAN subroutine for improving the accuracyof computed matrix eigenvalues.

ACM Transactions on Mathematical Software (TOMS) ,8(4):371–375, December 1982.[37] T. Ogita and K. Aishima. Iterative reﬁnement for symmetric eigenvalue decomposition.

JapanJournal of Industrial and Applied Mathematics , 35(3):1007–1035, 2018.4038] T. Ogita and K. Aishima. Iterative reﬁnement for symmetric eigenvalue decomposition II:clustered eigenvalues.

Japan J. Indust. Appl. Math. , 36:435–459, 2019. https://doi.org/10.1007/s13160-019-00348-4 .[39] Dhillon, Inderjit S. and Parlett, Beresford N. and V¨omel, Christof. The Design and Im-plementation of the MRRR Algorithm.

ACM Trans. Math. Softw. , 32(4):0098–3500, 2006. https://doi.org/10.1145/1186785.1186788 .[40] Petschow, M. and Quintana-Ort, E. S. and Bientinesi, P. Improved Accuracy and Parallelismfor MRRR-Based Eigensolvers—A Mixed Precision Approach.

SIAM Journal on ScientiﬁcComputing , 36(2):C240–C263, 2014. https://doi.org/10.1137/130911561 .[41] P. Lindstrom. Zfp version 0.5.3, April.[42] James Diffenderfer, Alyson Fox, Jeffrey Hittinger, Geoffrey Sanders, and Peter Lindstrom.Error analysis of zfp compression for ﬂoating-point data, 2018.[43] Stanimire Tomov, Azzam Haidar, Alan Ayala, Daniel Schultz, and Jack Dongarra. Designand Implementation for FFT-ECP on Distributed Accelerated Systems. ECP WBS 2.3.3.09Milestone Report ICL-UT-19-05, 2019-04 2019.[44] Stanimire Tomov, Alan Ayala, Azzam Haidar, and Jack Dongarra. FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systemswith GPUs. ECP WBS 2.3.3.13 Milestone Report FFT-ECP STML13-27, 2020-01 2020.revision 01-2020.[45] Alan Ayala, Stanimire Tomov, Xi Luo, Hejer Shaiek, Azzam Haidar, George Bosilca, andJack Dongarra. Impacts of Multi-GPU MPI Collective Communications on Large FFT Com-putation. In

Workshop on Exascale MPI (ExaMPI) at SC19 , Denver, CO, 2019-11 2019.[46] Hejer Shaiek, Stanimire Tomov, Alan Ayala, Azzam Haidar, and Jack Dongarra. GPUDi-rect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems.

Eu-roMPI’19 Posters, Zurich, Switzerland , (icl-ut-19-06), 2019-09 2019.[47] A. Sorna, X. Cheng, E. D’Azevedo, K. Won, and S. Tomov. Optimizing the Fast FourierTransform Using Mixed Precision on Tensor Core Hardware. In , pages 3–7, 2018.[48] Xaiohe Cheng, Anumeena Soma, Eduardo D’Azevedo, Kwai Wong, and Stanimire Tomov.Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision. 2018-11 2018.[49] J. Demmel, Y. Hida, W. Kahan, X.S. Li, S. Mukherjee, and E.J. Riedy. Error bounds fromextra-precise iterative reﬁnement.

ACM Trans. Math. Softw. , 32(2):325–351, June 2006.[50] J. Demmel, Y. Hida, E.J. Riedy, and X.S. Li. Extra-precise iterative reﬁnement for overde-termined least squares problems.

ACM Transactions on Mathematical Software (TOMS) ,35(4):28, 2009.[51] G´erard Meurant and Zdenˇek Strakoˇs. The Lanczos and conjugate gradient algorithms in ﬁniteprecision arithmetic.

Acta Numerica , 15:471–542, 2006.[52] Christopher C Paige. Accuracy and effectiveness of the Lanczos algorithm for the symmetriceigenproblem.

Lin. Alg. Appl. , 34:235–258, 1980.[53] Anne Greenbaum. Behavior of slightly perturbed Lanczos and conjugate-gradient recur-rences.

Lin. Alg. Appl. , 113:7–63, 1989.[54] Anne Greenbaum. Estimating the attainable accuracy of recursively computed residual meth-ods.

SIAM J. Matrix Anal. Appl. , 18(3):535–551, 1997.[55] Erin Claire Carson.

Communication-avoiding Krylov subspace methods in theory and prac-tice . PhD thesis, University of California, Berkeley, 2015.[56] Valeria Simoncini and Daniel B Szyld. Theory of inexact Krylov subspace methods andapplications to scientiﬁc computing.

SIAM J. Sci. Comput. , 25(2):454–477, 2003.[57] Jasper van den Eshof and Gerard LG Sleijpen. Inexact Krylov subspace methods for linearsystems.

SIAM J. Matrix Anal. Appl. , 26(1):125–153, 2004.[58] Michael A Clark, Ronald Babich, Kipton Barros, Richard C Brower, and Claudio Rebbi.Solving lattice QCD systems of equations using mixed precision solvers on GPUs.

ComputerPhysics Communications , 181(9):1517–1528, 2010.4159] Gerard LG Sleijpen and Henk A van der Vorst. Reliable updated residuals in hybrid Bi-CGmethods.

Computing , 56(2):141–163, 1996.[60] Henk A Van Der Vorst and Qiang Ye. Residual replacement strategies for Krylov subspaceiterative methods for the convergence of true residuals.

SIAM J. Sci. Comput. , 22(3):835–852,2000.[61] Serge Gratton, Ehouarn Simon, David Titley-Peloquin, and Philippe Toint. Exploiting vari-able precision in GMRES.

SIAM J. Sci. Comput. (to appear) , 2020.[62] ˚Ake Bj¨orck. Solving linear least squares problems by Gram-Schmidt orthogonalization.

BITNumerical Mathematics , 7(1):1–21, 1967.[63] Christopher C Paige, Miroslav Rozloˇzn´ık, and Zdenˇek Strakoˇs. Modiﬁed gram-schmidtMGS, least squares, and backward stability of MGS-GMRES.

SIAM J. Matrix Anal. Appl. ,28(1):264–284, 2006.[64] Christopher C Paige. The effects of loss of orthogonality on large scale numerical compu-tations. In

International Conference on Computational Science and Its Applications , pages429–439. Springer, 2018.[65] Luc Giraud, Serge Gratton, and Julien Langou. A rank- k update procedure for reorthogo-nalizing the orthogonal factor from modiﬁed Gram–Schmidt. SIAM J. Matrix Anal. Appl. ,25(4):1163–1177, 2004.[66] Christopher C Paige and Zdenˇek Strakoˇs. Residual and backward error bounds in minimumresidual Krylov subspace methods.

SIAM J. Sci. Comput. , 23(6):1898–1923, 2002.[67] K. ´Swirydowicz, J. Langou, S. Ananthan, U. Yang, and S. J. Thomas. Low synchronizationGram-Schmidt and GMRES algorithms.

Numer. Lin. Alg. Appl. , 2020.[68] Nicholas J Higham. The accuracy of solutions to triangular systems.

SIAM J. Numer. Anal. ,26(5):1252–1265, 1989.[69] Jesse L Barlow. Block modiﬁed Gram–Schmidt algorithms and their analysis.

SIAM J. MatrixAnal. Appl. , 40(4):1257–1290, 2019.[70] ˚Ake Bj¨orck. Numerics of Gram-Schmidt orthogonalization.

Lin. Alg. Appl. , 197:297–316,1994.[71] J. Malard and C.C. Paige. Efﬁciency and scalability of two parallel QR factorization algo-rithms. In

Proceedings of IEEE Scalable High Performance Computing Conference , pages615–622. IEEE, 1994.[72] Steven J Leon, ˚Ake Bj¨orck, and Walter Gander. Gram-Schmidt orthogonalization: 100 yearsand more.

Numer. Lin. Alg. Appl. , 20(3):492–532, 2013.[73] Chiara Puglisi. Modiﬁcation of the Householder method based on the compact WY represen-tation.

SIAM J. Sci. Stat. Comput. , 13(3):723–726, 1992.[74] Thierry Joffrain, Tze Meng Low, Enrique S Quintana-Ort´ı, Robert van de Geijn, and Field GVan Zee. Accumulating Householder transformations, revisited.

ACM Transactions on Math-ematical Software (TOMS) , 32(2):169–179, 2006.[75] Homer F Walker. Implementation of the GMRES method using Householder transformations.

SIAM J. Sci. Stat. Comput. , 9(1):152–163, 1988.[76] Xiaobai Sun. Aggregations of elementary transformations. Technical Report DUKE-TR-1996-03, Duke University, Durham, NC, 1996.[77] ˚Ake Bj¨orck and Christopher C Paige. Loss and recapture of orthogonality in the modiﬁedGram-Schmidt algorithm.

SIAM J. Matrix Anal. Appl. , 13(1):176–190, 1992.[78] Christopher C Paige and Wolfgang W¨ulling. Properties of a unitary matrix obtained from asequence of normalized vectors.

SIAM J. Matrix. Anal. Appl. , 35(2):526–545, 2014.[79] Alicja Smoktunowicz, Jesse L Barlow, and Julien Langou. A note on the error analysis ofclassical Gram–Schmidt.

Numerische Mathematik , 105(2):299–313, 2006.[80] Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J Higham, and Enrique S Quintana-Ort´ı. Adaptive precision in block-jacobi preconditioning for iterative sparse linear systemsolvers.

Concurrency and Computation: Practice and Experience , 31(6):e4460, 2019.4281] Hartwig Anzt, Goran Flegar, Thomas Gr¨utzmacher, and Enrique S Quintana-Ort´ı. Toward amodular precision ecosystem for high-performance computing.

The International Journal ofHigh Performance Computing Applications , 33(6):1069–1078, 2019.[82] Thomas Gr¨utzmacher, Terry Cojean, Goran Flegar, Fritz G¨obel, and Hartwig Anzt. A cus-tomized precision format based on mantissa segmentation for accelerating sparse linear alge-bra.

Concurrency and Computation: Practice and Experience , page e5418, 2019.[83] J.L. Gustafson.

The End of Error: Unum Computing . Chapman & Hall/CRC ComputationalScience. Taylor & Francis, 2015.[84] Karl Ljungkvist and Martin Kronbichler. Multigrid for matrix-free ﬁnite element compu-tations on graphics processors.

Technical report / Department of Information Technology,Uppsala University , 2017.[85] Karl Ljungkvist and Martin Kronbichler. Multigrid for matrix-free high-order ﬁnite elementcomputations on graphics processors.

ACM Transactions on Parallel Processing , 2019.[86] Stefan Lemvig Glimberg, Allan Peter Engsig-Karup, and Morten G Madsen. A fast gpu-accelerated mixed-precision strategy for fully nonlinearwater wave computations. In

Pro-ceedings of ENUMATH 2011 , 2011.[87] Takateru Yamagishi and Yoshimasa Matsumura. Gpu acceleration of a non-hydrostatic oceanmodel with a multigrid poisson/helmholtz solver.

Procedia Computer Science , 80:16581669,2016.[88] Maximilian Emans and Albert van der Meer. Mixed-precision amg as linear equation solverfor deﬁnite systems. In

Proceedings of International Conference on Computational Science,ICCS 2010 , volume 1, page 175183, 2012.[89] Yuki Sumiyoshi, Akihiro Fujii, Akira Nukada, and Teruo Tanaka. Mixed-precision amgmethod for many core accelerators. In

EUROMPI/ASIA 14: Proceedings of the 21st Eu-ropean MPI Users’ Group Meeting , page 127132, 2014.[90] Christian Richter, Sebastian Schops, and Markus Clemens. Gpu-accelerated mixed precisionalgebraic multigrid preconditioners for discrete elliptic ﬁeld problems.

IEEE Transactions onMagnetics , 50(2), 2014.[91] Kate Clark. Effective use of mixed precision for hpc. Smoky Mountain Conference 2019.[92] Alyson Fox and Avary Kolasinski. Error analysis of inline zfp compression for multigridmethods. 2019 Copper Mountain Conference for Multigrid Methods.[93] Rasmus Tamstorf, Joseph Benzaken, and Stephen McCormick. Algebraic error analysis formixed precision multigrid solvers.

SIAM Journal on Scientiﬁc Computing , 2020. submitted.[94] Rasmus Tamstorf, Joseph Benzaken, and Stephen McCormick. Discretization-error-accuratemixed precision multigrid solvers.

SIAM Journal on Scientiﬁc Computing , 2020. submitted.[95] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. Eyeriss v2: A ﬂexible and high-performanceaccelerator for emerging deep neural networks.

CoRR , abs/1807.07928, 2018.[96] Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Ramin-der Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-lucCantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean,Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hag-mann, Richard C. Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, AaronJaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, SteveLacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, AlanLundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagara-jan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, NarayanaPenukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gre-gory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Wal-ter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensorprocessing unit.

CoRR , abs/1704.04760, 2017.[97] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. Maeri: Enabling ﬂexible dataﬂowmapping over dnn accelerators via reconﬁgurable interconnects.

ACM SIGPLAN Notices ,53(2):461–475, 2018. 4398] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descentand its application to data-parallel distributed training of speech dnns. In

Fifteenth AnnualConference of the International Speech Communication Association , 2014.[99] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: Asystem for large-scale machine learning. In { USENIX } Symposium on Operating Sys-tems Design and Implementation ( { OSDI } , pages 265–283, 2016.[100] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De-Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentia-tion in pytorch. 2017.[101] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networkswith low precision multiplications. arXiv preprint arXiv:1412.7024 , 2014.[102] Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, EverettPhillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. Ex-ascale deep learning for climate analytics. In SC18: International Conference for High Per-formance Computing, Networking, Storage and Analysis , pages 649–660. IEEE, 2018.[103] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learn-ing with limited numerical precision.

CoRR , abs/1502.02551, 2015. accessed: 2018-08-01.[104] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learn-ing with limited numerical precision. In Francis Bach and David Blei, editors,

Proceedingsof the 32nd International Conference on Machine Learning , volume 37 of

Proceedings ofMachine Learning Research , pages 1737–1746, Lille, France, 2015. PMLR. accessed: 2018-08-01.[105] Hartwig Anzt, Erik Boman, Rob Falgout, Pieter Ghysels, Michael Heroux, Xiaoye Li,Lois Curfman McInnes, Richard Tran Mills, Sivasankaran Rajamanickam, Karl Rupp, et al.Preparing sparse solvers for exascale computing.

Philosophical Transactions of the RoyalSociety A , 378(2166):20190053, 2020.[106] Eric Bavier, Mark Hoemmen, Sivasankaran Rajamanickam, and Heidi Thornquist. Amesos2and belos: Direct and iterative solvers for large sparse linear systems.

Scientiﬁc Programming ,20:241–255, 2012.[107] Nicholas J. Higham and Srikara Pranesh. Simulating low precision ﬂoating-point arithmetic.

SIAM Journal on Scientiﬁc Computing , 41(5):C585–C602, 2019.[108] Nicholas J. Higham and Theo Mary. A new approach to probabilistic rounding error analysis.