Reproducibility of Parallel Preconditioned Conjugate Gradient in Hybrid Programming Environments
Roman Iakymchuk, Maria Barreda, Stef Graillat, Jose I. Aliaga, Enrique S. Quintana-Orti
RReproducibility of ParallelPreconditioned Conjugate Gradient inHybrid Programming Environments
Journal TitleXX(X):1–15c (cid:13)
Roman Iakymchuk , Maria Barreda , Stef Graillat , Jos ´e I. Aliaga , Enrique S.Quintana-Ort´ı Abstract
The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equationsarising in numerical simulations of physical phenomena. While being widely used, the solver is also known for itslack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate fromthe ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI +OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until thefinal rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision.Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violatereproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmicstrategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify theseapproaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, directerrors, and vector-solutions for the overhead of less than 37.7 % on 768 cores.
Keywords
Preconditioned Conjugate Gradient, MPI, OpenMP tasks, reproducibility, accuracy, floating-point expansion, longaccumulator, fused multiply-add.
Many current scientific and engineering problems involvethe solution of large and sparse linear systems of equations.Some traditional examples appear, for example, in circuit anddevice simulation, quantum physics, large-scale eigenvaluecomputations, nonlinear sparse equations, and all sortsof applications that include the discretization of partialdifferential equations (PDEs) Barrett et al. (1994). For manyproblems (especially those associated with 3-D models),the size and complexity of these systems have turnediterative projection methods, based on Krylov subspaces,into a highly competitive approach compared with directsolvers Saad (2003). In particular, the Conjugate Gradient(CG) method is one of the most efficient Krylov subspace-based algorithms for the solution of sparse linear systemswhen the coefficient matrix is symmetric positive definite(s.p.d.) Saad (2003). Preconditioning is usually incorporatedin real implementations of the method in order to acceleratethe convergence of the method and improve its numericalfeatures, yielding the Preconditioned Conjugate Gradient(PCG) method.One would expect that the results of different runs ofPCG are identical, for instance, in the number of iterations,the intermediate and final residuals, as well as the solution-vector. However, in practice this is not often the case dueto different reduction trees – the Message Passing Interface(MPI) implementations (libraries) Gropp et al. (2014) offerup to 14 different implementations for reduction –, OpenMPtasks scheduling, data alignment, instructions used, etc. Eachof these factors may change the execution order of floating-point operations, which are commutative but non-associative, and, hence, result in non-reproducible results. We define reproducibility as the ability to obtain a bit-wise identicaland accurate result for multiple executions on the samedata . Therefore, our aim of this study is to ensure reliablecomputations (we also refer to them as robust), at reasonablecost, for codes that leverage PCG (or any similar Krylovsubspace solver) and encounter numerical issues duringsensitive computations of the residual. Our approach androutines are also aimed to be used for debugging as we ensurereproducible residuals, direct errors, number of iterations,and solution-vector from the sequential, pure MPI, and evenhybrid MPI + OpenMP versions.Ensuring the bit-wise reproducibility is often a complexand expensive task that imposes modifications to thealgorithm and its underlying parts such as the BLAS(Basic Linear Algebra Subprograms) routines Lawsonet al. (1979); Dongarra et al. (1990). These modificationsare necessary to preserve every bit of information (bothresult and error) Collange et al. (2015) or, alternatively,to cut off some parts of the data and operate on theremaining most-significant parts Mukunoki et al. (2020);Demmel and Nguyen (2015). Furthermore, the bit-wise Sorbonne Universit´e, LIP6, France Fraunhofer ITWM, Germany Universitat Jaime I, Spain Universitat Polit`ecnica de Val`encia, Spain
Corresponding author:
Roman Iakymchuk, Sorbonne Universit´e, LIP6, Campus Pierre et MarieCurie, 4 place Jussieu, 75252 PARIS CEDEX 05, FranceEmail: [email protected]
Prepared using sagej.cls [Version: 2016/06/24 v1.10] a r X i v : . [ c s . D C ] M a y Journal Title XX(X) reproducibility can become expensive with the overhead ofat least 8 % for parallel reduction Collange et al. (2015);Demmel and Nguyen (2015), up to 2x-4x for matrix-vectorproduct Iakymchuk et al. (2019b), and more than 10x formatrix-matrix multiplication Iakymchuk et al. (2016). In thispaper, we aim to revisit reproducibility and raise its appealthrough reducing its negative impact on performance andminimizing changes to both the algorithm and its buildingblocks. We also raise a question:
Can reproducibility ofalgorithms be ensured by design with both minimal changesto algorithms and almost negligible overhead?
Hence, ouridea is to address those parts of algorithms that violateassociativity – such as parallel reductions, dot products,and possible replacements by compilers of a ∗ b + c infavor of fused multiply-add ( fma ) operation, etc. – aswell as to combine that with sequential executions of sub-blocks/subroutines. Such sequential execution of operationsis reproducible under some constraints, for example the sameinitial conditions on the input data like data alignment.We consider to verify this idea (both algorithmic andprogrammability) on a typical sparse linear algebra solversuch as PCG and ensure its reproducibility on paralleldistributed-memory systems using a hybrid combinationof the MPI + OpenMP-tasks programming models. Onone hand, the hybridization reduces the communicationburden being more focused on inner node computations andwork balancing, especially on nodes with large core countssuch as those in the MareNostrum4 platform at BarcelonaSupercomputing Center . On the other hand, it introduces anew challenge in the form of a double-level reduction: aninitial reduction among tasks inside a process/node, followedby one among processes. Thus, we ensure reproducibility ofthe PCG solver by preventing non-deterministic executionsas follows: • We construct two reproducible solutions: a first one onthe ExBLAS approach Iakymchuk et al. (2017) and analternative lightweight version based on floating-pointexpansions (FPEs). The ExBLAS-based approachwith its cornerstone Kulisch long accumulator Kulisch(2013) is robust but expensive since it is designed tocover severe (ill-conditioned) cases with very broaddynamic ranges. Motivated by “100 bits suffice formany HPC applications” as noted by David Baileyat ARITH-21 Bailey (2013) and a mini accumulatorfrom the ARM team Lutz and Hinds (2017); Burgesset al. (2019), we derive a faster but less generic versionusing FPEs, which is the other core algorithmiccomponent in the ExBLAS approach, aiming to adjustthe algorithm to the problem at hand. • As a consequence, we also address the commonissue of sparse iterative solvers – the accuracy whilecomputing the residual – and propose to use solutionsthat offer reproducibility (and potentially correct-rounding) only while computing the corresponding dotproducts. • Hence, we derive two hybrid (MPI + OpenMPtasks), reproducible, and accurate dot products usingExBLAS and FPEs. • Finally, we demonstrate applicability and feasibilityof the aforementioned idea with the ExBLAS- andFPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3DPoisson’s equation with 27 stencil points as wellas several test matrices from the SuiteSparse matrixcollection. This extends our previous results with thepure MPI implementation of PGC Iakymchuk et al.(2019a) to the more complex double-level dot productsand reductions with dynamic scheduling of the tasks.
To sum up : the FPE-based (we also call it Opt) solutionis efficient and fast, but it is limited to cases where thecondition number and/or the dynamic range do not exceedcertain thresholds, e.g. the dynamic range is below . (Atthis point, we note that the condition of a linear system canbe cheaply estimated with fair accuracy.) In comparison, theExBLAS-based solution is reserved for extreme cases as wellas problems where we do not have any information about theproblem at hand.This article is organized as follows. Section 2 reviewsseveral aspects of computer arithmetic, in particular floating-point expansion and long accumulator, as well as theExBLAS approach for accurate and reproducible computa-tions. Section 3 introduces the PCG algorithms and describesin details its hybrid (MPI+OpenMP) implementation. Wepresent strategies for ensuring reproducibility of PCG in Sec-tion 4 and evaluate corresponding implementations in Sec-tion 5. Finally, Section 6 reviews related work, while Sec-tion 7 draws conclusions and outlines future directions. At first, we briefly introduce the floating-point arithmeticthat consists in approximating real numbers by numbers thathave a finite, fixed-precision representation. These numbersare composed of a significand, an exponent, and a sign: x = ± x .x . . . x M − (cid:124) (cid:123)(cid:122) (cid:125) mantissa × b e , ≤ x i ≤ b − , x (cid:54) = 0 , where b isthe basis ( in our case), M is the precision, and e stands forthe exponent that is bounded ( e min ≤ e ≤ e max ).The IEEE 754 standard IEEE Computer Society (2008),created in 1985 and then revised in 2008, has led to aconsiderable enhancement in the reliability of numericalcomputations by rigorously specifying the properties offloating-point arithmetic. This standard is now adopted bymost processors, thus leading to a much better portabilityof numerical applications. The standard specifies floating-point formats, which are often associated with precisionslike binary16 , binary32 , and binary64 , see Table 1. Floating-point representation allows numbers to cover a wide dynamicrange that is defined as the absolute ratio between the numberwith the largest magnitude and the number with the smallestnon-zero magnitude in a set. For instance, binary64 (double-precision) can represent positive numbers from . × − to . × , so it covers a dynamic rangeof . × .The IEEE 754 standard requires correctly rounded resultsfor the basic arithmetic operations (+ , − , × , /, √ , fma ) .It means that the operations are performed as if the resultwas first computed with an infinite precision and thenrounded to the floating-point format. The correct roundingcriterion guarantees a unique, well-defined answer, ensuringbit-wise reproducibility for a single operation. Severalrounding modes are provided. The standard also contains the Prepared using sagej.cls
Table 1.
Parameters for three IEEE arithmetic precisions; quadruple (128 bits) is omitted.
Type Size Significand Exponent Rounding unit Rangehalf 16 bits 11 bits 5 bits u = 2 − ≈ . × − ≈ ± single 32 bits 24 bits 8 bits u = 2 − ≈ . × − ≈ ± double 64 bits 53 bits 11 bits u = 2 − ≈ . × − ≈ ± reproducibility clause that forwards the reproducibility issueto language standards. Emerging attention to reproducibilitystrives to draw more careful attention to the problem by thecomputer arithmetic community. It has led to the inclusionof error-free transformations (EFTs) for addition andmultiplication – to return the exact outcome as the result andthe error – to assure numerical reproducibility of floating-point operations, into the revised version of the standard.These mechanisms, once implemented in hardware, willsimplify our reproducible algorithms – like the ones used inthe ExBLAS Collange et al. (2015), ReproBLAS Demmeland Nguyen (2015), OzBLAS Mukunoki et al. (2020)libraries – and boost their performance.There are three approaches that enable the additionof floating-point numbers without incurring round-offerrors or with reducing their impact. The main idea isto keep track of both the result and the errors duringthe course of computations. The first approach uses EFTto compute both the result and the rounding error andstores them in a floating-point expansion (FPE), which isan unevaluated sum of p floating-point numbers, whosecomponents are ordered in magnitude with minimal overlapto cover the whole range of exponents. Typically, FPErelies upon the use of the traditional EFT for additionthat is twosum Knuth (1969) (Algorithm 1) and formultiplication that is twoprod
EFT Ogita et al. (2005)(Algorithm 2). Note that the underlying architectureshould support fma , which is often the case. Otherwise,we refer to Algorithm 3.3 in Ogita et al. (2005), whichrelies on Dekker’s algorithm for splitting a floating-point number Dekker (1971); this altogether requires17 flops in contrary to 3 flops of Algorithm 2 with fma . Algorithm 1:
Error-free transformation for the summa-tion of two floating-point numbers.
Input: a, b are two floating-point numbers.
Output: r, s are the result and the error, resp.
Function [ r, s ] = twosum ( a, b ) r := a + bz := r − as := ( a − ( r − z )) + ( b − z ) Algorithm 2:
Error-free transformation for the product oftwo floating-point numbers.
Input: a, b are two floating-point numbers.
Output: r, s are the result and the error, resp.
Function [ r, s ] = twoprod ( a, b ) r := a ∗ bs := f ma ( a, b, − r ) The second approach projects the finite range of exponentsof floating-point numbers into a long vector so called along (fixed-point) accumulator and stores every bit there. Forinstance, Kulisch Kulisch and Snyder (2011) proposed to usea 4288-bit long accumulator for the exact dot product of two vectors composed of binary64 numbers; such a large longaccumulator is designed to cover all the severe cases withoutoverflows in its highest digit.The third approach is based on slicing or splitting afloating-point number into slices using Dekker’s algorithm.Then, the same work is carried separately on each slice andthe accumulated results are aggregated/merged. More detailsand analysis can be found in Rump et al. (2008a), with ideasoriginating from Zielke and Drygalla (2003). This approachis implemented in ReproBLAS and OzBLAS.
The ExBLAS project Iakymchuk et al. (2015) is an effortto derive fast, accurate, and reproducible BLAS library byconstructing a multi-level approach for these operationsthat are tailored for various modern architectures with theircomplex multi-level memory structures. On one side, thisapproach aims to ensure similar performance comparedwith the non-deterministic parallel counterparts. On theother side, the approach preserves every bit of informationbefore the final rounding to the desired format to assurecorrect-rounding and, therefore, reproducibility. Hence,ExBLAS combines together long accumulator and FPEinto algorithmic solutions. In addition, it efficiently tunesand implements them on various architectures, includingconventional CPUs, NVIDIA and AMD GPUs, and IntelXeon Phi co-processors (for details we refer to Collangeet al. (2015)). Thus, ExBLAS assures reproducibility throughassuring correct-rounding.The cornerstone of ExBLAS is the reproducible parallelreduction, which is at the core of many BLAS routines.The ExBLAS parallel reduction relies upon FPEs with the twosum
EFT Knuth (1969) and long accumulators, so itis correctly rounded and reproducible. In practice, the latteris invoked only once per overall summation which resultsin the little overhead (less than %) on accumulating largevectors. Our interest in this article is the dot product oftwo vectors, which is another crucial fundamental BLASoperation. The EXDOT algorithm is based on the previous
EXSUM algorithm and the twoprod
EFT Ogita et al. (2005)(see Algorithm 2): the algorithm accumulates the result andthe error of twoprod to same FPEs and then follows the
EXSUM scheme. These and the other routines – such asmatrix-vector product, triangular solve, and matrix-matrixmultiplication – are distributed in the ExBLAS library ∗ . Inthis paper, we derive a hybrid MPI + OpenMP tasks EXDOT ,where a long accumulator is shared among OpenMP taskswithin one process and each OpenMP thread owns two FPEsunderneath (one for the result and the other for the error) thatare merged at the end of computations . ∗ ExBLAS repository: https://github.com/riakymch/exblas . Prepared using sagej.cls
Journal Title XX(X)
In this section we review the PCG algorithm and its task-parallel implementation using MPI and OpenMP tasks. Thegoal of the following analysis is twofold: to offer a completedescription of the parallelization approach and, even moreimportant, to identify key inter-node (i.e., between MPIranks) and intra-node (i.e., between threads executing tasks)communications, in particular reductions, which pose achallenge to ensuring reproducibility.
We consider the linear system Ax = b , where the coefficientmatrix A ∈ R n × n is sparse and symmetric positive definite(s.p.d.), with n z nonzero entries; b ∈ R n is the right-hand side vector; and x ∈ R n is the sought-after solutionvector. Figure 1 presents the algorithmic description of theclassical iterative PCG. In the body loop of the algorithm,the following operations are executed: a sparse matrix-vector product (S P MV) (S1), three DOT products (S2,S6,and S8), three AXPY (-like) operations (S3, S4, andS7), the preconditioner application (S5), and a few scalaroperations Barrett et al. (1994). In particular, in the proposedimplementation of the PCG method, we incorporate aJacobi preconditioner Saad (2003), which is composed ofthe elements in the diagonal of the matrix ( M := D = diag ( A ) ). Therefore, the application of the preconditioneris carried out on a vector and involves an element-wisemultiplication of two vectors. In this subsection we analyse the communication patternsof a message-passing implementation of the PCG solverthat operates in a distributed-memory platform. For clarity,hereafter we will drop the superindices that denote theiteration count in the variable names. The followingconsiderations are taken into account in the analysis of thecommunications: • The parallel platform comprises K processes (or MPIranks), denoted as P , P , . . . , P K . • The coefficient matrix A is partitioned into K blocksof rows ( A , A , . . . , A k ), with the k -th distributionblock A k ∈ R p k × n stored in P k , and n = (cid:80) Kk =1 p k . • Vectors are partitioned and allocated conformally withthe block-row distribution of A . For example, theresidual vector r is partitioned as r , r , . . . , r K , where P k stores r k . • The scalars α, β, ρ, τ are replicated on all K processes.Considering these previous aspects, we next examine howthey affect the different computational kernels ( S1 – S8 ) thatare executed in a single PCG iteration in Figure 1. Sparse matrix-vector product (S1):
The input operands arethe coefficient matrix A , which is distributed by blocks ofrows, and the vector d , which is partitioned and distributedaccording to A . A communication stage is required beforeexecuting this kernel in order to assemble the distributedparts of vector d into a single vector e , which is replicatedin all processes. We denote this communication as d → e ,which can be performed in MPI via an MPI Allgatherv .Note that vector e is the only array that is replicated in all processes. After that, the computation can proceed in paralleland each process calculates its local slice of the output vector w : P k : w k := A k e. DOT products (S2, S6, S8):
In this kernel, each processcan compute concurrently a partial result (in step S2 , P k calculates ρ k := < d k , w k > ). Then, these intermediatevalues are reduced into a globally-replicated scalar(for example, ρ := β/ ( ρ + ρ + · · · + ρ K ) in S2 ). Weimplement this reduction in MPI using MPI Allreduce .Applying this idea to all the
DOT products, there arethree process synchronizations because ρ, β, τ are globally-replicated.
AXPY (-type) vector updates (S3, S4, S7):
The
AXPY kernelinvolves two distributed vectors ( x and d in S3 ) anda globally-replicated scalar ( ρ in S3 ). This kernel canbe executed concurrently because all processes canperform their local parts of the computation without anycommunication ( P k : x k := x k + ρ d k ). Application of the preconditioner (S5):
The kernel in step S5 consists in applying the Jacobi preconditioner M . In orderto do that, vector r is scaled by the diagonal of the matrix.Here, each process stores a different group of the diagonalelements and also a local piece of the vector r , so that, thecomputations can be done in parallel, i.e P k : z k := M − k r k .The algorithm with communications is summarizedin Figure 2. We can re-arrange the operations to reduce thethe number of synchronizations in the loop body of the PCGsolver, as shown there. Concretely, pushing up step S8 next tostep S6 , we can simultaneously execute these two reductionsby merging them into one reduction and, hence, the numberof synchronizations decreases from three to two per iterationof PCG. In a cluster of multicore processors, a good practice toincrease the performance of the codes is to introduce anadditional level of parallelism. This level is exploited in eachnode of the cluster using, for example, OpenMP. The analysisin Aliaga et al. (2017); Barreda et al. (2019) exposes that, inthe PCG, a reasonable option is to leverage task-parallelism ,which consists in dividing each kernel into a collection offiner-grain operations, or tasks. Then, each thread executes adifferent task and two consecutive kernels can be executedconcurrently avoiding a thread-synchronization point aftereach kernel, as described next.In the following analysis, for simplicity, we merge theexecution of S3 with that of S4 ; and S8 with S6 . Therefore,we will only consider kernels S1 – S2 and S4 – S7 in the loopbody of the PCG solver (see Figure 2). Thus, the operationsin the solver are interlaced by a series of data dependencieswhich impose a strict order of execution: · · · S7 iteration l − d −→ S1 w −→ S2 ρ −→ S4 r −→ S5 z −→ S6 β −→ S7 iteration l d −→ S1 · · · iteration l +1 , where the variable that generates the dependency is denotedon top of each dependency arrow.Exploiting task-parallelism allows that some of thesekernels can be (partially) computed concurrently (denotedwith the symbol “ (cid:107) ”), breaking the strict inter-kernel barriersdue to the dependencies; in particular, we aim to attain aparallel execution with S1 (cid:107) S2 and S4 (cid:107) S5 (cid:107) S6 . Prepared using sagej.cls
Compute preconditioner for A → M Set starting guess x (0) Initialize z (0) , d (0) , β (0) , τ (0) , l := 0 (iteration count) r (0) := b − Ax (0) τ := < r (0) , r (0) > while ( τ ( l ) > τ max ) Step Operation Kernel S1 : w ( l ) := Ad ( l ) S P MV S2 : ρ ( l ) := β ( l ) /< d ( l ) , w ( l ) > DOT product S3 : x ( l +1) := x ( l ) + ρ ( l ) d ( l ) AXPY S4 : r ( l +1) := r ( l ) − ρ ( l ) w ( l ) AXPY S5 : z ( l +1) := M − r ( l +1) Apply precond. S6 : β ( l +1) := < z ( l +1) , r ( l +1) > DOT product S7 : d ( l +1) := ( β ( l +1) /β ( l ) ) d ( l ) + z ( l +1) AXPY -like S8 : τ ( l +1) := < r ( l +1) , r ( l +1) > DOT product l := l + 1 end while Figure 1.
Formulation of the PCG solver annotated withcomputational kernels. The threshold τ max is an upper bound onthe relative residual for the computed approximation to thesolution. In the notation, < · , · > computes the DOT (inner)product of its vector arguments.
Compute preconditioner for A → M Set starting guess x Initialize z, d, β, τ, l := 0 r := b − Axτ := < r, r > while ( τ > τ max ) Step Operation Communication β (cid:48) := β – S1 : S1 . : d → e Allgatherv S1 . : w := Ae – S2 : ρ := β/< d, w > Allreduce S3 : x := x + ρd – S4 : r := r − ρw – S5 : z := M − r – S6 : β := < z, r > Allreduce S8 : τ := < r, r > Allreduce S7 : d := ( β/β (cid:48) ) d + z – l := l + 1 end while Figure 2.
Message-passing formulation of the PCGsolver annotated with communication.
Sparse matrix-vector product S1 (cid:107) DOT product S2 : Onthe one hand, the local operands to process P k of the S P MVcan be divided as w k := A k e: ˜ w ˜ w ... ˜ w I := ˜ A , ˜ A , . . . ˜ A ,J ˜ A , ˜ A , . . . ˜ A ,J ... ... . . . ... ˜ A I, ˜ A I, . . . ˜ A I,J e e . . . e J . Here, we can consider each group of rows as a task, whichcomputes the corresponding S P MV operation to obtain apartial result, ˜ w k . For example, if we consider ˜ w , there isa task calculating ˜ w = (cid:80) Jj =1 ˜ A ,j e j .On the other hand, the computation local to P k for the DOT product S2 can be decomposed into S tasks. These tasks canbe computed concurrently by partitioning the input operands d k , w k into S pieces, with each task obtaining a partial result ˜ ρ s : ρ k := < d k , w k > ≡ ˜ ρ s := < ˜ d s , ˜ w s >, s = 1 , , . . . , S. These partial results are reduced to generate a unique valuelocal to the k -th node, ρ k := (cid:80) Ss =1 ˜ ρ s , and these local valuesare thereafter reduced across all K nodes to produce theglobally replicated scalar ρ := (cid:80) Kk =1 ρ k .Note that the advantage here is that we can eliminate thedependency S1 → S2 , by splitting up these operations intofine-grain tasks. Hence, the execution of some tasks of thesecond kernel can start as soon as the corresponding resultsof the previous one are available, resulting in a partially-parallel execution of these two tasks. However, the globalreduction required at the end of S2 enforces a task/processsynchronization point that is an impediment to extend thisidea further that point. AXPY vector update S4 (cid:107) preconditioner application S5 (cid:107) DOT product S6 : S4 – S6 can be computed in parallel by applying a similar division of the three kernels into fine-grain. Nevertheless, again a task/process synchronization isrequired right after S6 . AXPY vector update S7 and S P MV S1 (subsequentiteration): The convergence test and the requirement toperform the replication d → e at the beginning of eachiteration, inserts a process synchronization that makesimpossible the concurrent computation of the local taskscorresponding to these two kernels. In this subsection we detail how to exploit the describedtwo levels of parallelism via a combination of twoparallel programming interfaces: MPI forum (2019) andOpenMP OpenMP ARB (2019).We leverage OpenMP tasks to implement task-parallelism.At execution time, the runtime system underlying OpenMPdetects data dependencies between tasks, with the help ofcompiler directives ( ) annotated withclauses that indicate the task operands’ directionality (input( in ), output ( out ) or both ( inout )). Then, a task graph isgenerated during the execution, which is used to schedulethe tasks to the cores, exploiting the inherent task-levelparallelism while fullling the dependencies embedded in thegraph.As an example, the DOT product, which computes α := x T y , x, y ∈ R q , is annotated as For the routine
AXPY y := y + αx , the code snippet usingOpenMP tasks is as follows: Prepared using sagej.cls
Journal Title XX(X)
The replication of vector d into e is performed acrossthe processes using the MPI collective MPI Allgatherv ,as stated previously. To ensure that all the processes havefinalized their computation of d prior to the MPI collective,we introduce a task barrier, using the directive . This creates a task synchronizationpoint because it enforces that all tasks up to that pointare completed. Furthermore, this syncronization point isleveraged to perform the convergence test ( τ > τ max ?) rightafter it, which is followed by an implicit MPI syncronizationacross processes in the MPI collective primitive.The MPI Allreduce primitive is used to implement theglobal reductions. Similarly to the previous case, we inserta on the specific variablebeing reduced before invoking the MPI collectives for thereduction. This ensures that all tasks operating on thatvariable have been finalized prior the reduction acrossnodes can start. Moreover, atomic updates are employed toaccumulate the results from each reduction task (e.g., ˜ β n for S8 ) into the local result ( β k ).The previous description is condensed in Figure 3,focusing on the operations computed by the process P k ,during the iteration l . In this section, we present our strategies for ensuringreproducibility of the PCG solver. The first strategy relieson the ExBLAS approach, while the second is derived fromit and is based on FPEs. Both strategies are reinforcedwith programmability components such as the explicituse of fma instructions and a careful re-arrangement ofcomputations. Therefore, the reproducibility of the PCGsolver is guaranteed via reproducibility of its building blockson each iteration.
Section 2.1 provides an overview of the ExBLAS approach.Here we exploit the ExBLAS parallel reduction inconjunction with the twoprod
EFT to derive a hybridMPI (inter-node, distributed) and OpenMP (intra-node) forthe
DOT products appearing in PCG. The intra-node
DOT product is presented here, and its distributed part is describedin Section 4.3 together with the FPE-based alternative.For accurate and reproducible
DOT product within an MPIprocess, we rely upon OpenMP tasks following the ExBLASapproach. We allocate one long accumulator per MPI processas well as, within the exblas::cpu::exdot , a vectorof FPEs shared among OpenMP threads. Hence, the workon the process-local
DOT products is divided into multipletask (more than threads) in such a way that the intermediateresults from each task are stored in this large vector ofFPEs. To complete the local
DOT products we flush allFPEs sequentially into the process-owned long accumulator.Listing 1 outlines the code snippet of this implementation.
Accumulate is presented in Algorithm 4, however hereit also includes a possibility to flush the error to the long ~ w ~ w ~ w r ~ r ~ r ~ z ~ z ~ z ~ Iteration l −1 S5. Apply precond.
Process P k ρ ~ ρ ~ ρ ~ β ~ β k β ~ ~ β ...... Iteration l +1 ρ Σ ρ k ρ k e e e e e d e MPI_Allgather d ~ d ~ d ~ e e e e e d e MPI_Allgather d ~ d ~ d ~ Iterationl
S1.2 SpMVS2. DOTS6. DOTS4. AXPY ρ S7. AXPY β (and S8)(and S3) MPI_Allreduce β Σ β k MPI_Allreduce
S1.1S1.1
Convergence test
Convergence test
Figure 3.
Dependencies between kernels in the PCG solver. accumulator in case of not enough capacity to store this error.Listing 1: Process-local
DOT product with OpenMP tasks andExBLAS. double *fpe = (double *)calloc(NBFPE*omp_get_num_threads(), sizeof(double)); for (int i = 0; i < N_k; i += bm ) { int cs = N_k - i; int c = cs < bm ? cs : bm; for(int j = i; j < (i+c); j++) { double r1; double x = TwoProductFMA(a[j],b[j],r1); Accumulate(&fpe[omp_get_thread_num()*NBFPE], x); Accumulate(&fpe[omp_get_thread_num()*NBFPE], r1); } } for (int i=0; i < omp_get_num_threads(); i++) Flush(&fpe[i*NBFPE]);
Delivering both correctly rounded and reproducibleresults, ExBLAS has two major drawbacks Collange et al.
Prepared using sagej.cls (2015). The first drawback is related to the required memorystorage, which amounts for n t × p + acc s , where n t isthe thread count, p is the size of floating-point expansion,and acc s is the size of superaccumulator (2,098 bits forsummation). The second drawback is the number of requiredoperations: For an input vector of size n with dynamic range d , the cost of accumulation is n × (cid:6) d (cid:7) × C fpe V L + n t × p × V L × C sa , when d < p,n × (cid:16) p × C fpe V L + C sa (cid:17) + n t × p × V L × C sa , otherwise , where C fpe = 6 flops, see Algorithm 1, is the cost of theexpansion update, V L is the architecture-dependent vectorlength on SIMD architectures (4 with AVX and 1 on GPUs),and C sa = 16 flops +2 indirect memory accesses is the costof the long accumulator update. The right-hand side termis the cost of flushing expansions to long accumulators atthe end of the summation and gets negligible as n increases.These two drawbacks can be observed for compute-intensivekernels, leading to large performance overheads Iakymchuket al. (2016). However, these drawbacks are either hardlyvisible or relatively small on bandwidth- and memory-bound operations such as the DOT product (reduction) andpotentially PCG due to the possibility to saturate bandwidthand hide the cost of extra computations and memory needs.
We introduce a lightweight strategy for reproducibilityusing the ExBLAS approach as a starting point. TheExBLAS drawbacks served us as a motivation to design analternative, cheaper strategy for reproducible computationswith accuracy (correct rounding) guarantees. Examining thePCG method for moderately conditioned but largely sparsematrices, like the studied Poisson matrices, we come to theconclusion that the method can successfully accommodateaccurate and reproducible computations, ensuring theirrobustness, using eight-floating point numbers, meaning theFPE of size 8 (FPE8). In fact, the size of FPEs is tunableand exposed to the end-user. This approach is complementedwith the early-exit technique Collange et al. (2015): we stoppropagating zero-errors in the FPE. From our experience,the early-exit technique significantly improves performance.According to Hida et al. (2001), FPE8 is capable to representat least
424 = 8 · bits of significant.Our main motivation for iterative solvers, where nextiteration corrects the previous estimate, is to providea good enough associativity-assuring approach since theproperties provided by ExBLAS get demolished by the nextcomputation/iteration. In fact, we are working on developinga concept of weak reproducibility Imamura et al. (2019)– reproducibility under a certain accuracy guarantee, e.g.defined as the input tolerance, that is not necessarily correctrounding. However, we want all computations on a singleiteration to be reproducible. Therefore, the FPEs of size8 with the early-exit technique is generic enough to covera wide range of problems with various condition numbersand/or dynamic ranges. We also use two different FPEsunderneath (one for results and another for errors) that aremerged at the end of computations before rounding.We discuss here the DOT product using OpenMP tasks,while the distributed
DOT product is presented in the section below. Listing 2 outlines the FPE-based solution: each MPIprocess allocates a vector of FPEs for each OpenMP threadand invokes a local routine to conduct
DOT products ontheir local copies of vectors of size N k . This local DOT product routine subdivides the process-local
DOT productinto tasks of size bm ; each task calls a sequential DOT product. This
DOT product as in Listing 1 is composedof the call to twoprod
EFT (Algorithm 2) for the exactmultiplication of two floating-point numbers; and, then,the accumulation of the output result and the error to thethread local FPEs with the help of Algorithm 4, whichrelies upon the twosum
EFT (Algorithm 1). Later, theFPEs with the result and the error are combined into oneby invoking Algorithm 3, which calls Algorithm 4 in aloop over the FPE with errors:
Accumulate( f pe, f perr [ i ]) .To complete the OpenMP DOT product, we perform theprocess-local reduction on FPEs by sequentially executingAlgorithm 3. Finally, we round the FPE-result to thetarget precision using the
NearSum algorithm Rumpet al. (2008b), which is described in Section 4.3.Listing 2: Process-local
DOT product with OpenMP tasksand FPEs. void bblas_ddot(int bm, int N_k, double *X, double *Y,double *results) { for (int i=0; i Aggregation of two FPEs of size p . Function fpesum ( a, b, p ) Input: b is a FPE. Output: a is a FPE containing the result. for i = 0 → p − do Accumulate(a,b[i]) endAlgorithm 4: Adding a floating-point number x to afloating-point expansion a of size p . Function Accumulate( a, x ) Input: x is a floating-point number. Output: a is a FPE containing the result. for i = 0 → p − do ( a [ i ] , x ) := twosum ( a [ i ] , x ) end We re-assure reproducibility of parallel PCG by first examin-ing potential sources of non-deterministic computations and,in addition, presenting our mitigation strategies for them.Note that we target a hybrid MPI + OpenMP tasks implemen-tation of PCG, where each process conducts computations Prepared using sagej.cls Journal Title XX(X) on its own local slices of the matrix as well as the vectors(see Section 3.2 and Figure 2). DOT products (S2, S6, S8): The main issue of non-determinism emerges from DOT products and, thus, theparallel reductions such as MPI Allreduce() that areemployed in order to compute the tolerance τ as well asboth β and ρ . Hence, we 1) exploit the ExBLAS approachto build reproducible and correctly-rounded DOT product;2) construct DOT product solely based on FPEs; 3) extendthe ExBLAS- and FPE-based DOT products to distributedmemory in order to make them suitable for the PCGalgorithm in Figure 2. While Sections 4.1 and 4.2 presentimplementations of DOT product using OpenMP tasks, i.e.within each MPI process, Listings 3 and 4 provide pseudo-codes for our implementation of the distributed DOT productusing the ExBLAS and lightweight strategies, respectively.After carrying out process-local DOT products, via eitherExBLAS- or FPE-based implementations, we realise theglobal reduction by splitting them into three stages: • MPI Reduce() acting on either long accumulatorsor FPEs. For the ExBLAS approach, since the longaccumulator is an array of long integers, we applyregular reduction. Note that we may need to carry anextra intermediate normalization after the reduction of K − long accumulators, where K = 64 − 52 = 12 isthe number of carry-safe bits per each digit of longaccumulator. For the FPE approach, we may need torenormalize FPEs using the Priest’s renormalizationmethod Hida et al. (2001); Priest (1991) and define theMPI operation that is based on the twosum EFT, seeAlgorithm 3; • Rounding to double: for long accumulators, weuse the ExBLAS-native Round() routine. Toguarantee correctly rounded results of the FPE-basedcomputations, we employ the NearSum algorithmfrom Rump et al. (2008b) for FPEs of size eight orvariable size; it may require renormalization before. • MPI Bcast() to distribute the result of DOT productto the other processes as only master performsrounding.Splitting the MPI Allreduce() operation into MPI Reduce() and MPI Bcast() provides us fullcontrol of the operation and even may lead to betterperformance as noted in Hunold and Carpen-Amarie (2016).Listing 3: Reproducible Allreduce with ExBLAS. std::vector Listing 4: Reproducible Allreduce with FPEs only. double *fpes = (double *)calloc(NBFPE*omp_get_num_threads(),sizeof(double)); bblas_ddot (..., &fpes[0]); renormalize(&fpes[0]); // optional MPI_Op Op; // user-defined reduction operation MPI_Op_create (fpesum, 1, &Op); MPI_Reduce ((myId==0)?MPI_IN_PLACE:&fpes[0], &fpes[0],N, MPI_DOUBLE, Op, ...); if (myId == 0) { beta = Round (&fpes[0]); // NearSum } MPI_Bcast (&beta, 1, MPI_DOUBLE, ...); Sparse matrix-vector product (S1): The other repro-ducibility issue is hidden in the computation of the sparsematrix-vector product. With the current distributed imple-mentation of this operation, each MPI process computes itsdedicated part w k of the vector w by multiplying a block ofrows A k by the vector e . These process-local multiplicationsare correspondingly divided into tasks, where each taskis responsible for a product of a sub-block of rows bythe vector. Since the computations are carried locally andsequentially, they are deterministic and, thus, reproducible.However, some parts of the code like a + = b ∗ c – presentin the original implementation of PCG – may not alwaysprovide with the same result, depending on the compileroptimization strategies.Our approach to solve this issue is to explicitly instructcompilers to use fma † . Note that the underlying architectureshould support fma ; otherwise, this may lead to the runtimeerror. This is possible through the std::fma instructionadded to the C ++ 11 language standard. With this option,we avoid non-determinism in the order of operations, reducethe number of rounding errors from three to two, and,therefore, achieve reproducibility for this type of operations.Consequently, we accomplish reproducibility for the sparsematrix-vector multiplication. AXPY (-type) vector updates (S3, S4, S7): For this typeof operations, we rely upon the sequential MKL implemen-tation of AXPY (-type). Alternatively, we can replace thiscall to MKL’s AXPY (-type) by our implementation using fma to ensure correctly-rounded and, hence, reproducibleresults. This will not impact performance since the algorithmis strictly memory-bound and this type of kernels are notperformance-crucial. Application of the preconditioner (S5): The application ofthe Jacobi preconditioner is rather simple: first, the inverseof the diagonals are computed and then the application of thepreconditioner only involves element-wise multiplication oftwo vectors. Thus, this part is both correctly rounded andreproducible. Reproducibility and accuracy of both approaches: It isevident that the results provided by ExBLAS DOT are bothcorrectly-rounded and reproducible. With the lightweight DOT , we search for the minimal size of FPE such that westill preserve every bit of both the result and the error. Forthe studied 3D Poisson’s equation, the sweet spot is the FPEof size , which ensures identical results to ExBLAS and the † This and the other case of y = a ∗ x + b ∗ y are analyzed in more detailsin Wiesenberger et al. (2019). Prepared using sagej.cls reference highly accurate solution. However, we aim also tobe generic and, hence, we provide the implementation thatrelies on FPEs of size eight with the early-exit technique.We add a check for the FPE-based implementation for thosecases where the condition number and/or the dynamic rangeare too large and we cannot keep every bit of information. Awarning is then raised, offering also a suggestion to switch tothe ExBLAS-based implementation. Nonetheless, note thatthe lightweight implementation is intended for moderatelyconditioned problems or with moderate dynamic range inorder to be accurate, reproducible, but also high performingsince the ExBLAS version can be very resource demanding.To sum up, if the information about the problem is know inadvance, it is good to explore the FPE-based implementation. The experiments in this section employed IEEE754 double-precision arithmetic and were carried out in two differentclusters: • The MareNostrum4 (MN4) supercomputer at Barcelona Supercomputing Center (BSC) : Thisplatform consists of SD530 Compute Racks withan Intel Omni-Path high performance networkinterconnect. Each node comprises two 24-core IntelXeon Platinum 8160 processors (2.10 GHz) and96 Gbytes of DDR4 RAM. The platform runs theSuSE Linux Enterprise Server operating system. Thecodes in this platform were compiled using GCCv7.2.0, Intel MPI v2018.1, and MKL v2017.4. • The Tintorrum cluster at Universitat Jaume I : Thisis a 8-node cluster, where each node is equippedwith two 8-core Intel Xeon(R) E5-2630v3 processors(Haswell-EP) (for a total of 128 cores), running at2.4 GHz, with 20 MBytes of L3 on-chip cache (LLCor last level of cache), and with 64 GBytes of DDR3RAM. The operating system running in the cluster isLinux version 2.6.32-642.4.2.el6.centos.plus.x86 64.The codes were compiled with GCC v5.3.0, OpenMPIv1.10.2, and Intel MKL v2017.1.For the experimental analysis, we leveraged a sparse s.p.d.coefficient matrix arising from the finite-difference methodof a 3D Poisson’s equation with 27 stencil points. The factthat the vector involved in the S P MV kernel has to bereplicated in all MPI ranks constrains the size of the largestproblem that can be solved. Given that the theoretical cost ofPCG is t c ≈ nnz + 7 n floating-point arithmetic operations,where nnz denotes the number of nonzeros of the originalmatrix and its size n , the execution time of the method isusually dominated by that of the S P MV kernel. Therefore,in order to analyze the weak scalability of the method, wemaintain the number of non-zero entries per node. For thispurpose, we modified the original matrix, transforming itinto a band matrix, where the lower and upper bandwidths( bandL and bandU , respectively) depend on the number ofnodes employed in the experiment as follows: bandL = bandU = 100 × nodes = ⇒ nnz = ( bandL + bandU + 1) × n. With 8 nodes in Tintorrum and 16 in MN4, the bandwidthranges between 100 and 800 in the first platform, andfrom 100 to 1,600 in the second one. With this approachwe can then maintain the number of rows/columns of thematrix equal to n =4,000,000, while increasing its bandwidthand, therefore, the computational workload proportionallyto the hardware resources, as required in a weak scalingexperiment.The right-hand side vector b in the iterative solverswas always initialized to the product of A with a vectorcontaining ones only; and the PCG iteration was started withthe initial guess x = 0 . The parameter that controls theconvergence of the iterative process was set to − . We analyze the performance of two reproducible versionsof the PCG algorithm parallelized with MPI: one that relieson the ExBLAS approach, and an alternative variant thatis based on floating-point expansions (FPEs) of size eightwith the early-exit technique. Hereafter, we will refer tothem as Exblas and Opt (or FPE8EE ), accordingly. Ourexperiments evaluate the strong and weak scaling of thesereproducible implementations compared against the regular(non-deterministic) version of PCG; all three versions areimplemented with MPI + OpenMP tasks.We next analyze the performance of the three implemen-tations in the aforementioned clusters. On the one hand, inorder to assess the strong scalability, we fix the matrix size to n =16,000,000 and the size of the upper and lower bandwidthto 100, as we increase the number of cores. On the otherhand, in order to analyze the weak scalability, we proceed asexplained earlier, fixing the matrix size to n =4,000,000 andincreasing the bandwidth from 100 to × mnodes (withmnodes=16 in MN4 and mnodes=8 in Tintorrum).Table 2 reports the total execution time (averaged for5 different executions) of the different MPI + OpenMPtasks PCG solvers on both platforms, varying the numberof cores (from 48 to 768 in MN4 and from 16 to 128in Tintorrum) as we maintain the problem size. We testeddifferent computations of MPI processes per node andOpenMP threads per process: the best performing in MN4is 8 MPI process with 6 OpenMP threads each, and theoptimum combination on Tintorrum is 8 MPI process with2 OpenMP threads each. The weak scaling experiment offersnotable results, as, when executing the algorithms in morethan one node (up to 48 cores in MN4 and up to 16 coreson Tintorrum) while increasing proportionally the problem,the execution time is maintained. The executions on onenode show a different behavior because the communicationis in general faster as it entails no inter-node communication.Notably, these extra (local) operations of both ExBLAS andOpt implementations have a positive effect on scalability onthe larger node count due to better ratio of computationsto communication compared with the original version.The behaviour of the strong scaling experiment could beexpected for a parallel algorithm dealing with a sparse linearalgebra operation. This experiment in particular reportsan important increase of the overhead when the numberof nodes becomes large as the communication cost thendominates the execution time. But, the overhead of thereproducible versions decreases due to the favorable ratio Prepared using sagej.cls Journal Title XX(X) Execution time in seconds of the implementations in MN4Number Weak scaling Strong scalingof cores Regular Exblas Opt Regular Exblas Opt 48 3.5349E+00 8.8568E+00 7.7153E+00 1.3280E+01 3.5312E+01 2.9730E+0196 3.1697E+00 5.9492E+00 5.4720E+00 7.6761E+00 1.8550E+01 1.6142E+01192 2.9610E+00 4.7935E+00 4.5801E+00 5.1802E+00 1.0523E+01 9.3799E+00384 2.8018E+00 3.9885E+00 3.8810E+00 3.9321E+00 6.5620E+00 6.0571E+00768 3.5905E+00 4.7965E+00 4.7348E+00 3.6662E+00 5.0488E+00 4.7846E+00Execution time in seconds of the implementations on TintorrumNumber Weak scaling Strong scalingof cores Regular Exblas Opt Regular Exblas Opt 16 8.3203E+00 1.4222E+01 1.3014E+01 3.2747E+01 5.7285E+01 5.1238E+0132 1.6787E+01 2.2833E+01 2.1898E+01 4.8481E+01 7.0335E+01 6.8607E+0164 1.8877E+01 2.1114E+01 2.0992E+01 5.8668E+01 7.2928E+01 7.1930E+01128 1.8322E+01 2.0331E+01 2.0156E+01 6.4591E+01 6.8174E+01 6.7651E+01 Table 2. Timings of different implementations of the Preconditioned Conjugate Gradient method in MN4 and Tintorrum. between computations and communication. Unfortunately,we cannot evaluate a larger problem to increase the weightof the computational cost, as the problem dimension isconstrained by the node memory capacity.Figure 4 reports the total execution time (averaged for 5different executions) of the reproducible MPI PCG solversfor the two clusters normalized with respect to the executiontime of the regular MPI version, when we vary the numberof cores (from 48 to 768 in MN4 and from 16 to 128 inTintorrum). Specifically, in the two top plots we present thestrong scaling evaluation. In these graphs, we can observethat the difference of both versions with respect to the regularone is higher on a small number of cores, and it decreaseswith the core count. We observe that the overhead of both the Exblas and Opt implementations compared with the regularversion is smooth and decreasing: from 2.66x and 2.24xon the single node to 37.7 % and 30.5 % on 16 nodes onMN4 for Exblas and Opt , respectively; and, on Tintorrumfrom 74.9 % and 56.5 % on the single node to 5.6 % and4.7 % on 8 nodes for Exblas and Opt , accordingly. Moreover,the overhead between Exblas and Opt versions decreases,e.g. from 18 % to 6 % in MN4, on the large core count:this is due to very similar implementations of both since Exblas underneath relies upon FPE8EE for the OpenMP DOT products. Note that such difference is much larger for thepure MPI implementation Iakymchuk et al. (2019a).The two bottom graphs in Figure 4 expose the weakscaling evaluation, where we set the number of non-zeros of the sparse matrix to be roughly proportionalto the number of cores, increasing the size of the bandof the matrix, as discussed in Section 5.1. These resultsshow that both versions offer similar performance to thebaseline on the large number of cores. For instance, theoverheads are 33.70 % and 31.75 % for the Exblas and Opt implementations in MN4, respectively, and only 11 % and10 % for Exblas and Opt on Tintorrum, accordingly. Asin the strong scaling analysis, the Opt version outperformsthe Exblas implementation. If we compare the results inboth clusters, we can observe that they are more stable inTintorrum because the number of cores per node is smallerin this platform than in MN4. In addition to the performance results, we report also theresults of the accuracy and reproducibility evaluation. Forthat, we develop a generator of ill-conditioned matrices. Thisgenerator scales the first row and the first column of thematrix so that the DOT product determines the conditionnumber of the matrix. Additionally, we derive a sequentialversion of the code that relies on the GNU Multiple PrecisionFloating-Point Reliably (MPFR) library Fousse et al. (2007)– a C library for multiple (arbitrary) precision floating-pointcomputations on CPUs – as a highly accurate referenceimplementation. This implementation uses 2,048 bits ofaccuracy for computing the DOT product (192 bits forinternal product of two floating-point numbers) and performscorrect rounding of the computed result to double precision.Table 3 reports the intermediate and final residual on eachiteration of the PCG solver for the matrix with the number ofrows/columns equal to n=4,019,679 ( ), the bandwidthof size 200, and the condition number of . The resultsare exposed with all digits in hexadecimal. For this test,the tolerance was set to − and it took iterations forall four implementations to converge under this accuracyrequirement. We used one node of MN4 with processeseach pinned to one core. We present only few iterations, butthe difference is present in all iterations. The ExBLAS and Opt implementations deliver both accurate and reproducibleresults that are identical with the MPFR library. Note thatthese results are identical to the ones from the pure MPIimplementations in Iakymchuk et al. (2019a) and only theresults of the original code differ. The original code showsthe difference from one digit on the initial iteration and up tofive digits on the 45th iteration on 48 cores (8 MPI processeswith 6 OpenMP threads per each). We also add the resultsof the original code on one core/process to highlight thereproducibility issue. To show these results, we merge thetwo columns of the ExBLAS and Opt results as they areidentical.We assume that this discrepancy in accuracy andreproducibility becomes larger at scale (more nodes) due tothe stronger impact of the topology and reduction trees. Prepared using sagej.cls N o r m a li ze d ti m e v s o r i g i n a l Number of coresStrong Scalability on MareNostrum4ExBLASOpt N o r m a li ze d ti m e v s o r i g i n a l Number of coresStrong Scalability on TintorrumExBLASOpt N o r m a li ze d ti m e v s o r i g i n a l Number of coresWeak Scalability on MareNostrum4ExBLASOpt N o r m a li ze d ti m e v s o r i g i n a l Number of coresWeak Scalability on TintorrumExBLASOpt Figure 4. Analysis of the strong (top) and weak (bottom) scalability of the two reproducible versions of the MPI + OpenMP tasksPCG; the time is normalized with respect to the regular non-deterministic MPI version. Iteration Residual MPFR Original Original 48 cores Exblas & FPE8EE p+49 0x1.19f179eb7f03 p+49 0x1.19f179eb7f032p+492 0x1.f86089ece9f75p+38 0x1.f86089ece p+38 0x1.f86089ece af76 p+38 0x1.f86089ece9f75p+389 0x1.fc59a29d329ffp+28 0x1.fc59a29d3 p+28 0x1.fc59a29d32 d1b p+28 0x1.fc59a29d329ffp+2810 0x1.74f5ccc211471p+22 0x1.74f5ccc p+22 0x1.74f5ccc2 p+22 0x1.74f5ccc211471p+22... ... ... ... ...40 0x1.7031058eb2e3ep-19 0x1.7031058 dd6bcf p-19 0x1.7031058e af4c2 p-19 0x1.7031058eb2e3ep-1942 0x1.4828f76bd68afp-23 0x1.4828f76 d1aa3 p-23 0x1.4828f76bd a71a p-23 0x1.4828f76bd68afp-2345 0x1.8646260a70678p-26 0x1.8646260a p-26 0x1.8646260a p-26 0x1.8646260a70678p-2647 0x1.13fa97e2419c7p-33 0x1.13fa97e p-33 0x1.13fa97e24 p-33 0x1.13fa97e2419c7p-33 Table 3. Accuracy and reproducibility comparison on the intermediate and final residual against MPFR for a matrix with thecondition number of . The matrix is generated following the procedure from Section 5.1 with n=4,019,679 ( ) and thebandwidth of size 200. We conduct a set of tests using real cases from theSuiteSparse matrix collection. We select matrices withvarious condition numbers starting from . e + 04 up to . e + 18 ‡ , with as many as one million nonzero elements.Table 4 presents our experimental results on a single nodeusing eight MPI processes and six OpenMP threads perprocess on the MareNostrum4 cluster. For each test matrix,we report the number of iterations required to reach thetolerance of − for residual, the direct error computed as in Section 3.5.1 Golub and Loan (2013), and the total executiontime. We have selected the direct error instead of residualsince it is known that smaller residual does not imply higheraccuracy, meaning solutions with smaller residual might beless accurate. This is confirmed for the first three matrices forwhich the residual suffices the tolerance but the direct error isstill large. The direct error as well as the number of iterations ‡ In practice, selecting coefficient matrices for the linear systems for which cond ( A ) ≤ e + 12 would have been more realistic due to the limits of thedouble precision arithmetic. Prepared using sagej.cls Journal Title XX(X) are identical for both ExBLAS and Opt variants, hence wemerge these columns. Our reproducible variants require asmaller number of iterations than the original version for themsc01050, olafu, and bcsstk28 matrices. For instance, forthe olafu matrix (1,015,156 nonzero elements and conditionnumber . e + 11 ), both ExBLAS and Opt variants require42,342 iterations, while the original version needs 43,046iterations. For a few cases, our reproducible variants mayperform slightly more iterations than the non-reproduciblevariants due to the differences in the accumulation ofrounding errors arising form distinct optimizations in thecodes. The overhead of our reproducible variants can be aslow as . % for the 494 bus matrix but can reach . × forthe sts4098 matrix. This overhead is expected and is inlinewith the pattern from Figure 4, where the largest overhead isobserved on a single node. Moreover, we also run tests usingeight MPI processes and two OpenMP threads per process –the Opt and ExBLAS results are again identical in terms ofthe number of iterations, residuals, and direct errors.Furthermore, we conduct similar experiments on theTintorrum cluster. Table 5 presents the results of theseexperiments. There, we also use one node but four MPIprocesses and four OpenMP threads per process. Theseresults show a similar trend to that of MareNostrum4: smallernumber of iterations of the ExBLAS and Opt versions forplat1919, olafu, gyro k, bcsstk28, and msc04515 matrices;the overhead of reproducible versions as small as . %for the 494 bus matrix but may also grow up to fewtimes. Notably, both ExBLAS and Opt versions deliveridentical results, excluding timings, on the MareNostrum4and Tintorrum clusters.In addition, we conduct experiments using the pureMPI versions of the Reproducible Preconditioned ConjugateGradient Iakymchuk et al. (2019a) on the MareNostrum4and Tintorrum clusters, see Tables 6 and 7. We observethat the number of iterations, residuals, direct errors, thefinal error, and vector-solutions are identical to thoseproduced by the MPI+OpenMP tasks versions. Hence, ourreproducible strategies ensure cross-cluster reproducibilityof PCG implemented with pure MPI as well as the hybridMPI + OpenMP tasks models. To enhance reproducibility, Intel proposed the “ConditionalNumerical Reproducibility” (CNR) option in its Math KernelLibrary (MKL). Although CNR guarantees reproducibility,it does not ensure correct rounding, meaning the accuracyis arguable. Additionally, the cost of obtaining reproducibleresults with CNR is high. For instance, for large arrays theMKL’s summation with CNR was almost 2x slower than theregular MKL’s summation on the Mesu cluster hosted at theSorbonne University Collange et al. (2015).Demmel and Nguyen implemented a family of algorithms– that originate from the works by Rump, Ogita, andOishi Rump et al. (2010, 2008b,a) – for reproduciblesummation in floating-point arithmetic Demmel and Nguyen(2013, 2015). These algorithms always return the sameanswer. They first compute an absolute bound of the sumand then round all numbers to a fraction of this bound. Inconsequence, the addition of the rounded quantities is exact, however the computed sum using their implementationswith two or three bins is not correctly rounded. Theirresults yielded roughly % overhead on processors(CPUs only) compared to the Intel MKL dasum() , but itshows . times slowdown on processors (one node).Ahrens, Nguyen, and Demmel extended their concept tofew other reproducible BLAS routines, distributed as theReproBLAS library § , but only with parallel reproduciblereduction. Furthermore, the ReproBLAS effort was extendedto reproducible tall-skinny QR Nguyen and Demmel (2015).The other approach to ensure reproducibility is calledExBLAS, which is initially proposed by Collange, Defour,Graillat, and Iakymchuk in Collange et al. (2015). ExBLASis based on combining long accumulators and floating-pointexpansions in conjuction with error-free transformations.This approach is presented in Section 2.1. Collange et al.showed Collange et al. (2015) that their algorithms forreproducible and accurate summation have % overheadon cores (32 nodes) and less than % overhead on16 cores (one node). While ExSUM covers wide rangeof architectures as well as distributed-memory clusters, theother routines primarily target GPUs. Exploiting the modularand hierarchical structure of linear algebra algorithms, theExBLAS approach was applied to construct reproducible LUfactorizations with partial pivoting Iakymchuk et al. (2019b).Recently, Mukunoki and Ogita presented their approach toimplement reproducible BLAS, called OzBLAS Mukunokiet al. (2020), with tunable accuracy. This approach isdifferent from both ReproBLAS and ExBLAS as it doesnot require to implement every BLAS routine from scratchbut relies on high-performance (vendor) implementations.Hence, OzBLAS implements the Ozaki scheme Ozaki et al.(2012) that follows the fork-join approach: the matrix andvector are split (each element is sliced) into sub-matrices andsub-vectors for secure products without overflows; then, thehigh-performance BLAS is called on each of these splits;finally, the results are merged back using, for instance, theNearSum algorithm. Currently, the OzBLAS library includesdot product, matrix-vector product (gemv), and matrix-matrix multiplication (gemm). These algorithmic variantsand their implementations on GPUs and CPUs (only dot)reassure reproducibility of the BLAS kernels as well as makethe accuracy tunable up-to correctly rounded results. In this work, we addressed the reproducibility of iterativesolvers for sparse linear systems using a representativeinstance of the Preconditioned Conjugate Gradient method.We first analyzed the hybrid MPI + OpenMP tasksimplementation of the PCG method and identified two majorsources of non-deterministic behavior, namely the DOT product and compiler optimizations. The latter may changethe order of operations or replace some of them in favorof the fused multiply-add ( fma ) operation. For reproducibleand double-layered distributed DOT product, we leveragedthe ExBLAS-approach as well as proposed an alternativelightweight variant based solely on FPEs. Both strategiessplit the MPI Allreduce routine into the combination of § http://bebop.cs.berkeley.edu/reproblas/ Prepared using sagej.cls Matrix Nonzeros cond ( A ) Iterations Direct error Time [secs] Orig Opt&ExBLAS Orig Opt&ExBLAS Orig Opt Exblas plat1919 32,399 2.22e+18 28200 28347 0x1.d1fd2948ac992p+4 0x1.d1fd1980ddc3dp+4 2.89e+00 5.79e+00 6.74e+00msc01050 26,198 9.00e+15 1459 1441 0x1.fe62a1a8f70acp-7 0x1.c06963286be9ap-7 1.35e-01 2.27e-01 2.48e-01mhd4800b 27.520 1.03e+14 32 32 0x1.171f2d2a7c15dp-7 0x1.171f2cf90e554p-7 3.76e-03 1.10e-02 1.30e-02olafu 1,015,156 7.61e+11 43046 42342 0x1.4683bc1ddab86p-28 0x1.3c603001a3c8fp-28 3.54e+01 6.84e+01 7.60e+01gyro k 1,021,159 1.10e+09 16064 16075 0x1.7f300f81c65c7p-26 0x1.7ef8863fde778p-26 1.41e+01 2.78e+01 3.10e+01bcsstk28 219,024 6.28e+09 5592 5483 0x1.dd4e900472e3dp-31 0x1.ba21beeb93c43p-31 1.25e+00 2.51e+00 2.85e+00bcsstk13 83,883 5.64e+08 2571 2571 0x1.03a9e5339d79fp-34 0x1.5e81334a6997bp-34 3.66e-01 6.47e-01 7.37e-01sts4098 72,356 3.56e+07 666 668 0x1.ad954060aba53p-38 0x1.5a2eda56201fbp-38 8.87e-02 2.28e-01 2.61e-01494 bus 1,666 3.89e+06 410 410 0x1.22befca9188e3p-35 0x1.b83293969f70bp-36 3.29e-02 4.36e-02 5.05e-02msc04515 97,707 4.78e+05 4883 4885 0x1.41d64ef6a77bfp-32 0x1.4b0c4d82346dap-32 6.85e-01 1.80e+00 2.07e+00bcsstk27 56,126 1.49e+04 331 331 0x1.3f45a221626fdp-40 0x1.2b65be5099a69p-40 3.66e-02 5.94e-02 6.60e-02 Table 4. Evaluation of different MPI + OpenMP tasks implementations of the PCG method using test matrices from theSuiteSparse matrix collection on one MareNostrum4’s node with eight MPI processes and six OpenMP threads per process. Matrix Nonzeros cond ( A ) Iterations Direct error Time [secs] Orig Opt&ExBLAS Orig Opt&ExBLAS Orig Opt Exblas plat1919 32,399 2.22e+18 29133 28347 0x1.d1fccb48b0708p+4 0x1.d1fd1980ddc3dp+4 2.21e+00 6.07e+00 7.65e+00msc01050 26,198 9.00e+15 1441 1441 0x1.20ef1ec5aba0fp-6 0x1.c06963286be9ap-7 1.07e-01 2.26e-01 2.74e-01mhd4800b 27,520 1.03e+14 32 32 0x1.171f2d405896bp-7 0x1.171f2cf90e554p-7 4.35e-02 1.06e-01 8.98e-02olafu 1,015,156 7.61e+11 44309 42342 0x1.580f68bcf7c59p-28 0x1.3c603001a3c8fp-28 5.80e+01 1.10e+02 1.28e+02gyro k 1,021,159 1.10e+09 16623 16075 0x1.70c410f76c1f3p-26 0x1.7ef8863fde778p-26 2.12e+01 4.24e+01 4.98e+01bcsstk28 219,024 6.28e+09 5485 5483 0x1.b33c0d0a65819p-31 0x1.ba21beeb93c43p-31 1.59e+00 3.50e+00 4.17e+00bcsstk13 83,883 5.64e+08 2554 2571 0x1.1f45539d3ad76p-34 0x1.5e81334a6997bp-34 3.87e-01 8.04e-01 9.58e-01sts4098 72,356 3.56e+07 666 668 0x1.75b26e5cc575ep-38 0x1.5a2eda56201fbp-38 1.02e-01 3.00e-01 3.69e-01494 bus 1,666 3.89e+06 409 410 0x1.8a70c6145af0bp-33 0x1.b83293969f70bp-36 1.76e-02 3.31e-02 4.14e-02msc04515 97,707 4.78e+05 5138 4885 0x1.318ee7cc28729p-32 0x1.4b0c4d82346dap-32 8.34e-01 2.43e+00 2.98e+00bcsstk27 56,126 1.49e+04 331 331 0x1.287d86ae5b307p-40 0x1.2b65be5099a69p-40 3.34e-02 6.53e-02 7.81e-02 Table 5. Evaluation of different MPI + OpenMP tasks implementations of the PCG method using test matrices from theSuiteSparse matrix collection on one Tintorrum’s node with four MPI processes and four OpenMP threads per process. Matrix Nonzeros cond ( A ) Iterations Direct error Time [secs] Orig Opt&ExBLAS Orig Opt&ExBLAS Orig Opt Exblas plat1919 32,399 2.22e+18 28404 28347 0x1.d1fd13459efb2p+4 0x1.d1fd1980ddc3dp+4 2.09e+00 7.25e+00 1.15e+01msc01050 26,198 9.00e+15 1449 1441 0x1.1e960e3dd96adp-6 0x1.c06963286be9ap-7 1.18e-01 3.51e-01 4.95e-01mhd4800b 27,520 1.03e+14 32 32 0x1.171f2d513bf1fp-7 0x1.171f2cf90e554p-7 5.17e-03 9.00e-03 2.16e-02olafu 1,015,156 7.61e+11 44872 42342 0x1.284b2460347acp-28 0x1.3c603001a3c8fp-28 1.38e+01 2.22e+01 7.77e+01gyro k 1,021,159 1.10e+09 16577 16075 0x1.76957c0ecf952p-26 0x1.7ef8863fde778p-26 5.29e+00 8.71e+00 3.14e+01bcsstk28 219,024 6.28e+09 5736 5483 0x1.e9c64a28a93dfp-31 0x1.ba21beeb93c43p-31 8.43e-01 1.70e+00 3.69e+00bcsstk13 83,883 5.64e+08 2571 2571 0x1.03a9e5339d79fp-34 0x1.5e81334a6997bp-34 2.32e-01 5.07e-01 3.01e+00sts4098 72,356 3.56e+07 666 668 0x1.ad954060aba53p-38 0x1.5a2eda56201fbp-38 6.44e-02 1.85e-01 1.47e+00494 bus 1,666 3.89e+06 410 410 0x1.98d21a409a23cp-36 0x1.b83293969f70bp-36 4.47e-02 9.29e-02 1.20e-01msc04515 97,707 4.78e+05 4883 4885 0x1.41d64ef6a77bfp-32 0x1.4b0c4d82346dap-32 4.89e-01 1.47e+00 1.18e+01bcsstk27 56,126 1.49e+04 331 331 0x1.3f45a221626fdp-40 0x1.2b65be5099a69p-40 2.02e-02 4.50e-02 1.22e-01 Table 6. Evaluation of different pure MPI implementations Iakymchuk et al. (2019a) of the PCG method using test matrices fromthe SuiteSparse matrix collection on one MareNostrum4’s node with 48 MPI processes. Matrix Nonzeros cond ( A ) Iterations Direct error Time [secs] Orig Opt&ExBLAS Orig Opt&ExBLAS Orig Opt Exblas plat1919 32,399 2.22e+18 28225 28347 0x1.d1fd2b1f22f89p+4 0x1.d1fd1980ddc3dp+4 1.05e+00 2.27e+00 1.21e+01msc01050 26,198 9.00e+15 1440 1441 0x1.91eb8c4cf549ep-7 0x1.c06963286be9ap-7 5.85e-02 1.01e-01 4.14e-01mhd4800b 27,520 1.03e+14 32 32 0x1.171f2d071bbecp-7 0x1.171f2cf90e554p-7 2.05e-03 4.82e-03 3.28e-02olafu 1,015,156 7.61e+11 44840 42342 0x1.73ee2c0ee4f91p-28 0x1.3c603001a3c8fp-28 2.08e+01 3.39e+01 1.55e+02gyro k 1,021,159 1.10e+09 16518 16075 0x1.7c04191d8b8d9p-26 0x1.7ef8863fde778p-26 8.31e+00 1.38e+01 6.28e+01bcsstk28 219,024 6.28e+09 5640 5483 0x1.d8e49bef6a637p-31 0x1.ba21beeb93c43p-31 5.58e-01 1.06e+00 5.46e+00bcsstk13 83,883 5.64e+08 2337 2571 0x1.040b38e017aeep-34 0x1.5e81334a6997bp-34 1.36e-01 2.68e-01 1.23e+00sts4098 72,356 3.56e+07 668 668 0x1.77b31324422c1p-38 0x1.5a2eda56201fbp-38 4.36e-02 9.92e-02 6.02e-01494 bus 1,666 3.89e+06 410 410 0x1.f52f8f4c274dp-37 0x1.b83293969f70bp-36 1.21e-02 1.63e-02 5.78e-02msc04515 97,707 4.78e+05 4874 4885 0x1.2dfe5e95c1703p-32 0x1.4b0c4d82346dap-32 3.26e-01 7.56e-01 4.80e+00bcsstk27 56,126 1.49e+04 331 331 0x1.2c11a8f939c1dp-40 0x1.2b65be5099a69p-40 1.45e-02 2.53e-02 1.04e-01 Table 7. Evaluation of different pure MPI implementations Iakymchuk et al. (2019a) of the PCG method using test matrices fromthe SuiteSparse matrix collection on one Tintorrum’s node with 16 MPI processes. MPI Reduce and MPI Bcast , and perform the intra-node DOT product with FPEs. To tackle compiler interferencein computations, we reconstruct computations as well asexplicitly invoke fma instructions. Both approaches deliveridentical results on two clusters to ensure reproducibilityof PCG in the number of iterations, the intermediate and final residuals, the direct errors, as well as the vector-solution on the example of a 3D Poisson’s equation with27 stencil points as well as several test matrices from theSuiteSparse matrix collection. On a single node, the FPE-and ExBLAS-based reproducible versions of PCG show themaximum overhead of 2.24x and 2.66x, respectively, due to Prepared using sagej.cls Journal Title XX(X) additional memory allocation and computations. When thecommunication starts to dominate the execution time, bothversions show very low overhead compared with the originalnon-deterministic implementation: 37.70 % for ExBLAS and30.50 % for Opt on 768 cores of MareNostrum4; 5.6 % for ExBLAS and 4.6 % for Opt on 128 cores of Tintorrum. Thisis a solid argument in favor of the reproducible PCG atscale. The code is available at https://github.com/riakymch/ReproCG_MPI_OMP .Our study promotes the adoption of reproducibility bydesign through the proper choice of the underlying librariesas well as a moderate programmability effort. For instance,a brief guidance would be 1) for fundamental numericalcomputations, to leverage reproducible underlying librariessuch as ExBLAS, ReproBLAS, or OzBLAS; and 2) analyzethe algorithm and make it reproducible through eliminatingany uncertainties that may violate associativity such asreductions and use/ non-use of fma s. Additionally, we arguethe need for the bit-wise reproducible and correctly-roundedresults for iterative solvers as, nevertheless, they will beenhanced during subsequent iterations as we do not reach thedesired tolerance and, thus, do not exploit at full the obtainedbit-wise results.Our future work aims to conduct a deeper analysis of thelightweight approach to support our experimental results.One idea is to bind the length of FPEs to the conditionnumber of the input problem and/or its dynamic rangesimilarly to Carson and Higham (2018) for the mixed-precision direct linear solver. Acknowledgements To begin with, we would like to thank the reviewers for theirvaluable comments and suggestions. This research was partiallysupported by the European Union’s Horizon 2020 research,innovation programme under the Marie Skłodowska-Curie grantagreement via the Robust project No. 842528 as well as the ProjectHPC-EUROPA3 (INFRAIA-2016-1-730897), with the support ofthe H2020 EC RIA Programme; in particular, the author gratefullyacknowledges the support of Vicenc¸ Beltran and the computerresources and technical support provided by BSC. The researchersfrom Universitat Jaume I (UJI) and Universitat Polit´ecnica deValencia (UPV) were supported by MINECO project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I. References Aliaga JI, Barreda M, Flegar G, Bollh¨ofer M and Quintana-Ort´ıES (2017) Communication in task-parallel ILU-preconditionedCG solvers using MPI+OmpSs. Concurrency and Computa-tion: Practice and Experience Proceedings of ARITH-21 . IEEE, p. 1. Keynotetalk.Barreda M, Aliaga J, Beltran V and Casas M (2019) Iteration-fusing conjugate gradient for sparse linear systems withmpi + ompss. Journal of Supercomputing DOI:10.1051/proc/201445023. URL https://doi:10.1007/s11227-019-03100-4 .Barrett R, Berry M, Chan TF, Demmel J, Donato J, DongarraJ, Eijkhout V, Pozo R, Romine C and der Vorst HV (1994) Templates for the Solution of Linear Systems: Building Blocksfor Iterative Methods, 2nd Edition . SIAM.Burgess N, Goodyer C, Lutz DR and Hinds CN (2019) High-precision anchored accumulators for reproducible floating-point summation. IEEE Transactions on Computers SIAM J. Sci.Comput. ParCo 49: 83–97.Dekker TJ (1971) A floating point technique for extending theavailable precision. Numerische Mathematik Proceedings of ARITH-21 . pp. 163–172.Demmel J and Nguyen HD (2015) Parallel ReproducibleSummation. IEEE Transactions on Computers ACM TOMS .Fousse L, Hanrot G, Lef`evre V, P´elissier P and ZimmermannP (2007) MPFR: A Multiple-precision Binary Floating-pointLibrary with Correct Rounding. ACM TOMS Matrix Computations . 4th edition.Baltimore: The Johns Hopkins University Press.Gropp W, Hoefler T, Thakur R and Lusk E (2014) Using advancedMPI: Modern features of the message-passing interface . MITPress.Hida Y, Li XS and Bailey DH (2001) Algorithms for quad-doubleprecision floating point arithmetic. In: Proceedings of ARITH-15 . pp. 155–162. DOI:10.1109/ARITH.2001.930115.Hunold S and Carpen-Amarie A (2016) Reproducible MPIbenchmarking is still not as easy as you think. IEEETransactions on Parallel and Distributed Systems JCAM Availableonline 2nd January 2020. DOI: 10.1016/j.cam.2019.112697.HAL preprint: hal-02391618.Iakymchuk R, Collange S, Defour D and Graillat S (2015)ExBLAS: Reproducible and accurate BLAS library. In: Proceedings of the NRE2015 workshop held as part of SC15.Austin, TX, USA, November 15-20, 2015 .Iakymchuk R, Collange S, Defour D and Graillat S (2017) ExBLAS(Exact BLAS) library. Available on the WWW, https://exblas.lip6.fr/ . Accessed 31-JAN-2019.Iakymchuk R, Defour D, Collange S and Graillat S (2016)Reproducible and Accurate Matrix Multiplication. SpringerLNCS IJHPCA Prepared using sagej.cls IEEE Computer Society (2008) IEEE Standard for Floating-PointArithmetic . IEEE Standard 754-2008.Imamura T, Mukunoki D, Iakymchuk R, J´ez´equel F and Graillat S(2019) Numerical reproducibility based on minimal-precisionvalidation. In: the CRE2019 workshop held as part of SC19.Denver, Co, USA, November 17-22, 2019 .Knuth DE (1969) The Art of Computer Programming: Seminumer-ical Algorithms , volume 2. Addison-Wesley.Kulisch U and Snyder V (2011) The Exact Dot Product As BasicTool for Long Interval Arithmetic. Computing Computer arithmetic and validity , de GruyterStudies in Mathematics , volume 33. 2nd edition. Berlin: Walterde Gruyter & Co. Theory, implementation, and applications.Lawson CL, Hanson RJ, Kincaid DR and Krogh FT (1979) Basiclinear algebra subprograms for Fortran usage. ACM TOMS Proceedings of ARITH-24 . IEEE, London, UK, pp. 98–105.Mukunoki D, Ogita T and Ozaki K (2020) Accurate andreproducible blas routines with ozaki scheme for many-corearchitectures. In: Proc. International Conference on ParallelProcessing and Applied Mathematics (PPAM2019). LectureNotes in Computer Science , volume 12043. pp. 516–527.Nguyen HD and Demmel J (2015) Reproducible tall-skinny QR. In: Proceedings of ARITH-22 . pp. 152–159. DOI:10.1109/ARITH.2015.28.Ogita T, Rump SM and Oishi S (2005) Accurate sum and dotproduct. SIAM J. Sci. Comput 26: 1955–1988.OpenMP ARB (2019) The OpenMP API specification for parallelprogramming. .Ozaki K, Ogita T, Oishi S and Rump SM (2012) Error-freetransformations of matrix multiplication by using fast routinesof matrix multiplication and its applications. NumericalAlgorithms .IEEE, pp. 132–143.Rump SM, Ogita T and Oishi S (2008a) Accurate floating-pointsummation part i: Faithful rounding. SIAM J. Sci. Comput. https://doi.org/10.1137/050645671 .Rump SM, Ogita T and Oishi S (2008b) Accurate floating-pointsummation part ii: Sign, k-fold faithful and rounding to nearest. SIAM J. Sci. Comput. Nonlinear Theory and Its Applications, IEICE Iterative methods for sparse linear systems . 3rdedition. Philadelphia, PA, USA: SIAM.Wiesenberger M, Einkemmer L, Held M, Gutierrez-Milla A,Xavier Saez X and Iakymchuk R (2019) Reproducibility,accuracy and performance of the feltor code and libraryon parallel computer architectures. Computer PhysicsCommunications GAMM Mitt. Ges. Angew. Math. Mech.