[PDF] Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing - Researchain

Abstract

In this paper, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo's design principle abstracts all functionality as "linear operators", motivating the notation of a "linear operator algebra library". Ginkgo's current focus is oriented towards providing sparse linear algebra functionality for high performance GPU architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific back ends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo's usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo's high performance on state-of-the-art GPU architectures.

Full PDF

11 Ginkgo: A Modern Linear Operator Algebra Framework for HighPerformance Computing

HARTWIG ANZT,

Karlsruhe Institute of Technology and Innovative Computing Laboratory, University ofTennessee

TERRY COJEAN,

Karlsruhe Institute of Technology

GORAN FLEGAR,

Universidad Jaime I

FRITZ G ¨OBEL,

Karlsruhe Institute of Technology

THOMAS GR ¨UTZMACHER,

Karlsruhe Institute of Technology

PRATIK NAYAK,

Karlsruhe Institute of Technology

TOBIAS RIBIZEL,

Karlsruhe Institute of Technology

YUHSIANG MIKE TSAI,

Karlsruhe Institute of Technology

ENRIQUE S. QUINTANA-ORT´I,

Universitat Polit`ecnica de Val`enciaIn this paper, we present Ginkgo, a modern C++ math library for scientific high performance computing.While classical linear algebra libraries act on matrix and vector objects, Ginkgo’s design principle abstractsall functionality as “linear operators,” motivating the notation of a “linear operator algebra library.” Ginkgo’scurrent focus is oriented towards providing sparse linear algebra functionality for high performance GPUarchitectures, but given the library design, this focus can be easily extended to accommodate other algorithmsand hardware architectures. We introduce this sophisticated software architecture that separates core algo-rithms from architecture-specific backends and provide details on extensibility and sustainability measures.We also demonstrate Ginkgo’s usability by providing examples on how to use its functionality inside theMFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo’s highperformance on state-of-the-art GPU architectures.CCS Concepts: •

Mathematics of computing → Mathematical software; • Computing methodologies → Massively parallel algorithms; • Software and its engineering → Software creation and management;

Additional Key Words and Phrases: High Performance Computing, Healthy Software Lifecycle, Multicore andManycore Architectures

ACM Reference format:

Hartwig Anzt, Terry Cojean, Goran Flegar, Fritz G¨obel, Thomas Gr¨utzmacher, Pratik Nayak, Tobias Ribizel,Yuhsiang Mike Tsai, and Enrique S. Quintana-Ort´ı. 2016. Ginkgo: A Modern Linear Operator AlgebraFramework for High Performance Computing. 1, 1, Article 1 (January 2016), 30 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

With the rise of manycore accelerators, such as graphics processing units (GPUs), there is anincreasing demand for linear algebra libraries that can efficiently transform the massive hardwareconcurrency available in a single compute node into high arithmetic performance. At the sametime, more and more application projects adopt object-oriented software designs based on C++.In this paper, we present the result from our effort toward the design and development of Ginkgo,a next-generation, high performance sparse linear algebra library for multicore and manycorearchitectures. The library combines ecosystem extensibility with heavy, architecture-specific kernel a r X i v : . [ c s . M S ] J u l :2 H. Anzt et al. Library InfrastructureAlgorithm Implementations • Iterative Solvers • Preconditioners • … Core

OpenMP-kernels • SpMV • Solver kernels • Precond kernels • … OpenMP

Reference kernels • SpMV • Solver kernels • Precond kernels • … Reference

CUDA-GPU kernels • SpMV • Solver kernels • Precond kernels • … CUDA

HIP-GPU kernels • SpMV • Solver kernels • Precond kernels • … HIP

Library core contains architecture-agnostic algorithm implementation;Runtime polymorphism selects the right kernel depending on the target architecture;Architecture-specific kernels execute the algorithm on target architecture;Reference are sequential kernels to check correctness of algorithm design and optimized kernels; Optimized architecture-specific kernels; • Shared kernels

Common

Fig. 1. Ginkgo library architecture separating the core containing the algorithms from architecture-specificbackends. optimization using the platform-native languages CUDA (for NVIDIA GPUs), HIP (for AMD GPUs),and OpenMP (for general-purpose multicore processors, such as those from Intel, AMD or ARM).The software development cycle that drives Ginkgo ensures production-quality code by featuringunit testing, automated configuration and installation, Doxygen code documentation, as well as acontinuous integration and continuous benchmarking framework. Ginkgo is an open source effortlicensed under the BSD 3-clause. The object-oriented Ginkgo library is constructed around two principal design concepts. Thefirst principle, aiming at future technology readiness, is to consequently separate the numericalalgorithms from the hardware-specific kernel implementation to ensure correctness (via comparisonwith sequential reference kernels), performance portability (by applying hardware-specific kerneloptimizations), and extensibility (via kernel backends for other hardware architectures), see Figure 1.The second design principle, aiming at user-friendliness, is the convention to express functionalityin terms of linear operators: every solver, preconditioner, factorization, matrix-vector product, andmatrix reordering is expressed as a linear operator (or composition thereof).The rest of the paper is organized as follows. In Section 2, we leverage a simple use case tomotivate the design choices underlying Ginkgo, and elaborate on the concept of linear operators,memory management, hardware-specific kernel optimization, and event logging. Section 3 providesadditional details on Ginkgo’s current solvers, realizations for the sparse matrix-vector product(SpMV) kernel, and preconditioner capabilities. Section 4 elaborates on how the design allows foreasy extension, so that users can contribute new algorithmic technology or additional hardwarebackends. As many applications are in desperate need for high performance sparse linear algebratechnology, Section 5 showcases the usage of Ginkgo as a backend library in scientific applications,and also reviews Ginkgo’s integration into the extreme-scale Software Development Kit (xSDK). https://opensource.org/licenses/BSD-3-Clause, Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:3 stop solverpreconditionermatrix solverpreconditioner stop log PolymorphicObjectAbstract::FactoryIteration::FactoryTime::FactoryResidualNormReduction::FactoryCombined::Factory Bicg::FactoryBicgstab::FactoryCg::FactoryCgs::FactoryFcg::FactoryGmres::FactoryIr::FactoryLowerTrs::FactoryUpperTrs::FactoryIsai::FactoryJacobi::FactoryIlu < ParIct > ::FactoryIlu < ParIlu > ::FactoryIlu < ParIlut > ::FactoryLinOpCooCsrDenseEllHybridIdentityPermutationSellpSparsityCsr BicgBicgstabCgsFcgGmresIrUpperTrsLowerTrsCg + create_default()+ clone()+ copy_from()+ clear()+ apply()+ convert_to()+ move_to()+ get_system_matrix()+ get_preconditioner()+ set_preconditioner()+ get_parameters()+ get_executor()+ get_size()+ add_logger()+ remove_logger() IsaiJacobiIlu < ParIct > Ilu < ParIlu > Ilu < ParIlut > CriterionCombinedIterationResidualNormReductionTime + update()+ check()+ create_default()+ clone()+ copy_from()+ clear()+ get_parameters()+ get_executor()+ add_logger()+ remove_logger()

ExecutorCudaExecutorHipExecutorOmpExecutorReferenceExecutorLoggerConvergencePapiRecordStream

Fig. 2. Ginkgo’s class hierarchy showcasing the main namespaces (colored boxes) and classes (gray boxes)for Ginkgo.

In Section 6 we describe how Ginkgo’s design and development cycle promotes sustainable softwaredevelopment; and in Section 7, we offer representative performance results indicating Ginkgo’scompetitiveness for sparse linear algebra on high-end GPU architectures. We conclude in Section 8with a summary of the paper and the potential of the library design becoming a role model forfuture developments.

Figure 2 displays Ginkgo’s rich class hierarchy together with its main namespaces and classes.To better understand the role of each object, this section introduces Ginkgo’s interface using aminimal, concrete example as a starting point, and gradually presenting more advanced abstractionsthat demonstrate Ginkgo’s high composability and extensibility. These abstractions include: • the LinOp and

LinOpFactory classes which are used to implement and compose linearalgebra operations, • the Executor classes that allow transparent algorithm execution on multiple devices; and • other utilities such as the Criterion classes, which control the iteration process, as wellas the memory passing decorators that allow fine-grained control of how memory objectsare passed between different components of the library and the application. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :4 H. Anzt et al.

Figure 3 illustrates the specific flowchart Ginkgo uses to solve a linear system, highlighting theinteractions between Ginkgo’s classes. In the program code for this example given in Listing 1, thesystem matrix A , the right-hand side b , and the initial solution guess x , are initially read from thestandard input using Ginkgo’s ‘read’ utility (lines 9–11). Next, the program creates a factory for aCG Krylov solver preconditioned with a block-Jacobi scheme (lines 13–15). The solver is configuredto stop either after 20 iterations or having improved the original residual by 15 orders of magnitude(lines 16–19). (Stopping criteria are further discussed in Section 2.5.) The system matrix is bound tothe iterative solver, which is used to solve the system with the right-hand side and initial guess.The initial guess is overwritten with the computed solution (line 23). Solvers (and more generally LinOp and

LinOpFactory ) are discussed in detail in Section 2.2. Finally, the solution is printed tothe standard output (line 25). int main() { // Instantiate a CUDA executor auto cuda = gko:: CudaExecutor :: create(0, gko:: OmpExecutor :: create ()); // Read data auto A = gko::read >(std::cin , cuda); auto b = gko::read >(std::cin , cuda); auto x = gko::read >(std::cin , cuda); // Create the solver factory auto solver_factory = gko:: solver ::Cg <>:: build () .with_preconditioner(gko:: preconditioner ::Jacobi <>:: build ().on(cuda)) .with_criteria( gko::stop:: Iteration :: build ().with_max_iters (20u).on(cuda), gko::stop:: ResidualNormReduction <>:: build () .with_reduction_factor (1e-15) .on(cuda)) .on(cuda); // Create the solver from the factory and solve the system solver_factory ->generate(gko::give(A))->apply(gko::lend(b), gko::lend(x)); // Write result write(std::cout , gko::lend(x)); } Listing 1. A minimal example that uses Ginkgo to solve a linear system. The system matrix, right-hand side,and the initial solution guess are read from the standard input. The system is solved on an NVIDIA-enabledGPU using the CG method enhanced with a block-Jacobi preconditioner. Two stopping criteria are combinedto limit the maximum number of iterations and set the desired relative error. The solution is written to thestandard output.

Ginkgo supports execution on GPU and CPU architectures using different backends (currently,CUDA, HIP, and OpenMP). To accommodate this, when creating an object, the user passes aninstance of an

Executor in order to specify where the data for that object should be stored and theoperations on that data should be performed. The particular example in Listing 1 creates a CUDA

Executor (line 7) that employs the first GPU device (the one returned by cudaGetDevice(0) ).Since CUDA GPU accelerators are controlled by the CPU, an OpenMP

Executor is needed toorchestrate the execution on the GPU. (Section 2.3 describes the executors model in more detail.)Ginkgo avoids expensive memory movement and copies. At the same time, sharing data betweendifferent modules in the code might cause unexpected results (e.g., one module changes a matrixused by a solver in a different module, which causes that solver to tackle the wrong system). Ginkgoresolves the dilemma by allowing both shared and exclusive (unique) ownership of the objects.This comes at the price of some verbosity in argument passing: in most cases, plain argumentscannot be passed directly, but have to be wrapped in special “decorator” functions that specify inwhich “mode” they are passed (shared, copied, etc.). , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:5

Input A Input bStopping CriterionPreconditioner

Solver ClassSolver FactoryObjectSolver ObjectwithStopping CriterionSolution Vector

CREATEGENERATEAPPLY

File Data in programSystem MatrixObject

READ “VIEW”

Right Hand SideObject FileData in program

READ“VIEW”

Criterion FactoryObjectCriterion Class

CREATE

PreconditionerFactoryObjectPreconditionerClass

CREATE

Fig. 3. Flowchart providing an alternative view of the code example shown in Listing 1. All object interactionsare represented by arrows. The colors correspond to the type of the objects following the color convention inFigure 2.

The minimal example in Listing 1 already utilizes two of the decorator functions, gko::give and gko::lend , both in line 23. The first one, gko::give(A) , causes the caller to yield the ownership ofmatrix A to the solver, leaving the caller’s version of A in a valid, but undefined state (e.g., accessingany of its methods is not defined, but the object can still be de-allocated or assigned to). The seconddecorator, appearing twice, in gko::lend(x) and gko::lend(b) , “lends” objects x and b to thesolver by temporarily passing ownership to it until the control flow returns from apply back to thecaller. This is a special ownership mode that is only used when the callee does not need permanentownership of the object. Different ownership modes, as well as their relation to std::move arediscussed in Section 2.4. LinOp and

LinOpFactory

Ginkgo exposes an application programming interface (API) that allows toeasily combine different components for the iterative solution of linear systems: solvers, matrixformats, preconditioners, etc. The API enables running distinct iterative solvers and enhancing thesolvers with different types of preconditioners. A preconditioner can be a matrix or even anothersolver. Furthermore, the system matrix does not need to be stored explicitly in memory, but canbe available only as a function that is applied to a vector to compute a matrix-vector product(matrix-free). The objective of providing a clean and easy-to-use interface mandates that all thesespecial cases are uniformly realized in the API.The central observation that guides Ginkgo’s design is that the operations and interactionsbetween the solver, the system matrix, and the preconditioner can be represented as the applicationof linear operators : , Vol. 1, No. 1, Article 1. Publication date: January 2016. :6 H. Anzt et al. (1) The major operation that an iterative solver performs on the system matrix A is the multi-plication with a vector (realized as a Matrix-Vector product, or MV). This operation canbe viewed as the application of the induced linear operator L A : z (cid:55)→ Az . In some cases,multiplication with the transpose is also needed, which is yet another application of a linearoperator L A T : z (cid:55)→ A T z .(2) The solver itself solves a system Ax = b , which is the application of the linear operator S A : b (cid:55)→ A − b ( = x ) . Here, the term “solver” is not used to denote a function f that takes A and b as inputs and produces x , but instead a function with the system matrix A alreadyfixed (that is, S A = f ( A , ·) ).(3) The application of the preconditioner M , as in v = M − u , can be viewed as the applicationof the linear operator P M : u (cid:55)→ M − u ( = v ) .There are several remarks that have to be made regarding the observations above. First, in thecontext of numerical computations, with finite precision arithmetic, the term “linear operator”should be understood loosely. In fact, none of the previous categories strictly satisfy the linearitydefinition of the linear operator: L ( αx + βy ) = αL ( x ) + βL ( y ) , where α , β are scalars and x , y denotevectors. Instead, they are just approximations of the linear operators that satisfy the formula L ( αx + βy ) = αL ( x ) + βL ( y ) + E , where the error term E = E ( L , α , β , x , y ) is the result of one ormore of the following effects:(1) rounding errors introduced by storing non-representable values in floating-point format;(2) rounding errors introduced by finite-precision floating-point arithmetic;(3) instability and inaccuracy of the method used to apply the linear operator to a vector; and(4) inexact operator application, e.g. only few iterations of an iterative linear solver.The data layout and the implementation of any linear operator is internal to that operator, andthe interface does not expose implementation details. For example, a direct solver could store itsmatrix data in factored form, as two triangular factors (e.g., A = LU ) and implement its applicationas two triangular solves (with L and U ). In contrast, an iterative solver could just store the originalsystem matrix, and the entire implementation of the method could be a part of the linear operatorapplication. Nonetheless, both operators can still expose the same public interface. LinOp . In coherence with the observations in Section 2.2.1, the central abstraction inGinkgo’s design is the abstract class (interface)

LinOp , which represents the mathematical conceptof a linear operator. All concrete linear operators (solvers, matrix formats, preconditioners) are in-stances of

LinOp . Furthermore, this generic operator L exposes a pure virtual method apply(b, x) that is overridden by a concrete linear operator with an implementation that computes the result x = L ( b ) with conformal dimensions for L , x and b , where vectors are interpreted as dense matricesof dimension n ×

1. This design enables that a single interface can be leveraged to compute an MVwith different matrix formats, the application of distinct types of preconditioners, the solution oflinear systems using various solvers, or even the application of a user-defined linear operator.Using the

LinOp abstraction, an iterative solver can be implemented via references to other

LinOp s that represent the system matrix and the preconditioner. The solver does not have to beaware of the type of the matrix or the preconditioner — it is sufficient to know that they are bothconformal with the

LinOp interface. This means that the same implementation of the solver canbe configured to integrate various preconditioners and matrices. Furthermore, the linear operatorabstraction can also be used to compose “cascaded” solvers where the preconditioner can be replacedby another, less accurate solver, or even to create matrix-free methods by supplying a specializedoperator as the system matrix, without explicitly storing the matrix. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:7

LinOpFactory . LinOp exposes a uniform interface to different types of linear algebraoperations. A missing piece in the puzzle is how these

LinOp s are created in the first place. Forexample, in order to solve a system with a matrix A represented by the linear operator L A , anoperation has to be provided which, given the operator L A , creates a solver operator S A . Similarly, tocreate a preconditioner P A for a matrix A , an operator that maps L A to P A is needed. These are bothexamples of higher-order (non-linear) functions that map linear operators to other linear operators(in this case Σ : L A (cid:55)→ S A and Φ : L A (cid:55)→ P A ). Ginkgo provides an abstract class LinOpFactory thatrepresents mappings such as Σ and Φ . Concretely, the class LinOpFactory provides an abstractmethod generate(LinOp) which, given a linear operator from the domain of the mapping, returnsthe corresponding

LinOp from its input.The linear operators constructed by using operator factories are usually solvers and precondi-tioners. For example, in order to construct a BiCGSTAB solver operator that solves a problemwith the system matrix A , represented by the operator L A , one would first create a BiCGSTABfactory (which implements the LinOpFactory interface and represents the operator S ); and thencall generate on S , passing the operator L A as input, to obtain a BiCGSTAB operator S A , with thesystem matrix, A .Some factories are designed to be combined with other factories. For instance, to create aniterative refinement solver, which uses CG preconditioned with Jacobi as the inner solver, onewould create an iterative refinement factory S , and as the inner solver factory, pass a CG factoryconstructed with a Jacobi factory as the preconditioner factory. Then, when calling the generate method on S with the system matrix represented by a linear operator L A , this linear operator ispropagated to the CG and Jacobi factories, to create CG and Jacobi operators with the system matrix A . Instead of using LinOpFactory , an alternative (and more obvious) approach would have beento just use the constructor of

LinOp to provide all the “component” linear operators. However,this alternative presents the drawback that the “type” of the operator cannot be decoupled fromits data. To illustrate this, consider the scenario of a solver S which tackles a linear system usingthe LU factorization; and then invokes two triangular solvers on the resulting L and U factors.There are multiple algorithms for the solution of the triangular systems, which in Ginkgo arerepresented by different linear operators. Thus, the operators to use should somehow be passedas input parameters to the solver S . The problem is that they cannot be constructed outside of S ,since their factors are not known at that point. LinOpFactory provides an elegant solution to thisproblem, since instead of a

LinOp , the solver S can be provided with linear operator factories, whichare then used to construct the triangular solver operators once the factors L and U are known. After the previous elaboration on

LinOp and

LinOpFactory , it istimely to re-visit the example in Listing 1. The objects A , b and x in lines 9–11 are LinOp objectsthat store their data as “matrices” in CSR (compressed sparse row [27]) and dense matrix formats,respectively. Calling the method apply on these objects has the effect of calculating the matrix-vector product using that data. The solver factory object (defined in lines 13–21), is actuallya compound

LinOpFactory used to create a solver with the CG method. In this particular case,the CG solver is preconditioned with a block-Jacobi method (specified by providing a block-Jacobifactory as the preconditioner factory to the CG factory).All the work actually occurs in line 23. First, the CG factory solver factory is used to gen-erate a linear operator object representing the CG solver by calling the generate method. Since solver factory has a block-Jacobi factory set as the preconditioner factory, the solver factory ’s generate method invokes generate on the block-Jacobi factory; and the system matrix A is passedas input argument, which has the effect of generating a block-Jacobi preconditioner operator for , Vol. 1, No. 1, Article 1. Publication date: January 2016. :8 H. Anzt et al. that matrix. Then, the resulting linear operator is immediately used to solve the system by applyingit on b . This will have the effect of iterating the CG solver preconditioned with the generatedblock-Jacobi preconditioner operator on the system matrix A , thus solving the system. Traditional linear algebra libraries, such as BLAS [26] and LA-PACK [9], use vectors and matrices as basic objects, and provide operations such as matrix productsand the solution of linear systems on these objects as functions. In contrast, Ginkgo achievescomposability and extensibility (cf. Section 4) by treating linear operations as basic objects, andproviding methods to manipulate these operations in order to express the desired complex operation.This is the principle guiding the design of Ginkgo, which motivates the title of this paper: whileother libraries can be characterized as “linear algebra libraries”, Ginkgo’s algebra is performed onlinear operators, making it a “linear operator algebra library”.While the current focus of Ginkgo is on the iterative solution of sparse linear systems, other typesof operations on linear operators also fit into Ginkgo’s concept of

LinOp and

LinOpFactory . Forexample, a matrix factorization A = UV can be viewed as a linear operator factory Ψ : L A (cid:55)→ F U , V ,where the linear operator F U , V : b (cid:55)→ UV b stores the two factors U and V , and provides publicmethods to access the factors. An appealing feature of Ginkgo is the ability to run code on a variety of device architecturestransparently. In order to accommodate this functionality, Ginkgo introduces the

Executor class atits core. In consequence, the first task a user has to do when using Ginkgo is to create an

Executor .The

Executor specifies the memory location and the execution space of the linear algebra objectsand represents computational capabilities of distinct devices. Currently, four executor types areprovided: • CudaExecutor for CUDA-enabled GPUs; • HipExecutor for HIP-enabled GPUs; • OmpExecutor for OpenMP execution on multicore CPUs; and • ReferenceExecutor for sequential execution on CPUs (used for correctness checking).Each of these executors implements methods for allocating/deallocating memory on the devicetargeted by that executor, copying data between executors, running operations, and synchronizingall operations launched on the executor.Listing 1 illustrated the use of

Executor . Combined with the gko::clone(Executor, Object) utility function, the

Executor class makes it straight-forward to move all data and operations to ahost OpenMP executor, as in Listing 2. That code creates an gko::OmpExecutor object for executionon the CPU (line 1). Next, a CUDA executor representing a GPU device with ID 0 is created (line 2);and the system matrix data is read from a file and allocated on the gko::CudaExecutor ’s devicememory (line 4). Finally, the function gko::clone creates a copy of A on the gko::OmpExecutor ,that is, in the platform’s main memory (line 6). auto omp = gko:: OmpExecutor :: create (); auto cuda = gko:: CudaExecutor :: create(0, omp); // As in previous example , A is allocated on a CUDA device auto A = gko::read >("data/A.mtx", cuda); // copy A to an OpenMP -capable device auto A_copy = gko:: clone(omp , A); // All subsequent operations triggered from A_copy will use executor omp Listing 2. Copy of a matrix in CSR format from a CUDA device to a CPU through the

OmpExecutor . , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:9 benchmark cmake common core cuda dev_tools doc examples hip include omp reference test_install third_party test matrix solver factorization stop basegtestgflags scriptstestmatrixbase components solver preconditioner factorization stop test matrix solverbaselog device_hooks devices preconditioner factorization stop test matrix solver preconditioner stop base test matrix base components solver preconditioner stop ginkgo components matrixsolver preconditioner base external−lib−interfacing performance−debugging simple−solver−logging custom−logger poisson−solver preconditioned−solver papi−logging simple−solver scripts headers pages conf utils solver spmv conversions Fig. 4. Code distribution among different modules in Ginkgo develop version 1.1.1. The entire code base inthis release is 8.0 MB (represented by the entire figure). The top level rectangles represent different top-leveldirectories; these are: the core (1.4 MB) module, examples (1.2 MB), the

HIP module (928 KB), the

CUDA module (920 KB), the reference module (924 KB), the include directory with the core module’s public headers(852 KB), the omp module (612 KB), the common directory which contains shared HIP and CUDA kernels(388 KB) and the benchmark (244 KB) and doc directories (212 KB each). The first rectangles in the core , CUDA , HIP , omp and reference modules represent unit tests for these modules, which amount to 644, 396, 400,308 and 612 KB, respectively. In order to allow a transparent execution of operations on multiple executors, the kernels inGinkgo have separate implementations for each executor type, organized into several modules,see Figure 1 and Figure 4 for the code distribution, respectively. The core module contains all classdefinitions and non-performance critical utility functions that do not depend on an executor. Inaddition, there is a module for each executor, which contains the kernels and utilities specific forthat executor. Each module is compiled as a separate shared library, which allows to mix-and-matchmodules from different sources. This paves the road for hardware vendors to provide their ownproprietary modules: they only have to optimize their module, make it available in binary form,and users can then link it with Ginkgo. We note that the similarities between HIP and CUDA allowthe usage of common template kernels that are identical in kernel design but are compiled with , Vol. 1, No. 1, Article 1. Publication date: January 2016. :10 H. Anzt et al. architecture-specific parameters to either the

HipExecutor or the

CudaExecutor . This strategyreduces code replication and favors productivity and maintainability.Ginkgo contains dummy kernel implementations of all modules that throw an exception when-ever they are called. This allows a user to deactivate certain modules if no hardware support isavailable or to reduce compilation time. In general, during the configuration step, Ginkgo’s auto-matic architecture detection activates all modules for which hardware support has been detected.The

Executor design allows switching the target device where the solver in Listing 1 is executedthrough a one-line change that replaces the executor used for it. In addition, if one of the argumentsfor the apply method is not on the same executor as the operator being applied, the library willtemporarily move that argument to the correct executor before performing the operation, andreturn it back once the operation is complete. Even though this is done automatically, the usermay attain higher performance by explicitly moving the arguments in order to avoid unnecessarycopies (in the case, for example, of repeated kernel invocation).

Libraries have to specify several key memory management aspects: memory allocation, datamovement and copy, and memory deallocation. In contrast to traditional libraries such as BLAS andLAPACK, which leave memory management to the user, Ginkgo allocates/deallocates its memoryautomatically, using the C++ “Resource Acquisition Is Initialization” (RAII ) concept combinedwith the native allocation/deallocation functions of the executor (cf. Section 2.3). Alternatively, toeliminate unnecessary allocations and data copies, Ginkgo’s matrix formats can be configured touse raw data already allocated and managed by the application by using Array views .A more difficult problem is to realize data movement and copies between different entities of theapplication (e.g., functions and other objects). The memory management has to not only protectagainst memory leaks or invalid memory deallocations, but also avoid unnecessary data copies.The problem is usually solved by specifying a well-defined owner for each object, responsible fordeallocating the object once it is no longer needed.For simple C++ types, this behavior is enabled via the use of parameter qualifiers: Parametersare passed by-value and thus copied unless explicitly declared as references (which is when theyare passed by-reference without copying). The C++11 standard added move semantics as a thirdalternative where an input parameter that is either explicitly (using std::move ) or implicitly (bynot having a name) designated a temporary value may move its internal data into the functionwithout copying, leaving it in a valid but unspecified state. However, trying to pass polymorphicobjects by-value would lead to object slicing [3]. In Ginkgo, we avoid these issues with polymorphictypes like

Executor and

LinOp by always passing and returning them as pointers. To this goal,we use the smart pointer types std::unique ptr and std::shared ptr , which were added in theC++11 standard. They provide safe resource management using RAII while still providing (almost)the same semantics as raw pointers. Ginkgo uses pointers for parameters and return types in threedifferent contexts, where we say that a function parameter is used in a non-owning context if theobject will only be used during the function call, and in an owning context if the object needs tobe accessible even after the function call completed. Figure 5 shows the different ways to pass apolymorphic object as a parameter in Ginkgo.Functions that only need to modify a polymorphic object in a non-owning context take thisobject as a raw pointer parameter T* . To simplify the interaction with smart pointers, Ginkgoprovides the overloaded gko::lend function which returns the underlying raw pointer for bothsmart and raw pointers. This decorator function allows for a concise and uniform way to pass https://en.cppreference.com/w/cpp/language/raii, Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:11 Owner Callee no access

Give

Owner Callee

Clone copy

Share

Owner Callee workno more accessafter completion

Lend

Fig. 5. Different ways of passing polymorphic objects as parameters in Ginkgo: gko::clone , gko::lend , gko::give , gko::share together with the lifetime of the passed object. polymorphic objects to functions without ownership transfer. “Lending” an object can be comparedwith normal by-reference semantics for value types. When by-value semantics are necessary, wecan explicitly pass a copy using gko::lend(gko::clone( · )) .Functions that need to receive a polymorphic object in an owning context take this object asa std::shared ptr . We can pass an object to such a parameter in three ways: gko::clone creates a copy of the current object to be passed to the function ( by-value ), gko::give specifies thatthe object will not be used afterwards and can thus be moved into the function ( move semantics )and gko::share specifies that the ownership should be shared with the function ( by-reference ).Note that the gko::share annotation can usually be left out, since all owning smart pointers in C++already provide conversions to std::shared ptr .Functions that create new instances of a polymorphic object return a std::unique ptr ,while access to already existing objects is provided with std::shared ptr to allow theobjects to be used in both owning and non-owning contexts.The overloaded decorator functions gko::clone , gko::lend , gko::give and gko::share pro-vide a uniform interface for all types of smart and raw pointers, while still ensuring type safety.For example, calling gko::give with a non-owning pointer will fail to compile and output anappropriate error message. Virtually all iterative methods include the concept of a “stopping criterion” that evaluates whetherthe current approximation to the solution of the linear systems is accurate enough. To facilitatecontrolling the iteration process, Ginkgo provides a collection of stopping criteria. All of themare implementations of the base

Criterion class, which specifies what type of information can bepassed to the stopping criterion. A concrete criterion provides an implementation of the check() method that verifies if its condition has been met and, therefore, the iteration process has to bestopped.The stopping criteria are initially generated from criterion factories (created by the user) bypassing the system matrix, right-hand side, and an initial guess. In addition, during the iterationprocess, information can be updated when calling the check() function with the new iterationcount, residual, solution or residual norm.Currently, three basic stopping criteria are provided in Ginkgo: • The

Time criterion, which automatically stops the iteration process after a certain amountof time; • the Iteration criterion, which stops the iteration process once a certain iteration counthas been reached; and , Vol. 1, No. 1, Article 1. Publication date: January 2016. :12 H. Anzt et al. • the ResidualNormReduction criterion, which stops the iteration process once the initialrelative residual norm has been reduced by the certain specified amount.Additionally, Ginkgo provides a

Combined criterion, which can be used to combine multiplecriteria together through a logical– OR operation ( | ), so that the first subcriterion that is fulfilledstops the iteration process. This is illustrated in lines 16–19 of Listing 1. This design implies somestopping criteria may detain the iteration process before “convergence” is reached, in particular the Time and

Iteration criteria. Ginkgo provides a stopping status class, which can be inspectedto find out which criterion stopped the iteration process.The

Criterion class hierarchy is designed to avoid negative impact on the performance, andmay even improve it. For example, in case an iterative method is applied with multiple right-handside vectors, the stopping status is evaluated for each right-hand side individually, skippingvector updates in subsequent iterations for those right-hand side vectors where convergence hasbeen achieved.Also, all operations required to control the iteration process can be handled inside the

Criterion classes. The consequence is that, for most solvers, the residual norm and related operations arecomputed only when using the

ResidualNormReduction criterion. Therefore, the user can combinea solver with a simple stopping criterion to make it more lightweight or choose a more precisebut more expensive stopping criterion. In summary, Ginkgo’s design of stopping criteria tries tohonor the C++ philosophy of “only paying for what you use”.

Another utility that is provided to users in Ginkgo is the logging of events with the purpose torecord information about Ginkgo’s execution. This covers many aspects of the library, such asmemory allocation, executor events,

LinOp events, stopping criterion events, etc. For ease of use,the event logging tools provide different forms of output formats, and allow the usage of multipleloggers at once. As with the rest of Ginkgo, this tool is designed to be controllable, extensible,and as lightweight as possible. To offer support for all those capacities, the Logger infrastructurefollows the visitor and observer design patterns [22]. This design implies a minimal impact oflogging on the logged classes and allows to accommodate any logger.The following four loggers are currently provided in Ginkgo: • the Stream logger, which logs the events to a stream (e.g., file, screen, etc.); • the Record logger, which stores the events in a structure which has a history of all receivedevents that the user can retrieve at any moment; • the Convergence logger is a simple mechanism that stores the relative residual norm andnumber of iterations of the solver on convergence; and • the PAPI SDE logger uses the PAPI Software Defined Events backend [24] in order to enableaccess to Ginkgo’s internal information through the PAPI interface and tools.Almost every class in Ginkgo possesses multiple corresponding logging events. The loggedclasses are:

Executor , Operation , PolymorphicObject , LinOp , LinOpFactory and

Criterion .The user has the freedom to choose whether he/she wants to log all events or select only some ofthem. When an event is not selected for logging by the user, as a result of the implementation ofthe logging facilities, the event is not propagated and generates a “no-op”. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:13

Currently, Ginkgo provides a list of Krylov solvers (BICG, BiCGSTAB, CG, CGS, FCG, GMRES) forthe iterative solution of sparse linear systems, fixed-point methods, and direct solvers for sparsetriangular systems such as those that appear in incomplete factorization preconditioning. In orderto generate a solver, a solver factory (of type

LinOpFactory ) must first be created, where solvercontrol parameters, such as the stopping criterion, are set. The concrete solver is then generatedby binding the system matrix to the solver factory. This allows to generate multiple solvers fordistinct problems with the same solver settings, e.g. in time-stepping methods. Except for IterativeRefinement (IR), where the internal solver can be chosen, all iterative solvers have the option toattach a preconditioner of the class

LinOp . Furthermore, all solvers implement the abstract

LinOp interface, which not only simplifies the solver usage, but also allows to use the same notation forcalling solvers, preconditioners, SpMV, etc. This allows the user to compose iterative solvers bychoosing another iterative solver as a preconditioner.

Ginkgo allows any solver to be used as a preconditioner, i.e., to cascade Krylov solvers. Additionally,Ginkgo features diagonal scaling preconditioners (block-Jacobi) as well as incomplete factorization(ILU-type) preconditioners. As any of the other solvers, preconditioners are generated through a

LinOpFactory and implement the abstract class

LinOp .The block-Jacobi preconditioners can switch between a “standard” mode and an “adaptiveprecision” mode [14]. In the latter case, the memory precision is decoupled from the arithmeticprecision, and the storage format for each inverted diagonal block is optimized to preserve thenumerical properties while reducing the memory access cost [21].The ILU-based preconditioners can be generated by interfacing vendor libraries, via the ParILUalgorithm [18], or via a variant known as the ParILUT algorithm [13] that dynamically adapts thesparsity pattern of the incomplete factorization to the problem characteristics [17].For the application of an ILU-type preconditioner, Ginkgo leverages two distinct solvers: one forthe lower triangular matrix L and one for the upper triangular matrix U . The default choices arethe direct lower and upper triangular solvers but the user can change this to use iterative triangularsolves.In Listing 3 we illustrate how an ILU preconditioner can be customized in almost all aspects.In this case, we select a CGS solver for solving the upper triangular system by first creating thefactory in lines 18–23 and then attaching it to the preconditioner factory in lines 26–28. Insteadof relying on the internal generation of the incomplete factors, we generate them ourselves inlines 13–15. Afterwards, we generate the ILU preconditioner in line 29. In the end, we employ thenow already generated preconditioner in line 40 with a BiCGSTAB solver. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :14 H. Anzt et al. int main() { // Instantiate a CUDA executor auto cuda = gko:: CudaExecutor :: create(0, gko:: OmpExecutor :: create ()); // Read data auto A = gko::read >(std::cin , cuda); auto b = gko::read >(std::cin , cuda); auto x = gko::read >(std::cin , cuda); // Generate ILU (0) factorization auto ilu_factorization = gko:: factorization ::ParIlu <>:: build ().on(cuda) ->generate(A); // Create a custom upper solver factory auto upper_solver_factory = gko:: solver ::Cgs <>:: build () .with_criteria( gko::stop:: ResidualNormReduction <>:: build () .with_reduction_factor (1e-5) .on(cuda)) .on(cuda); // Create an ILU preconditioner factory with a CGS upper solver auto ilu_factory = gko:: preconditioner ::Ilu , gko:: solver ::Cgs <>>::build () .with_u_solver_factory(gko:: share(upper_solver_factory)) .on(cuda); auto ilu_prec = ilu_factory ->generate(gko:: share(ilu_factorization)); // Create the solver factory with ILU preconditioning auto solver_factory = gko:: solver ::Bicgstab <>:: build () .with_criteria( gko::stop:: ResidualNormReduction <>:: build () .with_reduction_factor (1e-15) .on(cuda), .with_generated_preconditioner(gko:: share(ilu_prec)) .on(cuda); // Create the solver from the factory and solve the system solver_factory ->generate(gko::give(A))->apply(gko::lend(b), gko::lend(x)); // Write result write(std::cout , gko::lend(x)); } Listing 3. An example of creating a CG solver with ILU preconditioning with an iterative solver for the uppertriangular factor.

As described in Section 2, Ginkgo provides a set of generic linear operators, including variousgeneral matrix formats, popular solvers, and simple preconditioners. However, sparse linearalgebra often includes problem-specific knowledge. This means that, in general, a highly-optimizedimplementation of a generic algorithm will still be outperformed by a carefully crafted customalgorithm employing application-specific knowledge. To tackle this, Ginkgo promotes extensibilityso that users can develop their own implementation for specific functionality without needing tomodify Ginkgo’s code (or recompile it).Domain-specific extensions can be elaborated as part of the application that uses them, or evenbundled together to create an ecosystem around Ginkgo. Currently, this is possible for all linearoperators, stopping criteria, loggers, and corresponding factories. Adding custom data types alsorequires only minor changes in a single header file and a recompilation. The only extension thatrequires more significant efforts is the addition of new architectures and executors. This involvesmodifying a key portion of Ginkgo as it requires the addition of specialized implementations of allkernels for the new architecture and executor.In contrast to the previous section, where Ginkgo is used as a library and the application is builtaround it, this section describes how Ginkgo can be used as a framework in which the applicationinserts its own custom components to work in harmony with Ginkgo’s built-in technology. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:15

Ginkgo’s facilities for memory management (e.g., automatic allocation and deallocation, or trans-parent copies between different executors) are designed to simplify its use as a library. As a result,the implementation burden is then shifted to the developers of these facilities, which are eitherthe developers of Ginkgo or, in case the application using Ginkgo needs custom extensions, thedevelopers of that application. To alleviate the burden and help developers focus on their algorithms,Ginkgo provides basic building blocks that handle memory management and the implementationof interfaces supported by the component being developed.

Array . Most components in Ginkgo have some sort of associated data, which shouldbe stored together with its executor. When copying a component, its data should also be copied,possibly to a different executor. When the object is destroyed, the data should be deallocatedwith it. Doing this manually for every class introduces a large amount of boilerplate code, whichincreases the effort of developing new components, and can lead to subtle memory leaks. Inaddition, different devices have different APIs for memory management, so a separate versionwould have to be written for each executor.To handle these issues in a single point in code, while removing some of the burden from thedeveloper, Ginkgo provides the

Array class. This is a container which encapsulates fixed-sizedarrays stored on a specific

Executor . It supports copying between executors and moving to anotherexecutor. In addition, it leverages the RAII idiom to automatically deallocate itself from the memorywhen it is no longer needed. auto omp = gko:: OmpExecutor :: create (); auto cuda = gko:: CudaExecutor :: create(0, omp); using arr = gko::Array ; arr x(cuda , {1, 2, 3, 4}); // an array of integers on the GPU arr cpu_x(omp , x); // a copy of x on the CPU arr z(omp , 10); // an uninitialized array of 10 integers on the CPU z = x; // copy x from the GPU to z (on the CPU) z.set_executor(cuda); // move z to the GPU auto d[] = {1, 2, 3, 4}; auto d_arr = arr::view(omp , 4, d); // use existing data auto size = x.get_num_elems (); // get the size of x auto x_data = x.get_data (); // get raw pointer to x's data // Note that x_data [0] would cause a segmentation fault if called from the CPU. // Memory used for x, cpu_x and z is automatically deallocated. // d_arr does not try to deallocate the memory. Listing 4. Usage examples of the

Array class.

Listing 4 shows some common usage examples of arrays. Lines 5–7 display several ways ofinitializing the

Array : using an initializer list, copying from an existing array (from a differentexecutor), or allocating a specified amount of uninitialized memory. The last constructor will onlyallocate the memory, without calling the constructors on individual elements, which remains theresponsibility of the caller. While this is not the usual behavior in C++, properly parallelizing theconstruction of the elements in multi- and manycore systems is a non-trivial task. Nevertheless,the elements of the arrays used in Ginkgo are mostly trivial types , so there is usually no need tocall the constructor in the first place.Lines 9–10 shown in Listing 4 illustrate how the assignment operator can be used to copy arraysand how the executor of the array can be changed via the set executor method. The combinationof the assignment operator and the RAII idiom usually means that classes using arrays as buildingblocks do not require user-defined destructors or assignment operators, since the ones synthesized https://en.cppreference.com/w/cpp/language/raii , Vol. 1, No. 1, Article 1. Publication date: January 2016. :16 H. Anzt et al. by the compiler behave as expected (in particular, this is true for all of Ginkgo’s linear operators,stopping criteria, and loggers).Lines 12–13 show that Array can also be used to store data in a non-owning fashion in a view ,i.e., the data will not be de-allocated when the

Array is destroyed. This feature is particularly usefulwhen using Ginkgo to operate on data owned by the application or another library.Finally, raw data stored in the

Array can be retrieved as shown in Lines 15–17. The get data method will return a raw pointer on the device where the array is allocated, so trying to dereferencethe pointer from another device will result in a runtime error.

Most components in Ginkgo expose a rich collection of utilityfunctions, usually related to conversion, object creation, and memory movement. These utilitiesare usually trivial to implement, and do not differ much between components. However, they stillrequire that the developer implements them, which steers the focus away from the actual algorithmdevelopment. Ginkgo addresses this issue by using mixins [2]. Since those are neither well-known by the community nor well-supported in languages commonly used in high performancecomputing (e.g., C, C++, Fortran), this subsection provides a simple example where mixins areleveraged to reduce boilerplate code. The remaining parts of Section 4 introduce mixins providedby Ginkgo when extending certain aspects of its ecosystem.As a toy example, assume there is an interface Clonable , which consists of a single method clone exposed to create a clone of an object. This method is useful if the object that should be clonedis only available through its base class (i.e., the static type of the object differs from its dynamictype). A common example where this is used is the prototype design pattern [25]. Obviously, theimplementation of the clone method should just create a new object using the copy constructor.Listing 5 is an example implementation of such a hierarchy consisting of three classes A , B and C .Classes A and B directly implement Clonable , while C indirectly implements it through B . struct Clonable { virtual ˜Clonable () = default; virtual std:: unique_ptr clone () const = 0; }; struct A : Clonable { std:: unique_ptr clone () const override { return std:: unique_ptr (new A{*this}); } }; struct B : Clonable { std:: unique_ptr clone () const override { return std:: unique_ptr (new B{*this}); } }; struct C : B { std:: unique_ptr clone () const override { return std:: unique_ptr (new C{*this}); } }; Listing 5. An example hierarchy implementing clonable without the use of mixins.

The implementation of the clone method is almost identical in all classes, so it represents agood candidate for extraction into a mixin. Mixins are not supported directly in C++, so theirimplementation is handled via inheritance, usually coupled with the Curiously Recurring TemplatePattern (CRTP) [19]. Nevertheless, using inheritance in this context should not be viewed asestablishing a parent–child relationship between the mixin and the class inheriting from it, butinstead as the class “including” the generic implementations provided by the mixin. Listing 6 shows The only mixin known to the authors is std::enable shared from this from the C++ standard library., Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:17 the implementation of the same hierarchy using the

EnableCloning mixin designed to provide ageneric implementation of the clone method. The mixin relies on the knowledge of the type ofthe implementer to call the appropriate constructor, which is provided as a template parameter.The base interface implemented by the mixin is also passed as a template parameter to allowindirect implementations, as is the case in class C . Once the mixin is set up, any class that wishes toimplement Clonable can just include the mixin to automatically get a default implementation ofthe interface, making the class cleaner, and removing the burden of writing boilerplate code. struct Clonable { virtual ˜Clonable () = default; virtual std:: unique_ptr clone () const = 0; }; template struct EnableCloning : Base { std:: unique_ptr clone () const override { return std:: unique_ptr ( new Implementer {* static_cast (this)}); } }; struct A : EnableCloning {}; struct B : EnableCloning {}; struct C : EnableCloning {}; Listing 6. An example hierarchy implementing clonable using the

EnableCloning mixin.

Ginkgo uses mixins to provide default implementations, or parts of implementations of poly-morphic objects, linear operators, various factories, as well as a few of other utility methods. Tobetter distinguish mixins from regular classes, mixin names begin with the “

Enable ” prefix.

The matrix structure is one of the most common types of domain-specific information in sparselinear algebra. For example, the discretization of the 1D Poisson’s differential equation with a3-point stencil results in a tridiagonal matrix with a value 2 for all diagonal entries and − custom-matrix-format example, whichis included in Ginkgo’s source distribution.Line 1 includes the EnableLinOp mixin, which implements the entire

LinOp interface except thetwo apply impl methods. These methods are called inside the default implementation of the apply method to perform the actual application of the linear operator. The default implementation of apply contains additional functionalities (executor normalization, argument size checking, logging hooks,etc.). Thus, by using the two-stage design with apply and apply impl , the implementers of matrixformats do not have to worry about these details. Line 2 includes the

EnableCreateMethod mixin,which provides a default implementation of the static create method. The default implementationwill forward all the arguments to the

StencilMatrix ’ constructor, allocate and construct the matrixusing the new operator, and return a unique pointer ( std::unique ptr ) to the constructed object. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :18 H. Anzt et al.

The constructor itself is defined in lines 4–8. Its parameters are the executor where the matrixdata should be located and operations performed, the size of the stencil, and the three coefficients ofthe stencil. The executor and the size are handled by

EnableLinOp , and the coefficients are storedin an

Array (defined in line 55) located on the executor used by the matrix.Linear operators provide two variants of the apply method. The “simple” version performs theoperation x = Ab and the “advanced” version for x = αAb + βx . Both of them are often used in linearalgebra, and can be expressed in terms of each other: A “simple” application is just an “advanced”one with α = β =

0. The “advanced” application can be expressed by combining x and theresult of “simple” application using the scal and axpy BLAS routines (called scale and add scaled in Ginkgo). In general, specialized versions result in superior performance. Thus, Ginkgo providesboth of them separately. However, for the sake of brevity, this example implements the “advanced”version in terms of the “simple” one (lines 14–42).The remainder of the code (lines 15–57) contains the implementation structure of the “simple”application. The input parameters contain the input vector b and the vector x where the solutionwill be stored. Each input and solution vector is represented by one column of a linear operator. Toaccommodate future extensions (e.g., sparse matrix–sparse vector multiplication), both x and b aregeneral linear operators. However, the only type supported by this example (and all of Ginkgo’sbuilt-in operators) is matrix::Dense . Downcasting these vectors to matrix::Dense is realized inlines 15–16 using the gko::as utility, which throws an exception if one of them is not in fact adense matrix.The implementation of the apply operation depends on the hardware architecture. The Referenceversion uses a simple sequential CPU implementation; the OpenMP version relies on a parallelimplementation based on OpenMP; and the CUDA and HIP versions launch a CUDA kernel and a HIPkernel, respectively. To support all four implementations, Ginkgo defines the Operation interface.An object that implements this interface is passed to the executor’s run method, which will selectthe appropriate implementation depending on the executor (lines 40–41). Thus,

StencilMatrix has to define a class (called stencil operation in this example, lines 18–39) which implements the

Operation interface and encapsulates the four implementations. The implementations are placedinto the four overloads of the run method: the reference version in lines 23–25; the OpenMP versionin lines 26–28; the CUDA version in lines 29–31; and the HIP version in lines 32–34. References tothe required data also have to be passed to stencil operation so that the implementation canaccess it.The new matrix format can be used instead of the CSR format in the example in Listing 1 bychanging the definition of A in line 9 as shown in line 59 of Listing 7, and placing the definition of A after the definition of b . In addition, lines 14–15 defining the preconditioner have to be removed,since the block-Jacobi preconditioning requires additional functionalities of the matrix format. Matrix formats are not the only linear operators that can be extended. A similar approach canbe used to define new solvers and preconditioners. StencilMatrix would have to define conversion to matrix::Csr for block-Jacobi preconditioning to work., Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:19 class StencilMatrix : public gko:: EnableLinOp , public gko:: EnableCreateMethod { public: StencilMatrix(std:: shared_ptr exec , gko:: size_type size = 0, double left = -1.0, double center = 2.0, double right = -1.0) : gko:: EnableLinOp (exec , gko::dim <2>{size}), coefficients(exec , {left , center , right }) {} protected: using vec = gko:: matrix ::Dense <>; using coef_type = gko::Array ; void apply_impl(const gko:: LinOp *b, gko:: LinOp *x) const override { auto dense_b = gko::as (b); auto dense_x = gko::as (x); struct stencil_operation : gko:: Operation { stencil_operation(const coef_type &coefficients , const vec *b, vec *x) : coefficients{coefficients}, b{b}, x{x} {} void run(std:: shared_ptr ) const override { // Reference kernel implementation } void run(std:: shared_ptr ) const override { // OpenMP kernel implementation } void run(std:: shared_ptr ) const override { // CUDA kernel implementation } void run(std:: shared_ptr ) const override { // HIP kernel implementation } const coef_type &coefficients; const vec *b; vec *x; }; this ->get_executor ()->run( stencil_operation(coefficients , dense_b , dense_x)); } void apply_impl(const gko:: LinOp *alpha , const gko:: LinOp *b, const gko:: LinOp *beta , gko:: LinOp *x) const override { auto dense_b = gko::as (b); auto dense_x = gko::as (x); auto tmp_x = dense_x ->clone (); this ->apply_impl(b, gko::lend(tmp_x)); dense_x ->scale(beta); dense_x ->add_scaled(alpha , gko::lend(tmp_x)); } private: coef_type coefficients; }; // using the matrix format: auto A = StencilMatrix :: create(exec , b->get_size ()[0], -1.0, 2.0, -1.0); Listing 7. Example implementation of a user-defined matrix format specialized for 3-point stencil matrices.

Implementing new stopping criteria requires a deeper understanding of the concept than thatexplained in Section 2.5. To accommodate higher generality, a criterion is allowed to maintainstate during the execution of a solver (e.g., a criterion based on a time limit may need to recordthe point in time when the solver was started). On the other hand, a linear operator may invoke asolver multiple times, every time its apply method is called. As a consequence, the same criterioncannot be reused for multiple runs, as the state from the previous invocation may interfere with asubsequent run. The solution is to prevent users from directly instantiating criteria. Instead, theuser instantiates a criterion factory, which is then used by the solver to create a new criterioninstance every time the solver is invoked. When creating the criterion, the solver will pass basicinformation about the system being solved, which includes the system matrix, the right-hand side, , Vol. 1, No. 1, Article 1. Publication date: January 2016. :20 H. Anzt et al. the initial guess, and optionally the initial residual. During its execution, the solver will call thecriterion’s check method to decide whether to stop the process. This method receives a list ofparameters that includes the current iteration number, and optionally one or more of the following:the current residual, the current residual norm, and the current solution. Based on this information,the criterion decides, separately for each right-hand side, whether the iteration process should bedetained.Currently, Ginkgo includes conventional stopping criteria for iterative solvers based on iterationcount, execution time or residual thresholds, as well as mechanisms to combine multiple criteria.Nevertheless, users may achieve tighter control of the iteration process by defining their ownstopping criteria. Listing 8 offers a sample stopping criterion based on the number of iterationswhich, even though already available in Ginkgo as gko::stop::Iteration , is simple enough toshow in full as part of this paper.As mentioned in Section 2.5, all stopping criteria, including custom ones, should implement the

Criterion interface. In addition to the check method, the interface provides various other utilitymethods which facilitate memory management. To reduce the volume of boiler-plate code neededfor new stopping criteria, Ginkgo provides the

EnablePolymorphicObject mixin. This mixininherits an interface supporting memory management (in this case

Criterion ), and implementsutility methods related to it (line 2). For the mixin to work properly, the class being enabled has toprovide a constructor with an executor as its only parameter (lines 21–23).Creating a criterion factory can be simplified by using the

CREATE FACTORY PARAMETERS , FACTO-RY PARAMETER and

ENABLE CRITERION FACTORY macros. The first one creates a member type parameters type , which contains all of the parameters of the criterion (lines 4–6). Each parameteris defined using the

FACTORY PARAMETER macro, which adds a data member of the requested nameand default value, as well as a utility method “ with ” that can be used whenconstructing the factory to set the parameter. In this case, the only parameter is the maximumnumber of iterations (line 5). Finally, the

ENABLE CRITERION FACTORY macro creates a factorymember type named

Factory that uses the parameters to create the criterion. The macro also addsa data member parameters which holds those parameters (line 7). When used to instantiate a newcriterion, the factory will pass itself, as well as an instance of parameters type , to the constructorof the criterion. This constructor is defined in lines 25–29.Finally, the implementation of the criterion logic is comprised inside the check method (lines 10–19). The current state of the solver is passed via the

Updater object. This particular criterion usesthe

Updater::num iterations property to check whether the limit on the number of iterationshas been reached (line 13). If this is not the case, the criterion returns false , indicating to the solverthat iterative process should continue (line 14). Otherwise, the stopping statuses of all columnsare set (line 16), and the one changed property is set to true to indicate that at least one of thestatuses changed (lines 14–17). Finally, once the iteration process for all right-hand sides has beencompleted, the criterion returns true . The stoppingId and the setFinalized flags are additionaldescriptors that may be used to retrieve additional details about the event that stopped the iterationprocess. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:21 class Iteration : public gko:: EnablePolymorphicObject { GKO_CREATE_FACTORY_PARAMETERS(parameters , Factory) { gko:: size_type GKO_FACTORY_PARAMETER(max_iters , 0); }; GKO_ENABLE_CRITERION_FACTORY(Iteration , parameters , Factory); public: bool check(gko:: uint8 stoppingId , bool setFinalized , gko::Array *stop_status , bool *one_changed , const gko::stop:: Updater &updater) override { if (updater.num_iterations_ < parameter_.max_iters) { return false; } this ->set_all_statuses(stoppingId , setFinalized , stop_status); *one_changed = true; return true; } explicit Iteration(std:: shared_ptr exec) : gko:: EnablePolymorphicObject ( std::move(exec)) {} explicit Iteration(const Factory *factory , const gko::stop:: CriterionArgs &args) : gko:: EnablePolymorphicObject ( factory ->get_executor ()), parameters_{factory ->get_parameters ()} {} }; Listing 8. An example of a stopping criterion that stops the iteration proces once a certain iteration limit isreached.

The executor is a central class in Ginkgo that provides all important primitives for allocating/deal-locating memory on a device, transferring data to other supported devices, and basic intra-devicecommunication (e.g., synchronization). An executor always has a master executor which is aCPU-side executor capable of allocating/deallocating space in the main memory. This concept isconvenient when considering devices such as CUDA or HIP accelerators, which feature their ownseparate memory space. Although implementing a Ginkgo executor that leverages features such asunified virtual memory (UVM) is possible via the interface, in order to attain higher performancewe decided to manage all copies by direct calls to the underlying APIs.Support for new devices (e.g., optimized versions of the library for different architectures, newaccelerators or co-processors, new programming models) in a heterogeneous node can be added toGinkgo by creating new executors for those devices. This requires 1) creating a new class whichimplements the

Executor interface; 2) adding kernel declarations in all Ginkgo classes with kernelsfor the new executor; 3) extending the internal gko::Operation to execute kernel operations onthe new executor; and 4) implementing kernels for all Ginkgo classes on the new architectures.Although this is an involved process and implies modifications in multiple parts of Ginkgo, theprocess has been successfully executed to extend Ginkgo to support a new HIP executor. Thanks toGinkgo’s design, most changes to Ginkgo’s base classes transfer to gko::Executor and its related gko::Operation classes. In addition, although most matrix formats, solvers, preconditioners,and utility functions rely on kernels that need to be implemented to support a new executionspace, a good first step is to declare all kernels as

GKO NOT IMPLEMENTED . This allows to obtain acompiling first version featuring the new executor with kernels throwing an exception when called.The required kernel implementations can then be progressively added without endangering thesuccessful compilation of the software stack. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :22 H. Anzt et al.

In this section we describe and demonstrate how to interface Ginkgo from other libraries. Specifi-cally, we showcase the usage of Ginkgo’s solver and preconditioner functionality from the deal.ii[8] and MFEM [10] finite element software packages.

To use Ginkgo as a solver in an external library, one must first adapt the data structures of theexternal library to Ginkgo’s data structures. We accomplish this by borrowing the raw data fromthe external library’s data structures; next operate on this data - e.g. solve a linear system; andthen return the result back to the application in the original data format. int main() { // Set solver parameters SolverControl control (200, 1e-6); const unsigned int size = 32; unsigned int dim = (size - 1) * (size - 1); // Setup a simple matrix FDMatrix testproblem(size , size); SparsityPattern structure(dim , dim , 5); testproblem.five_point_structure(structure); structure.compress (); SparseMatrix A(structure); testproblem.five_point(A); Vector f(dim); f = 1.; Vector u(dim); u = 0.; // Instantiate a Reference executor. auto ref = gko:: ReferenceExecutor :: create (); // Create a ginkgo preconditioner. auto jacobi = gko:: preconditioner ::Jacobi <>:: build ().on(ref); // Use ginkgo to solve the system on the gpu using the CG solver with jacobi // preconditioning. // Note that this is an additional constructor that takes in a created // LinOpFactory object and hence is generic. GinkgoWrappers ::SolverCG <> solver(control , "reference", jacobi); // Solves the system and copies the data back to deal.ii's solution variable. solver.solve(A, u, f); } Listing 9. Usage of Ginkgo’s solver capabilities in a deal.ii application. The code snippet only shows thesolution step and assumes that the system matrix and right-hand side are available from deal.ii. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:23 int main() { . . // Setup the finite element space and assemble the linear . // and bilinear forms . OperatorPtr A; Vector B, X; a->FormLinearSystem(ess_tdof_list , x, *b, A, X, B); // Solve the linear system with CG + ILU from Ginkgo. // Instantiate a Reference executor. auto ref = gko:: ReferenceExecutor :: create (); // Setup the preconditioner. auto ilu_precond = gko:: preconditioner ::Ilu , gko:: solver ::UpperTrs <>>::build () .on(ref); // Create the solver object with convergence parameters. GinkgoWrappers :: CGSolver ginkgo_solver("reference", 1, 2000, 1e-12, 0.0, ilu_precond.release ()); // The solve method internally converts the MFEM objects to Ginkgo 's // objects if necessary , computes the solution and returns the solution. ginkgo_solver.solve (&(( SparseMatrix &)(*A)), X, B); // Get solution back to MFEM a->RecoverFEMSolution(X, *b, x); . . // Clean up . } Listing 10. Usage of Ginkgo’s solver capabilities in a MFEM application.

Listings 9 and 10 showcase the explotation of Ginkgo functionality in deal.ii and MFEMapplications. Our main objective is to expose Ginkgo’s functionalities to the external librarieswhile maintaining an uniform interface within those libraries. The interfaces preserve the libraries’own solver interface, and take the executor determining the execution space as the only additionalparameter. All data movement is handled automatically and remains transparent to the user.

Ginkgo provides a multitude of preconditioners on both the CPU and the GPU. An exampleof such a preconditioner is the block-jacobi preconditioner. To accomodate the use of ginkgo’spreconditioners in deal.ii or MFEM, an additional constructor for each of the concrete solverclasses has been provided which takes in a gko::LinOpFactory as an argument. In the mostgeneral case this can be taken to be any generic linear operator factory with an overloaded applyimplementation to serve as a preconditioner.

Ginkgo is a part of the extreme-scale Scientific Software Development Kit (xSDK [5]), a softwarestack that comprises some of the most important research software libraries and that is availableon all US leadership computing facilities. Ginkgo is included in the xSDK release 0.5.0 [4] which isavailable as a Spack metapackage.Within the xSDK effort, interoperability examples with MFEM and deal.ii showcase the

LinOp concept of Ginkgo, and the use of Ginkgo as a solver using partial assembly of the finite elementoperator within MFEM. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :24 H. Anzt et al.

An important aspect of the Ginkgo library is its orientation towards software sustainability, easeof use, and openness to external contributions. Aside from Ginkgo being used as a framework foralgorithmic research, its primary intention is to provide a numerical software ecosystem designedfor easy adoption by the scientific computing community. This requires sophisticated designguidelines and high quality code. With these goals in mind, Ginkgo follows the guidelines andpolicies of the xSDK and the Better Scientific Software (BSSw [6]) initiative. In order to facilitate easyadoption, Ginkgo is open source with a modified BSD license, which does not restrict commercialuse of the software. The main repository is publicly available on github and only prototypeimplementations of ongoing research are kept in a private repository. The github repository isopen to external contributions through a peer-review concept and uses issues for bug trackingand to bolster development efforts. A Continuous Integration (CI) system realizes the automaticsynchronization of repositories, and the compilation and testing of the distinct branches. The CI isalso setup to ensure quality of the library in terms of memory leaks, threading issues, detection ofbugs thanks to static code analyzers, etc. The configuration and compilation processes are facilitatedwith CMake. The testing is realized using Google Test [7] and comprises a comprehensive list ofunit tests ensuring the library’s functionality. A feature spearheading sustainable high performancesoftware development is Ginkgo’s Continuous Benchmarking (CB) framework. This componentof Ginkgo’s ecosystem automatically runs performance tests on each code change; archives theperformance results in a public git repository; and allows users to investigate the performancevia an interactive web tool, the Ginkgo Performance Explorer [11]. Finally, the documentationis automatically kept up-to date-with the software, and multiple wiki pages containing examples,tutorials, and contributor guidelines are available. In the performance evaluation, we consider two GPU-centric HPC nodes from different hardwarevendors: The AMD node consists of an AMD Threadripper 1920X (12 x 3.5 Ghz) CPU, 64 GB RAM,and an AMD RadeonVII GPU. The RadeonVII GPU features 16 GB of main memory accessible at of1,024 GB/s (according to the specifications), and has a theoretical peak of 3.4 (double precision)TFLOP/s. The NVIDIA node is integrated into the Summit supercomputer, and consists of twoIBM POWER9 processors and six NVIDIA Volta V100 accelerators. The NVIDIA V100 GPUs eachhave a theoretical peak of 7.8 (double-precision) TFLOP/s and feature 16 GB of high-bandwidthmemory (HBM2). The board specifications indicate a memory bandwidth of 920 GB/s for thisaccelerator. We run all our experiments on a single GPU. We note that we do not intend this to bea performance-focused paper, and therefore refrain from showing a comprehensive performanceevaluation, but only show selected performance results that are representative for the commonusage of Ginkgo.

Relying on static and dynamic polymorphism largely simplifies code maintenance and extendability.A common concern when using these C++ features is the runtime overhead induced by runtimepolymorphism. Due to Ginkgo’s design, multiple runtime polymorphisms are evaluated at differentlevels. For example, calling the SpMV apply() functionality goes through 3 polymorphism forks:Format selection, Executor selection, and Kernel variant selection. Solvers undergo a similar process, https://ginkgo-project.github.io/gpe/, Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:25 Solver BiCGSTAB CG CGS FCG GMRESTime per iteration ( µs ) 1.26 1.28 1.00 1.45 1.51 Table 1. Overhead of the main Ginkgo solvers measured by averaging 10.000 solver runs, each doing 1.000iterations.Fig. 6. Performance profile comparing the runtime of Ginkgo’s SpMV kernels with the vendor libraries onthe AMD RadeonVII (left) and the NVIDIA V100 (right). The plain names represent the Ginkgo kernels,the “hipsp ” and “cusp ” labels refer to the vendor implementations in AMD’s hipSPARSE and NVIDIA’scuSPARSE libraries, respectively. except that during each iteration they call multiple kernels: an SpMV, possibly a preconditioner,etc.To evaluate the performance impact of the multiple runtime polymorphism branches, in Table 1we first measure the overhead for all Ginkgo’s solvers. The results there are obtained using amatrix of size 1, with an initial solution x = b ) set to N aN . This allowsrunning the full solver algorithm executing all runtime polymorphism branches with negligiblekernel execution time. We report results for 1.000 solver iterations averaged over 10.000 solverruns. Table 1 shows that the time per iteration is at most 1.5 µs for any of the solvers. We next evaluate the performance of the SpMV kernel for all matrices available in the Suite SparseMatrix Collection [1, 27] on the AMD RadeonVII and the NVIDIA V100 GPU [28]. For this purpose,we compare the performance profile of the SpMV kernels available in the Ginkgo library withtheir counterparts available in the NVIDIA cuSPARSE and the AMD hipSPARSE libraries. Theperformance profile indicates for how many test matrices from the Suite Sparse Matrix Collectiona specific format is the fastest (maximum slowdown factor 1.0), and how well a specific formatgeneralized. I.e., for a given “acceptable slowdown factor,” which percentage of the problems fromthe Suite Sparse Matrix Collection can be covered. The performance profiles reveal that Ginkgo’skernels are at least competitive, and in many cases superior to the vendor libraries.

Prior to evaluating the performance of Ginkgo’s Krylov solvers, we point out that Krylov solversoperating with sparse linear systems are memory-bound algorithms. For this reason, we initiallyassess the bandwidth efficiency of the implementations of the different Krylov solvers. Concretely,we select the COO matrix format for the SpMV kernel, and run the Krylov solvers without any , Vol. 1, No. 1, Article 1. Publication date: January 2016. :26 H. Anzt et al.

Solver Read access volume Write access volumeBiCGSTAB ( · n + · nnz ) · VT + · nnz · IT + (cid:100) iter / (cid:101) · (( · n + · nnz ) · VT + · nnz · IT ) + (cid:98) iter / (cid:99) · (( · n + · nnz ) · VT + · nnz · IT ) ( · n + ) · VT + (cid:100) iter / (cid:101) · (( · n + ) · VT ) + (cid:98) iter / (cid:99) · (( · n + ) · VT ) CG ( · n + · nnz )· VT + · nnz · IT + iter ·(( · n + · nnz ) · VT + · nnz · IT ) ( · n + ) · VT + iter · (( · n + ) · VT ) CGS ( · n + · nnz ) · VT + · nnz · IT + (cid:100) iter / (cid:101) · (( · n + · nnz ) · VT + · nnz · IT ) + (cid:98) iter / (cid:99) · (( · n + · nnz ) · VT + · nnz · IT ) ( · n + ) · VT + (cid:100) iter / (cid:101) · ( · n + ) · VT + (cid:98) iter / (cid:99) · ( · n · VT ) FCG ( · n + · nnz )· VT + · nnz · IT + iter ·(( · n + · nnz ) · VT + · nnz · IT ) ( · n + ) · VT + iter · (( · n + ) · VT ) GMRES ( · n + · nnz + / · r + n · r + r / + )· VT + · nnz · IT + (cid:98) iter / k (cid:99) ·(( + / · k + · n + · nnz + k / + k · n )· VT + · nnz · IT ) + iter ·(( · n + + · nnz )· VT + · nnz · IT + ) + iter r ·(( · n + )· VT ) ( · n + r + · k + )· VT + + (cid:98) iter / k (cid:99) ·(( k + · n + ) · VT + ) + iter · (( · n + ) · VT + ) + iter r · (( n + ) · VT ) Table 2. Memory access volume of a full run of the distinct solver. Here VT is the value type size in bytes(e.g., for double it is 8 bytes); IT is value type for the index type; and iter is the number of iterations thesolver does. In GMRES, k is the Krylov dimension (or restart iteration setting); r = iter % k ; and iter r = (cid:98) iter / k (cid:99) ∗ ( k − ) ∗ k / + ( iter % k − ) ∗ iter % k / . preconditioner. In Table 2 we list the target Krylov solvers along with their memory access volume(as a function of the iteration count). The formula for the GMRES algorithm is more involved as weimplement a variant enhanced with restart.For the experimental evaluation, we run 10,000 solver iterations on 10 different but representativetest matrices from the Suite Sparse collection. For GMRES, we set the restart parameter to 100.In Figure 7, we visualize the memory bandwidth usage of the different Krylov solvers for both aV100 GPU executing CUDA code and an AMD RadeonVII GPU executing HIP code. In each graphwe indicate the experimental peak bandwidth achieved by a reference stream triad bandwidthbenchmark [20]. Both test machines reach a very similar STREAM triad bandwidth (809 . . a [ i ] = b [ i ] + αc [ i ] , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:27 Fig. 7. Memory efficiency of Ginkgo’s Krylov solvers.

Operation V100 performance (GB/s) RadeonVII performance (GB/s)Copy 790.475 841.669Mul 787.301 841.934Add 811.312 806.632Triad 812.617 809.754Dot 844.321 635.677

Table 3. Stream bandwidth results from [20] on the V100 and RadeonVII machines for key operations.

Ginkgo provides both (block-Jacobi type) preconditioners based on diagonal scaling and (ILU type)incomplete factorization preconditioners. Ginkgo’s ILU preconditioner technology is spearheadingthe community, including ParILUT, the first threshold-based ILU preconditioner for GPU archi-tectures [17]. This preconditioner approximates the values of the preconditioner via fixed-pointiterations while dynamically adapting the sparsity pattern to the matrix properties [13]. Dependingon the matrix characteristics, this preconditioner can significantly accelerate the solution processof linear system solves; see Figure 8.Advanced techniques for the ILU preconditioner generation are complemented with fast tri-angular solvers, including iterative methods [12, 23] and the approximation of the inverse of thetriangular factors via a sparse matrix (incomplete sparse approximate inverse preconditioning [16]).The block-Jacobi preconditioner available in Ginkgo outperforms its competitors by automat-ically adapting the memory precision to the numerical requirements, therewith reducing thememory access time of the memory-bound preconditioner application [14, 21]. The inversion of thediagonal block is realized via a heavily-tuned batched variable size Gauss-Jordan elimination [15];see Figure 9.

Ginkgo is a modern C++-based sparse linear algebra library for GPU-centric HPC architectures withmany appealing features including the Linear Operator abstraction, which fosters easy adoption ofthe library. Some other aspects that we have elaborated on include the execution control, memorymanagement via smart pointers, software quality measures, and library extensibility. We havealso provided example codes for software integration into the deal.ii and MFEM finite elementecosystems, and demonstrated the high performance of Ginkgo on high end GPU architectures. We , Vol. 1, No. 1, Article 1. Publication date: January 2016. :28 H. Anzt et al.

Fig. 8. Time-to-solution comparison between standard ILU preconditioning (NVIDIA’s cuSPARSE) andGinkgo’s ParILUT for solving anisotropic flow problems on two NVIDIA GPU generations. The GMRES solveris taken from the Ginkgo library.Fig. 9. Performance of the block-Jacobi preconditioner generation on the NVIDIA V100 GPU. The precondi-tioner generation includes the Gauss-Jordan elimination featuring pivoting, the condition number calculationand exponent range analysis, the storage format optimization, the format conversion, and the preconditionerstorage in GPU main memory. The Different bars for each size represent the distinct memory precisionscenarios and the scenarios where the performance data includes the automatic precision detection based onthe block condition number and exponent range. believe that the design of the library and the sustainability measures that are taken as part of thedevelopment process have the potential to become role models for the future efforts on scientificsoftware packages. , Vol. 1, No. 1, Article 1. Publication date: January 2016. inkgo: A Modern Linear Operator Algebra Framework for High Performance Computing 1:29

ACKNOWLEDGMENTS

This work was supported by the “Impuls und Vernetzungsfond of the Helmholtz Association” undergrant VH-NG-1241. G. Flegar and E. S. Quintana-Ort´ı were supported by project TIN2017-82972-Rof the MINECO and FEDER and the H2020 EU FETHPC Project 732631 “OPRECOMP”. This researchwas also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of theU.S. Department of Energy Office of Science and the National Nuclear Security Administration. Theauthors want to acknowledge the access to the Summit supercomputer at the Oak Ridge NationalLab (ORNL).

REFERENCES [1] 2020. Suite Sparse Matrix Collection. http://faculty.cse.tamu.edu/davis/suitesparse.html.[2] accessed in April 2020. Mix In. Portland Pattern Repository. https://wiki.c2.com/?MixIn[3] accessed in April 2020. Object Slicing. Portland Pattern Repository. https://wiki.c2.com/?ObjectSlicing[4] accessed in April 2020. xSDK Examples https://xsdk.info/release-0-5-0/.[5] accessed in April 2020. xSDK: Extreme-scale Scientific Software Development Kit https://xsdk.info/.[6] accessed in August 2018. Better Scientific Software (BSSw) https://bssw.io/.[7] accessed in August 2018. Google Test https://github.com/google/googletest.[8] G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Davydov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann,M. Kronbichler, M. Maier, J.-P. Pelteret, B. Turcksin, and D. Wells. 2018. The deal.II

Library, Version 9.0.

Journal ofNumerical Mathematics

26, 4 (2018), 173–183. https://doi.org/10.1515/jnma-2018-0054[9] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A.McKenney, and D. Sorensen. 1999.

LAPACK Users’ Guide (third ed.). Society for Industrial and Applied Mathematics,Philadelphia, PA.[10] Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, VeselinDobrev, Yohann Dudouit, Aaron Fisher, Tzanio Kolev, Will Pazner, Mark Stowell, Vladimir Tomov, Johann Dahm,David Medina, and Stefano Zampini. 2019. MFEM: a modular finite element methods library. arXiv e-prints , ArticlearXiv:1911.09220 (Nov. 2019), arXiv:1911.09220 pages. arXiv:cs.MS/1911.09220[11] Hartwig Anzt, Yen-Chen Chen, Terry Cojean, Jack Dongarra, Goran Flegar, Pratik Nayak, Enrique S Quintana-Ort´ı,Yuhsiang M Tsai, and Weichung Wang. 2019. Towards Continuous Benchmarking: An Automated PerformanceEvaluation Framework for High Performance Software. In

Proceedings of the Platform for Advanced Scientific ComputingConference . 1–11.[12] Hartwig Anzt, Edmond Chow, and Jack Dongarra. 2015. Iterative sparse triangular solves for preconditioning. In

European Conference on Parallel Processing . Springer, Berlin, Heidelberg, 650–661.[13] Hartwig Anzt, Edmond Chow, and Jack Dongarra. 2018. ParILUT—A New Parallel Threshold ILU Factorization.

SIAMJournal on Scientific Computing

40, 4 (2018), C503–C519.[14] Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J Higham, and Enrique S Quintana-Ort´ı. 2019. Adaptive precisionin block-Jacobi preconditioning for iterative sparse linear system solvers.

Concurrency and Computation: Practice andExperience

31, 6 (2019), e4460.[15] Hartwig Anzt, Jack Dongarra, Goran Flegar, and Enrique S Quintana-Ort´ı. 2019. Variable-size batched Gauss–Jordanelimination for block-Jacobi preconditioning on graphics processors.

Parallel Comput.

81 (2019), 131–146.[16] Hartwig Anzt, Thomas K Huckle, J¨urgen Br¨ackle, and Jack Dongarra. 2018. Incomplete sparse approximate inversesfor parallel preconditioning.

Parallel Comput.

71 (2018), 1–22.[17] Hartwig Anzt, Tobias Ribizel, Goran Flegar, Edmond Chow, and Jack Dongarra. 2019. ParILUT—A Parallel ThresholdILU for GPUs. In . IEEE, 231–241.[18] Edmond Chow, Hartwig Anzt, and Jack Dongarra. 2015. Asynchronous iterative algorithm for computing incompletefactorizations on GPUs. In

International Conference on High Performance Computing . Springer, Cham, 1–16.[19] James O. Coplien. 1995. Curiously Recurring Template Patterns.

C++ Report (1995).[20] Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarkingthe Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In

HighPerformance Computing , Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.). Springer International Publishing,Cham, 489–507.[21] Goran Flegar, Hartwig Anzt, Terry Cojean, and Enrique S. Quintana-Ort´ı. submitted. Customized-Precision Block-Jacobi Preconditioning for Krylov Iterative Solvers on Data-Parallel Manycore Processors.

ACM TOMS (submitted)., Vol. 1, No. 1, Article 1. Publication date: January 2016. :30 H. Anzt et al. [22] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides. 1994.

Design Patterns: Elementsof Reusable Object-Oriented Software

EuroPar Conference 2020 (accepted).[24] Heike Jagode, Anthony Danalis, Hartwig Anzt, and Jack Dongarra. 2019. PAPI software-defined events for in-depthperformance analysis.

The International Journal of High Performance Computing Applications

33, 6 (2019), 1113–1127.[25] R. Johnson, E. Gamma, J. Vlissides, and R. Helm. 1995.

Design Patterns: Elements of Reusable Object-Oriented Software .Addison-Wesley. https://books.google.de/books?id=iyIvGGp2550C[26] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage.

ACM Trans. Math. Softw.

5, 3 (Sept. 1979), 308–323. https://doi.org/10.1145/355841.355847[27] Y. Saad. 2003.

Iterative Methods for Sparse Linear Systems (second ed.). Society for Industrial and Applied Mathematics.https://doi.org/10.1137/1.9780898718003[28] Yuhsiang M. Tsai, Terry Cojean, and Hartwig Anzt. accepted. Sparse Linear Algebra on AMD and NVIDIA GPUs –The Race is on.

Related Researches

FastAD: Expression Template-Based C++ Library for Fast and Memory-Efficient Automatic Differentiation

by James Yang

Event-Based Automatic Differentiation of OpenMP with OpDiLib

by Johannes Blühdorn

An Empirical Analysis of the R Package Ecosystem

by Ethan Bommarito

Using Jupyter for reproducible scientific workflows

by Marijan Beg

Automatic Generation of Interpolants for Lattice Samplings: Part II -- Implementation and Code Generation

by Joshua Horacsek

Automatic Generation of Interpolants for Lattice Samplings: Part I -- Theory and Analysis

by Joshua Horacsek

Quasi-Monte Carlo Software

by Sou-Cheng T. Choi

COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry

by Erdal Mutlu

User manual for bch, a program for the fast computation of the Baker-Campbell-Hausdorff and similar series

by Harald Hofstätter

lrsarith: a small fixed/hybrid arithmetic C library

by David Avis

FDApy: a Python package for functional data

by Steven Golovkine

Accelerated Polynomial Evaluation and Differentiation at Power Series in Multiple Double Precision

by Jan Verschelde

UFL Dual Spaces, a proposal

by David A. Ham

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

by Orestis Zachariadis

ParaMonte: A high-performance serial/parallel Monte Carlo simulation library for C, C++, Fortran

by Amir Shahmoradi

A highly scalable approach to solving linear systems using two-stage multisplitting

by Nick Brown

Extendible and Efficient Python Framework for Solving Evolution Equations with Stabilized Discontinuous Galerkin Method

by Andreas Dedner

Compressed Basis GMRES on High Performance GPUs

by José I. Aliaga

Flexible Performant GEMM Kernels on GPUs

by Thomas Faingnaert

AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications

by Weiqun Zhang

Portable high-order finite element kernels I: Streaming Operations

by Noel Chalmers

A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data

by Drew Schmidt

HDGlab: An open-source implementation of the hybridisable discontinuous Galerkin method in MATLAB

by Matteo Giacomini

Dune-CurvedGrid -- A Dune module for surface parametrization

by Simon Praetorius

A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

by Xia Liao

«

1

2

3

4

»

Submitted on 30 Jun 2020 (v1), last revised 1 Jul 2020 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar

Decentralizing Knowledge