[PDF] Accelerated Cyclic Reduction: A Distributed-Memory Fast Solver for Structured Linear Systems

Abstract

We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast direct solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has O(k~N \log N~(\log N + k^2)) arithmetic complexity and O(k~N \log N) memory footprint, where N is the number of degrees of freedom and k is the rank of a typical off-diagonal block, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method where hierarchical semi-separable matrices are used for compressing the fronts, and with algebraic multigrid. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and memory footprint. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides.

Full PDF

AAccelerated Cyclic Reduction: A Distributed-MemoryFast Solver for Structured Linear Systems

Gustavo Ch´avez a, ˚ , George Turkiyyah b , Stefano Zampini a , Hatem Ltaief a , David Keyes a a King Abdullah University of Science and Technology (KAUST). b Department of Computer Science, American University of Beirut (AUB).

Abstract

We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast solver for rank-compressible blocktridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions.Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver thathas O p k N log N p log N ` k qq arithmetic complexity and O p k N log N q memory footprint, where N is the numberof degrees of freedom and k is the rank of a block in the hierarchical approximation, and which exhibits substantialconcurrency. We provide a baseline for performance and applicability by comparing with the multifrontal methodwith and without hierarchical semi-separable matrices, with algebraic multigrid and with the classic cyclic reductionmethod. Over a set of large-scale elliptic systems with features of nonsymmetry and indeﬁniteness, the robustnessof the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR haslower or comparable execution time and size of the factors, with substantially lower numerical ranks. ACR exhibitsgood strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problemsthat require the solution of multiple right-hand sides. Numerical experiments show that the rank k patterns are of O p q for the Poisson equation and of O p n q for the indeﬁnite Helmholtz equation. The solver is ideal in situationswhere low-accuracy solutions are suﬃcient, or otherwise as a preconditioner within an iterative method. Keywords:

Cyclic reduction, Hierarchical matrices, Fast direct solvers, Elliptic equations

1. Introduction

Cyclic reduction, introduced in [1], is a direct solver for tridiagonal linear systems. It is eﬀective for the solutionof (block) Toeplitz and (block) tridiagonal matrices that arise from the discretization of elliptic PDEs [2, 3]. For theconstant-coeﬃcient Poisson equation, since each of the blocks of the discretized system is Fourier diagonalizable,cyclic reduction can be used in combination with the fast Fourier transform (FFT) to deliver optimal complexity, asproposed in the FACR method [4]. However, in the presence of variable coeﬃcients, the FFT-enabled version of cyclicreduction can not be used. The purpose of this work is to address the time and memory complexity growth in thepresence of heterogeneous blocks with a variant called Accelerated Cyclic Reduction (ACR). The main observationis that elliptic operators have a hierarchical structure of oﬀ-diagonal blocks that can be approximated with low-rank matrices. Thus we approximate appropriate blocks of the initially sparse matrix with hierarchical matricesand operate on these blocks with hierarchical matrix arithmetics, instead of the usual dense operations, to obtain adirect solver of log-linear arithmetic and memory complexities. This philosophy follows recent work discussed below,but to our knowledge, this is the ﬁrst demonstration of the utility of complexity-reducing hierarchical substitutionin the context of cyclic reduction.Cyclic reduction can be thought of as a direct Gaussian elimination on a permuted system that recursivelycomputes the Schur complement of half of the unknowns until a single block remains or the system is smallenough to be inverted directly. Schur complement computations have a complexity that is dominated by thecost of the inverse; by applying a red/black re-ordering of the unknowns, the linear system separates into twohalves with block diagonal structure. This decoupling addresses the most expensive step of the Schur complementcomputation regarding operation complexity and does so in a way that launches independent subproblems. Thisconcurrency feature, in the form of recursive bisection, can be naturally implemented in a distributed-memory ˚ Corresponding author

Email address: [email protected] (Gustavo Ch´avez)

Preprint accepted to the Elsevier Journal of Parallel Computing, Dec 2017. a r X i v : . [ m a t h . A P ] J a n arallel environment. The stability of the block cyclic reduction has been studied in [5], where the author presentserror bounds for strictly and nonstrictly diagonally dominant matrices.In order to simplify the description of the algorithm, in this work we consider structured linear systems arisingfrom the discretizations or scalar PDEs on three-dimensional Cartesian grids. For three-dimensional problems of size N “ n , where n is the number of discretization points in the linear dimension of the target domain, the synergy ofcyclic reduction and hierarchical matrices leads to a parallel fast direct solver of O p k N log N p log N ` k qq arithmeticcomplexity, and O p k N log N q memory footprint, where k ! N represents the numerical rank of compressed blocks.This is in contrast to O p N q and O p N . q respectively, if hierarchically low-rank matrices matrices are not used.In this manuscript, we present ACR and its distributed-memory implementation, and we demonstrate its per-formance on a set of problems with various symmetry and spectral properties in three dimensions. These problemsinclude the Poisson equation, the convection-diﬀusion equation, and the indeﬁnite Helmholtz equation. We showthat ACR is competitive in memory consumption and time-to-solution when compared to methods that rely on aglobal factorization and do not exploit the cyclic reduction structure. Recent years have seen increasing interest in the use of hierarchical low-rank approximations to accelerate thedirect solution of linear systems. In this section, we brieﬂy describe some of this literature focusing primarily oneﬀorts that target distributed-memory environments.Arguably the most common approach for using hierarchical matrix representations in matrix factorizationsis to use low-rank approximations to compress the dense frontal blocks that arise in the multifrontal variant ofGaussian elimination. The enabling property is that under proper ordering, many of the oﬀ-diagonal blocks ofthe Schur complement of discretized elliptic PDEs have an eﬀective low-rank approximation [6] that improves thememory and arithmetic estimates of conventional multifrontal solvers [7]. Furthermore, there are eﬃcient low-rankapproximation methods to perform the necessary arithmetic operations and preserve the low-rank representationduring the factorization and solution stages of the solver. Within this general approach, various methods that diﬀerin the particular data-sparse format used and in the algorithms for the computation of low-rank approximationshave been developed.In Wang et al. [8] the authors investigate the use of the HSS format [9] to accelerate the parallel geometric mul-tifrontal method, which results in a method known as the HSS-structured multifrontal solver (HSSMF). The generalapproach uses intra-node parallel HSS operations within a distributed-memory implementation of the multifrontalsparse factorization. This approach lowers the complexity of both arithmetic operations and memory consumptionof the resulting HSS-structured multifrontal solver by leveraging the underlying numerically low-rank structure ofthe intermediate dense matrices appearing within the factorization process, driven by an optimal nested dissectionordering.In a similar line of work, Ghysels et al. [10] also investigate a combination of the multifrontal method andthe HSS-structured hierarchical format, extending the range of applicability of the solver to general non-symmetricmatrices. Using the task-based parallelism paradigm, they introduce randomized sampling compression [11] andfast ULV HSS factorization [12]. Under the assumption of the existence of an underlying low-rank structure of thefrontal matrices, randomized methods deliver almost linear complexity; this reduces the asymptotic complexity of thesolver, which is mainly attributed to the frontal matrices near the root of the elimination tree. The eﬀectiveness ofthese task-based algorithms in combination with a distributed-memory implementation of the multifrontal methodis available in an early stage software release of the package STRUMPACK [10], which we will consider in thenumerical experiments section of this article. The HSS format assumes a weak admissibility condition (see section2.1.1), which in practice requires the use of large numerical ranks even for approximations with modest relativeaccuracy. Consequently, this stresses the memory requirements and increases overall execution time.The hierarchical interpolative factorization [13, 14] is another method for ﬁnding low-rank approximations thathas proved to be a fast solver for symmetric elliptic PDEs and integral equations. This decomposition relies on a“skeletonization” procedure to eliminate a redundant set of points from a symmetric matrix to further compress thedense fronts. The key step in skeletonization uses the interpolative decomposition of low-rank matrices to achievea quasi-linear overall complexity in factorization. The performance of hierarchical interpolative decomposition in adistributed-memory environment is reported in [15].A fast direct method for high-order discretizations of elliptic PDEs has been proposed by Martinsson et al.[16, 17, 18]. The method is based on a multidomain spectral collocation discretization scheme and a hierarchy ofnested grids, similar to nested dissection. It exploits analytical properties of elliptic PDEs to build Dirichlet-to-Neumann operators, by hierarchically merging these operators originating from smaller grids. When computationsare done using the HSS data-sparse format, an asymptotic complexity of O p N { q can be reached. The high-order2iscretizations used in this method makes it quite powerful in practice as they allow it reach the same accuracywith fewer degrees of freedom compared to second order discretizations. A distributed-memory implementation ofthis algorithm is in progress.Even though this approach has larger asymptotic estimates than the log-linear performance of the methodsabove, because of the high-order discretization of the PDE, this method is quite powerful in practice as they canreach the same accuracy with fewer degrees of freedom as compared to second order discretizations. A distributed-memory implementation of this algorithm is in progress.The BLR format [19] has also been used to compress blocks into low-rank approximations to accelerate thefactorization process of the multifrontal method. This format is compatible with numerical pivoting and is well-suited for the reuse of existing high-performance implementations of dense linear algebra kernels. Even though thisformat is not hierarchical, it has proven to be useful for a wide range of problems [20] within the distributed-memoryimplementation of the multifrontal method provided by the MUMPS library [21, 22].Rather than compressing and identifying individual blocks of the decomposition, another hierarchy-exploitingapproach considers the system as a whole and seeks to construct a holistic decomposition of the full linear system.An example of such decomposition is the recursive computation of the inverse of a hierarchical matrix [23, 24], orthe computation of its Cholesky or LU factorization [25, 26]. These methods have generally much higher prefactorsthan methods that compress individual matrix blocks of the factorizations and are not usually competitive forlarge-scale problems; as an example, we refer the reader to [23] for a discussion of the challenges of scaling theconstruction of the inverse of a hierarchical matrix.Pouransari et al. approximate ﬁll-in via low-rank approximations with the H format; see [27]. This formatguarantees linear complexity provided that blocks correspond to well-separated clusters and have a data-sparseproperty. The algorithm starts by recursively bisecting the computational domain, implicitly forming a binary tree.The leaf nodes correspond to independent subdomains, and the internal nodes correspond to Schur complements tocomputed with low-rank arithmetic operations. The bottom-up elimination process is performed with a procedurereferred to as “extended sparsiﬁcation” in which the original matrix dimension grows by introducing auxiliaryvariables but nonetheless remains sparse. Alternatively, elimination can be performed with an in-place algorithmthat keeps the matrix size constant. A related method with similar strategies as in this work is the so-called“compress and eliminate” solver [28]. A recent extension of this line of work into a distributed memory environmentdocumented in [29], demonstrates that concurrent processors can work on independent subdomains deﬁned by theircorresponding subgraphs, where interior vertices are eliminated concurrently. Communication is needed at theboundary vertices, but additional concurrency at the boundary is exploited trough graph coloring. The contribution of this work is the development of a parallel, robust and eﬃcient method for the solutionof block tridiagonal linear systems, with emphasis on systems that arise from the discretization of elliptic PDEs.ACR is a fast solver in the sense that it has a log-linear arithmetic complexity in operations count and memoryconsumption. The algorithm arrives at the solution in a ﬁnite number of steps, rather than iteratively convergingto a solution, which makes it a direct solver with a tunable accuracy. The fact that ACR is entirely algebraicextends its range of applicability to problems with an arbitrary coeﬃcient structure including nonsymmetry withinthe block tridiagonal sparsity structure, subject to their amenability to rank compression. This entirely algebraicproperty gives the method robustness on problems that are challenging for iterative methods, while still maintainingasymptotic eﬃciency.Two key features of the algorithm from a computational perspective are the simplicity of its parallelization andthe regularity of its communication patterns in a distributed memory environment. The communication patternis well-established beforehand and it is based on recursive bisection, as opposed to nested dissection with diﬀerentblock sizes at diﬀerent levels of the factorizations. The amount of inter-node concurrency is proportional to thesize of the blocks and it ﬁts readily into a distributed-memory parallel environment. The algorithm also exhibitssubstantial intra-node concurrency, both in processing multiple blocks and within its hierarchical operations onindividual blocks, which ﬁts the multi-core architecture of modern supercomputers.We demonstrate that our implementation is well suited for modern parallel multi-core systems and scalable in adistributed-memory environment. We also compare our implementation against other state-of-the-art direct solversover a relevant class of problems and show competitive time to solution and memory requirements.

2. Preliminaries

In this section, we review the building blocks of the proposed solver, namely hierarchical low-rank approximationsand the cyclic reduction algorithm. 3 .1. Hierarchical matrices

A hierarchical matrix is a data-sparse representation that enables fast linear algebraic operations by using ahierarchy of oﬀ-diagonal blocks, each represented by a low-rank approximation, that can be tuned to guaranteean arbitrary precision. The approximation, sometimes referred to as compression, is performed via singular valuedecomposition, or with a related method that delivers a low-rank approximation with fewer arithmetic operationsthan the traditional SVD method. For the representation to be eﬀective in terms of arithmetic operations andmemory requirements, numerical ranks signiﬁcantly smaller than the sizes of the various matrix blocks are required.There are several hierarchical and non-hierarchical low-rank approximation formats available in the literature. Inthis work, we consider the H -matrix format introduced by Hackbusch et al. in [30]. Being modular by design, ACR isnot limited to the H -format. In fact, the use of the H -format would immediately translate to an additional reductionof one logarithmic factor in terms of arithmetic and memory complexity estimates, from O p k N log N p log N ` k qq to O p k N log N q in terms of operations, and O p k N log N q to O p k N q in terms of memory requirements, however, werequire a complete set of hierarchical matrix operations and fast construction, which at the time of this publicationis still ongoing work within our group. Our implementation uses the H -format arithmetic and its arithmeticsoperations provided by the HLibPro library. We refer to the reader to [31, 32] for a discussion of the shared-memory scalability of these hierarchical matrix operations, their relative costs, and their performance on modernmanycore architectures. HLibPro does not feature a distributed memory solver. We use its shared-memory kernelsin combination with MPI to orchestrate parallel workload across nodes in a distributed memory environment as wewill discuss in section 4. H -matrix construction The structure of a hierarchical matrix in the H format can be described by four components: an index set, acluster tree, a block cluster tree, and the choice of an admissibility condition. The index set I “ t , , . . . , N ´ u represents the number of degrees of freedom N . The cluster tree represents row/column groupings, and it isconstructed by recursively subdividing the index set. Once the cluster tree is formed, the block cluster tree deﬁnesmatrix sub-blocks over the index I ˆ I . Its leaves are either low-rank blocks or small dense ones. Finally, theadmissibility condition determines whether a given block should be represented as a low-rank approximation or adense block .The ﬁrst step for the construction of an H -matrix is the deﬁnition of the cluster tree of unknowns. In thiswork, since each block row of the sparse matrix represents a plane from a three-dimensional regular discretization,we leverage the geometry information by selecting a binary space partitioning strategy to cluster the unknownsconsidering the two-dimensional domain representing the planes.The next step is the deﬁnition of a block cluster tree for these two-dimensional domains, which together withthe admissibility condition determines the structure of the hierarchical representation of the plane-block. We chosea standard admissibility condition, as opposed to the simpler weak admissibility condition, because it provides theﬂexibility of selecting a range of coarser or ﬁner blocks, tuned by an admissibility parameter η . Weak admissibilityrefers to a matrix decomposition where the p , q and p , q blocks are single low-rank blocks and the p , q and p , q blocks are recursively decomposed in a similar way. On the other hand, standard admissibility allows amore reﬁned blocking of the matrix; the η parameter appears in the inequality min p diameter p τ q , diameter p σ qq ď η ¨ distance p τ, σ q , where τ and σ denote two geometric regions deﬁned as the convex hulls of two separate point sets t and s (nodes in cluster tree). A matrix block A ts satisfying the previous inequality is represented in a low rankform.The motivation for choosing a standard admissibility condition is that, by further reﬁning the oﬀ-diagonalsblocks, it is possible to achieve a similar accuracy with smaller numerical ranks, that are crucial to ensure economicmemory consumption and overall high performance. The impact of the admissibility condition is illustrated inFigure 1, which depicts the H -inverse of the variable-coeﬃcient two-dimensional Poisson operator discretized on a N “ ˆ

64 grid using a ﬁnite diﬀerence scheme. In the right panel, the use of a few small dense blocks in theoﬀ-diagonal regions allows much smaller ranks to be used in the remaining low-rank blocks, without compromisingaccuracy. The word “block” is overloaded in this discussion. It is used to denote the partitions of the block tridiagonal coeﬃcient matrixof the problem. It is also used to denote the partitioning of a matrix into low-rank and dense subdivisions. When necessary to avoidconfusion, we will use the word “plane” or “plane-block” to refer to the ﬁrst meaning. a) Weak admissibility. (b) Standard admissibility. [1-10] [11-20] [21-30] [31-40] [41-50] [51-60] [61-70]10 Ranks bin R a n k f r e q u e n c y Weak admissibilityStandard admissibility (c) Ranks histogram with respect to admissibility condition.

Figure 1: H -inverse of the 2D Poisson operator, discretized with N “ ˆ

64 grid points, using two diﬀerent admissibility conditions.The number in each block is the numerical rank necessary to achieve a compression accuracy of 1E-4. The color map is determinedby the ratio of the numerical rank and the size of the block, deep blue indicates an eﬀective low-rank approximation, while red depictdense blocks.

Table 1 shows the storage gains by representing the inverse of a 2D Poisson problem with an H -matrix withweak admissibility versus standard admissibility. Table 1 also shows the diﬀerence in terms of number of operationsbetween these two structures. The cost of the H -matrix inversion requires 56 C sp kn p log n ` q ` C sp k n p log n ` q operations, where k represents the average rank of the low-rank blocks, and C sp represents the sparsity of thestructure of the hierarchical matrix inverse, see [34]. Since the weak admissibility condition requires larger ranksthan the standard admissibility condition, at scale, this tends to increase the memory requirements and the numberof ﬂoating-point operations.Operation Format Storage OperationsInverse H (weak admissibility) 723 MB 8.0E11Inverse H (standard admissibility) 434 MB 5.0E11Factorization HSS (weak admissibility)* 40 MB 3.3E07 Table 1: H -inverse of the 2D Poisson operator for N “ grid points, using two diﬀerent admissibility conditions. We documentthe memory and ﬂoating-point operations to build the H -matrix inverse with weak and standard admissibility. The weak admissibilitycondition tends to require large ranks, which lead to increased memory requirements and more arithmetic operations than the standardadmissibility condition. (Equivalent dense storage and arithmetic operations of the inverse operation would have required 2,147 MBand 1.0E13 operations.). *As a matter of comparison we show the equivalent storage requirements of the same problem by using thenested-basis HSS format which for this particular problem has an optimal complexity. A low-rank approximation for a given oﬀ-diagonal block can be found in a variety of ways. Several strategies,ranging from randomized algorithms to heuristics for pivoting, are available in the literature. Every block of the H -matrix stored as a low-rank approximation has the form of an outer product AB T . The goal of eﬃcient hierarchicalmatrix processing is to construct the best possible low-rank factorization as matrix operations are performed. Thisroutine is often referred to as the compression step. For a comprehensive discussion of the construction of H -matricesand its arithmetics, we refer the reader to [34]. 5 .2. Cyclic reduction This section reviews the cyclic reduction algorithm in preparation for the following section describing the accel-erated cyclic reduction variant that improves its arithmetic and memory complexity growth.

Consider the seven-point stencil ﬁnite diﬀerence discretization with Dirichlet boundary conditions of the three-dimensional variable-coeﬃcient Poisson equation on the unit cube. ´ ∇ ¨ κ p x q ∇ u “ f p x q (1)This discretization leads to a block tridiagonal linear system of N “ n unknowns. This corresponds to a matrix A composed of 3 n ´ n ˆ n . A “ tridiagonal p E i , D i , F i q “ »—————————– D F E D F . . . . . . . . . E n ´ D n ´ F n ´ E n ´ D n ´ ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ . (2)Block cyclic reduction can be used to solve the system deﬁned in Equation 2. The algorithm consists of twophases: elimination and back substitution. The ﬁrst step is to rearrange the linear system via matrix permutation p P AP T qp P u q “

P f . The permutationmatrix P corresponds to a red/black (even/odd) ordering of the blocks. For illustration, we choose n “ ˆ »—————————————– D p q F p q D p q E p q F p q D p q E p q F p q D p q E p q F p q E p q F p q D p q E p q F p q D p q E p q F p q D p q E p q D p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ »—————————————– u p q u p q u p q u p q u p q u p q u p q u p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ “ »—————————————– f p q f p q f p q f p q f p q f p q f p q f p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ . (3) »—– A A A A ﬁﬃﬂ »—– u even u odd ﬁﬃﬂ “ »—– f even f odd ﬁﬃﬂ . (4)The Schur complement computations of the partitioned system are shown in equation 5: p A ´ A A ´ A q u odd “ f, f “ f odd ´ A A ´ f even . (5)Since the upper-left block A is block-diagonal, its inverse can be computed as the inverse of each individualblock (in this case: D p q , D p q , D p q , and D p q ), in parallel. All computations for the generation of the Schurcomplement at step i `

1, whose size is half of the step i problem, are also done at block-level granularity as showin Equation 6, which applies to j odds only. There is a slight abuse of notation in Equation 6 to handle the caseof the last plane that has one neighbor, the computations involving the plane j ` A ` H B, A ´ H B, A ¨ H B, A ¨ H b, A ´ H ), depending on whether the matrices6re represented in the regular sparse format or the H -matrix format, as we will later refer back when describing the H -matrix accelerated cyclic reduction method. E p i ` q j “ ´ E p i q j ¨ H p D p i q j ´ q ´ H ¨ H E p i q j ´ D p i ` q j “ D p i q j ´ H E p i q j ¨ H p D p i q j ´ q ´ H ¨ H F p i q j ´ ´ H F p i q j ¨ H p D p i q j ` q ´ H ¨ H E p i q j ` F p i ` q j “ ´ F p i q j ¨ H p D p i q j ` q ´ H ¨ H F p i q j ` f p i ` q j “ f p i q j ´ E p i q j ¨ H p D p i q j ´ q ´ H ¨ H f p i q j ´ ´ F p i q j ¨ H p D p i q j ` q ´ H ¨ H f p i q j ` (6)This process of permuting and Schur complementation is recursive. It ﬁnishes when a single block is left, or whenthe remaining system is small enough to be inverted directly. Recursion is possible because the Schur complementof a tridiagonal matrix is tridiagonal. This property can be seen in the structure of the matrix at the next stepshown in Equation 7 and illustrating the remaining (originally odd) unknowns after they have been renumberedsequentially. »————————– D p q F p q E p q D p q F p q E p q D p q F p q E p q D p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬂ »————————– u p q u p q u p q u p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬂ “ »————————– f p q f p q f p q f p q ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬂ (7)The algorithm proceeds to apply the red/black permutation followed by a Schur complementation for two moresteps to compute the last single block D p q . Once elimination is completed, the solve stage starts from the last block of unknowns, as shown in equation 8: D p q ¨ H u p q “ f p q . (8)Once the solution at the last step u p q is computed, it is propagated backward in the hierarchy of the eliminationtree.The formula to compute the solution at step q is given by u p i q “ p D p i q q ´ H ¨ H p f p i q ´ E p i q ¨ H u p i ` q ´ F p i q ¨ H u p i ` q q . (9)This procedure continues until the solution of the entire linear system is computed.Back-substitution is much more lightweight than the elimination algorithm regarding computation and com-munication volume, because it communicates parts of the solution in the form of vectors, and the only matrixoperation performed is a matrix-vector multiplication. For large scale problems, this makes the solve phase ordersof magnitude faster than the elimination phase. As with other direct solvers, the ability to eﬃciently solve for agiven right-hand side given a factorization motivates the use of ACR for multiple right-hand sides at a minimal costper new forcing term.

3. Accelerated Cyclic Reduction

This section describes how cyclic reduction can be used in combination with hierarchical matrices to result ina variant that improves the computational complexity and memory requirements of the classical cyclic reductionmethod. H -matrix approximation ACR approximates each D i , E i and F i block of the original block tridiagonal matrix A given in Equation 2 witha hierarchical matrix, and then proceeds with the cyclic reduction algorithm, as described in the previous section,by using hierarchical matrix arithmetics instead of the conventional dense linear algebra arithmetic operations.In generating the structure of the hierarchical matrix representations of the blocks, we exploit the fact thedomain is subdivided into n planes each consisting of n grid points and block rows of the matrix are identiﬁed7ith the planes of the discretization grid. We consider this geometry and use a two-dimensional planar bisectionclustering when constructing each H -matrix.Cyclic reduction requires hierarchical matrix addition, subtraction, matrix-matrix multiplication, matrix-vectormultiplication and matrix inversion. The relative accuracy of the approximation is speciﬁed during the compressionof each block and while performing hierarchical matrix arithmetic operations. Committing to a given toleranceensures that the numerical ranks are adjusted to preserve the speciﬁed accuracy during the elimination and solvephases. It is at the block level that the improvements in the complexity estimates take place.Table 2 summarizes the advantages of a block-wise approximation of matrix blocks with H -matrices in thecomputation of the inverse of a block, and its storage, as compared to their equivalent dense counterparts.Inverse StorageDense Matrix O p N q O p N q H Matrix O p k N log N p log N ` k qq O p k N log N q Table 2: Comparing the complexity estimates of storing and computing the inverse of a N ˆ N matrix block in dense format, versusapproximating the matrix block with a hierarchical matrix with numerical rank k . To simplify the exposition, we assume the size of the linear system is a power of two; the number of stepsrequired by ACR is thus q “ log N . The size of the blocks for 2D problems is n .As mentioned in Section 2.2, two procedures deﬁne cyclic reduction: elimination and back-substitution. Thehigh-level algorithm of elimination is shown in listing 1, whereas the high-level algorithm for back-substitution isshown in listing 2. Even though Algorithms 1 and 2 show permutations and matrix operations at the level ofthe global system, our implementation operates at a per-block granularity, which means that permutations arepart of the implementation’s logic and that linear algebraic operations are performed block by block as shown inEquation 6. This is possible since cyclic reduction preserves the block tridiagonal structure during elimination. Algorithm 1

ACR elimination Block-wise low-rank approximation of A : A p q = A for i = 0 to q-1 do // Generate A p i ` q block tridiagonal of size n { i ` using block-level operations (Equation 6) // Requires O p k n log n p log n ` k q{ i ` q operations A p i ` q = A p i q ´ H A p i q ¨ H p A p i q q ´ H ¨ H A p i q // Forward substitution, requires O p k n log n { i ` q operations f p i ` q = f p i q ´ A p i q ¨ H p A p i q q ´ H ¨ H f p i q end forAlgorithm 2 ACR back-substitution Solve A p q q u p q q “ f p q q for i = q-1 to do // Back-substitution, requires O p k n log n { i ` q operations // This is performed at block-level (Equation 9) u p i q “ p A p i q q ´ H ¨ H p f p i q ´ A p i q ¨ H u p i ` q q end for Every cyclic reduction step requires two matrix-matrix multiplications, one matrix inversion and one matrixaddition per block being eliminated. These kernels have arithmetic complexity of O p k n log n p log n ` k qq operations[34]. For a problem size of N “ n with n “ q , ACR requires n { ` n { ` n { ` . . . « n steps to perform elimination.The most expensive computation in each step is the computation of an inverse of a block of size n ˆ n , whichin H -format has a complexity of O p k n log n p log n ` k qq , therefore, ACR results in a O p k N log N p log N ` k qq overall algorithm, with O p k N log N q memory requirements. Table 3 summarizes the complexity estimates of eachof the H matrix operations involved in ACR. Table 4 summarizes the complexity estimates of the classical cyclicreduction algorithms without exploitation of equal blocks versus ACR.8peration Complexity A ` H B O p k n log n q A ¨ H B O p k n log n p log n ` k qq A ´ H O p k n log n p log n ` k qq A ¨ H b O p k n log n q Table 3: Summary of the complexity of the H matrix arithmetic operations. Method Operations MemoryCyclic Reduction (CR) O p N q O p N . log N q Accelerated Cyclic Reduction (ACR) O p k N log N p log N ` k qq O p k N log N q Table 4: Summary of the sequential complexity estimates of the classic cyclic reduction method and the proposed variant, acceleratedcyclic reduction; k represents the numerical rank of the approximation. Because ACR eﬀectively uses hierarchical representations only for a set of regular two-dimensional problems,the resulting constants appearing in the asymptotic complexity estimates tend to be smaller, as a virtue of lowerrank requirements, and make it feasible to perform large scale computations. For instance, our limited experimentsshow that for the 3D Poisson problem (Table 5) ACR requires substantially lower numerical ranks than the ranksreported in the HSSMF literature [8, 10].In terms of practical usage, ACR has diﬀerent concurrency properties than H -LU or multifrontal HSS, enablingdiﬀerent amounts of independent work to be performed. The regularity of the computational patterns of ACR isvaluable in terms of the ability to eﬃciently use current and future hardware architectures.

4. Parallel accelerated cyclic reduction

This section describes how to leverage the concurrency features of the accelerated cyclic reduction method in adistributed-memory parallel environment.

The parallel ACR elimination and back-substitution algorithms are listed in Algorithms 3 and 4, respectively.

Algorithm 3

Parallel ACR elimination j = Processor number parallel for at all processors j , j P q ´ Block-wise conversion to H -matrix of A = tridiagonal( E p q j , D p q j , F p q j ) end parallel for for i = 1 to q do parallel for at j even , j P q ´ i ´ Compute ( D p i q j q ´ H Communicate E p i q j , p D p i q j q ´ , F p i q j , f p i q j to processors j ´ Communicate E p i q j , p D p i q j q ´ , F p i q j , f p i q j to processors j ` end parallel for parallel for at j odd , j P q ´ i ´ ´ Compute E p i ` q j , D p i ` q j , F p i ` q j , f p i ` q j from Equation 6 end parallel for end for lgorithm 4 Parallel ACR back-substitution n “ q j = Processor number for i = q to do parallel for at j , j P q ´ i ´ Compute u p i q j from Equation 9 Communicate u p i q j to processors j ´ Communicate u p i q j to processors j ´ end parallel for end for A number of concurrency features of the algorithms are evident. Each block row, identiﬁed by a plane in thediscretization, is assigned to an MPI rank. This decomposition allows the initial conversion of each block into an H -matrix in an embarrassingly parallel manner. The q “ log n levels of Schur complement computation exploitconcurrent execution in two ways: • The inverse of the block A of Equation 4 can be computed concurrently in a block-wise fashion since A isblock diagonal. This computation is embarrassingly parallel. • Computing the Schur complement requires two matrix-matrix multiplications and one matrix addition. Sincethe linear system partition is formed out of matrix blocks, the computation of these block matrix-matrixmultiplications and block matrix-addition can also be computed concurrently.Figure 2 depicts the concurrency through the various levels in ACR elimination. We note here that the ACRdecomposition strategy bears a similarity to the slice decomposition [35], and also relate to the sweeping precondi-tioner strategy [36], with the key distinction being that rather than sweeping through the domain, ACR eliminatesseveral planes at once, concurrently.

Figure 2: Concurrency in ACR elimination for the 16-planes case. Level 0 can eliminate eight planes concurrently thus reducing theproblem size to the next level by two; this process continues until one plane is left.

In the current implementation, each plane is assigned to an MPI rank, and multiple planes are assigned tocompute nodes. Let p be the number of physical compute nodes each storing n { p planes at the beginning of thefactorization. After r steps of ACR, each compute node holds n {p r p q planes. At level r “ log p n { p q , a coarse levelcalled the C-level, every node holds a single plane only. The remaining log p steps of ACR beyond the C-level leavesome compute nodes idle as illustrated in Figure 3. 10 P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P Node 1 Node 2 Node 3 Node 4

C-level

Level 1Level 2Level 3Level 4

Figure 3: Distribution of multiple planes per physical compute node for an example with n=16 and p = 4.

Distributed-memory communication occurs just at inter-node boundaries thanks to sorting at every step of thefactorization, as the computation of the Schur complement for plane P j just requires planes P j ´ and P j ` , seeFigure 3. Thus up to the C-level there are O p p q communication messages per step, each transmitting planes of size O p k n log n q . Beyond the C-level, there are O p p { ` ¨ ¨ ¨ ` q « O p p q communications messages, adding up to atotal communication volume of O p k p n log n p log np ` qq for ACR. The communication pattern with its bottom-upbinary tree structure is depicted in Fig. 4. P P P P P P P P P P P P P P u u u u u u u u u u u u u u u P Step 1. Blocks conversion intoStep 2. Elimination Step 3. Back-substitution Plane ( )communicationSolution vectorSolution vectorcommunication

Figure 4: Communication pattern for the 8-planes case. P j depicts the plane being eliminated, and u j its corresponding solution. The regularity of the ACR algorithm makes it straightforward to estimate the parallel time of the factorizationand assess its scalability characteristics. Consider the longest computing node which executes log n ACR steps. Inthe log p n { p q steps preceding the C-level, this node processes n {p p q` n {p p q`¨ ¨ ¨` p steps. This results in an asymptoticparallel time complexity for ACR of O ` kn log n p log n ` k qp n { p ` log p q ˘ . The sequential computational time getsreduced by the number of parallel compute nodes p , but at the expense of an additional log p factor that inhibitsperfect strong scaling. Fortunately, the amount of work above the C-level that introduces this log p factor left issmall and grows only as n “ N { .Finally, we note that beyond the parallelism across distributed computing nodes, there is additional concurrencyavailable at the node level. This additional level of parallelism is possible, not only because elimination and back-substitution for multiple block rows can proceed concurrently, but also because parallel variants of the hierarchical11atrix arithmetics can be used in performing operations on individual blocks. The two levels of intra-node paral-lelism are shown schematically in Figure 5. In practice, programming models based on tasks and directed acyclicgraphs have proven to be eﬀective to parallelize hierarchical matrix arithmetics [32, 10], but the optimal allocationof the multiple cores of a compute node to either block row processing or to individual operations on single blocksrequires tuning. We do not describe this aspect of the parallel implementation further here. Distributed-memoryplanesDistributed-memorycommunicationHierarchical matrix

Figure 5: Parallel ACR elimination tree depicting two-levels of concurrency using distributed-memory parallelism to distribute concurrentwork across compute nodes, and shared-memory parallelism to perform H -matrix operations within the nodes.

5. Numerical results

This section documents the parallel performance and scalability of ACR in a distributed-memory environment.The source code is written in C and compiled with the Intel C compiler v15. External libraries utilized in thereference implementation include HLIBpro v2.2 with Intel TBB [31, 37], and the sequential version of the IntelMath Kernel Library [38]. Experiments are conducted on the Cray XC40 Shaheen supercomputer at the KingAbdullah University of Science & Technology. Each node has 128GB of RAM and two Intel Haswell processors,each with 16 cores clocked at 2.3Ghz.To provide a baseline of performance we consider the solution of the same linear systems with STRUMPACK[10] v1.0.3, the open-source implementation of the HSS-structured multifrontal solver (HSSMF) developed at theLawrence Berkeley National Laboratory. The HSSMF method can solve a broader class of linear systems comparedto ACR, but the comparison is still of interest, as STRUMPACK is among the few available implementations ofdistributed-memory fast direct solvers that exploit hierarchically low-rank approximations.The tuning parameters of ACR include the choice of the leaf node size n min for the H matrices, the thresholdparameter η used to decide which blocks will be approximated with a low-rank factorization, or as a dense, full-rank,block, and the accuracy of the approximation H (cid:15) for the construction and algebraic operations of the H matrices.The tuning parameters for STRUMPACK include how many matrices from the nested-dissection elimination tree willbe approximated as HSS, which is controlled by specifying the threshold at which frontal matrices will representedas HSS matrices, the compression accuracy for the HSS matrices, and the minimum leaf size of the HSS frontalmatrices. We recall here that the HSS matrix format uses the so-called weak admissibility condition, whereasACR uses a standard admissibility condition, which does not limit the use of dense blocks exclusively at thematrix diagonal. Additionally, we also consider the algebraic multigrid (AMG) implementation of hypre [39, 40].Comparison experiments are set to deliver a solution with a relative error tolerance as || Ax ´ b || {|| b || « ´ .For further comparisons, we also consider the multifrontal (MF) implementation of STRUMPACK, and our cyclicreduction (CR) implementation with dense matrix blocks. We consider a sequence of Poisson problems of up to 512 « ´ ∇ u “ , x P Ω “ r , s , u p x q “ , x P Γ , (10)discretized with the 7-point ﬁnite-diﬀerence star stencil, which leads to a symmetric positive deﬁnite linear system.Although this problem can be solved with ACR, other methods such as multigrid or FFTs are ordinarily usedinstead; we consider it to report on a standard and well-known problem, to facilitate the exposition of ACR.12 Processors F a c t o r i z a t i o n t i m e ( s ) N = 512 N = 256 N = 128 N = 64 (a) Strong scaling of factorization.

32 128 512 2,048 8,19210 − − Processors S o l v e t i m e ( s ) N = 512 N = 256 N = 128 N = 64 (b) Strong scaling of solve.

256 512 1,024 2,048 4,096 8,19210 Processors F a c t o r i z a t i o n t i m e ( s ) (c) Weak scaling of factorization.

256 512 1,024 2,048 4,096 8,1920.10.52.5 Processors S o l v e t i m e ( s ) (d) Weak scaling of solve. N M e m o r y ( M B ) Factors sizePeak memory O ( N log N ) (e) Memory consumption. (f) Choice of H -matrix structure to repre-sent planes. Blue indicates low-rank blocks,whereas red indicates dense blocks. Figure 6: Parallel scalability and memory consumption of ACR for the solution of the Poisson equation. Figure 6f depicts the structureof the H -matrices utilized for all the ACR blocks (in this case, extracted from the N “ problem). The deep blue color indicates aneﬀective compression, while red blocks indicate no compression. H (cid:15) η Leafsize Relativeresidual Average rank Largest rank Table 5: Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Poisson experiments. N Compressiontolerance Relativeresidual Leafsize Minimumfront size Largest rank Table 6: Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Poisson experiments.

Furthermore, the discretization of the Poisson equation has all positive eigenvalues with rapid decay in oﬀ diagonal,making it also an ideal case for hierarchically low-rank approximations analysis.Figures 6a and 6b show the total time in seconds for the factorization and solve phases of ACR in a strong scalingsetting; dashed lines indicate ideal scaling. Ideal scaling of the factorization stage deteriorates at large processorcounts as factors such as communication volume and hardware latency begin to play a signiﬁcant role; the samefactors tend to dominate even more during the solve phase, being the latter a sequence of fast H -matrix-vectormultiplications with limited availability of communication/computation overlap.Figures 6c and 6d depict the results of a weak scaling test for ACR with diﬀerent numbers of degrees of freedomper processor, along with the ideal weak scaling reference lines depicted as dashed curves. The timings deviate fromthe ideal scaling due to the inherently load imbalance of the recursive bisection strategy of cyclic reduction as someprocessors become idle towards the end of the reduction. Communication latency further impacts the solve stageat large core counts due to the lower arithmetic intensity of this stage.Figure 6e depicts the memory requirements to store the ACR factorization, together with the expected asymp-totic memory usage as O p N log N q . We stress that the maximum rank of the factored matrices varies from 5 to10 within all the combinations of problem sizes/number of processors considered in the strong and weak scalingtests (data not shown). Figure 6f depicts the structure of the H -matrices used to represent each plane, with thechoice of standard admissibility condition. Dark blue blocks denote a low ratio between the numerical rank of theapproximation and the full rank of the block, whereas red block indicates non-admissible blocks stored in denseformat. For visualization purposes, the ﬁgure was taken from the N “ problem, and represents the last diagonalblock during the elimination phase of ACR. The prevalence of dark blue blocks indicate a good relative compressionof each block, since the ratio of numerical rank of the approximation and the actual block size is very small. Mostof the red blocks are clustered near the diagonal, where the smallest blocks reside.Figure 7 compares all solvers under consideration for a set of Poisson problems that range from N “ to N “ unknowns, with processor counts increased from 256 to 4,096 respectively. We document the executionparameters, obtained relative residual, and ranks of the ACR and HSSMF factorization in Tables 5 and 6. We reportfactorization times in Figures 7a showing that ACR can competitively tackle these problems. Similarly, the solvetimings in Figure 7b show that ACR is able to solve for a given right-hand size in comparable times as the othermethods under consideration. Figure 7c documents the size of the factors required by the factorizations, and it showsthat the cyclic reduction method (CR) cannot solve problems as small as N “ due to memory limitations.Additionally, we report the peak memory utilization of each solver using the library PAPI v5.5 [41], which showsthe largest memory usage that each solver required to produce the factorization. Also, the experiments conﬁrmthat the HSSMF method requires less memory to store its factors than the multifrontal method (MF). However, asFigure 7d shows, the HSSMF method requires higher ranks than ACR, which translated into a larger size of thefactors and prohibited the execution of HSSMF for problems of N “ and above. The experiments show thatACR requires only O p q ranks, as opposed to the O p n q rank requirements of the HSSMF factorization.As expected for this particular problem, multigrid is the method of choice concerning performance and memoryfootprint for a single right-hand-side. However, for multiple right-hand-sides, the ability to reuse the factorizationcould give the advantage to solvers based on factorization. The factorization times for ACR and HSSMF are14 − N F a c t o r i z a t i o n t i m e ( s ) CRACRHSSMFMF (a) Factorization time. − − N S o l v e t i m e ( s ) CRACRAMGHSSMFMF (b) Solve time. C R A C R M F H SS M F A C R M F H SS M F A C R M F A C R M e m o r y ( M B ) FactorsPeak (c) Memory consumption. N L a r g e s t r a n k i n f a c t o r i z a t i o n ACRHSSMF (d) Largest rank in the factorization.

Figure 7: Performance of the factorization and solve phases of ACR for the Poisson problem. comparable, with the setup stage of HSSMF being faster for smaller problems; the smaller ranks required by ACRinstead lead to a faster factorization step with large problem sizes and faster time to solution.While ACR and HSSMF solvers can deliver a more accurate solution as direct solvers (i.e. without iterativeprocedures), this comes at the expense of more time and memory; it is common practice that this factorizationis then used as a preconditioner or passed to an iterative reﬁnement procedure. Numerical experiments conﬁrmthat ACR could be used as a direct solver if we tune its parameters with a higher accuracy for its H -matrixrepresentations and operations, as depicted in Figure 8, at the expense of modest rank increases, albeit with highermemory requirements and time to solution. However, as Table 7 shows, a low-accuracy factorization in combinationwith an iterative procedure is best to minimize the total time-to-solution. H (cid:15) Factors (MB) Largest rank Factorization Apply Total time Iterations

Table 7: Iterative solution of a N “ Poisson problem with the conjugate gradients method and ACR preconditioner. Relativeresidual of the solution is 1E-6 in all cases. e-2 1e-4 1e-6 1e-8 1e-10 1e-1210 Relative residual T i m e ( s ec ) Factorization (a) ACR factorization. − Relative residual T i m e ( s ec ) Solve (b) ACR solve. Relative residual M e m o r y ( M B ) ACR factors (c) ACR size of the factors. L a r g e s t r a n k i n f a c t o r i z a t i o n Largest rank (d) Ranks requirements of the factorization.

Figure 8: Controllable accuracy solution of ACR for a N “ Poisson problem.

We next consider a standard convection-diﬀusion problem ´ ∇ u ` αb p x q ¨ ∇ u “ f p x q , x P Ω “ r , s ,b p x q “ »– sin p a πx q sin p a π p { ` y qq ` sin p a π p { ` z qq sin p a πx q cos p a πx q cos p a π p { ` y qq ` cos p a π p { ` y qq cos p a πz q cos p a πx q cos p a π p { ` z qq ` sin p a π p { ` y qq sin p a πz q ﬁﬂ , (11)discretized with a 7-point upwind ﬁnite diﬀerence scheme, that leads to a non-symmetric linear system which ischallenging for classical iterative solvers, especially when the convection term dominates the equation. The b p x q term we consider is a three-dimensional generalization of the two-dimensional vortex ﬂow proposed by Wessel et.al. [42]. We adjust the forcing term and boundary conditions to meet the exact solution u p x q “ sin p πx q ` sin p πy q ` sin p πz q ` sin p πx q ` sin p πy q ` sin p πz q , as proposed by Gupta and Zhang [43], as it is an archetypal challenging problem for multigrid methods.To demonstrate the robustness of ACR and HSSMF for this problem, we ﬁx the number of degrees of freedom at N “ and we increase the dominance of the convective term; results are reported in Figure 9. Consistently withthe Poisson problem, multigrid methods remains the method of choice for diﬀusion dominated problems in termsof time to solution; however, when α is increased, the performance of AMG deteriorates. On the other hand, bothACR and HSSMF prove to be able to solve convection-dominated problems, with ACR being consistently fasterthan HSSMF particularly in the back-substitution phase. The size of the factors generated by ACR and HSSMFare comparable, with ACR using signiﬁcantly smaller ranks. We ﬁnally consider the indeﬁnite Helmholtz equation with Dirichlet boundary conditions on the unit cube, i.e. ´p ∇ u ` κ u q “ , Ω “ r , s , (12)16 α F a c t o r i z a t i o n t i m e ( s ) ACRHSSMFAMG (a) Factorization time. − α S o l v e t i m e ( s ) ACRHSSMFAMG (b) Solve time. α F a c t o r s m e m o r y ( M B ) ACRHSSMFAMG (c) Factors size. α L a r g e s t r a n k i n f a c t o r i z a t i o n ACRHSSMF (d) Largest rank in the factorization.

Figure 9: Robustness of ACR for convection-diﬀusion problem. In convection dominated problems (large α ), AMG fails to convergewhile direct solvers maintain a steady performance. discretized with the 27-point trilinear ﬁnite element scheme on hexahedra. Results for ACR and HSSMF are reportedin Figure 10. The parameter κ is chosen to obtain a sampling rate of approximately 12 points per wavelength,speciﬁcally κ “ t , , u respectively, corresponding to approximately 10 ˆ ˆ

10 for the N “ problem.As opposed to the positive deﬁnite Helmholtz equation which models phenomena similar to diﬀusion, the indeﬁnitevariant, commonly denoted as the wave Helmholtz equation, has a solution that is oscillatory in nature. Multigridmethods are known to diverge without speciﬁc customizations for high-frequency Helmholtz problems, which wealso conﬁrmed via experimentation. For a detailed examination of the diﬃculties of solving the Helmholtz equationwith classical iterative methods we refer the reader to [44].We document the execution parameters, obtained relative residual, and ranks of the ACR and HSSMF factor-ization in Tables 8 and 9. Numerical experiments show that ACR features consistently lower factorization andsolve times than HSSMF, as can be seen in Figure 10a and 10b. The size of the factors of ACR and HSSMF arecomparable, with a slightly higher memory requirements of ACR due to performance-oriented tuning, see Figure10c. Furthermore, as also shown in section 5.1, HSSMF required less memory than MF, and CR quickly runs outof memory for problems larger than N “ . Finally, the largest rank of ACR is consistently lower than that ofHSSMF, even though both solvers require O p n q ranks, as shown in Figure 10d. Nevertheless, lower ranks lead tofaster time-to-solution in favor of ACR. 17 H (cid:15) η Leafsize Relativeresidual Average rank Largest rank Table 8: Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Helmholtz experiments. N Compressiontolerance Relativeresidual Leafsize Minimumfront size Largest rank Table 9: Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Helmholtz experiments. N F a c t o r i z a t i o n t i m e ( s ) CRACRHSSMFMF (a) Factorization time. − − N S o l v e t i m e ( s ) CRACRHSSMFMF (b) Solve time. N M e m o r y ( M B ) CRACRHSSMFMF (c) Memory usage. N L a r g e s t r a n k i n f a c t o r i z a t i o n ACRHSSMF (d) Largest rank in the factorization.

Figure 10: Solution of increasingly larger indeﬁnite Helmholtz problems consistently discretized with 12 points per wavelength. . Conclusions and future work We present a novel fast direct solver, Accelerated Cyclic Reduction, for block tridiagonal linear systems whichcommonly arise in the discretization of elliptic operators. The elimination strategy is based on a red/black orderingof the blocks that logically divides the grid into planes, approximates matrix blocks representing these planeswith H -matrices, and proceeds with elimination using hierarchical matrix operations. ACR achieves log-lineararithmetic complexity of O p k N log N p log N ` k qq and memory requirements of O p k N log N q by approximatingeach block with a hierarchical matrix whose structure is reﬁned using a spatial partitioning of the planar gridsections, employing a strong admissibility criterion that eﬀectively limits the ranks of individual low rank blocks inthe hierarchical matrix representations, and operating with hierarchical matrix arithmetics throughout. The averagerank k of the blocks inside the hierarchical matrix representations controls the accuracy of the approximation andgrows only modestly with problem size. A fair agreement with the rank estimate of [6] was found for the 3D Poissonequation of O p q (Table 5), and for the 3D Helmholtz equation O p n q (Table 8).The concurrency features of ACR are among its most important strengths. The regularity and structure ofthe decompositions allow eﬃcient load balance. These features are demonstrated in a distributed-memory environ-ment with numerical experiments that study the strong and weak scalability of our implementation. We providea reference for performance and memory consumption using comparisons with state-of-the-art open-source imple-mentations of the HSS-structured multifrontal solver from the STRUMPACK library, and algebraic multigrid fromhypre.ACR, being essentially a direct solver with tunable accuracy, can tackle problems that lack deﬁniteness, suchas the indeﬁnite high-frequency Helmholtz equation, or symmetry, such as the convection-diﬀusion equation. Forthese problems, stock versions of algebraic multigrid fail to produce convergent schemes. We demonstrated therobustness of ACR in dealing with such problems over a range of problem sizes and parameters.While multigrid methods are generally superior for scalar problems possessing smoothness and deﬁniteness,direct factorization methods such as ACR and HSSMF beneﬁt where multiple right-hand sides are involved, as thetime to solve per extra forcing term is orders of magnitude smaller than the factorization, which can be reused. Thesmaller ranks k of ACR result in solution times per new right-hand side that are smaller than those of HSSMF.Although having the same asymptotic complexity as other solvers that use general hierarchical matrix repre-sentations in their factorizations, such as H -LU, ACR has fundamentally diﬀerent algorithmic roots which enablea novel alternative for a relevant class of problems with competitive performance, increasing concurrency as theproblem grows and almost optimal memory requirements. Moreover, to the best of our knowledge, this is the ﬁrstdistributed-memory implementation of the synergies of cyclic reduction and hierarchical matrices, which scales upto 8,192 cores for problems up to N “ degrees of freedom.ACR has been demonstrated for a regular grid discretization, but the generalization to arbitrary grids is possibleand we intend to explore it in the future. Such a generalization would require an ordering of the mesh that producesa sequence of thin elongated regions (in 2D or 3D) where every region has only two neighbors so that the blocktridiagonal structure is preserved. Such an ordering might be produced via a breadth-ﬁrst search traversal of themesh as shown in Figure 11. In the unstructured case, the diagonal blocks do not necessarily have the same size, andthe oﬀ-diagonal blocks might be of rectangular shape. The main algorithmic implication is that each block will nowhave its own hierarchical matrix structure generated from the geometry of the region it represents. Computationallyhowever, the structure generation represents a small portion in the overall computation.In addition, because of the tunable accuracy characteristics of ACR, there are complexity-accuracy trade-oﬀsthat would naturally lead to the development of a new scalable preconditioner which we present at [45].19 igure 11: Partitioning of an unstructured mesh that produces a block tridiagonal matrix structure, for the application of ACR.

7. Acknowledgments

We thank the anonymous reviewers for their detailed comments and suggestions for this manuscript. Theauthors would also like to thank Ronald Kriemann from the Max-Planck-Institute for Mathematics in the Sciencesfor development and continuous support of HLibPro, Alexander Litvinenko from the King Abdullah Universityof Science and Technology (KAUST) for the enlightening discussions and advice, and Pieter Ghysels from theLawrence Berkeley National Laboratory for his recommendations on the use of STRUMPACK. Support from theKAUST Supercomputing Laboratory and access to Shaheen is gratefully acknowledged. The work of all authorswas supported by the Extreme Computing Research Center at KAUST.

References [1] R. W. Hockney, A fast direct solution of Poisson’s equation using Fourier analysis, Journal of the ACM 12 (1)(1965) 95–113.[2] B. L. Buzbee, G. H. Golub, C. W. Nielson, On direct methods for solving Poisson equation, SIAM Journal onNumerical Analysis 7 (4) (1970) pp. 627–656.[3] W. Gander, G. H. Golub, Cyclic Reduction history and applications, Scientiﬁc Computing (Hong Kong, 1997)(1997) 73–85.[4] P. N. Swarztrauber, The methods of Cyclic Reduction, Fourier analysis and the FACR algorithm for thediscrete solution of Poisson equation on a rectangle, SIAM Review 19 (3) (1977) 490–501.[5] P. Yalamov, V. Pavlov, Stability of the block cyclic reduction, Linear Algebra and its Applications 249 (1-3)(1996) 341–358.[6] S. Chandrasekaran, P. Dewilde, M. Gu, N. Somasunderam, On the numerical rank of the oﬀ-diagonal blocksof Schur complements of discretized elliptic PDEs, SIAM Journal on Matrix Analysis and Applications 31 (5)(2010) 2261–2290.[7] I. S. Duﬀ, J. K. Reid, The multifrontal solution of indeﬁnite sparse symmetric linear equations, ACM Trans-actions on Mathematical Software 9 (3) (1983) 302–325.[8] S. Wang, X. S. Li, F.-H. Rouet, J. Xia, M. V. De Hoop, A parallel geometric multifrontal solver using hierar-chically semiseparable structure, ACM Transactions on Mathematical Software 42 (3) (2016) 21:1–21:21.[9] R. Vandebril, M. Barel, G. Golub, N. Mastronardi, A bibliography on semiseparable matrices, Calcolo 42 (3-4)(2005) 249–270.[10] P. Ghysels, X. S. Li, F.-H. Rouet, S. Williams, A. Napov, An eﬃcient multicore implementation of a novelHSS-structured multifrontal solver using randomized sampling, SIAM Journal on Scientiﬁc Computing 38 (5)(2016) S358–S384.[11] P. G. Martinsson, A fast randomized algorithm for computing a hierarchically semiseparable representation ofa matrix, SIAM Journal on Matrix Analysis and Applications 32 (4) (2011) 1251–1274.2012] J. Xia, Y. Xi, M. Gu, A superfast structured solver for Toeplitz linear systems via randomized sampling, SIAMJournal on Matrix Analysis and Applications 33 (3) (2012) 837–858.[13] K. L. Ho, L. Ying, Hierarchical interpolative factorization for elliptic operators: diﬀerential equations, Com-munications on Pure and Applied Mathematics.[14] K. L. Ho, L. Ying, Hierarchical interpolative factorization for elliptic operators: integral equations, Communi-cations on Pure and Applied Mathematics.[15] Y. Li, L. Ying, Distributed-memory hierarchical interpolative factorization, arXiv preprint arXiv:1607.00346.[16] P. G. Martinsson, A direct solver for variable coeﬃcient elliptic PDEs discretized via a composite spectralcollocation method, Journal of Computational Physics 242 (2013) 460 – 479.[17] A. Gillman, P. G. Martinsson, A direct solver with O p N q complexity for variable coeﬃcient elliptic PDEsdiscretized via a high-order composite spectral collocation method, SIAM Journal on Scientiﬁc Computing36 (4) (2014) A2023–A2046.[18] S. Hao, P. G. Martinsson, A direct solver for elliptic PDEs in three dimensions based on hierarchical mergingof Poincar´e-Steklov operators, Journal of Computational and Applied Mathematics 308 (2016) 419 – 434.[19] C. Weisbecker, Improving multifrontal solvers by means of algebraic block low-rank representations, Ph.D.thesis, Institut National Polytechnique de Toulouse-INPT (2013).[20] P. R. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L’Excellent, C. Weisbecker, Improving multifrontalmethods by means of block low-rank representations, SIAM Journal on Scientiﬁc Computing 37 (3) (2015)A1451–A1474.[21] P. R. Amestoy, I. S. Duﬀ, J.-Y. L’Excellent, J. Koster, A fully asynchronous multifrontal solver using distributeddynamic scheduling, SIAM Journal on Matrix Analysis and Applications 23 (1) (2001) 15–41.[22] P. R. Amestoy, A. Guermouche, J.-Y. LExcellent, S. Pralet, Hybrid scheduling for the parallel solution of linearsystems, Parallel Computing 32 (2) (2006) 136 – 156, Parallel Matrix Algorithms and Applications (PMAA04).[23] M. Izadi, Hierarchical matrix techniques on massively parallel computers, Ph.D. thesis, Max Planck Institutefor Mathematics in the Sciences (2012).[24] S. Ambikasaran, E. Darve, An O p N log N q fast direct solver for partial Hierarchically Semiseparable matrices,Journal of Scientiﬁc Computing 57 (3) (2013) 477–501.[25] I. Ibragimov, S. Rjasanow, K. Straube, Hierarchical Cholesky decomposition of sparse matrices arising fromcurl–curl–equation, Journal of Numerical Mathematics 15 (1) (2007) 31–57.[26] L. Grasedyck, R. Kriemann, S. Le Borne, Parallel black box H -LU preconditioning for elliptic Boundary ValueProblems, Computing and Visualization in Science 11 (4-6) (2008) 273–291.[27] H. Pouransari, P. Coulier, E. Darve, Fast hierarchical solvers for sparse matrices using extended sparsiﬁcationand low-rank approximation, SIAM Journal on Scientiﬁc Computing 39 (3) (2017) A797–A830. doi:10.1137/15M1046939 .[28] D. A. Sushnikova, I. V. Oseledets, “Compress and eliminate” solver for symmetric positive deﬁnite sparsematrices, arXiv preprint arXiv:1603.09133.[29] C. Chen, H. Pouransari, S. Rajamanickam, E. Boman, E. Darve, A distributed memory hierarchical solver forsparse matrices, (Personal communication).[30] W. Hackbusch, A sparse matrix arithmetic based on H -Matrices. Part I: Introduction to H -Matrices, Com-puting 62 (2) (1999) 89–108.[31] R. Kriemann, Parallel H -Matrix arithmetics on shared memory systems, Computing 74 (3) (2005) 273–297.[32] R. Kriemann, H -LU factorization on many-core systems, Computing and Visualization in Science 16 (3) (2013)105–117. 2133] L. N. Trefethen, D. Bau III, Numerical linear algebra, Vol. 50, SIAM, 1997.[34] W. Hackbusch, Hierarchical matrices: Algorithms and analysis, Vol. 49, Springer, 2015.[35] R. Guivarch, L. Giraud, J. Stein, Parallel distributed fast 3D Poisson solver for meso-scale atmospheric simu-lations, International Journal of High Performance Computing Applications 15 (1) (2001) 36–46.[36] B. Engquist, L. Ying, Sweeping preconditioner for the Helmholtz equation: hierarchical matrix representation,Communications on Pure and Applied Mathematics 64 (5) (2011) 697–735.[37] L. Grasedyck, W. Hackbusch, R. Kriemann, Performance of preconditioning for sparse matrices, ComputationalMethods in Applied Mathematics 8 (4) (2008) 336–349.[38] A. Kalinkin, A. Anders, R. Anders, et al., Schur complement computations in Intel R (cid:13) Math Kernel LibraryPARDISO, Applied Mathematics 6 (02) (2015) 304.[39] W. Briggs, V. Henson, S. McCormick, A Multigrid Tutorial, Second Edition, 2nd Edition, Society for Industrialand Applied Mathematics, 2000.[40] R. D. Falgout, U. M. Yang, hypre: A Library of High Performance Preconditioners, Springer Berlin Heidelberg,Berlin, Heidelberg, 2002, pp. 632–641.[41] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci, A portable programming interface for performanceevaluation on modern processors, International Journal of High Performance Computing Applications 14 (3)(2000) 189–204.[42] W. F. Ames, Numerical methods for partial diﬀerential equations, Academic press, 2014.[43] M. M. Gupta, J. Zhang, High accuracy multigrid solution of the 3D convection–diﬀusion equation, AppliedMathematics and Computation 113 (2) (2000) 249–274.[44] O. G. Ernst, M. J. Gander, Why it is diﬃcult to solve Helmholtz problems with classical iterative methods,in: Numerical Analysis of Multiscale Problems, Springer, 2012, pp. 325–363.[45] G. Ch´avez, G. Turkiyyah, S. Zampini, D. Keyes, Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coeﬃcients, Journal of Computational and Applied Mathematics. doi:10.1016/j.cam.2017.11.035 . BibTeX entry of this article: