Distributed-memory H -matrix Algebra I: Data Distribution and Matrix-vector Multiplication
DDistributed-memory H -matrix Algebra I:Data distribution and matrix-vector multiplication Yingzhou Li ∗ , Jack Poulson and Lexing Ying Department of Mathematics, Duke University, Durham, NC 27708, USA. Hodge Star, Toronto, Canada. Department of Mathematics and ICME, Stanford University, Stanford, CA 94305,USA.
Abstract.
We introduce a data distribution scheme for H -matrices and a distributed-memory algorithm for H -matrix-vector multiplication. Our data distribution schemeavoids an expensive Ω ( P ) scheduling procedure used in previous work, where P isthe number of processes, while data balancing is well-preserved. Based on the datadistribution, our distributed-memory algorithm evenly distributes all computationsamong P processes and adopts a novel tree-communication algorithm to reduce thelatency cost. The overall complexity of our algorithm is O (cid:16) N log NP + α log P + β log P (cid:17) for H -matrices under weak admissibility condition, where N is the matrix size, α de-notes the latency, and β denotes the inverse bandwidth. Numerically, our algorithm isapplied to address both two- and three-dimensional problems of various sizes amongvarious numbers of processes. On thousands of processes, good parallel efficiency isstill observed. AMS subject classifications : 65F99, 65Y05
Key words : Parallel fast algorithm, H -matrix, distributed-memory, parallel computing For linear elliptic partial differential equations, the blocks of both forward and backwardoperators, when restricted to non-overlapping domains, are numerically low-rank [7].Hence both operators can be represented in a data sparse form. Many fast algorithmsbenefit from this low-rank property and apply these operators in quasi-linear scaling.Such fast algorithms include but not limit to tree-code [4, 41], fast multipole method(FMM) [3, 11, 18, 19, 20, 39, 42, 50], panel clustering method [21], etc. The low-rank ∗ Corresponding author.
Email address: [email protected] (Y. Li), [email protected] (J. Poulson), [email protected] (L. Ying) a r X i v : . [ m a t h . NA ] S e p Y. Li et al. structures in these fast algorithms are revealed via various interpolation techniques suchas: pole expansion, Chebyshev interpolation, equivalent interaction, etc [15, 19, 39, 50].In contrast to approximating the application of operators, another group of researchfocuses on approximating operators directly in compressed matrix forms. As one of theearliest members in this group, H -matrix [5, 6, 7, 17, 21, 22, 26, 27, 28] hierarchically com-presses operators restricted to far-range interactions by low-rank matrices. The memorycost and matrix-vector multiplication complexity are quasi-linear with respect to the de-grees of freedom (DOFs) in the problem. Shortly after introducing H -matrix, Hackbuschet al. [23] again introduced H -matrix, which uses nested low-rank bases to further re-duce the memory cost and multiplication complexity down to linear. Related to the fastalgorithms above, H -matrix and H -matrix can be viewed as algebraic versions of treecode and FMM respectively. But they are more flexible in choosing different admissibil-ity conditions and low-rank compression techniques, which are related to general advan-tages of algebraic representations.Developments in the H -matrix group and extensions beyond the group are exploredin the past decade. Hierarchical off-diagonal low-rank matrix (HOLDER) [2] and hier-archical semi-separable matrix (HSS) [48] are two popular hierarchical matrices with thesimplest admissibility condition, i.e., weak admissibility condition. Different from hi-erarchical matrices, recursive skeletonization factorization (RS) [37] and hierarchical in-terpolative factorization (HIF) [24, 25] introduce separators in the domain partition andcompress the operator as products of sparse matrices. The partition and factorization inRS and HIF are in the similar spirit as that in multifrontal method [1, 12] and superLUmethod [30], while extra low-rank approximations are introduced to compress the inter-actions within frontals. Other algebraic representations include block low-rank approx-imation [49], block basis factorization [45], etc. The benefits of algebraic representationsover analytical fast algorithms come in two folds: 1) numerical low-rank approxima-tion is more effective than interpolation; 2) matrix factorization and inversion becomefeasible. We emphases that these algebraic representations are not only valid for linearelliptic operators, but also valid for operators associated with low-to-medium frequencyHelmholtz equations and radial basis function kernel matrices. When operators admithigh-frequency property, the low-rank structure appears in a very different way com-paring to that in all aforementioned fast algorithms, and are also well-studied by thecommunity [8, 9, 13, 14, 31, 32, 33, 34, 38].Many of these fast algorithms and algebraic representations have been parallelizedon either shared-memory or distributed-memory setting to be applicable to practicalproblems of interest [10, 16, 18, 35, 40, 41, 42, 43, 44, 46, 47, 50]. Here we focus on theparallelization of H -matrix. Kriemann [28, 29] implemented a shared-memory parallel H -matrix using a block-wise distribution, i.e., each block is assigned to a single pro-cess. Processes assigned to blocks near root level are responsible for computations ofcomplexity linear in N , where N is the total DOFs. Hence the speedup of such a par-allelization scheme is theoretically upper bounded by O ( log N ) and limited in practiceup to 16 processes. Izadi [26, 27] published detailed algorithms for H -matrix addition, . Li et al. 3 matrix-vector multiplication, matrix-matrix multiplication and matrix inversion underdistributed-memory setting. In [26, 27], the data of H -matrix are evenly distributedamong all processes according to their global matrix indices, which is similar to our datadistribution for one-dimensional problems with uniform discretization but different fromours for other setups. The computations in [26, 27] are distributed under task-based par-allelization, whose the scheduling part costs Ω ( P ) operations on P processes. Accordingto numerical results therein, good parallel efficiency is limited up to 16 processes. In this paper, we first propose a balanced data distribution scheme for H -matrices basedon the underlying domain geometry † . In H -matrix, the domain is usually hierarchicallypartitioned and then organized in a domain tree structure. In order to avoid any expen-sive scheduling procedure, our processes are also organized in a tree structure in corre-spondence to that of the hierarchical domain partition. Each process then owns a uniquepiece of the domain and also own the associated data in H -matrix. Following such adata distribution, all data in H -matrix are evenly distributed among all processes. Fora H -matrix of size N distributed on P processes, the memory cost is O (cid:16) N log NP (cid:17) on eachprocess. Our data distribution scheme is scalable up to P = O ( N ) processes.Building on top of our data distribution, a distributed-memory parallel algorithm isproposed to conduct the H -matrix-vector multiplication. Our parallel algorithm con-sists of several parts: a computation part, three consecutive communication parts, andanother computation part. When the input and output vectors are distributed accord-ing to the tree structure of processes, both computation parts are communication-free.Then a novel data communication scheme, known as the tree-communication, is in-troduced to significantly reduce costs in two of the communication parts. The remain-ing communication part consists of a constant number of point-to-point communicationon each process. Mainly due to the process organization and the tree-communicationscheme, the expensive scheduling procedure is totally avoided throughout our algo-rithm. The overall computational and communication complexities, then, are O (cid:16) N log NP (cid:17) and O (cid:16) α log P + β (cid:16) log P + log NP + (cid:0) NP (cid:1) d − d (cid:17)(cid:17) ‡ respectively, where d is the dimension of theproblem, α denotes the message latency, and β denotes the inverse bandwidth.Finally, the parallel algorithm is applied to two-dimensional and three-dimensionalproblems of sizes varying from a few thousands to a quarter billion on massive number ofprocesses. The parallel scaling is still found to be near-ideal on computational resourcesavailable to us, up to a few thousands processes. In all cases, our H -matrix-vector multi- † When the domain geometry of the problem is not available and only the graph connectivity of the problemis known, our data distribution scheme can be extended to use the hierarchical partition of the graph instead. ‡ This is complexity for H -matrices under standard admissibility conditions and an upper for H -matricesunder weak admissibility condition. Y. Li et al. plications are completed within a few seconds. The rest of the paper is organized as follows. In Section 2, we revisit H -matrix to-gether with admissibility conditions. Section 3 introduces our balanced data distribu-tion scheme. The distributed-memory H -matrix-vector multiplication algorithm is de-tailed in Section 4. Section 5 presents numerical results for two-dimensional and three-dimensional problems of various sizes. Finally, we conclude the paper in Section 6 to-gether with some discussion on future work. In this section, we first review the definition and the structure of H -matrix. Then the H -matrix-vector multiplication follows in a straightforward way.Let us assume that K ( t , k ) is a kernel satisfying the hierarchical low-rank property asin tree code or H -matrix. Then applying Nystr ¨om discretization to the integral equation, u ( t ) = (cid:90) Ω K ( t , s ) f ( s ) d s , for t ∈ Ω , (2.1)results a matrix-vector multiplication, and the matrix therein can be approximated by an H -matrix. Throughout the rest paper, we use the concepts of a domain and the Nystr ¨omdiscretization points in the domain interchangably. For example, a matrix restricted to Ω × Ω means that the matrix restricted to the row and column indices correspondingto the discretization points in Ω and Ω respectively. In (2.1), the operator maps fromthe domain Ω to itself. In practice, H -matrix can also be used to approximate operatorsmapping from one domain to another and the rest of the paper can be extended to such asetting with a minor update on domain notations. To simplify our presentation, we limitourselves to the self mapping case.In the above setting, the structure of the H -matrix fundamentally relies on the hierar-chical partition of the domain Ω , which is defined as follows: Definition 2.1 ( Domain tree ) . A tree T Ω = ( V Ω , E Ω ) with the vertex set V Ω and the edgeset E Ω is called a domain tree of Ω if the following conditions hold:1. All nodes in T Ω are subdomains of Ω ;2. The set of children of a domain ω ∈ V Ω , denoted as C ( ω ) = { ν ∈ V Ω | ∃ ( ω , ν ) ∈ E Ω } ,is either empty or a partition of ω ;3. Ω ∈ V Ω is the root of T Ω . . Li et al. 5 Figure 1: Hierarchical partition of ideal one-dimensional domain (left) andtwo-dimensional domain (right). Gray lines indicate connections betweenparent domains and their child subdomains.When a (quasi-)uniform discretization of a regular d -dimensional domain Ω = [ ] d is considered, the domain tree is constructed via applying a 2 d uniform partition recur-sively. Such domains are later referred as ideal d -dimensional domains. Figure 1 illus-trates two domain tree associated with an ideal one-dimensional domain and an idealtwo-dimensional domain.The low-rank submatrices in an H -matrix are determined by admissibility condi-tions. There are many different admissibility conditions leading to different H -matrixstructures. Here we introduce two of them: weak admissibility condition and standardadmissibility condition. Definition 2.2 ( Weak admissibility condition ) . Two domains, ω and ν , are weakly ad-missible if ω ∩ ν = ∅ . Definition 2.3 ( Standard admissibility condition ) . Two domains, ω and ν , are standardadmissible if min (cid:0) diam ( ω ) ,diam ( ν ) (cid:1) ≤ ρ dist ( ω , ν ) , (2.2)where diam ( ω ) is the diameter of ω , dist ( ω , ν ) is the distance between two domains, and ρ is a constant adjusting the size of buffer zone.Weak admissibility condition is the simplest admissibility condition used in practiceand leads to the simplest H -matrix structure. While standard admissibility condition ismore complicated, but widely used in many fast algorithms [4, 19]. Importantly, for linearelliptic differential operators with L ∞ coefficients discretized by a local basis set, both theforward differential operator and its inverse can be well-approximated by H -matrix un-der standard admissibility condition. Throughout this paper, we adopt ρ ≡ √ d . Figure 2 Y. Li et al.
Figure 2: Weak admissibility condition and the corresponding H -matricesfor ideal one-dimensional (left) and two-dimensional (right) domains. Firstrow shows domain partitions and gray blocks are non-admissible domainsto ω . The second row shows the corresponding H -matrices with red sub-matrices being dense and white ones being low-rank.and Figure 3 shows the weak admissibility condition and the standard admissibility con-dition respectively for both ideal one-dimensional and two-dimensional domains. Onemore popular admissibility condition, known as strong admissibility condition [6, 37],simply replaces the “min” in (2.2) by “max”. Strong admissibility condition and stan-dard admissibility condition are the same on ideal domains.Based on the domain tree and admissibility conditions, we are ready to precisely de-fine H -matrix as an approximation of K mapping from Ω s = Ω to Ω t = Ω , i.e., an approx-imation of K Ω t × Ω s = K| Ω t × Ω s . Domain Ω t and Ω s are called the target domain and thesource domain respectively, where the target domain is associated with row indices andthe source domain is associated with column indices. Definition 2.4 ( H -matrix) . Assume that K maps vectors defined on source domain Ω s to vectors defined on target domain Ω t . K is an H -matrix with rank r and domain trees T Ω t and T Ω s if the following conditions hold in order: for each child subdomain pairs of Ω t × Ω s , i.e., ω t × ω s ∈ C ( Ω t ) ×C ( Ω s ) ,1. if C ( ω t ) = ∅ or C ( ω s ) = ∅ , then K ω t × ω s is a dense matrix D ω t × ω s ; else2. if ω t and ω s are admissible, then K ω t × ω s is a low-rank matrix with rank r , i.e., K ω t × ω s = U ω t × ω s V (cid:62) ω t × ω s for U ω t × ω s ∈ R | ω t |× r , V ω t × ω s ∈ R | ω s |× r , and |·| denotes the . Li et al. 7 Figure 3: Standard admissibility condition and the corresponding H -matrices for ideal one-dimensional (left) and two-dimensional (right) do-mains.DOFs in the domain; otherwise3. K ω t × ω s is an H -matrix with rank r and domain trees T ω t and T ω s .In the above definition, three conditions must be checked in the given order. The thirdcondition defines the hierarchical structure of H -matrix.In order to further clarify the definition, we walk readers through the ideal two-dimensional domain case under weak admissibility condition. We begin with K map-ping from Ω s = Ω = [ ] to Ω t = Ω = [ ] . There are 16 child subdomain pairs in C ( Ω t ) ×C ( Ω s ) . Among all 16 pairs, all child domains have their child domains. Hencethe first condition in Definition 2.4 fails for all pairs. We then check the weak admissibil-ity condition. There are 12 out of 16 pairs are admissible, i.e., all non-overlapping domainpairs on the first level as in Figure 1. Therefore, there are 12 off-diagonal submatrices arelow-rank, which are denoted by the big white blocks in Figure 2 (right). For the rest4 child subdomain pairs, they are H -matrices of them own. We can continue this pro-cess until the leaf level of the domain tree and resolve the entire H -matrix as in Figure 2(right). The H -matrix under standard admissibility condition is much more complicated.Figure 3 depicts the H -matrices with the same domain and domain tree as that in Figure 2but under the standard admissibility condition instead.The H -matrix-vector multiplication can be processed efficiently as long as we canread from the input vector and write to the output vector restricting to subdomains. Wedenote the H -matrix as K mapping from Ω to Ω and the H -matrix-vector multiplication Y. Li et al. as, y = K x ,where both x and y are vectors defined on Ω . We first initialize the output vector y as a zero vector. Then, we traverse all submatrices in K that contains data, i.e., densesubmatrices and low-rank submatrices. For any such submatrix, denoted as K ω t × ω s , weconduct the matrix-vector multiplication and add the results to the output vector, y ω t = y ω t + K ω t × ω s x ω s = (cid:40) y ω t + D ω t × ω s x ω s y ω t + U ω t × ω s ( V (cid:62) ω t × ω s x ω s ) , (2.3)where y ω t and x ω s denote the vector restricted to domain ω t and ω s respectively. When(2.3) is completed for all submatrices in K , the vector y is already the final H -matrix-vector multiplication result.As shown in many previous work [6, 21], for the regular domain with almost uni-formly distributed discretization points, the memory cost for the H -matrix is O ( rN log N ) ,where N is the total DOFs and r is the numerical rank. The H -matrix-vector multiplica-tion can be achieved in O ( rN log N ) operations. Here we mainly reviewed the struc-ture and matrix-vector multiplication of H -matrix. The construction algorithms of H -matrix [6, 36] as well as other algebraic operations such as: matrix-matrix multiplication,matrix factorization, etc., have been extensively studied in the literature, which are be-yond the scope of this paper and we omit the detailed discussion. H -matrix Data Distribution The organization of processes and the associated data distribution avoid expensive par-allel scheduling procedure as in [26, 27]. In Section 3.1, we first explain our hierarchicalorganization of processes. Then in Section 3.2, the data distribution together with theload balancing are discussed.
Processes are organized in correspondence with the domain tree T Ω . The main idea is toassign subdomains to processes as balanced as possible while preserving the hierarchicalstructure.Let the P processes be indexed from 0 to P −
1, and P be upper bounded by the numberof leaf nodes in T Ω § . The set of all processes, denoted as P = { P − } , is called theprocess group . We then traverse the domain tree to assign the process group, subgroups,or individual processes to nodes in T Ω . Regarding the root node in T Ω , i.e., domain Ω , weassign the entire process group P to it. From now on, we consider a general domain Ω (cid:96) § Having more processes than the number of leaf nodes ( O ( N ) ) is feasible if the later algorithm descriptionis slightly modified. While, such a setup is not of practical usage. Hence we omit the detail.. Li et al. 9 in T Ω at level (cid:96) with a general process group P (cid:96) assigned. The assignment of subgroupsof P (cid:96) to child subdomains of Ω (cid:96) obeys the following conditions:1) If the number of child subdomains of Ω (cid:96) is smaller than or equal to the number ofprocesses in P (cid:96) , i.e., |C ( Ω (cid:96) ) | ≤ |P (cid:96) | , then P (cid:96) is partitioned into |C ( Ω (cid:96) ) | subgroupssuch that the number of processes in each subgroup is proportional to the DOFs inthe corresponding child subdomain. Each subgroup is then assigned to the corre-sponding child subdomain.2) If the number of child subdomains of Ω (cid:96) is bigger than the number of processes in P (cid:96) , i.e., |C ( Ω (cid:96) ) | > |P (cid:96) | , then C ( Ω (cid:96) ) are organized into |P (cid:96) | parts such that the totalDOFs in each part are balanced. Each process is then assigned to subdomains inone part.According to the process organization strategy, a process participates and only partic-ipates one process group at each level. When a single process is assigned to a subdomain ω in T Ω , it is assigned to all descendants of ω in T Ω . We can then combine all subdo-mains that are singly owned by a process p and denote the union as ω p . All such unions { ω p } P − p = form a balanced partition of Ω , where the balancing factor is upper boundedby twice the balancing factor of T Ω . The balancing factor here is referring to the ratio ofthe heaviest workload and the lightest workload among all processes. Further, in eachprocess group or subgroup, the process with smallest index is called the group leader ,e.g., 0 is the group leader of P . When a process is the group leader at a level (cid:96) , then it isthe group leader in all descendant groups it participates. For example, process 0 is groupleaders of all process groups it participates, whose workload is the heaviest among allprocesses.Figure 4 top show a four level domain tree together with its process assignment of P = { } for an ideal one-dimensional domain. Each process is assigned to a uniquesubdomain at level 3. In addition, Figure 5 top show a three level domain tree togetherwith its process assignment of P = { } for an ideal two-dimensional domain. Wefind that process owns two subdomains at level 2 and eight processes form a perfectpartition of the domain. Definition 2.4 explicitly shows that all data (matrix entries) are in two types of subma-trices, either dense submatrices or low-rank submatrices. The hierarchical submatricesdefined by the third conditions in Definition 2.4 exist virtually for recursion purpose.Hence, we just need to distribute dense submatrices and low-rank submatrices amongprocesses.
Low-rank submatrix.
Consider a low-rank submatrix associated with domain pair Ω t × Ω s , where Ω t and Ω s are the target and source domains respectively. According tothe process organization defined in Section 3.1, there are two process groups assigned to
0, 1, 2, 3, 4, 5, 6, 70, 1, 2, 3 4, 5, 6, 7 0, 1 2, 3 4, 5 6, 70 1 2 3 4 5 6 7
Level 0Level 1 Level 2Level 3
Figure 4: H -matrix data distribution of an ideal one-dimensional domainwith weak and standard admissibility condition on 8 processes. Top partis the the four-level domain tree together with its process assignment. Redprocesses are group leaders. Middle parts and bottom parts are the dis-tributed H -matrix with weak and standard admissibility condition respec-tively. Left columns are H -matrix owned by target process groups andright columns are owned by source process groups. Blue blocks indicatedata owned by process 5 whereas light red and yellow blocks indicate dataowned by other processes. . Li et al. 11
0, 1, 2, 3, 0, 1 2, 34, 5 6, 7 01 2345 67
Level 0 Level 1 Level 2
01 2345 674, 5, 6, 7
Figure 5: H -matrix data distribution of an ideal two-dimensional domainwith weak admissibility condition on 8 processes. Top part is the the three-level domain tree together with its process assignment. Red processes aregroup leaders. Bottom parts are data in H -matrix owned by target pro-cess groups (left) and source process groups (right). Blue blocks indicatedata owned by process 2 whereas light red and yellow blocks indicate dataowned by other processes. Ω t and Ω s , denoted as P t and P s respectively. We distribute two factors in the low-ranksubmatrices, U Ω t × Ω s and V Ω t × Ω s , to two process groups, i.e., U Ω t × Ω s is stored among P t and V Ω t × Ω s is stored among P s . If either P t or P s has only one process, then theprocess owns the entire matrix. Now, assume there are more than one process in P t .Since U Ω t × Ω s is a tall and skinny matrix, it is distributed in a block row fashion. For eachprocess p ∈ P t , the rows corresponding to ω p is owned by process p , where ω p is thesingly owned subdomain of p . If there are more than one process in P s , then V Ω t × Ω s isdistributed in the same way among processes in P s . Dense submatrix.
Consider a dense submatrix associated with domain pair Ω t × Ω s and the corresponding process group pair P t ×P s . There are three scenarios of the sizesof process groups: (i) |P t | = |P s | =
1; (ii) |P t | = |P s | >
1; (iii) |P t | > |P s | = D Ω t × Ω s is owned by P s . In the second scenario,the transpose of dense matrix, D (cid:62) Ω t × Ω s , is distributed among P s in the same way as thedistribution of V Ω t × Ω s above. In the last scenario, the dense matrix D Ω t × Ω s is distributedamong P t in the same way as the distribution of U Ω t × Ω s above. Once the data distribution strategies are applied to all submatrices, the H -matrix isthen fully distributed among P . To further facilitate the understanding of the overalldata distribution, Figure 4 and Figure 5 show the data in H -matrices for an ideal one-dimensional domain with weak and standard admissibility condition and an ideal two-dimensional domain with weak admissibility condition respectively. Both H -matrices aredistributed among process group P of size eight. In Figure 4 and Figure 5, blue blockshighlight the data owned by process 5 and process 2 respectively. Remark 3.1.
The data distribution strategies we introduced here are suitable and efficientfor a sequence of parallel-friendly H -matrix algebraic operations, e.g., matrix-vector mul-tiplication, matrix-matrix multiplication, matrix compression, matrix addition, etc. Whilesome other H -matrix algebraic operations, like H -matrix-LU factorization and H -matrix-inversion, are not parallel-friendly since the operations therein depends sequentially oneach other. Our data distribution strategies work for these operations as well, while theefficiency is left to be further explored. Remark 3.2.
These data distribution strategies can also be easily extended to H -matrix.The nested basis in H -matrix can be distributed among all processes in the similar wayas we distribute low-rank factors. While the tiny middle matrix in each low-rank block in H -matrix could be singly owned by either its source or target group leader. Given suchdistribution strategies for H -matrix, all its algebraic operations can be parallelized in ananalog way as that for H -matrix.We now discuss the load balancing of the distributed H -matrix. As shown in Figure 4and Figure 5, the load balancing is different for different admissibility conditions. Fig-ure 5 under weak admissibility condition shows an ideal load balancing whereas Figure 4under standard admissibility condition shows slightly unbalanced data distribution. Inthe following, we assume the domain is an ideal d -dimensional domain, [ ] d with n uni-form discretization points on each dimension and N = n d discretization points in total. Insuch an ideal case, each process own the same size of subdomain.Assume that the weak admissibility condition is applied. At each level on T Ω , anydomain Ω (cid:96) has the same number of admissible domains. Each process participate onedomain on the target side and another on the source side. Hence all processes own ex-actly the same amount of data in low-rank submatrices at each level. For all low-ranksubmatrices throughout levels, data are evenly distributed among all processes. Regard-ing the dense submatrices, they are all of the same size and owned by their source pro-cesses. Since all processes own the same size domains on the source side, and these do-mains have the same amount of dense submatrices, all processes own the same amountof dense submatrix data. Overall, the data of dense submatrices and low-rank subma-trices are evenly distributed among all processes and the load balancing in this case isideal.While, when the standard admissibility condition is applied, the load balancing de-pends on the boundary condition of the problem. If the periodic boundary condition is . Li et al. 13 adopted, the load balancing is still ideal. While, if a non-periodic boundary conditionis adopted, the data loads are different for processes owning domains near the centerand processes owning domains near corners. Since all low-rank submatrices are evenlyowned by processes in its process groups, the load balancing factor is simply the ratio ofthe numbers of low-rank submatrices for different processes, i.e., the numbers of admis-sible domains. Consider level (cid:96) , which is neither the first two levels nor the last one. Acenter subdomain Ω (cid:96) center ’s parent domain has 3 d non-admissible neighbor domains, eachof which is partitioned into 2 d subdomains at level (cid:96) . Excluding non-admissible subdo-mains of Ω (cid:96) center , there are 3 d · d − d admissible subdomains of Ω (cid:96) center . However, a cornersubdomain Ω (cid:96) corner ’s parent domain is also a corner domain and has 2 d non-admissibleneighbor domains. Through the similar calculation, Ω (cid:96) corner has 2 d · d − d admissible sub-domains at level (cid:96) . Hence the load balancing factor is d d . Such a factor also holds to theload balancing of dense submatrices. Overall, asymptotically as N goes to infinity, theload balancing factor for distributed H -matrix under standard admissibility conditionand non-periodic boundary condition is upper bounded by (cid:0) (cid:1) d . Since this factor is in-dependent of both N and P , we still regard our data distribution in this case as a balancedone. H -matrix-vector Multiplication H -matrix-vector multiplication is the fundamental operation in H -matrix algebra andreveals the value of H -matrix as a fast algorithm. Further, it is also one of basic operationsinvolved in other H -matrix algebraic operations, including, matrix-matrix multiplication,matrix compression, matrix factorization, and matrix inversion. As briefly reviewed inSection 2, the sequential H -matrix-vector multiplication is as simple as looping over alllow-rank and dense submatrices, multiplying the submatrix to the input vector restrictedto the source domain, and adding the result to the output vector restricted to the targetdomain. However, the distributed-memory version is much more complicated. Based onthe data distribution as in Section 3, we present the distributed-memory H -matrix-vectormultiplication algorithm in this section followed by its complexity analysis. Distributed-memory H -matrix-vector multiplication algorithm mainly consists of the fol-lowing five steps:Step 1. Source side local computation;Step 2. Tree-reduction on source process tree;Step 3. Data transfer from source to target;Step 4. Tree-broadcast on target process tree; Step 5. Target side local computation.Among these five steps, Steps 1 and 5 only involve computations and are communication-free whereas Steps 2, 3, and 4 focus on efficient communication under our data distribu-tion and process organization. We will elaborate five steps in detail one-by-one. Through-out the following description, we assume the input vector x is already distributed in theblock row fashion among process group P . More precisely, for any process p ∈ P , it owns x ω p = x | ω p for ω p being p ’s singly owned domain. The output vector y will be distributedexactly in the same way as x . The source side local computation goes through all submatrices containing data, i.e., low-rank submatrices and dense submatrices, and conducts all communication-free calcula-tions. We now describe specific operations for submatrices of different types.
Low-rank submatrix.
Consider a low-rank submatrix associated with Ω t × Ω s andprocess groups P t ×P s . The explicit block form of V Ω t × Ω s and x Ω s admit, V Ω t × Ω s = v p ... v p |P s |− , x Ω s = x p ... x p |P s |− , (4.1)where p i ∈ P s , v p i and x p i are stored on process p i . We aim to compute the product of V Ω t × Ω s and x Ω s as, V (cid:62) Ω t × Ω s x Ω s = |P s |− ∑ i = v (cid:62) p i x p i , (4.2)where the summation over i requires communication since v (cid:62) p i x p i are owned by differentprocesses for different i . Hence, in this step, we only compute z local = v (cid:62) p i x p i (4.3)on process p i without conducting any communication. The communication for the sum-mation over i in (4.2) is postponed until the next step. Dense submatrix.
Consider a dense submatrix associated with Ω t × Ω s and P t ×P s .When there are more than one process in the target process group, i.e., |P t | >
1, the datain this submatrix are owned by the target process group. No local computation is neededand we assign z local = x Ω s for later communications. When there is only one process inthe target process group, i.e., |P t | =
1, the data are distributed among the source processgroup as, D Ω t × Ω s = (cid:16) d p ··· d p |P s |− (cid:17) , x Ω s = x p ... x p |P s |− , (4.4) . Li et al. 15 for p i ∈ P s and |P s | ≥
1. Similar to the low-rank submatrix case, we aim to compute D Ω t × Ω s x Ω s = |P s |− ∑ i = d p i x p i . (4.5)Instead, we only conduct local computation in this step, z local = d p i x p i , on each process p i ∈ P s without communication. This step implements the communication required summations in (4.2) and (4.5). Na¨ıvely,we can perform many MPI reductions ¶ , one for each submatrices and reduce the sum-mation results to their group leaders. However, such a na¨ıve reduction strategy requiresmany more messages than the tree-reduction to be introduced below, which benefits mostfrom the hierarchical organization of both the H -matrix and processes.The preliminary step in tree-reduction is to collect and pack local results that requirecommunication in (4.2) and (4.5). For each process, we visit the H -matrix level by levelfrom root to leaf. At each level, each process participates and only participates in oneprocess group. Hence, local results are about to be reduced to the same group leaderand are packed together in an array in the same ordering. Across levels, we concatenatepacked local results together until one level before the level where process group hasonly one process. We denote the maximum number of such levels as L P .Then a sequence of reductions are conducted from level L P backward to the root level.At level L P , all processes reduce the entire concatenated array to their own group leadersat this level. Group leaders at level L P then have already collected their group members’contributions to summations from root level to level L P −
1. Hence, those non-leadergroup members at level L P no longer participate the rest communications in this step. Ata following level (cid:96) = L P − L P − (cid:96) +
1. They reduce their concatenated array from level 1 to level (cid:96) (with contri-butions from their own group members) to their own group leaders at level (cid:96) . When allreductions are completed, all group leaders own the summations (4.2) and (4.5) of theirgroups. Slightly abuse of notation, we still denote these summation results as z local . Remark 4.1.
When the domain and discretization are far from balanced ones, the processtree is also not balanced. Hence, it is possible that at some level (cid:96) < L P , a process isthe process group of its own. In this case, such a process do not need to participate thereduction at level (cid:96) or lower. We do not exclude such cases from our description above,but do exclude them from our implementation.Figure 6 depicts the flow of a tree-reduction for an ideal one-dimensional domaindistributed evenly on 8 processes. Although there is no communication-required data onthe root level in H -matrix-vector multiplication, we still include data cubics on level 0in the figure to demonstrate the idea and show the extendability of the tree-reduction tomore than two levels. ¶ We refer to “MPI Reduce” with addition operation as the reduction throughout this paper.6 Y. Li et al.
Level 0Level 1 0Level 2 00 40 2 4 6Level 0Level 1 0 1 2 3 4 5 6 70Level 2 0 0 0 0 0 0 00 0 0 0 4 4 4 40 0 2 2 4 4 6 6Level 0Level 1 0Level 2 0 0 00 0 4 40 2 4 6Level 0Level 1 0Level 2 0 40 2 4 6 T r ee - r e d u c t i o n T r ee - b r o a d c a s t Figure 6: Tree-communication flowchart. Tree-reduction and tree-broadcastflow from top to bottom and from bottom to top respectively. Differ-ent columns with gray background are the concatenated arrays owned bydifferent processes. Each cubic is the packed data on the correspondinglevel and the number in the cubic indicates its group leader. For tree-reduction, yellow cubics are local data to be reduced to their group leadersand summed together whereas blue cubics are the final summation resultsowned only by group leaders. As shown in the figure, only yellow cu-bics and their owner processes participate the reduction communications.For tree-broadcast, blue cubics are original packed data to be sent to groupmembers. Yellow cubics are packed data been broadcasted. Light yellowcubics are final broadcasted data. . Li et al. 17
After the previous step, all local data, z local , are stored on their own group leaders on thesource side. In order to finish the computation, local data should be sent to the processesin the target group. To better benefit from the hierarchical structure, we accomplish thecommunication in this and next steps. In this step, local data will be sent from the sourcegroup leaders to the corresponding target group leaders. Then the next step is responsiblefor broadcasting local data to the processes in target groups.Given a pair of target and source group leaders, p t and p s , they could be the groupleaders of many submatrices. Hence process p s first packs local data in all those subma-trices and then send them in one message to process p t . After process p t received thepacked local data, it then unpacks the data to submatrices. Remark 4.2.
We emphasize that a process only participates at most O ( log P ) number ofgroup leader pairs. Let us consider process 0 as the source group leader, which acts mostfrequently as the source group leader among all processes. As we mentioned before, eachprocess only participates one process group on each level of the process tree. Process 0is then the group leaders of one process group on each level, which adds to O ( log P ) groups. A source process group on each level only interacts with a constant number oftarget process groups, where the constant depends on the admissibility condition. Henceprocess 0 is paired with a constant number of target group leaders at each level. Summingall levels together, process 0 is paired with O ( log P ) target group leaders. Consider a low-rank submatrices associated with Ω t × Ω s with process groups P t ×P s asan example. The matrix vector multiplication admits, U Ω t × Ω s V (cid:62) Ω t × Ω s x Ω s = u p (cid:0) V (cid:62) Ω t × Ω s x Ω s (cid:1) ... u p |P t |− (cid:0) V (cid:62) Ω t × Ω s x Ω s (cid:1) = u p z local ... u p |P t |− z local , (4.6)where p i ∈ P t and z local is the summation in (4.2). After the previous step, in each subma-trices, z local is owned by the target group leaders. Hence, in order to conduct the productof u p i z local as in (4.6), z local needs to be shared with all target group members. A similarequation can be written down for dense submatrices with target process groups of sizegreater than one. In this step, we hierarchically broadcast the local data z local from thegroup leaders to the group members together and name it as tree-broadcast , which is thereverse procedure of tree-reduction.Similar to tree-reduction, we first collect and pack local results that require commu-nication. For each group leader, we visit the H -matrix level by level from root to leaf. Ateach level, local results that are about to be broadcasted to the same group are packedtogether in an array. Across levels, we concatenate packed local results together untillevel L P . Then a sequence of broadcasts are executed from the first level forward to level L P .At a level (cid:96) = L P −
1, the group leaders broadcast their array from level 1 to level (cid:96) to those subgroup leaders at level (cid:96) +
1. Subgroup leaders then concatenate the receivedarray together with their own packed array. Once the concatenating procedure is accom-plished, we move on to the next level. Finally, at level L P , group leaders broadcast theirentire array to all their group members. All processes in target process group, in the end,received all needed local data for each submatrices they participated.Similar level skipping for the unbalanced target process tree can be done for tree-broadcast as that for tree-reduction in Remark 4.1. Figure 6 illustrates a tree-broadcastprocedure for an ideal one-dimensional domain distributed on 8 processes. The target side local computation goes through all low-rank and dense submatrices andconducts aggregation of the product results onto output vector y . Here we assume theoutput vector y is initialized to be all zero. We describe operations for different types ofsubmatrices. Low-rank submatrix.
Consider a low-rank submatrix associated with Ω t × Ω s andprocess groups P t ×P s . As shown in (4.6), for a process p i ∈ P t , the product result is u p i z local . After previous step, z local is owned by p i . Hence we only need to process thefollowing communication-free computation, y p i = y p i + u p i z local , (4.7)where y p i is the output vector y restricted to the subdomain in Ω t owned by p i . Dense submatrix.
Consider a dense submatrix associated with Ω t × Ω s and processgroups P t ×P s . If there is only one process in P t , then the matrix-vector multiplication asin (4.5) has already been conducted in the first step and the result z local is also owned by P t after previous communication steps. Hence we simply add it to the output vector, y Ω t = y Ω t + z local . (4.8)If there are more than one process in P t , then the dense matrix is owned by P t in a blockrow fashion and the matrix vector multiplication admits, D Ω t × Ω s x Ω s = d p x Ω s ... d p |P t |− x Ω s = d p z local ... d p |P t |− z local , (4.9)where p i ∈ P t and each p i has a copy of z local . In this step, process p i is responsible for thefollowing computation, y p i = y p i + d p i z local , (4.10)where y p i is the same as that in (4.7). . Li et al. 19 Remark 4.3.
Here we described the algorithm computing y = K x for a distributed-memory H -matrix K . A more standard matrix-vector multiplication operator in linear algebrawould be y = α K x + β y , which is the “GEMV” operation in level 2 BLAS. Such an opera-tion can be easily adopted here if we do not initialize y as a zero vector and modify (4.7),(4.8), and (4.10) accordingly. All the rest steps remain unchanged. In this section, we analyze the computational and the communication complexities of thedistributed-memory H -matrix-vector multiplication algorithm. To simplify the notation,we denote L p = O ( log P ) and L N = O ( log N ) as the number of levels in process trees (cid:107) anddomain trees respectively.The computational complexity is easy to conclude given our previous analysis on thedata balancing in Section 3.2. Notice that our total number of floating-point operationsstay identical to that of sequential H -matrix-vector multiplication if the extra computa-tion in tree-reduction is excluded. While, the computation in tree-reduction is of lowerorder comparing to that of dense matrix-vector multiplication conducted on each pro-cesses. Hence, the extra computation in communication steps can be ignored in our com-plexity analysis. Further, processes conduct float operations proportional to amounts ofdata they owned. Thanks to the balanced data distribution, we conclude that the com-putational operations are also balanced across all processes and each process conduct O (cid:0) N log NP (cid:1) operations.The communication complexity consists of two parts: the latency ( α ) and the per-process inverse bandwidth ( β ). The complexity analysis for the latency is relativelysimpler and stay the same for different admissibility conditions. The latency is essen-tially counting the number of send/receive communications. Each process in the tree-reduction and tree-broadcast steps conducts a reduction and broadcast among constantnumber of processes. Hence each process conduct O ( ) send/receive communicationson each level. Summing all L P levels together, the latencies for both tree-reduction andtree-broadcast are O ( α log P ) . Regarding the Step 3 in our algorithm, as discussed inRemark 4.2, each process only communicates with O ( log P ) other processes. Hence thelatency for Step 4 and the overall latency are O ( α log P ) . The complexities of inverse band-width, however, are different for different admissibility conditions and are discussed sep-arately. Weak admissibility condition.
Consider the tree-reduction and tree-broadcast steps.At a given level (cid:96) ≤ L P , each process only participates one process group and owns aconstant number of submatrices. Hence the final concatenated array is of length O ( L P ) .Process 0 is the most communication intensive process. For level (cid:96) = L P , it communi-cates an array of size O ( (cid:96) ) in both tree-reduction and tree-broadcast. Therefore, process0 in total send and receive O ( L P ) data, which is an upper bound for other processes. The (cid:107) Here we count the number of levels in a process tree until the first level such that all process groups containone process.0 Y. Li et al. inverse bandwidth complexities for the tree-reduction and tree-broadcast steps are then O ( β log P ) .The inverse bandwidth complexity for the third step is very much simplified for H -matrices under weak admissibility condition due to one crucial difference betweenweak admissibility condition and other admissibility conditions. H -matrices under weakadmissibility condition only have H -submatrices along their diagonal blocks, whereas H -matrices under other admissibility conditions have H -submatrices on off-diagonalblocks. Under the distributed-memory setting, such a property means that the sourceand target process groups remain the same for all H -submatrices when weak admis-sibility condition is adopted. Hence only low-rank submatrices are distributed amongdifferent source and target process groups. Now we again consider process 0, who aregroup leaders across all levels. For levels below L P , process 0 does not participate anysubmatrices with different source and target process groups. For level L P and above, pro-cess 0 is responsible to send the entire reduced array of length O ( L P ) to other processes.Hence the inverse bandwidth complexities for process 0 is O ( β log P ) , which is the upperbound for other processes.Overall, the complexity, including both computational complexity and communica-tion complexity, for distributed-memory H -matrices under weak admissibility conditionon P processes is O (cid:18) N log NP + α log P + β log P (cid:19) . (4.11) Standard admissibility condition.
All communication complexity analyses under theweak admissibility condition carry over to that under the standard admissibility condi-tion with a different prefactor, which is determined by the number of admissible neigh-bors. Some extra communication costs come from those H -submatrices singly ownedby different target process and source process. In this case, no tree-communication isneeded. But the source process need to pack all local data in this H -submatrices andsend them to the target process. The amount of local data in the H -submatrices is aconstant times the number of low-rank and dense submatrices. Such H -submatrices aremostly corresponding to neighboring subdomains and are of sizes NP . With a complicatedcalculation, which is omitted here, such H -submatrices have O (cid:0) log NP (cid:1) low-rank subma-trices and O (cid:16)(cid:0) NP (cid:1) d − d (cid:17) dense submatrices, where d is the dimension of the problem. Thenumber of low-rank submatrices essentially calculates the number of levels whereas thenumber of dense submatrices calculates the number of the subdomains of finest scale onthe interface of the two neighboring subdomains. Hence the extra communication costunder standard admissibility condition is O (cid:16) β (cid:16) log NP + (cid:0) NP (cid:1) d − d (cid:17)(cid:17) .Overall, the complexity for distributed-memory H -matrices under standard admissi-bility condition on P processes is O (cid:18) N log NP + α log P + β (cid:18) log P + log NP + (cid:16) NP (cid:17) d − d (cid:19)(cid:19) . (4.12) . Li et al. 21 Remark 4.4.
According to (4.11) and (4.12), we notice the trade-off between the compu-tational complexity and the communication complexity. When P is much smaller than N , the dominate cost comes from the computational part. While as P approaches N , thecomputational cost is then O ( log N ) whereas the communication complexity is O ( log P ) dominating the cost. All numerical experiments were performed on the Texas Advanced Computing Center(TACC) cluster, Stampede2. This cluster has 4,200 Intel Knights Landing nodes, eachwith 68 cores, 96 GB of DDR memory. Nodes are interconnected via Intel Omni-Pathnetwork with a fat tree topology. We allocate various number of nodes for our tests andeach node runs 32 MPI processes. The memory limit per process is 3 GB.In the following numerical results, we adopt a few measurements to demonstrate theparallel efficiency of our algorithm. In addition to the regular wall-clock time (walltime),we also calculate the speedup as well as the efficiency factor. Given a problem, we denote P as the smallest number of processes that are able to solve the problem and solve it in t seconds. Meanwhile, solving the problem among P processes for P ≥ P takes t seconds.The speedup and the efficiency factor (percentage) in this case are,Speedup = P t t and Eff = P t P t · H -matrices for Two-Dimensional Problems Let Ω = [ ] be the domain of interest. We discretize the problem with n points oneach dimension for n = × up to 65536 × . The structure of an H -matrix is thendetermined by a hierarchical partition of Ω . Since the construction of H -matrix is beyondthe scope of this paper and H -matrix-vector multiplication does not rely on the propertiesof the underlying problems, we fill dense submatrices and low-rank submatrices in H -matrices by random numbers and use these random H -matrices to explore the parallelscaling of our algorithm. Also random input vectors are used in our tests. Both weakadmissibility condition and standard admissibility condition are explored. In addition,we use two choices of r , r = r =
8, where the later makes problems more computationintensive. Each H -matrix is distributed among various number of processes, from 32 upto 16384. The reported runtime is averaged over 128 random input vectors.Figure 8 depicts strong scaling plots for different H -matrices and Table 1 further de-tails walltimes, speedups and efficiency factors. In both weak admissibility conditioncases, Figure 7(a) and Figure 7(b), strong scaling is well-preserved as we keep doubling Number of Processes -2 -1 W a ll t i m e ( s e c ) (a) Weak admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (b) Weak admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (c) Standard admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (d) Standard admissibility, r = Figure 7: Strong scaling of H -matrix-vector multiplication for various two-dimensional problems on various number of processes (up to 16384 pro-cesses). Figure (a) and (b) are H -matrices under weak admissibility condi-tion with rank being 4 and 8 respectively. Figure (c) and (d) are H -matricesunder standard admissibility condition with rank being 4 and 8 respectively.Solid lines are strong scaling curves and dash lines are their correspondingtheoretical references. Different colors are problems of different sizes as in-dicated in the legend. . Li et al. 23 N r P
Weak StandardTime (s) Speedup Eff (%) Time (s) Speedup Eff (%)512 H -matrix-vector multi-plication for two-dimensional problems. the number of processes. Towards the end of each curve, when the communication costdominates the walltime, the walltime remain flat for a long time, which means that thecommunication cost grows very mildly as the number of processes increases. In standardadmissibility condition cases, Figure 7(c) and Figure 7(d), good strong scaling is also ob-served in most cases. Comparing to the weak admissibility condition cases, especiallytowards the end of each curve, the communication cost kicks in earlier as the number ofprocesses increase, which is due to the different prefactors in the complexity analysis inSection 4.2. Table 1 provides more evidences supporting our comments. We emphasizethat the parallel efficiencies are impressive especially for larger problems. For example,in both N = and N = cases, parallel efficiencies are above 72 percent in weakadmissibility condition cases and above 90 percent in standard admissibility conditioncases, even when thousands of processes are used. Finally, we would like to comment onthe weak scaling. Although not been plotted in figures, weak scaling ∗∗ can be read fromconnecting dots vertically in figures. Clearly, on the top half of each figure, the weakscaling is near ideal (flat). Hence we claim that our algorithm and implementation givenumerical results of both good strong scaling and weak scaling. H -matrices for Three-Dimensional Problems In this section, we perform numerical results for domain Ω = [ ] . We discretize theproblem with n being 64,128,...,1024 and the corresponding matrices are of size varyingfrom 64 × up to 1024 × . Similar as in the two-dimensional cases, we adoptrandom H -matrices and random input vectors to explore the parallel scaling of our al-gorithm. Both weak admissibility condition and standard admissibility condition areexplored as well as two choices of r . Each H -matrix is distributed among various num-ber of processes, from 32 up to 16384. Reported runtime is averaged over 128 randominput vectors.Comments for two-dimensional problemss as in Section 5.1 apply seamless to three-dimensional problems. Both under weak and standard admissibility condition cases,strong scaling and weak scaling are well-preserved as the number of processes increases. H -matrices under weak admissibility condition show better parallel efficiencies compar-ing to that under standard admissibility condition. Now we focus on the comparison oftwo-dimensional problems and three-dimensional problems. Comparing Figure 7(a) andFigure 7(b) to Figure 8(a) and Figure 8(b) respectively, we find that all four figures showsimilar strong scaling as well as weak scaling. This behavior has already been predictedby (4.11), where the complexity under weak admissibility condition is independent ofthe dimensionality of the problem. While, comparing Figure 7(c) and Figure 7(d) to Fig-ure 8(c) and Figure 8(d) respectively, two-dimensional problems show better strong scal-ing than their three-dimensional counterparts. Under standard admissibility condition,the number of neighboring subdomains increases as the dimension increases, which also ∗∗ The computational cost grows quasi-linearly whereas the number of processes grows linearly. Here ourweak scaling definition ignores the extra logarithmic factor.. Li et al. 25 Number of Processes -2 -1 W a ll t i m e ( s e c ) (a) Weak admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (b) Weak admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (c) Standard admissibility, r = Number of Processes -2 -1 W a ll t i m e ( s e c ) (d) Standard admissibility, r = Figure 8: Strong scaling of H -matrix-vector multiplication for various three-dimensional problems on various number of processes (up to 16384 pro-cesses). Figure (a) and (b) are H -matrices under weak admissibility condi-tion with rank being 4 and 8 respectively. Figure (c) and (d) are H -matricesunder standard admissibility condition with rank being 4 and 8 respectively.Solid lines are strong scaling curves and dash lines are their correspondingtheoretical references. Different colors are problems of different sizes as in-dicated in the legend. N r P
Weak StandardTime (s) Speedup Eff (%) Time (s) Speedup Eff (%)64 H -matrix-vector multi-plication for three-dimensional problems. . Li et al. 27 implies that the required communication cost will increase. As detailed in (4.12), the com-munication complexity depends monotonically on the dimension d . Hence, as proved bynumerical resutls, the communication cost dominate the walltime earlier for biger d . In this paper, we introduce the data distribution of distributed H -matrices and a distributed-memory H -matrices-vector multiplication algorithm.Given the tree structure of the domain organization in H -matrix, we also organizeour processes in a process tree. Two process trees are adopted for the target and sourcedomains. Under our data distribution scheme, the load balancing factors are constantsfor both weak admissibility condition (the constant is independent of dimension d ) andstandard admissibility condition (the constant depends on d ). For problems of extremelylarge size N , our data distribution scheme allows the number of processes to grow as bigas O ( N ) . In this case, each process owns a part of the H -matrix, whose size dependsonly logarithmically on N . Therefore, our data distribution is feasible for problems ofextremely large sizes on massive number of processes.The proposed distributed-memory H -matrix-vector multiplication algorithm is par-allel efficient. Specifically under our tree organizations of both processes and data, weintroduce a tree communication scheme, i.e., “tree-reduce” and “tree-broadcast”, to sig-nificantly reduce the latency complexity. All required computations in sequential H -matrix-vector multiplication are evenly distributed among all processes. Importantly, ouralgorithm totally avoids the expensive scheduling step, which is as expensive as Ω ( P ) on P processes. Overall, our algorithm complexities for a d -dimensional problem of size N distributed among P processes are O (cid:16) N log NP + α log P + β log P (cid:17) and O (cid:16) N log NP + α log P + β (cid:16) log P + log NP + (cid:0) NP (cid:1) d − d (cid:17)(cid:17) for weakly admissibility condition and standard admissibil-ity condition respectively, where α denotes the latency and β denotes the per-processinverse bandwidth.There are several future directions for improvement, both in algorithm and in im-plementation. Instead of pure “MPI” parallelization, one can combine “OpenMP” and“MPI” to further reduce the local communications within a node. This could improvethe communication complexity, especially for H -matrices under standard admissibilitycondition, by a big factor. Other H -matrix algebraic operations can also be efficientlyparallelized given our data distribution and process organization. In a companion pa-per, we will introduce distributed-memory H -matrix compression, H -matrix addition,as well as H -matrix- H -matrix multiplication. Availability.
The distributed-memory H -matrix code, DMHM, is available under theGPLv3 license at https://github.com/YingzhouLi/dmhm . The code support both two-dimensional and three-dimensional problems. Acknowledgments
The authors acknowledge the Texas Advanced Computing Center (TACC) at The Univer-sity of Texas at Austin for providing HPC resources that have contributed to the researchresults reported within this paper. The work of Y.L. is supported in part by the US Na-tional Science Foundation under awards DMS-1454939 and DMS-2012286, and by the USDepartment of Energy via grant DE-SC0019449. The work of L.Y. is partially supportedby the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Com-puting Research, Scientific Discovery through Advanced Computing (SciDAC) programand the National Science Foundation under award DMS-1818449.
References [1] Amestoy, P., A. Buttari, I. Duff, A. Guermouche, J.-Y. L’Excellent, and B. Uc¸ar (2011).Multifrontal method. In D. Padua (Ed.),
Encyclopedia of Parallel Computing , pp. 1209–1216. Boston, MA: Springer US.[2] Aminfar, A. H., S. Ambikasaran, and E. Darve (2016, Jan). A fast block low-rank densesolver with applications to finite-element matrices.
J. Comput. Phys. 304 , 170–188.[3] Anderson, C. R. (1992, Jul). An implementation of the fast multipole method withoutmultipoles.
SIAM J. Sci. Stat. Comput. 13 (4), 923–947.[4] Barnes, J. and P. Hut (1986). A hierarchical O ( N log N ) force-calculation algorithm. Nature 324 (6096), 446–449.[5] Bebendorf, M. (2007, Jul). Why finite element discretizations can be factored by trian-gular hierarchical matrices.
SIAM J. Numer. Anal. 45 (4), 1472–1494.[6] Bebendorf, M. (2008).
Hierarchical matrices (1st ed.), Volume 63. Springer PublishingCompany, Incorporated.[7] Bebendorf, M. and W. Hackbusch (2003). Existence of H -matrix approximants to theinverse FE-matrix of elliptic operators with L ∞ -coefficients. Numer. Math. 95 (1), 1–28.[8] Benson, A. R., J. Poulson, K. Tran, B. Engquist, and L. Ying (2014, Aug). A paralleldirectional fast multipole method.
SIAM J. Sci. Comput. 36 (4), C335–C352.[9] Cand`es, E. J., L. Demanet, and L. Ying (2009, Jan). A fast butterfly algorithm for thecomputation of Fourier integral operators.
Multiscale Model. Simul. 7 (4), 1727–1750.[10] Chen, C., H. Pouransari, S. Rajamanickam, E. G. Boman, and E. Darve (2018, May).A distributed-memory hierarchical solver for general sparse linear systems.
ParallelComput. 74 , 49–64.
EFERENCES 29 [11] Cheng, H., L. Greengard, and V. Rokhlin (1999, Nov). A fast adaptive multipolealgorithm in three dimensions.
J. Comput. Phys. 155 (2), 468–498.[12] Duff, I. S., A. M. Erisman, and J. K. Reid (1986).
Direct Methods for Sparse Matrices .USA: Oxford University Press, Inc.[13] Engquist, B. and L. Ying (2007, Aug). Fast directional multilevel algorithms for os-cillatory kernels.
SIAM J. Sci. Comput. 29 (4), 1710–1737.[14] Engquist, B. and L. Ying (2009). A fast directional algorithm for high frequencyacoustic scattering in two dimensions.
Commun. Math. Sci. 7 (2), 327–345.[15] Fong, W. and E. F. Darve (2009, Dec). The black-box fast multipole method.
J. Com-put. Phys. 228 (23), 8712–8725.[16] Ghysels, P., X. S. Li, F. H. Rouet, S. Williams, and A. Napov (2016, Oct). An effi-cient multicore implementation of a novel HSS-structured multifrontal solver usingrandomized sampling.
SIAM J. Sci. Comput. 38 (5), S358–S384.[17] Grasedyck, L. and W. Hackbusch (2003, Jul). Construction and arithmetics of H-matrices.
Computing 70 (4), 295–334.[18] Greengard, L. and W. D. Gropp (1990, Jan). A parallel version of the fast multipolemethod.
Comput. Math. with Appl. 20 (7), 63–71.[19] Greengard, L. and V. Rokhlin (1987, Dec). A fast algorithm for particle simulations.
J. Comput. Phys. 73 (2), 325–348.[20] Greengard, L. and V. Rokhlin (1997). A new version of the fast multipole method forthe Laplace equation in three dimensions.
Acta Numer. 6 , 229–269.[21] Hackbusch, W. (1999). A sparse matrix arithmetic based on H -matrices. I. introduc-tion to H -matrices. Computing 62 (2), 89–108.[22] Hackbusch, W. and B. N. Khoromskij (2000, Dec). Sparse H-matrix arithmetic: Gen-eral complexity estimates.
J. Comput. Appl. Math. 125 (1-2), 479–501.[23] Hackbusch, W., B. N. Khoromskij, and S. A. Sauter (2000). On H -matrices. In Lect.Appl. Math. , pp. 9–29. Springer Berlin Heidelberg.[24] Ho, K. L. and L. Ying (2016a). Hierarchical interpolative factorization for ellipticoperators: differential equations.
Commun. Pure Appl. Math. 69 (8), 1415–1451.[25] Ho, K. L. and L. Ying (2016b, Jul). Hierarchical interpolative factorization for ellipticoperators: integral equations.
Commun. Pure Appl. Math. 69 (7), 1314–1353.[26] Izadi, M. (2012a, Jul).
Hierarchical matrix techniques on massively parallel computers . Ph.D. thesis, Max Planck Institute for Mathematics in the Sciences. [27] Izadi, M. (2012b, Apr). Parallel H -matrix arithmetic on distributed-memory sys-tems. Comput. Vis. Sci. 15 (2), 87–97.[28] Kriemann, R. (2005, May). Parallel H -matrix arithmetics on shared memory systems. Computing 74 (3), 273–297.[29] Kriemann, R. (2013, Jun). H -LU factorization on many-core systems. Comput. Vis.Sci. 16 (3), 105–117.[30] Li, X. S., J. Demmel, J. Gilbert, L. Grigori, and M. Shao (2011). Superlu. In D. Padua(Ed.),
Encyclopedia of Parallel Computing , pp. 1955–1962. Boston, MA: Springer US.[31] Li, Y. and H. Yang (2017). Interpolative butterfly factorization.
SIAM J. Sci. Com-put. 39 (2), A503–A531.[32] Li, Y., H. Yang, E. R. Martin, K. L. Ho, and L. Ying (2015, Jan). Butterfly factorization.
Multiscale Model. Simul. 13 (2), 714–732.[33] Li, Y., H. Yang, and L. Ying (2015, Jan). A multiscale butterfly algorithm for multidi-mensional Fourier integral operators.
Multiscale Model. Simul. 13 (2), 1–18.[34] Li, Y., H. Yang, and L. Ying (2018, May). Multidimensional butterfly factorization.
Appl. Comput. Harmon. Anal. 44 (3), 737–758.[35] Li, Y. and L. Ying (2017). Distributed-memory hierarchical interpolative factoriza-tion.
Res. Math. Sci. 4 (12), 23.[36] Lin, L., J. Lu, and L. Ying (2011). Fast construction of hierarchical matrix representa-tion from matrix-vector multiplication.
J. Comput. Phys. 230 (10), 4071–4087.[37] Minden, V., K. L. Ho, A. Damle, and L. Ying (2017, Apr). A recursive skeletonizationfactorization based on strong admissibility.
Multiscale Model. Simul. 15 (2), 768–796.[38] O’Neil, M., F. Woolfe, and V. Rokhlin (2010). An algorithm for the rapid evaluationof special function transforms.
Appl. Comput. Harmon. Anal. 28 (2), 203–226.[39] Rokhlin, V. (1985, Sep). Rapid solution of integral equations of classical potentialtheory.
J. Comput. Phys. 60 (2), 187–207.[40] Rouet, F. H., X. S. Li, P. Ghysels, and A. Napov (2016, Jun). A distributed-memorypackage for dense hierarchically semi-separable matrix computations using random-ization.
ACM Trans. Math. Softw. 42 (4), 1–35.[41] Salmon, J. K. and M. S. Warren (1994, Jun). Fast parallel tree codes for gravitationaland fluid dynamical N-body problems.
Int. J. Supercomput. Appl. High Perform. Com-put. 8 (2), 129–142.
EFERENCES 31 [42] Singh, J. P., C. Holt, J. L. Hennessy, and A. Gupta (1993, Nov). A parallel adaptivefast multipole method. In
Supercomput. ’93Proceedings 1993 ACM/IEEE Conf. Supercom-put. , pp. 54–65.[43] Takahashi, T., C. Chen, and E. Darve (2020, Feb). Parallelization of the inverse fastmultipole method with an application to boundary element method.
Comput. Phys.Commun. 247 , 106975.[44] Wang, R., C. Chen, J. Lee, and E. Darve (2019, Mar). PBBFMM3D:a parallel black-box algorithm for kernel matrix-vector multiplication.http://arxiv.org/abs/1903.02153.[45] Wang, R., Y. Li, M. W. Mahoney, and E. Darve (2019, Dec). Block basis factorizationfor scalable kernel evaluation.
SIAM J. Matrix Anal. Appl. 40 (4), 1497–1526.[46] Wang, S., X. S. Li, J. Xia, Y. Situ, and M. V. De Hoop (2013, Dec). Efficient scalablealgorithms for solving dense linear systems with hierarchically semiseparable struc-tures.
SIAM J. Sci. Comput. 35 (6).[47] Warren, M. S. and J. K. Salmon (1993). A parallel hashed oct-tree N-body algorithm.In
Proc. Supercomput. Conf. , New York, New York, USA, pp. 12–21. Publ by IEEE.[48] Xia, J., S. Chandrasekaran, M. Gu, and X. S. Li (2010, Dec). Fast algorithms forhierarchically semiseparable matrices.
Numer. Linear Algebr. with Appl. 17 (6), 953–976.[49] Xing, X. and E. Chow (2018, Nov). An efficient method for block low-rank approxi-mations for kernel matrix systems. http://arxiv.org/abs/1811.04134.[50] Ying, L., G. Biros, D. Zorin, and H. Langston (2003, Nov). A new parallel kernel-independent fast multipole method. In