[PDF] Contiguous Graph Partitioning For Optimal Total Or Bottleneck Communication

Abstract

Graph partitioning schedules parallel calculations like sparse matrix-vector multiply (SpMV). We consider contiguous partitions, where the m rows (or columns) of a sparse matrix with N nonzeros are split into K parts without reordering. We propose the first near-linear time algorithms for several graph partitioning problems in the contiguous regime. Traditional objectives such as the simple edge cut, hyperedge cut, or hypergraph connectivity minimize the total cost of all parts under a balance constraint. Our total partitioners use O(Km+N) space. They run in O((Kmlog(m)+N)log(N)) time, a significant improvement over prior O(K( m 2 +N)) time algorithms due to Kernighan and Grandjean et. al. Bottleneck partitioning minimizes the maximum cost of any part. We propose a new bottleneck cost which reflects the sum of communication and computation on each part. Our bottleneck partitioners use linear space. The exact algorithm runs in linear time when K 2 is O( N C ) for C<1 . Our (1+ϵ) -approximate algorithm runs in linear time when Klog( c high /( c low ϵ)) is O( N C ) for C<1 , where c high and c low are upper and lower bounds on the optimal cost. We also propose a simpler (1+ϵ) -approximate algorithm which runs in a factor of log( c high /( c low ϵ)) from linear time. We empirically demonstrate that our algorithms efficiently produce high-quality contiguous partitions on a test suite of 42 test matrices. When K=8 , our hypergraph connectivity partitioner achieved a speedup of 53× (mean 15.1× ) over prior algorithms. The mean runtime of our bottleneck partitioner was 5.15 SpMVs.

Full PDF

aa r X i v : . [ c s . D S ] J u l Load Plus Communication Balancing in ContiguousPartitions for Distributed Sparse Matrices:Linear-Time Algorithms

Peter Ahrens

Abstract —We study partitioning to parallelize multiplicationof one or more dense vectors by a sparse matrix (SpMV orSpMM). We consider contiguous partitions, where the rows (orcolumns) of the matrix are split into K parts without reorder-ing. We present exact and approximate contiguous partitioningalgorithms that minimize the runtime of the longest-runningprocessor under cost models that combine work factors andhypergraph communication factors. This differs from traditionalgraph or hypergraph partitioning models which minimize totalcommunication under a work balance constraint. We addressregimes where partitions of the row space and column space areexpected to match (the symmetric case) or are allowed to differ(the nonsymmetric case).Our algorithms use linear space. Our exact algorithm runsin linear time when K is sublinear. Our (1 + ǫ ) -approximatealgorithm runs in linear time when K log(1 /ǫ ) is sublinear.We combine concepts from high-performance computing andcomputational geometry. Existing load balancing algorithmsoptimize a linear model of per-processor work. We make mi-nor adaptations to optimize arbitrary nonuniform monotonicincreasing or decreasing cost functions which may be expen-sive to evaluate. We then show that evaluating our model ofcommunication is equivalent to planar dominance counting. Wespecialize Chazelle’s dominance counting algorithm to points inthe bounded integer plane and generalize it to trade reducedconstruction time for increased query time, since our partitionersmake very few queries.Our algorithms split the original row (or column) orderinginto parts to optimize diverse cost models. Combined withreordering or embedding techniques, our algorithms might beused to build more general heuristic partitioners, as they canoptimally round one-dimensional embeddings of direct K -waynoncontiguous partitioning problems. I. I

NTRODUCTION

Scientiﬁc computing applications require efﬁcient multi-plication of one or more dense vectors by a sparse matrix.These kernels are referred to as

SpMV or SpMM respectively,and often arise in situations where dense vectors are multi-plied by the same sparse matrix repeatedly, such as iterativelinear solvers. Parallelization can increase the efﬁciency ofthese operations, and datasets are sometimes large enoughthat parallelization across distributed networks of processorsbecomes imperative due to memory constraints. A commonparallelization strategy is to partition the rows (or similarly,the columns) of the sparse matrix and corresponding elements

Computer Science and Artiﬁcial Intelligence Laboratory, MassachusettsInstitute of Technology, Cambridge, MA ([email protected]). This work wassupported by a Department of Energy Computational Science GraduateFellowship, DE-FG02-97ER25308.

Fig. 1: Our running example matrix, together with an examplesymmetric partition of x and y . Nonzeros are denoted with ∗ . (cid:2) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (cid:3) x  ∗∗∗∗∗∗∗∗∗∗   ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗  y A of the dense vector(s) into parts, then assign each part to aseparate processor. While there are a myriad of methods forpartitioning the rows of sparse matrices, we focus on the casewhere the practitioner does not wish to change the ordering ofthe rows and the parts are therefore contiguous. There are sev-eral reasons to prefer contiguous partitioning. The order of therows may have already been carefully optimized for numericalconsiderations, the natural order of the rows may already beamenable to partitioning, the solver may involve operatorswith special structure (like stencils or block-diagonals) thatwould be difﬁcult to reorder, or reordering may simply be toocomplicated or expensive to implement on the target architec-ture. Furthermore, several noncontiguous partitioners (spectralmethods or other heuristics [1], [2]) work by producing a one-dimensional embedding of the columns which is subsequently“rounded” to produce partitions. Our algorithm can be thoughtof as optimally rounding one-dimensional embeddings. Sincenoncontiguous partitioning to minimize communication costsis NP-Hard [3], [4], one can view contiguous partitioningas a compromise, where the user is asked to use domainknowledge or heuristics to produce a good ordering, but thepartitioning can be performed efﬁciently and accurately. Aswe will demonstrate, the contiguous partitioning techniqueswe develop are efﬁcient and support highly expressive costmodels.In many applications of SpMV or SpMM, including iterative solvers, all processors wait for the last processor to ﬁnishmultiplication. This synchronization presents a load balancingproblem where our goal is to minimize the maximum amountof time that any processor requires to ﬁnish and communicateits portion of the product. If we assign a cost to each row(typically proportional to the number of nonzeros plus someconstant term reﬂecting linear operations) and model thecost of a part as the sum of the costs of rows in the part,this corresponds to the “Chains-On-Chains” load balancingproblem. Chains-On-Chains has received extensive study overseveral decades; we refer to [5] for a summary.Unfortunately, communication costs and storage consider-ations cannot be modeled accurately as the sum of someper-row workload. Iterative solvers require each processor tocommunicate completed entries of their part of the solutionvector to the other processors that need them. The size of themessages that need to be sent is proportional to the number ofdistinct nonzero column locations that occur in the row part,regardless of how many times a particular column appearsin the part. Parts which have nonzeros in many differentcolumns need to communicate more. Accurate communicationcosts depend on the number of distinct nonlocal columns;this relationship is nonlinear and depends on the sparsitystructure of the matrix itself. Cache and storage effects are alsononlinear; the cost to compute with a row part is much fasterwhen the part ﬁts in cache, and computation is impossible ifa row part does not ﬁt in the processors memory. A. Contributions

In this work, we partition sparse matrices to balance the sumof work and communication on each processor. We describefamilies of monotonic cost models which can account fornonlocal-column-based communication costs and discontinu-ous cache and storage effects. We show how state-of-the-artsublinear time algorithms for chain partitioning can optimizemonotonic cost functions (cost functions which either onlyincrease or only decrease as more columns are added to thepart). Finally, we propose efﬁcient algorithms to evaluate ourcost models. The resulting partitioners run in linear time, uselinear space, and optimize accurate models. By balancing thecombined cost of communication and work over each part,we can shift work burdens from extroverted processors whichcommunicate frequently to introverted processors which com-municate rarely, more efﬁciently utilizing computing resources[6].Our contributions are three-fold: • We propose families of monotonic communication-awarecost functions which target iterative solvers in distributedmemory settings. We address symmetric and nonsym-metric partitions. When the partition is nonsymmetric,we address the cases where either the rows or columnpartition is considered ﬁxed. Our techniques can begeneralized to create new families of monotonic costfunctions for a wide variety of situations, accounting forfactors like communication, cache, and storage effects. • We adapt state of the art chains-on-chains partitioningalgorithms (originally due to Iqbal et. al. and Nicol et. al. [7], [8] and further improved by Pinar et. al. [5]) tosupport arbitrary nonuniform monotonic increasing anddecreasing cost functions which may be expensive toevaluate. • We use a slightly modiﬁed reduction from multicoloredone-dimensional dominance counting to standard two-dimensional dominance counting to compute our com-munication terms [9]. We generalize a two-dimensionaldominance counting algorithm due to Chazelle to tradeconstruction time for query time, allowing the overallalgorithm to run in linear time [10].When load-balancing across a sublinear number of pro-cessors, we desire a data structure that can be constructedin linear time and evaluate queries in sublinear time. Sinceour dominance counting algorithm is specialized to the caseof computing preﬁx sums in a sparse matrix, it may be ofindependent interest, especially for two-dimensional rectilinearload balancing problems with linear cost functions [8], [11],[12].Our dominance counting data structure uses a tree of height H , and can be constructed with H passes over the data. Wesuggest setting H to . When partitioning the rows of an m × n matrix with N nonzeros to run on K processors, the overallruntime of our ǫ -approximate partitioner (Algorithm 3) is O ( m + n + HN + K log (cid:18) Kǫ (cid:19) log( m ) H N /H ) = O ( m + n + N + K log (cid:18) Kǫ (cid:19) log( m ) N / ) for subadditive monotonically increasing costs, which is linearwhen K grows strictly slower than N − /H = N / . Theruntime of our exact partitioner (Algorithm 4) is O ( m + n + HN + K log( m ) H N /H ) = O ( m + n + N + K log( m ) N / ) which is linear when K grows strictly slower than N (1 − /H ) / = N / . Both algorithms use at most N + n extra storage, regardless of the choice of H .II. B ACKGROUND

We index starting from 1. Consider the m × n matrix A .We refer to the entry in the i th row and j th column of A as a ij . A matrix is called sparse if most of its entries are zero.Let N be the number of nonzeros in A . Since most of theirentries are zeros, it is more efﬁcient to store sparse matrices in compressed formats. For example, the Compressed SparseRow (CSR) format stores a sorted vector of nonzero columncoordinates in each row, and a corresponding vector of values.This is accomplished efﬁciently with three arrays pos , idx , and val of length m + 1 , N , and N respectively. Locations pos j to pos j +1 − of idx hold the sorted column indices i such that a ij = 0 , and the corresponding entries of val hold the nonzerovalues. Sometimes it can be convenient to use CompressedSparse Column (CSC) format, the transpose of CSR formatwhere columns are compressed instead of rows.

A. Iterative Solvers

When A is sparse, it is much faster to multiply by A (processing only the nonzero values) than it is to use directfactorization methods to solve A · x = b . This has motivated thedevelopment of iterative solvers, which solve the symmetric,positive semideﬁnite linear system A · x = b through repeatedmultiplication by A . We consider the case where we wish tosolve for p distinct right-hand-sides. Algorithm 1 (Serial Conjugate Gradient Method) . Solve thesymmetric, positive semideﬁnite system A · x = b to a toleranceof ǫ . The initial value of x is a guess. Note that in the casethat there are p right-hand sides, x and b should be considered n × p matrices. Adapted from [13, Chapter 6]. function C ONJUGATE G RADIENT ( A , x , b , ǫ ) r ← b − A · xδ ← r · r ⊺ z ← r while δ / < ǫ do y ← A · zα ← δ/ ( y · z ) x ← x + αzr ← r − αyδ ′ ← r · r ⊺ β ← δ ′ /δz ← r + βzδ ← δ ′ end whilereturn x end function Because we can expect the inner loop to execute manytimes, we ignore the initialization steps and focus on theloop itself. The dominant cost is the sparse matrix densevector

SpMV multiply A · z (or sparse matrix dense matrix SpMM multiply if p > ), but there are also several dotproducts and pointwise additions to compute, which becomemore noticeable as the matrix becomes sparse.The two dot products in Algorithm 1 impose synchroniza-tion points, where all processors locally compute their dotproducts, but depend on a global sum of the local dot productsto compute α and β , which are needed in the next phase ofcomputation. This isolates the dominant phase of computation(the SpMV) from when each processor receives a global valueof β to when all processors have posted their local contributionto α . Once β is received, processors compute their localcopies of z , receive relevant nonlocal entries of z from otherprocessors, compute y = A · z and then post y · z .A high-level summary of the conjugate gradient method ispresented in Algorithm 1. We use conjugate gradient as anexample because it is a well-known algorithm which helpsillustrate several of the important terms in the computationalcost of iterative solvers. While conjugate gradient is our run-ning example, the cost functions we present are formulated formore broad applications of SpMV and SpMM. For example,similar iterative algorithms exist for least-squares problemsthat apply to possibly rectangular matrices, multiplying byboth A and A ⊺ in each iteration [13]. We must therefore distinguish when we intend to partition in the symmetric ornonsymmetric (and possibly rectangular) case. B. Parallelism and Partitioning

We consider a distributed memory model where storageof the matrix and intermediate vectors is divided acrossthe different processors, and any communication of solverstate involves sending a message between two processors.We assume that each processor can use multiple threads forasynchronous processing of communication. Without loss ofgenerality since transposition runs in linear time, we assumeour matrix is stored in CSR format.We must decide how to break up our data among processorsto compute y = A · x . A frequent parallelization strategyfor SpMV and SpMM on K processors is to split the rowsor columns of the matrix and corresponding vector(s) into K contiguous matching parts, where part k is stored and/orcomputed on processor k . When we decompose the matrixrow-wise, each processors ﬁrst waits to communicate inputvector entries (elements of x ), then computes its local portionof y = A · x . When we decompose the matrix column-wise, each processor ﬁrst computes its portion of A · x , thencommunicates these partial products for the ﬁnal summationto y on the destination processor. Without loss of generality,we assume that we decompose the SpMV row-wise, althoughour cost functions apply equally to both cases. When multi-plication by both A and A ⊺ is required, we can simply useboth interpretations of A .In a K -partition Π of the rows of A , each row i is assignedto a single part k , and the set of rows in part k is denoted as π k . We require that the parts be disjoint. Any partition Π on n elements must satisfy coverage and disjointness constraints: [ k π k = { , ..., n }∀ k = k ′ , π k ∩ π k ′ = ∅ (1)We distinguish between symmetric and nonsymmetric parti-tioning regimes. A symmetric partitioning regime assumes A to be square and that the input and output vectors will use thesame partition Π . Note that we can use symmetric partitioningon square, asymmetric matrices. The nonsymmetric parti-tioning regime makes no assumption on the size of A , andallows the input vector (columns) to be partitioned accordingto some partition Φ which may differ from the output vector(row) partition Π . Since our load balancing algorithms aredesigned to optimize only one partition at a time, we alternatebetween optimizing Π or Φ , considering the other partitionto be ﬁxed. When Φ is considered ﬁxed and our goal isonly to ﬁnd Π , we refer to this more restrictive problem as primary alternate partitioning . When Π is considered ﬁxedand our goal is only to ﬁnd Φ , we refer to this problemas secondary alternate partitioning . Alternating partitioninghas been examined as a subroutine in heuristic solutionsto nonsymmetric partitioning regimes, where the heuristicalternates between improving the row partition and the columnpartition, iteratively converging to a local optimum [14], [15].Similar alternating approaches have been used for the relatedtwo-dimensional rectilinear partitioning regimes [8], [12]. A partition is contiguous when adjacent elements are as-signed to the same part. Formally, Π is contiguous when, ∀ k < k ′ , ∀ ( i, i ′ ) ∈ π k × π ′ k , i < i ′ (2)A wealth of literature has been devoted to noncontiguous parti-tioning, where elements can be assigned to arbitrary processorlocations. While contiguous partitions are less ﬂexible, they areuseful when the cost of reordering the matrix is unacceptable,when the matrix already has good local structure, or whenthe matrix has already been carefully ordered for numericalreasons. Contiguous partitions are also easier to conceptu-ally understand and implement. This work demonstrates thatoptimal contiguous partitions can be constructed in nearlylinear time, whereas noncontiguous partitioning is usuallyNP-Hard [16], [17]. Even noncontiguous secondary alternatepartitioning to balance communication (known as the “ColumnAssignment Problem”) is NP hard [18].A contiguous partition Π can be described by its splitpoints S , where i ∈ π k for S k ≤ i < S k +1 . Thus, whenwe parallelize our solver, processor k will be responsible forrows S k to S k +1 − of A . Figure 1 illustrates a symmetricdecomposition in the context of Algorithm 1. C. Cost Models and Optimization

The history of cost models used to optimize iterative solversis tied to both the methods available to optimize these modelsand the nature of the architecture under consideration. Sincecommunication costs are quite expensive in the distributedmemory model and storage limits on each node can be aconcern, classical graph and hypergraph partitioners use graphtheory to model and minimize the communication costs underper-node work or storage balance constraints.We consider graphs and hypergraphs G = ( V, E ) on m vertices and n edges. Edges can be thought of as connectingsets of vertices . We say that a vertex i is incident to edge j if i ∈ e j . Note that i ∈ e j if and only if j ∈ v i . Graphs can bethought of as a specialization of hypergraphs where each edgeis incident to exactly two vertices. The degree of a vertex isthe number of incident edges, and the degree of an edge is thenumber of incident vertices.The graph and hypergraph partitioning problems map rowsof A to vertices of a graph or hypergraph, and use edgesto represent communication costs. Vertices are weighted torepresent computation cost or storage requirements, the weightof a part is viewed as the sum of the weights of its vertices,and the weights are constrained to be approximately balanced(balanced to a relative factor ǫ ). We let W represent the vertexweights, so that w i is the weight of vertex i .Graph models for symmetric partitioning of symmetricmatrices typically use the adjacency representation adj( A ) of a sparse matrix A . If G = adj( A ) , an edge exists betweenvertices i and i ′ if and only if a ii ′ = 0 . Thus, cut edges (edgeswhose vertices lie in different parts) require communication oftheir corresponding columns. However, this model overcountscommunication costs in the event that multiple cut edgescorrespond to the same column, since each column onlyrepresents one entry of y which needs to be sent. Reductions to bipartite graphs are used to extend this model to the possiblyrectangular, nonsymmetric case [15]. Problem 1 (Edge Cut Graph Partitionining) . Optimize thefeasible (1) partition Π under G = ( V, E ) = adj( A )arg min Π |{ E ∩ π k × π k ′ : k = k ′ }| : ∀ k, X i ∈ π k w i < ǫK X i w i Inaccuracies in the graph model led to the development ofthe hypergraph model of communication. Here we use the incidence representation of a hypergraph, inc( A ) . If G =inc( A ) , edges correspond to columns in the matrix, verticescorrespond to rows, and we connect the edge e j to the vertex v i when a ij = 0 . Thus, if there is some edge e j which is not cutin a row partition Π , all incident vertices to e j must belong tothe same part π k , and we can avoid communicating the input j by assigning it to processor k in our column partition Φ . Inthis way, we can construct a column partition Φ such that thenumber of cut edges in a row partition Π corresponds exactlyto the number of entries of y that must be communicated, andthe number of times an edge is cut is one more than the numberof processors which need to receive that entry of y , since oneof these processors has that entry of y stored locally. Whileit is clear that this will work in the nonsymmetric regime,Reference [4] explains how the partitions may be constrainedto be symmetric when A is square. To formalize these costfunctions on a partition Π , we deﬁne λ j ( A, Π) as the set ofrow parts which contain nonzeros in the j th column. Tersely, λ j = { k : i ∈ π k , a ij = 0 } . The former cost is known as the“edge cut” metric and the latter is known as the “( λ - 1) cut”metric. Problem 2 (Hyperedge Cut Hypergraph Partitioning) . Opti-mize the feasible (1) partition Π under G = ( V, E ) = inc( A )arg min Π |{ e j ∈ E : | λ j | > }| : ∀ k, X i ∈ π k w i < ǫK X i w i Problem 3 (( λ − ) Cut Hypergraph Partitioning) . Optimizethe feasible (1) partition Π under G = ( V, E ) = inc( A )arg min Π X e j ∈ E | λ j | − ∀ k, X i ∈ π k w i < ǫK X i w i While the hypergraph model better captures communicationin our kernel, heuristics for these noncontiguous partitioningproblems are highly specialized to the problem formulation,and can be expensive. For example, state-of-the-art multilevelpartitioners recursively pair vertices and vertex groups, main-taining balance constraints and using iterative improvementalgorithms at each level of grouping. The pairing step can be thought of as a weighted maximum matching problem(often approached with greedy heuristics). Weights in thegraph case are the edge weights, the weights in the hypergraphcase are the vertex similarities, measured by the number(or total weight) of shared edges. The reﬁnement is usuallyaccomplished with the Kernighan-Lin heuristic in graphs or theFiducciaMattheyses algorithm in hypergraphs [3], [4]. A mul-tilevel graph partitioner using a greedy coarsening algorithmand the FM heuristic runs in log-linear time. However, theadded complexity of the hypergraph case is costly (especiallythe coarsening step), and these heuristics usually require atleast quadratic time.Both the graph and hypergraph formulations minimize totalcommunication subject to a work or storage balance constraint,but it has been observed that the runtime depends more on theprocessor with the most communication, rather than the sum ofall communication [15]. Several approaches seek to use a two-phase approach to nonsymmetric partitioning where the matrixis ﬁrst partitioned to minimize total communication volume,then the partition is reﬁned to balance the communicationvolume (and other metrics) across processors [19], [20], [21].Other approaches modify traditional hypergraph partitioningtechniques to incorporate communication balance (and othermetrics) in a single phase of partitioning [22], [23], [24]. Givena row partition, Bisseling et. al. consider ﬁnding a columnpartition to balance communication (the secondary alternatepartitioning regime) [18].The contiguous case is much more forgiving. Kernighanproposed a dynamic programming algorithm which solvesthe contiguous graph partitioning problem to optimality inquadratic time, and this result was extended to hypergraphpartitioning by Grandjean and Uc¸ar [25], [26].Simpliﬁcations to the cost function lead to further reduc-tions in runtime and stronger guarantees. Much research hasbeen devoted to the “Chains-On-Chains” partitioning problem,where the objective is to produce a contiguous partitionthat minimizes the maximum amount of work performedby any processor (where work is modeled as the sum ofsome per-element work). The cost function doesn’t involvecommunication, leading to algorithmic simpliﬁcations. Whenminimizing the maximum amount of work in a noncontiguousasymmetric (primary or secondary) partition, our partitioningproblem is equivalent to bin-packing, which is approximableusing straightforward heuristics that can be made to run in log-linear time [27]. These problems are often described as “loadbalancing” rather than “partitioning.” In Chains-On-Chainspartitioning, the work of a part is typically modeled as directlyproportional to the number of nonzeros in that part. Formally,

Problem 4 (Chains-On-Chains) . Optimize the feasible (1)contiguous (2) partition Π under G = ( V, E ) = adj( A )arg min Π max k X i ∈ π k | v i | This cost model is easily computable since the pos vectorallows us to compute the total number of nonzeros in apartition from i to i ′ as pos i ′ − pos i in constant time. Using pos and some additional structure in the cost function, algorithmsfor Chains-On-Chains partitioning run in sublinear time. Nicolobserved that the work terms for the rows in Chains-On-Chainspartitioners can each be augmented by a constant to reﬂectthe cost of saxpy operations in iterative linear solvers [28].Local reﬁnements to contiguous partitions have been proposedto take communication factors into account, but this workproposes the ﬁrst globally optimal single-phase contiguouspartitioner with a communication-based cost model [29], [2].III. P ROBLEM F ORMULATION

Although it is simpler and often less accurate, the objectiveof Chains-On-Chains (Problem 4) has two key advantages overthe graph-based objectives of Problems 1, 2, and 3; Chains-On-Chains treats work balance as an objective, not a constraint,and the cost function is monotonic.Since Chains-On-Chains partitioning uses partition balanceas an objective, attempting to minimize the time spent by thelongest-running processor, it better captures the effects of themain synchronization point in our iterative solver. The runtimeof the solver doesn’t depend directly on the total amount ofwork or communication, just on the maximum amount of workand communication any particular processor has to perform[6], [20], [18], [30]. We refer to a costliest part in a partitionas a bottleneck processor.Additionally, Chains-On-Chains partitioning minimizes anobjective function which is monotonic , meaning that addingadditional rows to a part will never decrease (or never increase)the cost of that part, whereas neither of the communicationmetrics of the graph-based objectives are monotonic. The com-bination of a balance objective and monotonic cost functionmakes the contiguous partitioning problem efﬁciently solvable.Put simply, monotonicity implies that adding vertices to a partcan only reduce (or only increase) the cost of the other parts,which enables parametric search techniques [5].

A. Monotonic Load Balancing

This leads us to the ﬁrst contribution of this work: mono-tonic cost models for SpMV in the context of iterative solverson distributed memory machines. We start with a formaldeﬁnition of monotonicity. A cost function f on a set ofvertices is deﬁned as monotonic increasing if for any twosets of vertices π, π ′ ⊆ V where π ⊆ π ′ , f ( π ) ≤ f ( π ′ ) . (3)Our cost function f is deﬁned as monotonic decreasing if forany two sets of vertices π, π ′ ⊆ V where π ⊆ π ′ , f ( π ) ≥ f ( π ′ ) . (4)Monotonic costs can be composed; if we have two mono-tonic increasing or decreasing functions f and f , and a isa nonnegative constant, then the following functions are alsomonotonic increasing or decreasing, respectively: g ( π ) = min( f ( π ) , f ( π )) g ( π ) = max( f ( π ) , f ( π )) g ( π ) = f ( π ) + f ( π ) g ( π ) = f ( π ) · a (5) We are ready to state our main partitioning problem, whichwill provide an optimization framework for the rest of thework. By changing the cost function, our problem will addressthe symmetric partitioning regime and both of our nonsymmet-ric regimes to varying degrees of accuracy.

Problem 5 (Monotonic Load Balancing) . Optimize the feasi-ble (1) contiguous (2) partition Π under arg min Π max k f k ( π k ) where each f k is monotonic increasing (3) or each f k ismonotonic decreasing (4). B. Cost Modeling

Looking carefully at Algorithm 1, our example iterativesolver, the inner loop can be divided into two phases, separatedby the synchronization to compute dot products for α and β .The ﬁrst phase, between α and β , which computes x , r , and β ,is relatively inexpensive and requires no communication otherthan a scalar reduction at the beginning and end.We focus instead on the phase between β and α , whichcomputes z , y , and α . This phase is much more expensive,involving the key SpMV operation. It begins with each pro-cessor sending copies of their local portions of z to processorsthat need it, then receiving the required nonlocal copiesof z from other processors. In distributed memory settings,both the sending and the receiving node must participate intransmission of the message. The accuracy to which we canmodel the effects of communication will depend on whetherwe are partitioning in the nonsymmetric or symmetric regime.In all of our partitioning regimes, we model the per-rowwork (computation) cost (due to dot products, vector scaling,and vector addition) with the scalar c dot , and the per-nonzerowork costs (due to matrix multiplication) with the scalar c spmv .We further assume that the runtime of a part is propor-tional to the sum of local work and the number of receivedentries. This differs from the model of communication usedby Bisseling Et. Al., where communication is modeled asproportional to the maximum number of sent entries or themaximum number of received entries, whichever is larger [18].While ignoring sent entries may seem to represent a loss inaccuracy, there are several reasons to prefer such a model.Processors have multiple threads which can process sends andreceives independently. If we assume that the network is notcongested (that sending processors can handle requests whenreceiving processors make them), then the critical path for asingle processor to ﬁnish its work consists only of receivingthe necessary input entries and computing its portion of thematrix product. We assume the cost of receiving an entry isproportional to some scalar c message .

1) Monotonic Nonsymmetric Cost Modeling:

Since it ad-mits a more accurate cost model, we consider the alternatingpartitioning regime ﬁrst. This regime considers only one ofthe row (output) space partition Π of our sparse matrixmultiplication operation and the column (input) space partition Φ , considering the other to be ﬁxed.We model our matrix as an incidence hypergraph. Noticethat we are incentivized to pick partitions where elements of the input vector that reside on a processor are reused incomputing output vector elements on the same processor, sothey do not need to be communicated. The nonlocal entries ofthe input vector which processor k must receive are the edges j incident to vertices i ∈ π k such that j φ k . We can expressthis tersely as ( S i ∈ π k v i ) \ φ k . Thus, an expressive cost modelfor the alternating regime can be constructed as f k ( π k , φ k ) = c dot | π k | + c spmv X i ∈ π k | v i | + c message (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i ! \ φ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (6)When Φ is ﬁxed, each f k is a linear combination of monotonicincreasing factors and is therefore monotonic increasing (in-creasing the size of a row part can only increase the work andthe set of columns which potentially need communication).When Π is ﬁxed, each f k is monotonic decreasing (increasingthe size of a column part can only increase the number oflocal columns, decreasing the communication of part k ).While we require the opposite partition to be ﬁxed, we willonly require the partition we are currently constructing to becontiguous. Requiring that both Π and Φ to be contiguouswould limit us to matrices whose nonzeros are clusterednear the diagonal. Allowing arbitrary ﬁxed partitions givesus the ﬂexibility to use other approaches for the secondaryalternate partitioning problem. For example, one might simplyassign each column to an incident part to optimize totalcommunication volume as suggested by C¸ atalyurek [4].Of course, since our alternating partitioning regime assumeswe have a ﬁxed Π or Φ , we need an initial partition. We pro-pose starting by constructing Π because this partition involvesmore expensive tradeoffs between work and communication.Since we would have no Φ to start, we can assume that allentries of the input vector must be communicated to processorsthat need them, regardless of whether they are stored locallyor not, then we can decouple the costs of the two partitions,upper-bounding the cost of communication. In the hypergraphmodel, processor k receives at most | S i ∈ π k v i | entries of theinput vector. Thus, our cost model would be f ( π k ) = c dot | π k | + c spmv X i ∈ π k | v i | + c message (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (7)Note that any column partition Φ will achieve or improveon the modeled cost of (7).

2) Monotonic Symmetric Cost Modeling:

In the symmetricpartitioning case, we can achieve an approximation of theaccuracy of the nonsymmetric model by adjusting scalar coef-ﬁcients. However, the symmetric partitioning regime does notneed multiple alternating steps, and we construct our partitionwith just one solution to Problem 5. Recall that the symmetriccase asks us to produce a partition Π which will be usedto partition both the row and column space simultaneously.Simply replacing Φ with Π in (6), we obtain f ( π k ) = c dot | π k | + c spmv X i ∈ π k | v i | + c message (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i ! \ π k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (8) Unfortunately, the nonlocal communication factor | ( S i ∈ π k v i ) \ π k | is not necessarily monotonic. However,notice that | ( S i ∈ π k v i ) \ π k | + | π k | = | ( S i ∈ π k v i ) ∪ π k | ismonotonic. Assume that each row of the matrix has at least v min nonzeros (in linear solvers, v min should be at least two,or less occupied rows could be trivially computed from otherrows.) We rewrite (8) in the following form, f ( π k ) = ( c dot + v min · c spmv − c message ) | π k | + c spmv X i ∈ π k ( | v i | − v min ) + c message (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i ! ∪ π k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since | π k | , | ( S i ∈ π k v i ) ∪ π k | , and P i ∈ π k ( | v i | − v min ) are allmonotonic, we can say that (8) is monotonic when the coef-ﬁcients on these terms are positive. We therefore require c dot + v min c spmv ≥ c message (9)Informally, (9) asks that the rows and dot products hold“enough” local work to rival communication costs. Theseconditions correspond to situations where it would cheaperto communicate a summand in y ⊺ · A · x than it would beto compute it (assuming the required entries of y are locallyavailable). These constraints are most suitable to matrices withheavy rows, because increasing the number of nonzeros in arow increases the amount of local work and the communicationfootprint, but not the cost to communicate the single entry ofoutput corresponding to that row.Depending on how sparse our matrix is, we may approx-imate the modeled cost of the matrix by assuming v min tobe larger than it really is. If there are at most m ′ “underfull”rows with less than v min nonzeros, then (8) will be additivelyinaccurate by at most m ′ · v min c spmv . C. Capturing Constraints and Discontinuities

Traditional partitioners often use balance constraints toreﬂect per-node storage limits, rather than to balance work.Fortunately, threshold functions are monotonic, and we canuse thresholds in our cost functions to reﬂect constraints. Ifour maximum storage size is w max , then we can deﬁne τ w max ( w ) = ( if w < w max otherwiseSince τ is monotonic in w , we may simply take theminimum of τ ( | π k | ) · ∞ (where ∞ is a suitably large value)and our original cost functions to produce a new monotoniccost which enforces balance constraints.We can also use thresholds to capture discontinuous phe-nomena like cache effects. If we have a more expensive costmodel we wish to use when our input or output vectors donot ﬁt in cache, we can multiply the more expensive modelby a threshold and take the larger of the in-cache model andthe thresholded out-of-cache model. D. Atoms and Molecules

Together, the monotonic “atoms”, | π k | , P i ∈ π k | v i | , | ( S i ∈ π k v i \ φ k ) | , | ( S i ∈ π k v i ) | , and | ( S i ∈ π k v i ) ∪ π k | , canbe combined with operations like + , positive scalar · , min , max , and τ , to produce complex “molecules” like (6), (7), (8),and even new cost functions which we have yet to consider.Since Nicol’s algorithm minimizes monotonic cost functionson contiguous partitions with a sublinear number of calls tothe cost function, our overall algorithm will be efﬁcient if wecan ensure that our functions, and thus, their composite atoms,are efﬁciently computable. Figure 2 displays two examplepartitions and the value of the corresponding atoms.We assume our partition is contiguous, and therefore speci-ﬁed by split points S . Thus, we can compute | π k | as s k +1 − s k in constant time. Since the CSR format requires pos to bea preﬁx sum of the number of nonzeros in each row, wecan compute P i ∈ π k | v i | as pos s k +1 − pos s k in constant time.Storage requirements usually depend on factors such as | π k | and P i ∈ π k | v i | . All that remains is to ﬁnd an efﬁcient way tocompute | ( S i ∈ π k v i \ φ k ) | , | ( S i ∈ π k v i ) | , and | ( S i ∈ π k v i ) ∪ π k | ,which we will discuss later in Section V. E. Bounding the Costs

While our partitioner for Problem 5 does not make anyassumption on the cost other than nonnegativity and mono-tonicity, our approximate partitioner will need an upper andlower bound on the cost to search within, and both algorithmsperform better when given better bounds on the cost.We can use subadditivity in our cost functions to providegood lower bounds on the cost. A cost function f is subaddi-tive if for any two sets of vertices, π and π ′ , f ( π ) + f ( π ′ ) ≥ f ( π ∪ π ′ ) (10)Like monotonicity, subadditivity can be composed. Unlikemonotonicity, subadditivity cannot be composed under the min function, and is not preserved under the threshold τ . Givensubadditive functions f and f and a positive constant a , thefollowing functions are subadditive: g ( π ) = max( f ( π ) , f ( π )) g ( π ) = f ( π ) + f ( π ) g ( π ) = f ( π ) ∗ a (11)Given a monotonic increasing cost function f over a set ofvertices V , it is clear that f ( V ) is an upper bound on the costof a K -partition. Therefore, max k f ( π k ) ≤ f ( V ) . (12)One may use subadditivity in f to create a lower boundon a K -partition. Applying the deﬁnition of subadditivity tosome function f , we obtain X k f ( π k ) ≥ f ( V ) , and therefore, max k f ( π k ) ≥ f ( V ) K . (13) (cid:8) (cid:9) S i ∈ π v i    ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗  v = { , , , } v = { } v = { , , } v = { , , , } π A | π | = 4 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 4 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π v i ! ∪ π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 5 , X i ∈ π | v i | = 12 (cid:8) (cid:9) S i ∈ π v i    ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗  v = { , , , , } v = { , , , , } π A | π | = 2 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 8 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π v i ! ∪ π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 8 , X i ∈ π | v i | = 10 Fig. 2: Two parts in a symmetric partition of our example matrix, and the values of some of our various “atoms” for each part.Although part 2 (shown on top) contains more nonzeros than part 3 (shown on bottom), part 3 contains more distinct nonzerocolumn locations, and will therefore require more communication. Again, nonzeros in the matrix A are denoted with ∗ .Finding lower bounds on partition costs is more difﬁcultthan ﬁnding upper bounds. Equations (7) and (8) are subad-ditive and use the same cost for each part, so they obey (13).However, Equation (6) uses different functions f k on eachpart, and therefore we cannot apply (13). One workaround isto lower bound only the work terms in the cost function, sincethese are uniform across partitions. This is equivalent to ﬁrstlower-bounding all our functions by some uniform subadditiveone, then applying Equation (13).Similarly, although subadditivity is not preserved underoperations involving thresholds τ w max , we can often providelower bounds by examining only the subadditive terms. For example, if our cost function is subadditive except for a ﬁniteset of monotonic constraint functions τ w max ... , we can writeit in the form f ( π ) = max( f ′ ( π ) , τ w max ... ) , and simply use the upper and lower bound on f ′ ( π ) , since thisis the only bound that matters if a feasible solution exists.Subadditivity does not help us bound monotonic decreasingcost functions. For these costs, it may make sense to simplyuse max k,i f k ( i, i ) and max k f k (1 , m ) as upper and lowerbounds on the cost. IV. M

ONOTONIC I NCREASING P ARTITIONERS

Pinar et. al. present a multitude of algorithms for optimizinglinear cost functions [5]. We examine both an approximateand an optimal algorithm. We chose the approximate “ ǫ -BISECT+” algorithm (originally due to Iqbal et. al. [7]) andthe exact “NICOL+” algorithm (originally due to Nicol et.al. [8]) because they were easy to understand and enjoyedstrong guarantees, but used dynamic split point bounds andother optimizations based on problem structure which resultin empirically reduced calls to the cost function.Since the approximate algorithm introduces many of thekey ideas needed to understand the exact algorithm, we startwith our adaptation of the “ ǫ -BISECT+” partitioner, whichproduces a K -partition within ǫ of the optimal cost when itlies within the given bounds.If our cost is monotonic increasing, increasing the size ofsome part in a partition will not increase the cost of other parts.Therefore, if we know the cost of a K -partition to be at most c , then we can set the endpoint of the ﬁrst part to the largest i ′ such that f (1 , i ′ ) ≤ c , and the remaining K − -partitionwhich starts at i ′ will also cost at most c . This observationprovides us with a procedure that determines whether apartition of cost c is feasible. We can use such a procedureto search over the space of possible costs, stopping when ourlower bound on the cost is within ǫ of the upper bound. Notethat if we had the optimal value of a K -partition, we coulduse this search procedure to ﬁnd the optimal split points usingonly K log ( m ) evaluations of the cost function. While Pinaret. al. use this fact to simplify their algorithms and only returnthe optimal partition value, our cost function is much moreexpensive to evaluate than theirs, so our presentations of theiralgorithms have been modiﬁed to return split points with noextra evaluations of the cost functions [5].Since parametric search is a repeated pattern, we providepseudocode for our binary search procedure as a warm up. Algorithm 2.

Given a monotonic increasing cost function f k deﬁned on pairs of split points, a starting split point i , and amaximum cost c , ﬁnd the greatest i ′ such that i < i ′ , f k ( i, i ′ ) ≤ c , and i ′ low ≤ i ′ ≤ i ′ high . Returns max( i, i ′ low ) − if no costat most c can be found. function S EARCH ( f k , i , i ′ low , i ′ high , c ) i ′ low ← max( i, i ′ low ) while i ′ low ≤ i ′ high do i ′ = ⌊ ( i ′ low + i ′ high ) / ⌋ if f k ( i, i ′ ) ≤ c then i ′ low = i ′ + 1 else i ′ high = i ′ − end ifend whilereturn i ′ high end function We now state our adapted bisection algorithm as Algorithm3. Algorithm 3 differs from Pinar et. al. in that it is statedin terms of possibly different f for each part, makes noassumptions on the value of f k ( i, i ) , allows for an early exit to the probe function, returns the partition itself instead ofthe best cost (this avoids extra probes needed to constructthe partition from the best cost), and constructs the dynamicsplit index bounds in the algorithm itself, instead of usingmore complicated heuristics (which may not apply to all costfunctions) to initialize the split index bounds.Since Algorithm 3 evaluates at most log ( c high / ( c low ǫ )) objectives, the number of calls to the cost function is boundedby K log ( m ) log (cid:18) c high c low ǫ (cid:19) . (14)If our cost is subadditive and we use (13) to correctly set c high = f (1 , m + 1) , c low = f (1 , m + 1) /K , then the numberof calls to the cost function is bounded by K log ( m ) log ( K/ǫ ) . (15) Algorithm 3 (BISECT Partitioner) . Given monotonic increas-ing cost function(s) f deﬁned on pairs of split points, ﬁnd acontiguous K -partition Π which minimizes c = max k f k ( s k , s k +1 ) to a relative accuracy of ǫ within the range c low ≤ c ≤ c high ,if such a partition exists.This is an adaptation of the “ ǫ -BISECT+” algorithm byPinar et. al. [5], which was a heuristic improvement on thealgorithm proposed by Iqbal et. al. [7]. function B ISECT P ARTITION ( f , n , c low , c high , ǫ ) ( s high1 , ..., s high K +1 ) ← (1 , m + 1 , m + 1 , ..., m + 1)( s low1 , ..., s low K +1 ) ← (1 , , ..., s , ..., s K +1 ) ← ( , , ..., while c low (1 + ǫ ) < c high do c ← ( c low + c high ) / s ← t ← true for k = 1 , , ..., K do s k +1 ← S EARCH ( f k , s k , s low k +1 , s high k +1 , c ) if s k +1 < max( s k , s low( k +1) ) then t ← false s k +1 , ..., s K +1 ← max( s k , s low k +1 ) breakend ifend forif t and s K +1 = m + 1 then c high ← cS high ← S else c low ← cS low ← S end ifend whileend functionreturn S high The key insight made by Nicol et. al. which allows usto improve our bisection algorithm into an exact algorithmwas that there are only m possible costs which could bea bottleneck in our partition [8]. Thus, Nicol’s algorithm searches the split points instead of searching the costs, andachieves a strongly polynomial runtime. We will reiterate themain idea of the algorithm, but refer the reader to [5] for moredetailed analysis.Assume that we know the starting split point of processor k to be i . Consider the ending point i ′ in a partition of minimalcost. If k were a bottleneck (longest running processor) insuch a partition, then f k ( i, i ′ ) would be the overall cost ofthe partition, and we could use this cost to bound that of allother processors. If k were not a bottleneck, then f k ( i, i ′ ) should be strictly less than the minimum feasible cost of apartition, and it would be impossible to construct a partitionof cost f k ( i, i ′ ) . Thus, Nicol’s algorithm searches for the ﬁrstbottleneck processor, examining each processor in turn. Whenwe ﬁnd a processor where the cost f k ( i, i ′ ) is feasible, and lessthan the best feasible cost seen so far, we record the resultingpartition in the event this was the ﬁrst bottleneck processor.Then, we set i ′ so that f k ( i, i ′ ) is the greatest infeasible costand continue searching assuming that the previous processorwas not a bottleneck.We have made similar modiﬁcations in our adaptation of“NICOL+” as we did for our adaptation of “ ǫ -BISECT+.”We phrase our algorithm in terms of potentially multiple f ,construct our dynamic split point bounds inside the algorithminstead of using additional heuristics, make no assumptionson the value of f k ( i, i ) , allow for early exits to the probefunction, and return a partition instead of an optimal cost. Wealso consider bounds on the cost of a partition to be optionalin this algorithm. Our adaptation of “Nicol+” ([5]) for generalmonotonic part costs is presented in Algorithm 4.Although “NICOL+” uses outcomes from previous searchesto bound the split points in future searches, a simple worst-case analysis of the algorithm shows that the number of callsto the cost function is bounded by K log ( m ) . (16)Since this number of probe sequences is sublinear in the sizeof the input, we will use this as our theoretical upper bound. Algorithm 4 (NICOL Partitioner) . Given monotonic increas-ing cost function(s) f deﬁned on pairs of split points, ﬁnd acontiguous K -partition Π which minimizes c = max k f k ( s k , s k +1 ) within the range c low ≤ c ≤ c high , if such a partition exists.This is an adaptation of the “NICOL+” algorithm by Pinaret. al. [5], which was a heuristic improvement on the algorithmproposed by Nicol et. al. [8]. function N ICOL P ARTITION ( f , m , c low = −∞ , c high = ∞ ) ( s high1 , ..., s high K +1 ) ← (1 , m + 1 , m + 1 , ..., m + 1)( s low1 , ..., s low K +1 ) ← (1 , , ..., s , ..., s K +1 ) ← ( , , ..., for k ← , , ..., K do i ← s k i ′ high ← s high k +1 i ′ low ← max( s k , s low k +1 ) while i ′ low ≤ i ′ high do i ′ ← ⌊ ( i ′ low + i ′ high ) / ⌋ c ← f ( i, i ′ ) if c low ≤ c < c high then s k +1 ← i ′ t ← true for k ′ = k + 1 , k + 2 , ..., K do s k ′ +1 ← S EARCH ( f k ′ , s k ′ , s low k ′ +1 , s high k ′ +1 , c ) if s k ′ +1 < max( s ′ k , s low k ′ +1 ) then t ← false s k ′ +1 , ..., s K +1 ← max( s k ′ , s low k ′ +1 ) breakend ifend forif t and s K +1 = m + 1 then c high ← ci ′ high ← i ′ − S high ← S else c low ← ci ′ low ← i ′ + 1 S low ← S end ifelse if c ≥ c high then i ′ high = i ′ − else i ′ low = i ′ + 1 end ifend whileif i ′ high < max( s k , s low k +1 ) thenbreakend if s k +1 ← i ′ high end forreturn S high end function Of course, some modiﬁcations are needed to apply Al-gorithms 3 and 4 to monotonic decreasing cost functions.Because the increasing and decreasing variants are so similar,we state the monotonic decreasing versions in Appendix A.V. C

OMPUTING T HE F OOTPRINT

Efﬁcient computation of | ( S i ∈ π k v i \ φ k ) | , | ( S i ∈ π k v i ) | , and | ( S i ∈ π k v i ) ∪ π k | presents a major challenge in this work.These quantities concern the size of the set of distinct nonzerocolumn locations in some row part. Prior works scan the rowsof the matrix from top to bottom, using a hash table to recordnonzero column locations that have been seen before, andperhaps tally when each has been seen before [31], [26], [29],[2], [32]. These approaches are efﬁcient if we wish to computethe cost of parts for all starting points corresponding to aﬁxed end point, or for all end points corresponding to a ﬁxedstart point, making them a good ﬁt for dynamic programmingapproaches that will evaluate cost functions for all possibleparts in a structured, exhaustive pattern. These approachesoften take O ( m + N ) time or similar, meaning that each ofthe approximately m evaluations of the cost function occur in amortized constant time. However, if we wish to evaluate thecost for arbitrary start and end points using this approach, eachevaluation of the cost function would run in linear time to thesize of the considered part, which would lead to a superlinearruntime overall. Thus, we need a data structure that supportsefﬁcient, arbitrary, evaluations of our communication terms.In the secondary alternate regime when Π is ﬁxed, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i \ φ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i ! ∩ φ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , By constructing sorted list representations of each set ( S i ∈ π k v i ) in linear time and space, we can easily query thelast term | ( S i ∈ π k v i ) ∩ φ k | in O (log( m )) time by searchingfor the boundaries of the contiguous region deﬁning φ k .In the other regimes, we can compute our communicationterms by evaluating | S i ∈ π k v i | over special hypergraphs.In the symmetric regime, if we deﬁne A ′ as A witha full diagonal, and consider the corresponding hypergraph ( V ′ , E ′ ) = inc( A ′ ) , then we have, [ i ∈ π k v i ! ∪ π k = [ i ∈ π k v ′ i . In the primary alternate regime when Φ is ﬁxed, if we deﬁne A ( k ) as A where all columns other than φ k have been zeroedout, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i \ φ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v ( k ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Thus, if we can efﬁciently count | S i ∈ π k v i | for arbitrary A , we can compute all of our atoms. The remainder of thisdiscussion therefore focuses on computing | S i ∈ π k v i | .We know that P i ∈ π k | v i | is an easy upper bound on | S i ∈ π k v i | , but it overcounts columns for each row v i theyoccur in. If we were somehow able to count only the ﬁrstappearance of a nonzero column in our part, we could compute | S i ∈ π k v i | . We refer to the pair of a nonzero entry and itsfollowing duplicate nonzero entry in the row as a link . Ifa nonzero occurs at row i in some column and the closestfollowing nonzero occurs at i ′ in that column, we call this a ( i, i ′ ) link. A ( i, i ′ ) link is contained in a part if both of itsnonzero entries are. We can think of the second occurrence ineach contained link as an overcounted nonzero in our upperbound P i ∈ π k | v i | . Thus, by subtracting the number of linkscontained in our part from the number of nonzeros in our part,we obtain the number of distinct nonzero columns in the part.If we deﬁne a matrix L where l ii ′ is equal to the number of ( i, i ′ ) links in A , then we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = X i ∈ π k | v i | − X ( i,i ′ ) ∈ π k × π k l ii ′ (17)Recall that we can compute the number of nonzeros in ourpart in constant time using the pos array of CSR format, andonly link counting remains.Consider each link ( i, i ′ ) as a point ( − i, i ′ ) in an integergrid N . We say that a point ( i , i ′ ) dominates a point ( i , i ′ ) if i ≥ i and i ′ ≥ i ′ . Thus, the number of links contained in some part k which extends from s k = i to s k +1 = i ′ + 1 isequal to the number of of points dominated by ( − i, i ′ ) . The dominance counting problem in two dimensions asks for adata structure to support queries on the number of dominatedpoints. Our reduction is almost equivalent to the reductionfrom multicolored one-dimensional dominance counting totwo-dimensional standard dominance counting described byGupta et. al., but our reduction requires only one dominancequery, while Gupta’s requires two [9]. We have re-used thevalues in the pos array of the link matrix in CSR formatto avoid the second dominance query (these queries can beexpensive in practice).Dominance counting in two dimensions has been the sub-ject of intensive theoretical study [10], [33], [34]. However,because of the focus on only query time and storage, littleattention has been given to construction time, which is alwayssuperlinear. Therefore, we modify Chazelle’s original domi-nance counting algorithm to allow us to trade constructiontime for query time [10].At this point, the reader may ask whether it is necessaryto reduce our problem to the (seemingly more complicated)dominance counting. However, the problems are roughlyequivalent. If we are given an arbitrary set of points ona bounded grid, we can construct a sparse matrix A withlinks corresponding to each point on the grid. Then we couldcompute dominance queries using only the values | S i ∈ π k v i | and the number of nonzeros in rows of A .Dominance counting (and the related semigroup range sumproblem [35], [36], [33]) are roughly equivalent to the problemof computing preﬁx sums on sparse matrices and tensorsin the database community. These data structures are called“Summed Area Tables” or “Data Cubes,” and they valuedynamic update support and low or constant query time atthe expense of storage and construction time. The baselineapproach is to ﬁll an m + 1 × m + 1 dense matrix withthe required values in the sum, taking at least o ( m ) time.Improved structures tile the sparse matrix with partially densedata structures, imposing superlinear storage costs with respectto the size of the original matrix. We refer curious readersto [37] for an overview of existing approaches to the sparsepreﬁx sum problem, with the caution that most of these worksreference the size of a dense representation of the summed areatable when they use words like “sublinear” and “superlinear.” A. Dominance Counting In A Bounded Integer Grid

Chazelle’s original formulation of the dominance countingalgorithm is linear in the number of points to be stored,requires log-linear construction time, and logarithmic querytime. We adapt this structure to a small integer grid, allowingus to trade off construction time and query time. WhereasChazelle’s algorithm can be seen as searching through thewake of a merge sort, our algorithm can be seen as searchingthrough the wake of a radix sort. Our algorithm can also bethought of as a decorated transposition from CSR to CSC.Because the algorithm is so detail-oriented, we give a high-level description, but leave the speciﬁcs to the pseudocodepresented in Algorithm 5. ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗    

12 1214 1 A L

Fig. 3: Links of our example matrix A are illustrated as line segments connecting elements of A on the left, and as points(with labeled multiplicities) in the link matrix L on the right. Links residing entirely within part are shown in bold. Part 2contains two links starting at i = 3 and terminating at i = 5 , and three links starting at i = 5 and terminating at i = 6 . Intotal, part 2 contains links, which is equal to the number of points dominated by our dotted regionrepresenting the partition split points.In this section, assume we have been given N points ( i , j ) , ..., ( i N , j N ) in the range [1 , ..., m ] × [1 , ..., n ] . Sincethese points come from CSR matrices or our link construction,we assume that our points are initially sorted on their i coordinates ( i q ≤ i q +1 ) and we have access to an array pos to describe where the points corresponding to each value of i starts. If this is not the case, either the matrix or the followingalgorithm can be transposed, or the points can be sorted witha histogram sort in O ( m + N ) time. Algorithm 5.

Construct the dominance counting data struc-ture over N points ( i , j ) , ..., ( i N , j N ) , ordered on i initially.We assume we are given the j coordinates in a vector idx . function C ONSTRUCT D OMINANCE ( N , idx ) qos ← zero-initialized vector of length n + 2 qos ← qos n +2 ← N + 1 tmp ← uninitialized vector of length b + 1 cnt ← zero-initialized 3-tensor of size b + 1 × ⌊ N/ b ′ ⌋ + 1 × Hbyt ← uninitialized vector of length N for h ← H, H − , ..., dofor J ← , hb , ..., n + 1 do Fill tmp with zeros. for q ← qos J , qos J + 1 , ..., qos J +2 hb do d ← key h ( idx q ) tmp d +1 ← tmp d +1 + 1 end for tmp ← qos J for d ← , , ..., b do tmp d +1 ← tmp d + tmp d +1 end forfor q ← qos J , qos J + 1 , ..., qos J +2 hb do d ← key h ( idx q ) q ′ ← tmp d byt q ′ ← hb ⌊ idx q ′ / hb ⌋ + idx q mod 2 hb tmp d ← q ′ + 1 end forfor d ← , , ..., b do qos J + d ( h − b ← tmp d end forend for Fill tmp with zeros. for Q ← , b ′ , ..., N dofor q ← Q, Q + 1 , ..., Q + 2 b ′ do d ← key h ( idx q ) tmp d ← tmp d + 1 end forfor d ← , , ..., b do cnt ( d +1) Qh ← tmp d + cnt dQh end forend for ( idx, byt ) ← ( byt, idx ) end for byt ← idx return ( qos, byt, cnt ) end function Algorithm 6.

Query the dominance counting data structureover N points ( i , j ) , ..., ( i N , j N ) , ordered on i initially. Weassume we are given the j coordinates in a vector idx . function Q UERY D OMINANCE ( i , j ) ∆ q ← pos i − j ← j − c ← for h ← H, H − , ..., do j ′ ← hb ⌊ j/ hb ⌋ q ← qos j ′ − q ← q + ∆ qd ← key h ( j ) Q ← ⌊ q / b ′ ⌋ + 1 Q ← ⌊ q / b ′ ⌋ + 1 c ← cnt dQ h − cnt dQ h ∆ q ← ( cnt ( d +1) Q h − cnt dQ h )∆ q ← ∆ q − ( cnt ( d +1) Q h − cnt dQ h ) for q ← b ′ ( Q −

1) + 1 , b ′ ( Q −

1) + 2 , ..., q do d ′ ← key h ( byt q ) if d ′ < d then c ← c − else if d ′ = d then ∆ q ← ∆ q − end ifend forfor q ← b ′ ( Q −

1) + 1 , b ′ ( Q −

1) + 2 , ..., q do d ′ ← key h ( byt q ) if d ′ < d then c ← c + 1 else if d ′ = d then ∆ q ← ∆ q + 1 end ifend forend forreturn c end function The construction phase of our algorithm successively sortsthe points on their j coordinates in rounds, starting at themost signiﬁcant digit and moving to the least. We refer to theordering at round h as σ h . We use H rounds, each with b bitdigits, where b is the smallest integer such that H · b ≥ m . Let key h ( j ) refer to the h th most signiﬁcant group of b bits (the h th digit). Formally, key h ( j ) = ⌊ j/ ( h − b ⌋ mod 2 b At each round h , our points will be sorted by the top h · b bits of their j coordinates using a histogram sort in eachbucket formed by the previous round. We use an array qos (similar to pos ) to store the starting position of each bucket inthe current ordering of points. Formally, qos j will record thestarting position for points ( i q , j q ) where j q > = j . Althoughwe only use certain entries of qos j during our constructionphase, it will be completely ﬁlled by the end of the algorithm.Note that qos is of size n + 2 instead of n + 1 , as one mightexpect. This is because we assume that is a possible valueof j during construction, and we subtract one from j when wequery. This is because ( i , j ) dominates ( i , j ) when i > i and j > j , which is equivalent to i − ≥ i and j − ≥ j .Although we can interpret the algorithm as resorting thepoints several times, each construction phase only needs accessto its corresponding bit range of j coordinates (the keys) in thecurrent ordering. The query phase needs access to the orderingof keys before executing each phase. Thus, the algorithmiteratively constructs a vector byt , where the h th group of b bits in byt corresponds to the h th group of b bits in currentordering of j coordinates ( key h ( byt q ) = key h ( j σ h ( q ) ) ). As theconstruction algorithm proceeds, we can use the lower bits of byt to store the remaining j coordinate bits to be sorted.Each phase of our algorithm needs to sort ⌈ n/ hb ⌉ buckets.Using our histogram sort with a scratch array of size b , we can sort a bucket of N ′ points in O ( N ′ + 2 b ) time. Thus, wecan sort the buckets of level h in O (2 b ⌈ n/ hb ⌉ + N ) time, andbucket sorting takes O ( n + HN ) time in total over all levels.A query requests the number of points in our data structuredominated by ( i, j ) . Notice that in the initial ordering, i q < i is equivalent to q < pos i . Thus, the dominating points residewithin in the ﬁrst pos i − positions of the initial ordering. Ouralgorithm starts by counting the number of points such that key ( j q ) < key ( j ) and q < pos i . All remaining dominatingpoints satisfy key ( j q ) = key ( j ) , so let q ′ be the numberof points key ( j q ) = key ( j ) and q < pos i . After our ﬁrstsorting round, the set of points in the initial ordering where key ( i q ) = key ( i ) would have been stored contiguously, andtherefore the ﬁrst q ′ of them satisfy i ( σ h ( q )) < i . We can thenapply our procedure recursively within this bucket to count thenumber of dominating points.We have left out an important aspect of our algorithm. Ourquery procedure needs to count the number of dominatingpoints that satisfy key h ( j σ h ( q ) ) < h within ranges of q thatagree on the top h · b bits of each j . While qos stores therequisite ranges of q , we still need to count the points. In O ( N + 2 b ) time, for a particular value of h , we can walk byt from left to right, using a scratch vector of size b tocount the number of points we have seen with each value of key h ( byt q ) . If we cache a preﬁx sum of our scratch vectoronce every b ′ points (the preﬁx sum takes O (2 b ) time), thenwe can use the cache to jump-start the counting process atquery time. During a query, after checking our cached countin constant time, we only need to count a maximum of b ′ points to obtain the correct count. Our cached count array isa 3-tensor cnt , where cnt hqd stores the number of points q ′ such that q ′ < cq and key h ( j σ h ( q ′ ) ) < d . If we cache every b ′ points, computing cnt takes O (2 b ⌈ N/ b ′ ⌉ ) = O ( N b − b ′ ) time per phase.Thus, construction takes time O ( n + HN (1 + 2 b − b ′ )) . (18)Each query needs to traverse through H levels, each leveltaking O (2 b ′ ) time, so queries require time O ( H b ′ ) . (19)The pos vector uses m + 1 words, the qos vector uses n + 2 words, the byt vector uses N words, and the cnt tensor usesat most HN b − b ′ words. Thus, the structure uses m + n + 3 + N (1 + H b − b ′ ) (20)words.A careful reader may notice that our complexity differsslightly from Chazelle’s original complexity. Since Chazelleonly considered the case where b = 2 , each key was onebit, and Chazelle opted to store the byt array as a set of bit-packed vectors (If we think of bits as a dimension, this wouldbe analogous to the transpose of our storage format). Becausethe size of a word bounded the size of the input and Chazelleassumed that we could count the number of set bits in a wordin constant time, this saved another log factor at query time.Since we will choose b > in most cases, it is likely that such bit counting instructions will not beneﬁt us signiﬁcantlyeven if they existed for 64-bit integers. B. Balancing the Tree

Our primary alternate cost function requires us to create K separate dominance counters for each set of columns φ k .Unfortunately, the runtime and storage of our dominancecounting data structure as stated depends on the dimension n . Thus, if each processor required a dominance counter,then the total runtime and storage would be o ( nK ) . Toavoid this contingency, we may choose to ﬁrst transform ourpoints into “rank space” [38], [10]. Recall that our points ( i , j ) , ..., ( i N , j N ) are given to us in i -major, j -minor order.We start by resorting the points into a j -major, i -minor order ( i σ (1) , j σ (1) ) , ..., ( i σ ( N ) , j σ ( N ) ) . Our transformation maps apoint ( i q , j q ) to its position pair ( q, σ ( q )) . Notably, i q < i q ′ ifand only if q < q ′ and j q < j q ′ if and only if σ ( q ) < σ ( q ′ ) , soour dominance counts are preserved under our transformation.If we store our two orderings, we can binary search to ﬁndthe transformed point at query time.Not counting the initial resorting, our new construction timebecomes O ( N + HN (1 + 2 b − b ′ )) . (21)Since we need to perform binary searches to transform querypoints, the new query time would be O (log( m ) + log( n ) + H b ′ ) . (22)Since the i and j in the new structure are the integers , ..., N , pos and qos are the identity and we no longer need to storethem. However, we do need to store our two orderings, andour storage cost becomes N (3 + H b − b ′ ) (23)words.In the primary alternate case where we wish to createseparate dominance counters corresponding to points in eachcolumn part φ k of the matrix, we can sort the entire link matrixof all of the columns to both i -major and j -major orders inlinear time with, for example, a counting sort (transposition).We can then stratify the links in the overall ordering by theircorresponding column parts in linear time, producing the K requisite lists of sorted points required to count dominancewithin each column part φ k . Since the storage and constructionruntime of the new dominance counters depends multiplica-tively on the number of points, and the total number of pointsover all of the dominance counters is N , we can construct allthe dominance counters in linear time.The reader may ask why we have bothered to presentthe non-transformed version of our dominance counter ﬁrst.We do so because the non-transformed counter combines theact of sorting the points with the act of constructing thedominance counter, which is likely to be practically fasterthan the transformed version which essentially sorts the pointstwice. TABLE I: Relevant quantities involved in runtime tradeoffs (14) Algorithm 3 Max Queries K log ( m ) log ( c high / ( c low ǫ )) (15) Algorithm 3 (Subadditive Cost) K log ( m ) log ( K/ǫ ) (16) Algorithm 4 Max Queries K log ( m ) (21) Dominance Construction Time O ( N + HN (1 + 2 b − b ′ ) (22) Dominance Query Time O (log( m + n ) + H b ′ ) (23) Dominance Storage N + HN b − b ′ C. Sparse Preﬁx Sums

Like Chazelle’s algorithm, our dominance counting algo-rithm can be extended to compute preﬁx sums of non-Booleanvalues over a bounded integer grid. At each level, we simplyneed to store the values associated with the current orderingof points. When we query the structure, in the same way wecount c , the number of points in our bucket where d ′ < d ,we would also need to sum the values associated with thesepoints. In order to maintain the same asymptotic runtime, thisnecessitates the use of an array similar to cnt which records apreﬁx sum of the values of previous points (rather than theircount) in the ordering at each level. These modiﬁcations wouldnot increase the runtime of construction or queries, but wouldincrease the storage by a factor of H , so that the storagerequirement for a sparse preﬁx sum would be m + n + 3 + N (1 + H + H b − b ′ ) (24)VI. P UTTING I T T OGETHER

Having described our partitioners and routines to computeour cost functions, we can combine the two and evaluate theruntime of the overall algorithm. Our analysis balances theruntime of constructing the dominance counting data structureand completing the queries.As we explain in the beginning of Section V, all of our costfunctions can be evaluated with at most two calls to a dom-inance counting data structure. In the primary alternate case,we need to construct a dominance counter for each processorover N points total. Therefore, we assume a transformation torank space for our runtime analysis, and include the cost of alinear-time counting sort in our construction phase.Table I describes the relevant quantities.Combining the construction time, number of queries, andtime per query (assuming that the query time is dominated bydominance counting), our approximate partitioner (Algorithm3) runs in time O ( m + n + HN (1 + 2 b − b ′ )+ K log (cid:18) Kǫ (cid:19) log( m )(log( m + n ) + H b ′ )) , (25)for subadditive costs, and our exact partitioner (Algorithm 4)runs in time O ( m + n + HN (1+2 b − b ′ )+ K log( m ) (log( m + n )+ H b ′ )) . (26)There are several ways to set H , b , and b ′ . Chazelleconsidered an algorithm which sets H = log ( N ) , b = 2 , and b ′ = log (log ( N )) . While storage would be linear and thequery time polylogarithmic, constructing a dominance counterwith these settings would require log ( N ) passes, an onerous passes over the data for , nonzeros, and passesover the data for , , nonzeros.Thus, we recommend setting H , the height of the tree andthe number of passes, to a small constant like or , whilekeeping storage costs linear, since storage is often a criticalresource in scientiﬁc computing. For correctness, we minimize b subject to Hb ≥ n . We minimize b ′ subject to b ′ ≥ H b to ensure that the footprint of our dominance counter is atmost four times the size of A . Assuming H is a constant, ourapproximate partitioner (Algorithm 3) runs in time O ( m + n + HN + K log (cid:18) Kǫ (cid:19) log( m ) H N /H ) , (27)for subadditive costs, and our exact partitioner (Algorithm 4)runs in time O ( m + n + HN + K log( m ) H N /H ) . (28)With these parameters, the approximate partitioner runs in lin-ear time when K log(1 /ǫ ) grows strictly slower than N − /H and the exact partitioner runs in linear time when K growsstrictly slower than N (1 − /H ) / .Our algorithms can run in linear time precisely when ourpartitioners use strictly sublinear queries, since we are ableto offset polynomial query time decreases with logarithmicconstruction time increases.For our practical choice of H = 3 , K log(1 /ǫ ) needs togrow slower than N / for linear time approximate parti-tioning and K needs to grow slower than N / for lineartime exact partitioning. However, both algorithms use dynamicbounds on split indices to reduce the number of probes,so they are likely to outperform these worst-case estimates.Furthermore, K , the number of processors, is often a relativelysmall constant. VII. C ONCLUSION

Traditional graph partitioning approaches have two mainlimitations. The cost models are highly simpliﬁed, and theproblems are NP-Hard. While the ordering of the rows andcolumns of a matrix does not affect the meaning of thedescribed linear operation, it often carries useful informationabout the problem structure. Contiguous partitioning shiftsthe burden of reordering onto the user, asking them to usedomain-speciﬁc knowledge or known heuristics to producegood orderings. In exchange, we show that the contiguouspartitioning problem can be solved optimally in linear time andspace, and provably optimizes cost models which are closerto the realities of distributed parallel computing.Researchers point out that traditional graph partitioningapproaches are inaccurate, since they minimize the total com-munication, rather than the maximum runtime of any processorunder both work and communication factors. [6], [20], [18],[30]. We show that, in the contiguous partitioning case, we canefﬁciently minimize the maximum runtime under cost modelswhich include communication factors.We present a rich framework for constructing and optimiz-ing expressive cost models for contiguous decompositions ofiterative solvers. Our only constraints are monotonicity and perhaps subadditivity. Using a set of efﬁciently computable“atoms”, we can construct complex “molecules” of cost func-tions which express complicated nonlinear dynamics such ascache effects, memory constraints, and communication costs.Our algorithm is the ﬁrst to provably balance communica-tion costs of contiguous partitions in linear time and use linearspace. We show that we can compute our communicationcosts with dominance counting, and generalize a classicaldominance counting algorithm to reduce construction timeby increasing query time. Our new data structure can alsobe used to compute sparse preﬁx sums. We show that, withminor adaptation, state-of-the-art load balancing algorithmsare capable of optimizing our cost functions in linear time.In this version of the manuscript, we only address algorith-mic challenges. Future work includes an empirical evaluationof serial and parallel implementations, as well as an explo-ration of the effects of contiguous communication balancingunder popular row and column reorderings.R

EFERENCES[1] Tony F. Chan, P. Ciarlet, and W. K. Szeto. On the Optimality ofthe Median Cut Spectral Bisection Graph Partitioning Method.

SIAMJournal on Scientiﬁc Computing , 18(3):943–948, May 1997.[2] Kevin Aydin, MohammadHossein Bateni, and Vahab Mirrokni. Dis-tributed Balanced Partitioning via Linear Embedding † . Algorithms ,12(8):162, August 2019. Number: 8 Publisher: Multidisciplinary DigitalPublishing Institute.[3] George. Karypis and Vipin. Kumar. A Fast and High Quality MultilevelScheme for Partitioning Irregular Graphs.

SIAM Journal on ScientiﬁcComputing , 20(1):359–392, January 1998.[4] U. V. Catalyurek and C. Aykanat. Hypergraph-partitioning-based de-composition for parallel sparse-matrix vector multiplication.

IEEETransactions on Parallel and Distributed Systems , 10(7):673–693, July1999.[5] Ali Pınar and Cevdet Aykanat. Fast optimal load balancing algorithmsfor 1D partitioning.

Journal of Parallel and Distributed Computing ,64(8):974–996, August 2004.[6] Bruce Hendrickson. Load balancing ﬁctions, falsehoods and fallacies.

Applied Mathematical Modelling , 25(2):99–108, December 2000.[7] Mohammad Ashraf Iqbal. Approximate algorithms for partitioningproblems.

International Journal of Parallel Programming , 20(5):341–361, October 1991.[8] D. M. Nicol. Rectilinear Partitioning of Irregular Data Parallel Compu-tations.

Journal of Parallel and Distributed Computing , 23(2):119–134,November 1994.[9] P. Gupta, R. Janardan, and M. Smid. Further Results on Generalized In-tersection Searching Problems: Counting, Reporting, and Dynamization.

Journal of Algorithms , 19(2):282–317, September 1995.[10] Bernard. Chazelle. A Functional Approach to Data Structures andIts Use in Multidimensional Searching.

SIAM Journal on Computing ,17(3):427–462, June 1988.[11] Erik Saule, Erdeniz ¨O. Bas¸, and ¨Umit V. C¸ ataly¨urek. Load-balancingspatially located computations using rectangular partitions.

Journal ofParallel and Distributed Computing , 72(10):1201–1214, October 2012.[12] Abdurrahman Yas¸ar and ¨Umit V. C¸ ataly¨urek. Heuristics for SymmetricRectilinear Matrix Partitioning. arXiv:1909.12209 [cs] , September 2019.arXiv: 1909.12209.[13] Yousef Saad.

Iterative methods for sparse linear systems . SIAM,Philadelphia, 2nd edition, 2003.[14] Tamara G. Kolda. Partitioning sparse rectangular matrices for parallelprocessing. In Alfonso Ferreira, Jos´e Rolim, Horst Simon, and Shang-Hua Teng, editors,

Solving Irregularly Structured Problems in Parallel ,Lecture Notes in Computer Science, pages 68–79, Berlin, Heidelberg,1998. Springer.[15] Bruce Hendrickson and Tamara G. Kolda. Partitioning Rectangularand Structurally Unsymmetric Sparse Matrices for Parallel Processing.

SIAM Journal on Scientiﬁc Computing , 21(6):2048–2072, January 2000.Publisher: Society for Industrial and Applied Mathematics. [16] M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simpliﬁed NP-complete graph problems. Theoretical Computer Science , 1(3):237–267,February 1976.[17] Thomas Lengauer.

Combinatorial algorithms for integrated circuitlayout . John Wiley & Sons, Inc., USA, 1990.[18] Rob H. Bisseling and Wouter Meesen. Communication balancing inparallel sparse matrix-vector multiplication.

ETNA. Electronic Trans-actions on Numerical Analysis [electronic only] , 21:47–65, 2005. Pub-lisher: Kent State University, Department of Mathematics and ComputerScience.[19] A. Pinar and B. Hendrickson. Partitioning for complex objectives.In

Proceedings 15th International Parallel and Distributed ProcessingSymposium. IPDPS 2001 , pages 1232–1237, April 2001. ISSN: 1530-2075.[20] Bora. Uc¸ar and Cevdet. Aykanat. Encapsulating MultipleCommunication-Cost Metrics in Partitioning Sparse RectangularMatrices for Parallel Matrix-Vector Multiplies.

SIAM Journal onScientiﬁc Computing , 25(6):1837–1859, January 2004.[21] Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. PartitioningModels for Scaling Parallel Sparse Matrix-Matrix Multiplication.

ACMTransactions on Parallel Computing , 4(3):13:1–13:34, January 2018.[22] Mehmet Deveci, Kamer Kaya, Bora Uc¸ar, and Umit V. C¸ ataly¨urek.Hypergraph Partitioning for Multiple Communication Cost Metrics.

J.Parallel Distrib. Comput. , 77(C):69–83, March 2015.[23] Seher Acer, Oguz Selvitopi, and Cevdet Aykanat. Improving per-formance of sparse matrix dense matrix multiplication on large-scaleparallel systems.

Parallel Computing , 59:71–96, November 2016.[24] M. Ozan Karsavuran, Seher Acer, and Cevdet Aykanat. Reduce Op-erations: Send Volume Balancing While Minimizing Latency.

IEEETransactions on Parallel and Distributed Systems , 31(6):1461–1473,June 2020. Conference Name: IEEE Transactions on Parallel andDistributed Systems.[25] Brian W. Kernighan. Optimal Sequential Partitions of Graphs.

Journalof the ACM (JACM) , 18(1):34–40, January 1971.[26] Anael Grandjean, Johannes Langguth, and Bora Uc¸ar. On Optimal andBalanced Sparse Matrix Partitioning Problems. In , pages 257–265, September2012. ISSN: 2168-9253.[27] Gy¨orgy D´osa. The Tight Bound of First Fit Decreasing Bin-PackingAlgorithm Is FFD(I) ¡= 11/9OPT(I) + 6/9. In Bo Chen, Mike Paterson,and Guochuan Zhang, editors,

Combinatorics, Algorithms, Probabilisticand Experimental Methodologies , Lecture Notes in Computer Science,pages 1–11, Berlin, Heidelberg, 2007. Springer.[28] D.M. Nicol and D.R. O’Hallaron. Improved algorithms for mappingpipelined and parallel computations.

IEEE Transactions on Computers ,40(3):295–306, March 1991.[29] Louis H. Ziantz, Can C. ¨Ozturan, and Boleslaw K. Szymanski. Run-timeoptimization of sparse matrix-vector multiplication on SIMD machines.In Costas Halatsis, Dimitrios Maritsas, George Philokyprou, and SergiosTheodoridis, editors,

PARLE’94 Parallel Architectures and LanguagesEurope , Lecture Notes in Computer Science, pages 313–322, Berlin,Heidelberg, 1994. Springer.[30] Sivasankaran Rajamanickam and Erik Boman. Parallel partitioning withZoltan: Is hypergraph partitioning worth it? In David Bader, HenningMeyerhenke, Peter Sanders, and Dorothea Wagner, editors,

Contempo-rary Mathematics , volume 588, pages 37–52. American MathematicalSociety, Providence, Rhode Island, January 2013.[31] Peter Ahrens and Erik G. Boman. On Optimal Partitioning For SparseMatrices In Variable Block Row Format. arXiv:2005.12414 [cs] , May2020. arXiv: 2005.12414.[32] C.J. Alpert and A.B. Kahng. Multiway partitioning via geometricembeddings, orderings, and dynamic programming.

IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Systems ,14(11):1342–1358, November 1995.[33] Joseph JaJa, Christian W. Mortensen, and Qingmin Shi. Space-Efﬁcientand fast algorithms for multidimensional dominance reporting andcounting. In

Proceedings of the 15th international conference onAlgorithms and Computation , ISAAC’04, pages 558–568, Hong Kong,China, December 2004. Springer-Verlag.[34] Timothy M. Chan and Bryan T. Wilkinson. Adaptive and Approxi-mate Orthogonal Range Counting.

ACM Transactions on Algorithms ,12(4):45:1–45:15, September 2016.[35] Bernard Chazelle. Lower bounds for orthogonal range searching: partII. The arithmetic model.

Journal of the ACM (JACM) , 37(3):439–463,July 1990.[36] S. Alstrup, G. Stolting Brodal, and T. Rauhe. New data structures fororthogonal range searching. In

Proceedings 41st Annual Symposium on Foundations of Computer Science , pages 198–207, November 2000.ISSN: 0272-5428.[37] Michael Shekelyan, Anton Dign¨os, and Johann Gamper. Sparse preﬁxsums: Constant-time range sum queries over sparse multidimensionaldata cubes.

Information Systems , 82:136–147, May 2019.[38] Harold N. Gabow, Jon Louis Bentley, and Robert E. Tarjan. Scalingand related techniques for geometry problems. In

Proceedings of thesixteenth annual ACM symposium on Theory of computing , STOC ’84,pages 135–143, New York, NY, USA, December 1984. Association forComputing Machinery.[39] Brad Jackson, Jeffrey D. Scargle, David Barnes, SundararajanArabhi, Alina Alt, Peter Gioumousis, Elyus Gwin, Paungkaew Sang-trakulcharoen, Linda Tan, and Tun Tao Tsai. An Algorithm for OptimalPartitioning of Data on an Interval.

IEEE Signal Processing Letters ,12(2):105–108, February 2005. arXiv: math/0309285.[40] H. A. Choi and B. Narahari. Efﬁcient Algorithms for Mapping andPartitioning a Class of Parallel Computations.

Journal of Parallel andDistributed Computing , 19(4):349–363, December 1993.[41] B. Olstad and F. Manne. Efﬁcient partitioning of sequences.

IEEE Trans-actions on Computers , 44(11):1322–1326, November 1995. ConferenceName: IEEE Transactions on Computers. A PPENDIX

A. Monotonic Decreasing Partitioners

Here we present modiﬁed versions of Algorithms 2, 3, and4 which handle monotonic decreasing cost functions.

Algorithm 7.

Given a monotonic decreasing cost function f k deﬁned on pairs of split points, a starting split point i , and amaximum cost c , ﬁnd the greatest i ′ such that i < i ′ , f k ( i, i ′ ) ≤ c , and i ′ low ≤ i ′ ≤ i ′ high . Returns max( i, i ′ low ) − if no costat most c can be found. function R EVERSE S EARCH ( f k , i , i ′ low , i ′ high , c ) i ′ low ← max( i, i ′ low ) while i ′ low ≤ i ′ high do i ′ = ⌊ ( i ′ low + i ′ high ) / ⌋ if f k ( i, i ′ ) ≤ c then i ′ high = i ′ − else i ′ low = i ′ + 1 end ifend whilereturn i ′ low end function Algorithm 8 (Reverse BISECT Partitioner) . Given monotonicdecreasing cost function(s) f deﬁned on pairs of split points,ﬁnd a contiguous K -partition Π which minimizes c = max k f k ( s k , s k +1 ) to a relative accuracy of ǫ within the range c low ≤ c ≤ c high ,if such a partition exists. function R EVERSE B ISECT P ARTITION ( f , n , c low , c high , ǫ ) ( s high1 , ..., s high K +1 ) ← (1 , m + 1 , m + 1 , ..., m + 1)( s low1 , ..., s low K +1 ) ← (1 , , ..., s , ..., s K +1 ) ← ( , , ..., while c low (1 + ǫ ) < c high do c ← ( c low + c high ) / s ← t ← true for k = 1 , , ..., K do s k +1 ← R EVERSE S EARCH ( f k , s k , s low k +1 , s high k +1 , c ) if s k +1 > s high k +1 then t ← false s k +1 , ..., s K +1 ← s high k +1 , ..., s high K +1 breakend ifend forif t then c high ← cS low ← S else c low ← cS high ← S end ifend whileend function s high K +1 ← m + 1 return S high Algorithm 9 (Reverse NICOL Partitioner) . Given monotonicdecreasing cost function(s) f deﬁned on pairs of split points,ﬁnd a contiguous K -partition Π which minimizes c = max k f k ( s k , s k +1 ) within the range c low ≤ c ≤ c high , if such a partition exists. function R EVERSE N ICOL P ARTITION ( f , m , c low = −∞ , c high = ∞ ) ( s high1 , ..., s high K +1 ) ← (1 , m + 1 , m + 1 , ..., m + 1)( s low1 , ..., s low K +1 ) ← (1 , , ..., s , ..., s K +1 ) ← ( , , ..., for k ← , , ..., K do i ← s k i ′ high ← s high k +1 i ′ low ← max( s k , s low k +1 ) while i ′ low ≤ i ′ high do i ′ ← ⌊ ( i ′ low + i ′ high ) / ⌋ c ← f ( i, i ′ ) if c low ≤ c < c high then s k +1 ← i ′ t ← true for k ′ = k + 1 , k + 2 , ..., K do s k ′ +1 ← R EVERSE S EARCH ( f k ′ , s k ′ , s low k ′ +1 , s high k ′ +1 , c ) if s k ′ +1 < s high k ′ +1 then t ← false s k ′ +1 , ..., s K +1 ← s high k ′ +1 , ..., s high K +1 breakend ifend forif t then c high ← ci ′ low ← i ′ + 1 S low ← S else c low ← ci ′ high ← i ′ − S high ← S end ifelse if c ≥ c high then i ′ high = i ′ − else i ′ low = i ′ + 1 end ifend whileif i ′ low > s high k +1 thenbreakend if s k +1 ← i ′ high end for s high K +1 ← m + 1 return S high end function The intuition behind these algorithms is the opposite of themonotonic increasing case. Instead of searching for the lastsplit point less than a candidate cost, we search for the ﬁrst.Instead of looking for a candidate partition which can reach thelast row without exceeding the cost, we look for a candidatepartition which does not need to pass the last row withoutexceeding the cost. The majority of changes reside in the smalldetails, which we leave to the pseudocode.

B. A Word On Dynamic Programming

Our complex dominance counting data structure is requiredfor our partitioners because they require random access tothe cost function. This may lead readers to wonder whetherdynamic programming would offer a simpler algorithmic solu-tion, since dynamic programming queries the cost function in astructured pattern. Indeed, it is possible to avoid the dominancecounting tree entirely during dynamic programming by usinga hash-based structure similar to those of [31], [26], [39],[29]. These structures compute all values of f ( i, i ′ ) for each i ′ in turn, and the dynamic programming solutions of Choiet. al., and Olstad et. al. compare the split points of all i forparts ending at each i ′ in turn [40], [41]. Since the pattern forcomputing costs matches the pattern for querying costs, we canuse these algorithmic structures to provide amortized constant-time access to each query. Unfortunately, the dynamic pro-gramming solutions make K passes over the rows, or perform O ( K ) work for each row. The total number of probes becomes O ( K ( m − K )) , which would be superlinear by a factor of K when K is not a constant. The recursive bisection heuristicssuggested by Pinar et. al. to provide static split point boundsto empirically accelerate the dynamic programming algorithmrequire random access to the cost function, and if they wereimplemented using an approach similar to Ziantz et. al., wouldrequire N log( m ) time to execute [5], [29]. Therefore, we willnot investigate dynamic programming solutions in this initialversion of the manuscript. C. A Word On Total Versus Maximum Communication

Our suggested initial cost function (7) for primary alternatepartitioning simply assumes that no columns are local to any part. If we included only this communication term, ourpartitioner would ﬁnd a partition Π to minimize max k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) At a ﬁrst glance, this appears to disagree with the ( λ − hypergraph communication metric of Problem 3, which mini-mizes the total number of nonlocal columns (where we mustalso optimize Φ ) X k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i \ φ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = X e j ∈ E | λ j | − However, it is worth pointing out that the total number ofnonlocal columns differs from the total number of incidentcolumns by the constant | E | , and thus, ( λ − hypergraphpartitioning also minimizes | E | + X e j ∈ E | λ j | − X e j ∈ E | λ j | = X k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ i ∈ π k v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,,