[PDF] A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs

Abstract

We design and implement an efficient parallel algorithm for finding a perfect matching in a weighted bipartite graph such that weights on the edges of the matching are large. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization. Due to the lack of scalable alternatives, distributed solvers use sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64. To overcome this limitation, we propose a fully parallel distributed memory algorithm that first generates a perfect matching and then iteratively improves the weight of the perfect matching by searching for weight-increasing cycles of length four in parallel. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.

Full PDF

AA DISTRIBUTED-MEMORY APPROXIMATION ALGORITHM FOR MAXIMUMWEIGHT PERFECT BIPARTITE MATCHING

ARIFUL AZAD ∗ , AYDIN BULUC¸ † , XIAOYE S. LI ‡ , XINLIANG WANG § , AND

JOHANNES LANGGUTH ¶ Abstract.

We design and implement an eﬃcient parallel approximation algorithm for the problem of maximum weightperfect matching in bipartite graphs, i.e. the problem of ﬁnding a set of non-adjacent edges that covers all vertices andhas maximum weight. This problem diﬀers from the maximum weight matching problem, for which scalable approximationalgorithms are known. It is primarily motivated by ﬁnding good pivots in scalable sparse direct solvers before factorizationwhere sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, arewidely used due to the lack of scalable alternatives. To overcome this limitation, we propose a fully parallel distributedmemory algorithm that ﬁrst generates a perfect matching and then searches for weight-augmenting cycles of length four inparallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems theweights of the perfect matchings generated by our algorithm are very close to the optimum. An eﬃcient implementation ofthe algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are toolarge to be handled by a single node using the sequential algorithm.

Key words.

Bipartite graphs, matching, parallel approximation algorithms, graph theory, transversals

1. Introduction.

The maximum cardinality matching (MCM) problem is a classical topic in com-binatorial optimization. Given a graph, it asks for a set of non-incident edges of maximum size. Forthe bipartite version of the problem, eﬃcient sequential algorithms such as

Hopcroft-Karp [15] havebeen known for a long time. Practical algorithms for bipartite MCM have recently been studied inten-sively [5, 10, 19], leading to the development of scalable distributed-memory algorithms [4, 20].Finding a maximum-cardinality matching on a bipartite graph that also has maximum weight (alsocalled the assignment problem in the literature) is a harder problem, both w.r.t. complexity and inpractice. This is because in the transversal problem (i.e. maximum cardinality matching), the edges areinterchangeable, but in the assignment problem, they are not. As a result, only cardinality matchingsallow concurrent vertex-disjoint searches in the graph, which makes the assignment problem harder toparallelize. Recent published attempts at parallelizing the assignment problem, e.g. [30] rely on theauction paradigm [7]. While this approach has demonstrated some speedups, its scalability is limited andit is ineﬃcient in maximizing the cardinality in distributed memory [28].In this paper, we follow a diﬀerent approach. Instead of relaxing both the maximum cardinality andmaximum weight requirements at the same time, we only relax the maximum weight requirement anduse an algorithm that always returns maximum cardinality. This means that we solve the transversalproblems optimally and the assignment problem approximately. We only consider graphs that haveperfect matchings. Hence, we approximate weights of a maximum-weight perfect matching (MWPM) andthus solve the approximate-weight perfect matching (AWPM) problem on distributed memory machines.The motivation for this problem comes from sparse direct solvers. Often, sparse linear systems are toolarge to be solved in a single node, necessitating distributed-memory solvers such as SuperLU DIST [21].Partial pivoting, which is stable in practice, requires dynamic row exchanges that are too expensive toperform in the distributed case. Consequently, distributed-memory solvers often resort to static pivotingwhere the input is pre-permuted to have a “heavy” diagonal so that the factorization can proceed withoutfurther pivoting. The deﬁnition of “heavy” at the minimum implies having only nonzeros on the diagonal.Whether maximizing the product or the sum of absolute values of the diagonal is the right choice forthe objective function is debatable, but both can be solved by ﬁnding a perfect bipartite matching ofmaximum weight. In this formulation, rows and columns of the sparse matrix serve as vertices on eachside of the bipartite graph, and nonzeros as edges between them.Since the input matrix is already distributed as part of the library requirements, it is necessary touse distributed-memory parallel matching algorithms. However, the lack of scalable matching algorithms ∗ CRD, Lawrence Berkeley National Laboratory, CA ( [email protected] ). † CRD, Lawrence Berkeley National Laboratory, CA ( [email protected] ). ‡ CRD, Lawrence Berkeley National Laboratory, CA ( [email protected] ). § Tsinghua University, Beijing, China ( [email protected] ). ¶ Simula Research Laboratory, Fornebu, Norway ( [email protected] ).1 a r X i v : . [ c s . D C ] J a n orces distributed-memory solvers to assemble the entire instance on a single node and then use a se-quential matching library, such as the highly-optimized implementations of MWPM algorithms availablein MC64 [11]. For instance, the new algorithm in SuperLU DIST demonstrated strong scaling to 24 , • Algorithm:

We present a highly-parallel algorithm for approximate weight perfect bipartite matchingproblem. • Usefulness:

The presented algorithm returns perfect matchings that are often within 99% of theoptimum solution. • Performance:

We provide a hybrid OpenMP-MPI implementation that runs signiﬁcantly faster than asequential implementation of the optimum algorithm (up to 2500 × faster on 256 nodes of NERSC/Cori).On 256 nodes of the same system, the parallel implementation attains up to 114 × speedup relative toits runtime on a single node. • Impact:

The presented algorithm can be used to ﬁnd good pivots in distributed sparse direct solverssuch as SuperLU DIST, eliminating a longstanding performance bottleneck.

2. Related Work.

The bipartite maximum cardinality matching problem has been studied for morethan a century, and many diﬀerent algorithms for solving it have been published over the years [2, 12,15, 27]. Experimental studies [10, 16] established that when using heuristic initialization [18], optimizedvariants of two diﬀerent approaches, the

Pothen-Fan algorithm [27] and the push-relabel algorithm [12]provide superior practical performance in the sequential case. Both algorithms have eﬃcient sharedmemory counterparts [6, 19] which show good scaling on a single compute node. For distributed memorysystems however, the problem has proven to be extremely challenging. Due to the suspected inherentsequentiality of the problem, i.e. no theoretically eﬃcient parallel algorithm is known, such parallelalgorithms tend to require a large number of consecutive communication rounds. More recently, a push-relabel variant that exploits the fact that local matching can be performed at a much faster rate thannonlocal operations was presented [20]. A diﬀerent approach formulated the problem in terms of sparsematrix operations [4]. An implementation of the resulting algorithm scaled up to 16 ,

384 cores. Our workuses this algorithm as a subroutine.For the weighted case, parallel approximation algorithms have been shown to scale very well [24],even in distributed memory [23]. Furthermore, these algorithms also work for nonbipartite graphs. Onthe other hand, exact algorithms such as successive shortest paths have proven diﬃcult to parallelize,even for shared memory. Currently, auction algorithms [7, 30], which essentially constitute a weightedversion of the push-relabel algorithm, are a promising direction and can eﬃciently ﬁnd matchings ofnear-maximum weight, but they tend to be very ineﬃcient at ﬁnding perfect cardinality matchings indistributed memory [28]. In shared memory however, they can be competitive [13]. For that case, Hoggand Scott showed that the auction algorithm provides matrix orderings for direct linear solvers of aquality similar to the exact method [14] .The aim of our work is similar to Hogg and Scott’s. However, since we target distributed mem-ory machines, we need to develop a diﬀerent algorithm. Pettie and Sanders described and analyzedsequential linear time 2 / − (cid:15) approximation algorithms for the general weighted matching problem. Ourapproximation idea of using cycles of length 4 is inspired by this work.

3. Notation and Background.

For any matrix A of size n × n (cid:48) there is a weighted bipartite graph G = ( R ∪ C, E, w ), whose vertex set consists of n row vertices in R and n (cid:48) column vertices in C . For eachnonzero entry a ij of A , E contains an undirected edge { i, j } that is incident to row vertex i and columnvertex j , and has a weight w ( { i, j } ) = a ij . iven a bipartite graph G = ( R ∪ C, E, w ), a matching M is a subset of E such that at most one edgein M is incident on any vertex. Given a matching M in G , an edge is matched if it belongs to M , andunmatched otherwise. Similarly, a vertex is matched if it is an endpoint of a matched edge. If an edge { i, j } is matched, we deﬁne m j := i and m i := j and call the vertices mates of each other. A matching M is maximal if there is no other matching M (cid:48) that properly contains M , and M is maximum if |M|≥|M (cid:48) | for every matching M (cid:48) in G . Furthermore, if |M| = | R | = | C | , M is called a perfect matching. When A has full structural rank, the corresponding bipartite graph G has a perfect matching. Since this paperfocuses on perfect matching algorithms, we assume | R | = | C | . We denote | R | by n and | E | by m . Now,the perfect matching problem consists of either ﬁnding a matching that is perfect, or deciding that nosuch matching exists in the input graph.The weight of a matching M is simply the sum of the weights of its edges. The maximum weightmatching problem asks for a matching of maximum weight regardless of cardinality, while the maximumweight perfect matching (MWPM) problem asks for a perfect matching that has maximum weight amongall perfect matchings. If we multiply all weights by −

1, the MWPM problem becomes equivalent to the minimum weight perfect matching problem [31].All the above problems can be further subdivided into the bipartite and general case, with thelatter often requiring signiﬁcantly more sophisticated algorithms. In the following, we will restrict ourdiscussions to the bipartite case.An M -alternating cycle in G is a cycle whose edges are alternately matched and unmatched inmatching M . By exchanging the matched and unmatched edges of an M -alternating cycle, we get anew matching of the same cardinality, but potentially diﬀerent weight. If the new weight is higher, thealternating cycle is also an M -augmenting cycle . The diﬀerence in weight is called the gain of the cycle.Let W ( i, j, m j , m i ) denote the gain of the cycle formed by the vertices i and j and their mates m j and m i . A cycle that contains k vertices and k edges is called k -cycle.Since we study a maximization problem, for any α ∈ (0 , M in G is α -optimum if w ( M ) ≥ αw ( M ∗ ), where M ∗ is a perfect matching of maximum weight in G . We callan algorithm an α -approximation algorithm if it always returns α -optimum solutions.

4. The Pettie-Sanders Algorithm for Weighted Matching. A − (cid:15) sequential linear timeapproximation algorithm for the weighted matching problem on general graphs was presented by Pettieand Sanders [26]. Because we restrict ourselves to dealing with perfect matchings in bipartite graphs,the algorithm can be simpliﬁed signiﬁcantly. This simpliﬁed version, called PSS, is based on threefundamental statements:1. A perfect matching is -optimal if it contains no augmenting 4-cycles.2. For any perfect matching M , there exists a set A of vertex-disjoint augmenting 4-cycles suchthat: w ( M ⊕ A ) ≥ w ( M ) + 35 (cid:18) w ( M ∗ ) − w ( M ) (cid:19)

3. An augmenting 4-cycles can be found in O ( deg ( v )) time, where deg ( v ) is the maximum degreeamong vertices in the cycle.Based on that, a lower bound on the probability of obtaining cycles from A by selecting randomaugmenting 4-cycles can be derived, which in turn implies a randomized approximation guarantee, sincethe average gain of an augmenting 4-cycle from A is:35 (cid:18) w ( M ∗ ) − w ( M ) (cid:19) / |A| Thus, the randomized PSS algorithm repeatedly selects and applies random augmenting 4-cycles to thematching. It thereby computes a − (cid:15) optimum matching in expected time O ( m log (cid:15) ). We implement apractical variant of this algorithm that simply loops over the vertices to ﬁnd and use augmenting 4-cyclesand stops if no such cycles can be found, and use it for comparison with the parallel algorithm in Section6. n the same publication, Pettie and Sanders also present a derandomized version of their algorithmwith a similar approximation guarantee. The simpliﬁed version is shown in Algorithm 1. Our parallelalgorithm is inspired by this derandomized algorithm. Algorithm 1

The deterministic sequential algorithm adapted from Pettie and Sanders [26] Let M be a perfect matching for iter = 1 to maxiter do S := ∅ for all j ∈ C do Find max-gain cycle ( i, j, m i , m j ) rooted at j S := S ∪ ( i, j, m i , m j ) M := M⊕ greedy( S ) return M To ﬁnd a 4-cycle rooted at a vertex j , select a neighbor i (cid:54) = m j , then follow the matching edges incidentto i and j . Since M is a perfect matching, these edges always exists. Next, scan the neighborhood Γ( m i )for m j . If the edge { m j , m i } exists, we have found an alternating 4-cycle of gain W ( i, j, m j , m i ) = w ( { i, j } ) + w ( { m i , m j } ) − w ( { i, m i } ) − w ( { m j , j } ).The main problem when parallelizing the algorithm stems from Line 7. Here, greedy( S ) is a set ofvertex disjoint cycles selected from S via a greedy algorithm. Since doing so in parallel would be costly,we deviate slightly from this idea and perform only local comparisons to ﬁnd a heavy set of disjointaugmenting 4-cycles. This means that our algorithm does not have the same theoretical propertiesas Pettie and Sander’s algorithms, and it can terminate without returning a − (cid:15) optimum matching,although this can only happen on very speciﬁc instances. While this could easily be overcome by applyingrandom augmentations, doing so is not desirable from a practical point of view.

5. The Parallel Approximation Algorithm.5.1. Overview of the parallel AWPM algorithm.

In our approximate weight perfect match-ing , or

AWPM algorithm, we combine a parallel maximum cardinality matching (MCM) algorithm with adistributed memory approximation algorithm for the weight. The MCM algorithm is initialized usinga maximal cardinality matching algorithm that returns a matching with cardinality at least half of themaximum. While this step is optional, doing so greatly improves the parallel performance. The maxi-mal and maximum matching algorithms together provide a perfect matching whose weight is iterativelyimproved by discovering augmenting cycles following the idea of the Pettie-Sanders algorithm discussedabove.Distributed memory algorithms for cardinality matchings on bipartite graphs were developed in priorwork [3, 4]. Both of these algorithms are implemented using a handful of bulk-synchronous matrix andvector operations and are freely available as part of the CombBLAS library. Among several variants ofthe maximal matching algorithms, we used a simple greedy algorithm to initialize MCM. For this work,we modiﬁed the original cardinality matching algorithms in such a way that when selecting potentialmatching edges, we break ties by giving precedence to edges with higher weight. This simple heuristicoften results in the perfect matchings having high weight without incurring any additional cost. Once aperfect matching is obtained, we improve its weight using our parallel approximation algorithm.

As mentioned in the last section,the algorithm for maximum weight approximation aims to ﬁnd augmenting 4-cycles. Unlike longer cycles,they can be found in time proportional to the degree of their vertices, and doing so requires only a constantnumber of communication rounds in the distributed memory parallel setting. Thus, our algorithm canbe considered a special case of the general augmenting cycles algorithm . We will refer to it as

AWAC , i.e.approximate weight augmenting cycles, or augmenting cycles algorithm in short.The vertices of a 4-cycle might be distributed over 4 processes, requiring 4 major communicationsteps, which will be described in detail below. We use a regular 2D partitioning of the input matrix.Thus, p processes form a √ p × √ p grid. They are denoted by ( row, column ) in the grid. Let A row,column be the submatrix assigned to a process. Figure 5.1 illustrates the layout, and explains the labeling of the i,j} {m j ,m i } a d bc {m j ,j} {i,m i } Fig. 5.1: Augmenting 4-cycle in the 2D distributed matrix. Each square contains the part of the matrixassigned to one process. For simplicity, here the matched edges are gathered on the main diagonal (dashedred line). a and c denote rows of processes, while b and d denote columns. Thus, if edge { i, j } is onprocess ( a, b ), then its incident matched edges are on process ( a, d ) and ( c, b ) respectively, where c and d depend on the position of the matched edges. If the unmatched edge { m j , m i } exists in the graph, itmust be on process ( c, d ).processes. The algorithm starts by distributing the matching information across rows and columns, andthen loops over the four steps k times, or until no more augmenting 4-cycle can be found, as shown inAlgorithm 2. Algorithm 2

Sketch of the Basic Parallel Approximation Algorithm Input : a weighted graph G = ( R ∪ C, E, w ) On each process ( a, b ) do in parallel: for all rows i of A contained in A a,b do Copy m i and w ( i, m i ) to a local array for all columns j of A contained in A a,b do Copy m j and w ( m j , j ) to a local array for i = 1 to k do Step A: Cycle requests Step B: Cycle completion

Step C: Local cycle comparison

Step D: Nonlocal cycle comparison if no cycle was found then break Like the deterministic sequential Algorithm 1, we construct a set of vertex-disjoint augmentations.In eﬀect, we parallelize the

FOR ALL statement in Line 4 of Algorithm 1. However, we do not use thesame greedy strategy to pick the augmenting 4-cycles in Line 7, since the standard greedy algorithm isinherently sequential. Instead, we perform limited local weight comparisons, which are described below.After initialization, the ﬁrst step will be to select vertices at which augmenting cycles are rooted.While all vertices are eligible, we can reduce this number since each cycle can be rooted at any of its 4vertices. Because the matrix is stored in Compressed Sparse Column (CSC) format, we only start cyclesfrom column vertices, reducing the number of root vertices by half. This number can be halved again byrooting a given cycle only at the column vertex with the lower index.Now, for a potential cycle rooted at a column vertex j and containing a row vertex i , we generate arequest to the owner of { m j , m i } , which includes the edge weight w ( { i, j } ), as shown in Algorithm 3. Forperformance reasons, the exchange of all such requests is bundeled into an All-to-All collective operation.In the next step, we need to determine if { m j , m i } and thus the alternating cycle of i, j, m j , m i existsand is augmenting, i.e. has W ( i, j, m j , m i ) >

0. Note that the weight of the matching edges is stored onall processes in the same grid row/column, and can thus be accessed directly, as shown in Algorithm 4.However, before we can augment the matching along this cycle, we have to make sure that it is vertex lgorithm 3 Step A for all processes (a,b) do for all columns j of A contained in A a,b do for all rows i > m j of A contained in A a,b do if { i,j } exists then Let ( c, d ) be the owner of { m j , m i } Add A-request( m j , m i , w ( { i, j } ) to request queue for process ( c, d ) exchange A-requests via AllToAllv disjoint with other augmenting cycles, and if not, ﬁnd out which cycle has the highest gain. We thereforesend a request to the process that owns the matched edge { m j , j } . Algorithm 4

Step B for all processes ( c, d ): do for all A-requests( m j , m i , w ( { i, j } ) from ( a, b ) do if { m j , m i } exists: then W ( i, j, m j , m i ) = w ( { i, j } ) + w ( { m i , m j } ) − w ( { i, m i } ) − w ( { m j , j } ) if W ( i, j, m j , m i ) > then Add B-request ( i, j, m j , m i , W ( i, j, m j , m i )) to request queue for process ( c, b ) exchange B-requests via AllToAllv Now, the owner of each edge { m j , j } collects all incoming requests for that edge, selects one withmaximum gain, and discards all others. Since these requests correspond to cycles rooted at j , all remainingcycles now are disjoint w.r.t. their { m j , j } edge, as shown in Algorithm 5. However, we still have to ensurethat the cycles are disjoint with other cycles at the edge { i, m i } . Therefore, we send a request to itsowner, which might not share any column or row with the sending process. Figure 5.2 shows an examplesof such competing cycles. Algorithm 5

Step C for all processes ( c, b ): do for all rows m j of A contained in A c,b do Find B-request ( i, j, m j , m i , W ( i, j, m j , m i )) with maximum gain Add C-request ( i, j, m j , m i , W ( i, j, m j , m i )) to request queue for process ( a, d ) exchange C-requests via AllToAllv In Step D, the owner of each edge { i, m i } collects requests for that edge, selects one with maximumgain, and discards all others, similar to Step C. Thus, all cycles are disjoint w.r.t. their { i, m i } edges.However, it is still possible that the { i, m i } edge of one cycle is the { m j , j } edge of a diﬀerent cycle.Thus, if a process sent a C-request for an edge e = { m j , j } in Step C, then it will automatically discardthe requests for other cycles that have e as their { i, m i } edge. As mentioned in Section 4, our strategydeviates from the Pettie-Sanders algorithm here. The reason for this lies in the fact that ﬁnding themaximal set of augmenting 4-cycles would incur additional communication that most likely aﬀects onlya small number of vertices. Therefore, in the parallel case it is preferable to simply drop the problematiccycles and generate new ones instead.The ﬁnal step consists of augmenting the matching by ﬂipping matched and unmatched edges in eachcycle and communicating the change along the rows and columns, which is shown in Algorithm 6. Theentire path of the communication is sketched in Figure 5.3. We measure communication by the number of words moved ( W ) and the number of messages sent ( S ). The cost of communicating a W -word messageis α + β W where α is the latency and β is the inverse bandwidth, both are deﬁned relative to the cost ’’ jm j m i i i’ W(j,i,m j ,m i ) W(j,i’’,m j ,m i’’ )) Rows Columns Rows Columns jm j m i i j’ m j ’ W(j,i,m j ,m i ) W(j’,i,m j’ ,m i ) Fig. 5.2: Cycle collisions in the graph view. Left: multiple cycles, all of which are rooted in column vertex j compete for the matched edge { m j , j } in Step C of the algorithm. Right: cycles rooted at diﬀerentcolumn vertices j and j (cid:48) compete for the matched edge { i, m i } in Step D. {i,j} {m j ,m i } a d bc {m j ,j} {i,m i } A B C D Fig. 5.3: Communication in the 2D distributed matrix view. In Step A (orange arrow) communicationgoes from ( a, b ) to ( c, d ). Step B (blue arrow) from ( c, d ) to ( a, d ), and Step C (purple arrow) from ( c, b )to ( a, d ). In Step D, the matching is updated along rows a and c , and along columns b and d (greensquares). In this example that includes all but the process in the center.of a single arithmetic operation. Hence, an algorithm that performs F arithmetic operations, sends S messages, and moves W words takes F + αS + β W time.Since rows are permuted randomly at the start of the algorithm, in this analysis, matrix nonzerosare assumed to be i.i.d. distributed. We also assume a square √ p × √ p process grid. The runtime of theMCM algorithm is dominated by parallel sparse matrix-sparse vector multiplication whose complexityhas been analyzed in our prior work [4]. Assuming | iters | to be the number of iterations needed to ﬁndan MCM, the perfect matching computation takes: T MCM = O (cid:16) mp + β (cid:0) mp + n √ p (cid:1) + | iters MCM | α √ p (cid:17) . In Step A of the

AWAC algorithm, we exchange a total of O ( m ) requests among all processes. Under thei.i.d. assumption, each process contains m/p edges and spends O ( m/p ) time to prepare O ( m/p ) requestsfor all other processes. Let R i and C i be the set of rows and column whose pairwise connections arestored in process i . Let M ( R i ) and M ( C i ) be the set of mates of vertices in R i and C i , respectively. Sincea process stores at most n/ √ p rows and columns, |M ( R i ) | ≤ n/ √ p and |M ( C i ) | ≤ n/ √ p . In order toreceive a request in the i th process, there must be an edge between a pair of vertices in M ( R i ) × M ( C i ).Under the i.i.d. assumption, the number of such edges is O ( m/p ). Hence a process receives O ( m/p )messages in the communication round of Step A.The number of requests in Step B cannot be greater than in Step A, and requests in Step C and lgorithm 6 Step D for all processes ( a, d ): do for all columns j of A contained in A a,d do if No C-request was sent from i in Step C then ﬁnd C-request( i, j, m j , m i , W ( i, j, m j , m i )) with maximum gain k = m i l = m j broadcast m i = j, w ( i, j ) to all processes ( a, ∗ ) broadcast m j = i, w ( i, j ) to all processes ( ∗ , b ) broadcast m k = l, w ( l, k ) to all processes ( c, ∗ ) broadcast m l = k, w ( l, k ) to all processes ( ∗ , d ) Table 5.1: Overview of Evaluated Platforms. Shared between 2 cores in a tile. Memory bandwidth ismeasured using the STREAM copy benchmark per node.

Cori EdisonCore

Intel KNL Intel Ivy BridgeClock (GHz) 1.4 2.4L1 Cache (KB) 32 32L2 Cache (KB) 1024 Node Arch.

Sockets/node 1 2Cores/socket 68 12STREAM BW (GB/s) 102 104Memory (GB) 96 64 Overall system

Nodes 9,688 5,586Interconnect Aries (Dragonﬂy) Aries (Dragonﬂy)

Prog. Environment

Compiler Intel C++ Compiler (icpc) ver18.0.0Optimization -O3 -O3augmentations in Step D are bounded by O ( n ). Hence the total cost of AWAC is: T AWAC = O (cid:16) | iters | (cid:16) mp + β mp + αp (cid:17)(cid:17) . Note that in

AWAC , | iters | is bounded by a small constant k that depends on the desired approximationratio.

6. Results.6.1. Experimental setup.

We evaluated the performance of our algorithms on the Edison andCori supercomputers at NERSC. On Cori, we used KNL nodes set to cache mode where available 16GBMCDRAM is used as cache.Table 5.1 summarizes key features of these systems. We used Cray’s MPI im-plementation for inter-node communication and OpenMP for intra-node multithreading. Unless otherwisestated, all of our experiments used 4 MPI processes per node. We rely on the CombBLAS library [8] forparallel ﬁle I/O, data distribution and storage. We always used square process grids because rectangulargrids are not supported in CombBLAS.In our experiments, we used a diverse test set of real-world matrices from the University of Floridasparse matrix collection [9] and other sources as shown in Table 6.1. Before running matching algorithms,we normalized matrices such that the maximum entry of each row or column is 1 and the other entriesare bounded by 1. able 6.1: Square matrices used to evaluate matching algorithms. All matrices, except ﬁve, are fromthe University of Florida sparse matrix collection [9]. We separate bigger matrices ( nnz > × ) ( × )memchip unsym 2.71 14.81 circuit simulationrajat31 unsym 4.69 20.32 circuit simulationFreescale2 unsym 3.00 23.04 circuit simulationboneS10 sym 0.91 28.19 model reduction problemSerena sym 1.39 32.96 gas reservoir simulationaudikw 1 sym 0.94 39.30 structural probdielFilterV3real sym 1.10 45.20 higher-order ﬁnite elementFlan 1565 sym 1.56 59.49 structural problemcircuit5M unsym 5.56 59.52 circuit simulationLi7Nmax6 [1] sym 0.66 212.23 nuclear conﬁg. interactionHV15R unsym 2.02 283.07 3D engine fanperf008cr sym 7.90 321.06 3D structural mechanicsnlpkkt240 sym 27.99 401.23 Sym. indef. KKT matrixNm7 [1] unsym 4.08 437.37 nuclear conﬁg. interactionNaluR3 [22] unsym 17.60 473.71 Low Mach ﬂuid ﬂowA05 unsym 1.56 1088.71 MHD for plasma fusionWe compare the performance of our parallel algorithm with the MWPM implementations in MC64 [11]and an optimized sequential implementation of AWPM. Our sequential implementation uses the Kapr-Sipser and Push-Relabel algorithms [16] to ﬁnd a perfect matching, and then uses the augmenting cyclesalgorithm to ﬁnd an AWPM. MC64 provides ﬁve options for computing the weight of a matching. In allof our experiments except Table 6.3, we used option 4 that maximizes the sum of weights on matchededges. Since a matching algorithm is often used as a subroutine of another application, we exclude ﬁleI/O and graph generation time when reporting the running time of matching algorithms. Even though our par-allel algorithm can not guarantee the theoretical bound of the Pettie-Sanders algorithm, the obtainedmatchings are often very close to the optimum in practice as shown in Table 6.2. Here, we compute theapproximation ratio by dividing approximated weight by optimum weight and show it in the last column.For 10 out of 16 matrices in Table 6.1, AWPM ﬁnds a matching with the optimum weight. Approximateweights of other matrices are also close to the optimum. This observation extends to many other matri-ces from the University of Florida collection. In an extended set of more than 100 matrices, the averageapproximation ratio attained by AWPM is 98.66% (min 86%, max 100%). Hence, our algorithm cansuccessfully substitute MWPM algorithms in many practical problems without sacriﬁcing the quality ofthe matchings.

Figure 6.1 compares the runtime of theparallel AWPM algorithm with MC64 and sequential AWPM for all 16 matrices from Table 6.1. HereMC64+gather lines include the time to gather the whole matrix on a single node, representing the totaltime needed to compute MWPM sequentially in a distributed memory setting. Experiments in Figure 6.1were run on Edison.At ﬁrst, we discuss the performance of matching algorithms on bigger matrices where a distributedmemory algorithm is urgently needed. MC64 failed with A05, the largest matrix in our suite, becauseof insuﬃcient memory in a single node of Edison. For matrices of this scale, our distributed-memoryalgorithm is the only viable option for enabling extreme scale scientiﬁc applications that rely on MWPM.For other large matrices, parallel AWPM is signiﬁcantly faster than its sequential competitors. For some able 6.2: The weight of matchings from AWPM and MC64.Matrix MC64 AWPM Approx. ratiomemchip 2,707,524 2,707,520 100%rajat31 4,690,002 4,690,002 100%Freescale2 2,994,270 2,989,080 99.98%boneS10 914,898 914,898 100%Serena 1,391,344 1,391,340 100%audikw 1 943,624 943,624 100%dielFilterV3real 1,102,796 1,102,796 100%Flan 1565 1,564,794 1,564,794 100%circuit5M 5,557,920 5,557,890 99.99%Li7Nmax6 663,526 663,521 99.99%HV15R 1,877,709 1,770,960 94.31%perf008cr 7,902,327 7,902,327 100%nlpkkt240 27,762,507 27,710,011 99.81%Nm7 4,079,730 4,079,730 100%NaluR3 17,598,889 17,598,889 100%A05 1,140,660 1,140,660 100%problems, this is true even on a single node. For example, when computing a matching on NaluR3on a single node of Edison, parallel AWPM is 40 × and 6 × faster than MC64 and sequential AWPM,respectively. On 256 nodes (6144 cores), our parallel algorithm becomes 3372 × and 384 × faster thanMC64 and sequential AWPM. Note that AWPM is able to ﬁnd the optimum solution on NaluR3. Hence,the drastic reduction in runtime comes for free, without sacriﬁcing any matching quality. For other largematrices (with nnz greater than 200M), parallel AWPM runs at least 20 × faster than MC64+gatheron high concurrency. This observation also holds for most matrices in the second and third rows inFigure 6.1.On smaller matrices (e.g., those in the ﬁrst row of Figure 6.1), the performance gain from the parallelalgorithm is not as dramatic as with bigger matrices. This is expected as it is hard for parallel algorithmsto reduce a subsecond sequential runtime. However, for all matrices except Freescale2, parallel AWPMruns faster than MC64+gather on high concurrency. Hence our algorithm is competitive on smallermatrices and runs signiﬁcantly faster on bigger matrices.While most instances show excellent scalability, there are two outliers, Freescale2 and HV15R, whereparallel AWPM does not run faster than the sequential AWPM. For both matrices, parallel AWPM spendsmore than 80% of the runtime on ﬁnding an initial perfect matching using a distributed-memory MCMalgorithm [4]. Obtaining perfect matchings on these two matrices require searching for long paths goingthrough many processors, which is hard to parallelize. Nevertheless, even for these matrices, parallelAWPM remains competitive or signiﬁcantly faster than MC64+gather, which is the main competitor ofour algorithm in practice.The matching algorithms discussed above show similar performance trends on Cori-KNL, as shownin Figure 6.2. For example, on the perf00cr matrix, parallel AWPM runs 72 × and 34 × faster thanMC64+gather on 17,408 and 6,144 cores on Cori-KNL and Edison, respectively. On Cori-KNL, MC64 isable to compute matching for the A05 matrix in 135 seconds, whereas, parallel AWPM algorithm tookjust 0.44 seconds on 256 nodes of Cori-KNL. Unless otherwise noted, we report the speedupof the parallel AWPM algorithm relative to its runtime on a single node. AWPM still runs in parallel ona single node using 4 MPI processes and employs multithreading within a process.At ﬁrst, we discuss the scalability of the complete AWPM algorithm (including the time to computethe initial perfect matching) as shown in Figure 6.1 and 6.2. AWPM achieves 19 × speedup on average C64 MC64+gather Parallel AWPM

Sequential AWPM

Number of cores (log scale) T i m e i n s e c o n d s ( l o g s c a l e )

24 96 384 1536 61440.180.71 memchip (4.5x) 24 96 384 1536 61440.180.71 rajat31(4.95x) 24 96 384 1536 61440.351.41 Freescale2 (1.5x) 24 96 384 1536 61440.030.130.52 boneS10 (30x)24 96 384 1536 61440.060.2514 Serena (25x) 24 96 384 1536 61440.060.2514 audikw_1 (28x) 24 96 384 1536 61440.060.2514 dielFilterV3real (33x) 24 96 384 1536 61440.060.2514 Flan_1565 (40x)24 96 384 1536 61440.351.415.66 circuit5M (8.1x) 24 96 384 1536 61440.52832 Li7Nmax6 (20x) 24 96 384 1536 614441664256 HV15R (43x) 24 96 384 1536 61440.52832 perf008cr (36x)24 96 384 1536 61441416 nlpkkt240 (23x) 24 96 384 1536 61442832128 Nm7 (31x) 24 96 384 1536 61440.060.251416642561024 NaluR3 (3372x) 384 1536 61440.250.512 A05

Fig. 6.1: Comparing the parallel AWPM algorithm with sequential AWPM and MC64 on Edison. AWPMtime includes the computation of a perfect matching and the newly developed augmenting cycles algo-rithm. A red line plots the time to gather a distributed matrix on a node and then run MC64. For eachmatrix, the best speedup attained by the parallel AWPM algorithm relative to MC64+gather time isshown at the top of the corresponding subplot. Four MPI processes per node and 6 threads per processwere used. Matrices are arranged in ascending order by nnz ﬁrst from left to right and then from top tobottom.over 13 matrices that were able to run on a single node of Edison. However, the performance improvementis more signiﬁcant on bigger matrices. For example, our algorithm attains the best speedup of 82 × forNaluR3 on 256 nodes on Edison. On 256 nodes of Cori-KNL, AWPM achieves 30 × speedup on averageover 14 matrices that were able to run on a single node. The best speedups on Cori-KNL are attained onLi7Nmax6 (114 × ) and NaluR3 (97 × ). For the A05 matrix, as we go from 16 nodes to 256 nodes, parallelAWPM runs 8 × and 9 × faster on Edison and Cori-KNL, respectively.Since the primary contribution of the paper is the approximate weight augmenting cycles algorithm,we show its scalability separately in Figure 6.3. We only show results for matrices with more than 200Mnonzeros. For all matrices in Figure 6.3, the parallel augmenting cycles algorithm attains more thana 50 × speedup on both Edison and Cori-KNL. Among these matrices, the best speedups of 260 × was C64 MC64+gather Parallel AWPM

Sequential AWPM

Number of cores (log scale) T i m e i n s e c o n d s ( l o g s c a l e )

64 256 1024 4096 163840.25141664256 A0564 256 1024 4096 163840.52832128 perf008cr64 256 1024 4096 163840.130.528 Flan_156564 256 1024 4096 163840.060.2514 boneS10 (42x) (56x) (129x) (302x)

Fig. 6.2: Comparing parallel AWPM with two sequential algorithms for four representative matrices onCori-KNL. The best speedup of the parallel AWPM algorithm relative to MC64+gather time is shownat the top of the corresponding subplot. See the caption of Figure 6.1 for further detail.

16 64 256 1024 40960.1250.52832 Number of Cores T i m e ( s e c ) (a) Edison Li7Nmax6HV15Rperf008crnlpkkt240Nm7NaluR3A05

64 256 1024 4096 163840.1250.52832128 Number of Cores T i m e ( s e c ) (b) Cori − KNL

Li7Nmax6HV15Rperf008crnlpkkt240Nm7NaluR3A05

Fig. 6.3: Strong scaling of the approximate weight augmenting cycles algorithm for the largest sevenmatrices from Table 6.1. The scalability plots starts from one node. Four MPI processes are used in eachnode of Edison and Cori-KNL.

68 272 1,088 4,352 17,408020406080100120140 Number of Cores T i m e ( s e c ) Nm7

Maximal Card. MatchingMaximum Card. MatchingAugmenting Cycles 272 1,088 4,352 17,40805101520 Number of Cores T i m e ( s e c ) nlpkkt240 Maximal Card. MatchingMaximum Card. MatchingAugmenting Cycles

Fig. 6.4: Breakdown of parallel AWPM runtime on Cori-KNL.observed on 256 nodes of Cori-KNL for the Li7Nmax6 matrix, highlighting the excellent scalability ofthe newly developed algorithm. able 6.3: Weights of matchings from MC64 and AWPM when the sum of logarithm of weights ismaximized and SuperLU DIST relative error of the solution. For other matrices in Table 6.1, we couldnot get a solution from SuperLU DIST.Matching weight Relative errorMatrix MC64 AWPM MC64 AWPMmemchip 0 .

00 0 .

00 7 . × − . × − rajat31 0 .

00 0 .

00 5 . × − . × − Freescale2 -11982.80 -15122.14 1 . . .

00 0 .

00 1 . × − . × − Serena -6.33 -6.33 2 . × − . × − audikw 1 -81.90 -81.90 1 . × − . × − dielFilterV3real -28.97 -28.97 2 . × − . × − Flan 1565 0 .

00 0 .

00 1 . × − . × − circuit5M -1298.52 -2491.49 5 . × − . × − perf008cr 0 .

00 0 .

00 6 . × − . × − A05 -780638 -780638 2 . × − . × − In order to study the be-havior of parallel AWPM, we break down the execution time of two instances in Figure 6.4. For bothinstances, the newly developed augmenting cycles algorithm scales very well. This trend is also observedfor most of the matrices, as was explained before. However, the maximum cardinality matching algorithmthat is used to obtain an initial perfect matching starts to become the bottleneck on high concurrency.As studied extensively in a prior work [4], the performance of the parallel MCM algorithm suﬀers if itneeds to search long augmenting paths that span many processors.

As stated in Section 1, our motivation for AWPM comes from the need for parallel pre-pivoting of largeentries to the main diagonal to improve stability of sparse direct solvers. We now evaluate how AWPMperforms relative to MC64 in this regard, using the SuperLU DIST direct solver. For each matrix A inTable 6.3, we set the true solution vector x true = [1 , . . . , T , then generate the right-hand side vector bycomputing b = Ax true . We use the LAPACK-style simple equilibration scheme to compute the row andcolumn scalings D r and D c which would make each row and each column of the scaled matrix D r AD c have equal norm, and the maximum entry of each row or column is ± P r . Here, themaximization criterion is the sum of the logarithms of the weights, i.e. the product of the weights. Thesparsity reordering P c is then obtained with METIS using the permuted matrix (graph). Finally, theLU factorization is computed as P Tc ( P r ( D r AD c )) P c = LU , and the solution x is computed based on thetransformed linear system. The relative solution error (cid:107) x − x true (cid:107) ∞ / (cid:107) x (cid:107) ∞ is reported in Table 6.3. Formost matrices, the relative error obtained with AWPM is remarkably close to that of MC64, with theexception of circuit5M. This can be explained by the diﬀerence in weights found by MC64 and AWPM.However, for most matrices, the AWPM weights are identical to the MC64 weights. Recall in Table 6.2,when the sum of weights is maximized, AWPM achieves 99.99% optimum weight for this matrix. Whenusing the permutation obtained from the “AWPM-sum” metric for circuit5M, the computed solution isas accurate as that of MC64. In the future, we plan to investigate the performance of the AWPM-sumand AWPM-product metrics in the solution accuracy of SuperLU DIST.

7. Concluding Remarks.

We presented a new distributed-memory parallel algorithm for the approximate-weight perfect matching problem on bipartite graphs. That is, our algorithm returns aperfect matching (if it exists) that has approximately maximum weight. Our motivation comes fromdistributed-memory sparse direct solvers where an initial permutation of the sparse matrix that places ntries with large absolute values on the diagonal is often performed before the factorization, in orderto avoid expensive pivoting during runtime. This process is called static pivoting and the permutationis ideally found using a maximum-weight perfect matching. However, previous attempts in parallelizingthe exact algorithms met with limited success. Since the perfect matching part is a strict requirement ofstatic pivoting, our algorithm only relaxes the weight requirement.There are two key reasons for the performance of our algorithm. For the initial phase where we ﬁnda perfect matching, we use existing optimized distributed-memory cardinality matching algorithms withminor modiﬁcations to maximize the weight when tie-breaking. For the iterative phase that improveson the weights while maintaining maximum cardinality, we restrict the algorithm to 4-cycles, and thusavoid traversing long augmenting paths, following the Pettie-Sanders approximation algorithm. Hence,the crux of our paper is an eﬃcient method for ﬁnding augmenting 4-cycles on a bipartite graph indistributed memory.In terms of LP-duality, unlike the Hungarian algorithm [17], ours is a primal method because itmaintains a feasible solution. It does not compute a solution to the dual problem, i.e. feasible potential.Since the dual values can be used to scale matrices in linear solvers [25] extending the algorithm toprovide such scaling factors is a goal for future work.We have tested the algorithm on the largest available instances in the Florida sparse matrix collection.While in some cases ﬁnding the perfect matching in parallel is expensive, the weight approximation alwaysscales well. Our experiments show that the weight of the matchings we obtain is high and high enoughto replace MC64 for pre-pivoting in linear solvers such as SuperLU DIST, thereby making such solversfully scalable. REFERENCES[1] Hasan Metin Aktulga, Aydin Bulu¸c, Samuel Williams, and Chao Yang. Optimizing sparse matrix-multiple vectorsmultiplication for nuclear conﬁguration interaction calculations. In

Proceedings of the IPDPS , pages 1213–1222,2014.[2] H. Alt, N. Blum, K. Mehlhorn, and M. Paul. Computing a maximum cardinality matching in a bipartite graph intime O ( n . (cid:112) m/ log n ). Information Processing Letters , 37(4):237–240, 1991.[3] Ariful Azad and Aydin Bulu¸c. A matrix-algebraic formulation of distributed-memory maximal cardinality matchingalgorithms in bipartite graphs.

Parallel Computing , 58:117–130, 2016.[4] Ariful Azad and Aydın Bulu¸c. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs.In

Proceedings of the IPDPS . IEEE, 2016.[5] Ariful Azad, Aydın Bulu¸c, and Alex Pothen. Computing maximum cardinality matchings in parallel on bipartitegraphs via tree-grafting.

IEEE Transactions on Parallel and Distributed Systems (TPDS) , 28(1):44–59, 2017.[6] Ariful Azad, Mahantesh Halappanavar, Sivasankaran Rajamanickam, Erik Boman, Arif Khan, and Alex Pothen.Multithreaded algorithms for maximum matching in bipartite graphs. In

IPDPS , Shanghai, China, 2012. IEEE.[7] Dimitri P. Bertsekas and David A. Casta˜n´on. A generic auction algorithm for the minimum cost network ﬂow problem.

Comp. Opt. and Appl. , 2(3):229–259, 1993.[8] Aydın Bulu¸c and John R Gilbert. The Combinatorial BLAS: Design, implementation, and applications.

InternationalJournal of High-Performance Computing Applications (IJHPCA) , 25(4), 2011.[9] Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection.

ACM Transactions on Mathe-matical Software (TOMS) , 38(1):1, 2011.[10] I. S. Duﬀ, K. Kaya, and B. U¸car. Design, implementation, and analysis of maximum transversal algorithms.

ACMTransaction on Mathematical Software , 38(2):13:1– 13:31, 2011.[11] Iain S. Duﬀ and Jacko Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix.

SIAMJournal on Matrix Analysis and Applications , 22(4):973–996, 2001.[12] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum ﬂow problem.

Journal of the Association forComputing Machinery , 35:921–940, 1988.[13] Mahantesh Halappanavar, John Feo, Oreste Villa, Antonino Tumeo, and Alex Pothen. Approximate weighted match-ing on emerging manycore and multithreaded architectures.

The International Journal of High PerformanceComputing Applications , 26(4):413–430, 2012.[14] Jonathan Hogg and Jennifer Scott. On the use of suboptimal matchings for scaling and ordering sparse symmetricmatrices.

Numerical Linear Algebra with Applications , 22(4):648–663, 2015.[15] John E. Hopcroft and Richard M. Karp. An n SIAMJournal on Computing , 2(4):225–231, 1973.[16] K. Kaya, J. Langguth, F. Manne, and B. U¸car. Push-relabel based algorithms for maximum transversal problem.

Computers & Operations Research , 40(5):1266–1275, 2013.[17] H. W. Kuhn. The hungarian method for the assignment problem.

Naval Research Logistics Quarterly , 2(1-2):83–97,1955. 1418] J. Langguth, F. Manne, and P. Sanders. Heuristic initialization for bipartite matching problems.

ACM Journal ofExperimental Algorithmics , 15:1.1–1.22, February, 2010.[19] Johannes Langguth, Ariful Azad, Mahantesh Halappanavar, and Fredrik Manne. On parallel push-relabel basedalgorithms for bipartite maximum matching.

Parallel Computing , 40(7):289–308, 2014.[20] Johannes Langguth, Md. Mostofa Ali Patwary, and Fredrik Manne. Parallel algorithms for bipartite matching problemson distributed memory computers.

Parallel Computing , 37(12):820–845, 2011.[21] Xiaoye S Li and James W Demmel. SuperLU DIST: A scalable distributed-memory sparse direct solver for unsym-metric linear systems.

ACM Transactions on Mathematical Software (TOMS) , 29(2):110–140, 2003.[22] Paul Lin, Matthew Bettencourt, Stefan Domino, Travis Fisher, Mark Hoemmen, Jonathan Hu, Eric Phipps, AndreyProkopenko, Sivasankaran Rajamanickam, Christopher Siefert, et al. Towards extreme-scale simulations for lowmach ﬂuids with second-generation Trilinos.

Parallel Processing Letters , 24(04):1442005, 2014.[23] Fredrik Manne and Rob H. Bisseling. A parallel approximation algorithm for the weighted maximum matchingproblem. In

International Conference on Parallel Processing and Applied Mathematics , pages 708–717, 2007.[24] Fredrik Manne and Mahantesh Halappanavar. New eﬀective multithreaded matching algorithms. In

Proceedings ofthe IPDPS , pages 519–528, 2014.[25] Markus Olschowka and Arnold Neumaier. A new pivoting strategy for gaussian elimination.

Linear Algebra and itsApplications , 240(Supplement C):131 – 151, 1996.[26] Seth Pettie and Peter Sanders. A simpler linear time 2/3-epsilon approximation for maximum weight matching.

Inf.Process. Lett. , 91(6):271–276, 2004.[27] A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix.

ACM Trans. Math. Softw. ,16:303–324, 1990.[28] Edward Jason Riedy.

Making static pivoting scalable and dependable . PhD thesis, University of California, Berkeley,2010.[29] P. Sao, X.S. Li, and R. Vuduc. A 3D LU factorization algorithm for sparse matrices. In

Proceedings of the IPDPS (toappear) . IEEE, 2018.[30] M. Sathe, O. Schenk, and H. Burkhart. An auction-based weighted matching implementation on massively parallelarchitectures.

Parallel Computing , 38(12):595–614, 2012.[31] Alexander Schrijver.