[PDF] Shared-memory Exact Minimum Cuts

Abstract

The minimum cut problem for an undirected edge-weighted graph asks us to divide its set of nodes into two blocks while minimizing the weight sum of the cut edges. In this paper, we engineer the fastest known exact algorithm for the problem. State-of-the-art algorithms like the algorithm of Padberg and Rinaldi or the algorithm of Nagamochi, Ono and Ibaraki identify edges that can be contracted to reduce the graph size such that at least one minimum cut is maintained in the contracted graph. Our algorithm achieves improvements in running time over these algorithms by a multitude of techniques. First, we use a recently developed fast and parallel \emph{inexact} minimum cut algorithm to obtain a better bound for the problem. Then we use reductions that depend on this bound, to reduce the size of the graph much faster than previously possible. We use improved data structures to further improve the running time of our algorithm. Additionally, we parallelize the contraction routines of Nagamochi, Ono and Ibaraki. Overall, we arrive at a system that outperforms the fastest state-of-the-art solvers for the \emph{exact} minimum cut problem significantly.

Full PDF

SShared-memory Exact Minimum Cuts ∗ Monika Henzinger † Alexander Noe ‡ Christian Schulz § Abstract

The minimum cut problem for an undirected edge-weighted graph asks us to divide its set of nodes intotwo blocks while minimizing the weight sum of the cutedges. In this paper, we engineer the fastest knownexact algorithm for the problem.State-of-the-art algorithms like the algorithm ofPadberg and Rinaldi or the algorithm of Nagamochi,Ono and Ibaraki identify edges that can be contractedto reduce the graph size such that at least one mini-mum cut is maintained in the contracted graph. Ouralgorithm achieves improvements in running time overthese algorithms by a multitude of techniques. First,we use a recently developed fast and parallel inexact minimum cut algorithm to obtain a better bound forthe problem. Then we use reductions that depend onthis bound, to reduce the size of the graph much fasterthan previously possible. We use improved data struc-tures to further improve the running time of our al-gorithm. Additionally, we parallelize the contractionroutines of Nagamochi, Ono and Ibaraki. Overall, wearrive at a system that outperforms the fastest state-of-the-art solvers for the exact minimum cut problemsigniﬁcantly.

Given an undirected graph with non-negative edgeweights, the minimum cut problem is to partition thevertices into two sets so that the sum of edge weightsbetween the two sets is minimized. A minimum cutis often also referred to as the edge connectivity of agraph [24, 14]. The problem has applications in manyﬁelds. In particular, for network reliability [16, 30],assuming equal failure probability edges, the smallestedge cut in the network has the highest chance to ∗ The research leading to these results has received fundingfrom the European Research Council under the European Com-munity’s Seventh Framework Programme (FP7/2007-2013) /ERCgrant agreement No. 340506 † University of Vienna, Vienna, Austria. [email protected] ‡ University of Vienna, Vienna, Austria. [email protected] § University of Vienna, Vienna, Austria. [email protected] disconnect the network; in VLSI design [21], a minimumcut can be used to minimize the number of connectionsbetween microprocessor blocks; and it is further usedas a subproblem in the branch-and-cut algorithm forsolving the Traveling Salesman Problem and othercombinatorial problems [27].As the minimum cut problem has many applicationsand is often used as a subproblem for complex problems,it is highly important to have algorithms that are ablesolve the problem in reasonable time on huge data sets.As data sets are growing substantially faster than pro-cessor speeds, a good way to achieve this is eﬃcient par-allelization. While there is a multitude of algorithms,which solve the minimum cut problem exactly on a sin-gle core [12, 14, 17, 24], to the best of our knowledge,there exists only one parallel exact algorithm for theminimum cut problem: Karger and Stein [17] presenta parallel variant for their random contraction algo-rithm [17] which computes a minimum cut with highprobability in polylogarithmic time using n processors.This is however unfeasible for large instances. There hasbeen a MPI implementation of this algorithm by Gian-inazzi et al. [9]. However, there have been no parallelimplementations of the algorithms of Hao et al. [12] andNagamochi et al. [24, 25], which outperformed other ex-act algorithms by orders of magnitude [7, 13, 15], bothin real-world and generated networks.All algorithms that solve the minimum cut prob-lem exactly have non-linear running times, currentlythe fastest being the deterministic algorithm of Hen-zinger et al. [14] with running time O (log n log log n ).There is a linear time approximation algorithm, namelythe (2 + ε )-approximation algorithm by Matula [23] anda linear time heuristic minimum cut algorithm by Hen-zinger et al. [13] based on the label propagation algo-rithm [29]. The latter paper also contains a shared-memory parallel implementation of their algorithm. We engineer the fastest known exact minimum cut algorithm for the problem. We do soby (1) incorporating recently proposed inexact methodsand (2) by using better suited data structures and otheroptimizations as well as (3) parallelization.Algorithms like the algorithm of Padberg and Ri-naldi or the algorithm of Nagamochi, Ono and Ibaraki a r X i v : . [ c s . D S ] A ug dentify edges that can be contracted to reduce thegraph size such that at least one minimum cut is main-tained in the contracted graph. Our algorithm achievesimprovements in running time by a multitude of tech-niques. First, we use a recently developed fast and par-allel inexact minimum cut algorithm [13] to obtain abetter approximate bound ˆ λ for the problem. As knowgraph reduction techniques depend on this bound, thebetter bound enables us to apply more reductions andreduce the size of the graph much faster. For exam-ple, edges whose incident vertices have a connectivityof at least ˆ λ , can be contracted without the contractionaﬀecting the minimum cut. Using better suited datastructures as well as incorporating observations thathelp to save a signiﬁcantly amount of work in the con-traction routine of Nagamochi, Ono and Ibaraki [25] fur-ther reduce the running time of our algorithm. For ex-ample, we observe a signiﬁcantly higher performance onsome graphs when using a FIFO bucket priority queue,bounded priority queues as well as better bounds ˆ λ . Ad-ditionally, we give a parallel variant of the contractionroutines of Nagamochi, Ono and Ibaraki [25]. Overall,we arrive at a system that outperforms the state-of-the-art by a factor of up to 2 . . Let G = ( V, E, c ) be aweighted undirected graph with vertex set V , edgeset E ⊂ V × V and non-negative edge weights c : E → N . We extend c to a set of edges E (cid:48) ⊆ E bysumming the weights of the edges; that is, c ( E (cid:48) ) := (cid:80) e = { u,v }∈ E (cid:48) c ( u, v ). We apply the same notation forsingle nodes and sets of nodes. Let n = | V | be thenumber of vertices and m = | E | be the number ofedges in G . The neighborhood N ( v ) of a vertex v isthe set of vertices adjacent to v . The weighted degree of a vertex is the sum of the weight of its incidentedges. For brevity, we simply call this the degree ofthe vertex. For a set of vertices A ⊆ V , we denote by E [ A ] := { ( u, v ) ∈ E | u ∈ A, v ∈ V \ A } ; that is, the setof edges in E that start in A and end in its complement.A cut ( A, V \ A ) is a partitioning of the vertex set V into two non-empty partitions A and V \ A , each beingcalled a side of the cut. The capacity of a cut ( A, V \ A )is c ( A ) = (cid:80) ( u,v ) ∈ E [ A ] c ( u, v ). A minimum cut is a cut( A, V \ A ) that has smallest weight c ( A ) among all cutsin G . We use λ ( G ) (or simply λ , when its meaning isclear) to denote the value of the minimum cut over all A ⊂ V . For two vertices s and t , we denote λ ( G, s, t )as the smallest cut of G , where s and t are on diﬀerentsides of the cut. The connectivity λ ( G, e ) of an edge e = ( s, t ) is deﬁned as λ ( G, s, t ), the connectivity of itsincident vertices. This is also known as the minimums-t-cut of the graph or the connectivity or vertices s and t . At any point in the execution of a minimum cut al-gorithm, ˆ λ ( G ) (or simply ˆ λ ) denotes the lowest upperbound of the minimum cut that an algorithm discovereduntil that point. For a vertex u ∈ V with minimum ver-tex degree, the size of the trivial cut ( { u } , V \ { u } ) isequal to the vertex degree of u . Hence, the minimumvertex degree δ ( G ) can serve as initial bound.Many algorithms tackling the minimum cut prob-lem use graph contraction . Given an edge ( u, v ) ∈ E ,we deﬁne G/ ( u, v ) to be the graph after contractingedge ( u, v ). In the contracted graph, we delete vertex v and all edges incident to this vertex. For each edge( v, w ) ∈ E , we add an edge ( u, w ) with c ( u, w ) = c ( v, w )to G or, if the edge already exists, we give it the edgeweight c ( u, w ) + c ( v, w ). We review algorithms for theglobal minimum cut and related problems. A closelyrelated problem is the minimum s-t-cut problem, whichasks for a minimum cut with nodes s and t in diﬀerentpartitions. Ford and Fulkerson [8] proved that minimum s - t -cut is equal to maximum s - t -ﬂow. Gomory andHu [11] observed that the (global) minimum cut canbe computed with n − s - t -cut computations.For the following decades, this result by Gomory and Huwas used to ﬁnd better algorithms for global minimumcut using improved maximum ﬂow algorithms [17]. Oneof the fastest known maximum ﬂow algorithms is thepush-relabel algorithm [10] by Goldberg and Tarjan.Hao and Orlin [12] adapt the push-relabel algorithmto pass information to future ﬂow computations. Whena push-relabel iteration is ﬁnished, they implicitly mergethe source and sink to form a new sink and ﬁnd a newsource. Vertex heights are maintained over multipleiterations of push-relabel. With these techniques theyachieve a total running time of O ( mn log n m ) for a graphwith n vertices and m edges, which is asymptoticallyequal to a single run of the push-relabel algorithm.Padberg and Rinaldi [26] give a set of heuristics foredge contraction. Chekuri et al. [7] give an implemen-tation of these heuristics that can be performed in timeinear in the graph size. Using these heuristics it is pos-sible to sparsify a graph while preserving at least oneminimum cut in the graph. If their algorithm does notﬁnd an edge to contract, it performs a maximum ﬂowcomputation, giving the algorithm worst case runningtime O ( n ). However, the heuristics can also be used toimprove the expected running time of other algorithmsby applying them on interim graphs [7].Nagamochi et al. [24, 25] give a minimum cut algo-rithm which does not use any ﬂow computations. In-stead, their algorithm uses maximum spanning foreststo ﬁnd a non-empty set of contractible edges. This con-traction algorithm is run until the graph is contractedinto a single node. The algorithm has a running timeof O ( mn + n log n ). Stoer and Wagner [33] give a sim-pler variant of the algorithm of Nagamochi, Ono andIbaraki [25], which has a the same asymptotic timecomplexity. The performance of this algorithm on real-world instances, however, is signiﬁcantly worse than theperformance of the algorithms of Nagamochi, Ono andIbaraki or Hao and Orlin, as shown in experiments con-ducted by J¨unger et al. [15]. Both the algorithms of Haoand Orlin, and Nagamochi, Ono and Ibaraki achieveclose to linear running time on most benchmark in-stances [7, 15]. To the best of our knowledge, there areno parallel implementation of either algorithm. Bothof the algorithms do not have a straightforward paral-lel implementation.Kawarabayashi and Thorup [18] give a determin-istic near-linear time algorithm for the minimum cutproblem, which runs in O ( m log n ). Their algorithmworks by growing contractible regions using a vari-ant of PageRank [28]. It was later improved by Hen-zinger et al. [14] to run in O ( m log n log log n ) time.Based on the algorithm of Nagamochi, Ono andIbaraki, Matula [23] gives a (2 + ε )-approximation al-gorithm for the minimum cut problem. The algorithmcontracts more edges than the algorithm of Nagamochi,Ono and Ibaraki to guarantee a linear time complexitywhile still guaranteeing a (2 + ε )-approximation factor.Karger and Stein [17] give a randomized Monte Carloalgorithm based on random edge contractions. This al-gorithm returns the minimum cut with high probabilityand a larger cut otherwise. In experiments, the algo-rithm was often outperformed by Nagamochi et al. andHao and Orlin by orders of magnitude [7, 13, 15]. We discuss the algorithm by Nagamochi, Ono andIbaraki [24, 25] in greater detail since our work makesuse of the tools proposed by those authors. Theintuition behind the algorithm is as follows: imagineyou have an unweighted graph with minimum cut value exactly one. Then any spanning tree must contain atleast one edge of each of the minimum cuts. Hence,after computing a spanning tree, every remaining edgecan be contracted without losing the minimum cut.Nagamochi, Ono and Ibaraki extend this idea to the casewhere the graph can have edges with positive weight aswell as the case in which the minimum cut is boundedby ˆ λ . The ﬁrst observation is the following: assumethat you already found a cut in the current graph ofsize ˆ λ and you want to ﬁnd a out whether there isa cut of size < ˆ λ . Then the contraction process onlyneeds to ensure that the contracted graph contains allcuts having a value strictly smaller than ˆ λ . To doso, Nagamochi, Ono and Ibaraki build edge-disjointmaximum spanning forests and contract all edges thatare not in one of the ˆ λ − λ .Note that the edge-disjoint maximum spanning forest certiﬁes for any edge e = { u, v } that is not in the forestthat the minimum cut between u and v is at least ˆ λ .Hence, the edge can be “safely” contracted. As weightsare integer, this guarantees that the contracted graphstill contains all cuts that are strictly smaller than ˆ λ .Since it would be ineﬃcient to directly compute ˆ λ − O (cid:0) mn + n log n (cid:1) . In experimental evaluations [7,15, 13] it is one of the fastest exact minimum cutalgorithms, both on real-world and generated instances.We now dive into more details of the algorithm. Toﬁnd contractable edges, the algorithm uses a modiﬁed breadth-ﬁrst graph traversal (BFS) algorithm [24, 25].More precisely, the algorithm starts at an arbitraryvertex. In each step, the algorithm visits (scans) thevertex v that is most strongly connected to the alreadyvisited vertices. For this purpose a priority queue Q isused, in which the connectivity strength of each vertex r : V → R to the already discovered vertices is usedas a key. When scanning a vertex v , the value r ( w ) iskept up to date for every unscanned neighbor w of v by setting i.e. r ( w ) := r ( w ) + c ( e ). Moreover, for eachsuch edge e = ( v, w ), the algorithm computes a lowerbound q ( e ) for the connectivity, i.e. the smallest cut λ ( G, v, w ), which places v and w on diﬀerent sides of thecut. More precisely, as shown by [25, 24], if the verticesare scanned in a certain order (the order used by thealgorithm), then r ( w ) is a lower bound on λ ( G, v, w ).For an edge that has connectivity λ ( G, v, w ) ≥ ˆ λ ,we know that there is no cut smaller than ˆ λ that places and w in diﬀerent partitions. If an edge e is not ina given cut ( A, V \ A ), it can be contracted withoutaﬀecting the cut. Thus, we can contract edges withconnectivity at least ˆ λ without losing any cuts smallerthan ˆ λ . As q ( e ) ≤ λ ( G, u, v ) (lower bound), all edgeswith q ( e ) ≥ ˆ λ are contracted.Afterwards, the algorithm continues on the con-tracted graph. A single iteration of the subroutine canbe performed in O ( m + n log n ). The authors show thatin each BFS run, at least one edge of the graph canbe contracted [24]. This yields a total running time of O ( mn + n log n ). However, in practice the number ofiterations is typically much less than n −

1, often it isproportional to log n . VieCut is a multilevel algorithm that uses a shared-memory parallel implementation of the label propaga-tion algorithm [29] to ﬁnd clusters with a strong intra-cluster connectivity. The algorithm then contracts theseclusters as it is assumed that the minimum cut does notsplit a cluster, as the vertices in a cluster are stronglyinterconnected with each other. This contraction isfollowed by a linear-work shared memory run of thePadberg-Rinaldi local tests for contractible edges [26].This whole process is repeated until the graph has onlya constant amount of vertices left and can be solved bythe algorithm of Nagamochi et al. [25] exactly.While

VieCut can not guarantee optimality or evena small approximation ratio, in practice the algorithmﬁnds near-optimal minimum cuts, often even the exactminimum cut, very quickly and in parallel. The algo-rithm can be implemented to have sequential runningtime O ( n + m ). In this section we detail our shared-memory algo-rithm for the minimum cut problem that is based onthe algorithms of Nagamochi et al. [24, 25] and Hen-zinger et al. [13]. We aim to modify the algorithmof Nagamochi et al. [25] in order to ﬁnd exact mini-mum cuts faster and in parallel. Their algorithm usesa routine described above in Section 2.3, called CAP-FOREST in their original work, in order to computea lower bound q ( e ) of the connectivity λ ( G, u, v ) foreach edge e = ( u, v ).If the connectivity between two vertices is largerthan the current upper bound for the minimum cut,then it can be contracted. That also means that edges e with q ( e ) ≥ ˆ λ can be safely contracted, The algorithmis guaranteed to ﬁnd at least one such edge.We start this section with optimizations to thesequential algorithm. First we use a recently published inexact algorithm to lower the minimum cut upperbound ˆ λ . This enables us to save work and to performcontractions more quickly. We then give diﬀerentimplementations of the priority queue Q and detail theeﬀects of the choice of queue on the algorithm. Weshow that the algorithm remains correct, even if welimit the priorities in the queue to ˆ λ , meaning thatelements in the queue having a key larger than that willnot be updated. This signiﬁcantly lowers the amountof priority queue operations necessary. Then we adaptthe algorithm so that we are able to detect contractibleedges in parallel eﬃciently. Lastly, we put it everythingtogether and present a full system description. ˆ λ . Note thatthe upper bound ˆ λ for the minimum cut is an importantparameter for exact contraction based algorithms suchas the algorithm NOI of Nagamochi et al. [25]. Thealgorithm computes a lower bound for the connectivityof the two incident vertices of each edge and contractsall edges whose incident vertices have a connectivity ofat least ˆ λ . Thus it is possible to contract more edges ifwe manage to lower ˆ λ beforehand.A trivial upper bound ˆ λ for the minimum cut isthe minimum vertex degree, as it represents the trivialcut which separates the minimum degree vertex fromall other vertices. We run VieCut to lower ˆ λ in orderto allow us to ﬁnd more edges to contract. Although VieCut is an inexact algorithm , in most cases it alreadyﬁnds the minimum cut [13] of the graph. As thereare by deﬁnition no cuts smaller than the minimumcut, the result of

VieCut is guaranteed to be at leastas large as the minimum cut λ . As we set ˆ λ to theresult of VieCut when running

NOI , we can thereforeguarantee a correct result.A similar idea is employed by the linear time(2 + (cid:15) )-approximation algorithm of Matula [23], whichinitializes the algorithm of Nagamochi et al. [25] withˆ λ = ( − (cid:15) ) × min degree. Whenever wevisit a vertex, we update the priority of all of itsneighbors in Q by adding the respective edge weight.Thus, in total we perform | E | priority queue increase-weight operations. In practice, many vertices reachpriority values much higher than ˆ λ and perform manypriority increases until they reach their ﬁnal value. Welimit the values in the priority queue by ˆ λ , i.e. we do notupdate priorities that are already ˆ λ . Lemma 3.1 showsthat this does not aﬀect correctness of the algorithm.Let ˜ q G ( e ) be the value q ( e ) assigned to e in theodiﬁed algorithm on graph G and let ˜ r G ( x ) be the r -value of a node x in the modiﬁed algorithm on G . Lemma 3.1.

Limiting the values in the priority queue Q used in the CAPFOREST routine to a maximum of ˆ λ does not interfere with the correctness of the algorithm.For every edge e = ( v, w ) with ˜ q G ( e ) ≥ ˆ λ , it holds that λ ( G, e ) ≥ ˆ λ . Therefore the edge can be contracted.Proof. As we limit the priority queue Q to a maximumvalue of ˆ λ , we can not guarantee that we always popthe element with highest value r ( v ) if there are multipleelements that have values r ( v ) ≥ ˆ λ in Q . However, weknow that the vertex x that is popped from Q is eithermaximal or has r ( x ) ≥ ˆ λ .We prove Lemma 3.1 by creating a graph G (cid:48) =( V, E, c (cid:48) ) by lowering edge weights (possibly to 0, ef-fectively removing the edge) while running the algo-rithm, so that CAPFOREST on G (cid:48) visits vertices inthe same order (assuming equal tie breaking) and as-signs the same q values as the modiﬁed algorithm on G .We ﬁrst describe the construction of G (cid:48) . We initial-ize the weight of all edges in graph G (cid:48) with the weightof the respective edge in G and run CAPFOREST on G (cid:48) . Whenever we check an edge e = ( x, y ) and update avalue r G (cid:48) ( y ) , we check whether we would set r G (cid:48) ( y ) > ˆ λ .If this is the case, i.e. when r G (cid:48) ( y ) + c ( e ) > ˆ λ , we set c (cid:48) ( e ) in G (cid:48) to c ( e ) − ( r G (cid:48) ( y ) − ˆ λ ), which is lower by ex-actly the value by which r G ( y ) is larger than ˆ λ , andnon-negative. Thus, r G (cid:48) ( y ) = ˆ λ . As we scan every edgeexactly once in a run of CAPFOREST, the weights ofedges already scanned remain constant afterwards. Thiscompletes the construction of G (cid:48) Note that during the construction of G (cid:48) edgeweights were only decreased and never increased. Thusit holds that λ ( G (cid:48) , x, y ) ≤ λ ( G, x, y ) for any pair ofnodes ( x, y ). If we ran the unmodiﬁed CAPFORESTalgorithm on G (cid:48) each edge would be assigned a value q G (cid:48) ( e ) with q G (cid:48) ( e ) ≤ λ ( G (cid:48) , e ). Thus for every edge e itholds that q G (cid:48) ( e ) ≤ λ ( G (cid:48) , e ) ≤ λ ( G, e ).Below we will show that ˜ q G ( e ) = q G (cid:48) ( e ) for all edges e . It then follows that for all edges e it holds that˜ q G ( e ) ≤ λ ( G, e ). This implies that if ˜ q G ( e ) ≥ ˆ λ then λ ( G, e ) ≥ ˆ λ , which is what we needed to show.It remains to show for all edges e that ˜ q G ( e ) = q G (cid:48) ( e ). To show this claim we will show the followingstronger claim. For any i with 1 ≤ i ≤ m after the( i − i th scan of an edge the modiﬁedalgorithm on g and the original algorithm on G (cid:48) withthe same tie breaking have visited all nodes and scannedall edges up to now in the same order and for all edges e it holds that ˜ q G ( e ) = q G (cid:48) ( e ) (we assume that beforescanning an edge e , q ( e ) = 0) and for all nodes x it holds that ˜ r G ( x ) = r G (cid:48) ( x ). We show this claim by theinduction on i .For i = 1 observe that before the ﬁrst edge scan˜ q G ( e ) = q G (cid:48) ( e ) = 0 for all edges e and the same nodeis picked as ﬁrst node due to identical tie breaking andthe fact that G = G (cid:48) at that point. Now for i > i − i − i − y as ˜ r G ( x ) = r G (cid:48) ( x ) for all nodes x .Then both algorithms scan the same incident edge of y as in both algorithms the set of unscanned neighborsof y is identical. If neither algorithm has to picka new node then both have scanned the same edgesof the same current node y and due to identical tiebreaking will pick the same next edge to scan. Let thisedge be ( y, w ). By induction ˜ r G ( w ) = r G (cid:48) ( w ) at thistime. As ( y, w ) is unscanned c (cid:48) ( y, w ) = c ( y, w ) whichimplies that ˜ r G ( w ) + c ( y, w ) = r G (cid:48) ( w ) + c (cid:48) ( y, w ). If˜ r G ( w ) + c ( y, w ) ≤ ˆ λ then the modiﬁed algorithm on G and the original algorithm on G (cid:48) will set the r valueof w to the same value, namely ˜ r G ( w ) + c ( y, w ). If˜ r G ( w ) + c ( y, w ) > ˆ λ , then ˜ r G ( w ) is set to ˆ λ and c (cid:48) ( y, w )is set to c ( y, w ) − ( r G (cid:48) ( w ) − ˆ λ ), which leads to r G (cid:48) ( w )being set to ˆ λ . Thus ˜ r G ( w ) = r G (cid:48) ( w ) and by induction˜ r G ( x ) = r G (cid:48) ( x ) for all x . Additionally the modiﬁedalgorithm on G sets ˜ q G ( y, w ) = ˜ r G ( w ) and the originalalgorithm on G (cid:48) sets q G (cid:48) ( y, w ) = r G (cid:48) ( w ). It follows that˜ q G ( y, w ) = q G (cid:48) ( y, w ) and, thus, by induction ˜ q G ( e ) = q G (cid:48) ( e ) for all e . This completes the proof of the claim.Lemma 3.1 allows us to considerably lower theamount of priority queue operations, as we do notneed to update priorities that are bigger than ˆ λ . Thisoptimization has even more beneﬁt in combination withrunning VieCut to lower the upper bound ˆ λ , as wedirectly lower the amount of priority queue operations. Nag-amochi et al. [25] use an addressable priority queue Q in their algorithm to ﬁnd contractible edges. In thissection we now address variants for the implementationof the priority queue. As the algorithm often has manyelements with maximum priority in practice, the imple-mentation of this priority queue can have major impacton the order of vertex visits and thus also on the edgesthat will be marked contractible. Bucket Priority Queue.

As our algorithm limitsthe values the priority queue to a maximum of ˆ λ , weobserve integer priorities in the range of [0 , ˆ λ ]. Hence,we can use a bucket queue that is implemented as anrray with ˆ λ buckets. In addition, the data structurekeeps the id of the highest non-empty bucket, alsoknown as the top bucket , and stores the position of eachvertex in the priority queue. Priority updates can beimplemented by deleting an element from its bucket andpushing it to the bucket with the updated priority. Thisallows constant time access for all operations except fordeletions of the maximum priority element, which haveto check all buckets between the prior top bucket to thenew top bucket, possibly up to ˆ λ checks. We give twopossible implementations to implement the buckets sothat they can store all elements with a given priority.The ﬁrst implementation, BStack uses a dynamicarray ( std::vector ) as the container for all elements ina bucket. When we add a new element to the vector, wepush it to the back of the array. Q .pop max() returnsthe last element of the top bucket. Thus our algorithmwill always next visit the element whose priority itjust increased. The algorithm therefore does not fullyexplore all vertices in a local region.The other implementation, BQueue uses a doubleended queue ( std::deque ) as the container instead.A new element is pushed to the back of the queueand Q .pop max() returns the ﬁrst element of the topbucket. This results in a variant of our algorithm, whichperforms closer to a breadth-ﬁrst search in that it ﬁrstexplores the vertices that have been discovered earlier,i.e. are closer to the source vertex in the graph. Bottom-Up Binary Heap.

A binary heap [36] isa binary tree (implemented as an array, where element i has its children in index 2 i and 2 i + 1) which fulﬁlls theheap property, i.e. each element has priority that is notlower than either of its children. Thus the element withhighest priority is the root of the tree. The tree can bemade addressable by using an array of indices, in whichwe save the position of each vertex. We use a binaryheap using the bottom-up heuristics [35], in which wesift down holes that were created by the deletion of thetop priority vertex. Priority changes are implementedby sifting the addressed element up or down in the tree.Operations have a running time of up to O (log n ) to siftan element up or down to ﬁx the heap property.In Q .pop max() , the Heap priority queue does notfavor either old or new elements in the priority queueand therefore this implementation can be seen as amiddle ground between the two bucket priority queues.

We modify the algo-rithm in order to quickly ﬁnd contractible edges us-ing shared-memory parallelism. The pseudocode canbe found in Algorithm 1. Pseudocode for the originalCAPFOREST algorithm can be found in Algorithm 3 inAppendix A.2. The proofs in this section show that the Figure 1: Example run of Algorithm 1. Every processstarts at a random vertex and scans region around thestart vertex. These regions do not overlap.modiﬁcations do not violate the correctness of the al-gorithm. Detailed proofs for the original CAPFORESTalgorithm and the modiﬁcations of Nagamochi et al. forweighted graphs can be found in [25].The idea of the our algorithm is as follows: Weaim to ﬁnd contractible edges using shared-memoryparallelism. Every processor selects a random vertexand runs Algorithm 1, which is a modiﬁed version ofCAPFOREST [24, 25] where the priority values arelimited to ˆ λ , the current upper bound of the size ofthe minimum cut. We want to ﬁnd contractible edgeswithout requiring that every process looks at the wholegraph. To achieve this, every vertex will only be visitedby one process. Compared to limiting the amount ofvertices each process visits this has the advantage thatwe also scan the vertices in sparse regions of the graphwhich might otherwise not be scanned by any process.Figure 1 shows an example run of Algorithm 1with p = 5. Every process randomly chooses a startvertex and performs Algorithm 1 on it to “grow aregion” of scanned vertices.As we want to employ shared-memory parallelism tospeed up the algorithm, we share an array T betweenall processes to denote whether a vertex has alreadybeen visited. Every process has a blacklist B to marknodes which were already scanned by another processand therefore not explored by this process. For everyvertex v we keep a value r ( v ), which denotes thetotal weight of edges connecting v to already scannedvertices. Over the course of a run of the algorithm, everyedge e = ( v, w ) is given a value q ( e ) (equal to r ( w ) rightafter scanning e ) which is a lower bound for the smallestcut λ ( G, v, w ). We mark an edge e as contractible (moreaccurately, we union the incident vertices in the sharedconcurrent union-ﬁnd data structure [1]), if q ( e ) ≥ ˆ λ .Note that this does not modify the graph, it justremembers which nodes to collapse. The actual nodecollapsing happens in a postprocessing step. Nagamochi lgorithm 1 Parallel CAPFOREST

Input: G = ( V, E, c ) ← undirected graph ˆ λ ← upper bound for minimum cut, T ← shared array of vertex visits

Output:

U ← union-ﬁnd data structure to mark contractible edges Label all vertices v ∈ V “unvisited”, blacklist B empty ∀ v ∈ V : r ( v ) ← ∀ e ∈ E : q ( e ) ← Q ← empty priority queue Insert random vertex into Q while Q not empty do x ← Q .pop max() (cid:46) Choose unvisited vertex with highest priority Mark x “visited” if T ( x ) = True then (cid:46) Every vertex is visited only once B ( x ) ← True else T ( x ) ← True end if α ← α + c ( x ) − r ( x ) ˆ λ ← min (ˆ λ, α ) for e = ( x, y ) ← edge to vertex y not in B and not visited do if r ( y ) < ˆ λ ≤ r ( y ) + c ( e ) then U .union(x,y) (cid:46) Mark edge e to contract end if r ( y ) ← r ( y ) + c ( e ) q ( e ) ← r ( y ) Q ( y ) ← min ( r ( y ) , ˆ λ ) end for end while and Ibaraki showed [25] that contracting only the edgesthat fulﬁll the condition in line 17 is equivalent.If a vertex v has already been visited by anotherprocess, it will not be visited by any other workers. Aprocess that tries to visit v after it has already beenvisited locally blacklists v by setting B ( v ) to true anddoes not visit the vertex. Subsequently, no more edgesincident to v will be marked contractible by this process.This is necessary to ensure correctness of the algorithm.As the set of disconnected edges is diﬀerent dependingon the start vertex, we looked into visiting every vertexby a number of processes up to a given parameter to ﬁndmore contractible edges. However, this did generallyresult in higher total running times and thus we onlyvisit every vertex once.After all processes are ﬁnished, every vertex wasvisited exactly once (or possibly zero times, if the graphis disconnected). On average, every process has visitedroughly np vertices and all processes ﬁnish at the sametime. We do not perform any form of locking of theelements of T , as this would come with a running timepenalty for every write and the worst that can happenwith concurrent writes is that a vertex is visited moreoften, which does not aﬀect correctness of the algorithm. However, as we terminate early and no process visitsevery vertex, we can not guarantee anymore that thealgorithm actually ﬁnds a contractible edge. However,in practice, this only happens if the graph is already verysmall ( <

50 vertices in all of our experiments). We canthen run the CAPFOREST routine which is guaranteedto ﬁnd at least one edge to contract.In line 14 and 15 of Algorithm 1 we compute thevalue of the cut between the scanned and unscannedvertices and update ˆ λ if this cut is smaller than it. Formore details on this we refer the reader to [25].In practice, many vertices reach values of r ( y ) thatare much higher than ˆ λ and therefore need to updatetheir priority in Q often. As previously detailed, welimit the values in the priority queue by ˆ λ and do notupdate priorities that are already greater or equal to ˆ λ .This allows us to considerably lower the amount ofpriority queue operations per vertex. Theorem 3.1.

Algorithm 1 is correct.

Algorithm 1 is correct. As Algorithm 1 is a modiﬁedvariant of CAPFOREST [24, 25], we use the correctnessof their algorithm and show that our modiﬁcations canot result in incorrect results. In order to show this weneed the following lemmas:

Lemma 3.2.

1) Multiple instances of Algorithm 1 canbe run in parallel with all instances sharing aparallel union-ﬁnd data structure.2) Early termination does not aﬀect correctness3) For every edge e = ( v, w ) , where neither v nor w are blacklisted, q ( e ) is a lower bound for theconnectivity λ ( G, v, w ) , even if the set of blacklistedvertices B is not empty.4) When limiting the priority of a vertex in Q to ˆ λ , it still holds that the vertices incident to anedge e = ( x, y ) with q ( e ) ≥ ˆ λ have connectivity λ ( G, x, y ) ≥ ˆ λ .Proof. A run of the CAPFOREST algorithm ﬁnds anon-empty set of edges that can be contracted withoutcontracting a cut with value less than ˆ λ [24]. We showthat none of our modiﬁcations can result in incorrectresults:1) The CAPFOREST routine can be started from anarbitrary vertex and ﬁnds a set of edges that canbe contracted without aﬀecting the minimum cut λ . This is true for any vertex v ∈ V . As we donot change the underlying graph but just markcontractible edges, the correctness is obviouslyupheld when running the algorithm multiple timesstarting at diﬀerent vertices. This is also true whenrunning the diﬀerent iterations in parallel, as longas the underlying graph is not changed.Marking the edge e = ( u, v ) as contractible isequivalent to performing a Union of vertices u and v . The Union operation in a union-ﬁnd datastructure is commutative and therefore the orderof unions is irrelevant for the ﬁnal result. Thusperforming the iterations successively has the sameresult as performing them in parallel.2) Over the course of the algorithm we set a value q ( e ) for each edge e and we maintain a value ˆ λ that never increases. We contract edges that havevalue q ( e ) ≥ ˆ λ at the time when q ( e ) is set. Forevery edge, this value is set exactly once. If weterminate the algorithm prior to setting q ( e ) forall edges, the set of contracted edges is a subset ofthe set of edges that would be contracted in a fullrun and all contracted edges e fulﬁll q ( e ) ≥ ˆ λ attermination. Thus, no edge contraction contractsa cut that is smaller than ˆ λ . 3) Let e = ( v, w ) be an edge and let B e be theset of nodes blacklisted at the time when e isscanned. We show that for an edge e = ( v, w ), q ( e ) ≤ λ ( ¯ G, v, w ), where ¯ G = ( ¯ V , ¯ E ) with vertices¯ V = V \B e and edges ¯ E = { e = ( u, v ) ∈ E : u (cid:54)∈ B e and v (cid:54)∈ B e } is the graph G with all blacklistedvertices and their incident edges removed. As theremoval of vertices and edges can not increase edgeconnectivities q ¯ G ( e ) ≤ λ ( ¯ G, v, w ) ≤ λ ( G, v, w ) and e is a contractible edge.Whenever we visit a vertex b , we decide whetherwe blacklist the vertex. If we blacklist the vertex b ,we immediately leave the vertex and do not changeany values r ( v ) or q ( e ) for any other vertex or edge.As vertex b is marked as blacklisted, we will notvisit the vertex again and the edges incident to b only aﬀect r ( b ).As edges incident to any of the vertices in B e donot aﬀect q ( e ), the value of q ( e ) in the algorithmwith the blacklisted in G is equal to the value of q ( e ) in ¯ G , which does not contain the blacklistedvertices in B e and their incident edges. On ¯ G thisis equivalent to a run of CAPFOREST withoutblacklisted vertices and due to the correctness ofCAPFOREST [25] we know that for every edge e ∈ ¯ E : q ¯ G ( e ) ≤ λ ( ¯ G, v, w ) ≤ λ ( G, v, w ).Note that in ¯ G we only exclude the vertices thatare in B e . It is possible that a node y that wasunvisited when e was scanned might get blacklistedlater, however, this does not aﬀect the value of q ( e )as the value q ( e ) is set when an edge is scanned andnever modiﬁed afterwards.4) Proof in Lemma 3.1.We can combine the sub-proofs (3) and (4) by cre-ating the graph ¯ G (cid:48) , in which we remove all edges inci-dent to blacklisted vertices and decrease edge weightsto make sure no q ( e ) is strictly larger than ˆ λ . As weonly lowered edge weights and removed edges, for ev-ery edge between two not blacklisted vertices e = ( u, v ), q G ( e ) ≤ λ ( ¯ G (cid:48) , x, y ) ≤ λ ( G, x, y ) or q G ( e ) > ˆ λ and thuswe only contract contractible edges. As none of ourmodiﬁcations can result in the contraction of edges thatshould not be contracted, Algorithm 1 is correct. Parallel Graph Contraction.

After using Algo-rithm 1 to ﬁnd contractible edges, we use a concur-rent hash table [22] to generate the contracted graph G C = ( V C , E C ), in which each block in U is representedby a single vertex: ﬁrst we assign each block a ver-tex ID in the contracted graph in [0 , | V C | ). For each lgorithm 2 Parallel Minimum Cut

Input: G = ( V, E, c ) ˆ λ ← VieCut ( G ), G C ← G while G C has more than 2 vertices do ˆ λ ← Parallel CAPFOREST( G C , ˆ λ ) if no edges marked contractible then ˆ λ ← CAPFOREST( G C , ˆ λ ) end if G C , ˆ λ ← Parallel Graph Contract( G C ) end while return ˆ λ edge e = ( u, v ), we compute a hash of the block IDs of u and v to uniquely identify the edge in E C . We use thisidentiﬁer to compute the weights of all edges betweenblocks. If there are two blocks that each contain manyvertices, there might be many edges between them andif so, the hash table spends considerable time for syn-chronization. We thus compute the weight of the edgeconnecting the two heavy blocks locally on each processand sum them up afterwards to reduce synchronizationoverhead. If the collapsed graph G C has a minimum de-gree of less than ˆ λ , we update ˆ λ to the value of this cut. Algorithm 2 showsthe overall structure of the algorithm. We ﬁrst run

VieCut to ﬁnd a good upper bound ˆ λ for the minimumcut. Afterwards, we run Algorithm 1 to ﬁnd contractibleedges. In the unlikely case that none were found,we run CAPFOREST [25] sequentially to ﬁnd at leastone contractible edge. We create a new contractedgraph using parallel graph contraction, as shown inSection 3.2. This process is repeated until the graphhas only two vertices left. Whenever we encountera collapsed vertex with a degree of lower than ˆ λ , weupdate the upper bound. We return the smallest cutwe encounter in this process.If we also want to output the minimum cut, for each collapsed vertex v C in G C we store which vertices of G are included in v C . When we update ˆ λ , we store whichvertices are contained in the minimum cut. This allowsus to see which vertices are on one side of the cut. Weimplemented the algorithms using C ++ -17 and compiledall codes using g++-7.1.0 with full optimization ( -O3 ).Our experiments are conducted on a machine with twoIntel Xeon E5-2643 v4 with 3.4GHz with 6 CPU coreseach and 1.5 TB RAM in total. We perform ﬁve repe-titions per instance and report average running time. Performance plots relate the fastest running timeto the running time of each other algorithm on a per-instance basis. For each algorithm, these ratios aresorted in increasing order. The plots show the ratio t best /t algorithm on the y-axis. A point close to zeroindicates that the running time of the algorithm wasconsiderably worse than the fastest algorithm on thesame instance. A value of one therefore indicates thatthe corresponding algorithm was one of the fastestalgorithms to compute the solution. Thus an algorithmis considered to outperform another algorithm if itscorresponding ratio values are above those of the otheralgorithm. In order to include instances that were toobig for an algorithm, i.e. some implementations arelimited to 32bit integers, we set the corresponding ratiobelow zero . Algorithms.

There have been multiple experimentalstudies that compare exact algorithms for the minimumcut problem [7, 13, 15]. All of these studies report thatthe algorithm of Nagamochi et al. and the algorithmof Hao and Orlin outperform other algorithms, suchas the algorithms of Karger and Stein [17] or the al-gorithm of Stoer and Wagner [33], often by multipleorders of magnitude. Among others, we compare our-selfs against two available implementations of the se- Number of Vertices R unn i n g T i m e p e r E d g e [ ( n s ) ] Average Node Degree: 2 Number of VerticesAverage Node Degree: 2 Number of VerticesAverage Node Degree: 2 Number of VerticesAverage Node Degree: 2 HO-CGKLSNOI-CGKLSNOI ˆ λ -BStackNOI ˆ λ -BQueueNOI-HNSSNOI ˆ λ -HeapNOI-HNSS-VieCutNOI ˆ λ -Heap-VieCut Figure 2: Total running time in nanoseconds per edge in RHG graphs. S l o w D o w n Number of Edges HO−CGKLSNOI−CGKLSNOI−HNSSNOI l −HeapNOI l −BStackNOI l −BQueueNOI−HNSS−VieCutNOI l −Heap−VieCut S l o w D o w n Average Degree

Figure 3: Total running time in real-world graphs, normalized by the running time of

NOI ˆ λ -Heap-VieCut . t b e s t / t a l g o Figure 4: Performance plot for all graphs (legendshared with Figure 3 above).quential algorithm of Nagamochi et al. [24, 25]. Hen-zinger et al. [13] give an implementation of the al-gorithm of Nagamochi et al. [24, 25], written in inC ++ ( NOI-HNSS ) that uses a binary heap. We use thisalgorithm with small optimizations in the priority queueas a base of our implementation. Chekuri et al. [7] givean implementation of the ﬂow-based algorithm of Haoand Orlin using all optimizations given in the paper(variant ho in [7]), implemented in C, in our experi-ments denoted as HO-CGKLS . They also give an imple-mentation of the algorithm of Nagamochi et al. [24, 25],denoted as

NOI-CGKLS , which uses a heap as its prior-ity queue data structure (variant ni-nopr in [7]). Astheir implementations use signed integers as edge ids,we include their algorithms only for graphs that haveless than 2 edges. Most of our discussions focus oncomparisons to the NOI-HNSS implementation as thisoutperforms the implementations by Chekuri et al.Gianinazzi et al. [9] give a MPI implementation ofthe algorithm of Karger and Stein [17]. We performed preliminary experiments on small graphs which can besolved by

NOI-HNSS , NOI-CGKLS and

HO-CGKLS in lessthan 3 seconds. On these graphs, their implementationusing 24 processes took more than 5 minutes, whichmatches other studies [7, 15, 13] that report bad real-world performance of (other implementations of) thealgorithm of Karger and Stein. Gianinazzi et al. reporta running time of 5 seconds for RMAT graphs with n = 16000 and an average degree of 4000, using 1536 cores . As NOI can ﬁnd the minimum cut on RMATgraphs [19] of equal size in less than 2 seconds using asingle core , we do not include the implementation in [9]in our experiments.As our algorithm solves the minimum cut problemexactly, we do not include the (2 + (cid:15) )-approximationalgorithm of Matula [23] and the inexact algorithm

VieCut in the experiments.

Instances.

We use a set of graph instances from theexperimental study of Henzinger et al. [13]. The set ofinstances contains k -cores [3] of large undirected real-world graphs taken from the 10th DIMACS Implemen-tation Challenge [2] as well as the Laboratory for WebAlgorithmics [4, 5]. Additionally it contains large ran-dom hyperbolic graphs [20, 34] with n = 2 − and m = 2 − . A detailed description of the graphinstances is given in Appendix A. These graphs are un-weighted, however contracted graphs that are createdin the course of the algorithm have edge weights. We limit the valuesin the priority queue Q to ˆ λ , in order to signiﬁcantlylower the amount of priority queue operations needed Sp ee dup t o S e q u e n t i a l gsh-2015-host , k = 10 ( λ = 1) 12 4 8 12 24246810 uk-2007-05 , k = 10 ( λ = 1) 12 4 8 12 24246 twitter-2010 , k = 50 ( λ = 3) 12 4 8 12 24246 rhg 25 8 1 ( λ = 118) 12 4 8 12 24246 rhg 25 8 2 ( λ = 73) ParCut ˆ λ -BStackParCut ˆ λ -BQueueParCut ˆ λ -Heap

12 4 8 12 24246 Number of Processes Sp ee dup t o S e q u e n t i a l

12 4 8 12 24123 Number of Processes 12 4 8 12 24246810 Number of Processes 12 4 8 12 240246 Number of Processes 12 4 8 12 240510 Number of Processes

ParCut ˆ λ -BStackParCut ˆ λ -BQueueParCut ˆ λ -HeapNOI-HNSSNOI ˆ λ -HeapNOI ˆ λ -BStack Figure 5: Scaling plots for four large graphs. Top: Scalability. Bottom: Speedup compared to

NOI-HNSS andfastest sequential algorithm (ﬁrst 3 graphs:

NOI ˆ λ -BStack ¸ last 2 graphs: NOI ˆ λ -Heap ).to run the contraction routine. In this experiment, wewant to examine the eﬀects of diﬀerent priority queueimplementations and limiting priority queue values haveon sequential minimum cut computations. We alsoinclude variants which run VieCut ﬁrst to lower ˆ λ .We start with sequential experiments using theimplementation of NOI-HNSS . We use two variants:

NOI ˆ λ limits values in the priority queue to ˆ λ while NOI-HNSS allows arbitrarily large values in Q . For NOI ˆ λ , wetest the three priority queue implementations, BQueue , Heap and

BStack . As the priority queue for

NOI-HNSS has priorities of up to the maximum degree of thegraph and the contracted graphs can have very largedegrees, the bucket priority queues are not suitable for

NOI-HNSS . Therefore we only use the implementationof

NOI-HNSS [13]. The variants

NOI-HNSS-VieCut and

NOI ˆ λ -Heap-VieCut ﬁrst run the shared-memory parallelalgorithm VieCut using all 24 threads to lower ˆ λ beforerunning the respective sequential algorithm. We reportthe total running time, e.g. the sum of VieCut and

NOI . Priority Queue Implementations.

Figure 2shows the results for RHG graphs and Figure 3 showsthe results for real-world graphs, normalized by therunning time of

NOI ˆ λ -Heap-VieCut . Figure 4 givesperformance plots for all graphs. We can see that innearly all sequential runs, NOI ˆ λ -BStack is 5 −

10% fasterthan

NOI ˆ λ -BQueue . This can be explained as this pri-ority queue uses std::vector instead of std::deque as its underlying data structure and thus has lower ac-cess times to add and remove elements. As all verticesare visited by the only thread, the scan order does notgreatly inﬂuence how many edges are contracted. In the RHG graphs, nearly no vertices in NOI-HNSS reach priorities in Q that are much larger than ˆ λ .Usually, less than 5% of edges do not incur an update in Q . Thus, NOI-HNSS and

NOI ˆ λ -Heap have practically thesame running time. NOI ˆ λ -BStack is usually 5% slower.As the real-world graphs are social network and webgraphs, they contain vertices with very high degrees. Inthese vertices, NOI-HNSS often reaches priority valuesof much higher than ˆ λ and NOI ˆ λ can actually savepriority queue operations. Thus, NOI ˆ λ -Heap is upto 1 .

83 times faster than

NOI-HNSS with an average(geometric) speedup factor of 1 .

35. Also, in contrast tothe RHG graphs,

NOI ˆ λ -BStack is faster than NOI-HNSS on real-world graphs. Due to the low diameter of weband social graphs, the number of vertices in Q is verylarge. This favors the BStack priority queue, as it hasconstant access times. The average geometric speedupof

NOI ˆ λ -BStack compared to NOI ˆ λ -Heap is 1 . Reduction of ˆ λ by VieCut . Now we reduce ˆ λ by VieCut before running

NOI . While the other algorithmsare slower for denser RHG graphs,

NOI-HNSS-VieCut and

NOI ˆ λ -Heap-VieCut are faster in these graphs withhigher density. This happens as the variants without VieCut ﬁnd less contractible edges and therefore needmore rounds of CAPFOREST. The highest speedupcompared to

NOI ˆ λ -Heap is reached in RHG graphswith n = 2 and an average density of 2 , where NOI ˆ λ -Heap-VieCut has a speedup of factor 4. NOI ˆ λ -Heap-VieCut is fastest on most real-worldgraphs, however when the minimum degree is veryclose to the minimum cut λ , running VieCut can notsigniﬁcantly lower ˆ λ . Thus, the extra work to run ieCut takes longer than the time saved by loweringthe upper bound ˆ λ . The average geometric speedupfactor of NOI ˆ λ -Heap-VieCut on all graphs compared tothe variant without VieCut is 1 . NOI ˆ λ -Heap-VieCut is fastest or close to the fastest algo-rithm in all but the very sparse graphs, in which the al-gorithm of Nagamochi et al. [25] is already very fast [13]and therefore using VieCut cannot suﬃciently lower ˆ λ and thus the running time of the algorithm. NOI-CGKLS and

HO-CGKLS are outperformed on all graphs.

We run experi-ments on 5 of the largest graphs in the data sets usingup to 24 threads on 12 cores. First, we compare the per-formance of Algorithm 2 using diﬀerent priority queues:

ParCut ˆ λ -Heap , ParCut ˆ λ -BStack and ParCut ˆ λ -BQueue all limit the priorities to ˆ λ , the result of VieCut .Figure 5 shows the results of these scaling experi-ments. The top row shows how well the algorithms scalewith increased amounts of processors. The lower rowshows the speedup compared to the fastest sequential al-gorithm of Section 4.2. On all graphs,

ParCut ˆ λ -BQueue has the highest speedup when using 24 threads. Onreal-world graphs, ParCut ˆ λ -BQueue also has the low-est total running time. In the large RHG graphs, inwhich the priority queue is usually only ﬁlled with upto 1000 elements, the worse constants of the double-ended queue cause the variant to be slightly slower than ParCut ˆ λ -Heap also even when running with 24 threads.In the two large real-world graphs that have a minimumdegree of 10, the sequential algorithm NOI ˆ λ -BStack con-tracts most edges in a single run of CAPFOREST - dueto the low minimum degree, the priority queue opera-tions per vertex are also very low. Thus, ParCut ˆ λ usingonly a single thread has a signiﬁcantly higher runningtime, as it runs VieCut ﬁrst and performs graph contrac-tion using a concurrent hash table, as described in Sec-tion 3.2, which is slower than sequential graph contrac-tion when using just one thread. In graphs with higherminimum degree,

NOI needs to perform multiple runsof CAPFOREST. By lowering ˆ λ using VieCut we cancontract signiﬁcantly more edges and achieve a speedupfactor of up to 12 . NOI ˆ λ -Heap . On twitter-2010 , k = 50, ParCut ˆ λ -BQueue has a speedup of 10 . NOI-HNSS ,16 . NOI-CGKLS and a speedup of 25 . HO-CGKLS .The other graphs have more than 2 edges and are thustoo large for NOI-CGKLS and

HO-CGKLS . We presented a shared-memory parallel exact algorithmfor the minimum cut problem. Our algorithm is based on the algorithms of Nagamochi et al. [24, 25] and ofHenzinger et al. [13]. We use diﬀerent data structuresand optimizations to decrease the running time of thealgorithm of Nagamochi et al. by a factor of up to 2 . .

9. Future workincludes checking whether our sequential optimizationsand parallel implementation can be applied to the(2 + (cid:15) )-approximation algorithm of Matula [23].

References [1] R. J. Anderson and H. Woll. Wait-free parallel algo-rithms for the union-ﬁnd problem. In

Proceedings ofthe Twenty-Third Annual ACM Symposium on Theoryof Computing , STOC ’91, pages 370–380. ACM, 1991.[2] D. Bader, A. Kappes, H. Meyerhenke, P. Sanders,C. Schulz, and D. Wagner. Benchmarking for GraphClustering and Partitioning. In

Encyclopedia of SocialNetwork Analysis and Mining . Springer, 2014.[3] V. Batagelj and M. Zaversnik. An O ( m ) algorithmfor cores decomposition of networks. arXiv preprintcs/0310049 , 2003.[4] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layeredlabel propagation: A multiresolution coordinate-freeordering for compressing social networks. In S. Srini-vasan, K. Ramamritham, A. Kumar, M. P. Ravindra,E. Bertino, and R. Kumar, editors, Proceedings of the20th International Conference on World Wide Web ,pages 587–596. ACM Press, 2011.[5] P. Boldi and S. Vigna. The WebGraph frameworkI: Compression techniques. In

Proceedings of theThirteenth International World Wide Web Conference(WWW 2004) , pages 595–601, Manhattan, USA, 2004.ACM Press.[6] D. Chakrabarti and C. Faloutsos. Graph mining:Laws, generators, and algorithms.

ACM ComputingSurveys , 38(1):2, 2006.[7] C. S. Chekuri, A. V. Goldberg, D. R. Karger, M. S.Levine, and C. Stein. Experimental study of minimumcut algorithms. In

Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’97) ,pages 324–333. SIAM, 1997.[8] L. R. Ford and D. R. Fulkerson. Maximal ﬂowthrough a network.

Canadian Journal of Mathematics ,8(3):399–404, 1956.[9] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta,and T. Hoeﬂer. Communication-avoiding parallel min-imum cuts and connected components. In

Proceedingsof the 23rd ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming , pages 219–232.ACM, 2018.[10] A. V. Goldberg and R. E. Tarjan. A new approachto the maximum-ﬂow problem.

Journal of the ACM ,35(4):921–940, 1988.[11] R. E. Gomory and T. C. Hu. Multi-terminal networkows.

Journal of the Society for Industrial and AppliedMathematics , 9(4):551–570, 1961.[12] J. Hao and J. B. Orlin. A faster algorithm forﬁnding the minimum cut in a graph. In

Proceedingsof the 3rd Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 165–174. Society for Industrial andApplied Mathematics, 1992.[13] M. Henzinger, A. Noe, C. Schulz, and D. Strash. Prac-tical minimum cut algorithms. In , pages 48–61. SIAM, 2018.[14] M. Henzinger, S. Rao, and D. Wang. Local ﬂowpartitioning for faster edge connectivity. In

Proceedingsof the 28th Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 1919–1938. SIAM, 2017.[15] M. J¨unger, G. Rinaldi, and S. Thienel. Practical per-formance of eﬃcient minimum cut algorithms.

Algo-rithmica , 26(1):172–195, 2000.[16] D. R. Karger. A randomized fully polynomial timeapproximation scheme for the all-terminal networkreliability problem.

SIAM Review , 43(3):499–522,2001.[17] D. R. Karger and C. Stein. A new approach tothe minimum cut problem.

Journal of the ACM ,43(4):601–640, 1996.[18] K.-i. Kawarabayashi and M. Thorup. Deterministicglobal minimum cut of a simple graph in near-lineartime. In

Proceedings of the 47th Annual ACM Sympo-sium on Theory of Computing , pages 665–674. ACM,2015.[19] F. Khorasani, R. Gupta, and L. N. Bhuyan. Scalablesimd-eﬃcient graph processing on gpus. In

Proceedingsof the 24th International Conference on Parallel Archi-tectures and Compilation Techniques , PACT ’15, pages39–50, 2015.[20] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat,and M. Bogun´a. Hyperbolic geometry of complexnetworks.

Physical Review E , 82(3):036106, 2010.[21] B. Krishnamurthy. An improved min-cut algorithmfor partitioning VLSI networks.

IEEE Transactionson Computers , 33(5):438–446, 1984.[22] T. Maier, P. Sanders, and R. Dementiev. Concurrenthash tables: Fast and general?(!). In

ACM SIGPLANNotices , volume 51, page 34. ACM, 2016.[23] D. W. Matula. A linear time 2 + ε approximationalgorithm for edge connectivity. In Proceedings ofthe 4th annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 500–504. SIAM, 1993.[24] H. Nagamochi and T. Ibaraki. Computing edge-connectivity in multigraphs and capacitated graphs.

SIAM Journal on Discrete Mathematics , 5(1):54–66,1992.[25] H. Nagamochi, T. Ono, and T. Ibaraki. Implementingan eﬃcient minimum capacity cut algorithm.

Mathe-matical Programming , 67(1):325–341, 1994.[26] M. Padberg and G. Rinaldi. An eﬃcient algorithmfor the minimum capacity cut problem.

MathematicalProgramming , 47(1):19–36, 1990. [27] M. Padberg and G. Rinaldi. A branch-and-cut algo-rithm for the resolution of large-scale symmetric trav-eling salesman problems.

SIAM Review , 33(1):60–100,1991.[28] L. Page, S. Brin, R. Motwani, and T. Winograd. ThePageRank citation ranking: Bringing order to the web.Technical report, Stanford InfoLab, 1999.[29] U. N. Raghavan, R. Albert, and S. Kumara. Near lin-ear time algorithm to detect community structures inlarge-scale networks.

Physical Review E , 76(3):036106,2007.[30] A. Ramanathan and C. J. Colbourn. Counting almostminimum cutsets with reliability applications.

Mathe-matical Programming , 39(3):253–261, 1987.[31] S. B. Seidman. Network structure and minimumdegree.

Social Networks , 5(3):269–287, 1983.[32] C. L. Staudt, A. Sazonovs, and H. Meyerhenke. Net-worKit: An interactive tool suite for high-performancenetwork analysis.

CoRR, abs/1403.3005 , 2014.[33] M. Stoer and F. Wagner. A simple min-cut algorithm.

Journal of the ACM , 44(4):585–591, 1997.[34] M. von Looz, H. Meyerhenke, and R. Prutkin. Gener-ating random hyperbolic graphs in subquadratic time.In

Proceedings of the 26th International Symposium onAlgorithms and Computation (ISAAC 2015) , volume9472 of

LNCS , pages 467–478. Springer, 2015.[35] I. Wegener. Bottom-up-heapsort, a new variant ofheapsort beating, on an average, quicksort (if n is notvery small).

Theoretical Computer Science , 118(1):81–98, 1993.[36] J. W. J. Williams. Heapsort.

Communications of theACM , 7(6):347–348, 1964.

A Instances and Capforest PseudocodeA.1 Random Hyperbolic Graphs (RHG) [20].

Random hyperbolic graphs replicate many features ofreal-world networks [6]: the degree distribution followsa power law, they often exhibit a community structureand have a small diameter. In denser hyperbolic graphs,the minimum cut is often equal to the minimum degree,which results in a trivial minimum cut. In order toprevent trivial minimum cuts, we use a power lawexponent of 5. We use the generator of von Loozet al. [34], which is a part of NetworKit [32], to generateunweighted random hyperbolic graphs with 2 to 2 vertices and an average vertex degree of 2 to 2 . Thesegraphs generally have very few small cuts and theminimum cut has two partitions with similar sizes. A.2 Real-world Graphs.

We use large real-worldweb graphs and social networks from [2, 4, 5], detailedin Table 1. The minimum cut problem on these weband social graphs can be seen as a network reliabilityproblem. As these graphs are generally disconnectedand contain vertices with very low degree, we use a k -raph n m k n m λ δ hollywood-2011 com-orkut uk-2002

18M 262M 10 9M 226M 1 10[2, 4, 5] 30 2.5M 115M 1 3050 783K 51M 1 50100 98K 11M 1 100 twitter-2010

42M 1.2B 25 13M 958M 1 25[4, 5] 30 10M 884M 1 3050 4.3M 672M 3 5060 3.5M 625M 3 60 gsh-2015-host

69M 1.8B 10 25M 1.3B 1 10[4, 5] 50 5.3M 944M 1 50100 2.6M 778M 1 1001000 104K 188M 1 1000 uk-2007-05 k -cores used inexperiments with their respective minimum cutscore decomposition [31, 3] to generate versions of thegraphs with a minimum degree of k . The k -core ofa graph G = ( V, E ) is the maximum subgraph G (cid:48) =( V (cid:48) , E (cid:48) ) with V (cid:48) ⊆ V and E (cid:48) ⊆ E , which fulﬁlls thecondition that every vertex in G (cid:48) has a degree of atleast k . We perform our experiments on the largestconnected component of G (cid:48) . For every real-world graphwe use, we compute a set of 4 diﬀerent k -cores, in whichthe minimum cut is not equal to the minimum degree.We generate a diverse set of graphs with diﬀer-ent sizes. On the large graphs gsh-2015-host and uk-2007-05 , we use cores with k in 10, 50, 100 and1000. In the smaller graphs we use cores with k in 10,30, 50 and 100. twitter-2010 and com-orkut only hadfew cores where the minimum cut is not trivial. There-fore we used those cores. As hollywood-2011 is verydense, we multiplied the k value of all cores by a factorof 2. lgorithm 3 CAPFOREST

Input: G = ( V, E, c ) ← undirected graph with integer edge weights, ˆ λ ← upper bound for minimum cut Output:

T ← forest of contractible edges Label all vertices v ∈ V “unvisited”, blacklist B empty ∀ v ∈ V : r ( v ) ← ∀ e ∈ E : q ( e ) ← Q ← empty priority queue Insert random vertex into Q while Q not empty do x ← Q .pop max() (cid:46) Choose unvisited vertex with highest priority α ← α + c ( x ) − r ( x ) ˆ λ ← min (ˆ λ, α ) for e = ( x, y ) ← edge to vertex y not in B and not visited do if r ( y ) < ˆ λ ≤ r ( y ) + c ( e ) then T ← T ∪ e (cid:46) Mark edge e to contract end if r ( y ) ← r ( y ) + c ( e ) q ( e ) ← r ( y ) Q ( y ) ← r ( y ) end for Mark x “visited”19: