[PDF] Parallel Minimum Cuts in O(m \log^2(n)) Work and Low Depth

Abstract

We present an O(m \log^2(n)) work, O(\text{polylog}(n)) depth parallel algorithm for minimum cut. This algorithm matches the work of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP'20, (2020), 57:1-57:15], and improves on the previously best known parallel algorithm by Geissmann and Gianinazzi [SPAA'18, (2018), pp. 1-11] which performs O(m \log^4(n)) work in O(\text{polylog}(n)) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure for dynamic trees that solves mixed batches of queries and weight updates in low depth. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger's sequential algorithm for minimum cuts. Lastly, we design a work-efficient parallel algorithm for solving the minimum 2-respecting cut problem.

Full PDF

PParallel Minimum Cuts in O ( m log ( n )) Work and Low Depth Daniel AndersonCarnegie Mellon [email protected] Guy E. BlellochCarnegie Mellon [email protected]

Abstract

We present an O ( m log ( n )) work, O (polylog( n )) depth parallel algorithm for minimum cut.This algorithm matches the work of a recent sequential algorithm by Gawrychowski, Mozes, andWeimann [ICALP’20, (2020), 57:1-57:15], and improves on the previously best known parallel al-gorithm by Geissmann and Gianinazzi [SPAA’18, (2018), pp. 1–11] which performs O ( m log ( n ))work in O (polylog( n )) depth.Our algorithm makes use of three components that might be of independent interest. Firstly,we design a parallel data structure for dynamic trees that solves mixed batches of queries andweight updates in low depth. It generalizes and improves the work bounds of a previous datastructure of Geissmann and Gianinazzi and is work eﬃcient with respect to the best sequentialalgorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improveson previous results by Karger and Motwani. We use this algorithm to give a work-eﬃcientprocedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts.Lastly, we design a work-eﬃcient parallel algorithm for solving the minimum 2-respecting cutproblem. a r X i v : . [ c s . D S ] F e b Introduction

The minimum cut problem is one of the classic problems in graph theory and algorithms. Theproblem is to ﬁnd, given an undirected weighted graph G = ( V, E ), a subset of vertices S ⊂ V such that the total weight of the edges crossing from S to V \ S is minimized. Early approaches tothe problem were based on reductions to maximum s - t ﬂows [15, 16]. Several algorithms followedwhich were based on edge contraction [30, 31, 20, 25]. Karger ﬁrst observed that tree packings [32]can be used to ﬁnd minimum cuts [22]. In particular, for a graph with n vertices and m edgesKarger showed how to generate a set of O (log n ) spanning trees such that, with high probability,the minimum cut crosses at most two edges of one of them. The second ingredient is then an O ( m log ( n )) time algorithm to ﬁnd the so-called minimum cut of each of these span-ning trees, yielding an O ( m log ( n )) time algorithm for minimum cut. Karger [22] also describes aparallel algorithm for ﬁnding the minimum 2-respecting cut in O ( n ) work in O (log ( n )) depth.Until very recently, these were the state-of-the-art sequential and parallel algorithms for theweighted minimum cut problem. A new wave of interest in the problem has recently pushed thesefrontiers. Geissmann and Gianinazzi [13] design a parallel algorithm for minimum 2-respecting cutsthat performs O ( m log ( n )) work in O (log ( n )) depth. Their algorithm is based on parallelizingKarger’s algorithm by replacing a sequential data structure for the so-called minimum path problem,based on dynamic trees, with a data structure that can evaluate a batch of updates and queriessimultaneously in low depth. Their algorithm performs just a factor of O (log( n )) additional workthan Karger’s sequential algorithm, but substantially improves on the work of Karger’s parallelalgorithm.Even more recently, a breakthrough from Gawrychowski, Mozes, and Weimann [11] gave an O ( m log ( n )) algorithm for minimum cuts. Their algorithm is also based on Karger’s algorithm,and achieves the O (log( n )) speedup by designing an O ( m log( n )) algorithm for ﬁnding the minimum2-respecting cuts, which was the bottleneck of Karger’s algorithm. This is the ﬁrst result to beatKarger’s seminal algorithm in over 20 years.To generate the O (log n ) spanning trees, Karger used a combination of random sampling [20]and a modiﬁcation of a tree packing algorithm of Gabow [10]. The random sampling requires aconstant approximation to the minimum cut, which is the most challenging part to parallelize.Karger and Motwani give a parallel algorithm for approximating the cut that runs with O ( m /n )work in polylogarithmic depth [24].In our work, we combine ideas from Gawrychowski et. al and Geissmann et. al with several newtechniques to close the gap between the parallel and sequential algorithms. Our contribution canbe summarized by: Theorem 1.

Given a weighted, undirected graph G , there exists a parallel algorithm that, with highprobability, computes the minimum cut of G in O ( m log ( n )) work and O (log ( n )) depth. We achieve this using a combination of results that may be of independent interest. Firstly, wedesign a framework for evaluating mixed batches of updates and queries on trees work eﬃciently andin low depth. This algorithm is based on parallel tree contraction [28] and parallel Rake-compressTrees (RC trees) [1]. Roughly, we say that a set of update and query operations implemented on anRC tree is simple (deﬁned formally in Section 3) if the updates maintain values at the leaves thatare modiﬁed by an associative operation and combined at the internal nodes, and the queries readonly the nodes on a root-to-leaf path and their children. Simple operation sets include updates andqueries on path and subtree weights.

Theorem 2.

Given a constant-degree RC tree of size n , and a simple operation set, after O ( n ) work and O (log n ) depth preprocessing, every following batch of k operations from the operation-set, an be processed in O ( k log( k + n )) work and O (log( n ) log( k )) depth. The total space required is O ( n + k max ) , where k max is the maximum size of a batch. This result generalizes and improves on previous results by Geissmann et. al., who give an algorithmfor evaluating a batch of k path-weight updates and queries in Ω( k log ( n )) work.Next, we design a faster parallel algorithm for approximating minimum cuts, which is used asan ingredient in producing the tree packing used in Karger’s approach (Section 4). To achievethis, we design a faster sampling scheme for producing graph skeletons, leveraging recent results onsampling binomial random variables, and a transformation that reduces the maximum edge weightof the graph to O ( m log( n )) while preserving an approximate minimum cut.Lastly, we show how to solve the minimum 2-respecting cut problem work-eﬃciently in parallel,using a combination of our new parallel dynamic tree algorithms combined with the use of RCtrees to eﬃciently perform a divide-and-conquer search over the edges of the 2-constraining trees(Section 5) Theorem 3.

There exists an algorithm that given a weighted, undirected graph G and a rootedspanning tree T , computes the minimum -respecting cut of G with respect to T , in O ( m log( n )) work and O (polylog( n )) depth w.h.p. Application to the unweighted problem.

The unweighted minimum cut problem, or edgeconnectivity problem was recently improved by Ghafarri, Nowicki, and Thorup [14] who give an O ( m log( n ) + n log ( n )) work and O (log ( n )) depth randomized algorithm which uses Geissmannand Gianinazzi’s algorithm as a subroutine. By plugging our improved algorithm into Ghafarri,Nowicki, and Thorup’s algorithm, we obtain an algorithm for unweighted minimum cut that runsin O ( m log( n ) + n log ( n )) work and O (polylog( n )) depth w.h.p. Model of computation.

We analyze algorithms in the work-depth model using fork-join-styleparallelism. A procedure can fork oﬀ another procedure call to run in parallel and then wait forforked procedures to complete with a join . Work is deﬁned as the total number of instructionsperformed by the algorithm and depth (also called span) is the length of the longest chain ofsequentially dependent instructions [5]. The model can work-eﬃciently cross simulate the classicCRCW PRAM model [5], and the more recent Binary Forking model [6] with at most a logarithmic-factor diﬀerence in the depth.

Randomness.

We say that a statement happens with high probability (w.h.p) in n if for anyconstant c , the constants in the statement can be set such that the probability that the event failsto hold is O ( n − c ). In line with Karger’s work on random sampling [21], we assume that we cangenerate O (1) random bits in O (1) time. Since some of the subroutines we use require randomΘ(log( n ))-bit words, these take O (log( n )) work to generate. We can assume that the depth isunaﬀected since we can always pre-generate the anticipated number of random words in parallel atthe beginning of our algorithms.Our algorithms are Monte Carlo, i.e., correct w.h.p. but run in a deterministic amount of time.We can use Las Vegas algorithms, which are fast w.h.p. but always correct, as subroutines, becauseany Las Vegas algorithm can be converted into a Monte Carlo algorithm by halting and returningan arbitrary answer after the desired time bound. Note that it is not always possible to converta Monte Carlo algorithm into a Las Vegas one, unless a fast algorithm for verifying a solution isavailable, which is not the case for minimum cuts.2 ree contraction. Parallel tree contraction is a framework for producing dynamic tree algo-rithms, introduced by Miller and Reif [29]. Tree contraction works by performing a sequence ofrounds, each applying two operations, rake and compress, in parallel across every vertex of thetree, to produce a sequence of smaller (contracted) trees. The rake operation removes a leaf vertexand merges it with its parent. The compress operation removes a vertex of degree two and replacesits two incident edges with a single edge joining its neighbors. For a rooted tree the root is neverremoved, and is the ﬁnal surviving vertex. The technique of Miller and Reif produces a sequenceof O (log( n )) trees w.h.p., with O ( n ) vertices in total across all of the contracted trees w.h.p. Theiralgorithm applies to bounded-degree trees, but arbitrary-degree trees be handled by convertingthem into equivalent bounded-degree trees.A powerful application of tree contraction is that it can be used to produce a recursive clus-tering of the given tree with attractive properties. From the resulting tree contraction, a recursiveclustering can be produced that consists of O ( n ) clusters with recursive height O (log( n )) w.h.p.Such a clustering can be represented as a so-called rake-compress tree (RC tree) [2]. Rake-compress trees.

The RC tree of a tree T encodes a recursive clustering of T correspondingto the result of tree contraction, where each cluster corresponds to a rake or compress. Figure 1illustrates a recursive clustering, and its corresponding RC tree. A cluster is deﬁned to be aconnected subset of vertices and edges of the original tree. Importantly, a cluster can contain anedge without containing its endpoints. The boundary vertices of a cluster C are the vertices v / ∈ C such that an edge e ∈ C has v as one of its endpoints. All of the clusters in an RC tree have at mosttwo boundary vertices. A cluster with no boundary vertices is called a nullary cluster (generatedat the top-level root cluster), a cluster with one boundary is a unary cluster (generated by therake operation) and a cluster with two boundaries is binary cluster (generated by the compressoperation). The cluster path of a binary cluster is the path in T between its boundary vertices.Nodes in an RC tree correspond to clusters, such that a node is always the disjoint union of itschildren (subclusters). The leaf clusters of the RC tree are the vertices and edges of the originaltree. Note that all non-leaf clusters have exactly one vertex (leaf) cluster as a child. This vertex isthat cluster’s representative vertex. Clusters have the useful property that the constituent clustersof a parent cluster C share a single boundary vertex in common—the representative of C , and theirremaining boundary vertices become the boundary vertices of C .In this paper we will be considering rooted trees. In this case the root of the tree is also therepresentative of the top level nullary cluster of the RC-tree. All binary clusters have a binarysubcluster whose path is above the representative vertex, which we will refer to as the top cluster ,and a binary cluster below the representative cluster, which we call the bottom cluster . We willalso refer to the binary subcluster of a unary cluster as the top cluster as its path is also above therepresentative vertex. In our pseudocode, we will use the following notation. For a cluster x : x (cid:1) v is the representative vertex, x (cid:1) t is the top subcluster, x (cid:1) b is the bottom subcluster, x (cid:1) U is a listof unary subclusters, and x (cid:1) p is the parent.RC trees are similar to top trees [3], which are also based on a recursive clustering strategy.Both data structures support a wide variety of queries. Compared to top trees, however, RC treesare somewhat simpler (fewer cases) and, importantly for us, it is well understood how to constructthem in parallel [29, 12]. We refer the reader to [1] and [2] for a more in-depth explanation of RCtrees and their properties. Compressed path trees.

For a weighted (unrooted) tree T and a set of marked vertices V ⊂ T ,the compressed path tree is a minimal tree T c on V and some additional “steiner vertices” from T such that for every pair ( u, v ) ∈ V , the lightest edge on the path from u to v is the same in T and T c .Alternatively, the compressed path tree is the tree T with all unmarked vertices of degree less than3 c b d e h if g jk l (a) A tree ac b d e h if g jk l (b) A recursive clustering of the tree pro-duced by tree contraction. Clusters pro-duced in earlier rounds are depicted in adarker color. Ee F I Bf i Jh(e,f) (e,h) K(h,i) j (i,j) k (i,k) Lb d (b,d) (d,e) C Ac (b,c) a (a,b)g (g,h) l (k,l)GH D (c) The corresponding RC tree. (Non-base) unary clusters are shown as circles, binaryclusters as rectangles, and the ﬁnalize (nullary) cluster at the root with two concentriccircles. The base clusters (the leaves) are labeled in lowercase, and the compositeclusters are labeled with the uppercase of their representative.

Figure 1: A tree, a clustering, and the corresponding RC tree [1].three spliced out. It is not hard to show that T c has size less than 2 | V | . Compressed path trees aredescribed in [4], where it is shown that given an RC tree for the tree T and a set of k marked vertices,the compressed path tree can be produced in O ( k log(1 + n/k )) work and O (log ( n )) depth w.h.p.Gawrychowski et al. [11] deﬁne a similar notion which they call “topologically induced trees”, buttheir algorithm is sequential and requires O ( k log n ) work (time). Karger’s minimum cut algorithm.

Karger’s algorithm for minimum cuts [22] is based on thenotion of k -respecting cuts . Given a weighted, undirected graph G and a spanning tree T , a cut of G k -respects T if at most k edges of T cross the cut. Karger’s algorithm is the following two-stepprocess.1. Find O (log( n )) spanning trees of G such that w.h.p., the minimum cut 2-respects at least oneof them2. Find, for each of the aforementioned spanning trees, the minimum 2-respecting cut in G Karger solves the ﬁrst step using a combination of random sampling and tree packing . Given aweighted graph G , a tree packing of G is a set of spanning trees with weights assigned to the edgessuch that for each edge in G , its total weight in all of the spanning trees is no more than its weightin G . Since the underlying tree packing algorithms used by Karger have running time proportionalto the size of the minimum cut, random sampling is used to produce a sparsiﬁed graph, or skeleton ,such that the resulting tree packing still has the desired property w.h.p. This allows the treepacking algorithms to run suﬃciently fast. Given the sparsiﬁed graph, Karger gives two algorithmsfor producing tree packings of size O (log( n )) such that w.h.p., the minimum cut 2-respects one ofthem. The ﬁrst approach uses a tree packing algorithm of Gabow [10]. The second is based on the4acking algorithm of Plotkin et al. [33], and is much more amenable to parallelism. It works byperforming O (log ( n )) minimum spanning tree computations. In total, step one of the algorithmtakes O ( m + n log ( n )) time.For the second step, Karger develops an algorithm to ﬁnd, given a graph G and a spanning tree T , the minimum cut of G that 2-respects T . The algorithm works by arbitrarily rooting the tree,and considering two cases: when the two cut edges are on the same root-to-leaf path, and whenthey are not. Both cases use a similar technique; They consider each edge e in the tree and tryto ﬁnd the best matching e (cid:48) to minimize the weight of the cut induced by the edges { e, e (cid:48) } . Thisis achieved by using a dynamic tree data structure to maintain, for each candidate e (cid:48) , the valuethat the cut would have if e (cid:48) were selected as the second cutting edge, while iterating over thepossibilities of e and updating the dynamic tree. Karger shows that this step can be implementedsequentially in O ( m log ( n )) time, which results in a total runtime of O ( m log ( n )) when appliedto the O (log n ) spanning trees. The batched mixed operation problem is to take an oﬀ-line sequence of mixed operations on a datastructure, usually a mix of queries and updates, and process them as a batch. The primary reasonfor batch processing is to allow for parallelism on what would otherwise be a sequential execution ofthe operations. We use the term operation-set to refer to the set of operations that can be appliedamong the mixed operations. Here we are interested in operations on trees, and our results applyto operation-sets that can be implemented on an RC tree in a particular way, deﬁned as follows.

Deﬁnition 1.

An implementation of an operation-set on trees is a simple RC implementation ifit uses an RC representation of the trees and satisﬁes the following conditions.1. The implementation maintains a value at every RC cluster that can be calculated in constanttime from the values of the children of the cluster,2. every query operation is implemented by traversing from a leaf to the root examining values atthe visited clusters and their children taking contant time per value examined, and using constantspace, and3. every update operation involves updating the value of a leaf using an associative constant-timeoperation, and then reevaluating the values on each cluster on the path from the leaf to the root.

Note that every operation has an associated leaf (either an edge or vertex). Also note that setting(i.e., overwriting) a value is an associative operation (just return the second of the arguments). Forsimple RC implementations, all operations take time (work) proportional to the depth of the RC treesince they only follow a path to the root taking constant time at each cluster. Although the simpleRC restriction may seem contrived, most operations on trees studied in previous work [36, 3, 2]can be implemented in this form, including most path and subtree operations. This is because ofa useful property of RC trees, that all paths and subtrees in the source tree can be decomposedinto clusters that are children of a single path in the RC tree, and typically operations need justupdate or collect a contribution from each such cluster.As an example, consider the following two operations on a rooted tree (the ﬁrst an update, andthe second a query): • addWeight ( v, w ) : adds weight w to a vertex v • subtreeSum ( v ) : returns the sum of weights for the subtree rooted at v lgorithm 1 The subtreeSum query. The query starts at the leaf for v and goes up the RCtree keeping track of the total weight on the bottom side of v . Note that x will never be a unarycluster, so if not the representative or top subcluster of p (Line 5), it is the bottom subcluster withnothing below it in this cluster. procedure subtreeSum ( v ) w ← x ← v ; p ← x (cid:1) p while p is binary do if ( x = p (cid:1) t ) or ( x = p (cid:1) v ) then w ← w + p (cid:1) b (cid:1) w + p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w x ← p ; p ← x (cid:1) p return w + p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w These operations can use a simple RC implementation by keeping as the value of each cluster thesum of values of all its children. Leaves in the RC tree start with zero weight. This satisﬁes theﬁrst condition since the sums take constant time. An addWeight( v, w ) adds weight w to the vertex v (which is a leaf in the the RC tree) and updates the sums up to the root cluster. This satisﬁes thethird condition since addition is associative and takes constant time. The query can be implementedas in Algorithm 1, which only examines values on a path from the start vertex to the root and thechildren along that path. Each step takes constant time and the function requires constant space,satisfying the second condition. The operations therefore has a simple RC implementation.We are interested in implementing batches of of operations from an an operation-set on treeswith a simple RC implementation. In particular, we prove Theorem 2. Proof of Theorem 2.

The preprocessing just builds an RC tree on the source tree, and sets thevalues for each cluster based on the initial values on the leaves. This can be implemented with theMiller-Reif algorithm [29], or in the binary forking model [6], or deterministically [12]. All takelinear work and logarithmic depth (w.h.p for the randomized versions). Our algorithm for eachbatch is then implemented as follows:1. Timestamp the operations by their position.2. Collect all operations by their associated leaf, and sort within each leaf by timestamp. This canbe implemented with a single sort.3. For each leaf use a preﬁx sum on the update values to calculate the value of the leaf after eachoperation, starting from the initial value on the leaf.4. Initialize each query using the value it received from the preﬁx sum. We now have a list ofoperations on each leaf sorted by timestamp. For each operation we have its value, and for eachquery we also have its partial evaluation based on the value. We prepend the initial value. Wecall this the operation list . An operation list is non-trivial if it has more than just the initialvalue.5. Sequentially for each level of the cluster tree starting one above the deepest, and in parallel forevery cluster on the level for which at least one child has a non-trivial operation list.(a) Merge the operation lists from each child into a single list by timestamp.(b) Calculate for each element the latest value of each child at or before the timestamp. Thiscan be implemented by preﬁx sums. 6c) For each element in the list calculate the value at that timestamp from the child valuescollected in the previous step.(d) For queries use the values and/or child values to update the query.Note that this algorithm needs to have children with non-trivial operation lists identify parents thatneed to be processed. This can be implemented by keeping a list of all the clusters at a level withnon-trivial operation lists left-to-right in level order. When moving up a level, adjacent duplicatesthat share the same parent can be combined.We ﬁrst consider why the algorithm is correct. We assume by structural induction (over sub-trees) that the operation lists contain the correct values for each timestamped operation in the list.This is true at the leaves since we apply a preﬁx sum across the associative operation to calculatethe value at each update. For internal clusters, assuming the child clusters have correct operationlists (values for each timestamp valid until the next timestamp, and partial result of queries), weproperly determine the operation lists for the cluster. In particular for all timestamps that appearin children we promote them to the parent, and for each we calculate the value based on the currentvalue, by timestamp, for each child.We now consider the costs. The cost of the batch before processing the levels is dominated bythe sort which takes O ( k log k ) work and O (log k ) depth. The cost at each level is then dominatedby the merging and preﬁx sums which take O ( k ) work and O (log k ) depth accumulated across allclusters that have a child with a non-trivial operation list. If the RC tree has depth O (log n ) thenacross all levels the work is bounded by O ( k log n ) work and O (log( n ) log( k )) depth. The totalwork and depth is therefore as stated. The space for each batch of size k is bounded by the sizeof the RC tree which is O ( n ) and the total space of the operation lists at any two adjacent levels,which is O ( k ).Note that we could maintain operation lists at each cluster for all operations on the source tree(along with links to the child nodes) across all batches. This woud allow arbitrary queries back intime in O (log n ) work per query. However it would not satisfy the desired space bounds. We now consider implementing mixed operations consisting of updating paths, and querying bothpaths and subtres. We will use these in batch in Sections 3.2 and 5. In particular we wish tomaintain, given a rooted tree T = ( V, E ) a weight w ( e ) for each e ∈ E , a data structure thatsupports the following operations. • AddPath ( u, v, x ): For u, v ∈ V adds x to the weight of all edges on the u − v path. • QuerySubtree ( v ): Returns the lightest weight of an edge in the subtree rooted at v ∈ V , • QueryPath ( u, v ): For u, v ∈ T , returns the lightest weight of an edge on the u − v path. • QueryEdge ( e ): Returns w ( e )We also consider AddPath ’( v, x ), which adds x to the path from v to the root, and Query-Path ’( u, v ), which requires that v be the representative vertex of an ancestor of u in the RC tree.The more general forms can be implemented from these with a constant number of calls given theLCA in the original tree for AddPath and in the RC tree for

QueryPath .The interface can be implemented in O (log( n )) time by using top trees [3]. Here we describe asimple RC implementation to allow eﬃcient batching.7 emma 1. The

AddPath ’, QuerySubtree , QueryPath ’, and

QueryEdge operations onbounded degree trees can be supported with a simple RC implementation.

Algorithm 2

A simple RC implementation of

AddPath and

QuerySubtree . procedure f unary ( w v , ( m t , l t , w t ) , U ) w (cid:48) ← w v + (cid:80) ( m,w ) ∈ U w m u ← min ( m,w ) ∈ U m return (min( m t , l t + w (cid:48) , m u ) , w t + w (cid:48) ) procedure f binary ( w v , ( m t , l t , w t ) , ( m b , l b , w b ) , U ) w (cid:48) ← w v + w b + (cid:80) ( m,w ) ∈ U w m u ← min ( m,w ) ∈ U m return (min( m t , m b , m u ) , min( l t + w (cid:48) , l b ) , w t + w (cid:48) ) procedure AddPath’ ( v, w ) v (cid:1) value ← v (cid:1) value + w Reevaluate the f ( · ) on path to root. procedure QuerySubtree ( v ) w ← ∞ ; l ← ∞ x ← v ; p ← x (cid:1) p while binary p do if ( x = p (cid:1) t ) or ( x = p (cid:1) v ) then w (cid:48) ← p (cid:1) b (cid:1) w + p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w l ← min( l + w (cid:48) , p (cid:1) b (cid:1) l ) m ← min( m, p (cid:1) b (cid:1) m, min u ∈ p (cid:1) U u (cid:1) m ) x ← p ; p ← x (cid:1) p w (cid:48) ← p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w return min( l + w (cid:48) , m, min u ∈ p (cid:1) U u (cid:1) m ) Proof.

Our simple RC implementation for combining values,

AddPath , and

QuerySubtree isgiven in Algorithm 2. The other two operations can be found in Appendix E. The value of eachvertex (leaf) in the cluster is the the total weight added to that vertex by

AddPath . The valuefor each unary cluster consists of: m , the minimum weight edge in the cluster, and w , the totalweigh of AddPath s originating in the cluster. For each binary cluster we separate the minimumweights on and oﬀ the cluster path. In particular, the value of each binary cluster consists of: m ,the minimum weight edge not on the cluster path, l , the minimum edge on the cluster path due toall AddPath originating in the cluster, and w , the total weight of AddPath s originating in thecluster. The f binary and f unary calculate the values for unary and binary clusters from the valuesof their children. We initialize each vertex with zero, and each edge e (which are binary clusters)with ( m = 0 , l = w ( e ) , w = 0).It is a simple RC implementation since (1) the f ( · ) can be computed in constant time, (2) thequeries just traverse from a leaf on a path to the root (possibly ending early) only examining childvalues, taking constant time per level and constant space, and (3) the update just sets a leaf usingan associative addition, and reevaluates the values to the root.We argue the implementation is correct. Firstly we argue by structural induction on the RCtree that the values as described in the previous paragraph are maintained correctly by f binary and f unary . In particular assuming the children are correct we show the parent is correct. The valuesare correct for leaves since we increment the value on vertices with AddPath , and initialize theedges appropriately. To calculate the minimum edge weight of a unary cluster f unary takes theminimum of three quantities: the minimum oﬀ-path edge of the child binary cluster, the overall8inimum edge of any of the child unary clusters, and, importantly, the minimum edge on thecluster path of the child binary cluster plus the AddPath weight contributed by the unary clustersand the representative vertex (i.e., min( m t , l t + w (cid:48) , m y )). This is correct since all paths from thoseclusters to the root go through the binary edge, so it needs to be adjusted. The oﬀ-path edges andchild unary clusters do not need to be adjusted since no path from the cluster vertex goes throughthem. The minimum weight is therefore correct. The total AddPath weight is trivially correctsince it just adds the contributions.For binary clusters we need to separately consider the minimum oﬀ and on path edges. For theoﬀ-path edges the parts that are oﬀ the cluster path are the oﬀ path edges from the two binarychildren, plus all edges from the unary children (i.e., min( m t , m b , m u )). For the on-path edgesboth the top and bottom binary clusters contribute their on-path edges. The on-path edges fromthe bottom binary cluster do not need to be adjusted because no vertices in the cluster are belowthem. The on-path edges from the top binary cluster need to be adjusted by the AddPath weightsfrom all vertices in the bottom cluster, all vertices in unary child clusters, and the representativevertex since they are all below the path (this sum is given by w (cid:48) ). The minimum of the resultedadjusted top edge and bottom edge is then returned, which is indeed the minimum edge on thepath accounting for AddPath s on vertices in the cluster.The

QuerySubtree accumulates the appropriate minimum weights within a subtree as it goesup the RC tree. As with the calculation of values it needs to separate the on-path and oﬀ-pathminimum weight. Whenever coming as the upper binary cluster to the parent,

QuerySubtree needs to add all the contributing

AddPath weights from vertices below it in the parent cluster(the representative vertex, the lower binary cluster and the unary clusters) to the current minimumon-path weight. A minimum is then taken with the lower on-path minimum edge to calculatethe new minimum on path edge weight (Line 18). The oﬀ path minimum is the minimum of thecurrent oﬀ path minimum, the minimum oﬀ path edge of the bottom cluster and the minimums ofthe unary clusters (Line 19). Once we reach a unary cluster we are done since for a unary cluster allsubtrees of vertices within the cluster are fully contained within the cluster. The ﬁnal line thereforejust determines the overal minimum for the subtree rooted at v by considering the on-path edgesadjusted by AddPath contributions, the oﬀ path edges, and all edges in child unary clusters.

Corollary 1.

Given a bounded-degree tree of size n , any sequence of m AddPath , Query-Subtree , QueryPath , and

QueryEdge operations can be evaluated in O ( n + m log n ) work, O (log ( n + m )) depth and O ( n + m ) space.Proof. The LCAs required to convert

AddPath to AddPath ’ and

QueryPath to QueryPath ’can be computed in O ( n ) work, O (log( n )) depth, and O ( n ) space [35] . The rest follows fromTheorem 2 and Lemma 1. Using our batched mixed operations, we can improve previous results on ﬁnding 2-respecting cuts.In particular we can shave oﬀ a log factor in the work of Geissmann and Gianinazzi’s (GG) algo-rithm [13], and we can parallelize Lovett and Sandlund’s (LS) algorithm [26].Geissmann and Gianinazzi ﬁnd 2-respecting cuts by ﬁrst ﬁnding an O ( m ) sequence of mixed AddPath and

QueryPath operations for each of O (log n ) trees. They show how to ﬁnd eachset in O ( m log n ) work and O (log n ) depth [13, Lemma 12]. On each set they then use their owndata structure to evaluate the sequence in O ( m log n ) work and O (log n ) depth, for a total of O ( m log n ) work and O (log n ) depth across the sets. Replacing their data structure with theresult of Corollary 1 improves their results to O ( m log n ) work.9ovett and Sandlund signiﬁcantly simplify Kargers’ algorithm by ﬁrst ﬁnding a heavy-lightdecomposition—i.e., a vertex disjoint set of paths in a tree such that every path in the tree iscovered by at most O (log n ) of them. It then reduces ﬁnding the 2-respecting cuts to a sequence of AddPath and

QueryPath operations on the decomposed paths induced by each non-tree edge,for a total of O ( m log n ) operations. Using Geissmann and Gianinazzi’s O ( n log n ) work O (log n )algorithm for ﬁnding a heavy-light decomposition [13, Lemma 7], and the results of Corollary 1again gives an O ( m log n ) work, O (log n ) algorithm. We follow the general approach used by Karger to produce a set of O (log( n )) spanning treessuch that w.h.p., the minimum cut 2 respects at least one of them. We have to make severalimprovements to achieve our desired work and depth bounds. At a high level, Karger’s algorithmworks as follows.1. Compute a constant-factor approximation to the minimum cut c

2. Sample the edges of G with probability Θ(log( n ) /c )3. Use the tree packing algorithm of Plotkin [33] to generate a packing of O (log( n )) treesStep 2 is trivial to parallelize, as the sampling can be done independently in parallel. The samplingprocedure produces an unweighted multigraph with O ( m log( n )) edges, and takes O ( m log ( n ))work and O (log( n )) depthIn Step 3, Plotkin’s algorithm consists of O (log ( n )) sequential minimum spanning tree (MST)computations on a weighting of the sampled graph, which has O ( m log( n )) edges. Naively thiswould require O ( m log ( n )) work. To save work, we can use the trick mentioned by Gawrychowskiet al. [11]. Since the sampled graph is a multigraph sampled from a graph with m edges, the MSTalgorithm need only know about the lightest of each parallel edge, which can be maintained in O (1)time since the weights change by a ﬁxed amount each iteration. Using Cole, Klein, and Tarjan’slinear work and O (log( n )) depth MST algorithm [8] results in a total of O ( m log ( n )) work in O (log ( n )) depth w.h.p.The only nontrivial part of parallelizing the tree production is actually Step 1, computing aconstant-factor approximation to the minimum cut. In the sequential setting, Matula’s algorithmcan be used, which runs in linear time on unweighted graphs, and can be extended to weightedgraphs to run in O ( m log ( n )) time. To the best of our knowledge, the only known parallelizationof Matula’s algorithm is due to Karger and Motwani [24], but it takes O ( m /n ) work, which isfar too much for our purposes. In Appendix A, we derive a faster version of the approximationalgorithm that runs in O ( m log ( n )) work and O (log ( n )) depth. Taking all of this together, wehave the following theorem. Theorem 4.

Given a weighted, undirected graph, a set of O (log( n )) spanning trees can be producedin O ( m log ( n )) work and O (log ( n )) depth such that w.h.p., the minimum cut two respects at leastone of them -respecting Cuts We are given a graph G and a set of O (log( n )) trees such that, w.h.p., the minimum cut of G G with respect to a tree T in O ( m log( n )) work and O (polylog( n )) depth.10ur faster O ( m log( n )) work algorithm, like those that came before it, ﬁnds the minimum 2-respecting cut by considering two cases. We assume that the tree T is rooted arbitrarily. In theﬁrst case, we assume that the two tree edges of the cut occur along the same root-to-leaf path,i.e. one is a descendant of the other. This is called the descendant edges case. In the secondcase, we assume that the two edges do not occur along the same root-to-leaf path. This is the independent edges case. We assume we are given an undirected weighted graph G = ( V, E ) withmaximum degree three. Note that any arbitrary degree graph can easily be ternarized by replacinghigh-degree vertices with paths of inﬁnite weight edges, resulting in a graph of maximum degreethree with the same minimum cut, and only a constant-factor larger size.

We now present our algorithm for minimum 2-respecting cut for the descendant edges case. Let T be a spanning tree of a connected graph G = ( V, E ) of degree at most three, and root T at anarbitrary vertex of degree at most two. The rooted tree is binary since G is a connected graph withbounded degree three.We use the following fact. For any tree edge e ∈ T , let F e denote the set of edges ( u, v ) ∈ E (tree and non-tree) such that the u − v path in T contains the edge e . Then the weight of the cutinduced by a pair of edges { e, e (cid:48) } is given by w ( F e ∆ F e (cid:48) ) = w ( F e ) + w ( F e (cid:48) ) − w ( F e ∩ F e (cid:48) ) , where ∆ denotes the symmetric diﬀerence between the two sets. For each tree edge e , our algorithmseeks the tree edge e (cid:48) that minimizes w ( F e ∆ F e (cid:48) ), which is equivalent to minimizing the expression w ( F e (cid:48) ) − w ( F e ∩ F e (cid:48) ) . To do so, it traverses T from the root while maintaining dynamic weights on a tree data structurethat satisﬁes the following invariant: Invariant 1 (Current subtree invariant) . When visiting e = ( u, v ) , for every edge e (cid:48) ∈ Subtree ( v ) ,the weight of e (cid:48) in the dynamic tree is w ( F e (cid:48) ) − w ( F e ∩ F e (cid:48) )The initial weight of each edge e is therefore w ( F e ). Maintaining this invariant as the algorithmtraverses the tree can then be achieved with the following observation. When the traversal descendsfrom an edge p = ( w, u ) to a neighboring child edge e = ( u, v ), the following hold for all e (cid:48) ∈ Subtree( v ):1. ( F e ∩ F e (cid:48) ) ⊇ ( F p ∩ F e (cid:48) ), since any path that goes through p and e (cid:48) must pass through e .2. ( F e ∩ F e (cid:48) ) \ ( F p ∩ F e (cid:48) ) are the edges ( x, y ) ∈ F e (cid:48) such that e is a top edge of the path x − y in T (i.e., e is on the path from x to y in T , but the parent edge of e is not).Therefore, to maintain the current subtree invariant, when the algorithm visits the edge e , it needonly subtract twice the weight of all x − y paths that contain e as a top edge. This can be doneeﬃciently by precomputing the sets of top edges. There are at most two top edges for each path x − y , and they can be found from the LCA of x and y in T . We need not consider tree edges sincethey will never appear in F e (cid:48) . By maintaining the aforementioned invariant, the solution followsby taking the minimum value of w ( F e ) + QuerySubtree ( v ) for all edges e = ( u, v ) during thetraversal. As described, this algorithm is entirely sequential, but it can be parallelized using ourmixed-batch evaluation algorithm (Corollary 1).11he operation sequence can be generated as follows. First, the weights w ( F e ) for each edge canbe computed using the batch evaluation algorithm (Corollary 1) where each edge ( u, v ) of weight w creates an AddPath ( u, v, w ) operation, followed by a QueryEdge ( e ) for every edge e ∈ T . Thistakes O ( m log( n )) work and O (log ( n )) depth. The LCAs required to compute the sets of top edgescan be computed using the parallel LCA algorithm of Schieber and Vishkin [35] in O ( m ) work and O (log( n )) depth in total. By computing an Euler tour of the tree T (an ordered sequence of visitededges) beginning at the root, the order in which to perform the tree operations can be deduced in O ( n ) work and O (log( n )) depth. Each edge in the Euler tour generates an AddPath operation foreach of its top edges, followed by a

QuerySubtree operation. Note that each edge is visited twiceduring the Euler tour. The second visit corresponds to negating the

AddPath operations fromthe ﬁrst visit. The solution is then the minimum result of all of the

QuerySubtree operations.Since there are a constant number of top edges per path, and O ( m ) paths in total, the operationsequence has length O ( m ). Using Corollary 1, we arrive at the following result. Theorem 5.

There exists an algorithm that given a weighted, undirected graph G and a rootedspanning tree T , computes the minimum -respecting cut of G with respect to T such that one ofthe cut edges is a descendant of the other in O ( m log( n )) work and O (log ( n )) depth w.h.p. The independent edge case is where the two cutting edges do not fall on the same root-to-leafpath. To solve the independent edges problem, we use the framework of Gawrychowski et al. [11],which is to decompose the problem into a set of subproblems, which they call bipartite problems .The key challenge in parallelizing the solution to the bipartite problem is dealing with the factthat the resulting trees might not be balanced. The algorithm of Gawrychowski et al. relies onperforming a biased divide-and-conquer search guided by a heavy-light decomposition [17], andthen propagating results up the trees bottom up. Since the trees may be unbalanced, this can notbe easily parallelized. Our solution is to use the recursive clustering of RC trees to guide a divideand conquer search in which we can maintain all of the needed information on the clusters, so wenever have to propagate anything up the original possibly unbalanced tree.

Deﬁnition 2 (The bipartite problem) . Given two weighted rooted trees T and T and a set ofweighted edges that cross from one tree to the other, i.e. L = { ( u, v ) : u ∈ T , v ∈ T } , the bipartiteproblem is to select e ∈ T and e ∈ T with the goal of minimizing the sum of the weight of e and e plus the weights of all edges ( v , v ) ∈ L such that v is in the subtree underneath e and v is in the subtree underneath e . The size of a bipartite problem is the size of L plus the sizes of T and T . Gawrychowski et al. observe that if T and T are disjoint subtrees of T , then, assigning weights of − w ( F e ) to each edge, the solution to the bipartite problem is the minimum 2-respecting cut suchthat e ∈ T and e ∈ T . The independent edges problem is then solved by reducing it to severalinstances of the bipartite problem, and taking the minimum answer among all of them. We willshow how to generate the bipartite problems eﬃciently, and how to solve them eﬃciently, both inparallel. The following parallel algorithm generates O ( n ) instances of the bipartite problem with total sizeat most O ( m ). For each edge e in T , the algorithm ﬁrst assigns them a weight equal to − w ( F e ).Now consider all non-tree edges, i.e. all edges e ∈ E ( G ) , e / ∈ T , and group them by the LCA of12heir endpoints in T . This forms a partition of the O ( m ) edges of G , each group identiﬁed by avertex. Each vertex in T conversely has an associated (possibly empty) list of non-tree edges.For each vertex v in T with a non-empty associated list of edges, create a compressed path treeof T with respect to the endpoints of the associated edges and v . Finally, for each such compressedpath tree, root it at v (the common LCA of the edge endpoints). The bipartite problems arenow generated as follows. For each vertex v with a non-empty list of non-tree edges, and thecorresponding compressed path tree T v , consider the children x, y of v in T v . The bipartite problemconsists of T , which contains the edge ( v, x ) and the subtree of T v rooted at x , and likewise, T ,which contains the edge ( v, y ) and the subtree of T v rooted at y , and L , the associated list ofnon-tree edges. Lemma 2.

There exists an algorithm that can generate the bipartite problems in O ( m log( n )) workand O (log ( n )) depth w.h.p.Proof. The edge weights can be computed using the batch evaluation algorithm in O ( m log( n )) workand O (log ( n )) depth in the same way as before. LCAs can be computed using the parallel LCAalgorithm of Schieber and Vishkin [35] in O ( m ) work and O (log( n )) depth. Grouping the edges byLCA can be achieved using a parallel sorting algorithm in O ( m log( n )) work and O (log( n )) depth.Together, these steps take O ( m log( n )) work and O (log ( n )) depth. For each group, computing thecompressed path tree takes O ( m i log(1 + n/m i )) ≤ O ( m i log( n )) work and O (log ( n )) depth w.h.p.,where m i is the number of edges in the group. Performing all compressed path tree computationsin parallel and noting that the edge lists of each vertex are a disjoint partition of the edges of G ,this takes at most O ( m log( n )) work and O (log ( n )) depth in total w.h.p.It remains only for us to show that the bipartite problems can be eﬃciently solved in parallel. Our solution is a recursive algorithm that utilizes the recursive cluster structure of RC trees. Recallthat RC trees consist of unary and binary clusters (and the nullary cluster at the root, but this is notneeded by our algorithm). See Figure 2. A unary cluster is the disjoint union of exactly one binarycluster at the top, zero to two unary clusters at the bottom, and one leaf cluster correspondingto the representative vertex joining them in the middle. A binary cluster is the disjoint union oftwo binary clusters, one on top and one below; zero or one unary clusters at the bottom; and theleaf cluster corresponding to the representative vertex joining them in the middle. All non-leafclusters, except the root cluster are either unary or binary clusters. Since the bipartite problemsare constructed such that trees T and T always have a root with a single child, the root clusterof their RC trees consists of exactly one unary cluster. High-level idea.

Recall that the goal is to select an edge e ∈ T and an edge e ∈ T thatminimizes their costs plus the cost of all edges ( u, v ) ∈ L such that u is a descendant of e and v is a descendant of e . Our algorithm ﬁrst constructs an RC tree of T , and weights the edges in T and T by their cost. At a high level, the algorithm then works as follows. Given a binary cluster c of T , the algorithm maintains weights on T such that for each edge e ∈ T , its weight is theweight of e in the original tree plus the sum of the weights of all edges ( u, v ) ∈ L such that u isa descendant of the bottom boundary of c , and v is a descendant of e . This implies that for abinary cluster of T consisting of an isolated edge e ∈ T , the weights of each e ∈ T are preciselysuch that w ( e ) + w ( e ) is the value of selecting { e , e } as the solution. This idea leads to a verynatural recursive algorithm. We start with the topmost unary cluster of T and proceed recursivelydown the clusters of T , maintaining T with weights as described. When the algorithm recurses13 (a) A unary cluster consisting of one binarysubcluster and two unary subclusters v (b) A binary cluster consisting of two bi-nary subclusters and one unary subcluster Figure 2: Unary clusters and binary clustersinto the top binary child of a cluster, it must add the weights of all ( u, v ) ∈ L that are descendantsof that cluster to the corresponding paths in T . If recursing on the bottom binary subcluster of abinary cluster, the weights on T are unchanged. When recursing on a unary cluster, since it hasno descendants, the algorithm uses the original weights of T . Once the recursion hits a binarycluster that consists of a single edge e , it can return the solution w ( e ) + w ( e ), where e is thelightest edge with respect to the current weights on T . Lastly, to perform this process eﬃciently,the algorithm compresses , using the compressed path tree algorithm [4], the tree T every time itrecurses, keeping only the vertices that are endpoints of the crossing edges that touch the currentcluster of T . Implementation.

We provide pseudocode for our algorithm in Algorithm 3. Given a bipartiteproblem ( T , T , L ), we use the notation L ( C ) to denote the edges of L limited to those that areincident on some vertex in the cluster C . Furthermore, we use V T ( L ( C )) to denote the set ofvertices given by the endpoints of the edges in L ( C ) that are in T . The pseudocode does notmake the parallelism explicit, but all that is required is to run the recursive calls in parallel. Theprocedure takes as input a cluster C of T , a compressed version of T with its original weights,and T (cid:48) , the compressed version of T with updated weights. At the top level, it takes the clusterrepresenting all of T for the ﬁrst argument, and the cluster for all of T for the second and thirdargument. The Compress function compresses the given tree with respect to the given vertex setand its root, and returns the compressed tree still rooted at the same root.

AddPaths ( S ) takes aset S ⊂ L of edges and for each one, adds w ( u, v ) to the root-to- v path, where v ∈ T , returning anew copy of the tree. Remark 1 (Identifying vertices) . Since this algorithm creates many copies of T , we must ensurethat we can still identify and locate a desired vertex given its label. One simple way to achieve thisis to build a static hashtable alongside each copy of T that maps vertex labels to the instance ofthat vertex in that copy. Since our bounds are already randomized, using hashing is okay. An ingredient that we need to achieve low depth is an eﬃcient way to update the weights in T when adding weights to a collection of paths. Although RC trees support batch-adding weights topaths, the standard algorithm does not meet our cost requirements. This is easy to achieve in linearwork and O (log( n )) depth using ideas similar to standard treeﬁx sum algorithms (See Appendix Dfor details). It remains to show that the Bipartite procedure runs in low work and depth.

Theorem 6.

A bipartite problem of size m can be solved in O ( m log( m )) work and O (log ( m )) depth w.h.p. lgorithm 3 Parallel bipartite problem algorithm procedure Bipartite ( C : Cluster , T : Tree , T (cid:48) : Tree , L : Edge list ) if C = { e } then return w ( e ) + LightestEdge ( T (cid:48) ) else local T cmp ← T . Compress ( V T ( L ( C (cid:1) t ))) local T (cid:48)(cid:48) ← T (cid:48) . AddPaths ( L ( C ) \ L ( C (cid:1) t )) local T (cid:48)(cid:48) cmp ← T (cid:48)(cid:48) . Compress ( V T ( L ( C (cid:1) t ))) local ans ← Bipartite ( C (cid:1) t , T cmp , T (cid:48)(cid:48) cmp , L ( C (cid:1) t )) for each cluster C (cid:48) in C (cid:1) U do local T cmp ← T . Compress ( V T ( L ( C (cid:48) ))) ans ← min(ans, Bipartite ( C (cid:48) , T cmp , T cmp , L ( C (cid:48) ))) if C is a binary cluster then local T cmp ← T . Compress ( V T ( L ( C (cid:1) b ))) local T (cid:48) cmp ← T (cid:48) . Compress ( V T ( L ( C (cid:1) b ))) ans ← min(ans, Bipartite ( T cmp , T (cid:48) cmp , L ( C (cid:1) b ))) return ans Proof.

First, since all recursive calls are made in parallel and the recursion is on the clusters of T ,the maximum levels of recursion is O (log( m )) w.h.p. We will show that the algorithm performs O ( m ) work in total at each level, in O (log ( m )) depth w.h.p. Observe ﬁrst that at each level ofrecursion, the edges L for each call are a disjoint partition of the non-tree edges, since each recursivecall takes a disjoint subset. We will now argue that each call does work proportional to | L | . Since T and T (cid:48) are both compressed with respect to L , their size is proportional to | L | . AddPaths can beimplemented in linear work in the size of T and O (log( m )) depth (Appendix D), and hence takes O ( | L | ) work and O (log( m )) depth. Compress ( K ) takes O ( | K | log(1 + | T | / | K | )) ≤ O ( | K | + | T | )work and O (log ( m )) depth w.h.p.. Since compression is with respect to some subset of L , all ofthe compress operations take O ( | L | ) work and O (log ( m )) depth w.h.p. In total, this is O ( | L | )work in O (log ( m )) depth w.h.p. at each level for each call. Since the L s at each level are a disjointpartition of the non-tree edges, the total work per level is O ( m ) w.h.p., and hence the desiredbounds follow.Since there are O ( n ) bipartite problems of total size O ( m ), solving them all in parallel gives us thefollowing theorem, which, when combined with Theorem 5, proves Theorem 3. Theorem 7.

There exists an algorithm that given a weighted, undirected graph G and a rootedspanning tree T , computes the minimum -respecting cut of G with respect to T such that the cutedges are independent, in O ( m log( n )) work and O (log ( n )) depth w.h.p. Combining Theorem 3 with Theorem 4 concludes our main result (Theorem 1).

We present the ﬁrst work-eﬃcient algorithm for minimum cuts that runs in low depth. Thatis, the ﬁrst highly parallel algorithm that performs no more work than the best sequential algo-rithm. Since our algorithm is work eﬃcient, ﬁnding a faster parallel algorithm would entail ﬁndinga faster sequential algorithm. Our algorithm is Monte Carlo and it runs in O ( m log ( n )) workand O (polylog( n )) depth. It remains an open problem to ﬁnd a deterministic algorithm, even asequential one, that runs in O ( m polylog( n )) time.15 cknowledgments This work was supported in part by NSF grants CCF-1408940 and CCF-1629444.

References [1] U. A. Acar, D. Anderson, G. E. Blelloch, L. Dhulipala, and S. Westrick. Parallel batch-dynamic trees via change propagation. In

ACM Symposium on Parallelism in Algorithms andArchitectures (SPAA) , 2020.[2] U. A. Acar, G. E. Blelloch, and J. L. Vittes. An experimental analysis of change propagationin dynamic trees. In

Algorithm Engineering and Experiments (ALENEX) , 2005.[3] S. Alstrup, J. Holm, K. D. Lichtenberg, and M. Thorup. Maintaining information in fullydynamic trees with top trees.

ACM Trans. on Algorithms , 1(2):243–264, 2005.[4] D. Anderson, G. E. Blelloch, and K. Tangwongsan. Work-eﬃcient batch-incremental mini-mum spanning trees with applications to the sliding window model. In

ACM Symposium onParallelism in Algorithms and Architectures (SPAA) , 2020. (To appear).[5] G. E. Blelloch. Programming parallel algorithms.

Commun. ACM , 39(3), Mar. 1996.[6] G. E. Blelloch, J. T. Fineman, Y. Gu, and Y. Sun. Optimal parallel algorithms in the binary-forking model. In

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) ,2020.[7] J. Cheriyan, M.-Y. Kao, and R. Thurimella. Scan-ﬁrst search and sparse certiﬁcates: animproved parallel algorithm for k-vertex connectivity.

SIAM Journal on Computing , 22(1):157–174, 1993.[8] R. Cole, P. N. Klein, and R. E. Tarjan. Finding minimum spanning forests in logarithmic timeand linear work using random sampling. In

ACM Symposium on Parallelism in Algorithmsand Architectures (SPAA) , 1996.[9] M. Farach-Colton and M.-T. Tsai. Exact sublinear binomial sampling.

Algorithmica ,73(4):637–651, 2015.[10] H. N. Gabow. A matroid approach to ﬁnding edge connectivity and packing arborescences.

Journal of Computer and System Sciences , 50(2):259–273, 1995.[11] P. Gawrychowski, S. Mozes, and O. Weimann. Minimum cut in O ( m log n ) time. In Intl.Colloq. on Automata, Languages and Programming (ICALP) , 2020. (to appear).[12] H. Gazit, G. L. Miller, and S. Teng. Optimal tree contraction in an EREW model. In S. K.Tewksbury, B. W. Dickinson, and S. C. Schwartz, editors,

Concurrent Computations: Algo-rithms, Architecture and Technology , pages 139–156, New York, 1988. Plenum Press. PrincetonWorkshop on Algorithms, Architecture and Technology Issues for Models of Concurrent Com-putation.[13] B. Geissmann and L. Gianinazzi. Parallel minimum cuts in near-linear work and low depth.In

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) , 2018.1614] M. Ghaﬀari, K. Nowicki, and M. Thorup. Faster algorithms for edge connectivity via random2-out contractions. In

ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2020.[15] R. E. Gomory and T. C. Hu. Multi-terminal network ﬂows.

Journal of the Society for Industrialand Applied Mathematics , 9(4):551–570, 1961.[16] J. Hao and J. B. Orlin. A faster algorithm for ﬁnding the minimum cut in a directed graph.

Journal of Algorithms , 17(3):424–446, 1994.[17] D. Harel and R. E. Tarjan. Fast algorithms for ﬁnding nearest common ancestors.

SIAM J.on Computing , 13(2):338–355, 1984.[18] L. H¨ubschle-Schneider and P. Sanders. Parallel weighted random sampling, 2019.[19] D. Karger.

Random sampling in graph optimization problems . PhD thesis, stanford university,1995.[20] D. R. Karger. Global min-cuts in rnc, and other ramiﬁcations of a simple min-cut algorithm.In

SODA , volume 93, pages 21–30, 1993.[21] D. R. Karger. Random sampling in cut, ﬂow, and network design problems.

Mathematics ofOperations Research , 24(2):383–413, 1999.[22] D. R. Karger. Minimum cuts in near-linear time.

J. ACM , 47(1):46–76, 2000.[23] D. R. Karger, P. N. Klein, and R. E. Tarjan. A randomized linear-time algorithm to ﬁndminimum spanning trees.

J. ACM , 42(2):321–328, 1995.[24] D. R. Karger and R. Motwani. Derandomization through approximation: An nc algorithm forminimum cuts. In

ACM Symposium on Theory of Computing (STOC) , pages 497–506, 1994.[25] D. R. Karger and C. Stein. A new approach to the minimum cut problem.

J. ACM , 43(4):601–640, 1996.[26] A. M. Lovett and B. Sandlund. A simple algorithm for minimum cuts in near-linear time. arXiv preprint arXiv:1908.11829 , 2019.[27] D. W. Matula. A linear time 2+ ε approximation algorithm for edge connectivity. In Pro-ceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms , pages 500–504.Society for Industrial and Applied Mathematics, 1993.[28] G. L. Miller and J. H. Reif. Parallel tree contraction and its application. Technical report,HARVARD UNIV CAMBRIDGE MA AIKEN COMPUTATION LAB, 1985.[29] G. L. Miller and J. H. Reif. Parallel tree contraction part 1: Fundamentals. In

Randomnessand Computation , volume 5, pages 47–72, 1989.[30] H. Nagamochi and T. Ibaraki. Computing edge-connectivity in multigraphs and capacitatedgraphs.

SIAM Journal on Discrete Mathematics , 5(1):54–66, 1992.[31] H. Nagamochi and T. Ibaraki. A linear-time algorithm for ﬁnding a sparsek-connected spanningsubgraph of ak-connected graph.

Algorithmica , 7(1-6):583–596, 1992.[32] C. S. J. Nash-Williams. Edge-disjoint spanning trees of ﬁnite graphs.

Journal of the LondonMathematical Society , 1(1):445–450, 1961. 1733] S. A. Plotkin, D. B. Shmoys, and ´E. Tardos. Fast approximation algorithms for fractionalpacking and covering problems.

Mathematics of Operations Research , 20(2):257–301, 1995.[34] S. Rajasekaran and J. H. Reif. Optimal and sublogarithmic time randomized parallel sortingalgorithms.

SIAM J. on Computing , 18(3), 1989.[35] B. Schieber and U. Vishkin. On ﬁnding lowest common ancestors: Simpliﬁcation and paral-lelization.

SIAM Journal on Computing , 17(6):1253–1262, 1988.[36] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees.

J. of computer and systemsciences , 26(3):362–391, 1983. 18

A Parallel Constant-Factor Minimum Cut Approximation

Step one of Karger’s procedure for producing a tree packing is to compute a constant-factor approx-imation to the minimum cut, which is then used to derive the sampling probability for constructinga sparse skeleton. In this section, we will derive an algorithm for a constant-factor approximateminimum cut that runs in O ( m log ( n )) work and O (polylog( n )) depth. Karger and Motwani [24]give an algorithm that runs in O ( m /n ) work and O (polylog( n )) depth. We achieve our boundsby improving Karger’s algorithm and speeding up several of the components. Speciﬁcally, we usethe following combination of ideas, new and old.1. We extend a k -approximation algorithm of Karger [20] to work in parallel, allowing us to producean O (log( n ))-approximate minimum cut in low work and depth2. The log( n )-approximate minimum cut allows us to make O (log log( n )) guesses of the minimumcut such that at least one of them is a constant-factor approximation3. We use a faster sampling technique for producing Karger’s skeletons for weighted graphs. Thisis done by transforming the graph into a graph that maintains an approximate minimum cutbut has edge weights each bounded by O ( m log( n )), and then using binomial random variablesto sample all of the multiedges of a particular edge at the same time, instead of separately.4. We show that the parallel sparse k -certiﬁcate algorithm of Cheriyan, Kao, and Thurimella [7]for unweighted graphs can be modiﬁed to run on weighted graphs5. We show that Karger and Motwani’s parallelization of Matula’s algorithm can be modiﬁed torun on weighted graphsWe will use the following result due to Karger. Deﬁnition 3 ( p -skeleton of a graph) . Given an unweighted graph G and a probability p , the skeleton G ( p ) consists of the vertices of G and a random subset of the edges of G , each sampled withprobability p . Lemma 3 (Karger [19]) . With high probability, if G ( p ) is constructed and has minimum cut ˆ c =Ω(log( n ) /ε ) for ε ≤ , then the minimum cut in G is (1 ± ε )ˆ c/p . Parallelising the k -approximation algorithm. Karger describes an O ( mn /k log( n )) timesequential algorithm for ﬁnding a cut in a graph within a factor of k of the optimal cut [20].Karger’s algorithm works by randomly selecting edges to contract with probability proportionalto their weight until a single vertex remains, and keeping track of the component with smallestincident weight (not including internal edges) during the contraction. This relies on the followingLemma. Lemma 4 (Karger [20]) . Given a weighted graph with minimum cut c , with probability n − /k ,the meta-vertex with minimum incident weight encountered during a single trial of the contractionalgorithm implies a cut of weight at most kc . Running O ( n /k log( n )) rounds yields a cut of size at most kc w.h.p. Here we show how to parallelizeKarger’s algorithm using batched mixed operations yielding the following result: Lemma 5.

For a weighted graph, a cut within a factor of k of the minimum cut can be foundw.h.p. in O ( mn /k log ( n )) work and polylogarithmic depth. O ( m log ( n )) work. It can easily slightly modiﬁed to improve the bounds by a logarithmic factor asfollows. The algorithm selects the edges by running a preﬁx sum over the edge weights. Assuminga total weight of W , it then picks m random integers up to W , and for each uses binary searchon the result of the preﬁx sum to pick an edge. This process, however, might end up picking onlythe heaviest edges. Karger shows that by removing those edges the total weight W decreases bya constant factor, with high probability. Since the edges can be preprocessed to be polynomial in n (see below), repeating for log( n ) rounds the algorithm will select all edges in the appropriateweighted random order. Each round takes O ( m log( n )) work for a total of O ( m log ( n )) work.However, by replacing the binary search with a sort of the random integers and merge into the theresult of the preﬁx sum yields an O ( m log( n )) work randomized algorithm. In particular m randomnumbers uniformly distributed over a range can be sorted in O ( m ) work and O (log( n )) span byﬁrst determining for each number which of m evenly distributed buckets within the range it is in,then sorting by bucket using an integer sort [34] and ﬁnally sorting within buckets.The more interesting part to parallelize is identifying the component with smallest incidentweight during the contraction process. Identifying the edges that are contracted is easy using aminimum spanning tree over the position on which the edge is selected, but keeping track of thesmallest incident weight of a component is somewhat trickier. To achieve this, we use our parallelbatch mixed operations framework from Section 3. In Appendix B, we show that that the followingoperations have a simple RC implementation and therefore can be applied in batch. • SubtractWeight ( v , w ): Subtract weight w from vertex v • JoinEdge ( e ): Mark the edge e as “joined” • QueryWeight ( v ): Return the weight of the connected component containing the vertex v ,where the components are induced by the joined edgesWith this tool, we can simulate the contraction process, and determine the minimum incidentweight of a component as follows:1. Compute an MST with respect to the random edge ordering, where a heavier weight indicatesthat an edge contracts later2. For each edge ( u, v ) ∈ G , determine the heaviest edge in the MST on the unique ( u, v ) path3. Construct a vertex-weighted tree from the MST, where the weights are the total incident weighton each vertex in G . For each edge ( u, v ) in the MST in contraction order: • Determine the set of edges in G such that ( u, v ) is the heaviest edge on its MST path. Foreach such edge identiﬁed, SubtractWeight from each of its endpoints by the weight of theedge • Perform

JoinEdge on the edge ( u, v ) • Perform

QueryWeight on the vertex u Observe that the weight of a component at the point in time when it is queried is precisely thetotal weight of incident edges (again, not including internal edges). Taking the minimum over theinitial degrees and all query results therefore yields the desired answer.Step 1 takes O ( m log( n )) work and O (log ( n )) depth to compute the random edge permutationusing Karger’s technique [20], and O ( m ) work and O (log( n )) depth to run a parallel MST algo-rithm [23]. Step 2 takes O ( m log( n )) work and O (log( n )) depth using RC trees [2, 1], and Step 3takes O ( m log( n )) work and O (log ( n )) depth using our batch evaluation framework (Theorem 2).20ased on Lemma 4, trying O ( n /k log( n )) random contractions yields the result of Lemma 5. Set-ting k = log( n ) then gives a log( n ) approximation in O ( m log ( n )) work and O (log ( n )) depth. Transformation to bounded edge weights.

For our algorithm to be eﬃcient, we require thatthe input graph has small integer weights. Karger [19] gives a transformation that ensures all edgeweights of a graph are bounded by O ( n ) without aﬀecting the minimum cut by more than a aconstant factor. For our algorithm O ( n ) would be too big, so we design a diﬀerent transformationthat guarantees all edge weights are bounded by O ( m log( n )), and only aﬀects the weight of theminimum cut by a constant factor. Lemma 6.

There exists a transformation that, given an integer-weighted graph G , produces aninteger-weighted graph G (cid:48) no larger than G , such that G (cid:48) has edge weights bounded by O ( m log( n )) ,and the minimum cut of G (cid:48) is a constant-approximate minimum cut in G .Proof. Let G be the input graph and suppose that the true value of the minimum cut is c . First,we use Lemma 5 to obtain a O (log( n ))-approximate minimum cut, whose value we denote by ˜ c ( c ≤ ˜ c ≤ c log( n )). We can contract all edges of the graph with weight greater than ˜ c since theycan not appear in the minimum cut. Let s = ˜ c/ (2 m log( n )). We delete (not contract) all edgeswith weight less than s . Since there are at most m edges in any cut, this at most aﬀects the valueof a cut by sm = ˜ c/ (2 log( n )) ≤ c/

2. Therefore the minimum cut in this graph is still a constantfactor approximation to the minimum cut in G .Next, scale all remaining edge weights down by the factor s , rounding down. All edge weightsare now integers in the range [1 , m log( n )]. This is the transformed graph G (cid:48) . It remains to arguethat the value of the minimum cut is a constant-factor approximation. First, note that the scalingprocess preserves the order of cut values, and hence the true minimum cut in G has the same valuein G (cid:48) as the minimum cut in G (cid:48) . Consider any cut in G (cid:48) , and scale the weights of the edges backup by a factor s . This introduces a rounding error of at most s per edge. Since any cut has at most m edges, the total rounding error is at most sm ≤ c/

2. Therefore the value of the minimum cut in G (cid:48) is a constant factor approximation to the value of the minimum cut in G .Lastly, observe that this transformation can easily be performed in parallel by using a work-eﬃcientconnected components algorithm to perform the edge contractions, as is standard (see e.g. [25]). Sampling the skeleton from a weighted graph.

Note that by deﬁnition, the p -skeleton ofa graph has O ( pm ) edges in expectation. For a weighted graph, the p -skeleton is deﬁned as the p -skeleton of the corresponding unweighted multigraph in which an edge of weight w is replacedby w parallel multiedges. The p -skeleton of a weighted graph therefore has O ( pW ) edges in ex-pectation, where W is the total weight in the graph. Karger gives an algorithm for generating a p -skeleton in O ( pW log( m )) work, which relies on performing O ( pW ) independent random sampleswith probabilities proportional to the weight of each edge, each of which takes O (log( m )) amor-tized time. In Karger’s algorithm, given a guess of the minimum cut c , he computes p -skeletons for p = Θ(log( n ) /c ), which results in a skeleton of O ( m log( n )) edges, and hence takes O ( m log ( n ))work to compute.Our algorithm instead does the following. For each edge e in the graph, sample a binomialrandom variable x ∼ B ( w ( e ) , p ). The skeleton then contains the edge e with weight x (conceptually, x unweighted copies of the multiedge e ). This results in the same graph as if sampled using Karger’stechnique. In Appendix C, we show how to use recent results on sampling binomial random variablesto perform samples from B ( n (cid:48) , /

2) in O (log( n (cid:48) )) time w.h.p., and from B ( n (cid:48) , p ) in O (log ( n (cid:48) )) timew.h.p., for any n (cid:48) ≤ N after O ( N / ε ) work preprocessing. Since we can preprocess the graph tohave edge weights at most O ( m log( n )) (Lemma 6), this is no more than O ( m ) work in preprocessing.21t ﬁrst glance, this does not improve on Karger’s bounds, since we need to perform O (log ( n ))work per edge when sampling from B ( n, p ). However, we use the fact that only the ﬁrst sampleof the graph needs to be this expensive. In Karger’s algorithm, and by extension, our algorithm,subsequent samples always take p as exactly half of the value of p from last iteration, and hencewe can use subsampling to only require random variables from B ( n, / O (log( n )) rounds of subsampling in O ( m log ( n )) total work, instead of O ( m log ( n ))work. Sparse certiﬁcates. A k -connectivity certiﬁcate of a graph G = ( V, E ) is a graph G (cid:48) = ( V, E (cid:48) ⊂ E ) such that every cut in G of weight at most k has the same weight in G (cid:48) . In other words, a k -connectivity certiﬁcate is a subgraph that preserves cuts of weight up to k . A k -connectivitycertiﬁcate is called sparse if it has O ( kn ) edges.Cheriyan, Kao, and Thurimella [7] introduce a parallel graph search called scan-ﬁrst search ,which they show can be used to generate k -connectivity certiﬁcates of undirected graphs. Here, webrieﬂy note that the algorithm can easily be extended to handle weighted graphs. The scan-ﬁrstsearch algorithm is implemented as follows Algorithm 4

Scan-ﬁrst search [7] procedure SFS ( G = ( V, E ) :

Graph , r : Vertex ) Find a spanning tree T (cid:48) rooted at r Find a preorder numbering to the vertices in T (cid:48) For each vertex v ∈ T (cid:48) with v (cid:54) = r , let b ( v ) denote the neighbor of v with the smallest preorder number Let T be the tree formed by { v, b ( v ) } for all v (cid:54) = r Using a linear work, low depth spanning tree algorithm, scan-ﬁrst search can easily be implementedin O ( m ) work and O (log( n )) depth. Cheriyan, Kao, and Thurimella show that if E i are the edgesin a scan-ﬁrst search forest of the graph G i − = ( V, E \ ( E ∪ ...E i − )), then E ∪ ...E k is a sparse k -connectivity certiﬁcate. A sparse k -connectivity certiﬁcate can therefore be found in O ( km ) workand O ( k log( n )) depth by running scan-ﬁrst search k times.In the weighted setting, we treat an edge of weight w as w parallel unweighted multiedges. Asalways, this is only conceptual, the multigraph is never actually generated. To compute certiﬁcatesin weighted graphs, we therefore use the following simple modiﬁcation. After computing each scan-ﬁrst search tree, instead of removing the edges present from G , simply lower their weight by one,and remove them only if their weight becomes zero. It is easy to see that this is equivalent torunning the ordinary algorithm on the unweighted multigraph. We therefore have the following. Lemma 7.

Given a weighted, undirected graph G = ( V, E ) , a sparse k -connectivity certiﬁcate canbe found in O ( km ) work and O ( k log( n )) depth. Parallel Matula’s algorithm for weighted graphs.

Matula [27] gave a linear time sequentialalgorithm for a (2 + ε )-approximation to edge connectivity (unweighted minimum cut). It is easy toextend to weighted graphs so that it runs in O ( m log( n ) log( W )) time, where W is the total weight ofthe graph. Using standard transformations to obtain polynomially bounded edge weights, this givesan O ( m log ( n )) algorithm. Karger and Motwani [24] gave a parallel version of Matula’s unweightedalgorithm that runs in O ( m /n ) work. We will show that a slight modiﬁcation to this algorithmmakes it work on weighted graphs in O ( dm log( W/m )) work and O ( d log( n ) log( W )) depth, where d is the minimum degree of the graph. When d = O (log( n )) and W = O ( m polylog( n )), this givesa work bound of O ( m log( n ) log log( n )).Essentially, Karger and Motwani’s version of Matula’s algorithm does the following steps asindicated in Algorithm 5. 22 lgorithm 5 Approximate minimum cut procedure Matula ( G = ( V, E ) :

Graph ) local d ← minimum degree in G local k ← d/ (2 + ε ) local C ← Compute a sparse k -certiﬁcate of G local G (cid:48) ← Contract all non-certiﬁcate edges of E return min( d, Matula ( G (cid:48) )) It can be shown that at each iteration, the size of the graph is reduced by a constant factor,and hence there are at most O (log( n )) iterations. Furthermore, the work performed at each stepis geometrically decreasing, so the total work, using the sparse certiﬁcate algorithm of Cheriyan,Kao, and Thurimella [7] is O ( dm ) and the depth is O ( d log ( n )).To extend this to weighted graphs, we can replace the sparse certiﬁcate routine with our modiﬁedversion for weighted graphs, and replace the computation of d with the equivalent weighted degree.By interpreting an edge-weighted graph as a multigraph where each edge of weight w corresponds to w parallel multiedges, we can see that the algorithm is equivalent. To argue the cost bounds, notethat like in the original algorithm where the size of the graph decreases by a constant factor eachiteration, the total weight of the graph must decrease by a constant factor in each iteration. Becauseof this, it is no longer true that the work of each iteration is geometrically decreasing. Naively, thisgives a work bound of O ( dm log( W )), but we can tighten this slightly as follows. Observe that afterperforming log( W/m ) iterations, the total weight of the graph will have been reduced to O ( m ), andhence, like in the sequential algorithm, the work must subsequently begin to decrease geometrically.Hence the total work can actually be bounded by O ( dm log( W/m ) + dm ) = O ( dm log( W/m )). Wetherefore have the following.

Lemma 8.

Given an undirected, weighted graph G with minimum weighted-degree d and total weight W , a constant-factor approximation to the minimum cut can be computed in O ( dm log( W/m )) workand O ( d log( n ) log( W )) depth. A parallel approximation algorithm for minimum cut.

The ﬁnal ingredient needed toproduce the parallel minimum cut approximation is a trick due to Karger. Recall that to produceKarger’s skeleton graph, the sampling probability must be inversely proportional to the weight ofthe minimum cut, which paradoxically is what we are trying to compute. This issue is solve by using doubling . The algorithm makes successively larger guesses of the minimum cut and computes theresulting approximation. It can then verify whether the guess was too high by checking whether theminimum cut in the skeleton contained too few edges (Lemma 3). Speciﬁcally, Karger’s samplingtheorem (Lemma 6.3.2 of [19]) says that we will know that we have made the correct guess withina factor two when the skeleton has Ω(log( n )) edges in its minimum cut. To perform the minimumamount of work, we use Lemma 5 to ﬁrst produce a O (log( n ))-approximation to the minimum cut,which allows us to make just O (log log( n )) guesses with the guarantee that one of them will becorrect to within a factor two.Our algorithm proceeds by making these O (log log( n )) guesses in parallel. For each, we con-sider the corresponding skeleton graph and compute a Θ(log( n )) certiﬁcate, since, by assump-tion, until we have made the correct guess, the minimum cut in the skeleton will be less than O (log( n )) w.h.p. This then guarantees that we can run our version of parallel Matula’s algorithmin O ( n log( n ) log log( n )) work (Lemma 8), since, after producing the certiﬁcate, the total weight ofthe graph is at most O ( n log( n )), and the minimum degree is no more than O (log( n )). The detailsare depicted in Algorithm 6. It takes O ( m log ( n )) work to produce the sequence of graph skeletons, O ( m log( n ) log log( n )) work to compute the sparse certiﬁcates, and O ( n log( n )(log log( n )) ) work to23ompute the resulting approximations from Matula’s algorithm. All together, the algorithm takesat most O ( m log ( n )) work, and runs in O (log ( n )) depth. Algorithm 6

Approximate minimum cut algorithm procedure ApproxMinCut ( G = ( V, E ) :

Graph ) local C ← A log( n )-approximation of MinCut( G ) for c ∈ { C/ log( n ) , C/ log( n ) , ...C } do in parallel p ( c ) ← Θ(log( n ) /c ) local G p ← Compute the skeleton graph G (cid:48) ( p ) local G (cid:48) p ← Compute a Θ(log( n ))-certiﬁcate of G p ˆ c ( c ) ← Matula ( G (cid:48) p ) if ˆ c ( C/ log( n )) ≤ O (log( n )) then return ˆ c ( C/ log( n )) else local c ← min { c | ˆ c ( c ) ≥ O (log( n )) } return ˆ c ( c ) /p ( c ) To see that the ﬁnal returned value is correct, we appeal to Karger’s sampling theorem, which saysthat w.h.p., if our guess of the minimum cut is too high by a factor two, the minimum cut of theskeleton will have less than O (log( n )) edges w.h.p. [19], and hence the certiﬁcate algorithm doesnot damage the minimum cut. Once our guess is below the minimum cut by a factor two, thesampling theorem says that the minimum cut of the skeleton exceeds Ω(log( n )). Provided that weset the constant of the Θ(log( n )) certiﬁcate to be a constant factor larger than this threshold, aChernoﬀ bound shows us that one of our guesses leads to a skeleton with approximate minimum cutˆ c = Ω(log( n )) that is not damaged by the certiﬁcate, and then Lemma 3 says that ˆ c/p is a constant-factor approximation of the minimum cut w.h.p. This argument works only if the minimum cut of G has size at least Ω(log( n )), but note that if it does not, the skeleton construction G (cid:48) (1) (whichmust occur during the last iteration) and the certiﬁcate completely preserve the minimum cut andhence the last iteration of the loop ﬁnds a constant factor approximation of the minimum cut in G . B Mixed Component Weight Operations

Here, we describe a simple RC implementation of the following operations, which is hence amenableto our batched mixed operations framework. • SubtractWeight ( v, w ): Subtract weight w from vertex v • JoinEdge ( e ): Mark the edge e as “joined” • QueryWeight ( v ): Return the weight of the connected component containing the vertex v ,where the components are induced by the joined edgesThe values stored in the RC clusters are as follows. Vertices store their weight, and unary clustersstore the weight of the component reachable via joined edges from the boundary vertex. A binarycluster is either joined , meaning that its boundary vertices are connected by joined edges, in whichcase it stores a single value, the weight of the component reachable via joined edges from theboundaries, otherwise it is split , in which case it stores a pair: the weight of the componentreachable via joined edges from the top boundary, and the weight of the component reachable viajoined edges from the bottom boundary. We provide pseudocode for the update operations forIllustration in Algorithm 7. 24 lgorithm 7 A simple RC implementation of

SubtractWeight and

JoinEdge . procedure f unary ( v v , t, U ) if t = ( t v , b v ) then return t v else return v v + t + (cid:80) u v ∈ U u v procedure f binary ( v v , t, b, U ) if t = t v and b = b v then return t v + b v + v v + (cid:80) u v ∈ U u v else if t = ( t t v , t b v ) and b = b v then return ( t t v , t b v + v v + b v + (cid:80) u v ∈ U u v ) else if t = t v and b = ( b t v , b b v ) then return ( t v + v v + b t v + (cid:80) u v ∈ U u v ) else if t = ( t t v , t b v ) and b = ( b t v , b b v ) then return ( t t v , b b v ) procedure SubtractWeight ( v, w ) v (cid:1) value ← v (cid:1) value - w Reevaluate the f ( · ) on path to root. procedure JoinEdge ( e ) e (cid:1) value ← Reevaluate the f ( · ) on path to root. The initial value of a vertex is its starting weight. The initial value of an edge is (0 , f unary and f binary can be evaluated in constant time,and the structure of the updates involves setting the value at a leaf using an associative operationand re-evaluating the values of the ancestor clusters.We can argue that the values are correctly maintained by structural induction. First considerunary clusters. If the top subcluster is split, then the representative vertex and unary subclustersare not reachable via joined edges, and hence the only reachable component is the componentreachable inside the top subcluster from its top boundary, whose weight is t v . If the top subclusteris joined, then the representative vertex is reachable, which is by deﬁnition the boundary vertex ofthe unary subclusters, and hence the reachable component is the union of the reachable componentsof all of the subclusters, whose weight is as given.For binary clusters, there are four possible cases, depending on whether the top and bottomsubclusters are joined or not. If both are joined, then the representative and hence the boundary ofall subclusters is reachable from both boundaries, and hence the cluster is joined and the reachablecomponent is the union of the reachable components of the subclusters. If either subcluster is split,then the reachable component at the corresponding boundary is just the reachable component ofthe subcluster, whose weight is as given. Lastly, if one of the subclusters is not split, then thecorresponding boundary can reach the representative vertex, and hence the reachable componentsof the unary subclusters, whose weights are as given.It remains to argue that we can implement QueryWeight with a simple RC implementation.Consider a vertex v whose component weight is desired and consider the parent cluster P of v , i.e.,the cluster of which v is the representative. If P has no binary subclusters that are joined, observethat P must contain the entire component of v induced by joined edges, since the only way for acomponent to exit a cluster is via a boundary which would have to be joined. Answering the queryin this situation is therefore easy; the result is the sum of the weights of v , the unary subclustersof P , the bottom boundary weight of the top subcluster (if it exists), and the top bounary weightof the bottom subcluster (if it exists). Suppose instead that P contains a binary subcluster thatis joined to some boundary vertex u (cid:54) = v . Since the subcluster is joined, u is in the same induced25omponent as v , and hence QueryWeight ( v ) has the same answer as QueryWeight ( u ). Bystandard properties of RC trees, since u is a boundary of P , we also know that the leaf cluster u is the child of some ancestor of P . Since the root cluster has no binary subclusters, this processof jumping to joined boundaries must eventually discover a vertex that falls into the easy case,and since such a vertex u is always the child of some ancestor is P , the algorithm only examinesclusters that are on or are children of the root-to- v path in the RC tree, and hence the algorithmis a simple RC implementation. C Sampling Binomial Random Variables

Our graph sampling procedure makes use of binomial random variables. We will use the followingresults due to Farach-Colton et al. [9].

Lemma 9 (Farach-Colton et al. [9], Theorem 1) . Given a positive integer n , one can sample a ran-dom variate from the binomial distribution B ( n, / in O (1) time with probability − /n Ω(1) andin expectation after O ( n / ε ) -time preprocessing for any constant ε > , assuming that O (log( n )) bits can be operated on in O (1) time. The preprocessing can be reused for any n (cid:48) = O ( n )We can also use the following reduction to sample B ( n, p ) for arbitrary 0 ≤ p ≤ Lemma 10 (Farach-Colton et al. [9], Theorem 2) . Given an algorithm that can draw a sample from B ( n (cid:48) , / in O ( f ( n )) time with probability − /n Ω(1) and in expectation for any n (cid:48) ≤ n , thendrawing a sample from B ( n (cid:48) , p ) for any real p can be done in O ( f ( n ) log( n )) time with probability − /n Ω( n ) and in expectation, assuming each bit of p can be obtained in O (1) time We note, importantly, that the model used by Farach-Colton et al. assumes that random Θ(log( n ))-size words can be generated in constant time. Since we only assume that we can generate randombits in constant time, we will have to account for this with an extra O (log( n )) factor in the workwhere appropriate. Note that this does not negatively aﬀect the depth since we can pregenerateas many random words as we anticipate needing, all in parallel at the beginning of our algorithm.Lastly, we also remark that although it might not be clear in their deﬁnition, the constants inthe algorithm can be conﬁgured to control the constant in the Ω(1) term in the probability, andtherefore their algorithms take O (1) time and O (log( n )) time w.h.p. Preprocessing in parallel.

In order to make use of these results, we need to show that thepreprocessing of Lemma 9 can be parallelized. Thankfully, it is easy. The preprocessing phaseconsists of generating n ε alias tables of size O ( (cid:112) n log( n )). H¨ubschle-Schneider and Sanders [18]give a linear work, O (log( n )) depth parallel algorithm for building alias tables. Building all of themin parallel means we can perform the alias table preprocessing in O ( n / ε ) work and O (log( n ))depth. The last piece of preprocessing information that needs to be generated is a lookup tablefor decomposing any integer n (cid:48) = O ( n ) into a sum of a constant number of square numbers. Thistable construction is trivial to parallelize, and hence all preprocessing runs in O ( n / ε ) work and O (log( n )) depth. D Bulk Path Updates For RC Trees

Given an RC tree of a tree on n vertices and a set of k path updates of the form ( v i , x i ), denotingthat x i is to be added to the weight of all edges on the path from r to v i , we can apply all of themin O ( n ) work and O (log( n )) depth w.h.p. The idea is similar to a standard treeﬁx sum algorithm.26irst, the algorithm associates each vertex v i with its weight x i . Now the operation is to add toeach edge ( u, v ), where v is a child of u , the weight of all vertices in the subtree rooted at v . Thealgorithm then proceeds in two steps. First, in a traversal of the RC tree, it computes, for eachcluster C , the total weight W ( C ) of all vertices in it. This step is the same as the ﬁrst step ofour batch evaluation algorithm, and takes O ( n ) work and O (log( n )) depth w.h.p. Second, for eachchild cluster of the root, it traverses the RC tree top-down, maintaining the weight w of all verticesthat are descendants of the current cluster (initially zero). This is achieved when recursing on C (cid:1) t ,by adding W ( C ) − W ( C (cid:1) t ) to the accumulated weight w , otherwise keeping w the same. Whenreaching a base edge cluster, the value w is added to the weight of the edge. This takes O ( n ) workand O (log( n )) depth w.h.p. 27 Additional Mixed Tree Operations

Algorithm 8

A simple RC implementation of

QueryEdge and

QueryPath ’. For

QueryPath ’the node v must be the representative node of an ancestor of u in the RC tree. It maintains m , theminimum oﬀ cluster path, t , the minimum on cluster path above where the path from u meets thecluster path, and b , the minimum on cluster path below that point. Once v is found it picks one of b or t depending on which side v is, or neither if a unary cluster. If a binary cluster it then needsto continue up the tree to add in the additional weights for binary clusters. procedure QueryEdge ( e ) w ← w ( e ) x ← e ; p ← x (cid:1) p while binary p do if x = p (cid:1) t then w ← w + p (cid:1) b (cid:1) w + p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w x ← p ; p ← x (cid:1) p return w + p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w procedure QueryPath’ ( u, v ) m ← ∞ ; t ← ∞ ; b ← ∞ x ← u ; p ← x (cid:1) p while not p (cid:1) v = v do w (cid:48) ← p (cid:1) v (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w if unary p then if x = p (cid:1) t then m ← min( t + w (cid:48) , m ) else m ← min( p (cid:1) t (cid:1) l + w (cid:48) , m ) t ← ∞ ; b ← ∞ ; else w (cid:48) ← w (cid:48) + p (cid:1) b (cid:1) w if x = p (cid:1) t then t ← t + w (cid:48) ; b ← min( b, p (cid:1) b (cid:1) l ) else if x = p (cid:1) b then t ← min( p (cid:1) t (cid:1) l + w (cid:48) , t ) else t ← p (cid:1) t (cid:1) l + w (cid:48) ; b ← p (cid:1) t (cid:1) l x ← p ; p ← x (cid:1) p if x = p (cid:1) t then l ← b else if x = p (cid:1) b then l ← t else return m while binary p do w (cid:48) ← p (cid:1) v (cid:1) w + p (cid:1) b (cid:1) w + (cid:80) u ∈ p (cid:1) U u (cid:1) w if ( x = p (cid:1) t ) then l ← l + w (cid:48) return min( m, l ))