Distributed Computation in Node-Capacitated Networks
John Augustine, Mohsen Ghaffari, Robert Gmyr, Kristian Hinnenthal, Fabian Kuhn, Jason Li, Christian Scheideler
aa r X i v : . [ c s . D C ] A p r Distributed Computation in Node-Capacitated Networks ∗ JOHN AUGUSTINE,
IIT Madras
MOHSEN GHAFFARI,
ETH Zurich
ROBERT GMYR,
University of Houston
KRISTIAN HINNENTHAL and CHRISTIAN SCHEIDELER,
Paderborn University
FABIAN KUHN,
University of Freiburg
JASON LI,
Carnegie Mellon University
In this paper, we study distributed graph algorithms in networks in which the nodes have a limited communication capacity. Manydistributed systems are built on top of an underlying networking infrastructure, for example by using a virtual communicationtopology known as an overlay network . Although this underlying network might allow each node to directly communicate with alarge number of other nodes, the amount of communication that a node can perform in a fixed amount of time is typically much morelimited.We introduce the
Node-Capacitated Clique model as an abstract communication model, which allows us to study the effect ofnodes having limited communication capacity on the complexity of distributed graph computations. In this model, the n nodes of anetwork are connected as a clique and communicate in synchronous rounds. In each round, every node can exchange messages of O ( log n ) bits with at most O ( log n ) other nodes. When solving a graph problem, the input graph G is defined on the same set of n nodes, where each node knows which other nodes are its neighbors in G .To initiate research on the Node-Capacitated Clique model, we present distributed algorithms for the Minimum Spanning Tree (MST),
BFS Tree , Maximal Independent Set , Maximal Matching , and
Vertex Coloring problems. We show that even with only O ( log n ) concurrent interactions per node, the MST problem can still be solved in polylogarithmic time. In all other cases, the runtime of ouralgorithms depends linearly on the arboricity of G , which is a constant for many important graph families such as planar graphs.CCS Concepts: • Theory of computation → Distributed algorithms .Additional Key Words and Phrases: Distributed Algorithms, Node Capacity, Graph Algorithms
Nowadays, most of the distributed systems and applications do not have a dedicated communication infrastructure, butinstead share a common physical network with many others. The logical network formed on top of this infrastructure iscalled an overlay network . For these systems, the amount of information that a node can send out in a single round doesnot scale linearly with the number of its incident edges. Instead, it rather depends on the bandwidth of the connectionof the node to the communication infrastructure as a whole. For these networks, it is therefore more reasonable toimpose a bound on the amount of information that a node can send and receive in one round, rather than imposinga bound on the amount of information that can be sent along each of its incident edges . Also, the topology of theoverlay network may change over time, and these changes are usually under the control of the distributed application.To capture these aspects, we propose to study the so-called
Node-Capacitated Clique model. The model is inspired in ∗ This is an extended version of a paper that will appear at SPAA 2019.Authors’ addresses: John Augustine, [email protected], IIT Madras, India; Mohsen Ghaffari, ghaff[email protected], ETH Zurich, Switzerland; RobertGmyr, [email protected], University of Houston, USA; Kristian Hinnenthal; Christian Scheideler, {krijan,scheidel}@mail.upb.de, Paderborn University,Germany; Fabian Kuhn, [email protected], University of Freiburg, Germany; Jason Li, [email protected], Carnegie Mellon University, USA. art by the Congested Clique model introduced first by Lotker, Patt-Shamir, Pavlov, and Peleg [47], which has receivedsignificant attention recently [8, 9, 11, 12, 14, 20–22, 25, 26, 28–30, 33, 34, 37, 38, 45, 47].Similarly to the Congested Clique model, the nodes of the Node-Capacitated Clique are interconnected by a completegraph. However, in the Node-Capacitated Clique every node can only send and receive at most O ( log n ) messagesconsisting of O ( log n ) bits in each round. This limitation is added precisely to address the issue explained above. Itparticularly rules out the possibility that the model allows one node to be in contact with up to Θ ( n ) other nodes atthe same time; a property of the Congested Clique that seems to severely limit its practicability. We comment that thecapacity bound of O ( log n ) messages per node per round is a natural choice: it is small enough to ensure scalabilityand any smaller would require unnecessarily complicated techniques for the protocol to ensure nodes do not receivemore messages than the capacity bound.Compared to traditional overlay network research, the Node-Capacitated Clique model has the advantage that it ab-stracts away the issue of designing and maintaining a suitable overlay network, for which many solutions have alreadybeen found in recent years. Nevertheless, it is closely related to overlay networks: every overlay network algorithm(i.e., an algorithm in which overlay edges can be established by introducing nodes to each other, and which satisfiesthe capacity bound of O ( log n ) messages) can be simulated in the Node-Capacitated Clique without any overhead. Fur-thermore, any algorithm for our model can be simulated with a multiplicative O ( log n ) runtime overhead in the CRCWPRAM model (by assigning each processor O ( log n ) memory cells, and letting nodes write into randomly chosen cellsof other processors), which in turn can be simulated with only O ( log n ) overhead by a network of constant degree [53].The Congested Clique model and its broadcast variant, on the other hand, are far more powerful (and arguably beyondwhat is possible in overlay networks): Whereas in the Congested Clique a total of e Θ ( n ) bits can be transmitted in eachround, in the Node-Capacitated Clique only e Θ ( n ) bits may be sent. For example, the gossip problem—i.e., deliveringone message from each node to every other node—can be solved in a single round in the Congested Clique, whereasthe problem requires at least Ω ( n / log n ) rounds in the Node-Capacitated Clique model. Even the simple broadcastproblem—i.e., delivering one message from one node to all nodes—already takes time Ω ( log n / log log n ) in the Node-Capacitated Clique.In this paper, we assume some edges of the network are marked as edges of an input graph G , where each nodeknows which other nodes are its neighbors in G , and aim to solve graph problems on G using the power of the Node-Capacitated Clique. Such edges can, for instance, be seen as edges of an underlying physical network, or representrelations between nodes in social networks. Our results in that direction also turn out to be useful for some othertheoretical models as well: they are relevant for hybrid networks [27] and also the k -machine model for processinglarge scale graphs [36].The concept of hybrid networks has just recently been considered in theory (e.g., [27]). In a hybrid network, nodeshave different communication modes : We are given a network of cheap links of arbitrary topology that is not under thecontrol of the nodes and may potentially be changing over time. In addition to that, the nodes have the ability to buildarbitrary overlay networks of costly links that are fully under the control of the nodes. Cell phones, for example, cancommunicate in an ad-hoc fashion via their WiFi interfaces, which is for free but only has a limited range, and whoseconnections may change as people move. Additionally, they may use their cellular infrastructure, which comes at aprice, but remains fully under their control. Although in the idealized setting this overlay network may form a clique,to save costs, the nodes might want to exchange only a small amount of messages of small size in each communicationround. This property is captured by the Node-Capacitated Clique. The network of cheap links, on the other hand, an be seen as an input graph in the Node-Capacitated Clique for which the nodes want to solve a graph problem ofinterest.Another interesting application of the Node-Capacitated Clique is the recently introduced k -machine model [36],which was designed for the study of data center level distributed algorithms for large scale graph problems. Here, a datacenter with k servers is modeled as k machines that are fully interconnected and capable of executing synchronousmessage passing algorithms. A standard approach for the k -machine model is to partition the input graph in a fair wayso that each machine stores a set of nodes of the input graph with their incident edges. It is quite natural to simulatealgorithms designed for the Node-Capacitated Clique model in the k -machine model. Precisely, any algorithm thatrequires T rounds in the Node-Capacitated Clique model can be simulated to take at most time e O ( nT / k ) . The detailsof this simulation can be found in Appendix A. To illustrate the usefulness of this simulation, we remark that therunning time of the fast minimum spanning tree algorithm provided by Pandurangan et al. [51] can be obtained simplyby converting the algorithm we provide in this work to the k -machine model.As we demonstrate in this paper, many graph problems can be solved efficiently in the Node-Capacitated Clique,which shows that many interesting problems can be solved efficiently in distributed systems based on an overlaynetwork over a shared infrastructure as well as hybrid networks and server systems. In the
Node-Capacitated Clique model we consider a set V of n computation entities that we model as nodes of a graph.Each node has a unique identifier consisting of O ( log n ) bits and every node knows the identifiers of all nodes suchthat, on a logical level, they form a complete graph. Note that since every node knows the identifier of every othernode, the nodes also know the total number of nodes n . As node identifiers are common knowledge, without loss ofgenerality we can assume that the identifiers are from the set { , , . . . , n − } .The network operates in a synchronous manner with time measured in rounds. In every round, each node canperform an arbitrary amount of local computation and send distinct messages consisting of O ( log n ) bits to up to O ( log n ) other nodes. The messages are received at the beginning of the next round. A node can receive up to O ( log n ) messages. If more messages are sent to a node, it receives an arbitrary subset of O ( log n ) messages. Additional messagesare simply dropped by the network.Let G be an undirected graph G = ( V , E ) with an arbitrary edge set, but the same node set as the Node-CapacitatedClique. We aim to solve graph problems on G in the Node-Capacitated Clique model. At the beginning, each nodelocally knows which identifiers correspond to its neighbors in G , but has no further knowledge about the graph. The Congested Clique model has already been studied extensively in the past years. Problems studied in prior workinclude routing and sorting [45], minimum spanning trees [26, 28, 34, 38, 47], subgraph detection [8, 11, 14], shortestpaths [9, 11], local problems [12, 29, 30], minimum cuts [25, 33], and problems related to matrix multiplication [11, 20].Some of the upper bounds are astonishingly small, such as the constant-time upper bound for routing and sorting andfor the computation of a minimum spanning tree, demonstrating the power of the Congested Clique model.While almost no non-trivial lower bounds exist for the Congested Clique model (due to their connection to circuitcomplexity [15]), various lower bounds have already been shown for the more general CONGEST model [17, 19, 43,46, 49, 52, 54]. As pointed out in [39], the reductions used in these lower bounds usually boil down to constructinggraphs with bottlenecks, that is, graphs where large amounts of information have to be transmitted over a small cut. roblem Runtime SectionMinimum Spanning Tree O ( log n ) O (( a + D + log n ) log n ) O (( a + log n ) log n ) O (( a + log n ) log n ) O ( a ) -Coloring O (( a + log n ) log / n ) Table 1. An overview of our results. We use a for arboricity and D to denote the diameter of the given graph. As this is not the case for the Node-Capacitated Clique, the lower bounds are of limited use here. Therefore, it remainsinteresting to determine upper and lower bounds for the Node-Capacitated Clique.Hybrid networks have only recently been studied in theory. An example is the hybrid network model proposed in[27], which allows the design of much faster distributed algorithms for graph problems than with a classical communi-cation network. Also, the problem of finding short routing paths with the help of a hybrid network approach has beenconsidered [32]. A priori, these papers do not assume that the nodes are completely interconnected, so extra measureshave to be taken to build up appropriate overlays. Abstracting from that problem, the Node-Capacitated Clique allowsone to focus on how to efficiently exchange information in order to solve the given problems.The graph problems considered in this paper have already been extensively studied in many different models. Inthe CONGEST model, for example, a breadth-first search can trivially be performed in time O ( D ) . There exists anabundance of algorithms to solve the maximal independent set, the maximal matching, and the coloring problem inthe CONGEST model (see, e.g., [6] for a comprehensive overview). Computing a minimum spanning tree has also beenwell studied in that model (see, e.g., [17, 18, 52, 54]). Whereas the running times of the above-mentioned algorithmsdepend on D and additional polylogarithmic factors, there have also been proposed algorithms to solve such problemsmore efficiently in graphs with small arboricity [3–6, 40, 41]. Notably, Barenboim and Khazanov [7] show how to solvea variety of graph problems in the Congested Clique efficiently given such graphs, e.g., compute an O ( a ) -orientation intime O ( log a ) , an MIS in time O (√ a ) , and an O ( a ) -coloring in time O ( a ε ) , where a is the arboricity of the given graph.The algorithms make use of the Nash-Williams forest-decomposition technique [50], which is one of the key techniquesused in our work.
We present a set of basic communication primitives and then show how they can be applied to solve certain graph prob-lems (see Table 1 for an overview). Note that for many important graph families such as planar graphs, our algorithmshave polylogarithmic runtime (except when depending on the diameter D ).Although many of our algorithms rely on existing algorithms from literature, we point out that most of thesealgorithms cannot be executed in the Node-Capacitated Clique in a straight-forward fashion. The main reason forthat is that high-degree nodes cannot efficiently communicate with all of their neighbors directly in our model, whichimposes significant difficulties to the application of the algorithms. To overcome these difficulties, we present a set ofbasic tools that still allow for efficient communication, and combine it with variations of well-known algorithms andnovel techniques. Notably, we present an algorithm to compute an orientation of the input graph G with arboricity a ,in which each edge gets assigned a direction, ensuring that the outdegree of any node is at most O ( a ) . The algorithm islater used to efficiently construct multicast trees to be used for communication between nodes. Achieving this is a highlynontrivial task in our model and requires a combination of techniques, ranging from aggregation and multicasting to hared randomness and coding techniques. We believe that many of the presented ideas might also be helpful for otherapplications in the Node-Capacitated Clique.Although proving lower bounds for the presented problems seems to be a highly nontrivial task, we believe thatmany problems require a running time linear in the arboricity. For the MIS problem, for example, it seems that we needto communicate at least 1 bit of information about every edge (typically in order for a node of the edge to learn whenthe edge is removed from the graph because the other endpoint has joined the MIS). However, explicitly proving sucha lower bound in this model seems to require more than our current techniques in proving multi-party communicationcomplexity lower bounds. In this section, we first give some basic definitions and describe a set of communication primitives needed throughoutthe paper.
Let G = ( V , E ) be an undirected graph. The neighborhood of a node u is defined as N ( u ) = { v ∈ V | { u , v } ∈ E },and d ( u ) = | N ( u )| denotes its degree. With ∆ = max u ∈ V ( d ( u )) we denote the maximum degree of all nodes in G , and d = Í u ∈ V d ( u )/ n is the average degree of all nodes. The diameter D of G is the maximum length of all shortest pathsin G .The arboricity a of G is the minimum number of forests into which its edges can be partitioned. Since the edges ofany graph with maximum degree ∆ can be greedily assigned to ∆ forests, a ≤ ∆ . Furthermore, since the average degreeof a forest is at most 2, and the edges of G can be partitioned into a forests, d ≤ a . Graphs of many important graphfamilies have small arboricity although their maximum degree might be unbounded. For example, a tree obviously hasarboricity 1. Nash-Williams [50] showed that the arboricity of a graph G is given by max H ⊆ G ( m H /( n H − )) , where H ⊆ G is a subgraph of G with at least two nodes and n H and m H denote the number of nodes and edges of H ,respectively. Therefore, any planar graph, which has at most 6 n − genus д , which is the minimum number of handles that must be added to the plane to embed the graph withoutany crossings, has arboricity O (√ д ) [4]. Furthermore, it is known that the family of graphs that exclude a fixed minor [13] and the family of graphs with bounded treewidth [16] have bounded arboricity.An orientation of G is an assignment of directions to each edge, i.e., for every { u , v } ∈ E either u → v ( u is directedto v ) or v → u ( v is directed to u ). If u → v , then u is an in-neighbor of v and v is an out-neighbor of u . For u ∈ V define N in ( u ) = { v ∈ V | v → u } and N out ( u ) = { v ∈ V | u → v } . The indegree of a node u is defined as d in ( u ) = | N in ( u )| and its outdegree is d out ( u ) = | N out ( u )| . A k -orientation is an orientation with maximum outdegree k . For a graph witharboricity a , there always exists an a -orientation: we root each tree of every forest arbitrarily and direct every edgefrom child to parent node.To allow each node to efficiently gather information sent to it by other nodes, our communication primitives makeheavy use of aggregate functions . An aggregate function f maps a multiset S = { x , . . . , x N } of input values to somevalue f ( S ) . For some functions f it might be hard to compute f ( S ) in a distributed fashion, so we will focus on so-called distributive aggregate functions : An aggregate function f is called distributive if there is an aggregate function д suchthat for any multiset S and any partition S , . . . , S ℓ of S , f ( S ) = д ( f ( S ) , . . . , f ( S ℓ )) . Classical examples of distributiveaggregate functions are MAX, MIN, and SUM. ur algorithms make heavy use of randomized strategies. To show that the correctness and runtime of the algo-rithms hold with high probability (w.h.p.) , we use a generalization of the Chernoff bound in [56] (Theorem 2):
Lemma 2.1.
Let X , . . . , X n be k -wise independent random variables with X i ∈ [ , b ] and let X = Í ni = X i . Then itholds for all δ ≥ , µ ≥ E [ X ] , and k ≥ ⌈ δ µ ⌉ Pr [ X ≥ ( + δ ) µ ] ≤ e − min [ δ , δ ]· µ /( b ) . Our algorithms make heavy use of a set of communication primitives, which are presented in this section. Whereasthe
Aggregate-and-Broadcast algorithm will be used as a general tool for aggregation and synchronization purposes,the other primitives are used to allow nodes to send and receive messages to and from specific sets of nodes associatedwith them. Note that a node is not able to send or receive a large set of messages in few rounds; the center of a star,for example, would need linear time to deliver messages to all of its neighbors. If, however, the number of distinct messages a node has to send is small, or if messages destined at a node can be combined using an aggregate function,then messages can be efficiently delivered using a randomized routing strategy. Due to space limitations, we onlypresent the high-level ideas of our algorithms and state their results. The full description and all proofs can be foundin Appendix B.
Butterfly Simulation.
To distribute local communication load over all nodes of the network, our algorithms rely on anemulation of a butterfly network . Formally, for d ∈ N , the d -dimensional butterfly is a graph with node set [ d + ] × [ d ] ,where we denote [ k ] = { , . . . , k − } , and an edge set E ∪ E with E = {{( i , α ) , ( i + , α )} | i ∈ [ d ] , α ∈ [ d ]} , E = {{( i , α ) , ( i + , β )} | i ∈ [ d ] , α , β ∈ [ d ] , α and β differ only at the i -th bit } . The node set {( i , j ) | j ∈ [ d ]} represents level i of the butterfly, and node set {( i , j ) | i ∈ [ d + ]} represents column j of the butterfly. In our algorithms, every node u ∈ V with identifier i ≤ d − i of the d -dimensional butterfly with d = ⌊ log n ⌋ . Since u knows the identifiers of all other nodes, it knows exactlywhich nodes emulate its neighbors in the butterfly. As every node in the Node-Capacitated Clique can send and receive O ( log n ) messages in each round, and the butterfly is of constant degree, a communication round in the butterfly canbe simulated in a single round in our model. Aggregate-and-Broadcast Problem.
We are given a distributive aggregate function f and a set A ⊆ V , where eachmember of A stores exactly one input value. The goal is to let every node learn f ( inputs of A ) . Theorem 2.2.
There is an
Aggregate-and-Broadcast Algorithm that solves any Aggregation Problem in time O ( log n ) . In principle, the algorithm first aggregates all values from the topmost (i.e., level 0) to the bottommost level (i.e.,level d ) of the butterfly, and then broadcasts the result upwards to all nodes in the butterfly. We say an event holds with high probability , if it holds with probability at least − / n c for any fixed constant c > . ggregation Problem. We are given a distributive aggregate function f and a set of aggregation groups A = { A , . . . , A N } , A i ⊆ V , i ∈ { , . . . , N } with targets t , . . . , t N ∈ V , where each node holds exactly one input value s u , i for each ag-gregation group A i of which it is a member , i.e., u ∈ A i . Note that a node may be member or target of multipleaggregation groups. The goal is to aggregate these input values so that eventually t i knows f ( s u , i | u ∈ A i ) for all i . We define L = Í Ni = | A i | to be the global load of the Aggregation Problem, and the local load ℓ = ℓ + ℓ , where ℓ = max u ∈ V |{ i ∈ { , . . . , N } | u ∈ A i }| and ℓ = max u ∈ V |{ i ∈ { , . . . , N } | u = t i }| . Whereas the global loadcaptures the total number of messages that need to be processed, ℓ and ℓ indicate the work required for insertingmessages into the butterfly, or sending aggregates from butterfly nodes to their targets, respectively. We require thatevery node knows the identifier and target of all aggregation groups it is a member of, and an upper bound ˆ ℓ on ℓ . Theorem 2.3.
There is an
Aggregation Algorithm that solves any Aggregation Problem in time O ( L / n + ( ℓ + ˆ ℓ )/ log n + log n ) , w.h.p. From a very high level, the algorithm works as follows. First, packets are sent to random nodes of the topmost levelof the butterfly. Then, packets belonging to the same aggregation group A i are routed to an intermediate target h ( i ) in the bottommost level of the butterfly using a (pseudo-)random hash function h and a variant of the random rankrouting protocol [1, 57]. Whenever two packets belonging to the same aggregation group collide on a butterfly node,they are combined using the function f . Finally, the result of aggregation group A i is sent from its intermediate targetto its actual target t i .The intermediate steps of the algorithm are synchronized using a variant of the Aggregate-and-Broadcast algorithm:Every node delays its participation in an aggregation until having finished the current step. Once the aggregation fin-ishes, all nodes become informed about a common round to start the next step. Termination of the routing protocolcan easily be determined by passing down tokens in the butterfly. We also use the same techniques to achieve synchro-nization for all other algorithms in this paper without explicitly mentioning it.Note that common hash functions require shared randomness . Although in the remainder of this paper we assumethat all hash functions behave like perfect random functions, it can be shown that it suffices to use Θ ( log n ) -wiseindependent hash functions (see, e.g., [10] and the references therein): Whenever we aim to show that the outcome ofa random experiment deviates from the expected value by at most O ( log n ) , w.h.p., we can immediately use Lemma 2.1;if the deviation we aim to show is higher, we can partition events in a suitable way so that we only need Θ ( log n ) -wiseindependence for each subset of events, and the sum of the deviations does not exceed the overall desired deviation.To agree on such hash functions, all nodes have to learn Θ ( log n ) random bits. This can be done by letting the nodewith identifier 0 broadcast Θ ( log n ) messages, each consisting of log n bits, to all other nodes using the butterfly. Multicast Tree Setup Problem.
We are given a set of multicast groups A = { A , . . . , A N } , A i ⊆ V , with sources s , . . . , s N ∈ V such that each node is source of at most one multicast group (but possibly member of multiple groups).The goal is to set up a multicast tree T i in the butterfly for each i ∈ { , . . . , N } with root h ( i ) , which is a node uniformlyand independently chosen among the nodes of the bottommost level of the butterfly, and a unique and randomlychosen leaf l ( i , u ) in the topmost level for each u ∈ A i . Let L = Í Ni = | A i | , ℓ = max u ∈ V |{ i ∈ { , . . . , N } | u ∈ A i }| anddefine the congestion of the multicast trees to be the maximum number of trees that share the same butterfly node. Werequire that each node u ∈ V knows the identifier and source of all multicast groups it is a member of. We only enumerate the aggregation groups from , . . . , N to simplify the presentation of the algorithm. Actually, we only require each aggregationgroup to be uniquely identified, which can easily be achieved for all algorithms in this paper. heorem 2.4. There is a
Multicast Tree Setup Algorithm that solves any Multicast Tree Setup Problem in time O ( L / n + ℓ / log n + log n ) , w.h.p. The resulting multicast trees have congestion O ( L / n + log n ) , w.h.p. The algorithm shares many similarities with the Aggregation Algorithm; in fact, the multicast trees stem from thepaths taken by the packets during an aggregation. Alongside the aggregation, every butterfly node u records for every i ∈ { , . . . , N } all edges along which packets from group A i arrived during the routing towards h ( i ) , and declares themas edges of T i . Multicast Problem.
Assume we have constructed multicast trees for a set of multicast groups A = { A , . . . , A N } , A i ⊆ V , with sources s , . . . , s N ∈ V such that each node is source of at most one multicast group. The goal isto let every source s i send a message p i to all nodes u ∈ A i . Let C be the congestion of the multicast trees and ℓ = max u ∈ V |{ i ∈ { , . . . , N } | u ∈ A i }| . We require that the nodes know an upper bound ˆ ℓ on ℓ . Theorem 2.5.
There is a
Multicast Algorithm that solves any Multicast Problem in time O ( C + ˆ ℓ / log n + log n ) , w.h.p. The algorithm multicasts messages by sending them upwards the multicast trees, performing our routing strategy in"reverse order". We remark that similar to the Aggregation Algorithm, the Multicast Algorithm may easily be extendedto allow a node to be source of multiple multicasts; however, we will only need the simplified variant in our paper.
Multi-Aggregation Problem.
We are given a set of multicastgroups A = { A , . . . , A N } , A i ⊆ V , with sources s , . . . , s N ∈ V such that every source s i stores a multicast packet p i , and every node is source of at most one multicast group. We assume that multicast trees for the multicast groupswith congestion C have already been set up. The goal is to let every node u ∈ V receive f ({ p i | u ∈ A i }) for a givendistributive aggregate function f . Theorem 2.6.
There is a
Multi-Aggregation Algorithm that solves any Multi-Aggregation Problem in time O ( C + log n ) ,w.h.p. The Multi-Aggregation algorithm combines all of the previous algorithms to allow a node to first multicast a messageto a set of nodes associated with it, and then aggregate all messages destined at it. More precisely, each source s i firstmulticasts its packet p i to all leaves in its multicast tree. Every node l ( i , u ) then maps p i to a packet ( id ( u ) , p i ) for all i and u ∈ A i . The resulting packets are randomly distributed among the nodes of the topmost level of the butterfly.Finally, all packets associated with identifier id ( u ) for some u are aggregated towards an intermediate target h ( id ( u )) on level d using the aggregate function f as in the Aggregation Algorithm. From there, the result is finally deliveredto u . For applications beyond our paper, the algorithm may also be extended to allow nodes to be source of multiplemulticast groups, and to receive aggregates corresponding to distinct aggregations. As a first example of graph algorithms for the Node-Capacitated Clique, we describe an algorithm that computes a minimum spanning tree (MST) in time O ( log n ) . More specifically, for every edge in the input graph G , one of itsendpoints eventually knows whether the edge is in the MST or not. We assume that each edge of G has an integralweight in { , , . . . , W } for some positive integer W = poly ( n ) . High-Level Description.
From a high level, our algorithm mimics Boruvka’s algorithm with Heads/Tails clustering,which works as follows. Start with every node as its own component. For O ( log n ) iterations, every component C (1) nds its lightest , i.e., minimum-weight, edge out of the component that connects to the other components, (2) flips aHeads/Tails coin, and (3) learns the coin flip of the component C ′ on the other side of the lightest edge. If C flips Tailsand C ′ flips Heads, then the edge connecting C to C ′ is added to the MST, and thus effectively component C mergeswith component C ′ (and whatever other components that are merging with C ′ simultaneously). It is well known that,w.h.p., all nodes get merged into one component within O ( log n ) iterations and the added edges form an MST (see, e.g.,[23, 24]). Details of the Algorithm.
Over the course of the algorithm, each component C ⊆ V maintains a leader node l ( C ) ∈ C whose identifier is known to every node in the component. Furthermore, we maintain a multicast tree for eachcomponent C with source l ( C ) and corresponding multicast group C \ { l ( C )} . We ensure that the set of multicast treeshas congestion O ( log n ) . In each round of Boruvka’s algorithm with the partition of V into components { C , . . . , C N } ,every leader l ( C i ) flips Heads/Tails and multicasts the result to all nodes in its component by using the MulticastAlgorithm of Theorem 2.5. As the multicast trees have congestion O ( log n ) , and ˆ ℓ = O ( log n ) , w.h.p.For each component C , the leader then learns the lightest edge to a neighbor in V \ C in time O ( log n log W ) . This is ahighly nontrivial task that we address later. Afterwards, the leader multicasts the lightest edge { u , v } ∈ ( C ×( V \ C ))∩ E to every node in its component, which can again be done in time O ( log n ) . For each component C that flips Tails, thenode u ∈ C incident to the lightest outgoing edge { u , v } now has to learn whether v ’s component C ′ has flippedHeads, and, if so, the identifier of l ( C ′ ) . Therefor, u joins a multicast group A id ( v ) with source v , i.e., declares itself amember of A id ( v ) and constructs multicast trees with the help of Theorem 2.4. As every node is member of at mostone multicast group, setting up the corresponding trees with congestion O ( log n ) takes time O ( log n ) , w.h.p. By usingthe Multicast Algorithm, the endpoints of all lightest edges learn the result of the coin flip and the identifier of theiradjacent component’s leader in time O ( log n ) .If for the edge { u , v } the component C ′ of v has flipped Heads, then u sends the identifier of the leader of C ′ to itsown leader, which in turn informs all nodes of C using a multicast. Note that thereby only u learns that { u , v } is anedge of the MST, but not v . Finally, the multicast trees of the resulting components are rebuilt by letting each node joina multicast group corresponding to its new leader. As the components are disjoint, the resulting trees with congestion O ( log n ) are built in time O ( log n ) , w.h.p. Finding the Lightest Edge.
To find the lightest edge of a component, we “sketch” its incident edges. Our algorithmfollows the procedure
FindMin of [35], with the “broadcast-and-echo” subroutine inside each component replaced bymulticasts and aggregations (i.e., executions of the Multicast and Aggregation Algorithm) from/to the leader to/fromthe entire component. As argued above, and due to Theorem 2.3, both steps can be performed in time O ( log n ) , w.h.p.We highlight the main steps of FindMin , and refer the reader to [35] for the details and proof.Initially, we bidirect each edge into two arcs in opposite directions, and define the identifier id ( u , v ) = id ( u ) ◦ id ( v ) ,where ◦ denotes the concatenation of two binary strings. We will apply binary search to the weights of edges so thatwe can find the lightest outgoing edge. Every iteration has a current range [ L , R ] ⊆ [ , W ] such that the lightest edgeout has weight in that range. To compute the next range, the algorithm determines whether there is an edge out of [ L , M ] , where M : = ⌊( L + R )/ ⌋ . If so, the new range becomes [ L , M ] ; otherwise, the new range is [ M + , R ] . The The algorithm
FindMin of [35] actually uses a “ Θ ( log n ) -ary” search instead of binary search, but we replace it with binary search here for simplicityof explanation. emaining task is to solve the following subproblem: given a range [ a , b ] , determine whether there exists an outgoingedge with weight in [ a , b ] .To sketch their incident edges, the nodes use a (pseudo-)random hash function h that maps each edge identifier to { , } . For a node u , define h ↑ ( u ) : = Õ v ∈ N ( u ) : w ( u , v )∈[ a , b ] h ( id ( u , v )) mod 2 , and h ↓ ( u ) : = Õ v ∈ N ( u ) : w ( u , v )∈[ a , b ] h ( id ( v , u )) mod 2 , and for component C ⊆ V , define h ↑ ( C ) : = Í u ∈ C h ↑ ( u ) and h ↓ ( C ) similarly. Observe that the unordered sets { id ( u , v ) : u ∈ C , v ∈ N ( u ) , w ( u , v ) ∈ [ a , b ]} and { id ( v , u ) : u ∈ C , v ∈ N ( u ) , w ( u , v ) ∈ [ a , b ]} are the same if and only ifcomponent C does not have an outgoing edge with weight in the range [ a , b ] . Also, the hash function h satisfies theproperty that, if two sets S , S of integers are not equal, then the values of Í x ∈ S h ( x ) mod 2 and Í x ∈ S h ( x ) mod 2are not equal with constant probability. To compute the values of h ↑ ( C ) and h ↓ ( C ) , each node u ∈ C computes h ↑ ( u ) and h ↓ ( u ) , and an aggregation towards the leader node is performed in each component C with addition mod 2 as theaggregate function. We can repeat this procedure O ( log n ) times so that w.h.p., there is no outgoing edge out of C withweight in [ a , b ] if and only if h ↑ ( C ) and h ↓ ( C ) are equal in every trial. Note that this requires the nodes to know O ( log n ) different hash functions; by the discussion in Section 2.2, the necessary O ( log n ) bits can be retrieved beforehand in O ( log n ) rounds.The running time analysis from [35], modified to count the number of “broadcast-and-echo” subroutines, can berewritten as follows. Lemma 3.1 ([35], Lemma 2).
The leader node of each component learns the lightest edge out of its component within O ( log W log n ) iterations of multicasts and aggregations, w.h.p. Since each iteration can be performed in time O ( log n ) , and there are O ( log n ) phases of Boruvka’s algorithm, w.h.p.,we conclude the following theorem. Theorem 3.2.
The algorithm computes an MST in time O ( log n ) , w.h.p. O ( a ) -ORIENTATION One of the reasons the MST problem can be solved very efficiently is because we only require one endpoint of eachedge to learn whether the edge is in the MST or not; otherwise, the problem seems to become significantly harder, asevery node would have to learn some information about each incident edge. We observe this difficulty for the othergraph problems considered in this paper as well. To approach this issue, we aim to set up multicast trees connectingeach node with all of its neighbors in G , allowing us to essentially simulate variants of classical algorithms. As we willsee, such trees can be set up efficiently if G has small arboricity by first computing an O ( a ) -orientation of G , which isdescribed in this section.We present the Orientation Algorithm , which computes an O ( a ) -orientation in time O (( a + log n ) log n ) , w.h.p. Morespecifically, the goal is to let every node learn a direction of all of its incident edges in G . The algorithm essentiallyconstructs a Nash-Williams forest-decomposition [50] using the approach of [4]. From a high-level perspective, thealgorithm repeatedly identifies low-degree nodes and removes them from the graph until the graph is empty. Whenever node leaves, all of its incident edges are directed away from it. More precisely, the algorithm proceeds in phases1 , . . . , T . Let d i ( u ) be the number of incident edges of a node u that have not yet been assigned a direction at thebeginning of phase i . Define d i to be the average degree of all nodes u with d i ( u ) >
0, i.e., d i = Í u ∈ V d i ( u )/|{ u ∈ V | d i ( u ) > }| . In phase i , a node u is called inactive if d i ( u ) = active if d i ( u ) ≤ d i , and waiting if d i ( u ) > d i .In each phase, an edge { u , v } gets directed from u to v , if u is active and v is waiting, or if both nodes are activeand id ( u ) < id ( v ) . Thereby, each node is waiting until it becomes active in some phase, and remains inactive for allsubsequent phases. This results in a partition of the nodes into levels L , . . . , L T , where level i is the set L i of activenodes in phase i . The lemma below follows from the fact that in every phase, at least half of all nodes that are not yetinactive become inactive, which can easily be shown, and that d i ≤ a , since any subgraph of G can be partitionedinto a forests, whose average degree is at most 2. Lemma 4.1.
The Orientation Algorithm takes O ( log n ) phases to compute an O ( a ) -orientation. It remains to show how a single phase can be performed efficiently in our model. Here, the main difficulty lies inhaving active nodes determine which of their neighbors are already inactive in order to conclude the orientations ofincident edges. We approach this problem by solving the following
Identification Problem : We are given a set L ⊆ V of learning nodes and a set P ⊆ V of playing nodes. Every playing node knows a subset of its neighbors that are potentially learning, i.e., it knows that none of the other neighbors are learning. The goal is to let every learning nodedetermine which of its neighbors are playing.To solve such a problem, we present the Identification Algorithm , which will later be used as a subroutine. In thissubsection, we represent each edge { u , v } by two directed edges ( u , v ) and ( v , u ) . We assume that all nodes know s (pseudo-)random hash functions h , . . . , h s : E → [ q ] for some parameters s and q . The hash functions are used tomap every directed edge to s trials . We say an edge e participates in trial i if h j ( e ) = i for some j .Let u ∈ L . We refer to an edge ( u , v ) as a red edge of u , if v is not playing, and a blue edge of u , if v is playing. Weidentify each edge ( u , v ) by the identifiers of its endpoints, i.e., id ( u , v ) = id ( u ) ◦ id ( v ) . Let X ( i ) be the XOR of theidentifiers of all edges ( u , v ) that participate in trial i , and X ′ ( i ) be the XOR of the identifiers of all blue edges ( u , v ) that participate in trial i . Furthermore, let x ( i ) be the total number of edges adjacent to u that participate in trial i , andlet x ′ ( i ) be the number of blue edges that participate in trial i .Our idea is to let u use these values to identify all of its red edges; then it can conclude which of its neighbors mustbe playing. Before describing this, we explain how the values are determined. Clearly, the values X ( i ) and x ( i ) can becomputed by u by itself for all i . The other values are more difficult to obtain as u does not know which of its edgesare blue. To compute these values, we use the Aggregation Algorithm: Each playing node v is in aggregation group A id ( w )◦ i for every potentially learning neighbor w and every trial i such that ( w , v ) participates in trial i . The inputof v for the group A id ( w )◦ i is ( id ( w , v ) , ) , where the first coordinate is used to let w compute X ′ ( i ) , and the secondcoordinate is used to compute x ′ ( i ) . Correspondingly, the aggregate function f combines two inputs corresponding tothe same aggregation group by taking the XOR of the first coordinate and the sum of the second coordinate. Thereby, u eventually receives both X ′ ( i ) and x ′ ( i ) .We now show how u can identify its red edges using the aggregated information. First, it determines a trial i forwhich x ( i ) = x ′ ( i ) +
1. Since neighbors that are not playing did not participate in the aggregation, in this case thereis exactly one red edge ( u , v ) such that id ( u , v ) is included in X ( i ) but not in X ′ ( i ) . Therefore, id ( u , v ) can be retrieved y taking the XOR of both values. Having identified id ( u , v ) , u determines all trials in which ( u , v ) participates usingthe common hash functions and "removes" id ( u , v ) from X ( i ) by again computing the XOR of both. It then decreases x ( i ) by 1 and repeats the above procedure until no further edge can be identified. If u always finds a trial i for which x ( i ) = x ′ ( i ) +
1, then it eventually has identified all red edges. Clearly, all the remaining neighbors must be playing.
Lemma 4.2.
Let u ∈ L and assume that u is incident to at most p red edges. Let s be the number of hash functions, and q be the number of trials. Pr [ u fails to identify at least k red edges ] ≤ (cid:18) skq (cid:19) ( s − ) k / for q ≥ esp and s ≥ . Proof. u fails to identify at least k red edges if at some iteration of the above procedure there are j ≥ k edgesleft such that all edges participate only in trials in which at least two of the j edges participate. Here, the j edgesparticipate in at most ⌊ s · j / ⌋ many different trials, since otherwise there must be a trial in which only one edgeparticipates. Therefore, the probability for that event isPr ≤ p Õ j = k (cid:18) pj (cid:19) (cid:18) qsj / (cid:19) (cid:18) sj / q (cid:19) sj ≤ p Õ j = k (cid:18) epj (cid:19) j (cid:18) eqsj (cid:19) sj / (cid:18) sj q (cid:19) sj = p Õ j = k "(cid:18) epj · sj q (cid:19) (cid:18) eqsj · sj q (cid:19) s / · (cid:18) sj q (cid:19) s / − j = p Õ j = k " e ps q · (cid:18) esj q (cid:19) s / − j ≤ p Õ j = k (cid:18) sjq (cid:19) ( s − ) j / ≤ (cid:18) skq (cid:19) ( s − ) k / , where the last inequality holds because (cid:18) s ( j + ) q (cid:19) ( s − )( j + )/ = (cid:18) s ( j + ) q (cid:19) ( s − )/ (cid:18) s ( j + ) q (cid:19) ( s − ) j / = (cid:18) s ( j + ) q (cid:19) ( s − )/ (cid:18) j + j (cid:19) j ! ( s − )/ (cid:18) sjq (cid:19) ( s − ) j / ≤ (cid:18) es ( j + ) q (cid:19) ( s − )/ (cid:18) sjq (cid:19) ( s − ) j / ≤ / (cid:18) sjq (cid:19) ( s − ) j / . (cid:3) .2 Details of the Algorithm Finally, we show how the Identification Algorithm can be used to efficiently realize a phase of the high-level algorithmin time O ( a + log n ) , w.h.p. In our algorithm every node learns the direction of all its incident edges in the phase inwhich it is active; however, its neighbors might learn their direction only in subsequent phases. Each phase is dividedinto three stages : In Stage 1, every node determines whether it is active in this phase. In Stage 2, every active nodelearns which of its neighbors are inactive. Finally, in Stage 3 every active node learns which of its remaining neighbors,which must be either active or waiting, are active. From this information, and since every node knows the identifiers ofall of its neighbors, every active node concludes the direction of each of its incident edges. In the following we describethe three stages of a phase i in detail. Stage 1: Determine Active Nodes.
We assume that all nodes start the stage in the same round. First, every node u that isnot inactive needs to compute d i ( u ) (i.e., d ( u ) minus the number of inactive neighbors) to determine whether it remainswaiting or becomes active in this phase. This value can easily be computed using the Aggregation Algorithm: Everyinactive node v , which already knows the orientation of each of its incident edges, is a member of every aggregationgroup A id ( w ) such that v → w . As the input value of each node we choose 1, the aggregate function f is the sum,and ℓ ≤ = : ˆ ℓ . By performing the Aggregation Algorithm, u determines the number of inactive neighbors, and, bysubtracting the value from d ( u ) , computes d i ( u ) . Afterwards, the nodes use the Aggregate-and-Broadcast Algorithmto compute d i and to achieve synchronization. Stage 2: Identify Inactive Neighbors.
The goal of this stage is to let every active node learn which of its neighbors areinactive. The stage is divided into two steps : In the first step, a large fraction of active nodes succeeds in the identificationof inactive neighbors. The purpose of the second step is to take care of the nodes that were unsuccessful in the firststep, i.e., that only identified some, but not all, of their incident red edges. In both steps we use the IdentificationAlgorithm described in the previous section, and carefully choose the parameters to achieve that each step only takestime O ( a + log n ) .At the beginning of the first step, the nodes compute d ∗ i = max u ∈ L i ( d i ( u )) by performing the Aggregate-and-Broadcast Algorithm. Let d ∗ = max j ≤ i d ∗ i , which is a value known to all nodes, and note that d ∗ = O ( a ) . Then, thenodes perform the Identification Algorithm, where the active nodes are learning and the inactive nodes are playing .Hence, the endpoints of the red edges learned by the active nodes must either be active or waiting. If we chose s = c log n and q = ecd ∗ log n for some constant c > O ( a log n ) . To reduce this to O ( a + log n ) , we insteadchoose s = c and q = ecd ∗ log n for some constant c >
6, and accept that nodes fail to identify some of their red edgesin this step. However, for this choice Lemma 4.2 implies that each node fails to identify at most log n red edges, w.h.p.We now describe how these remaining edges are identified in the second step. Let U = { u ∈ V | u is unsuccessful } .We divide U into sets of high-degree nodes U hiдh = { u ∈ U | ( d ( u ) − d i ( u )) > n / log n } and of low-degree nodes U low = { u ∈ U | ( d ( u ) − d i ( u )) ≤ n / log n } and consider the nodes of each set separately. By dealing with high-degreenodes separately, we ensure that the global load required to let low-degree nodes identify their red edges reduces by alog n factor. First, the nodes of U hiдh (of which there are only O ( a + log n ) , w.h.p.) broadcast their identifiers by usinga variant of the Aggregate-and-Broadcast Algorithm: Using the path system of the butterfly, every node u ∈ U hiдh sends its identifier to the node v with identifier 0; however, messages are not combined. Instead, whenever multipleidentifiers contend to use the same edge in the same round, the smallest identifier is sent first. After v has received ll identifiers, it broadcasts them in a pipelined fashion, i.e., one after the other, to all other nodes. For every node u ∈ A : = { u ∈ V | u is active or waiting } define R u = U hiдh ∩ N ( u ) , i.e., ( v , u ) is a red edge of v for all v ∈ R u . Let u ∈ A . For each v ∈ R u , u chooses a round from { , . . . , max {| R u | , d ∗ i }} uniformly and independently at random andsends its own identifier to v in that round. Afterwards, every high-degree node can identify all of its red edges. Asmax u ∈ A {| R u | , d ∗ i } = O ( a + log n ) , this takes time O ( a + log n ) , w.h.p.To let the low-degree nodes identify their red edges, we again use the Identification Algorithm. First, in order tonarrow down its set of potentially learning neighbors, every inactive node determines which of its neighbors areunsuccessful low-degree nodes. Therefore, we let every inactive node u join multicast group A id ( v ) for all u → v suchthat v is not inactive (recall that every inactive node knows the directions of all of its incident edges, and whether theother endpoint of each edge is inactive or not). Every node v ∈ U low then informs its inactive neighbors by using theMulticast Algorithm. Since every node is member of at most d ∗ multicast groups, which is a value known to all nodes,the nodes know an upper bound on ℓ as required by the algorithm. Having narrowed down the set of learning nodesand the sets of potentially learning neighbors to the unsuccessful ones only, the Identification Algorithm is performedonce again. As the parameters of the algorithm we choose s = c log n and q = ec log n for some constant c > Stage 3: Identify Active Neighbors.
Finally, every active node has to learn which of the endpoints of its red edges areactive. In the following, let id ( e ) = id ( u ) ◦ id ( v ) be the identifier of an edge given by its endpoints u and v such thatid ( u ) < id ( v ) . The nodes use two (pseudo-)random hash-function h , r , where h maps the identifier of an edge e to anode h ( id ( e )) ∈ V uniformly and independently at random, and r maps its identifier to a round r ( id ( e )) ∈ { , . . . , d ∗ i } uniformly and independently at random. Every active node u sends an edge-message containing id ( e ) to h ( id ( e )) inround r ( id ( e )) for every incident edge e leading to an active or waiting node. Using this strategy, two adjacent activenodes u , v send an edge-message containing id ({ u , v }) to the same node in the same round. Whenever a node receivestwo edge-messages with the same edge identifier, it immediately responds to the corresponding nodes, which therebylearn that both endpoints are active. We now turn to the analysis of the algorithm. We mainly show the following lemma:
Lemma 4.3.
In phase i of the algorithm, every node v ∈ L i learns the directions of its incident edges. Each phase takestime O ( a + log n ) , w.h.p. In every round, each node sends and receives at most O ( log n ) messages, w.h.p. We present the proof in three parts: first, we show the correctness of the algorithm, then analyze its runtime, andfinally show that every node receives at most O ( log n ) messages in each round. Lemma 4.4.
In the first step, every active node fails to identify at most log n red edges, w.h.p. Proof.
Note that every active node can only be adjacent to at most p ≤ d ∗ active or waiting nodes, i.e., it is incidentto at most p red edges. Therefore, by Lemma 4.2, the probability that an active node u fails to identify at least log n rededges is 2 (cid:18) c log n ecd ∗ log n (cid:19) ( c − ) log n / ≤ ( c / − ) log n − ≤ n c / − . Taking the union bound over all nodes implies the lemma. (cid:3)
Lemma 4.5.
After the second step, every active node has identified all of its red edges, w.h.p. roof. If u ∈ U hiдh , then after having received the identifiers of all neighbors that are active or waiting, u imme-diately knows its red edges. Now let u ∈ U low . Since by Lemma 4.4 u has at most p ≤ log n remaining red edges, byLemma 4.2 we have that the probability that u fails to identify at most one of its remaining red edges is at most2 (cid:18) c log n ec log n (cid:19) ( c log n − )/ ≤ c log n / − ≤ n c / − . Taking the union bound over all nodes implies the lemma. (cid:3)
To bound the runtime of the complete algorithm, we now prove that each stage takes time O ( a + log n ) , w.h.p. Lemma 4.6.
Stage 1 takes time O ( a + log n ) , w.h.p. Proof.
In the execution of the Aggregation Algorithm, every inactive node is member of at most O ( a ) aggregationgroups and every active node is target of at most one aggregation, i.e., L = O ( na ) and ℓ + ˆ ℓ = O ( a ) . The lemmafollows from Theorem 2.3. (cid:3) For the runtime of Stage 2 we need the following two lemmas.
Lemma 4.7. | U hiдh | = O ( a + log n ) , w.h.p. Proof.
Let A = { u ∈ L i | ( d ( u ) − d i ( u )) > n / log n } . Note that since d ≤ a , we have that Í u ∈ V d ( u ) ≤ an , andtherefore | A | ≤ a log n . For u ∈ A let X u be the binary random variable that is 1, if u is unsuccessful in the first step,and 0, otherwise. By Lemma 4.2 and since c >
6, we havePr [ X u = ] ≤ c / − n ≤ n . Let X = Í u ∈ A X u . X is the sum of independent binary random variables with expected value E [ X ] ≤ a log n / log n = a = : µ . Let δ = max { α log n / µ , } for some constant α >
3, then by using the Chernoff bound of Lemma 2.1 we getthat Pr [ X ≥ ( + δ ) µ ] ≤ e − α log n / ≤ n α / , and thus X = O ( a + log n ) , w.h.p. (cid:3) Lemma 4.8. Í u ∈ U low ( d ( u ) − d i ( u )) = O ( an / log n + n ) , w.h.p. Proof.
Let A = { u ∈ L i | ( d ( u ) − d i ( u )) > n / log n } . For a node u ∈ A , let X u be the random variable that is d u , if u is unsuccessful in the first step, and 0, otherwise. From the proof of Lemma 4.7, we have that Pr [ X u = d u ] ≤ / log n .Let A be the set of active nodes. Then X = Í u ∈ A X u is a sum of independent random variables with expected value E [ X ] ≤ Í u ∈ A d ( u )/ log n ≤ an / log n = : µ . Note that d ( u ) ≤ n / log n for all u ∈ A . Therefore, we can use the Chernoffbound of Lemma 2.1 with δ = max { αn / µ , } for some constant α >
3, and getPr [ X ≥ ( + δ ) µ ] ≤ e − αn log n /( n ) ≤ n α / . Therefore, we have that X = O ( an / log n + n ) , w.h.p. (cid:3) We are now ready to bound the runtime of Stage 2.
Lemma 4.9.
Stage 2 takes time O ( a + log n ) , w.h.p. roof. The computation of d ∗ at the beginning of the first step takes time O ( log n ) . To perform the first executionof the Identification Algorithm, every node has to learn s = O ( ) hash functions, which can be done in time O ( log n ) (see Section 2.2). In the first execution of the Identification Algorithm, every active node u is target of aggregationgroup A id ( u )◦ i for every trial i , and every inactive neighbor v of u is member of all aggregation groups A id ( u )◦ i suchthat ( u , v ) participates in trial i . Therefore, every active node is target of at most 4 ecd ∗ log n and every inactive nodeis a member of at most cd ∗ aggregation groups. Since both values are known to every node, the nodes know an upperbound ˆ ℓ = ecd ∗ log n on ℓ . Since every inactive node is a member of at most cd ∗ aggregation groups, the global load L is bounded by ncd ∗ . By Theorem 2.3, the Aggregation Algorithm takes time O (cid:18) ncd ∗ n + ecd ∗ log n log n + log n (cid:19) = O ( a + log n ) , w.h.p., to solve the problem.Now consider the second step. By Lemma 4.7, | U hiдh | = O ( a + log n ) , w.h.p.. A simple delay sequence argument canbe used to show that all identifiers are broadcasted within time O ( a + log n ) . Informing each node in U hiдh about itsred edges takes an additional O ( a + log n ) rounds, as | R u | = O ( a + log n ) for every node u and d ∗ i = O ( a ) .The multicast trees to handle low-degree nodes are constructed in time O ( a + log n ) , as every inactive node joins atmost d ∗ multicast groups, and the resulting trees have congestion O ( nd ∗ / n + log n ) = ( a + log n ) , w.h.p. Correspondingly,the multicast can be performed in time O ( a + log n ) , w.h.p.We now bound the runtime of the final execution of the Identification Algorithm. First, note that the s = Θ ( log n ) hash functions can be learned by broadcasting the O ( log n ) bits required for each hash function (see Section 2.2) ina pipelined fashion in a binary tree, which is implicitly given in the network. Clearly, this takes time O ( log n ) andrequires each node to send and receive only O ( log n ) messages in each round. Every inactive node is a member of atmost O ( a log n ) aggregation groups, and every node is a target of at most 4 ec log n aggregation groups. By Lemma 4.8 Í u ∈ U low ( d ( u ) − d i ( u )) = O ( an / log n + n ) , w.h.p. As this is also a bound on the number of edges that participate inany trial, and each edge participates in c log n trials, the global load L is bounded by O ( an + n log n ) . Therefore, byTheorem 2.3, the Aggregation Algorithm takes time O ( a + log n ) , w.h.p. (cid:3) The lemma below follows from the fact that d ∗ i = O ( a ) . Lemma 4.10.
Stage 3 takes time O ( a + log n ) . Finally, it remains to show that no node receives too many messages.
Lemma 4.11.
In each round of the algorithm, every node sends and receives at most O ( log n ) messages, w.h.p. Proof.
By the discussion of Section 2.2, the executions of the Aggregation, Multicast Tree Setup, and MulticastAlgorithm ensure that every node receives only O ( log n ) messages in each round. It remains to show the claim for thesecond step of Stage 2, where high-degree nodes broadcast their identifiers and receive their red edges, and for Stage3, where active nodes learn which of their red edges lead to other active nodes.For the first part, note that after all high-degree nodes have broadcasted their identifiers, every active or waitingnode sends out O ( log n ) messages containing its identifier in every round, w.h.p., which can easily be shown usingChernoff bounds. Second, as every high-degree node receives at most d ∗ i identifiers, it also follows from the Chernoffbound that every such node receives at most O ( log n ) messages in each round, w.h.p.Now consider Stage 3 of the algorithm. Again, by using the Chernoff bound, it can easily be shown that no node sendsout more than O ( log n ) edge-messages in any round. Therefore, every node only receives O ( log n ) response messages n every round. It remains to show that every node receives at most O ( log n ) edge-messages in every round, fromwhich it follows that it only sends out O ( log n ) response messages in every round. Let A = {{ u , v } | u or v is active } and note that | A | ≤ nd ∗ i . Fix a node u ∈ V and a round i ∈ { , . . . , d ∗ i } and let X e be the binary random variable thatis 1 if and only if h ( id ( e )) = u and r ( id ( e )) = i for e ∈ A . Then Pr [ X e = ] = /( nd ∗ i ) . X = Í e ∈ A X e has expected value E [ X ] ≤
1. Using the Chernoff bound we get that X = O ( log n ) , w.h.p., which implies that u receives at most O ( log n ) edge-messages in round i . The claim follows by taking the union bound over all nodes and rounds. (cid:3) Taking Lemma 4.3 together with Lemma 4.1 yields the final theorem of this section.
Theorem 4.12.
The Orientation Algorithm computes an O ( a ) -orientation in time O (( a + log n ) log n ) , w.h.p. We conclude our initiating study of the Node-Capacitated Clique by presenting a set of graph problems that can besolved efficiently in graphs with bounded arboricity. The presented algorithms rely on a structure of precomputedmulticast trees. More specifically, for every node u ∈ V we construct a multicast tree T id ( u ) for the multicast group A id ( u ) = N ( u ) . Since such trees enable the nodes to send messages to all of their neighbors, in the following we referto them as broadcast trees .In a naive approach to construct these trees, one could simply use the Multicast Tree Setup Algorithm, whereeach node joins the multicast group of every neighbor. However, as ℓ = ∆ , the time to construct these trees would be O ( d + ∆ / log n + log n ) , which can be O ( n / log n ) if G is a star, for example. Instead, we first construct an O ( a ) -orientationof the edges as shown in the previous section, and let u only join multicast groups A id ( v ) (which translates to injectingone packet per group into the butterfly) for every out-neighbor v . Additionally, for every out-neighbor v it takes careof v joining u ’s multicast group by injecting a packet for v . In case of a star for example (whose arboricity is one),every node, including the center, injects at most two packets. In general, we obtain the following result. Lemma 5.1.
Setting up broadcast trees takes time O ( a + log n ) , w.h.p. The congestion of the broadcast trees is O ( a + log n ) ,w.h.p. The corollary below, which follows from the analysis of Theorem 2.6, establishes one of the key techniques used bythe algorithms in this section.
Corollary 1.
Let S ⊆ V . Using the broadcast trees, the Multi-Aggregation Algorithm solves any Multi-AggregationProblem with multicast groups A id ( u ) = N ( u ) and s id ( u ) = u for all u ∈ S in time O ( Í u ∈ S d ( u )/ n + log n ) , w.h.p. As a simple example, we show how to compute
Breadth-First Search (BFS) Trees : Let s be a node and let δ ( u ) be thelength of a shortest (unweighted) path from s to u in G . Furthermore, let π ( u ) be the predecessor of u on a shortest pathfrom s to u (breaking ties by choosing the one with smallest identifier). The goal is to let each node u ∈ V eventuallystore δ ( u ) and π ( u ) . Using the broadcast trees, the problem can easily be solved by the following algorithm, whichproceeds in phases. In Phase 1, only s is active , and in Phase i >
1, all nodes that have received an identifier in Phase i − f as the minimum function, every node that hasan active neighbor thereby receives the minimum identifier of all active neighbors. Furthermore, in every Phase i > very active node u sets δ ( u ) = i − π ( u ) to the node whose identifier it has received in the previous phase. Clearly,after at most D + Theorem 5.2.
The algorithm computes a BFS Tree in time O (( a + D + log n ) log n ) , w.h.p. Proof.
By Lemma 5.1, the broadcast trees are constructed in time O (( a + log n ) log n ) , w.h.p. Let S i be the set ofnodes active in Phase i . By Corollary 1, the Multi-Aggregation Algorithm takes time O ( Í u ∈ S i d ( u )/ n + log n ) , w.h.p.We conclude a runtime of O © « ( a + log n ) log n + D + Õ i = © « Õ u ∈ S i d ( u )/ n + log n ª®¬ª®¬ = O ( a + log n ) log n + Õ u ∈ V d ( u )/ n + ( D + ) log n ! = O (( a + D + log n ) log n ) , w.h.p. (cid:3) In this section we show how to compute a maximal independent set (MIS) : A set U ⊆ V is an MIS if (1) it is anindependent set, i.e., no two nodes of U are adjacent in G , and (2) there is no set U ′ ⊆ U such that U ⊂ U ′ . On ahigh level, we perform the algorithm of Métivier et al [48], which works as follows. First, all nodes are active and nonode is in the MIS. The algorithm proceeds in phases, where in each phase every active node u first chooses a randomnumber r ( u ) ∈ [ , ] and broadcasts the value to all of its neighbors. u then joins the MIS (and becomes inactive) if r ( u ) is smaller than the minimum of all received values. If so, it broadcasts a message to all of its neighbors, instructingthem to become inactive.We can easily perform a phase of the algorithm in our model by using two executions of the Multi-AggregationAlgorithm, the first to let every node aggregate the minimum of all values chosen by its neighbors, and the second tolet every node that is not in the MIS determine whether it is adjacent to a node that is in the MIS. This information isthen used to determine whether the nodes have reached an MIS using the Aggregate-and-Broadcast Algorithm. Sinceby [48] O ( log n ) phases suffice, and each phase can be performed in time O ( d + log n ) = O ( a + log n ) by Corollary 1,we conclude the following theorem. Theorem 5.3.
The algorithm computes an MIS in time O (( a + log n ) log n ) , w.h.p. Similar to an MIS, a maximal matching M ⊆ E is defined as a maximal set of independent (i.e., node-disjoint) edges. Tocompute a maximal matching, we propose to use the algorithm of Israeli and Itai [31], which works as follows. Initially,no node is matched. The algorithm proceeds in phases, where in each phase every unmatched node u performs thefollowing procedure. First, it chooses an edge to an unmatched neighbor uniformly at random. If u itself has beenchosen by multiple neighbors, it accepts only one choice arbitrarily and informs the respective node. The outcomeis a collection of paths and cycles. Each node of a path or cycle finally chooses one of its at most two neighbors. Ifthereby two adjacent nodes choose the same edge, the edge joins the matching and the two nodes become matched.Afterwards, all matched nodes and their incident edges are removed from the graph.The algorithm lends itself to a realization using communication primitives. First, we let every unmatched node ran-domly pick one of its unmatched neighbors by performing the Multi-Aggregation Algorithm with a slight modification. ere, every node u that is still unmatched multicasts a packet p id ( u ) using its broadcast tree. Recall that after p id ( u ) hasreached butterfly node l ( id ( u ) , v ) for all v ∈ N ( u ) in the execution of the Multi-Aggregation Algorithm, it is mappedto a new packet ( id ( v ) , p id ( u ) ) . Here, we additionally let l ( id ( u ) , v ) choose a value r ∈ [ , ] uniformly at random, andannotate ( id ( v ) , p id ( u ) ) by r . Whenever thereafter two packets with the same target are combined, the packet annotatedby the minimum value remains. Thereby, every node that still has an unmatched neighbor receives the identifier of anode chosen uniformly and independently at random among its unmatched neighbors.Afterwards, every node that has been chosen by multiple neighbors has to choose one of them arbitrarily. Thiscan be done by performing the Aggregation Algorithm, in which we let every node u aggregate the minimum of theidentifiers of all nodes by which it has been chosen in the previous step. In the resulting collection of paths and cycles,neighbors can directly send messages to each other to determine which edges join the matching. Finally, the nodeshave to determine whether the matching is maximal, which can be done as described in the previous section. UsingCorollary 3.5 of [31] and Chernoff bounds, it can be shown that O ( log n ) phases suffice. We conclude the followingtheorem. Theorem 5.4.
The algorithm computes a maximal matching in time O (( a + log n ) log n ) , w.h.p. O ( a ) -Coloring The goal of this section is to compute an O ( a ) -coloring , in which every node has to choose one of O ( a ) colors such thatno color is chosen by two adjacent nodes. Following the idea of Barenboim and Elkin [4], we consider the partition ofnodes into levels L , . . . , L T and color the nodes of each level separately. Recall that after the algorithm to computethe O ( a ) -orientation, every node knows the index of its own level. Furthermore, for all i every node u ∈ L i knowswhich of its neighbors are in lower levels L , . . . , L i − , the same level L i , and higher levels L i + , . . . , L T , since it knowswhich of its neighbors were inactive, active, or waiting in phase i . First, the nodes use the Aggregate-and-BroadcastAlgorithm to compute ˆ a = max u ∈ V { max ( d L ( u ) , d out ( u ))} = O ( a ) , where d L ( u ) is the number of neighbors of u thatare in the same level as u . Furthermore, the nodes set up multicast trees for multicast groups A id ( u ) = N in ( u ) withsource s id ( u ) = u for all u ∈ V . More precisely, every node joins the multicast group of each of its out-neighbors, whichcan be done in time O ( a + log n ) , w.h.p., by Theorem 2.4.Afterwards, the algorithm proceeds in phases 1 , . . . , T , where in each phase i the nodes of level L T − i + get colored.Throughout the algorithm’s execution, every node u maintains a color palette C ( u ) initially set to [ ( + ε ) ˆ a ] for someconstant ε >
0. After each phase, the color palette of every remaining uncolored node has been narrowed down to allcolors that have not yet been chosen by its neighbors. Since every u ∈ L T − i + has at most ˆ a neighbors in higher levels, C ( u ) still consists of at least ( + ε ) ˆ a colors at the beginning of phase i .In phase i of the algorithm, the nodes of level L T − i + essentially perform the Color-Random Algorithm of Kothapalliet al. [42]. First, every node u ∈ L T − i + chooses a color c u from its color palette uniformly at random. Then, it informsits in-neighbors about its choice by performing the Multicast Algorithm using the precomputed multicast trees and ˆ a asan upper bound on ℓ . Thereby, u receives the colors chosen by its out-neighbors of the same level. If u does not receiveits own color c u , it permanently chooses c u . In that case, it first informs all of its in-neighbors about its permanentchoice by again performing the Multicast Algorithm. Afterwards, it informs all of its out-neighbors by performing theAggregation Algorithm. Here, u is a member of aggregation groups A id ( v )◦ c u for all v ∈ N out and target of aggregationgroups A id ( u )◦ i for all i ∈ [ ( + ε ) ˆ a ] . Note that every node is a member of at most ˆ a and a target of at most 2 ( + ε ) ˆ a ggregation groups. Afterwards, all nodes (including nodes of lower levels) remove all colors permanently chosen byneighbors from their palettes.The above procedure is repeated until all nodes of level L T − i + have permanently chosen a color, which is deter-mined by performing the Aggregate-and-Broadcast Algorithm after each repetition. Then, if i >
1, the nodes start thenext phase, and terminate, otherwise. The following theorem can be shown using the following facts: (1) there are O ( log n ) phases, (2) O ( p log n ) repetitions during a phase suffice until all nodes of the corresponding level are colored(see the discussion in Section 4 of [42]), and (3) each repetition takes time O ( a + log n ) . Theorem 5.5.
The algorithm computes an O ( a ) -coloring in time O (( a + log n ) log / n ) , w.h.p. Our work initiates the study on the effect of node-capacities on the complexity of distributed graph computations. Weprovide some ideas to approach the difficulties such limitations impose, which might be of interest for other problemsas well. Clearly, there is an abundance of classical problems that may be newly investigated under our model and forwhich our algorithms may be helpful. In general, it would be interesting to see a classification of graph algorithms thatcan or cannot be efficiently performed in the Node-Capacitated Clique. We are also very interested in proving lowerbounds, which seems to be highly non-trivial in our model. Particularly, we do not know whether the arboricity or theaverage node degree are natural lower bounds for some of the problems considered in this paper, although we highlysuspect it.Interestingly, the algorithms presented in this paper do not fully exploit the power of the Node-Capacitated Clique.In fact, all of our algorithms still achieve the presented runtimes if in addition to knowing their neighbors in the inputgraph, they initially only know Θ ( log n ) random nodes . It is an interesting question whether there are algorithms thatactually require knowing all node identifiers. ACKNOWLEDGMENTS
J.A. is supported by DST/SERB Extra Mural Research Grant EMR/2016/003016, DST-DAAD Joint ProjectINT/FRG/DAAD/P-25/2018, and DST MATRICS MTR/2018/001198. M.G. is supported by SNSF Project No. 200021_184735.K.H. and C.S. are supported by DFG Project No. 160364472-SFB901 ("On-The-Fly-Computing"). F.K. is supported by ERCGrant No. 336495 (ACDC).
REFERENCES [1] R. Aleliunas. 1982. Randomized parallel communication. In
Proc. of 1st ACM Symposium on Principles of Distributed Computing (PODC) . 60–72.[2] John Augustine and Sumathi Sivasubramaniam. 2018. Spartan: A Framework For Sparse Robust Addressable Networks. In . IEEE, 1060–1069. https://ieeexplore.ieee.org/document/8425259/[3] Leonid Barenboim and Michael Elkin. 2009. Distributed ( δ +1)-Coloring in Linear (in δ ) Time. In Proc. of the 41st annual ACM symposium on Theoryof computing (STOC) . 111–120.[4] Leonid Barenboim and Michael Elkin. 2010. Sublogarithmic distributed MIS algorithm for sparse graphs using Nash-Williams decomposition.
Distributed Computing
22, 5-6 (2010), 363–379.[5] Leonid Barenboim and Michael Elkin. 2011. Deterministic Distributed Vertex Coloring in Polylogarithmic Time.
J. ACM
58, 5 (2011), 1–25.[6] Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schneider. 2016. The Locality of Distributed Symmetry Breaking.
J. ACM
63, 3 (2016),20:1–20:45.[7] Leonid Barenboim and Victor Khazanov. 2018. Distributed Symmetry-Breaking Algorithms for Congested Cliques. arXiv preprint arXiv:1802.07209 (2018). Most communication in our algorithms is carried out using a butterfly as an overlay, which can be constructed, e.g., using [2].
8] Florent Becker, Pedro Montealegre, Ivan Rapaport, and Ioan Todinca. 2018. The Impact of Locality on the Detection of Cycles in the BroadcastCongested Clique Model. In
LATIN 2018: Theoretical Informatics . 134–145.[9] Ruben Becker, Andreas Karrenbauer, Sebastian Krinninger, and Christoph Lenzen. 2017. Near-optimal approximate shortest paths and transship-ment in distributed and streaming models. In
Proc. of 31st International Symposium on Distributed Computing (DISC) . 7:1–7:16.[10] L. Elisa Celis, Omer Reingold, Gil Segev, and Udi Wieder. 2013. Balls into Bins: Smaller Hash Families and Faster Evaluation.
SIAM J. Comput.
Proc. of 34th ACM Symposium on Principles of Distributed Computing (PODC) . 143–152.[12] Keren Censor-Hillel, Merav Parter, and Gregory Schwartzman. 2017. Derandomizing local distributed algorithms under bandwidth restrictions. In
Proc. of 31st International Symposium on Distributed Computing (DISC) . 11:1–11:16.[13] Narsingh Deo and Bruce Litow. 1998. A Structural Approach to Graph Compression. In
Proc. of the MFCS Workshop on Communications . 91–100.[14] Danny Dolev, Christoph Lenzen, and Shir Peled. 2012. ”Tri, tri again”: Finding triangles and small subgraphs in a distributed setting. In
Proc. of26th International Symposium on Distributed Computing (DISC) . 195–209.[15] Andrew Drucker, Fabian Kuhn, and Rotem Oshman. 2014. On the power of the congested clique model. In
Proc. of 33rd ACM Symposium onPrinciples of Distributed Computing (PODC) . 367–376.[16] Vida Dujmovic and David R. Wood. 2007. Graph Treewidth and Geometric Thickness Parameters.
Discrete & Computational Geometry
37, 4 (2007),641–670.[17] Michael Elkin. 2004. Unconditional lower bounds on the time-approximation tradeoffs for the distributed minimum spanning tree problem. In
Proc.of the 36th ACM Symposium on Theory of Computing (STOC) . 331–340.[18] Michael Elkin. 2006. A faster distributed protocol for constructing a minimum spanning tree.
J. Comput. System Sci.
72, 8 (2006), 1282–1308.[19] Silvio Frischknecht, Stephan Holzer, and Roger Wattenhofer. 2012. Networks cannot compute their diameter in sublinear time. In
Proc. of 23rdAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA) . 1150–1162.[20] Francois Le Gall. 2016. Further algebraic algorithms in the congested clique model and applications to graph-theoretic problems. In
Proc. of 30thInternational Symposium on Distributed Computing (DISC) . 57–70.[21] Mohsen Ghaffari. 2017. Distributed MIS via All-to-All Communication. In
Proc. of the ACM Symposium on Principles of Distributed Computing(PODC) . ACM, 141–149.[22] Mohsen Ghaffari, Themis Gouleakis, Slobodan Mitrovic, and Ronitt Rubinfeld. 2018. Improved Massively Parallel Computation Algorithms forMIS, Matching, and Vertex Cover. arXiv preprint arXiv:1802.08237 (2018).[23] Mohsen Ghaffari and Bernhard Haeupler. 2016. Distributed Algorithms for Planar Networks II: Low-congestion Shortcuts, MST, and Min-Cut. In
Proc. of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) . 202–219.[24] Mohsen Ghaffari, Fabian Kuhn, and Hsin-Hao Su. 2017. Distributed MST and Routing in Almost Mixing Time. In
Proc. of the ACM Symposium onPrinciples of Distributed Computing (PODC) . 131–140.[25] Mohsen Ghaffari and Krzysztof Nowicki. 2018. Congested Clique Algorithms for the Minimum Cut Problem. In
Proc. of the 2018 ACM Symposiumon Principles of Distributed Computing (PODC) . 357–366.[26] Mohsen Ghaffari and Merav Parter. 2016. MST in log-star rounds of congested clique. In
Proc. of 35th ACM Symposium on Principles of DistributedComputing (PODC) . 19–28.[27] Robert Gmyr, Kristian Hinnenthal, Christian Scheideler, and Christian Sohler. 2017. Distributed Monitoring of Network Properties: The Power ofHybrid Networks. In
Proc. of 44th International Colloqium on Algorithms, Languages, and Programming (ICALP) . 137:1–137:15.[28] James W. Hegeman, Gopal Pandurangan, Sriram V. Pemmaraju, Vivek B. Sardeshmukh, and Michele Scquizzato. 2015. Toward optimal bounds inthe congested clique: Graph connectivity and MST. In
Proc. of 34th ACM Symposium on Principles of Distributed Computing (PODC) . 91–100.[29] James W. Hegeman and Sriram V. Pemmaraju. 2014. Lessons from the congested clique applied to MapReduce. In
Proc. of 21st Colloquium onStructural Information and Communication Complexity (SIROCCO) . 149–164.[30] James W. Hegeman, Sriram V. Pemmaraju, and Vivek B. Sardeshmukh. 2014. Near-constant-time distributed algorithms on a congested clique. In
Proc. of 28th International Symposium on Distributed Computing (DISC) . 514–530.[31] Amos Israeli and A. Itai. 1986. A fast and simple randomized parallel algorithm for maximal matching.
Inform. Process. Lett.
22, 2 (1986), 77–80.[32] Daniel Jung, Christina Kolb, Christian Scheideler, and Jannik Sundermeier. 2018. Competitive Routing in Hybrid Communication Networks. In
Proc. of the 14th International Symposium on Algorithms and Experiments for Wireless Networks (ALGOSENSORS) . 15–31.[33] Tomasz Jurdziński and Krzysztof Nowicki. 2018. Connectivity and Minimum Cut Approximation in the Broadcast Congested Clique. In
StructuralInformation and Communication Complexity (SIROCCO) . 331–344.[34] Tomasz Jurdziński and Krzysztof Nowicki. 2018. MST in O ( ) rounds of congested clique. In Proc. of the 29th ACM-SIAM Symposium on DiscreteAlgorithms (SODA) . 2620–2632.[35] Valerie King, Shay Kutten, and Mikkel Thorup. 2015. Construction and impromptu repair of an MST in a distributed network with o (m) commu-nication. In
Proc. of the 2015 ACM Symposium on Principles of Distributed Computing (PODC) . 71–80.[36] Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, and Peter Robinson. 2015. Distributed Computation of Large-scale Graph Problems.In
Proc. of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) . 391–410.[37] Christian Konrad. 2018. MIS in the Congested Clique Model in log log ∆ Rounds. arXiv preprint arXiv:1802.07647 (2018).
38] Janne H. Korhonen. 2016. Brief announcement: Deterministic MST sparsification in the congested clique. In
Proc. of 30th International Symposiumon Distributed Computing (DISC 2016) .[39] Janne H. Korhonen and Jukka Suomela. 2017. Brief Announcement: Towards a Complexity Theory for the Congested Clique. In
Proc. of 31stInternational Symposium on Distributed Computing (DISC) . 55:1–55:3.[40] Kishore Kothapalli and Sriram Pemmaraju. 2011. Distributed graph coloring in a few rounds. In
Proc. of the 30th annual ACM Symposium onPrinciples of Distributed Computing (PODC) . 31–40.[41] Kishore Kothapalli and Sriram Pemmaraju. 2012. Super-Fast 3-Ruling Sets. In
IARCS Annual Conference on Foundations of Software Technology andTheoretical Computer Science (FSTTCS) , Vol. 18. 136–147.[42] Kishore Kothapalli, Christian Scheideler, Melih Onus, and Christian Schindelhauer. 2006. Distributed coloring in ˜ O ( p log n ) bit rounds. In Proc.20th IEEE International Parallel & Distributed Processing Symposium (IPDPS) .[43] Shay Kutten and David Peleg. 1998. Fast distributed construction of small k -dominating sets and applications. Journal of Algorithms
28, 1 (1998),40–66.[44] F. T. Leighton, B. M. Maggs, A. G. Ranade, and S. B. Rao. 1994. Randomized routing and sorting in fixed-connection networks.
Journal of Algorithms
17 (1994), 157–205.[45] Christoph Lenzen. 2013. Optimal deterministic routing and sorting on the congested clique. In
Proc. of 32nd ACM Symposium on Principles ofDistributed Computing (PODC) . 42–50.[46] Christoph Lenzen and David Peleg. 2013. Efficient distributed source detection with limited bandwidth. In
Proc. 32nd ACM Symposium on Principlesof Distributed Computing (PODC) . 375–382.[47] Zvi Lotker, Boaz Patt-Shamir, Elan Pavlov, and David Peleg. 2005. Minimum-weight spanning tree construction in O ( log log n ) communicationrounds. SIAM J. Comput.
35, 1 (2005), 120–131.[48] Y. Métivier, J. M. Robson, N. Saheb-Djahromi, and A. Zemmari. 2011. An optimal bit complexity randomized distributed MIS algorithm.
DistributedComputing
23, 5-6 (2011), 331–340.[49] Danupon Nanongkai. 2014. Distributed approximation algorithms for weighted shortest paths. In
Proc. of 46th ACM Symposium on Theory ofComputing (STOC) . 565–573.[50] C. St. J. A. Nash-Williams. 1964. Decomposition of Finite Graphs Into Forests.
Journal of the London Mathematical Society
39, 1 (1964), 12–12.[51] Gopal Pandurangan, Peter Robinson, and Michele Scquizzato. 2016. Fast Distributed Algorithms for Connectivity and MST in Large Graphs. In
Proc. of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) . 429–438.[52] David Peleg and Vitaly Rubinovich. 2000. Near-tight lower bound on the time complexity of distributed MST construction.
SIAM J. Comput.
30, 5(2000), 1427–1442.[53] Abhiram G. Ranade. 1991. How to Emulate Shared Memory.
J. Comput. System Sci.
42, 3 (1991), 307–326.[54] Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, and Roger Wattenhofer. 2011.Distributed verification and hardness of distributed approximation. In
Proc. of 43th ACM Symposium on Theory of Computing (STOC) . 363–372.[55] Christian Scheideler. 1998.
Universal Routing Strategies for Interconnection Networks . Springer Verlag, Heidelberg.[56] Jeanette P. Schmidt, Alan Siegel, and Aravind Srinivasan. 1995. Chernoff-Hoeffding Bounds for Applications with Limited Independence.
SIAMJournal on Discrete Mathematics
8, 2 (1995), 223–250.[57] Eli Upfal. 1982. Efficient schemes for parallel communication. In
Proc. of 1st ACM Symposium on Principles of Distributed Computing (PODC) .241–250. SIMULATIONS IN THE k -MACHINE MODEL In this section we consider the simulation of an algorithm for the Node-Capacitated Clique in the k -machine model.For the Congested Clique model, Klauck et al. [36] provide a conversion theorem that states the following. Theorem A.1 (Theorem 4.1 in [36]).
Any algorithm A C in the Congested Clique model that executes in T C roundsand passes at most M C messages over the course of the algorithm’s execution can be simulated in the k -machine model sothat it requires at most e O ( M C / k + T C ∆ ′ / k ) rounds. Here, ∆ ′ is the communication degree complexity and refers to themaximum number of messages sent by any node at any round. The simulation alluded to in Theorem A.1 is quite straightforward. Each node from the Congested Clique model isplaced randomly on one of the k machines in the k -machine model. Under this random vertex partitioning scheme,each machine will get at most e O ( n / k ) nodes from the Congested Clique model. So it is natural for the messages sentby each node u in the Congested Clique model to be simulated by the machine that holds u .The following conversion result suited for the Node-Capacitated Clique model follows as a corollary when we noticethat the number of messages per round is at most e O ( n ) and, furthermore, ∆ ′ under the Node-Capacitated Clique modelis at most O ( log n ) . Corollary 2.
Any algorithm A NCC in the Node-Capacitated Clique model that executes in T NCC rounds can besimulated in the k -machine model so that it requires at most e O ( nT NCC / k ) rounds. B COMMUNICATION PRIMITIVES
In this section, we provide full descriptions of our communication primitives, and provide the missing proofs. Forsimplicity, we refer to butterfly nodes as
BF-nodes . B.1 Aggregate-and-Broadcast Algorithm
We first describe the Aggregate-and-Broadcast Algorithm of Theorem 2.2 in detail. First, every node that stores aninput value, but does not emulate a node of the butterfly (in which case the most significant bit of its identifier must be1), sends it to the BF-node j of level 0 such that j equals the remaining bits of its identifier. Afterwards, every BF-nodeof level 0 stores at most two input values, i.e., its own value and at most one value of a node that does not emulate anode of the butterfly. Note, that for every BF-node of level 0 there is a unique path of length d from that node to anyBF-node of level d in the butterfly. In the aggregation phase , we send all input values to BF-node 0 of level d , which inthe following we refer to as the root of the butterfly, along that path system. Whenever two values x , y reach the sameBF-node u , u only forwards д ({ x , y }) . Thereby, the root eventually computes the aggregate of all values. This valueis finally broadcast to all BF-nodes of level 0 in the broadcast phase : Every BF-node of level i that receives the valueforwards it to all of its neighbors in level i −
1. Finally, every node that does not emulate a BF-node gets informedby the BF-node of level 0 whose identifier differs only in the most significant bit. The correctness of Theorem 2.2 caneasily be seen.In pointed out in the paper, we also use the above algorithm to achieve synchronization : Assume that the nodes ex-ecute some distributed algorithm that finishes in different rounds at the nodes. In order to start a follow-up algorithmat the same round, the nodes can make use of the following slight modification of the Aggregate-and-Broadcast algo-rithm: Every node delays its participation in the aggregation phase until it has finished the current algorithm. Once ithas finished, it sends a token to its corresponding BF-node at level 0. Once a BF-node at level 0 has received a token rom each of the at most two nodes of the Node-Capacitated Clique associated with it, it sends a token in the directionof the butterfly’s root. Similarly, once a BF-node at level i > O ( log n ) rounds. B.2 Aggregation Algorithm
Next, we describe the Aggregation Algorithm of Theorem 2.3. We divide the execution of the algorithm into threephases, the
Preprocessing Phase , the
Combining Phase , and the
Postprocessing Phase . First, in the
Preprocessing Phase , allinput values are sent in batches of size ⌈ log n ⌉ to BF-nodes of level 0 chosen uniformly at random. More specifically,every node u ∈ V transforms each input value s u , i for all A i of which u is a member of into a packet of the form ( i , s u , i ) ,and enumerates all of its packets arbitrarily from 1 to k ≤ ℓ as p , . . . , p k . Then, for each j ∈ { , . . . , ⌈ k / log n ⌉} , u sends out packets p ( j − )⌈ log n ⌉ + , . . . , p min { j ⌈ log n ⌉ , k } in communication round j to BF-nodes chosen uniformly andindependently at random among all BF-nodes of level 0. To achieve synchronization after this phase, the nodes performthe Aggregate-and-Broadcast algorithm.In the Combining Phase , the input values of each aggregation group A i are aggregated to a node h ( i ) (the intermediatetarget ) chosen uniformly and independently at random from the BF-nodes of level d using a (pseudo-)random hash-function h . This is achieved by using a variant of the random rank protocol [1, 57]: Each packet p = ( i , s u , i ) stored at someBF-node of level 0 gets assigned a rank ( p ) = ρ ( i ) using some (pseudo-)random hash function ρ : { , . . . , N } → [ K ] that is known to all nodes. Then, all packets belonging to aggregation group A i are routed towards their target h ( i ) along the unique paths on the butterfly, and using the following rules:(1) Whenever a BF-node stores multiple packets belonging to the same aggregation group A i , it combines theminto a single packet of rank ρ ( i ) , combining their values using the given aggregate function.(2) Whenever multiple packets from different aggregation groups contend to use the same edge in the same round,the one with smallest rank wins (preferring the one with smallest aggregation group identifier in case of a tie),and all others get delayed .Note that a packet can never get delayed by a packet belonging to the same aggregation group. Clearly, in each roundat most one packet is sent along each edge of the butterfly, and eventually all (combined) packets have reached theirtargets.In order to determine whether the combining phase has finished, every BF-node of level 0 sends out a token to allneighbors at level 1 once it has sent out all packets. Correspondingly, every BF-node at level i > i − i +
1. Byperforming the Aggregate-and-Broadcast Algorithm to determine whether all BF-nodes of level d have received twotokens, the nodes eventually detect that the combining phase has finished.Finally, in the Postprocessing Phase the BF-nodes of level d send their packets to the corresponding targets in roundsthat are randomly chosen from { , . . . , s } , where s = ⌈ ˆ ℓ / log n ⌉ . More specifically, for each packet p stored at somenode u , which contains the result f ({ s u , i | u ∈ A i }) for some aggregation group A i , u selects a round r ∈ { , . . . , s } uniformly and independently at random and sends p to t i in round r . Again, the end of the phase is determined byusing the Aggregate-and-Broadcast Algorithm.We now turn to the analysis of the algorithm. emma B.1. The Preprocessing Phase takes time O ( ℓ / log n ) . Moreover, in each round every node sends and receives atmost O ( log n ) packets, w.h.p. Proof.
The runtime and the bound on the number of packets sent out in each round are obvious. Hence, it remainsto bound the number of packets that are received in each round.Fix any BF-node u of level 0 and round t ∈ { , . . . , ⌈ ℓ / log n ⌉} . Altogether, at most n ⌈ log n ⌉ packets are sent out inround t , which we denote by p , . . . , p n ⌈ log n ⌉ . For each p i , let the binary random variable X i be 1 if and only if p i is sent to BF-node u in round t . Furthermore, let X = Í ki = X i . Certainly, E [ X i ] = Pr [ X i = ] = / d and therefore,E [ X ] ≤ ( n ⌈ log n ⌉)/ d ≤ n +
1. Since the packets choose their destinations uniformly and independently at random,it follows from Lemma 2.1 that X = O ( log n ) , w.h.p. (cid:3) In order to bound the runtime of the Combining Phase, we first analyze our variant of the random rank protocolin a general setting: A path collection P = { p , . . . , p N } in some graph G is a leveled path collection if every node v can be given a level l ( v ) ∈ N so that for every edge ( v , w ) of a path in that collection, l ( w ) = l ( v ) +
1. Given a leveledpath collection P of size n in which packets belonging to the same aggregation group have the same destination, letthe congestion C of P be defined as the maximum number of aggregation groups that have packets that want to crossthe same edge, and let the degree d of P be defined as the maximum number of edges in E ( P ) leading to the same node,where E ( P ) is the set of all edges used by the paths in P . Theorem B.2.
For any leveled path collection P of size n with congestion C , depth D , and degree d , the routing strategyused in the Combining Phase with parameter K ≥ C needs at most O ( C + D log d + log n ) steps, w.h.p., to finish routingin P . Proof.
We closely follow the analysis of the random rank protocol in [55] and extend it with ideas from [44] so thatthe analysis covers the case that packets can be combined. In order to bound the runtime, we will use the followingdelay sequence argument.Consider the runtime of the routing strategy to be at least T ≥ D + s . We want to show that it is very improbablethat s is large. For this we need to find a structure that witnesses a large s . This structure should become more andmore unlikely to exist the larger s becomes.Let p be a packet that arrived at its destination v in step T , and let A be the aggregation group of p . We followthe path of p (or one of its predecessors, if p is the result of the combination of two packets at some point) backwardsuntil we reach a link e , where it was delayed the last time. Let us denote the length of the path from v to e (inclusive)by l , and the packet that delayed p by p . Let A be the aggregation group of p . From e we follow the path of p (or one of its predecessors) backwards until we reach a link e where p was delayed the last time, by a packet p fromsome aggregation group A . Let us denote the length of the path from e (exclusive) to e (inclusive) by l . We repeatthis construction until we arrive at a packet p s + from some aggregation group a s + that prevented the packet p s atedge e s from moving forward, and denote the number of links on the path of p i from e i (inclusive) to e i − (exclusive)as l i . Altogether it holds for all i ∈ { , . . . , s } : a packet from aggregation group A i + is sent over e i at time step T − Í ij = ( l j + ) +
1, and prevents at that time step a packet from aggregation group A i from moving forward.The path from e s to v recorded by this process in reverse order is called delay path . It consists of s contiguousparts of routing paths of length l , . . . , l s ≥ Í si = l i ≤ D . Because of the contention resolution rule it holds ρ ( i ) ≥ ρ ( i + ) for all i ∈ { , . . . , s } . A structure that contains all these features is defined as follows. Definition B.3 ( s -delay sequence). An s -delay sequence consists of s not necessarily different delay links e , . . . , e s ; • s + delay groups a , . . . , a s + such that the path of a packet from a i traverses e i and e i − in that order for all i ∈ { , . . . , s } , the path of p contains e , and the path of p s + contains e s ; • s integers l , . . . , l s ≥ l is the number of links on the path of p from e (inclusive) to its destination,and for all i ∈ { , . . . , s } , l i is the number of links on the path of p i from e i (inclusive) to e i − (exclusive), and Í si = l i ≤ D ; and • s + r , . . . , r s + with 0 ≤ r s + ≤ . . . ≤ r < K .A delay sequence is called active if for all i ∈ { , . . . , s + } we have ρ ( a i ) = r i .Our observations above yield the following lemma. Lemma B.4.
Any choice of the ranks that yields a routing time of T ≥ D + s steps implies an active s -delay sequence. Lemma B.5.
The number of different s -delay sequences is at most n · d D · C s · (cid:18) D + ss (cid:19) · (cid:18) s + Ks + (cid:19) . Proof.
There are at most (cid:0) D + ss (cid:1) possibilities to choose the l i ’s such that Í si = l i ≤ D . Furthermore, there are atmost n choices for v , which will also fix a . Once v and l is fixed, there are at most d l choices for e . Once e isfixed, there are at most d l choices for e , and so on. So altogether, there are at most d D possibilities for e , . . . , e s .Since the congestion at every edge is at most C , there are at most C possibilities for each e i to pick a i + , so altogether,there are at most C s possibilities to select a , . . . , a s + . Finally, there are at most (cid:0) s + Ks + (cid:1) ways to select the r i such that0 ≤ r s + ≤ . . . ≤ r < K . (cid:3) Note that we assumed that there is a unique, total ordering on the ranks of the aggregation groups once ρ is fixed.Hence, every aggregation group can only occur once in an s -delay sequence. Since ρ is assumed to be a (pseudo-)randomhash function, the probability that an s -delay sequence is active is 1 / K s + . Thus,Pr [ The protocol needs at least D + s steps ] Lemma B.4 ≤ Pr [ There exists an active s -delay sequence ] Lemma B.5 ≤ n · d D · C s · (cid:18) D + ss (cid:19) · (cid:18) s + Ks + (cid:19) · K s + ≤ n · D log d · C s · D + s · s + K · K s + ≤ n · s + D ( log d + ) + K · (cid:18) CK (cid:19) s . If we set K ≥ C and s = K + D ( log d + ) + ( α + ) log n , where α > [ The algorithm needs at least D + s steps ]≤ n · s + D ( log d + ) + K · − s = n · − s + D ( log d + ) + K = n α which concludes the proof of Theorem B.2. (cid:3) sing Theorem B.2, we are now able to bound the runtime of the Combining Phase by determining the parametersof the underlying routing problem. Lemma B.6.
The Combining Phase takes time O ( L / n + log n ) , w.h.p. Proof.
The depth of the butterfly is O ( log n ) and its degree is 4. Furthermore, the size of the routing problem is L .Therefore, it only remains to show that the congestion of the routing problem is O ( L / n + log n ) , w.h.p.Consider some fixed edge e from level i to i + A ∈ A let the binary random variable X A be 1 if and only if there is at least one packet from A crossing e . Clearly, there are 2 i · d − i − = d / d , whose unique shortest path passesthrough e . If the source of every packet is chosen uniformly and independently at random among all BF-nodes oflevel 0 and the destinations of the aggregation groups are chosen uniformly and independently at random from allBF-nodes of level d , then the probability for an individual packet to pass through e is ( d / )/( d ) = /( d + ) . Hence,E [ X A ] = Pr [ X A = ] ≤ | A |/ d + . Let X = Í A ∈A X A . ThenE [ X ] = Õ A ∈A E [ X A ] ≤ Í A ∈A | A | d + = L d + ≤ Ln . Since the X A ’s are independent, it follows from the Chernoff bounds (Lemma 2.1) that X = O ( L / n + log n ) , w.h.p. (cid:3) Using Chernoff bounds and the fact that every node at level d of the butterfly is target of at most O ( ˆ ℓ + log n ) aggregation groups, w.h.p., the following result can be shown similarly to Lemma B.1. Lemma B.7.
The Postprocessing Phase takes time O ( ˆ ℓ / log n ) , w.h.p. Moreover, in each round every node sends andreceives at most O ( log n ) packets, w.h.p. We conclude the following theorem.
Theorem B.8.
The Aggregation Algorithm takes time O ( L / n + ( ℓ + ˆ ℓ )/ log n + log n ) , w.h.p. B.3 Multicast Tree Setup Algorithm
First, every node u injects an (empty) packet ( i , u ) for each i such that u ∈ A i into a BF-node l ( i , u ) of level 0 chosenuniformly and independently at random. As in the Aggregation Algorithm, packets are sent in batches of size ⌈ log n ⌉ .Then, for all i , all packets of A i are aggregated at h ( i ) using the same routing strategy as in the Aggregation Algorithmand an arbitrary aggregate function. Alongside the algorithm’s execution, every BF-node u records for every i ∈{ , . . . , N } all edges along which packets from group A i arrived during the routing towards h ( i ) , and declares themas edges of T i . Again, the intermediate steps are synchronized using the Aggregate-and-Broadcast Algorithm, and thefinal termination is determined using a token passing strategy.The following theorem follows from the analysis of the Aggregation Algorithm. Theorem B.9.
The Multicast Tree Setup Algorithm computes multicast trees in time O ( L / n + ℓ / log n + log n ) , w.h.p.The resulting multicast trees have congestion O ( L / n + log n ) , w.h.p. B.4 Multicast Algorithm
The Multicast Algorithm shares many similarities to the Aggregation Algorithm. First, every source s i directly sends p i to h ( i ) . Then, in the Spreading Phase , h ( i ) sends p i to all l ( i , u ) for all i and u ∈ A i . This is done by using the multicast rees and a variant our routing protocol of the Combining Phase: First, each packet p i is assigned a rank ( p i ) = ρ ( i ) .Whenever a multicast packet p i of some aggregation group A i is stored by an inner node of T i , i.e., by some BF-node u of level j ∈ { , . . . , d } , then a copy of p i is sent over each outgoing edge of u in T i , i.e., towards one or both of u ’s neighbors in level j −
1. If two packets from different multicast groups contend to use the same edge at the sametime, the one with smallest rank is sent (preferring the one with smallest multicast group identifier in case of a tie),and the others get delayed. Once there are no packets in transit anymore, which is determined by using the tokenpassing strategy of the Aggregation Algorithm from level 0 in the direction of level d , all leaves of the multicast treeshave received their multicast packet. Finally, every leaf node l ( i , u ) sends p i to u in a round randomly chosen from { , . . . , ⌈ ˆ ℓ / log n ⌉} .The following theorem follows from the discussion of the previous sections and an adaptation of the delay sequenceargument in the proof of Theorem B.2. Theorem B.10.
The Multicast Algorithm takes time O ( C + ˆ ℓ / log n + log n ) , w.h.p. B.5 Multi-Aggregation Algorithm
The Multi-Aggregation Algorithm essentially first performs a multicast, then maps each multicast packet to a newaggregation group corresponding to its target, and finally aggregates the packets to their targets. More precisely, firstevery node s i send its multicast packet to h ( i ) . Then, by using the same strategy as in the Multicast Algorithm, we leteach l ( i , u ) receive p i for all u ∈ A i and all i . Every node l ( i , u ) then maps p i to a packet ( id ( u ) , p i ) for all u ∈ A i andall i . We randomly distribute the resulting packets by letting each BF-node send out its packets, one after the other, toBF-nodes of level 0 chosen uniformly and independently at random. By using the same strategy as in the AggregationAlgorithm, we then aggregate all packets ( id ( u ) , p i ) for all i to h ( id ( u )) , and finally send the result f ({ p i | u ∈ A i }) from h ( id ( u )) to u .The following theorem follows from discussion of the previous sections and from the fact that the mapping takestime O ( C ) . Theorem B.11.
The Multi-Aggregation Algorithm takes time O ( C + log n ) , w.h.p., w.h.p.