[PDF] Near-Optimal Scheduling in the Congested Clique

Abstract

This paper provides three nearly-optimal algorithms for scheduling t jobs in the \mathsf{CLIQUE} model. First, we present a deterministic scheduling algorithm that runs in O(\mathsf{GlobalCongestion} + \mathsf{dilation}) rounds for jobs that are sufficiently efficient in terms of their memory. The \mathsf{dilation} is the maximum round complexity of any of the given jobs, and the \mathsf{GlobalCongestion} is the total number of messages in all jobs divided by the per-round bandwidth of n^2 of the \mathsf{CLIQUE} model. Both are inherent lower bounds for any scheduling algorithm. Then, we present a randomized scheduling algorithm which runs t jobs in O(\mathsf{GlobalCongestion} + \mathsf{dilation}\cdot\log{n}+t) rounds and only requires that inputs and outputs do not exceed O(n\log n) bits per node, which is met by, e.g., almost all graph problems. Lastly, we adjust the \emph{random-delay-based} scheduling algorithm [Ghaffari, PODC'15] from the \mathsf{CLIQUE} model and obtain an algorithm that schedules any t jobs in O(t / n + \mathsf{LocalCongestion} + \mathsf{dilation}\cdot\log{n}) rounds, where the \mathsf{LocalCongestion} relates to the congestion at a single node of the \mathsf{CLIQUE}. We compare this algorithm to the previous approaches and show their benefit. We schedule the set of jobs on-the-fly, without a priori knowledge of its parameters or the communication patterns of the jobs. In light of the inherent lower bounds, all of our algorithms are nearly-optimal. We exemplify the power of our algorithms by analyzing the message complexity of the state-of-the-art MIS protocol [Ghaffari, Gouleakis, Konrad, Mitrovic and Rubinfeld, PODC'18], and we show that we can solve t instances of MIS in O(t + \log\log\Delta\log{n}) rounds, that is, in O(1) amortized time, for t\geq \log\log\Delta\log{n}.

Full PDF

aa r X i v : . [ c s . D C ] F e b Near-Optimal Scheduling in the Congested Clique

Keren Censor-Hillel Yannic Maus Volodymyr PolosukhinTechnion ∗ Abstract

This paper provides three nearly-optimal algorithms for scheduling t jobs in the CLIQUE model. First, we present a deterministic scheduling algorithm that runs in O ( GlobalCongestion + dilation ) rounds for jobs that are suﬃciently eﬃcient in terms of their memory. The dilation isthe maximum round complexity of any of the given jobs, and the GlobalCongestion is the totalnumber of messages in all jobs divided by the per-round bandwidth of n of the CLIQUE model.Both are inherent lower bounds for any scheduling algorithm.Then, we present a randomized scheduling algorithm which runs t jobs in O ( GlobalCongestion + dilation · log n + t ) rounds and only requires that inputs and outputs do not exceed O ( n log n )bits per node, which is met by, e.g., almost all graph problems. Lastly, we adjust the random-delay-based scheduling algorithm [Ghaﬀari, PODC’15] from the CONGEST model and obtain analgorithm that schedules any t jobs in O ( t/n + LocalCongestion + dilation · log n ) rounds, wherethe LocalCongestion relates to the congestion at a single node of the

CLIQUE . We compare thisalgorithm to the previous approaches and show their beneﬁt.We schedule the set of jobs on-the-ﬂy, without a priori knowledge of its parameters or thecommunication patterns of the jobs. In light of the inherent lower bounds, all of our algorithmsare nearly-optimal.We exemplify the power of our algorithms by analyzing the message complexity of the state-of-the-art MIS protocol [Ghaﬀari, Gouleakis, Konrad, Mitrovic and Rubinfeld, PODC’18], andwe show that we can solve t instances of MIS in O ( t + log log ∆ log n ) rounds, that is, in O (1)amortized time, for t ≥ log log ∆ log n . Motivated by the ever-growing number of frameworks for parallel computations, we address thecomplexity of executing multiple jobs in such settings. Such frameworks, e.g., MapReduce [KSV10],typically need to execute a long queue of jobs. A fundamental goal of such systems is to schedulemany jobs in parallel, for utilizing as much of the computational power of the system as possible.Ideally, this is done by the system in a black-box manner, without the need to modify the jobs and,more importantly, without the need to know their properties and speciﬁcally their communicationpatterns beforehand.In their seminal work, Leighton, Maggs, and Rao [LMR94] studied the special case where eachof the to-be-scheduled jobs is a routing protocol that routes a packet through a network along agiven path. The goal in their work is to schedule t jobs such that the length of the schedule, i.e., theoverall runtime until all t packets have reached their destination, is minimized. They showed that ∗ { ckeren, yannic.maus, po } @cs.technion.ac.il O ( congestion + dilation ), where congestion is the maximum number of packets that need to be routed over a single edge of the network and dilation is the maximum length of a path that a packet needs to travel. Clearly, both parametersare lower bounds on the length of any schedule, implying that the above schedule is asymptoticallyoptimal. Further, Leighton, Maggs, and Rao [LMR94] showed that assigning a random delay toeach packet gives a schedule of length O ( congestion + dilation · log ( t · dilation )).In his beautiful work, Ghaﬀari [Gha15] raised the question of running multiple jobs in thedistributed CONGEST model on n nodes. Applying the random delays method [LMR94], he showeda randomized algorithm which after O ( dilation · log n ) rounds of pre-computation, runs a given a setof jobs in O ( congestion + dilation · log n ) rounds. Here, in a similar spirit to [LMR94], congestion is themaximum number of messages that need to be sent over a single edge and dilation is the maximumround complexity of all jobs. Further, Ghaﬀari [Gha15] showed that this is nearly optimal, byconstructing an instance which requires Ω( congestion + dilation · log n/ log log n ) rounds to schedule.In this paper, we address the t -scheduling problem in the ( CONGESTED ) CLIQUE model [LPPP05],in which each of n machines can send O (log n )-bit messages to any other machine in each round.Our goal is thus to devise scheduling algorithms that run t jobs in a black-box manner, such thatthey complete in a number of rounds that beats the trivial solution of simply running the jobssequentially one after the other, and, ideally, reaches inherent lower bounds that we discuss later.We emphasize that we schedule all jobs’ actions on-the-ﬂy during their execution. Throughout thepaper, we use the terminology that a job is a protocol that n nodes , v , . . . , v n − , need to run onsome input, and we use the notion of an algorithm for the scheduling procedure that the n machines , p , . . . , p n − , execute. Each machine p i is given the inputs of the nodes v ji for all jobs j , and themachines run an algorithm which simulates the protocols of their assigned nodes.Our contributions are three algorithms for scheduling t jobs in the CLIQUE model, which exhibittrade-oﬀs based on the parameters of dilation , LocalCongestion , and

GlobalCongestion of the set ofjobs, which we formally deﬁne below. Our scheduling algorithms complete within round complexitiesthat are nearly optimal w.r.t. the appropriate parameters.

No scheduling algorithm can beat the dilation of the set of jobs, which is the maximum runtime ofa job in the set, had this job been executed standalone. Similarly, another natural lower bound isgiven by the

GlobalCongestion , which is the total number of messages that all nodes in all jobs sendover all rounds, normalized by the n per-round-bandwidth of the CLIQUE model (for simplicity,this considers the possibility that a machine sends a message to itself). The main goal is thus toget as close as possible to these parameters.As a toy example, consider a set of jobs in which each completes within a single round. Intu-itively, if the total number of messages that need to be sent by all nodes in all jobs is at most n ,then one could hope to squeeze all of these jobs into a single round of the CLIQUE model, as n isthe available bandwidth per round. The main hurdle in a straightforward argument as above, liesin the fact that a machine cannot send more than n messages in a round. Thus, although we arepromised that in total there no more than n messages, it might be that a machine is required tosend/receive ω ( n ) messages because the heaviest-loaded nodes of multiple jobs might be located onthe same machine.This implies that a na¨ıve scheduling, in which each machine simulates the nodes that are locatedat it, is more expensive than our single-round goal scheduling, as some messages must wait for later2ounds. In the general case, these issues become more severe, as the jobs may originally requiremore than a single round, and it could be that each round displays an imbalance in a diﬀerent setof nodes and machines.The key ingredient in the ﬁrst two scheduling algorithms that we present is hence to rebalance the nodes among the machines, for the sake of a more eﬃcient simulation that deals with thepossible imbalance, which also may vary from round to round. The third scheduling algorithm wepresent is inspired by the random-delay approach of [LMR94, Gha15]. In what follows, we presentthe guarantees that are obtained by our three scheduling algorithms, and discuss the trade-oﬀs thatthey exhibit. Deterministic scheduling.

A crucial factor in the complexity of rebalancing the nodes amongthe machines is the amount of information that needs to be passed from one machine to anotherin order for the latter to take over the simulation of a node. To this end, we deﬁne an M - memoryeﬃcient job as a job where for each node, its state can be encoded in M log n bits, and that thenumber of messages it needs to receive in this round can be inferred from its state. In Section 3,we obtain the following deterministic algorithm for scheduling t jobs that are M -memory eﬃcient. Theorem 3.1.

There is a deterministic algorithm that schedules t = poly n jobs that are M -memoryeﬃcient in O ( GlobalCongestion + ⌈ M · t/n ⌉ · dilation ) rounds. At a very high level, in the algorithm for Theorem 3.1, the machines rebalance nodes in eachround by sending the states of nodes. The main technical eﬀort is that the reassignment needs tobe computed by the machines on-the-ﬂy , and we show how to do so in a fast way.Notice that for the case that M · t = O ( n ), the round complexity we get from Theorem 3.1 is O ( GlobalCongestion + dilation ), which is optimal . Another crucial point is that our algorithm doesnot require the knowledge of either the GlobalCongestion or the dilation of the set of jobs.

Randomized scheduling.

If we are given a set of jobs that are not memory eﬃcient for areasonable value of M , it may be too expensive to rebalance the nodes among the machines inevery simulated round. However, if the input of each node is not too large, we can randomly shuﬄethe nodes at the beginning of the simulation, and if the output is also not too large then we caneﬃciently unshuﬄe, and reach the original assignment.To capture this, we say that a job is I/O eﬃcient if its input and output can be encoded within O ( n log n ) bits. Notice that most graph-related problems are I/O eﬃcient, e.g., MST [LPPP05,HPP + + +

19, Gal16]. An example of a graph problem that is notI/O eﬃcient is k -clique listing , in which all nodes together have to explicitly output all k -cliquesin the input graph [DLP12, IG17, PRS18, CGL20, CPZ19] which can be as many as Ω( n k ), thusnecessitating large outputs. While the k -clique listing problem is not output eﬃcient , it is inputeﬃcient , and as it does not require a speciﬁc node to output a speciﬁc clique, one could also runseveral instances of the problem by omitting the output unshuﬄing step of our scheduling algorithm.We obtain the following randomized algorithm for scheduling t jobs that are I/O eﬃcient. Theorem 4.1.

There is a randomized algorithm in the

CLIQUE model that schedules t = poly n jobs that are I/O eﬃcient in O ( t + GlobalCongestion + dilation · log n ) rounds, w.h.p. An event occurs w.h.p. ( with high probability ) if for an arbitrary constant c ≥

1, the probability that the eventoccurs is at least 1 − n − c , where n is the number of machines. All our results can be adapted to any constant c atthe cost of increasing the runtime by a constant factor.

3s the deterministic scheduling algorithm (Theorem 3.1), the scheduling algorithm of Theo-rem 4.1 requires neither the knowledge of

GlobalCongestion nor the knowledge of dilation .Both of our scheduling algorithms for Theorem 3.1 and Theorem 4.1 have the machines possiblysimulate the execution of nodes that are not originally assigned to them. We stress that anyblack-box scheduling algorithm in which each machine only simulates the nodes that are originallyassigned to it must inherently suﬀer from another type of congestion as a lower bound on its roundcomplexity, namely, the maximum number of messages that all nodes assigned to a single machinehave to send or receive, normalized by the bandwidth n that each machine has per round. We callthis the LocalCongestion of a set of jobs. We obtain the following random-delay-based algorithm forscheduling any t jobs, without reassigning nodes. Theorem 4.4 (Simpliﬁed) . There is a randomized algorithm in the

CLIQUE model that schedules t = poly n jobs in O ( t/n + LocalCongestion + dilation · log n ) rounds w.h.p. The stated complexity in the above simpliﬁed version of Theorem 4.4 requires the knowledge ofthe

LocalCongestion , but this can be eliminated using a standard doubling approach, at the cost ofa logarithmic multiplicative factor (see precise statement in Section 4).The random-delay algorithm which gives Theorem 4.4 is suboptimal for a set of jobs which havea single machine with heavily-loaded nodes assigned to it, since in this case it does not exploitthe entire bandwidth of the

CLIQUE model. For example, for a problem with inputs of at most O ( n log n ) bits per node, a protocol in which a ﬁxed leader learns the entire input takes O ( n )rounds, where on each round each node sends one message to the leader, who receives n messages.For n such jobs, the GlobalCongestion is n , while the LocalCongestion is n . In such a setting,our random-shuﬄing algorithm from Theorem 4.1 outperforms the random-delay algorithm fromTheorem 4.4. One may suggest to replace the ﬁxed leader by a randomly or more carefully chosenleader. However, this trick might be more complicated in the general case: suppose now that n . nodes need to learn n . messages each. For such a set of jobs, it holds that LocalCongestion = n . ,while GlobalCongestion = 1. Thus, it is more eﬃcient to run Theorem 4.1 in this case. Anothercrucial example in which random-shuﬄing outperforms random-delays is the maximal independentset protocol that we describe below. Note that our algorithms address these cases in a black-boxmanner without assuming knowledge of the communication pattern.

Applications.

In Section 5, we present two applications in order to exemplify our schedulingalgorithms. We summarize these applications below and defer a more detailed discussion to Section 5and Section 6.A maximal independent set (MIS) of a graph G = ( V, E ) is a set M ⊆ V such that no two nodesin M are adjacent and no node of V can be added to M without violating this condition. Thestate-of-the-art randomized CLIQUE protocol for solving the MIS problem completes in O (log log ∆)rounds, w.h.p., where ∆ is the maximum degree of the graph [GGK + Theorem 5.1 (Multiple MIS instances) . There is a randomized algorithm in the

CLIQUE modelwhich solves t = poly n instances of MIS in O ( t + log log ∆ log n ) rounds, w.h.p. Another application that exempliﬁes our scheduling algorithms is a variant of the pointer jumping problem, which is a widespread algorithmic technique [Hir76]. In the P -pointer jumping problem,4ach node has a permutation on P elements. A ﬁxed node has a value pointer p and should learnthe result of applying these permutations one after another on p . Pointer jumping can be solved byan O (log n )-round protocol in the CLIQUE model by learning the composition of all permutations(see Section 5.2). We observe that this protocol does not utilize the entire bandwidth and leveragethis for obtaining an algorithm that executes multiple instances of this protocol eﬃciently.

Theorem 5.5 (Pointer Jumping) . For P ≤ n , there are algorithms in the CLIQUE model that solve t = poly n instances of the P -pointer jumping problem deterministically in O ( ⌈ P · t/n ⌉ · log n ) , andrandomized in O ( t + log n ) rounds, w.h.p. We obtain the deterministic result using our scheduling algorithm in Theorem 3.1 and therandomized result using our random-shuﬄing scheduling algorithm in Theorem 4.1. The proposedsimple O (log n ) round pointer jumping protocol also serves as an example where scheduling jobs viathe random-shuﬄing approach of Theorem 4.1 is signiﬁcantly better than the random-delay basedapproach of Theorem 4.4. For more details we refer to Section 5.2.In Section 6 we discuss the amortized versions of these results, and present a small example of aset of jobs that can be scheduled with o (1)-amortized complexity. In light of the growing number of O (1)-round CLIQUE -protocols, e.g., [CDP20, Now19, GNT20], we propose the amortized complexityof solving many instances of a problem in parallel, as a valuable measure for the eﬃciency in futureresearch.

Many graph problems are studied in the

CLIQUE model. There are fast protocols for the

CLIQUE model for distance computations [CKK + +

18, CPS20], and more.To the best of our knowledge, there are no previous works that study the scheduling of jobs inthe

CLIQUE model. In the past, it has been shown that running multiple instances of the same protocol on diﬀerent inputs can result in fast algorithms for some complex problems. We surveysome of these. Hegeman et al. [HPP +

15] reduce the MST problem to multiple smaller instancesof graph connectivity, breaking below the long-standing upper bound of O (log log n ) by Lotker etal. [LPPP05]. Further variants and improvements on the MST problem [Kor16,GP16,JN18,Now19]all exploit invoking multiple instances of sparser problems. This line of work culminated in thedeterministic O (1)-round algorithm of Nowicki [Now19].In [GN18], Ghaﬀari and Nowicki show a randomized algorithm which solves O ( n − ǫ ) manyinstances of the MST problem in O ( ǫ − ) rounds. This is used for ﬁnding the minimum cut of agraph. The state-of-the-art O (1)-round algorithm for the minimum cut problem, by Ghaﬀari etal. [GNT20], runs Θ(log n ) instances of connected components as a subroutine. The complexity ofcomputing multiple matrix multiplications in parallel was explored by Le Gall [Gal16] and was usedin the same paper to solve the all-pairs-shortest-path problem .The notion of LocalCongestion is somewhat similar to the notion of

Communication Degree Complexity [KNPR15]. The diﬀerence lies in the fact that the

Communication Degree Complexity is an upperbound on the number of messages sent or received by any node on any round , while

LocalCongestion is an upper bound on the total number of messages sent or received by any node over all rounds.5

Preliminaries

The

CLIQUE

Model.

In the (

CONGESTED ) CLIQUE model, n machines p , . . . , p n − communi-cate with each other in synchronous rounds in an all-to-all fashion. In each round, any pair ofmachines can exchange O (log n ) bits. There is usually no constraint neither on the size of the localmemory nor on the time complexity of the local computations . Besides the local memory, eachmachine has a read-only input buﬀer and a write-only output buﬀer , as well as read/write incoming-and outgoing- message buﬀers . Routing in the

CLIQUE

Model.

Lenzen’s routing scheme [Len13] says that a set of messagescan be routed in the

CLIQUE model within O (1) rounds, given that each machine sends and receivesat most O ( n ) messages. We formally state it here in its generalized version, which addresses thecase of more than a linear number of messages. In the generalized version, each machine p i holds aset of messages M i = S i ′ ∈ [ n ] M i ′ i , where M i ′ i is a set of messages with the destination p i ′ . The claimfollows by having each node chop its set of messages M i into chunks of n messages, each of whichcontaining | M i ′ i | n/X messages for each i ′ ∈ [ n ], and applying the original routing scheme X/n times.The routing scheme could be adapted to preserve the message complexity in the following way. Let Y = P i ∈ [ n ] | M i | ≤ n be the total number of messages. First, compute a global numberingof messages and the total number of messages Y . Then, send O ( ⌈√ Y ⌉ ) messages to each one ofthe ﬁrst O ( ⌈√ Y ⌉ ) machines via intermediate nodes based on the numbering. Sort messages by thedestination in the using Lenzen’s sorting algorithm [Len13] over O ( ⌈√ Y ⌉ )-clique. Finally, deliverthe messages to their destinations via intermediate nodes based on the indices of messages in thesorted sequence. The round complexity of the algorithm is O (1) and the message complexity of thealgorithm in O ( Y + ⌈√ Y ⌉ · ⌈√ Y ⌉ ) = O ( Y ) Claim 2.1 (Lenzen’s Routing Scheme) . Let X be a globally known value and let P be the propertythat | M i | ≤ X for all i ∈ [ n ] and P i ∈ [ n ] | M i ′ i | ≤ X for all i ′ ∈ [ n ] . There is an algorithm in the CLIQUE model which completes in O ( ⌈ X/n ⌉ ) rounds and O ( P i ∈ [ n ] M i ) messages, and delivers allmessages if P holds, or indicates that it does not hold. Protocols and Jobs. A protocol is run on an input , that is provided in a distributed manner in the read-only input buﬀer of each machine. The complexity of a protocol is the number of synchronousrounds until each machine has ﬁnished writing its output to its write-only output buﬀer .A job is an instance of a protocol together with a given input and a job is ﬁnished when eachmachine has written its output. We generally assume that each job ﬁnishes in O (poly n ) rounds.For our purposes of fast scheduling, we need to specify the internals of each synchronous round.We follow the standard description, which is usually omitted and simply referred to as a ’round’.We require that for each machine, the input and output buﬀers are only accessed in the ﬁrst andlast rounds of the protocol on that machine, respectively. In particular, this means that any furtheraccess to the input requires storing it in the local memory. Accessing the incoming- and outgoing-message buﬀers is not restricted to certain rounds. Each synchronous round of a protocol consistsof 3 steps, in the following order. • Receiving Step:

Read from incoming-message buﬀer (or from input buﬀer if this is the ﬁrstround), possibly modifying the local memory. • Computation Step:

Possibly modify local memory. We thank an anonymous reviewer for pointing this out. Sending Step:

Write to outgoing-message buﬀer, (or to output buﬀer if this is the lastround), possibly modifying the local memory.After these 3 phases, all messages written in outgoing-message buﬀers are delivered into the incoming-message buﬀers of their targets.

The Scheduling Problem.

In the t -scheduling problem (or simply a scheduling problem, if t isclear from the context) the objective is to execute t jobs. Since our goal is to do this in an eﬃcientmanner, we wish to allow a machine to simulate a computation that originally should take placein a diﬀerent machine, in a na¨ıve execution of the t jobs. To this end, we distinguish between thephysical machine and the nodes , which are the virtual machines that need to execute each job. Thatis, for each job j we denote by { v i,j | i ∈ [ n ] } the set of nodes that need to execute job j .Formally, in the t -scheduling problem, the input for machine p i is composed of the inputs of allthe nodes with identiﬁers of the form v i,j for each job j ∈ [ t ]. We also assume that each machineknows the protocol for each of the t jobs. An algorithm solves the scheduling problem or schedulesthe jobs when each job has ﬁnished writing its output. That is, for deterministic jobs, we requireeach machine p i to write the output of nodes v i,j for all j ∈ [ t ]. For randomized jobs, the machines’output distribution for each job has to be equal to the distribution of outputs in a na¨ıve executionof the job. In the rest of the paper, we refer to the scheduling solution as an algorithm , while weuse the term protocol only for the content of a job. Notations.

Following the widespread conventions, we denote by log the logarithm base 2, and byln the natural logarithm. Also, we denote [ n ] = { , , . . . , n − } . We denote by s ri,j and t ri,j thenumber of messages sent and received by v i,j in round r , respectively. If job j terminates before round r , we indicate s ri,j = t ri,j = 0. We sometimes drop the superscript r , when it is clear from the context.We denote by ℓ j the round complexity of job j and by m j = P i ∈ [ n ] ,r ∈ [ ℓ j ] s ri,j = P i ∈ [ n ] ,r ∈ [ ℓ j ] t ri,j thetotal number of messages sent or received during the execution of job j , i.e., the message complexityof job j . Another notation we extensively use is m r = P i ∈ [ n ] ,j ∈ [ t ] s ri,j = P i ∈ [ n ] ,j ∈ [ t ] t ri,j , which is thenumber of messages all nodes in all jobs sent or received during round r . Congestion parameters.

We deﬁne the normalized

GlobalCongestion as the total number ofmessages sent by all the jobs divided by n , and normalized LocalCongestion as the maximumnumber of messages send to or received by some node in the entire course of the execution of alljobs divided by n . Formally, dilation = max j ∈ [ t ] ℓ j , GlobalCongestion = X j ∈ [ t ] m j = X i ∈ [ n ] X j ∈ [ t ] X r ∈ [ ℓ j ] s ri,j /n = X r ∈ [ dilation ] m r /n , LocalCongestion = max  max i ∈ [ n ] X j ∈ [ t ] X r ∈ [ ℓ j ] s ri,j /n, max i ∈ [ n ] X j ∈ [ t ] X r ∈ [ ℓ j ] t ri,j /n  . Hoeﬀding bound.

Some of our proofs use the following Hoeﬀding bound.

Claim 2.2 (Hoeﬀding Bound [Hoe63]) . Let { X i } ni =1 be independent random variables with valuesin the interval X i ∈ [0 , and expectation of their sum bounded by E [ P ni =1 X i ] ≤ µ . Then for all ǫ > " n X i =1 X i ≥ (1 + ǫ ) µ ≤ (cid:18) e ǫ (1 + ǫ ) ǫ (cid:19) µ ≤ e − ǫ ǫ µ . Deterministic Scheduling

The objective of this section is to prove the following theorem.

Theorem 3.1.

There is a deterministic algorithm that schedules t = poly n jobs that are M -memoryeﬃcient in O ( GlobalCongestion + ⌈ M · t/n ⌉ · dilation ) rounds. The formal deﬁnition of an M -memory eﬃcient job as used in Theorem 3.1 is as follows. Deﬁnition 3.2 ( M -memory eﬃcient job ) . For a given value M , an M -memory eﬃcient job is a job in which for each node v in each round r , the state (local memory) of v at the end of theComputation Step can be encoded in M log n bits. In addition, there is a function that, given thestate of node v after the Computation Step of round r , infers the number of messages it sends andreceives on this round. Theorem 3.1 requires that jobs use at most M bits of local memory per machine. Thus, the powerof the result is when M = o( n ), as otherwise the na¨ıve execution of jobs one after another schedulesthem in dilation · t rounds. In the case that M · t = O ( n ), the runtime becomes O ( GlobalCongestion + dilation ), which is optimal up to a constant factor as, clearly, any schedule for any collection of jobsrequires at least Ω( GlobalCongestion + dilation ) rounds.To schedule the jobs for Theorem 3.1, we work in epochs. Each machine p i ﬁrst simulates round0 up to the end of the Computation Step for the nodes v i,j , for each j ∈ [ t ]. This does not requireany communication. Then, the epochs are such that for each round r , at the start of epoch r , allnodes in all jobs are at the end of the Computation Step of round r . Clearly, for each simulatednode that ﬁnishes in round r , the machine does not need to do anything for the part that executesthe beginning of round r + 1. The reason why we execute the protocol in these shifted epochs, fromSending Step of round r (including) to Sending Step of round r + 1 (excluding), lies in the fact thatthe bottleneck is the possible imbalance in communication.Recall that m r denotes the number of messages all nodes from all jobs send in round r . Sincein each round of the CLIQUE model, at most n messages can be exchanged, routing m r messagescannot be done faster than ⌈ m r /n ⌉ rounds. We aim to execute an epoch in this optimal numberof O ( ⌈ m r /n ⌉ ) rounds. We start with the simple case and then use it to solve the general case.The ﬁrst case is when m r ≤ n . In Lemma 3.4, we show that in this case, we can route all m r messages in O ( ⌈ M · t/n ⌉ ) rounds. The challenge we encounter is that although m r ≤ n , weare not promised that the messages are balanced across the machines in the following sense. It ispossible that some machine p i , which simulates the nodes v i,j , for all jobs 0 ≤ j < t , is requiredto send signiﬁcantly more than n messages when summing over all messages that need to be sentby these nodes v i,j . We overcome this issue by assigning the simulation of some of these nodes tosome other machine p i ′ , which originally has a smaller load of messages to send. The crux thatunderlies our ability to defer a simulation of a node v i,j to a machine p i ′ is that the state of the nodedoes not consume too many bits. We show how to compute a well-balanced assignment of nodes tomachines in Claim 3.3. This assignment allows us to execute the epoch in the claimed number of O ( ⌈ M · t/n ⌉ ) rounds.In the general case, we can have m r > n . We show how to carefully split up the messages thatneed to be sent into chunks that allow us to use multiple invocations of Lemma 3.4. This allows usto execute the epoch in the O ( ⌈ m r /n + M · t/n ⌉ ) rounds. As the core of our algorithm is handlingthe case m r ≤ n , now, we focus on the case m r ≤ n .8e start with the following notation. An assignment of nodes to machines corresponds to afunction ϕ : [ n ] × [ t ] [ n ], where ϕ ( i, j ) = k says that the i -th node in job j , i.e., v i,j , is assignedto the k -th machine p k . We sometimes abuse notation and write that ϕ ( v i,j ) = p k for ϕ ( i, j ) = k .We call an assignment balanced , if the number of nodes assigned to each machine is O ( t ), i.e., if foreach k , it holds that | ϕ − ( p k ) | = O ( t ). The (balanced) assignment ϕ ( i, j ) = i is called the trivial assignment.We denote by S i,j,r the state of node v i,j after its Computation Step in round r . Claim 3.3 (Distributing the states) . Given are t jobs that are M -memory eﬃcient, and globallyknown initial and ﬁnal balanced assignments, ϕ s and ϕ f , respectively. Assume that for each i ∈ [ n ] and j ∈ [ t ] , machine ϕ s ( i, j ) holds the state S i,j,r of node v i,j after its Computation Step in round r .Then, there exists a deterministic CLIQUE algorithm which completes in O ( ⌈ M · t/n ⌉ ) rounds and moves the states according to ϕ f , that is, at the end of the algorithm, for each i ∈ [ n ] and j ∈ [ t ] ,machine ϕ f ( i, j ) holds the state S i,j,r of node v i,j . Proof of Claim 3.3.

For each node v i,j , denote i ′ = ϕ f ( i, j ). For each node v i,j such that i ′′ = ϕ s ( i, j ), machine p i ′′ sends S i,j,r to machine p i ′ . Overall, each machine p i sends and re-ceives | ϕ − s ( p i ) | · M = O ( t · M ), | ϕ − f ( p i ) | · M = O ( t · M ) messages. Thus, by Claim 2.1, it completesin O ( ⌈ M · t/n ⌉ ) rounds. Lemma 3.4 (Scheduling of a round with m r ≤ n messages) . Given are t jobs that are M -memoryeﬃcient, and given is a round number, r , for which m r ≤ n . Assume that for each i ∈ [ n ] , p i holds S i,j,r for all j ∈ [ t ] . Then there exists a deterministic CLIQUE algorithm which completes in O ( ⌈ M · t/n ⌉ ) rounds, at the end of which, for each i ∈ [ n ] , p i holds S i,j,r +1 for all j ∈ [ t ] . The outline of the algorithm is as follows. Each machine partitions its simulated nodes intobuckets of contiguous ranges of indices, such that nodes in each bucket send and receive O ( n )messages altogether. Thus, the messages of all nodes in the bucket can be sent or received by a singlemachine. We show that the number of buckets over all machines is O ( n ). The machines collectivelyassign the buckets such that each machine gets O (1) buckets, and they make the assignment globallyknown. Then, the states S i,j,r are distributed according to the assignment using Claim 3.3, andeach machine executes the Sending Step of round r for each of its newly assigned nodes and allmessages get delivered. Then, each machine executes the remainder of the protocol of its newlyassigned nodes until after the Computation Step of round r + 1. Finally, the states S i,j,r +1 for round r + 1 are distributed back according to the trivial assignment. Proof of Lemma 3.4.

We begin with describing the algorithm (see Algorithm 1). Afterwards, weprove the correctness and analyze the round complexity.

Algorithm 1

Simulating a round with m r ≤ n . Compute the balanced assignment ϕ : [ n ] × [ t ] [ n ]. Distribute the states according to the assignment ϕ . Execute the protocol for round r accounting for ϕ . Distribute the states back according to the trivial assignment.

The Algorithm.

We ﬁrst show how to split nodes into buckets. Then we show how to computea globally known assignment ϕ , distribute the nodes according to ϕ , execute the jobs until after thenext Computation Step, and assign nodes back to their initial machines.9 orming buckets (locally): Each machine p i for each j ∈ [ t ] uses S i,j,r to locally compute s i,j and t i,j , the number of messages each node v i,j sends and receives in round r , respectively. This is pos-sible by the deﬁnition of an M -memory eﬃcient job. Let S i = P t − j =0 s i,j and T i = P t − j =0 t i,j . Then,each machine p i (locally and independently) applies [CDKL19, Lemma 7] (restated in Claim 3.5 forbetter readability) with k = k i = ⌈ max { S i /n, T i /n }⌉ to the sequences ( s i,j ) t − j =0 and ( t i,j ) t − j =0 , tosplit its nodes into k i buckets B i, , . . . , B i,k i − of continuous ranges of jobs’ indices. Claim 3.5 (Lemma 7 from [CDKL19]) . Let s , . . . , s n − ∈ N and t , . . . , t n − ∈ N be sequences ofnatural numbers where each number is upper bounded by s and t , respectively. Let S = P j ∈ [ n ] s j and T = P j ∈ [ n ] t j . Then for any k ∈ N , there is a partition of [ n ] into k sets B , . . . , B k − , suchthat for each i , the set B i consists of consecutive elements, and X j ∈ B i s j ≤ (cid:18) Sk + s (cid:19) and X j ∈ B i t j ≤ (cid:18) Tk + t (cid:19) . Invoking Claim 3.5 with s = n ≥ s i,j , t = n ≥ t i,j , S = S i , and T = T i , implies that for each i ∈ [ n ] and i ′ ∈ [ k i ], the nodes inside each bucket B i,i ′ want to send/receive at most 4 n messages,i.e., X j ∈ B i,i ′ s i,j ≤ (cid:18) Sk + s (cid:19) ≤ (cid:18) S i ( S i /s ) + s (cid:19) = 4 s = 4 n , and X j ∈ B i,i ′ t i,j ≤ (cid:18) Tk + t (cid:19) ≤ (cid:18) T i ( T i /t ) + t (cid:19) = 4 t = 4 n. Computing the assignment ϕ : We ﬁrst deﬁne the assignment ϕ and then show how it becomesglobally known. Recall that the buckets of machine p i are numbered from 0 to k i − i ∈ [ n ] and i ′ ∈ [ k i ]: f ( i, i ′ ) = $ i ′ + X i ′′

For each i ∈ [ n ] and j ∈ [ t ] the machine p f ( i,j ) receives the state S i,j,r andexecutes the Sending Step of round r , the Receiving Step of round r + 1, and the Computation Stepof round r + 1 for node v i,j . Thus, afterwards it holds the state S i,j,r +1 . Since this state is then sentback to p i , the correctness follows. Round Complexity.

The partitioning of each machine’s nodes into buckets is done locallywithout communication. Broadcasting the number of buckets (the value of k i ) can be done in asingle round. We next reason about the time complexity that is required to make the assignment ϕ globally known. The computation of f ( i, i ′ ) is done locally. Informing machine p f ( i,i ′ ) about thesmallest and largest job in the bucket B i,i ′ requires for each machine p i to send at most t messagesand to receive at most 5 messages. Thus, by ⌈ t/n ⌉ invocations of Claim 2.1, this step completes in O ( ⌈ t/n ⌉ ) rounds. Since each machine p i is assigned at most 5 buckets, and for each bucket B i ′ ,j itbroadcasts a constant number of elements (smallest and largest job index in it together with theidentiﬁer i ′ ), this step completes in O (1) rounds.The runtime is hence dominated by distributing the states via Claim 3.3, which takes O ( ⌈ M · t/n ⌉ ) rounds. All nodes in a bucket send/receive at most 4 n messages in total and each machineexecutes the sending/receiving phase for at most 5 buckets, and thus these steps are done in O (1)rounds by Claim 2.1.The next lemma deals with the general case, where total number of messages m r might be largerthan 2 n . Lemma 3.6 (Scheduling of a round r .) . Given are t jobs that are M -memory eﬃcient, and givenis a round number r . Assume that for each i ∈ [ n ] , p i holds S i,j,r for all j ∈ [ t ] . Then there existsa deterministic CLIQUE algorithm which completes in O ( ⌈ m r /n + M · t/n ⌉ ) rounds, at the end ofwhich, for each i ∈ [ n ] , p i holds S i,j,r +1 for all j ∈ [ t ] . The proof of Lemma 3.6 uses the next claim to split all jobs into chunks that send smallernumbers of messages in order to apply Lemma 3.4.

Claim 3.7.

Let S be a non-empty (globally known) set of consecutive indices of size at most n c forsome constant c > and let x > . Each machine p i has a sequence of numbers ( s i,j ) j ∈S that areall upper bounded by n . There is a deterministic algorithm in the CLIQUE model, which in O (1) rounds ﬁnds the minimum index j ∈ S (if it exists) that satisﬁes x ≤ X j ∈S ,j ≤ j n − X i =0 s i,j ≤ x + n . (1)We solve this problem in c recurrent levels. On recursion level c ′ , which goes from c down to 1,we start with the search space S c ′ of size n c ′ and ﬁnish with the search space S c ′ − of size n c ′ − .11fter each iteration, we maintain the invariant that if there exists the required j then j − ∈ S c ′ − and that S c ′ − is contiguous. We always maintain a search space of consecutive indices.Next, we explain the c ′ -th recursion level, and for that purpose assume that the current searchspace S c ′ is of size n c ′ for 0 ≤ c ′ ≤ c . If this is not the case, we append dummy indices to make S c ′ ofthe size exactly n c ′ . To narrow down the search space, we compute n preﬁx sums S ℓ , . . . , S ℓ n where S ℓ i ′ sums up all values with index j < ℓ i ′ of all machines. The indices ℓ = min S , . . . , ℓ n = ℓ + n c ′ are equidistantly placed in S c ′ . Let i ′ be the largest index such that S ℓ i ′ < x . The new search spaceis formed by the indices S c ′ − = [ ℓ i ′ , ℓ i ′ +1 ).After the last recursion level we obtain singleton search space S . We return j as that valueplus one if it is less than n c . Otherwise, respond that the required j does not exist. Proof of Claim 3.7. Algorithm:

Initially we may assume that the search space S is of sizeexactly n c . If this is not the case, we append dummy indices to the end of S , in other words weadd the indices { max S + 1 , max S + 2 , . . . , max S + n c − |S| } to S , obtaining the range of indices[min S , . . . , min S + n c − c recursion levels, in each of which we decrease the sizeof the search space by a factor of n , while always maintaining a search space of consecutive indices.Consider iteration c ′ with the search space S c ′ . Let ℓ be the smallest index in S c ′ and for1 ≤ i ′ ≤ n let ℓ i ′ = ℓ + i ′ · n c ′ − . Now, each machine p i builds preﬁx sums S iℓ , . . . , S iℓ n of its ownnumbers, that is S iℓ i ′ = X j ∈ ≤ j<ℓ i ′ s i,j . Then, all machines send their computed preﬁx sum corresponding to ℓ i ′ +1 to machine p i ′ whichsums up all received preﬁxes, that is, afterwards machine p i ′ holds S ℓ i ′ +1 = P i ∈ [ n ] S iℓ i ′ +1 . In asecond round of communication S ℓ , . . . , S ℓ n are broadcasted and every node can determine the newsearch space S c ′ − = [ ℓ i ′ , ℓ i ′ +1 ) where i ′ is the largest number such that the preﬁx sums S ℓ i ′ add upto less than x . After c levels of recursion the search space consists of a single index ℓ . We return j = ℓ + 1. Correctness:

By induction, before level c ′ the search space size is |S c ′ | = n c ′ and thelargest index ℓ such that S ℓ < x belongs to S c ′ . Thus, after c levels, the search space is a singleton ℓ . This means that x ≤ S j in case j < n c . In case we return that j does not exist, it holds that ℓ = n c −

1, and so the sum of all s i,j is indeed below x .As each s i,j is upper bounded by n , we obtain P i ∈ [ n ] s i,j ≤ n , and as S j − = P j
As all s i,j are upper bounded by n and t is polynomial in n , all numberscan be send in O (log n )-bit messages. Each recursion level can be implemented in O (1) rounds,thus we need O ( c ) = O (1) rounds in total.We continue with the proof of Lemma 3.6. Proof of Lemma 3.6. Algorithm.

A short pseudocode is given in Algorithm 2.

Algorithm 2

Scheduling of a round. Split jobs into chunks J , J , . . . , J k . for each chunk J k ′ do Apply Algorithm 1 on J k ′ . 12e use Claim 3.7 to split the jobs into k = O ( ⌈ m r /n ⌉ ) chunks J , . . . , J k , such that the jobsin each chunk send at most 2 n messages in round r over all of their nodes. Then, we iterativelyapply Lemma 3.4 on each chunk to progress each job to the next round. Forming chunks : First, each machine p i , for each job j , uses S i,j,r to locally compute the numberof messages s i,j that node v i,j sends in round r . Assume that chunks J , . . . , J k ′ − have been formedand let S = [ t ] \ ( J ∪ · · · ∪ J k ′ − ). We apply Claim 3.7 with the index set S , where machine p i holdsthe sequence ( s i,j ) j ∈S , and with x = n . If we ﬁnd j , by the guarantee of Claim 3.7, we obtain thatall jobs in a chunk J k ′ , for k ′ = k , send at least n messages and at most 2 · n messages in round r .The jobs in chunk k send at most 2 n messages. Otherwise, if we do not ﬁnd j , the nodes of thejobs in S send less than 2 n messages in round r , so we obtain the last chunk and set J k = J k ′ = S .We thus have k ≤ ⌈ m r /n ⌉ . Executing round r +1 : Since, by construction, the jobs in each chunk send at most 2 n messages,we can iteratively apply Lemma 3.4 on the chunks. Round complexity.

We split the jobs into at most k = O ( ⌈ m r /n ⌉ ) chunks, where formingeach chunk can be done in O (1) rounds by Claim 3.7. The invocation of Lemma 3.4 on chunk J k ′ takes O ( ⌈ M | J k ′ | /n ⌉ ) rounds per chunk. Thus, the round complexity of the algorithm is O (1) + k X k ′ =1 O ( ⌈ M · | J k ′ | /n ⌉ ) = O k + M · k X k ′ =1 | J k ′ | /n ! = O (cid:0) ⌈ m r /n + M · t/n ⌉ (cid:1) . Finally, we use Lemmas 3.4 and 3.6 to obtain the near-optimal scheduling of Theorem 3.1.

Theorem 3.1.

There is a deterministic algorithm that schedules t = poly n jobs that are M -memoryeﬃcient in O ( GlobalCongestion + ⌈ M · t/n ⌉ · dilation ) rounds. Proof of Theorem 3.1.

We repeatedly apply Lemma 3.6 until all jobs terminate. First, eachmachine p i reads the input for each node v i,j for each j ∈ [ t ], and executes the Computation Stepof round r = 0, as a result of which it holds the state S i,j, for each of its nodes. Then, we splitthe execution into epochs, where in epoch r all jobs move from the Computation Step of round r to the Computation Step of round r + 1. A single epoch is implemented via Lemma 3.6 in O ( ⌈ m r /n + M · t/n ⌉ ) rounds. After the epoch r = dilation −

1, all machines compute the outputsgiven the respective terminating state S i,j, dilation − of each of its nodes. Round complexity.

The pre-processing in round r = 0 and the post-processing in the lastround r = dilation − r for all jobs is O ( ⌈ m r /n + M · t/n ⌉ ), where m r is thenumber of messages sent in round r . Since P dilation − r =0 m r /n = GlobalCongestion , we obtain theoverall round complexity by X r ∈ [ dilation ] O ( ⌈ m r /n + M · t/n ⌉ ) = O ( GlobalCongestion + dilation · ⌈ M · t/n ⌉ ) . In this section we show and compare the two approaches for randomized scheduling: random shuf-ﬂing (Section 4.1) and random delaying (Section 4.2). In contrast to Theorem 3.1, the results inthis section do not require the jobs to be memory eﬃcient.13 .1 Scheduling through Random Shuﬄing

In this subsection we use random shuﬄing to schedule I/O eﬃcient jobs and we obtain the followingtheorem.

Theorem 4.1.

There is a randomized algorithm in the

CLIQUE model that schedules t = poly n jobs that are I/O eﬃcient in O ( t + GlobalCongestion + dilation · log n ) rounds, w.h.p. The deﬁnition of an I/O eﬃcient job as used in Theorem 4.1 is as follows.

Deﬁnition 4.2 (I/O eﬃcient job) . An I/O eﬃcient job is a job where each node receives andproduces at most O ( n log n ) bits of input and output. Algorithm.

The high level overview of the algorithm for Theorem 4.1 (see Algorithm 3) consistsof three steps:

Input Shuffling , Execution , and

Output Unshuffling . Algorithm 3

Scheduling of I/O eﬃcient job. Input Shuffling Execution : Run dilation many phases, where in phase r each machine p i runs the protocol forits nodes { v π − j ( i ) ,j | j ∈ [ t ] } , and messages are routed via Claim 2.1. Output UnshufflingInput Shuffling:

We iterate sequentially through the jobs. For each job, a leader machine,say, p , generates a random uniform permutation π j : [ n ] [ n ]. The permutation becomes globallyknown within two rounds by having p send π j ( i ) to each p i and then each p i broadcasts π j ( i ) to allmachines. In the last round of this subroutine, each machine p i sends the input of v i,j to machine p π j ( i ) . A single round is suﬃcient because the job is I/O eﬃcient. Thus, at the end, machine p i holds the state of the nodes v π − j ( i ) ,j for all j ∈ [ t ]. We call this subroutine Input Shuffling . Execution: In dilation many phases we progress each job by one round. That is, each machine p i performs all actions of the nodes that it holds, which are v π − j ( i ) ,j for all j ∈ [ t ]. In order to useClaim 2.1 eﬃciently for each phase r , the machines need to compute a bound on the number ofmessages that any of them sends or receives in phase r . To this end, the machines jointly computethe value of m r = P j ∈ [ t ] P i ∈ [ n ] s ri,j , where s ri,j is the number of messages that node v i,j sends inround r . They do this by having each machine p i send P j ∈ [ t ] s rπ − j ( i ) ,j to a leader machine, say, p ,which then sums these values and broadcasts their sum m r . That is, m r is the total number ofmessages sent by all nodes in all jobs in round r , and we show that for each i ∈ [ n ], O ( m r /n + n log n )is a bound on P j ∈ [ t ] s rπ − j ( i ) ,j ( P j ∈ [ t ] t rπ − j ( i ) ,j ), which is the number of messages that machine p i hasto send (receive) in phase r , to be used when invoking Claim 2.1. Output Unshuffling:

At the end, after each machine executes the protocols until they ﬁnish,we use a single round of communication for each job to unshuﬄe the outputs according to π − j . Atthe end of this Output Unshuffling subroutine, machine p i holds the output v i,j for all j ∈ [ t ].This ﬁnishes the description of the algorithm.In the following lemma, we bound the number of messages that each machine has to send/receivein one phase by X = O ( m r /n + n · log n ). 14 emma 4.3. Consider t jobs and a set of permutations { π j } j ∈ [ t ] generated uniformly at randomand let S = max i ∈ [ n ] P j ∈ [ t ] s rπ − j ( i ) ,j and R = max i ∈ [ n ] P j ∈ [ t ] t rπ − j ( i ) ,j . Then, w.h.p., it holds that X = max { S, R } = O ( m r /n + n log n ) , where m r = P i ∈ [ n ] P j ∈ [ t ] s ri,j . Proof of Lemma 4.3.

Let c ≥ S ri,j = P i ′ ∈ [ n ] s ri ′ ,j · π j ( i ′ )= i the random variable whose value is the number of messages sent by machine p i for job j (note thatthere is a single i ′ = π − j ( i ) for which i = π j ( i ′ ), but this i ′ is also a random variable). Thesevariables are bounded by n and are independent for diﬀerent j . Denote by S ri = P j ∈ [ t ] S ri,j /n therandom variable whose value is the total number of messages machine p i sends normalized by n .Denote c ′ = c + 2. We show that the normalized number of messages machine p i sends is boundedas S ri ≤ m r /n + 2 c ′ ln n , with probability at least 1 − n c ′ .First, we note that the expected normalized number of messages machine p i sends is:E X j ∈ [ t ] S ri,j /n  = X j ∈ [ t ] X i ′ ∈ [ n ] s ri ′ ,j · E h π j ( i ′ )= i /n i = X j ∈ [ t ] X i ′ ∈ [ n ] s ri ′ ,j /n = m r /n , where the ﬁrst equality holds due to the linearity of expectation, the second one holds since π j issampled uniformly and the last one is due to the deﬁnition of m r .Since for diﬀerent j , the variables S ri,j /n are independent, we use Claim 2.2 (Hoeﬀding Bound)with a relative error ǫ >

0, which we later optimize, to bound the probability that a machine hastoo many messages to send.Pr (cid:20) S ri > (1 + ǫ ) m r n (cid:21) = Pr X j ∈ [ t ] S ri,j n > (1 + ǫ ) m r n  < e − ǫ mr (2+ ǫ ) n . If m r ≥ c ′ · n ln n , then for ǫ = 2 we have that e − ǫ m r / ((2+ ǫ ) n ) ≤ e − c ′ ln n = n − c ′ . In other words,w.h.p. 3 m r /n = O ( m r /n ) rounds are suﬃcient for machine p i for sending all required messages onround r . Otherwise, we have m r < c ′ · n ln n . In this case, for ǫ = 2 c ′ · n ln( n ) /m r ≥ e − ǫ m r / ((2+ ǫ ) n ) = e − · c ′ ln ( n ) / (2 /ǫ +1) ≤ n − c ′ . In other words, w.h.p. (1 + 2 c ′ ln ( n ) · n /m r ) m r /n = m r /n + 2 c ′ · ln n = O ( m r /n + log n ) rounds are suﬃcient for machine p i for sending all of itsrequired on round r . We conclude that Pr (cid:2) S ri > m r /n + 2 c ′ ln n (cid:3) < n − c ′ .Denote by T ri the random variable whose value is the number of messages received by machine p i normalized by n . By the same approach, we show thatPr (cid:20) T ri > m r n + 2 c ′ ln n (cid:21) < n − c ′ . By a union bound over S ri , T rj for all i ∈ [ n ], we obtain that for some i one of the event S ri > m r /n + 2( c + 2) ln n ), T ri > m r /n + 2( c + 2) ln n ) happens with probability at most2 n − c ′ +1 ≤ n − c for n ≥

2. Notice, that S = max i ∈ [ n ] S i · n and R = max i ∈ [ n ] T i · n , thus X =max { S, R } = O ( m r /n + n log n ) w.h.p.With an upper bound at hand, on the number of messages that each machine sends or receivesin phase r , we can prove that Algorithm 3 satisﬁes the statement of Theorem 4.1.15 roof of Theorem 4.1. We prove the correctness and bound the runtime of the presented algo-rithm (see Algorithm 3).

Correctness:

After the

Input Shuffling subroutine (Line 1), the input for node v i,j is storedon machine p π j ( i ) . For each phase r ∈ [ dilation ], we invoke Claim 2.1 with the computed value X ,which is w.h.p. a bound the number of messages that each machine sends or receives. Thus, w.h.p.this invocation succeeds. Since dilation = O ( n ), a union bound over all phases gives that at theend of the Execution subroutine, each machine p i holds the outputs of all nodes v π − ( i ) ,j for each j ∈ [ t ]. After Output Unshuffling , machine p i holds the output for node v i,j for each job j ∈ [ t ]. Round Complexity:

The initial

Input Shuffling (Line 1) and the

Output Unshuffling atthe end of the algorithm (Line 3) complete with t rounds each. For each phase r in the Execution part of the algorithm, computing m r is done in 2 rounds. By Lemma 4.3, X = O ( m r /n + n log n )is a bound on P j ∈ [ t ] s rπ − ( i ) ,j and P j ∈ [ t ] t rπ − ( i ) ,j , which are the number of messages that machine p i sends and receives in phase r , respectively, for all i ∈ [ n ]. Thus, invoking Claim 2.1 completes in O ( m r /n + log n ) rounds, w.h.p. Thus, the overall round complexity of the algorithm is O ( t + X r ∈ [ dilation ] ( m r /n + log n )) = O ( t + X j ∈ [ t ] m j /n + dilation · log n )= O ( t + GlobalCongestion + dilation · log n ) . In this subsection we show how to use random delays approach introduced in [LMR94] to scheduleround eﬃcient jobs.

Theorem 4.4.

There is a randomized algorithm in the

CLIQUE model, which schedules t jobs O ( LocalCongestion + dilation · log n + t/n ) rounds, w.h.p., given an upper bound on the value of LocalCongestion , and in O (cid:0) LocalCongestion + log

LocalCongestion · ( dilation · log n + t/n ) (cid:1) rounds, w.h.p., if such a bound is not known. In the algorithm, job j ∈ [ t ] is executed with a delay D j that is chosen uniformly at randomfrom [ D ], where D = ⌊ LocalCongestion / ln n ⌋ . In the crucial step of the proof, we use a HoeﬀdingBound to show that this random delay implies that each node has to send and receive at most X = O ( LocalCongestion · n/D ) messages per round in all jobs combined. The claim then follows byrouting all messages of a single round with Lenzen’s routing scheme (Claim 2.1). This approach usesthat all nodes know a bound on LocalCongestion , which can be removed at the cost of a logarithmicfactor with a standard doubling -technique.

Algorithm:

We describe the algorithm for the case where

LocalCongestion is known. Thealgorithm consists of initializing part

Sample Delays , followed by the actual

Execution part. Let D = ⌊ LocalCongestion / ln n ⌋ . Sample Delays:

We start by generating a random delay D j for each job j and broadcasting it.For this, a leader node, say, p , samples a delay D j uniformly at random from [ D ] independentlyfor each job j . Notice, that in the special case D ≤ LocalCongestion < n ), the delays are actually degenerated to the deterministic D j = 0. Execution ( O ( D +16 ilation ) phases): In phase r we progress each job j (for which r ≥ D j holds) from round r − D j to round r − D j + 1. Each machine p i executes the protocol of round r − D j for job j . Todeliver the messages eﬃciently, we use the algorithm from Claim 2.1, which requires the bound X on max i ∈ [ n ] { P j ∈ [ n ] s r − D j i,j , P j ∈ [ n ] t r − D j i,j } , the number of messages machine sends or receives. If LocalCongestion < n , the number of messages to send or receive is clearly bounded by O ( n log n ).In the general case, we show that this bound is O ( LocalCongestion · n/D ) w.h.p. Doubling:

To remove the requirement on the knowledge of

LocalCongestion , we use a standarddoubling technique. We try to run the algorithm until success while doubling the estimation of

LocalCongestion in each attempt, starting from a guess of

LocalCongestion = 1. The algorithmdetects failure when the algorithm from Claim 2.1 fails.

Algorithm 4

Scheduling of jobs. Sample delays : Independently UAR pick D j ∈ [ D ] and broadcast the values Execution : Run O ( D + dilation ) phases, where in phase r progress each job j that satisﬁes r ≥ D j by one round where the messages of all jobs are routed with Claim 2.1.In the proof of the following lemma, we bound the number of messages that each machine hasto send/receive in one phase by X = O ( LocalCongestion · n/D ). Lemma 4.5.

Given t jobs and a set of delays { D j } j ∈ [ t ] sampled uniformly at random from [ D ] for D = ⌊ LocalCongestion / ln n ⌋ ≥ , let S = max i ∈ [ n ] P j ∈ [ t ]: r ≥ D j s r − D j i,j , R = max i ∈ [ n ] P j ∈ [ t ] : r ≥ D j t r − D j i,j ,and X = max { S, R } .Then, w.h.p., it holds that X = O ( LocalCongestion · n/D ) , where m r = P i ∈ [ n ] P j ∈ [ t ] s ri,j . Proof.

Let c ≥ S i,j = P r ′ ∈ [ dilation ] s r ′ i,j · D j + r ′ = r therandom variable whose value is the number of messages sent by machine p i for job j on round r .These variables are independent for diﬀerent values of j , as the delays D j are independent. Theyare also bounded by n , which means that the variables S i,j /n are also independent and belong to[0 , S i = P j ∈ [ t ] S i,j /n the random variable whose value is the number of messagessent by machine p i , normalized by n . Denote c ′ = c + 2. We show that the normalized number ofmessages machine p i sends is bounded as S ri ≤ (1 + 2 · c ′ ) LocalCongestion /D with probability atleast 1 − n c ′ .First, we note that the expected normalized number of messages machine p i sends is:E[ S i ] = E X j ∈ [ t ] S i,j /n  = P j ∈ [ t ] P r ′ ∈ [ dilation ] s r ′ i,j · E[ D j + r ′ = r ] n ≤ P j ∈ [ n ] P r ′ ∈ [ dilation ] s r ′ i,j D · n = LocalCongestion

D , where the second transition is due to the linearity of expectation, the third follows from delaysbeing uniformly selected and the last one is due to the deﬁnition of

LocalCongestion .Since D j are independent, we use Claim 2.2 (Hoeﬀding Bound) with ǫ = 2 · c ′ , we bound the17robability of S i being larger than the expected value byPr[ S i ≥ (1 + 2 · c ′ )( LocalCongestion /D )] = Pr[ X j ∈ [ t ] ( S i,j /n ) ≥ (1 + 2 · c ′ )( LocalCongestion /D )] ≤ e − (2 · c ′ )22+2 · c ′ LocalCongestion D = e − c ′ c ′ ln n ≤ n − c ′ , where the second transition is due to Claim 2.2 and the third is due to the selection of D ≤ LocalCongestion ln n .Denote by T i the random variable whose value is the number of messages received by machine p i on round r normalized by n . Using a similar approach, it holds thatPr (cid:2) T i ≥ (1 + 2 · c ′ ) LocalCongestion /D (cid:3) ≤ n − c ′ . By a union bound over all S i and T i , we obtain that for some i , the probability that S i or T i aremore than LocalCongestion /X is bounded by n − c . Since S = max i ∈ [ n ] S ri , R = max i ∈ [ n ] T ri , and X = max { S, R } it w.h.p. holds that X = O ( LocalCongestion · n/D ).The following simple routing primitives are used in the random-delay based algorithm of Theo-rem 4.4. Deﬁnition 4.6. (Multiple broadcast problem.) Each machine p i ∈ V is given a set M i of m i messages of size O (log n ) bits each. The goal is to deliver each message to all the machines. Lemma 4.7.

There is an algorithm in the

CLIQUE model, which solves the multiple broadcastproblem in O (cid:0) ⌈ P i ∈ [ n ] m i /n ⌉ (cid:1) rounds. Proof.

The pseudocode is given in Algorithm 5. First, on Line 1, each machine p i broadcasts m i ,the number of messages it has. Given the information it receives, the machine p i locally computes y i = P i − i ′ =0 m i ′ , the number of messages the machines with preceding identiﬁers i ′ < i have. Thisallows each machine to compute indices of its messages in the global numbering. We split theexecution into ⌈ P i ∈ [ n ] m i /n ⌉ = ⌈ y n /n ⌉ phases. On phase k , a batch of messages with indices[ k · n, min { ( k + 1) · n, P i ∈ [ n ] m i } ) are broadcasted in two rounds. In the ﬁrst round, the i ′ -thmessage of the current batch (e.g. the message number k · n + i ′ ) is sent to machine p i ′ (Line 4). Inthe second round, each machine broadcasts the message it received in the previous round (Line 5). Algorithm 5

Multiple broadcasts. Each machine p i broadcasts m i . Each machine p i locally computes its y i = P i − i ′ =0 m i ′ . for k ← ⌊ y n /n ⌋ do For each i ′ ∈ [ n ] message number k · n + i ′ in global numbering is sent to the machine p i ′ . Each p i ′ broadcasts the message it receives.In the ﬁrst round of each phase, at most one message is received by each machine, in particularonly 1 message between any pair of machines. In the second round of each phase, each machine sendsat most 1 message to each other machine. Hence, the entire execution completes in O ( ⌈ P n − i =0 m i /n ⌉ )rounds. 18 roof of Theorem 4.4. We prove the correctness and bound the runtime for the aforementionedalgorithm (Algorithm 4).First, in the special case

LocalCongestion < n , the number of messages each machine hasto send over the entire execution for all jobs combined and in particular in each round is boundedby 2 · n ln n . Thus, a straightforward execution of one round of all jobs with Claim 2.1 completesin 2 ln n rounds, and the entire execution takes O ( dilation · log n ) rounds. From now on we assume D ≥ Correctness.

In each phase r ∈ [ D + dilation ], we invoke Claim 2.1 with a bound of X = O ( LocalCongestion /D ), which due to Lemma 4.5 bounds w.h.p. the number of messages each nodesends or receives. Thus, due to the union bound over n rounds, all of them succeed w.h.p. Round complexity.

Broadcasting t values during Sample Delay (Line 1) takes O ( ⌈ t/n ⌉ )rounds by Lemma 4.7. For each phase r ∈ [ D + dilation ], by Lemma 4.5 X = O ( LocalCongestion · n/D ) is a bound on the number of messages that machine p i sends and receives in phase r for each i ∈ [ n ] w.h.p. and by applying union bound over the D + dilation = O (poly ( n )) rounds, this holds on eachround w.h.p. Thus, invoking the algorithm from Claim 2.1 completes in O ( ⌈ LocalCongestion /D ⌉ ) = O ( LocalCongestion /D ). Thus, overall, for D = ⌊ LocalCongestion / ln n ⌋ the algorithm terminates in O ( ⌈ t/n ⌉ ) + ( dilation + D ) · O ( LocalCongestion /D ) = O ( t/n + dilation log n + LocalCongestion ) roundsw.h.p.

Doubling.

Since the algorithm succeeds w.h.p. when our estimate is at least equal to thevalue of

LocalCongestion , we ﬁnish within O (log LocalCongestion ) attempts. Thus, w.h.p., the roundcomplexity of this approach is P O (log LocalCongestion ) κ =0 O ( t/n +2 κ + dilation · log n ) = O ( LocalCongestion +log

LocalCongestion · ( t/n + dilation · log n )). In this section we apply the scheduling algorithms developed in Sections 3 and 4 on protocols whichsolve MIS (Section 5.1) and Pointer Jumping (Section 5.2). We analyze the round complexity ofthe developed algorithms. A maximal independent set (MIS) of a graph G = ( V, E ) is a subset of nodes M ⊆ V such that notwo nodes in M are connected by an edge and adding any node to M would break this property. Inthis subsection, we show that we can eﬃciently solve multiple MIS instances using our schedulingalgorithm from Theorem 4.1. Theorem 5.1 (Multiple MIS instances) . There is a randomized algorithm in the

CLIQUE modelwhich solves t = poly n instances of MIS in O ( t + log log ∆ log n ) rounds, w.h.p. To prove our result, we prove that the MIS protocol for the

CLIQUE model given in [GGK + O (log log ∆) rounds, uses O ( n ) messages in all rounds combined, which westate as follows. Theorem 5.2 (Analysis of the MIS protocol of [GGK +

18, Theorem 1.1]) . There is a randomizedMIS protocol in the

CLIQUE + Lenzen ′ s Routing model which completes in O (log log ∆) rounds andsends O ( n ) messages, w.h.p. Given Theorem 5.2, we prove Theorem 5.1 as follows.19 roof of Theorem 5.1.

By Theorem 5.2, a set of t jobs of the MIS protocol of [GGK +

18] have dilation = O (log log ∆) and GlobalCongestion = t · n n = t , w.h.p. By Theorem 4.1, w.h.p., we canschedule the t jobs in a number of rounds bounded by O ( t + GlobalCongestion + dilation · log n ) = O ( t + log log ∆ log n ). Remark.

Theorem 5.1 also shows that the random-shuﬄing approach may be more eﬃcient thanrandom-delays. In the MIS protocol of [GGK +

18] which we use here, a leader node is used forcollecting some of the edges of the graph. Potentially, since the leader node may receive O ( n )messages during the O (log log ∆) rounds of the protocol, applying the random-delay scheduling ofTheorem 4.4 on t such MIS jobs results in a complexity of O ( t log log ∆ + log log ∆ log n ) rounds.This run-time is asymptotically worse than the one obtained by the algorithm from Theorem 5.1for t = Ω(log n ). Moreover, it is no better then the na¨ıve execution of the protocol multiple timesone after another.It remains to prove Theorem 5.2. Proof of Theorem 5.2.

The correctness and the round complexity follow from [GGK + M ⊆ V , initially empty. The Protocol (See Algorithm 6).Random ranking:

First, a leader node v ∗ generates a uniform random permutation π : [ n ] [ n ] and makes it globally known within 2 rounds by sending each node v i the value of π ( i ) which v i then broadcasts to everyone. The value π ( i ) is called the rank of v i and does not change duringthe algorithm. Degree reduction by simulating greedy steps:

The second part of the protocol is a loop,which, as shown in [GGK +

18, Theorem 1.1], uses O (log log ∆) iterations w.h.p., to reduce themaximum degree of active nodes to ∆ ′ = min { ∆ , poly log n } .20 lgorithm 6 The MIS algorithm of [GGK + The leader v ∗ generates a uniform random permutation π : [ n ] [ n ] and sends π ( i ) to v i . Each node v i broadcasts π ( i ). k ← while The maximum active degree of active nodes is at least ∆ ′ = min { ∆ , poly log n } do M ← ∅ , M k ← M k − if k ≥ N ← ∅ , N k ← N k − if k ≥ Every edge { v i , v i ′ } with both endpoints active and π ( i ) ≤ π ( i ′ ) ≤ n ∆ αk is sent to v ∗ by v i . while There exists a node v i M k ∪ N k with π ( i ) ≤ n ∆ αk do Add v i with the smallest rank π ( i ) among the undecided nodes to M k . All the neighbors of v i that are known to v ∗ are added to N k . The leader v ∗ informs the nodes in M k \ M k − that they are such. The nodes in M k \ M k − are added to M , and they inform their neighbors that they aresuch and become inactive. The nodes in N G ( M k \ M k − ) inform their neighbors that they are such and become inactive. k ← k + 1. for k from 0 to O (log log ∆ ′ ) do Each active node v i sends all adjacent edges from H k to each of its neighbors in H k . Each active node v i simulates O (log ∆ ′ ) rounds of the MIS protocol of [Gha16] locally. Thechosen nodes are added to M and they become inactive along with their neighbors. The leader v ∗ learns all remaining edges, locally computes an MIS over them and informs thenodes, which are then added to M .In each iteration k ≥

0, we produce a set M k ⊆ V , which is initially empty for k = 0 and isinitially M k − for k ≥

1. The nodes in M k are afterwards added to the resulting MIS, M . We alsouse a set N k which is initially empty for k = 0 and is initially N k − for k ≥

1, of nodes that will notbe in M . Initially, all nodes are active . A node that is in M k ∪ N k is decided and becomes inactive ,and otherwise it remains active . A constant α = 3 / k , all edges { v i , v i ′ } where both endpoints are active and have ranks π ( i ) ≤ π ( i ′ ) ≤ n/ ∆ ( α k ) are sent to the leader v ∗ by v i . The leader v ∗ now applies greedy MIS steps, asfollows. As long as there is an active node v i with π ( i ) ≤ n/ ∆ ( α k ) , the node with the smallest rankis added to M k and all of its neighbors that are known to v ∗ are added to N k . After these greedysteps, the leader v ∗ informs the nodes in M k that they are such. These nodes are added to M andbecome inactive , and they inform their neighbors, which join N k and become inactive as well.The loop terminates when the maximum degree of active nodes is at most ∆ ′ = min { ∆ , poly log n } .To check that this condition is met, each node sends its degree to the leader and the leader broad-casts the decision. This requires 1 round. We denote by H = ( V ′ , E ′ ) the graph of maximum degreebounded by ∆ ′ that is induced by the remaining active nodes. Small degrees (the graph H ): First, each active node generates O (log ∆ ′ ) random bits.These random bits are from now sent along with the node’s identiﬁer whenever the latter is sentin a message. Then, each node of H learns its O (log ∆ ′ )-hop neighborhood in H . To this end,we proceed in O (log log ∆ ′ ) iterations, where after iteration k , each node in H knows its 2 k -hopneighborhood in H . In iteration k , each node sends its edges in H k to its neighbors in H k . Noticethat by induction over k , at the beginning of iteration k , each node knows its neighbors in H k ,and at the end of the iteration, it knows its neighbors in H k +1 .21fter learning its O (log ∆ ′ )-hop neighborhood, each active node locally simulates O (log ∆ ′ )rounds of the randomized MIS protocol of [Gha16]. Each iteration of this protocol requires O (log ∆ ′ )random bits by each node, which are the ones generated by the node at the beginning of this step.Each node chosen to the MIS is added to M and becomes inactive along with its neighbors. Noticethat all nodes compute the same MIS locally, because each node knows a suﬃciently large neigh-borhood, including O (log ∆ ′ ) globally consistent random bits for each node in the neighborhood. Wrapping-up part:

Finally, the leader v ∗ learns the remaining graph induced by active nodes,and locally computes an MIS and informs the nodes, who are then added to M . This ﬁnishes thedescription of the algorithm. Message Complexity.Random ranking:

Since this part takes 2 rounds, it clearly sends at most O ( n ) messages. Degree reduction (simulation greedy steps):

Let G be the subgraph induced by nodeswith ranks π ( v i ) ≤ n ∆ . Since the maximum degree in G is ∆, the number of edges in G is boundedby n ∆ · ∆ = n . This implies that at most O ( n ) messages are sent to the leader v ∗ in the ﬁrst iteration.For k ≥

1, let r k = n/ ∆ ( α k ) , and let G k = ( V k , E k ) be the subgraph that is induced by nodeswith ranks in the range [ r k − , r k ] that are still active after iteration k −

1. In [GGK +

18, Theorem1.1], it is shown that G k has at most O ( n ) edges, w.h.p., which implies that at most O ( n ) messagesare sent to the leader v ∗ in iteration k . Informing the nodes in M k \ M k − that they should join M k requires at most O ( n ) messages. Notice, that the leader does not inform nodes in N k \ N k − , as theyinformed by their neighbors in M k \ M k − . Checking the loop condition required O ( n ) messages.Since [GGK +

18, Lemma 3.1] implies that after O (log log ∆) iterations of the loop, the degree in thegraph induced by active nodes is at most n log n/ ( n/ ∆ α O (log log ∆) ) = poly log n = ∆ ′ and the loopterminates, this gives a total of O ( n log log ∆) of such messages, w.h.p. Over the entire executionof the protocol, each node v i is informed at most deg G ( v i ) times by one of its neighbors that sucha neighbor enters the MIS or becomes inactive. Thus, over the entire course of the algorithm thisrequires O ( n ) messages. Small degrees:

For k = O (log log ∆ ′ ) the maximum degree in H k = H poly log ∆ ′ is bounded by∆ ′ poly log ∆ ′ . To send one identiﬁer together with O (log ∆ ′ ) random bits we need 1+ O (log ∆ ′ / log n ) O (log n )-bit messages. Thus, for each k , each active node sends at most O (cid:0) (1 + log ∆ ′ / log n ) (cid:1) ∆ ′ poly log ∆ ′ = 2 poly log ∆ ′ messages. For the entire O (log log ∆ ′ ) rounds, each active node sends O (log log ∆) ′ · poly log ∆ ′ =2 poly log ∆ ′ messages. As ∆ ′ = poly log n holds, the number of messages sent by each active node tolearn its O (log ∆ ′ )-hop neighborhood is 2 poly log log n = O ( n ). This implies O ( n ) messages in total.The simulation of [Gha16] to decide whether to join M is then done locally, without communication. Wrapping-up part:

In [Gha17, Lemma 2.11], it is shown that the graph induced by activenodes after learning O (log ∆ ′ )-hop neighborhoods in H and simulating O (log ∆ ′ ) iterations of theMIS algorithm from [Gha16] has at most O ( n ) edges. Thus, learning the remaining edges by theleader and informing nodes about the leader’s decision requires O ( n ) messages. Notice that in the CLIQUE model this would require some routing scheme. However, in the

CLIQUE + Lenzen ′ s Routing model this is part of the model deﬁnition.Thus, overall the algorithm sends O ( n ) messages.22 .2 Pointer Jumping In this subsection we address the pointer jumping problem, widely used in parallel and distributeddata structures [Hir76].

Deﬁnition 5.3 ( P -pointer jumping) . In a P -pointer jumping problem , each node v i is given apermutation π i : [ P ] [ P ] . A ﬁxed node v i ′ is given a number x ∈ [ P ] , The aim of the algorithm isfor v i ′ to learn the composition of the permutations applied on p , i.e., ( π n − ◦ π n − ◦ · · · ◦ π )( p ) = π n − ( π n − ( . . . π ( p ) . . . ))In the following claim we show a simple deterministic O (log n )-round CLIQUE protocol for solving P -pointer jumping with a complexity of O ( P · n ) messages. Claim 5.4 (Pointer jumping) . For P = O ( n ) , there is a deterministic O ( P ) -memory eﬃcientprotocol in the CLIQUE + Lenzen ′ s Routing model which solves the pointer jumping problem in O (log n ) rounds and O ( P · n ) messages. Algorithm 7

Pointer jumping. for k = 0 to ⌈ log n ⌉ dofor i ∈ [ n ] in parallel doif i has at least k > then v i computes π i ◦ π i +1 ◦ · · · ◦ π min { i +2 k ,n }− . if i has exactly k < ⌈ log n ⌉ trailing zeros in binary representation then v i sends π i ◦ π i +1 ◦ · · · ◦ π min { i +2 k ,n }− to v i − k . v i ′ sends p to v . v replies v i ′ with ( π n − ◦ π n − ◦ · · · ◦ π )( p ). Proof of Claim 5.4.Algorithm and correctness.

The pseudo-code for the simple well known protocol for thepointer jumping problem is presented in Algorithm 7. At a high level, ﬁrst v learns the compositionof the permutations, then v i ′ sends the entry p to v and it responds with the ﬁnal output. To learnthe composition of the permutations we proceed in ⌈ log n ⌉ + 1 iterations. On each iteration exceptthe ﬁrst, each node v i which receives a permutation from v i +2 k − , computes the composition ofthe permutation it possesses and the received permutation, that is, it composes the permutation π i ◦ π i +1 ◦ · · · ◦ π i +2 k − − with the permutation π i +2 k − ◦ · · · ◦ π i +2 k − . Each node v i which hasexactly k trailing zeros in the identiﬁer and currently knows the composition of 2 k permutations π i ◦ π i +1 ◦ · · · ◦ π min { i +2 k ,n }− sends it to the node v i − k . Each node sends and receives at most P = O ( n ) messages. Clearly, after ⌈ log n ⌉ iterations, node v possesses the composition π ◦ π ◦· · ·◦ π n − . Memory-eﬃciency.

To compose the received permutation with the current permutation,we store both permutations and the output permutation in the local memory. Thus we require O ( P log n ) bits of the local memory . Given only the index of the round, it is possible for eachnode to deduce the number of messages each node sends to it. In the

CLIQUE model this algorithm requires some routing scheme, but Claim 2.1 is not known to run in O ( P log n )bits of memory. However, our results apply in the potentially more powerful model of CLIQUE + Lenzen ′ s Routing [KS20], in which each node is allowed to send and receive n messages in each round. Therefore, no additional memoryoverhead of routing algorithm is required. ound complexity. The algorithm ﬁnishes within O (log n ) iterations. In each iteration, eachnode sends and receives P = O ( n ) messages, thus each iteration completes in O (1) rounds. Message complexity. In k -th iteration of the algorithm, O ( n/ k ) nodes send P messageseach. Thus, the protocol uses P ⌈ log n ⌉− k =0 O ( n/ k ) · O ( P ) = O ( P · n ) messages.Applying our deterministic scheduling algorithm and our random shuﬄing algorithm, we ob-tain the following theorem on the complexity of solving multiple instances of the pointer jumpingproblem. Theorem 5.5 (Pointer Jumping) . For P ≤ n , there are algorithms in the CLIQUE model that solve t = poly n instances of the P -pointer jumping problem deterministically in O ( ⌈ P · t/n ⌉ · log n ) , andrandomized in O ( t + log n ) rounds, w.h.p. Proof of Theorem 5.5.

The ﬁrst part of the theorem follows immediately from Claim 5.4 and The-orem 3.1. By Theorem 3.1, running t instances of the protocol from Claim 5.4 completes in O ( t · P · n/n + ⌈ P · t/n ⌉ · log n ) = O ( ⌈ P · t/n ⌉ · log n ) rounds. This gives the ﬁrst claim.Since each node of the job consumes only P log n ≤ n log n bits of input and produces log n bitsof output, it is P I/O eﬃcient. Thus, by Theorem 4.1, running t instances of the protocol fromClaim 5.4 completes in O ( t + t · P · n/n + log n · log n ) = O ( t + log n ) rounds, w.h.p., which givesthe second claim.The proposed simple O (log n ) round pointer jumping protocol also serves as an example wherescheduling jobs via the random-shuﬄing approach of Theorem 4.1 is signiﬁcantly better than therandom-delay based approach of Theorem 4.4. Since multiple nodes receive Ω( n log n ) messages inthe execution of the protocol, if we apply the random-delay scheduling algorithm from Theorem 4.4we solve t instances of the problem in O ( t · n · log n + log n ) rounds, which is no better thansequentially running one instance after another. Our results suggest that the amortized complexity, i.e., the runtime of solving many instances of aproblem divided by the number of instances, is a valuable measure for the eﬃciency of protocolsin the

CLIQUE model. Our interest in obtaining protocols with fast amortized complexities stemsfrom the growing number of problems which admit O (1)-round CLIQUE -protocols, e.g., [CDP20,Now19, GNT20], whose amortized complexity could potentially be shown to go below constant, aswell as from problems that are still not known to have a constant worst-case complexity. We nowelaborate on this viewpoint.We give MIS as an example of a problem which can be solved with a good amortized complexity.The best known protocol [GGK +

18] requires O (log log ∆) rounds. Theorem 5.1 shows that running t = poly n instances of MIS completes in O ( t + log log ∆ log n ) rounds. For t = Ω(log log ∆ log n ),the second part of the complexity “amortizes out” and we obtain that we run t instances of the MISproblem in O ( t ) rounds. Basically, we show that the amortized complexity of the MIS problem is O (1) rounds.Note that the amortized complexity should not be optimized isolated from other measures. Forexample, consider the trivial O ( n )-round protocol for pointer jumping, in which in the i -th round,the i -th node applies its permutation to the current pointer and sends the result to the next node.It requires only O ( n ) messages. Thus, it is trivial to run t ≤ n instances of this pointer jumping24rotocol in only O ( n ) rounds, leading to an amortized complexity of O (1 /n ) = o (1). However,the latency of this algorithm is an unacceptable O ( n ) rounds. Instead, Theorem 5.5 shows thatthe pointer jumping problem has an acceptable amortized complexity of O (1) rounds and a smalllatency of O (log n ) rounds.For certain protocols, Theorem 3.1 might even yield o (1) amortized complexity. For example,consider a job in which it is required to compute the √ n -bin histogram of some given data. In thetrivial 2-round protocol, each node locally builds a histogram of its input and sends the number ofelements in its i -th bin to v i . For all i ∈ [ √ n ], node v i sums the received values and broadcaststhe result. Clearly, such an algorithm is O ( √ n )-memory eﬃcient and uses O ( n √ n ) messages. Ouralgorithm from Theorem 3.1 executes t instances of this protocol in O ( ⌈ t/ √ n ⌉ ) rounds. Whenever t = o ( √ n ), this gives an o (1) amortized round complexity with constant latency.The reader may notice that for some sets of jobs, it may be that some ad-hoc routing could bedeveloped for eﬃcient scheduling. We emphasize that, in contrast, the power of our algorithms isthat they do not require tailoring the protocols for the sake of scheduling them within a given setof jobs. This is pivotal for obtaining a general framework, because knowing in advance the settingin which a protocol would be executed is an unreasonable assumption that we do not wish to make. Acknowledgements:

This project has received funding from the European Union’s Horizon 2020research and innovation programme under grant agreement no. 755839-ERC-BANDWIDTH.

References [CDKL19] Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fastapproximate shortest paths in the congested clique. In

PODC , pages 74–83. ACM,2019.[CDP20] Artur Czumaj, Peter Davies, and Merav Parter. Simple, deterministic, constant-roundcoloring in the congested clique. In

PODC , pages 309–318. ACM, 2020.[CGL20] Keren Censor-Hillel, Fran¸cois Le Gall, and Dean Leitersdorf. On distributed listing ofcliques. In

PODC , pages 474–482. ACM, 2020.[CKK +

19] Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz,and Jukka Suomela. Algebraic methods in the congested clique.

Distributed Comput. ,32(6):461–478, 2019.[CPS20] Keren Censor-Hillel, Merav Parter, and Gregory Schwartzman. Derandomizing localdistributed algorithms under bandwidth restrictions.

Distributed Comput. , 33(3-4):349–366, 2020.[CPZ19] Yi-Jun Chang, Seth Pettie, and Hengjie Zhang. Distributed triangle detection via ex-pander decomposition. In

SODA , pages 821–840. SIAM, 2019.[DLP12] Danny Dolev, Christoph Lenzen, and Shir Peled. ”tri, tri again”: Finding triangles andsmall subgraphs in a distributed setting - (extended abstract). In

DISC , pages 195–209,2012.[Gal16] Fran¸cois Le Gall. Further algebraic algorithms in the congested clique model and appli-cations to graph-theoretic problems. In

DISC , pages 57–70, 2016.25GGK +

18] Mohsen Ghaﬀari, Themis Gouleakis, Christian Konrad, Slobodan Mitrovic, and RonittRubinfeld. Improved massively parallel computation algorithms for MIS, matching, andvertex cover. In

PODC , pages 129–138. ACM, 2018.[Gha15] Mohsen Ghaﬀari. Near-optimal scheduling of distributed algorithms. In

PODC , pages3–12. ACM, 2015.[Gha16] Mohsen Ghaﬀari. An improved distributed algorithm for maximal independent set. In

SODA , pages 270–277. SIAM, 2016.[Gha17] Mohsen Ghaﬀari. Distributed MIS via all-to-all communication. In

PODC , pages 141–149. ACM, 2017.[GN18] Mohsen Ghaﬀari and Krzysztof Nowicki. Congested clique algorithms for the minimumcut problem. In

PODC , pages 357–366. ACM, 2018.[GNT20] Mohsen Ghaﬀari, Krzysztof Nowicki, and Mikkel Thorup. Faster algorithms for edgeconnectivity via random 2-out contractions. In

SODA , pages 1260–1279. SIAM, 2020.[GP16] Mohsen Ghaﬀari and Merav Parter. MST in log-star rounds of congested clique. In

PODC , pages 19–28. ACM, 2016.[Hir76] Daniel S. Hirschberg. Parallel algorithms for the transitive closure and the connectedcomponent problems. In

STOC , pages 55–57. ACM, 1976.[Hoe63] Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables.

J.Am. Stat. Assoc. , 58(301):13–30, 1963.[HPP +

15] James W. Hegeman, Gopal Pandurangan, Sriram V. Pemmaraju, Vivek B. Sardesh-mukh, and Michele Scquizzato. Toward optimal bounds in the congested clique: Graphconnectivity and MST. In

PODC , pages 91–100. ACM, 2015.[IG17] Taisuke Izumi and Fran¸cois Le Gall. Triangle ﬁnding and listing in CONGEST networks.In

PODC , pages 381–389. ACM, 2017.[JN18] Tomasz Jurdzinski and Krzysztof Nowicki. MST in O (1) rounds of congested clique. In SODA , pages 2620–2632. SIAM, 2018.[KNPR15] Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, and Peter Robinson. Dis-tributed computation of large-scale graph problems. In

SODA , pages 391–410. SIAM,2015.[Kor16] Janne H. Korhonen. Deterministic MST sparsiﬁcation in the congested clique.

CoRR ,abs/1605.02022, 2016.[KS20] Fabian Kuhn and Philipp Schneider. Computing shortest paths and diameter in thehybrid network model. In

PODC , pages 109–118. ACM, 2020.[KSV10] Howard J. Karloﬀ, Siddharth Suri, and Sergei Vassilvitskii. A model of computation formapreduce. In

SODA , pages 938–948. SIAM, 2010.26Len13] Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique.In

PODC , pages 42–50. ACM, 2013.[LMR94] Frank Thomson Leighton, Bruce M. Maggs, and Satish Rao. Packet routing and job-shop scheduling in O (congestion + dilation) steps. Comb. , 14(2):167–186, 1994.[LPPP05] Zvi Lotker, Boaz Patt-Shamir, Elan Pavlov, and David Peleg. Minimum-weight spanningtree construction in O (log log n ) communication rounds. SIAM J. Comput. , 35(1):120–131, 2005.[Now19] Krzysztof Nowicki. A deterministic algorithm for the MST problem in constant roundsof congested clique.

CoRR , abs/1912.04239, 2019.[PRS18] Gopal Pandurangan, Peter Robinson, and Michele Scquizzato. On the distributed com-plexity of large-scale graph computations. In

Related Researches

A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR

by Ruiqin Tian

OptSmart: A Space Efficient Optimistic Concurrent Execution of Smart Contracts

by Parwat Singh Anjana

DHLink: A Microservice Platform supporting Rapid Application Development and Secure Real-time Data Sharing in Digital Health

by Wenhao Li

A Framework for Auditing Data Center Energy Usage and Mitigating Environmental Footprint

by Justin Gould

We might walk together, but I run faster: Network Fairness and Scalability in Blockchains

by Anurag Jain

A decentralized FAIR platform to facilitate data sharing in the life sciences

by Pavel Vazquez

Superfast Coloring in CONGEST via Efficient Color Sampling

by Magnús M. Halldórsson

Load balancing for distributed nonlocal models within asynchronous many-task systems

by Pranav Gadikar

Hemlock : Compact and Scalable Mutual Exclusion

by Dave Dice

SonicChain: A Wait-free, Pseudo-Static Approach Toward Concurrency in Blockchains

by Kian Paimani

A Newcomer In The PGAS World -- UPC++ vs UPC: A Comparative Study

by Jérémie Lagravière

A Serverless Cloud-Fog Platform for DNN-Based Video Analytics with Incremental Learning

by Huaizheng Zhang

Disaggregated Memory at the Edge

by Luis M Vaquero

Connecting flying backhauls of UAVs to enhance vehicular networks with fixed 5G NR infrastructure

by Dalia Popescu

Committee selection in DAG distributed ledgers and applications

by Bartosz Ku?mierz

Parallelware Tools: An Experimental Evaluation on POWER Systems

by Manuel Arenaz

Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

by Francisco Romero

HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

by Muhammad Haseeb

Function Delivery Network: Extending Serverless Computing for Heterogeneous Platforms

by Anshul Jindal

TBFT: Understandable and Efficient Byzantine Fault Tolerance using Trusted Execution Environment

by Jiashuo Zhang

CausalEC: A Causally Consistent Data Storage Algorithm based on Cross-Object Erasure Coding

by Viveck R. Cadambe

On Register Linearizability and Termination

by Vassos Hadzilacos

BeFaaS: An Application-Centric Benchmarking Framework for FaaS Platforms

by Martin Grambow

libtxsize -- a library for automated Bitcoin transaction-size estimates

by Johannes Hofmann

VPIC 2.0: Next Generation Particle-in-Cell Simulations

by Robert Bird

«

1

2

3

4

»

Submitted on 14 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar