[PDF] Algorithms for a Topology-aware Massively Parallel Computation Model

Abstract

Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability, and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds, and the cost of each round is measured as the maximum cost over all the edges in the network. In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds, as well as (asymptotically) matching upper bounds. The optimality of our algorithms is with respect to the initial data distribution among the network nodes, instead of assuming worst-case distribution as in previous results. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.

Full PDF

AAlgorithms for a Topology-aware Massively Parallel ComputationModel

Xiao HuDuke [email protected] Paraschos [email protected] Spyros BlanasThe Ohio State [email protected]

Abstract

Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every com-puting unit has the same computational capability, and can communicate with every other unit withthe same latency and bandwidth. However, this strong assumption of a uniform topology rarely holdsin practical settings, where computing units are connected through complex networks. To address thisissue, Blanas et al. [ ] recently proposed a topology-aware massively parallel computation model thatintegrates the network structure and heterogeneity in the modeling cost. The network is modeled as adirected graph, where each edge is associated with a cost function that depends on the data transferredbetween the two endpoints. The computation proceeds in synchronous rounds, and the cost of eachround is measured as the maximum cost over all the edges in the network.In this work, we take the ﬁrst step into investigating three fundamental data processing tasks in thistopology-aware parallel model: set intersection, cartesian product, and sorting. We focus on networktopologies that are tree topologies, and present both lower bounds, as well as (asymptotically) matchingupper bounds. The optimality of our algorithms is with respect to the initial data distribution amongthe network nodes, instead of assuming worst-case distribution as in previous results. Apart from thetheoretical optimality of our results, our protocols are simple, use a constant number of rounds, and webelieve can be implemented in practical settings as well. The popularity of massively parallel data processing systems has led to an increased interest in studyingthe formal underpinnings of massively parallel models. As a simpliﬁcation of the Bulk Synchronous Parallel(BSP) model [ ], the Massively Parallel Computation (MPC) model [ ], has enjoyed much success instudying algorithms for query evaluation [

7, 8, 28, 25, 27, 26, 48 ], as well as other fundamental dataprocessing tasks [

24, 2, 21, 3, 6, 22, 4 ]. In the MPC model, any pair of compute nodes in a clustercommunicates via a point-to-point channel. Computation proceeds in synchronous rounds: at each round,all nodes ﬁrst exchange messages and then perform computation on their local data.Algorithms in the MPC model operate on a strong assumption of homogeneity: every compute node hasthe same data processing capability and communicates with every other node with the same latency andbandwidth. In practice, however, large deployments are heterogeneous in their computing capabilities, oftenconsisting of diﬀerent generations of CPUs and GPUs. In the cloud, the speed of communication diﬀersbased on whether the compute nodes are located within the same rack, across racks, or across datacenters.In addition to static eﬀects from the network topology, a model needs to capture the dynamic eﬀects ofdiﬀerent algorithms that may cause network contention. This homogeneity assumption is not conﬁned inthe theoretical development of algorithms, but it is also used when deploying algorithms in the real world.Recent work has started taking into account the impact of network topology for data processing. In themodel proposed by Chattopadhyay et al. [

12, 11 ], the underlying network is modeled as a graph, where nodescommunicate with their neighbors through the connected edges. Computation proceeds in rounds. In eachround, ˜ O (1) bits can be exchanged per edge. The complexity of algorithms in such a model is measured bythe number of rounds. Using the same model, Langberg et al. [ ] prove tight topology-sensitive bounds on The notation ˜ O hides a polylogarithmic factor on the input size. a r X i v : . [ c s . D B ] S e p ask Algorithm Set intersection randomized 1 O (log | V | log N ) with high probabilityCartesian product deterministic 1 O (1)Sorting randomized O (1) O (1) with high probabilityTable 1: A summary of our results. The graph network is G = ( V, E ), while the size of the input data isdenoted by N .the round complexity for computing functional aggregate queries. Although these algorithmic results haveappealing theoretical guarantees, they are unrealistic starting points for implementation. As the number ofrounds required is usually polynomial in terms of the data size, the synchronization cost would be extremelyhigh in practice. In addition, the size of the data that can be exchanged per edge in each round is too small;the compute nodes in today’s mainstream parallel data processing systems can process gigabytes of data ineach round.Recently, Blanas et al. [ ] proposed a new massively parallel data processing model that is aware of thenetwork topology as well as network bandwidth. The underlying communication network is represented asa directed graph, where each edge is associated with a cost function that depends on the data transferredbetween the two endpoints. A subset of the nodes in the network consists of compute nodes , i.e., nodesthat can store data and perform computation—the remaining nodes can only route data to the desireddestination. Computation still proceeds in rounds: in each round, each compute node sends data to othercompute nodes, receives data, and then performs local computation. There is no limit on the size of thedata that can be transmitted per edge; the cost is deﬁned as the sum across all rounds of the maximum costover all edges in the network at each round. This model is general enough to capture the MPC model as aspecial case.In this work, we use the above topology-aware model to prove lower bounds and design algorithms forthree fundamental data processing tasks: set intersection , cartesian product , and sorting . These three tasksare the essential building blocks for evaluating any complex analytical query in a data processing system.In contrast to prior work, which either assumes a worst-case or uniform initial data distribution over thenodes in the network, we study algorithms in a more ﬁne-grained manner by assuming that the cardinalityof the initial data placed at each node can be arbitrary and is known in advance. This information allows usto build more optimized algorithms that can take advantage of data placement to discover a more eﬃcientcommunication pattern. Our contributions.

We summarize our algorithmic results in Table 1. Our results are restricted to networktopologies that have two properties. First, they are symmetric , i.e., for each link ( u, v ) there exists a link( v, u ) with the same bandwidth. Second, the network graph is a tree . Even with these two restrictions, wecan capture several widely deployed topologies, such as star topologies and fat trees. All our algorithmsare simple to describe and run either in a single round or in a constant number of rounds, hence requiringminimal synchronization. We thus believe that they form a good starting point for an eﬃcient practicalimplementation. We next present our results for each data processing task in more detail.

Set Intersection (Section 3). In this task, we want to compute the intersection R ∩ S of two sets. Our lowerbound for set intersection uses classic results from communication complexity on the lopsided set disjointnessproblem. This lower bound has a rather complicated form (as shown in Section 3.1), since each link has adiﬀerent data capacity budget depending on the underlying network as well as the initial data distribution.Since set intersection is a computation-light but communication-heavy task, the challenge is how to eﬀectivelyroute the data according to the capacity of each link. We design a single-round randomized routing strategyfor set intersection that matches the lower bound with high probability, losing only a polylogarithmic factor(w.r.t. the input size and network size). Surprisingly, the routing depends only on the topology and initialdata placement, but not the bandwidth of the links. Cartesian Product (Section 4). Here we want to compute the cross product R × S of two sets. This taskis fundamental for various join operators, such as natural join, θ -join, similarity join and set containmentjoin. We derive two lower bounds of diﬀerent ﬂavor. The ﬁrst lower bound has a similar form as that for set2ntersection. The second lower bound uses instead a counting argument, which states that each pair in thecartesian product must be enumerated by at least one compute node, and the two elements participating inthis result should reside on the same node when it is enumerated. We propose a one-round deterministicrouting strategy for computing the cartesian product, which has asymptotically optimal guarantees. Ourprotocol generalizes the HyperCube algorithm that is used to compute the cartesian product in the MPCmodel [ ]. Sorting (Section 5). We ﬁrst deﬁne a valid ordering of compute nodes as any left-to-right traversal of theunderlying network tree, after picking an arbitrary node as the tree root. If the ordering of compute nodesis v , v , · · · , v | V C | , at the end of the algorithm all elements on node v i are in sorted order and smaller thanthose on node v j if i < j . Our lower bound again has a similar form to the one we derived for set intersection.We present a sampling-based sorting algorithm which runs in a constant number of rounds and matches ourlower bound with high probability. The protocol is again independent of the topology and the bandwidth,and depends only on the initial placement of the data. In this section, we present the computational model we will use for this work.

Network Model.

We model the network topology using a directed graph G = ( V, E ). Each edge e ∈ E represents a network link with bandwidth w e ≥

0, where the direction of the edge captures the direction ofthe data ﬂow. We distinguish a subset of nodes in the network, V C ⊆ V , to be compute nodes. Computenodes are the only nodes in the network that can store data and perform computation on their local data.Non-compute nodes can only route data. We only consider connected networks, where every pair of computenodes is connected through a directed path. Computation.

A parallel algorithm A proceeds in sequential rounds (or phases ). We denote by r ∈ N the number of rounds of the algorithm. In the beginning, each compute node v ∈ V C holds part of theinput I , denoted X ( v ) ⊆ I . In this work, we assume that { X ( v ) } v ∈ V C forms a partition of the input I ; inother words, there is no initial data duplication across the nodes. The goal of the algorithm is to computea function over the input I , such that in the end the compute nodes together hold the function output.We also assume that the algorithm A has knowledge of the following: ( i ) the topology of the graph, ( ii )the bandwidth of each link, and ( iii ) | X ( v ) | for each compute node v ∈ V C . In the case of relational data,we further assume that the algorithm knows the cardinality of the local fragment for each relation.We use X i ( v ) to denote the data stored at compute node v ∈ V C after the i -th round completes, where i = 1 , . . . , r . At every round, the compute nodes ﬁrst perform some computation on their local data. Then,they communicate by sending data to other compute nodes in the network. We assume that for a datatransfer from compute node u to compute node v , the algorithm must explicitly specify the routing path (ora collection of routing paths). We use Y i ( e ) to denote the data that is routed through link e during round i , and | Y i ( e ) | denote its total size measured in bits . Cost Model.

Since the algorithm proceeds in sequential rounds, we can decompose the cost of the algorithm,denoted cost ( A ), as the sum of the costs for each round i , cost ( A ) = r (cid:88) i =1 cost i ( A )The model captures the cost of each round by considering only the cost of communication . The cost ofthe i -th round is cost i ( A ) := max e ∈ E ( G ) | Y i ( e ) | /w e . In other words, the cost of each round is captured by the cost of transferring data through the most bottle-necked link in the network. In some cases, it will be convenient to express the cost using tuples/elementsinstead of bits, which we will mention explicitly. 3 v v v v v v (a) Star topology. w v v v w v v v w v v v w (b) Tree topology. Figure 1: Common computer network topologies have structure, which permits more eﬃcient solutions thanwhat is feasible for arbitrary topologies.Even though the model does not take into account any computation time in the cost, it is possible toincorporate computation costs in the model by appropriately transforming the underlying graph – for moredetails, see [ ]. We should note here that our model does not capture factors such as congestion on a routernode, or communication delays due to large network diameter. Even though the model supports general network topologies, computer networks often have a speciﬁc struc-ture. When the underlying topology has some structure, several problems (such as routing [

5, 35, 14, 13 ])admit more eﬃcient solutions than what is achievable for general topologies. It is therefore natural toconsider restrictions on the topology that are of either theoretical or practical interest.

Symmetric Network.

Wired networks support full duplex operation that allows simultaneous commu-nication in both directions of a link. Furthermore, datacenter networks allocate the same bandwidth fortransmitting and receiving data for each node. These networks are represented in the model using a sym-metric network. We say that a network topology is symmetric if for every edge e = ( u, v ) ∈ E , we also havethat e (cid:48) = ( v, u ) ∈ E with w e = w e (cid:48) . In other words, the cost of sending data from u to v is the same as thecost of sending the same data from v to u . Star Topology.

The most common topology for small clusters is the star topology, where all computersare connected to a single switch. A star network with p + 1 nodes has p compute nodes V C = { v , . . . , v p } that are all connected to a central node w that only does routing. Figure 1a depicts an example of a starnetwork. Within a node, a multi-core CPU also exhibits a star topology: individual CPU cores exchangedata through a shared cache and memory hierarchy, which implicitly forms the center of the star. Tree Topology.

As the network grows, a single router is no longer suﬃcient to connect all nodes. A commonsolution to scale the network further is to arrange r routers { w , · · · , w r } in a star topology, and connect p compute nodes V C = { v , · · · , v p } to individual routers. Figure 1b shows an example of a tree topology. Akey property in a tree topology is that there exists a unique directed path between any two compute nodes,hence routing is trivial.In this work, we will focus on symmetric tree topologies . We make two observations about such topologies: • We can assume w.l.o.g. that every compute node is a leaf. Indeed, if we have a non-leaf compute node v ∈ V C , we can transform G to a new graph G (cid:48) by adding a new compute node v (cid:48) , introduce a newlink between v, v (cid:48) with bandwidth + ∞ , and make v a non-routing node. • We can assume w.l.o.g. that there are no nodes with degree 2. Indeed, consider a non-leaf node v withtwo adjacent edges e = ( v, u ), e = ( v, u ). We can then remove v , and replace the two edges with asingle edge e = ( u , u ) with bandwidth min { w e , w e } . We discuss here how the topology-aware model can capture the MPC model [

30, 7 ] as a special case. Recallthat in the MPC model we have a collection of p nodes. The MPC model is topology-agnostic: every machine4an communicate with any other machine, and the cost of a round is deﬁned as the maximum amount ofdata that is received during this round across all machines. The MPC model corresponds to an asymmetricstar topology with p compute nodes. For every edge e = ( v i , o ) that goes from a compute node to the center o the bandwidth is w e = + ∞ , while for the inverse edge e (cid:48) = ( o, v i ) the bandwidth is w e (cid:48) = 1.It should be noted that all previous works using the MPC model assume a uniform data distribution,where each node initially receives N/p data, where N is the input size. This assumption has been used bothfor lower and upper bounds. In contrast, our algorithms and lower bounds take the sizes of the initial datadistribution as parameters. In the set intersection problem, we are given two sets

R, S . Our goal is to enumerate all pairs ( r, s ) ∈ R ∩ S .Note that there is no designated node for each output pair, as long as it is emitted by at least one node. Weassume that all elements from both sets are drawn from the same domain.Given an initial distribution D of the data across the compute nodes, we denote by R D v , S D v the elementsfrom R and S respectively in node v . Let N D v = | R D v | + | S D v | , and N D = (cid:80) v N v = | R | + | S | . Whenever thecontext is clear, we will drop the superscript D from the notation. We present a lower bound on the cost for the case of a symmetric tree topology. To prove the lower bound,we use a reduction from the lopsided set disjointness problem in communication complexity. In this problem,Alice holds a set X of n elements and Bob holds a set Y of m elements from some common domain. The goalis to decide whether the intersection X ∩ Y is empty by minimizing communication. It is known [

43, 19 ]that for any multi-round randomized communication protocol, either Alice has to send Ω( n ) bits to Bob, orBob has to send Ω( m ) bits to Alice.To construct the reduction, we observe that any edge e = ( u, v ) deﬁnes a partitioning of the computenodes in the tree G into two subsets: V − e and V + e . Here, V − e is the set of compute nodes in the same sideas u , and V + e in the same side as v . Hence, any algorithm that computes the set intersection in the treetopology also solves a lopsided set disjointness problem, where Alice holds all data located in V − e , Bob holdsall data located in V + e , and they can only communicate through the edge e . Following this core idea, we canshow the following lower bound. Theorem 1.

Let G = ( V, E ) be a symmetric tree topology. Any algorithm computing the intersection R ∩ S has cost Ω( C LB ) , where C LB = max e ∈ E w e · min  | R | , | S | , (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  . Observe that the above lower bound holds independent of the number of rounds that the algorithm uses.

Proof.

Consider an edge e ∈ E . Any algorithm that computes the set intersection R ∩ S must solve thefollowing problem. Alice holds two sets, R A = (cid:83) v ∈ V − e R v , and S A = (cid:83) v ∈ V − e S v . Similarly, Bob holdstwo sets, R B = (cid:83) v ∈ V + e R v , and S B = (cid:83) v ∈ V + e S v . Then, Alice and Bob must together compute two setintersections, R A ∩ S B and R B ∩ S A , communicating only through the link e with bandwidth w e . Thelower bound for lopsided disjointness tells us that in order to compute R A ∩ S B we need to communicateΩ(min {| R A | , | S B |} ) bits, and for R B ∩ S A we need at least Ω(min {| R B | , | S A |} ) bits. Hence, the cost of any5lgorithm must be Ω( C ), where: C = 1 w e max(min {| R A | , | S B |} , min {| R B | , | S A |} ) ≥ w e min {| R A | + | R B | , | S A | + | S B | , | R A | + | S A | , | R B | + | S B |} = 12 w e min  | R | , | S | , (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  Applying the above argument to every edge in the tree G , we obtain the desired result. We ﬁrst consider the star topology to present some of the key ideas. W.l.o.g. we assume | R | ≤ | S | . Wepresent a one-round algorithm that is based on randomized hashing.Our algorithm (Algorithm 1) in its core performs a randomized hash join. It ﬁrst partitions the computenodes into two subsets, V α and V β , depending on the size of the local data. Deﬁne N (cid:48) = | R | + (cid:80) v ∈ V α | S v | .Let h be a random hash function that maps independently each a in the domain to node v ∈ V C with thefollowing probability: P r [ h ( a ) = v ] = (cid:40) N v /N (cid:48) , v ∈ V α | R v | /N (cid:48) , v ∈ V β If V β = ∅ , then the algorithm performs a distributed hash join using the above hash function h . Observethat the algorithm does not hash each value uniformly across the compute nodes, but with probabilityproportional to the input data N v that each node holds.If V β (cid:54) = ∅ , we perform hashing only on a subset of the data using a subset of the nodes. In particular, eachnode v ∈ V β ﬁrst gathers all the elements from R (the smallest relation) and locally computes R ∩ S v , whilehashing is used to compute the remaining set intersection. After the data is communicated, the intersectioncan be computed locally at each node. Algorithm 1:

StarIntersect ( G, D ) V α ← { v ∈ V C | min { N v , N − N v } < | R |} , V β ← V C \ V α ; for v ∈ V C do send every a ∈ R D v to all nodes in V β ∪ { h ( a ) } ; if v ∈ V α then send every a ∈ S D v to h ( a ) ;We next show that the above algorithm is optimal within a polylogarithmic factor. Lemma 1.

Let G = ( V, E ) be a symmetric star topology, and consider sets R, S with N = | R | + | S | .Then, StarIntersect computes the set intersection R ∩ S with cost O (log N log | V | ) away from the optimalsolution with high probability.Proof. The correctness of the algorithm is straightforward. We will next bound the cost of the algorithm.We will measure the cost using elements of the set; to translate to bits it suﬃces to add a log( N ) factorwhich captures the number of bits necessary to represent each element.To make the notation simpler, we will use w v to refer to the bandwidth w e of edge e = ( v, w ), where v ∈ V C and w is the central node of the star topology. We can now reformulate the lower bound fromTheorem 1 as C LB = max (cid:26) max v ∈ V α min { N v , N − N v } w v , max v ∈ V β | R | w v (cid:27) We now distinguish two cases, depending on whether the edge is adjacent to a node in V α or V β .6 ase 1: v ∈ V β . Consider the two edges ( v, w ) and ( w, v ). The number of tuples that will be sent throughedge ( v, w ) is | R v | ≤ | R | . As for the tuples received, node v will receive | R | − | R v | tuples from R , as wellas some tuples from S which are in expectation: | R v | N (cid:48) · (cid:80) v ∈ V α | S v | ≤ | R v | . Thus, the cost incurred by edgesadjacent to V β is: max v ∈ V β | R | w v ≤ C LB . Even though the above analysis just bounds the expectation, we canuse Chernoﬀ bounds to show that with probability polynomially small in the number of compute nodes, thenumber of tuples will not exceed the expectation by more than an O (log | V | ) factor for any of the edges. Case 2: v ∈ V α . We bound separately the number of R -tuples and S -tuples that go through each edge.The expected number of S -tuples that go through edge ( w, v ) is (cid:32) (cid:88) u ∈ V α | S u | − | S v | (cid:33) · N v N (cid:48) ≤ ( N (cid:48) − R − | S v | ) · N v N (cid:48) ≤ ( N (cid:48) − N v ) N v N (cid:48) ≤ min { N v , N − N v } The third inequality is a direct application of the facts that min { a, b } ≥ a · ba + b for any a, b ≥ N (cid:48) < N .Similarly, the expected number of S -tuples that go through edge ( v, w ) is | S v | · N (cid:48) − N v N (cid:48) ≤ ( N (cid:48) − N v ) N v N (cid:48) ≤ min { N v , N − N v } For R -tuples, we distinguish two cases. If V β = ∅ , then we can bound the expected size using the sameargument as above for S -tuples. We now turn to the case where V β (cid:54) = ∅ .We ﬁrst claim that N v ≤ N − N v for each vertex v ∈ V α . Indeed, if not then we must have that N − N v < | R | , which implies that N v > | S | . However, this is a contradiction since there exists u ∈ V β with N u > | R | . Hence, it suﬃces to bound the R -tuples that go through each edge by N v .Indeed, the number of R -tuples that go through ( v, w ) for v ∈ V α are at most | R v | ≤ N v . As for the edge( w, v ), the expected number of tuples that use the edge is:( | R | − | R v | ) · N v N (cid:48) ≤ | R | N (cid:48) · N v ≤ N v Combining these two cases yields the desired claim. Note that all expectation calculations can be extendedto high probability statements by losing a factor of O (log | V | ) as mentioned before. We now generalize the algorithm for the star topology to an arbitrary (symmetric) tree topology. W.l.o.g.we assume | R | ≤ | S | . We partition all edges in E into two subsets: E α = { e ∈ E | min { (cid:88) v ∈ V + e N v , (cid:88) v ∈ V − e N v } < | R |} E β = { e ∈ E | min { (cid:88) v ∈ V + e N v , (cid:88) v ∈ V − e N v } ≥ | R |} An edge e ∈ E is called α -edge if e ∈ E α , and β -edge if e ∈ E β . Observe that the deﬁnition issymmetric w.r.t. the direction of the edge: if ( u, v ) is an α -edge, so is ( v, u ). The intuition behind thispartition lies in the lower bound of Theorem 1, where the amount of data that can go through an α -edgeis O (min { (cid:80) v ∈ V + e N v , (cid:80) v ∈ V − e N v } ) and through a β -edge is O ( | R | ). We denote by G β the edge-inducedsubgraph of the edge set E β . Lemma 2.

The subgraph G β is a connected tree.Proof. For the sake of contradiction, assume there exist vertices u, v ∈ V ( G β ) such that u, v are not connectedin G β . Then, there exists an α -edge e on the unique path that connects u and v in G . In turn, e splits G into two connected subtrees: G + e (that contains all nodes in V + e ), and G − e (that contains all nodes in V − e ).Suppose w.l.o.g. that u ∈ V ( G + e ) and v ∈ V ( G − e ).Since u, v belong in the edge-induced subgraph of E β , there exists β -edges e ∈ G + e , e ∈ G − e . We observethat V + e ⊆ V + e and V − e ⊆ V − e , which implies | R | ≤ (cid:80) v ∈ V + e N v and | R | ≤ (cid:80) v ∈ V − e N v . In this way, e wouldbe an β -edge, contradicting our assumption. 7 C V C V Cα βααα α β β ββ α ααα

Figure 2: An illustration of a balanced partition.On the other hand, the edge-induced subgraph G α derived from E α is not necessarily connected andforms a forest. Balanced Partition.

The ﬁrst step of our algorithm is to compute a partition { V C , V C , · · · , V kC } of thecompute nodes V C . In particular, the algorithm seeks a balanced partition , as illustrated in Figure 2. Deﬁnition 1.

A partition { V C , V C , · · · , V kC } of the compute nodes V C is balanced for data distribution D ifthe following properties hold:1. If two nodes are connected in G α , they belong in the same block of the partition ;2. Each edge appears in the spanning tree of at most one block of the partition ;3. For every block i , (cid:80) v ∈ V iC N D v ≥ | R | ;4. For every β -edge e in the spanning tree of a block i , min { (cid:80) v ∈ V iC ∩ V + e N v , (cid:80) v ∈ V iC ∩ V − e N v } ≤ | R | . Before we show how to ﬁnd a balanced partition, we ﬁrst discuss how we can use it to compute the setintersection.

The Algorithm.

Let { V C , V C , · · · , V kC } be a balanced partition. For every block V iC , we deﬁne a randomhash function h i that maps independently each value a in the domain to node v ∈ V iC with probability: P r [ h i ( a ) = v ] = N v (cid:80) u ∈ V iC N u Using the above probabilities, we can now describe the detailed algorithm (Algorithm 2), which works in asingle round. Each R -tuple is hashed across all blocks of the partition (hence it may be replicated), whileeach S -tuple is hashed only in the block that contains the node it belongs in. After all data is communicated,each node locally computes the set intersection. Algorithm 2:

TreeIntersect ( G, D ) Find a balanced partition { V C , V C , · · · , V kC } ; for v ∈ V C do for i = 1 , . . . , k do send every a ∈ R D v to h i ( a ) ; if v ∈ V iC then send every a ∈ S D v to h i ( a ) ; Theorem 2.

On a symmetric tree topology G = ( V, E ) , the set intersection R ∩ S with | R | + | S | = N can becomputed in a single round with cost O (log N log | V | ) away from the optimal solution with high probability.Proof. The correctness of the algorithm comes from the fact that each subset of nodes V iC computes R ∩ (cid:83) v ∈ V iC S v . Since S = (cid:83) ki =1 (cid:83) v ∈ V iC S v , it follows that the algorithm computes all results in R ∩ S .We next analyze the cost. As before, we will measure the cost in number of tuples, and then pay a O (log N ) factor to translate to bits. We ﬁrst rewrite the lower bound as: C LB = max  max e ∈ E α w e min { (cid:88) v ∈ V + e N v , (cid:88) v ∈ V − e N v } , max e ∈ E β | R | w e 

8e analyze the cost for the edges in E α , E β separately. Case: e ∈ E β . We will bound the amount of data that goes through e by O ( | R | ). The R -tuples that gothrough e are at most | R | , so it suﬃces to bound the number of S -tuples that cross edge e . By property(2) of a balanced partition, e is included in at most one spanning tree, say of block V iC . Then, w.h.p. theexpected amount of S -tuples that goes through e is at most1 (cid:80) v ∈ V iC N v ·  (cid:88) v ∈ V iC ∩ V − e N v  ·  (cid:88) v ∈ V iC ∩ V + e N v  ≤ min  (cid:88) v ∈ V iC ∩ V − e N v , (cid:88) v ∈ V iC ∩ V + e N v  ≤ | R | The ﬁrst inequality comes from the fact that a · ba + b ≤ min { a, b } for any a, b >

0. The second inequality isimplied directly by property (4) of a balanced partition.

Case: e ∈ E α . We will bound the amount of data that goes through e by min (cid:110)(cid:80) v ∈ V − e N v , (cid:80) v ∈ V + e N v (cid:111) .To bound the number of S -tuples, we again notice that e can belong in the spanning tree of at most oneblock, say V iC . Hence, as in the previous case, w.h.p. the expected amount of S -tuples that goes through e is at most1 (cid:80) v ∈ V iC N v ·  (cid:88) v ∈ V iC ∩ V − e N v  ·  (cid:88) v ∈ V iC ∩ V + e N v  ≤ min  (cid:88) v ∈ V iC ∩ V − e N v , (cid:88) v ∈ V iC ∩ V + e N v  ≤ min  (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  We can bound the number of R -tuples that go through e by distinguishing three cases: • none of G − e , G + e contain β -edges. Then, the partition consists of a single block, and the number of R -tuples can be bounded as we did above with the S -tuples. • G + e contains β -edges but G − e not. Then, all vertices in G β are in V + e . The R -data that goes through e issent by nodes in V − e , so its size is bounded by (cid:80) v ∈ V − e | R v | ≤ (cid:80) v ∈ V − e N v = min (cid:110)(cid:80) v ∈ V − e N v , (cid:80) v ∈ V + e N v (cid:111) .Here, the last equality follows from the fact that G + e contains at least one β -edge, which implies that (cid:80) v ∈ V + e N v ≥ | R | > (cid:80) v ∈ V − e N v . • G − e contains β -edges but G + e not. Then, all nodes in V + e belong in the same block V iC . We can aboundthe expected amount of S -tuples with:1 (cid:80) v ∈ V iC N v ·  (cid:88) v ∈ V − e | R v |  ·  (cid:88) v ∈ V iC ∩ V + e N v  ≤ (cid:80) v ∈ V − e | R v | + (cid:80) v ∈ V iC ∩ V + e N v (cid:80) v ∈ V iC N v min  (cid:88) v ∈ V − e | R v | , (cid:88) v ∈ V iC ∩ V + e N v  ≤ | R | + (cid:80) v ∈ V iC N v (cid:80) v ∈ V iC N v min  (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  ≤  (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  where the last inequality is from property (3) of Deﬁnition 1.This completes the proof. Finding a Balanced Partition.

Finally, we present how we can compute a balanced partition inAlgorithm 3. We say that two vertices in G are α -connected if there exists a path that uses only α -edgesthat connects them. For the algorithm below, denote Γ( x ) as the set of nodes that are α -connected withnode x in G (line 2). Moreover, we use w ( x ) to denote the quantity (cid:80) x ∈ Γ( x ) N x , i.e. the total amount ofdata in the nodes from Γ( x ). 9 lgorithm 3: BalancedPartition ( G, D ) for x ∈ V ( G β ) do Γ( x ) ← { v ∈ V C | v, x are α -connected in G } ; P ← ∅ ; while | V ( G β ) | > do pick the leaf vertex x ∈ G β with the smallest w ( x ); if w ( x ) ≥ | R | then add Γ( x ) to P ; else y ← unique neighbor of x in G β ; Γ( y ) ← Γ( y ) ∪ Γ( x ); G β ← G β \ { x } ; return P ;The algorithm initially creates a group for each set of compute nodes that are connected through α -edges.Then, it starts merging the groups (starting from the leaves of the tree) as long as the total number of theelements in the group is less than | R | . We show that the above algorithm indeed creates the desired balancedpartition. Lemma 3.

Algorithm 3 outputs a balanced partition of compute nodes V C in O ( | V | ) time.Proof. First, we notice that in lines 1-2 each compute node V C belongs in exactly one Γ( x ). In the remainingalgorithm, every vertex in G β with w ( x ) > P is a partition of V C . Indeed,the only issue may occur when we are left with a single vertex x : we claim that in this case we always have w ( x ) ≥ | R | . Suppose w ( x ) < | R | , and consider the last vertex u for which Γ( u ) was added in P (such avertex always exists, since every leaf vertex of G β initially has weight at least | R | ). But then, the algorithmcould not have picked u at this point, since all other leaf vertices have smaller weight, a contradiction.We now prove that the output partition satisﬁes all properties of a balanced partition (Deﬁnition 1).(1) The ﬁrst condition is trivial. From lines 1-2, two compute nodes that are connected in G α will be inthe same initial Γ( x ), hence they will appear together in a block of the partition.(2) By contradiction, assume there exists an edge e = ( u, v ) appearing in the spanning trees of V iC and V jC for i (cid:54) = j . By the deﬁnition of spanning trees, there exists one pair of vertices x, y ∈ V iC and one pairof vertices x (cid:48) , y (cid:48) ∈ V jC such that x, x (cid:48) ∈ G + e and y, y (cid:48) ∈ G − e . When Algorithm 3 visits e in line 9, w.l.o.g.assume u is visited before v . Since x, x (cid:48) are placed in diﬀerent blocks of the partition, it cannot be that both x, x (cid:48) ∈ Γ( u ). W.l.o.g., x (cid:48) / ∈ Γ( u ). This implies that x (cid:48) has already been put into one block with vertices from G − e . Then x (cid:48) , y (cid:48) won’t appear in the same block, contradicting our assumption.(3) It is easy to see that the algorithm adds a set of nodes to P only if their total weight is at least | R | .(4) Consider a block V iC in the partition. Let e = ( u, v ) be a β -edge in the spanning tree of V iC . Then,Algorithm 3 visits e in line 9: w.l.o.g. assume u is visited before v . At this point, we have w ( u ) < | R | , sinceΓ( u ) was merged with Γ( v ). The key observation is that we have Γ( u ) = V iC ∩ V − e , since no other computenodes will be added to the ”left” of e (since u is a leaf node). Hence,min { (cid:88) v ∈ V iC ∩ V + e N v , (cid:88) v ∈ V iC ∩ V − e N v } ≤ (cid:88) v ∈ V iC ∩ V − e N v = w ( u ) < | R | This completes the proof.

Remark.

Interestingly, the algorithm we described above does not use the link bandwidths to decide whatto send and where to send to. Instead, what matters is the connectivity of the network and how the data isinitially partitioned across the compute nodes. This is a signiﬁcant practical advantage because bandwidthinformation may be imprecise or have high variability at runtime, such as when sharing a cluster with otherusers. 10

Cartesian Product

In the cartesian product problem, we are given two sets

R, S with | R | = | S | = N/

2. (We will discuss inthe end why the unequal case is challenging, even on the simple symmetric star topology). Our goal is toenumerate all pairs ( r, s ) for any r ∈ R, s ∈ S , such that the output pairs are distributed among the computenodes by the end of the algorithm. Similar to set intersection, there is no designated node for each outputpair, as long as it is emitted by at least one node. We assume that all elements are drawn from the samedomain, and that initially the input data is partitioned across the compute nodes. We present two lower bounds on cost for the case of a symmetric tree topology. The ﬁrst one as stated inTheorem 3 has the same form as the one in Theorem 1 when | R | = | S | = N/

2, but uses a slightly diﬀerentargument. Both lower bounds are expressed in terms of elements, and not bits.

Theorem 3.

Let G = ( V, E ) be a symmetric tree topology. Any algorithm computing R × S has (tuple) cost Ω( C LB ) , where C LB = max e ∈ E w e · min  (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  . Proof.

Let C opt be the cost of any algorithm computing R × S on the tree topology G . Consider an edge e ∈ E . Suppose that C opt · w e ≤ (cid:80) v ∈ V − e | R v | . Then, at least one element in R u for some u ∈ V − e does notgo through e , i.e., entering into any vertex in V + e . In this case, in order to guarantee correctness, all data in S must be sent to u , hence C opt · w e ≥ (cid:80) v ∈ V + e | S v | . Thus C opt · w e ≥ min { (cid:80) v ∈ V − e | R v | , (cid:80) v ∈ V + e | S v |} . Usinga symmetric argument, C opt · w e ≥ min { (cid:80) v ∈ V − e | S v | , (cid:80) v ∈ V + e | R v |} . Summing up the two inequalities, andobserving that min { (cid:80) v ∈ V − e N v , (cid:80) v ∈ V + e N v } ≤ | R | (= | S | = N/ e .Applying the above argument to every edge in the tree G , we obtain the desired result.The second lower bound uses a diﬀerent argument that depends on the underlying tree topology. Tostate the lower bound, we ﬁrst deﬁne a ”directed” version G † of the symmetric tree G as follows. G † hasthe same vertex set as G . Recall that each edge e = ( u, v ) in G partitions the nodes of V into V + e and V − e .Then, if (cid:80) x ∈ V − e N x ≤ (cid:80) x ∈ V + e N x , G † contains only an edge from u to v , otherwise only an edge from v to u . As the next lemma shows, the resulting directed graph G † has a very speciﬁc structure. Lemma 4. G † satisﬁes the following properties:1. The out-degree of every node is at most one.2. There exists exactly one node with out-degree zero.Proof. By contradiction, assume there exists one node u ∈ V with at least two out-going edges. Since G hasno vertices with degree 2, this means that G † has three edges e = ( u, v ), e = ( u, v ), e = ( u, v ). Foreach such edge, we have (cid:80) x ∈ V + ei N x ≥ (cid:80) x ∈ V − ei N x , and thus (cid:80) x ∈ V + ei N x ≥ N/

2. Observe that because G isa tree, it also holds that the vertex sets V + e i are disjoint. Then we come to the following contradiction N = (cid:88) x ∈ V N x ≥ (cid:88) i =1 (cid:88) x ∈ V + ei N x ≥ N/ G † is a directed tree, it is easy to see that there must exist at least one node with no outgoing edges;otherwise, there would be a cycle in the graph, a contradiction. Hence, it suﬃces to show that there is atmost one such a node. By contradiction, assume two nodes u, v with outdegree 0. Consider the unique pathbetween u, v : then, there must be a node in the path with out-degree at least two. However, this contradicts(1), thus (2) is proven as well. 11 ompute nodesrouting nodes Figure 3: Two examples of a directed graph G † . The left one is rooted at a compute node and the right oneis rooted at a router.We denote the single node with out-degree zero as r , and call it the root of the tree. Every other node in G † will point towards r , as the example in Figure 3 illustrates. Observe that the root r of the tree could bea compute node. But in this case, the algorithm that simply routes all the data to the root is asymptoticallyoptimal, since the cost matches the cost of the lower bound in Theorem 3. Hence, we will focus on thecase where the root is not a compute node; in this case, it is easy to observe that all the nodes in G † within-degree 0 are exactly the compute nodes.A cover of G † is a subset S ⊆ V such that every leaf node has some ancestor in S . We will be interestedin minimal covers of G † . Observe that the singleton set { r } is trivially a minimal cover. Theorem 4.

Let G = ( V, E ) be a symmetric tree topology. Let U be a minimal cover of G † such that U (cid:54) = { r } , where r is the root of G † . Then, any algorithm computing the cartesian product R × S for | R | = | S | = N/ has (tuple) cost Ω( C LB ) , where C LB = N (cid:112)(cid:80) v ∈ U w v , where w v is the capacity of the unique outgoing edge of v in G † .Proof. Let e u be the outgoing edge of u ∈ U in G † , with capacity cost w u . Let T u be the subtree rootedat u . From minimality of U , it follows that T u , T v have disjoint vertex sets. Moreover, from the deﬁnitionof a node cover, every compute node belongs in some (unique) subtree. This means that we can bound theoutput result by at most the union of the outputs in the compute nodes of each subtree. In the following,we will bound the maximum output size of a given subtree T u .Let R (cid:48) u , S (cid:48) u denote the elements of R, S respectively that are in some compute node of T u . Moreover, let R (cid:48)(cid:48) u , S (cid:48)(cid:48) u be the elements of R, S that go through link e u respectively. Then, the size of the results that canbe produced at subtree T u is at most | R (cid:48) u ∪ R (cid:48)(cid:48) u | · | S (cid:48) u ∪ S (cid:48)(cid:48) u | . Observe the following: • | R (cid:48)(cid:48) u | ≤ C opt · w u and | S (cid:48)(cid:48) u | ≤ C opt · w u ; • | R (cid:48) u | ≤ C opt · w u and | S (cid:48) u | ≤ C opt · w u . Indeed, since w u is an outgoing edge of u in G † , Theorem 3 tellsus that C opt · w u ≥ | R (cid:48) u | + | S (cid:48) u | .Hence, we can bound the number of outputs in T u as: | R (cid:48) u ∪ R (cid:48)(cid:48) u | · | S (cid:48) u ∪ S (cid:48)(cid:48) u | ≤ ( | R (cid:48) u | + | R (cid:48)(cid:48) u | )( | S (cid:48) u | + | S (cid:48)(cid:48) u | ) ≤ (2 · C opt · w u )(2 · C opt · w u ) = 4 · C opt · w u In order for the algorithm to be correct, the total size of the output must be at least | R | · | S | . Summingover all nodes in the minimal cover U , we obtain | R | · | S | ≥ · C opt · (cid:80) u ∈ U w u . This concludes the proof. In this section, we present a deterministic one-round protocol on a symmetric star topology, named weightedHyperCube (wHC), which generalizes the HyperCube algorithm [ ]. We assume that the data statistics | R v | , | S v | are known to all compute nodes.We give a strict ordering ≤ on the compute nodes in V C . Each node assigns consecutive numbers to itslocal data. More speciﬁcally, node v labels its data in R v from 1 + (cid:80) u

0: for each i , if there are 4 squares of size 2 i × i in S , we pack them into a larger square of size2 i +1 × i +1 . In this way, we can transform S into a new set of squares S (cid:48) , where for every i , there are at most3 squares of size 2 i × i . It is now easy to see that, by induction starting from the smaller size, all squaresof size ≤ i − can be packed inside a square of size 2 i . Hence, we can pack all squares in S (cid:48) inside a squareof size 2 i ∗ +1 , where 2 i ∗ is the dimension of the largest square in S (cid:48) . To conclude the argument, observe thatthe square with dimension 2 i ∗ is fully packed. Also, 2 i ∗ +1 ≥ (cid:112)(cid:80) i d i . Hence, we can fully pack a square ofsize at least 1 / (cid:112)(cid:80) i d i .The next lemma bounds the cost of the wHC protocol. Lemma 6.

Let G be a symmetric star topology. Then, the wHC algorithm correctly computes the cartesianproduct R × S for | R | = | S | = N/ with (tuple) cost O ( C ) , where C = max (cid:40) max v N v w v , N (cid:112)(cid:80) v w v (cid:41) Proof.

To prove correctness, we apply Lemma 5 with S = { l v × l v | v ∈ V C } . Then, the squares fully pack asquare of area at least 14 (cid:88) v ∈ V C (2 l v ) ≥ (cid:88) v ∈ V C ( L · w v ) = ( N/ = | R | · | S | Hence, the whole grid can be covered.Next, we analyze the cost of the algorithm. First, the cost of sending data is max v N v /w v . For the costof receiving, observe that node v receives at most 2 · (2 L · w v ) = 4 w v L tuples. Hence, the cost of receivingis bounded by 4 L . Combining these two costs obtains the desired result.13 .3 Warm-up on Symmetric Star Before we present the general algorithm for symmetric trees, we warm up by studying the simpler symmetricstar case (Algorithm 4).The algorithm checks whether the maximum data that some node holds exceeds N/

2. If so, it is easy toobserve that the strategy where every compute node sends their data to that node is optimal. If every nodeholds at most N/ G † all compute nodes of the star are directed to the central node o , which becomes the root of G † . In this case, running the wHC algorithm on the whole topology can beproven optimal. Algorithm 4:

StarCartesianProduct ( G, D ) if max u N u > N/ then all compute nodes send their data to arg max u N u ; else run the wHC algorithm ; Lemma 7.

On a symmetric star topology, the

StarCP algorithm correctly computes the cartesian product R × S for | R | = | S | = N/ in a single round deterministically and with cost O (1) away from the optimal.Proof of Lemma 7. We distinguish the analysis into two cases, depending on whether max u N u > N/ u N u > N/

2. Let u ∗ = arg max u N u . For node u ∗ , we have N − N u ∗ < N/ < N u ∗ .For every other node v (cid:54) = u ∗ , it holds N v < N/

2, hence N v < N − N v . Hence, we can write Theorem 3 as: C opt ≥ max v w v min { N v , N − N v } ≥ max (cid:26) N − N u ∗ w u ∗ , max v (cid:54) = u ∗ N v w v (cid:27) But this is exactly half the cost of the protocol where all nodes send their data to u ∗ .Suppose now that max u N u ≤ N/

2. From Theorem 3, we obtain the lower bound C opt ≥ max v N v w v .Additionally, observe that in G † all compute nodes of the star are directed to the central node o , and hence V C is a minimal cover of G † . Indeed, if we add { o } to V C , the cover is not minimal, since { o } is a minimalcover by itself. Plugging this cover in Theorem 4, we obtain that C opt ≥ N/ (cid:80) v w v . To conclude, notice thatthese two lower bounds on C opt match the upper bound of wHC in Lemma 6 within a constant factor. We now generalize the techniques for the star topology to an arbitrary tree topology.

The Algorithm.

Assume that the data statistics | R v | , | S v | are known to all compute nodes. Similar to thewHC algorithm, each tuple from R is labeled with a unique index, as well as each one from S . In this way, eachanswer in the cartesian product can be uniquely mapped to a point in the grid (cid:3) = { , . . . , | R |}×{ , . . . , | S |} .For simplicity, we split the routing phase into two steps.Let r be the root of the directed graph G † . In the ﬁrst step, each compute node v ∈ V C sends its localdata to r .In the second step, we assign to each compute node v ∈ V C a square (cid:3) v such that every result t = ( t r , t s )is computed on some v . To compute t , associated tuples t r , t s will be sent to v at least once. In this step,every tuple sent to v will be sent from the root r , which has gathered all necessary data in the ﬁrst step.Next, we show how to ﬁnd a balanced assignment on a tree and analyze its capacity cost with respect to thelower bound in Theorem 4. Balanced Packing on Symmetric Tree.

Let ζ ( u ) be the set of children nodes of u in G † , and p u theunique parent of u in G † . To simplify notation, we use w v to denote the quantity w ( v, p v ).The algorithm is split into two phases. First, it computes a quantity ˜ w v for each node v in G † . For theleaf nodes, we have ˜ w v = w v , while for the internal nodes ˜ w v is computed in a bottom-up fashion (through apost-order traversal). In the second phase, the algorithm computes a quantity l v for each node, but now in a14op-down fashion (through a pre-order traversal). As a ﬁnal step, each compute node v rounds up ( N/ · l v to the closest power of 2, and then gets assigned a square of that dimension. Algorithm 5:

BalancedPackingTree ( G ) forall v ∈ V \ { r } in post-order do if v is a leaf then ˜ w v ← w v ; else ˜ w v ← min { w v , (cid:113)(cid:80) u ∈ ζ ( v ) ˜ w u } ˜ w r ← (cid:113)(cid:80) u ∈ ζ ( r ) ˜ w u ; l r ← forall v ∈ V \ { r } in pre-order do l v ← l p v · ˜ w v / (cid:113)(cid:80) u ∈ ζ ( p v ) ˜ w u ; forall v ∈ V C do d v ← arg min k { k ≥ N · l v } ; assign to v a square d v × d v ;The next lemma shows that Algorithm 5 guarantees certain properties for the computed quantities. Lemma 8.

The following properties hold:1. For every non-root vertex v , ˜ w v ≤ w v .2. For every vertex v , l v ≤ ˜ w v / ˜ w r .3. There exists a minimal cover U of G † such that ˜ w r = (cid:112)(cid:80) u ∈ U w u .4. For every vertex u , l u = (cid:113)(cid:80) v ∈ T u ∩ V C l v where T u is the subtree rooted at u .Proof. Property (1) is straightforward from the algorithm.We prove property (2) by induction. For the base case, v is the root. In this case, l r = 1, so the inequalityholds with equality. Consider now any non-root vertex v with parent p v . We then have: l v = l p v · ˜ w v (cid:113)(cid:80) u ∈ ζ ( p v ) ˜ w u ≤ ˜ w v ˜ w r · ˜ w p v (cid:113)(cid:80) u ∈ ζ ( p v ) ˜ w u ≤ ˜ w v ˜ w r The ﬁrst inequality holds from the inductive hypothesis for the parent node p v . The second inequality comesfrom line 5 of the algorithm, which implies that ˜ w v ≤ (cid:113)(cid:80) u ∈ ζ ( v ) ˜ w u for every non-leaf vertex v .We also use induction to show property (3). For a subtree rooted at leaf node v , U = { v } is a minimalcover. In this case, ˜ w v = w v = (cid:112)(cid:80) u ∈ U w u . For the induction step, consider some non-leaf node v . If˜ w v = w v , then ˜ w v = (cid:112)(cid:80) u ∈ U w u holds for the minimal cover U = { v } . Otherwise, ˜ w v = (cid:113)(cid:80) u ∈ ζ ( v ) ˜ w u .From the induction hypothesis, there exists a minimal cover U u for the subtree rooted at u ∈ ζ ( v ) such that˜ w u = (cid:80) t ∈ U u w t . Moreover, it is easy to see that the set U = (cid:83) u ∈ ζ ( v ) U u is a minimal cover for the subtreerooted at v . Hence, we can write:˜ w v = (cid:115) (cid:88) u ∈ ζ ( v ) ˜ w u = (cid:115) (cid:88) u ∈ ζ ( v ) (cid:88) t ∈ U u w t = (cid:115)(cid:88) t ∈ U w t The property (4) directly follows the Algorithm 5. The base case for u ∈ V C always holds. Conasiderany non-leaf node u . By induction, assume l x = (cid:113)(cid:80) v ∈ T x ∩ V C l v for each node x ∈ ζ ( u ). Implied by line 9 inAlgorithm 5, we have l u = (cid:115) (cid:88) x ∈ ζ ( u ) l x = (cid:115) (cid:88) x ∈ ζ ( u ) (cid:88) v ∈ T x ∩ V C l v = (cid:115) (cid:88) v ∈ T u ∩ V C l v . Packing squares.

In this part, we discuss how we can pack each square of dimension d v assigned to leafnode v inside (cid:3) . Our goal is to ﬁnd an assignment (packing) of each square to compute nodes V C such thatfor each vertex u , the number of elements that cross the link ( u, p u ) is bounded by O ( N · l u ).We visit all vertices in bottom-up way, starting from the leaves. We recursively assign to each node v aset of squares in the form of S v = { (2 i , c i ) : c i ∈ { , , , }} , meaning that there are c i squares of dimensions2 i × i .For every leaf node v ∈ V C , only one square is assigned to v by Algorithm 5. Consider some non-leafnode u . Each of its children v ∈ ζ ( u ) is assigned with a set of squares S v . We start the following procedurein an increasing order of i ≥

0: for each i , if there are 4 squares of size 2 i × i in (cid:83) v ∈ ζ ( u ) S v , we pack theminto a larger square of size 2 i +1 × i +1 . In this way, we can transform (cid:83) v ∈ ζ ( u ) S v into a new set of squares S u , where for every i , there are at most c i ≤ i × i .Next we bound the number of elements that cross the link ( u, p u ) for each node u ∈ V , which is assignedwith the set of squares S u . Let T u be the subtree rooted u . Let i ∗ be the largest integer such that c i ∗ (cid:54) = 0.Note that each square of dimensions 2 i × i includes 2 i elements from both R and S . Then, the total numberof elements for all squares in S u is (cid:80) i c i · · i ≤ · ( c i ∗ + 1) · i ∗ ≤ · i ∗ , which can be further bounded by ≤ · (cid:115) (cid:88) v ∈ T u ∩ V C d v ≤ · N · (cid:115) (cid:88) v ∈ T u ∩ V C l v = 16 · N · l u The second inequality is implied by Algorithm 5, while the third inequality comes from the fact that d v ≤ N · l v for each compute node v ∈ V C . The last equality is implied by Lemma 8. Theorem 5.

On a symmetric tree topology G = ( V, E ) , the cartesian product R × S for | R | = | S | = N/ can be computed deterministically in a single round optimally.Proof. To prove the correctness of the algorithm, we need to show that the packing of the squares fullycovers the | R | × | S | grid. Indeed, consider the largest square 2 i ∗ × i ∗ that occurs in the set of squares S r assigned to the root node. Observe ﬁrst that we can pack all squares in S r inside a 2 i ∗ +1 × i ∗ +1 square,and thus 2 i ∗ +1) ≥ (cid:88) v d v ≥ N (cid:88) v ∈ V C l v = N Hence, 2 i ∗ ≥ ( N/ · ( N/

2) = | R | · | S | , which means that the grid is fully packed by the largest square in S r .We next show that the cost is asymptotically close to the lower bounds in Theorem 3 and Theorem 4.It can be easily checked that the number of elements transmitted through any link e at the ﬁrst stepis at most O (cid:16) min { (cid:80) v ∈ V − e N v , (cid:80) v ∈ V + e N v } (cid:17) , matching the lower bound in Theorem 3. For the secondstep, we have bounded the number of elements that cross link ( u, p u ) by O ( N · l v ). Lemma 8 impliesthat N · l v ≤ N · w v / (cid:112)(cid:80) u ∈ U w u for some minimal cover U of G † , hence matching the lower bound inTheorem 4. At last, we discuss the diﬃculty of computing the cartesian product R × S with | R | (cid:54) = | S | on a symmetricstar topology. W.l.o.g., assume | R | < | S | . The ﬁrst lower bound following the same arguement in Theorem 3is Ω( C LB ) where C LB = max v ∈ V C w v · min (cid:8) N v , N − N v , | R | (cid:9) We next see how the counting argument yields the second lower bound under the condition max v N v < N .Let C be the cost of any correct algorithm. Let R (cid:48) u , S (cid:48) u be the elements of R, S received by u . Then, the sizeof the results that can be produced at u is | R u ∪ R (cid:48) u | · | S u ∪ S (cid:48) u | . Observe the following: • | R (cid:48) u | ≤ C opt · w u and | S (cid:48) u | ≤ C opt · w u ; 16 If N u < | R | , | R u ∪ R (cid:48) u | ≤ C opt · w u and | S u ∪ S (cid:48) u | ≤ C opt · w u • If N u ≥ | R | , C opt · w u ≥ | R | .Summing over all node, the total size of the output must be at least | R | · | S | . We then obtain | R | · | S | ≤ (cid:88) u ∈ V C ( | R (cid:48) u | + | R (cid:48)(cid:48) u | )( | S (cid:48) u | + | S (cid:48)(cid:48) u | ) ≤ (cid:88) u ∈ V C : N u < | R | { C · w v , | R |} · { C · w v , | S |} + (cid:88) u ∈ V C : N u ≥| R | | R | · { C · w v + S u , | S |} whose minimizer gives the second lower bound, which becomes rather complicated without a clean form asTheorem 4.This is just an intuition of why the unequal case would make the lower bound hard even on the symmetricstar. In Appendix A.1, we give a more detailed analysis on the lower bound, as well as an optimal algorithm.Extending our current result to the general symmetric tree topology is left as future work. In the sorting problem, we are given a set R whose elements are drawn from a totally ordered domain. Weﬁrst deﬁne an ordering of compute nodes in the following way: after picking an arbitrary node as the root,any left-to-right traversal of the underlying network tree is a valid ordering of compute nodes. The goal isto redistribute the elements of R such that on an ordering of compute nodes as v , v , · · · , v | V C | , elementson node v i are always smaller than those on node v j if i < j .Given an initial distribution D of the data across the compute nodes, we denote by N D v the initial datasize in node v . Whenever the context is clear, we drop the superscript D from the notation. Our lower bound for sorting has the same form as the one for set intersection, with the only diﬀerence thatthe cost is expressed as tuples, and not bits.

Theorem 6.

Let G = ( V, E ) be a symmetric tree topology. Any algorithm sorting elements in a set R has(tuple) cost Ω( C LB ) , where C LB = max e ∈ E w e · min  (cid:88) v ∈ V − e N v , (cid:88) v ∈ V + e N v  . Proof.

We construct an initial data distribution as follows. Assume elements in R are ordered as r , r , · · · , r N ,where i is the rank of element r i in R . Without loss of generality, assume N is even. We assign elements tocompute nodes in the ordering of { r , r , · · · , r N − , r , r , · · · , r N } . Moreover, we pick one arbitrary routenode of G as the root, where all compute nodes are leaves of the tree. All compute nodes in V C are alsolabeled as v , v , · · · , v | V C | in an left-to-right traversal ordering, i.e., recursively traversing the leaves in theleft subtree and then the right subtree. For example, the node v with initial data size N will be assignedwith elements { r , r , · · · , r N − } if N ≤ N , and { r , r , · · · , r N − , r , r , · · · , r N − N } otherwise. We needto argue that any algorithm correctly sorting R under this initial distribution must have a cost Ω( C LB ).Consider an arbitrary edge e ∈ E . Removing e deﬁnes a partition of V C as V − e , V + e . Denote R − e = (cid:83) v ∈ V − e R v and R + e = (cid:83) v ∈ V + e R v . It should be noted that R − e or R + e is a sub-interval of { r , r , · · · , r N − , r , r , · · · , r N } , or a sub-interval of { r , r , · · · , r N , r , r , · · · , r N } . Note that every element transmitted between V − e and V + e must go through edge e . Without loss of generality, assume | R − e | ≤ N ≤ | R + e | . Then it suﬃcesto show that the total number of elements exchanged between V − e and V + e is at least Ω( | R − e | ).In the extreme case, there is only one element in R − e , say R − e = { r i } . If r i is not sent through e , at leastone element in R + e must be sent to V − ; otherwise, no comparison between r i and any element r j ∈ R + e is17 Nj i i N − ij j Nj i i N − ij j Case (3.3.1)Case (3.3.3)1

Nj i i i k N − i ji (cid:48) j (cid:48) k (cid:48) N Case (1)elements in R − e elements in R + e Figure 5: Data exchange between V − e , V + e .performed, contradicting to the correctness of algorithms. So at least one element is transmitted throughedge e . In general, at least two elements are in R − e . We further distinguish four cases: (1) r / ∈ R − e , r N / ∈ R − e ;(2) r / ∈ R − e , r N − / ∈ R − e ; (3) r ∈ R − e , r N − ∈ R − e ; (4) r ∈ R − e , r N ∈ R − e . Note that (2) can be arguedsymmetrically with (1) and (4) can be argued symmetrically with (3).Case (1): r / ∈ R − e , r N / ∈ R − e . In this case, the R − e must be a subset of { r , r , · · · , r N − } . Let i, j be thesmallest and largest rank of elements in R − e . If all elements in R − e have been sent from V − e to V + e , then weare done. Otherwise, let i (cid:48) , j (cid:48) be the smallest and largest rank of elements in R − e which are not sent from V − e to V + e . Furthermore, if all elements in R − e − { r i (cid:48) , r j (cid:48) } are sent from V − e to V + e , it can be easily checkedthat the number of such elements is at least | R − e | . Otherwise, there is r k (cid:48) ∈ R − e − { r i (cid:48) , r j (cid:48) } not sent from V − e to V + e . By the deﬁnition, r i (cid:48) < r k (cid:48) < r j (cid:48) . Implied by the ordering of compute nodes, all elements in[ r i (cid:48) , r j (cid:48) ] should reside on V − e when the algorithm terminates. In this case, each element in [ r i (cid:48) , r j (cid:48) ] − R − e should be sent from V + e to V − e , and each in { r i , r i +2 , · · · , r i (cid:48) − } ∪ { r j (cid:48) +2 , r j (cid:48) +4 , · · · , r j } are sent from V − e to V + e , as illustrated in Figure 5 (Due to the page limit, the ﬁgure is moved to Appendix ?? ). So the numberof elements transmitted through edge e is at least i (cid:48) − i + j − j (cid:48) + j (cid:48) − i (cid:48) = j − i ≥ | R − e | − ≥ | R − e | .Case (3): r N − ∈ R − e , r ∈ R − e . Let i be the smallest odd rank and j be the largest even rank of elementsin R − e . Note that j < i since | R − e | ≤ N . We further consider three cases as below.Case (3.1): all elements in { r , r , · · · , r j } are sent from V − e to V + e . If all elements in { r i , r i +2 , · · · ,r N − } are also sent from V − e to V + e , then we are done. Otherwise, let i , i be the smallest and largest rank ofelements in { r i , r i +2 , · · · , r N − } not sent from V − e to V + e . Furthermore, if all elements in { r i , r i +2 , · · · , r N − }−{ r i , r i } are sent from V − e to V + e , it can be easily checked that the number of elements sent from V − e to V + e is at least | R − v | . Otherwise, there is r k (cid:48) ∈ { r i , r i +2 , · · · , r N − } − { r i , r i } not sent from V − e to V + e . Bythe deﬁnition, r i < r k < r i . Implied by the ordering of compute nodes, all elements in [ r i , r i ] shouldreside on V − e when the algorithm terminates. In this case, each element in [ r i , r i ] − R − e should be sentfrom V + e to V − e , and each in { r , r , · · · , r j } ∪ { r i , r i +2 , · · · , r i − } ∪ { r i +2 , r i +4 , · · · , r N − } are sent from V − e to V + e , as illustrated in Figure 5. So the number of elements transmitted through edge e is at least j + i − i + N − − i + i − i = N − j − i = | R − e | − ≥ | R − e | .Case (3.2): all elements in { r i , r i +2 , · · · , r N − } are sent from V − e to V + e , which can be argued symmet-rically. 18 lgorithm 6: Proportional ( V H , u ) ∆ ← i ← while i ≤ k do x ← N vi (cid:80) v ∈ VH N v · N u ; if ∆ ≥ x − (cid:98) x (cid:99) then N iu ← (cid:98) x (cid:99) , ∆ ← ∆ − ( x − (cid:98) x (cid:99) ); else N iu ← (cid:98) x (cid:99) + 1 , ∆ ← ∆ + 1 − ( x − (cid:98) x (cid:99) ); i ← i + 1; return N u , N u , · · · , N ku ;Case (3.3): at least one element in { r , r , · · · , r j } and one element in { r i , r i +2 , · · · , r N − } are not sentfrom V − e to V + e . Let j , j be the smallest and largest even rank of elements in R − e not sent from V − e to V + e .Let i , i be the smallest and largest odd rank of elements in R − e not sent from V − e to V + e . Note that eachelement in { r , r , · · · , r j − } ∪ { r j +2 , r j +4 , · · · , r j } ∪ { r i , r i +2 , · · · , r i − } ∪ { r i +2 , r i +4 , · · · , r N − } is sentfrom V − e to V + e .By the ordering of compute nodes, (3.3.1) all elements in [ r j , r i ] or (3.3.2) all elements in [ r , r j ] ∪ [ r i , r N ] should reside on V − e when the algorithm terminates. In (3.3.1), each element in [ r j , r i ] − R − e shouldbe sent from V + e to V − e , as illustrated in Figure 5. The number of elements transmitted through edge e is atleast i − j +1 − j − j − i − i + j − + j − j + i − i + N − − i ≥ N − i − j ≥ N ≥ | R − e | . In (3.3.2), each elementin { r , r , · · · , r j − } ∪ { r i +1 , r i +3 , · · · , r N − } should be sent from V + e to V − e , as illustrated in Figure 5.The number of elements transmitted through edge e is at least j +12 + N − i +12 + j − + j − j + i − i + N − − i = N − − i + j ≥ N ≥ | R − e | . In the MPC model, the theoretically optimal sorting algorithm inherited from [ ] is rather complicated.Instead, sampling-based techniques, such as TeraSort [ ], are more amenable to be extended to morecomplex networks. In this section, we present a randomized communication protocol for a symmetric treetopology, named weighted TeraSort (wTS), which generalizes the TeraSort algorithm in three fundamentalways. First, TeraSort is designed for the MapReduce [ ] framework, which is an instantiation of thetheoretical MPC model (with star topology), and we extend it to the general tree topology. Second, not allnodes participate in the splitting of the data, but only the ones that initially have a substantial amount ofdata. Third, we do not split the data uniformly, but proportionally to the size of the initial data. Beforeintroducing our algorithm, we revisit the TeraSort algorithm. TeraSort Algorithm.

It picks an arbitrary node as the coordinator. Set ρ = 4 · | V C | N ln( | V C | · N ). Round 1:

Each node u ∈ V C samples each element from its local storage with uniform probability ρ , andsends all sampled elements to the coordinator. Let s be the number of samples generated in total. Round 2:

The coordinator sorts all sampled elements received. Let b i be the i · (cid:100) s | V C | (cid:101) -th smallest object inthe sorted samples for i ∈ { , , · · · , | V C | − } , b = −∞ and b | V C | = + ∞ . It then broadcasts | V C | + 1splitters b , b , · · · , b | V C | to all nodes. Round 3:

Upon receiving all splitters, each node scans it own elements. For each element x , the node ﬁndsthe two consecutive splitters b i and b i +1 such that b i ≤ x < b i +1 and then sends x to v i +1 . Finally,each node locally sorts all elements that it has received.Now we describe our algorithm. Assume that the data statistics N v ’s are known to all compute nodes.A compute node v ∈ V C is heavy if N v ≥ | V C | and light otherwise. Let V H , V L ⊆ V C be the set of heavy andlight compute nodes respectively. For simplicity, we pick an arbitrary non-compute node as the root andlabel heavy nodes in V H from left to right as v , v , · · · , v k .19 ound 1: Each light node u ∈ V L sends its local data to heavy nodes proportional to N v i ’s. More specif-ically, node u sends N iu local elements to v i for each i ∈ { , , · · · , k } , where N iu is computed byAlgorithm 6. Let M j be the number of elements residing on heavy node v j after round 1. Round 2:

Each heavy node samples each element from its local storage with the same uniform probability ρ independently and then sends the sampled elements to v . Let s be the number of samples generatedin total. Round 3:

Node v sorts all samples received. Let t i be the i · (cid:100) s | V C | (cid:101) -th smallest element among all samples.Let c j = (cid:100) | V C | N · M j (cid:101) . It chooses k + 1 splitters as follows: (1) b = −∞ ; (2) b i = t j where j = c + c + · · · + c i ; (3) b k = + ∞ . Then, v broadcasts b , b , · · · , b k to the remaining heavy computenodes. Round 4:

Upon receiving all splitters, each heavy node scans its own elements. For each element x , thenode ﬁnds the two consecutive splitters b i and b i +1 such that b i ≤ x < b i +1 and then sends x to v i +1 .Finally, each node locally sorts all elements that it has received.A possible improvement is that if the maximum data that some node holds exceeds N/

2, every node justsends their data to that node. Otherwise, we simply run the wTS routine on the whole topology.

Before proving the complexity for our algorithm, we ﬁrst point out some important properties.

Lemma 9.

Consider a light node u ∈ V L . Then, the following hold true in Algorithm 6:1. for any i ∈ [ k ] , (cid:80) ij =1 N ju − ≤ (cid:80) ij =1 N vj (cid:80) kj =1 N vj · N u ≤ (cid:80) ij =1 N ju ;2. for any i , i ∈ [ k ] with i < i , (cid:80) i j = i N ju ≤ (cid:80) i j = i N vj (cid:80) kj =1 N vj · N u + 1 .3. (cid:80) kj =1 N ju ≥ N u .Proof. We ﬁrst prove (1) by induction. The base case i = 1 follows since N u = (cid:98) N v (cid:80) kj =1 N vj · N u (cid:99) + 1. For theinductive step, assume the claim holds for i . Let ∆ i be the value of ∆ after being updated during the i -thiteration of the while loop. Observe that the invariant ∆ i = (cid:80) ij =1 N ju − (cid:80) ij =1 N vj (cid:80) kj =1 N vj · N u always holds. It canalso be checked that ∆ i ≥ ≤ x − (cid:98) x (cid:99) ≤ i + 1)-th iteration of while loop. When it goes into line 4, we have: i +1 (cid:88) j =1 N ju = N i +1 u + i (cid:88) j =1 N ju = (cid:36) N v i +1 (cid:80) kj =1 N v j · N u (cid:37) + ∆ i + (cid:80) ij =1 N v j (cid:80) kj =1 N v j · N u = (∆ i − x + (cid:98) x (cid:99) ) + (cid:80) i +1 j =1 N v j (cid:80) kj =1 N v j · N u In this case, 0 < ∆ i − x + (cid:98) x (cid:99) <

1, so the claim holds. When the algorithm goes into line 6, i +1 (cid:88) j =1 N ju = N i +1 u + i (cid:88) j =1 N ju = (cid:36) N v i +1 (cid:80) kj =1 N v j · N u (cid:37) + 1 + ∆ i + (cid:80) ij =1 N v j (cid:80) kj =1 N v j · N u = (∆ i + 1 − x + (cid:98) x (cid:99) ) + (cid:80) i +1 j =1 N v j (cid:80) kj =1 N v j · N u We prove (2) based on (1). Observe that i (cid:88) j = i N ju = i (cid:88) j =1 N ju − i (cid:88) j =1 N ju ≤ (cid:80) i j =1 N v j (cid:80) kj =1 N v j · N u + 1 − (cid:80) i j =1 N v j (cid:80) kj =1 N v j · N u ≤ (cid:80) i j = i N v j (cid:80) kj =1 N v j · N u + 1We can obtain a similar expression for i ; then the claim holds by adding the two inequalities.Property (3) follows immediately from (1) by setting i = k .20 heorem 7. Let G = ( V, E ) be a symmetric tree topology and R be an ordered set of N elements. If N ≥ | V C | · ln( | V C | · N ) , with probability − N , the wST algorithm sorts R in 4 rounds with cost O (1) awayfrom the optimal.Proof. From property (3) of Lemma 9, it follows that all the data of the light nodes is sent to the heavynodes during the ﬁrst round. Hence, the algorithm will produce the correct sorting. We complete the proofof Theorem 7 by analyzing the cost of the wTS algorithm.First, we observe that at least half the data is distributed across heavy nodes initially, i.e., (cid:80) kj =1 N v j ≥ N .Indeed, the size of initial data distributed across all light node is strictly smaller than N | V C | · | V C | = N , so theremaining data with size at least N must reside on heavy nodes. We next analyze the cost for each roundseparately. Round 1.

Consider an arbitrary edge e ∈ E , which deﬁnes a partition of compute nodes V − e , V + e . If V H ∩ V + e (cid:54) = ∅ , it holds that V H ∩ V + e = { v i , v i +1 , · · · , v j } or { v , v , · · · , v i } ∪ { v j , v j +1 , · · · , v k } for some i, j ∈ [ k ] and i ≤ j . For any light node u ∈ V L , the number of data sent to the nodes in V H ∩ V + e can thenbe bounded as follows using Lemma 9(2): (cid:88) v ∈ V + e ∩ V H N vu ≤ (cid:88) v ∈ V + e ∩ V H N v (cid:80) v (cid:48) ∈ V H N v (cid:48) · N u In this way, the number of data sent from light nodes in V − e to heavy nodes in V + e can be bounded as (cid:88) u ∈ V − e ∩ V L  (cid:88) v ∈ V + e ∩ V H N v (cid:80) v (cid:48) ∈ V H N v (cid:48) · N u  ≤ (cid:88) u ∈ V − e ∩ V L (cid:88) u ∈ V − e ∩ V L (cid:88) v ∈ V + e ∩ V H N v N · N u ≤  (cid:88) u ∈ V − e N u , | V C |  + 2 N ·  (cid:88) u ∈ V − e N u  ·  (cid:88) v ∈ V + e N v  ≤  (cid:88) u ∈ V − e N u , (cid:88) v ∈ V + e N v  The rationale behind the third inequality is that | V C | ≤ N | V C | ≤ (cid:80) v ∈ V + e ∩ V H N v ≤ (cid:80) v ∈ V + e N v and a · ba + b ≤ min { a, b } holds for any a, b ≥

1. If V H ∩ V − e (cid:54) = ∅ , we can make a symmetric argument.We observe here that the number of data received by any heavy node v ∈ V H in round 1 is at most (cid:88) u ∈ V L (cid:38) N v (cid:80) v (cid:48) ∈ V H N v (cid:48) · N u (cid:39) = (cid:88) u ∈ V L N v (cid:80) v (cid:48) ∈ V H N v (cid:48) · N u + (cid:88) u ∈ V L ≤ N v N · (cid:88) u ∈ V L N u + | V C | ≤ N v where the rationale behind the ﬁrst inequality is that (cid:80) v (cid:48) ∈ V H N v (cid:48) ≥ N and that behind the second inequalityis that | V C | ≤ N | V C | ≤ N v . Hence, for every heavy node v , M v ≤ N v + N v = 4 N v . Rounds 2, 3.

During sampling, each element is an independent Bernoulli sample, so we have E [ s ] = ρN .Applying the Chernoﬀ bound, Pr[ s ≥ ρN ] ≤ exp ( − Ω( ρN )). In round 2 and round 3, the number ofelements received or sent by any node is at most s , which is smaller than 2 ρN with probability at least1 − exp ( − Ω( ρN )) ≥ − ( | V C |· N ) | V C | . Observe that 2 ρN ≤ N/ | V C | . Since there is a heavy node at eachside of an edge that has data getting through, we have 2 ρN ≤ min { (cid:80) u ∈ V − e N u , (cid:80) v ∈ V + e N v } . Round 4.

In this round, each heavy node v i sends out at most M i elements and receives all the elementsfalling into the interval [ b i , b i +1 ), i.e., R ∩ [ b i − , b i ). Let t = −∞ and t | V C | = + ∞ . Under the conditionthat s ≤ ρN , we ﬁrst observe that for any j ∈ { , , · · · , | V C |} , | R ∩ [ t j − , t j ) | ≤ · N | V C | , which holdswith probability at least 1 − N , following a similar analysis to [ ]. Together, the probability that all theseassumptions hold is (cid:18) − ( 1 | V C | · N ) | V C | (cid:19) · (cid:18) − N (cid:19) ≥ − N Conider any heavy compute node v j . The number of intervals allocated to v j is exactly c j , thus thenumber of elements recieved by v j in the last round is at most (cid:100) M j N · | V C |(cid:101) · · N | V C | ≤ ( | M j | N · | V C | + 1) · · N | V C | ≤ M j + 8 N | V C | ≤ N v j + 16 N v j = O ( N v j )21ith probability at least 1 − /N .Next we bound the amount of data transmitted on every link e ∈ E . Removing e will partition thecompute nodes in V − e , V + e . W.l.o.g., assume (cid:80) v ∈ V − e ∩ V H N v ≤ (cid:80) v ∈ V + e ∩ V H N v . The size of data sent fromthe heavy nodes in V − e to V + e is always bounded by the total size of data sitting in v ∈ V − e ∩ V H , with O ( (cid:80) v ∈ V − e ∩ V H N v ) = O (min { (cid:80) v ∈ V − e ∩ V H N v , (cid:80) v ∈ V + e ∩ V H N v } ). The size of data sent from the heavy nodesin V + e to V − e is at most the number of elements recieved by all compute nodes in V − e ∩ V H , thus boundedby O ( (cid:80) v ∈ V − e ∩ V H N v ) = O (min { (cid:80) v ∈ V − e ∩ V H N v , (cid:80) v ∈ V + e ∩ V H N v } ). In either way, the capacity of each edge e is matched by its lower bound, thus completing the proof. The fundamental diﬀerence of the topology-aware model we use with other parallel models (e.g., BSP [ ],MPC [ ], LogP [ ]) is that the cost depends both on the topology and properties of the network and thenodes. Prior models view the network as a star topology, where each link and each node have exactly thesame cost functions. In this sense, our model can be viewed as a generalization, where the topology and thenode heterogeneity is taken into account.There have already been some eﬀorts to introduce topology-aware models, including [

11, 31 ] as mentionedin the introduction.One line of work in distributed computing on networks are the classical LOCAL and CONGEST mod-els [

36, 44 ], where distributed problems are also considered in networks modeled as an arbitrary graph.These two models diﬀerentiate from ours in two important aspects. First, in each round, each node can onlycommunicate with its neighbors; instead, in our model we can send messages to other nodes that may belocated several hops away. Second, the target is to design algorithms that minimize the number of rounds.As a combination of both aspects, the diameter of the communication network cannot be avoided as a cost inthese models. Moreover, system synchronization after each round is a huge bottleneck of modern massivelyparallel systems; thus, any algorithm in these two models running in non-constant number of rounds wouldbecome hard to implement eﬃciently in practice.Network routing has been studied in the context of parallel algorithms (see [

32, 33 ]), distributed com-puting (see, e.g. [ ]), and mobile networks [ ]. Several general-purpose optimization methods for networkproblems have been proposed [ ]. Our proposed research deviates from prior literature by considering a“distribution-aware” setting, and tasks that have not been considered before.The topology-aware model we use in this paper has been previously used to design algorithms for ag-gregation [ ]. However, only star topologies were considered. Madden et al. [

38, 40 ] also proposed atiny aggregation service which does topology-aware in-network aggregation in sensor networks. Culhane etal. [

16, 17 ] propose LOOM, a system that builds an aggregation tree with ﬁxed fan-in for all-to-one aggre-gations, and assigns nodes to diﬀerent parts of the plan according to the amount of data reduced duringaggregation. Chowdhury et al. [ ] propose Orchestra, a system to manage network activities in MapReducesystems. Both systems are cognizant of the network topology, but agnostic to the distribution of the inputdata. They also lack any theoretical guarantees. In this paper, we studied three fundamental data processing tasks in a topology-aware massively parallelcomputational model. We derived lower bounds based on the cardinality of the initial data distributionat each node and we designed provably optimal algorithms for each task with respect to the initial datadistribution. Interestingly, these problems have diﬀerent dependency on the topology structure, the costfunctions (bandwidth), as well as the data distribution.There are several exciting directions for future research. For one, we would like to extend our algorithmsand lower bounds to non-symmetric and general (non-tree) topologies. General topologies (e.g., grid, torus)are particularly challenging because there are multiple routing paths between two compute nodes, and thusa topology-aware algorithm needs to consider all nodes in the routing path, instead of just the destination.Looking further ahead, it would be interesting to study more complex tasks that have so far been analyzed22nly in the context of the MPC model, starting from a simple join between two relations, and continuing toensembles of tasks in more complex queries.

References [ ] F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactionson Knowledge and Data Engineering , 23(9):1282–1298, 2011.[ ] P. K. Agarwal, K. Fox, K. Munagala, and A. Nath. Parallel algorithms for constructing range and nearest-neighbor searching data structures. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium onPrinciples of Database Systems , pages 429–440. ACM, 2016.[ ] A. Andoni, A. Nikolov, K. Onak, and G. Yaroslavtsev. Parallel algorithms for geometric graph problems. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages 574–583, 2014.[ ] S. Assadi, X. Sun, and O. Weinstein. Massively parallel algorithms for ﬁnding well-connected components insparse graphs. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, PODC2019, Toronto, ON, Canada, July 29 - August 2, 2019 , pages 461–470, 2019.[ ] N. Bansal, Z. Friggstad, R. Khandekar, and M. R. Salavatipour. A logarithmic approximation for unsplittableﬂow on line graphs. ACM Trans. Algorithms , 10(1):1:1–1:15, 2014.[ ] R. d. P. Barbosa, A. Ene, H. L. Nguyen, and J. Ward. A new framework for distributed submodular maxi-mization. In , pages 645–654.Ieee, 2016.[ ] P. Beame, P. Koutris, and D. Suciu. Communication Steps for Parallel Query Processing. In PODS , 2013.[ ] P. Beame, P. Koutris, and D. Suciu. Skew in Parallel Query Processing. In PODS , 2014.[ ] S. Blanas, P. Koutris, and A. Sidiropoulos. Topology-aware parallel data processing: Models, algorithms andsystems at scale. In CIDR , 2020.[ ] S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel data analysis directly on scientiﬁc ﬁle formats.In SIGMOD Conference , pages 385–396, 2014.[ ] A. Chattopadhyay, M. Langberg, S. Li, and A. Rudra. Tight network topology dependent bounds on rounds ofcommunication. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms ,pages 2524–2539. SIAM, 2017.[ ] A. Chattopadhyay, J. Radhakrishnan, and A. Rudra. Topology matters in communication. In , pages 631–640. IEEE, 2014.[ ] C. Chekuri, A. Ene, and A. Vakilian. Node-weighted network design in planar and minor-closed families ofgraphs. In A. Czumaj, K. Mehlhorn, A. M. Pitts, and R. Wattenhofer, editors, Automata, Languages, andProgramming - 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, PartI , volume 7391 of

Lecture Notes in Computer Science , pages 206–217. Springer, 2012.[ ] C. Chekuri, S. Khanna, and F. B. Shepherd. Edge-disjoint paths in planar graphs with constant congestion. SIAM J. Comput. , 39(1):281–301, 2009.[ ] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusterswith orchestra. In Proceedings of the ACM SIGCOMM 2011 Conference , SIGCOMM ’11, pages 98–109, NewYork, NY, USA, 2011. ACM.[ ] W. Culhane, K. Kogan, C. Jayalath, and P. Eugster. Loom: Optimal aggregation overlays for in-memory bigdata processing. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing , HotCloud’14,pages 13–13, Berkeley, CA, USA, 2014. USENIX Association.[ ] W. Culhane, K. Kogan, C. Jayalath, and P. Eugster. Optimal communication structures for big data aggregation.In , pages 1643–1651, 2015.[ ] D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. vonEicken. LogP: Towards a Realistic Model of Parallel Computation. In PPOPP , 1993.[ ] A. Dasgupta, R. Kumar, and D. Sivakumar. Sparse and lopsided set disjointness via information theory. InA. Gupta, K. Jansen, J. Rolim, and R. Servedio, editors, Approximation, Randomization, and CombinatorialOptimization. Algorithms and Techniques , pages 517–528, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.[ ] J. Dean and S. Ghemawat. Mapreduce: Simpliﬁed data processing on large clusters. 2004.[ ] M. Ghaﬀari, T. Gouleakis, C. Konrad, S. Mitrovi´c, and R. Rubinfeld. Improved massively parallel computationalgorithms for mis, matching, and vertex cover. In Proceedings of the 2018 ACM Symposium on Principles ofDistributed Computing , pages 129–138, 2018.[ ] M. Ghaﬀari, T. Gouleakis, C. Konrad, S. Mitrovic, and R. Rubinfeld. Improved massively parallel computation lgorithms for mis, matching, and vertex cover. In Proceedings of the 2018 ACM Symposium on Principles ofDistributed Computing, PODC 2018, Egham, United Kingdom, July 23-27, 2018 , pages 129–138, 2018.[ ] M. T. Goodrich. Communication-eﬃcient parallel sorting. SIAM Journal on Computing , 29(2):416–432, 1999.[ ] M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the mapreduce framework.In International Symposium on Algorithms and Computation , pages 374–383. Springer, 2011.[ ] X. Hu, Y. Tao, and K. Yi. Output-optimal Parallel Algorithms for Similarity Joins. In PODS , 2017.[ ] X. Hu, K. Yi, and Y. Tao. Output-optimal massively parallel algorithms for similarity joins. ACM Transactionson Database Systems (TODS) , 44(2):6, 2019.[ ] B. Ketsman and D. Suciu. A worst-case optimal multi-round algorithm for parallel computation of conjunctivequeries. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of DatabaseSystems , pages 417–428. ACM, 2017.[ ] P. Koutris, P. Beame, and D. Suciu. Worst-case optimal algorithms for parallel query processing. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2016.[ ] P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In Proceedings of the thirtieth ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 223–234. ACM, 2011.[ ] P. Koutris and D. Suciu. A guide to formal analysis of join processing in massively parallel systems. SIGMODRecord , 45(4):18–27, 2016.[ ] M. Langberg, S. Li, S. V. M. Jayaraman, and A. Rudra. Topology dependent bounds for faqs. 2019.[ ] F. T. Leighton. Complexity issues in VLSI: optimal layouts for the shuﬄe-exchange graph and other networks .MIT press, 1983.[ ] F. T. Leighton. Introduction to parallel algorithms and architectures: Arrays · trees · hypercubes . Elsevier, 2014.[ ] F. T. Leighton, B. M. Maggs, and S. B. Rao. Packet routing and job-shop scheduling ino (congestion+ dilation)steps. Combinatorica , 14(2):167–186, 1994.[ ] C. E. Leiserson. Fat-trees: Universal networks for hardware-eﬃcient supercomputing. IEEE Trans. Computers ,34(10):892–901, 1985.[ ] N. Linial. Locality in distributed graph algorithms. SIAM Journal on computing , 21(1):193–201, 1992.[ ] F. Liu, A. Salmasi, S. Blanas, and A. Sidiropoulos. Chasing similarity: Distribution-aware aggregation schedul-ing. PVLDB , 12(3):292–306, 2018.[ ] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TAG: A tiny aggregation service for ad-hocsensor networks. In , 2002.[ ] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. The design of an acquisitional query processor forsensor networks. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data ,pages 491–502. ACM, 2003.[ ] S. Madden, R. Szewczyk, M. J. Franklin, and D. E. Culler. Supporting aggregate queries over ad-hoc wirelesssensor networks. In , pages 49–58, 2002.[ ] O. OMalley. Terabyte sort on apache hadoop. 2008.[ ] D. P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility maximization. IEEEJournal on Selected Areas in Communications , 24(8):1439–1451, 2006.[ ] M. Patra¸scu. Unifying the landscape of cell-probe lower bounds. SIAM J. Comput. , 40(3):827–847, June 2011.[ ] D. Peleg. Distributed computing.[ ] Y. Tao, W. Lin, and X. Xiao. Minimal mapreduce algorithms. In Proceedings of the 2013 ACM SIGMODInternational Conference on Management of Data , pages 529–540. ACM, 2013.[ ] L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM , August 1990.[ ] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM , 33(8):103–111, 1990.[ ] T. Yufei. A simple parallel algorithm for natural joins on binary relations. ICDT, 2020. Omitted Proofs

A.1 Cartesian product in Unequal Size

We consider the general cartesian product on a symmetric star topology G = ( V, E ). For simplicity, wedivide the compute nodes into two subsets: V α = { v ∈ V C : min { N v , N − N v } < | R |} , V β = V C − V α The ﬁrst lower bound can be simpliﬁed as follows.

Theorem 8.

Any algorithm computing cartesian product R × S has cost Ω( C ) , where C ≥ max (cid:26) max v ∈ V α min { N v , N − N v } w v , max v ∈ V β | R | w v (cid:27) Moreover, we deﬁne V ( R, S, V C ) as the minimizer for the following formula. (cid:88) v ∈ V C min { C · w v , | R |} · C · w v ≥ | R | · | S | (2)Then we are able to give the second lower bound as below. Theorem 9. If max v N v ≤ N , any algorithm computing cartesian product R × S has cost Ω( C ) , where C ≥ min (cid:40) | S | max v w v , (cid:80) u ∈ V α | S u | (cid:80) u ∈ V β w u , V ( R, ∪ u ∈ V α S u , V α ) (cid:41) Proof.

It suﬃces to show that if C ≤ | S | / max v w v , then C ≥ min (cid:26) (cid:80) u ∈ Vα | S u | (cid:80) u ∈ Vβ w u , V ( R, ∪ u ∈ V α S u , V α ) (cid:27) . Weﬁrst rewrite the inequality in Section 4.5 as below: | R | · (cid:88) u ∈ V α | S u | ≤ (cid:88) u ∈ V α { C · w v , | R |} · C · w v + (cid:88) u ∈ V β | R | · C · w v To make this inequality holds, at least one term should be larger than | R | · (cid:80) u ∈ V α | S u | , thus yielding thedesired result. Generalized wHC Algorithm.

We extend the wHC algorithm for computing R × S with | R | < | S | on asymmetric star topology. Algorithm 7:

BalancedPackingUnEqual ( G, D ) L ∗ ← L ( R, S, V C ), w ← max v w v ; while (cid:3) is not fully covered do u ← arg max v ∈ V C w v ; if − (cid:96) wL ∗ ≥ | R | then Assign to u a rectangle of size | R | × ( w u · L ∗ ); else (cid:96) ← arg min k { w ≥ k · w u } ; Assign to u a square of size (2 − (cid:96) wL ∗ ) × (2 − (cid:96) wL ∗ ); V C ← V C − { u } ;To show the correctness of Algorithm 7, it suﬃces to show that the grid is fully covered when V C becomesempty. Indeed, notice that each node v covers an area of size at least L ∗ · w v · min { L ∗ · w v , | R |} . Summingover all compute nodes, the area covered in total is at least (cid:88) v ∈ V C L ∗ · w v · min { L ∗ · w v , | R |} ≥ | R | · | S | (cid:3) is covered.Next, we analyze the cost of the algorithm. Observe that each node v receives at most 4 L ∗ · w v tuples.Hence, the cost is bounded by 4 L ∗ , yielding the following result. Lemma 10.

The wHC algorithm correctly computes the cartesian product R × S with (tuple) cost O ( C ) ,where C = max (cid:26) max v N v w v , L ( R, S, V C ) (cid:27) Putting Everything Together on Symmetric Star.

Now we introduce our algorithm for computingcartesian product on a symmetric star. It can be easily checked that Algorithm 8 has its cost matching thelower bound in Theorem 8 and Theorem 9, thus be optimal.

Algorithm 8:

GeneralizedStarCartesianProduct ( G, D ) if max u N u > N/ then all compute nodes send their data to arg max u N u ; else all compute nodes send their R -tuples to V β ; Pick the best of:1. compute nodes send their data to arg max u w u ;2. all nodes in V α send their S -tuples proportionally to V β ;3. run wHC algorithm on V α to compute R × ∪ vv