[PDF] Efficient Circuit Simulation in MapReduce

Abstract

The MapReduce framework has firmly established itself as one of the most widely used parallel computing platforms for processing big data on tera- and peta-byte scale. Approaching it from a theoretical standpoint has proved to be notoriously difficult, however. In continuation of Goodrich et al.'s early efforts, explicitly espousing the goal of putting the MapReduce framework on footing equal to that of long-established models such as the PRAM, we investigate the obvious complexity question of how the computational power of MapReduce algorithms compares to that of combinational Boolean circuits commonly used for parallel computations. Relying on the standard MapReduce model introduced by Karloff et al. a decade ago, we develop an intricate simulation technique to show that any problem in NC (i.e., a problem solved by a logspace-uniform family of Boolean circuits of polynomial size and a depth polylogarithmic in the input size) can be solved by a MapReduce computation in O(T(n)/ log n) rounds, where n is the input size and T(n) is the depth of the witnessing circuit family. Thus, we are able to closely relate the standard, uniform NC hierarchy modeling parallel computations to the deterministic MapReduce hierarchy DMRC by proving that NC^(i+1) is contained in DMRC^i for all natural i, including 0. Besides the theoretical significance, this result that has important applied aspects as well. In particular, we show for all problems in NC^1---many practically relevant ones such as integer multiplication and division, the parity function, and recognizing balanced strings of parentheses being among these---how to solve them in a constant number of deterministic MapReduce rounds.

Full PDF

aa r X i v : . [ c s . CC ] D ec Eﬃcient Circuit Simulation in MapReduce

Fabian Frei

Department of Computer Science, ETH Zürich, Universitätstrasse 6, CH-8006 Zürich, [email protected]

Koichi Wada

Department of Applied Informatics, Hosei University, 3-7-2 Kajino, 184-8584 Tokyo, [email protected]

Abstract

The MapReduce framework has ﬁrmly established itself as one of the most widely used parallelcomputing platforms for processing big data on tera- and peta-byte scale. Approaching it from atheoretical standpoint has proved to be notoriously diﬃcult, however. In continuation of Goodrich etal.’s early eﬀorts, explicitly espousing the goal of putting the MapReduce framework on footing equalto that of long-established models such as the PRAM, we investigate the obvious complexity questionof how the computational power of MapReduce algorithms compares to that of combinationalBoolean circuits commonly used for parallel computations. Relying on the standard MapReducemodel introduced by Karloﬀ et al. a decade ago, we develop an intricate simulation technique toshow that any problem in

N C (i.e., a problem solved by a logspace-uniform family of Boolean circuitsof polynomial size and a depth polylogarithmic in the input size) can be solved by a MapReducecomputation in O ( T ( n ) / log n ) rounds, where n is the input size and T ( n ) is the depth of thewitnessing circuit family. Thus, we are able to closely relate the standard, uniform NC hierarchymodeling parallel computations to the deterministic MapReduce hierarchy DMRC by proving that NC i +1 ⊆ DMRC i for all i ∈ N . Besides the theoretical signiﬁcance, this result has importantapplied aspects as well. In particular, we show for all problems in NC —many practically relevantones such as integer multiplication and division, the parity function, and recognizing balancedstrings of parentheses being among these—how to solve them in a constant number of deterministicMapReduce rounds. Theory of computation → Complexity classes; Computing method-ologies → MapReduce algorithms; Theory of computation → Circuit complexity; Theory of com-putation → MapReduce algorithms; Software and its engineering → Ultra-large-scale systems

Keywords and phrases

MapReduce, Circuit Complexity, Parallel Algorithms, Nick’s Class NC Related Version

This is the full version of preliminary paper with the same title [9] presented atthe 30th International Symposium on Algorithms and Computation (ISAAC 2019).

Funding

Koichi Wada : Research done in part during a supported visit at ETH Zürich and partlysupported by JSPS KAKENHI No. 17K00019 and by the Japan Science and Technology Agency(JST) SICORP (Grant

Acknowledgements

We thank the anonymous reviewers for their helpful comments.

Despite the overwhelming success of the MapReduce framework in the big data industry andthe great attention it has garnered ever since its inception over a decade ago, theoreticalresults about it have remained scarce in the literature. In particular, it is very natural toask how powerful exactly MapReduce computations are in comparison to the traditionalmodels of parallel computations based on circuits, a question that has practical implicationsas well. The answers have proved to be very elusive, however. In this paper, we show howMapReduce programs can eﬃciently simulate circuits used for parallel computations, thustying these two worlds together more tightly.

Eﬃcient Circuit Simulation in MapReduce

In this section we ﬁrst provide an introduction to the concept of MapReduce, then presentthe related work, and ﬁnally describe our contribution. In Section 2, we will formally deﬁnethe traditional models of parallel computing and the MapReduce model. In Section 3, wethen derive our main results. Section 4 concludes the paper with a short summary and adiscussion of our ﬁndings, outlining opportunities for future research.

In recent years the amount of data available and demanding analysis has grown at anastonishing rate. The amount of memory in commercially available servers has also grownat a remarkable pace in the past decade; it is now exceeding tera- and even peta-bytes.Despite the considerable advances in the availability of computational power, traditionalapproaches remain insuﬃcient to cope with such huge amounts of data. A new form ofparallel computing has become necessary to deal with these enormous quantities of availabledata. The MapReduce framework has been attracting great interest due to its suitability forprocessing massive data-sets. This framework was originally developed by Google [5], butan open source implementation called Hadoop has recently been developed and is currentlyused by over a hundred companies, including Yahoo!, Facebook, Adobe, and IBM [19].MapReduce diﬀers substantially from previous models of parallel computation in thatit combines aspects of both parallel and sequential computation. Informally, a MapReducecomputation can be described as follows.The input is a set of key-value pairs h k ; v i . In a ﬁrst step, the map step , each of thesekey-value pairs is separately and independently transformed into an entire set of key-valuepairs by a map function µ . In the next step, the shuﬄe step , we collect all key-value pairsfrom the sets that have been produced in the previous step, group them by their keys, andmerge each group {h k ; v i , h k ; v i , . . . } of pairs containing the same key into a single key-value pair h k ; { v , v , . . . }i consisting of said key and a list of the associated values. In athird step, the reduce step , a reduce function ρ transforms the list of values in each key-value pair h k ; { v , v , . . . }i into a new list { v ′ , v ′ , . . . } . Again, this is done separately andindependently for each pair. The ﬁnal output consists of the pairs {h k ; v ′ i , h k ; v ′ i , . . . } foreach key k . The diﬀerent instances that implement the reduce function for the diﬀerentgroups of pairs are called reducers. Analogously, mappers are instances of the map function.The three steps described above constitute one round of the MapReduce computationand transform the input set into a new set of key-value pairs. A complete MapReducecomputation consists of any given number of rounds and acts just as the composition ofthe single rounds. The shuﬄe step works the same way every time; the map and reducefunctions, however, may change from round to round. A MapReduce computation with R rounds is therefore completely described by a list µ , ρ , µ , ρ , . . . , µ R , ρ R of map and reducefunctions. In both the map step and the reduce step, the input pairs can be processed inparallel since the map and reduce functions act independently on the pairs and groupsof pairs, respectively. These steps therefore capture the parallel aspect of a MapReducecomputation, whereas the shuﬄe step enforces a partial sequentiality since the shuﬄed pairscan be output only once the previous map step is completed in its entirety.The MapReduce paradigm has been introduced in [5] in the context of algorithm designand analysis. A treatment as a formal computational model, however, was missing in thebeginning. Later on, a number of models have emerged to deal more rigorously with algo-rithmic issues [8, 10, 11, 14, 15]. In this paper, our interest lies in studying the MapReduceframework from a standpoint of parallel algorithmic power by comparing it to standard mod-els of parallel computation such as Boolean circuits and parallel random access machines . Frei, K. Wada 3 (PRAMs). A PRAM can be classiﬁed by how far simultaneous access by processors to itsmemory is restricted; it can be CRCW, EREW, CREW, or ERCW, where R, W, C, andE stand for Read, Write, Concurrent, and Exclusive, respectively [4]. If concurrent writingis allowed, we need to further specify how parallel writes by multiple processors to a sin-gle memory cell are handled. The most natural choice is arguably that every memory cellcontains after each time step the total of all numbers assigned to it by diﬀerent processorsduring that step. In fact, all constructions in this paper work with this treatment of simul-taneous writes; we thus generally assume this model. If the context warrants it, we speakof a Sum-CRCW to make this assumption explicit. We brieﬂy present and discuss the following known results on the comparative power of theMapReduce framework and PRAM models. A T -time EREW-PRAM algorithm can be simulated by an O ( T )-round MapReducealgorithm, where each reducer uses memory of constant size and an aggregate memoryproportional to the amount of shared memory required by the PRAM algorithm [10, 11]. A P -processor, M -memory, T -time EREW-PRAM algorithm can be simulated by an O ( T )-round, ( P + M )-key MUD algorithm with a communication complexity of O (log( P + M )) bits per key, where a MUD (massive, unordered, distributed) algorithm is a data-streaming MapReduce algorithm in the following sense: The reducers do not receive theentire list of values associated with a given key at once, but rather as a stream to be pro-cessed in one pass, using only a small working memory determining the communicationcomplexity [8]. When using MapReduce computations to simulate a CRCW-PRAM instead, again with P processors and M memory, we incur an O (log m ( P + M )) slowdown compared to thesimulations above, where m is an upper bound on each reducer’s input and output [10].These results imply that any problem solved by a PRAM with a polynomial number ofprocessors and in polylogarithmic time T can be simulated by a MapReduce computationwith an amount of memory equal to the number of PRAM processors, and in a numberof rounds equal to the computation time of even the powerful CRCW-PRAM. Since theclass of problems solved by CRCW-PRAMs in time T ∈ O (log i n ) is equal to the class ofproblems solved by families of polynomial-sized combinational circuits consisting of gateswith unbounded fan-in and fan-out and time T ∈ O (log i n ) (often denoted AC i ) [1], thesecircuits can be simulated in a MapReduce computation with a number of rounds equal tothe time required by these circuits.Since the publication of the seminal paper by Karloﬀ et al. [11], extensive eﬀort has beenspent on developing eﬃcient algorithms in MapReduce-like frameworks [3, 6, 13, 12, 17].Only few relationships between the theoretical MapReduce model [11] and classical com-plexity classes have been established, however; for example, any problem in SPACE ( o (log n ))can be solved by a MapReduce computation with a constant number of rounds [7].Recently, Roughgarden et al. [16, Theorem 6.1] described a short and simple way of sim-ulating NC circuits with a certain class of models of parallel computation. The constraintsof these models, namely the number of machines and the memory restrictions, are exactlytailored to allow for this general simulation method, however. In particular, it cruciallyrelies on the fact that all models of this class are more powerful than the MapReduce modelin that they all grant us a number of machines that is polynomial in the input size; thismakes it possible to just dedicate one machine to each of the circuit gates. Such a simple a r X i v . o r g Eﬃcient Circuit Simulation in MapReduce simulation is impossible with MapReduce computations since the standard model due toKarloﬀ only allows for a sublinear number of machines with sublinear memory.

We prove that NC i +1 ⊆ DMRC i for all i ∈ { , , , . . . } , where DMRC i is the set ofproblems solvable by a deterministic MapReduce computation in O (log i n ) rounds. In thecase of NC ⊆ DMRC , which already opens up a plethora of applications on its own, theresult holds for every possible choice of ε , that is, for 0 < ε ≤ /

2. The higher levels of thehierarchy require an entirely diﬀerent proof method, which yields the result for 0 < ε < / AC i ⊆ MR C i . The case i = 1 is of particular practical interestsince NC \ AC contains plenty of relevant problems such as integer multiplication anddivision, the parity function, and the recognition of the Dyck languages D n , which containall balanced strings of n diﬀerent types of parentheses; see [1]. Our results show how tosolve all of these problems with a deterministic MapReduce program in a constant numberof rounds. We denote by N = { , , , . . . } the natural numbers including zero and let N + = N \ { } .Moreover, we let [ i ] = { , , . . . , i − } denote the i ﬁrst natural numbers for any i ∈ N + . In this section, we deﬁne the common complexity classes capturing the power of parallelcomputation; most prominently the NC hierarchy.A ﬁnite set B = { f , . . . , f |B|− } of Boolean functions f i : { , } n i → { , } with n i ∈ N for every i ∈ [ |B| ] is called a basis . For every n, m ∈ N + , a (Boolean) circuit C over the basis B with n inputs and m outputs is a directed acyclic graph that contains n sources (nodeswith no incoming edges), called the input nodes , and m sinks (nodes with no outgoing edges).The fan-in of a node is the number of incoming edges, the fan-out is the number of outgoingedges. Nodes that are neither sources nor sinks are called gates . Each gate is labeled witha function f i ∈ B and has fan-in n i . It computes f i on the input given by the incomingedges and outputs the result (either 0 or 1) to the outgoing edges. A basis B is said to be complete if for every Boolean function f , we can construct a circuit of the described formthat computes f over the basis B . In the following, we use the complete basis B = {∨ , ∧ , ¬} .The size of a circuit C , denoted by size( C ), is the total number of edges it contains. The level of a node v in a circuit C , denoted level( v ), is deﬁned recursively: The level of a sinkis 0, and the level of a node v with nonzero fan-out is one greater than the maximum of thelevels of the outgoing neighbors of v . The depth of C , denoted depth( C ), is the maximumlevel across all nodes in C .A function f : { , } ∗ → { , } ∗ is implicitly logspace computable if the two mappings( x, i ) χ i ≤| f ( x ) | , where χ denotes the characteristic function, and ( x, i ) ( f ( x )) i arecomputable using logarithmic space. A circuit family { C n } ∞ n =0 is logspace-uniform if thereis an implicitly logspace computable function mapping 1 n to the description of the circuit C n . It is known that the class of languages that have logspace-uniform circuits of polynomialsize equals P [1, Thm. 6.15]. . Frei, K. Wada 5 For any i ∈ N , the complexity class NC i contains a language L exactly if there isa constant c and a logspace-uniform family of circuits { C n } ∞ n =0 recognizing L such that C n has size O ( n c ), depth O (log i n ), and all nodes have fan-in at most 2. The union isNick’s class NC = S ∞ i =0 NC i . We mention that there is an analogous deﬁnition of classesNonuniform- NC i that do not require logspace uniformity from the circuits; they constitutea diﬀerent hierarchy.The complexity classes AC i and AC = S ∞ i =0 AC i are deﬁned exactly as NC i and NC ,except that the restriction of the maximal fan-in to at most 2 is omitted. Nevertheless, therestriction on the circuit size imply that the fan-in of a node is bounded by a polynomialin n . The OR gates and AND gates in such a circuit can therefore be replaced by trees ofgates of fan-in at most 2 with a depth in O (log n ). It follows that AC i ⊆ NC i +1 for all i ∈ N and thus NC = AC . (Analogously, we see why Nick’s class can also be deﬁned, as it oftenis, by upper-bounding the fan-in by an arbitrary constant greater than 2.) The inclusion NC i ⊆ AC i for every i ∈ N is immediate from the deﬁnition. The ﬁrst two inclusions of theresulting chain are known to be strict—namely, we have NC ( AC ( NC ; see [1].Finally, we summarize the known results on how the classes of languages recognized bydiﬀerent PRAMs ﬁt into the two hierarchies of NC and AC . Let EREW i , CREW i and CRCW i denote the sets of problems of size n computed by EREW-PRAMs, CREW-PRAM,and CRCW-PRAMs, respectively, with a polynomial number of processors in O (log i n ) time.For every i ∈ N , we have NC i ⊆ EREW i ⊆ CREW i ⊆ CRCW i = AC i ⊆ NC i +1 ; see [1]. In this section we describe the standard MapReduce model as proposed Karloﬀ et al. [11]. Itdeﬁnes the notions of map functions and reduce functions , which are summarized under theterm primitives . Roughly speaking, a MapReduce computing system executes primitives,interleaved with so-called shuﬄe operations. The basic data unit in these computationsis an ordered pair h key ; value i , called key-value pair . In general, keys and values are justbinary strings, allowing us to encode all the usual entities.A map function is a (possibly randomized) function that takes as input a single key-valuepair and outputs a ﬁnite multiset of new key-value pairs. A reduce function (again, possiblyrandomized) takes instead an entire set of key-value pairs {h k ; v k, i , h k ; v k, i , . . . } , where allthe keys are identical, and outputs a single key-value pair h k ; v ′ i with that same key.A MapReduce program is nothing else than a sequence µ , ρ , µ , ρ , . . . , µ R , ρ R of mapfunctions µ r and reduce functions ρ r . The input of this program is a multiset U of key-valuepairs. For each r ∈ { , . . . , R } , a map step, a shuﬄe step and a reduce step are successivelyexecuted as follows: Map step:

Each pair h k ; v i in U r − is given as input to an arbitrary instance of the mapfunction µ r , which then produces a ﬁnite sequence of pairs. The multiset of all producedpairs is denoted by U ′ r . Shuﬄe step:

For each key k , let V k,r be the multiset of all values v i such that h k, v i i .The MapReduce system automatically constructs the multiset V k,r from U ′ r in the back-ground. Reduce step:

For each key k , a reducer (i.e., an instance calculating the reduce function ρ r ) receives k and the elements of V k,r in arbitrary order. We usually write such an inputas a set of key-value pairs that all have key k . The reducer calculates from V k,r anothermultiset V ′ k,r , which is output in the form of one key-value pair h k, v ′ i for each v ′ in V ′ k,r . a r X i v . o r g Eﬃcient Circuit Simulation in MapReduce

Fix any ε with 0 < ε ≤ / N . For every i ∈ N , a problem is in MR C i if and only if if there is a MapReduce program µ , ρ , µ , ρ , . . . , µ R , ρ R satisfying the following properties: It outputs a correct answer to the problem with probability at least 3 / The number of rounds of the MapReduce program, R , is in O (log i N ). The potentially randomized primitives (i.e., all map and reduce functions) are com-putable by a RAM with O (log N )-bit words using O ( N − ε ) space and time polynomialin N . The pairs produced by the map functions can be stored in O ( N − ε ) ) space.A MapReduce program satisfying these conditions is called an MR C i -algorithm. Notethat due to the last condition it is impossible to even store the input unless 2(1 − ε ) ≥ < ε ≤ /

2. As with NC , we deﬁne the union class MR C = S ∞ i =0 MR C i . Requiring all primitives to be deterministic yields the analogoushierarchy of DMRC = S ∞ i =0 DMRC i . Note that we obviously have DMRC i ⊆ MR C i for all i ∈ N . We will often refer to the single rounds of such MapReduce algorithms as MR C -rounds and

DMRC -rounds, respectively.

We are now going to prove our two main results NC ⊆ DMRC for 0 < ε ≤ / NC i +1 ⊆ DMRC i for all i ∈ N + and 0 < ε < / NC ⊆ DMRC , we simulate width-bounded branchingprograms that are equivalent to the respective circuits by Barrington’s classical theorem [2],whereas for the higher levels of the hierarchy, we directly simulate the combinational circuitsthemselves. Goodrich et al. [10] parametrize MapReduce algorithms, on the one hand, by the memorylimit m for the input/output buﬀer of the reducers and, on the other hand, by the communi-cation complexity K r of round r , that is, the total size of inputs and outputs for all mappersand reducers in round r . We state a useful result from [10]. ◮ Theorem 1.

Any CRCW-PRAM algorithm using M total memory, P processors and T time can be simulated in O ( T log m P ) deterministic MapReduce-rounds with communicationcomplexity K r ∈ O (( M + P ) log m ( M + P )) . We denote by N the size of the smallest circuit representation of the CRCW-PRAMalgorithm (i.e., its number of edges) plus the size of its input. Taking into account ourrequirements m ∈ O ( N − ε ) and K r ∈ O ( N − ε ) ), we obtain the following a technical tool,which will prove to be useful in our endeavor. ◮ Corollary 2.

Any CRCW-PRAM algorithm using M total memory, P processors and T time can be simulated in O ( T log N − ε P ) DMRC -rounds if ( M + P ) log N − ε ( M + P ) ∈ O ( N − ε ) ) . . Frei, K. Wada 7 NC It is known that Nonuniform- NC is equal to the class of languages recognized by nonuniformwidth-bounded branching programs. A careful inspection of the proof due to Barrington[2]—crucially relying on the non-solvability of the permutation group on 5 elements—revealsthat it naturally translates to the uniform analogue: Our uniform class NC is identical withthe class of languages recognized by uniform width-bounded branching programs. In orderto prove NC ⊆ DMRC , it therefore suﬃces to show how to simulate such branchingprograms by appropriate MapReduce computations with a constant number of rounds.We ﬁrst deﬁne width-bounded branching programs. Let n, w ∈ N + . The input to theprogram is an assignment α to n Boolean variables X = { x , . . . , x n − } . An instruction or line of the program is a triple ( x i , f, g ), where i is the index of an input variable x i ∈ X and f and g are endomorphisms of [ w ]. An instruction ( x i , f, g ) evaluates to f if α ( x i ) = 1and to g if α ( x i ) = 0. A width- w branching program of length t is a sequence of instructions( x i j , f j , g j ) for j ∈ [ t ]. We also refer to the t instructions as the lines of the program. Givenan assignment α to X , a branching program B yields a function B ( α ) that is the compositionof the functions to which the instructions evaluate.To recognize a language L ⊆ { , } ∗ , we need a family ( B n ) ∞ n =0 of width- w branchingprograms with B n taking n Boolean inputs. We say that L is recognized by B n if there is,for each n ∈ N , a set F n of endomorphisms of [ w ] such that for all α ∈ { , } n , α ∈ L ifand only if B n ( α ) ∈ F n . If f i and g i are automorphisms, that is, permutations of [ w ] for all i ∈ [ t ], then B n is called a width- w permutation branching program , or w -PBP for short. ◮ Theorem 3. [2] If L ∈ NC , then L is recognized by a logspace-uniform - PBP family.

Due to Theorem 3 it is suﬃcient for our purposes to simulate the w -PBPs with constant w instead of the circuit families provided by the deﬁnition of NC . In order to do this, weneed to encode the given w -PBP and the possible assignments in the right form, namely weexpress them as sets of key-value pairs. A w -PBP of length t can be described as the set {h p ; ( x i p , f p , g p ) i | p ∈ [ t ] } , where we call p the line number of line ( x i p , f p , g p ). Similarly,an assignment α : X → { , } , x i v i to the input variables X = { x , x , . . . , x n − } isdescribed by the set of key-value pairs {h i ; ( x i , v i ) i | i ∈ [ n ] } , letting the mappers dividethe information by the indices of the input variables. Let N O and N I be the total size ofthe encodings of the w -PBP and the input assignment α , respectively. Let N = N O + N I and let d = ⌈ N − ε O ⌉ and ℓ = ⌈ N ε O ⌉ . We denote by ÷ the integer division. For every q ∈ [ t ÷ d ], let w -PBP q be the q th of the subprogram blocks of w -PBP of length d , that is {h p ; ( x i p , f p , g p ) i | qd ≤ p ≤ ( q + 1) d − } . For ease of readability, we assume from now onwithout loss of generality that dℓ = t , so that w -PBP can be partitioned into exactly ℓ suchsubprograms.For every q ∈ [ ℓ ], we denote by X q the subset of variables from X appearing in theinstructions of subprogram w -PBP q . An assignment α q : X q → { , } to these variables isrepresented as a set of key-value pairs in the following way. Recall that the subprogram w -PBP q is a list of lines, each of which requires the assignment of a value, either 0 or 1, forexactly one variable. Let x q,j be the j th variable to which a value is assigned in w -PBP q ,let p q,j denote the number of the line in which this assignment occurs for the ﬁrst time in w -PBP q , and let v q,j denote the value that is assigned to x q,j in this line. Now, we represent α q by {h q ; ( p q,j , x q,j , v q,j ) i | j ∈ [ |X q | ] } . Note that despite the dependence of X q on q , wealways have |X q | ≤ d . Having seen how to express w -PBP, α , and both w -PBP q and α q forall q ∈ [ ℓ ] as a set of key-value pairs, we are ready to state and prove the following lemma. a r X i v . o r g Eﬃcient Circuit Simulation in MapReduce ◮ Lemma 4.

Let L be a w - PBP -recognized language. If, for every q ∈ [ ℓ ] , the representationsof w - PBP and α q are given, then we can decide in a -round DMRC -computation whether α ∈ L or not. Proof.

As already described above, let w -PBP be represented by the set {h p ; ( x i p , f p , g p ) i | p ∈ [ t ] } and, for every q ∈ [ ℓ ], the assignment α q by {h q, ( p q,j , x q,j , v q,j ) i | j ∈ [ |X q | ] } . Notethat there are ℓ subprograms of length at most d and ℓ partial assignments that each assignvalues to at most one variable per line of the corresponding partial program, thus the totalsize of the input is in O ( dℓ ) ⊆ O ( N O ) ⊆ O ( N ).We deﬁne the ﬁrst map function µ by µ ( h p ; ( x i p , f p , g p ) i ) = {h p ÷ d ; ( p, x i p , f p , g p ) i} , for each p ∈ [ t ] and µ ( h q ; ( p q,j , x q,j , v q,j ) i ) = {h p q,j ÷ d ; ( p q,j , x q,j , v q,j ) i} for each q ∈ [ ℓ ] , j ∈ [ k + 1] . For any q ∈ [ ℓ ], there is one subprogram w -PBP q and an associated assignment set α q .We use the map function µ to ﬁnd the value assignment for each variable appearing in w -PBP q and store it in a key-value pair. This pair has the key q and is thereby designatedto be processed by reducer q , which can calculate ρ , having all pairs with key q available.This function simulates, for each permutation π of [ w ], the subprogram w -PBP q on thispermutation with the received assignment and stores the resulting permutation π ′ . Thisyields a table T q of size w ! ∈ O (1), describing the action of w -PBP q for the given assignmenton all w ! permutations. (We mention in passing that for the ﬁrst reducer it would besuﬃcient to compute and store only the permutation that results from applying w -PBP onthe given assignment to the identity as the initial permutation, thus saving the time andmemory necessary for the rest of the ﬁrst table.) The output of ρ on the q th reducer is h q ; T q i .The map function µ of the second round is simple, it maps h q ; T q i to h

0; ( q, T q ) i , thusdelivering all pairs ( i, T i ) to a single instance of the reduce function ρ . This ﬁrst reducerhas therefore all tables T , . . . , T ℓ − at its disposal and knows which one is which. Using T q as a look-up table for the permutation performed by w -PBP q , reducer can now compute,starting from the identity permutation id, the permutation π = T ℓ − ◦ · · · ◦ T ◦ T ◦ T (id),and the input is accepted if and only if π ∈ F n , where F n is the set of accepted permutationsthat is given to us alongside the program w -PBP. ◭ In the following four lemmas, we show that α q can be computed in a constant numberof rounds from w -PBP and α for every q ∈ [ ℓ ]. The challenge lies in designing an interfacebetween the diﬀerent reducers to bridge the gap between the ℓ program blocks w -PBP q andthe given assignments, initially cut into ℓ block based solely on the indices of the inputvariables, without exceeding the memory limits. We begin with a brief overview of the foursteps. For each x i , where i ∈ [ n ], we compute the number of subprograms in which x i appears,and denote this number by S ( x i ). Note that S ( x i ) ≤ ℓ and that S ( x i ) is thenumber of all those reducers for which the value assignment of x i is generally requiredto compute the resulting permutations in the corresponding subprograms. We compute the preﬁx sums of S ( x i ). For i ∈ [ n ], let y i = P ij =0 S ( x j ). Note that y i is the number of assignment triples ( p q,j , x q,j , v q,j ) with 0 < j ≤ i needed to computethe action of the ﬁrst i subprograms and that y n − = P ℓ − q =0 | α q | . Based on the preﬁx sums, we will compute a separation of the input variables into ℓ contiguous blocks such that, for each q ∈ [ ℓ ], it is feasible for reducer q to produce from . Frei, K. Wada 9 the q th block the input value assignments that it needs to contribute for the next step.This is nontrivial since the number of input assignments must not exceed O ( d ) due tothe memory limitation of reducer q . A separation of the input variables { x , . . . , x n − } is a list of ℓ − split values σ , . . . , σ ℓ − such that we have ℓ ordered, contiguous blocks { x , . . . , x σ } , { x σ +1 , . . . , x σ } , . . . , { x σ ℓ − +1 , . . . , x n − } . For notational convenience, welet σ = − σ ℓ = n −

1. Let σ q = max { j ∈ [ n ] | y j ≤ qd } for q ∈ { , . . . , ℓ − } . Using these split values, each reducer q can provide all value assignments neededfor the computation of all subprograms in the next step without violating the memorylimitations. We compute α q for q ∈ [ ℓ ] by using w -PBP, the input assignment α , and the split values. ◮ Lemma 5.

Calculating S ( x i ) is in DMRC . That is, for each i ∈ [ n ] , S ( x i ) iscomputable from w - PBP in a constant number of

DMRC -rounds.

Proof.

For each q ∈ [ ℓ ], the subprogram w -PBP q is stored in reducer q . The output ofreducer q —which will be the input to compute S ( x i )—is h q ; ( q, i , . . . , h q ; ( q, k q ) i , withthe variables x q, , . . . , x q,k q appearing in the subprogram w -PBP q and k q ∈ O ( d ). The totalnumber of inputs used to compute S ( x i ) is therefore at most dℓ ∈ O ( N ). We use aSum-CRCW-PRAM, whose concurrent writes to a single memory register are resolved bysumming up all values being written to the same register simultaneously, see [10]. We useat most dℓ processors, P q, , . . . , P q,k q for each q ∈ [ ℓ ], and registers R , . . . , R n − and let allprocessors P q,j add 1 to R j concurrently. Thus we see that the computing S ( x i ) is possiblein constant time on a Sum-CRCW-PRAM and therefore, by Corollary 2, in DMRC . ◭◮ Lemma 6.

Computing the preﬁx-sums of S ( x i ) is in DMRC . Proof.

The input is given as h i ; ( S ( x i ) , i ) i for i ∈ [ n ]. We compute the preﬁx-sums y i of S ( x i ) for all i ∈ [ n ] in three rounds that can be summarized as follows: Each reducer q , for q ∈ [ ℓ ], determines its local preﬁx-sums; that is, it computes the d preﬁx-sums y local dq , . . . , y local d ( q +1) − of the d numbers S ( x dq ) , . . . , S ( x d ( q +1) − ). A single reducer computes the preﬁx-sums z , z , . . . z ℓ − of y local d − , y local2 d − , . . . y local ℓd − , whichare known from the ﬁrst round. For every q ∈ [ ℓ − z q to reducer q +1 . Each reducer q +1 with q ∈ [ ℓ −

1] computes y d ( q +1)+ j = y local d ( q +1)+ j + z q for each j ∈ [ d ].We now describe the three rounds in more detail at the level of key-value pairs. By deﬁning the map function µ ( h i ; ( S ( x i ) , i ) i ) = h i ÷ d ; ( S ( x i ) , i ) i , each reducer q , for q ∈ [ ℓ ], receives S ( x dq ) , . . . , S ( x d ( q +1) − ) together with the correct indices. Thus wecan compute in reducer q all local preﬁx-sums y local dq , . . . , y local d ( q +1) − of these number. Theoutput of reducer q consists of the local preﬁx-sums in the format h q ; (p-sum , q, j, y local q,j ) i for j ∈ [ d ] and the last of each group of local preﬁx-sums in the format h q ; (last , y local d ( q +1) − ) i ,where p-sum = 0 and last = 1 is a simple binary identiﬁer By deﬁning the map function µ ( h q ; (last , y local d ( q +1) − ) i ) = h

0; (last , y local d ( q +1) − ) i , all lastparts of the local preﬁx-sums can be gathered in reducer . Thus, the preﬁx-sums z , z , . . . z ℓ − of y local d − , . . . , y local dℓ − can be computed in it and the output of the reducer is h

0; (last , i + 1 , z i ) i for every i ∈ [ ℓ − h q ; (p-sum , q, j, y local q,j ) i —are passed on unaltered. The input of the third round consists of the output pairs h q ; (p-sum , q, j, y local q,j ) i for all j ∈ [ d ] and q ∈ [ ℓ ] passed on from the ﬁrst round and the pairs h

0; (last , q + 1 , z q ) i for all q ∈ [ ℓ −

1] from the second round. Deﬁning the map function as µ ( h q ; (p-sum , q, j, y local q,j ) i ) = h q ; (p-sum , q, j, y local q,j ) i and µ ( h

0; (last , q + 1 , z q ) i ) = h q + 1; (last , q + 1 , z q ) i , we can, foreach j ∈ [ d ] and each q ∈ { , . . . , ℓ − } , compute y q,j = y local q,j + z j in reducer q . a r X i v . o r g The memory limitations of the mappers and reducers are clearly respected. ◭◮ Lemma 7.

Each of the split values σ , . . . , σ ℓ − can be computed in one reducer with therequired preﬁx-sums being made available in one more DMRC -round. x n − = x σ ℓ x n − x σ ℓ − x σ x σ +1 x σ x x x = x σ +1 y y y σ y σ +1 · · · y n − y n − reducer reducer reducer ℓ − Figure 1

Separation of the input variables x , . . . , x n − into ℓ blocks for the ℓ reducers, independence of the values of y i . Proof.

If there is a k ∈ [ ℓ −

1] such that y n − ≤ kd , then it is clear from the deﬁnition σ q = max { j ∈ [ n ] | y j ≤ qd } of the split values that σ k = σ k +1 = . . . = σ ℓ − . We cantherefore assume that y n − > ( ℓ − d and characterize, for each q ∈ { , . . . , ℓ − } , the splitvalue σ q as the unique integer satisfying ( q − d < y σ q ≤ qd and qd < y σ q +1 ; see Figure 1.This characterization is well deﬁned since 0 < S ( x i ) ≤ ℓ < d for each i ∈ [ n ] and y n − ≤ dℓ ∈ O ( N O ). For each q ∈ [ ℓ ], in order to determine the split value σ q , it is thereforesuﬃcient to have available in the respective reducer a sequence of consecutive preﬁx-sumssuch that the ﬁrst one is at most qd and the last one is greater than qd . This condition issatisﬁed if reducer q has the d + 2 consecutive preﬁx-sums y qd − , y qd , . . . , y ( q +1) d − , y ( q +1) d available. (For the ﬁrst and the last reducer, the d + 1 preﬁx-sums y , . . . , y d − , y d and y ( ℓ − d − , y ( ℓ − d , . . . , y ℓd − , respectively, will suﬃce.) Slightly extending the sequence ofavailable preﬁx-sums in each reducer by copying the overlapping preﬁx-sums from anotherreducer thus enables us to compute all split values in the ℓ reducers. Since for each q ∈ [ ℓ ],there are the d preﬁx-sums y qd , . . . , y ( q +1) d − in reducer q , each reducer can have the d + 2preﬁx-sums made available after one more round by having each neighboring reducer copyone more preﬁx-sum into it. We have σ = − σ ℓ = n −

1; it is thus immediately veriﬁedthat, for every q ∈ [ ℓ ], the total number of subprograms in which input variables between x σ q +1 and x σ ( q +1) appear is at most 2 d , showing that all the memory restrictions on thereducers are observed. ◭ . Frei, K. Wada 11 ◮ Lemma 8.

Given w - PBP , α , and the split values σ , . . . , σ ℓ , we can, for each q ∈ [ ℓ ] ,compute α q in a constant number of DMRC -rounds.

Proof.

We can assume that, for each κ ∈ [ ℓ ], the reducer κ has the subprogram w -PBP κ , the κ th block of input assignments { ( x j , v j ) | κ · d ≤ j ≤ ( κ + 1) d − } , and the split values σ , . . . , σ ℓ available. The output of reducer κ then consists of the following: h κ ; ( q, p, x i p , f p , g p ) i for each line ( p, x i p , f p , g p ) in w -PBP κ , where σ q + 1 ≤ i p ≤ σ q +1 . h κ ; ( q, x j , v j ) i for each value assignment ( x j , v j ) with σ q + 1 ≤ j ≤ σ q +1 .For any κ ∈ [ ℓ ], we need to bound the total number of outputs with key κ from above. Fromthe deﬁnition of the split values we see that this number is in O ( d ) since it is bounded bythe number of lines, which is at most 2 d , plus the number of assignments, which is at most d . Naturally, the map function µ of the next round is deﬁned by µ ( h κ ; ( q, p, x j p , f p , g p ) i ) = h q ; ( p, x j p , f p , g p ) i and µ ( h κ ; ( q, x j , v j ) i ) = h q ; ( x j , v j ) i .For any κ ∈ [ ℓ ], the assignment variables α q can be computed by the subsequent reducefunction using the key-value pairs produced above. For each q ∈ [ ℓ ], the reducer q hasnow available the lines of w -PBP and the value assignments for the input variables between x σ q +1 and x σ q +1 . It can therefore go through all the program lines and determine, on the onehand, which value assignments they require and, on the other hand, to which subprogramthey belong. To required assignment information is then sent to the respective reducers byoutputting h q ; ( p ÷ d, p, x i p , v i p ) i . ◭ We ﬁnally obtain the desired inclusion by applying Theorem 3 and Lemmas 4 through 8. ◮ Theorem 9.

We have NC ⊆ DMRC . NC i For All i ≥ For the higher levels in the hierarchy of Nick’s class, we show how to simulate the involvedcircuits directly. We begin with a short outline of the proof.Let C n = ( V n , E n ) be a NC i +1 circuit with an input of size n , given as a set of nodesand a set of directed edges, together with an input assignment α . The total size of C n inbits is N O , the total size of the input assignment in bits is N I , and N = N O + N I . Note thatsize( C n ) is polynomial in n and depth( C n ) ∈ O (log i n ). We will take the following steps tosimulate the circuit C n with deterministic MapReduce computations: We compute the level of each node in C n . The nodes and edges are sorted by their level. Both the circuit C n and the input assignment α are divided equally among the reducers. We split the circuit into subcircuits computable in a constant number of rounds. A custom communication scheme collects and constructs the complete subcircuits. The entire circuit is evaluated via evaluation of the subcircuits.Note that equal division of C n in the third step is very diﬀerent from the split in theforth one, where the parts may diﬀer radically in size. Great care must be taken so as tono violate any of the memory and time restrictions, necessitating two unlike partitions. Thesubsequent steps then need to mediate between these dissimilar divisions. We will showthat the steps (1) to (6) can be computed in O (log n ), O (1), O (1), O (1), O (log n ) and O (depth( C n ) / log n ) rounds, respectively, yielding the desired theorem. ◮ Theorem 10.

We have NC i +1 ⊆ DMRC i , for all i ∈ N + and all < ε < / . a r X i v . o r g We begin by showing how to compute the level of each node in the circuit in O (log n ) DMRC -rounds by simulating a CRCW-PRAM algorithm. (We mention in passing thatthis step requires more than a constant number of rounds, which prevents us from obtainingthe result for NC ⊆ DMRC by simulating the circuits directly; the separate approachfrom Subsection 3.2 via Barrington’s theorem is thus required for this case.)In [18], an algorithm is presented that computes the levels of all nodes in a directedacyclic graph can on a CREW-PRAM with O ( n + m ) processors in O (log m ) time, where n and m are the numbers of nodes and edges in the graph, respectively. The ﬁrst stage of thisalgorithm relies partly on the computation of preﬁx-sums, which can be computed muchmore eﬃciently when switching to a CRCW-PRAM, as we will show below. A straightfor-ward adaptation of the analysis in [18], taking into account the maximum in-degree andout-degree and separating out the computation of preﬁx-sums, yields the following result. ◮ Lemma 11.

Let G = ( V, E ) be a directed acyclic graph with n nodes, m edges, maximumin-degree d in , and maximum out-degree d out . The level of each node in G can then becomputed on a CRCW-PRAM with P ∈ O ( m + P P-Sum ( O ( m ))) processors and time T ∈ O ((log m ) · ( T P-Sum ( O ( m )) + log max { d in , d out } )) , where P P-Sum ( q ) and T P-Sum ( q ) denote,respectively, the number of processors and the computation time to compute the preﬁx-sumsof q numbers on a CRCW-PRAM. In the following lemma, we aim to lower the time and memory requirements for computingpreﬁx-sums on a CRCW-PRAM as far as possible. ◮ Lemma 12.

The preﬁx-sums of q numbers can be computed on a CRCW-PRAM with P ∈ O ( q log q ) processors and memory M ∈ O ( q ) in constant time. Proof.

We use a Sum-CRCW-PRAM, where concurrent writes to the same memory regis-ter are resolved by adding up all simultaneously assigned numbers. [10]. Let q numbers x , x , . . . , x q − be given as input. Without loss of generality, we assume q to be a powerof 2 and calculate s i ( j ) = P j i ≤ p< ( j +1)2 i x p for all i ∈ [1 + log q ] and all j ∈ [ q/ i + 1]; seeFigure 2 for an illustrating example.Since each of the q/ i elements in s i is the sum of 2 i elements, we can—by allocating q processors for each i ∈ [1 + log q ]—compute every s i ( j ) in a Sum-CRCW-PRAM with O ( q log q ) processors and O (1) time.We now describe how the preﬁx-sums y (0) , y (1) , . . . , y ( q −

1) are computed from the s i ( j ).Assume ﬁrst that j + 1 is a power of 2, that is, j + 1 = 2 p . Then we have y ( j ) = s p (0),so the value has already been computed. If j + 1 = 2 p + 1 for some p , then we have y ( j ) = s p (0) + s (2 p ), so we need to add two summands. In general, y ( j ) can be calculatedas the sum of at most log q − a j log q a j (log q ) − . . . a j be the binary representation of j + 1. Now, we can see that y ( j ) = s log q (0) · a j log q + s (log q ) − (( j + 1 − (log q ) − ) ÷ (log q ) − ) · a j (log q ) − + . . . + s (( j + 1 − ) ÷ ) · a j + s (( j + 1 − ) ÷ ) · a j ; . Frei, K. Wada 13 x s (0) x s (1) x s (2) x s (3) x s (4) x s (5) x s (6) x s (7)+ s (0) + s (1) + s (2) + s (3)+ s (0) + s (1)+ s (0) y = s (0) y = s (0) y = s (0) + s (2) y = s (0) y = s (0) + s (4) y = s (0) + s (2) y = s (0) + s (2) + s (6) y = s (0) Figure 2

Calculation of the preﬁx-sums s i ( j ) = P p ∈ [( j +1)2 i ] \ [ j i ] x p for every i ∈ [1 + log q ] and j ∈ [ q/ i ] for the example of q = 8. a r X i v . o r g that is, y ( j ) can be computed as the sum of all s p (( j + 1 − p ) ÷ p ) such that a jp = 1. Thus,it is suﬃcient to supply a maximum of (log q ) − y ( j )in a second time step, and the preﬁx-sums can be computed on a Sum-CRCW-PRAM with O ( q log q ) processors in constant time. ◭ We plug in the result of Lemma 12 into Lemma 11 and then apply it to the graph C n .Since its in-degrees and out-degrees are bounded by a constant ∆, we have m ≤ ∆ n/ ∈ O ( n ). Hence we can compute the levels of the nodes of C n on a CRCW-PRAM with P ∈ O ( N log N ) processors in time T ∈ O (log n ). By Corollary 2, we obtain the followingresult. ◮ Lemma 13.

Computing the levels of all nodes in C n is in DMRC . Proof.

From Lemmas 11 and 12 we know that the level of each node in C n can be computedin T ∈ O (log n ) time on a Sum-CRCW-PRAM with P ∈ O ( N + N log N ) processors. Now,Corollary 2 yields a MapReduce simulation of this Sum-CRCW-PRAM. We need to checkthat the conditions of Corollary 2 are indeed all satisﬁed: From T ∈ O (log n ), P ∈ O ( N + N log N ), and M ∈ O ( N ) follows M + P ∈ O ( N log N ) and log N − ε ( M + P ) ∈ O (1), hencewe have ( M + P ) log N − ε ( M + P ) ∈ O ( N − ε ) ). Thus, the level of each node in C n can becomputed in O (log n ) DMRC -rounds. ◭ Once the levels of all nodes are computed, each node in the circuit can be represented as(level( x i ) , x i ). Recall that the depth of C n is just the maximum level. Since depth( C n ) ∈ O (log k n ) for some k ∈ N + and the number of nodes is bounded by the number of edgessize( C n ) ∈ O ( N ), we can encode each pair (level( x i ) , x i ) by appending to a bit string oflength log( c log k n ) another one of log( c N ) = log( cN log k n ), for appropriate constants c and c , which results in a bit string of length lg( cN log k n ) for c = c c ∈ N . This enablesus to identify each pair (level( x i ) , x i ) with a diﬀerent bit string, which can interpreted as aninteger bounded by cN log k n . We call this integer the sorting index of node x i . Crucially,we chose the bit string to start with the encoding of the level. Sorting the sorting indicesthus means to sort the nodes of C n by their level. The following lemma shows how preﬁx-sums can be used to perform such a sort so eﬃciently on a CRCW-PRAM that we can applyCorollary 2 to simulate it in a constant number of DMRC -rounds. ◮ Lemma 14.

A CRCW-PRAM with P ∈ O ( D log D ) processors and memory M ∈ O ( D ) can sort any subset I ⊆ { , . . . , D } of integers in constant time. Proof.

Recall that we use a Sum-CRCW-PRAM that sums up concurrent writes. Assumethat the input and output are stored in the arrays x [0] , . . . , x [ p −

1] and y [0] , . . . , y [ p − z [0] , . . . , z [ D ] and ˆ z [0] , . . . , ˆ z [ D ] of size D + 1.The algorithm works in for steps: Initialize z by using D + 1 ≤ P processors to set z [ k ] ← k ∈ [ D + 1]. Use p ≤ P processors in parallel to set z [ x [ k ]] ← k ∈ [ p ]. Compute the preﬁx-sums of the array z and save them into ˆ z . Use D processors to set, for all k ∈ { , . . . , D } in parallel, y [ˆ z [ k ]] ← k if and only ifˆ z [ k ] = ˆ z [ k − D numbers can be computed by the Sum-CRCW PRAM with P ∈ O ( D log D ) processors and memory M ∈ O ( D ) in constant time by Lemma 12, theabove algorithm stays within these bounds as well. . Frei, K. Wada 15 We now prove that this algorithm is correct. First we observe that after step 2, forevery k ∈ { , . . . , D } , we have z [ k ] = 1 if and only if one of the p integers to be sorted is k .Because ˆ z contains the preﬁx-sums of z , the value stored in ˆ z [ k ] hence tells us how many ofthe p integers in x are at most k . (Note that accordingly we always have z [0] = ˆ z [0] = 0.)Thus k is one of the integers in x if and only if ˆ z [ k ] = ˆ z [ k −

1] + 1; otherwise, we haveˆ z [ k ] = ˆ z [ k − z contains exactly the indices of x , namely[ p ], as values in non-decreasing order, that is, 0 = ˆ z [0] ≤ ˆ z [1] ≤ · · · ≤ ˆ z [ D − ≤ ˆ z [ D ] = p .Stepping through ˆ z from start to end, that is, from k = 0 to k = D , we therefore observean increment of 1 from ˆ z [ k −

1] to ˆ z [ k ] exactly if k one of the integers to be sorted. Thismeans that in step 4 the integers contained in x are detected from left to right in ascendingorder and subsequently stored into y in the same order. ◭ Combining Lemma 14 and Corollary 2 we obtain, by a careful analysis using ε = 1 / ◮ Corollary 15.

Let c ∈ N and < ε < / . Any set of distinct integers from { , . . . , ⌈ cN log k n ⌉} can be sorted in a constant number of DMRC -rounds.

Proof.

We apply Lemma 14 with D ∈ O ( N log k n ). We have D ∈ O ( N log k N ) ⊆ O ( N ζ )and thus also D log D ∈ O ( N ζ ) for any constant ζ >

0. Choose any ζ < − ε , whichis possible for ε < /

2. The sorting is then possible on a CRCW-PRAM with O ( N ζ )processors and O ( N ζ ) memory in constant time. By Corollary 2, this CRCW-PRAM canbe simulated in a constant number of DMRC -rounds because log N − ε ( N ζ ) = (1 + ζ ) / (1 − ε ) ∈ O (1) and O ( N ζ ) ⊆ O ( N − ε ) ). ◭ Once all the nodes are sorted by their sorting index (and therefore implicitly by theirlevel), we can enumerate them in ascending order using a sorting index j ; that is, westore each node as the key-value pair h j ; (level( v ) , v ) i . Clearly, we obtain an analogousrepresentation of the edges in the form h i ; (( j, (level( v ) , v ) , ( j ′ , (level( v ′ ) , v ′ )) i , which willprove useful later on. As we have already seen when discussing the branching programs, an assignment α to inputvariables X = { x , x , . . . , x n − } can be represented as a set {h i ; ( x i , v i ) i | i ∈ [ n ] } of key-value pairs, where α ( x i ) = v i ∈ { , } .The circuit C n is now divided into ℓ = N ε O subsets of edges according to the sorting indicesand input values that are assigned to each subset as in the case of branching programs. Forevery q ∈ [ ℓ ], let C qn = { (( j, level( v ) , v ) , ( j ′ , level( v ′ ) , v ′ )) | qd ≤ j ≤ ( q + 1) d − } , where d = N − ε O , be the q th subset. Note that | C qn | ∈ O ( d ). For every q ∈ [ ℓ ], the set ofvariables appearing in C qn is denoted as X q and the assignment α q to X q is represented as {h j ; x q,j , v q,j i | j ∈ [ | α q | ] } , where x q,j is the j th variable that appears as an input in C qn , and v q,j is its assignment value. Just as seen in Lemma 8 for the case of a branching program,we can now compute α q from C n and α for all q ∈ [ ℓ ], yielding the following lemma. ◮ Lemma 16.

Computing α q from C n and α is in DMRC for every q ∈ [ ℓ ] . We can therefore assume that each input node is represented by h j ; (level( x j i ) , x j i , v j i ) i ,a key-value pair that is computed from C qn and α q for q ∈ [ ℓ ] in a single DMRC -round. a r X i v . o r g

We divide C n = ( V n , E n ) into as few subcircuits as possible such that the simulation of eachsubcircuit is in DMRC and we can evaluate C n by evaluating the subcircuits sequentially.Given v ∈ V n and δ ∈ N , we deﬁne the v -down-circuit C down δ ( v ) = ( V down δ ( v ) , E down δ ( v ))of depth δ to be the subcircuit of C n induced by V down δ ( v ) = { u | level( v ) ≤ level( u ) ≤ level( v ) + δ, u → ∗ v } , where u → ∗ v means that there is a directed path of any length(including 0) from u to v in C n . The v -up-circuit C up δ ( v ) = ( V up δ ( v ) , E up δ ( v )) of depth δ is analogously the subcircuit of C n induced by V up δ ( v ) = { u | level( v ) − δ ≤ level( u ) ≤ level( v ) , v → ∗ u } .When dividing C n into subcircuits we have two conﬂicting goals. On the one hand, wewant as few of them as possible, which implies that they have to be of great depth. Onthe other hand, we need to simulate them in MapReduce without exceeding the memorybounds. A depth in O (log n ) turns out to be the right choice. Let s = γ (log n ) / log ∆, where∆ ≥ C n and γ is an arbitrary constantsatisfying 0 < γ < − ε . (Note that such a γ exists exactly if ε < / s and maximum degree bounded by a constant ∆ contains at most P si =1 ∆ i edges,their size is in O (∆ s ) = O ( n γ ) ⊆ O ( N γ ). Hence each reducer may contain up to N − ε /N γ such subcircuits without exceeding the memory constraint of O ( N − ε ); see Figure 3. Wedenote this number of allowed subcircuits per reducer by β = N − ε − γ . β e d g e s reducer q For every key-value pair h q ; ( j v , level( v ) , v ) i such that there is an i ∈ N with level( v ) = L i : v -up-circuit ofdepth L i − − L i vv -down-circuit ofdepth L i − L i +1 For every key-value pair h q ; ( j v , level( v ) , v ) i such that level( v ) = L i for all i ∈ N : vv -up-circuit of depth 1 v -down-circuit of depth 1 Figure 3

The up-circuits and down-circuits constructed in reducer q , comprising up to β edges. For each i ∈ [ ⌈ depth( C n ) /s ⌉ + 1], we deﬁne L i = i · s . For every node v on level L i —thatis, with level( v ) = L i —we call the v -down-circuit ( v -up-circuit, resp.) of depth s a L i -down-circuit ( L i -up-circuit , resp.). We will construct in each reducer the v -down-circuits and v -up-circuits of depth 1 of all its nodes. From those we then construct all L i -down-circuits . Frei, K. Wada 17 and L i -up-circuits for every i . Note that we can evaluate all L i -down-circuits if the values ofthe nodes of level L i +1 are given. The values of the nodes v of level L i +1 that are necessaryto compute the L i -up-circuits are then known from the L i +1 -down-circuits.When the circuit C n is divided into L i -down-circuits, there may exist edges of C n that arenot contained in any L i -down-circuit. If an edge (( j u , level( u ) , u ) , ( j v , level( v ) , v )) satisﬁes L i u ≤ level( u ) ≤ L i u +1 and L i v ≤ level( v ) ≤ L i v +1 for i u = i v , then this edge is not includedin any L i u -down-circuit nor any L i v -down-circuit. We call such edges level-jumping edges ;see Figure 4 for an example. We would like to replace every level-jumping edge ( u, v ) bya path from u to v that consists only of edges that will be part of the respective L i -down-circuits and L i -up-circuits in the resulting, augmented circuit. The following lemma statesthat this is possible without increasing the size by too much. L i L i +1 L i +2 L i L i +1 L i +2 vu vuv ′ u ′ v ′ u ′ dummy dummy ′ dummy ′ Figure 4

Two jumping edges on the left and their resolving division on the right. ◮ Lemma 17.

We can subdivide the jumping edges in C n in a way that renders the subcircuit-wise evaluation possible without increasing the size beyond O ( N ) . Proof.

Let (( j u , level( u ) , u ) , ( j v , level( v ) , v )) be a jumping edge, where L i u ≤ level( u ) ≤ L i u +1 , L i v ≤ level( v ) ≤ L i v +1 and i u < i v . If i u = i v −

1, then this edge is divided intotwo edges (( j u , level( u ) , u ) , dummy) and (dummy , ( j v , level( v ) , v )), introducing a new nodedummy of the id kind with level(dummy) = i v . If i u ≤ i v −

2, then this edge is divided intothree edges (( j u , level( u ) , u ) , dummy ), (dummy , dummy ) and (dummy , ( j v , level( v ) , v )),introducing two new nodes with level(dummy ) = i u + 1, level(dummy ) = i v . Havingdivided the jumping edges in this way, the newly created edges are all part of some dummy-down-circuit or dummy-up-circuit, except for edges of the form (dummy , dummy ). Notethat we cannot further subdivide the edges of the form (dummy , dummy ) because wewould exceed the size limit on the circuit otherwise. The most convenient way to deal withthis is to adjust our deﬁnition of down-circuits and up-circuits such that every edge of theform (dummy , dummy ) is considered to be both a dummy -down-circuit and a dummy -up-circuit on its own. This way, every edge in the augmented circuit is included in somedown-circuit or up-circuit. Note that this augmentation can be performed in a single roundand that the size of the augmented circuit is in O ( N ). In what follows, we consider C n tobe this augmented circuit. ◭ a r X i v . o r g Having described the subcircuits on which the evaluation of the entire circuits will be basedwe now need to show how to split and construct them in the ℓ diﬀerent reducers. In eachreducer, We start with the nodes v contained in it that satisfy level( v ) = L i for any i andthe associated v -down-circuits and v -up-circuits of depth 1. We then iteratively increasethe depth one by one, until the full L i -down-circuits and L i -up-circuits of depth up to s are constructed. Note that the nodes of any level L i and their corresponding circuits mayscattered across multiple reducers since edges were split equally among them according totheir the sorting index and not depending on the level. We therefore need to carefullyimplement a communication scheme that allows each reducer to encode requests for missingedges required in the construction, which are then delivered to them in multiple rounds,without exceeding any of the memory or time bounds. Taking care of all these details, weobtain the following lemma. ◮ Lemma 18.

Given C n , all L i -down-circuits and L i -up-circuits can be constructed in O (log n ) DMRC -rounds whenever < ε < / . Proof.

In the ﬁrst round, the map function µ is deﬁned such that each reducer q is assigned(via the choice of the key) β nodes of the form h j ; (level( v ) , v ) i and directed edges adjacentto these nodes. Note that one edge can thus be assigned to two diﬀerent reducers, once asan outgoing, once as in in-going edge. Speciﬁcally, we deﬁne µ ( h j ; (level( v ) , v ) i ) = {h j ÷ β ; ( j, level( v ) , v ) i} for the key-value pairs representing nodes and µ ( h i ; ( ( j, level( v ) , v ) , ( j ′ , level( v ′ ) , v ′ ) ) i ) = { h j ÷ β ; (( j, level( v ) , v ) , ( j ′ , level( v ′ ) , v ′ )) i , h j ′ ÷ β ; (( j, level( v ) , v ) , ( j ′ , level( v ′ ) , v ′ )) i } for the key-value pairs representing edges.In the subsequent execution of ρ , each reducer can therefore directly construct the v -up-circuits and v -down-circuits of depth 1 for its β assigned nodes. We will now describehow some of these initial circuits, namely those on levels L i for any i ∈ [ r ], can be usedto extended to full L i -up-circuits and L i -down-circuits by iteratively increasing the circuitdepth one by one in the following way:Let v be a node with level( v ) = L i in reducer q for any i ∈ [ r ] and q ∈ [ ℓ ]. We want toextend C down1 ( v ) and C up1 ( v ) to C down2 ( v ) and C up2 ( v ), respectively. Let u in ( u out , resp.) beany node of in-degree (out-degree, resp.) 0 in it, that is, any node that potentially needsto be extended by one or multiple edges. These extending edges are not necessarily avail-able in reducer q , however. We need to ﬁnd out which reducer stores them—if there areany—and then request these edges from it in some way. To determine the right reducer,we make use of the sorting index stored alongside each node, even when part of an edge.Any edge ( u in , v ) that we need to check for possible extensions is in fact represented as h q , ( ( j u in , level( u in ) , u in ) , ( j v , level( v ) , v ) ) i in reducer q . The number of the reducer con-taining the downward extending edges is now retrieved as to( u in ) = j u in ÷ β . Analogously,the upward extending edges for an edge ( v, u out ) are to be found in reducer to ( u out ), where T o ( u out ) = j u out ÷ β . We now know whom to ask for edges extending the subcircuit beyondnode u , namely reducer number to( u ). Let from( v ) = q denote the number of the reducersending the request, which we encode in form of the key-value pair h q ; ( u, to( u ) , from( v )) i . . Frei, K. Wada 19 Each reducer q does the above for every node with possible extending edges and also passesalong to the mapper all v -up-circuits and v -down-circuits constructed so far unaltered. Thisconcludes the ﬁrst round.In the second round, the map function µ naturally re-assigns h q ; ( u, to( u ) , from( v )) i toreducer to( u ) , and returns the v -up-circuits and v -down-circuits to the reducers that sent them.Having received the edge request of the form of h to( u ); ( u, to( u ) , from( v )) i while executing ρ reducer to( u ) now sends all edges potentially useful to reducer from( v ) —that is, the entire u -up-circuit and the entire u -down-circuit of depth 1—to the next mapper in the form of apair (from( v ) , e ) for every edge containing node u . As before, all other circuits constructedso far get passed along without modiﬁcation as well.In the third round, the map function µ routes the requested edges to the requestingreducer by generating the key-value pairs h from( v ); (from( v ) , e i . In the reducing step, whichimplements the same reduce function ρ as in the ﬁrst round, reducer reducer from( v ) nowﬁnally has all fully extend the v -up-circuits and v -down-circuits to depth 2.Since performing the two rounds µ , ρ , µ , ρ deepens the L i -up-circuits and L i -down-circuits by one level in the way just seen, the complete L i -up-circuits and L i -down-circuitscan be constructed by repeating these two rounds s times.It is again clear that the memory and I/O requirements of the reducers are all met inevery round since the input size and output size are in O ( d ) for each reducer. Moreover,thetotal memory for storing the v -up-circuits and v -down-circuits is β · N ∈ O ( N γ ) because C n has at most N O ∈ O ( N ) nodes. Since the constant γ was chosen such that 0 < γ ≤ − ε ,we have N γ ∈ O ( N − ε ) ) and thus all up-circuits and down-circuits can be stored in therespective reducers. ◭ The main idea in the proof of the following lemma is to compute the evaluation valuessubcircuit-wise, starting with the deepest ones, and then iteratively moving up the circuitin depth( C n ) /s rounds, passing on the newly computed values to the right reducers, untilthe value of the unique output node is known. ◮ Lemma 19.

If all up-circuits and down-circuits are constructed in the proper reducers, C n can be evaluated in O (depth( C n ) / log n ) DMRC -rounds.

Proof.

Without loss of generality, let depth( C n ) be divisible by s and let r = depth( C n ) /s .Once all L i -down-circuits and L i -up-circuits for all i ∈ { , . . . , r } have been constructed, wecan evaluate C n on the given input assignment. We begin by evaluating the L r − -down-circuits. Since every input node has its value assigned in a v -down-circuit, the L r − -down-circuits can be computed in the reducers containing these v -down-circuits. With the valuesof all nodes at level L r − determined, we can send the necessary values to the L r − -down-circuits and, in the case of edges that were divided using two dummy nodes, to lower-leveldown-circuits. Nodes at level L r − that are necessary to compute L r − -down-circuits aredescribed in the L r − -up-circuits. Any node v at level L r − that is necessary to compute L r − -down-circuits is described in the v -up-circuit. Therefore, the output of the reducer q isas follows: Let v be at level L r − and let u i , for i ∈ { , . . . , k v } , be the nodes at level L r − in the v -up-circuit. For each v in reducer q , it outputs (to( u i ) , v, val( v )), where to( u i ) is theindex of the reducer containing the u i -down-circuit and val( v ) is the value of v determinedin the computation of v -down-circuit. The reducer q also passes on all v -down-circuits and v -up-circuits contained in it. a r X i v . o r g In the next round, the map function sends each (to( u in ) , v, val( v )) to the reducer con-taining the u in -down-circuit; that is, it generates the key-value pair h to( u in ); ( v, val( v )) i . Ofcourse, the map function also passes along all v -down-circuits and v -up-circuits to the properreducers.Since now each L r − -down-circuit is contained completely in a reducer that has gatheredall values of nodes at level L r − necessary to compute this subcircuit, all L r − -down-circuitscan be computed in their reducers. Now we can compute the values of nodes higher andhigher up in the circuit, by iterating the last mapping-reducing function pair, until the valueis ﬁnally known for the unique output node.As before, we clearly stay within the memory and I/O buﬀer limits of each reducer. ◭ In a substantial improvement over all previously known results, we have shown that NC i +1 ⊆DMRC i for all i ∈ N . In the case of NC ⊆ DMRC , we have proved this result for everyfeasible choice of ε in the model, that is, for 0 < ε ≤ /

2. For i >

0, we have shown theresult to hold for all but one value, namely ε = 1 / NC , which is particularly relevantin practice, we applied Barrington’s theorem and simulated width-bounded branching pro-grams [2], whereas we directly simulated the circuits for the higher levels of the hierarchy.We emphasize that none of the two approaches can replace the other: Barrington’s theoremonly gives a characterization for the ﬁrst level of the NC hierarchy and the second approachdoes not even yield NC ⊆ MR C . (Recall that DMRC is just the deterministic variant of

MR C , so we have

DMRC i ⊆ MR C i for all i ∈ N .)We would like to brieﬂy address the small question that immediately arises from ourresult, namely whether it is possible to extend the inclusion NC i +1 ⊆ DMRC i of Theorem 10to the case ε = 1 /

2. Going through all involved lemmas, we see that the two reasons thatour proof does not work in this corner case are the sorting of the nodes using Lemma 15 andthe construction of the up-circuits and down-circuits in Lemma 18. Regarding the former,we can avoid the restriction by allowing randomization. For the latter, it is not clear thatthis can be achieved, however. If there was any way to construct the levels for ε = 1 / < ε ≤ / ε .Besides dealing with the small issue mentioned above, the natural next step for futureresearch is to take the complementary approach and address the reverse relationship: Havingshown in this paper how to obtain eﬃcient deterministic MapReduce algorithms for paral-lelizable problems, we now aim to include the largest possible subset of DMRC i into NC i +1 for all i ∈ N . By simply padding a P -complete language so as to include it in DMRC ,Karloﬀ et al. [11, Thm. 4.1] proved that DMRC ⊆ NC would imply P = NC , an equalitygenerally deemed unlikely to hold. Thus we cannot expect to prove DMRC i = NC i +1 , buttry to determine DMRC i ∩ NC i +1 in order to ﬁnally settle the long-standing open ques-tion of how exactly the MapReduce classes correspond to the classical classes of parallelcomputation. References S. Arora, B. Barak, Computational Complexity: A Modern Approach, Cambridge UniversityPress, 2009. . Frei, K. Wada 21 D. A. Barrington,

Bounded-Width Polynomial-Size Branching Programs Recognize ExactlyThose Languages in NC , JCSS (J. of Computer and System Sciences), vol. 38, 1989, 150–164. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotum,

Map-Reduce for Machine Learning on Multicore , Advances in Neural Information Processing Sys-tems (NIPS), 2006, 281–288. T. H. Cormen, C. E. Leiserson, and R. L. Rivest,

Introduction to Algorithms , MIT Press, 1990. J. Dean and S. Ghemawat,

MapReduce: Simpliﬁed Data Processing on Large Clusters , Com-munications of the ACM (CACM), vol. 51, no. 1, 2008, 107–113. A. K. Farahat, A. Elgohary, A. Ghodsi, and M. S. Kamel,

Distributed Column Subset Selectionon MapReduce , International Conference on Data Mining (ICDM), 2013, 171–180. B. Fish, J. Kun, Á. D. Lelker, L. Reyzin, and G. Turán,

On the Computational Complexityof MapReduce , International Symposium on Distributed Computing (DISC), 2015, 1–15. J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina,

On DistributingSymmetric Streaming Computations , ACM Transactions on Algorithms, vol. 6, no. 4, 2010,66:1–66:15. F. Frei, K. Wada,

Eﬃcient Circuit Simulation in MapReduce , 30th International Symposiumon Algorithms and Computation (ISAAC), 2019, 52:1–52:21 M. Goodrich, N. Sichinava, and Q. Zhang,

Sorting, Searching, and Simulation in theMapReduce Framework , the 22nd International Symposium on Algorithms and Computation(ISAAC), 2011, 374–383. H. Karloﬀ, S. Suri, and S. Vassilvitskii,

A Model of Computation for MapReduce , 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010, 938–948. S. Kamara, M. Raykova,

Parallel Homomorphic Encryption , Financial Cryptography Work-shops, 2013, 213–225. R. Kumar, B. Moseley, S. Vassilvitskii,

Fast Greedy Algorithms in MapReduce and Streaming ,ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2013, 1–10. M. F. Pace,

BSP vs MapReduce , in 12th International Conference on Computational Science(ICCS), 2012, 246–255. A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal,

Space-Round Tradeoﬀsfor MapReduce Computations , 26th ACM International Conference on Supercomputing (ICS),2012, 235–244. T. Roughgarden, S. Vassilvitskii, and J. R. Wang,

Shuﬄes and Circuits (On Lower Boundsfor Modern Parallel Computation) , Journal of the ACM (JACM) 65:6, 2018, 41:1–41:24. A. D. Sarma, F.N. Afrati, S. Salihoglu, and J. D. Ullman,

Upper and Lower Bounds on theCost of a Map-Reduce Computation , Proceedings of the VLDB Endowment (PVLDB), 2013,277–288. A. Tada, M. Migita, R. Nakamura,

Parallel Topological Sorting Algorithm , Journal of theInformation Processing Society of Japan (IPSJ), 45(4), 2004, 1102–1111. T. White,

Hadoop: The Deﬁnitive Guide , 4th edition, O’Reilly, 2015., 4th edition, O’Reilly, 2015.