[PDF] Optimal Sorting Circuits for Short Keys

Abstract

A long-standing open question in the algorithms and complexity literature is whether there exist sorting circuits of size o(n \log n). A recent work by Asharov, Lin, and Shi (SODA'21) showed that if the elements to be sorted have short keys whose length k = o(\log n), then one can indeed overcome the n\log n barrier for sorting circuits, by leveraging non-comparison-based techniques. More specifically, Asharov et al.~showed that there exist O(n) \cdot \min(k, \log n)-sized sorting circuits for k-bit keys, ignoring poly\log^* factors. Interestingly, the recent works by Farhadi et al. (STOC'19) and Asharov et al. (SODA'21) also showed that the above result is essentially optimal for every key length k, assuming that the famous Li-Li network coding conjecture holds. Note also that proving any {\it unconditional} super-linear circuit lower bound for a wide class of problems is beyond the reach of current techniques. Unfortunately, the approach taken by Asharov et al.~to achieve optimality in size somewhat crucially relies on sacrificing the depth: specifically, their circuit is super-{\it poly}logarithmic in depth even for 1-bit keys. Asharov et al.~phrase it as an open question how to achieve optimality both in size and depth. In this paper, we close this important gap in our understanding. We construct a sorting circuit of size O(n) \cdot \min(k, \log n) (ignoring poly\log^* terms) and depth O(\log n). To achieve this, our approach departs significantly from the prior works. Our result can be viewed as a generalization of the landmark result by Ajtai, Koml\'os, and Szemer\'edi (STOC'83), simultaneously in terms of size and depth. Specifically, for k = o(\log n), we achieve asymptotical improvements in size over the AKS sorting circuit, while preserving optimality in depth.

Full PDF

OOptimal Sorting Circuits for Short Keys *Wei-Kai LinCornell [email protected]

Elaine ShiCMU [email protected]

Abstract

A long-standing open question in the algorithms and complexity literature is whether there exist sort-ing circuits of size o ( n log n ) . A recent work by Asharov, Lin, and Shi (SODA’21) showed that if theelements to be sorted have short keys whose length k = o (log n ) , then one can indeed overcome the n log n barrier for sorting circuits, by leveraging non-comparison-based techniques. More speciﬁcally,Asharov et al. showed that there exist O ( n ) · min( k, log n ) -sized sorting circuits for k -bit keys, ignor-ing poly log ∗ factors. Interestingly, the recent works by Farhadi et al. (STOC’19) and Asharov et al.(SODA’21) also showed that the above result is essentially optimal for every key length k , assuming thatthe famous Li-Li network coding conjecture holds. Note also that proving any unconditional super-linearcircuit lower bound for a wide class of problems is beyond the reach of current techniques.Unfortunately, the approach taken by Asharov et al. to achieve optimality in size somewhat cruciallyrelies on sacriﬁcing the depth: speciﬁcally, their circuit is super- poly logarithmic in depth even for 1-bitkeys. Asharov et al. phrase it as an open question how to achieve optimality both in size and depth.In this paper, we close this important gap in our understanding. We construct a sorting circuit of size O ( n ) · min( k, log n ) (ignoring poly log ∗ terms) and depth O (log n ) . To achieve this, our approach departssigniﬁcantly from the prior works. Our result can be viewed as a generalization of the landmark resultby Ajtai, Koml´os, and Szemer´edi (STOC’83), simultaneously in terms of size and depth. Speciﬁcally,for k = o (log n ) , we achieve asymptotical improvements in size over the AKS sorting circuit, whilepreserving optimality in depth. * Author ordering is randomly generated. a r X i v : . [ c s . D S ] F e b Introduction

Sorting circuits have been investigated for a long time in the algorithms and complexity theory literature,and it is almost surprising that we still do not fully understand sorting circuits. Suppose we want to sort aninput array with n elements, each with a k -bit comparison key and a w -bit payload. A long-standing openquestion is whether there exist circuits with ( k + w ) · o ( n log n ) boolean gates where each gate is assumed tohave constant fan-in and constant fan-out [11]. The recent works of Farhadi et al. [14] (STOC’19) showedthat assuming the famous Li-Li network coding conjecture [26], it is impossible to construct sorting circuitsof size ( k + w ) · o ( n log n ) when there is no restriction on the key length k . Given this conditional lowerbound, we seem to have hit another wall. However, shortly afterwards, Asharov, Lin, and Shi [9] showedthat we can indeed overcome the n log n barrier when the keys are short, speciﬁcally, when k = o (log k ) .More speciﬁcally, Asharov et al. showed that an array containing n elements each with a k -bit key and a w -bit payload can be sorted in a circuit of size ( k + w ) · O ( n ) · min( k, log n ) (ignoring poly log ∗ terms);moreover, Asharov et al. [9] prove that this is optimal for every choice of k .Asharov et al. [9]’s result moved forward our understanding on sorting circuits, since it achieved asymp-totical improvements for short keys relative to the landmark result by Ajtai, Koml´os, and Szemer´edi [3](STOC’83), who constructed sorting circuits containing O ( n log n ) comparator -gates. As Asharov et al. [9]point out, an o ( n log n ) sorting circuit for short keys might have eluded the community earlier due to a cou-ple natural barriers. First, an o ( n log n ) sorting circuit is impossible in the comparator-based model evenfor -bit keys — this follows partly due to the famous 0-1 principle which was described in Knuth’s text-book [24]. Indeed, Asharov et al. [9] is the ﬁrst to show how to leverage non-comparison-based techniquesto achieve a non-trivial sorting result in the circuit model. Earlier, non-comparison-based sorting was inves-tigated in the Random Access Machine (RAM) model to achieve almost linear-time sorting [5,18,19,23,36]but it was unknown how non-comparison-based techniques can help in the circuit model. The second nat-ural barrier pertains to the stability of the sorting algorithm. Stability requires that elements with the samekey should preserve the same order as they appear in the input. Recent works [2, 27] have shown that an o ( n log n ) -sized stable sorting circuit is impossible even for -bit keys, if either the circuit does not per-form encoding or computation on the elements’ payloads (henceforth this is referred to as the indivisibility model), or if the Li-Li network coding conjecture [26] is true. Therefore, to achieve their result, Asharov etal. [9] had to forgo both the comparator-based restriction as well as the stability requirement.Despite the progress, Asharov et al. [9]’s result is nonetheless unsatisfying — to achieve optimal circuitsize, they give up on depth: their circuit is super- poly logarithmic in depth even for 1-bit keys. In fact, aswritten, the depth of their circuit is super-linear; but it is possible to leverage existing techniques [8] toimprove their depth to (log ∗ n ) log log n + O (1) which is still ω (log c n ) for any constant c . We are not awareof any known technique that can improve the depth to even polylogarithmic, even for -bit keys, while stillpreserving the o ( n log n ) circuit size.We therefore ask the following natural question, which was also phrased as the main open question inthe work by Asharov et al. [9]: Can we construct sorting circuits for short keys optimal both in size and depth?

More concretely, canwe sort n elements each with a k -bit key and w -bit payload in a circuit of size ( k + w ) · O ( n ) and oflogarithmic depth?If we could achieve the above, we would get a result that strictly generalizes AKS [3] (taking both circuitsize and depth into account). We answer the above question afﬁrmatively (except for an extra poly log ∗ factor in the circuit size). Main result 1: sorting circuit for short keys, optimal in both size and depth.

We explicitly constructa sorting circuit for short keys that is optimal in size (modulo poly log ∗ factors), and optimal in depth, as1tated in the following theorem: Theorem 1.1 (Sorting short keys in the circuit model) . Suppose that n > k +7 . There is a constant fan-in,constant fan-out boolean circuit that correctly sorts any array containing n elements each with a k -bit keyand a w -bit payloads, whose size is O ( nk ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) and whose depthis O (log n + log w ) . The circuit size is optimal upto poly log ∗ factors for every k due to a lower bound by Asharov et al. [9](assuming either the invisibility model or the Li-Li network coding conjecture). Furthermore, Ω(log n ) depth is necessary even for -bit keys, as implied by the lower bound of Cook et al. [12]; moreover, the log w part of the depth is needed even for propagating the comparison result to all bits of the output. Oursorting circuit leverages non-comparison-based techniques, and moreover it does not preserve stability —as mentioned earlier, these are inherent even for the 1-bit key special case.To get this main result requires a signiﬁcant amount of new machinery. Along the way, we go throughtwo major stepping stones that are each of independent interest — we present them below: Main result 2: sorting short keys on an oblivious PRAM, optimal in work and depth.

To eventually getour sorting circuit result, we ﬁrst make the problem a little easier by considering how to deterministicallysort n elements each with a k -bit key on an oblivious Parallel RAM (PRAM). A deterministic algorithmin the oblivious PRAM model is a PRAM algorithm whose memory access patterns do not depend on theinput (except the input size). We show that indeed, one can obliviously sort n elements each with a k -bit keyin O ( n ) · min( k, log n ) total work and O (log n ) depth, assuming that each element can be stored in O (1) memory words. The total work is optimal assuming either the indivisibility model or the Li-Li networkcoding conjecture [9, 27], and the depth is optimal unconditionally even for 1-bit keys [12]. Theorem 1.2 (Sorting short keys on an oblivious PRAM) . There exists a deterministic oblivious parallelalgorithm that sorts any input array containing n elements each with a k -bit key in O ( n ) · min( k, log n ) total work and O (log n ) depth, assuming that each element can be stored in O (1) words . Prior to our work, it was known that n elements with k -bit keys can sorted by a randomized obliviousalgorithm in O ( kn log log n log k ) work and polylogarithmic depth [27]. It is possible to improve the total work to O ( kn ) and get rid of the randomization by combining techniques from Lin et al. [27] and Asharov et al. [8].However, to the best of our knowledge, existing techniques are stuck at polylogarithmic depth. To attain theabove result, our techniques depart signiﬁcantly from the prior works [9, 27].More concretely, we leverage the linear-work, logarithmic depth oblivious compaction algorithm byAsharov et al. [8]. Compaction (also called tight compaction ) is a special case of sorting where the key ispromised to be 1 bit. The crux is how to efﬁciently upgrade the 1-bit oblivious sorting to k -bit oblivioussorting incurring only a k -factor more work relative to the 1-bit case, and preserving the depth. A na¨ıveidea for achieving this is to rely on Radix sort, but this fails on multiple accounts: 1) the -bit case (i.e.,the compaction algorithm) is not stable , and this is inherent as mentioned [2, 27]; and 2) using Radix sortwould have incurred polylogarithmic depth. Lin, Shi, and Xie [27] and Asharov et al. [9] proposed anew 2-parameter recursion trick to accomplish this upgrade. Their technique overcomes the non-stabilitychallenge, but unfortunately still suffers from polylogarithmic depth.We propose a fundamentally different approach to accomplish the 1-bit to k -bit upgrade. Speciﬁcally,we deﬁne a new building block called a nearly orderly segmenter . A nearly orderly segmenter partially sortsthe input array taking O ( nk ) work and O (log n ) depth, such that if we divide the outcome into k segments,each segment contains at most / − k fraction of elements that do not belong in the segment. We then pro-pose a novel algorithm that uses oblivious compaction [8] and additional new building blocks to detect andcorrect the remaining errors in linear work and logarithmic depth. Note that our nearly ordered segmenter Note that the theorem statement for oblivious PRAM does not have an extra poly log ∗ blowup in total work.

2s comparator-based, but the second step that corrects the remaining errors relies on non-comparison-basedtechniques (for example, the oblivious compaction building block itself is non-comparison-based).

Main result 3: linear-sized, logarithmic depth compaction circuit.

We stress that the above obliviousPRAM result does not directly give our circuit result and vice versa. It turns out that going from our obliviousPRAM result to our circuit result is highly non-trivial. The reason is that in a PRAM model, the word size isat least log n bits and one can perform arithmetic and boolean operations on log n bits in unit cost . Indeed, byleveraging this capability, a recent work [8] relied on the so-called “packing” technique to achieve an optimaloblivious parallel compaction algorithm. Speciﬁcally, they showed that sorting n elements each with a 1-bitkey can be accomplished in O ( n ) work and O (log n ) depth on an oblivious PRAM. Unfortunately, in thecircuit model, there is no more free lunch and operations on log n bits incur logarithmic or more cost. Asa result, a compaction circuit optimal in both size and depth is not known. The work by Asharov et al. [9]showed how to achieve optimality in circuit size for compaction (modulo poly log ∗ factors), but at the priceof super- poly logarithmic depth.We ﬁll in this gap and show how to construct a compaction circuit optimal in both size and depth (barringan extra poly log ∗ factor in size), as stated in the following theorem: Theorem 1.3 (Linear-sized, logarithmic-depth tight compaction circuit) . There is a circuit of size O ( nw ) · max( poly (log ∗ n − log ∗ w ) , and depth O (log n + log w ) that can sort any array containing elements with1-bit keys and w -bit payloads. To prove the above theorem, we need fairly sophisticated and novel techniques. At a high level, to avoidsuffering from the super-polylogarithmic depth of Asharov et al. [9], we ﬁrst construct various buildingblocks that can be regarded as relaxations of (tight) compaction. Speciﬁcally, by relaxing compaction alongseveral different axes, we deﬁne several new, intermediate abstractions, each of which will play a role in theﬁnal construction. We show that the relaxed abstractions can be realized in sub-logarithmic or logarithmicdepth. We then gradually bootstrap these building blocks into stronger ones, and the ﬁnal tight compactioncircuit is achieved through multiple steps of bootstrapping. We defer the details to Section 2.In summary, since both stepping stones require novel and fairly sophisticated techniques, our ﬁnal sort-ing circuit construction departs signiﬁcantly from the most closely related prior works [9, 27]. We deﬁnenumerous new algorithmic abstractions along the way, some of which may be of independent interest.

We give an informal technical overview of our ideas in this section. We will start with the two major steppingstones mentioned in Section 1.1, followed by our main sorting circuit result.

Without loss of generality, we assume that k < log n in the following exposition where n denotes thelength of the array to be sorted; since if k ≥ log n , we can simply run AKS [3] to sort the array. We alsoassume that n is a power of ; if not, we can pad it with elements with ∞ keys to the next power of . Weassume that each element can be stored in O (1) memory words.As mentioned, we will leverage the oblivious compaction (i.e., 1-bit-key sorting) algorithm by Asharovet al. [8]; however, existing techniques that upgrade the 1-bit case to k bits would incur polylogarithmicdepth [9, 27]. Our ideas completely depart from the prior works [9, 27]. We propose a new abstraction called an ( η, p ) -orderly segmenter , where η ∈ (0 , indicates how sorted theresulting array is, and p denotes the number of segments. An array A := A || A || . . . || A p , represented3s the concatenation of p equally sized partitions denoted A , A , . . . , A p , is said to be ( η, p ) -orderly iffin each of the p segments, at most η fraction of the elements are misplaced, i.e., they do not belong to thecorrect segment if the array were to be fully sorted. An ( η, p ) -orderly segmenter receives an input arraywhose length is divisible by p , and outputs a permutation of the input array that is ( η, p ) -orderly.We then show how to construct a deterministic, oblivious (2 − k , k ) -orderly segmenter that requires O ( nk ) total work and O ( k ) depth. The construction involves partially executing the AKS algorithm [3].Recall that the full AKS algorithm would execute for a total of log n cycles. In each cycle, the following isrepeated for O (1) number of times: partition the array into disjoint partitions where each partition may notbe a contiguous region in the original array, and apply an (cid:15) -near-sorter to each partition in parallel where (cid:15) ∈ (0 , is a sufﬁciently small constant . Our key observation is the following: Observation 2.1.

If we execute the AKS algorithm not for the full log n cycles, but only for k cycles, itgives a (2 − k , k ) -orderly segmenter. Of course, proving this is non-trivial since it requires us to use the properties of AKS in a non-blackboxmanner . We defer the proof to the formal technical sections.One helpful intuition is the following: if we run AKS for only o (log n ) cycles, we cannot guaranteesortedness within each segment of length n/ o (log n ) . Therefore, had the number of distinct keys been large,running AKS for only o (log n ) cycles could produce an outcome that is far from sorted. Fortunately, whenthe key length k = o (log n ) , running AKS for k = o (log n ) cycles would actually produce an array that isclose to fully sorted. This is a crucial observation we use next to correct the remaining errors. (2 − k , k ) -Orderly Array To sort an array, we ﬁrst apply a (2 − k , k ) -orderly segmenter, which gives a partially sorted array. Next,we apply an efﬁcient oblivious algorithm that corrects the remaining errors. Henceforth, let K := 2 k . New building blocks.

To achieve this, we ﬁrst need to deﬁne and construct a couple new building blocks:1.

SlowSort K ( A ) : an inefﬁcient oblivious sort algorithm — when given an array A of length m withat most K := 2 k distinct keys, the algorithm outputs a sorted permutation of A . We would like toaccomplish SlowSort K ( A ) in O ( mK ) total work and O (log m + k ) depth, since later we will apply SlowSort K ( A ) to arrays of size n/K where n is the length of the larger array we need to sort.It turns out that even this slow version is somewhat non-trivial to construct. The most obvious idea,that is, relying on AKS [3], does not work. AKS would have incurred O ( m log m ) work; and for smallchoices of K , log m could be larger than K .We instead make K copies of the input array: in the u -th copy, we want to put elements with the key u ∈ [0 , K − into the right positions, where as all other elements should be ﬁllers. If we can accomplishthis, we can sort A by performing a coordinate-wise K -way selection among the K arrays.Speciﬁcally, let s u be the number of elements smaller than u . In the sorted array, elements with the key u should appear in positions s u + 1 , s u + 2 , . . . , s u +1 . Now, in the u -th copy, we preserve all the elementswith the key u but replace all other elements with ﬁllers. We mark exactly s u ﬁllers with the key −∞ and the mark rest of ﬁllers with the key ∞ . Now, the u -th copy of the problem boils down to sorting m elements with 3 different keys. We show that this can be accomplished in linear time and logarithmicdepth, if we leverage the linear-work, logarithmic depth oblivious compaction [8] algorithm (we deferthe details of the construction to subsequent technical sections).2. FindDominant ( A ) : let A be an input array containing n elements each with a k -bit key, and let (cid:15) ∈ (0 , / . We say that an array A of length n is (1 − (cid:15) ) -uniform iff except for at most (cid:15)n elements, An (cid:15) -near-sorter is a constant depth comparator circuit described by Ajtai et al. [3], which we will formally deﬁne in thesubsequent technical sections. A have the same key — henceforth this key is said to be the dominant key. We willneed an oblivious algorithm FindDominant ( A ) which ﬁnds the dominant key among an (1 − − k ) -uniform input array A containing n elements; further, we want to accomplish this in O ( n ) total work and O (log n + k ) depth. We construct an oblivious algorithm for solving this problem that is reminiscentof Blum et al. [10]’s median-ﬁnding algorithm, and moreover the algorithm employs SlowSort K as abuilding block — see the subsequent technical section for details. Sorting a (2 − k , k ) -orderly array. Let A be a (2 − k , k ) -orderly array containing n elements with k -bit keys, and recall that K := 2 k . If A were to be fully sorted, then among the K segments, at most K segments can have multiple keys; and all remaining segments must have only a single key. Since A is (2 − k , k ) -orderly, it means that all but K segments of A must be (1 − − k ) -uniform.To understand our algorithm, it is instructive to ﬁrst look at a ﬂawed strawman idea: Flawed strawman idea: sorting a (2 − k , k ) -orderly array

1. Each segment decides if it is (1 − − k ) -uniform or not. That is, each segment calls FindDominant to ﬁnd its dominant key. If the segment is indeed (1 − − k ) -uniform, FindDominant is guaran-teed to return the correct dominant key; else an arbitrary result may be returned.2. Use oblivious compaction to extract 1) all segments in A that are not (1 − − k ) -uniform, and 2)from each (1 − − k ) -uniform segment: extract all elements whose keys differ from the dominantkey, and extract − k · ( n/K ) elements with the dominant key where n/K is the segment size.We can show that the number of extracted elements is upper bounded by n/K ; and there is a wayto pad the extracted array with ﬁllers to a ﬁxed length of n/K to hide how long it actually is.Note that the extracted elements contain all the misplaced elements (i.e., elements that belong toincorrect segments), but possibly some additional elements too. The invariant we want to maintainhere is that all remaining elements must belong to the right segment .3. Call SlowSort K to sort the extracted array and reverse route the result back to the original array.4. At this moment, all elements fall into the correct segment, but if a segment has multiple keys, itmay not be sorted internally. Fortunately, we know that at most K segments can be multi-keyed.Therefore, we use oblivious compaction to extract these K segments, call SlowSort K to sort withineach extracted segment, reverse route the result back, and output the ﬁnal result.This algorithm almost works except for one subtle issue that breaks correctness: the linear-work, logarithmic-depth oblivious compaction algorithm [8] is not stable and in fact this is inherent [2, 27]. This means that inthe Step 2 above, the extracted elements do not preserve the order in which they appear in the input array.To ﬁx this problem, one na¨ıve idea is to use SlowSort K to sort the extracted array once again basedon which segment each element belongs to, but this would be too costly since it would incur K · n/K = O ( nK ) work. Our idea is to switch to a more coarse-grained partitioning scheme at this point: we in-stead view the array as K super-segments, where each super-segment is the concatenation of K originalsegments. Therefore, we use SlowSort K to sort the extracted array whose length is n/K , and thisincurs O ( n ) work and O (log n + k ) depth. At this moment, we can follow through with Steps 3 and 4,with the following modiﬁcations: 1) the reverse-routing in Step 3 now needs to reverse the decision of the SlowSort K instance as well as the compaction; and 2) Step 4 now works on the super-segments ratherthan the segments. We defer a detailed description of our ﬁnal algorithm to the subsequent formal sections. Our operational circuit model.

Although our ﬁnal results are stated in a standard circuit model consistingof constant fan-in, constant fan-out AND, OR, and NOT gates, for convenience, we adopt an enhanced5perational circuit model in intermediate steps [9]. This operational model allows generalized boolean gates of constant fan-in and constant fan-out implementing any truth table; and moreover it allows w -selectorgates each of which takes in a ﬂag bit and two w -bit payload strings, and outputs one of the two payloadsdetermined by the ﬂag. Clearly, each generalized boolean gate can be instantiated with O (1) constant fan-in,constant fan-out AND, OR, and NOT gates. Each w -selector gate can be implemented with w generalizedboolean gates in O (log w ) depth. At ﬁrst sight, it seems like fully instantiating all w -selector gates will incuran O (log w ) multiplicative blowup in depth, but we show later (Lemma 6.1) that this can be avoided, andwe can get away with only O (log w ) additive overhead if we were to fully instantiate all w -selector gates. Challenges of the circuit model.

As mentioned in Section 1.1, going from our oblivious PRAM resultto the circuit result is highly non-trivial. On a PRAM, arithmetic and boolean operations on log n bits areperformed in unit cost, and leveraging this capability, prior work [8] provided us with an “ideal” obliviouscompaction building block which requires linear work and logarithmic depth. In the circuit model, unfortu-nately there is no such free lunch, and the counterpart of an ideal oblivious compaction is not known in thecircuit model. Although the recent work by Asharov, Lin, and Shi [9] constructed a linear-sized compactioncircuit (ignoring poly log ∗ factors), their circuit has super-polylogarithmic depth.To turn our oblivious PRAM algorithm into a circuit, a critical gap we need to overcome is to constructa compaction circuit optimal in both size and depth. This turned out to be highly non-trivial: notably,Asharov et al. [9]’s bootstrapping techniques for getting optimal circuit size seem to come at the price ofdepth blowup. To accomplish this goal, we go through several steps of bootstrapping that takes us fromweaker primitives to stronger primitives. Speciﬁcally, we need several intermediate abstractions — all ofthese abstractions can be viewed in some way as a relaxation of (tight) compaction; but each relaxation isof an incomparable nature. We will ﬁrst deﬁne all these intermediate abstractions, and then we explain ourblueprint for getting an optimal tight compaction circuit. We rely the following intermediate abstractions — a subset of them have been adopted in prior work [8, 33]but the others are new:•

Lossy loose compaction . Let α ∈ (0 , . Given an array of length n containing at most n/ real elements and all remaining elements are ﬁllers , an α -lossy loose compactor compresses the array by ahalf, losing at most αn real elements in the process.• Approximate splitter . Let β ∈ (0 , / and let α ∈ (0 , . An ( α, β ) -approximate splitter solves thefollowing problem: we are given an input array containing n elements each marked with a 1-bit labelindicating whether the element is distinguished or not. It is promised that at most β · n elements inthe input are distinguished. We want to output a permutation of the input array, such that at most αn distinguished elements are not contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of the output.• Approximate tight compaction . Let α ∈ (0 , . Given an input array containing n elements, each with a -bit key, an α -approximate tight compactor outputs a permutation of the input array, such that at most α · n elements in the output are misplaced. Here, the i -th element in the output is said be misplaced iffits key disagrees with what the i -th smallest key in the input array.• Sparse loose compactor . Let α ∈ (0 , . An array of length n is said to be α -sparse if there are at most αn real elements in it and the rest are all ﬁllers. A sparse loose compactor performs exactly the sametask as a lossy loose compactor, except that 1) the input array is promised to be / (log n ) C (cid:79) -sparse forsome ﬁxed constant C (cid:79) > ; 2) we now want to compress the array to n/ log n length; and 3) we do notwant to lose any real elements in the compressed output array.6 .2.2 Bootstrapping an Efﬁcient Lossy Loose Compactor Fix an arbitrary constant

C > . First, we want to construct a / (log n ) C -lossy loose compactor that has O ( n · w ) generalized boolean gates, O ( n ) number of w -selector gates (ignoring poly log ∗ terms), and withdepth O (log . n ) — here w denotes the bit-width of an element’s payload.We could get an inefﬁcient / (log n ) C -lossy loose compactor (for an arbitrary constant C > ) usingtechniques described in Asharov et al. [8]: speciﬁcally, the resulting / (log n ) C -lossy loose compactorrequires O ( n log log n ) generalized boolean gates, O ( n ) number of w -selector gates, and incurs depth O (log log n ) . If w ≥ log log n , we would then be able to implement this as a constant fan-in, constantfan-out boolean circuit of O ( nw ) size and O (log n + log w ) depth.Henceforth we focus on the case when w = o (log log n ) . In this case, the generalized boolean gatescost asymptotically more than the w -selector gates when we fully instantiate the circuit as a constant fan-in,constant fan-out boolean circuit. We want to bootstrap a more efﬁcient / (log n ) C -lossy loose compactorby balancing these two costs. During the bootstrapping, we can blow up the α parameter (i.e., the fractionof lost elements) by at most a (poly-)logarithmic factor.We are inspired by Asharov et al. [9]’s repeated bootstrapping technique: they use a loose compactorto bootstrap a tight compactor without incurring too much overhead, and then use the tight compactor tobootstrap a loose compactor much more efﬁcient than the original one. This is repeated for d := log(log ∗ n − log ∗ w ) times. Unfortunately, even if we allow lossiness, we cannot directly use their techniques due to theblowup in depth. One critical factor contributing to the depth blowup comes from the bootstrapping stepin which they construct a tight compactor given a loose compactor. Here, they have to perform metadatacomputation that is Θ(log n ) in depth. This would incur at least (Θ(log n )) d total depth over all steps of thebootstrapping, where d := log(log ∗ n − log ∗ w ) .Our key observation is to use a weaker intermediate abstraction during the bootstrapping, that is, anapproximate splitter. Speciﬁcally, we use a lossy loose compactor to construct an approximate splitter with-out incurring too much overhead, and then use the resulting approximate splitter to construct a lossy loosecompactor much more efﬁcient than the original one. Unlike Asharov et al. [9], the repeated bootstrappingno longer gives us a tight compactor directly; it only gives an efﬁcient lossy loose compactor. As explainedlater, getting a tight compactor from an efﬁcient lossy loose compactor requires additional novel techniques. Approximate splitter from lossy loose compactor.

In a pre-processing phase, we ﬁrst mark misplacedelements (and some additional elements) as either blue or red , such that the approximate splitter task canbe expressed as pairing up each blue with a distinct red and swapping almost all such pairs. Speciﬁcally,any distinguished element not contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of the input are colored blue .Any non-distinguished element contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of the input are colored red .This makes sure that n red ≥ n blue + n/ , where n red and n blue denote the number of red and blue elements, respectively. Observe that the metadata computation in the pre-processing step has constant depth(as opposed to logarithmic depth had we used the tight compaction version of the bootstrapping [9]).Next, we rely on an approximate swapper that swaps most of the blue elements with their paired red ,except for leaving behind at most n/ blue elements that are unswapped. Henceforth, we may assumethat swapped elements become uncolored. Such an approximate swapper circuit can be constructed using alinear-sized and constant-depth circuit by combining prior techniques [8, 9].Now, we want to extract almost all of the remaining blue elements except for at most n/ poly log n ofthem, as well as slightly more red elements than blue . Further, the extracted array is a constant factorshorter than the original array. For technical reasons, we have to use a different algorithm for extractingthe blue and red elements, respectively. Speciﬁcally, we rely on a lossy loose compactor to extract the blue elements; and rely on an (cid:15) (cid:48) -near-sorter to extract the red elements for some sufﬁciently small constant (cid:15) (cid:48) ∈ (0 , . At this moment, the problem boils down to swapping almost all blue elements in the extractedarray with a distinct, paired red element, and reverse routing the result back to the original array. We can7ccomplish this by recursing on the extracted array. The recursion stops when the extracted array’s sizebecomes n/ poly log n for some appropriate choice of poly log( · ) .We defer a formal description of the scheme and the parameters to the subsequent technical sections.This bootstrapping step incurs the following blowup in parameters:• Let α := 1 / poly log n be the loss-factor of the α -lossy loose compactor, then the resulting approximatesplitter has the approximation factor α .• Suppose that the / poly log( n ) -lossy loose compactor has B lc ( n ) number of generalized boolean gates,and S lc ( n ) number of w -selector gates, and has D lc ( n ) depth, then the resulting approximate splitter has C · B lc ( n ) generalized boolean gates, C · S lc ( n ) number of w -selector gates, and C log log n · D lc ( n ) depth, where C , C , C > are appropriately large constants. Lossy loose compactor from approximate splitter.

We want to construct a more efﬁcient lossy loosecompactor given an approximate splitter. Let f ( n ) < log n be some function on n , and let C sp > be some appropriate constant. Suppose that we have an α -approximate splitter that costs C sp · n · f ( n ) generalized boolean gates, C sp · n number of w -selector gates, and D sp ( n ) depth. We can construct a lossyloose compactor as follows:1. Divide the input array into f ( n ) -sized chunks. We say that a chunk is sparse if there are at most f ( n ) / real elements in it; otherwise it is called dense . Since the input is promised to be / -sparse, we willlater prove that at least / fraction of the chunks are sparse.2. Call an ( α, / -approximate splitter to move almost all dense chunks to the front and almost all sparsechunks to the end. Here the approximate splitter works on n/f ( n ) elements each of bit-width f ( n ) · w .3. Apply an ( α, / -approximate splitter to the trailing (cid:108) ( − ) · nf ( n ) (cid:109) chunks to compress each ofthese chunks to a length of (cid:106) f ( n )64 (cid:107) , losing few elements in the process. The ﬁrst (cid:106) ( + ) · nf ( n ) (cid:107) chunks are unchanged. Output the resulting array.The resulting lossy loose compactor has a lossy factor of . α ; moreover, it costs at most . · C sp · n · f ( f ( n )) generalized boolean gates, at most . · C sp · n number of w -selector gates, and has depth . D sp ( n ) . Note that the total number of generalized boolean gates reduces quite signiﬁcantly in this stepbut the total number of w -selector gates and the depth increase by a constant factor. Repeated bootstrapping.

We repeatedly perform the above bootstrapping. Henceforth going from lossyloose compactor to approximate splitter, and then back to a lossy loose compactor is called one step inour bootstrapping. After d := log(log ∗ n − log ∗ w ) steps of bootstrapping, the cost incurred by generalizedboolean gates and w -selector gates will be balanced. Speciﬁcally, there will be O ( nw ) · poly (log ∗ n − log ∗ w ) generalized boolean gates and O ( n ) · poly (log ∗ n − log ∗ w ) number of w -selector gates. Both can beinstantiated with O ( nw ) · poly (log ∗ n − log ∗ w ) number of AND, OR, and NOT gates of constant fan-in.After d steps of bootstrapping, the depth will be log log n · (Θ(log log n )) d which is upper bounded by O (log . n ) . The total lossy factor will be poly (log ∗ n − log ∗ w ) · α , where α = 1 / poly log( n ) denotes thelossy factor of the initial lossy loose compactor we started out with. Given a / (8 log C n ) -lossy loose compactor with O ( nw ) · poly (log ∗ n − log ∗ w ) boolean gates and O (log . n ) depth, we construct a / (log n ) C -approximate tight compactor which asymptotically preserves the circuitsize and has O (log n ) depth. The construction is similar to how Asharov et al. [9] constructed a tight com-pactor from a loose compactor, except that 1) to achieve O (log n ) small depth, we run the algorithm onlyfor Θ(log log n ) iterations rather than Θ(log n ) as Asharov et al. did; and 2) we prove that due to the lossi-ness in the loose compactor as well as the early stopping, our bootstrapping achieves only approximate tight8ompaction, i.e., α fraction of elements may still be misplaced. We defer the details to the formal sections. Now that we have a / (log n ) C -approximate tight compactor with O ( nw ) · poly (log ∗ n − log ∗ w ) booleangates and O (log n ) depth, we can apply it to the input array, such that all but / poly log n fraction of theelements are in the correct place. Next, we want to extract the / poly log n fraction of misplaced elementsto an array of length at most Θ( n/ log n ) . If we can accomplish this, we can then use AKS to swap everymisplaced with a distinct misplaced in the extracted short array, and reverse route the result back.Therefore, the crux is how to solve the sparse loose compaction problem, that is, we want to extractthe / poly log n fraction of misplaced elements to an output array of a ﬁxed length of (cid:98) n/ log n (cid:99) ; besidescontaining the misplaced elements, the output array is otherwise padded with ﬁller elements. Bipartite expander graphs with poly-logarithmic degree.

We are inspired by the loose compactor con-struction of Asharov et al. [9] which in turn builds on Pippenger’s self-routing superconcentrator [33].Asharov et al. [9]’s construction relies on d -regular bipartite expander graph with constant degree d andconstant spectral expansion (cid:15) ∈ (0 , . We will instead need a bipartite expander graph with m vertices onthe left and m vertices on the right, where each vertex has degree d = log c m . The spectral expansionof the graph is (cid:15) := 1 / log c m . In the above, c > c > , and both c and c are suitable constants.Such a bipartite expander graph can be constructed using standard techniques. As we shall see later, usinga polylogarithmic degree bipartite expander graph introduces additional complications to the algorithm incomparison with earlier works [9, 33]. Intuition.

Given such a polylogarithmic-degree bipartite expander graph, where L denotes the left ver-tices and R denotes the right vertices, we construct a sparse loose compactor as follows. Throughout, ouralgorithm will operate on super-elements rather than elements, where each super-element contains log n consecutive elements in the input array. Each super-element is real if it contains at least one real element. Ifthe fraction of real elements in the input is at most / (log n ) C , then the fraction of real super-elements is atmost / (log n ) C − . Henceforth let n (cid:48) := n/ log n denote the number of super-elements.We divide the input array into chunks each containing only d/ super-elements. Henceforth let m =2 n (cid:48) /d be the number of chunks. For simplicity, we assume that the numbers log n , n/ log n , and n (cid:48) /d areintegers in this informal overview, and we will deal with rounding issues in the formal technical sections.We will think of each of the m chunks as a left vertex in the bipartite expander graph. If the chunk containsat most d/ (2 log m ) real super-elements, it is said to be sparse; else it is said to be dense.At a very high level, the idea is for all the dense vertices on the left to distribute its load to the rightvertices, such that each right vertex receives no more than d/ (2 log m ) real super-elements. After the loaddistribution step, we empty all real super-elements from the dense chunks; and now all vertices on the leftand right are sparse chunks. We now compress each left and right chunk to / log m of its original sizewithout losing any real super-elements. This would compress the array by a Θ(1 / log m ) factor. Ofﬂine phase.

The load distribution step consists of an ofﬂine phase and an online phase. The ofﬂine phaselooks at only the real/ﬁller indicator of each super-element, and does not look at the payloads. The goalof the ofﬂine phase is to output a matching M between the left vertices L and the right vertices R , suchthat each dense chunk on the left has d/ neighbors in the matching M , and each right vertex has no morethan d/ m neighbors in M . If such a matching can be found, then during the online phase, each densechunk can route up to d/ super-elements each along a distinct edge in the matching M to a right vertex.To ﬁnd the matching, we use the ProposeAcceptFinalize algorithm ﬁrst proposed by Pippenger [33].For convenience, a left vertex is called a factory and a right vertex is called a facility .Initially, each factory corresponding to a dense chunk is unsatisﬁed and each factory corresponding to asparse chunk is satisﬁed . Each productive factory u ∈ L has at most d/ real super-elements. Now, repeatthe following for iter := log n (cid:48) / log log n (cid:48) times and output the resulting matching M at the end:9a) Propose:

Each unsatisﬁed factory sends a proposal (i.e., the bit 1) to each one of its neighbors. Eachsatisﬁed factory sends 0 to each one of its neighbors.(b)

Accept:

If a facility v ∈ R received no more than d/ (2 log m ) proposals, it sends an acceptancemessage to each one of its d neighbors; otherwise, it sends a reject message along each of its d edges.(c) Finalize:

Each currently unsatisﬁed factory u ∈ L checks if it received at least d acceptance messages.If so, for each edge over which an acceptance message is received, mark it as part of the matching M .At this moment, this factory becomes satisﬁed.In our subsequent formal sections, we will use the Expander Mixing Lemma (see Lemma A.1 of Ap-pendix A) to prove that in each iteration of the above ProposeAcceptFinalize algorithm, at most / log m fraction of the unsatisﬁed factories remain unsatisﬁed at the end of the iteration (Lemma 12.4). Therefore,one can show that after log n (cid:48) / log log n (cid:48) iterations, all factories become satisﬁed. Note that each iterationtakes O (log d ) = O (log log n ) depth (this is needed for tallying how many proposals or acceptance mes-sages a vertex has received), and therefore the total depth is only O (log n ) . One crucial observation is thatthe number of edges in the bipartite group is within a constant factor of the number of super-elements, whichis O ( n/ log n ) . In this way, over all log n/ log log n iterations of the ofﬂine phase, the number of generalizedboolean gates is upper bounded by O ( n ) .Finally, like in prior work [6, 33], it is not hard to show that each facility on the right will be matchedwith at most d/ (2 log m ) factories. Online routing phase.

Each dense chunk wants to route each of its up to d/ real super-elements alonga distinct edge in the matching M to the right. The challenge is that we need to accomplish this using alinear number of gates, i.e., each chunk is allowed to consume O ( d · w ) gates (ignoring poly log ∗ terms). Incomparison, in prior works [9, 33], this was a non-issue because their chunks were constant in size.We accomplish this by leveraging a tight compaction circuit that is optimal in size, but not so optimalin depth — since each chunk is small. In fact, to achieve this, we can use the tight compaction circuit byAsharov et al. [9], but replace some its building blocks with parallel versions (see Theorem 12.1 for moredetails). The resulting tight compaction circuit has depth that is super-polylogarithmic in the input length,but when applied to a chunk of poly log n size, the depth would be upper bounded by O (log n ) . Compressing all chunks.

Now that we have ﬁnished the load distribution phase, all chunks on the left andright must be sparse. We therefore compress each chunk to / log m of its original size. This can be doneby applying to each chunk a tight compaction circuit that is optimal in work but not optimal in depth (sameas the building block we used in the online routing phase).After this, the input is compressed to / log m of its original size, without losing any real elements. We can now put everything together and construct construct a compaction circuit optimal in size (barring poly log ∗ factors) and also optimal in depth. We ﬁrst apply our / poly log n -approximate tight compactor tosort almost all of the input array, except for leaving / poly log n fraction of elements still misplaced. Next,we rely on a sparse loose compactor to extract all misplaced elements to an array of size n/ log n , and theextracted array is padded with ﬁllers besides containing all the misplaced elements. We then use AKS toswap misplaced 0s with misplaced 1s in the short, extracted array, and reverse route the result back. Additional technicalities.

Our informal description above is a somewhat simpliﬁed version of our actualtight compaction circuit. We omitted various technicalities regarding how to implement some of the otherbuilding blocks in circuit, in a way that avoids extra blowups. We defer these details to the formal sections. In fact, in our formal technical sections, we will deﬁne a slight variant of tight compaction called “distribution” to accomplishthe online routing — see Sections 4 and 12.1. .3 Sorting Circuit for Short Keys With our algorithms in Sections 2.1 and 2.2, and with some extra work, one can get a sorting circuit for shortkeys that satisﬁes Theorem 1.1. The technicalities here are mostly how to efﬁciently convert some of thealgorithmic building blocks used by the oblivious PRAM sorting algorithm to the circuit model. We deferthe details to the subsequent formal sections.

Since the landmark AKS result [3], various works have attempted to simplify it and/or reduce the concreteconstants [16, 31, 35]. Notably, the recent ZigZag sort of Goodrich (STOC’14) [16] took a rather differ-ent approach than the original AKS; unfortunately, its depth is asymptotically worse than AKS. None ofthese works achieved theoretical improvements over AKS, and all of them considered the comparator-basedmodel.As mentioned, the special case of sorting 1-bit keys is also called compaction, which is trivial to ac-complish on a (non-oblivious) RAM. A line of work was concerned about the circuit complexity of com-paction [4, 22, 32, 37]; but all earlier works focused on the comparator-based model. Due to the famous 0-1principle described as early as in Knuth’s textbook [24], there is an Ω( n log n ) lower bound for compactionwith comparator-based circuits. Several works have considered compaction in other incomparable modelsof computation as explained below (but none of them easily translate to a circuit result). Leighton et al. [25]show how to construct comparison-based, probabilistic circuit families for compaction, with O ( n log log n ) comparators; again, here we require that for every input, an overwhelming fraction of the circuits in thefamily can give a correct result on the input. Subsequent works [27, 29] have improved Leighton’s resultby removing the restriction that the circuit family must be parametrized with the number of 0s without in-creasing the asymptotical overhead. These works also imply that compaction can be accomplished with in O ( n log log n ) time on a randomized Oblivious RAM [27, 29].Asharov et al. considered how to accomplish compaction on deterministic Oblivious RAMs in lin-ear work [7], but their construction is sequential in nature. Their work was subsequently extended [8] to aPRAM setting achieving optimality in both work and depth; but a counterpart of such an optimal compactionresult in the circuit model was not known earlier. Dittmer and Ostrovsky improve its concrete constants byintroducing randomness back [13]. Interestingly, linear-time oblivious compaction played a pivotal role inthe construction of an optimal Oblivious RAM (ORAM) compiler [7], a machine that translates a RAM pro-gram to a functionally-equivalent one with oblivious access patterns. Speciﬁcally, earlier ORAM compilersrelied on oblivious sorting which requires Ω( n log n ) time either assuming the indivisibility model [27] orthe Li-Li network coding conjecture [14]; whereas more recent works [7, 30] observed that with with a lotmore additional work, we could replace oblivious sorting with the weaker compaction primitive.Besides Pippenger’s self-routing super-concentrator [33], Arora, Leighton, and Maggs [6] considered aself-routing permutation network. Their construction does not accomplish sorting. Further, converting theirnon-blocking network to a permutation circuit would require at least Ω( n log n ) gates [1]. Pippenger’swork [33] adopted some techniques from the Arora et al. work [6]. Array and multiset notations.

Whenever we say an array , we mean an ordered array. Throughout thepaper, we may assume that the array to be sorted has length n that is a power of — in case not, we canalways round it up to the nearest power of by padding ∞ elements, incurring only constant blowup inarray length and consuming at most one additional bit in terms of key length.11iven an array A , the notation mset ( A ) denotes multiset formed by elements in A . Suppose that A and A (cid:48) are two arrays, then A || A (cid:48) denotes the array formed by concatenating A and A (cid:48) . For m ∈ N , we use thenotation [ m ] := { , , . . . , m } . Suppose that ≤ s ≤ t ≤ | A | , we use the notation A [ s : t ] to denote thelength- ( t − s + 1) segment of the array A from s -th element to the t -th element. We deﬁne the short-handnotations A [: t ] := A [1 : t ] and A [ s :] := A [ s : | A | ] . Unless otherwise noted log means log . Binary tree notations.

Given a complete binary tree with t levels, the level of a node is the number of edgesfrom the root to the node. For example, the root is at level ; and the leaves are at level t − .The tree distance of two nodes in a binary tree is the length of the shortest path between them. Misplaced elements.

Let A be an array of length n , and let [ s, t ] ⊆ [ n ] be a contiguous sub-range of [ n ] .The number of misplaced elements in the segment A [ s : t ] , denoted err ( A [ s : t ]) , is deﬁned as as the numberof elements residing in A [ s : t ] , however, if A were to be sorted, ought not to be in A [ s : t ] . More formally, err ( A [ s : t ]) = | mset ( A [ s : t ]) − mset ( B [ s : t ]) | . where B = sorted ( A ) denotes the sorted version of A , and recall that mset ( A [ s : t ]) denotes the multisetformed by elements in A [ s : t ] . As a special case, if mset ( A [ s : t ]) = mset ( B [ s : t ]) , then the err ( A [ s : t ]) = 0 . Nearly orderly segmenter.

We now deﬁne ( η, p ) -orderliness and an ( η, p ) -orderly segmenter. Deﬁnition 1 ( ( η, p ) -orderly) . Let m and p be positive integers, and suppose that n = mp . Write an array A of length n as the concatenation of p equal-sized segments: A = A || A || . . . || A p . We say that A is ( η, p ) -orderly iff for each i ∈ [ p ] , err ( A i ) ≤ η · | A i | . Deﬁnition 2 ( ( η, p ) -orderly segmenter) . Let n := mp . An ( η, p ) -nearly orderly segmenter (for n ) is a circuitthat takes an array A of length n , and outputs a permutation of A which is ( η, p ) -orderly. We rely on ideas from the AKS sorting network [3] to construct a nearly orderly segmenter.At a very high level, the AKS algorithm proceeds in O (log n ) cycles to sort a length- n input array.During each cycle t ,1. The algorithm partitions the current array into a number of disjoint intervals which are not necessarilyequally sized. Henceforth the term interval refers to a contiguous subarray. The number of intervalsis geometrically growing with each cycle.2. The algorithm then partitions the intervals into groups, and each group contains a disjoint (but notnecessarily contiguous) subset of the intervals. It then sorts each group and writes the sorted arrayback in place. Further, all the groups are sorted in parallel. The above partitioning and sorting proce-dure is repeated three times (and each time the partitioning may be different), and then the algorithmenters the next cycle.At the end of log n cycles, the input array is guaranteed to be sorted.It turns out if we repeat the AKS algorithm for k < log n cycles and stop, the resulting array will satisfy (2 − k , k ) -orderliness. For completeness, below we describe the algorithm where we essentially performAKS for k < log n cycles, we then rely on a technical lemma proven in the AKS paper [3] to prove thatthe resulting array is indeed (2 − k , k ) -orderly. 12 , 22 4 5 8 9 14 15 18 19 3, 6 7, 10 13,

16 17,

21 even cherry odd cherry

Figure 1: t -AKS-tree for t = 3 . Recall that we would like to partition the current array into a number of intervals in each AKS cycle t . Tounderstand how the intervals a deﬁned, we will ﬁrst deﬁne a helper data structure called a t -AKS-tree. t -AKS-tree. An t -AKS-tree is binary tree containing a total of t +1 levels numbered , , . . . , t , respectively.Henceforth deﬁne M ( t ) := 3 · t − . All tree nodes receive either one or two labels from the range [ M ( t )] ;further, each label is given to exactly one tree node. The labeling scheme satisﬁes the following constraints:1. Each leaf receives one label from the range [ M ( t )] ; and each non-leaf node receives two labels fromthe same range.2. For each internal node, every label in its right subtree is strictly greater than every label in its leftsubtree.3. The set of labels assigned to each subtree is a contiguous sub-range [ s, t ] ⊆ [ M ( t )] ; and further, theminimum s and maximum t of the range are assigned to the root of the sub-tree.One can check that the above set of constraints uniquely deﬁne the labeling on the tree nodes. In Figure 1,we give an example of a t -AKS-tree where t = 3 . t -AKS-intervals. Given an array A of length n , we can divide it into M ( t ) intervals called t -AKS-intervals,i.e., A := A || A || . . . || A M ( t ) , where the length of each interval A i depends on which level the label i showsup in the t -AKS-tree. At a very high level, the length geometrically decreases by a factor of approximately γ := 16 as the label i ’s level becomes smaller.We now deﬁne the lengths of each t -AKS-interval more formally, following the same approach as in theoriginal AKS paper [3]. We ﬁrst deﬁne the following numbers for t = 1 , , . . . , log n , and for (cid:96) = 1 , , . . . , t : X t ( (cid:96) ) := (cid:106) γ · n · − t · γ (cid:96) − t (cid:107) , Y t ( (cid:96) ) := (cid:80) (cid:96)j =1 X t ( j ) Let (cid:96) ∈ [0 , t − and j ∈ [1 , (cid:96) ] . Suppose that the j -th node at level (cid:96) in the t -AKS-tree have the twolabels i and i (cid:48) . Then, the lengths of the two intervals A i and A i (cid:48) are deﬁned as follows. | A i | := (cid:40) X t ( (cid:96) + 1) if j is odd Y t ( (cid:96) + 1) o.w. and | A i (cid:48) | := X t ( (cid:96) + 1) + Y t ( (cid:96) + 1) − | A i | Finally, in the last level (cid:96) = t in the t -AKS-tree, each node has only one label. Suppose that the j -thnode’s label is i , then the length of the interval A i is | A i | := n · − t − Y t ( t ) .13 act 3.1 (Group t -AKS-intervals into equally sized segments) . As mentioned, assume that n := | A | isa power of 2. Fix any non-leaf level (cid:96) ∈ { , , . . . , t − } in a t -AKS-tree, we can partition A into (cid:96) equally sized segments as follows (where equally sized means that every segment contains the same numberof elements):

1. Initially, for every node v in the t -AKS tree, L ( v ) is deﬁned to be the set of the original labels of v .Speciﬁcally, for every non-leaf node v , L ( v ) has two labels, and for every leaf node v , L ( v ) has onlyone label.2. For level i = 0 to (cid:96) − , for every node v in level i of the t -AKS-tree,(a) let S ⊆ L ( v ) be the subset of node v ’s labels smaller than every label in L ( v. LeftChild ) , and let S (cid:48) := L ( v ) \ S .(b) let L ( v. LeftChild ) := L ( v. LeftChild ) ∪ S ;(c) let L ( v. RightChild ) := L ( v. RightChild ) ∪ S (cid:48) ;3. For every node v in level (cid:96) of the t -AKS-tree: all t -AKS-intervals whose corresponding labels are in Subtree ( v ) (including L ( v ) ) are grouped together and called one segment, where Subtree ( v ) meansthe subtree rooted at v .In the example in Figure 1 where t = 3 , let (cid:96) := 2 . In this case, if we partition A into equally sizedsegments: ( A , A , . . . , A ) , ( A , A , . . . , A ) , ( A , A , . . . , A ) , ( A , A , . . . , A ) . Proof of Fact 3.1:

This fact is implicit in the AKS paper [3], we prove it explicitly below. Alternatively,we can consider the following equivalent variant of the above algorithm: in Step 2, we do not stop at theend of the ( (cid:96) − -th iteration, but continue all the way to level t − . At this moment, for each node v in level (cid:96) of the tree: we group together the labels on all leaf nodes in Subtree ( v ) — their corresponding t -AKS-intervals will form one segment.It is not hard to show through induction that at the end of the iteration i = 0 to t − in Step 2, each node v in level i + 1 of the t -AKS tree receives a set of labels from its parent which correspond to a total of Y t ( i ) elements. Therefore, at the end of iteration to t − , for each node v in level i + 1 , its labels correspond toa total of consecutive Y t ( i + 1) elements. At the end of the ﬁnal iteration i = t − , each leaf node’s labelscorrespond to n · − t consecutive elements. Even and odd cherries.

In a binary tree, a cherry is deﬁned to be a parent node and its two children. The even (or odd, resp.) cherries are those whose parents reside at an even (or odd, resp.) level .Given A := A || . . . || A M ( t ) written as t -AKS-intervals, we deﬁne EvenCherries ( A || . . . || A M ( t ) ) tobe a set of disjoint groups of t -AKS-intervals. Speciﬁcally, EvenCherries ( A || . . . || A M ( t ) ) is of the form { G , G , . . . , G d } where d is the number of even cherries in a t -AKS-tree, and for i ∈ [ d ] , each group G i corresponds to a distinct even cherry in a t -AKS-tree, i.e., G i is one of the following two forms dependingon whether the even cherry touches the leaf level:• either G i := A j || . . . || A j where j < j < . . . < j , and moreover, j , . . . , j correspond to thelabels of an even cherry in the t -AKS-tree that does not involve the leaf level;• or G i := A j || . . . || A j where j < j < . . . < j . and moreover, j , . . . , j correspond to the labelsof an even cherry in the t -AKS-tree, involving the leaves this time.14he notation OddCherries is similarly deﬁned but replacing “even” with “odd”. (cid:15) -near-sorter.

Let (cid:15) ∈ (0 , be a constant. An array A of length n is said to be (cid:15) -near-sorted, iff thefollowing holds for any ≤ k ≤ n :1. A [1 : k + (cid:15)n ] contains at least (1 − (cid:15) ) k of the k smallest elements in A ;2. A [ n − k − (cid:15)n + 1 : n ] contains at least (1 − (cid:15) ) k of the k largest elements in A .In the above, we use the following notations to deal with boundary conditions: for i > n , A [1 : i ] := A [1 : n ] ; and for i < , A [ i : n ] := A [1 : n ] .An (cid:15) -near-sorter (for n ) is a circuit containing O ( n ) comparators and of constant depth (dependent on (cid:15) ) that permutes any input array of length n into one that is (cid:15) -near-sorted. Earlier works have shown how toconstruct such and (cid:15) -near-sorter using expander graphs [3]. Our nearly orderly segmenter construction is described below:

Nearly orderly segmenterInput:

An array I whose length n is a power of . Parameters:

Let (cid:15) ∈ (0 , be a sufﬁciently small constant, and let C zigzag > be a sufﬁciently largeconstant. Algorithm:

Let A := I be the current array.For t = 1 , , . . . , min(6 k, log n ) : // t -th AKS cycle Divide into t -AKS-intervals. Write A := A || A || . . . || A M ( t ) , where A , A , . . . , A M ( t ) are t -AKS-intervals.2. Repeat the following C zigzag times:(a) Near-sort even cherries.

In parallel, apply an (cid:15) -near-sorter to each group of intervals containedin

EvenCherries ( A || . . . || A M ( t ) ) , and the result is written back in place (i.e., into the t -AKS-intervals’ original positions within A ).(b) Near-sort odd cherries.

In parallel, apply an (cid:15) -near-sorter to each group of intervals containedin

OddCherries ( A || . . . || A M ( t ) ) , and the result is written back in place.3. Near-sort even cherries.

Repeat Step 2a one ﬁnal time.

Output:

Finally, output A . Theorem 3.2 ( (2 − k , k ) -orderly-segmenter) . Let (cid:15) ∈ (0 , be a suitably small constant, and let C zigzag be a suitably large constant. Then, the above construction is a (2 − k , k ) -orderly-segmenter; moreover, itcan be implemented as a comparator-based circuit with O ( n ) · min(6 k, log n ) comparators and of O (1) · min(6 k, log n ) depth.Proof. The proof is presented in Section 3.4.

The size and depth bounds follow in a straightforward manner. Below we focus on proving that the algorithmgives a (2 − k , k ) -orderly-segmenter. To prove this, we need to rely on a technical lemma proven by Ajtaiet al. [3]. 15 emma 3.3 (Technical lemma due to Ajtai et al. [3]) . Fix any arbitrarily small constant α ∈ (0 , suchthat (16 γ ) · α < . There exist a suitably small constant (cid:15) ∈ (0 , and a suitably large constant C zigzag > , such that in the above construction, at the end of each cycle t ≤ log n , the following hold forany t -AKS-interval A i where i ∈ [ M ( t )] :For r ≥ , err r ( A i ) < α r +27 · | A i | , where err r ( A i ) denotes the number of elements actually in A i , butif the array were sorted, would land in a t -AKS-interval that is at a tree-distance at least r away fromthe node labeled with i in the t -AKS-tree. The above Lemma 3.3 is implied by the Theorem stated on page 7 of the original AKS paper [3] —we stated the lemma slightly differently from the original AKS paper for our convenience. We now useLemma 3.3 to prove Theorem 3.2.

Proof of Theorem 3.2:

Recall that γ = 16 . We will choose α such that · (16 γ ) · α = 1 , i.e., α = (32 γ ) − = . Moreover, suppose that we pick C zigzag to be sufﬁciently large and (cid:15) ∈ (0 , tobe sufﬁciently small such that Lemma 3.3 is satisﬁed. We run the algorithm speciﬁed in Section 3.3.2 withthe aforementioned parameters, and let A be the output array. Without loss of generality, we may assumethat k < log n since otherwise Ajtai et al. [3] proved that the outcome A would be sorted, and this wouldbe the easy case.We now divide A into k equally sized segments. We can equivalently view the k equally sizedsegments as being created by the procedure speciﬁed in Fact 3.1, where (cid:96) := 3 k . Pick an arbitrary segment,say, the i -th segment, among the k equally sized segments. Henceforth, let v k,i denote the i -th node inlevel k of the k -AKS-tree.Due to the procedure speciﬁed in Fact 3.1, we know that the i -th segment consists of1. all k -AKS-intervals whose labels reside in Subtree ( v k,i ) of the original k -AKS-tree ( not of the treeoutput by the procedure in Fact 3.1);2. a subset of the k -AKS-intervals whose labels reside in an ancestor node of v k,i in the k -AKS-tree.For convenience, whenever we say the level of a t -AKS-interval, we mean the level of its correspondinglabel in the t -AKS-tree. Let S (cid:96) denote the the total length of all k -AKS-intervals of level (cid:96) contained in the i -th segment. It is not hard to see that for (cid:96) ∈ [0 , k − , S (cid:96) ≤ S (cid:96) +1 / , by the deﬁnition of the lengths ofthe t -AKS-intervals.For (cid:96) ∈ [3 k, k ] , a k -AKS-interval of level (cid:96) contained in the i -th segment must have tree distance atleast (cid:96) − k + 1 from any k -AKS-interval not contained in the i -th segment.We use the term “wrong elements” to mean elements that do not belong to the i -th segment if the arraywere sorted. Let W (cid:96) denote the total number of wrong elements in some k -AKS-interval of level (cid:96) in the i -th segment. By Lemma 3.3, we have that W (cid:96) ≤ α (cid:96) − k +1)+27 · S (cid:96) ≤ α (cid:96) − k +1)+27 · S k k − (cid:96) Therefore, we have that (cid:80) (cid:96) ∈ [3 k, k ] W (cid:96) S k ≤ α k +1)+27 · (cid:32) α − (cid:18) α − (cid:19) + . . . + (cid:18) α − (cid:19) k (cid:33) ≤ α · α k · (cid:18) α − (cid:19) k · ≤ − k ( (cid:63) ) (cid:80) (cid:96) ∈ [0 , k − S (cid:96) S k ≤ k +1 · (cid:18) . . . + 18 k − (cid:19) ≤ k +1 · ≤ − k ( (cid:63)(cid:63) ) Combining ( (cid:63) ) and ( (cid:63)(cid:63) ) , we have that (cid:80) (cid:96) ∈ [0 , k ] W (cid:96) S k ≤ − k · ≤ − k Since S k is smaller than the total length of the i -th segment, we have that the fraction of misplaced elementsof the i -th segment must be upper bounded by − k . In this section, we present some building blocks that can be implemented as deterministic, oblivious parallelalgorithms. This means that the algorithms’ memory access patterns are ﬁxed a-priori and independent ofthe input (once we ﬁx the input’s length).

Compaction.

Compaction (short for “tight compaction”) solves the following problem: given an array inwhich every element is tagged with a 1-bit key, move all elements tagged with to the front of the array, andmove elements tagged with to the end. Asharov et al. [8] showed a deterministic algorithm that obliviouslycompacts any array containing n elements each of which encoded as (cid:96) words; and their algorithm achieves O ( (cid:96) · n ) total work and O (log n ) depth.Furthermore, their compactor supports a “reverse routing” capability. Speciﬁcally, their compactor canbe thought of a network consisting of O ( n ) selector gates of depth O (log n ) , with n inputs and n outputs.Each selector gate takes in a 1-bit ﬂag and two input elements that are (cid:96) words long, and the ﬂag is used todecide which of the two input elements to output. The ﬁrst phase of their algorithm, takes O ( n ) work and O (log n ) depth: it computes on the elements’ 1-bit keys, and populates all selector gates’ 1-bit ﬂags. Thesecond phase of their algorithm then routes the input elements to the output layer over this selector network.This takes O ( (cid:96) · n ) work and O (log n ) depth. Since each selector gate can remember its ﬂag, it is possibleto later on route elements in the reverse direction, from the output layer back to the input layer.We stress that Asharov et al. [8]’s oblivious compaction algorithm is not stable , i.e., it does not preservethe relative order of elements with the same key as they appeared in the input array. In fact, Lin, Shi,Xie [27] showed that this is inherent: any oblivious algorithm in the indivisibility model that achieves stable compaction must incur Ω( n log n ) work. Here, an algorithm in the indivisibility model is one that doesnot perform encoding or computation on the elements’ payload strings. Afshani et al. [2] shows that the Ω( n log n ) lower bound holds for oblivious, deterministic stable compaction even without the indivisibilityrequirement, but instead assuming that the Li-Li network coding conjecture holds [26]. Distribution.

Distribution solves the following problem. We are given an input array I of length n in whicheach element carries a w -bit payload and a 1-bit label indicating whether the element is real or a ﬁller .Additionally, we are given a bit-vector v of length n , where v [ i ] indicates whether the i -th output position isavailable to receive a real element. It is promised that the number of available positions is at least as manyas the number of real elements in I . We want to output an array O such that the multiset of real elementsin O is the same as the multiset of real elements in I , and moreover if O [ i ] contains a real element, then itmust be that v [ i ] = 1 , i.e., only available positions in the output array O can receive real elements.The following algorithm accomplishes the aforementioned distribution task using compaction as a build-ing block: 17 istribution

1. Let X be an array in which all payloads are ﬁllers and each X [ i ] is marked with the label v [ i ] .2. Now, apply tight compaction to X routing all entries with -labels to the front, and all entries with -labels to the end.3. Apply another instance of tight compaction to the input array I routing all real elements to the frontand all ﬁller elements to the end; let the outcome be I (cid:48) .4. Next, reverse-route the array I (cid:48) by reversing the routing decisions made in Step 2, and output theresult.Therefore, oblivious distribution can be accomplished with the same asymptotical overhead as obliviouscompaction. Just like compaction, here it also makes sense to consider a reverse-routing capability of ourdistribution algorithm. All preﬁx sums.

Given an array A of length n , an all-preﬁx-sum algorithm outputs the preﬁx sums of all n preﬁxes, i.e., A [: 1] , A [: 2] , . . . , and A [: n ] , respectively. It is promised that the sum of the entire array A can be stored in O (1) memory words. It is well-known that there is a deterministic, oblivious algorithm thatcomputes all preﬁx sums in O ( n ) work and O (log n ) depth [20]. Generalized binary-to-unary conversion.

Imagine that there are n receivers where the i -th receiver is la-beled with an indicator bit x [ i ] . We are given an integer (cid:96) ∈ { , , . . . , n } expressed in binary representation,and we want to output an array of n bits where the i -th bit represents the bit received by the i -th receiver.We want that the ﬁrst (cid:96) receivers marked with receive , and all other receivers marked with receive .The receivers marked with may receive an arbitrary bit. Note that in the special case that all receivers aremarked with , then the problem boils down to converting an integer (cid:96) ∈ { , , . . . , n } expressed in binaryrepresentation to a corresponding unary string.The generalized binary-to-unary conversion problem can easily be solved by invoking an all-preﬁx-sumcomputation on an oblivious parallel RAM, taking O ( n ) total work and O (log n ) depth . Sorting elements with ternary keys.

We will need a linear-work, and logarithmic-depth oblivious algo-rithm to sort an input array with ternary keys, as stated in the following theorem.

Theorem 4.1 (Sort elements with ternary keys) . There exists a deterministic, oblivious algorithm that cansort any input array A containing n elements each with a key from the domain { , , } in O ( n ) work and O (log n ) depth.Proof. Consider the following algorithm:

Ternary-key sorting

1. For each key b ∈ { , , } , let L b , U b ∈ [ n ] denote the starting and ending index for b if the array A were to be fully sorted. We can accomplish this by counting for each b ∈ { , , } the total numberof occurrences of b in A .2. Relying on oblivious distribution three times, we can route all elements with the key b to the posi-tions [ L b , U b ] of the output array. Output the result. We explicitly differentiate the generalized binary-to-unary conversion from the all-preﬁx-sum because it is more convenientlater for our circuit-model results. In the circuit model, the generalized binary-to-unary conversion can be solved with a circuit O ( n ) in size and O (log n ) in depth, whereas all-preﬁx sum requires a circuit O ( n log n ) in size and O (log n ) in depth (even whenthe input A is a bit array). A with ternary keys, and moreover,the algorithm completes in O ( n ) total work and O (log n ) depth.Just like compaction, here it also makes sense to consider a reverse-routing capability of our ternary-keysorting algorithm. Throughout, we assume that the array A to be sorted contains elements that are (key, payload) pairs. A keycan be expressed in k bits, and the entire element can ﬁt in O (1) memory words. Slow sorter.

We show that there is a slow sorter that sorts an array containing n elements with k -bit keys in O (2 k · n ) work and O (log n ) depth. Theorem 5.1 (Slow sorter) . Let K := 2 k . There exists a deterministic, oblivious algorithm, henceforthdenoted SlowSort K ( A ) , that can correctly sort any input array A of length n and containing elementswith k -bit keys in O ( nK ) total work and O ( k + log n ) depth.Proof. Consider the following algorithm:

SlowSort K ( A ) Input:

An array A whose length n is a power of . Every element in A has a k -bit key chosen from thedomain [0 , K − . Algorithm:

1. For each u ∈ [0 , K − in parallel, count the number of occurrences of the key u in A , andlet c u be this count. Using an all-preﬁx-sum algorithm, compute s u := (cid:80) i ∈ [0 ,u − c i for every u ∈ [1 , K − , and deﬁne s := 0 .2. Make K copies of the array A , denoted B , . . . , B K − , respectively. In each B u , the elementswhose keys are not u are replaced with ﬁller .3. For u ∈ [0 , K − :(a) In array B u , for the ﬁrst s u ﬁller elements, treat their keys as −∞ ; for every other ﬁllerelement, treat its the key as ∞ . This can be accomplished by invoking a generalized binary-to-unary conversion algorithm.(b) Invoke oblivious sorting for ternary keys to sort B u . In the resulting array denoted B (cid:48) u , theelements whose keys are equal to u will appear at positions s u + 1 , . . . , s u +1 .4. In parallel, populate the i -th element in the output array for every i ∈ [ n ] as follows: select theelement whose key is within the range [0 , K − among the elements B (cid:48) [ i ] , B (cid:48) [ i ] , . . . , B (cid:48) K − [ i ] .The selection can be accomplished by aggregating over a binary tree whose leaves are B (cid:48) [ i ] , B (cid:48) [ i ] , . . . , B (cid:48) K − [ i ] .One can easily verify that the above algorithm indeed correctly sorts in the input array. Moreover, itstotal work is bounded by O ( nK ) and its depth is bounded by O ( k + log n ) . Speciﬁcally, for the depth, the O ( k ) part upper bounds the depth of the ﬁrst step that computes the all-preﬁx-sum of K elements as well asthe last step where we select among K elements; and the O (log n ) part is an upper bound on the depth ofthe generalized binary-to-unary computation, as well as the ternary-key sorting.19 emark 1 (Reverse routing) . In the above

SlowSort algorithm, there is a way to reverse-route elements inthe output array back into their original positions in the input. Suppose that during Step 4, we remember foreach position of the output array, from which array B (cid:48) u it received an element, In this way, we can reverseStep 4 and reconstruct the arrays B (cid:48) , . . . , B (cid:48) K − from the output. Now, we can reverse the routing decisionsof the ternary sorter to reconstruct the arrays B , . . . , B K − . For each i ∈ [ n ] , there is only one B u suchthat B u [ i ] is not a ﬁller element, and this element B u [ i ] will be routed back to the i -th position of the inputarray. Clearly, the reverse routing does not cost more than the forward direction in terms of work and depth. Slow alignment.

We deﬁne a variant of the slow sorter algorithm, called

SlowAlign

K,K (cid:48) ( A ) . SlowAlign

K,K (cid:48) receives an input array A in which every element A [ i ] is not only tagged with a key A [ i ] . key from the domain [0 , K − , but also an index A [ i ] . idx which can be expressed in k (cid:48) := log K (cid:48) bits. As before, we assume thateach element, including its tagged key and index, can ﬁt in O (1) words. We want to output a permutation of A such that in the ordering the keys become consistent with the ordering of the indices in the input array. Inother words, suppose that B is the output array in which each element is tagged with only a key, then, ∀ i, j ∈ [ n ] and i (cid:54) = j : ( A [ i ] . idx < A [ j ] . idx ) = ⇒ ( B [ i ] . key ≤ B [ j ] . key ) (1) Theorem 5.2 ( SlowAlign

K,K (cid:48) ) . There is a deterministic, oblivious

SlowAlign

K,K (cid:48) ( A ) algorithm thatsolves the above alignment problem and outputs an array B that is a permutation of the input array A satisfying Equation (1) ; and moreover, the algorithm takes O (( K + K (cid:48) ) n ) total work and O (log K +log K (cid:48) + log n ) depth where n is the length of the input array.Proof. The oblivious algorithm

SlowAlign

K,K (cid:48) is described below:

SlowAlign

K,K (cid:48) ( A ) Input:

An array A of length n , and for every i ∈ [ n ] , the element A [ i ] is tagged with a key A [ i ] . key and an index A [ i ] . idx . Algorithm:

1. Call

SlowSort K ( A ) using the key ﬁeld as the key to sort the array A , and let B be the outcome.2. Call SlowSort K (cid:48) ( A [1] . idx , A [2] . idx , . . . , A [ n ] . idx ) and let idx , . . . , idx n be the resulting orderedlist of indices.3. Reverse route B by reversing the routing decisions made in Step 2.Correctness is easy to verify. For performance bounds, observe that Step 1 takes O ( nK ) work and O (log n +log K ) depth, Step 2 takes O ( nK (cid:48) ) work and O (log n +log K (cid:48) ) depth, Step 3’s work and depth are not morethan Step 2. Let (cid:15) ∈ (0 , / . We say that an array A of length n is (1 − (cid:15) ) -uniform iff except for at most (cid:15)n elements,all other elements in A have the same key — henceforth this key is said to be the dominant key.We want an algorithm that can correctly identify the dominant key when given an input array A that is (1 − (cid:15) ) -uniform. If the input array A is not (1 − (cid:15) ) -uniform, the output of the algorithm may be arbitrary. Theorem 5.3 ( FindDominant algorithm) . Suppose that n > k +7 and moreover n is a power of . Let A be an array containing n elements each with a k -bit key, and suppose that A is (1 − − k ) -uniform. Thereis a deterministic, oblivious algorithm that can correctly identify the dominant key given any such A ; andmoreover, the algorithm requires O ( n ) total work and O ( k + log n ) depth. roof. Let K := 2 k . We can call FindDominant ( A, K, n ) which is deﬁned below — since n > k +7 and n is a power of , one can verify that every recursive call to FindDominant ( B, K, n ) will have aninput B whose size is a multiple of . FindDominant ( B, K, n )

1. If | B | ≤ n/K , then call SlowSort K ( B ) and output either one of the median keys in the sortedarray. Else, continue with the following.2. Divide the array into columns of size . Obliviously sort each column using AKS [3]; and let a i , b i be the two median elements in column i , i.e., the 4th and 5th smallest elements.3. Output FindDominant (cid:0) { ( a i , b i ) } i ∈ [ | B | / , K, n (cid:1) .Henceforth, any element whose key differs from the dominant key is said to be a minority element. In theabove algorithm, for each column, if we want to make sure that both median elements are minority, wemust consume at least minority elements. If we want to make sure that one of the two median elements isminority, we must consume at least minority elements.Suppose that B is (1 − µ ) -uniform. In the array { ( a i , b i ) } i ∈ [ | B | / , the number of elements that areminority is upper bounded by µ ·| B | ; the fraction of elements that are minority is upper bounded by µ · | B | / | B | / µ/ After D := (cid:100) log K (cid:101) recursive calls, the algorithm will encounter the base case, invoke SlowSort andoutput the median. At this moment, the fraction of minority elements is upper bounded by − k · (cid:18) (cid:19) D ≤ / Therefore, outputting the median at this moment will give the correct result.

Recall that using a nearly ordered segmenter (see Section 3) to partially sort the input array, such that theresult is ( η, p ) -nearly ordered. We will show that there is an efﬁcient oblivious algorithm that can correctthe remaining errors and fully sort the array. Theorem 5.4.

Suppose that n > k +7 . There is a deterministic, oblivious algorithm that fully sorts an (2 − k , k ) -orderly array in O ( n ) total work and O (log n ) depth. Proof of Theorem 5.4:

We consider the following algorithm.

Fully sort an ( η, p ) -orderly arrayInput and parameters. The input is an array A whose length n is a power of . A is promised tobe ( η, p ) -orderly for η := 2 − k and p := 2 k , where k is a natural number such that k < log n .Henceforth we write A as A := A || A || . . . || A p where all A i s are equally sized segments. Let K :=2 k and let m := n/p . Algorithm.

1. For each segment i ∈ [ p ] in parallel:(a) Call key ∗ i := FindDominant ( A i , K, | A i | ) ;21b) Count the number of occurrences of key ∗ i in A i to decide if A i is (1 − η ) -uniform.(c) If A i is (1 − η ) -uniform, mark the following elements as “ misplaced ”: 1) all elements whosekey differ from key ∗ i ; and 2) exactly (cid:100) ηm (cid:101) number of elements with the dominant key key ∗ i .Else, mark all elements in A i as “ misplaced ”.2. All elements in A calculate which segment it falls in — note that all elements can learn its positionwithin A through an all-preﬁx-sum calculation, and the segment number can be calculated from theelement’s position within A .Call oblivious compaction to move all elements in A marked with “ misplaced ” to the front of thearray, and all other elements to the end; all elements carry their segment number in the process. Letthe outcome be called X .3. Call SlowAlign

K,K ( X [1 : 3 n/K ]) on the ﬁrst n/K elements of X , where the ﬁrst k -bits ofeach element’s segment number is used as the idx ﬁeld in the SlowAlign algorithm.4. Invoke the reverse routing algorithm of the compactor in Step 2 on the outcome of the previous step,and let Y be the outcome.5. We will now divide Y into K super-segments instead where each super-segment is composed of K original segments. Write Y := Y || Y || . . . || Y K as the concatenation of K equally sized super-segments. For each i ∈ [ K ] : check if Y i has multiple keys; if so, label the super-segment as“ multi-key ”.6. Invoke an oblivious compaction algorithm to move all the super-segments marked “ multi-key ” tothe front of the array (here the compaction algorithm treats each super-segment as an element). Letthe outcome be Z .7. Now, for each of the beginning K super-segments of Z in parallel (where each super-segment is n/K in size), use SlowSort K to sort within each super-segment.8. Finally, reverse route the outcome of the previous step by reversing the decisions made in Step 6,and output the result. Remark 2.

In the above algorithm, Steps 1b and 1c can be performed obliviously as follows:• The counting in Step 1b can be performed by aggregating over a binary-tree of depth O (log n ) .• If the segment A i is not (1 − η ) -uniform: all elements in A i label itself as “ misplaced ” (else all elementsin A i pretend to write a label for obliviousness).• Else, every element whose key differs from the dominate key key ∗ i marks itself “ misplaced ”; moreover,using a generalized binary-to-unary conversion algorithm, the ﬁrst (cid:100) ηm (cid:101) elements with the dominant key key ∗ i label itself as “ misplaced ”. All remaining elements pretend to write a label for obliviousness. Correctness.

We will now argue correctness of the above algorithm, i.e., that the result is fully sorted aslong as the input array is (2 − k , k ) -orderly. Since there are at most K distinct keys, it must be that in afully sorted array, at most K out of the p = 2 k = K segments have multiple keys, all remaining segmentsmust have a single key. Since the input array is (2 − k , k ) -orderly, it means that in the input array, all but K segments must be (1 − η ) -uniform.For any (1 − η ) -uniform segment in the input array, if we extract from it all elements whose keysdiffer from the dominant key, as well as at least (cid:100) ηm (cid:101) number of elements with the dominant key, thenall remaining elements must belong to this segment if the array were fully sorted. In Step 1c, we label all22uch elements as “ misplaced ”; as well as all elements in segments that are not (1 − η ) -uniform. The totalnumber of elements marked as “ misplaced ” is upper bounded by K · m + 2 ηm ( K − K ) ≤ n/K + 2 ηn = n/K + 2 n/K ≤ n/K Therefore, after Step 2 effectively X [1 : 3 n/K ] contains all elements marked “ misplaced ” as well assome additional elements that we want to extract, such that all remaining elements belong to their segment.Suppose that i , i , . . . , i K number of elements from each super-segment are contained in X [1 : 3 n/K ] .In Step 3 and Step 4, we move the smallest i extracted elements back to the ﬁrst super-segment, then nextsmallest i extracted elements to the second super-segment, and so on. In the outcome of Step 4, everyelement must belong to the correct super-segment.At this moment, we only need to sort the super-segments that are multi-keyed. The total number ofmulti-keyed super-segments is at most K . This is accomplished as follows: Step 6 moves all multi-keyedsuper-segments to the front, and then sorts within each of the ﬁrst K super-segments. Finally, Step 8 reverseroutes all the super-segments back to their original positions. Performance bounds.

Since by assumption, n > k +7 , then the length of each segment m := n/ k > k +7 , and therefore other assumption of Theorem 5.3 is satisﬁed and we can use Theorem 5.3 to characterizethe performance of the FindDominant step. Steps 2 and 6 each incurs O ( n ) work and O (log n ) depth.Step 3 incurs O (3 n/K · K ) = O ( n ) work and O (log n + k ) depth. Step 7 incurs O ( K · ( n/K ) · K ) = O ( n ) work and O (log n + k ) depth. The costs of all other steps are upper bounded by O ( n ) and O (log n + k ) too. Now, we can put everything together and obtain an oblivious parallel algorithm that sorts an input array withshort keys.

Theorem 5.5 (Restatement of Theorem 1.2) . There exists a deterministic oblivious parallel algorithm thatsorts any input array containing n elements each with a k -bit key in O ( n ) · min( k, log n ) total work and O (log n ) depth, assuming that each element can be stored in O (1) words.Proof. If n ≤ k +7 , we can just run AKS which takes O ( n log n ) total work an O (log n ) . Else, if n > k +7 , we can accomplish the task with the following algorithm. Sorting short keys on an oblivious PRAMInput.

An array A of length n each with a k -bit key and a payload string. We assume that n > k +7 and moreover each element can be stored in O (1) memory words. Algorithm.

1. Apply the (2 − k , k ) -orderly segmenter construction of Theorem 3.2 to the input array A , theoutcome is a permutation of A that is (2 − k , k ) -orderly.2. Apply the algorithm of Theorem 5.4 to correct the remaining errors and output the fully sorted result.Given Theorems 3.2 and 5.4, it is not hard to see that the algorithm takes O ( nk ) work and O (log n ) depth. Our result will be stated using the standard circuit model of computation [34] where the circuit consists ofAND, OR, and NOT gates; and moreover each AND and OR gate has constant fan-in and constant fan-out.23or convenience, we shall use an operational model that consists of generalized boolean gates and (re-verse) selector gates. A generalized boolean gate has constant fan-in and constant fan-out, and implementsany truth table between the inputs and outputs. A w -selector gate is a selector gate that takes in a 1-bitﬂag and two w -bit payload strings, and outputs one of the two payload strings determined by the ﬂag. A reverse selector gate is the opposite. A w -reverse selector gate takes one element x of bit-width w and a ﬂag b ∈ { , } as input and outputs ( m, w ) if b = 0 and (0 w , m ) if b = 1 . In our construction later, we willoften use reverse selector gates to preform “reverse routing”, where we reverse the routing decisions madeby earlier selector gates. Henceforth in the paper whenever we count selector and reverse selector gates, wedo not distinguish between them and count both towards selector gates.We say a circuit is in the indivisible model if and only if the input to the circuit consists of elements with k -bit keys and w -bit payloads, and the circuit never performs boolean computation on the payload strings,that is, the payload strings are only moved around using w -selector gates. Lemma 6.1 (Technical lemma about our operational circuit model) . In the indivisible model, any circuitwith n generalized boolean gates, n (cid:48) number of w -selector gates and of depth d can be implemented as aboolean circuit (having constant fan-in and constant fan-out) of size O ( n + n (cid:48) · w ) and depth O ( d + log w ) .Proof. Generalized boolean gates can be easily replaced with AND, OR, and NOT gates without incurringadditional asymptotical overhead. The key is how to instantiate all the w -selector gates without blowing upthe circuit’s depth by a log w multiplicative factor.First, imagine we have a “partial evaluation” circuit where payloads are fake and of the form w . In thisway, we can implement every w -selector gate as a degenerate one that takes O (1) depth, since the outputsare always w . Evaluating this partial evaluation circuit will populate the ﬂags on all selector gates. Noticesuch partial evaluation relies on the circuit being indivisible and thus populating a ﬂag is independent of anyresult of any w -selector.Since we are subject to constant fan-in and constant fan-out gates, to implement an actual selector gatewill require replicating the gate’s ﬂag w times, and then use w generalized boolean gates, one for selectingeach bit. After the partial evaluation phase, all selector gates can perform this w -way replication in parallel,incurring an additive rather than multiplicative log w overhead. At this point, we can instantiate each w -selector gate using one generalized boolean gate for each of the w bits being selected.Therefore, the total circuit size is ( n + n (cid:48) · w ) and the depth is O ( d + log w ) .We deﬁne some useful circuit gadgets below. Comparator. A k -bit comparator takes two values each of k bits, and outputs an answer from a constant-sized result set such as { >, <, = } , or {≥ , < } , or {≤ , > } . Note that the outcome can be expressed as 1 to 2bits. Fact 6.2. A k -bit comparator can be implemented with a circuit with O ( k ) generalized boolean gates and O (log k ) depth. Delayed-carry representation and shallow addition.

Adding two (cid:96) -bit numbers in binary representationtakes O (log (cid:96) ) depth. We will later need adders that are constant in depth. To do this, we can use a Wallace-tree-like trick and adopt a delayed-carry representation of numbers.We represent an (cid:96) -bit number v as the addition of two (cid:96) -bit numbers, i.e., v := x + y . Here, it mustbe that the sum x + y ≤ (cid:96) − can still be presented as (cid:96) -bits; moreover, the delayed-carry representationof v is not unique. Given two (cid:96) -bit numbers v := x + y and v := x + y in this delayed-carryrepresentation, we can compute the ( (cid:96) + 1) -bit number v + v as follows where the answer is also indelayed-carry representation: 24. Compute a delayed-carry representation of x + y + x , and let the result be x (cid:48) + y (cid:48) . This can be doneby summing up the i -th bit of x , y , and x respectively, for each i ∈ [ (cid:96) ] . For each i ∈ [ (cid:96) ] , the sum ofthe three bits can be expressed as a 2-bit number, where the ﬁrst bit becomes the i -th bit of x (cid:48) and theother bit becomes the ( i + 1) -st bit of y (cid:48) .2. Now, using the same method, compute and output a delayed-carry representation of x (cid:48) + y (cid:48) + y .The above can be accomplished with O ( (cid:96) ) generalized boolean gates and in O (1) depth. Henceforth this iscalled a shallow addition . Counting.

We will need a simple circuit gadget that counts the number of s in an input array containing n bits. Fact 6.3.

Given an input array containing n bits, counting the number of 1s in the input array can berealized with a circuit of size O ( n ) and depth O (log n ) .Proof. We can use the algorithm in Fact 4.3 of Asharov et al. [9], but use the delayed-carry representationof numbers, and replace all adders with shallow adders. Essentially, the numbers are added over a binarytree, where in the leaf level (also called the last level), every number is promised to be at most -bit long; inthe second to last level, every number is promised to be at most -bit long; and so on. In this way, the totalcircuit size for the entire tree of adders is O ( n ) . At the end of the algorithm, we perform a ﬁnal addition toconvert the delayed-carry representation of the answer to a binary representation. All preﬁx sums.

We consider an all-preﬁx-sum circuit gadget, which upon receiving an input A containing n non-negative integers, outputs the sums of all n preﬁxes, that is, A [: 1] , A [: 2] , A [: 3] , . . . , A [: n ] . It ispromised that the sum of the entire array A can be stored in (cid:96) bits. Fact 6.4.

For any (cid:96) ≤ n , there is a circuit composed of at most O ( n(cid:96) ) generalized boolean gates and ofdepth O (log n ) that solves the aforementioned all-preﬁx-sum problem.Proof. We can use the standard preﬁx sum algorithm, but represent all numbers using the delayed-carryrepresentation, and use shallow addition which can be computed in constant depth.

AllPreﬁxSum ( A ) Input:

An array A containing n bits, where n is a power of . We assume that each bit A [ i ] isrepresented in a delayed-carry representation as the sum of A [ i ] and . Algorithm:

1. If n = 1 , return the only element of A . Else proceed with the following2. Let A (cid:48) be the array of length n/ containing sums of adjacent pairs in A . A (cid:48) can be computed from A using n/ shallow additions.3. Compute S := AllPreﬁxSum ( A (cid:48) ) .4. Compute the all-preﬁx-sum array for A from S , ﬁlling the gaps by performing n/ shallow addi-tions.If we run the AllPreﬁxSum algorithm using the delayed-carry representation, the outcome will be n preﬁx sums where the i -th preﬁx sum is expressed the sum of two numbers, s i and t i . Finally, we cancompute s i + t i in parallel for all i ∈ [ n ] in parallel, taking O (log (cid:96) ) ≤ O (log n ) depth. The entire circuitfor computing all n preﬁx sums takes O ( n(cid:96) ) generalized boolean gates and O (log n ) depth.25 eneralized binary-to-unary conversion. The generalized binary-to-unary conversion problem has beendeﬁned earlier in Section 4. Earlier, we also showed how to solve it on an oblivious PRAM in linear totalwork and logarithmic depth. It turns out that it is a little trickier if we want to accomplish the same with alinear-sized and logarithmic depth circuit. This is because on a PRAM, arithmetic and boolean operationson log n bits can be performed in unit cost, whereas in a circuit model, we charge O (log n ) .We can solve the generalized binary-to-unary conversion problem with the following algorithm. Withoutloss of generality, we can assume that n is a power of ; if not, we can round n up to the nearest power of . Generalized binary-to-unary conversion circuit

1. First, we apply the counting circuit of Fact 6.3 to the input array x . Speciﬁcally, we compute thesum over a binary tree using the delayed-carry representation of numbers. At the end of this step,every tree node stores the sum of its subtree, in delayed-carry representation.Henceforth, let S ( v ) denote the sum of the subtree rooted at the node v . We may assume that allnumbers below use a delayed-carry representation.2. For convenience, assume that the root receives (cid:96) from an imaginary parent.From level i = 0 to log n − : every node in level i performs the following. Let S be the numberreceived from its parent, and let lc and rc denote the node’s left child and right child, respectively.Send S to lc and send S − S ( lc ) to rc .3. For convenience, assume that the root receives the label “ M ” from an imaginary parent.From level i = 0 to log n − , every node in level i does the following where lc and rc denote its leftchild and right child, respectively:• If the label received from its parent is not “ M ”, then pass the label to both children;• Else, let S be the number received earlier from its parent. – if S ≥ S ( lc ) then pass “1” to lc and pass “ M ” to rc ; – else, pass “ M ” to lc and pass “0” to rc .4. If a leaf node receives “0” or “1” from the parent, then output the corresponding bit. Otherwise, let S be the 1-bit number received from the parent, output S . Implementation as a circuit.

All numbers use a delayed-carry representation. Let v := x + y and v := x + y be two (cid:96) -bit numbers in delayed-carry representation, and suppose that v ≥ v . Then, v − v can be derived by computing x + x + y + y + 2 and keeping only the last (cid:96) bits, where y b denotes the number obtained by ﬂipping all bits of y b for b ∈ { , } . Therefore, we can use the shallowaddition trick to compute subtraction. Of course, before a node receives the label from { M , , } fromits parent, it is not guaranteed that S ≥ S ( lc ) , but we can just pretend it will be the case and continue.Therefore, Step 2 can be implemented in O (log n ) depth.Step 3 must be implemented in a pipelined manner to save depth: basically, as soon as a nodereceives the number S from its parent during Step 2, it immediately starts to compute the comparisonbetween S and S ( lc ) which takes O (log log n ) depth. In other words, the nodes do not wait for itsparent to compute this comparison before it computes its own comparison, but rather pre-computes thiscomparison ahead of time. Using this pipelining trick, Step 3 can also be accomplished in O (log n ) depth. 26inally, observe that S is at most log n + 1 bits at the root; and at level i it is at most log n + 1 − i bits. Therefore, the above can be implemented with an O ( n ) -sized circuit.This gives rise to the following fact. Fact 6.5.

There is a circuit with O ( n ) generalized boolean gates and of O (log n ) depth that solves theaforementioned generalized binary-to-unary conversion problem. Lossy loose compactor.

Let α ∈ [0 , . An ( α, n, w ) -lossy loose compactor (also written as α -lossy loosecompactor when n and w are clear from the context) solves the following problem:• Input : an array I containing n elements of the form { ( b i , v i ) } i ∈ [ n ] , where each b i ∈ { , } is a metadatabit indicating whether the element is real or ﬁller , and each v i ∈ { , } w is the payload . The input arrayis promised to have at most n/ real elements .• Output : an array O containing (cid:98) n/ . (cid:99) elements, such that mset ( O ) ⊆ mset ( I ) , and moreover, | mset ( I ) − mset ( O ) | ≤ αn where mset ( O ) denotes the multiset of real elements contained in O ,and mset ( I ) is similarly deﬁned.In other words, lossy loose compaction takes a relatively sparse input array containing only a smallconstant fraction of real elements; it compresses the input to slightly more than half its original length while preserving all but α · n real elements in the input. Loose compactor. If α = 0 , i.e., there is no loss, we also call it a loose compactor. More formally, an (0 , n, w ) -lossy loose compactor is also called an ( n, w ) -loose compactor. In this subsection, we will prove the following theorem.

Theorem 7.1.

Let c > be an arbitrary constant. There is a circuit in the indivisible model with O ( n log log n ) generalized boolean gates, O ( n ) number of w -selector gates, and of depth O (log log n ) that realizes an ( c n , n, w ) -lossy loose compactor. To prove the above theorem, we describe how to implement lossy loose compaction as a low-depthcircuit. Our construction is almost the same as the loose compactor circuit described by Asharov et al. [9],except that we now run the algorithm for fewer (i.e., c log log n ) iterations rather than O (log n ) iterations).Because we omit some iterations, we end up losing a small fraction of elements during the loose compaction.We describe the algorithm below. Expander graphs.

The construction will rely on a suitable family of bipartite expander graphs denoted { G (cid:15),m } m ∈ SQ where SQ ⊆ N is the set of perfect squares. The parameter (cid:15) ∈ (0 , is a suitable constantreferred to as the spectral expansion . The graph G (cid:15),m has m vertices on the left henceforth denoted L , and m vertices on the right henceforth denoted R , and each vertex has d := d ( (cid:15) ) number of edges where d is aconstant that depends on (cid:15) . We give additional preliminaries on expander graphs in Appendix A.Without loss of generality, we may assume that d is a multiple of since we can always consider thegraph that duplicates each edge times. It is not exactly half the original length due to rounding issues — See Remark 3. onstruction. The input array is grouped into chunks of d/ size. Chunks that have at most d/ elements(i.e., a quarter loaded) are said to be sparse and chunks that have more than d/ elements are said to be dense .The idea is to ﬁrst distribute the dense chunks such that there are only very few dense chunks after this step.Then, we can easily compact each chunk separately. When the remaining dense chunks are compressed, weend up losing some elements.The challenge is how to distribute the dense chunks. We can consider the chunks to be left-vertices inthe bipartite expander graph G (cid:15),m . Each dense chunk wants to distribute its real elements to its neighborson the right, such that each right vertex receives no more than d/ elements, i.e., each vertex on the rightis a sparse chunk too. At this moment, we can replace dense chunks on the left with ﬁller elements —for almost all of these dense chunks, their real elements have moved to the right. For the remaining densechunks, replacing them with ﬁller causes some elements to be lost. Now that all chunks are sparse, and wecan compress each chunk on the left and the right to a quarter its original size. All compressed chunks areconcatenated and output, and the output array is a half the length of the input.The distribution of the real elements to its neighbors on the right requires some care, as we have toguarantee that no node on the right will become dense. We will have to compute on which subset of edgeswe will route the real elements. This is done via the procedure ProposeAcceptFinalize described below.

ProposeAcceptFinalize subroutine.

We now describe the

ProposeAcceptFinalize subroutine which is thekey step to achieve the aforementioned distribution of dense chunks. To make the description more intuitive,henceforth we call each vertex in L a factory and each vertex in R a facility . Initially, imagine that the densevertices correspond to factories that manufacture at most d/ products, and the sparse vertices are factoriesthat are unproductive. There are at most m/ productive factories, and they want to route all their productsto facilities on the right satisfying the following constraints: 1) each edge can route only 1 product; and 2)each facility can receive at most d/ products. The ProposeAcceptFinalize algorithm described below ﬁndsa set of edges M to enable such routing, also called a feasible route as explained earlier. ProposeAcceptFinalize subroutine

Initially, each productive factory is unsatisﬁed and each unproductive factory is satisﬁed . For a pro-ductive factory u ∈ L , we use notation load ( u ) to denote the number of products it has (correspondingto the number of real elements in the chunk). Algorithm:

Repeat the following for iter times and output the resulting matching M at the end:(a) Propose:

Each unsatisﬁed factory sends a proposal (i.e., the bit 1) to each one of its neighbors.Each satisﬁed factory sends 0 to each one of its neighbors.(b)

Accept:

If a facility v ∈ R received no more than d/ proposals, it sends an acceptance messageto each one of its d neighbors; otherwise, it sends a reject message along each of its d edges.(c) Finalize:

Each currently unsatisﬁed factory u ∈ L checks if it received at least d acceptance mes-sages. If so, add the set of edges over which acceptance messages are received to the matching M . At this moment, this factory becomes satisﬁed.Notice that for a facility v ∈ R , the proposals it receives in iteration i + 1 is a subset of the proposals itreceives in iteration i . Therefore, once v starts accepting in some iteration i , it will also accept all proposalsreceived in future rounds i + 1 , i + 2 , . . . too, if any proposals are received. Moreover, the total number ofproduct v receives will not exceed d/ . Pippenger [33] and Asharov et al. [8] showed the following fact: Fact 7.2 (Pippenger [33] and Asharov et al. [8]) . There exist an appropriate constant (cid:15) ∈ (0 , and a bipar-tite expander graph family { G (cid:15),m } m ∈ N where each vertex has d edges for a constant d := d ( (cid:15) ) assumed tobe a multiple of , such that for any m ∈ SQ ⊆ N , at the end of the above ProposeAcceptFinalize procedure hich runs for iter iterations (and assuming it is instantiated with the family of graphs { G (cid:15),m } m ∈ N ), thefollowing must hold:1. at most m/ iter vertices in L remain unsatisﬁed;2. every satisﬁed vertex in u ∈ L has at least d/ edges in the output matching M ;3. for every vertex in v ∈ R , the output matching M has at most d/ edges incident to v . Given the

ProposeAcceptFinalize subroutine, we can realize a / log c n -lossy loose compaction as fol-lows where c > denotes a constant. / (log n ) c -Lossy Loose Compaction • Input:

An input array I of n elements, in which at most n/ are real and the rest are dummies.• Assumption : Without loss of generality, we assume that d is a multiple of . Further, we assumethat m is a perfect square and that n is a multiple of d/ ; henceforth let m := n/ ( d/

2) = 2 n/d .The algorithm can be easily generalized to any choice of n — see Remark 3.• The algorithm :1. Divide I into m chunks of size d/ . If a chunk contains at most d/ real elements (i.e., at mosta quarter loaded), it is said to be sparse ; otherwise it is said to be dense . It is not hard to see thatthe number of dense chunks must be at most m/ .2. Now imagine that each chunk is a vertex in L of G (cid:15),m , and D ⊂ L is a set of dense vertices(i.e., corresponding to the dense chunks). Let edges ( D, R ) denote all the edges in G (cid:15),m between D ⊂ L and R .Let D be the subset of productive factories, and run the ProposeAcceptFinalize subroutine for c log log n iterations. The outcome is a subset of edges M ⊆ edges ( D, R ) that satisfy Fact 7.2,where the fraction of unsatisﬁed chunks is / log c n .3. Now, every dense chunk u in D does the following: for each of an arbitrary subset of load ( u ) ≤ d/ outgoing edges of u in M , send one element over the edge to a corresponding neighbor in R ; for all remaining out edges of u , send a ﬁller element on the edge.Every vertex in R receives no more than d/ real elements. Henceforth we may consider everyvertex in R as a sparse chunk, i.e., an array of capacity d/ but containing only d/ real elements.4. At this moment, for each dense chunk in L , replace the entire chunk with d/ ﬁller elements.5. Now, all chunks in L and in R must be sparse, that is, each chunk contains at most d/ realelements, while its size is d/ . We now compress each chunk in L and R to a quarter of itsoriginal size (i.e., to size d/ in length), losing few elements in the process (we will bound thenumber of lost elements later).Output the compressed array O , containing of m · d = 2 · nd · d = n/ elements. Proposition 7.3 (Pippenger [33] and Asharov et al. [8]) . There exists an appropriate constant (cid:15) ∈ (0 , and a bipartite expander graph family { G (cid:15),m } m ∈ N where each vertex has d edges for a constant d := d ( (cid:15) ) ,such that for any perfect square m and n = md/ , the above lossy loose compaction algorithm, wheninstantiated with this family of bipartite expander graph, can correctly compress any input array of length n to a half its original size losing at most n/ log c n real elements, as long as the input array has at most n/ real elements. emark 3. In the above, we assumed that n is divisible by d/ . If n = dm/ where m is a perfect square.In case this is not the case, we can always round n up to the next integer that satisﬁes this requirement; thisblows up n by at most a o (1) factor. This is why in our deﬁnition of lossy loose compactor, the outputsize is allowed to be (cid:98) n/ . (cid:99) rather than (cid:98) n/ (cid:99) , assuming that n is sufﬁciently large. Implementing the algorithm in a low-depth circuit.

Since our lossy loose compactor algorithm is almostthe same as Asharov et al.’s loose compactor [9], we can implement the algorithm as a circuit in almost thesame way as described by Asharov et al. [9], except that we run fewer iterations. It is not hard to check thatthe resulting circuit has O ( n log log n ) generalized boolean gates, O ( n ) number of w -selector gates, and hasdepth O (log log n ) . Approximate swapper. An ( n, w ) -approximate swapper obtains an input array where each element ismarked with a label that is ⊥ , blue , or red . Let n red and n blue denote the number of red and blue elements,respectively. The ( n, w ) -approximate swapper circuit swaps a subset of the blue elements with red onesand the swapped elements receive the label ⊥ . We call elements marked red or blue colored and thosemarked ⊥ uncolored .Formally, an ( n, w ) -approximate swapper solves the following problem:• Input : an input array containing n elements where each element contains a w -bit payload string and atwo-bit metadata label whose value is chosen from the set { blue , red , ⊥} . Henceforth we assume theﬁrst bit of the label encodes whether the element is colored or not, and the second bit of the label picks acolor between blue and red if the element is indeed colored.• Output : a legal swap of the input array such that at most n/ | n red − n blue | elements remain colored,where the notion of a legal swap is deﬁned below.We say that an output array O is a legal swap of the input array I iff there exist pairs ( i , j ) , ( i , j ) , . . . , ( i (cid:96) , j (cid:96) ) of indices that are all distinct, such that for all k ∈ [ (cid:96) ] , I [ i k ] and I [ j k ] are colored and have opposite colors,and moreover O is obtained by swapping I [ i ] with I [ j ] , swapping I [ i ] with I [ j ] , . . . , and swapping I [ i k ] with I [ j k ] ; further, all swapped elements become uncolored. Theorem 8.1 (Approximate swapper) . There exists an ( n, w ) -approximate swapper circuit containing O ( n ) generalized boolean gates and O ( n ) number of w -selector gates, and of depth O (1) .Proof. We can use Algorithm 6.10 in Asharov et al. [8]: their algorithm is described for the oblivious PRAMmodel, and achieves O ( n ) work and O (1) depth. It is straightforward to check that the same algorithm canbe implemented as a circuit with O ( n ) generalized boolean gates, O ( n ) number of w -selector gates, and O (1) in depth. Note that Algorithm 6.10 in Asharov et al. [8] needs to compute the decomposed perfectmatchings on the ﬂy since their oblivious PRAM algorithm is uniform; however, we do not need to computethe matchings on the ﬂy in the circuit model, since the circuit model is non-uniform. Swapper.

A swapper is deﬁned in almost the same way as an approximate swapper, except that we requirethat the remaining colored elements do not exceed | n red − n blue | . In other words, if initially, the number of red elements equals the number of blue elements, then the swapper must swap every red element with adistinct blue element, leaving no colored elements behind. Theorem 8.2 (Slow swapper) . There exists an ( n, w ) -swapper circuit (henceforth denoted SlowSwap ( · ) )with O ( n log n ) generalized boolean gates, and O ( n log n ) number of w -selector gates , and whose depthis O (log n ) . roof. We can use the following algorithm:1. Use an AKS sorting circuit [3] to sort the input array such that all the red elements are in the front; andall the blue elements are at the end. Let the result be X .2. For each i ∈ , , . . . , (cid:98) n/ (cid:99) in parallel: if X [ i ] is marked red and X [ n + 1 − i ] is marked blue , thenswap X [ i ] and X [ n + 1 − i ] and mark both elements as uncolored.3. Reverse route the resulting array by reversing the decisions made by the AKS network in Step 1, andoutput the result.Since the AKS sorting network performs comparisons on labels that are at most 2-bits long, the entirealgorithm can be implemented as a circuit with O ( n log n ) generalized boolean gates, O ( n log n ) numberof w -selector gates, and of depth O (log n ) . Approximate splitter.

Let β ∈ (0 , / and let α ∈ (0 , . An ( α, β, n, w ) -approximate splitter (also writ-ten as ( α, β ) -approximate splitter when n and w are clear from the context) solves the following problem:suppose we are given an input array I containing n elements where each element has a w -bit payload and a1-bit label indicating whether the element is distinguished or not. It is promised that at most β · n elementsin I are distinguished. We want to output a permutation (denoted O ) of the input array I , such that at most αn distinguished elements are not contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of O . Theorem 8.3 (Approximate splitter from lossy loose compaction) . Suppose that there is an ( α, n, w ) -lossyloose compaction circuit with B lc ( n ) generalized boolean gates and S lc ( n ) w -selector gates, and of depth D lc ( n ) . Suppose also that there is an O (1) -depth approximate swapper circuit with B sw ( n ) generalizedboolean gates and S sw ( n ) w -selector gates for an input array containing n element each of bit-width w .For any constant β ∈ (0 , / , there is a (8 α, β, n, w ) -approximate splitter circuit with at most . S sw ( n )+5 S lc ( n ) + O ( n ) number of w -selector gates, . S sw ( n ) + 2 . B sw ( n ) + 2 . B lc ( n ) + 10 S lc ( n ) + O ( n ) gen-eralized boolean gates, and has depth at most . α · ( D lc ( n ) + O (1)) . Proof of Theorem 8.3:

Consider the following algorithm.

Approximate splitter from lossy loose compaction Color.

Any distinguished element not contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of X are colored blue . Any non-distinguished element contained in the ﬁrst (cid:98) βn + n/ (cid:99) positions of X are colored red . Observe that n red ≥ n blue + n/ , where n red and n blue denote the number of red and blue elements, respectively.Note that at this moment, each element in X is labeled with 3 bits of metadata, one bit of dis-tinguished indicator and two bits of color-indicator (indicating whether the element is colored oruncolored, and if so, which color).2. Swap.

Call

Swap n ( X ) deﬁned below to swap the blue elements with red elements except fora small residual fraction (here we use a payload of size w + 1 and not w as we also include thedistinguished-indicator as part of the payload). Return the outcome.We now describe the Swap n ( · ) subroutine. 31 wap n ( X ) • Input:

An array X of m ≤ n elements, each has a w -bit payload and a 2-bit label indicatingwhether the element is colored, and if so, whether the element is blue or red . n is the size of theoriginal problem when Swap is ﬁrst called; the same n will be passed into all recursive calls sinceit is used to decide when the recursion stops. It is promised that m red ≥ m blue + m/ where m red and m blue denote the number of red and blue elements in the input array X , respectively.• Algorithm: (a)

Base case. If m ≤ αn , then return X ; else continue with the following steps.(b) Approximate swapper.

Call an ( m, w ) -approximate swapper (see Theorem 8.1) on X to swapelements of opposite colors and uncolor them in the process, such that at most m/

128 + m red − m blue elements remain colored. Let the outcome be called X (cid:48) .(c) Lossy-extract blue.

Call an ( α, m, w + 1) -lossy loose compactor to compact X (cid:48) by a half, wherethe lossy loose compactor treats the blue elements as real, and all other elements as ﬁllers (i.e.,the loose compactor treats the second bit of the color label as a real-ﬁller indicator, and the ﬁrstbit of the color label is treated as part of the payload).Let the outcome be Y blue whose length is (cid:98)| X | / . (cid:99) .(d) Extract red.

Let (cid:15) (cid:48) = 1 / . Apply an (cid:15) (cid:48) -near-sorter (deﬁned in Section 3.3.1) to the array X (cid:48) treating all red elements as smaller than all other elements. Let Y red be the ﬁrst (cid:98) m/ (cid:99) elements of the resulting near-sorted array. Mark every non- red element in Y red as uncolored,and let Y := Y red || Y blue .(e) Recurse.

Recursively call

Swap n ( Y ) , and let the outcome be Y (cid:48) .(f) Reverse route.

Reverse the routing decisions made by all selector gates during Steps (c) and (d)(see Remark 4). Speciﬁcally, – pad Y (cid:48) [: (cid:98) m/ (cid:99) ] with a vector of ﬁllers to a length of m and reverse-route the padded arrayby reversing the decisions of Step (d) — let Z red be the outcome; – reverse-route Y (cid:48) [ (cid:98) m/ (cid:99) + 1 :] by reversing the decisions of Step (c), resulting in Z blue .Note that both Z blue and Z red have length m , i.e., length of the input to this recursive call.(g) Output.

Return O which is formed by a performing coordinate-wise select operation between X (cid:48) , Z red , and Z blue . For every i ∈ [ m ] : – if X (cid:48) [ i ] originally had a blue element and the element was not lost during Step (c), then let O [ i ] := Z blue [ i ] ; – if X (cid:48) [ i ] originally had a red element and Z red [ i ] is not a ﬁller, then let O [ i ] := Z red [ i ] ; – else let O [ i ] := X (cid:48) [ i ] ; Our approximate splitter algorithm actually requires a swapper where elements are of bit-length w +1 , but for conveniencewe rename the variable to w in the description of the swapper. Remark 4 (Reverse routing details) . For every selector gate g in Steps (c) and (d), its reverse selector gatedenoted g (cid:48) is one that receives a single element as input and outputs two elements; the same control bit b input to the original gate g is used by g (cid:48) to select which of the output receives the input element, and theother one will simply receive a ﬁller element. If g selected the ﬁrst input element to route to the output,then in g (cid:48) , the input element is routed to the ﬁrst output. Fact 8.4.

Suppose that n is greater than a sufﬁciently large constant. If a call to Swap n ( X ) does not hit thebase case, then, in the next recursive call to Swap n ( Y ) in Step (e), m (cid:48) := | Y | ≤ m . + m . Therefore, the ecursive call will hit the base case after at most (cid:6) log c α (cid:7) steps of recursion where c := 1 / (1 / . / > . . Fact 8.5.

Suppose that n is greater than a sufﬁciently large constant. If the condition m red ≥ m blue + m/ is satisﬁed at the beginning of some call Swap n ( X ) , then if and when the function makes a recursive callto Swap n ( Y ) , the same condition is satisﬁed by the array Y .Proof. If the execution does not trigger the base case, since n is greater than a sufﬁciently large constant, m must be greater than a sufﬁciently large constant too.Suppose the inequality is satisﬁed at the beginning of the recursive call. Then, after Step (b), at most m/ elements are blue , and at least m/ elements are red . After Step (d), due to the property of thenear-sorter, Y red has at least (1 − (cid:15) (cid:48) ) · ( m/ red elements. As long as m is greater than some appropriateconstant, in the next recursive call to Swap n ( Y ) in Step (e), m (cid:48) := | Y | ≤ m . + m . Let m (cid:48) red and m (cid:48) blue be the number of red and blue elements in Y respectively. We have that m (cid:48) blue ≤ m/ and m (cid:48) red ≥ (1 − (cid:15) (cid:48) ) · ( m/ . Therefore, m (cid:48) red − m (cid:48) blue m (cid:48) ≥ (1 − (cid:15) (cid:48) ) · ( m/ − m/ m . + m > / Fact 8.6.

Assume that n is greater than a sufﬁciently large constant. The remaining number of coloredelements at the end of the algorithm is at most αn + n red − n blue .Proof. The number of blue elements remaining is equal to the total number of blue elements lost duringall executions of Step (c), plus the size of the base case αn . Let c := 1 / (1 / . / . The total numberof elements lost during all executions of Step (c) is upper bounded by αn + αn/c + αn/c + . . . ≤ αn .Therefore, the total number of blue elements remaining is upper bounded by αn . This means that the totalnumber of colored elements remaining is at most αn + n red − n blue .Clearly, Step 1 of the algorithm takes only n number of generalized boolean gates. We now discuss howto implement Step 2 as a circuit. Implementing Step 2 in circuit.

This step is accomplished through recursive calls to

Swap on arrays oflength n (cid:48) := n, n/c, n/c , . . . , where c := 1 / (1 / . / . The recursion stops when n (cid:48) < αn . For eachlength n (cid:48) , we consume an approximate swapper, a loose compactor, an (cid:15) (cid:48) -near-sorter, and the reverse-routingcircuitry of the loose compactor and the (cid:15) (cid:48) -near-sorter. Thus for each problem size n (cid:48) = n, n/c, n/c , . . . ,we need• S sw ( n (cid:48) ) number of ( w + 1) -selector gates and B sw ( n (cid:48) ) number of generalized boolean gates corre-sponding to Step (b);• S lc ( n (cid:48) ) number of ( w + 2) -selector gates (one for the forward direction and one for the reversedirection) and B lc ( n (cid:48) ) generalized boolean gates corresponding to Step (c);• O ( n (cid:48) ) number of generalized boolean gates and O ( n (cid:48) ) number of ( w +2) -selector gates due to Step (d)and its reverse routing; and• O ( n (cid:48) ) generalized boolean gates and O ( n (cid:48) ) number of w -selector gates due to Step (g).Note that each ( w + 1) -selector gate can be realized with one w -selector gate that operates on the w -bitpayload and one generalized boolean gate that computes on the extra metadata bit; further, during the reverserouting, the metadata generalized boolean gate can also be used to save whether each output is a ﬁller. Thuseach problem size n (cid:48) can be implemented with S sw ( n (cid:48) ) + 2 S lc ( n (cid:48) ) + O ( n (cid:48) ) number of ( w + 1) -selector gates33nd B sw ( n (cid:48) ) + B lc ( n (cid:48) ) + 2 S lc ( n (cid:48) ) + O ( n (cid:48) ) generalized boolean gates. Replacing each ( w + 1) -selector gatewith a w -selector gate and a generalized boolean gate, we have that each problem size n (cid:48) can be implementedwith S sw ( n (cid:48) ) + 2 S lc ( n (cid:48) ) + O ( n (cid:48) ) number of w -selector gates and S sw ( n (cid:48) ) + B sw ( n (cid:48) ) + B lc ( n (cid:48) ) + 4 S lc ( n (cid:48) ) + O ( n (cid:48) ) generalized boolean gates.Summing over all n (cid:48) = n, n/c, n/c , . . . , we have the follow fact: Fact 8.7.

In the above approximate splitter algorithm, the total number of w -selector gates needed is upperbounded by . S sw ( n )+5 S lc ( n )+ O ( n ) and the total number of generalized boolean gates is upper boundedby . S sw ( n ) + 2 . B sw ( n ) + 2 . B lc ( n ) + 10 S lc ( n ) + O ( n ) ; furthermore, the depth is upper bounded by log .

79 1 α · (2 D lc ( n ) + O (1)) ≤ . α · ( D lc ( n ) + O (1)) . In this section, we show how to construct a circuit for lossy loose compaction from an approximate splitter.

Theorem 9.1.

Let f ( n ) be some function in n such that < f ( n ) ≤ log n holds for every n greaterthan an appropriate constant; let C sp > be a constant. Fix some α ∈ (0 , which may be a functionof n . Suppose that for any β ∈ (0 , / , for any n that is greater than an appropriately large constant, ( α, β, n, w ) -approximate splitter can be solved by a circuit with C sp · n · f ( n ) generalized boolean gates, C sp · n selector gates, and of depth D sp ( n ) . Then, for any n greater than an appropriately large constant, (1 . α, n, w ) -lossy loose compaction can be solved by a circuit with . C sp · n · f ( f ( n ))+ O ( n ) generalizedboolean gates, . C sp · n number of w -selector gates, and of depth . D sp ( n ) + O (log f ( n )) . The remainder of this section will be dedicated to proving the above theorem.

Proof of Theorem 9.1:

For simplicity, we ﬁrst consider the case when n is divisible by f ( n ) . Lookingahead, we will use f ( n ) to be log ( x ) n for some x that is power of . We will later extend our theoremstatement to the case when n is not divisible by f ( n ) . Consider the following algorithm: Lossy loose compaction from approximate splitter

1. Divide the input array into f ( n ) -sized chunks. We say that a chunk is sparse if there are at most f ( n ) / real elements in it; otherwise it is called dense . Now, count the number of elements inevery chunk, and mark each chunk as either sparse or dense . We will show later in Fact 9.2 thatat least / fraction of the chunks are sparse.2. Call an ( α, / , n/f ( n ) , w · f ( n )) -approximate splitter to move almost all the dense chunks to thefront and almost all the sparse chunks to the end.3. Apply an ( α, / , f ( n ) , w ) -approximate splitter to the trailing (cid:108) ( − ) · nf ( n ) (cid:109) chunks to com-press each of these chunks to a length of (cid:106) f ( n )64 (cid:107) , losing few elements in the process. The ﬁrst (cid:106) ( + ) · nf ( n ) (cid:107) chunks are unchanged. Output the resulting array.At the end of the algorithm, the output array has length at most ( 34 −

164 ) · nf ( n ) · f ( n )64 + ( 14 + 164 ) · nf ( n ) · f ( n ) ≤ . n < n/ . (2) Fact 9.2.

At least · nf ( n ) chunks are sparse. roof. Suppose not, this means that more than · nf ( n ) have more than f ( n ) / real elements. Thus thetotal number of elements is more than n/ which contradicts the input sparsity assumption of loosecompaction. Fact 9.3.

The above algorithm loses at most . αn real elements.Proof. If a dense chunk is not contained within the ﬁrst (cid:106) ( + ) · nf ( n ) (cid:107) chunks, we may assume that allelements in it will be lost. Due to the property of the approximate splitter, at most αn/f ( n ) dense chunksare not contained within the ﬁrst (cid:106) ( + ) · nf ( n ) (cid:107) chunks. Further, when we apply an approximate splitterto compress the trailing (cid:108) ( − ) · nf ( n ) (cid:109) chunks to each to a length of (cid:106) f ( n )64 (cid:107) , for each chunk, we maylose at most αf ( n ) real elements.Therefore, the number of real elements lost is upper bounded by the following as long as n is greaterthan an appropriate constant: α · ( n/f ( n )) · f ( n ) + αf ( n ) · (cid:24) ( 34 −

164 ) · nf ( n ) (cid:25) ≤ . αn Implementing the above algorithm in circuit.

We now analyze the circuit size of the above algorithm.For simplicity, we ﬁrst assume that n is divisible by f ( n ) and we will later modify our analysis to the moregeneral case when n is not divisible by f ( n ) .1. Due to Fact 6.3, Step 1 of the algorithm requires at most O ( n ) generalized boolean gates as we have n/f ( n ) counters each for a chunk of size f ( n ) . The counting for all chunks are performed in parallel,and thus the depth is O (log f ( n )) .2. Step 2 is a single invocation of an ( α, / , n/f ( n ) , w · f ( n )) -approximate splitter. Assuming that ( α, / , n, w ) -approximate splitter can be realized with C sp · n · f ( n ) generalized boolean gates and C sp · n selector gates, this step requires at most C sp · ( n/f ( n )) · f ( n/f ( n )) ≤ C sp · ( n/f ( n )) · f ( n ) generalized boolean gates and C sp · n/f ( n ) number of w · f ( n ) -selector gates. Each such selector gatecan in turn be realized with f ( n ) number of w -selector gates; moreover, the ﬂag bit needs to be replicated f ( n ) times over a binary tree, requiring O (log f ( n )) depth and O ( f ( n )) generalized boolean gates perchunk. Thus, in total, Step 2 requires C sp · n + O ( n ) generalized boolean gates, C sp · n number of w -selector gates, and requires at most D sp ( n ) + O (log f ( n )) depth.3. Step 3 of the algorithm requires applying (cid:108) ( − ) · nf ( n ) (cid:109) number of ( α, / , f ( n ) , w ) -approximatesplitters, where, according to our assumption in Theorem 9.1, each such approximate splitter consumes C sp · f ( n ) · f ( f ( n )) generalized boolean gates and C sp · f ( n ) number of w -selector gates. For sufﬁcientlylarge n and f ( n ) ≤ log n , we have that (cid:108) ( − ) · nf ( n ) (cid:109) · f ( n ) ≤ n . Therefore, in total there are atmost C sp · n · f ( f ( n )) generalized boolean gates and C sp · n number of w -selector gates. The depth ofthis step is upper bounded by D sp ( f ( n )) .Summarizing the above, we have the following fact: Fact 9.4.

Assume the same assumptions as in Theorem 9.1, and moreover n is divisible by f ( n ) . The lossyloose compaction algorithm above can be realized with a circuit consisting of C sp · n · ( f ( f ( n )) + 1) + O ( n ) generalized boolean gates, C sp · n number of w -selector gates, and of depth D sp ( n ) + D sp ( f ( n )) + O (log f ( n )) . hen n is not divisible by f ( n ) . When n is not divisible by f ( n ) , we can pad the last chunk with ﬁllerelements to a length of a multiple f ( n ) . After the padding the total number of elements is upper boundedby n + f ( n ) . As long as n is greater than an appropriately large constant, even with the aforementionedpadding, we would have the following fact: Fact 9.5.

Assume the same assumptions as in Theorem 9.1. Then, for sufﬁciently large n , the above lossyloose compaction algorithm can be realized with a circuit consisting of . C sp · n · f ( f ( n )) + O ( n ) generalized boolean gates, . C sp · n number of w -selector gates, and in depth . D sp ( n ) + O (log f ( n )) .

10 Linear-Sized, Low-Depth / poly log( · ) -Lossy Loose Compactor In this section, we shall prove the following theorem.

Theorem 10.1 (Linear-sized loose compactor) . Let (cid:101)

C > be an arbitrary constant. There exists acircuit in the indivisible model that solves (1 / log (cid:101) C ( n ) , n, w ) -lossy loose compaction, and moreover thecircuit depth is O (log . n ) , the total number of generalized boolean gates is upper bounded by O ( n · w ) · max (1 , poly (log ∗ n − log ∗ w )) , and the number of w -selector gates is upper bounded by O ( n ) · max (1 , poly (log ∗ n − log ∗ w )) .As a direct corollary, for any arbitrarily large constant c ≥ , if w ≥ log ( c ) n , it holds that the numberof generalized boolean gates is upper bounded by O ( nw ) , and the number of w -selector gates is upperbounded by O ( n ) . The case when w > log log n is easier (see Footnote 7), so in the remainder of this section, unlessotherwise noted, we shall assume that w ≤ log log n . Proof of Theorem 10.1:

We will construct tight compaction through repeated bootstrapping and boosting.Without loss of generality, we may assume that n is greater than an appropriately large constant. We havetwo steps:• LC i = ⇒ SP i +1 (Theorem 8.3): from lossy loose compactor to approximate splitter. Due to Theo-rem 8.1 and Theorem 8.3, we get the following, where we use different subscripts in the big-O notationsto hide different constants.Assuming ( α, n, w ) -lossy loose compactor with: : B lc ( n ) : S lc ( n ) depth : D lc ( n ) Then, for any β ∈ (0 , / , there exists (8 α, β, n, w ) -approximate splitter with: : 2 . B lc ( n ) + 10 S lc ( n ) + O ( n ) : 5 S lc ( n ) + O ( n ) depth : 2 . α · ( D lc ( n ) + O (1)) • SP i +1 = ⇒ LC i +1 (Theorem 9.1) : from approximate splitter to lossy loose compactor. SimplifyingTheorem 9.1 we have: 36ix some α ∈ (0 , . Assuming that for any β ∈ (0 , / , ( α, β, n, w ) -approximate splittercan be realized in a circuit with the following cost for some constant C sp and function f ( n ) : : C sp · n · f ( n ) : C sp · n depth : D sp ( n ) Then there exists a (1 . α, n, w ) -lossy loose compactor such that: : 2 . · C sp · n · f ( f ( n )) + O ( n ) : 2 . · C sp · n depth : 2 . · D sp ( n ) + O (log f ( n )) Choose C := (cid:101) C + 1 . Our starting point is Theorem 7.1, which gives as a circuit LC that realizes (1 / log C n, n, w ) -lossy loose compaction for the constant C > . Using the above two steps, we bootstrapand boost the circuit: LC : By Theorem 7.1, there exists a constant C > such that we can solve (1 / log C n, n, w ) -lossy loosecompaction with generalized boolean gates : Cn log log n selector gates : Cn depth : C log log n SP : By Theorem 8.3, for any β ∈ (0 , / , we can construct an (8 / log C n, β, n, w ) -approximate splittercircuit SP from LC . SP ’s size is upper bounded by the expressions :generalized boolean gates : 2 . Cn log n + 10 Cn + O ( n ) ≤ . Cn log n selector gates : 5 Cn + O ( n ) ≤ . Cn depth : 2 . C log log n · ( C log log n + O (1)) ≤ . C log log n · ( C log log n ) In the above, the inequalities hold as long as n is greater than an appropriately large constant. LC : Due to Theorem 9.1, we build a (8 · . / log C n, n, w ) lossy loose compaction circuit LC from SP . LC ’s size is upper bounded by the expressions:generalized boolean gates : 2 . · . Cn log log n + O ( n ) ≤ . · . Cn log log n selector gates : 2 . · . Cn ≤ . · . Cn depth : 2 . · (2 . C log log n ) · ( C log log n ) + O (log log n ) ≤ . · (2 . C log log n ) · ( C log log n ) SP : Due to Theorem 8.3, for any β ∈ (0 , / , we can construct a (8 · . / log C n, β, n, w ) -approximatesplitter circuit SP from LC . SP ’s size is upper bounded by the expressions:generalized boolean gates : 2 . · . · . Cn log log n + 10 · . · . Cn + O ( n ) ≤ . · (5 . Cn log log n selector gates : 5 · . · . Cn + O ( n ) ≤ . · (5 . Cn depth : 2 . C log log n · (2 . · (2 . C log log n ) · ( C log log n ) + O (1)) ≤ . · (2 . C log log n ) · ( C log log n ) When w > log log n , LC gives Theorem 10.1. Therefore, the rest of this section assumes w ≤ log log n . C : Due to Theorem 9.1, we build a ((8 · . / log C n, n, w ) -lossy loose compaction circuit LC from SP . LC ’s size is upper bounded by the expressions:generalized boolean gates : 2 . · . · (5 . Cn log (4) n + O ( n ) ≤ (2 . · . Cn log (4) n selector gates : 2 . · . · (5 . Cn ≤ (2 . · . Cn depth : 2 . · . · (2 . C log log n ) · ( C log log n ) + O (log (3) ( n )) ≤ (2 . · . C log log n ) · ( C log log n ) Let d ∈ N be the smallest integer such that log (2 d ) n ≤ w , i.e., d = (cid:100) log(log ∗ n − log ∗ w ) (cid:101) ≤ log(log ∗ n − log ∗ w ) + 1 . Continuing for d iterations, we get: LC d : LC d is a ((8 · . d / log C n, n, w ) -lossy loose compactor, and LC d ’s size is upper bounded by theexpressions:generalized boolean gates : (2 . · . d Cn log (2 d ) n = O ( nw ) · poly (log ∗ n − log ∗ w ) selector gates : (2 . · . d Cn = O ( n ) · poly (log ∗ n − log ∗ w ) depth : (2 . · . C log log n ) d · ( C log log n ) ≤ O (log . n ) This gives rise to Theorem 10.1.

11 Approximate Tight Compaction

Deﬁnition.

Let α ∈ (0 , . An ( α, n, w ) -approximate tight compactor (also written as α -approximate tightcompactor when n and w are clear from the context) solves the following problem: given an input array I containing n elements, each containing a -bit key and a w -bit payload, we want to output a permutation(denoted O ) of the input array I , such that at most α · n elements in O are misplaced — here, an element O [ i ] is said be misplaced iff O [ i ] is marked with the key b ∈ { , } ; however, the sorted array sorted ( I ) wants the key − b in position i . Theorem 11.1 (Approximate tight compaction) . Fix an arbitrary constant (cid:101)

C > . There is an (1 / (log n ) (cid:101) C , n, w ) -approximate tight compaction circuit that has O ( n · w ) · max (1 , poly (log ∗ n − log ∗ w )) generalized booleangates, O ( n ) · max (1 , poly (log ∗ n − log ∗ w )) number of w -selector gates, and with depth at most O (log n ) . Proof of Theorem 11.1:

Given an ( α, n, w ) -lossy loose compactor, we can obtain a (8 α, n, w ) -approximatetight compactor using an algorithm that is similar to the one described in the proof of Theorem 8.3. Forconvenience, below we shall refer to the elements with the -key in the input array I as distinguished . Approximate tight compaction from lossy loose compaction Count.

Compute the total number (denoted cnt ) of distinguished elements in the input array I .2. Color.

For any i ≤ cnt , if I [ i ] is not distinguished, mark the element red ; for any i > cnt , if I [ i ] isdistinguished, mark the element blue ; every other element is marked ⊥ . Let the outcome be X .Note that at this moment, each element is labeled with 3 bits of metadata, one bit of distinguishedindicator and two bits of color-indicator (indicating whether the element is colored, and if so, whichcolor).3. Swap.

Call (cid:92)

Swap n ( X ) (to be deﬁned below) to swap almost all the blue elements each with a red element — here we use a payload of size w + 1 and not w as we also include the distinguished-38ndicator as part of the payload. Return the outcome. (cid:92) Swap n ( X ) is deﬁned in a very similar to the Swap n algorithm of Theorem 8.3; except that now, wesimply use a lossy loose compactor to extract the residual red and blue elements, and then recurse on theextracted array. In comparison, in the earlier Swap n algorithm, we used a lossy loose compactor to extract blue elements and used a near-sorter to extract the red elements. (cid:92) Swap n ( X ) • Input:

An array X of m ≤ n elements, each has a w -bit payload and a 2-bit label indicating whetherthe element is colored, and if so, whether the element is blue or red . n is the size of the originalproblem when Swap is ﬁrst called; the same n will be passed into all recursive calls since it is usedto decide when the recursion stops.• Algorithm: (a)

Base case.

Same as Step (a) of the earlier

Swap n of Theorem 8.3.(b) Approximate swapper.

Same as Step (b) of the earlier

Swap n of Theorem 8.3; recall that theresulting array is denoted as X (cid:48) .(c) Lossy-extract colored.

Call an ( α, m, w + 1) -lossy loose compactor to compact X (cid:48) by a half,where the lossy loose compactor treats the colored elements as real, and all other elements asﬁllers (i.e., the loose compactor treats the ﬁrst bit of the color label as a real-ﬁller indicator, andthe second bit of the color label is treated as part of the payload).Let the outcome be Y whose length is half of X .(d) Recurse.

Recursively call (cid:92)

Swap n ( Y ) , and let the outcome be Y (cid:48) .(e) Reverse route.

Reverse the routing decisions made by all selector gates during Steps (c) (seeRemark 4 in the proof of Theorem 8.3). In this way, we can reverse-route elements in Y (cid:48) to anarray (denoted (cid:101) X ) whose length is m .(f) Output.

Return O which is formed by a performing coordinate-wise select operation between X (cid:48) and (cid:101) X . For every i ∈ [ m ] : – if X (cid:48) [ i ] originally had a colored element and the element was not lost during Step (c), then let O [ i ] := (cid:101) X [ i ] ; – else let O [ i ] := X (cid:48) [ i ] ; Our approximate tight compaction algorithm actually requires a swapper where elements are of bit-length w + 1 , but forconvenience we rename the variable to w in the description of the swapper. Suppose that n is sufﬁciently large. Then, the recursive call will hit the base case after at most (cid:6) log . α (cid:7) steps of recursion. Fact 11.2.

Assume that n is greater than a sufﬁciently large constant. The remaining number of coloredelements at the end of the algorithm is at most αn .Proof. The total number of elements lost during Step (c) of the algorithm is upper bounded by αn + αn/ . αn/ . . . . ≤ αn . Also, the recursion stops when m ≤ αn , all remaining colored elements will notget swapped. Therefore, the total number of colored elements remaining at the end is upper bounded by · αn + αn < αn , where the factor 2 comes from the fact that we may lose all αn in blue color andthus there are another αn in red . 39 mplementing Steps 1 and 2 in circuit. Due to Fact 6.3, Step 1 can be accomplished with O ( n ) generalizedboolean gates and in depth O (log n ) . When the count cnt is computed from Step 1, we can implement Step 2as follows. Recall that cnt ∈ { , , . . . , n } is a ((log n ) + 1) -bit number. Imagine that there are n receiversnumbered , , . . . , n . Each receiver is waiting to receive either “ ≤ ” or “ > ”. Those with indices , . . . , cnt should receive “ ≤ ” and those with indices cnt + 1 , . . . , n should receive “ > ”. We can accomplish this usingthe binary-to-unary conversion circuit of Fact 6.5, i.e., convert cnt into an n -bit string so that the head cnt bits are 0 and the tail n − cnt bits are 1. Due to Fact 6.5, Step 2 can be implemented as a circuit consistingof at most O ( n ) generalized boolean gates and in depth O (log n ) . Once each of the n receivers receiveeither “ ≤ ” or “ > ”, it takes a single generalized boolean gate per receiver to write down either blue , red , oruncolored. Implementing Steps 3 in circuit.

The approach and analysis are similar to the

Swap n circuit in the proofof Theorem 8.3.Summarizing the above, and plugging in a (1 / (cid:101) C n, n, w ) -lossy loose compactor as stated in Theo-rem 10.1, we will get Theorem 11.1.

12 Tight Compaction

Lemma 12.1 (Slow tight compaction circuit

SlowTC ) . There is an ( n, w ) -tight compaction circuit of depth O (log n ) , and requiring O ( nw + n log n ) generalized boolean gates and O ( n ) number of w -selector gates.Henceforth we will use SlowTC to denote this circuit.Proof.

We can use the tight compactor circuit constructed in Asharov et al. [9, Theorems 4.8 and 5.1]. Inparticular, wherever they employ an approximate swapper (called a loose swapper in their paper [9]), wereplace its implementation with a constant-depth one as described in Theorem 8.1. Asharov et al. [9] didnot analyze the depth of the circuit; however, with this modiﬁcation, it is not hard to show that the resultingcircuit has depth upper bounded by O (log n ) .Recall that in Section 4, we showed how to construct an algorithm that accomplishes distribution fromtight compaction. The same algorithm applies in the circuit model. This gives rise to the following corollary: Corollary 12.2 (Slow distribution circuit

SlowDistr ) . There is a circuit that solves the aforementioneddistribution problem, henceforth denoted

SlowDistr ; further, the number of generalized boolean gates, w -selector gates, and depth asymptotically match the SlowTC circuit of Theorem 12.1.Proof.

Use the above algorithm where tight compaction is instantiated with

SlowTC . Sparsity of an array.

Let A be an array in which each element has a w -bit payload, and is tagged with a bitdenoting whether the element is real or a ﬁller . Let α ∈ (0 , . An array A of length n is said to be α -sparseif there are at most αn real elements in it. Sparse loose compactor.

A sparse loose compactor is deﬁned almost in the same way as a loose compactor(see Section 7), except that 1) it works only on / poly log( n ) -sparse arrays; and 2) it compresses the arrayby / log n factor without losing any real elements.More formally, let C (cid:79) > be a sufﬁciently large universal constant. Given an input array I of length n that is promised to be / (log n ) C (cid:79) -sparse, an ( n, w ) -sparse loose compactor outputs an array O whoselength is (cid:98) n/ log n (cid:99) , and moreover, the multiset of real elements in O must be equal to the multiset of realelements in I . 40 heorem 12.3 (Sparse loose compactor) . There is an ( n, w ) -sparse loose compactor circuit, with O ( nw ) generalized boolean gates and O ( n ) number of w -selector gates, and of depth O (log n ) . Proof of Theorem 12.3:

We will run a variant of the lossy loose compactor algorithm described in the proofof Theorem 7.1 in Section 7.

Bipartite expander graphs with polylogarithmic degree.

Recall the bipartite graph of Margulis [28]. Fixa positive t ∈ N . The left and right vertex sets are L = R := [ t ] × [ t ] . A left vertex ( x, y ) is connected tothe right vertices ( x, y ) , ( x, x + y ) , ( x, x + y + 1) , ( x + y, y ) , ( x + y + 1 , y ) where all arithmetic is modulo t . We let H m be the resulting graph that has m = t vertices on each side.It is known (Margulis [28], Gabber and Galil [15], and Jimbo and Maruoka [21]) that for every m whichis a perfect square (i.e., of the form m = i for some i ∈ N ), H m is -regular and the second largesteigenvalue of its normalized adjacency matrix λ ( H m ) ∈ (1 / , is a constant. Let (cid:15) := 1 / log m . Wewill use a graph G (cid:15),m := H γm that is the γ -th power of G m , where γ is the smallest odd integer such that λ ( G (cid:15),m ) = λ ( H m ) γ ≤ (cid:15) . In other words, in G (cid:15),m , the edges are the length- γ paths in H m . Therefore, G (cid:15),m is a γ -regular bipartite graph. Note that the degree γ ∈ [log c m,

25 log c m ] for some constant c > (where any constant c > works later). Sparse loose compactor algorithm.

We ﬁrst describe the modiﬁcations to the meta-algorithm on top ofthe lossy loose compactor algorithm in the proof of Theorem 7.1. We then described the modiﬁed circuitimplementation of the meta-algorithm.

Sparse loose compactorExpander graph family and parameters.

We use a family of bipartite expander graphs { G (cid:15),m } m whose special expansion (cid:15) ≤ / log m . The expander graph family { G (cid:15),m } m can be constructed inthe aforementioned manner. Input.

The input is an array I of length n which is promised to be / (log n ) C (cid:79) -sparse. Interpret I as an array of n (cid:48) super-elements where n (cid:48) := n/ (cid:98) log n (cid:99) , each super-element consists of (cid:98) log n (cid:99) consecutive elements in I , and a super-element is real if it consists of at least one real element. Assumethat n (cid:48) = m · (cid:98) d/ (cid:99) for some perfect square m and d = Θ(log c m ) is the degree of the aforementionedbipartite expander graph G (cid:15),m where c > is an appropriate constant. For now, we assume that n is divisible by (cid:98) log n (cid:99) , and that n (cid:48) is divisible by (cid:98) d/ (cid:99) — see Remark 5 regarding how to deal withgeneral parameters. Algorithm.

Similar to the lossy loose compactor algorithm described in the proof of Theorem 7.1,except that we now parametrize the expander graph family differently as explained above, we run thealgorithm on super-elements throughout, and moreover, we introduce the following parameter modiﬁ-cations:• The array of length n (cid:48) = m ·(cid:98) d/ (cid:99) super-elements is divided into m chunks of (cid:98) d/ (cid:99) super-elements.We redeﬁne sparse and dense chunks as follows: a sparse chunk is one that has at most d/ (2 log m ) real super-elements. Any chunk that is not sparse is said to be dense.• We will run the ProposeAcceptFinalize subroutine for iter := log n (cid:48) / log log n (cid:48) iterations. More-over, in every iteration, each right vertex sends a rejection if it receives more than d/ (2 log m ) proposals; otherwise it sends an acceptance message. Each left vertex become satisﬁed if it receivesat least (cid:98) d/ (cid:99) acceptance messages.• After the dense chunks distribute their real super-elements to the right vertices, we compress allchunks such that each chunk contains (cid:4) d/ (2 log m ) (cid:5) super-elements, without losing any real super-41lements in the process (see Fact 12.5).Last but not the least, the circuit implementations of the ProposeAcceptFinalize subroutine andthe online routing phase are somewhat non-trivial, and needs to use the

SlowDistr and

SlowTC primitives — we will explain these details later.Note that for sufﬁciently large n , log m = Θ(log n ) and therefore the above algorithm produces anoutput that is Θ(1 / log n ) < / log n fraction of the original length. Lemma 12.4.

In each iteration of the

12 log m − (cid:15) · | U | m > (cid:15) · (cid:18)

12 log m − n ) C (cid:79) − (cid:19) , Since C (cid:79) > , and (cid:15) ≤ m , we have that (cid:112) | U | / | R neg | ≥ .

25 log m , i.e., | U | / | R neg | ≥ · log m ,that is, | R neg | ≤ | U | / log m .We conclude that the number of vertices in R that respond with a rejection is at most | U | / log m .Therefore, the number of edges that receive a rejection is at most | U | d/ log m . For a left vertex to remainunsatisﬁed, it must receive at least d/ rejections. This means that at most | U | / log m left vertices canremain unsatisﬁed. Fact 12.5.

Suppose that n (cid:48) is sufﬁciently large. Then, after iter := log n (cid:48) / log log n (cid:48) iterations, all leftvertices become satisﬁed.Proof. We only need to make sure that (cid:16) m (cid:17) iter · m < , that is, iter > log m/ log (cid:18) log m (cid:19) = log m m − . Therefore, for sufﬁciently large n , it sufﬁces to make sure that iter > log n (cid:48) log log n (cid:48) . Circuit implementation.

We now discuss how to implement the above meta algorithm in circuit.42 To determine whether each super-element is real or not, all super-elements in parallel run the countingcircuit of Fact 6.3 and then call a comparator circuit of Fact 6.2. In total, this step takes O ( n ) generalizedboolean gates and O (log log n ) depth.• To determine whether each chunk is sparse or dense, all chunks in parallel run the counting circuit ofFact 6.3 and then call a comparator circuit of Fact 6.2. In total, this step takes O ( n (cid:48) ) generalized booleangates and O (log d ) = O (log log n ) depth.• Next, we invoke the ProposeAcceptFinalize algorithm. In each iteration: – Every facility (i.e., right vertex) need to tally how many proposals it received, and decide whether itwants to send rejections or acceptance messages. For each facility, this requires a counting circuit ofFact 6.3, and a comparator circuit of Fact 6.2. Then, the decision can be propagated over a binary treeto all d edges. Accounting for all facilities, this step in total requires O ( n (cid:48) ) generalized boolean gatesand O (log d ) = O (log log n ) depth. – Every factory (i.e., left vertex) needs to tally how many acceptance messages it has received, anddecide if it wants to mark itself as satisﬁed. If it marks itself as satisﬁed, it will also mark all edgesover which an acceptance message is received as being part of the matching M . This can be done in asimilar fashion as how facilities tally their proposals, in total taking O ( n (cid:48) ) generalized boolean gatesand O (log log n ) depth.Accounting for all log n (cid:48) / log log n (cid:48) iterations, the total depth is at most O (log n ) , and the total numberof generalized boolean gates is at most O ( n (cid:48) ) · log n (cid:48) / log log n (cid:48) = O ( n ) .• Next, each dense chunk u ∈ L must send one real super-element over each of an arbitrary subset of load ( u ) ≤ d/ edges outgoing from u in the matching M . This can be accomplished by invokingan instance of SlowDistr (Corollary 12.2) for each chunk, such that in each dense chunk, each realsuper-element is sent over an edge in M . Thus, each chunk takes O ( (cid:98) d/ (cid:99) · ( w log n + log (cid:98) d/ (cid:99) )) = O ( (cid:98) d/ (cid:99) · w log n ) number of generalized boolean gates and O ( (cid:98) d/ (cid:99) ) total number of ( w · log n ) -selector gates. Accounting for all chunks, the total number of generalized boolean gates is at most O ( m · (cid:98) d/ (cid:99) · w log n ) = O ( nw ) , the total number of ( w · log n ) -selector gates is at most O ( n/ log n ) ,and the depth is at most O (log (cid:98) d/ (cid:99) ) = O (log n ) . Recall that each ( w · log n ) -selector gate can beimplemented as O (log n ) number of w -selector gates, and using O (log n ) generalized boolean gates topropagate the ﬂag over a binary-tree of log n leaves and depth log log n .Therefore, in total, this step can be implemented with O ( nw ) generalized boolean gates, O ( n ) numberof w -selector gates, and in depth O (log n ) .• Now, all dense chunks mark all its super-elements as ﬁllers. This can be done by having each chunkbroadcast its dense/sparse indicator bit over a binary tree to all d positions of the chunk. In total, we canimplement it with a circuit of O ( n (cid:48) ) generalized boolean gates and O (log d ) = O (log log n ) depth.• Finally, we need to compress all chunks on the left and the right to (cid:4) d/ (2 log m ) (cid:5) super-elements. Thiscan be accomplished by applying a SlowTC circuit to each chunk (Lemma 12.1), and the number ofgeneralized boolean gates, w -selector gates, and depth are asymptotically the same as the earlier step inwhich we invoke a SlowDistr instance per chunk.Summarizing the above, we get that the entire sparse loose compactor algorithm requires O ( nw ) gener-alized boolean gates, O ( n ) number of w -selector gates, and O (log n ) depth. Remark 5.

So far, we have assumed that n is divisible by (cid:98) log n (cid:99) and n (cid:48) := n/ (cid:98) log n (cid:99) is equal to m · (cid:98) d/ (cid:99) for some perfect square m , and d = Θ(log c m ) is the degree of the aforementioned bipartite expander graph G (cid:15),m where c > is an appropriate constant. 43f the above is not satisﬁed, we can let n (cid:48) := (cid:100) n/ (cid:98) log n (cid:99)(cid:101) . If n (cid:48) does not satisfy the above, we can ﬁndthe largest m ∗ such that n (cid:48) ≥ m ∗ · (cid:98) d ∗ / (cid:99) (note that d ∗ is a function of m ∗ for a ﬁxed (cid:15) ). Now, we can round m ∗ up to the next perfect square m , and still use d ∗ as the degree of the bipartite expander graph. We canpad the array with ﬁllers such that contains m · d ∗ super-elements, and then run the sparse loose compactoralgorithm. With this modiﬁcation, one can check that Lemma 12.4 and Fact 12.5 still hold. Therefore, ourearlier analyses hold. The padding incurs only o (1) blowup in the array’s length, i.e., n (cid:48)(cid:48) = (1 + o (1)) n (cid:48) .Our algorithm compresses the array to n (cid:48)(cid:48) / (log m ) in the number of super-elements, for sufﬁciently large n and thus sufﬁciently large n (cid:48) = (cid:100) n/ log n (cid:101) , the output length is upper bounded by (cid:98) n/ log n (cid:99) . Putting it all together, we can now realize a linear-sized, logarithmic-depth tight compaction circuit, as statedin the following theorem:

Theorem 12.6 (Linear-sized, logarithmic-depth tight compaction circuit) . There is an ( n, w ) -tight com-paction circuit with O ( nw ) · max( poly (log ∗ n − log ∗ w ) , generalized boolean gates, O ( n ) · max( poly (log ∗ n − log ∗ w ) , number of w -selector gates, and of depth O (log n ) . Note that the above theorem and Lemma 6.1 together would imply Theorem 1.3.

Proof of Theorem 12.6:

We construct a linear-sized, logarithmic-depth tight compaction circuit as follows.

Tight compactionInput.

An array I containing n elements each with a w -bit payload and a 1-bit key. Algorithm. Approximate tight compaction.

Apply a (1 / (log n ) C (cid:79) , n, w ) -approximate tight compactor to theinput array I ; let X denote the outcome.2. Count and label.

Count how many -keys there are in the array I , let the result be cnt . For each i ∈ [ n ] in parallel:• if X [ i ] has the key and i ≤ cnt , mark it as red ;• else if X [ i ] has the key and i > cnt , mark it as blue ;• else the element X [ i ] is uncolored.3. Sparse loose compaction.

Apply a sparse loose compactor to the outcome of the previous step; theoutcome is an array Y whose length is (cid:98) n/ log n (cid:99) containing all colored elements in X , padded withﬁller elements to a length of (cid:98) n/ log n (cid:99) .4. Slow swap.

Let Y (cid:48) := SlowSwap ( Y ) .5. Reverse route.

Reverse route the array Y (cid:48) by reversing the routing decisions made in Step 3, and letthe outcome be Z which has length n .6. Output.

The output O is obtained by performing a coordinate-wise select operation between Z and X : ∀ i ∈ [ n ] : O [ i ] := (cid:40) Z [ i ] if X [ i ] was marked “ misplaced ” X [ i ] o.w.44 mplementing the algorithm in circuit. Step 1 is implemented with the approximate tight compactioncircuit of Theorem 11.1.Step 2 is implemented as follows. First, use the counting circuit of Fact 6.3 to compute cnt . Then, usethe binary-to-unary circuit of Fact 6.5 to write down a string of n bits where the beginning cnt bits are and all other bits are . Next, all positions i ∈ [ n ] uses the comparator circuit of Fact 6.2 to compute its“ misplaced ” label.Step 3 is implemented with the sparse loose compactor circuit of Theorem 12.3. Step 4 is implementedusing the SlowSwap circuit of Theorem 8.2. Step 5’s costs are absorbed by Step 3. Finally, Step 6 can beaccomplished with n generalized boolean gates.Summarizing the above, the entire tight compaction circuit requires O ( nw ) · max( poly (log ∗ n − log ∗ w ) , generalized boolean gates, O ( n ) · max( poly (log ∗ n − log ∗ w ) , number of w -selector gates, and has depth O (log n ) .

13 Sorting Circuit for Short Keys

Earlier, we described various building blocks for an Oblivious PRAM model. We now discuss the size anddepth bounds for these building blocks in the circuit model.

Sorting elements with ternary keys.

Given Theorem 12.6, and Fact 6.3, we can implement the algorithmof Theorem 4.1 using a circuit with O ( nw ) · max(1 , poly (log ∗ n − log ∗ w )) generalized boolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ w )) number of w -selector gates, and of depth O (log n ) . This leads to thefollowing fact: Fact 13.1.

There exists a circuit with O ( nw ) · max(1 , poly (log ∗ n − log ∗ w )) generalized boolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ w )) number of w -selector gates, and of depth O (log n ) , capable of sortingany input array containing n elements each with a key from the domain { , , } and a payload of w bits. Slow sorter and slow alignment.

We now discuss how to implement the earlier

SlowSort K ( · ) and SlowAlign

K,K (cid:48) ( · ) algorithms in circuit. Fact 13.2 ( SlowSort K ( · ) circuit) . Let n be the length of the input array and w be the length of eachelement’s payload. Recall that each element has a key from the domain [0 , K − , and let k := log K .The SlowSort K ( · ) algorithm of Theorem 5.1 can be implemented as a circuit with O ( nK · ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( nK ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates, and of depth O (log n + k ) .Proof. Recall the

SlowSort K ( · ) algorithm of Theorem 5.1 where K := 2 k :1. Step 1 can be implemented using K parallel instances of the counting circuit of Fact 6.3 on arrays oflength n , and then using the all-preﬁx-sum circuit of Fact 6.4 on an array of length K where the entire sumis promised to be at most O (log n ) bits long. In total, Step 1 requires a circuit with O ( nK + K log n ) = O ( nK ) generalized boolean gates and of depth O (log K + log n ) = O ( k + log n ) .2. Step 2 can be implemented by broadcast each element of A over a binary tree of K leaves, and thenhaving all nK elements perform a comparison in parallel using Fact 6.2. This requires O ( nK ) numberof ( w + k ) -selector gates, O ( nK ) generalized boolean gates, and at most O ( k ) depth.3. Step 3 invokes K parallel instances of the generalized binary-to-unary conversion circuit on arrays oflength n , and K parallel instances of the ternary-key sorting circuit. This requires O ( nK ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( nK ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates, and has depth O (log n ) .45. Step 4 can be accomplished in a circuit with O ( nK ) number of generalized boolean gates, O ( nK ) number of ( w + k ) -selector gates and of depth O (log K ) = O ( k ) . Note that we can use a single bit tomark whether each element in each of B (cid:48) , B (cid:48) , . . . , B (cid:48) K − has a real key in the range [0 , K − or not.Summarizing the above, we have that the entire SlowSort K ( · ) algorithm can be implemented as acircuit with O ( nK ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( nK ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates, and of depth O (log n ) .We now discuss the circuit implementation of the SlowAlign

K,K (cid:48) ( · ) algorithm of Theorem 5.2. Fact 13.3 ( SlowAlign

K,K (cid:48) ( · ) circuit) . Let n be the length of the input array and w be the length of eachelement’s payload. Recall that each element has a key from the domain [0 , K − , and an index from thedomain [0 , K (cid:48) − . Let k = log K and k (cid:48) = log K (cid:48) . The SlowAlign

K,K (cid:48) ( · ) algorithm of Theorem 5.1 canbe implemented as a circuit with O ( n · ( K + K (cid:48) ) · ( w + k + k (cid:48) )) · max(1 , poly (log ∗ n − log ∗ ( w + k + k (cid:48) ))) generalized boolean gates, O ( n ( K + K (cid:48) )) · max(1 , poly (log ∗ n − log ∗ ( w + k + k (cid:48) ))) number of ( w + k + k (cid:48) ) -selector gates, and of depth O (log n + k + k (cid:48) ) .Proof. Recall that

SlowAlign

K,K (cid:48) invokes one instance of

SlowSort K on an array of length n containing ( w + k (cid:48) ) -bit payloads, and one instance of SlowSort K (cid:48) on an array of length n containing ( w + k ) -bitpayloads and its reverse routing circuit. Therefore, the fact follows from Fact 13.2. Finding the dominant key.

We now analyze the complexity of the

FindDominant algorithm (The-orem 5.3) when implemented in circuit. Note that the

FindDominant algorithm need not look at theelements’ payload strings. Therefore, we may plug in an arbitrary w ≥ as the fake payload length.1. Step 1, i.e., the base case calls the SlowSort K algorithm on an array of length at most n/K where K := 2 k . Therefore, this step requires O ( n · ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalizedboolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates, and ofdepth O (log n + k ) .2. In each of the O ( k ) recursive calls to FindDominant , the array length would reduce by a factor of , and during each recursive call, we divide the array into groups of and run an AKS circuit on eachgroup. In total over all levels of recursion, this requires O ( n ) number of ( w + k ) -selector gates, O ( n ) generalized boolean gates, and O ( k ) depth.Therefore, we have the following fact. Fact 13.4 ( FindDominant circuit) . Suppose that n > k +7 and moreover n is a power of . Let A be an array containing n elements each with a k -bit key, and suppose that A is (1 − − k ) -uniform. Fixsome arbitrary w ≥ (which need not be the element’s payload length ). Then, there is a circuit that cancorrectly identify the dominant key given any such A ; and moreover, the circuit contains O ( n · ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates, and of depth O (log n + k ) . We now ﬁnish it off and discuss how to implement the algorithm of Theorem 5.5 in the circuit model. To dothis, it sufﬁces to describe how to implement a nearly orderly segmenter in circuit, and how to sort a nearlyorderly array in circuit.

Nearly orderly segmenter.

Recall that for k ≤ log n , the algorithm of Theorem 3.2 is a comparator-basedcircuit with O ( nk ) comparators and of O ( k ) depth. We would like to convert this comparator-based circuitto a circuit with generalized boolean gates and w -selector gates. Note that the algorithm need not look at the elements’ payload strings. act 13.5 ( (2 − k , k ) -orderly segmenter circuit) . Suppose that k ≤ log n . There exists a (2 − k , k ) -orderly-segmenter circuit with O ( nk ) generalized boolean gates, O ( nk ) number of ( w + k ) -selector gates,and of depth O ( k ) .Proof. If we used a na¨ıve method for converting the comparator-based circuit in Theorem 3.2 to a circuitwith generalized boolean gates and w -selector gates, the resulting circuit depth would have depth O ( k log k ) because every comparator can be implemented as an O ( k ) -sized and O (log k ) -depth boolean circuit due toFact 6.2.Fortunately, we can rely on a pipelining technique to make the depth smaller.• In the beginning, all input bits of the input layer are “ready”. All other comparators not in the input layersee all bits of their inputs as “not ready”.• Whenever a comparator detects a new i ∈ [ k ] such that both of its inputs have the i -th bit ready, it cancompare the i -th bits of the two inputs, and as a result, the i -th bits of the two outputs of the gate will beready.Using this pipelining technique, we can ﬁrst compute all the generalized boolean gates which will pop-ulate the ﬂags of all selector gates. This step takes O ( k ) depth and O ( nk ) generalized boolean gates. Next,we can evaluate all O ( nk ) number of ( w + k ) -selector gates in a topological order; this can be accomplishedin O ( k ) depth. Sorting a nearly orderly array.

We now describe how to implement the algorithm of Theorem 5.4 incircuit.• Step 1 calls the

FindDominant circuit of Fact 13.4, and then for each segment, invokes one copy of thecounting circuit of Fact 6.3 and the generalized binary-to-unary conversion circuit of Fact 6.5. Therefore,this step can be accomplished with a circuit containing O ( n · ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + k ) -selector gates,and of depth O (log n + k ) .• For Step 2: to mark each element with its segment index, we can simply hard-wire the segment indices inthe circuit. Then, we invoke the oblivious compaction circuit of Theorem 12.6 which requires O ( n ( w + k )) · max( poly (log ∗ n − log ∗ ( w + k )) , generalized boolean gates, O ( n ) · max( poly (log ∗ n − log ∗ ( w + k )) , number of ( w + k ) -selector gates, and O (log n ) depth.• Step 3 invokes the SlowAlign

K,K circuit of Fact 13.3 on n/K elements. This requires O ( n · ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) generalized boolean gates, O ( n ) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) number of ( w + 3 k ) -selector gates, and has depth O (log n + k ) .• Step 4 is a reverse routing step whose costs are absorbed by Step 2.• Step 5 invokes K instances of the counting circuit of Fact 6.3 each on an array of length n/K . Thistakes O ( n ) generalized boolean gates.• Step 6 invokes the compaction circuit of Theorem 12.6 on an array containing K elements, where eachelement is of length W := O ( n ( k + w ) /K ) . Since K · n ( k + w ) /K = O ( n ( k + w )) , this step requires O ( n ( k + w )) · max( poly (log ∗ n − log ∗ ( w + k )) , generalized boolean gates, O ( K ) · max( poly (log ∗ n − log ∗ ( w + k )) , number of W -selector gates, and of depth O (log K ) .• Step 7 invokes K instances the SlowSort K circuit of Fact 13.2 each on an array of length n/K . Thiscost of this step is dominated by that of Step 3.• Step 8 is a reverse routing step whose costs are dominated by Step 6.47ue to Lemma 6.1, the above can be implemented as a constant fan-in, constant fan-out a booleancircuit of size O ( n ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) and depth O (log n + log w ) , assumingthat n > k +7 . Fact 13.6 (Sorting a (2 − k , k ) -orderly array in circuit) . Suppose that n > k +7 . There is a constantfan-in, constant fan-out boolean circuit that fully sorts an (2 − k , k ) -orderly array containing n elementseach with a k -bit key and a w -bit payloads, whose size is O ( n ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) and whose depth is O (log n + log w ) . Sorting short keys in the circuit model.

Summarizing the above, we get the following theorem:

Theorem 13.7 (Restatement of Theorem 1.1) . Suppose that n > k +7 . There is a constant fan-in, constantfan-out boolean circuit that correctly sorts any array containing n elements each with a k -bit key and a w -bit payloads, whose size is O ( nk ( w + k )) · max(1 , poly (log ∗ n − log ∗ ( w + k ))) and whose depth is O (log n + log w ) .Proof. Follows directly due to the algorithm of Theorem 5.5 where we implement the nearly orderly seg-menter and the sorter for a nearly orderly array using the circuits of Facts 13.5 and 13.6, respectively.Further, we use Lemma 6.1 to convert each circuit gadget in our operational model to a constant fan-in,constant fan-out boolean circuit gadget.

Acknowledgments

This work is in part supported by an NSF CAREER Award under the award number CNS1601879, a PackardFellowship, and an ONR YIP award. We would like to thank Silei Ren for discussions and help in anearly stage of the project. Elaine Shi would like to thank Bruce Maggs for explaining the AKS algorithm,Pippenger’s self-routing super-concentrator, the Wallace-tree trick, and the elegant work by Arora, Leighton,and Maggs [6], as well as for his moral support of this work.

References [1] Private communication with Bruce Maggs.[2] Peyman Afshani, Casper Benjamin Freksen, Lior Kamma, and Kasper Green Larsen. Lower boundsfor multiplication via network coding. In , pages 10:1–10:12, 2019.[3] M. Ajtai, J. Koml´os, and E. Szemer´edi. An O ( n log n ) sorting network. In STOC , 1983.[4] V.E. Alekseev. Sorting algorithms with minimum memory.

Kibernetica , 5:99–103, 1969.[5] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time?

J.Comput. Syst. Sci. , 57(1):74–93, August 1998.[6] Sanjeev Arora, Frank Thomson Leighton, and Bruce M. Maggs. On-line algorithms for path selectionin a nonblocking network (extended abstract). In

Proceedings of the 22nd Annual ACM Symposium onTheory of Computing, May 13-17, 1990, Baltimore, Maryland, USA , 1990.[7] Gilad Asharov, Ilan Komargodski, Wei-Kai Lin, Kartik Nayak, Enoch Peserico, and Elaine Shi. Op-tORAMa: Optimal Oblivious RAM. In

Eurocrypt , 2020.[8] Gilad Asharov, Ilan Komargodski, Wei-Kai Lin, Enoch Peserico, and Elaine Shi. Oblivious paralleltight compaction. In

Information-Theoretic Cryptography (ITC) , 2020.489] Gilad Asharov, Wei-Kai Lin, and Elaine Shi. Sorting short keys in circuits of size o ( n log n ) . In SODA ,2021.[10] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time boundsfor selection.

J. Comput. Syst. Sci. , 7(4):448–461, August 1973.[11] Elette Boyle and Moni Naor. Is there an oblivious RAM lower bound? In

ITCS , 2016.[12] Stephen A. Cook, Cynthia Dwork, and R¨udiger Reischuk. Upper and lower time bounds for parallelrandom access machines without simultaneous writes.

SIAM J. Comput. , 15(1):87–97, 1986.[13] Samuel Dittmer and Rafail Ostrovsky. Oblivious tight compaction in O(n) time with smaller constant.In

SCN , 2020. https://eprint.iacr.org/2020/377 .[14] Alireza Farhadi, MohammadTaghi Hajiaghayi, Kasper Green Larsen, and Elaine Shi. Lower boundsfor external memory integer sorting via network coding. In

STOC , 2019.[15] Ofer Gabber and Zvi Galil. Explicit constructions of linear-sized superconcentrators.

J. Comput. Syst.Sci. , 22(3):407–420, 1981.[16] Michael T. Goodrich. Zig-zag sort: A simple deterministic data-oblivious sorting algorithm runningin O(N Log N) time. In

STOC , 2014.[17] Willem H. Haemers. Interlacing eigenvalues and graphs.

Linear Algebra and its Applications , 226-228:593 – 616, 1995. Honoring J.J.Seidel.[18] Yijie Han. Deterministic sorting in o( n loglog n ) time and linear space. J. Algorithms , 50(1):96–105,2004.[19] Yijie Han and Mikkel Thorup. Integer sorting in 0(n sqrt (log log n)) expected time and linear space.In

FOCS , 2002.[20] Joseph J´aJ´a.

An Introduction to Parallel Algorithms . Addison-Wesley, 1992.[21] Shuji Jimbo and Akira Maruoka. Expanders obtained from afﬁne transformations.

Combinatorica ,7(4):343–355, 1987.[22] Shuji Jimbo and Akira Maruoka. Selection networks with n log n size and o (log n ) depth. In Algorithms and Computation , pages 165–174, 1992.[23] David G. Kirkpatrick and Stefan Reisch. Upper bounds for sorting integers on random access ma-chines. Technical report, 1981. University of British Columbia.[24] Donald E. Knuth.

The Art of Computer Programming, Volume III: Sorting and Searching . Addison-Wesley, 1973.[25] Tom Leighton, Yuan Ma, and Torsten Suel. On probabilistic networks for selection, merging, and sort-ing. In

Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures ,SPAA ’95, pages 106–118, 1995.[26] Zongpeng Li and Baochun Li. Network coding : The case of multiple unicast sessions. In

AllertonConference on Communications , volume 16, page 8, 2004.[27] Wei-Kai Lin, Elaine Shi, and Tiancheng Xie. Can we overcome the n log n barrier for oblivioussorting? In SODA , 2019. 4928] Grigorii Aleksandrovich Margulis. Explicit constructions of concentrators.

Problemy Peredachi Infor-matsii , 9(4):71–80, 1973.[29] John C. Mitchell and Joe Zimmerman. Data-Oblivious Data Structures. In

STACS , pages 554–565,2014.[30] Sarvar Patel, Giuseppe Persiano, Mariana Raykova, and Kevin Yeo. Panorama: Oblivious ram withlogarithmic overhead. In

FOCS , 2018.[31] M. S. Paterson. Improved sorting networks with o (log n ) depth. In Algorithmica , 1990.[32] Nicholas Pippenger. Selection networks. In

Algorithms , pages 2–11, Berlin, Heidelberg, 1990.Springer Berlin Heidelberg.[33] Nicholas Pippenger. Self-routing superconcentrators.

J. Comput. Syst. Sci. , 52(1):53–60, February1996.[34] John E. Savage.

Models of Computation: Exploring the Power of Computing . Addison-Wesley Long-man Publishing Co., Inc., Boston, MA, USA, 1st edition, 1997.[35] Joel Seiferas. Sorting networks of logarithmic depth, further simpliﬁed.

Algorithmica , 53(3):374–384,March 2009.[36] Mikkel Thorup. Randomized sorting in O ( n log log n ) time and linear space using addition, shift, andbit-wise boolean operations. J. Algorithms , 42(2):205–230, 2002.[37] Andrew Chi-Chih Yao. Bounds on selection networks.

SIAM J. Comput. , 9(3):566–582, 1980.

A Expander Graphs and Spectral Expansion

Lemma A.1 (Expander mixing lemma for bipartite graphs [17]) . Let G = ( L ∪ R, E ) be a d -regularbipartite graph such that | L | = | R | = n . Then, for all sets S ⊆ L and T ⊆ R , it holds that (cid:12)(cid:12)(cid:12)(cid:12) e ( S, T ) − dn · | S | · | T | (cid:12)(cid:12)(cid:12)(cid:12) ≤ λ ( G ) · d · (cid:112) | S | · | T | , where λ ( G ) is deﬁned as the second largest eigenvalue of the normalized adjacency matrix A of G . Inother words, A is the adjacency matrix of G multiplied by /d ; let λ ≥ λ ≥ · · · ≥ λ n be the eigenvaluesof A , then λ ( G ) := λ . The eigenvalue λ ( G ) ∈ (1 /d, is also called the spectral expansion of thebipartite graph G ..