[PDF] Sorting Short Integers

Abstract

We build boolean circuits of size O(nm^2) and depth O(\log(n) + m \log(m)) for sorting n integers each of m-bits. We build also circuits that sort n integers each of m-bits according to their first k bits that are of size O(nmk(1 + \log^*(n) - \log^*(m))) and depth O(\log^{3}(n)). This improves on the result of Asharov et al. arXiv:2010.09884 and resolves some of their open questions.

Full PDF

aa r X i v : . [ c s . CC ] F e b Sorting Short Integers ∗ Michal Koucký and Karel Král Computer Science Institute, Charles University, Prague, Czech Republic {koucky, kralka}@iuuk.mff.cuni.cz

February 26, 2021

Abstract

We build boolean circuits of size

O ( 𝑛𝑚 ) and depth O ( log ( 𝑛 ) + 𝑚 log ( 𝑚 )) for sorting 𝑛 integers each of 𝑚 -bits. We build also circuits that sort 𝑛 integers each of 𝑚 -bits according to their ﬁrst 𝑘 bits that are of size O ( 𝑛𝑚𝑘 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) and depth O ( log 𝑛 ) . This improves on the results of Asharov et al. [3] andresolves some of their open questions. Sorting undoubtedly plays a central role in computer science. Great many problems can be solved using sortingas a subcomponent. There are many practical variants of sorting based either on what we sort (integers, rationalnumbers, strings, etc.) or how we sort (in parallel, in distributed fashion, in external memory, etc.). Despite lotsof research there are still many basic questions about sorting unanswered.The classical comparison based sorting takes time O( 𝑛 log ( 𝑛 )) when sorting 𝑛 integers. Well known lowerbound postulates that this is optimal for comparison based sorting. However, this is a great over-simpliﬁcation andthe picture is much more nuanced: sorting integers from a domain of size 𝑀 can be done using binary search treesin time O( 𝑛 log | 𝑀 |) , thus sorting for example 𝑚 -bit integers only needs O( 𝑛𝑚 ) comparisons. Such an algorithmcan be implemented on a pointer machine, for example. In the RAM model, with the word size 𝑚 we can sorteven faster: When 𝑚 = 𝑂 ( log ( 𝑛 )) one can sort in time O( 𝑛 ) using radix sort, and when 𝑚 = Ω ( log ( 𝑛 )) one canalso sort in linear time using the algorithm of Andersson [2]. When 𝑚 = 𝑂 ( log ( 𝑚 )) one can sort in expectedtime O( 𝑛 q log 𝑚 log ( 𝑛 ) ) and linear space using the algorithm of Han and Thorup [4]. It is an easy exercise to designTuring machines that sort 𝑚 -bit integers in time O( 𝑛𝑚 ) .In many cryptographic applications there is an interest in oblivious algorithms, algorithms in which thesequence of the operations is independent of the processed data. Sorting plays an important role in constructionof oblivious RAM. An oblivious comparison based parallel model of computation intended for sorting are sortingnetworks . Numbers in a sorting network are thought of as signals which can only be compared. The seminalpaper by Ajtai, Komlós, and Szemerédi [1] gives an asymptotically optimal sorting network of logarithmic depthand thus having O( 𝑛 log ( 𝑛 )) comparators matching the comparison based lower bound. The AKS network hasimmense applications in theoretical computer science, and we use it in this paper, too.Another oblivious model of computation heavily used throughout theoretical computer science are booleancircuits. One can turn the AKS sorting network into a circuit of size O( 𝑛𝑚 log ( 𝑛 )) and depth O( log ( 𝑚 ) log ( 𝑛 )) (see Section 4). However, when building boolean circuits for sorting it is not clear whether one can take anyadvantage of some of the faster algorithms for RAM or Turing machines as simulating random access memory orTuring machine tapes by circuits requires substantial overhead. Asharov et al. [3] asked the question whether onecan sort 𝑚 -bit integers in time 𝑜 ( 𝑛𝑚 log ( 𝑛 )) when 𝑚 = 𝑜 ( log ( 𝑛 )) . They provide an answer to this question byconstructing circuits for sorting 𝑚 -bit integers of size O( 𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 )) + 𝜀 ) and polynomial depth,for any 𝜀 >

0. We improve their results: We build boolean circuits for sorting 𝑚 -bit integers of size O( 𝑛𝑚 ) ∗ This research was supported by the Grant Agency of the Czech Republic under the grant agreement no. 19-27871X, and Charles Universitygrant SVV-2017-260452. O( log ( 𝑛 ) + 𝑚 log ( 𝑚 )) . Pending some unexpected breakthrough this size seems optimal. The depth isprovably optimal whenever 𝑚 = 𝑂 ( log ( 𝑛 )/ log log ( 𝑛 )) .Asharov et al. [3] solve even a more general problem as their circuits partially sort 𝑛 numbers each of 𝑚 bits by their ﬁrst 𝑘 bits using a circuit of size O( 𝑛𝑚𝑘 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 )) + 𝜀 ) . We improve on this resultas well by presenting circuits that sort 𝑚 -bit integers according to their ﬁrst 𝑘 bits of size O( 𝑛𝑚𝑘 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) and depth O( log 𝑛 ) . Our small circuits of poly-logarithmic depth answer some of the open questionsof Asharov et al. [3]. We state our results in the next section. We provide a family of boolean circuits that sort 𝑚 -bit strings. Our circuits are smaller than the circuits directlyderived from the AKS sorting network, and they improve on the result of Asharov et al. [3]. Our circuits achieveoptimal logarithmic depth whenever 𝑚 log ( 𝑚 ) ≤ log ( 𝑛 ) . Pending some unexpected breakthrough, their size seemsalso optimal. Theorem 1.

For any integers 𝑛, 𝑚 ≥ O( 𝑛𝑚 ) and depth O( log ( 𝑛 ) + 𝑚 log ( 𝑚 )) circuit that sorts 𝑛 integers of 𝑚 bits each.For 𝑚 ≥ Ω ( log ( 𝑛 )) , the existence of such a circuit directly follows from AKS sorting networks. Ourcontribution is the construction of such circuits for 𝑚 ≤ 𝑜 ( log ( 𝑛 )) . Our construction also uses a sorting networkas a building block. We use the AKS sorting network as one of our primitives but in principle, we could use anysorting network or sorting circuit. In particular, we could use any circuit sorting 𝑛 numbers of log ( 𝑛 ) bits eachin our construction. Any improvement of asymptotic complexity of sorting of log ( 𝑛 ) -bit numbers would give usimproved complexity of sorting short numbers.The main idea behind our construction is to compress the input by computing the number of occurrences ofeach 𝑚 -bit integer. This gives a vector of 2 𝑚 integers, each of size O( log ( 𝑛 )) . Decompressing this vector backgives the sorted input. Combining the counting and decompressing circuit gives us a circuit that sorts. The maintechnical lemma is our counting circuit which is of independent interest. Lemma 2.

For any integers 𝑛, 𝑚 ≥ 𝑚 ≤ log ( 𝑛 )/

10 there is a circuitFAST_COUNT 𝑛,𝑚 : { , } 𝑛 · 𝑚 → { , } ⌈ + log ( 𝑛 ) ⌉ 𝑚 which given a sequence of 𝑛 strings of 𝑚 bits each outputs the number of occurrences of each possible 𝑚 -bit string among the inputs, that is for input 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 it outputs 𝑛 𝑚 , 𝑛 𝑚 − , . . . , 𝑛 𝑚 where foreach string 𝑦 ∈ { , } 𝑚 , 𝑛 𝑦 ∈ { , } ⌈ + log ( 𝑛 ) ⌉ represents (cid:12)(cid:12)(cid:8) 𝑗 ∈ [ 𝑛 ] | 𝑥 𝑗 = 𝑦 (cid:9)(cid:12)(cid:12) in binary. The size of the circuitFAST_COUNT 𝑛,𝑚 is O( 𝑛𝑚 ) and depth O( log ( 𝑛 ) + 𝑚 log ( 𝑚 )) .We also provide a family of boolean circuits which sort the input integers by their ﬁrst 𝑘 bits only. One can viewthis as sorting (key, value) pairs, where keys have 𝑘 bits and values have 𝑚 − 𝑘 bits. For the special case of 𝑘 = super-concentrators (see Section 1.2), and we use super-concentrators of Pippenger [6] as our building block. We get size improvementover the result of Asharov et al. [3] while achieving also poly-logarithmic depth. Theorem 3.

For any integers 𝑛, 𝑚, 𝑘 ≥ 𝑘 ≤ 𝑚 and 𝑘 ≤ log ( 𝑛 )/

11 there is a circuitSORT 𝑛,𝑚,𝑘 : { , } 𝑛𝑚 → { , } 𝑛𝑚 which partially sorts 𝑛 numbers each of 𝑚 bits by their ﬁrst 𝑘 bits. The circuit SORT 𝑛,𝑚,𝑘 has size O( 𝑘𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) and depth O( log ( 𝑛 ) ) . One can take AKS sorting networks and turn them into circuits of size O( 𝑛𝑚 log ( 𝑛 )) and depth O( log ( 𝑚 ) log ( 𝑛 )) .For 𝑚 = 𝑜 ( log ( 𝑛 )) this is sub-optimal as shown by Asharov et al. [3]. Asharov et al. shows how to reduce theproblem of sorting 𝑚 -bit integers according to the ﬁrst 𝑘 bits into the problem of sorting 𝑚 -bit integers accordingto just single bit. Sorting according to single bit is essentially equivalent to routing in super-concentrators.Super-concentrators have been studied originally by Valiant with the aim of proving circuit lower bounds. Asuper-concentrator is a graph with two disjoint subsets of vertices 𝐴, 𝐵 ⊆ 𝑉 ( 𝐺 ) , called inputs and outputs, with2he property that for any set 𝑆 ⊆ 𝐴 and 𝑇 ⊆ 𝐵 of the same size there is a set of vertex disjoint paths from eachvertex of 𝑆 to some vertex of 𝑇 . Pippenger [6] constructs super-concentrators with a linear number of edges andan algorithm that on input describing 𝑆 and 𝑇 outputs the list of edges forming the disjoint paths between 𝑆 and 𝑇 .This can be turned into a circuit of size O( 𝑛 log ( 𝑛 )) and depth O( log ( 𝑛 ) ) .The result of Pippenger [6] can be used to build a circuit sorting by one bit, but the circuit will be larger thanwe want (see Corollary 18.) Thus, Asharov et al. [3] used the technique of Pippenger rather than his result todesign a circuit sorting by one bit, and iterate it to sort by 𝑘 bits. Our technique diﬀers substantially from that ofAsharov et al. yet, we use the circuits from AKS networks and from Pippenger’s super-concentrators as black box.To sort 𝑚 -bit integers for 2 𝑚 ≪ 𝑛 our approach is to count the number of occurrences of each number inthe input. This compresses the input from 𝑛𝑚 bits into 2 𝑚 log ( 𝑛 ) bits. We can then decompress the vector backto get the desired output. So the main challenge is to construct counting (compressing) circuits of size O( 𝑛𝑚 ) .Interestingly, we use the sorting circuits derived from AKS networks to do that. But to avoid the size blow-up wedon’t use them on all of the integers at once but on blocks of integers of size 2 𝑚 . Then the O( log ( 𝑛 )) overhead ofthe circuits turns into the acceptable O( 𝑚 ) overhead. Each sorted block is then subdivided into parts of size 2 𝑚 .Clearly, most parts in each block will be monochromatic, they will contain copies of the same integer. There willbe at most 2 𝑚 non-monochromatic parts. We move the parts within a block to one side using another applicationof the AKS sorting circuit. Then we can aﬀord to build a fairly expensive counting circuit for the small fraction ofnon-monochromatic parts, while cheaply counting the monochromatic parts. Summing the results by linear sizecircuit gives us the desired compression. Our decompression essentially mirrors the compression.We also design a circuit to sort according to a single bit improving the parameters of Asharov et al. [3]. Wetake the circuit of Pippenger as basis and apply it iteratively to larger and larger blocks of inputs. Again we startfrom blocks of size 2 𝑂 ( 𝑚 ) , and increase the size of the blocks exponentially at each iteration. We use Pippenger’scircuit to sort each block by the bit. When we split the block into parts, only one will be monochromatic. Mergingmultiple blocks into one gives a mega-block with only a small fraction of non-monochromatic parts. These non-monochromatic parts can be separated from monochromatic ones, re-sorted, and re-partitioned to give only onenon-monochromatic part in the mega-block. Each part takes on the role of an “ 𝑚 ”-bit integer in the next iteration.Iterating this process leads to the desired result.To sort according to the ﬁrst 𝑘 bits we use the one-bit sorting similarly to Asharov et al. [3]. Thanks to oureﬃcient sorting circuits for 𝑚 -bit integers to sort the 𝑘 -bit keys, we can avoid the use of median ﬁnding circuits. Organization.

In the next section we review our notation. We provide basic construction tools including naïveconstructions of counting and decompression circuits in Section 3. In Section 4 we recall basic facts on AKSsorting networks and related sorting circuits. In Section 5 we prove our main result by constructing eﬃcientcounting and decompression circuits. Finally, we provide a construction of partial sorting circuits for Theorem 3in Section 6. Some of the proof are deferred to Appendix.

In this paper N denotes the set of natural numbers, and for 1 ≤ 𝑎 ≤ 𝑏 ∈ N , [ 𝑎, 𝑏 ] = { 𝑎, 𝑎 + , . . . , 𝑏 } and [ 𝑎 ] = { , . . . , 𝑎 } . All logarithms are base two unless stated otherwise. For 𝑚 ∈ N , { , } 𝑚 is the set of all binarystrings of length 𝑚 . A string 𝑥 ∈ { , } 𝑚 , 𝑥 = 𝑥 𝑥 · · · 𝑥 𝑚 , represents the number Í 𝑗 ∈[ 𝑚 ] 𝑥 𝑗 𝑚 − 𝑗 in binary, and weoften identify the string with that number. (As the same integer has multiple binary representations diﬀering in thenumber of leading zeroes, the number of leading zeroes should be clear from the context.) The most signiﬁcant bitof 𝑥 = 𝑥 𝑥 · · · 𝑥 𝑚 is 𝑥 and the least signiﬁcant bit of 𝑥 is 𝑥 𝑚 . Symbol ◦ denotes the concatenation of two strings.For strings 𝑥, 𝑦 ∈ { , } 𝑚 , 𝑥 ⊕ 𝑦 denotes the bit-wise XOR of 𝑥 and 𝑦 , 𝑥 ∧ 𝑦 denotes the bit-wise AND, and 𝑥 ∨ 𝑦 the bit-wise OR.We assume the reader is familiar with boolean circuits (see for instance the book of Jukna [5]). We assumeboolean circuits consist of gates computing binary AND and OR, and unary gates computing negation. Forus, boolean circuits might have multiple outputs so a circuit with 𝑛 inputs and 𝑚 outputs computes a function 𝑓 : { , } 𝑛 → { , } 𝑚 . We usually index a circuit family by multiple integral parameters. Inputs and outputsof boolean circuits are often interpreted as sequences of substrings, e.g., a circuit 𝐶 𝑛,𝑚 : { , } 𝑛𝑚 → { , } 𝑛𝑚 isviewed as taking 𝑛 binary strings of length 𝑚 as its input, and similarly for its output. We say a circuit family ( 𝐶 𝑛 ) 𝑛 ∈ N is uniform, if there is an algorithm that on input 1 𝑛 outputs the description of the circuit 𝐶 𝑛 in timepolynomial in 𝑛 . 3 Preliminaries

Here we review some of the circuits for basic primitives that we will use in our later constructions. Most of themare well known facts but for the others we provide proofs in the appendix.

Lemma 4 (Addition) . There is a uniform family of boolean circuits ADD 𝑚 : { , } 𝑚 → { , } 𝑚 + that given 𝑥, 𝑦 ∈ { , } 𝑚 representing two numbers in binary outputs their sum 𝑥 + 𝑦 ∈ { , } 𝑚 + . The circuit ADD 𝑚 hassize Θ ( 𝑚 ) and depth Θ ( log ( 𝑚 )) . Lemma 5 (Subtraction) . There is a uniform family of boolean circuits SUB 𝑚 : { , } 𝑚 → { , } 𝑚 that given 𝑥, 𝑦 ∈ { , } 𝑚 representing two numbers in binary outputs the absolute value of their diﬀerence | 𝑥 − 𝑦 | ∈ { , } 𝑚 .The circuit SUB 𝑚 has size Θ ( 𝑚 ) and depth Θ ( log ( 𝑚 )) . Lemma 6 (Summation) . There is a uniform family of boolean circuits SUM 𝑛,𝑚 : { , } 𝑛 · 𝑚 → { , } ⌈ log ( 𝑛 ) ⌉+ 𝑚 that given 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 interpreted as 𝑛 numbers, each of 𝑚 bits, outputs their sum Í 𝑛𝑗 = 𝑥 𝑗 . The circuitSUM 𝑛,𝑚 has size Θ ( 𝑛𝑚 ) and depth Θ ( log ( 𝑛 ) + log ( 𝑚 )) . Lemma 7 (Comparator) . There is a uniform family of boolean circuits SWITCH 𝑚 : { , } 𝑚 → { , } 𝑚 thatgiven two numbers 𝑥, 𝑦 ∈ { , } 𝑚 outputs these two numbers sorted as integers, i.e., min ( 𝑥, 𝑦 ) ◦ max ( 𝑥, 𝑦 ) . Thesize of the circuit SWITCH 𝑚 is Θ ( 𝑚 ) and depth is Θ ( log ( 𝑚 )) .Technique similar to the proof of the next lemma will be used also later in the proofs of Lemma 2 and Lemma 16in order to achieve smaller circuit size. The main idea is to split inputs into smaller blocks and process the blocksindependently by smaller circuits. We provide the proof in the appendix. Lemma 8 (Binary to unary) . There is a uniform family of boolean circuits ONES 𝑏 : { , } 𝑏 + → { , } 𝑏 suchthat for any number 𝑥 ∈ { , } 𝑏 + represented in binary the output consists of 𝑥 ones followed by 2 𝑏 − 𝑥 zeroes,provided 𝑥 ≤ 𝑏 . The circuit ONES 𝑏 has size Θ ( 𝑏 ) and depth Θ ( log ( 𝑏 )) .We will need a primitive that counts the number of occurrences of each string in the input. A counting similarto Lemma 9 appears in Appendix A of the paper of Asharov et al. [3]. The construction of the counting circuit israther straightforward, we just compare each input string 𝑥 𝑗 with a given string 𝑦 getting an indicator bit set to onefor equality and to zero for inequality and then sum the indicator bits. We provide a proof in the appendix. Lemma 9 (Count) . There is a uniform family of boolean circuits COUNT 𝑛,𝑚 : { , } 𝑛 · 𝑚 → { , } 𝑚 ⌈ + log ( 𝑛 ) ⌉ thatgiven 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 counts the number of occurrences of each 𝑦 ∈ { , } 𝑚 among the inputs, i.e., thecircuit outputs 𝑛 𝑚 , 𝑛 𝑚 − , . . . , 𝑛 𝑚 where for each 𝑦 ∈ { , } 𝑚 , 𝑛 𝑦 represents in binary (cid:12)(cid:12)(cid:8) 𝑗 ∈ [ 𝑛 ] | 𝑦 = 𝑥 𝑗 (cid:9)(cid:12)(cid:12) using ⌈ + log ( 𝑛 )⌉ bits. The size of the circuit COUNT 𝑛,𝑚 is O( 𝑛𝑚 𝑚 ) and depth O( log ( 𝑛 ) + log ( 𝑚 )) .We will need also an inverse operation for the counting. To construct a circuit that decompresses the countswe would like to ﬁrst compute the interval where a given string 𝑥 should appear and then get indicator bits for thisinterval. We can compute the interval using preﬁx sums of the counts. To get the indicator bits for the interval weutilize the circuit from Lemma 8 which outputs a given number of bits set to one followed by bits set to zero. Thefull proof is in the appendix. Lemma 10 (Decompress) . There is a uniform family of boolean circuitsDECOMPRESS 𝑛,𝑚 : { , } ⌈ + log ( 𝑛 ) ⌉ 𝑚 → { , } 𝑛 · 𝑚 that decompresses its input that is on input numbers 𝑛 𝑚 , 𝑛 𝑚 − , . . . , 𝑛 𝑚 , each represented in binary by ⌈ + log ( 𝑛 )⌉ bits, where Í 𝑥 ∈{ , } 𝑚 𝑛 𝑥 = 𝑠 ≤ 𝑛 , outputs the string ( · · · | {z } 𝑚 ) 𝑛 ··· ◦ ( · · · | {z } 𝑚 − ) 𝑛 ··· ◦ ( · · · | {z } 𝑚 − ) 𝑛 ··· ◦ ( · · · | {z } 𝑚 − ) 𝑛 ··· ◦ · · · ◦ ( · · · | {z } 𝑚 ) 𝑛 ··· ◦ ( 𝑚 ) 𝑛 − 𝑠 . When 𝑠 > 𝑛 the output might be arbitrary. The size of the circuit DECOMPRESS 𝑛,𝑚 is O( 𝑛𝑚 𝑚 + 𝑚 log ( 𝑛 )) and depth O( 𝑚 + log ( log ( 𝑛 ))) . 4 Sorting circuits from AKS sorting networks

In this section we recall the construction of circuits for sorting from the Ajtai-Komlós-Szemerédi sorting networks.They will serve as the basic primitive for our later constructions.

Sorting networks.

Sorting networks model parallel algorithms that sort values using only comparisons. A sortingnetwork consists of 𝑛 wires and 𝑠 comparators. The wires extend from left to right in parallel. Each wire carriesan integer from left to right. Any two wires can be connected by a comparator at any point along their length.The comparator swaps the values carried along the two wires if the higher wire carries a higher value at that pointotherwise it has no eﬀect. The sorting network should be such when we input arbitrary integers to the wires on theleft, the integers always exit in sorted order from top to bottom. The depth of a sorting network is the maximumnumber of comparators a value can encounter on its way. For a formal deﬁnition see, e.g., [1]. Observe that if thedepth of a sorting network is 𝑑 and the number of inputs is 𝑛 then there are at most 𝑠 ≤ 𝑛𝑑 comparators. Ajtai,Komlós and Szemerédi [1] established the existence of sorting networks of logarithmic depth. Theorem 11 (AKS [1]) . For any integer 𝑛 ≥

1, there is a sorting network for 𝑛 integers of depth O( log ( 𝑛 )) . Sorting circuits.

Here we give a precise deﬁnition of sorting by a circuit. First we consider a circuit sorting 𝑛 integers, each of them 𝑚 bits long. Deﬁnition 12 (Sort) . Let 𝑛, 𝑚 ∈ N , and (cid:0) 𝐶 𝑛,𝑚 (cid:1) be a family of boolean circuits. We say that the circuit 𝐶 𝑛,𝑚 : { , } 𝑛𝑚 → { , } 𝑛𝑚 sorts its input interpreted as 𝑛 integers 𝑥 , 𝑥 , . . . , 𝑥 𝑛 each represented by 𝑚 bitsif it outputs 𝑦 , 𝑦 , . . . , 𝑦 𝑛 ∈ { , } 𝑚 such that:1. The outputs are sorted: For any 𝑖 < 𝑗 ∈ [ 𝑛 ] , 𝑦 𝑖 ≤ 𝑦 𝑗 .2. The inputs and outputs form the same multiset: For each 𝑗 ∈ N , (cid:12)(cid:12)(cid:8) 𝑖 ∈ [ 𝑛 ] | 𝑦 𝑖 = 𝑥 𝑗 (cid:9)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:8) 𝑖 ∈ [ 𝑛 ] | 𝑥 𝑖 = 𝑥 𝑗 (cid:9)(cid:12)(cid:12) .An immediate consequence of the existence of AKS sorting networks is the existence of shallow sorting circuits,since by Lemma 7, each comparator can be replaced by a small circuit: Corollary 13.

There is a family of boolean circuits AKS 𝑛,𝑚 : { , } 𝑛 · 𝑚 → { , } 𝑛 · 𝑚 that on an input 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈{ , } 𝑚 sorts these numbers. The size of the circuit AKS 𝑛,𝑚 is O( 𝑛𝑚 log ( 𝑛 )) and depth O( log ( 𝑛 ) log ( 𝑚 )) .We also need circuits that sort the 𝑛 input integers, each of 𝑚 bits, by the 𝑘 most signiﬁcant bits where 𝑘 < 𝑚 .Such sorting can be thought of as sorting (key, value) pairs, where keys are 𝑘 -bit long and values ( 𝑚 − 𝑘 ) -bit long.Formally it can be deﬁned as follows: Deﬁnition 14 (Partial Sort) . Let 𝑛, 𝑚, 𝑘 ∈ N , be such that 𝑘 < 𝑚 , and let (cid:0) 𝐶 𝑛,𝑚,𝑘 (cid:1) be a family of boolean circuits.We say that the circuit 𝐶 𝑛,𝑚,𝑘 : { , } 𝑛𝑚 → { , } 𝑛𝑚 partially sorts by the ﬁrst 𝑘 bits its input interpreted as 𝑛 integers 𝑥 , 𝑥 , . . . , 𝑥 𝑛 each represented by 𝑚 bits if it outputs 𝑦 , 𝑦 , . . . , 𝑦 𝑛 ∈ { , } 𝑚 such that:1. The outputs are partially sorted: For any 𝑖 < 𝑗 ∈ [ 𝑛 ] , ( 𝑦 𝑖 ) ( 𝑦 𝑖 ) · · · ( 𝑦 𝑖 ) 𝑘 ≤ ( 𝑦 𝑗 ) ( 𝑦 𝑗 ) · · · ( 𝑦 𝑗 ) 𝑘 .2. The inputs and outputs form the same multiset: For each 𝑗 ∈ N , (cid:12)(cid:12)(cid:8) 𝑖 ∈ [ 𝑛 ] | 𝑦 𝑖 = 𝑥 𝑗 (cid:9)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:8) 𝑖 ∈ [ 𝑛 ] | 𝑥 𝑖 = 𝑥 𝑗 (cid:9)(cid:12)(cid:12) .Using a circuit of size O( 𝑚 ) and depth O( log 𝑘 ) implementing a comparator which swaps two 𝑚 -bit integersbased only on the ﬁrst 𝑘 bits we get the following variant of the previous corollary. Corollary 15.

There is a family of boolean circuits PARTIAL_AKS 𝑛,𝑚,𝑘 : { , } 𝑛 · 𝑚 → { , } 𝑛 · 𝑚 , for 𝑘 ≤ 𝑚 and 𝑘 ≤ log ( 𝑛 ) , that on input 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 partially sorts these numbers according to their 𝑘 mostsigniﬁcant bits. That is if 𝑦 𝑖 , 𝑦 𝑗 are two output numbers where 𝑖 < 𝑗 then we have ⌊ 𝑦 𝑖 / 𝑚 − 𝑘 ⌋ ≤ ⌊ 𝑦 𝑗 / 𝑚 − 𝑘 ⌋ . Thesize of the circuit PARTIAL_AKS 𝑛,𝑚,𝑘 is O( 𝑛𝑚 log ( 𝑛 )) and depth O( log ( 𝑛 ) log ( 𝑘 )) . 𝑛 Binary Strings of Length 𝑚 Here we present a sorting circuit for short numbers. The construction consists of two circuits. The ﬁrst circuitcounts the number of occurrences of various strings (as stated in Lemma 2) and the second circuit decompressesthese counts. Both of these constructions use heavily the following technique: we divide the problem into blocks5hich can be eﬃciently sorted using the AKS-based circuit. These blocks will be of size between 2 𝑂 ( 𝑚 ) and 𝑛 / 𝑂 ( 𝑚 ) where 𝑚 is the binary length of the input integers.Thus when we sort the numbers inside each block and subdivide the block into parts, then by the pigeon-holeprinciple, most of the parts will be monochromatic (containing copies of a single string only). We can thenseparately count the strings in monochromatic parts (count the ﬁrst string and then multiply that by the length ofthe part) and in the non-monochromatic parts (there are not that many strings in total in non-monochromatic parts).However a priori we do not know which parts will be monochromatic and which will be not. To save on circuitrywe use sorting (on whole parts) to move the non-monochromatic parts aside. We build the (expensive) countingcircuits only for non-monochromatic parts. Proof of Lemma 2.

For the sake of simplicity let us assume that 𝑛 is a power of two so, it is divisible by 2 𝑚 . (Byour assumption 𝑛 ≥ 𝑚 , thus if 𝑛 is not a power of two take the circuit for the closest power of two larger than 𝑛 and feed ones for the extra input bits.) We partition the input into 𝑛 / 𝑚 blocks each consisting of 2 𝑚 numbers.We sort each block by the circuit AKS 𝑚 ,𝑚 of size O( 𝑚 𝑚 log ( 𝑚 )) = O( 𝑚 𝑚 ) and depth O( 𝑚 log ( 𝑚 )) asgiven in Corollary 13 . Thus for this phase we need a circuit of total size O( 𝑛𝑚 ) .Then we subdivide each block into 2 𝑚 parts each consisting of 2 𝑚 numbers. Observe that most of these partsare monochromatic: a part is monochromatic if it contains 2 𝑚 copies of a single 𝑚 -bit number. We can upper boundthe number of non-monochromatic parts by 2 𝑚 . We can add a single indicator bit to each part indicating whetherthis part is monochromatic. As the parts are sorted it is enough to compare the ﬁrst and last number in each part andset the bit to 1 if the numbers are equal and to 0 otherwise. We sort the parts preﬁxed by their indicator bit using thecircuit PARTIAL_AKS 𝑚 , + 𝑚 𝑚 , from Corollary 15 to move all non-monochromatic parts to the front of eachblock. Thus the total size of the circuit sorting parts inside each block is O (cid:16) 𝑛 𝑚 ( 𝑚 ) ( + 𝑚 𝑚 ) 𝑚 (cid:17) = O( 𝑛𝑚 ) and depth O( 𝑚 ) . We call the ﬁrst 2 𝑚 parts of each block potentially non-monochromatic . The other parts are deﬁnitely monochromatic .From each deﬁnitely monochromatic part we take the ﬁrst 𝑚 -bit number and we count them. This can bedone by the circuit COUNT 𝑛 𝑚 ( 𝑚 − 𝑚 ) ,𝑚 from Lemma 9 of size O (cid:16) (cid:16) 𝑛 𝑚 − 𝑛 𝑚 (cid:17) 𝑚 𝑚 (cid:17) ≤ O( 𝑛𝑚 ) and depth O( log ( 𝑛 ) + log ( 𝑚 )) . By multiplying each count by 2 𝑚 (that is by appending 2 𝑚 zeroes) we get the number ofoccurrences of each number in the deﬁnitely monochromatic parts.As there are relatively few (exactly 𝑛 𝑚 𝑚 𝑚 ) numbers overall in potentially non-monochromatic parts we canuse the circuit COUNT 𝑛 / 𝑚 ,𝑚 from Lemma 9 to count those numbers by a circuit of size O (cid:16) 𝑛 𝑚 𝑚 𝑚 (cid:17) ≤ O( 𝑛𝑚 ) and depth O( log ( 𝑛 ) + log ( 𝑚 )) .Thus we get two vectors of counts for numbers in potentially non-monochromatic and deﬁnitely monochromaticblocks. Finally, we add the two vectors of 2 𝑚 numbers each consisting of at most ⌈ + log ( 𝑛 )⌉ bits to get theresulting counts. This uses a circuit of size O( 𝑚 𝑚 ) = 𝑂 ( 𝑛 ) and depth O( log log ( 𝑛 )) . Thus, the overall size ofthe circuit is O( 𝑛𝑚 ) and depth O( log ( 𝑛 ) + 𝑚 log ( 𝑚 )) . (cid:3) Lemma 16.

For integers 𝑛, 𝑚 ≥ 𝑚 ≤ log ( 𝑛 )/

11, there is a family of boolean circuitsFAST_DECOMPRESS 𝑛,𝑚 : { , } ⌈ + log ( 𝑛 ) ⌉ 𝑚 → { , } 𝑛 · 𝑚 that decompresses its input as in Lemma 10. The size of FAST_DECOMPRESS 𝑛,𝑚 is O( 𝑛𝑚 ) and its depth is O( 𝑚 log ( 𝑚 ) + log ( log ( 𝑛 ))) .The construction of the decompression circuit mirrors the counting circuit albeit it is somewhat simpler with adiﬀerent choice of parameters. We separately decompress monochromatic blocks (by decompressing just a singlestring from each block and then creating the right number of copies) and the strings from non-monochromaticblocks (as there are not many of those). We then use partial sorting to rearrange the blocks in the proper order toconstruct a sorted sequence. Proof.

For the sake of simplicity let us assume that 𝑛 is a power of two and let us set 𝑘 = 𝑛 / 𝑚 . (Thus 𝑘 is aninteger.) We will think of the output as partitioned into 2 𝑚 blocks of size 𝑘 . As in the proof of Lemma 10 wecompute the preﬁx sums 𝑝 𝑥 = Õ 𝑦 ∈{ , } 𝑚 : 𝑦<𝑥 𝑛 𝑦 for each 𝑥 ∈ { , } 𝑚 𝑝 𝑚 = 𝑛 . (Here, we identify 𝑚 -bit strings 𝑥 and 𝑦 with integers they represent.) We can compute each 𝑝 𝑥 using the circuit SUM 𝑚 , + log ( 𝑛 ) , thus computing all of them using a circuit of size O( log ( 𝑛 ) 𝑚 ) ≤ O( 𝑛 ) (bythe assumption 𝑚 ≤ log ( 𝑛 )/

11) and depth O( 𝑚 + log ( log ( 𝑛 ))) . Thus the string 𝑥 ∈ { , } 𝑚 should appear atoutput positions [ 𝑝 𝑥 + , 𝑝 𝑥 + ] . For any 𝑥 ∈ { , } 𝑚 we set: 𝑟 𝑥 = (( 𝑘 − ( 𝑝 𝑥 mod 𝑘 )) mod 𝑘 ) + ( 𝑝 𝑥 + mod 𝑘 ) 𝑞 𝑥 = 𝑛 𝑥 − 𝑟 𝑥 𝑘 The meaning is that if we partition the output into blocks of 𝑘 consecutive numbers, then for any 𝑥 ∈ { , } 𝑚 thenumber 𝑟 𝑥 tells the number of times the string 𝑥 appears in non-monochromatic blocks. (These occurrences arelocated in at most two non-monochromatic blocks.) The number 𝑞 𝑥 tells us in how many monochromatic blocksthe string 𝑥 ∈ { , } 𝑚 appears. Observe that 𝑞 𝑥 is an integer. Since 𝑛 is a power of two, so is 𝑘 , furthermore, 𝑘 is ﬁxed for given 𝑛 and 𝑚 , and thus computing mod 𝑘 and division by 𝑘 corresponds to selecting appropriate bitsfrom the binary representation of numbers. All numbers 𝑝 𝑥 , 𝑞 𝑥 and 𝑟 𝑥 are integers represented by 1 + log ( 𝑛 ) bits.Hence, each 𝑞 𝑥 and 𝑟 𝑥 can be computed from 𝑛 𝑥 and 𝑝 𝑥 by one circuit ADD + log ( 𝑛 ) and two SUB + log ( 𝑛 ) . Thecircuit computing values 𝑞 𝑥 and 𝑟 𝑥 for all 𝑥 has total size O( 𝑚 log ( 𝑛 )) and depth O( log log ( 𝑛 )) .The following holds: 𝑛 𝑥 = 𝑘𝑞 𝑥 + 𝑟 𝑥 Õ 𝑥 ∈{ , } 𝑚 𝑞 𝑥 = Õ 𝑥 ∈{ , } 𝑚 𝑛 𝑥 − 𝑟 𝑥 𝑘 ≤ 𝑛 / 𝑘 = 𝑚 Õ 𝑥 ∈{ , } 𝑚 𝑟 𝑥 ≤ 𝑘 𝑚 = 𝑛 / 𝑚 We use circuit DECOMPRESS 𝑚 ,𝑚 ( 𝑞 𝑚 , 𝑞 𝑚 − , . . . , 𝑞 𝑚 ) from Lemma 10 of size O (cid:0) 𝑚 𝑚 (cid:1) and depth O ( 𝑚 ) to decompress monochromatic blocks. We then just copy each resulting number 𝑘 times to create sortedmonochromatic blocks. Last 2 𝑚 − Í 𝑥 ∈{ , } 𝑚 𝑞 𝑥 blocks contain zero padding corresponding to the numbers innon-monochromatic blocks. They will be merged with the non-monochromatic blocks obtained next.In order to properly match the non-monochromatic blocks to the padded zeroes we adjust the count 𝑟 𝑚 : 𝑟 ′ 𝑚 = (cid:16) 𝑛 / 𝑚 (cid:17) − Õ 𝑥 ∈{ , } 𝑚 : 𝑥 ≠ 𝑚 𝑟 𝑥 using circuit SUM 𝑚 , + log ( 𝑛 ) and SUB + log ( 𝑛 ) of size O( 𝑛 ) and depth O( 𝑚 + log log ( 𝑛 )) . We use the circuitDECOMPRESS 𝑛 / 𝑚 ,𝑚 ( 𝑟 ′ 𝑚 , 𝑟 𝑚 − , . . . , 𝑟 𝑚 ) from Lemma 10 to decompress the non-monochromatic blocks.The circuit is of size O (cid:0) (cid:0) 𝑛 / 𝑚 (cid:1) 𝑚 𝑚 + 𝑚 log (cid:0) 𝑛 / 𝑚 (cid:1)(cid:1) ≤ O (cid:0) 𝑛𝑚 / 𝑚 (cid:1) and of depth O( 𝑚 + log ( log ( 𝑛 ))) .(Here, we used our assumption 𝑚 ≤ log ( 𝑛 )/

11, to bound 𝑛 ≥ 𝑚 and 2 𝑚 ≤ 𝑛 / / 𝑚 .)Finally, we compute the bit-wise OR of the last 2 𝑚 + blocks of the output from the previous step (monochromaticdecompression) with the current output (non-monochromatic decompression). This way we get a sequence of 𝑛 numbers partitioned into blocks where each block corresponds to one of the blocks in the desired output. However,we still need to rearrange the blocks in the proper order. We will use partial sorting of the whole blocks to do that.For a given block let 𝑥 be the ﬁrst number in that block. We preﬁx the block by a number 2 𝑥 (represented by 𝑚 + 𝑥 + 𝑘 numbers is preﬁxed by an 𝑚 + O( 𝑚 𝑚 ) = 𝑂 ( 𝑛 ) and depth O( log ( 𝑚 )) . We then use the PARTIAL_AKS 𝑚 , ( 𝑚 + )+ 𝑘𝑚,𝑚 + circuit of size O( 𝑛𝑚 ) and depth O( 𝑚 log ( 𝑚 )) to sort the blocks. Finally, we ignore the 𝑚 + (cid:3) Proof of Theorem 1.

This is just a combination of Lemma 2 with Lemma 16. (cid:3)

Observe that the proofs of Lemma 2 and Lemma 16 do not depend on using speciﬁcally the AKS sorting. Inparticular for the case of Lemma 2 if there is a circuit that sorts input numbers that is linear in the number of inputbits then there is a linear size circuit that counts these numbers.7

Partial Sorting by the First 𝑘 Bits in Poly-logarithmic Depth

Here we design a family of boolean circuits that partially sorts by the ﬁrst 𝑘 bits out of 𝑚 bits which is asymptoticallysmaller than PARTIAL_AKS 𝑛,𝑚,𝑘 . We will need super-concentrators for our construction.A directed acyclic graph 𝐺 = ( 𝑉, 𝐸, 𝐴, 𝐵 ) , where 𝑉 is the set of vertices, 𝐸 is the set of directed edges, and 𝐴 and 𝐵 are disjoint subsets of vertices of the same size, is a super-concentrator if the following hold: The verticesin 𝐴 ( inputs ) have in-degree zero, vertices in 𝐵 ( outputs ) have out-degree zero, and for any 𝑆 ⊆ 𝐴 and for any 𝑇 ⊆ 𝐵 : | 𝑆 | = | 𝑇 | there is a set of pairwise vertex disjoint paths connecting each vertex from 𝑆 to some vertex in 𝑇 .We parametrize the super-concentrator by the number of input vertices 𝑛 , and we measure its size by the numberof edges. We want the graph to have as few edges as possible. The depth of the super-concentrator is the numberof edges on the longest directed path.Pippenger [6] shows a construction of super-concentrators of linear size and logarithmic depth. He constructsa family of super-concentrators 𝑆 𝑛 for 𝑛 being the number of inputs, where the in-degree and out-degree of eachvertex is bounded by some universal constant, the number of edges is linear in 𝑛 , and the depth is O( log ( 𝑛 )) .Moreover there are ﬁnite automatons which for any 𝑆 ⊂ 𝐴, 𝑇 ⊂ 𝐵 : | 𝑆 | = | 𝑇 | when put on the vertices of thesuper-concentrator ﬁnd the set of vertex disjoint paths from 𝑆 to 𝑇 in O( log ( 𝑛 )) iterations, each taking O( log ( 𝑛 )) steps, for the total number of O( 𝑛 ) steps of the automatons. We describe this construction using the languageof circuits. The circuit on input of characteristic vector of 𝑆 and 𝑇 computes the set of | 𝑇 | vertex disjoint pathsconnecting 𝑆 and 𝑇 . The circuit outputs the characteristic vector of the set of edges participating in the paths. Theorem 17 (Pippenger [6]) . There is a family of super-concentrators 𝑆 𝑛 as described above and boolean circuitsROUTE 𝑛 : { , } 𝑛 → { , } | 𝑆 𝑛 | of size O( 𝑛 log ( 𝑛 )) and depth O( log ( 𝑛 ) ) that on input characteristic vector ofany set 𝑇 ⊆ [ 𝑛 ] and characteristic vector of any 𝑆 ⊆ [ 𝑛 ] where | 𝑇 | = | 𝑆 | , outputs the characteristic vector of edgesthat form | 𝑇 | vertex disjoint paths between 𝑆 and 𝑇 .By routing 𝑚 bits along each path in the super-concentrator we can use the above circuit to build a circuit thatpartially sorts 𝑚 -bit integers by their most signiﬁcant bit. Corollary 18.

There is a family of boolean circuits PIPPENGER_SORT 𝑛,𝑚, : { , } 𝑛 · 𝑚 → { , } 𝑛 · 𝑚 that oninput 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 partially sort these numbers according to their ﬁrst most signiﬁcant bit. The size ofthe circuit PIPPENGER_SORT 𝑛,𝑚, is O( 𝑛𝑚 + 𝑛 log ( 𝑛 )) and depth O( log ( 𝑛 )) . Proof.

We give a sketch of the proof. First, we will use the graph 𝑆 𝑛 to get all inputs starting with one to theproper place. Then, using the same construction we will move all inputs starting by 0 to the proper place. Wetransform the graph 𝑆 𝑛 into a circuit by replacing each vertex of in-degree 𝑑 by a routing gadget (circuit) whichtakes 𝑑 𝑚 -bit inputs together with 𝑑 control bits, one bit for each of the 𝑚 -bit inputs, and outputs the bit-wise ORof inputs for which their control bit is set to 1. Such a routing gadget of size O( 𝑑𝑚 ) and depth O( log 𝑑 ) can beeasily constructed. If ( 𝑢, 𝑣 ) is the 𝑗 -th incoming edge of 𝑣 in 𝑆 𝑛 , we connect the 𝑗 -th block of 𝑚 input bits of therouting gadget corresponding to 𝑣 to the output of the routing gadget of 𝑢 . The routing gadgets of input vertices of 𝑆 𝑛 are connected directly to the appropriate inputs of the sorting circuit. The routing gadget will be used with atmost single control bit set to one, thus it will route the corresponding input.It remains to calculate paths that will route the integers starting with 1 in the above circuit in the desired way.For that, we calculate the sum 𝑠 of the most signiﬁcant bits by which we are sorting using SUM 𝑛, from Lemma 6,we expand it back using ONES ⌈ log ( 𝑛 ) ⌉+ ( 𝑠 ) , and reverse it to get the characteristic vector of a set 𝑇 , where we wantto route to. Together with the most signiﬁcant bits of each input integer (which form the characteristic vector of 𝑆 from which we route) we feed this as an input to ROUTE 𝑛 . The output bits of ROUTE 𝑛 are connected to theappropriate control bits of our routing gadgets. The sorted output will be obtained as the output of the 𝑛 routinggadgets corresponding to the output vertices of 𝑆 𝑛 .The size of the ROUTE 𝑛 is O( 𝑛 log ( 𝑛 )) and the total size of the circuits implementing the routing gadgets is O( 𝑚𝑛 ) . These two terms dominate the overall size of the circuit. The depth of the circuit is dominated by thedepth of the ROUTE 𝑛 . (cid:3) We can use the above circuit in an iterative fashion to build a smaller circuit for the same primitive.

Lemma 19.

There is a family of boolean circuits ITERATIVE_SORT 𝑛,𝑚, : { , } 𝑛 · 𝑚 → { , } 𝑛 · 𝑚 that on input 𝑥 , 𝑥 , . . . , 𝑥 𝑛 ∈ { , } 𝑚 partially sort these numbers according to their ﬁrst most signiﬁcant bit. The size of thecircuit ITERATIVE_SORT 𝑛,𝑚, is O( 𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) and its depth is O( log ( 𝑛 )) .8 roof. Assume 𝑚 ≤ log ( 𝑛 )/

11 otherwise use Corollary 18. We will build the circuit iteratively using the circuitfrom Corollary 18 for blocks of various sizes. We will start with small blocks of items and we will iterativelysort larger and larger number of items organized into mostly monochromatic blocks. Without loss of generalitywe assume that 𝑚 is a power of two, and we will ignore the rounding issues. We will have two parameters 𝑚 𝑖 and 𝑛 𝑖 = 𝑚 𝑖 , where 𝑚 = 𝑚 and 𝑚 𝑖 + = 𝑚 𝑖 for 𝑖 ≥

0. At iteration 𝑖 , all the items will be partitioned into parts of consecutive numbers, each part will be either monochromatic containing all zeros, all ones, or it will be mixed . (Here we refer to the most signiﬁcant bits of the numbers in the part.) For each part we will maintain twoindicator bits which of the three possibilities occurs: an indicator which is one if the block is mixed, and another color indicator which speciﬁes the highest order bit of the integers if the block is monochromatic. (For the latterwe could use the ﬁrst bit of the ﬁrst integer in the part.) At each iteration 𝑖 > 𝑚 𝑖 will denote the number of itemsin each part. 𝑛 𝑖 / 𝑚 𝑖 consecutive parts form a block , so each block contains 𝑛 𝑖 items. The blocks partition the input.We will maintain an invariant that the fraction of mixed parts in each block is at most 2 / 𝑚 𝑖 .At iteration 0 we apply PIPPENGER_SORT 𝑛 ,𝑚, to consecutive blocks of 𝑛 input integers. Afterwards, theblock is partitioned into parts of size 𝑚 and for each part we determine its status by comparing the most signiﬁcantbits of the ﬁrst and last integer in the part. It is clear that each block of size 𝑛 contains at most one mixed part. Asthe number of parts in the block is 𝑚 , the fraction of mixed parts in each block is at most 2 / 𝑚 , and this is alsotrue for blocks of size 𝑛 .At iteration 𝑖 >

0, we divide the current sequence of parts of size 𝑚 𝑖 into blocks containing 𝑛 𝑖 / 𝑚 𝑖 parts, andwe proceed in three steps: Step 1.

Sort the parts in each block using PIPPENGER_SORT 𝑛 𝑖 / 𝑚 𝑖 , + 𝑚 𝑖 · 𝑚, according to the mixed indicator.Hence, all the mixed parts will move to the end of the block. There are at most 2 𝑛 𝑖 / 𝑚 𝑖 mixed parts in each block,the remaining parts must be monochromatic. Step 2.

In each block, sort all the 𝑚 -bit integers in the last 2 𝑛 𝑖 / 𝑚 𝑖 parts according to their most signiﬁcant bitusing PIPPENGER_SORT 𝑛 𝑖 / 𝑚 𝑖 ,𝑚, . This sorts together all the integers in the mixed parts (and perhaps few otherparts). Repartition them into parts of 𝑚 𝑖 consecutive numbers and determine their indicator bits. Only one of theparts should be mixed at this point. Swap it with the last part in the block. (We provide details of the swap later.) Step 3.

In each block, sort all the parts except for the last one according to their color indicator usingPIPPENGER_SORT ( 𝑛 𝑖 / 𝑚 𝑖 )− , + 𝑚 𝑖 · 𝑚, . This moves all the parts of color 0 to the front. Repartition all thenumbers in the block into parts of 𝑚 𝑖 + consecutive integers and determine their indicator bits, where the last partis marked as mixed. At most two of the new parts should be mixed at this point. Notice, that out of 𝑚 𝑖 + parts ineach block, at most two are marked as mixed so the invariant applies. We can move to the next iteration.We iterate the algorithm until 𝑚 𝑖 ≥ log ( 𝑛 )/

4. Once 𝑚 𝑖 ≥ log ( 𝑛 )/

4, the number of integers in mixed parts is atmost 2 𝑛 / 𝑚 𝑖 ≤ 𝑂 ( 𝑛 / log ( 𝑛 )) , remaining items are in monochromatic parts. At this point we cannot form a blockof size 𝑛 𝑖 , but we can still perform the same type of actions as in Steps 1-3: We can bring the monochromatic partsforward as in Step 1, sort the last 32 𝑛 / log ( 𝑛 ) integers belonging to the mixed parts, move the remaining mixedpart to the end, sort the monochromatic parts and swap the mixed part with the ﬁrst monochromatic part of color1. To swap a single mixed part with the last part we can copy the mixed part into a buﬀer by AND-ing everypart bit-wise with the indicator whether that is the mixed part, and OR-ing all the results together. This copies themixed part into a buﬀer. In a similar fashion we can copy the last part into the now unused part by letting eachpart bit-wise copy to its place either its original content or the content of the last part, again conditioning on anappropriate indicator bit. Hence, the swap can be implemented by a circuit of size proportional to the total size ofthe parts and depth logarithmic in the number of parts.Now we will bound the total size of the circuit we constructed. Step 1 requires 𝑛 / 𝑛 𝑖 circuits of size O( 𝑛 𝑖 𝑚 + 𝑛 𝑖 / 𝑚 𝑖 log ( 𝑛 𝑖 / 𝑚 𝑖 )) = O( 𝑛 𝑖 𝑚 ) , as log ( 𝑛 𝑖 ) = 𝑂 ( 𝑚 𝑖 ) , and of depth at most O( log ( 𝑛 𝑖 )) . Step 2 requires 𝑛 / 𝑛 𝑖 sortingcircuits of size O( 𝑚𝑛 𝑖 / 𝑚 𝑖 + 𝑛 𝑖 / 𝑚 𝑖 log ( 𝑛 𝑖 / 𝑚 𝑖 )) = O( 𝑛 𝑖 ) and of depth at most O( log ( 𝑛 𝑖 )) , together with acircuit of total linear size O( 𝑛 ) to recalculate the parts and do the swaps. The last step requires the same amountof circuitry as the ﬁrst step.Hence, each step requires circuits of total size O( 𝑛𝑚 ) . The same goes for the initial sort at iteration 0, andthe ﬁnal sorts at the end. As there are at most log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ) iterations, the resulting size is O( 𝑛𝑚 ( log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) . Each step requires a circuit of depth O( log ( 𝑛 𝑖 )) , recall that by our choice 𝑛 𝑖 = 𝑚 𝑖 , thus log ( 𝑛 𝑖 ) = 𝑚 𝑖 .Since 𝑚 𝑖 + = 𝑚 𝑖 and for each 𝑖 we have 𝑚 𝑖 ≤ log ( 𝑛 )/

4, thus the total depth is dominated by the last iterationwhere we use a circuit of depth O( log ( 𝑛 )) . (cid:3) Conclusion

We have provided improved sorting circuits. Our technique used in the proof of Theorem 1 can be viewed asinformation compression and decompression. This technique might prove useful for other related problems. Welist some open problems: • Most of our circuits are uniform. The non-uniform part is due to the use of the AKS circuits and Pippenger’ssuper-concentrators. Can one make uniform circuits of the same size? • Is it possible to sort 𝑛 numbers each of 𝑚 bits in depth O( log ( 𝑛 )) (can we get rid of the 𝑚 log 𝑚 factor in thecircuit depth from Theorem 1)? • Is it possible to partially sort 𝑛 numbers of 𝑚 bits each by their ﬁrst bit using a circuit of size O( 𝑛𝑚 ) anddepth O( log ( 𝑛 )) ? Acknowledgement:

The authors are grateful for insightful discussions with Mike Saks on sorting and to VeronikaSlívová for her insights and comments regarding the ﬁrst versions of this paper.

References [1] Miklós Ajtai, János Komlós, and Endre Szemerédi. Sorting in 𝑐 log ( 𝑛 ) parallel steps. Combinatorica , 3(1):1–19, 1983.[2] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time?

Journal ofComputer and System Sciences , 57(1):74–93, 1998.[3] Gilad Asharov, Wei-Kai Lin, and Elaine Shi. Sorting short keys in circuits of size 𝑜 ( 𝑛 log 𝑛 ) . In Proceedingsof the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 2249–2268. SIAM, 2021.[4] Yĳie Han and Mikkel Thorup. Sorting integers in 𝑂 ( 𝑛 log log 𝑛 ) expected time and linear space. In IEEESymposium on Foundations of Computer Science (FOCS’02) , 2002.[5] Stasys Jukna.

Boolean function complexity: advances and frontiers , volume 27. Springer Science & BusinessMedia, 2012.[6] Nicholas Pippenger. Self-routing superconcentrators.

Journal of Computer and System Sciences , 52(1):53–60,1996.[7] Christopher S. Wallace. A suggestion for a fast multiplier.

IEEE Transactions on electronic Computers ,(1):14–17, 1964.

A Proofs of some preliminaries

Proof of Lemma 6.

We sketch the construction following the technique of Wallace [7]. Given three numbers 𝑥, 𝑦, 𝑧 ∈ { , } 𝑘 in constant depth and using Θ ( 𝑘 ) gates we can compute 𝑝, 𝑞 ∈ { , } 𝑘 + such that 𝑥 + 𝑦 + 𝑧 = 𝑝 + 𝑞 . Here, 𝑝 is the coordinate-wise addition without carry, i.e., 0 ◦ ( 𝑥 ⊕ 𝑦 ⊕ 𝑧 ) , and 𝑞 is the carry, i.e., (( 𝑥 ∧ 𝑦 ) ∨ ( 𝑥 ∧ 𝑧 ) ∨ ( 𝑦 ∧ 𝑧 )) ◦

0. Thus as long as there are at least three numbers to sum we can use this to transform 𝑥, 𝑦, 𝑧 which takes 3 𝑘 bits into 𝑝, 𝑞 which take 2 𝑘 + O( log / ( 𝑛 )) = O( log ( 𝑛 )) rounds we are left with just two numbers and we sumthose using Lemma 4. (cid:3) A ﬁgure of a small sorting network is given in Figure 1.

Proof of Lemma 8.

We ﬁrst show how to construct a uniform family of boolean circuits ( ONES’ 𝑏 ) which computesthe same function, has the same size but depth O( 𝑏 ) . Then we use ONES’ log ( 𝑏 ) to construct the desired circuitONES 𝑏 .The main idea of the construction of ONES’ 𝑏 is to recursively split the number 𝑥 into two numbers 𝑥 𝐿 , 𝑥 𝑅 which describe how many bits set to one there should be in the ﬁrst and the second half of the output.10 𝑦𝑧 min ( 𝑥, 𝑦 ) max ( 𝑥, 𝑦 ) max ( min ( 𝑥, 𝑦 ) , 𝑧 ) min ( 𝑥, 𝑦, 𝑧 ) median ( 𝑥, 𝑦, 𝑧 ) max ( 𝑥, 𝑦, 𝑧 ) Figure 1: An example of a sorting network with three inputs (the horizontal lines), three comparators (the verticallines), and depth three. The inputs on the left are numbers 𝑥, 𝑦, 𝑧 and after each comparator we noted what is onthe horizontal line. Note that the bottom most output is max ( max ( 𝑥, 𝑦 ) , max ( min ( 𝑥, 𝑦 ) , 𝑧 )) = max ( 𝑥, 𝑦, 𝑧 ) andthe middle one is min ( max ( 𝑥, 𝑦 ) , max ( min ( 𝑥, 𝑦 ) , 𝑧 )) which is the median.Each of the two numbers 𝑥 𝐿 , 𝑥 𝑅 will be represented by 𝑏 bits with the convention that if the most signiﬁcantbit is equal to one then the number is a power of two (corresponding to all output bits in this part of the output setto one). We recursively split the numbers 𝑥 𝐿 , 𝑥 𝑅 in the same fashion until the numbers are represented by a singlebit each at which point they will represent the output bits. We set 𝑥 𝐿 = min ( 𝑏 − , 𝑥 ) 𝑥 𝑅 = min ( 𝑏 − , max ( , 𝑥 − 𝑏 − )) note that if the number 𝑥 is represented by 𝑏 + 𝑥 ∈ { , } 𝑏 + ) then the numbers 𝑥 𝐿 , 𝑥 𝑅 can be representedby 𝑏 bits ( 𝑥 𝐿 , 𝑥 𝑅 ∈ { , } 𝑏 ) as both of them represent at most half of 𝑥 . Given 𝑥 ∈ { , } 𝑏 + we can compute themaximum and minimum deﬁning 𝑥 𝐿 , 𝑥 𝑅 by inspecting the two most signiﬁcant bits of 𝑥 : • If the most signiﬁcant bit of 𝑥 is set to one (thus 𝑥 ≥ 𝑏 ) we set 𝑥 𝐿 = 𝑥 𝑅 = 𝑥 / 𝑏 − ). • If the most signiﬁcant bit of 𝑥 is set to zero and the second most signiﬁcant bit is set to one, then 𝑥 𝐿 will beset to the binary number 10 𝑏 − and 𝑥 𝑅 will be 𝑥 − 𝑥 𝐿 (a copy of 𝑥 without the second most signiﬁcant bitof 𝑥 ). • If the two most signiﬁcant bits of 𝑥 are equal to zero then 𝑥 𝐿 = 𝑥 (represented by one less bit than 𝑥 ) and 𝑥 𝑅 = 𝑥 into 𝑥 𝐿 , 𝑥 𝑅 .0101100101 1 101 1 001011 0 000 0Figure 2: An example of splitting numbers where 𝑏 =

3. The input number 𝑥 = 𝑥 𝐿 = , 𝑥 𝑅 =

001 which are themselves split recursively. The bottom nodes form the output.Thus we can compute the transformation 𝑥 ↦→ ( 𝑥 𝐿 , 𝑥 𝑅 ) where 𝑥 ∈ { , } 𝑏 + and 𝑥 𝐿 , 𝑥 𝑅 ∈ { , } 𝑏 using acircuit of size Θ ( 𝑏 ) and depth Θ ( ) . Then each of the numbers 𝑥 𝐿 , 𝑥 𝑅 is again split into two, etc. until we getsingle bit numbers which represent the ﬁnal output. The depth of the circuit ONES’ 𝑏 is Θ ( 𝑏 ) as each splitting canbe done in constant depth. If the circuit splitting 𝑏 + 𝑏 -bit numbers has size 𝑠 ( 𝑏 ) ≤ 𝑐𝑏 + 𝑑 , for some11niversal constants 𝑐 and 𝑑 , then the circuit ONES’ 𝑏 has size: 𝑠 ( 𝑏 + ) + 𝑠 ( 𝑏 ) + 𝑠 ( 𝑏 − ) + . . . + 𝑏 𝑠 ( ) = 𝑏 Õ 𝑗 = 𝑗 𝑠 ( 𝑏 − 𝑗 )≤ 𝑏 Õ 𝑗 = 𝑗 𝑐 ( 𝑏 − 𝑗 ) + 𝑗 𝑑 ≤ 𝑐 (cid:16) 𝑏 + − 𝑏 − (cid:17) + 𝑏 + 𝑑 = 𝑂 ( 𝑏 ) To build the circuit ONES 𝑏 of depth O( log 𝑏 ) we proceed as follows. For any 𝑦 > 𝑦 by ℓ ( 𝑦 ) = max (cid:8) 𝑗 | 𝑗 ∈ N , 𝑗 ≤ 𝑦 (cid:9) . We divide the output bits into blocks of ℓ ( 𝑏 ) bits and for each block 𝑗 ∈ h 𝑏 ℓ ( 𝑏 ) i of output bits with positions [( 𝑗 − ) ℓ ( 𝑏 ) + , 𝑗ℓ ( 𝑏 )] (counting positions fromone) we compute if it should be constant (that is either constant zero when 𝑥 ≤ ( 𝑗 − ) ℓ ( 𝑏 ) or constantly equal toone when 𝑥 > 𝑗ℓ ( 𝑏 ) ). This check for constant values can be done in each block by a circuit of size Θ ( 𝑏 ) and depth Θ ( log ( 𝑏 )) . We compute ONES’ log ( ℓ ( 𝑏 )) with the input being the log ℓ ( 𝑏 ) least signiﬁcant bits of 𝑥 . This circuit isof size O( 𝑏 ) and depth O( log ( 𝑏 )) . In each block if the block should not be monochromatic then we use the outputof that circuit as the output of the block, otherwise we use the appropriate constant one or zero copied ℓ ( 𝑏 ) -timesas the output of the block. (cid:3) Proof of Lemma 9.

For each 𝑦 ∈ { , } 𝑚 we build a sub-circuit computing the number of times 𝑦 occurs amongthe inputs 𝑥 , . . . , 𝑥 𝑛 . This is done by comparing 𝑦 to each 𝑥 𝑖 in parallel, 𝑖 ∈ [ 𝑛 ] , to get an indicator bit whetherthey are equal. We obtain 𝑛 𝑦 by summing up the indicator bits using the circuit SUM 𝑛, of size Θ ( 𝑛 ) and depth Θ ( log ( 𝑛 )) from Lemma 6. Comparing 𝑦 to 𝑥 𝑖 can be done by a circuit of size O( 𝑚 ) and depth O( log ( 𝑚 )) . So weget 𝑛 𝑦 using a circuit of size Θ ( 𝑛𝑚 ) and depth Θ ( log ( 𝑛 ) + log ( 𝑚 )) . Doing this for each 𝑦 ∈ { , } 𝑚 in parallel weget a circuit of size Θ ( 𝑛𝑚 𝑚 ) and depth Θ ( log ( 𝑛 ) + log ( 𝑚 )) . (cid:3) Proof of Lemma 10.

Given 𝑛 𝑚 , 𝑛 𝑚 − , . . . , 𝑛 𝑚 we can compute the total sum 𝑠 = Í 𝑥 ∈{ , } 𝑚 𝑛 𝑥 and for each 𝑦 ∈ { , } 𝑚 , the number 𝑝 𝑦 of binary strings before the ﬁrst occurrence of 𝑦 , i.e., 𝑝 𝑦 = Í 𝑥 ∈{ , } 𝑚 : 𝑥<𝑦 𝑛 𝑥 . Each ofthe numbers 𝑝 𝑦 can be computed using the circuit SUM 𝑦 − , ⌈ + log ( 𝑛 ) ⌉ from Lemma 6 of size O( 𝑚 log ( 𝑛 )) and depth O( 𝑚 + log ( log ( 𝑛 ))) . Similarly for 𝑠 . Thus we can get all numbers 𝑝 𝑦 in parallel by a circuit of size O( 𝑚 log ( 𝑛 )) . Agiven string 𝑦 ∈ { , } 𝑚 , 𝑦 ≠ 𝑚 , should appear at each position 𝑗 ∈ [ 𝑝 𝑦 + , 𝑝 𝑦 + ] . Let 𝐼 𝑦 ∈ { , } 𝑛 be the indicatorvector of positions where 𝑦 should appear in the output. We can use ONES ⌈ + log ( 𝑛 ) ⌉ ( 𝑝 𝑦 ) ⊕ ONES ⌈ + log ( 𝑛 ) ⌉ ( 𝑝 𝑦 + ) to calculate 𝐼 𝑦 for each 𝑦 ≠ 𝑚 . For 𝑦 = 𝑚 , 𝐼 𝑦 = ONES ⌈ + log ( 𝑛 ) ⌉ ( 𝑝 𝑦 ) ⊕ ONES ⌈ + log ( 𝑛 ) ⌉ ( 𝑠 ) . The size ofONES ⌈ + log ( 𝑛 ) ⌉ is Θ ( 𝑛 ) . As there are 2 𝑚 diﬀerent 𝑦 ’s, we need a circuit of size Θ ( 𝑛 𝑚 ) and depth Θ ( log ( log ( 𝑛 ))) to calculate all 𝐼 𝑦 ’s.If 𝑥 , 𝑥 , . . . , 𝑥 𝑛 are the output integers, for each output position 𝑗 ∈ [ 𝑛 ] , we calculate the 𝑘 -bit of 𝑥 𝑗 as Ü 𝑦 ∈{ , } 𝑚 (( 𝐼 𝑦 ) 𝑗 ∧ 𝑦 𝑘 ) To compute all these ORs we need a circuit of total size Θ ( 𝑛𝑚 𝑚 ) and depth Θ ( 𝑚 ) . (cid:3) B Proof of Theorem 3

Proof of Theorem 3.

We assume that 𝑘 ≤ log ( 𝑛 )/

11 otherwise we can use Corollary 15 to sort the elements.Without loss of generality we assume 𝑛 is a power of two. We think of the input as organized into an array. Weextract the ﬁrst 𝑘 bits ( key ) from each input element and we sort the keys using the circuit from Theorem 1 of size O( 𝑛𝑘 ) and depth O( log ( 𝑛 ) + 𝑘 log ( 𝑘 )) .We will build recursively a circuit that will sort the input array of 𝑛 elements according to the ﬁrst 𝑘 bits whenthe input is augmented with the array of sorted keys. Now our goal is to split the input array into two equal sizedparts 𝐿 and 𝑅 where all elements in 𝐿 are less or equal to elements in 𝑅 when comparing only the keys.To do that we take the median , the 𝑛 / 𝐿 , 𝑀 , and 𝑅 of length 𝑛 with elements less than, equal to, and greater12han the median, resp., and we mark the unused elements as dummy using an extra bit associated to each element.We sort 𝐿 and 𝑀 so that all non-dummy elements are to the left and 𝑅 so that all non-dummy elements are to theright. We use three circuits ITERATIVE_SORT 𝑛,𝑚 + , to do that. Now, we ﬂip the ﬁrst half of elements in 𝑀 ,i.e., swap the 𝑖 -th element with the element in position ( 𝑛 / ) − 𝑖 +

1, and we replace the dummy elements in theﬁrst half of 𝐿 by the corresponding elements in 𝑀 . By one application of ITERATIVE_SORT 𝑛,𝑚 + , we move allthe remaining non-dummy elements in 𝑀 to the left, and we merge those elements with the second half of 𝑅 . Wediscard the second and ﬁrst half of 𝐿 and 𝑅 , respectively. (They contain only dummy elements.)If the highest order bit of the median is set to 0 then all the elements in 𝐿 have the highest order bit set to0, otherwise all the elements in 𝑅 have the highest order bit set to 1. In either case we reduced the problem toone problem of sorting half of the elements according to 𝑘 − 𝑘 -bits. Werecursively build a circuit to sort SORT 𝑛 / ,𝑚,𝑘 − and SORT 𝑛 / ,𝑚,𝑘 when the input is augmented with the sortedarray of keys. We pass to each of the sorting sub-circuits the appropriate sub-problem and we re-route the resultsfrom them to form the ﬁnal output.Not counting the two sub-circuits SORT 𝑛 / ,𝑚,𝑘 − and SORT 𝑛 / ,𝑚,𝑘 , this step requires four copies of the circuitITERATIVE_SORT 𝑛,𝑚 + , and additional O( 𝑛𝑚 ) gates to do the moves and element comparison with the median.Denote the size of this part of the circuit by 𝐿 𝑚 ( 𝑛 ) = O( 𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) . The depth of the resultingcircuit to perform all those operations is O( log ( 𝑛 ) ) as the move operations are done in parallel (again, not countingthe depth of SORT 𝑛 / ,𝑚,𝑘 − and SORT 𝑛 / ,𝑚,𝑘 ). If we denote by 𝑆 𝑚,𝑘 ( 𝑛 ) the size of the circuit SORT 𝑛,𝑚,𝑘 we getthe following recurrence: 𝑆 𝑚,𝑘 ( ) = O( 𝑚 ) 𝑆 𝑚, ( 𝑛 ) = O( 𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) 𝑆 𝑚,𝑘 ( 𝑛 ) ≤ 𝐿 𝑚 ( 𝑛 ) + 𝑆 𝑚,𝑘 − (cid:16) 𝑛 (cid:17) + 𝑆 𝑚,𝑘 (cid:16) 𝑛 (cid:17) when we iterate the recurrence: 𝑆 𝑚,𝑘 ( 𝑛 ) = 𝐿 𝑚 ( 𝑛 ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝑆 𝑚,𝑘 ( 𝑛 / ) = 𝐿 𝑚 ( 𝑛 ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝐿 𝑚 ( 𝑛 / ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝑆 𝑚,𝑘 ( 𝑛 / ) = 𝐿 𝑚 ( 𝑛 ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝐿 𝑚 ( 𝑛 / )+ 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝐿 𝑚 ( 𝑛 / ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝑆 𝑚,𝑘 ( 𝑛 / ) = . . . = ( 𝐿 𝑚 ( 𝑛 ) + 𝐿 𝑚 ( 𝑛 / ) + . . . + 𝐿 𝑚 ( ))+ (cid:0) 𝑆 𝑚,𝑘 − ( 𝑛 / ) + 𝑆 𝑚,𝑘 − ( 𝑛 / ) + . . . + 𝑆 𝑚,𝑘 − ( ) (cid:1) + 𝑆 𝑚,𝑘 ( )≤ 𝐿 𝑚 ( 𝑛 ) + 𝑆 𝑚,𝑘 − ( 𝑛 ) + O( 𝑚 ) which gives us 𝑆 𝑚,𝑘 ( 𝑛 ) = 𝑘 𝐿 𝑚 ( 𝑛 ) + ( 𝑘 − ) 𝑆 𝑚,𝑘 ( ) + 𝑆 𝑚, ( 𝑛 ) = 𝑘 𝐿 𝑚 ( 𝑛 ) + O( 𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) = O( 𝑘𝑛𝑚 ( + log ∗ ( 𝑛 ) − log ∗ ( 𝑚 ))) To bound the depth 𝐷 𝑚,𝑘 ( 𝑛 ) we use the following recurrence: 𝐷 𝑚,𝑘 ( ) = O( ) 𝐷 𝑚,𝑘 ( 𝑛 ) ≥ 𝐷 𝑚,𝑘 − ( 𝑛 ) 𝐷 𝑚, ( 𝑛 ) = O( log ( 𝑛 ) ) 𝐷 𝑚,𝑘 ( 𝑛 ) = O( log ( 𝑛 ) ) + max (cid:0) 𝐷 𝑚,𝑘 ( 𝑛 / ) + 𝐷 𝑚,𝑘 − ( 𝑛 / ) (cid:1) ≤ O( log ( 𝑛 ) ) + 𝐷 𝑚,𝑘 ( 𝑛 / )≤ O( log ( 𝑛 ) (cid:3)(cid:3)