Radix Sorting With No Extra Space
aa r X i v : . [ c s . D S ] J un Radix Sorting With No Extra Space
Gianni FranceschiniUniv. of Pisa [email protected]
S. MuthukrishnanGoogle Inc., NY [email protected]
Mihai Pˇatra¸scuMIT [email protected]
October 30, 2018
Abstract
It is well known that n integers in the range [1 , n c ] can be sorted in O ( n ) time in theRAM model using radix sorting. More generally, integers in any range [1 , U ] can be sorted in O ( n √ log log n ) time [5]. However, these algorithms use O ( n ) words of extra memory. Is thisnecessary?We present a simple, stable, integer sorting algorithm for words of size O (log n ), which worksin O ( n ) time and uses only O (1) words of extra memory on a RAM model. This is the integersorting case most useful in practice. We extend this result with same bounds to the case whenthe keys are read-only, which is of theoretical interest. Another interesting question is the caseof arbitrary c . Here we present a black-box transformation from any RAM sorting algorithmto a sorting algorithm which uses only O (1) extra space and has the same running time. Thissettles the complexity of in-place sorting in terms of the complexity of sorting. Given n integer keys S [1 . . . n ] each in the range [1 , n ], they can be sorted in O ( n ) time using O ( n )space by bucket sorting . This can be extended to the case when the keys are in the range [1 , n c ]for some positive constant c by radix sorting that uses repeated bucket sorting with O ( n ) rangedkeys. The crucial point is to do each bucket sorting stably , that is, if positions i and j , i < j , hadthe same key k , then the copy of k from position i appears before that from position j in the finalsorted order. Radix sorting takes O ( cn ) time and O ( n ) space. More generally, RAM sorting withintegers in the range [1 , U ] is a much-studied problem. Currently, the best known bound is therandomized algorithm in [5] that takes O ( n √ log log n ) time, and the deterministic algorithm in [1]that takes O ( n log log n ) time. These algorithms also use O ( n ) words of extra memory in additionto the input.We ask a basic question: do we need O ( n ) auxiliary space for integer sorting ? The ultimate goalwould be to design in-place algorithms for integer sorting that uses only O (1) extra words, Thisquestion has been explored in depth for comparison-based sorting, and after a series of papers, wenow know that in-place, stable comparison-based sorting can be done in O ( n log n ) time [10]. Somevery nice algorithmic techniques have been developed in this quest. However, no such results areknown for the integer sorting case. Integer sorting is used as a subroutine in a number of algorithmsthat deal with trees and graphs, including, in particular, sorting the transitions of a finite statemachine. Indeed, the problem arose in that context for us. In these applications, it is useful if onecan sort in-place in O ( n ) time. From a theoretical perspective, it is likewise interesting to know ifthe progress in RAM sorting, including [3, 1, 5], really needs extra space.Our results are in-place algorithms for integer sorting. Taken together, these results solve muchof the issues with space efficiency of integer sorting problems. In particular, our contributions arethreefold. 1 practical algorithm. In Section 2, we present a stable integer sorting algorithm for O (log n )sized words that takes O ( n ) time and uses only O (1) extra words.This algorithm is a simple and practical replacement to radix sort. In the numerous applicationswhere radix sorting is used, this algorithm can be used to improve the space usage from O ( n ) toonly O (1) extra words. We have implemented the algorithm with positive results.One key idea of the algorithm is to compress a portion of the input, modifying the keys. Thespace thus made free is used as extra space for sorting the remainder of the input. Read-only keys.
It is theoretically interesting if integer sorting can be performed in-place withoutmodifying the keys . The algorithm above does not satisfy this constraint. In Section 3, we presenta more sophisticated algorithm that still takes linear time and uses only O (1) extra words withoutmodifying the keys. In contrast to the previous algorithm, we cannot create space for ourselvesby compressing keys. Instead, we introduce a new technique of pseudo pointers which we believewill find applications in other succinct data structure problems. The technique is based on keepinga set of distinct keys as a pool of preset read-only pointers in order to maintain linked lists as inbucket sorting.As a theoretical exercise, in Section 5, we also consider the case when this sorting has to bedone stably. We present an algorithm with identical performance that is also stable. Similar to theother in-place stable sorting algorithms e.g., comparison-based sorting [10], this algorithm is quitedetailed and needs very careful management of keys as they are permuted. The resulting algorithmis likely not of practical value, but it is still fundamentally important to know that bucket andradix sorting can indeed be solved stably in O ( n ) time with only O (1) words of extra space. Forexample, even though comparison-based sorting has been well studied at least since 60’s, it was notuntil much later that optimal, stable in-place comparison-based sorting was developed [10]. Arbitrary word length.
Another question of fundamental theoretical interest is whether therecently discovered integer sorting algorithms that work with long keys and sort in o ( n log n ) time,such as [3, 1, 5], need any auxiliary space. In Section 4, we present a black-box transformationfrom any RAM sorting algorithm to an sorting algorithm which uses only O (1) extra space, andretains the same time bounds. As a result, the running time bounds of [1, 5] can now be matchedwith only O (1) extra space. This transformation relies on a fairly natural technique of compressinga portion of the input to make space for simulating space-inefficient RAM sorting algorithms. Definitions.
Formally, we are given a sequence S of n elements. The problem is to sort S according to the integer keys, under the following assumptions:(i) Each element has an integer key within the interval [1 , U ].(ii) The following unit-cost operations are allowed on S : (a) indirect address of any position of S ; (b) read-only access to the key of any element; (c) exchange of the positions of any twoelements.(iii) The following unit-cost operations are allowed on integer values of O (log U ) bits: addition,subtraction, bitwise AND/OR and unrestricted bit shift.(iv) Only O (1) auxiliary words of memory are allowed; each word had log U bits.For the sake of presentation, we will refer to the elements’ keys as if they were the input elements.For example, for any two elements x , y , instead of writing that the key of x is less than the keyof y we will simply write x < y . We also need a precise definition of the rank of an element in asequence when multiple occurrences of keys are allowed: the rank of an element x i in a sequence x . . . x t is the cardinality of the multiset { x j | x j < x i or ( x j = x i and j ≤ i ) } .2 Stable Sorting for Modifiable Keys
We now describe our simple algorithm for (stable) radix sort without additional memory.
Gaining space.
The first observation is that numbers in sorted order have less entropy than inarbitrary order. In particular, n numbers from a universe of u have binary entropy n log u whenthe order is unspecified, but only log (cid:0) un (cid:1) = n log u − Θ( n log n ) in sorted order. This suggests thatwe can “compress” sorted numbers to gain more space: Lemma 1.
A list of n integers in sorted order can be represented as: (a) an array A [1 . . . n ] withthe integers in order; (b) an array of n integers, such that the last Θ( n log n ) bits of the array arezero. Furthermore, there exist in-place O ( n ) time algorithms for switching between representations(a) and (b).Proof. One can imagine many representations (b) for which the lemma is true. We note nonethelessthat some care is needed, as some obvious representations will in fact not lead to in-place encoding.Take for instance the appealing approach of replacing A [ i ] by A [ i ] − S [ i − un ). Then, one can try to encode the difference using a codeoptimized for smaller integers, for example one that represents a value x using log x + O (log log x )bits. However, the obvious encoding algorithm will not be in-place: even though the scheme isguaranteed to save space over the entire array, it is possible for many large values to cluster at thebeginning, leading to a rather large prefix being in fact expanded. This makes it hard to constructthe encoding in the same space as the original numbers, since we need to shift a lot of data to theright before we start seeing a space saving.As it will turn out, the practical performance of our radix sort is rather insensitive to the exactspace saving achieved here. Thus, we aim for a representation which makes in-place encodingparticularly easy to implement, sacrificing constant factors in the space saving.First consider the most significant bit of all integers. Observe that if we only remember theminimum i such that A [ i ] ≥ u/
2, we know all most significant bits (they are zero up to i and oneafter that). We will encode the last n/ A [1 . . . n ] to store a stream of n bits needed by the encoding.We now break a number x into hi( x ), containing the upper ⌊ log ( n/ ⌋ bits, and lo( x ), withthe low log u − ⌊ log ( n/ ⌋ bits. For all values in A [ n + 1 . . . n ], we can throw away hi( A [ i ]) asfollows. First we add hi( A [ n + 1]) zeros to the bit stream, followed by a one; then for every i = n + 2 , . . . , n we add hi( A [ i ]) − hi( A [ i − n/ A [ n ]) ≤ n/ A [ n + 1]) , . . . , lo( A [ n ]) in one pass, gaining n ⌊ log ( n/ ⌋ free bits. An unstable algorithm.
Even just this compression observation is enough to give a simplealgorithm, whose only disadvantage is that it is unstable. The algorithm has the following structure:1. sort the subsequence S [1 . . . ( n/ log n )] using the optimal in-place mergesort in [10].2. compress S [1 . . . ( n/ log n )] by Lemma 1, generating Ω( n ) bits of free space.3. radix sort S [( n/ log n ) + 1 . . . n ] using the free space.4. uncompress S [1 . . . ( n/ log n )].5. merge the two sorted sequences S [1 . . . ( n/ log n )] and S [( n/ log n ) + 1 . . . n ] by using the in-place, linear time merge in [10]. 3he only problematic step is 3. The implementation of this step is based on the cycle leaderpermuting approach where a sequence A is re-arranged by following the cycles of a permutation π . First A [1] is sent in its final position π (1). Then, the element that was in π (1) is sent to itsfinal position π ( π (1)). The process proceeds in this way until the cycle is closed, that is until theelement that is moved in position 1 is found. At this point, the elements starting from A [2] arescanned until a new cycle leader A [ i ] (i.e. its cycle has not been walked through) is found, A [ i ]’scycle is followed in its turn, and so forth.To sort, we use 2 n ǫ counters c , . . . , c n ǫ and d , . . . , d n ǫ . They are stored in the auxiliary wordsobtained in step 2. Each d j is initialized to 0. With a first scan of the elements, we store in any c i the number of occurrences of key i . Then, for each i = 2 . . . n ǫ , we set c i = c i − + 1 and finally weset c = 1 (in the end, for any i we have that c i = P j
1, given the free space we have.
Stability through recursion.
To achieve stability, we need more than n free bits, which we canachieve by bootstrapping with our own sorting algorithm, instead of merge sort. There is also animportant practical advantage to the new stable approach: the elements are permuted much moreconservatively, resulting in better cache performance.1. recursively sort a constant fraction of the array, say S [1 . . . n/ S [1 . . . n/
2] by Lemma 1, generating Ω( n log n ) bits of free space.3. for a small enough constant γ , break the remaining n/ γn numbers.Each chunk is sorted by a classic radix sort algorithm which uses the available space.4. uncompress S [1 . . . n/ /γ = O (1) sorted subarrays. We merge them in linear time using the stablein-place algorithm of [10].We note that the recursion can in fact be implemented bottom up, so there is no need for a stackof superconstant space. For the base case, we can use bubble sort when we are down to n ≤ √ n elements, where n is the original size of the array at the top level of the recursion.Steps 2 and 4 are known to take O ( n ) time. For step 3, note that radix sort in base R appliedto N numbers requires N + R additional words of space, and takes time O ( N log R u ). Since wehave a free space of Ω( n log n ) bits or Ω( n ) words, we can set N = R = γn , for a small enoughconstant γ . As we always have n = Ω( √ n ) = u Ω(1) , radix sort will take linear time.The running time is described by the recursion T ( n ) = T ( n/
2) + O ( n ), yielding T ( n ) = O ( n ).4 self-contained algorithm. Unfortunately, all algorithms so far use in-place stable mergingalgorithm as in [10]. We want to remove this dependence, and obtain a simple and practical sortingalgorithm. By creating free space through compression at the right times, we can instead use asimple merging implementation that needs additional space. We first observe the following:
Lemma 2.
Let k ≥ and α > be arbitrary constants. Given k sorted lists of n/k elements, and αn words of free space, we can merge the lists in O ( n ) time.Proof. We divide space into blocks of αn/ ( k + 1) words. Initially, we have k + 1 free blocks. Westart merging the lists, writing the output in these blocks. Whenever we are out of free blocks,we look for additional blocks which have become free in the original sorted lists. In each list, themerging pointer may be inside some block, making it yet unavailable. However, we can only have k such partially consumed blocks, accounting for less than k αnk +1 wasted words of space. Since intotal there are αn free words, there must always be at least one block which is available, and wecan direct further output into it.At the end, we have the merging of the lists, but the output appears in a nontrivial order ofthe blocks. Since there are ( k + 1)(1 + 1 /α ) = O (1) blocks in total, we can remember this orderusing constant additional space. Then, we can permute the blocks in linear time, obtaining thetrue sorted order.Since we need additional space for merging, we can never work with the entire array at thesame time. However, we can now use a classic sorting idea, which is often used in introductoryalgorithms courses to illustrate recursion (see, e.g. [2]). To sort n numbers, one can first sort thefirst n numbers (recursively), then the last n numbers, and then the first n numbers again.Though normally this algorithm gives a running time of ω ( n ), it works efficiently in our casebecause we do not need recursion:1. sort S [1 . . . n/
3] recursively.2. compress S [1 . . . n ], and sort S [ n +1 . . . n ] as before: first radix sort chunks of γn numbers, andthen merge all chunks by Lemma 2 using the available space. Finally, uncompress S [1 . . . n ].3. compress S [ n + 1 . . . n ], which is now sorted. Using Lemma 2, merge S [1 . . . n ] with S [ n +1 . . . n ]. Finally uncompress.4. once again, compress S [1 . . . n ], merge S [ n + 1 . . . n ] with S [ n + 1 . . . n ], and uncompress.Note that steps 2–4 are linear time. Then, we have the recursion T ( n ) = T ( n/
3) + O ( n ), solvingto T ( n ) = O ( n ). Finally, we note that stability of the algorithm follows immediately from stabilityof classic radix sort and stability of merging. Practical experience.
The algorithm is surprisingly effective in practice. It can be implementedin about 150 lines of C code. Experiments with sorting 1-10 million 32-bit numbers on a Pentiummachine indicate the algorithm is roughly 2.5 times slower than radix sort with additional memory,and slightly faster than quicksort (which is not even stable).
With the bit stealing technique [9], a bit of information is encoded in the relative order of a pairof elements with different keys: the pair is maintained in increasing order to encode a 0 and vice5ersa. The obvious drawback of this technique is that the cost of accessing a word of w encodedbits is O ( w ) in the worst case (no word-level parallelism). However, if we modify an encoded wordwith a series of l increments (or decrements) by 1, the total cost of the entire series is O ( l ) (see [2]).To find pairs of distinct elements, we go from S to a sequence Z ′ Y ′ XY ′′ Z ′′ with two properties.( i ) For any z ′ ∈ Z ′ , y ′ ∈ Y ′ , x ∈ X , y ′′ ∈ Y ′′ and z ′′ ∈ Z ′′ we have that z ′ < y ′ < x < y ′′ < z ′′ . ( ii )Let m = α ⌈ n/ log n ⌉ , for a suitable constant α . Y ′ is composed by the element y ′ m with rank m plusall the other elements equal to y ′ m . Y ′′ is composed by the element y ′′ m with rank n − m +1 plus all theother elements equal to y ′′ m . To obtain the new sequence we use the in-place, linear time selectionand partitioning algorithms in [6, 7]. If X is empty, the task left is to sort Z ′ and Z ′′ , which can beaccomplished with any optimal, in-place mergesort (e.g. [10]. Let us denote Z ′ Y ′ with M ′ and Y ′′ Z ′′ with M ′′ . The m pairs of distinct elements ( M ′ [1] , M ′′ [1]) , ( M ′ [2] , M ′′ [2]) , . . . , ( M ′ [ m ] , M ′′ [ m ]) willbe used to encode information.Since the choice of the constant α does not affect the asymptotic complexity of the algorithm,we have reduced our problem to a problem in which we are allowed to use a special bit memory with O ( n/ log n ) bits where each bit can be accessed and modified in constant time but withoutword-level parallelism. With the internal buffering technique [8], some of the elements are used as placeholders in orderto simulate a working area and permute the other elements at lower cost. In our unstable sortingalgorithm we use the basic idea of internal buffering in the following way. Using the selection andpartitioning algorithms in [6, 7], we pass from the original sequence S to ABC with two properties.( i ) For any a ∈ A , b ∈ B and c ∈ C , we have that a < b < c . ( ii ) B is composed of the element b ′ with rank ⌈ n/ ⌉ plus all the other elements equal to b ′ . We can use BC as an auxiliary memoryin the following way. The element in the first position of BC is the separator element and willnot be moved. The elements in the other positions of BC are placeholders and will be exchangedwith (instead of being overwritten by) elements from A in any way the computation on A (in ourcase the sorting of A ) may require. The “emptiness” of any location i of the simulated workingarea in BC can be tested in O (1) time by comparing the separator element BC [1] with BC [ i ]: if BC [1] ≤ BC [ i ] the i th location is “empty” (that is, it contains a placeholder), otherwise it containsone of the elements in A .Let us suppose we can sort the elements in A in O ( | A | ) time using BC as working area. After A is sorted we use the partitioning algorithm in [6] to separate the elements equal to the separatorelement ( BC [1]) from the elements greater than it (the computation on A may have altered theoriginal order in BC ). Then we just re-apply the same process to C , that is we divide it into A ′ B ′ C ′ , we sort A ′ using B ′ C ′ as working area and so forth. Clearly, this process requires O ( n )time and when it terminates the elements are sorted. Obviously, we can divide A into p = O (1)equally sized subsequences A , A . . . A p , then sort each one of them using BC as working area andfinally fuse them using the in-place, linear time merging algorithm in [10]. Since the choice of theconstant p does not affect the asymptotic complexity of the whole process, we have reduced ourproblem to a new problem, in which we are allowed to use a special exchange memory of O ( n )locations, where each location can contain input elements only (no integers or any other kind ofdata). Any element can be moved to and from any location of the exchange memory in O (1) time. By blending together the basic techniques seen above, we can focus on a reduced problem in whichassumption (iv) is replaced by: 6 iv) Only O (1) words of normal auxiliary memory and two kinds of special auxiliary memory areallowed:(a) A random access bit memory B with O ( n/ log n ) bits, where each bit can be accessed in O (1) time (no word-level parallelism).(b) A random access exchange memory E with O ( n ) locations, where each location cancontain only elements from S and they can be moved to and from any location of E in O (1) time. If we can solve the reduced problem in O ( n ) time we can also solve the original problem with thesame asymptotic complexity. However, the resulting algorithm will be unstable because of the useof the internal buffering technique with a large pool of placeholder elements. Despite the two special auxiliary memories, solving the reduced problem is not easy. Let us considerthe following naive approach. We proceed as in the normal bucket sorting: one bucket for eachone of the n ǫ range values. Each bucket is a linked list: the input elements of each bucket aremaintained in E while its auxiliary data (e.g. the pointers of the list) are maintained in B . Inorder to amortize the cost of updating the auxiliary data (each pointer requires a word of Θ (log n )bits and B does not have word-level parallelism), each bucket is a linked list of slabs of Θ (cid:0) log n (cid:1) elements each ( B has only O ( n/ log n ) bits). At any time each bucket has a partially full head slab which is where any new element of the bucket is stored. Hence, for each bucket we need to storein B a word of O (log log n ) bits with the position in the head slab of the last element added. Thealgorithm proceeds as usual: each element in S is sent to its bucket in O (1) time and is inserted inthe bucket’s head slab. With no word-level parallelism in B the insertion in the head slab requires O (log log n ) time. Therefore, we have an O ( n log log n ) time solution for the reduced problem and,consequently, an unstable O ( n log log n ) time solution for the original problem.This simple strategy can be improved by dividing the head slab of a bucket into second levelslabs of Θ (log log n ) elements each. As for the first level slabs, there is a partially full, second levelhead slab. For any bucket we maintain two words in B : the first one has O (log log log n ) bits andstores the position of the last element inserted in the second level head slab; the second one has O (log log n ) bits and stores the position of the last full slab of second level contained in the firstlevel head slab. Clearly, this gives us an O ( n log log log n ) time solution for the reduced problemand the corresponding unstable solution for the original problem. By generalizing this approach tothe extreme, we end up with O (log ∗ n ) levels of slabs, an O ( n log ∗ n ) time solution for the reducedproblem and the related unstable solution for the original problem. Unlike bit stealing and internal buffering which were known earlier, the pseudo pointers techniquehas been specifically designed for improving the space complexity in integer sorting problems.Basically, in this technique a set of elements with distinct keys is used as a pool of pre-set, read-only pointers in order to simulate efficiently traversable and updatable linked lists. Let us showhow to use this basic idea in a particular procedure that will be at the core of our optimal solutionfor the reduced problem.Let d be the number of distinct keys in S . We are given two sets of d input elements withdistinct keys: the sets G and P of guides and pseudo pointers , respectively. The guides are givenus in sorted order while the pseudo pointers form a sequence in arbitrary order. Finally, we aregiven a multiset I of d input elements (i.e. two elements of I can have equal keys). The procedure7ses the guides, the pseudo pointers and the exchange memory to sort the d input elements of I in O ( d ) time.We use three groups of contiguous locations in the exchange memory E . The first group H has n ǫ locations (one for each possible value of the keys). The second group L has n ǫ slots of twoadjacent locations each. The last group R has d locations, the elements of I will end up here insorted order. H , L and R are initially empty. We have two main steps. First . For each s ∈ I , we proceed as follows. Let p be the leftmost pseudo pointer still in P .If the s th location of H is empty, we move p from P to H [ s ] and then we move s from I to thefirst location of L [ p ] (i.e. the first location of the p th slot of L ) leaving the second location of L [ p ]empty. Otherwise, if H [ s ] contains an element p ′ (a pseudo pointer) we move s from I to the firstlocation of L [ p ], then we move p ′ from H [ s ] to the second location of L [ p ] and finally we move p from P to H [ s ]. Second . We scan the guides in G from the smallest to the largest one. For a guide g ∈ G weproceed as follows. If the g th location of H is empty then there does not exist any element equal to g among the ones to be sorted (and initially in I ) and hence we move to the next guide. Otherwise,if H [ G ] contains a pseudo pointer p , there is at least one element equal to g among the ones tobe sorted and this element is currently stored in the first location of the p th slot of L . Hence, wemove that element from the first location of L [ p ] to the leftmost empty location of R . After that,if the second location of L [ p ] contains a pseudo pointer p ′ , there is another element equal to g andwe proceed in the same fashion. Otherwise, if the second location of L [ p ] is empty then there areno more elements equal to g among the ones to be sorted and therefore we can focus on the nextguide element.Basically, the procedure is bucket sorting where the auxiliary data of the list associated to eachbucket (i.e. the links among elements in the list) is implemented by pseudo pointers in P insteadof storing it explicitly in the bit memory (which lacks of word-level parallelism and is inefficientin access). It is worth noting that the buckets’ lists implemented with pseudo pointers are spreadover an area that is larger than the one we would obtain with explicit pointers (that is becauseeach pseudo pointer has a key of log n ǫ bits while an explicit pointer would have only log d bits). We can now describe the algorithm, which has three main steps.
First . Let us assume that for any element s ∈ S there is at least another element with the samekey. (Otherwise, we can easily reduce to this case in linear time: we isolate the O ( n ǫ ) elementsthat do not respect the property, we sort them with the in-place mergesort in [10] and finally wemerge them after the other O ( n ) elements are sorted.) With this assumption, we extract from S two sets G and P of d input elements with distinct keys (this can be easily achieved in O ( n ) timeusing only the exchange memory E ). Finally we sort G with the optimal in-place mergesort in [10]. Second . Let S ′ be the sequence with the ( O ( n )) input elements left after the first step. Usingthe procedure in § G and P computed in the first step will bethe guides and pseudo pointers used in the procedure), we sort each block B i of S ′ with d contiguouselements. After that, let us focus on the first t = Θ (log log n ) consecutive blocks B , B , . . . , B t .We distribute the elements of these blocks into ≤ t groups G , G . . . in the following way. Eachgroup G j can contain between d and 2 d elements and is allocated in the exchange memory E . Thelargest element in a group is its pivot . The number of elements in a group is stored in a word ofΘ (log d ) bits allocated in the bit memory B . Initially there is only one group and is empty. In the i th step of the distribution we scan the elements of the i th block B i . As long as the elements of B i are less than or equal to the pivot of the first group we move them into it. If, during the process,the group becomes full, we select its median element and partition the group into two new groups8using the selection and partitioning algorithms in [6, 7]). When, during the scan, the elements of B i become greater than the pivot of the first group, we move to the second group and continue inthe same fashion. It is important to notice that the number of elements in a group (stored in aword of Θ (log d ) bits in the bit memory B ) is updated by increments by 1 (and hence the totalcost of updating the number of elements in any group is linear in the final number of elementsin that group, see [2]). Finally, when all the elements of the first t = Θ (log log n ) consecutiveblocks B , B , . . . , B t have been distributed into groups, we sort each group using the procedurein § d elements, we sort them in two batches and then mergethem with the in-place, linear time merging in [10]). The whole process is repeated for the second t = Θ (log log n ) consecutive blocks, and so forth. Third . After the second step, the sequence S ′ (which contains all the elements of S withthe exclusion of the guides and pseudo pointers, see the first step) is composed by contiguoussubsequences S ′ , S ′ , . . . which are sorted and contain Θ( d log log n ) elements each (where d is thenumber of distinct elements in S ). Hence, if we see S ′ as composed by contiguous runs of elementswith the same key, we can conclude that the number of runs of S ′ is O ( n/ log log n ). Therefore S ′ can be sorted in O ( n ) time using the naive approach described in § O (1) of them) instead of accessing the inefficient bit memory B atany single insertion. When the current run is finally exhausted, we copy the position in the bitmemory. Finally, we sort P and we merge P , A and S ′ (once again, using the sorting and mergingalgorithms in [10]). Let us focus on the reasons why the algorithm of this section is not stable. The major cause of insta-bility is the use of the basic internal buffering technique in conjunction with large ( ω ( polylog ( n )))pools of placeholder elements. This is clearly visible even in the first iteration of the process in § A into sorted order, the placeholder elements in BC are leftpermuted in a completely arbitrary way and their initial order is lost. In this section, we consider the case of sorting integers of w = ω (log n ) bits. We show a blackbox transformation from any sorting algorithm on the RAM to a stable sorting algorithm with thesame time bounds which only uses O (1) words of additional space. Our reduction needs to modifykeys. Furthermore, it requires randomization for large values of w .We first remark that an algorithm that runs in time t ( n ) can only use O ( t ( n )) words of spacein most realistic models of computation. In models where the algorithm is allowed to write t ( n )arbitrary words in a larger memory space, the space can also be reduced to O ( t ( n )) by introducingrandomization, and storing the memory cells in a hash table. Small word size.
We first deal with the case w = polylog ( n ). The algorithm has the followingstructure:1. sort S [1 . . . n/ log n ] using in-place stable merge sort [10]. Compress these elements by Lemma1 gaining Ω( n ) bits of space. 9. since t ( n ) = O ( n log n ), the RAM sorting algorithm uses at most O ( t ( n ) · w ) = O ( n polylog ( n ))bits of space. Then we can break the array into chunks of n/ log c n elements, and sort eachone using the available space.3. merge the log c n sorted subarrays.4. uncompress S [1 . . . n/ log n ] and merge with the rest of the array by stable in-place merg-ing [10].Steps 1 and 4 take linear time. Step 2 requires log c n · t ( n/ log c n ) = O ( t ( n )) because t ( n ) is convexand bounded in [ n, n log n ]. We note that step 2 can always be made stable, since we can afford alabel of O (log n ) bits per value.It remains to show that step 3 can be implemented in O ( n ) time. In fact, this is a combinationof the merging technique from Lemma 2 with an atomic heap [4]. The atomic heap can maintaina priority queue over polylog ( n ) elements with constant time per insert and extract-min. Thus,we can merge log c n lists with constant time per element. The atomic heap can be made stable byadding a label of c log log n bits for each element in the heap, which we have space for. The mergingof Lemma 2 requires that we keep track of O ( k/α ) subarrays, where k = log c n was the number oflists and α = 1 / polylog ( n ) is fraction of additional space we have available. Fortunately, this isonly polylog ( n ) values to record, which we can afford. Large word size.
For word size w ≥ log ε n , the randomized algorithm of [1] can sort in O ( n )time. Since this is the best bound one can hope for, it suffices to make this particular algorithmin-place, rather than give a black-box transformation. We use the same algorithm from above. Theonly challenge is to make step 2 work: sort n keys with O ( n polylog ( n )) space, even if the keyshave w > polylog ( n ) bits.We may assume w ≥ log n , which simplifies the algorithm of [1] to two stages. In the firststage, a signature of O (log n ) bits is generated for each input value (through hashing), and thesesignatures are sorted in linear time. Since we are working with O (log n )-bit keys regardless of theoriginal w , this part needs O ( n polylog ( n )) bits of space, and it can be handled as above.From the sorted signatures, an additional pass extracts a subkey of w/ log n bits from eachinput value. Then, these subkeys are sorted in linear time. Finally, the order of the original keysis determined from the sorted subkeys and the sorted signatures.To reduce the space in this stage, we first note that the algorithm for extracting subkeys doesnot require additional space. We can then isolate the subkey from the rest of the key, using shifts,and group subkeys and the remainder of each key in separate arrays, taking linear time. This way,by extracting the subkeys instead of copying them we require no extra space. We now note thatthe algorithm in [1] for sorting the subkeys also does not require additional space. At the end, werecompose the keys by applying the inverse permutation to the subkeys, and shifting them backinto the keys.Finally, sorting the original keys only requires knowledge of the signatures and order informationabout the subkeys. Thus, it requires O ( n polylog ( n )) bits of space, which we have. At the end, wefind the sorted order of the original keys and we can implement the permutation in linear time. In the following we will denote n ǫ with r . Before we begin, let us recall that two consecutivesequences X and Y , possibly of different sizes, can be exchanged stably, in-place and in lineartime with three sequence reversals, since Y X = ( X R Y R ) R . Let us give a short overview of thealgorithm. We have three phases. 10 reliminary phase ( § The purpose of this phase is to obtain some collections of elementsto be used with the three techniques described in §
3. We extract Θ ( n/ log n ) smallest and largestelements of S . They will form an encoded memory of Θ ( n/ log n ) bits. Then, we extract from theremaining sequence Θ ( n ǫ ) smallest elements and divide them into O (1) jump zones of equal length.After that, we extract from the remaining sequence some equally sized sets of distinct elements.Each set is collected into a contiguous zone. At the end of the phase, we have guide , distribution , pointer , head and spare zones. Aggregating phase ( § After the preliminary phase, we have reduced the problem tosorting a smaller sequence S ′ (still of O ( n ) size) using various sequences built to be used withthe basic techniques in §
3. Let d be the number of distinct elements in S ′ (computed during thepreliminary phase). The objective of this phase is to sort each subsequence of size Θ ( d polylog ( n ))of the main sequence S ′ . For any such subsequence S ′ l , we first find a set of pivots and then sort S ′ l with a distributive approach. The guide zone is sorted and is used to retrieve in sorted order lists ofequal elements produced by the distribution. The distribution zone provides sets of pivots elementsthat are progressively moved into one of the spare zones. The head zone furnishes placeholderelements for the distributive processes. The distributive process depends on the hypothesis thateach d contiguous elements of S ′ l are sorted. The algorithm for sorting Θ ( d ) contiguous elementsstably, in O ( d ) time and O (1) space (see § Final phase ( § After the aggregating phase the main sequence S ′ has all its subsequencesof Θ ( d polylog ( n )) elements in sorted order. With an iterative merging process, we obtain from S ′ two new sequences: a small sequence containing O (cid:0) d log n (cid:1) sorted elements; a large sequence stillcontaining O ( n ) elements but with an important property: the length of any subsequence of equalelements is multiple of a suitable number Θ (cid:0) log n (cid:1) . By exploiting its property, the large sequenceis sorted using the encoded memory and merged with the small one. Finally, we take care of allthe zones built in the preliminary phase. Since they have sizes either O ( n/ log n ) or O ( n ǫ ), theycan be easily sorted within our target bounds. The preliminary phase has two main steps described in Sections 5.1.1 and 5.1.2.
We start by collecting some pairs of distinct elements. We go from S to a sequence Z ′ Y ′ XY ′′ Z ′′ with the same two properties we saw in § § m pairs of distinct elements ( M ′ [1] , M ′′ [1]) , . . . , ( M ′ [ m ] , M ′′ [ m ]) to encode information by bitstealing. We use the encoded memory based on these pairs as if it were actual memory (that is, wewill allocate arrays, we will index and modify entries of these arrays, etc). However, in order notto lose track of the costs, the names of encoded structures will always be written in the followingway: I , U , etc.We allocate two arrays I bg and I en , each one with r = n ǫ entries of 1 bit each. The entriesof both arrays are set to 0. I bg and I en will be used each time the procedure in § .1.2 Jump, guide, distribution, pointer, head and spare zones The second main step of this phase has six sub-steps.
First . Let us suppose | X | > n/ log n (otherwise we sort it with the mergesort in [10]). Using theselection and partitioning in [6, 7], we go from X to J X ′ such that J is composed by the element j ∗ with rank 3 r + 1 = 3 n ǫ + 1 (in X ) plus all the elements (in X ) ≤ j ∗ . Then, we move the rightmostelement equal to j ∗ in the last position of J (easily done in O ( | J | ) and stably with a sequenceexchange). Second . Let us suppose | X ′ | > n/ log n . With this step and the next one we separate the ele-ments which appear more than 7 times. Let us allocate in our encoded memory of m = O ( n/ log n )bits an encoded array I with r (= n ǫ ) entries of 4 bits each. All the entries are initially set to 0.Then, we start scanning X ′ from left to right. For any element u ∈ X ′ accessed during the scan, if I [ u ] ≤
7, we increment I [ u ] by one. Third . We scan X ′ again. Let u ∈ X ′ be the i th element accessed. If I [ u ] <
7, we decrement I [ u ] by 1 and exchange X ′ [ i ] (= u ) with J [ i ]. At the end of the scan we have that J = W J ′′ , where W contains the elements of X ′ occurring less than 7 times in X ′ . Then, we have to gather theelements previously in J and now scattered in X ′ . We accomplish this with the partitioning in [6],using J [ | J | ] to discern between elements previously in J and the ones belonging to X ′ (we knowthat J [ | J | ] is equal to j ∗ and, for any j ∈ J and any x ′ ∈ X ′ , j ≤ J [ | J | ] < x ′ ). After that we have W J ′′ J ′ X ′′ where the elements in J ′ are the ones previously in J and exchanged during the scan of X ′ . We exchange W with J ′ ending up with J W X ′′ . Fourth . We know that each element of X ′′ occurs at least 7 times in it. We also know that theentries of I encode either 0 or 7. We scan X ′′ from left to right. Let u ∈ X ′′ be the i th elementaccessed. If I [ u ] = 7, we decrement I [ u ] by one and we exchange X ′′ [ i ] (= u ) with J [ i ]. After thescan we have that J = GJ ′′′ , where, for any j , G [ j ] was the leftmost occurrence of its kind in X ′′ (before the scan). Then, we sort G with the mergesort in [10] ( | G | = O ( r ) = O ( n ǫ ) and ǫ < J and now scattered in X ′′ because of the scan. We end up with the sequence J W GX ′′′ . We repeat the same process (onlytesting for I [ u ] equal to 6 instead of 7) to gather the leftmost occurrence of each distinct elementin X ′′′ into a zone D , ending up with the sequence J W GDX ′′′′ . Fifth . Each element of X ′′′′ occurs at least 5 times in it and the entries of I encode either 0 or5. We scan X ′′′′ , let u ∈ X ′′′′ be the i th element accessed. If I [ u ] = 5, we decrement I [ u ] by 1 andexchange X ′′′′ [ i ] (= u ) with J [ i ]. After the scan we have that J = P J ′′′ , where, for any j , P [ j ] wasthe leftmost occurrence of its kind in X ′′′′ (before the scan). Unlike the fourth step, we do not sort P . We repeat the process finding T , T , T and H containing the second, third, fourth and fifthleftmost occurrence of each distinct element in X ′′′′ , respectively. After any of these processes, wegather back the elements previously in J scattered in X ′′′′ (same technique used in the third andfourth steps). We end up with the sequence J W GDP T T T HS ′ . Sixth . Let us divide J into J J J V , where | J | = | J | = | J | = r and | V | = | J | − r . We scan G , let u ∈ G be the i th element accessed, we exchange T [ i ], T [ i ] and T [ i ] with J [ u ], J [ u ] and J [ u ], respectively. We will refer to J , J and J as jump zones . Zone G , D , P and H will be referred to as guide , distri-bution , pointer and head zones , respectively. Finally, T , T and T will be called spare zones . Withthe preliminary phase we have passed from the initial sequence S to M ′ J J J V W GDP T T T HS ′ M ′′ .We allocated in the encoded memory two arrays I bg and I en . The encoded memory, I bg and I en , andthe jump, guide, distribution, pointer, head and spare zones will be used in the next two phasesto sort the S ′ . Zones V and W are a byproduct of the phase and will not have an active role12n the sorting of S ′ . The number of distinct elements in sequence S ′ is less than or equal to thesizes of guide, distribution, pointer and head zones. For the rest of the paper we will denote | G | (= | D | = | P | = | H | ) with d . Lemma 3.
The preliminary phase requires O ( n ) time, uses O (1) auxiliary words and is stable. Let us divide S ′ into k subsequences S ′ S ′ . . . S ′ k with | S ′ i | = d log β n , for a suitable constant β ≥ t = log δ n , for a suitable constant δ <
1. We will assume that d ≥ (2 t + 1) log | S ′ i | . We leave theparticular case where d < (2 t + 1) log | S ′ i | for the full paper. For a generic 1 ≤ l ≤ k , let us assumethat any S ′ l ′ with l ′ < l has been sorted and that H is next to the left end of S ′ l . To sort S ′ l we havetwo main steps described in § § § O ( d ) contiguous elements Let us show how to exploit the two arrays I bg and I en , (in the encoded memory in the preliminaryphase) and the jump, guide and pointer zones to sort a sequence A , with | A | ≤ d , stably in O ( | A | )time and using O (1) auxiliary words. The process has two steps. First . We scan A , let u ∈ A be the i th element accessed. Let p = P [ i ] and h = J [ u ]. If I bg [ u ] = 0, we set both I bg [ u ] and I en [ p ] to 1. In any case, we exchange J [ u ] (= h ) with J [ p ] and A [ i ] (= u ) with J [ p ]. Then, we exchange P [ i ] (= p ) with J [ u ] (which is not h anymore). Second . Let j = | A | . We scan G , let g ∈ G be the i th element accessed. If I bg [ g ] = 0, we donothing. Otherwise, let p be J [ g ], we set I bg [ g ] = 0 and execute the following three steps. ( i ) Weexchange J [ p ] with A [ j ], then J [ g ] with P [ j ] and finally J [ g ] with J [ p ]. ( ii ) We decrease j by 1.( iii ) If I en [ p ] = 1, we set I en [ p ] = 0 and the sub-process ends, otherwise, let p be J [ g + 1], and wego to ( i ).Let us remark that the O ( | A | ) entries of I bg and I en that are changed from 0 to 1 in the firststep, are set back to 0 in the second one. I bg and I en have been initialized in the preliminary phase.We could not afford to re-initialize them every time we invoke the process (they have r = n ǫ entriesand | A | may be o ( n ǫ )). Lemma 4.
Using the encoded arrays I bg , I en and the jump, guide and pointer zones, a sequence A with | A | ≤ d can be sorted stably, in O ( | A | ) time and using O (1) auxiliary words. S ′ l We find a set of pivots { e , e . . . , e p − , e p } with the following properties: ( i ) |{ x ∈ S ′ l | x < e }| ≤ d ; ( ii ) |{ x ∈ S ′ l | e p < x }| ≤ d ; ( iii ) |{ x ∈ S ′ l | e i < x < e i +1 }| ≤ d , for any 1 ≤ i < p ; ( iv ) p =Θ (cid:0) log β n (cid:1) . In the end the pivots reside in the first p positions of D . We have four steps. First . We allocate in the encoded memory an array P with r entries of log | S ′ l | bits, but we donot initialize each entry of P . We initialize the only d of them we will need: for any i = 1 . . . | G | ,we set P [ G [ i ]] to 0. Second . We scan S ′ l from left to right. Let u ∈ S ′ l be the i th element accessed, we increment P [ u ] by 1. Third . We sort the distribution zone D using the algorithm described in § Fourth . Let i = 1, j = 0 and p = 0. We repeat the following process until i > | G | . ( i ) Let u = G [ i ], we set j = j + P [ u ]. ( ii ) If j < d we increase i by 1 and go to ( i ). If j ≥ d , we increase p by 1, exchange D [ i ] with D [ p ], increase i by 1, set j to 0 and go to ( i ).13 .2.3 Sorting S ′ l Let p be the number of pivots for S ′ l selected in § p positions of D . Let u be log | S ′ l | . Let us assume that H is next to the left end of S ′ i . We have six steps. First . Let i = 0 and j = 0. The following two steps are repeated until i > p : ( i ) we increase i by p/t and j by 1; ( ii ) we exchange D [ i ] with T [ j ]. We will denote with p ′ the number of selectedpivots, now temporarily residing in the first p ′ positions of T . Second . Let us divide S ′ l into q = | S ′ l | /d blocks B B . . . B q of d elements each. We sort each B i using the algorithm in § Third . With a sequence exchange we bring H next to the right end of S ′ l . Let us divide H into H ˆ H . . . H p ′ ˆ H p ′ H p ′ +1 H ′ , where (cid:12)(cid:12) H p ′ +1 (cid:12)(cid:12) = | H i | = | ˆ H i | = u . Let f = | S ′ l | /u . We allocate thefollowing arrays: ( i ) U suc and U pre both with f + 2 p ′ + 1 entries of Θ ( u ) bits; ( ii ) H and ˆ H with p ′ + 1 and p ′ entries of Θ (log u ) bits; ( iii ) L and ˆ L with p ′ + 1 and p ′ entries of Θ ( u ) bits; ( iv ) N and ˆ N with p ′ + 1 and p ′ entries of Θ ( u ) bits. Each entry of any array is initialized to 0. Fourth . In this step we want to transform S ′ l and H in the following ways. We pass from S ′ l to U U . . . U f ′ − U f ′ H ′′ , where the U i ’s are called units , for which the following holds.( i ) f ′ ≥ f − (2 p ′ + 1) and | U i | = u , for any 1 ≤ i ≤ f ′ .( ii ) For any U i , 1 ≤ i ≤ f ′ , one of the following holds: ( a ) there exists a 1 ≤ j ≤ p ′ such that x = T [ j ], for any x ∈ U i ; ( b ) x < T [1], for any x ∈ U i ; ( c ) T [ p ′ ] < x , for any x ∈ U i ; ( d )there exists a 1 ≤ j ′ ≤ p ′ − T [ j ′ ] < x < T [ j ′ + 1], for any x ∈ U i .( iii ) Let us call a set of related units a maximal set of units U = { U i , U i , . . . , U i z } for which one ofthe following conditions holds: ( a ) there exists a 1 ≤ j ≤ p ′ such that x = T [ j ], for any x ∈ U i and for any U i ∈ U ; ( b ) x < T [1], for any x ∈ U i and for any U i ∈ U ; ( c ) T [ p ′ ] < x , for any x ∈ U i and for any U i ∈ U ; ( d ) there exists a 1 ≤ j ′ ≤ p ′ − T [ j ′ ] < x < T [ j ′ + 1],for any x ∈ U i and for any U i ∈ U . For any set of related units U = { U i , U i , . . . , U i z } wehave that U suc [ i y ] = i y +1 and U pre [ i y +1 ] = i y , for any 1 ≤ y ≤ z − H ′′ and H = H ˆ H . . . H p ′ ˆ H p ′ H p ′ +1 H ′ . Before this step all the elements in H were theoriginal ones gathered in § iv ) The elements in H ′ and H ′′ plus the elements in H i [ H [ i ] + 1 . . . u ] and in ˆ H i ′ [ ˆ H [ i ′ ] + 1 . . . u ],for any 1 ≤ i ≤ p ′ + 1 and 1 ≤ i ′ ≤ p ′ , form the original set of elements that were in H beforethe fourth step.( v ) We have that: ( a ) x < T [1], for any x ∈ H [1 . . . H [1]]; ( b ) x > T [ p ′ ], for any x ∈ H p ′ +1 [1 . . . H [ p ′ + 1]]; ( c ) T [ i − < x < T [ i ], for any x ∈ H i [1 . . . H [ i ]] and any 2 ≤ i ≤ p ′ ;( d ) x = T [ i ], for any x ∈ ˆ H i [1 . . . ˆ H [ i ]] and any 1 ≤ i ≤ p ′ .( vi ) ( a ) Let j = L [1] ( j = L [ p ′ ]), U j is the rightmost unit such that x < T [1] ( x > T [ p ′ ])for any x ∈ U j . ( b ) For any 2 ≤ i ≤ p ′ , let j = L [ i ], U j is the rightmost unit such that T [ i − < x < T [ i ] for any x ∈ U j . ( c ) For any 2 ≤ i ≤ p ′ , let j = ˆ L [ i ], U j is the rightmostunit such that x = T [ i ] for any x ∈ U j .( vii ) ( a ) N [1] ( N [ p ′ ]) is the number of x ∈ S ′ l such that x < T [1] ( x > T [ p ′ ]). ( b ) For any 2 ≤ i ≤ p ′ , N [ i ] is the number of x ∈ S ′ l such that T [ i − < x < T [ i ]. ( c ) For any 2 ≤ i ≤ p ′ , ˆ N [ i ] isthe number of x ∈ S ′ l such that x = T [ i ].Let h = H [1], let i = 1 and let j = 1. We start the fourth step by scanning B . If B [ i ] < T [1], weincrease h , H [1] and N [1] by 1, exchange B [ i ] with H [ h ] and increase i by 1. This sub-process goeson until one of the following two events happens: ( a ) h and H [1] are equal to u + 1; ( b ) B [ i ] ≥ T [1].14f event ( a ) happens, we exchange the u elements currently in H with S ′ l [( j − u + 1 . . . ju ]. Then,we set U pre [ j ] to L [1], U suc [ L [1]] to j and L [1] to j . After that, we set h and H [1] to 0 and weincrement j by 1. Finally, we go back to the scanning of B . Otherwise, if event ( b ) happens, we set h to ˆ H [1] and we continue the scanning of B but with the following sub-process: if B [ i ] = T [1],we increase h , ˆ H [1] and ˆ N [1] by 1, exchange B [ i ] with ˆ H [ h ] and increase i by 1. In its turn, thissub-process goes on until one of the following two events happens: ( a ′ ) h and ˆ H [1] are equal to u + 1; ( b ′ ) B [ i ] > T [1]. Similarly to what we did for event ( a ), if event ( a ′ ) happens, we exchangethe u elements currently in ˆ H with S ′ l [( j − u + 1 . . . ju ]. Then, we set U pre [ j ] to ˆ L [1], U suc [ˆ L [1]]to j and ˆ L [1] to j . After that, we set h and ˆ H [1] to 0 and we increment j by 1. Finally, we goback to the scanning of B . Otherwise, if event ( b ′ ) happens, we set h to H [2] and we continue thescanning of B but with the following sub-process: if B [ i ] < T [2], we increase h , H [2] and N [2] by1, exchange B [ i ] with H [ h ] and increase i by 1. We continue in this fashion, possibly passing toˆ H [2], H [3], etc, until B is exhausted. Then, the whole process is applied to B from the beginning.When B is exhausted we pass to B , B and so forth until each block is exhausted. Fifth . We start by exchanging H ′′ and H ˆ H . . . H p ′ ˆ H p ′ H p ′ +1 . Let h = 1 and h ′ = u − H [1].We exchange the elements in H [ H [1] + 1 . . . u ] with the ones in T [ h . . . h ′ ]. After that, we set h = h ′ , increment h ′ by u − ˆ H [1] and exchange the elements in ˆ H [ ˆ H [1] + 1 . . . u ] with the onesin T [ h . . . h ′ ]. We proceed in this fashion until H p ′ +1 is done. Then we exchange the elements in H ′′ H ′ with the rightmost | H ′′ H ′ | ones in T . After that we execute the following process to “link”the H i ’s and ˆ H i ’s to their respective sets of related units. We start by setting U pre [ f ′ + 1] to L [1]and U suc [ L [1]] to f ′ + 1. Then we set U pre [ f ′ + 2] to ˆ L [1] and U suc [ˆ L [1]] to f ′ + 2. We proceed inthis fashion until we set U pre [ f ′ + 2 p ′ + 1] to L [ p ′ + 1] and U suc [ L [ p ′ + 1]] to f ′ + 2 p ′ + 1. Afterthat we execute the following process to bring each set of related units into a contiguous zone. Let i = ⌊ N [1] /u ⌋ + 1, we exchange H with U i ; we swap the values in U pre [ i ] and U pre [ f ′ + 1], then thevalues in U suc [ i ] and U suc [ f ′ + 1]; we set U pre [ U suc [ f ′ + 1]] = f ′ + 1 and U suc [ U pre [ f ′ + 1]] = f ′ + 1;finally, we set U suc [ U pre [ i ]] = i . Then, let j = U pre [ i ], we decrement i by 1, we exchange U j with U i ; we swap the values in U pre [ i ] and U pre [ j ], then the values in U suc [ i ] and U suc [ j ]; we set U pre [ U suc [ j ]] = j and U suc [ U pre [ j ]] = j ; finally, we set U pre [ U suc [ i ]] = i and U suc [ U pre [ i ]] = i . Weproceed in this fashion until the entire set of related units of H resides in S ′ l [1 . . . N [1] − H [1]] ( H now resides in S ′ l [ N [1] − H [1] + 1 . . . N [1] − H [1] + u ]). After that, we apply the same process to ˆ H , H , ˆ H and so forth until every set of related units has been compacted into a contiguous zone. Weend up with the sequence U H ˆ U ˆ H . . . U p ′ H p ′ ˆ U p ′ ˆ H p ′ U p ′ +1 H p ′ +1 H ′′ H ′ , where each U i containsthe set of related units of H i and each ˆ U i the set of related units of ˆ H i . Finally, we proceed toseparate the elements of S ′ l from the ones residing in H i [ H [ i ] + 1 . . . u ] and ˆ H i ′ [ H [ i ′ ] + 1 . . . u ], forany 1 ≤ i ≤ p ′ + 1 and any 1 ≤ i ′ ≤ p ′ . Since the “intruders” were previously residing in T , bythe first and sixth steps in § S ′ l . Therefore we can separate them with the stable partitioning algorithm in [6] and end up withthe sequence R ˆ R . . . R p ′ ˆ R p ′ R p ′ +1 H ′′′ H ′′ H ′ . Finally, we exchange the elements in the sequence H = H ′′′ H ′′ H ′ with the ones in T , getting back in H its original (distinct) elements. Sixth . After the fifth step we are left with the sequence R ˆ R . . . R p ′ ˆ R p ′ R p ′ +1 H for which thefollowing holds: ( i ) N [ i ] = | R i | and ˆ N [ i ′ ] = | R i ′ | , for any 1 ≤ i ≤ p ′ + 1 and any 1 ≤ i ′ ≤ p ′ ; ( ii )for any x ∈ R , x < T [1]; ( iii ) for any x ∈ R p ′ +1 , x < T [ p ′ ]; ( iv ) for any x ∈ ˆ R i , x = T [ i ](1 ≤ i ≤ p ′ ); ( v ) for any x ∈ R i , with 1 ≤ i < p ′ , T [ i ] < x < T [ i + 1]. We begin by moving H before R with a sequence exchange. Since we do not need the p ′ pivots anymore, we put themback in their original positions in D [1 . . . p ] executing once again the process in the first step. Then,we allocate in the encoded memory an array R with p ′ + 1 entries of Θ ( u ) bits. Let i = 1 and R [1] = 1: for j = 2 , . . . , p ′ + 1 we increment i by N [ j −
1] + ˆ N [ j −
1] and set R [ j ] = i . Afterthat, we execute a series of p ′ + 1 recursive invocations of the procedure here in § R = S ′ l [ R [1] . . . R [2] −
1] recursively with the sameprocedure in this section: we use S ′ l [ R [1] . . . R [2] −
1] in place of S ′ l , D [1 . . . p/t −
1] in place of D and so forth. After R is sorted, we swap H and R with a sequence exchange and proceed to sort R : we use S ′ l [ R [2] . . . R [3] −
1] in place of S ′ l , D [ p/t + 1 . . . p/t −
1] in place of D and so forth.We proceed in this fashion until R p ′ +1 is sorted and H is located after it again. We do not needanything particularly complex to handle the recursion with O (1) words since there can be only O (1) nested invocations. Lemma 5.
The aggregation phase requires O ( n ) time, uses O (1) auxiliary words and is stable. The final phase has two main steps described in § § S ′ After the aggregating phase we are left with S ′ = S ′ S ′ . . . S ′ k where, for any 1 ≤ i ≤ k , S ′ i is sortedand | S ′ i | = d log β n . Let f = log n . We have three steps. First . We allocate in the encoded memory an array S ′ with | S ′ | entries of one bit initially set to1. Then, we scan S ′ . During the scan, as soon as we encounter a subsequence S ′ [ i . . . i + f −
1] ofequal elements (that is a subsequence of f consecutive equal elements) we set S ′ [ i ], S ′ [ i +1]. . . S ′ [ i + f −
2] and S ′ [ i + f −
1] to 0. After that we use the partitioning algorithm in [6] to separate theelements of S ′ with the corresponding entries in S ′ set to 1 from the ones with their entries set to 0(during the execution of the partitioning algorithm in [6], each time two elements are exchanged, thevalues of their entries in S ′ are exchanged too). After the partitioning we have that S ′ = S ′′ O andthe following conditions hold: ( i ) S ′′ and O are still sorted ( S ′ was sorted and the partitioning isstable); ( ii ) the length of any maximal subsequence of consecutive equal elements of S ′′ is a multipleof f ; ( iii ) | O | ≤ df . Then, we merge O and S ′ using the merging algorithm in [10], obtaining asorted sequence S ′′′ with | S ′′′ | < d log β n . After that, we apply to S ′′′ the same process we appliedto S ′ ending up with S ′′′ = S ′′ O where conditions ( i ), ( ii ) and ( iii ) hold for S ′′ and O too. Weproceed in this fashion until each S ′ k is done. Second . We have that S ′ = S ′′ O . The following conditions hold: ( i ) O is sorted and | O | ≤ df ;( ii ) the length of any maximal subsequence of consecutive equal elements of S ′′ is a multiple of f (and so | S ′′ | is a multiple of f too). Let s = | S ′′ | /f and let us divide S ′′ into s subsequences F F . . . F s − F s with | F i | = f . We allocate in the encoded memory two arrays S ′′ pre and S ′′ suc , eachone with s entries of Θ (log n ) bits. We also allocate an array C with r (= n ǫ ) entries of Θ (log n )bits, each one initialized to 0. Then, for each F i from the rightmost to the leftmost one, we do thefollowing: Let v = F i [1]. If C [ v ] = 0, we set S ′′ suc [ i ] = 0 and C [ v ] = i . Otherwise, if C [ v ] = 0, we set S ′′ suc [ i ] = C [ v ], S ′′ pre [ C [ v ]] = i and C [ v ] = i . Third . We scan C and find the leftmost entry not equal to 0, let it be i . Let j = C [ i ], weexchange F with F j and do the following: ( i ) we swap the values in S ′′ pre [1] and S ′′ pre [ j ], then thevalues in S ′′ suc [1] and S ′′ suc [ j ]; ( ii ) we set S ′′ pre [ S ′′ suc [ j ]] = j and S ′′ suc [ S ′′ pre [ j ]] = j ; ( iii ) finally, we set S ′′ pre [ S ′′ suc [1]] = 1 and S ′′ suc [ S ′′ pre [1]] = 1. Then, let j ′ = S ′′ suc [1]. We exchange F and F j ′ and thenwe make similar adjustments to their entries in S ′′ pre and S ′′ suc . We proceed in this fashion until weexhaust the linked list associated with the i th entry of C . After that we continue to scan C , findthe leftmost non-zero entry i ′ > i and process its associated list in the same way. At the end of theprocess, the F i ’s have been permuted in sorted stable order. Now both S ′′ and O are sorted. Wemerge them with the merging algorithm in [10] and S ′ is finally sorted.16 .3.2 Taking care of the encoded memory and the zones In the last main step of the final phase we sort all those zones that have been built in the preliminaryphase. We scan G , let u ∈ G be the i th element accessed, we exchange T [ i ], T [ i ] and T [ i ] with J [ u ], J [ u ] and J [ u ], respectively. We swap S ′′ and H ( H has been moved after S ′′ at the endof the aggregating phase). W , G , D , P , T , T , T and H have O ( d ) = O ( n ǫ ) elements and wecan sort them using the mergesort in [10]. The obtained sequence can now be merged with S ′′ using the merging algorithm in [10]. J , J , J and V have O ( n ǫ ) elements. We sort them usingthe mergesort in [10]. Finally, M ′ and M ′′ have Θ ( n/ log n ) elements. We sort them with themergesort in [10] and we are done. Lemma 6.
The final phase requires O ( n ) time, uses O (1) auxiliary words and is stable. Our sincere thanks to Michael Riley and Mehryar Mohri of Google, NY who, motivated by manip-ulating transitions of large finite state machines, asked if bucket sorting can be done in-place inlinear time.
References [1] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time?
Journal of Computer and System Sciences , 57(1):74–93, August 1998.[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introductionto Algorithms.
MIT Press, 2001.[3] M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusiontrees.
J. Comput. System Sci. , 47:424–436, 1993.[4] M. L. Fredman and D. E. Willard. Trans-dichotomous algorithms for minimum spanning treesand shortest paths.
J. Comput. System Sci. , 48(3):533–551, 1994.[5] Yijie Han and Mikkel Thorup. Integer sorting in O ( n √ log log n ) expected time and linearspace. In FOCS , pages 135–144. IEEE Computer Society, 2002.[6] Jyrki Katajainen and Tomi Pasanen. Stable minimum space partitioning in linear time.
BIT ,32(4):580–585, 1992.[7] Jyrki Katajainen and Tomi Pasanen. Sorting multisets stably in minimum space.
Acta Infor-matica , 31(4):301–313, 1994.[8] M. A. Kronrod. Optimal ordering algorithm without operational field.
Soviet Math. Dokl. ,10:744–746, 1969.[9] J. Ian Munro. An implicit data structure supporting insertion, deletion, and search in O (log n )time. Journal of Computer and System Sciences , 33(1):66–74, 1986.[10] Jeffrey Salowe and William Steiger. Simplified stable merging tasks.