[PDF] RAM-Efficient External Memory Sorting

Abstract

In recent years a large number of problems have been considered in external memory models of computation, where the complexity measure is the number of blocks of data that are moved between slow external memory and fast internal memory (also called I/Os). In practice, however, internal memory time often dominates the total running time once I/O-efficiency has been obtained. In this paper we study algorithms for fundamental problems that are simultaneously I/O-efficient and internal memory efficient in the RAM model of computation.

Full PDF

aa r X i v : . [ c s . D S ] D ec RAM-Eﬃcient External Memory Sorting ⋆ Lars Arge ⋆⋆ and Mikkel Thorup ⋆ ⋆ ⋆ MADALGO † , Aarhus University, Aarhus, Denmark University of Copenhagen, ‡ Copenhagen, Denmark

Abstract.

In recent years a large number of problems have been consid-ered in external memory models of computation, where the complexitymeasure is the number of blocks of data that are moved between slowexternal memory and fast internal memory (also called I/Os). In prac-tice, however, internal memory time often dominates the total runningtime once I/O-eﬃciency has been obtained. In this paper we study al-gorithms for fundamental problems that are simultaneously I/O-eﬃcientand internal memory eﬃcient in the RAM model of computation.

In the last two decades a large number of problems have been considered inthe external memory model of computation, where the complexity measure isthe number of blocks of elements that are moved between external and internalmemory. Such movements are also called I/Os. The motivation behind the modelis that random access to external memory, such as disks, often is many orders ofmagnitude slower than random access to internal memory; on the other hand, ifexternal memory is accessed sequentially in large enough blocks, then the costper element is small. In fact, disk systems are often constructed such that thetime spent on a block access is comparable to the time needed to access eachelement in a block in internal memory.Although the goal of external memory algorithms is to minimize the numberof costly blocked accesses to external memory when processing massive datasets,it is also clear from the above that if the internal processing time per element ina block is large, then the practical running time of an I/O-eﬃcient algorithm isdominated by internal processing time. Often I/O-eﬃcient algorithms are in factnot only eﬃcient in terms of I/Os, but can also be shown to be internal memoryeﬃcient in the comparison model. Still, in many cases the practical running timeof I/O-eﬃcient algorithms is dominated by the internal computation time. Thusboth from a practical and a theoretical point of view it is interesting to investigate ⋆ This paper will appear in the Proceedings of The 24th International Symposium onAlgorithms and Computation, LNCS 8283, Springer, 2013. ⋆⋆ Supported in part by the Danish National Research Foundation and the DanishNational Advanced Technology Foundation. ⋆ ⋆ ⋆

Supported in part by an Advanced Grant from the Danish Council for IndependentResearch under the Sapere Aude research career program. † Center for Massive Data Algorithmics—a center of the Danish National ResearchFoundation ‡ Part of this work was done while the author was at AT&T Labs–Research. ow internal-memory eﬃcient algorithms can be obtained while simultaneouslyensuring that they are I/O-eﬃcient. In this paper we consider algorithms thatare both I/O-eﬃcient and eﬃcient in the RAM model in internal memory.

Previous results.

We will be working in the standard external memory modelof computation, where M is the number of elements that ﬁt in main memoryand an I/O is the process of moving a block of B consecutive elements betweenexternal and internal memory [1]. We assume that N ≥ M , M ≥ B and B ≥ indivisibility assumption , whichstates that at any given time during an algorithm the original N input elementsare stored somewhere in external or internal memory. Our internal memory timemeasure is simply the number of performed operations; note that this includesthe number of elements transferred between internal and external memory.Aggarwal and Vitter [1] described sorting algorithms using O ( NB log M/B NB )I/Os. One of these algorithms, external merge-sort, is based on Θ ( M/B )-waymerging. First O ( N/M ) sorted runs are formed by repeatedly sorting M elementsin main memory, and then these runs are merged together Θ ( M/B ) at a timeto form longer runs. The process continues for O (log M/B NM ) phases until one isleft with one sorted list. Since the initial run formation and each phase can beperformed in O ( N/B ) I/Os, the algorithm uses O ( NB log M/B NB ) I/Os. Anotheralgorithm, external distribution-sort, is based on Θ ( p M/B )-way splitting. The N input elements are ﬁrst split into Θ ( p M/B ) sets of roughly equal size, suchthat the elements in the ﬁrst set are all smaller than the elements in the secondset, and so on. Each of the sets are then split recursively. After O (log √ M/B NM ) = O (log M/B NM ) split phases each set can be sorted in internal memory. Althoughperforming the split is somewhat complicated, each phase can still be performedin O ( N/B ) I/Os. Thus also this algorithm uses O ( NB log M/B NB ) I/Os.Aggarwal and Vitter [1] proved that external merge- and distribution-sortare I/O-optimal when the comparison model is used in internal memory, andin the following we will use sort E ( N ) to denote the number of I/Os per blockof elements of these optimal algorithms, that is, sort E ( N ) = O (log M/B NB ) andexternal comparison model sort takes Θ ( NB sort E ( N )) I/Os. (As described be-low, the I/O-eﬃcient algorithms we design will move O ( N · sort E ( N )) elementsbetween internal and external memory, so O ( sort E ( N )) will also be the per el-ement internal memory cost of obtaining external eﬃciency.) When no assump-tions other than the indivisibility assumption are made about internal memorycomputation (i.e. covering our deﬁnition of the use of the RAM model in internalmemory), Aggarwal and Vitter [1] proved that permuting N elements accordingto a given permutation requires Ω (min { N, NB sort E ( N ) } ) I/Os. Thus this is also2 lower bound for RAM model sorting. For all practical values of N , M and B the bound is Ω ( NB sort E ( N )). Subsequently, a large number of I/O-eﬃcientalgorithms have been developed. Of particular relevance for this paper, severalpriority queues have been developed where insert and deletemin operations canbe performed in O ( B sort E ( N )) I/Os amortized [2,4,8]. The structure by Arge [2]is based on the so-called buﬀer-tree technique, which uses O ( M/B )-way split-ting, whereas the other structures also use O ( M/B )-way merging.In the RAM model the best known sorting algorithm uses O ( N log log N )time [6]. Similar to the I/O-case, we use sort I ( N ) = O (log log N ) to denote the per element cost of the best known sorting algorithm. If randomization is allowedthen this can be improved to O ( √ log log n ) expected time [7]. A priority queuecan also be implemented so that the cost per operation is O ( sort I ( N )) [9]. Our results.

In Section 2 we ﬁrst discuss how both external merge-sort andexternal distribution-sort can be implemented to use optimal O ( N log N ) timeif the comparison model is used in internal memory, by using an O ( N log N )sorting algorithm and (in the merge-sort case) an O (log N ) priority queue. Wealso show how these algorithms can relatively easily be modiﬁed to use O ( N · ( sort I ( N ) + sort I ( M/B ) · sort E ( N ))) and O ( N · ( sort I ( N ) + sort I ( M ) · sort E ( N )))time, respectively, if the RAM model is used in internal memory, by using an O ( N · sort I ( N )) sorting algorithm and an O ( sort I ( N )) priority queue.The question is of course if the above RAM model sorting algorithms can beimproved. In Section 2 we discuss how it seems hard to improve the running timeof the merge-sort algorithm, since it uses a priority queue in the merging step. Byusing a linear-time internal-memory splitting algorithm, however, rather than an O ( N · sort I ( N )) sorting algorithm, we manage to improve the running time ofexternal distribution-sort to O ( N · ( sort I ( N ) + sort E ( N ))) . Our new split-sort algorithm still uses O ( NB sort E ( N )) I/Os. Note that for smallvalues of M/B the N · sort E ( N )-term, that is, the time spent on moving elementsbetween internal and external memory, dominates the internal time. Given theconventional wisdom that merging is superior to splitting in external memory, itis also surprising that a distribution algorithm outperforms a merging algorithm.In Section 3 we develop an I/O-eﬃcient RAM model priority queue by modi-fying the buﬀer-tree based structure of Arge [2]. The main modiﬁcation consistsof removing the need for sorting of O ( M ) elements every time a so-called buﬀer-emptying process is performed. The structure supports insert and deletemin op-erations in O ( B sort E ( N )) I/Os and O ( sort I ( N ) + sort E ( N )) time. Thus it canbe used to develop another O ( NB sort E ( N )) I/O and O ( N · ( sort I ( N )+ sort E ( N )))time sorting algorithm.Finally, in Section 4 we show that when NB sort E ( N ) = o ( N ) (and our sortingalgorithms are I/O-optimal), any I/O-optimal sorting algorithm must transfer3 number of elements between internal and external memory equal to Θ ( B )times the number of I/Os it performs, that is, it must transfer Ω ( N · sort E ( N ))elements and thus also use Ω ( N · sort E ( N )) internal time. In fact, we showa lower bound on the number of I/Os needed by an algorithm that transfers b ≤ B elements on the average per I/O, signiﬁcantly extending the lower boundof Aggarwal and Vitter [1]. The result implies that (in the practically realisticcase) when our split-sort and priority queue sorting algorithms are I/O-optimal,they are in fact also CPU optimal in the sense that their running time is thesum of an unavoidable term and the time used by the best known RAM sortingalgorithm. As mentioned above, the lower bound also means that the time spenton moving elements between internal and external memory resulting from thefact that we are considering I/O-eﬃcient algorithms can dominate the internalcomputation time, that is, considering I/O-eﬃcient algorithms implies that lessinternal-memory eﬃcient algorithms can be obtained than if not consideringI/O-eﬃciency. Furthermore, we show that when B ≤ M − ε for some constant ε > tall cache assumption ) the same Ω ( N · sort E ( N )) number of transfersare needed for any algorithm using less than εN/ External merge-sort.

In external merge-sort Θ ( N/M ) sorted runs are ﬁrst formedby repeatedly loading M elements into main memory, sorting them, and writingthem back to external memory. In the ﬁrst merge phase these runs are mergedtogether Θ ( M/B ) at a time to form longer runs. The merging is continued for O (log M/B NM ) = O ( sort E ( N )) merge phases until one is left with one sortedrun. It is easy to realize that M/B runs can be merged together in O ( N/B )I/Os: We simply load the ﬁrst block of each of the runs into main memory,ﬁnd and output the B smallest elements, and continue this process while load-ing a new block from the relevant run every time all elements in main mem-ory from that particular run have been output. Thus external merge-sort uses O ( NB log M/B NM ) = O ( NB sort E ( N )) I/Os.In terms of internal computation time, the initial run formation can triviallybe performed in O ( N/M · M log M ) = O ( N log M ) time using any O ( N log N ) in-4ernal sorting algorithm. Using an O (log( M/B )) priority queue to hold the mini-mal element from each of the

M/B runs during a merge, each of the O (log M/B NM )merge phases can be performed in O ( N log MB ) time. Thus external merge-sortcan be implemented to use O ( N log M + log M/B NM · N log MB ) = O ( N log M + N log NM ) = O ( N log N ) time, which is optimal in the comparison model.When the RAM model is used in internal memory, we can improve the in-ternal time by using a RAM-eﬃcient O ( M · sort I ( M )) algorithm in the runformation phase and by replacing the O (log( M/B )) priority queue with an O ( sort I ( M/B )) time priority queue [9]. This leads to an O ( N · ( sort I ( M ) + sort I ( M/B ) · sort E ( N )) algorithm. There seems no way of avoiding the extra sort I ( M/B )-term, since that would require an O (1) priority queue. External distribution-sort.

In external distribution-sort the input set of N el-ements is ﬁrst split into p M/B sets X , X , . . . , X √ M/B − deﬁned by s = p M/B − x < x < . . . < x s , such that all elements in X aresmaller than x , all elements in X √ M/B − are larger than or equal to x s , andsuch that for 1 ≤ i ≤ p M/B − X i are larger than or equalto x i and smaller than x i +1 . Each of these sets is recursively split until eachset is smaller than M (and larger than M/ ( M/B ) = B ) and can be sorted ininternal memory. If the s split elements are chosen such that | X i | = O ( N/s ) thenthere are O (log s NB ) = O (log M/B NB ) = O ( sort E ( N )) split phases. Aggarwal andVitter [1] showed how to compute a set of s split elements with this property in O ( N/B ) I/Os. Since the actual split of the elements according to the split ele-ments can also be performed in O ( N/B ) I/Os (just like merging of

M/B sortedruns), the total number of I/Os needed by distribution-sort is O ( NB sort E ( N )).Ignoring the split element computation it is easy to implement externaldistribution-sort to use O ( N log N ) internal time in the comparison model: Dur-ing a split we simply hold the split elements in main memory and perform abinary search among them with each input element to determine to which set X i the element should go. Thus each of the O (log M/B NB ) split phases uses O ( N log p M/B ) time. Similarly, at the end of the recursion we sort O ( N/M )memory loads using O ( N log M ) time in total. The split element computationalgorithm of Aggarwal and Vitter [1], or rather its analysis, is somewhat com-plicated. Still it is easy to realize that it also works in O ( N log M ) time asrequired to obtain an O ( N log N ) time algorithm in total. The algorithm worksby loading the N elements a memory load at a time, sorting them and pick-ing every p M/B/ O ( N/M · M log M ) = O ( N log M ) time and results in a set of 4 N/ p M/B el-ements. Finally, a linear I/O and time algorithm is used p M/B times on thisset of elements to obtain the split elements, thus using O ( N ) additional time.If we use a RAM sorting algorithm to sort the memory loads at the end ofthe split recursion, the running time of this part of the algorithm is reduced to O ( N · sort I ( M )). Similarly, we can use the RAM sorting algorithm in the splitelement computation algorithm, resulting in an O ( N · sort I ( M )) algorithm and5onsequently a sort I ( M )-term in the total running time. Finally, in order toavoid the binary search over p M/B split elements in the actual split algorithm,we can modify it to use sorting instead: To split N elements among s splittingelements stored in s/B blocks in main memory, we allocate a buﬀer of one blockin main memory for each of the s + 1 output sets. Thus in total we require s/B + ( s + 1) B < M/ M/ s + 1 buﬀers, while outputting the B elements in a buﬀer when itruns full. Thus this process requires O ( N · sort I ( M )) time and O ( N/B ) I/Os likethe split element ﬁnding algorithm. Overall this leads to an O ( N · ( sort I ( M ) + sort I ( M ) · sort E ( N ))) time algorithm. Split-sort.

While it seems hard to improve the RAM running time of the externalmerge-sort algorithm, we can actually modify the external distribution-sort algo-rithm further and obtain an algorithm that in most cases is optimal both in termsof I/O and time. This split-sort algorithm basically works like the distribution-sort algorithm with the split algorithm modiﬁcation described above. However,we need to modify the algorithm further in order to avoid the sort I ( M )-term inthe time bound that appears due to the repeated sorting of O ( M ) elements inthe split element ﬁnding algorithm, as well as in the actual split algorithm.First of all, instead of sorting each batch of M/ s = p M/B − < p M/ Lemma 1 (Han and Thorup [7]).

In the RAM model N elements can besplit over N − ε split elements in linear time and space for any constant ε > . Secondly, in order to avoid the sorting in the split element ﬁnding algorithm ofAggarwal and Vitter [1], we design a new algorithm that ﬁnds the split elementson-line as part of the actual split algorithm, that is, we start the splitting withno split elements at all and gradually add at most s = p M/B − N input elements we, as previously, repeatedly bring M/ B elements in a buﬀer when it runs full.However, during the process we keep track of how many elements are output toeach subset. If the number of elements in a subset X i becomes 2 N/s we pausethe split algorithm, compute the median of X i and add it to the set of splitters,and split X i at the median element into two sets of size N/s . Then we continuethe splitting algorithm.It is easy to see that the above splitting process results in at most s +1 subsetscontaining between N/s and 2

N/s − N/s elements and each new set (deﬁned by a new split element) containsat least

N/s elements. The actual median computation and the split of X i canbe performed in O ( | X i | ) = O ( N/s ) time and O ( | X i | /B ) = O ( N/sB ) I/Os [1].6hus if we charge this cost to the at least

N/s elements that were inserted in X i since it was created, each element is charged O (1) time and O (1 /B ) I/Os. Thuseach distribution phase is performed in linear time and O ( N/B ) I/Os, leadingto an O ( N · ( sort I ( M ) + sort E ( N ))) time algorithm. Theorem 1.

The split-sort algorithm can be used to sort N elements in O ( N · ( sort I ( M ) + sort E ( N ))) time and O ( NB sort E ( N )) I/Os.Remarks.

Since sort I ( M ) + sort E ( N ) ≥ sort I ( N ) our split-sort algorithm uses Ω ( N · sort I ( N )) time. In Section 4 we prove that the algorithm in some senseis optimal both in terms of I/O and time. Furthermore, we believe that thealgorithm is simple enough to be of practical interest. In this section we discuss how to implement an I/O- and RAM-eﬃcient priorityqueue by modifying the I/O-eﬃcient buﬀer tree priority queue [2].

Structure.

Our external priority queues consists of a fanout p M/B

B-tree [3] T over O ( N/M ) leaves containing between M/ M elements each. In such atree, all leaves are on the same level and each node (except the root) has fan-outbetween p M/B and p M/B and contains at most p M/B splitting elementsdeﬁning the element ranges of its children. Thus T has height O (log √ M/B NM ) = O ( sort E ( N )). To support insertions eﬃciently in a “lazy” manner, each internalnode is augmented with a buﬀer of size M and an insertion buﬀer of size at most B is maintained in internal memory. To support deletemin operations eﬃciently,a RAM-eﬃcient priority queue [9] supporting both deletemin and deletemax, called the mini-queue , is maintained in main memory containing the up to M/ Insertion.

To perform an insertion we ﬁrst check if the element to be insertedis smaller than the maximal element in the mini-queue, in which case we insertthe new element in the mini-queue and continue the insertion process with thecurrently maximal element in the mini-queue. Next we insert the element tobe inserted in the insertion buﬀer. When we have collected B elements in theinsertion buﬀer we insert them in the buﬀer of the root. If this buﬀer now containsmore than M/ buﬀer-emptying process on it, “pushing”elements in the buﬀer one level down to buﬀers on the next level of T : We loadthe M/ p M/B splitting elements, distribute the elements among the splitting elements, andﬁnally output them to the buﬀers of the relevant children. Since the splitting andbuﬀer elements ﬁt in memory and the buﬀer elements are distributed to p M/B buﬀers one level down, the buﬀer-emptying process is performed in O ( M/B ) A priority queue supporting both deletemin and deletemax can easily be obtainedusing two priority queues supporting deletemin and delete as the one by Thorup [9]. M/ p M/B splitters the process canbe performed in O ( M ) time (Lemma 1). After emptying the buﬀer of the rootsome of the nodes on the next level may contain more than M/ M elements. When (between1 and M/

2) elements are pushed down to a leaf (when performing a buﬀer-emptying process on its parent) resulting in the leaf containing more than M (and less than 3 M/

2) elements we split it into two leaves containing between M/ M/ O ( M/B ) I/Os and O ( M )time [1]. As a result of the split the parent node v gains a child, that is, a new leafis inserted. If needed, T is then balanced using node splits as a normal B-tree,that is, if the parent node now has p M/B children it is split into two nodeswith 1 / p M/B children each, while also distributing the elements in v ’s buﬀeramong the two new nodes. This can easily be accomplished in O ( M/B ) I/Os and M time. The rebalancing may propagate up along the path to the root (whenthe root splits a new root with two children is constructed).During buﬀer-emptying processes we push Θ ( M ) elements one level downthe tree using O ( M/B ) I/Os and O ( M ) time. Thus each element inserted in theroot buﬀer pays O (1 /B ) I/Os and O (1) time amortized, or O ( B log M/B NB ) = O ( B sort E ( N )) I/Os and O (log M/B NB ) = O ( sort E ( N )) time amortized on buﬀer-emptying processes on a root-leaf path. When a leaf splits we may use O ( M/B )I/Os and O ( M ) time in each node of a leaf-root path of length O ( sort E ( N )).Amortizing among the at least M/ O ( B sort E ( N )) I/Os and O ( sort E ( N )) time on insertion in the root buﬀer. Since insertion of an elementin the root buﬀer is always triggered by an insertion operation, we can chargethe O ( B sort E ( N )) I/Os and O ( sort E ( N )) time cost to the insertion operation. Deletemin.

To perform a deletemin operation we ﬁrst check if the mini-queuecontains any elements. If it does we simply perform a deletemin operation on itand return the retrieved element using O ( sort I ( M )) time and no I/Os. Otherwisewe perform buﬀer-emptying processes on all nodes on the leftmost path in T starting at the root and moving towards the leftmost leaf. After this the buﬀerson the leftmost path are all empty and the smallest elements in the structureare stored in the leftmost leaf. We load the between M/ M elements inthe leaf into main memory, sort them and remove the smallest M/ M/ M elements we split it. As a result of thisthe parent node v may lose a child. If needed T is then rebalanced using nodefusions as a normal B-tree, that is, if v now has 1 / p M/B children it is fusedwith its sibling (possibly followed by a split). As with splits after insertion of anew leaf, the rebalancing may propagate up along the path to the root (whenthe root only has one leaf left it is removed). Note that no buﬀer merging isneeded since the buﬀers on the leftmost path are all empty.8f buﬀer-emptying processes are needed during a deletemin operation wespend O ( MB log M/B NB ) = O ( MB sort E ( N )) I/Os and O ( M log M/B NB ) = O ( M · sort E ( N )) time on such processes that are not paid by buﬀers running full(containing more than M/ O ( M/B ) I/Os and O ( M · sort I ( M )) time to load and sort the leftmost leaf, and another O ( M · sort I ( M ))time is used to insert the M/ M/B ) I/Os and O ( M ) time on each of at most O (log M/B NB ) nodeson the leftmost path that need to be fused or split. Altogether the ﬁlling up ofthe mini-queue requires O ( MB sort E ( N )) I/Os and O ( M · ( sort I ( M )+ sort E ( N )))time. Since we only ﬁll up the mini-queue when M/ M/ O ( B sort E ( N )) I/Osand O ( sort E ( N ) + sort I ( M )) time. Theorem 2.

There exists a priority queue supporting an insert operation in O ( B sort E ( N )) I/Os and O ( sort E ( N )) time amortized and a deletemin opera-tion in O ( B sort E ( N )) I/Os and O ( sort I ( M ) + sort E ( N )) time amortized.Remarks. Our priority queue obviously can be used in a simple O ( NB sort E ( N ))I/O and O ( N · ( sort I ( M ) + sort E ( N ))) time sorting algorithm. Note that it isessential that a buﬀer-emptying process does not require sorting of the elementsin the buﬀer. In normal buﬀer-trees [2] such a sorting is indeed performed, mainlyto be able to support deletions and (batched) rangesearch operations eﬃciently.Using a more elaborate buﬀer-emptying process we can also support deletionswithout the need for sorting of buﬀer elements. Assume that NB sort E ( N ) = o ( N ) and for simplicity also that B divides N .Recall that under the indivisibility assumption we assume the RAM model ininternal memory but require that at any time during an algorithm the original N elements are stored somewhere in memory; we allow copying of the originalelements. The internal memory contains at most M elements and the externalmemory is divided into N blocks of B elements each; we only need to consider N blocks, since we are considering algorithms doing less than N I/Os. Duringan algorithm, we let X denote the set of original elements (including copies) ininternal memory and Y i the set of original elements (including copies) in the i ’thblock; an I/O transfers up to B elements between an Y i and X . Note that interms of CPU time, an I/O can cost anywhere between 1 and B (transfers).In the external memory permuting problem, we are given N elements in theﬁrst N/B blocks and want to rearrange them according to a given permutation;since we can always rearrange the elements within the

N/B blocks in O ( N/B )I/Os, a permutation is simply given as an assignment of elements to blocks(i.e. we ignore the order of the elements within a block). In other words, westart with a distribution of N elements in X, Y , Y , . . . Y N such that | Y | = | Y | = . . . = | Y N/B | = B and X = Y ( N/B )+1 = Y ( N/B )+2 = . . . = Y N = ∅ ,9nd should produce another given distribution of the same elements such that | Y | = | Y | = . . . = | Y N/B | = B and X = Y ( N/B )+1 = Y ( N/B )+2 = . . . = Y N = ∅ .To show that any permutation algorithm that performs O ( NB sort E ( N )) I/Oshas to transfer Ω ( N · sort E ( N )) elements between internal and external memory,we ﬁrst note that at any given time during a permutation algorithm we canidentify a distribution (or more) of the original N elements (or copies of them)in X, Y , Y , . . . Y N . We then ﬁrst want to bound the number of distributionsthat can be created using T I/Os, given that b i , 1 ≤ i ≤ T , is the number ofelements transferred in the i ’th I/O; any correct permutation algorithm needsto be able to create at least N ! B ! N/B = Ω (( N/B ) N ) distributions.Consider the i ’th I/O. There are at most N possible choices for the block Y j involved in the I/O; the I/O either transfers b i ≤ B elements from X to Y j orfrom Y j to X . In the ﬁrst case there are at most (cid:0) Mb i (cid:1) ways of choosing the b i elements, and each element is either moved or copied. In the second case thereare at most most (cid:0) Bb i (cid:1) ways of choosing the elements to move or copy. Thus theI/O can at most increase the number of distributions that can be created by afactor of N · (cid:18)(cid:18) Mb i (cid:19) + (cid:18) Bb i (cid:19)(cid:19) · b i < N (2 eM/b i ) b i . Now the T I/Os can thus at most create Q Ti =1 N (2 eM/b i ) b i distributions. Thatthis number is bounded by (cid:0) N (2 eM/b ) b (cid:1) T , where b is the average of the b i ’s,can be seen by just considering two values b and b with average b . In this casewe have N (2 eM/b ) b · N (2 eM/b ) b ≤ N (2 eM ) b + b ) b b + b ) ≤ (cid:0) N (2 eM/b ) b (cid:1) . Next we consider the number of distributions that can be created using T I/Os for all possible values of b i , 1 ≤ i ≤ T , with a given average b . Thiscan trivially be bounded by multiplying the above bound by B T (since this is abound on the total number of possible sequences b , b , . . . , b T ). Thus the numberof distributions is bounded by B T (cid:0) N (2 eM/b ) b (cid:1) T = (( BN )(2 eM/b ) b ) T . Sinceany permutation algorithm needs to be able to create Ω (( N/B ) N ) distributions,we get the following lower bound on the number of I/Os T ( b ) needed by analgorithm that transfers b ≤ B elements on the average per I/O: T ( b ) = Ω (cid:18) N log( N/B )log N + b log( M/b ) (cid:19) . Now T ( B ) = Ω (min { N, NB sort E ( N ) } ) corresponds to the lower bound provedby Aggarwal and Vitter [1]. Thus when NB sort E ( N ) = o ( N ) we get T ( B ) = Ω ( NB sort E ( N )) = Ω (cid:16) N log( N/B ) B log( M/B ) (cid:17) . Since 1 ≤ b ≤ B ≤ M/

2, we have T ( b ) = ω ( T ( B )) for b = o ( B ). Thus any algorithm performing optimal O ( NB sort E ( N ))I/Os must transfer Ω ( N · sort E ( N )) elements between internal and externalmemory. 10econsider the above analysis under the tall cache assumption B ≤ M − ε forsome constant ε >

0. In this case, we have that the number of distributions anypermutation algorithm needs to be able to create is Ω (( N/B ) N ) = Ω ( N εN ).Above we proved that with T I/Os transferring an average number of b keysan algorithm can create at most ( BN (2 eM/b ) b ) T < N T M bT distributions.Thus we have M bT ≥ N εN − T . For T < εN/

4, we get M bT ≥ N εN/ andthus that the number of transferred elements bT is Ω ( N log M N ). Since the tallcache assumption implies that log( N/B ) = Θ (log N ) and log( M/B ) = Θ (log M )we have that N log M N = Θ ( N log M/B ( N/B )) = Θ ( N · sort E ( N )). Thus anyalgorithm using less than εN/ Ω ( N · sort E ( N )) elementsbetween internal and external memory. Theorem 3.

When B ≤ M and NB sort E ( N ) = o ( N ) , any I/O-optimal per-muting algorithm must transfer Ω ( N · sort E ( N )) elements between internal andexternal memory under the indivisibility assumption.When B ≤ M − ε for some constant ε > any, permuting algorithm usingless than εN/ I/Os must transfer Ω ( N · sort E ( N )) elements between internaland external memory under the indivisibility assumption.Remark. The above means that in practice where NB sort E ( N ) = o ( N ) our O ( NB sort E ( N )) I/O and O ( N · ( sort I ( N ) + sort E ( N )) time split-sort and prior-ity queue sort algorithms are not only I/O-optimal but also CPU optimal in thesense that their running time is the sum of an unavoidable term and the timeused by the best known RAM sorting algorithm. References

1. A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and relatedproblems.

Communications of the ACM , 31(9):1116–1127, 1988.2. L. Arge. The buﬀer tree: A technique for designing batched external data structures.

Algorithmica , 37(1):1–24, 2003.3. D. Comer. The ubiquitous B-tree.

ACM Computing Surveys , 11(2):121–137, 1979.4. R. Fadel, K. V. Jakobsen, J. Katajainen, and J. Teuhola. Heaps and heapsort onsecondary storage.

Theoretical Computer Science , 220(2):345–362, 1999.5. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious al-gorithms. In

Proc. IEEE Symposium on Foundations of Computer Science , pages285–298, 1999.6. Y. Han. Deterministic sorting in O ( n log log n ) time and linear space. In Proc. ACMSymposium on Theory of Computation , pages 602–608, 2002.7. Y. Han and M. Thorup. Integer sorting in O ( n √ log log n ) expected time and linearspace. In Proc. IEEE Symposium on Foundations of Computer Science , pages 135–144, 2002.8. V. Kumar and E. Schwabe. Improved algorithms and data structures for solvinggraph problems in external memory. In

Proc. IEEE Symp. on Parallel and Dis-tributed Processing , pages 169–177, 1996.9. M. Thorup. Equivalence between priority queues and sorting.

Journal of the ACM ,54(6):Article 28, 2007.,54(6):Article 28, 2007.