[PDF] Compressed Key Sort and Fast Index Reconstruction

Abstract

In this paper we propose an index key compression scheme based on the notion of distinction bits by proving that the distinction bits of index keys are sufficient information to determine the sorted order of the index keys correctly. While the actual compression ratio may vary depending on the characteristics of datasets (an average of 2.76 to one compression ratio was observed in our experiments), the index key compression scheme leads to significant performance improvements during the reconstruction of large-scale indexes. Our index key compression can be effectively used in database replication and index recovery of modern main-memory database systems.

Full PDF

CCompressed Key Sort and Fast Index Reconstruction

Yongsik Kwon a,b , Cheol Ryu a , Sang Kyun Cha a , Arthur H. Lee c , Kunsoo Park a , Bongki Moon a a Seoul National University, Seoul, Korea b SAP Labs Korea, Seoul, Korea c State University of New York Korea, Incheon, Korea

Abstract

In this paper we propose an index key compression scheme based on the notion of distinctionbits by proving that the distinction bits of index keys are suﬃcient information to determine thesorted order of the index keys correctly. While the actual compression ratio may vary dependingon the characteristics of datasets (an average of 2.76 to one compression ratio was observed in ourexperiments), the index key compression scheme leads to signiﬁcant performance improvementsduring the reconstruction of large-scale indexes. Our index key compression can be eﬀectively usedin database replication and index recovery of modern main-memory database systems.

Keywords:

Compressed key sort, Distinction bit, Index reconstruction, Parallel sorting

1. Introduction

Main-memory database systems have been widely used for many applications such as OLTP andOLAP, which are required to keep the latency low and the transaction throughput high. In such amain-memory database system, indexes are often deployed without on-disk representations [1, 2, 3,4]. By letting the indexes reside solely in memory, it can sustain the best attainable performanceof the indexes, which are already critical to query and data processing performance, even in thepresence of many updates. Insertions, deletions and updates made to a database table will bereﬂected to all the indexes associated with the table as well as the table itself. However, none ofthose update operations will incur any disk accesses for keeping the indexes up to date, becauseall the corresponding changes will only be applied to the table and the associated indexes residingin the main memory. The changes applied to the indexes will not even be written to the log orcheckpointed to disk, as the most recent copy of each index can always be restored from its basetable [5].Since none of the index updates are propagated to disk, however, all the indexes have to bereconstructed from scratch when the database system restarts from a failure, an anti-cached tableis loaded back from disk to memory, or an entire database is replicated from the master node to aslave node. For a table that has many indexes associated with it, the cost of loading the table maybe signiﬁcantly increased due to the additional cost of constructing the indexes from the rows of thetable. Therefore, it is practically an important challenge to limit the cost of index reconstructionso that the database loading time and the restart time can be kept to its minimum.Table 1 shows the times taken to build a B-tree index for each of the six memory-residenttables. The times are broken down to two separate stages, namely, sort and build . Both the sortand build phases of index construction were performed by a single-core implementation. Just forthe sake of comparison, the second column of the table shows the times taken to load the indexedcolumns from disk to memory. In all the six cases reported in Table 1, the cost of internal sort wasapproximately 90 percent of the total cost. Evidently, the internal sort was the dominant factor

Preprint submitted to Journal of L A TEX Templates September 25, 2020 a r X i v : . [ c s . D B ] S e p oad index construction timetable time sort build totalINDBTAB 1.24 4.25 0.34 4.59Human 5.34 19.94 1.17 21.12Wikititle 1.36 5.09 0.59 5.68ExURL 0.96 16.82 0.81 17.63WikiURL 1.36 18.56 0.80 19.36Part 0.25 0.65 0.05 0.70 Table 1: Index construction time for sample tables (in seconds). of the B-tree index construction, and hence fast construction of memory resident B-tree indexescannot be achieved without reducing the cost of internal sort drastically.In this paper, we propose a new sort approach relying on the distinction bits among the keys.We call this method a compressed key sort , as utilizing only the distinction bits of the keys canbe considered a kind of compressing the keys. The experimental evaluation demonstrates thatthe overall cost of B-tree construction can be reduced by 21–54 percent for real-world datasets.Our experiments also show that the compressed key sort is readily parallelizable for multi-coreprocessors, yields near-linear speedup, and can actually build a B-tree index faster than loading itsimage from disk or even enterprise-class SSD.The key question we pose in this paper is “What is the minimum amount of information required(in terms of the number of bits in the index keys) to determine the sorted order among the indexkeys correctly?” Whoever can determine it will extract the minimum number of bits from indexkeys and sort them still correctly but more eﬃciently. Once the keys are sorted, the B-tree indexwill be built by following the standard bulk-load procedure. This process is illustrated in Figure 1,where the top ﬂow shows the conventional steps for index construction while the bottom one showsthe proposed compressed key sort applied to index construction.We now formally deﬁne the distinction bits of two index keys to be the most signiﬁcant bits thatare diﬀerent between the keys. We will prove that the distinction bits of index keys are suﬃcientinformation to determine the sorted order of the index keys correctly. Consequently, we can extractonly the distinction bits of index keys into compressed form and sort the compressed keys in orderto construct a B-tree index quickly.Index keys for database tables can be as short as a 4-byte word but they can also be longer thana few dozen bytes in business applications. Hence index trees and all related algorithms (sortingindex keys, building the index, searching with a query key, etc.) should be able to handle longkeys as well as short keys. Our compressed key sort approach assumes this wide range of indexkey sizes. To speed up the index construction process further, we exploit parallelism in building anindex tree.This paper is organized as follows. In Section 2 we discuss related work. In Section 3 weintroduce our compressed key sort. In Section 4 we describe the index key formats we use forvarious data types and present the metadata information to keep for eﬃcient index rebuilding. InSection 5 we explain the procedure of rebuilding the index. Section 6 shows the results of ourexperiments, and we conclude in Section 7. 2 index keys

84 010101001 000000019 sortsortextract build build i n d e x i n d e x compressed keys Figure 1: Compressed key sort.

2. Related Work

There has been extensive research on eﬃcient index structures for database tables, where ef-ﬁciency measures are index size, search time, concurrency control, etc. Especially the followingwork focused on reducing index sizes and/or search time: Bayer and Unterauer’s Preﬁx B-tree [6],Lehman and Carey’s T-tree [7], Ferguson’s Bit-tree [8], Bohannon et al.’s partial-key T-tree andpartial-key B-tree [9], Rao and Ross’s CSS-tree [10] and CSB+ tree [11], Chen et al.’s pB+ tree [12]and fpB+ tree [13], Schlegel et al.’s k-ary search tree [14], Boehm et al.’s Generalized Preﬁx Tree[15], Kissinger et al.’s KISS-tree [16], and more recently Kim et al.’s Fast Architecture-SensitiveTree (FAST) [17], Yamamuro et al.’s VAST-tree [18], Levandoski et al.’s Bw-tree [19], Leis et al.’sAdaptive Radix Tree (ART) [20], Zhang et al.’s SuRF [21], Binna et al.’s HOT [22]. See Graefeand Larson [23] and Graefe [24] for surveys.Our work reduces the sizes of sort keys by compressing them, from which the speedups insorting and index rebuilding are obtained. Hence our work is orthogonal to the previous work oneﬃcient index structures, and it can be applied to many index structures. Compressing keys bydistinction bits can also be applied to big data ﬁle formats. Popular self-described ﬁle formats suchas ORC [25] and Parquet [26] adopt columnar storage structures to cope with read-heavy analyticworkloads against large-scale distributed datasets. Compression by distinction bits can acceleratesuch common analytic tasks as sorting data and generating unique keys.Kim et al.’s FAST [17] proposed a key compression technique which extracts bits of index keysin the bit positions where the index keys are not the same (which are called variant bits ). We goone step further and use distinction bits to determine the sorted order of index keys correctly.There has been research on order-preserving compression [27, 28, 29, 30, 31] which maps indexkeys into encoded values such that the order of index keys is the same as the order of encoded values.In order-preserving compression, index keys are replaced by encoded values. In our compressionscheme, however, there is no encoding. We simply extract part of index keys (i.e., distinction bits)to speed up sorting and index building. The index tree built by our compression scheme will be aconventional B-tree index without any encoding of index keys.For sorting in multi-core CPUs, there have recently been many results exploiting SIMD par-allelism [32, 33, 34]. In our target applications which require a wide range of index key sizes,however, the size of index keys is too big to exploit SIMD parallelism. Thus we implemented our3 key key key key key key bit position Figure 2: Distinction bits and invariant bits. own parallel sorting algorithm called the row-column sort , which relies only on the operation ofcomparing two elements during sorting. The comparison operator is called a comparator . There-fore, the row-column sort works with any key sizes. (In contrast, sorting on SIMD needs quitediﬀerent algorithmic techniques such as merging networks [32, 33, 34].) In our experiments wecompared the row-column sort with GCC STL parallel sort [35], which is an available parallel sortcode on multi-core CPUs in which a custom comparator can be used. Experiments show that therow-column sort shows a better speedup than GCC STL sort, and it is 31.4% faster than GCCSTL sort when the number of cores is 16.

3. Compressed Key Sort

In this section we ﬁrst prove that the distinction bits of index keys are suﬃcient information todetermine the sorted order of the index keys correctly, and then present our compressed key sortbased on distinction bits, which is the central idea in rebuilding main-memory indexes eﬃciently.

We ﬁrst introduce some terms to describe the compressed key sort. We consider key values asbinary strings throughout this paper. The bit positions where all key values of the given datasetare identical are called invariant bit positions (the bits in these positions are called invariant bits ).The other bit positions are called variant bit positions (the bits themselves are called variant bits ).Each row in Figure 2 (a) represents a key value. In the ﬁgure, bit positions 0, 3, 4, 8, 9 and 10 areinvariant bit positions, and bit positions 1, 2, 5, 6, 7 and 11 are variant bit positions. Note that bitpositions start with 0, and the bit of a key in bit position i will be called the ( i + 1)-st bit of thekey (i.e., the bit in bit position 0 is the ﬁrst bit, the bit in bit position 1 is the second bit, etc.).The ﬁrst bit (i.e., in bit position 0) is the most signiﬁcant bit in the keys.The distinction bit position of two keys is deﬁned as the most signiﬁcant bit position where the4wo keys diﬀer (the bits themselves are called distinction bits ). The name distinction bit is from[8], but the main focus of this paper is sorting based on distinction bits .Suppose that there are n +1 keys key , key , . . . , key n and they are in lexicographic order (sortedorder), i.e., key < key < · · · < key n . The distinction bit position of two keys key i and key j isdenoted by D-bit( key i , key j ). Let D i = D-bit( key i − , key i ) for 1 ≤ i ≤ n , i.e., the distinction bitposition of two adjacent keys in sorted order. We prove that the set of distinction bit positions ofall possible key pairs is the same as the set { D , D , . . . , D n } . First, we need a lemma. Lemma 1.

D-bit ( key i , key j ) = min i D = 1 in Figure 2 (a). Note that D cannot be equal to D j because we have only twopossibilities, 0 and 1, in a bit position. Theorem 1.

The set D all of distinction bit positions of all possible key pairs is the same as the set D adj = { D , D , . . . , D n } , i.e., the set of distinction bit positions of adjacent keys in sorted order.Proof. Since adjacent key pairs are part of all key pairs, we have D adj ⊆ D all .To prove D all ⊆ D adj , we show that the distinction bit position of any pair (say, key i and key j )belongs to D adj . Without loss of generality, assume that i < j . By Lemma 1, D-bit( key i , key j ) =min i

The distinction bit slice is suﬃcient information to determine the sorted order of thekeys.Proof.

We ﬁrst prove the theorem for the distinction bit positions. We prove that the followingrelation holds: key i < key j if Compress( key i ) < Compress( key j )for all i and j . Let D = D-bit( key i , key j ). Since the ﬁrst D bits of key i and key j are the same,the order of key i and key j is determined by the bits in bit position D . By Lemma 1, bits in bitposition D are in Compress (and thus in the distinction bit slice). Hence, the order between key i and key j is determined by the order between Compress( key i ) and Compress( key j ).Due to the relation above, we can correctly determine the sorted order of the keys by thedistinction bit slice. Similarly, we can prove the theorem for extended distinction bit positions.When we maintain an index for a database table, index keys may be inserted, deleted, orupdated by database operations. Then distinction bit positions may be changed at runtime. Forexample, if key is deleted in Figure 2, position 7 is no longer a distinction bit position (but itis still a variant bit position). If key is also deleted, distinction bit positions don’t change, butposition 7 becomes an invariant bit position. If an index key is inserted, a new distinction bitposition may be added. It is quite expensive to maintain the distinction bit positions accuratelywhen delete operations are allowed. Theorem 2 allows us to lazily update distinction bit positionswithout aﬀecting the correctness of sorting by letting some invalidated bit positions stay.The scheme of extracting distinction bits of key i into Compress( key i ) for all i and sortingCompress( key i )’s rather than sorting full key values is called compressed key sort . In order toextract compressed keys from index keys, we need to keep only (extended) distinction bit positionsas a bitmap. Compressed key sort is the main reason for the speedup of index reconstruction. Remark 1.

To build an index, the sorting of index keys is necessary, which requires O ( n log n )time. To compute distinction bit positions additionally, our compressed key sort needs O ( n ) timeto compare adjacent keys in sorted order. However, our key compression is not optimal in termsof the number of bit positions if an unlimited time is allowed. For the given keys in Figure 3,our key compression selects bit positions 0, 1, and 3 as distinction bit positions, but the bit slicein bit positions 2 and 3 can correctly determine the sorted order of the keys, and this is the6 - 0.0 1 00000 0000000000 0 11111 1111111111- 2 -24 -24 - 1 0 1111111 11111111 1 1111111 111111111 0 0000000 00000001 1 0000000 000000010 0 0000000 00000000 1 0000000 00000000-1 1 1111111 11111111 0 1111111 11111111-2 1 1111111 11111110 0 1111111 111111101 0000000 00000000 0 0000000 00000000- 2 value binary representation index key binary representation+ infinity 0 11111 0000000000 1 11111 00000000001.0 0 01111 0000000000 1 01111 00000000000.0 0 00000 0000000000 1 00000 0000000000- 1.0 1 01111 0000000000 0 10000 11111111111 11111 0000000000 0 00000 1111111111- infinityvalue binary representation index key binary representation99 00000010 011000111 00000010 00000001 00000011 000000010 00000010 00000000 00000011 00000000-1 00000011 00000001 00000010 1111111000000011 01100011 00000010 10011100-99 00000011 01100011 index key format index key format index key format (a) int: 2 bytes (b) decimal(2,0): 2 bytes (c) float: 2 bytes (sign 1 bit, exponent 5 bits, significand 10 bits) Figure 4: Binary representations and index key formats of 2-byte decimal(2,0). minimum number of bit positions. An optimal algorithm can ﬁnd the minimum number of bitpositions by choosing every subset of the bit positions and checking whether the bit slice in thesubset of bit positions can correctly determine the sorted order of the keys. We conjecture thatour key compression is best (in terms of the number of bit positions) if the sorting complexity (i.e., O ( n log n ) time) is allowed.

4. Data Structures

The B+ tree and its variants are widely used as indexes in modern DBMSs to enable fast accessto data with a search key. If an index is deﬁned on columns A , . . . , A k of a table, its key can berepresented as a tuple of column values, of the form ( a , . . . , a k ) [37]. The ordering of the tuples isthe lexicographic ordering. When k = 2, for example, the order of two tuples ( a , a ) and ( b , b )is determined as follows: ( a , a ) < ( b , b ) if a < b or ( a = b and a < b ).In this section we describe how to make actual index keys from the tuples of column values soas to keep the lexicographic ordering of the tuples. We ﬁrst explain how to make index keys fromdiﬀerent data types and then explain how to make an index key from multiple columns.For each data type ( int , float , decimal , string , etc.), its index key format can be deﬁnedso that a lexicographic binary comparison in the index key format is equivalent to a comparison oforiginal data values. For the mappings of data types int and float to index key formats, we referreaders to [20]. Here we describe the mappings of decimal and string to index key formats.A. decimal: A decimal number x is represented by a 1-byte header and a decimal part. The lastbit of the header is the sign of the decimal number (1 for negative), and the second-to-lastbit indicates whether the entry is null or not (0 for null). The decimal part contains a binarynumber corresponding to x in (cid:100) log ( x + 1) / (cid:101) bytes. The location of the decimal point isstored in the metadata of the column. For mapping, if the sign bit is 1 (i.e., x is negative),toggle the sign bit and all bits of the decimal part; otherwise, toggle the sign bit only. Thenthe order of the mapped values corresponds to that of the decimal numbers. See Figure 4,where decimal( m, n ) means m total digits, of which n digits are to the right of the decimalpoint.B. ﬁxed-size string: We use a ﬁxed-size string as it is.C. variable-size string with maximum length: A variable-size string with maximum length n isdenoted by varchar( n ). We assume that the null character (denoted by ∅ ) is not allowed inthe variable-size string. (In the case that null characters are allowed, we need to use some7 ∅ ∅ ∅ ∅ ∅ ∅ PART ( int ) NAME ( varchar (30)) XXX ( int ) YYY ( int ) ZZZ ( varchar (15))

14 AB 27 8 DDDC14 AB 27 8 DDDE14 ABA 27 10 DE(a) (b)

Figure 5: Index keys from multiple columns. (a) Database table. (b) Index keys from the ﬁve columns. encoding of characters so that the encoded string doesn’t have null characters.) We attachone null character at the end of the variable-size string to make the index key value. Then thelexicographic order of index key values corresponds to that of variable-size strings as follows.If two index keys have the same lengths, the order between them is trivially the order of thestrings. If two index keys have diﬀerent lengths (let k be the length of the shorter key) andtheir ﬁrst k − k − k − k -th byte and thelonger one has a non-null character in the k -th byte. For instance, if two keys are AB ∅ andABA ∅ , then AB ∅ is smaller than ABA ∅ due to the 3rd bytes and this is the lexicographicorder between two strings AB and ABA. Furthermore, the distinction bit position takes placein the null character of the shorter key.In each data type, the order between two index keys can be determined by a lexicographic binarycomparison of them.We now explain how to make an index key from multiple columns. An index key on multiplecolumns is deﬁned as the concatenations of index keys from the multiple columns. Suppose that anindex key is deﬁned on the following ﬁve columns: PART (int), NAME (varchar(30)), XXX (int),YYY (int), and ZZZ (varchar(15)). Example column values in some rows are shown in Figure 5(a), and the index keys of the three rows are in Figure 5 (b).The distinction bit positions in Section 3.1 are deﬁned on these full index keys. If the data typesof index columns have ﬁxed lengths (as in int, ﬂoat, decimal, and ﬁxed-size string), the columnvalues are aligned in the index keys, and the order between index keys are determined by thelexicographic order of the column values.If the data types of index columns have variable lengths (as in variable-size string), however,the column values may not be aligned in the index keys, as shown in Figure 5 (b). Still we deﬁnedistinction bit positions on these full index keys. If two rows have variable-size strings of diﬀerentlengths in a column (e.g., column NAME in Figure 5), the distinction bit position takes place inthat column as described above if previous columns have the same values as in Figure 5, and theorder between the two index keys are determined by the lexicographic order of the variable-sizestrings in that column.To compare two index keys, we make a binary comparison (by word sizes) of the two keys. If8 oot (d) leaf node (b) non-leaf node partial key distinction bit position key lengthnon-leaf node entries record ID pointer to child node pointer to highest key (e) leaf node entry (c) non-leaf node entry leaf node entries (a) index tree header pointer to the next node Figure 6: Index tree structure. one index key is shorter, it is padded with 0’s in the binary comparison. (The padded value doesnot aﬀect the order of the two keys.) In this way we deﬁne distinction bits and distinction bitpositions on full index keys derived from multiple columns.

Although our compressed key sort can work with any variant of the B+ tree index structure,available codes for indexes have a small and ﬁxed length for index keys (4 bytes for FAST [38] andCSB+ tree [39], and 4 or 8 bytes for k-ary search tree [14]) or some restrictions in building indexes(e.g., no parallel index building for ART [20]). Therefore, we use a full-ﬂedged index tree whichis being used in SAP HANA database system, and apply our compressed key sort to it. Figure 6shows the structure of the index tree, which is a variant of the partial-key B+ tree [9]. To deﬁnepartial keys on key values key , key , . . . , key n in sorted order, a parameter pk is introduced. The partial key of key i is the pk bits following the distinction bit position D i [9]. In Figure 2, the partialkey of key when pk = 4 is 1010, because D = 5. The distinction bit position D i is also called the oﬀset of the partial key of key i [9].We describe the structure of the index tree in Figure 6 which is relevant to this paper. A leafnode of the index tree contains a list of entries, one for each index key, plus a pointer to the nextnode. An entry in a leaf node consists of a partial key, a distinction bit position, an index keylength, and a record ID. The header of a leaf node contains a pointer to the last (i.e., highest) indexkey of the entries in the leaf node. A non-leaf node contains a list of entries, one for each child.The index key corresponding to an entry is the highest index key in the descendant leaves of thechild corresponding to the entry. An entry in a non-leaf node consists of a partial key, a distinctionbit position, an index key length, a pointer to the child node, and a pointer to the highest indexkey (where the partial key and the index key length are those of the highest index key, and thedistinction bit position is that of the highest index key against the highest index key of the previousentry). 9n addition, we keep the following information persistently for each index tree, which will becalled the DS-metadata (DS stands for D-bit Slice).1. D-bitmap: Our compression scheme requires distinction bit positions, which can be repre-sented by a bitmap. The position of each bit in the bitmap means the position in the fullindex key. While the value 0 means that the bit position is not a distinction bit position, thevalue 1 means that it is possibly a distinction bit position.2. Variant bitmap: Similarly we store variant bit positions in a bitmap, where value 0 in a bitposition means that the bit position is not a variant bit position and value 1 means that it ispossibly a variant bit position.3. Reference key value: We need a reference key value for invariant bits, which can be anarbitrary index key value because the invariant bits are the same for all index keys.Note that we use extended distinction bit positions to deﬁne the D-bitmap. We maintain thevariant bitmap and a reference key value in order to obtain partial keys when we rebuild our indextree. If partial keys are not needed in an index, the variant bitmap and a reference key value arenot necessary, and we need only maintain the D-bitmap, which is the main information to keep foreﬃcient index rebuilding. We describe how to perform search/insert/delete operations with the index tree and DS-metadata. • Search: Given a search key value K , we search down the index tree for K as follows.In a non-leaf node, we need to compare K with an index key in a non-leaf node entry. Sincethe entry has a pointer to the highest index key (say, A ), we make a binary comparison oftwo full key values K and A .A leaf node contains a list of partial keys, and thus we need to compare search key K witha list of partial keys. The procedure to compare K with a list of partial keys is the same asthe one described in Bohannon et al. [9]. • Insert: Given an insert key value K , insert K into the index tree as follows.1. Search down the index tree with K and ﬁnd the right place for insertion (say, betweentwo keys A and B ).2. Compute the distinction bit positions D-bit( A, K ) and D-bit(

K, B ).3. Make changes in the index tree corresponding to the insertion, and update the D-bitmapand the variant bitmap as follows. For the D-bitmap, in principle we need to re-move the bit position D-bit(

A, B ) and add new distinction bit positions D-bit(

A, K )and D-bit(

K, B ) because key K has been inserted between A and B . By Lemma 1,however, D-bit( A, B ) = min(D-bit(

A, K ) , D-bit(

K, B )). Since the minimum position isD-bit(

A, B ) and it is already set in the D-bitmap, we need only set max(D-bit(

A, K ) , D-bit(

K, B ))in the D-bitmap if it is not already set. For the variant bitmap, we perform a bitwiseXOR on K and the reference key value, and then perform a bitwise OR on the variantbitmap and the result of the above bitwise XOR. The result of the bitwise OR will bethe new variant bitmap. (Notice that the number of actual write operations on theD-bitmap is bounded by the number of 1’s in the D-bitmap. Thus the chances that anactual write operation on the D-bitmap occurs during an insertion are low.)10 Delete: Given a delete key value K , delete K from the index tree as follows.We delete K as a usual deletion is done in the index tree, and simply leave the D-bitmap andthe variant bitmap without changes. We need to show that the D-bitmap is valid after deleting K . Let A and B be the previous key value and the next key value of K , respectively. Afterdeleting K , D-bit( A, B ) should be set in the D-bitmap. Again by Lemma 1, D-bit(

A, B ) =min(D-bit(

A, K ) , D-bit(

K, B )). Since D-bit(

A, K ) and D-bit(

K, B ) are set in the D-bitmap,D-bit(

A, B ) is already set, whether it is D-bit(

A, K ) or D-bit(

K, B ).An update operation is done by a delete operation followed by an insert operation. Note that ifthere are only insert operations (i.e., no delete operations), the D-bitmap represents the distinctionbit positions exactly.As the data in a database table change, the DS-metadata is updated incrementally as above.When an insertion occurs, at most one distinction bit position is added to the D-bitmap, and somevariant bit positions may be added to the variant bitmap. This operation never reverts even if thereis a delete or rollback, because implementing the revert exactly is quite expensive. Hence there maybe positions in the D-bitmap whose values are 1, but which are not distinction bit positions. Alsothe variant bitmap may have positions whose values are 1, but which are not variant bit positions.However, they do not aﬀect the correctness as shown in Theorem 2. Such bit positions can beremoved by scanning the index and computing the DS-metadata occasionally. If we rebuild theindex anew, then certainly there will be no such bit positions.With the current DS-metadata, we can rebuild the index tree (which will be described in thenext section) when it is lost or unavailable. Even after the index tree is rebuilt, we may usethe current DS-metadata as the DS-metadata. However, index rebuilding is an opportune time tocompute the DS-metadata anew. We compute the new DS-metadata from the current DS-metadataas follows.1. D-bitmap: Extract compressed keys from index keys by the current D-bitmap, sort the com-pressed keys, and compute the distinction bit positions between adjacent compressed keys(all three steps are part of index reconstruction), which make the new D-bitmap. Note thatthe bit positions where the current D-bitmap had 0 remain 0 in the new D-bitmap.2. Reference key value: Take an arbitrary index key as the reference key value.3. Variant bitmap: Initially the variant bitmap is all 0, and we take index keys one by one (say, K ) and do the following. Perform a bitwise XOR on K and the reference key value, followedby a bitwise OR on the variant bitmap and the result of the bitwise XOR (as in the insertoperation above).If we build an index tree for the ﬁrst time (i.e., there is no DS-metadata at all), then we computethe D-bitmap as above, but with full index keys rather than compressed keys. Remark 2.

Our key compression per se requires O ( n ) time to compute the DS-metadata initially(other than sorting) and O (1) time to update the DS-metadata for an insertion (other than O (log n )search time to ﬁnd the right place to insert). Note that sorting is needed anyway to build an indextree and a search is needed anyway to ﬁnd the place to insert. For the optimal algorithm describedin Remark 1, ﬁnding the minimum number of bit positions after an insertion is very expensive. Any practical compression scheme should have low complexities in computing compression informationsuch as the DS-metadata and updating the information.

5. Index Reconstruction

We now describe how to build the index tree in parallel from a database table loaded in mem-ory by using the DS-metadata on the ﬂy. We extract only the bits from the index key values11 ecords to cores e merge in parallel parallel51384726401400403402405404407406 5138401400403402 4726405404407406 51384726 401400403402405404407406 1342 4004034054075876 401402404406 one }}}}

1. evenly distributerecords to cores 2. scan , key-record ID extraction, sorting 3. find perfect partitionand merge in parallel 4. build subtreesin parallel 5. merge treesinto one Figure 7: Index-reconstruction procedure. whose positions are set in the D-bitmap. Figure 7 shows the overall procedure of parallel indexreconstruction.1. To collect index keys in parallel, data pages of the target table are evenly distributed to thecores.2. Each core scans the assigned data pages and extracts compressed keys and correspondingrecord IDs. A pair of a compressed key and the corresponding record ID makes a sort key.The record ID is included in the sort key so that each pair of a compressed key and a recordID can be directly used to ﬁll its corresponding leaf node entry without causing many cachemisses.3. Sort the pairs of compressed key and record ID by a parallel sorting algorithm.4. Build the index tree in a bottom-up fashion.

Sort key compression can be done by extracting the bits in the positions which have value 1in the D-bitmap. We now describe how to get compressed keys from index keys. (Though theexamples in Figure 8 are shown in the big endian format for readability, the actual implementationwas done in the little endian format due to Intel processors.)1. Separate one-word long (8 bytes) masks from the D-bitmap. The ﬁrst mask starts from thebyte which contains the ﬁrst 1 in the bitmap, and it is 8 bytes long. The second mask startsfrom the byte which contains the ﬁrst 1 after the ﬁrst mask, and it is 8 bytes long, and so on.In the example of Figure 8, we get three masks from the D-bitmap. See Figure 8 (c).2. By BMI instruction PEXT [40] (which copies selected bits from the source to contiguouslow-order bits of the destination), extract bits from an index key which are located in thepositions where the masks have value 1. See Figure 8 (d).3. Concatenate the extracted bits with shift and bitwise OR operations. Since there are threemasks in Figure 8, the extracted bits are concatenated in three steps (f).(i), (f).(ii), and(f).(iii) by a shift and a bitwise OR in each step. See Figure 8 (e) and Figure 8 (f).The bit string in Figure 8 (f).(iii) is the compressed key extracted from the full key in Figure 8 (a).12 b) D-bitmap F17F1B3F1E003717000000000000F17F 00904846D

Figure 8: Extracting compressed keys from index keys.

In our target applications which require a wide range of index key sizes, the size of sort keysis usually too big to exploit SIMD parallelism. Thus we implemented our own parallel sortingalgorithm called the row-column sort , which is a comparison sort [41] (i.e., it relies only on theoperation of comparing two elements during sorting; the comparison operator is called a compara-tor ). Hence the row-column sort works for any key sizes. The details of the row-column sort aredescribed in Appendix.

Once the pairs of compressed index key and record ID are sorted in lexicographic order, the in-dex tree can be built in a bottom-up fashion. First, we build leaf nodes from the sorted compressedkeys and record IDs. To compute distinction bit positions, we make an array D-oﬀset[ i ] from theD-bitmap, which stores the position of the ( i + 1)-st 1 in the D-bitmap. Then the distinction bit po-sition of key i and key i +1 is D-oﬀset[D-bit(Compress( key i ) , Compress( key i +1 ))]. Next, we build non-leaf nodes in a bottom-up fashion. For two adjacent entries in a non-leaf node whose highest keysare key i and key j , the distinction bit position is D-oﬀset[D-bit(Compress( key i ) , Compress( key j ))].In the case of our index tree, the leaf nodes and non-leaf nodes contain partial keys of apredeﬁned length pk . Given the oﬀset (i.e., distinction bit position) of a partial key and thepredeﬁned partial key length pk , the bits of the partial key are determined as follows.A. If a bit position of the partial key is included in the compressed key, the bit value can bedirectly copied from the compressed key.B. If a bit position is a position which has value 0 in the variant bitmap (i.e., an invariant bitposition), the bit value can be copied from the reference key value.C. Otherwise (i.e., a bit position which has value 0 in the D-bitmap and value 1 in the variantbitmap), we have two options.a) Add the bits required for partial key construction ( pk bits following the distinction bitposition) to the compressed key and use them here for index construction.b) Since the record ID is also contained in the sort key, necessary bits can be copied fromthe record, for which a dereferencing is required.13 able 2: Statistics of six datasets, where k = thousand, M = million, B = byte, and b = bit. INDBTAB Human Wikititle ExURL WikiURL Partdatabase table size 884MB 5310MB 623MB 649MB 930MB 116MBindex size 390MB 860MB 333MB 184MB 305MB 46MB − /

16 = 14. Since each entry in a non-leaf node takes 24B, themax fanout in a non-leaf node is 9. The ﬁll factor is deﬁned for each index during index building,and leaf and non-leaf nodes are ﬁlled up to max fanout × ﬁll factor [41]. Given the number ofrecords, the max fanouts, and the ﬁll factor (default value is 0.9), the height of the index tree canbe determined.Index construction can be parallelized by partitioning the sorted pairs of index key and recordID and constructing subtrees in parallel. That is, n sort keys are divided into p blocks of np sortkeys each, and one block is assigned to a thread (which is the situation at the end of the row-columnsort). Thread i (1 ≤ i ≤ p ) constructs a subtree consisting of all sort keys in the i -th block. Whenall the subtrees are constructed, they are merged into one tree such that the height of the resultingtree can be minimized. Since the fanouts of the root nodes of the subtrees can be much less thanthe max fanout, just linking the root nodes of the subtrees may increase the height of the wholetree unnecessarily. Hence we remove the root nodes of the subtrees, and build the top layers of thewhole tree by linking the children of the root nodes of the subtrees. In this way the height of thewhole tree can be minimized.

6. Performance Evaluation

We conduct experiments to measure the performance improvements due to our compressedkey sort. In the experiments we compare the compressed key sort against the full key sort withrespect to the time for sorting and index building. We use ﬁve real datasets and one TPC-Hdataset: a database table in SAP HANA that records items in sales documents (which we callINDBTAB), a complete EST (expressed sequence tag) database of Human Chromosome 14 fromGenome Assembly Gold-standard Evaluations [42], Wikipedia titles [43], external links of DBpedia[44], Wikipedia links of DBpedia [44], and Part table (column name ) of TPC-H [45].14

20 40 60 80 100 INDBTAB Human Wikititle ExURL WikiURL Part % % .6 % . % . % % loadfull key sort - sort full key sort - build comp ressed key sort - extract comp ressed key sort - sort comp ressed key sort - build Figure 9: Sorting time and index-building time of six datasets.

The computer used in our experiments is equipped with four Intel R (cid:13) Xeon R (cid:13) E7-8880 v4 (2.20GHz)processors, each of which contains 22 cores. The computer has 1TB DRAM memory. (Since weused no more than 16 cores in our experiments of parallelization, the experiments were done in asingle processor.)

Table 2 presents the basic statistics of the six datasets such as the sizes of each database tableand its index tree as well as a few important characteristics and measurements relevant to ourproposed scheme. The full sort key refers to the combination of an uncompressed key taken froma dataset and the corresponding record ID. The record ID is 8 bytes long, and either the whole oronly the variant bits of a record ID can be used as part of a sort key. In the latter case, the variantbitmap in the DS-metadata should be expanded to include the variant bits of the record IDs. The compressed sort key consists of distinction bits in a key and variant bits in the corresponding recordID. Table 2 also shows the number of keys in a dataset, the lengths of the shortest, average, andlongest keys, the number of bits in a full key, the number of distinction bits in keys, the number ofvariant bits in record IDs, the size of full sort keys (i.e., full key + record ID), the size of compressedsort keys (i.e., distinction bits in key + variant bits in record ID). The length of a sort key - full orcompressed - is presented in the unit of 8B because sort keys are stored in words .The compression ratio and the sort key ratio are computed by the following formulas:compression ratio = able 3: Distribution of distinction bit positions of INDBTAB. bytes distinction bit positions1–8 00000000 00000000 00000001 00000001 00000001 00001111 00000011 000011119–16 00000000 00000011 00001111 00000111 00001111 00000111 00001111 0000000017–24 00001111 00001111 00001111 00001111 00001111 00001110 00000000 0000000025–32 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000033–35 00000000 00000000 00000000full key sort and the compressed key sort, respectively. While the second bar is broken down totwo phases, sort and build , the third bar is broken down to three phases, extract , sort , and build .This is because the extract phase is needed only for the compressed key sort to obtain compressedkeys by extracting bits from full keys in positions of 1s in the D-bitmap. Despite the extra phase ofbit extraction, however, as is shown clearly in Figure 9, our compressed key sort reduced the totalindex building time substantially by expediting both the sort and build phases. The improvementratio was 34.0% on average for the six datasets. Note that all the measurements in the ﬁgure arenormalized to the same scale for the ease of presentation and comparison. The total time of buildingan index by the full key sort is 4.59, 21.12, 5.68, 17.63, 19.36, and 0.70 seconds for INDBTAB,Human, Wikititle, ExURL, WikiURL, and Part, respectively.To better understand the performance diﬀerential, we looked into the distribution of distinctionbit positions. Table 3 shows the distinction bit positions of INDBTAB, where distinction bitpositions are set by 1. As is shown in the table, the distinction bit positions are distributed widelyover many bytes of full index keys, and extracting distinction bits into a compressed key can makeit shorter and improve the performance of sorting and index building. In the case of full key sort, asingle key comparison will have to examine up to 22 bytes of each key (i.e., sort by distinguishingpreﬁxes), because the last distinction bit position is in the 22nd byte in the table. In the case ofcompressed key sort, however, a single key comparison can be done by examining no more than7 bytes, because there are only 56 distinction bits, which can be stored in 7 bytes. Althougha compressed sort key is actually 16 bytes long due to the 27 variant bits in the record IDs, acomparison of two compressed sort keys ﬁnishes in one word-comparison because all distinctionbits belong to the ﬁrst word of a compressed sort key. We conduct a sensitivity analysis to see how our sort key compression scheme performs undervarious circumstances. The main parameters that aﬀect the performance are the sort key ratio deﬁned in the previous section and the word comparison ratio deﬁned as follows:word comparison ratio = wcc full wcc comp where wcc full is the average count of word comparisons required by a single full key comparison, and wcc comp is the average count of word comparisons required by a single compressed key comparison.We used Zipf distribution [46] to generate synthetic datasets of various conﬁgurations. Eachdataset is generated by a custom function, denoted by Zipf( s, n, m ), so that it contains 10 millionkeys of n bytes each, the ﬁrst m bytes of each 8 byte word in a key have the same arbitrary ASCIIvalue, and the remaining 8 − m bytes of each word have lower case ASCII characters following theZipf distribution Zipf ( s, s is the value of the exponent characterizing the Zipfdistribution. For example, Zipf( s, ,

3) generates keys of type aaaZZZZZ aaaZZZZZ, where a is anarbitrary ﬁxed character and Z is a byte having one of ‘a’ to ‘z’ by the Zipf distribution ( s, able 4: Statistics of synthetic datasets. data function key size full sort compressed sort key word comparisonkey size sort key size ratio ratio1 Zipf(2.5,48,0) 48B 56B 40B 1.40 1.302 Zipf(2.5,56,0) 56B 64B 40B 1.60 1.303 Zipf(2.5,64,0) 64B 72B 40B 1.80 1.304 Zipf(2.5,72,0) 72B 80B 40B 2.00 1.305 Zipf(2.5,80,0) 80B 88B 40B 2.20 1.306 Zipf(2.5,88,0) 88B 96B 40B 2.40 1.307 Zipf(2.5,96,0) 96B 104B 40B 2.60 1.308 Zipf(2.5,104,0) 104B 112B 40B 2.80 1.309 Zipf(2.5,112,0) 112B 120B 40B 3.00 1.3010 Zipf(1.5,40,0) 40B 48B 24B 2.00 1.0611 Zipf(1.5,40,1) 40B 48B 24B 2.00 1.1112 Zipf(1.5,40,2) 40B 48B 24B 2.00 1.2013 Zipf(1.5,40,3) 40B 48B 24B 2.00 1.3414 Zipf(1.5,40,4) 40B 48B 24B 2.00 1.5515 Zipf(1.5,64,0) 64B 72B 24B 3.00 1.0516 Zipf(1.5,64,1) 64B 72B 24B 3.00 1.1017 Zipf(1.5,64,2) 64B 72B 24B 3.00 1.1918 Zipf(1.5,64,3) 64B 72B 24B 3.00 1.3319 Zipf(1.5,64,4) 64B 72B 24B 3.00 1.5320 Zipf(1.5,64,5) 64B 72B 24B 3.00 1.85 t o t a l t i m e r a t i o sort key ratiodatasets 1-9 t o t a l t i m e r a t i o word comparison ratiodatasets 10-14 t o t a l t i m e r a t i o word comparison ratiodatasets 15-20 Figure 10: Total time ratio of synthetic datasets. able 5: Sorting time and index-building time of INDBTAB (in seconds). full key sort compressed key sort total time speedupcores sort build total extract sort build total ratio improve full comp1 4.251 0.340 4.591 0.543 2.063 0.242 2.848 1.61 38.0% 1.0 1.02 2.195 0.171 2.366 0.274 1.053 0.123 1.450 1.63 38.7% 1.9 2.04 1.181 0.090 1.271 0.138 0.572 0.066 0.776 1.64 38.9% 3.6 3.78 0.549 0.049 0.598 0.069 0.265 0.035 0.369 1.62 38.3% 7.7 7.716 0.310 0.034 0.344 0.034 0.148 0.025 0.207 1.66 39.8% 13.3 13.8Table 4 shows the statistics of the synthetic datasets used in the sensitivity analysis. As canbe seen in the table, the datasets are generated such that they conform to the default parametersettings: word comparison ratio = 1.30 and sort key ratio = 2.00 or 3.00. In datasets 1-9, the wordcomparison ratio is ﬁxed to 1.30 and the sort key ratio changes as the key size changes. In datasets10-14 (resp. 15-20), the sort key ratio is ﬁxed to 2.00 (resp. 3.00) and the word comparison ratiochanges as m changes in Zipf( s, n, m ).Figure 10 shows the index construction times by the compressed key sort in comparison withthose by the full key sort. The measurements are the ratio of the former to the latter. So, the higherthe ratio is, the larger the margin of improvement is by the compressed key sort. In datasets 1-9, asthe sort key ratio increases, the advantage of our compressed key sort scheme grows proportionally.This is because the size of compressed sort keys gets smaller, which leads to a smaller amount ofwork in sorting and index building.In datasets 10-14 (also in datasets 15-20) the number of distinction bits in each dataset is aboutthe same, and thus the sort key ratios are identical. However, as m in Zipf(1 . , , m ) increases,the distinction bits are more widely spread in full keys. Hence, our compression scheme has theeﬀect of compacting widely spread distinction bits in full keys into compressed keys, which leadsto a smaller number of word-comparisons in a comparison of two compressed keys. As m increases,therefore, the word comparison ratio gets bigger, which results in an improvement especially insorting time, even though the sort key ratios remain the same. Therefore, our compression schemeimproves the performance of index building in two ways:1. by making compressed keys shorter than full keys, which leads to a smaller amount of workin sorting and index building, and2. by compacting distinction bits in full keys into compressed keys, which leads to a smallernumber of word-comparisons in a comparison of two compressed keys.For example, whereas the sort key ratio of WikiURL is smaller than that of INDBTAB (in Table 2),the improvement of WikiURL is larger than that of INDBTAB (in Figure 9) because the wordcomparison ratio of WikiURL is larger than that of INDBTAB.Finally, we compare the sensitivity analysis to the experimental results with the six datasets withrespect to (sort key ratio, word comparison ratio, total time ratio). INDBTAB has a characteristic(3.00, 2.10, 1.61), which is similar to (3.00, 1.85, 1.66) of dataset 20; (2.33, 1.27, 1.59) of Humanis similar to (2.40, 1.30, 1.46) of dataset 6; (2.20, 1.00, 1.30) of Wikititle is similar to (2.00, 1.06,1.19) of dataset 10; (2.03, 1.37, 1.44) of ExURL is similar to (2.00, 1.34, 1.25) of dataset 13. In Section 6.2, we have shown that our compressed key sort can reduce times for both sort andindex-build phases considerably. Nonetheless, the sort time still remains as the dominant portionof total index reconstruction time. In this section, we show that the sort time could be further18 ull key sort - sort full key sort - build compressed key sort - extract compressed key sort - sort compressed key sort - build t i m e ( s e c ) Figure 11: Speedups of sorting time and index-building time. reduced by parallelizing it on a multi-core computing platform. The choice of our sort algorithmwas the row-column sort . Refer to Appendix A for the detailed description of the algorithm.Table 5 shows the detailed performance measurements from the full key sort and the compressedkey sort with a varying number of cores used. Figure 11 presents the speedups observed in thisexperiment in the log-log scale. It clearly shows near-linear speedup in all measurements from boththe full key sort and compressed key sort except for the index-building time with 16 cores, in whichthe speedup deteriorated because memory write (390MB in Table 2) became dominant and wasnot accelerated by using multiple cores. When sixteen cores were used, the speedups in the totalindex reconstruction time were 13.3 and 13.8 for the full key sort and the compressed key sort,respectively.In the case of sixteen cores, the tree index of INDBTAB was reconstructed from the pre-loaded database table in just 0.207 seconds. Given that the memory footprint of the index treeof INDBTAB is 390MB (see Table 2), this is approximately equivalent to 1.88 GB/sec bandwidth.This level of bandwidth is higher than the peak bandwidth of enterprise class magnetic disk drivesand most commodity solid state drives. We present this result as an evidence that a tree index canbe reconstructed from the pre-loaded database table on the ﬂy much more eﬃciently than loadingthe materialized image from disk.

7. Conclusion

We have deﬁned the notion of distinction bit positions and proved that the bit slices of indexkeys in distinction bit positions are suﬃcient information to determine the sorted order of the indexkeys correctly. Consequently, utilizing only those bit slices is in eﬀect equivalent to compressingkeys losslessly in regard to sorting the keys. The key compression ratio achieved by the proposedmethod was 2.76:1 on average in our experiment.We have then proposed the compressed key sort based on the distinction bit positions in orderto expedite the reconstruction of a tree index from the base table in memory. The compressedkey sort reduced the index reconstruction time by 34.0% on average in our experiment carried outon a single core platform. The proposed method based on distinction bit positions is essentiallya lightweight key compression scheme. Thus it can be adopted in any application that involvessorting keys longer than the word size and is expected to deliver signiﬁcant performance beneﬁt.19 cknowledgements

Ryu and Park were supported by Institute for Information & communications Technology Pro-motion(IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00551, Framework ofPractical Algorithms for NP-hard Graph Problems). Moon was supported by a grant (K-16-L03-C01-S03) funded by the ministry of science, ICT, and future planning, Korea and PF Class Het-erogeneous High Performance Computer Development (NRF-2016M3C4A7952587).

References [1] H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, R. Shen, Reducing the StorageOverhead of Main-Memory OLTP Databases with Hybrid Indexes, in: Proceedings of the 2016ACM SIGMOD International Conference on Management of Data, ACM, 2016, pp. 1567–1581.[2] Microsoft, SQL Server Documentation, https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/comparing-disk-based-table-storage-to-memory-optimized-table-storage?view=sql-server-2017 .[3] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N. Verma,M. Zwilling, Hekaton: Sql server’s memory-optimized oltp engine, in: Proceedings of the 2013ACM SIGMOD International Conference on Management of Data, ACM, 2013, pp. 1243–1254.[4] F. Faerber, A. Kemper, P.-A. Larson, J. Levandoski, T. Neumann, A. Pavlo, Main memorydatabase systems, Foundations and Trends in Databases 8 (1-2) (2017) 1–130.[5] N. Malviya, A. Weisberg, S. Madden, M. Stonebraker, Rethinking Main Memory OLTP Re-covery, in: Proceedings of the 30th International Conference on Data Engineering, IEEEComputer Society, 2014, pp. 604–615.[6] R. Bayer, K. Unterauer, Preﬁx B-trees, ACM Transactions on Database Systems 2 (1) (1977)11–26.[7] T. J. Lehman, M. J. Carey, A study of index structures for main memory database managementsystems, in: Proceedings of the 12th International Conference on Very Large Data Bases,Morgan Kaufmann Publishers Inc., 1986, pp. 294–303.[8] D. E. Ferguson, Bit-tree: A data structure for fast ﬁle processing, Communications of theACM 35 (6) (1992) 114–120.[9] P. Bohannon, P. Mcllroy, R. Rastogi, Main-memory index structures with ﬁxed-size partialkeys, ACM SIGMOD Record 30 (2) (2001) 163–174.[10] J. Rao, K. A. Ross, Cache conscious indexing for decision-support in main memory, in: Pro-ceedings of the 25th International Conference on Very Large Data Bases, Morgan KaufmannPublishers Inc., 1999, pp. 78–89.[11] J. Rao, K. A. Ross, Making B+- trees cache conscious in main memory, ACM SIGMOD Record29 (2) (2000) 475–486.[12] S. Chen, P. B. Gibbons, T. C. Mowry, Improving index performance through prefetching, ACMSIGMOD Record 30 (2) (2001) 235–246. 2013] S. Chen, P. B. Gibbons, T. C. Mowry, G. Valentin, Fractal prefetching B+-trees: Optimizingboth cache and disk performance, in: Proceedings of the 2002 ACM SIGMOD InternationalConference on Management of Data, ACM, 2002, pp. 157–168.[14] B. Schlegel, R. Gemulla, W. Lehner, K-ary search on modern processors, in: Proceedings ofthe 15th International Workshop on Data Management on New Hardware, ACM, 2009, pp.52–60.[15] M. Boehm, B. Schlegel, P. B. Volk, U. Fischer, D. Habich, W. Lehner, Eﬃcient in-memory in-dexing with generalized preﬁx trees, in: Proceedings of the 14th BTW conference on DatabaseSystems for Business, Technology, and Web, 2011, pp. 227–246.[16] T. Kissinger, B. Schlegel, D. Habich, W. Lehner, KISS-tree: Smart latch-free in-memoryindexing on modern architectures, in: Proceedings of the 18th International Workshop onData Management on New Hardware, ACM, 2012, pp. 16–23.[17] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A.Brandt, P. Dubey, FAST: Fast architecture sensitive tree search on modern CPUs and GPUs,in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data,ACM, 2010, pp. 339–350.[18] T. Yamamuro, M. Onizuka, T. Hitaka, M. Yamamuro, VAST-tree: A vector-advanced andcompressed structure for massive data tree traversal, in: Proceedings of the 15th InternationalConference on Extending Database Technology, ACM, 2012, pp. 396–407.[19] J. J. Levandoski, D. B. Lomet, S. Sengupta, The Bw-tree: A B-tree for new hardware platforms,in: Proceedings of the 29th International Conference on Data Engineering, IEEE ComputerSociety, 2013, pp. 302–313.[20] V. Leis, A. Kemper, T. Neumann, The adaptive radix tree: ARTful indexing for main-memorydatabases, in: Proceedings of the 29th International Conference on Data Engineering, IEEEComputer Society, 2013, pp. 38–49.[21] H. Zhang, H. Lim, V. Leis, D. Andersen, M. Kaminsky, K. Keeton, A. Pavlo, SuRF: practicalrange query ﬁltering with fast succinct tries, in: Proceedings of the 2018 ACM SIGMODInternational Conference on Management of Data, ACM, 2018, pp. 323–336.[22] R. Binna, E. Zangerle, M. Pichl, G. Specht, V. Leis, HOT: A height optimized trie index formain-memory database systems, in: Proceedings of the 2018 ACM SIGMOD InternationalConference on Management of Data, ACM, 2018, pp. 521–534.[23] G. Graefe, P.-A. Larson, B-tree indexes and CPU caches, in: Proceedings of the 17th Interna-tional Conference on Data Engineering, IEEE Computer Society, 2001, pp. 349–358.[24] G. Graefe, Modern B-tree techniques, Foundations and Trends in Databases 3 (4) (2011) 203–402.[25] Apache ORC, https://orc.apache.org/ .[26] Apache Parquet, https://parquet.apache.org/ .[27] G. Antoshenkov, D. Lomet, J. Murray, Order preserving string compression, in: Proceedingsof the 12th International Conference on Data Engineering, IEEE Computer Society, 1996, pp.655–663. 2128] G. Antoshenkov, Dictionary-based order-preserving string compression, The VLDB Journal6 (1) (1997) 26–39.[29] Z. Chen, J. Gehrke, F. Korn, Query optimization in compressed database systems, ACMSIGMOD Record 30 (2) (2001) 271–282.[30] C. Binnig, S. Hildenbrand, F. F¨arber, Dictionary-based order-preserving string compressionfor main memory column stores, in: Proceedings of the 2009 ACM SIGMOD InternationalConference on Management of Data, ACM, 2009, pp. 283–296.[31] H. Zhang, X. Liu, D. Andersen, M. Kaminsky, K. Keeton, A. Pavlo, Order-preserving key com-pression for in-memory search trees, in: Proceedings of the 2020 ACM SIGMOD InternationalConference on Management of Data, ACM, 2020, pp. 1601–1615.[32] H. Inoue, T. Moriyama, H. Komatsu, T. Nakatani, AA-sort: A new parallel sorting algorithmfor multi-core SIMD processors, in: Proceedings of the 16th International Conference on Par-allel Architecture and Compilation Techniques, IEEE Computer Society, 2007, pp. 189–198.[33] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Ku-mar, P. Dubey, Eﬃcient implementation of sorting on multi-core SIMD CPU architecture,Proceedings of the VLDB Endowment 1 (2) (2008) 1313–1324.[34] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, P. Dubey, Fast sort onCPUs and GPUs: A case for bandwidth oblivious SIMD sort, in: Proceedings of the 2010ACM SIGMOD International Conference on Management of Data, ACM, 2010, pp. 351–362.[35] The GNU C++ library manual, http://gcc.gnu.org/onlinedocs/libstdc++/manual/ .[36] Y. S. Kwon, K. Park, C. Yoo, Optimal sort key compression and index rebuilding, US PatentApplication Number 15/658,671 (2017).[37] A. Silberschatz, H. F. Korth, S. Sudarshan, Database Systems Concepts, 4th Edition, McGraw-Hill Higher Education, 2001.[38] V. Leis, FAST source, .[39] J. Rao, CSB+ tree source, .[40] Intel, Advanced Vector Extensions Programming Reference, 2011.[41] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, 3rd Edition,The MIT Press, 2009.[42] Genome datasets of Human Chromosome 14, http://gage.cbcb.umd.edu/data/index.html .[43] Wikipedia titles dump, http://dumps.wikimedia.org/enwiki/ .[44] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: A nucleus for aweb of open data, in: The Semantic Web, Springer Berlin Heidelberg, 2007, pp. 722–735.[45] M. Poess, C. Floyd, New TPC benchmarks for decision support and web commerce, ACMSIGMOD Record 29 (4) (2000) 64–71.[46] S. Ross, A First Course in Probability, 6th Edition, Prentice Hall, 2002.2247] P. J. Varman, S. D. Scheuﬂer, B. R. Iyer, G. R. Ricard, Merging multiple lists on hierarchical-memory multiprocessors, Journal of Parallel and Distributed Computing 12 (2) (1991) 171–177.[48] R. S. Francis, I. D. Mathieson, L. Pannan, A fast, simple algorithm to balance a parallelmultiway merge, in: Proceedings of the 5th International PARLE Conference on ParallelArchitectures and Languages Europe, Springer-Verlag, 1993, pp. 570–581.[49] J. L. Bentley, M. D. McIlroy, Engineering a sort function, Software: Practice and Experience23 (11) (1993) 1249–1265.[50] A. LaMarca, R. E. Ladner, The inﬂuence of caches on the performance of sorting, Journal ofAlgorithms 31 (1) (1999) 66–104.[51] D. E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, 2ndEdition, Addison Wesley Longman Publishing Co., Inc., 1998.[52] H. Shi, J. Schaeﬀer, Parallel sorting by regular sampling, Journal of Parallel and DistributedComputing 14 (4) (1992) 361–372.

Appendix A. Parallel SortingAlgorithm 1

Row-Column Sort procedure Row Column Sort (Key[1.. n ], n , p , e , C ) c ← (cid:98) C/e (cid:99) t ← max( (cid:98) √ n/cp (cid:99) , for i ← to tp do init block[ i ] ← ( ntp ( i −

1) + 1 , ntp i ) sorted block[ i ] ← ( ntp ( i −

1) + 1 , ntp i ) (cid:46) in another array Temp for j ← to ntpc do sub block[ i ][ j ] ← ( ntp ( i −

1) + c ( j −

1) + 1 , ntp ( i −

1) + cj ) for all thread i ← to p do in parallel for j ← to t do for k ← to ntpc do basic sort(sub block[ t ( i −

1) + j ][ k ]) multiway merge( ntpc , sub block[ t ( i −

1) + j ][1 .. ntpc ], sorted block[ t ( i −

1) + j ]) for all thread i ← to p do in parallel perfect partition(sorted block[1 ..tp ], split block[1 ..tp ][1 ..p ]) for all thread i ← to p do in parallel ﬁnal block[ i ] ← ( np ( i −

1) + 1 , np i ) multiway merge( tp , split block[1 ..tp ][ i ], ﬁnal block[ i ])We describe our parallel sorting algorithm, which we call the row-column sort. The row-columnsort uses a notion of the perfect partition in [47, 48]. A pair of a (full or compressed) key and itsrecord ID will be called a sort key, which is an element in sorting. The input to the row-columnsort is as follows:Key[1 ..n ]: array of sort keys 23

28 11 15 24 16 17 29 26 31 3 22 19 2 0 23 25 18 13 8 5 21 7 12 14 20 27 6 10 1 30 94 28 11 15 24 16 17 29

26 31 3 22 19 2 0 23

25 18 13 8 5 21 7 1214 20 27 6 10 1 30 9 4 11 15 16 17 24 28 29 (b) (c)(a)

Figure 12: Row-column sort. (a) init block[1 ..tp ], where t = 1, p = 4. (b) sorted block[1 ..tp ] (each block is sorted).(c) split block[1 ..tp ][1 ..p ] (perfect split of sorted blocks). n : number of elements (i.e., sort keys) p : number of threads e : size of an element (in bytes) C : last level cache size (in bytes) per thread (i.e., available L3 cache size / number of threadsin our experiments)For the dataset of INDBTAB in Section 6, for instance, n is 16.39 million, and e is 48 bytesfor full sort keys. Typically in our target applications, the size of sort keys is too big to exploitSIMD parallelism. Hence, the row-column sort does not rely on SIMD instructions, but it is a comparison sort [41] (i.e., it relies on the operation of comparing two elements). Algorithm 1 showsthe pseudo-code of the row-column sort. The details of the algorithm are as follows.1. (lines 2–3) Compute two parameters which are used in the algorithm: c is the number ofelements that can be included in C bytes, and t is set such that ntpc ≈ tp in order to balancethe workloads of line 13 and line 18. In Figure 12, n = 32 and p = 4. For simplicity, weassume in this toy example that e = 1 and c = 2. Then t is set to 1.2. The row-column sort uses two arrays Key[1 ..n ] and Temp[1 ..n ] which are partitioned intoblocks: init block[1 ..tp ], sub block[1 ..tp ][1 .. ntpc ], and ﬁnal block[1 ..p ] are blocks of array Key,and sorted block[1 ..tp ] and split block[1 ..tp ][1 ..p ] are blocks of array Temp. See Figure 12.For simplicity of presentation, we assume that ntp , ntpc , and np are integers. In Algorithm 1,each block is represented by the ﬁrst position and the last position in its array (but in actualimplementation only one of the ﬁrst and last positions is necessary because the whole array ispartitioned into blocks without overlaps). For example, init block[1] is represented by (1 , ntp ).3. (lines 9–13) Assign t init blocks to each thread. Each thread sorts each of t init blocks asfollows. (Note that a block init block[ i ] is partitioned into sub block[ i ][1 .. ntpc ].)3.1. Sort each sub block[ i ][ k ] of Key[1 ..n ] by the following basic sort. The basic sort isessentially Quicksort with insertion sort as the recursion base. The Quicksort partitionsaround the median of the medians of three samples, each of three elements (also calledthe pseudo-median of 9 elements) [49]. This basic sort is fast when all the input elementsare within the last level cache [49, 50].24 able 6: GCC STL sort vs. row-column sort (in seconds). cores 1 2 4 8 16GCC STL sort 4.016 2.727 1.380 0.775 0.452row-column sort 4.251 2.195 1.181 0.549 0.310 t i m e ( s e c ) number of cores GCC STLrow-column Figure 13: Speedups of GCC STL sort and row-column sort. ntpc sub blocks into a sorted block by the multi-way merge (i.e., ntpc -way merge) using a tournament tree [51]. (In multiway merge( x , in block[1 ..x ],out block) of Algorithm 1, x is the number of blocks to be merged, in block[1 ..x ] are theblocks to be merged, and out block is the merged block.)4. (lines 14–15) Compute the perfect p -partition of the tp sorted blocks [47, 48]. The perfect p -partition of sorted blocks is deﬁned as follows: Each sorted block is partitioned into p split blocks (sizes of split blocks may vary and there can be even an empty split block as inthe second sorted block of Figure 12 (c)) such that the collection of all the ﬁrst split blocksconstitutes the np smallest ones of n elements (gray elements in Figure 12 (c)), and thecollection of all the second split blocks constitutes the next np smallest ones, etc.5. (lines 16–18) Thread i (1 ≤ i ≤ p ) merges all the i -th split blocks of the perfect p -partition(i.e., split block[1 ..tp ][ i ]) into a ﬁnal block. Again we use the multi-way merge (i.e., tp -waymerge) using a tournament tree.The perfect p -partition in step 4 is computed as follows: An x -split of the tp sorted blocks isdeﬁned as a partition of each sorted block[ i ] (1 ≤ i ≤ tp ) into two disjoint subsets L i and H i suchthat(1) any element in all L i ’s is less than or equal to any element in all H i ’s(2) the number of elements in all L i ’s is exactly x .To ﬁnd the perfect p -partition, each thread i (1 ≤ i ≤ p −

1) computes an i × np -split of the tp sorted blocks. Then the np -split , np -split , . . . , ( p − np -split make the perfect p -partition. In Fig-ure 12 (c), the np -split has L = { } , L = { , , } , L = { , } , L = { , } , and we setsplit block[ i ][1] = ( ntp ( i −

1) + 1 , ntp ( i −

1) + | L i | ), i.e., split block[1][1] = (1 , , , , x -split. The algorithm in [47] also computes an xx