Bridging the Gap Between Theory and Practice on Insertion-Intensive Database
aa r X i v : . [ c s . D B ] M a r Bridging the Gap Between Theory and Practice onInsertion Intensive Database
Sepanta Zeighami
USC [email protected] Raymond Chi Wing Wong
HKUST [email protected]
ABSTRACT
With the prevalence of online platforms, today, data is beinggenerated and accessed by users at a very high rate. Besides,applications such as stock trading or high frequency tradingrequire guaranteed low delays for performing an operationon a database. It is consequential to design databases thatguarantee data insertion and query at a consistently highrate without introducing any long delay during insertion.In this paper, we propose Nested B-trees (NB-trees), an in-dex that can achieve a consistently high insertion rate onlarge volumes of data, while providing asymptotically op-timal query performance that is very efficient in practice.Nested B-trees support insertions at rates higher than LSM-trees, the state-of-the-art index for insertion-intensive work-loads, while avoiding their long insertion delays and improv-ing on their query performance. They approach the queryperformance of B-trees when complemented with Bloom fil-ters. In our experiments, NB-trees had worst-case delaysup to 1000 smaller than LevelDB, RocksDB and bLSM,commonly used LSM-tree data-stores, could perform queriesmore than 4 times faster than LevelDB and 1.5 times fasterthan bLSM and RocksDB, while also outperforming themin terms of average insertion rate.
PVLDB Reference Format:
Sepanta Zeighami, Raymond Chi-Wing Wong. Bridging the GapBetween Theory and Practice on Insertion-Intensive Database.
PVLDB , XX(xxx): xxxx-yyyy, 2020.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
Due to the rapid growth of the data in a variety of appli-cations such as banking/trading systems [51], social media[24] and user logs [45, 20], massive data comes in at a rapidrate and it is very important for a database system to han-dle both fast insertion and fast query. Consider Facebookwith more than 41,000 posts [29] and YouTube with morethan 60,000 videos watched per second on average [19]. Thedata is generated in a rapid rate and is accessed by otherusers at the same time. Consider Nasdaq Exchange where
This work is licensed under the Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by nc nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.
Proceedings of the VLDB Endowment,
Vol. XX, No. xxxISSN 2150 8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx an average of about 70,000 shares are traded per second [51].This stock exchange platform requires a database system toguarantee insertion performance at rates higher than 70,000insertions per second. Meanwhile, an insertion delay in anorder of milliseconds is unacceptable in many trading sce-narios such as in high-frequency trading, a large componentof the market [26] where stocks are traded by milliseconds[48]. Besides, the current stock price has to be accessed ina short time for the next sell/buy of this stock.
In this paper, we study to design an index which achievesthe following 5 requirements.1.
Short Average Insertion Time Requirement:
The index could handle a lot of insertions within ashort period of time.2.
Short Maximum Insertion Time Requirement:
The index could handle each individual insertionwithin a short time.3.
Short Average Query Time Requirement:
Theindex could return the answers of a lot of queries withina short period of time.4.
Short Maximum Query Time Requirement:
The index could return the answer of each individual query within a very short time.5.
Theoretical Performance Guarantee Require-ment:
The index could have theoretical performanceguarantee on both the insertion performance and thequery performance.(1) Short Average Insertion Time Requirement is neededdue to the rapid data growth nowadays. (2) Short MaximumInsertion Time Requirement is a stricter requirement. It re-quires that each individual insertion has to be completedwithin a short period of time but the former requirementrequires that the index could handle a collective set of inser-tions within a period of time, allowing some individual inser-tions to be completed with a longer delay. (3) Short AverageQuery Time Requirement is needed due to the rapid data ac-cess in some applications. (4) Short Maximum Query TimeRequirement is needed since it requires that each individual query could be answered in a short time. (5) TheoreticalPerformance Guarantee Requirement is needed so that weknow how good/bad an index is. Based on the first 4 re-quirements, we are interested in the time complexities ofthe following1a) Amortized insertion time (Requirement 1)(b) Worst-case insertion time (Requirement 2)(c) Average query time (Requirement 3)(d) Worst-case query time (Requirement 4)(a) The amortized insertion time of an insertion is the totaltime the index needs to handle a batch of insertions dividedby the total number of insertions handled by the index. (b)The worst-case insertion time of an insertion is the greatestinsertion time of an insertion. (c) The average query timeis the query time of a query on expectation. (d) The worst-case query time is the greatest query time of a query.Similar to many recent studies [18, 17, 16, 11, 8, 46], wefocus on when the data is stored in external memory (e.g.,HDD or SSD). Data storage in main memory is more ex-pensive than HDDs or SSDs. As pointed out in [16] anddiscussed in [34], main memory costs 2 orders of magnitudemore than disk in terms of price per bit. Moreover, mainmemory consumes about 4 times more power per bit thandisk [49]. Thus, designing high performance external mem-ory indices that provide guarantees for real world applica-tions can significantly reduce the cost of operations for manysystems. On Amazon Web Services, any machine with morethan 100GB of main memory costs at least US$1 per hourbut a machine with 15.25GB of main memory and 475GBSSD costs US$0.156 [3] (Linux machines, US East (Ohio) re-gion). An SSD with 480GB capacity costs US$55 [5] whilea 128GB DDR3L RAM module costs about US$393 [4].
Existing indices do not satisfy the above requirements si-multaneously. There are two major branches of indices re-lated to our goal: (1) LSM-tree-like indices [37, 42, 32] and(2) B-tree-like indices [7, 10, 27] .Consider the first branch. In recent years, LSM-trees [37,42, 52, 15, 33, 17, 18] have attracted a lot of attention andare used as the standard index for insertion-intensive work-loads in systems such as LevelDB[23], BigTable [12], HBase[1], RocksDB [21] (by Google and Facebook [12, 1, 21]), Cas-sandra [30] and Walnut [14]. LSM-trees buffer insertions inmemory and merge them with on-disk components in bulk,creating sorted-runs on disk. Although LSM-trees satisfythe Short Average Insertion Time requirement, they do notsatisfy Short Maximum Insertion Time requirement, ShortAverage/Maximum Query Time Requirement and Theoret-ical Performance Guarantee Requirement. This is becauseLSM-tree’s worst-case insertion time is linear in data size[46, 32] and their worst-case query time is suboptimal [32].In fact, in our experiments, although RocksDB [21], the in-dustry standard and common research baseline [18, 17, 16],took an order of microseconds per insertion on average, ithad worst-case insertion time of 453 seconds. Such a worst-case insertion time is utterly unacceptable for any applica-tion that requires reliability.There are two major techniques to improve the perfor-mance of LSM-trees in the literature. The first techniqueis
Bloom filters . They can improve average query time ofLSM-trees [42], but their worst-case query time remains sub-optimal. Thus, the LSM-trees with Bloom filters still donot satisfy Short Maximum Query Time Requirement. Onerepresentative is bLSM [42], a variant of LSM-tree that usesBloom filters at each level. It also limits the number of LSM-tree levels. Setting the number of LSM-tree levels to amaximum allows for asymptotically optimal query time, butviolates Short Average Insertion Time Requirement sincethe amortized insertion time becomes asymptotically largerthan LSM-trees with an unrestricted number of levels. Thisis because the ratio of the size between LSM-tree compo-nents becomes unbounded, causing merge operations to readand rewrite a larger portion of the data. Furthermore, [42]provides methods to improve the worst-case query time ofLSM-tree by a constant factor, but the worst-case insertiontime remains linear to data size. Thus, they still do notsatisfy Short Maximum Insertion Time Requirement. Thesecond technique is fractional cascading . It improves theworst-case query time of LSM-trees [32], but their averagequery time remains high. Thus, LSM-trees with fractionalcascading still do not satisfy Short Average Query TimeRequirement. One representative is [32] that adds an extrapointer to each component of the LSM-tree, pointing to itsnext component. This pointer allows for reading one diskpage per LSM-tree level. This was not compared in the ex-perimental studies of LSM-trees [42, 15, 17, 18] due to itshigh average query time. Fractional Cascading and Bloomfilters are incompatible [42] and cannot be used together.Consider the second branch. Traditional B-trees [7] andB + -trees [44] are among the most commonly used indicesfor good query performance. They provide optimal queryperformance and thus satisfy the Short Average/MaximumQuery Time Requirement. They do not satisfy Short Av-erage and Maximum Insertion Time Requirements, becausethey perform no buffering and perform at least one disk ac-cess for every insertion, which is very time-consuming.Later, a write optimized variant of B-trees called B ǫ -trees(also known as B-trees with Buffer) [10] were proposed, thatreserves a portion of each node for a buffer. New data is in-serted into the buffer of the root and moved down the levelsof the tree whenever the buffer becomes full. However, thismethod, although faster than B-trees, does not satisfy ShortAverage/Maximum Insertion Time Requirement. This isbecause B-tree nodes get scattered across the storage de-vices and moving this small buffer frequently down from anode requires accessing its children which is time consuming. Motivated by the above, in this paper, we propose an in-dex called the Nested B-tree (NB-tree) which satisfies the 5requirements simultaneously. That is, NB-trees give shortaverage/maximum insertion time which is multiple factorssmaller than B-trees and is similar to LSM-trees. They pro-vide worst-case insertion time logarithmic to data size (un-like LSM-tree’s linear worst-case insertion time) and mul-tiple factors smaller than B-trees which satisfies the ShortMaximum Insertion Time Requirement. They use Bloom fil-ters to provide low average query time, while their structureallows for asymptotically optimal worst-case query time, sat-isfying Short Maximum Query Time Requirement. This,together with their logarithmic, yet better-than-B-trees,worst-case insertion time shows that NB-trees satisfy theTheoretical Performance Guarantee Requirement.Fig. 1 shows the structure of an NB-tree compared withan LSM-tree. Intuitively, a Nested B-tree is a B-tree inwhich each node contains a B + -tree. NB-trees can be seenas imposing a B-tree structure across the levels of an LSM-tree and breaking down each level into constant-sized B + -2 emoryDisk (a) LSM-Tree MemoryDisk (b) Nested B-Tree
Figure 1: Nested B-trees break down each level into constantsized B + -trees and establish a connection between the keysin different levels.trees. By imposing a B-tree structure, NB-trees establish arelationship between the keys in different components andprovide an asymptotically optimal query cost, which nearsthe query performance of B-trees when complemented withBloom filters. This design is based on the observation thatdifferent levels need to be connected to avoid suboptimalworst-case query time, which is lacking in the structure ofLSM-trees. Although this is also the intuition behind thedesign of LSM-tree with fractional cascading [32], [32] failsto design an index compatible with Bloom filters or withlogarithmic worst-case insertion time.Furthermore, the B-tree structure ensures that keys ineach node only overlap with the keys in its children. Thislimits the impact of merge operations across levels, causingthe merge operations to have the same cost on all the levels,and is used to provide a logarithmic worst-case insertioncost. In essence, the connection created between differentlevels allows us to bound the cost during the merge operationand provide a per-insertion account of the total insertioncost. Such a worst-case analysis is missing in most of theLSM-tree literature [37, 18, 17, 16] where the focus has beenon amortized analysis and the few papers that have focusedon worst-case performance provide an algorithm with worst-case that is linear to data size [46, 32].Finally, by keeping the nodes as large constant-size B + -trees, NB-trees, similar to LSM-trees, perform mainly se-quential I/O operations during insertions which minimizesseek time and allows them to perform insertions better thanB-trees and their variants. • We propose Nested B-Tree (NB-Tree), a novel datastructure that satisfies all the 5 requirements men-tioned for indices on large volumes of data. This isthe first indexing structure satisfying all these 5 re-quirements in the literature. • NB-Tree is the first fast-insertion index withasymptotically-optimal query time. To the best ofour knowledge, there is no existing index which couldachieve this result. • In our experiments, NB-Tree’s worst-case insertiontime was more than 1000 times smaller than Lev-elDB, RocksDB and bLSM, the three popular LSM-tree databases. They achieved average query time al-most the same as B + -trees, while performing insertionsat least 10 times faster than them on average. Summary of Results.
The performance improvementof NB-trees compared with LSM-tree variants and B-treeis summarized in Table 1. NB-trees outmatch LSM-treeson worst-case insertion and query time as well as averagequery time, and perform insertions faster than B-trees while providing similar average query performance. A more in-depth analysis of the related work is provided in Sec. 7.
Organization.
The rest of this paper is organized asfollows. Section 2 discusses the terminology used and theproblem addressed in this paper. Section 3 provides thedesign of the Nested B-tree data structure and Section 4discusses more details on the implementation and analysisof the data structure. Section 5 discusses a more advancedversion of NB-tree that achieves a logarithmic worst-caseinsertion time and uses Bloom filters. Section 6 providesour experimental results. Section 7 discusses the relevantliterature and Section 8 provides the conclusion of the paper.
2. TERMINOLOGY AND SETTING
Problem Setting.
Key-value pairs are to be stored inan index that supports insertions, queries, deletions and up-dates. The index is to be stored on an HDD or SSD and theterm disk is used to broadly refer to the secondary storagedevice. The data is written or read from disk in pages of size B bytes. The index can use up to M pages of main mem-ory. Transferring a page from disk to main memory (or viceversa) incurs two costs, a seek time, T seek , and a sequential read, T seq,R or write, T seq,W , time. Seek time is the timedifference between the starting time of the read/write re-quest and the starting time of the data transfer. sequential read/write time is the time taken to transfer the data fromthe disk to the main memory.Sequential access time is determined by a device’s band-width while seek time depends on its internal mechanisms:on HDDs the movement of the disk arm and platter, on SSDsthe limitation of its electrical circuits. Sequential time isproportional to size of the data transferred but seek time de-pends on how the data is stored, i.e., whether it is stored oncontiguous blocks. It is important to account for seek timein our analysis as, per page, it can take much longer thansequential access time. For instance, an HDD (7200rpm and300MB/s bandwidth) based on the measurements in [41] hasa seek time of 8.5 milliseconds and transfer rate of 125 MB/s.Reading a 4KB disk page incurs seek time of 8 . × − sec-onds, but reading it sequentially takes KB MB/s ≈ × − seconds (283 times smaller than the seek time).For the ease of discussion and as is the industry stan-dard for common key-values stores such as RocksDB [22]and LevelDB [23], we consider the keys to be unique. Du-plicate keys can be handled similar to B-trees [44] by usingan extra bucket or a uniquifier attribute as discussed in [44]. Performance Metrics.
When analyzing an index we as-sume its performance is dominated by disk I/O operations.For an operation on an index (e.g, an insertion or query onthe index) we use the term cost only when referring to the number of pages accessed during the operation. We use theterm time when referring to the actual time taken, measuredin seconds, during the operation. The time is dominatedby disk I/O operations, and is composed of the sequentialand seek time for all disk accesses performed during the op-eration. Because a time measure takes into account seekoperations, it is a more realistic measure of the real-life per-formance of an index compared with the cost measures.We use the following metrics for evaluation of indices.
Worst-case insertion time is the time, measured in seconds,it takes to insert an item into the index in the worst case.Moreover, given a set X of n keys to be inserted to an in-3ata Structure AmortizedInsertion Time Worst-CaseInsertion Time AverageQuery Time Worst-CaseQuery Time AsymptoticallyOptimal QueryTimeB-Tree [7] and B + -Tree Bad Medium Good Good YesB-Tree with Buffer [10] Medium Medium Medium Medium YesLSM-Tree (no BF, no FC) [37] Good Bad Bad Bad NoLSM-Tree (BF, no FC) LevelDB[23], RocksDB [22], Monkey [16] Good Bad Medium bad NoLSM-Tree (no BF, FC) [32] Good Bad Bad Medium YesNB-Tree (BF) [this paper] Good Good Good Medium YesTable 1: Comparing NB-trees, LSM-trees and B-tree variants (BF: with Bloom filters, FC: with fractional cascading)dex, amortized insertion time of a key in X with respectto X is the worst-case total time of inserting all the keysof X divided by n . Worst-case query time is the time anindex takes to answer a query in the worst-case.
Averagequery time is the expected value of the random variable de-noting the query time of a random query key ( average timeis defined over one operation and is the expected time theoperation takes while amortized time is defined over a setof operations, and is the average time an operation takes inthe worst-case). The metrics are defined in terms of time ,but their definition in terms of cost is analogous.
Problem Definition.
Our goal is designing an indexthat satisfies the Short Average/Maximum Insertion Time,Short Average/Maximum Query Time and Theoretical Per-formance Guarantee Requirements.
3. DESIGN OF NESTED B TREE
An NB-tree is a B-tree whose nodes contain B + -trees. NB-trees insertion, deletion and update operations differ fromthat of B-trees, but the data is organized in the nodes ofan NB-tree in a way that the properties of B-trees (in addi-tion to other properties described later) are preserved. Thisallows for query and worst-case insertion time that growsonly logarithmically in data size. Moreover, insertion, dele-tion and update operations are first buffered in memory andbatched together which reduces the number of page accessesand the number of seek operations performed, improvingsignificantly on amortized and worst-case insertion timescompared with B-trees. These properties, complementedwith Bloom filters, allow NB-trees to support insertions andqueries at high rates without any delays during insertion.Next, we describe a basic version of NB-trees. We providethe final version in Section 5. We use Fig. 2 for illustration. An NB-tree is defined as a collection of several tree struc-tures, { D , ..., D k , S } for an integer k . D to D k are B + -trees that each store part of the data(i.e., key-value pairs) and are called data trees or d-trees forshort. Any key-value pair inserted into the index is storedin one of the d-trees, and the key-value pairs are movedbetween the d-trees throughout the life of an NB-tree. InFig. 2 (e), D , D , D and D show four different datatrees. They are all B + -trees, i.e., at the leaf level each keyis written next to its corresponding value (not shown in thefigure). For ease of discussion we refer to the nodes of a datatree as data nodes or d-nodes and to the keys in a d-node as data keys or d-keys S is a tree structure similar to a B-tree. S is used to es-tablish a relationship between the keys in the d-trees, andimpose a structure on the d-trees. Thus, S is called a struc-tural tree or an s-tree for short. A structural tree is exactly a B-tree with some modifications discussed later. In Fig.2 (e), the eclipse labelled S shows a structural tree. Simi-lar to a B-tree, an s-tree contains several nodes. For easeof discussion we refer to the nodes of a structural tree as structural nodes or s-nodes and to the keys in an s-node as structural keys or s-keys .An s-tree differs from a B-tree in the following ways. (1)An s-tree does not store any key-value pairs. It only containskeys and pointers. Keys in an s-tree are not associated witha value. For this reason we call it a structural tree (it onlyspecifies a structure). (2) Each s-node, N , contains an extrapointer to the root d-node of a d-tree (which is a B + -tree).We call this d-tree, N ’s d-tree (each d-tree is pointed to byexactly one s-node). The pointer in an s-node pointing to theroot of its d-tree will be referred to as its d-tree pointer . InFig. 2 (e), pointers P , P , P and P are d-tree pointers fors-nodes N , N , N and N . (3) Leaf s-nodes only containa d-tree pointer, and no keys or values. This is because ans-tree does not contain any data in its s-nodes. Since leafs-nodes don’t have any children, they do not contain anypointers or keys. In Fig. 2 (e), leaf s-nodes N , N and N do not contain any keys and only contain a d-tree pointer.Specifically, non-leaf s-nodes in an s-tree are of the format h P d − tree , P , K , P , K , ..., P r , K r , P r +1 i for an s-node with r + 1 children. P i for all i are pointers to the s-nodes in thenext level of the s-tree, K i are the corresponding s-keys and P d − tree is a pointer to the d-tree of the s-node. s-keys aresorted in an s-node. For an s-key, K , in the s-node pointedto by P i , 1 < i < r + 1, it is true that K i − ≤ K < K i ,for i = 1, K < K i and for i = r + 1, K i − ≤ K . The onlydifferences between a non-leaf s-node and a non-leaf B-treenode is that (1) an s-node has an extra pointer P d − tree tothe d-tree of the s-node and (2) s-keys are not associatedwith any value in the s-node. Moreover, a leaf s-node is ofthe format h P d − tree i , i.e., it only contains a d-tree pointer. Structural Properties . The following properties arethe structural properties of NB-trees.
S-tree Fanout.
Each non-leaf s-node has at most f chil-dren and each non-leaf and non-root s-node has at least ⌈ f ⌉ children. We call the parameter f the s-tree fanout . In Fig.2, f is set to 3. Each s-node has at most 3 children, andnon-leaf and non-root s-nodes must have at least 2 children. D-tree Fanout.
Each non-leaf d-node has at most B chil-dren and each non-leaf and non-root d-node has at least ⌈ B ⌉ children. We call the parameter B the d-tree fanout . In Fig.2, B is set to be 4. Each d-node has at most 4 children, andnon-leaf and non-root s-nodes must have at least 2 children. D-tree Size.
For a parameter σ , each d-tree is at most ofsize σ . D-trees of leaf but not root s-nodes are at least ofsize ⌈ σ ⌉ . σ can be specified by the number of bytes used4 a) After insertion of the keys 1, 2, 8, 15, 21 and 32.(d) After insertion of 10. (b) After insertion of 33.
151 2 8 21 3215 All keys All keysKeys K, K<15 Keys K, 15 ⩽ K All keysKeys K, K<15 Keys K, 15 ⩽ KAll keys All keysAll keysKeys K, K<15 Keys K, 20 ⩽ K Keys K, K<15Keys K, K<15 Keys K, 15 ⩽ KKeys K, K<5 Keys K, 5 ⩽ K<15
21 32151 2 8 (c) After insertion of the keys 3, 4, 16, 17, 19 and 20.
15 20
41 2 3 84 (e) After insertion of the keys 5, 6, 11,18, 29 and 45.
D1D2N1N2 N3 N4S
15 20
41 2 3 84 185 6 451811 29106 8 10 11
41 2 3 54
D1D2 D3 D4N1N2 N3 N4S
15 20
41 2 3 54 3221 29 32 33 45108 1168 (f.1) After insertion of 19.5 and flush(N1). D2 is full.
D1N1 N3 N4S
106 8 10 1141 2 3 54 D small D large N small N large (f.2) After SNodeSplit(N2). N1 now has more than 3 children.(f.3) After SNodeSplit(N1) and creating a new root s-node with its d-tree. Keys K, 20 ⩽ K10 10
P1P2 P3 P4P1P2 P3 P4 P3 P4
33 33 21 3215 33 33 20Keys K, 15 ⩽ K<20 21 3320 3332 Keys K, 15 ⩽ K<2016 1815 1817 Keys K, 20 ⩽ KKeys K, 15 ⩽ K<20 21 3320 333220
D3 D4
D3 D4P1
16 1915 1917 16 1915 191719
16 1815 1817 19
16 1815 1817 19
Figure 2: Insertions in an NB-Tree with parameters σ = 6 key-value pairs, f = 3, B = 4.by the d-tree or the number of key-value pairs in the d-tree. The analysis in the paper uses the latter for ease ofnotation, while the former is used in experiments as its easierto specify in practice. Unless stated otherwise, σ refers tothe number of key-value pairs in a d-tree (i.e., number ofd-keys in the leaf level). In Fig. 2, σ is set to 6. Each d-treecontains up to 6 keys in their leaves, and d-tree of leaf butnot root s-nodes contain at least 3 d-keys in their leaves. Cross-s-node Linkage Property.
This property of NB-trees establishes the relationship between the s-keys in ans-node and the d-keys and s-keys of the s-node’s children.Consider an s-node, N with r s-keys. The s-node is of theform h P d − tree , P , K , P , K , ..., P r , K r , P r +1 i , where the s-keys are in a sorted order (i.e., for each i ∈ [1 , r ) , K i K, V ) in an NB-treestarts by inserting the pair in the d-tree of the root s-nodeand recursively moving the pair down the tree to ensure thatthe properties mentioned in Section 3.1 are satisfied. Werefer to a d-tree as full if it has more than σ key-value pairs.In this section, we provide a conceptual discussion on howinsertions are performed, and how they are implemented inpractice is discussed in Section 4.1.Intuitively, the d-tree of each s-node can be seen as astorage space for the s-node. The key-value pairs are storedin the d-tree of each s-node. When d-tree of an s-node N becomes full , the pairs are distributed down to the d-treeof the children of N based on the N ’s s-keys such that theCross-s-node Linkage Property is satisfied. This continuesuntil the d-tree, D of a leaf s-node, N becomes full, in whichcase D and N are split into two and the median of d-keys in D (i.e., the d-key, K , in D such that half of the d-keys in D are less than K ) is inserted into the parent, P , of N . If P now has more than f children, P and it’s d-tree are similarlysplit into two. The splitting may continue until the root ofthe s-tree, which may result in an increase in the height ofthe tree. More specifically, insertion works as follows.5 nsertion Operation. A new key-value pair ( K, V ) isalways inserted into the d-tree, D , of the root s-node, N .We insert ( K, V ) in D using a B + -tree insertion mechanism.If D has up to σ d-keys, the insertion is finished. Otherwise,we need to ensure d-tree size requirement is satisfied. Forthis, we call HandleF ullSNode ( N ) (described later). Example. In Fig. 2 (a), insertion of key-value pairs is donein the d-tree of the root s-node. Fig. 2 (a) shows the resultof inserting keys 1, 2, 8 ,15, 21 and 32. They are all insertedinto the d-tree of the root s-node. Now, inserting a new key(e.g., 33) in the d-tree of the root s-node causes the d-treeto become full and HandleF ullSNode is called on the roots-node to restore compliance to the d-tree size requirement .Fig. 2(b) shows the result after calling HandleF ullSNode and all the properties discussed in Section 3.1 are satisfied. HandleFullSNode Operation . HandleF ullSNode ( N )is called to restore compliance to d-tree size requirement when the size of a d-tree, D , of an s-node N surpasses σ . Itacts differently when N is a leaf s-node and when it is not. N is a leaf s-node. HandleF ullSNode ( N ), if N is aleaf s-node, calls SNodeSplit ( N ). SNodeSplit ( N ) (detailedlater) splits N into two s-node N small and N large and re-turns the median d-key, K M , of the d-keys in D togetherwith pointers P small and P large to N small and N large . Then HandleF ullSNode ( N ) inserts K M , P small and P large intothe parent s-node of N and returns (similar to the inser-tion of the median into a parent node of a B-tree afterthe node splits). If N is a root s-node, i.e., has no parent, HandleF ullSNode ( N ) creates a new root s-node and theninserts K M , P small and P large into this new root (s-tree’sheight increases by one). Example. Consider Fig. 2 (a). HandleF ullSNode splitsthe d-tree and the s-node into two, one d-tree containing thesmaller half and another the larger half of d-keys (seen atthe leaf level of Fig. 2 (b)). HandleF ullSNode also createsa new root s-node and inserts the median of d-keys into it. N is not a leaf s-node. If N is not a leaf s-node, HandleF ullSNode ( N ) first calls flush ( N ) operation. flush ( N ) (detailed later) removes the keys-value pairs from D and inserts them into the d-trees of N ’s children. Af-ter that, for any child s-node C of N , if C ’s d-tree is nowfull, HandleF ullSNode ( N ) calls HandleF ullSNode ( C ) re-cursively. If d-tree of none of the children is full, HandleF ullSNode ( N ) returns.If N has k children, there can be up to k recursive calls.A recursive call HandleF ullSNode ( C ) may result in C be-ing split into two s-nodes, which increases the total num-ber of children of N . Therefore, if the number of chil-dren of N becomes larger than f , HandleF ullSNode ( N )calls SNodeSplit ( N ) which splits N into s-nodes N small and N large and returns the median s-key K M of N togetherwith pointers P small and P large to N small and N large . Then HandleF ullSNode ( N ) inserts K M , P small and P large intothe parent s-node of N and returns (similar to the inser-tion of the median into a parent node of a B-tree afterthe node splits). If N is a root s-node, i.e., has no parent, HandleF ullSNode ( N ) creates a new root s-node and theninserts K M , P small and P large into the new root (s-tree’sheight increases by one). Example. Consider Fig. 2 (e). HandleF ullSNode ( N )first calls flush ( N ) which moves d-keys from d-tree of theroot s-node N to its children. Fig. 2 (f.1) shows theresult. Consequently N ’s d-tree becomes full (has more than 6 keys). Thus, HandleF ullSNode ( N ) calls itselfrecursively, i.e., HandleF ullSNode ( N ). In the recursivecall, since N is a leaf s-node, HandleF ullSNode ( N ) calls SNodeSplit ( N ) which splits N into two. It inserts themedian d-key into N . Now N has more than two s-keys(Fig. 2 (f.2)).Then HandleF ullSNode ( N ), calls SNodeSplit ( N )which splits N into two. It creates a parent for N andinserts N ’s median, 15, into N ’s parent. Fig. 2 (f.3) showsthe result. SNodeSplit. SNodeSplit ( N ) splits an s-node N and itscorresponding d-tree D into two. If N is a leaf s-node, let K M be the median d-key of D . If N is not a leaf s-node,let K M be the median s-key of N . SNodeSplit ( N ) createstwo s-nodes N small and N large with corresponding d-trees D small and D large . It inserts all the d-keys in D less than K M in D small and the d-keys at least K M in D large . Italso inserts all s-keys in N less than K M in N small andthe s-keys at least K M in N large . Let P small be a pointerto N small and P large a pointer to N large . The operationreturns ( K M , P small , P large ). Example. See Fig. 2 (f.1) to (f.2) and Fig. 2 (f.2) to (f.3). Flush . flush ( N ) is called on a non-leaf s-node, N .Intuitively, flush ( N ) distributes the d-keys in the d-treeof N to the d-tree of its children based on the s-keysof N . Let N contain r s-keys and be of the format h P d − tree , P , K , P − , K , ..., P r , K r , P r +1 i . Let D denotethe d-tree of N , pointed to by P d − tree . Furthermore, let C i be the s-node pointed to by P i and let D i be the d-tree of C i . For every key K in D , we remove it from D and insert itinto D i if K i − ≤ K < K i . We insert K into D if K < K and in D r +1 if K ≥ K r . Example. See Fig. 2 (e) to Fig. 2 (f.1). Similar to LSM-trees [37], we perform deletion and up-dates by inserting delta records that indicate the modifica-tion into the index. Thus, deletions and updates are treatedthe same way as insertions and the same analysis applies tothem. Note that delta records will be resolved before theyreach the leaf level or they can be discarded (if they reachthe leaf level, the key they are meant to modify does notexist in the tree). Therefore, delta records do not affect theheight of the s-tree and do not affect our analysis of queryand insertion performance. The cases when s-nodes become underfull as a result of a deletion, can also be handled usinga mechanism similar to deletions in B-trees [44]. We willnot provide the details for such scenarios since our focus ison write-intensive workloads, and we assume that deletionsare infrequent and such cases can be ignored. NB-trees perform queries similar to B-trees. However, ina B-tree, a query traverses the B-tree based on the keys inthe nodes. In an NB-tree, a query traverses the s-tree basedon the s-keys in the s-nodes. Furthermore, in a B-tree, aquery only searches the keys in the nodes visited. However,an NB-tree searches the s-keys in the s-nodes visited andalso searches the corresponding d-tree of each s-node visited.Searching a d-tree is exactly a B + -tree search. Fig. 3 showsthe query of key 11 on an NB-tree. 4. IMPLEMENTATION AND ANALYSIS 221 29 32 33 4541 2 3 4 8353 2016 1915 191792 Figure 3: Querying for the key 11 in an NB-Tree. The shaded area shows the part of the tree read during the query.To allow for fast performance, similar to LSM-trees, thed-tree corresponding to the root s-node is kept in memory.The rest of the d-trees are stored on disk. Manipulations of the s-tree is straight forward. Here wefocus on operations impacting on-disk d-trees. Insert and HandleF ullSNode do not make any modifications to theon-disk d-tree themselves. Modifications are done through flush and SNodeSplit operations, so we focus on them.To minimize the insertion time, we aim at minimizingthe number of seek operations by performing our disk ac-cesses sequentially. To this end, we maintain the followinginvariants. Firstly, all the d-nodes in a d-tree are writtensequentially and can be retrieved by a sequential scan fromthe first node written. Secondly, the leaf d-nodes are writ-ten on disk in a sorted order. Thus, a sequential scan of ad-tree from the first leaf d-node until the last d-node readsall the key-value pair written in the d-tree in an ascendingorder. Flush(N) . Assume that N contains r + 1 children, C i with respective d-trees D i , 1 ≤ i ≤ r + 1 and r keys h K , K , ..., K r i . flush starts by sequentially scanning D and D , merge-sorting them together (sequential scan of D and D retrieves their keys in a sorted order) and writingthe output, D ′ in a new disk location. Note that the twoinvariants mentioned above now hold for D ′ . From D , weonly merge-sort the d-keys that are less than K with D .We follow the same procedure and in general merge-sort thed-keys, K such that K i − ≤ K < K i , from D with D i for1 < i < r . For i = r + 1, we merge-sort the d-keys, K suchthat K i ≤ K , from D with D r +1 and for i = 1 d-keys, K ,such that K i > K . Finally, we move down only the first σ d-keys from D if it has more. This is to avoid the size of thefull d-trees in deeper levels of the tree getting progressivelylarger as a result of recursive flush calls. Because some ofthe d-keys may remain in a d-tree, we re-write D startingfrom the ( σ + 1)-th d-key and thus removing the d-keys thatwere flushed down from D .The cost of flush is O ( σfB ). Assuming the main memoryhas enough space to buffer Ω( σ ) key-value pairs (to buffer aconstant fraction of the parent’s d-tree and the d-tree of onechild at a time) which is typically in the order of 100MB,the flush operation performs a constant number of seek op-erations for merge-sorting N with each child and thus O ( f )seek operations in total. The number of seek operationsincreases proportionately if there is less space available inmemory. SNodeSplit(N) The SNodeSplit ( N ) operation only per-forms disk accesses when dividing a d-tree into two. For this,we sequentially scan a d-tree and sequentially write it as two d-trees. It costs O ( σB ) page accesses and O (1) number ofseek operations under the same conditions as above. Thisoperation preserves the two invariants mentioned above. Correctness. Induction on the number of insertion op-erations shows that the cross-s-node linkage and structuralproperties are preserved using the insertion algorithm. Thecorrectness of the query operation follows from the cross-s-node linkage property, and the correctness of updates anddeletions follow from the correctness of insertions. Insertion Time Complexity. There are at most O ( nσ ) HandleF ullSNode function calls on any level because in theworst case all the keys are moved down to the leaf level andeach flush moves σ keys. HandleF ullSNode , excluding therecursive call, requires O ( fσB ) page accesses for flush and SNodeSplit . Each operation can be handled with O ( f ) seekoperations. Since the height of the s-tree is O (log f nσ ), theamortized insertion time is O (log f ( nσ ) × ( fB T seq,W + fσ T seek )).Note that we only modify an s-node if its corresponding d-tree is modified. Thus, assuming each s-node fits in a diskpage ( f is typically much smaller than B ) s-tree manipula-tions add at most one page write after writing each d-tree,which does not impact the complexity of the operations.For this version of NB-tree, the worst-case insertion timeis linear in n because all the s-nodes may be full at the sametime. In Section 5 we introduce a few modifications thatreduces the worst-case insertion time to logarithmic in n . Query Time Complexity. In the worst case, the querywill search one s-node in each level of the s-tree. Theheight of each d-tree is O (log B σ ) and height of the s-treeis O (log f nσ ), thus, the query takes time O (log B σ log f nσ ) × ( T seek + T seq,R ). Observe that the query cost of NB-treesis asymptotically optimal. That is, it is within the constantfactor log B σ of minimum number of pages accesses requiredto answer a query. Note that in-memory caching, to cache anumber of levels of each d-tree can be used to reduce querytime by a constant factor, similar to B-trees. NB-trees have three parameters, f , σ and B . B is setsimilar to B-trees so we focus on the other two. f providesa trade-off between insertion cost and query cost while σ provides a trade-off between the number of seek operationsper insertion and query cost. σ depends on how expensiveseek operations are, but typically, for fast insertions, it isset to the order of tens or hundreds of mega bytes. Typ-ically, f is set to a number in the order of 10 for writeintensive workloads, and its increase affects insertions muchmore than queries, as the insertion time linearly depends on7 but query time’s dependence is only logarithmic. Section6.2 provides an empirical analysis of parameter setting. 5. ADVANCED NB TREE We discuss modifications to the NB-tree design to reducethe worst-case insertion time from linear in n to logarithmicin n and how to add Bloom filters to NB-trees to enhancetheir query performance. The version provided here is to beconsidered as the final NB-tree index. We make the following changes to the structural proper-ties of NB-tree. For non-leaf s-nodes, we remove the require-ment on the maximum size of its d-tree being σ , and insteadput a requirement on the total number of key-value pairs inthe d-trees of all sibling s-nodes to be f ( σ + 1) (each s-nodecan still have at most f − f to beat most a constant fraction of σ which is typically true inpractice. Single Recursive Call. All the operations workthe same as before, but with one difference. In HandleF ullSNode ( N ), after calling flush ( N ), if any s-node is oversized, HandleF ullSNode will be called recur-sively on exactly one s-node that has the largest size (i.e.arg max | C | ), instead of performing a recursive call for ev-ery full s-node. The rest of the operations work as be-fore, but now there is at most one recursive call during HandleF ullSNode operation.The above insertion procedure remains correct and satis-fies the new requirement on the maximum number of key-value pairs in d-trees of non-leaf sibling s-nodes. This isbecause each level receives σ keys and flushes down σ keys(see Section 4.1) if any of the d-trees of sibling s-nodes havemore than σ keys, and the requirement is already satisfied ifnone of the siblings has more than σ keys. For leaf s-nodeswe still perform splits if their size surpasses σ keys. Thus,we can observe that the total size of siblings is at most f × σ . Lazy Removal. Recall that during the flush operation,we need to remove the d-keys that were moved from theparent s-node to its children. In Section 4.1 we discussed amethod that required rewriting of the parent s-node. Here,we discuss a lazy removal approach that removes this over-head. Consider the scenario when flush ( N ) is called, as-sume that N ’s parent is P and N ’s d-tree is D . Some of thed-keys of D are flushed to the d-tree of children of N . Atthis stage, we create a pointer to the location of the small-est d-key K in D that is not flushed to N ’s children, thatis, all the d-keys in D smaller than K are now present inthe d-tree of N ’s children and need to be removed from D .Now instead of removing these d-keys from D at this point,we postpone this removal to when flush ( P ) is called (i.e.,when N is a child s-node during the flush operation). When flush ( P ) is called, we need to flush the d-keys from P ’sd-tree to N ’s d-tree. In doing so, we only merge d-keys in D that are at least equal to K (using the pointer to K weremembered). Because of the sequentiality property, thesekeys can be retrieved by a sequential scan, and the existenceof the keys smaller than K in N does not incur any extracost for flush ( P ). After flush ( p ) is called, an entirely newd-tree is created for N and we discard the previous d-treenow, removing the keys smaller than K . This lazy removaldoes not incur any extra cost for insertions as the d-keys whose removal where postponed will not be read by the in-sertion algorithm. Moreover, the total size of siblings will be f ( σ + 1) because one s-node can now have at most σ mores-keys than was discussed in the above paragraph. Deamortization. Although the worst-case insertiontime of NB-tree with the changes discussed above is alreadylogarithmic in data size (shown below), we deamortize theinsertion procedure by performing σ fraction of the opera-tions for every new key inserted into the NB-tree, similar to[32], to reduce the worst-case insertion time by the factor σ . Insertion Time Complexity. An insertion operationperforms at most one HandleF ullSNode function call ateach level of the s-tree, resulting in at most O (log f nσ )number of HandleF ullSNode calls. Each flush and SNodeSplit step take O ( fσB ) I/O operations. Thus, the to-tal time take for one insertion call is O (log f nσ ( fσB T seq,W + fT seek )). Deamortization reduces the cost by a factorof σ and we can achieve the worst-case insertion time O (log f nσ ( fB T seq,W + fσ T seek )). The amortized insertion timein this case is the same as the worst-case insertion time. Thisshows that NB-trees achieve a good amortized and worst-case insertion time compared with LSM-trees and B-trees,as shown in Table 1. Query Time Complexity. Maximum size of an s-nodeis f ( σ + 1) since that is the maximum total size of siblings-nodes together. Thus, the query cost is now at most O (log B ( fσ ) × (log f ( nσ ))) based on an analysis similar toSection 3.2.3, but by changing the maximum size of an s-node. f is at most a fraction of σ and thus log B ( fσ ) is O (log B ( σ )) which is O (log B ( σ )). Hence, the query cost is O (log B ( σ ) × (log f ( nσ ))), which is asymptotically optimal asdiscussed in Table 1. We use Bloom filters to enhance the average query cost.A Bloom filter uses k bits per key and h hash functionsto decide whether a key exists in a data structure. Whensearching for a key, if the Bloom filter returns negative, thekey definitely does not exist in the data structure. When itreturns positive, the key may not exist in the data structurewith a probability dependant on k and h (e.g., k = 8 and h = 3 results in a false positive probability of less than 5%).We use a Bloom filter for d-tree of each s-node. We need tocreate/modify the Bloom filters during flush or SNodeSplit operations. For children s-nodes in flush and all the s-nodesin SNodeSplit , as we create a new d-tree for the s-nodes,we create a new Bloom filter for this d-tree and delete theold Bloom filter if it exists. For the parent s-node in flush ,as mentioned above we use lazy removal, that is, the d-treeis kept until the s-node is a child in a flush operation, whena new d-tree is created and the old d-tree discarded. Wesimilarly keep the Bloom filter and create a new one onlywhen the s-node is a child in the flush operation.To search for a key q , we start our search from the roots-node. We check if the Bloom filter for the root indicatesthat the d-tree of the root can contain q or not. If yes,then we search the root. If it does not contain q , then wemove down one level according to the pointers and performthe search recursively on the subtree rooted at the node.Overall, in the worst case, we go through all the levels ofthe s-tree and search the corresponding d-tree, which givesthe same worst-case query time as before. However, withhigh probability, we only search one s-node in total and the8ost will be O (log B σ ) with high probability, which is a con-stant. Thus, NB-trees have a good average query time, asmentioned in Table 1. 6. EMPIRICAL STUDIES6.1 Experimental Setup We ran our experiment on a machine with Intel Core i53.20GHz CPUs and 8 GB RAM running CentOS 7. Thismachine has (1) a 250GB and 7200 rpm hard disk and (2)an SSD with the model “Crucial MX500” and the storagesize of 1TB. Each disk page is 4KB. All algorithms wereimplemented in C/C++. Dataset. Following [18, 17, 16], we conducted experi-ments on synthetic datasets. Specifically, we generated syn-thetic datasets with n key-value pairs where each key is 8bytes and each value is 128 bytes. Following [18, 17, 16],we generated keys uniformly to focus on worst-case perfor-mance. The largest dataset generated is of size about 250GB (2 × keys). Workload. We designed an insert workload and a query workload to study the query and insertion performance ofdifferent indices. Each insert workload is a workload whichstarts from an empty dataset and involves n I insertion op-erations. Each query workload is a workload which involves n Q query operations performed on an index built based onthe dataset containing n keys. n Q is set to 10 through-out the experiments. In the query workload, we select keysuniformly from existing keys as the query input. Measurements. Based on the four performance met-rics discussed in Section 1, we designed measurements onthe indices for each of the two workloads. Consider an in-sert workload involving n I insertion operations. We have2 measurements, namely (1) average insertion time and (2) maximum insertion time . (1) Average insertion time is de-fined to be the average time taken per key to finish the entireinsert workload, i.e., t I n I , where t I is the total time taken tocomplete n I insertion operations. Average insertion time helps us verify our theoretical results on amortized insertiontime . (2) Maximum insertion time is a measure on the en-tire workload. It is the maximum insertion time of a keyover the entire workload. Maximum insertion time helps usverify our theoretical results on worst-case insertion time .Consider a query workload involving n Q query operations.We have 2 measurements, (1) the average query time and (2)the maximum query time . (1) The average query time is ameasure on the entire workload. It is defined as the averagetime taken per key to finish the entire query workload, i.e., t Q n Q where t Q is the total time taken to complete n Q queryoperations in this workload. The average query time helps usverify our theoretical results on average query time . (2) The maximum query time is a measure on the entire workload.It is defined to be the maximum query time of a query in theentire workload. The maximum query time helps us verifyour theoretical results on worst-case query time . Algorithms. We compared our index, NB-trees, with 6other indices: (1) LevelDB [23], (2) Rocksdb [21, 22], (3)bLSM [42], (4) B ǫ -tree [10] and (5) B-tree [7] and (6) B + -tree [44]. The first three indices (i.e., LevelDB, Rocksdb andbLSM) are three different implementations of LSM-trees.Note that there exist many other variants of the LSM-trees[17, 33, 54] which optimize the insertion/query performancewhich will be discussed in detail in Section 7. However, as to be discussed in Section 7, these performance optimizationtechniques originally designed for LSM-trees could also beapplied to NB-trees. Thus, these techniques are orthogo-nal to our work. For fairness, we do not include the othervariants of the LSM-trees for comparison.Moreover, we ran a preliminary experiment in which weinserted about 6GB of raw data and measured average in-sertion time of all the algorithms. If average insertion timewas larger than 100 µs , we excluded the algorithm from therest of the experiments. This is because based on this re-sult, we can conclude that the algorithm is not suitable forinsertion-intensive workload and it will be infeasible to runsuch an algorithm on the large datasets in our experiment. (1) LevelDB: LevelDB [23] is a widely used key-value storeimplementing an LSM-tree and has been used in the ex-periments of many existing studies [16, 42, 33, 52, 43]. Inorder to have a fair comparison, we adopt two different pa-rameter settings for LevelDB, namely leveldb-default and leveldb-tuned . leveldb-default is LevelDB with the de-fault setting similar to [16, 42] (i.e., multiplying factor = 10,in-memory write buffer size = 4 MB and no Bloom Filterfeature enabled). In our preliminary experimental result, wefound that the average insertion time of leveldb-default is larger than 100 µs . In the later experiments, we excludethis algorithm from our experimental results since it couldnot handle insertions with average insertion time smallerthan 100 µs . leveldb-tuned is LevelDB with the “tuned”setting for the best-insertion performance. Specifically, in leveldb-tuned , following [17, 42], we enabled the BloomFilter feature using 10 bits per key for short query time.Due to the large available memory, we varied the user pa-rameter called “in-memory write buffer size” from 10 MBand 100 MB to determine the “best” buffer size which couldgive the smallest average insertion time. When the buffersize is larger, LevelDB has fewer merge operations resultingin a smaller insertion time but at the same time, each mergetakes longer resulting in a larger insertion time. In our ex-periment, we found that 32 MB as the “best” buffer size.Thus, leveldb-tuned is LevelDB with the setting wheremultiplying factor = 10, in-memory write buffer size = 32MB and Bloom Filter feature enabled. (2) Rocksdb: Rocksdb [21] is a fork of LevelDB with somenew features that are not necessarily relevant to our work(e.g., parallelism, see [22] for details). However, we observedthat they performed differently under our workloads, so weinclude both algorithms. Similar to LevelDB, we performedparameter tuning for Rocksdb and observed that setting thewrite buffer size to 2GB has the best average insertion time.We refer to this algorithm as rocksdb-tuned . Bloom Filtersare enabled and set to 10 bits per key. (3) bLSM: bLSM [42] is a variant of an LSM-tree proposedfor high query performance and low insertion delay. For afair comparison, we obtained a parameter setting of bLSMwith the best performance. We varied the user parameterof the in-memory component size to determine the “best”in-memory component size with the “best” insertion andquery performance (increasing memory size improves bothinsertion and query performance). We found that 6 GB isthe “best” size. In our experiment, we adopted this setting. (4) B ǫ -trees: We implemented two versions of the B ǫ -trees, namely (a) Public-Version and (b) Own-Version . (a) Public-version is a publicly available version of the B ǫ -treesused in system TokuDB [39]. We adopted the default set-9 B-Tree ( σ =2048) NB-Tree ( σ =64) 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 3 6 9 12 15 A v g . Q ue r y T i m e ( m s ) Fanout(a) Avg. Query Time 4 6 8 10 12 14 16 3 6 9 12 15 A v g . I n s e r t i on T i m e ( µ s ) Fanout(b) Avg. Insertion Time Figure 4: NB-Tree Performance vs Fanout NB-Tree (f=3) NB-Tree (f=6) NB-Tree (f=12) 2 2.5 3 3.5 4 4.5 5 5.5 1 10 100 1000 10000 A v g . Q ue r y T i m e ( m s ) S-node size (MB)(a) Avg. Query Time 2 4 6 8 10 12 14 16 18 20 1 10 100 1000 10000 A v g . I n s e r t i on T i m e ( µ s ) S-node size (MB)(b) Avg. Insertion Time Figure 5: NB-Tree Performance vs. S-Node Size leveldb-tunedrocksdb-tuned bLSMNB-Tree 0 5 10 15 20 25 30 0 50 100 150 200 250 A v g . I n s e r t i on T i m e ( µ s ) Raw Data Size (GB)(a) Avg. Insertion Time on SSD 0 10 20 30 40 50 60 0 20 40 60 80 100 A v g . I n s e r t i on T i m e ( µ s ) Raw Data Size (GB)(b) Avg. Insertion Time on HDD Figure 6: Avg. Insertion Time vs. Data Size leveldb-tunedbLSM rocksdb-tunedNB-Tree 10 100 1000 10000 100000 1x10 0 50 100 150 200 250 M a x . I n s e r t i on T i m e ( µ s ) Raw Data Size (GB)(a) Max. Insertion Time on SSD 10 100 1000 10000 100000 1x10 0 20 40 60 80 100 M a x . I n s e r t i on T i m e ( µ s ) Raw Data Size (GB)(b) Max. Insertion Time on HDD Figure 7: Max. Insertion Time vs. Data Sizetings of TokuDB. However, TokuDB’s average insertion timein our preliminary experiments is more than 200 µs . It wasnot feasible to run TokuDB in our experiments which re-quires the insertion time to be at most 100 µs . (b) Own-Version is our own implementation of B ǫ -tree. Own-Version could not handle the insertions with average insertion timeless than 100 µs . Thus, since B ǫ -tree (both Public-Version and Own-Version ) is not suitable for high-insertion rateworkloads, we exclude it from our experimental results. (5) B-trees and (6) B + -trees: Similar to B ǫ -trees, we im-plemented two versions of B + -trees , namely Public-Version and Own-Version . Here, Public-Version denotes the B + -trees used in wiredtiger which is a storage engine in Mon-goDB [47]. Similarly, we exclude B-trees and B + -trees inour experimental results since they could not handle in-sertions with insertion time smaller than 100 µs per inser-tion. However, since it is well-known that B + -trees are goodfor fast queries, we implemented a “bulk-load” version of aB + -tree called B + -tree(bulk) as a baseline to compare thequery performance among all indices in the experiments. Weimplemented B + -tree(bulk) by pre-sorting the data andadopting a bottom-up bulk-loading approach [44]. We donot include any measurement about the insertion statisticsfor B + -tree(bulk) since it does not show the realistic in-sertion performance for B + -tree . The query performanceof the bulk-load version of a B + -tree (i.e., B + -tree(bulk) )is better than the “normal insertion” version of a B + -treebecause B + -tree(bulk) could be constructed such that al-most all nodes in B + -tree(bulk) are full and thus, the dataare not scattered across different disk pages, resulting in alower seek time and a smaller query time. It is not easy todesign a “bulk load” version of B-trees (since some key-valuepairs are stored in internal nodes and some are stored in leafnodes) and thus, we do not include it. NB-Trees. We implemented the final version of NB-treediscussed in Section 5, referred to as NB-Tree . We set f to3 and σ to 2 GB after conducting experiments to find the“best” parameter for the NB-tree to be shown in Section 6.2. In this section, our experiment measures the average in-sertion time for 25GB of raw data ( n I = 2 × keys), andthe average query time on a database of size 25GB (2 × keys). We ran each experiment on an HDD three times andaveraged the results, shown in Figs. 4-5. Fanout. We studied the effect of fanout f for a small σ value, 64MB, and a large σ value, 2048MB, on NB-trees.Fig. 4 (a) shows that when σ = 64, increasing f causesaverage query time to decrease. However, the trend is theopposite when σ = 2048. This is because query time de-pends on the number of page accesses and the seek timefor the accesses. When σ is small, increasing f reduces theheight by a lot (from 8 levels when f = 3 to 4 levels when f = 15). When the height is smaller, fewer Bloom filtersare checked, decreasing the probability that at least one ofthe Bloom filters returns a false positive. Thus, increasing f reduces the number of page accesses and the query time.However, for large values of σ , increasing f does not changethe height by much (from 4 levels when f = 3 to 3 levelswhen f = 15). In this case, most queries perform only onedisk access. Note that, d-trees of sibling s-nodes are writtensequentially to the disk. Thus, when f is large, keys that areclose to each other in the key space are written close to eachother on disk. However, the query distribution is uniform,and it is likely that consecutive query keys are not close toeach other in the key space. Hence, when f is large, the seektime during the queries becomes larger. This is less of anissue when f is small. Therefore, increasing f increases theseek time for queries. As a result, for σ = 2048 MB , querytime worsens when f increases.Fig. 4 (b) shows that the insertion time increases when f increases. This result generally follows the theoretical modelwhere the factor f in amortized insertion time complexitycauses the insertion time to increase when f gets larger. D-tree size. Fig. 5 shows that, generally, larger σ im-proves insertion time but worsens the query time, as theorysuggests. However, one interesting observation is a localminimum observed at σ = 16 MB for average insertion timein Fig. 5 (b). This can be attributed to the HDD cache be-ing 16 MB , which improves the sequential I/O performanceduring HandleF ullSNode . As σ gets beyond 4 GB , the in-sertion time increases since NB-Tree does not fit in mainmemory. The improvement in query performance when σ is larger than 1 GB is because the main memory componentbecomes large compared to data size and some of the queries10 eveldb-tunedNB-Tree bLSMrocksdb-tuned B + Tree(bulk) 0.1 1 10 100 0 50 100 150 200 250 A v g . Q ue r y T i m e ( m s ) Raw data size (GB)(a) Avg. Query Time on SSD 1 10 100 1000 0 20 40 60 80 100 A v g . Q ue r y T i m e ( m s ) Raw data size (GB)(b) Avg. Query Time on HDD Figure 8: Avg. Query Time vs. Data Size NB-TreeB + Tree(bulk) bLSMleveldb-tuned rocksdb-tuned 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 50 100 150 200 250 M a x . Q ue r y T i m e ( s e c ) Raw data size (GB)Max. Query Time on SSD 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 0 20 40 60 80 100 M a x . Q ue r y T i m e ( s e c ) Raw data size (GB)Max. Query Time on HDD Figure 9: Max. Query Time vs. Data SizeAlgorithms Amortized insertion time O ( α × T seq,W + β × T seek ) Worst-case Insertion Time O ( α × T seq,W + β × T seek ) Worst-case Query Time O ( α × ( T seq,R + T seek )) α β α β α B-tree [7] log B n log B n log B n log B n log B n B ǫ -tree [10] f log f BB log B n f log f BB log B n f log f BB log B n f log f BB log B n log f B log B n LSM-tree [37] f log f BB log B n nB log f B log B n log f B (log B n ) NB-tree (our paper) f log f BB log B n f log f Bσ log B n f log f BB log B n f log f Bσ log B n log f σ log B n Table 2: Summary of the theoretical results (performance in terms of time)are answered by just checking the in-memory component. Parameter Setting. In the rest of the experiments, weoptimize NB-tree for an insertion-intensive workload. Weselect σ = 2 GB which has the best insertion performancebased on Fig. 5 (b) and set f = 3 because for σ = 2 GB , inFig. 4 (b), f = 3 has the best insertion performance. Basedon this parameter setting, we note that NB-Tree ’s memoryusage is as follows. For data size of 250GB (the maximumdata size used in our experiments), about 2.3GB is allocatedfor caching Bloom filters and 1GB for caching non-leaf nodeof d-trees. Interestingly, even when optimizing NB-trees forinsertions, they perform queries almost as fast as a B + -tree. Average insertion time. Fig. 6 shows the average inser-tion time of the indices on HDD and SSD. NB-Tree achievesthe lowest time on both HDD and SSD, while bLSM ’s perfor-mance deteriorates when the data size gets larger because itkeeps the number of components constant. rocksdb-tuned performs similar to NB-Tree on HDDs but its performanceis worse on SSD . Note that the performance advantage of NB-tree compared with rocksdb-tuned and bLSM is morevisible on SSDs. This shows that NB-trees perform betteron larger data sizes when the ratio between data size andin-memory component is larger. Maximum insertion time. Fig 7 shows the maximum in-sertion time of the indices. NB-Tree achieves the lowesttime on both HDD and SSD, outperforming other algo-rithms by at least 1000 times for some data sizes on bothHDD and SSD. Maximum insertion time of rocksdb-tuned , bLSM and leveldb-tuned goes as high more than 0.2 s (for rocksdb-tuned , this number is 453s), which is unacceptablefor many applications. The superior performance of NB-Tree is due to their logarithmic worst-case time together with thedeamortization mechanism suggested in Section 5. Observethat rocksdb-tuned has the maximum insertion time of 453seconds. Even though that happens only once during theinsertion processes, it makes the system unreliable. Average query time. Fig. 8 shows the average querytime of the indices. NB-Tree achieves query time almostas low as B + -tree(bulk) (which is worst-case optimal). rocksdb-tuned , leveldb-tuned and blsm have query timeslarger than NB-Tree , more prominently on SSDs. Maximum query time. Fig. 9 shows the maximum querytime of the indices. rocksdb-tuned has the worst perfor-mance while B + -tree(bulk) is generally better. Note thatall queries have to wait for at least one disk I/O operation,but an I/O operation can take long if the operating systemis busy or if there are disk failures. Thus, maximum querytime has a large variance and the comparison among the al-gorithms is less conclusive (note that insertions do not needto wait for disk I/O operations due to in-memory buffering). Summary. The average query time of an NB-tree is 4times smaller than LevelDB and 1.5 times than bLSM andRocksdb. It is similar to the average query time of a nearlyoptimally constructed, bulk-loaded B + -tree, where buildinga B + -tree incrementally takes orders of magnitude longerthan an NB-tree. Besides, the average and maximum in-sertion time of an NB-tree (which are at most 0.0001s) aremultiple factors smaller than LevelDB, Rocksdb and bLSM(which could be greater than 0.2s). Overall, an NB-treeprovides a more reliable insertion and query performance. 7. RELATED WORK We discuss indices used for insertion intensive workloads. LSM-trees. LSM-tree is an index used for insertion-intensive workloads used in many systems such as BigTable[12], LevelDB[23], Cassandra [30], HBase [1], RocksDB [22],Walnut [14] and Astrix DB [2]. By using an in-memorycomponent and several on-disk B-tree components, LSM-trees [37] perform very few seek operations during insertions.However, this design causes a sub-optimal number of I/O op-erations during queries, and linear worst-case insertion timethat causes long insertions delay (see [46, 32] for a discus-sion of LSM-tree’s performance). Many improvements havebeen proposed to LSM-trees’ design as discussed below. Query improvement. [42] uses Bloom filters to improvethe query time and [16] tunes the Bloom filter parameters.Compared with LSM-trees, we showed that Bloom filtersadopted by NB-trees provide better theoretical and empir-ical performance. Method of [16] can also be used by NB-trees to optimize the Bloom filter parameters. Moreover,[28, 15] partition an LSM-tree into several smaller LSM-treecomponents which provides a constant factor improvement.[32] uses fractional cascading [13] to provide asymptot-ically optimal worst-case query time. Fractional cascading11onnects different LSM-tree components to each other. Con-sider the B + -tree of the i -th level of the LSM-tree. In eachleaf node, N , of the B + -tree, some key-value pairs have ex-tra pointers pointing to a node, N ′ , of the ( i + 1)-th level.The pointers from the i -th level to the ( i + 1)-th level arecalled fence pointers . Fence pointers satisfy the propertiesthat (1) the first key-value pair k of node N must have afence pointer pointing to a next-level node N ′ and everynode N ′ at level i +1 must have a fence pointer pointing toit from level i . (2) Consider two keys, k s and k l , in level i that have fence pointers to nodes N s and N l in level i + 1,such that there does not exist another key k in level i thathas a fence pointer and that k s < k < k l . Let r s be thesmallest key in N s and r l the smallest key in N l . It holdsthat r s ≤ k s < k l ≤ r l . These properties help in performinga constant number of disk-page accesses at each level.LSM-trees with fractional cascading suffer from largeworst-case insertion time and are not compatible withBloom filters [42]. Thus, they provide a worse query per-formance in practice. The reason for their incompatibilityis that to search the ( i + 1)-th level using the i -th level fencepointers, we need to have searched the i -th level. Based onthis deduction, we need to have searched all the levels of theLSM-tree. However, using Bloom filter is only advantageouswhen we do not need to search all levels of the LSM-tree. Insertion improvement. Most of the focus has been onoptimizing the merge operation, divided into leveling and tiering categories. leveling is the category discussed so far,which sorts each LSM-tree component during the merge. Tiering , during a merge operation, appends the data to thelower levels and only sorts a level after it is full. This avoidsrewriting the lower level component during the merge opera-tion at the expense of the query time. [17] uses the levelingmerge policy at some levels of the tree and tiering mergepolicy at other levels. In [18] unlike the original design,the ratio of the size across different adjacent levels of theLSM-tree is not constant. More variations of tiering are dis-cussed in [55, 6, 53, 52, 38, 54]. [9] discusses in-memoryoptimization for faster writes. These improvements are or-thogonal to our work and can be adopted by NB-trees in thefuture. [33] discusses a theoretical model to analyze inser-tion performance of LevelDB and provides methods for pa-rameter optimization. Their methods require knowledge ofprobability distribution of the keys in advance and performstime-consuming optimizations not feasible in the real-world.Thus we did not include their method in our experiments.[46, 32, 42] discuss reducing the worst-case insertion time,but their methods take linear time to the data size comparedwith the logarithmic worst-case time of NB-trees. B-tree and B-tree with Buffer. B-trees [7] are read-optimized indices, performing optimal number of I/O opera-tions during queries[10]. But they perform a seek operationfor every page access, sacrificing their insertion performance.B-trees with Buffer [10] (also known as B ǫ -trees) are a write-optimized variant of B-trees where part of each disk pageallocated to each node is reserved for a buffer. The bufferis flushed down the tree when it becomes full. B-trees withBuffer can be seen as a special case of NB-trees where s-nodesize is one disk page and their analysis of query and insertionperformance follows from that of NB-trees. In such a case,all disk accesses involve a seek operation, worsening the in-sertion performance, as our experiments confirmed. Theyalso have worse space utilization since they allow half full nodes and worse range query performance since their nodesare not written sequentially on the disk. NB-trees keep theird-nodes full and write them sequentially for each s-node. Other data structures. Many write optimized datastructures such as [8, 25, 50, 35] have been proposed fora variety of settings and we do not have space to coverthem all. Among them, Y-tree [27] is similar to B-treeswith Buffer but allows for larger unsorted buffers at eachnon-leaf level of the B-tree that reduces the number of seekoperations performed during insertions (can also be seen asa form of tiering ). For a buffer similar in size to that ofB-tree with Buffer, their performance will be similar to B-trees with Buffer and with the same weaknesses. However, alarger buffer worsens the point query performance (althoughrange queries will not be affected as adversely), since it re-quires searching multiple pages of the unsorted buffer ateach level of the tree by long scans. Y-trees also suffer fromthe issues mentioned above regarding space utilization andseek operations during range queries of B-trees with Buffer.Finally, mass-tree [36] is an in-memory data structure thatis similar to this paper using a nested index, but the struc-tural tree for mass-tree is a trie which, although works wellin memory, can be unbalanced and cause large insertion andquery cost if adopted for secondary storage.In-memory optimization is outside the scope of this paper,but in-memory optimizations for B-trees such as [31, 40]improve the in-memory performance. However, their on-disk insertion performance is the same as B-trees, which isworse than NB-trees in terms of amortized insertion time. Summary. Table 2 shows the theoretical performance ofthe indices mentioned above (written as multiples of log B n for easier comparison). For amortized insertion time, NB-trees perform σB times fewer seek operations than B ǫ -trees, σf log f B times fewer than B-trees, and similar to LSM-trees( σ is typically in the order of 10,000 times larger than B ).NB-trees have worst-case insertion time logarithmic in datasize while LSM-trees’ worst-case insertion time is linear indata size. NB-trees’ query time is a factor log σ n smallerthan LSM-trees and is asymptotically optimal. Overall, NB-trees have a better worst-case insertion and query time (con-sidering the number of seek operations) than existing indiceswhile maintaining practical properties, such as compatibilitywith Bloom filters and high space utilization. 8. CONCLUSION We introduced Nested B-trees, an index that theoreticallyguarantees logarithmic worst-case insertion time and asymp-totically optimal query time, and thus supports insertionsat high rates with no delays while performing fast queries.This significantly improves on LSM-trees’ linear worst-caseinsertion time and suboptimal query time and avoids longdelays that frequently occur in LSM-trees during insertions.We empirically showed that NB-trees outperform RocksDB[21], LevelDB [23] and bLSM [42], commonly used LSM-treedatabases, performing insertions faster than them and withmaximum insertion time of 1000 smaller and lower querytime by a factor of at least 1.5. NB-trees perform queries asfast as B-trees on large datasets, while performing insertionsat least 10 times faster. In the future, a more detailed studycan be done on optimizing in-memory caching of the meta-data, optimizing the parameter setting of Bloom filters, andusing different flushing schemes such as tiering.12 . REFERENCES [1] A. S. Aiyer, M. Bautin, G. J. Chen, P. Damania,P. Khemani, K. Muthukkaruppan, K. Ranganathan,N. Spiegelberg, L. Tang, and M. Vaidya. Storageinfrastructure behind facebook messages: Using hbaseat scale. In IEEE Data Eng. Bull. , 2012.[2] S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S.Kim, M. J. Carey, M. Dreseler, and C. Li. Storagemanagement in asterixdb. In VLDB’14 , 2014.[3] Amazon. Aws pricing.https://aws.amazon.com/ec2/pricing/on-demand/,2019.[4] Amazon. Ram cost, 2019.https://tinyurl.com/u7u3nna.[5] Amazon. Ssd cost. https://tinyurl.com/wpzhw4u,2019.[6] O. M. Balmau, D. Didona, R. Guerraoui,W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, andP. Konka. Triad: Creating synergies between memory,disk and log in log structured key-value stores. In USENIX ATC’ 17 , 2017.[7] R. Bayer and E. McCreight. Organization andmaintenance of large ordered indexes. In ActaInformatica , 1972.[8] M. A. Bender, M. Farach-Colton, R. Johnson,S. Mauras, T. Mayer, C. A. Phillips, and H. Xu.Write-optimized skip lists. In PODS’17 , 2017.[9] E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, andG. Sheffi. Accordion: Better memory organization forlsm key-value stores. Proc. VLDB Endow. ,11(12):1863–1875, Aug. 2018.[10] G. S. Brodal and R. Fagerberg. Lower bounds forexternal memory dictionaries. In ACM-SIAM’03 , 2003.[11] B. Chandramouli, G. Prasaad, D. Kossmann,J. Levandoski, J. Hunter, and M. Barnett. Faster: Aconcurrent key-value store with in-place updates. In Proceedings of the 2018 International Conference onManagement of Data , SIGMOD ’18, pages 275–290,New York, NY, USA, 2018. ACM.[12] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.Wallach, M. Burrows, T. Chandra, A. Fikes, andR. E. Gruber. Bigtable: A distributed storage systemfor structured data. In ACM Transactions onComputer Systems , 2008.[13] B. Chazelle and L. J. Guibas. Fractional cascading: I.a data structuring technique. In Algorithmica , 1986.[14] J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid,R. Ramakrishnan, S. Rao, and R. Sears. Walnut: aunified cloud object store. In SIGMOD’12 , 2012.[15] L. DAI, J. FU, and C. FENG. An improved lsm-treeindex for nosql data-store. In International Conferenceon Computer Science and Technology , 2017.[16] N. Dayan, M. Athanassoulis, and S. Idreos. Monkey:Optimal navigable key-value store. In SIGMOD’17 ,2017.[17] N. Dayan and S. Idreos. Dostoevsky: Betterspace-time trade-offs for lsm-tree based key-valuestores via adaptive removal of superfluous merging. In SIGMOD’18 , 2018.[18] N. Dayan and S. Idreos. The log-structuredmerge-bush & SIGMOD , 2019. [19] R. Diorio, V. Tim ˜A teo, and E. Ursini. Testing anip-based multimedia gateway. INFOCOMP ,13(1):21–25, 2014.[20] DMR. Dropbox statistics.https://expandedramblings.com/index.php/dropbox-statistics/,2017.[21] Facebook. Rocksdb documentation.https://github.com/facebook/rocksdb, 2018.[22] Facebook. Rocksdb features not in leveldb.https://github.com/facebook/rocksdb/wiki/Features-Not-in-LevelDB,2018.[23] Google. Leveldb documentation.https://github.com/google/leveldb/blob/master/doc/impl.md,2017.[24] http://wersm.com/. Facebook statistics.http://wersm.com/how-much-data-is-generated-every-minute-on-social-media/,2017.[25] J. Iacono and M. P˘atra¸scu. Using hashing to solve thedictionary problem. In ACM-SIAM’12 VLDB , 1999.[28] C. Jermaine, E. Omiecinski, and W. G. Yee. Thepartitioned exponential file for database storagemanagement. In VLDB’17 , 2007.[29] C. Kim and S.-U. Yang. Like, comment, and share onfacebook: How each behavior differs from the other. Public Relations Review , 43(2):441–449, 2017.[30] A. Lakshman and P. Malik. Cassandra: adecentralized structured storage system. In ACMSIGOPS Operating Systems Review , 2010.[31] J. J. Levandoski, D. B. Lomet, and S. Sengupta. Thebw-tree: A b-tree for new hardware platforms. In , pages 302–313. IEEE, 2013.[32] Y. Li, B. He, R. J. Yang, Q. Luo, and K. Yi. Treeindexing on solid state drives. In VLDB’10 , 2010.[33] H. Lim, D. G. Andersen, and M. Kaminsky. Towardsaccurate and fast evaluation of multi-stagelog-structured designs. In FAST , 2016.[34] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky.Silt: A memory-efficient, high-performance key-valuestore. In Proceedings of the Twenty-Third ACMSymposium on Operating Systems Principles , pages1–13. ACM, 2011.[35] L. Lu, T. S. Pillai, H. Gopalakrishnan, A. C.Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Wisckey:Separating keys from values in ssd-conscious storage.In ACM Transactions on Storage , 2017.[36] Y. Mao, E. Kohler, and R. T. Morris. Cache craftinessfor fast multicore key-value storage. In Proceedings ofthe 7th ACM european conference on Computer ystems , pages 183–196. ACM, 2012.[37] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. Thelog-structured merge-tree (lsm-tree). In ActaInformatica , 1996.[38] F. Pan, Y. Yue, and J. Xiong. dcompaction: Delayedcompaction for the lsm-tree. In International Journalof Parallel Programming Proceedings of the 2000ACM SIGMOD International Conference onManagement of Data SIGMOD’12 ,2012.[43] P. Shetty, R. P. Spillane, R. Malpani, B. Andrews,J. Seyster, and E. Zadok. Buildingworkload-independent storage with vt-trees. In FAST ,2013.[44] A. Silberschatz, H. F. Korth, S. Sudarshan, et al. Database system concepts ICDE’17 Proceedings of the 2010 ACM SIGMODInternational Conference on Management of data ,pages 231–242. ACM, 2010.[50] P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin,C. Zhang, and J. Cong. An efficient design andimplementation of lsm-tree based key-value store onopen-channel ssd. In European Conference onComputer Systems USENIX ATC’ 15 , 2015. [53] T. Yao, J. Wan, P. Huang, X. He, Q. Gui, F. Wu, andC. Xie. A light-weight compaction tree to reduce i/oamplification toward efficient key-value stores. In International Conference on Massive Storage Systemsand Technology , 2017.[54] Y. Yue, B. He, Y. Li, and W. Wang. Building anefficient put-intensive key-value store with skip-tree.In IEEE Transactions on Parallel and DistributedSystems , 2017.[55] W. Zhang, Y. Xu, Y. Li, and D. Li. Improving writeperformance of lsmt-based key-value store. In