[PDF] CedrusDB: Persistent Key-Value Store with Memory-Mapped Lazy-Trie

Abstract

Most of today's persistent key-value stores are based either on a write-optimized Log-Structured Merge tree (LSM) or a read-optimized B+-tree, both of which assume that there is not enough memory to cache the entire data set. As RAM is becoming cheaper and more abundant, it is time to revisit the design of persistent key-value storage systems. This paper introduces a new design called "lazy-trie" to index the persistent storage, a variant of the hash-trie data structure, optimized to cache entirely in memory. A lazy-trie achieves near-optimal height and has practical storage overhead, but does not support range queries. We implemented CedrusDB, a lazy-trie-based persistent key-value store. The implementation uses memory-mapping. The lazy-trie organization in virtual memory allows CedrusDB to better leverage concurrent processing than other organizations. CedrusDB achieves comparable or superior performance for both popular LSM- and B+-tree-based key-value stores in mixed workloads.

Full PDF

CCedrusDB: Persistent Key-Value Store with Memory-Mapped Lazy-Trie

Maofan Yin, Hongbo Zhang, Robbert van Renesse, Emin Gün SirerCornell University

Abstract

As RAM is becoming cheaper and growing abundant, it istime to revisit the design of persistent key-value storage sys-tems. Most of today’s persistent key-value stores are basedeither on a write-optimized Log-Structured Merge tree (LSM)or a read-optimized B + -tree. Instead, this paper introduces anew design called “lazy-trie” to index the persistent storage,a variant of the hash-trie data structure. A lazy-trie achievesnear-optimal height and has practical storage overhead. Un-like other balanced data structures, it achieves this withoutthe need for expensive reorganization of subtrees, allowingimproved write performance.We implemented CedrusDB, a lazy-trie-based persistentkey-value store that supports point lookup and updates. Ce-drusDB outperforms both popular LSM- and B + -tree-basedkey-value stores in mixed workloads. The implementationof CedrusDB uses memory-mapping. The lazy-trie organiza-tion in virtual memory allows CedrusDB to better leverageconcurrent processing than other organizations. Persistent key-value stores have become an indispensable partof applications such as web servers [5], cloud storage [17, 46],machine learning [35], mobile apps [22], and even blockchaininfrastructures [39,43]. There are two main on-disk data struc-tures that are used extensively for a general-purpose persistentkey-value store: B + -trees and Log-Structured Merge Trees(LSMs). B + -trees have stable, predictable degradation as thedata set grows large and supports fast read access, but they in-cur non-sequential small writes that sometimes get ampliﬁedin order to maintain balance. LSMs, on the other hand, areoptimized for write-intensive workloads. They usually havehigh write-throughput as most of the writes are sequential andthe resulting storage space is compact. An LSM-tree, unlikethe name suggests, does not have an explicit tree structure thatorganizes nodes as in B + -trees. Instead, the merge tree servesas a conceptual hierarchy that directs how to merge-sort userdata. The amount of merge-sorted data gets ampliﬁed as thelevel goes deeper and thus may cause severe write ampliﬁ-cation during log compaction. Compared to B + -trees, LSMsalso have higher read ampliﬁcation.Both designs are intended for scenarios where the data setcannot ﬁt in memory and the underlying secondary storage ismuch slower than memory. But servers and consumer deviceshave increasingly larger memories. Today, servers may have tens to thousands of gigabytes of memory with a secondarystorage using ﬂash or non-volatile memory technology. Asa result, more applications can have all or at least most oftheir data set fully reside in memory, and this has ignitedinterest in exploring new data structures that better utilize thecharacteristics of the abundant memory [14, 46] or even non-volatile main memory [26, 32]. At the same time, solid statedevices (SSDs) may also require changes in data structureson secondary storage to take advantage of the much fasteraccess times [29, 40].We consider the design of a persistent key-value store thatuses SSDs for persistence and where the volatile main mem-ory is capable of storing most of the data set for high accessdemand. Some recent key-value store designs have eschewedon-disk indexes altogether, favoring a fast in-memory indexand an unordered solution for persistence such as loggingupdates sequentially or slab allocation [9, 28, 31, 33]. Whilesuch designs can perform very well in the failure-free case,crash recovery involves reading large sections if not all ofthe disk, leading to lengthy recovery times. In practice, allpopular persistent key-value stores use on-disk indexes.This paper proposes lazy-trie , a storage-friendly data struc-ture. All nodes in a lazy-trie have the same number of childrenslots, simplifying maintenance. To bound the depth of the trieand probabilistically balance the load, user keys are hashedto index into the trie. To further reduce the depth of the trieand improve utilization, the lazy-trie uses a path compressiontechnique similar to radix trees [34]. Finally, some small sub-trees are collapsed into linked lists at leaves, greatly reducingstorage overhead and read/write ampliﬁcation.We use the lazy-trie data structure to implement a memory-mapped, persistent key-value store, CedrusDB . It supportspoint operations as a key-value store sorted by key hash. It isable to achieve near-optimal dynamic tree height with prac-tical storage overhead. CedrusDB outperforms carefully en-gineered key-value stores such as LevelDB, RocksDB, andLMDB in nearly all read- or write-intensive mixed workloads.Like LMDB [14], the implementation of CedrusDB usesmemory-mapping, but, unlike LMDB, CedrusDB does notrequire that all of the data set ﬁt in available memory—CedrusDB implements its own page replacement and doesnot rely on the operating system kernel to do so. The lazy-trieorganization in virtual memory allows CedrusDB to betterleverage concurrency.A limitation of CedrusDB is that it does not support rangequeries (because the index is sorted by hash). CedrusDB1 a r X i v : . [ c s . D B ] M a y ·· ··· u s e r k e y h k ( · ) u

00 01 ...

42 43 ... u ... a1 ... e4 ... u

00 01 02 03 ... u

00 01 02 03 ... u ( key , value ) next u ( key , value ) ⊥ ··· Figure 1: Basic structure of a hash-trie, with chained leafnodes.does not perform as well as some of its competitors for write-only workloads or if the working set size is much larger thanavailable memory.This paper makes the following contributions:• The lazy-trie data structure, which dynamically growswith near-optimal tree height, has a practical storagefootprint, and allows for efﬁcient concurrent access.• The design and implementation of CedrusDB, a high-performance key-value store that uses the lazy-trie.• Evaluation results show that, compared to its most im-portant competitors, CedrusDB provides superior perfor-mance for mixed workloads and is able to better leveragemulti-core computing.Section 2 introduces the lazy-trie data structure and itsunique properties. Section 3 discusses the storage model,design decisions, and optimizations that went into buildingCedrusDB. Section 4 presents our evaluation of CedrusDB.Section 5 discusses related work, and Section 6 concludes.

The design of CedrusDB is inspired by memory-mapped key-value stores like LMDB [14]. Instead of using B + -tree variantsor other tree structures that require complex operations tomove nodes across sibling subtrees for a logarithmic treeheight, we propose lazy-trie , a trie-tree structure tailored forpersistent storage. In this section, we describe the lazy-trieby progressively adding its crucial elements, along with adiscussion of its properties and their implications for storage. A trie is a tree structure that encodes all preﬁx paths of in-serted key strings. Each key is treated as a sequence of con- secutive ﬁx-width characters . A trie stores paths of edgesrepresenting character sequences, collapsing all shared preﬁxinto a tree topology.A hash-trie is a trie indexed by a hash. CedrusDB uses astrong (well-distributed) and fast hash function (see Section 3)to map an arbitrary key string given by the user to a 256-bithash. It then partitions the hash into equal-length characters.In the trie, a tree node consists of a character-indexed arrayof pointers to its child nodes and a pointer referring back toits parent node. We deﬁne the height of a given node to be thenumber of its ancestors. Thus, the root node has a height ofzero. Each leaf node maintains a linked list of user key-valuepairs. Figure 1 depicts the basic structure of a hash-trie, wherethe user key key and key in data node u and u have thesame hash preﬁxed by . The size of the hash shouldbe chosen so that the probability of such collisions is small.The insertion algorithm starts from the root node. It ﬁrstvisits the child indexed by the ﬁrst character of the key hash,proceeds to the next child by the second character, and re-cursively walks down the trie until the entire key sequenceis consumed. Child nodes are created as needed. At the leafnode, the algorithm adds a data node containing the originalkey-value data of the user to the linked list.Lookup is similar to insertion. When reaching the leaf node,the linked list is scanned doing a full comparison using theoriginal, unhashed user key to locate the value.There are some similarities between the hash-trie and theB + -tree data structures. First, both are balanced, n-ary treestructures. In practice, a typical B + -tree has a branching factorof several hundreds. As we show later in the evaluation, a goodchoice for the branching factor for the hash-trie is hundredsof children per node as well. Second, both B + -trees and hash-tries only store user data at the leaves—intermediate nodesonly contain metadata for indexing.That being said, the structures have important differences.The height of a B + -tree is logarithmic in the number of keys.When inserting data, a B + -tree has to constantly adjust itstopology across sibling subtrees to maintain tree balance. In ahash-trie, the path to the leaf node for a speciﬁc key is static :there is no reorganization of this path when other data is in-serted or deleted. This signiﬁcantly reduces the I/O cost ofmaintaining the internal index structure and simpliﬁes concur-rent access or modiﬁcation to the tree. Finally, all tree nodesof a hash-trie have an identical size and the same storagefootprint, which simpliﬁes the storage maintenance and re-duces fragmentation. B + -tree nodes have a variable numberof children, bounded by the branching factor.A disadvantage of the hash-trie design is that operationsalways need to visit a ﬁxed number of tree nodes even ifthe data store is small. Also, the utilization of child tables intree nodes can be low, potentially resulting in signiﬁcant writeamplication. We next show, in two steps, how to improve uponthe hash-trie structure to build a practical, dynamically grown2 ⊥ ⊥ u ⊥ ⊥ u ⊥ ⊥ u ⊥ e9 ⊥ u ⊥ bb ⊥ u ⊥ ⊥ u ⊥ ⊥ u ⊥ ⊥ b3 ⊥ u ⊥ ⊥ ⊥ u ⊥ ⊥ b3 ⊥ u ⊥ ⊥ ⊥ d ... next d ... next d ... next d ... next d ... next d ... next Figure 2: The tree structure without (left) and with (right) thepath compression. ... ... u ... ... u ... n ex t u (cid:48) ... ⊥ d ··· ... ... u ... ... u ⊥ ⊥ u a ⊥ c7 ⊥ u b ⊥ ⊥ f1 ⊥ u c ... n ex t u (cid:48) ... ⊥ d ··· the rest of common preﬁx ... ... u ... ... u ... n ex t d ··· ... ... u ... ... u ⊥ ⊥ u a ⊥ c7 ⊥ u b ⊥⊥⊥ f1 ⊥ u c ... ⊥ u (cid:48) ... n ex t d ··· (a) (b) (d)(c) Figure 3: The bookkeeping operations required by lazy-trie.(a)(b) show a split operation, when a new data node d isinserted. (b)(c)(d) show a merge operation when the parentnode u c only has one child after the other child is removed. lazy-trie with near-optimal height and small storage overheadby utilizing statistical properties of the hash function. Let the branching factor be b . Then two keys have b − h proba-bility of sharing the same preﬁx of length h . The exponentiallydecreasing probability means that, as one descends into thetree, it becomes less likely to have forks in the tree. This led usto compress the sufﬁx of paths in the hash-trie and only unrollthe compressed path when necessary. Data nodes remain atthe leaves, so the collision rate remains unchanged. Figure 2illustrates the new design. Compared to radix trees, the differ-ence is that we only compress the sufﬁxes of paths—internalpaths may still have nodes with only one child.Whenever inserting a new node, we lazily create the mini-mum number of intermediate missing nodes just to distinguishdata nodes with different hashes. During the insertion, we ﬁrstfollow the tree structure as usual. If we reach a data node that is supposed to be a tree node, then we insert a tree node,uncompressing the path (see Figure 3).The lookup algorithm stays essentially the same: Ce-drusDB traverses the lazy-trie according to the key hash untilit reaches a data node, then it scans through the list of datanodes. In the delete algorithm, when a data node is removed,a merge operation checks whether it is the only child of itsparent node. If so, it collapses the non-forking path repeatedlyuntil reaching an ancestor having more than one child andstops there.Unlike insertion or deletion in other tree structures, whichrequire complex recursive reorganization, there is at most onenon-recursive split for each insertion and one non-recursivemerge for each deletion. In a split, a simple path of intermedi-ate nodes is created as a chain, while, in a merge, the longestnon-forking path is collapsed into a direct parent reference.These operations only happen on a single path of insertionand deletion, so they do not interfere with any other siblingsor their subtrees.As a result, lazy-trie dynamically adjusts the tree height de-pending on how frequently the preﬁxes of inserted key hashesconﬂict. The hash function offers uniformly distributed keyhashes regardless of the order of insertion, statistically bal-ancing the tree with a height of approximately log b n , where n is the number of keys. Since the probability of a sharedpreﬁx decreases exponentially with the length of the preﬁx,in expectation there will be few missing intermediate nodesin a split operation. The lazy-trie structure has an average path length that growslogarithmically with the number of keys (Figure 5, top, s = s = sluggishness to be the maximum numberof hash values that are allowed in a linked list of data nodes.A sluggish lazy-trie will only split the path when a linked listoverﬂows. If so, the linked list will be replaced with a leafnode, and all data nodes in the linked list will be redistributedinto the child table of the new leaf node. The redistributionis recursive and continues until the sluggishness constraint ismet. Figure 4 shows an example of the steps in redistribution.To demonstrate how sluggish splitting mitigates the prob-lems of high path length variation and low utilization, Figure 53 u u u u u (cid:48) u (cid:48) u (cid:48) u (cid:48) u (cid:48) d d d d d d d d d d d d d d d d d d d d redistribute(a) (b)(c) Figure 4: (a) a trie without sluggishness ( s = d ) can lead to overﬂowingthe linked list of data nodes. (c) for a maximum sluggishnessof 3, the data nodes are redistributed.shows the average path length and child table utilization as thedata store grows in size for different levels of sluggisness s .The periodic change of utilization is due to the statisticalgrowth of the tree height. The bottom graph shows the num-ber of bytes used by tree nodes as a function of the number ofkeys. With a sluggishness s =

4, the storage overhead is sig-niﬁcantly reduced compared to no sluggish splitting ( s = In this section, we show how a lazy-trie can be used to buildCedrusDB, a high-performance key-value store. CedrusDBhas been fully implemented in ~6K lines of pure Rust. Rustis a modern systems programming language that providesstatically checked memory- and thread-safety guarantees [4,18, 42]. In addition to basic constructs offered by the standardlibrary, it allows programmers to customize their buildingblocks with different safety guarantees [19]. For CedrusDB,we explored how to separate the lazy-trie algorithm fromthe underlying storage management. We also make use ofnative functionalities of Linux, some of which may not beavailable on other operating systems, such as io_submit forkernel-based asynchronous IO (AIO).

In some key-value stores, like LMDB [14], the entire storeis always mapped in memory and the maximum size has tobe predetermined at its creation. CedrusDB supports a dy-namically growing data set that may not all ﬁt in memory. Tosupport this, it has logical spaces , one for each type of objects. A v g . P a t h L e n g t h s = 1 s = 2 s = 4 d log x e A v g . U t il. ( % ) s = 1 s = 2 s = 4 Number of Keys0102030 T r ee F oo t p r i n t( G i B ) s = 1 s = 2 s = 4 Figure 5: Average path length, child table utilization, andmeta-data storage footprint for a 128-ary lazy-trie as a func-tion of the number of keys, with different sluggishness values.The solid blue line in the bottom graph shows the user datafootprint for 23-byte keys and 128-byte values.A logical space is a 64-bit four layer virtual address space. A logical address is a 64-bit unsigned integer subdivided intofour parts: a segment number, a region number, a page num-ber, and a page offset. CedrusDB associates a ﬁle with eachsegment. Each region is either fully mapped or unmapped.When mapped, it can be accessed like ordinary memory, thatis, regions are memory-mapped that allow zero-copy readaccess. Figure 6 shows the organization of storage units withdifferent granularities.Objects cannot span across regions. Regions can only beaccessed by mapping them, and thus the number of mappedregions is effectively the cache size, blurring the boundarybetween the “cache-based” approach, used by stores like Lev-elDB and RocksDB, and the “memory-based” approach bystores like LMDB and Memcached. Ideally, the performanceis optimal when regions can remain mapped as long as theyare in active use, which is the main focus of this paper.To map regions, we use mmap . While in theory we couldlet the kernel keep track of dirty pages and write them backto the underlying segment ﬁles, we found that this solution,while simple, did not provide good performance even when madvise is used. The kernel ends up writing the same, ac-tively modiﬁed pages repeatedly, thus incurring prohibitivelyhigh write cost. Moreover, when the bounded kernel bufferof pending writes is full, the kernel slows down store instruc-4 ·· ··· ··· ······ M a pp e d W r it e B u ff e r M e m o r y Persistent Storage A s yn c I / O segment (ﬁle) region Figure 6: CedrusDB storage hierarchy with mapped memoryand write buffer. The smallest rectangle represents a page. ··· ··· trie free listtrie space reserved ··· ··· data free listdata space

Figure 7: Four logical spaces used in CedrusDB.tions (such as x86 mov ) made to the virtual memory space,resulting in performance that is hard to predict.The storage architecture of CedrusDB is therefore hybrid.We map the regions in memory but keep track of our ownwrite buffer for dirty pages. Doing so also beneﬁts write-ahead-logging.Logical spaces allow the lazy-trie algorithm to operatetransparently as if the entire data structure is in memory. Ce-drusDB maintains four logical spaces:1. trie space : tree nodes in use;2. trie free list : a stack of indexes (pointers) to unused treenodes;3. data space : data nodes in use;4. data free list : an array of descriptors tracking the unusedportion of data space.Tree nodes have a ﬁxed size, and thus the trie space containsan array of tree nodes. The trie free list contains a stack ofindexes into the tree node array. To free a tree node, its index is pushed onto the stack. To allocate a tree node, its index ispopped from the stack.In the data space, in-use and free data nodes of differentsizes are stored continuously with a header at the front anda footer at the end. The footer of a data node is right beforethe header of the next data node and contains the size of thedata node. It supports merging of two adjacent free data nodes(see below). Instead of maintaining a doubly-linked list offree data nodes like the free list in glibc’s malloc , CedrusDBmaintains a separate, unsorted array of hole descriptors forbetter locality and therefore I/O efﬁciency (see Figure 7).Each descriptor points to a free data node, while the headerof a free data node points back to its descriptor.CedrusDB adopts a next-ﬁt allocation policy [23, 25], us-ing two indexes into the array of hole descriptors. tailIdx points to the ﬁrst unused descriptor. curIdx points to the lastdescriptor visited by the allocator. To allocate a data node,the allocator starts at curIdx and scans (wrapping around at tailIdx ) until it ﬁnds a free data node that is large enough.If the free data node is larger than needed, it is split into two.If the free data node is exactly the right size, it will be reusedand its descriptor is replaced by the one before tailIdx and tailIdx is decremented. If there is no free data node largeenough, the data space is enlarged to accommodate a newdata node. As we show in Section 4.2.5, the allocator doesnot have to fully scan through the free list, allowing for betterperformance.To free a data node, we ﬁrst check if either or both of theadjacent data nodes are free. If so, we merge the data nodes(possibly reducing the number of hole descriptors if bothadjacent data nodes are free). If not, we add a new hole de-scriptor at tailIdx . The tailIdx pointer, together with thetrie root pointer and other tail pointers that indicate the bound-ary of logical spaces, are all stored within the ﬁrst (reserved)4KBytes page of the trie space, before the entire tree nodearray.

To beneﬁt from Rust’s memory-safety and thread-safety guar-antees, we crafted our own memory-mapped object abstrac-tion that ﬁts in Rust’s idiom and thus utilizes its type checker.Rust does not come natively with a type-safe wrapper or prim-itive for memory-mapping. Although nix [36], a popular Rustlibrary, provides related functions, they are exposed as Rustfunction signatures of the original POSIX interface in C. Theyare unsafe because the compiler makes normal assumptionsabout memory management and variable life-times withoutmaking accommodations for memory-mapped address space.

The

MMappedSpace struct is the core of implementing thelogical space abstraction (Section 3.1). It can be created fromﬁle handles and exposes safe methods.5 [repr(C)] union

ChdRaw { node: ObjPtr < Node >, data: ObjPtr < Data >,} pub struct

Node { parent: ObjPtr < Node >, // parent node (for deletion)height: u8,pub pidx: NBranchType , // index in the parent nodedata_mask: [u64; NBRANCH / 64], // data node bitmaskchd_mask: [u64;

NBRANCH / 64], // validity bitmaskchildren: [ ChdRaw ; NBRANCH ], // child table// reserved field for late-initialization pub rwlock: std::mem::MaybeUninit < RwLock <()>>}

Figure 8: Tree Node Layout.We use an opaque struct to represent a pointer into logicalspace so that either it can be reconstructed from persistentstorage (like a pointer to an existing tree node) or allocatedthrough

MMappedSpace . The reason we need an abstractionfor a pointer is two-fold. First, addresses in logical spaces donot directly correspond to virtual memory addresses even ifthe corresponding region is mapped. Second, using an integerindexing into logical space would be unsafe. Given a typedobject pointer,

ObjPtr , it can dereference the pointer tothe correspondingly typed object handle,

ObjRef . Thisallows manipulating the actual object typed T available inmemory as if there is no memory-mapping. (It will auto-dereference to &mut T , which is an ordinary mutable access toa variable typed T .) The object handle is accessible throughoutits lifetime by pinning the affected region in memory. A lazy-trie tree node consists predominantly of a ﬁxed-sizearray of pointers to its children, as shown in Figure 8. Twobitmasks indicate the validity and type of the pointer in a par-ticular slot. To obtain child information, chd_mask is ﬁrst ex-amined to determine whether the slot contains a valid pointer.If so, data_mask speciﬁes whether the pointer is to anothertree node ( struct Node ) or a data node ( struct Data ). Asthe most accessed component in the node structure, both bit-masks are accessed using bitwise arithmetic. On x86-64 weuse inline assembly with BMI quadword instructions such as bsfq and bzhiq for optimal performance.We use the compiler directive to ensure a C-struct layout, guaranteeing that all memory-mapped objectscan be correctly accessed even if the Rust compiler changesthe default layout of struct . Unfortunately we cannot takeadvantage of the feature-rich Rust enum type. Thus,

ChdRaw is hidden to other parts of the CedrusDB, which access thetrie data structure through safe enum structs and methods. pub struct

Data { lsize: u8, // the size of the linked listhkey: [u8; HASH_LEN ], // bytes of the hashed keykey_mode: KeyMode , key_size: u64, // number of bytes in user keyval_size: u64, // number of bytes in valuenext: ObjPtr < Data > // the next item in the list// NOTE: the raw key-value bytes follow } Figure 9: Data Node Layout.

We use Google’s HighwayHash [3] to hash keys. For each userkey-value pair, we use an object of

Data struct (Figure 9) tohold the precomputed HighwayHash of the original arbitrary-length key in hkey , and size information for the original keyand value in key_size and val_size . The actual user key andvalue are placed directly after the Rust structure. To supportkeys with colliding hashes and sluggish splitting,

Data objectsare chained together into a singly-linked list using the next pointer. We use madvise [1] to request that the kernel pre-fetches the memory pages storing data objects.

As described in Section 3.1 and shown in Figure 6, CedrusDBuses a separate write buffer to serialize all changes and sched-ule them as block writes to the physical storage device. Eachmodiﬁcation made to the internal lazy-trie data structure ﬁrstwrites directly to memory. The state of the memory thusalways reﬂects the latest changes. To update the underlyingstorage, the same write is also sent to a disk thread via abounded buffer.The changes generated by tree logic are short and fre-quent, which is not optimal for secondary storage. CedrusDBtherefore aggregates the updates into blocks. The Linux ext4ﬁlesystem has the same block size as the page size, so Ce-drusDB is conﬁgured to use 4KBytes blocks.While the lazy-trie data structure does not require logcompaction such as used in LSMs and is arguably simplerto maintain than B + -trees, it suffers potentially from non-optimal locality and random writes when pointers need tobe updated. To optimize storage updates, the CedrusDB diskthread uses Linux native Asynchronous I/O (AIO). Not to beconfused with the POSIX AIO standard offered by glibc [37],Linux AIO performs concurrent writes if possible. We accessAIO through libaio , a thin C ABI wrapper that is availableon main-stream Linux distributions. AIO allows us to asyn-chronously manage reads and writes in a non-blocking style.Because CedrusDB supports concurrent operations on thelazy-trie, it is possible that multiple user threads make changes Due to limitations of mmap , we map the memory in copy-on-write modeeven though we do not require that the original contents is saved.

6o the same page (block). In CedrusDB it is the disk thread’sresponsibility to obtain the consistent state of a block froma ﬁle if it is not already available in the buffer, rather thancopying the content from memory as there may be a datarace. The disk thread can schedule such an infrequent blockread while at the same time it can schedule other writes in thebuffer without being blocked.

The lazy-trie design does not require tree re-balancing opera-tions, simplifying concurrent access. Because walks down thetree diverge exponentially fast, gains from concurrent accesscan be signiﬁcant.

Locating the leaf node does not change the trie structure,while insertion or deletion only makes changes starting froma leaf node. We assign a reader-writer lock to each tree nodeto control any access that goes through that node.First consider lookup operations. For each node visitedon the path to the leaf node, a thread acquires the read lock.Knowing that once the thread holds a read lock, no concurrentupdates can happen to the node or any node below, it is safefor the thread to release the lock on the parent node, allowingconcurrent updates to other parts of the trie.In theory, insertions could also obtain read locks the sameas lookups, until the thread needs to update a node. At thattime, the thread would have to upgrade its read lock to a writelock. In practice, Rust does support an upgradable reader-writer lock, but it allows at most one thread to hold an upgrad-able read lock at a time (while other threads may concurrentlyhold a regular read lock). The lock also supports a downgradeoperation that converts an upgradable read lock into a regularread lock. We use it as follows: insertion proceeds as lookupsbut obtaining upgradable read locks. As soon as the threaddetermines that it will not need to update the node, it down-grades the lock before attempting to get an upgradable readlock on the next node down the path.Deletion is more complex as it may remove a path that isbeing followed by other concurrent threads. CedrusDB as-sumes deletions are much less frequent than other operations.Based on this assumption, a thread performing a delete opera-tion obtains write locks for the entire path, ruling out someconcurrent access.

The idiomatic way in Rust to add a lock to a type is

RwLock , a wrapper around type T . Because CedrusDBtrie nodes are memory-mapped, we cannot use this facilitydirectly. We maintain a RwLock<()> within every trie node(see Figure 8), wasting a neglibible amount of space in the

Node 0000000000

Node 0000000000 ···

Node 0000000000

Node RwLock<()>

Node RwLock<()> ···

11 0 0 0 1 0 ···

Region bitmasks std::sync::atomic::AtomicU64

Persistent Storage Virtual Memory

Figure 10: Implementation of trie node locks.

Region r − r Region r + r + Figure 11: Using CAS operations to maintain logical space.node (which is dominated by the table of child pointers), butneed a way to initialize the locks after mapping their spacefrom disk.The lock objects are initialized on-the-ﬂy as threads try toobtain them. We maintain two bitmaps with each region, eachwith one bit per trie node. For a region size of 16MBytes fullof 256-child tree nodes, less than 2KBytes is needed for thebitmasks. init_fin indicates whether the corresponding lockis initialized. init indicates whether some thread is in theprocess of initializing the lock. The bitmasks are implementedby a ﬁxed-length array of 64-bit atomic variables so thatquery and modiﬁcation can be performed atomically usingbit operations. We hide this fast node lock facility from otherparts of the system, behind a safe interface.

The logical space could be another performance bottleneckfor concurrent insertions, because its tail pointer is frequentlywritten when new objects are pushed to the end of the space.Ideally, Fetch-And-Add (FAA) atomic operations allow up-dating the pointer concurrently. However, extra care shouldbe taken when the pointer reaches the boundary of the last al-located region. In this case, we need to extend the underlyingﬁle or create a new ﬁle to ensure accessing the new regionwill not result in a

SIGBUS fault.We use the Linux fallocate system call to manipulatethe allocated disk space for a ﬁle. To mitigate the overheadof fallocate , we grow a ﬁle at the granularity of regions.We use a tail pointer to keep track of the boundary of usable7pace and increase the pointer to extend the usable space afterthe new region is ready to be mapped. Since regions are largecompared to the contained objects, we use a Compare-And-Swap (CAS) loop to optimistically increase the tail pointer ifit advances within the same region for the common case. Abusy ﬂag is used to synchronize the threads when a new regionneeds to be allocated. In this slower case, only one thread willextend the ﬁle and move the tail pointer forward to the nextregion, while other threads will yield and continue to the nextiteration of the CAS loop. There might be threads holdingsome stale view of the tail pointer referring to some previousregion; we avoid possible ABA problems with careful checks.Figure 11 describes how to optimistically grow logical spaceat a high-level.

We evaluate the implementation of CedrusDB in order toanswer the following questions:• How does branching factor of a lazy-trie affect its perfor-mance? Which value of sluggishness should one choose?Is 256-bit practical enough for hashing? (§4.2.1, §4.2.2)• What is the performance for various types of CedrusDBoperations? Is it practical enough compared to otherstores? (§4.2.3)• How does performance degrade when user data cannotﬁt entirely in memory? What is the impact of the regionsize? (§4.2.4) What is the performance of data space al-locator and what is the overall storage footprint? (§4.2.5)• How well does CedrusDB perform under various kindsof realistic workloads? How well does it leverage con-currency? (§4.3)

Unless otherwise noted, we use Dell R340 servers to conductour experiments. The machine has a hexa-core Intel XeonE-2176G 3.70GHz CPU with 64GB DDR4 memory. For thesecondary storage medium, we use Intel Optane 905P SSDwith PCIe NVMe 3.0 × . . X pu t . ( o p s / s ec . ) × T r ee ( G i B ) lookup-b64insert-b64 lookup-b128insert-b128 lookup-b256insert-b256 Figure 12: Lookup/insertion throughput (Xput.) and the treefootprint with different branching factors and sluggishness.setup for 5 times and plot the mean in y-axis with standarddeviation bars. For macrobenchmarks, we evaluate CedrusDBusing YCSB [11]. YCSB is written in Java whereas CedrusDBis in Rust and the other key-value stores are in C or C++.To avoid overhead caused by incompatible interfacing, wecreated a C binding for CedrusDB and invoke the APIs froma uniﬁed C++ test program dedicated to executing the YCSBworkload. For each run, we populate the data store with acertain number of key-value items before running the testworkload. The data store is reopened when the test starts.Finally, for multi-threaded experiments, we run a separateYCSB workload generator for each thread in such a way thata thread only handles keys that were preloaded or keys thatwere inserted by the thread itself.

There are two parameters required to instantiate a lazy-trie:the tree branching factor and sluggishness, together determin-ing the shape of the tree and its statistical characteristics. InFigure 12, we use branching factors 64, 128, and 256. Wevary the sluggishness from 1 (no sluggishness) to 128. Foreach run, a data store of 100 million items is used to perform10 million uniformly random lookups or insertions.As discussed in Section 2.3, we expect that a larger branch-ing factor will make the efﬁcacy of sluggishness more pro-nounced as having more slots in the child table leads to higherstorage overhead and ampliﬁcation. Indeed, we see that for abranching factor of 256, sluggishness signiﬁcantly improvesperformance. Data points for s < items and 10 opera-tions (single-threaded).Figure 14: Concurrent microbenchmark with 10 items and10 operations (4 threads).around 2KBytes of space. By having more sluggishness, stor-age overhead is greatly mitigated and both lookup and in-sertion performance ramp up to reach or surpass those ofother branching factors. When increasing sluggishness from4 to 16, the memory footprint gets cut down from more than23GBytes to 144MBytes.When using a branching factor of 64, each tree node onlytakes around 512 bytes and it requires a lower sluggishness forcomparable performance. On the other hand, because the datastore can be fully cached in memory, the choice of branch-ing factor does not signiﬁcantly affect the maximum perfor-mance due to the low cost of memory access. To reduce thetree height and have the best lookup performance, we use abranching factor of 256 with sluggishness of 16 as a practicalchoice for all subsequent experiments. In the prior experiments, we used 256-bit HighwayHash tohash the keys. HighwayHash also provides 64- and 128-bithashing. While generating a shorter hash is faster, it increasesthe collision rate, amplifying the cost of scanning the linkedlist of data nodes. We experimented with all three hash lengthsbut found no difference in performance. We use 256-bithashes for the remainder of the evaluation.

We examine the performance of a read-only workload with asingle client thread. We populate the store with 100 million items and conduct 10 million uniformly random ( L ) or zip-ﬁan ( L-zipf ) reads. The right graph of Figure 13 shows that,using a single thread, LMDB outperforms CedrusDB, espe-cially when reads have more locality (zipﬁan distribution).This is partly explained because CedrusDB requires hashingand scanning in the sluggish leaf list for each lookup. More-over, for zipﬁan workloads, hashing in CedrusDB eliminateslocality. However, shown on the right of Figure 14, CedrusDBsigniﬁcantly outperforms all key-value stores for concurrentlookup access with 4 client threads.LSM-based key-value stores like RocksDB are optimizedfor write-only workloads. In Figure 13, we test 10 millionpure insertions ( I ), deletions ( D ) or updates ( U ) with uniformlydistributed keys. Additionally, we tested updates with a zip-ﬁan distribution ( U-zipf ) and a very write-intensive mixedworkload with 50% lookups + 50% updates ( M and M-zipf ).As expected, RocksDB is signiﬁcantly faster than LMDB forwrite-intensive operations.CedrusDB has two kinds of updates: (i) in-place updates(as used in LMDB by default) when the new value can ﬁtinto the footprint of the current one, and (ii) emulated up-dates that remove the current value and insert the new one.We instrumented CedrusDB so we could test both individ-ually. cedrusdb shows in-place update performance and cedrusdb-m shows the performance when all updates aretreated as deletions followed by insertions. Unsurprisingly,in-place updates are fastest since they do not alter the treetopology. Insertions are faster than updates because the latterengages the data space allocator frequently to recycle datanodes. The current implementation of CedrusDB uses a mutexlock for the data space allocator that becomes a performancebottleneck in the concurrent setup, whereas the allocator forthe trie space is wait-free. The performance of deletions is thelowest because they require frequent access to the data spaceallocator and also change the tree topology. Nonetheless, inthese experiments CedrusDB is faster than LMDB.

While CedrusDB is optimized for the case that the entiredata store ﬁts within a given memory budget, the region-based mapping design of CedrusDB allows a larger data store.In this section we evaluate how performance degrades asCedrusDB runs out of the memory budget. We started witha baseline experiment using 100 million data items that didnot have a memory budget, so the data store could utilize allthe memory. We recorded the maximum number of regionsand used that as the memory budget. We then experimentedwith 0%–50% of additional user data to see how performancechanges.Figure 15 shows lookup performance for region sizesranging from 64KBytes ( lookup16 , 16-bit) to 16MBytes( lookup24 , 24-bit). The ﬁgure that shows having larger regionsize results in better performance when all data ﬁt in memory,9

10 20 30 40 500 . . . . . . × % over T h r o u g hpu t( o p s / s ec . ) lookup16lookup16-zipf lookup20lookup20-zipf lookup24lookup24-zipf Figure 15: Read performance for different region sizes, withuser data beyond the memory budget. . . . .

00% over T h r o u g hpu t( n o r m a li ze d . ) deleteinsert updateupdate-zipf update-mupdate-m-zipf Figure 16: Normalized write performance with user data be-yond the memory budget.but degrades quickly when data exceeds the memory budget.When the whole data set ﬁts in memory (0%), small regionsizes hurt performance because there is more overhead formanaging regions and their mapping. For large region sizes,the coarse granularity of mappings make it more likely thatcold and hot items are collocated in the same region, resultingin increased swapping.Figure 16 shows the results for write operations. In thisﬁgure, each line is normalized to the full memory budget per-formance. We see that faster operations like updates degrademore than slower ones like deletions, due to the high penaltyof swapping compared to in-memory operations.

So far (and as is common practice in many key-value storeevaluations), we used the same length for all values. For Ce-drusDB, this means that the data space allocator only needsto take one step to ﬁnd the next-ﬁt location that was previ-ously freed to recycle. When allowing in-place updates, theallocator is not even engaged. Thus, to throughly examine ourdesign and effectively evaluate the allocator, we generateda workload that mixes 128/256/1024 values uniformly. Weﬁrst populated the store with 10 million items. To have eachvalue updated many times on average, we ran 100 millionoperations with the mixed workload ( M ), changing the valueof each key ~5 times throughout the entire run. ce d r u s d b r o c k s d b l m d b D i s k U s ag e ( G i B ) x1.77x1.50x1.56 Max. Scan123 R a t i o T h r o u g hpu t( o p s / s ec . ) × Disk Amp. Recycled Throughputbefore-run after-run

Figure 17: Data space allocator statistics and storage ampliﬁ-cation due to fragmentation.For the next-ﬁt allocator, scanning through the entire freelist before giving up is too expensive in practice. Insteadof making sure to recycle a freed object whenever possible,allowing some slack in using the free list does not causesigniﬁcant fragmentation. Therefore we limit the maximumnumber of steps in scanning the free list during an allocation.In the right subgraph of Figure 17, we vary the scan steplimit (

Max. Scan ) and show the change of the ampliﬁcationfactor (

Disk Amp. ) and the ratio of reusing freed space in anallocation (

Recycled ). The disk ampliﬁcation factor is theﬁnal disk usage divided by the initial one. It is greater than1 for all stores due to fragmentation. Once the limit exceeds100 there is little difference compared to having no limit.Moreover, as shown in the left subgraph of Figure 17, webelieve the fragmentation ratio of CedrusDB is reasonablecompared to other key-value stores.

In this section, we evaluate CedrusDB using YCSB [11] work-loads with zipﬁan distributions. We used a region size of 16MBytes for all experiments.

Figure 18 shows single-threaded throughput for differentread/write ratios using 128-byte values; Figure 19 showsthe same for 1KByte values. The data store is populated with100 million keys. Two types of write operations are consid-ered: updating the value of an existing key and to insert avalue with a non-existing key. Although both use the sameAPI, they trigger different code paths (as demonstrated in ourmicrobenchmarks). The upper left graph shows results fromruns where all writes are updates whereas the lower left showsinsertions; in the right subgraph, all operations are lookups.CedrusDB achieves better performance than other stores inwrite-intensive mixed workloads. It also outperforms othersin read-intensive cases, with the exception that LMDB has thehighest throughput for pure lookups using 128-byte values.To evaluate performance when values have varying sizes,we generated two workloads. In workload A, for each key, the10igure 18: YCSB evaluation with 10 Next we evaluate multi-threaded performance. Both LMDBand RocksDB have speciﬁc optimizations to take advantage ofconcurrency [12,13]. Our lazy-trie requires no re-organization Figure 20: YCSB variable-length throughput (Xput.) with128/256/1024 ( *-A ) or 128–256 ( *-B ) byte values (single-threaded).Figure 21: YCSB evaluation with 10 r5d.8xlarge instance. We used one NVMe SSD and run80%–90% read and 20%–10% update/insert workloads witha varying number of client threads. Figure 22 shows thatCedrusDB scales near-linearly until it hits the read lock bot-tleneck at 8 threads with throughputs between 2 and 3 millionoperations per second. When it does, it outperforms RocksDBby a factor of 3–5 and LMDB by over 10 times. Various prior systems have looked into better leveragingavailable memory to speed up performance of key-valuestores. SILT [29] has a pipeline of three data structures to im-prove memory-efﬁciency and write-friendliness. LMDB [14]11s a popular open-source key-value store that leverages amemory-mapped, copy-on-write B + -tree. CedrusDB’s usageof memory-mapped storage is inspired by LMDB [14]. Thereare also persistent key-value stores that only keep a fast, con-current hash table or other indexing data structures in-memoryand pipe all writes directly into append-only logs [9,28,31,33].However, such architectures suffer from signiﬁcant recoveryoverhead when a key-value store is restarted.There has been extensive work to reduce write-ampliﬁcation in LSM-based persistent key-value stores.RocksDB [8,20] is a fork of LevelDB improved by developersat Facebook. It provides more features such as multi-threadedcompaction and support for transactions. Inspired by skiplists and based on HyperLevelDB [21], PebblesDB [38] pro-poses the Fragmented LSM data structure, carefully choosingthe SSTs during compaction to reduce ampliﬁcation. LSM-trie [44] uses a static hash-trie merge structure that keepsre-organizing data for more efﬁcient compaction. SuRF [45]uses an LSM design with a trie-based ﬁlter to optimize rangequeries. Accordion [7] improves the memory organizationfor LSM. Monkey [15] reduces the lookup cost for LSM byallocating memory to ﬁlter across different levels, minimizingthe number of false positives. Dostoevsky [16] introduceslazy-leveling to remove superﬂuous merging. mLSM [39]is tailored for blockchain applications and signiﬁcantly im-proves the performance of the Ethereum storage subsystem.There are also recent proposals to combine LSM and B + -tree designs. Jungle [2] reduces update cost without sacri-ﬁcing lookup cost in LSM using a B + -tree. SLM-DB [24]assumes persistent memory hardware. It uses a B + -tree forindexing and stages insertions to LSM.Existing storage data structures have evolved in responseto changes in hardware. wB + -tree [10] reduces transactionlogging overhead for a B + -tree in non-volatile main memory.LB + -tree [30] optimizes the index performance using 3DX-Point persistent memory. S3 [46] uses an in-memory skip-listindex for a customized version of RocksDB in Alibaba Cloud.RECIPE [26] offers a principled way to convert concurrentindexes on DRAM to the one on persistent-memory withcrash-consistency.Like CedrusDB, other systems have embraced the trie forin-memory indexes. ART [27] uses a radix tree, also com-presses the non-diverging paths. HOT [6] uses an adaptivenumber of children for each node. Compared to these in-memory indexes, there are several major differences: (1) Ce-drusDB is designed to be persistent and has optimized lazy-trie for an on-disk index; for example, lazy-trie tree nodeshave a ﬁxed number of slots in their child table to improvepersistent storage performance. (2) To ensure near-optimaltree height like a B + -tree, instead of directly using key stringsto index the trie, lazy-trie uses ﬁxed-length hashes, havingdifferent statistical properties. (3) In addition to path compres-sion, lazy-trie employs sluggish splitting to reduce variance . . . . . × . . . . . upd a t e i n s e rt T h r o u g hpu t( o p s / s ec . ) cedrusdb-90%cedrusdb-80% rocksdb-90%rocksdb-80% lmdb-90%lmdb-80% leveldb-90%leveldb-80% Figure 22: Concurrent YCSB evaluation with 10 + -tree. In this paper, we propose a novel data structure, lazy-trie, forindexing on persistent storage. Lazy-trie has similar struc-tural properties as a B + -tree and achieves comparable single-threaded lookup speed. Unlike a B + -tree, its maintenance doesnot require reorganization of the tree structure when items areinserted and therefore offers signiﬁcant improvement in writethroughput.To demonstrate the advantages of the lazy-trie, we built Ce-drusDB, a memory-mapped persistent key-value store. Withcomparable storage overhead, CedrusDB outperforms popularLSM- and B + -tree-based key-value stores in mixed workloads.The persistent lazy-trie index also has signiﬁcant advantagesfor concurrent access when mapped in virtual memory.We are currently looking into various future directions.First, the data space allocator is a major source of fragmen-tation and can be improved. Second, the lazy-trie is indexedby key hashes, but improving locality may enable using Ce-drusDB when the data set that cannot reside mostly in memory.Finally, we are considering a copy-on-write version of thelazy-trie data structure to eliminate some node locking (e.g.read locks). A copy-on-write lazy-trie would also supportmulti-versioning [41], enabling fast snapshots and transactionisolation.12 eferences [1] madvise(2) - linux manual page. http://man7.org/linux/man-pages/man2/madvise.2.html . Accessed:2020-04-15.[2] Jung-Sang Ahn, Mohiuddin Abdul Qader, Woon-HakKang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane. Jungle: Towards dynamically adjustablekey-value store by combining lsm-tree and copy-on-write b+-tree. In Daniel Peek and Gala Yadgar, editors, . USENIX Association, 2019.[3] Jyrki Alakuijala, Bill Cox, and Jan Wassenberg. Fastkeyed hash/pseudo-random function using simd multi-ply and permute, 2016.[4] Brian Anderson, Lars Bergstrom, Manish Goregaokar,Josh Matthews, Keegan McAllister, Jack Mofﬁtt, andSimon Sapin. Engineering the servo web browser engineusing rust. In Laura K. Dillon, Willem Visser, and LaurieWilliams, editors, Proceedings of the 38th InternationalConference on Software Engineering, ICSE 2016, Austin,TX, USA, May 14-22, 2016 - Companion Volume , pages81–89. ACM, 2016.[5] Anirudh Badam, KyoungSoo Park, Vivek S. Pai, andLarry L. Peterson. Hashcache: Cache storage for thenext billion. In Jennifer Rexford and Emin Gün Sirer,editors,

Proceedings of the 6th USENIX Symposium onNetworked Systems Design and Implementation, NSDI2009, April 22-24, 2009, Boston, MA, USA , pages 123–136. USENIX Association, 2009.[6] Robert Binna, Eva Zangerle, Martin Pichl, GüntherSpecht, and Viktor Leis. HOT: A height optimized trieindex for main-memory database systems. In GautamDas, Christopher M. Jermaine, and Philip A. Bernstein,editors,

Proceedings of the 2018 International Confer-ence on Management of Data, SIGMOD Conference2018, Houston, TX, USA, June 10-15, 2018 , pages 521–534. ACM, 2018.[7] Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel,Idit Keidar, and Gali Shefﬁ. Accordion: Better mem-ory organization for LSM key-value stores.

PVLDB ,11(12):1863–1875, 2018.[8] Zhichao Cao, Siying Dong, Sagar Vemuri, andDavid H.C. Du. Characterizing, modeling, and bench-marking rocksdb key-value workloads at facebook.In , pages 209–223, Santa Clara,CA, February 2020. USENIX Association. [9] Badrish Chandramouli, Guna Prasaad, Donald Koss-mann, Justin J. Levandoski, James Hunter, and MikeBarnett. FASTER: A concurrent key-value store within-place updates. In Gautam Das, Christopher M. Jer-maine, and Philip A. Bernstein, editors,

Proceedings ofthe 2018 International Conference on Management ofData, SIGMOD Conference 2018, Houston, TX, USA,June 10-15, 2018 , pages 275–290. ACM, 2018.[10] Shimin Chen and Qin Jin. Persistent b+-trees in non-volatile main memory.

PVLDB , 8(7):786–797, 2015.[11] Brian F. Cooper, Adam Silberstein, Erwin Tam, RaghuRamakrishnan, and Russell Sears. Benchmarking cloudserving systems with YCSB. In Joseph M. Hellerstein,Surajit Chaudhuri, and Mendel Rosenblum, editors,

Pro-ceedings of the 1st ACM Symposium on Cloud Com-puting, SoCC 2010, Indianapolis, Indiana, USA, June10-11, 2010 , pages 143–154. ACM, 2010.[12] Symas Corporation. Lmdb: Lightning memory-mappeddatabase manager (lmdb). . Accessed: 2020-05-16.[13] Symas Corporation. Memtable - facebook/rocksdbwiki. https://github.com/facebook/rocksdb/wiki/MemTable . Accessed:2020-05-16.[14] Symas Corporation. Symas lightning memory-mappeddatabase. https://symas.com/lmdb/ . Accessed:2020-04-15.[15] Niv Dayan, Manos Athanassoulis, and Stratos Idreos.Monkey: Optimal navigable key-value store. In SemihSalihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang,and Dan Suciu, editors,

Proceedings of the 2017 ACMInternational Conference on Management of Data, SIG-MOD Conference 2017, Chicago, IL, USA, May 14-19,2017 , pages 79–94. ACM, 2017.[16] Niv Dayan and Stratos Idreos. Dostoevsky: Better space-time trade-offs for lsm-tree based key-value stores viaadaptive removal of superﬂuous merging. In GautamDas, Christopher M. Jermaine, and Philip A. Bernstein,editors,

Proceedings of the 2018 International Confer-ence on Management of Data, SIGMOD Conference2018, Houston, TX, USA, June 10-15, 2018 , pages 505–520. ACM, 2018.1317] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,Gunavardhan Kakulapati, Avinash Lakshman, AlexPilchin, Swaminathan Sivasubramanian, Peter Vosshall,and Werner Vogels. Dynamo: amazon’s highly avail-able key-value store. In Thomas C. Bressoud andM. Frans Kaashoek, editors,

Proceedings of the 21stACM Symposium on Operating Systems Principles 2007,SOSP 2007, Stevenson, Washington, USA, October 14-17, 2007 , pages 205–220. ACM, 2007.[18] Redox OS Developers. Redox - your next(gen) os. . Accessed: 2020-04-15.[19] The Rust Project Developers. The rustonomi-con: the dark arts of advanced and unsafe rust pro-gramming. https://doc.rust-lang.org/stable/nomicon/ . Accessed: 2020-04-15.[20] Siying Dong, Mark Callaghan, Leonidas Galanis,Dhruba Borthakur, Tony Savor, and Michael Strum. Op-timizing space ampliﬁcation in rocksdb. In

CIDR 2017,8th Biennial Conference on Innovative Data SystemsResearch, Chaminade, CA, USA, January 8-11, 2017,Online Proceedings https://hackingdistributed.com/2013/06/17/hyperleveldb/ . Accessed: 2020-04-20.[22] Nabil Hachicha. Snappydb: a fast and lightweightkey/value database library for android. . Accessed: 2020-04-15.[23] Mark S Johnstone and Paul R Wilson. The memoryfragmentation problem: Solved?

ACM Sigplan Notices ,34(3):26–36, 1998.[24] Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam,Sam H. Noh, and Young-ri Choi. SLM-DB: single-levelkey-value store with persistent memory. In Arif Mer-chant and Hakim Weatherspoon, editors, , pages 191–205. USENIX Association, 2019.[25] Donald Ervin Knuth.

The Art of Computer Program-ming, Volume 1: Fundamental Algorithms , volume 1.Pearson Education, 1997.[26] Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap,Taesoo Kim, and Vijay Chidambaram. Recipe: convert-ing concurrent DRAM indexes to persistent-memoryindexes. In Tim Brecht and Carey Williamson, editors,

Proceedings of the 27th ACM Symposium on OperatingSystems Principles, SOSP 2019, Huntsville, ON, Canada,October 27-30, 2019 , pages 462–477. ACM, 2019. [27] Viktor Leis, Alfons Kemper, and Thomas Neumann. Theadaptive radix tree: Artful indexing for main-memorydatabases. In Christian S. Jensen, Christopher M.Jermaine, and Xiaofang Zhou, editors, , pages 38–49. IEEE Computer Society, 2013.[28] Baptiste Lepers, Oana Balmau, Karan Gupta, and WillyZwaenepoel. Kvell: the design and implementation of afast persistent key-value store. In Tim Brecht and CareyWilliamson, editors,

Proceedings of the 27th ACM Sym-posium on Operating Systems Principles, SOSP 2019,Huntsville, ON, Canada, October 27-30, 2019 , pages447–461. ACM, 2019.[29] Hyeontaek Lim, Bin Fan, David G. Andersen, andMichael Kaminsky. SILT: a memory-efﬁcient, high-performance key-value store. In Ted Wobber and PeterDruschel, editors,

Proceedings of the 23rd ACM Sym-posium on Operating Systems Principles 2011, SOSP2011, Cascais, Portugal, October 23-26, 2011 , pages1–13. ACM, 2011.[30] Jihang Liu, Shimin Chen, and Lujun Wang. Lb+-trees:Optimizing persistent index performance on 3dxpointmemory.

PVLDB , 13(7):1078–1090, 2020.[31] Yandong Mao, Eddie Kohler, and Robert Tappan Morris.Cache craftiness for fast multicore key-value storage. InPascal Felber, Frank Bellosa, and Herbert Bos, editors,

European Conference on Computer Systems, Proceed-ings of the Seventh EuroSys Conference 2012, EuroSys’12, Bern, Switzerland, April 10-13, 2012 , pages 183–196. ACM, 2012.[32] Leonardo Mármol, Jorge Guerra, and Marcos K. Aguil-era. Non-volatile memory through customized key-value stores. In Nitin Agrawal and Sam H. Noh, editors, . USENIX Association, 2016.[33] Alexander Merritt, Ada Gavrilovska, Yuan Chen, andDejan S. Milojicic. Concurrent log-structured memoryfor many-core key-value stores.

PVLDB , 11(4):458–471,2017.[34] Donald R. Morrison. PATRICIA - practical algorithmto retrieve information coded in alphanumeric.

J. ACM ,15(4):514–534, 1968.[35] Giang Nguyen, Stefan Dlugolinsky, Martin Bobák,Viet D. Tran, Álvaro López García, Ignacio Heredia, Pe-ter Malík, and Ladislav Hluchý. Machine learning anddeep learning frameworks and libraries for large-scaledata mining: a survey.

Artif. Intell. Rev. , 52(1):77–124,2019.1436] The nix-rust Project Developers. nix — crates.io: Rustpackage registry. https://crates.io/crates/nix .Accessed: 2020-04-15.[37] The GNU Project. Asynchronous i/o (the gnu c library). . Accessed:2020-04-15.[38] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram,and Ittai Abraham. Pebblesdb: Building key-valuestores using fragmented log-structured merge trees. In

Proceedings of the 26th Symposium on Operating Sys-tems Principles, Shanghai, China, October 28-31, 2017 ,pages 497–514. ACM, 2017.[39] Pandian Raju, Soujanya Ponnapalli, Evan Kaminsky, Gi-lad Oved, Zachary Keener, Vijay Chidambaram, and IttaiAbraham. mlsm: Making authenticated storage faster inethereum. In Ashvin Goel and Nisha Talagala, editors, . USENIX Association, 2018.[40] Zhaoyan Shen, Feng Chen, Yichen Jia, and Zili Shao.Optimizing ﬂash-based key-value cache systems. InNitin Agrawal and Sam H. Noh, editors, .USENIX Association, 2016.[41] Yihan Sun, Guy E. Blelloch, Wan Shen Lim, and An-drew Pavlo. On supporting efﬁcient snapshot isola-tion for hybrid workloads with multi-versioned indexes.

PVLDB , 13(2):211–225, 2019.[42] The Rust Team. Rust programming language. . Accessed: 2020-05-26.[43] Sheng Wang, Tien Tuan Anh Dinh, Qian Lin, Zhon-gle Xie, Meihui Zhang, Qingchao Cai, Gang Chen,Beng Chin Ooi, and Pingcheng Ruan. Forkbase: Anefﬁcient storage engine for blockchain and forkable ap-plications.

PVLDB , 11(10):1137–1150, 2018.[44] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. Lsm-trie: An lsm-tree-based ultra-large key-value store forsmall data items. In Shan Lu and Erik Riedel, editors, , pages 71–82.USENIX Association, 2015. [45] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G.Andersen, Michael Kaminsky, Kimberly Keeton, andAndrew Pavlo. Surf: Practical range query ﬁlteringwith fast succinct tries. In Gautam Das, Christopher M.Jermaine, and Philip A. Bernstein, editors,

Proceedingsof the 2018 International Conference on Managementof Data, SIGMOD Conference 2018, Houston, TX, USA,June 10-15, 2018 , pages 323–336. ACM, 2018.[46] Jingtian Zhang, Sai Wu, Zeyuan Tan, Gang Chen, ZhushiCheng, Wei Cao, Yusong Gao, and Xiaojie Feng. S3: Ascalable in-memory skip-list index for key-value store.