[PDF] Efficient Data Management with Flexible File Address Space

Abstract

Data management applications store their data using structured files in which data are usually sorted to serve indexing and queries. In order to insert or remove a record in a sorted file, the positions of existing data need to be shifted. To this end, the existing data after the insertion or removal point must be rewritten to admit the change in place, which can be unaffordable for applications that make frequent updates. As a result, applications often employ extra layers of indirections to admit changes out-of-place. However, it causes increased access costs and excessive complexity. This paper presents a novel file abstraction, FlexFile, that provides a flexible file address space where in-place updates of arbitrary-sized data, such as insertions and removals, can be performed efficiently. With FlexFile, applications can manage their data in a linear file address space with minimal complexity. Extensive evaluation results show that a simple key-value store built on top of this abstraction can achieve high performance for both reads and writes.

Full PDF

EEfﬁcient Data Management with Flexible File Address Space

Chen Chen, Wenshao Zhong, and Xingbo Wu

University of Illinois at Chicago

Abstract

Data management applications store their data using struc-tured ﬁles in which data are usually sorted to serve indexingand queries. In order to insert or remove a record in a sortedﬁle, the positions of existing data need to be shifted. To thisend, the existing data after the insertion or removal pointmust be rewritten to admit the change in place, which can beunaffordable for applications that make frequent updates. Asa result, applications often employ extra layers of indirectionsto admit changes out-of-place. However, it causes increasedaccess costs and excessive complexity.This paper presents a novel ﬁle abstraction, FlexFile, thatprovides a ﬂexible ﬁle address space where in-place updatesof arbitrary-sized data, such as insertions and removals, canbe performed efﬁciently. With FlexFile, applications canmanage their data in a linear ﬁle address space with minimalcomplexity. Extensive evaluation results show that a simplekey-value store built on top of this abstraction can achievehigh performance for both reads and writes.

Data management applications store data in ﬁles for persistentstorage. The data in ﬁles are usually sorted in a speciﬁc orderso that they can be correctly and efﬁciently retrieved in thefuture. However, it is not trivial to make updates such asinsertions and deletions in these ﬁles. To commit in-placeupdates in a sorted ﬁle, existing data may need to be rewrittento maintain the order of data. For example, key-value (KV)stores such as LevelDB [18] and RocksDB [16] need to mergeand sort KV pairs in their data ﬁles periodically, causingrepeated rewriting of existing KV data [2, 28, 51].It has been common wisdom to rewrite data for betteraccess locality. By co-locating logically adjacent data ona storage device, the data can be quickly accessed in thefuture with a minimal number of I/O requests, which iscrucial for traditional storage technologies such as HDDs.However, when managing data with new storage technologiesthat provide balanced random and sequential performance(e.g., Intel’s Optane SSDs [21]), access locality is no longera critical factor of I/O performance [50]. In this scenario,data rewriting becomes less beneﬁcial for future accesses butstill consumes enormous CPU and I/O resources [26, 34]. Therefore, it may not be cost-effective to rewrite data on thesedevices in exchange for better localities. Despite this, datamanagement applications still need to keep their data logicallysorted for efﬁcient access. An intuitive solution is to relocatedata in the ﬁle address space logically without rewriting themphysically. However, this is barely feasible because of thelack of support for logically relocating data in today’s ﬁles.As a result, applications have to employ additional layers ofindirections to keep data logically sorted, which increases theimplementation complexity and requires extra work in theuser space, such as maintaining data consistency [45, 52].We argue that if ﬁles can provide a ﬂexible ﬁle addressspace where insertions and removals of data can be appliedin-place, applications can keep their data sorted easily withoutrewriting data or employing complex indirections. With sucha ﬂexible abstraction, data management applications candelegate the data organizing jobs to the ﬁle interface, whichwill improve their simplicity and efﬁciency.Efforts have been made to realize such an abstraction.For example, a few popular ﬁle systems—Ext4, XFS, andF2FS—have provided insert-range and collapse-range fea-tures for inserting or removing a range of data in a ﬁle [17,20]. However, these mechanisms have not been able tohelp applications because of several fundamental limitations.First of all, these mechanisms have rigid block-alignmentrequirements. For example, deleting a record of only a fewbytes from a data ﬁle using the collapse-range operationis not allowed. Second, shifting an address range is veryinefﬁcient with conventional ﬁle extent indexes. Insertinga new data segment to a ﬁle needs to shift all the existingﬁle content after the insertion point to make room for thenew data. The shift operation has O ( N ) cost ( N is the numberof blocks or extents in the ﬁle), which can be very costlydue to intensive metadata updating, journaling, and rewriting.Third, commonly used data indexing mechanisms cannotkeep track of shifted contents in a ﬁle. Speciﬁcally, indexesusing ﬁle offsets to record data positions are no longer usablebecause the offsets can be easily changed by shift operations.Therefore, a co-design of applications and ﬁle abstractionsis necessary to realize the beneﬁts of managing data on aﬂexible ﬁle address space.This paper introduces FlexFile, a novel ﬁle abstractionthat provides a ﬂexible ﬁle address space for data manage-1 a r X i v : . [ c s . O S ] J a n xtentsFile Address Space O ﬀ set: 0Length: 8Block: 386 O ﬀ set: 8Length: 4Block: 0 O ﬀ set: 12Length: 4Block: 100 O ﬀ set: 16Length: 8Block: 2800 100 280 3860 32K 48K 64K 96K . . . . . . . . . O ﬀ set: 0 Length: 4

Block: 386 O ﬀ set: 13 Length: 4Block: 100 O ﬀ set: 17 Length: 8Block: 280 O ﬀ set: 9 Length: 4Block: 0 O ﬀ set: 5Length: 4Block: 390O ﬀ set: 4Length: 1Block: 640 new updatedupdated Insert a 4K extent at 16K o ﬀ set Device Blocks(4K size)

Figure 1:

A ﬁle with 96K address spacement systems. The core of FlexFile is a B + -Tree-like datastructure, named FlexTree, designed for efﬁcient and ﬂexibleﬁle address space indexing. In a FlexTree, it takes O ( log N ) time to perform a shift operation in the ﬁle address space,which is asymptotically faster than that of existing index datastructures with O ( N ) cost. We implement FlexFile as a user-space I/O library that provides a ﬁle-like interface. It adoptslog-structured space management for write efﬁciency andperforms defragmentation based on data access locality forcost-effectiveness. It also employs logical logging [40, 55] tocommit metadata updates at a low cost.Based on the advanced features provided by FlexFile,we build FlexDB, a key-value store that demonstrates howto build simple yet efﬁcient data management applicationsbased on a ﬂexible ﬁle address space. FlexDB has a simplestructure and a small codebase (1.7 k lines of C code). Thatbeing said, FlexDB is a fully-functional key-value storethat not only supports regular KV operations like GET , SET , DELETE and

SCAN , but also integrates efﬁcient mechanismsto support caching, concurrent access, and crash consistency.Evaluation results show that FlexDB has substantially reducedthe data rewriting overheads. It achieves up to 11 . × and1 . × the throughput of read and write operations, respectively,compared to an I/O-optimized key-value store, RocksDB. Modern ﬁle systems use extents to manage ﬁle addressmappings. An extent is a group of contiguous blocks. Itsmetadata consists of three essential elements—ﬁle offset,length, and block number. Figure 1 shows an example of a ﬁlewith 96 KB address space on a ﬁle system using 4 KB blocks.This ﬁle consists of four extents. Real-world ﬁle systemsemploy index structures to manage extents. For example, Ext4uses an HTree [15]. Btrfs and XFS use a B + -Tree [41, 49].F2FS uses a multi-level mapping table [25].Regular ﬁle operations such as overwrite do not modifyexisting mappings. An append-write to a ﬁle needs to expandthe last extent in-place or add new extents to the end of themapping index, which is of low cost. However, the insert-range and collapse-range operations with the aforementioneddata structures can be very expensive due to the shifting ofextents. To be speciﬁc, an insert-range or collapse-range Number of 1MB data written O p s / s e c PWRITEINSERT-RANGEREWRITE

Figure 2:

Performance of random write/insertion on Ext4 ﬁlesystem. The

REWRITE test was terminated when 128M datawas written because of its low throughput.operation needs to update the offset value of every extentafter the insertion or removal point. For example, insertinga 4 KB extent at the offset of 16 KB to the example ﬁle inFigure 1 needs to update all the existing extents’ metadata.Therefore, the shift operation has O ( N ) cost, where N is thetotal number of extents after the insertion or removal point.We benchmark the ﬁle editing performance of an Ext4 ﬁlesystem on a 100 GB Intel Optane DC P4801X SSD. In eachtest, we construct a 1 GB ﬁle and measure the throughput ofdata writing. There are three tests, namely, PWRITE , INSERT - RANGE , and

REWRITE . PWRITE starts with an empty ﬁleand use the pwrite system call to write 4 KB blocks inrandom order (without overwrites). Both

INSERT - RANGE and

REWRITE start with an empty ﬁle and insert 4KB datablocks to a random 4K-aligned offset within the already-written ﬁle range. Accordingly, each insertion shifts the dataafter the insertion point forward.

INSERT - RANGE utilizes the insert-range of Ext4 (through the fallocate system callwith mode=FALLOC_FL_INSERT_RANGE ). The

REWRITE testrewrites the shifted data to realize data shifting, which onaverage rewrites half of the existing data for each insertion.The results are shown in Figure 2. The

REWRITE test wasterminated early due to its inferior performance caused byintensive data rewriting. With 128 MB of new data writtenbefore terminated,

REWRITE caused 5.5 GB of writes tothe SSD.

INSERT - RANGE shows better performance than

REWRITE by inserting data logically in the ﬁle address space.However, due to the inefﬁcient shift operations in the extentindex, the throughput of

INSERT - RANGE dropped quicklyand was eventually nearly 1000 × lower than that of PWRITE .Although

INSERT - RANGE does not rewrite any user data,it updates the metadata intensively and caused 25% morewrites to the SSD compared to

PWRITE . This number can befurther increased if the application frequently calls fsync toenforce write ordering. XFS and F2FS also support the shiftoperations, but they exhibit much worse performance thanExt4, so their results are not included.Extents are simple and ﬂexible for managing variable-length data chunks. However, the block and page alignmentrequirements and the inefﬁcient extent index structures intoday’s systems hinder the adoption of the shift operations.To make ﬂexible ﬁle address spaces generally usable and2ffordable for data management applications, an efﬁcient datashifting mechanism without the rigid alignment requirementsis indispensable.

Inserting or removing data in a ﬁle needs to shift all theexisting data beyond the insertion or removal point, whichcauses intensive updates to the metadata of the affectedextents. With regard to the number of extents in a ﬁle, the costof shift operations can be signiﬁcantly high due to the O ( N ) complexity in existing extent index structures.The following introduces FlexTree, an augmented B + -Tree that supports efﬁcient shift operations. The design ofFlexTree is based on the observation that a shift operationalters a contiguous range of extents. FlexTree treats the shiftedextents as a whole and applies the updates to them collectively.To facilitate this, it employs a new metadata representationscheme that stores the address information of an extent onits search path. As an extent index, it costs O ( log N ) time toperform a shift operation in FlexTree, and a shift operationonly needs to update a few tree nodes. Before demonstrating the design of FlexTree, we ﬁrst startwith an example of B + -Tree [9] that manages ﬁle addressspace in byte granularity, as shown in Figure 3a. Each extentcorresponds to a leaf-node entry consisting of three elements— offset , length , and (physical) address . Each internal nodecontains pivot entries that separate the pointers to the childnodes. When inserting a new extent at the head of the ﬁle,every existing extent’s offset and every pivot’s offset must beupdated because of the shift operation on the entire ﬁle.FlexTree employs a new address metadata representationscheme that allows for shifting extents with substantiallyreduced changes. Figure 3b shows an example of a FlexTreethat encodes the same address mappings in the B + -tree. InFlexTree, the offset ﬁelds in extent entries and pivot entries arereplaced by partial offset ﬁelds. Besides, the only structuraldifference is that in a FlexTree, every pointer to a child nodeis associated with a shift value. These shift values are usedfor encoding address information in cooperation with thepartial offsets. The effective offset of an extent or pivot entryis determined by the sum of the entry’s partial offset and theshift values of the pointers found on the search path fromthe root node to the entry. The search path from the rootnode (at level 0) to an entry at level N can be represented bya sequence (cid:0) ( X , S ) , ( X , S ) , . . . , ( X N − , S N − ) (cid:1) , where X i represents the index of the pointer at level i , and S i representsthe shift value associated with that pointer. Suppose the partialoffset of an entry is P . Its effective offset E can be calculatedas E = (cid:0) ∑ N − i = S i (cid:1) + P . FlexTree supports basic operations such as appending extentsat the end of a ﬁle and remapping existing extents, as well asadvanced operations, including inserting or removing extentsin the middle of a ﬁle ( insert-range and collapse-range ). Thefollowing explains how the address range operations executein a FlexTree. In this section, a leaf node entry in FlexTree isdenoted by a triple: (partial_offset, length, address) . insert-range Operation

Inserting a new extent of length L to a leaf node z in FlexTreetakes three steps. First, the operation searches for the leaf nodeand inserts a new entry with a partial offset P = E − (cid:0) ∑ N − i = S i (cid:1) ,assuming the leaf node is not full. When inserting to themiddle of an existing extent, the extent must be split beforethe insertion. The insertion requires a shift operation on allthe extents after the new extent. In the second step, for eachextent within node z that needs shifting, its partial offset isincremented by L . The remaining extents that need shiftingspan all the leaf nodes after node z . We observe that, if everyextent within a subtree needs to be shifted, the shift valuecan be recorded in the pointer that points to the root of thesubtree. Therefore, in the third step, the remaining extentsare shifted as a whole by updating a minimum number ofpointers to a few subtrees that cover the entire range. To thisend, for each ancestor node of z at level i , the shift valuesof the pointers and the partial offsets of the pivots after thepointer at X i are all added by L . In this process, the updatedpointers cover all the remaining extents, and the path of eachremaining extent contains exactly one updated pointer. Whenthe update is ﬁnished, every shifted extent has its effectiveoffset added by L . The number of updated nodes of a shiftoperation is bounded by the tree’s height, so the operation’scost is O ( log N ) .Figure 4 shows the process of inserting a new extent withlength 3 and physical address 89 to offset 0 in the FlexTreeshown in Figure 3b. The ﬁrst step is to search for the target leafnode for insertion. Because all the shift values of the pointersare 0, the effective offset of every entry is equal to its partialoffset. Therefore, the target leaf node is the leftmost one, andthe new extent should be inserted at the beginning of that leafnode. Then, there are three changes to be made to the FlexTree.First, a new entry ( , , ) is inserted at the beginning of thetarget leaf node. Second, the other two extents in the sameleaf node are updated from ( , , ) and ( , , ) to ( , , ) and ( , , ) , respectively. Third, following the target leafnode’s path upward, the pointers to the three subtrees coveringthe remaining leaf nodes and the corresponding pivots areupdated, as shown in the shaded areas in Figure 4. Now, theeffective offset of every existing leaf entry is increased by 3.Similar to B + -tree, FlexTree splits every full node when asearch travels down the tree for insertion. The split thresholdin FlexTree is one entry smaller than the node’s capacitybecause an insertion may cause an extent to be split, which3

4 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 52125917 30 64 0 9 0 9 831 51 O ﬀ setLengthAddress Pivot (o ﬀ sets)Pointer (a) B + -Tree

64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259+017 +0+0 30 +0 64 +0+0 0 9 0 9 831 +0 51

Partial O ﬀ setLengthAddressPivot (partial o ﬀ sets)Pointer (w/ shift value) (b) FlexTree

Figure 3:

Examples of B + -Tree and FlexTree that manage the same ﬁle address space

64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259

0 389

9 0

831 3insertshift update21

54 +3 +0

20 +3 +0

33 +3

64 +0+0

Figure 4:

Inserting a new extent in FlexTreeleads to two entries being added to the node for the insertion.To split a node, half of the entries in the node are moved to anew node. Meanwhile, a pointer to the new node and a newpivot entry is created at the parent node. The new pointerinherits the shift value of the pointer to the old node so thatthe effective offsets of the moved entries remain unchanged.The new pivot entry inherits the effective offset of the mediankey in the old full node. The partial offset of the new pivot iscalculated as the sum of the old median key’s partial offsetand the new pointer’s inherited shift value. Figure 5 shows anexample of a split operation. The new pivot’s partial offset is38 (which is 5 +

10 +5...12 +7+0 33 +6 45 +3 10 +5...12 +7+0 45 +3

38 +5 +6Split... ...

Figure 5:

An example of node splitting in FlexTree

To retrieve the mappings of an address range in FlexTree, theoperation ﬁrst searches for the starting point of the range,which is a byte address within an extent. Then, it scansforward on the leaf level from the starting point to retrieveall the mappings in the requested range. The correctness ofthe forward scanning is guaranteed by the assumption that allextents on the leaf level are contiguous in the logical addressspace. However, a hole (an unmapped address range) in thelogical address space can break the continuity and lead toincorrect range size calculation and wrong search results.To address this issue, FlexTree explicitly records holes asunmapped ranges using entries with a special address value.Figure 6 shows the process of querying the addressmappings from 36 to 55, a 19-byte range in the FlexTreeafter the insertion in Figure 4. First, a search of logical offset36 identiﬁes the third leaf node. The partial offset values ofthe pivots in the internal nodes on the path are equal to theireffective offsets (54 and 33), and the target leaf node has the

64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259 0 389 3 9 0 12 831 42 1 searchwalk walk *3 search

20 33 54 E ﬀ ective O ﬀ sets

54 +3+020 +3+0 33 +3 64 +0+0

23 42 45 55 67 74 850 3 12

Figure 6:

Looking up mappings from 36 to 55 in FlexTreepath (cid:0) ( , + ) , ( , + ) (cid:1) . Although the ﬁrst extent in the targetleaf node has a partial offset value of 30, its effective offset is33 (0 + + (( , ) , ( , ) , ( , ) , ( , )) , an array of four tuples, eachcontaining a physical address and a length. collapse-range Operation

To collapse (remove) an address range in FlexTree, theoperation ﬁrst searches for the starting point of the removal.If the starting points is in the middle of an extent, the extentis split so that the removal will start from the beginning of anextent. Similarly, a split is also used when the ending point isin the middle of an extent. The address range being removedwill cover one or multiple extents. For each extent in the range,the extents after it are shifted backward using a process similarto the forward shifting in the insertion operation (§3.2.1). Theonly difference is that a negative shift value is used.Figure 7 shows the process of removing a 9-byte addressrange (33 to 42) from the FlexTree in Figure 6 withoutleaving a hole in the address space. First, a search identiﬁesthe starting point, which is the beginning of the ﬁrst extent ( , , ) in the third leaf node. Then the extent is removed,and the remaining extents in the leaf node are shiftedbackward. Finally, in the root node, the pointer to the subtreethat covers the last two leaf nodes is updated with a negativeshift value of −

9, as shown in the shaded area in Figure 7.Similar to B + -Tree, FlexTree merges a node to a sibling iftheir total size is under a threshold after a removal. Since twonodes being merged can have different shift values in theirparents’ pointers, we need to adjust the partial offsets in themerged node to maintain correct effective offsets for all theentries. When merging two internal nodes, the shift valuesare also adjusted accordingly. Figure 8 shows an example of4erging two internal nodes. FlexTree manages extent address mappings in byte granu-larity. To be speciﬁc, the size of each extent in a FlexTreecan be arbitrary bytes. In the implementation of FlexTree,the internal nodes have 64-bit shift values and pivots. Forleaf nodes, we use 32-bit lengths, 32-bit partial offsets, and64-bit physical addresses for extents. The largest physicaladdress value (2 −

1) is reserved for unmapped addressranges. Therefore, an extent in FlexTree records 16 bytesof data. While a 32-bit partial offset can only cover a 4 GBaddress range, the effective offset can address a much largerspace using the sum of the shift values on the search path asthe base. When a leaf node’s maximum partial offset becomestoo large, FlexTree subtracts a value M , which is the minimumpartial offset in the node, from every partial offset of the node,and adds M to the node’s corresponding shift value at theparent node. A practical limitation is that the extents in a leafnode can cover no more than a 4 GB range. We implement FlexFile as a user-space I/O library thatprovides a ﬁle-like interface with FlexTree as the extent indexstructure. The FlexFile library manages the ﬁle address spacein byte granularity, and its data and metadata are stored inregular ﬁles in a traditional ﬁle system. Each FlexFile consistsof three ﬁles—a data ﬁle, a FlexTree ﬁle, and a logical logﬁle. The user-space library implementation gives FlexFilethe ﬂexibility to perform byte-granularity space managementwithout block alignment limitations. In the meantime, theFlexFile library can delegate the job of cache managementand data storage to the underlying ﬁle system.

A FlexFile stores its data in a data ﬁle. Similar to thestructures in log-structured storage systems [25, 42, 43], thedata ﬁle’s space is divided into ﬁxed-size segments (4 MBin the implementation), and each new extent is allocatedwithin a segment. Speciﬁcally, a large write operation maycreate multiple logically contiguous extents residing indifferent segments. To avoid small writes, an in-memorysegment buffer is maintained, where consecutive extents areautomatically merged if they are logically contiguous.

45 -6 +0 64 743 711178 82 77151 142 521259 64 +0+017 3 9 201021

950 0 389 3 9 0 12 831 removeshift21update320 +3+0 33 +3 30 912

Figure 7:

Removing address mapping from offset 33 to 42

10 +5...12 +7+0

36 +9 48 +6

10 +5...12 +7+0 45 +341 +8+6 Merge ......

Figure 8:

An example of node merging in FlexTreeThe FlexFile library performs garbage collection (GC) toreclaim space from underutilized segments. It maintains anin-memory array to record the valid data size of each segment.A GC process scans the array to identify a set of mostunderutilized segments and relocates all the valid extents fromthese segments to new segments. Then, the FlexTree extentindex is updated accordingly. Since the extents in a FlexFilecan have arbitrary sizes, the GC process may produce less freespace than expected because of the internal fragmentation ineach segment. To address this issue, we adopt an approachused by a log-structured memory allocator [43] to guaranteethat a GC process can always make forward progress. Bylimiting the maximum extent size to K of the segment size,relocating extents in one segment whose utilization ratiois not higher than K − K can reclaim free space for at leastone new extent. Therefore, if the space utilization ratio ofthe data ﬁle is capped at K − K , the GC can always reclaimspace from the most underutilized segment for writing newextents. To reduce the GC pressure in the implementation,we set the maximum extent size to be (128 KB) of thesegment size and conservatively limit the space utilizationratio of the data ﬁle to (93 . flexfile_defrag interfacefor manually relocating a contiguous range of data in the ﬁleinto new segments. We will evaluate the efﬁciency of the GCpolicy in §6. A FlexFile maintains an in-memory FlexTree that periodicallysynchronizes its updates to the FlexTree ﬁle. It must ensureatomicity and crash consistency in this process. An insertionor removal operation often updates multiple tree nodes alongthe search path in the FlexTree. If we use a block-basedjournaling mechanism to commit updates, every dirtied nodein the FlexTree needs to be written twice. To address thepotential performance issue, we use a combination of Copy-on-Write (CoW) [40] and logical logging [40, 55] to minimizethe I/O cost.CoW is used to synchronize the persistent FlexTree withthe in-memory FlexTree. The FlexTree ﬁle has a header atthe beginning of the ﬁle that contains a version number anda root node position. A commit to the FlexTree ﬁle creates anew version of the FlexTree in the ﬁle. In the commit process,dirtied nodes are written to free space in the FlexTree ﬁlewithout rewriting existing nodes. Once all the updated nodeshave been written, the ﬁle’s header is updated atomically tomake the new version persist. Once the new version has beencommitted, the ﬁle space used by the updated nodes in the5ld version can be safely reused in future commits.Updates to the FlexTree extent index can be intensive withsmall insertions and removals. If every metadata update onthe FlexTree directly commits to the FlexTree ﬁle, the I/Ocost can be high because every commit can write multipletree nodes in the FlexTree ﬁle. The FlexFile library adoptsthe logical logging mechanism [40, 55] to further reducethe metadata I/O cost. Instead of performing CoW to thepersistent FlexTree on every metadata update, the FlexFilelibrary records every FlexTree operation in a log ﬁle and onlysynchronizes the FlexTree ﬁle with the in-memory FlexTreewhen the logical log has accumulated a sufﬁcient amount ofupdates. A log entry for an insertion or removal operationcontains the logical offset, length, and physical address of theoperation. A log entry for a GC relocation contains the oldand new physical addresses and the length of the relocatedextent. We allocate 48 bits for addresses, 30 bits for lengths,and 2 bits to specify the operation type. Each log entry takes16 bytes of space, which is much smaller than the node sizeof FlexTree. The logical log can be seen as a sequence ofoperations that transforms the persistent FlexTree to the latestin-memory FlexTree. The version number of the persistentFlexTree is recorded at the head of the log. Upon a crash-restart, uncommitted updates to the persistent FlexTree canbe recovered by replaying the log on the persistent FlexTree.When writing data to a FlexFile, the data are ﬁrst writtento free segments in the data ﬁle. Then, the metadata updatesare applied to the in-memory FlexTree and recorded in anin-memory buffer of the logical log. The buffered log entriesare committed to the log ﬁle periodically or on-demandfor persistence. In particular, the buffered log entries arecommitted after every execution of the GC process to makesure that the new positions of the relocated extents arepersistently recorded. Then, the reclaimed space can be safelyreused. Upon a commit to the log ﬁle, the data ﬁle must beﬁrst synchronized so that the logged operations will refer tothe correct ﬁle data. When the logical log ﬁle size reaches apre-deﬁned threshold, or the FlexFile is being closed, the in-memory FlexTree is synchronized to the FlexTree ﬁle usingthe CoW mechanism. Afterward, the log ﬁle can be truncatedand reinitialized using the FlexTree’s new version number.Figure 9 shows an example of the write ordering of aFlexFile. D i and L i represent the data write and the logicallog write for the i -th ﬁle operation, respectively. At the timeof “v1”, the persistent FlexTree (version 1) is identical to thein-memory FlexTree. Meanwhile, the log ﬁle is almost empty,contains only a header that records the FlexTree version(version 1). Then, for each write operation, the data is written D D D L +L +L v1 ... CoW v2 logical log ﬁ le is full CoW D D ...... TimeWrite Order

I/O barrier using fsync() : update tree ﬁ le header Figure 9:

An example of write ordering in FlexFile to the data ﬁle (or buffered if the data is small), and itscorresponding metadata updates are logged in the logicallog buffer. When the logical log buffer is full, all the ﬁle data( D , D , and D ) are synchronized to the data ﬁle. Then thebuffered log entries ( L + L + L ) are written to the logicallog ﬁle. When the log ﬁle is full, the current in-memoryFlexTree is committed to the FlexTree ﬁle to create a newversion (version 2) in the FlexTree ﬁle using CoW. Oncethe nodes have been written to the FlexTree ﬁle, the newversion number and the root node position of the FlexTree arewritten to the ﬁle atomically. The logical log is then clearedfor recording future operations based on the new version. I/Obarriers ( fsync ) are used before and after each logical log ﬁlecommit and each FlexTree ﬁle header update to enforce thewrite ordering for crash consistency, as shown in Figure 9. We build FlexDB, a key-value store powered by the advancedfeatures of FlexFile. Similar to the popular LSM-tree-basedKV stores such as LevelDB [18] and RocksDB [16], FlexDBbuffers updates in a MemTable and writes to a write-aheadlog (WAL) for immediate data persistence. When committingupdates to the persistent storage, however, FlexDB adopts agreatly simpliﬁed data model. FlexDB stores all the persistentKV data in sorted order in a table ﬁle (a FlexFile) whoseformat is similar to the SSTable format of LevelDB [5, 16, 18].Instead of performing repeated compactions across a multi-level store hierarchy that causes high write ampliﬁcation,FlexDB directly commits updates from the MemTable to thedata ﬁle in-place at low cost, which is made possible by theﬂexible ﬁle address space of the FlexFile. FlexDB employs aspace-efﬁcient volatile sparse index to track positions of KVdata in the table ﬁle and implements a user-space cache forfast reads.

FlexDB stores persistent KV pairs in the table ﬁle and keepsthem always sorted (in lexical order by default) with in-placeupdates. Each KV pair in the ﬁle starts with the key and valuelengths encoded with Base-128 Varint [6], followed by thekey and value’s raw data. A sparse KV index, whose structureis similar to a B + -Tree, is maintained in the memory to enablefast search in the table ﬁle.KV pairs in the table ﬁle are grouped into intervals , eachcovering a number of consecutive KV pairs. The sparse indexstores an entry for each interval using the smallest key in it asthe index key. Similar to FlexTree, the sparse index encodesthe ﬁle offset of an interval using the partial offset and the shiftvalues on its search path. Speciﬁcally, each leaf node entrycontains a partial offset , and each child pointer in internalnodes records a shift value. The effective ﬁle offset of aninterval is the sum of its partial offset and the shift values onits search path. A search of a key ﬁrst identiﬁes the target6 ar bit far kit pinfooact0 42 64 89 SET(key="cat", value="abcd")

73 98cat

Pivot (anchor key) Pointer (w/ shift value)Anchor KeyPartial O ﬀ set bar bit far kit pinfooact ...... "foo"+0 +9 "bit"0 42 "pin"64 89"foo"+0 +0 "pin"64 89"bit" 420FlexFileIn-memorySparse IndexFlexFileIn-memorySparse Index Figure 10:

An example of the sparse KV index in FlexDBinterval with a binary search on the sparse index. Then thetarget ﬁle range is calculated using the shift values reached onthe search path and the metadata in the target entry. Finally, thesearch scans the interval to ﬁnd the KV pair. Figure 10 showsan example of the sparse KV index with four intervals. Theﬁrst interval does not need an index key. The index keys of theother three intervals are “bit”, “foo” and “pin”, respectively.A search of “kit” reaches the third interval (“foo” < “kit” < “pin”) at offset 64 (0+64).When inserting (or removing) a KV pair in an interval,the offsets of all the intervals after it needs to be shifted sothat the index can stay in sync with the table ﬁle. The shiftoperation is similar to that in a FlexTree. First, the operationupdates the partial offsets of the intervals in the same leafnode. Then, the shift values on the path to the target leaf nodeare updated. Different from that of FlexTree, the partial offsetsin the sparse index are not the search keys but the values inleaf node entries. Therefore, none of the index keys and pivotsare modiﬁed in the shifting process. An update operation thatresizes a KV pair is performed by removing the old KV pairwith collapse-range and inserting the new one at the sameoffset with insert-range .To insert a new key “cat” with the value “abcd” in theFlexDB shown in Figure 10, a search ﬁrst identiﬁes theinterval at offset 42 whose index key is “bit”. Assumingthe new KV item’s size is 9 bytes, we insert it to the tableﬁle between keys “bit” and “far” and shift the intervalsafter it forward by 9. As shown at the bottom of Figure 10,the effective offsets of interval “foo” and “pin” are allincremented by 9.Similar to B + -tree, the sparse index needs to split a largenode or merge two small nodes when their sizes reach speciﬁcthresholds. FlexDB also deﬁnes the minimum and maximumsize limits of an interval. The limits can be speciﬁed by thetotal data size in bytes or the number of KV pairs. An intervalwhose size exceeds the maximum size limit will be split witha new index key inserted in the sparse index. Updates in FlexDB are ﬁrst buffered in a MemTable. TheMemTable is a thread-safe skip list that supports concur-rent access of one writer and multiple readers. When theMemTable is full, it is converted to an immutable MemTable,and a new MemTable is created to receive new updates. Meanwhile, a background thread is activated to commit theupdates to the table ﬁle and the sparse index. A lookupin FlexDB ﬁrst checks the MemTable and the immutableMemTable (if it exists). If the key is not found in theMemTables, the lookup queries the sparse KV index to ﬁndthe key in the table ﬁle. When the committer thread isactive, it requires exclusive access to the sparse index andthe table ﬁle to prevent inconsistent data or metadata frombeing reached by readers. To this end, a reader-writer lockis used for the committer thread to block the readers whennecessary. For balanced performance and responsiveness, thecommitter thread releases and reacquires the lock every 4 MBof committed data so that readers can be served quicklywithout waiting for the completion of a committing process.

Real-world workloads often exhibit skewed access patterns [1,4, 53]. Many popular KV stores employ user-space caching toexploit the access locality for improved search efﬁciency [7,16, 19]. FlexDB adopts a similar approach by cachingfrequently used intervals of the table ﬁle in the memory.The cache (namely the interval cache ) uses the CLOCKreplacement algorithm and a write-through policy. Everyinterval’s entry in the sparse index contains a cache pointerthat is initialized as

NULL to represent an uncached interval.Upon a cache miss, the data of the interval is loaded fromthe table ﬁle, and an array of KV pairs is created based onthe data. Then, the cache pointer is updated to point to a newcache entry containing the array, the number KV pairs, andthe interval size. A lookup on a cached interval can perform abinary search on the array.

Upon a restart, FlexDB ﬁrst recovers the uncommitted KVdata from the write-ahead log. Then, it constructs the volatilesparse KV index of the table ﬁle. Intuitively the sparse indexcan be built by sequentially scanning the KV pairs in thetable ﬁle, but the cost can be signiﬁcant for a large store. Infact, the index rebuilding only requires an index key for eachinterval. Therefore, a sparse index can be quickly constructedby skipping a certain amount of extents every time an indexkey is determined.In FlexDB, extents in the underlying table ﬁle are createdby inserting or removing KV pairs, which guarantees that anextent in the ﬁle always begins with a complete KV pair. Toidentify a KV in the middle of a table ﬁle without knowing itsexact offset, we add a flexfile_read_extent(off, buf,maxlen) function to the FlexFile library. A call to thisfunction searches for the extent at the given ﬁle offset andreads up to maxlen bytes of data at the beginning of the extent.The extent’s logical offset ( ≤ off ) and the number of bytesread are returned. To build a sparse index, read_extent isused to retrieve a key at each approximate interval offset(4 KB, 8 KB, . . . ) and use these keys as the index keys of7he new intervals. FlexDB can immediately start processingrequests once the sparse index is built. Upon the ﬁrst accessof a new interval, it can be split if it contains too many keys.Accordingly, the rebuilding cost can be effectively reducedby increasing the interval size. In this section, we experimentally evaluate FlexTree, theFlexFile library, and FlexDB. In the evaluation of FlexTreeand the FlexFile library, we focus on the performance of theenhanced insert-range and collapse-range operations as wellas the performance metrics on regular ﬁle operations. ForFlexDB, we use a set of microbenchmarks and YCSB [8].Experiments in this section are run on a workstation withDual Intel Xeon Silver 4210 CPU and 64G DDR4 ECC RAM.The persistent storage of all tests is an Intel Optane DCP4801X SSD with 100 GB capacity. The workstation runs a64-bit Linux OS with kernel version 5.9.14.

In this section, we evaluate the performance of the FlexTreeindex structure and compare it with a regular B + -Tree and asorted array. The B + -Tree has the structure shown in Figure 3awhere a shift operation can cause intensive node updates.The sorted array is allocated with malloc and contains asorted sequence of extents. A shift operation in it also requiresintensive data movements in memory, but a lookup of anextent in the array can be done efﬁciently with a binary search.We benchmark three operations— append , lookup , and insert . An append experiment starts with an empty index.Each operation appends a new extent after all the existingextents. Once the appends are done, we run a lookupexperiment by sequentially querying every extent. Note thatevery lookup performs a complete search on the index insteadof simply walking on the leaf nodes or the array. An insertexperiment starts with an empty index. Each operation insertsa new extent before all the existing extents.Table 1 shows the execution time of each experiment. Forappends, the sorted array outperforms FlexTree and B + -Treebecause appending new extents at the end of an array istrivial and does not require node splits or memory allocations.FlexTree is about 10% slower than B + -Tree. The overheadmainly comes from calculating the effective offset of eachaccessed entry on the ﬂy.The results of the lookups are similar to those of theappends. However, for each index, the lookups take a longer Table 1:

Execution time of the extent metadata operations

Operation Append Lookup Insert Execution Time (seconds)FlexTree 7.83 91.1 9.47 99.2 B + -Tree 7.01 82.8 7.91 88.9 Sorted Array 4.49 55.1 8.56 86.8 time than the corresponding appends. The reason is that theindex is small at the beginning of an append test. Appendingto a small index is faster than that on a large index, whichreduces the total execution time. In a lookup experiment,every search operation is performed on a large index with aconsistently high cost.FlexTree’s address metadata representation scheme allowsfor much faster extent insertions with a small loss on lookupspeed. The B + -Tree and the sorted array shows extremelyhigh overhead due to the intensive memory writes andmovements. To be speciﬁc, every time an extent is inserted atthe beginning, the entire mapping index is rewritten. FlexTreemaintains a consistent O ( log N ) cost for insertions, which isasymptotically and practically faster. In this section, we evaluate the efﬁciency of ﬁle I/O operationsin the FlexFile library (referred to as FlexFile in this section)and compare it with the ﬁle abstractions provided by two ﬁlesystems, Ext4 and XFS. The FlexFile library is conﬁguredto run on top of an XFS ﬁle system. The XFS and Ext4are formatted using the mkfs command with the defaultconﬁgurations. File system benchmarks that require hierarchydirectory structures do not apply to FlexFile because FlexFilemanages ﬁle address spaces for single ﬁles. As a result, theevaluation focuses on ﬁle I/O operations.Each experiment consists of a write phase and a read phase.There are three write patterns for the write phase—randominsert (using insert-range ), random write, and sequentialwrite. The ﬁrst two patterns are the same as the

INSERT and

PWRITE described in Section 2, respectively. The sequentialwrite pattern writes 4 KB blocks sequentially. A write phasestarts with a newly formatted ﬁle system and creates a1 GB ﬁle by writing or inserting 4 KB data blocks using therespective pattern. After the write phase, we measure theread performance with two read patterns—sequential andrandom. Each read operation reads a 4 KB block from theﬁle. The random pattern use randomly shufﬂed offsets sothat it reads each block exactly once with 1 GB of reads.For each read pattern, the kernel page cache is ﬁrst cleared.Then the program reads the entire ﬁle twice and measuresthe throughput before and after the cache is warmed up.The results are shown in Table 2. We also calculate thewrite ampliﬁcation (WA) ratio of each write pattern usingthe S.M.A.R.T. data reported by the SSD. The followingdiscusses the key observations on the experimental results.

Fast Inserts and Writes

Insertions in FlexFile has a lowcost ( O ( log N ) ). As shown in Table 2, FlexFile’s random-insert throughput is more than 500 × higher than Ext4(491 vs. 0 .

87) and four orders of magnitude higher thanXFS. Figure 11a shows the throughput variations during therandom-insert experiments. FlexFile maintains a constanthigh throughput while Ext4 and XFS both suffer extremethroughput degradations because of the growing index sizes8

Data Written (GB) O p s / s e c XFSEXT4FlexFile (a)

The write phase

Data Read (GB) M o p s / s e c Cold Warm

XFSEXT4FlexFile (b)

The sequential read phase

Figure 11:

The random insert experimentthat lead to increasingly intensive metadata updates. The writethroughput of FlexFile is higher than that of Ext4 and XFSfor both random and sequential writes. The reason is thatthe FlexFile library commits writes to the data ﬁle in theunit of segments, which enables batching and buffering inthe user space. Meanwhile, the FlexFile library adopts thelog-structured write in the data ﬁle, which transforms randomwrites on the FlexFile into sequential writes on the SSD. Thisallows FlexFile to gain a higher throughput at the beginningof the random insert experiment.

Low Write Ampliﬁcation

FlexFile maintains a low writeampliﬁcation ratio (1 .

03 to 1 .

04) throughout all the writeexperiments. In the random and sequential write experiments,the WA ratio of FlexFile is slightly higher than the ones ofExt4 and XFS due to the extra data it writes in the logicallog and the persistent FlexTree. Ext4 and XFS show higherWA ratios in the random insert experiments because of theoverheads on metadata updates for ﬁle space allocation. InExt4 and XFS, each insert operation updates on average 50%of the existing extents’ metadata, which causes intensivemetadata I/O. In FlexFile, the data ﬁle is written sequentially,and each insert operation updates O ( log N ) FlexTree nodes.Additionally, the logical logging in the FlexFile librarysubstantially reduces metadata write under random inserts.

Read Performance

All the systems exhibit good readthroughput with a warm cache and low random read through-put with a cold cache. However, we observe that FlexFile alsoshows low throughput in sequential reads with cold cacheon a ﬁle constructed by random writes or random inserts.This is because random writes and random inserts generatehighly fragmented layouts in the FlexFile where consecutiveextents reside in different segments in the underlying data ﬁle.Therefore, it cannot beneﬁt from prefetching in the OS kernel,and it costs extra time to perform extent index querying.

Table 2:

I/O performance of FlexFile, Ext4 and XFS

Random Insert Random Write Seq. WriteFlex Ext4 XFS Flex Ext4 XFS Flex Ext4 XFSW. (Kops/sec)

491 0.87 0.01

489 242 359 493 304 460W. Amp. Ratio 1.04 1.28 5.53 1.04 1.02 1.02 1.03 1.02 1.02Read Throughput (Kops/sec)Seq. Cold 95 221 240 95 445 441 449 482 443Seq. Warm 737 1053 1028 731 1079 1044 921 1048 1050Rand. Cold 92 87 92 92 87 96 95 100 95Rand. Warm 571 803 808 569 836 823 648 804 824

Table 3:

Synthetic KV datasets using real-world KV sizes

Dataset ZippyDB [4] UDB [4] SYS [1]Avg. Key Size (Bytes) 48 27 28Avg. Value Size (Bytes) 43 127 396 ∼

40 GB) 470 M 280 M 100 M

Figure 11b shows the throughput variations of the sequentialread phase before and after the cache is warmed up. Ext4and XFS maintain a constant throughput in the ﬁrst 1 GBwith optimal prefetching. Meanwhile, FlexFile shows a muchlower throughput at the beginning due to the random readsin the underlying data ﬁle. The FlexFile’s throughput keepsimproving as the hit ratio increases. After 1 GB of reads, allthe systems show high and stable throughput.

We evaluate the performance of FlexDB through variousexperiments and compare it with Facebook’s RocksDB [16],an I/O-optimized KV store. We also evaluated LMDB [47], aB + -Tree based KV store, TokuDB [35], a B ε -Tree based KVstore, and KVell [26], a highly-optimized KV store for fastSSDs. However, LMDB and TokuDB exhibit consistently lowperformance compared with RocksDB. Similar observationsare also reported in recent studies [13, 19, 34]. KV insertionsin KVell abort when a new key share a common preﬁx oflonger than 8 bytes with an existing key since KVell onlyrecords up to eight bytes of a key for space-saving. Severalstate-of-the-art KV stores [7, 23, 24, 54] are either built forbyte-addressable non-volatile memory or not publicly avail-able for evaluation. Therefore, we focus on the comparisonbetween FlexDB and RocksDB in this section.For a fair comparison, both the FlexDB and RocksDB areconﬁgured with a 256 MB MemTable and an 8 GB user-spacecache. The FlexDB is conﬁgured with the maximum intervalsize of 8 KV pairs. The RocksDB is tuned as suggested by itsofﬁcial tuning guide (following the conﬁgurations for “Totalordered database, ﬂash storage.”) [39].We generate synthetic KV datasets using the representativeKV sizes of Facebook’s production workloads [1, 4]. Table 3shows the details of the datasets. The size of each dataset isabout 40 GB, which is approximately 5 × the size of the user-level cache in both systems. The workloads are generatedusing three key distributions—sequential, Zipﬁan ( α = . Write

We ﬁrst evaluate the write performance of the twostores. In each experiment, each thread inserts 25% (approx.10 GB) of the dataset to an empty store following the keydistribution. For sequential writes, the dataset is partitionedinto four contiguous ranges, and each thread inserts one rangeof KVs. We run experiments with KV datasets described9

Z C0250500750 T h r o u g h p u t ( K o p s / s e c ) ZippyDB

S Z C

Key DistributionUDB

S Z C

SYSRocksDB FlexDB (a)

Insertion Throughput

S Z C050100150 W r i t e I / O ( G B ) ZippyDB

S Z C

Key DistributionUDB

S Z C

SYSRocksDB FlexDB (b)

SSD Writes

S Z C

Key Distrib. T h r o u g h p u t ( M o p s / s e c ) GET

10 20 50 100

Scan Length

SCAN (zipf)RocksDB FlexDB (c)

Query Throughput

S Z C

Key Distrib. M o p s / s e c Throughput

S Z C

Key Distrib. I / O S i z e ( G B ) SSD WriteLow GC Intensive GC (d)

GC Overhead

Figure 12:

Benchmark results of FlexDB. Key distributions: S – Sequential; Z – Zipﬁan; C – Zipﬁan-Composite.in Table 3 and different key distributions. For the Zipﬁanand Zipﬁan-Composite distributions, insertions can overwritepreviously inserted keys, which leads to reduced write I/O.Figure 12a shows the measured throughput of FlexDBand RocksDB in the experiments. The sizes of data writtento the SSD are shown in Figure 12b. FlexDB outperformsRocksDB in every experiment. The advantage mainly comesfrom FlexDB’s capability of directly inserting new data intothe table ﬁle without requiring further rewrites. In contrast,RocksDB needs to merge and sort the tables when movingthe KV data across the multi-level structure, which generatesexcessive rewrites that consume extra CPU and I/O resources.As shown in Figure 12b, RocksDB’s write I/O is up to 2 × compared to that of FlexDB.In every experiment with sequential write, FlexDB gener-ates about 90 GB write I/O to the SSD, where about 40 GBare in the WAL, and the rest are in the FlexFile. However,RocksDB writes up to 175 GB in these experiments. Theroot cause of the excessive writes is that every newly writtentable in the RocksDB spans a long range in the key spacewith keys clustered at four distant locations, making the newtables overlap with each other. Consequently, RocksDB has toperform intensive compactions to sort and merge these tables. Read

We measure the read performance of FlexDB andRocksDB. To emulate a randomized data layout in real-worldKV stores, we ﬁrst load a store with the UDB dataset andperform 100 million random updates following the Zipﬁandistribution. We then measure the point query (

GET ) and rangequery (

SCAN ) performance of the store. Each read experimentruns for 60 seconds. Figure 12c shows the performanceresults. For

GET operations, FlexDB consistently outperformsRocksDB because it can quickly ﬁnd the target key using abinary search on the sparse index. In contrast, a

GET operationin RocksDB needs to check the tables on multiple levels. Inthis process, many key comparisons are required for selectingcandidate tables on the search path. The advantage of FlexDBis particularly high with sequential

GET because a sequenceof

GET operations can access the keys in an interval with atmost one miss in the interval cache.The advantage of FlexDB is also signiﬁcant in rangequeries. We test the

SCAN performance of both systems usingthe Zipﬁan distribution. As shown on the right of Figure 12c,FlexDB outperforms RocksDB by 8 . × to 11 . × because of its capability of quickly searching on the sparse index andretrieving the KV data with optimal cache locality. GC overhead

We evaluate the impact of the FlexFile GCactivities on FlexDB using an update-intensive experiment.In a run of the experiment, each thread performs 100 millionKV updates to a store containing the UDB dataset (42 GB).We ﬁrst run experiments with a 90 GB FlexFile data ﬁle,which represents the scenario of modest space utilizationand low GC overhead. For comparison, we run the sameexperiments with a 45 GB data ﬁle size that exhibits a 93%space utilization ratio and causes intensive GC activities in theFlexFile. The results are shown in Figure 12d. The intensiveGC shows a negligible impact on both throughput and I/Owith sequential and Zipﬁan workloads. In this scenario, theGC process can easily ﬁnd near-empty segments becausethe frequently updated keys are often co-located in the dataﬁle with a good spatial locality. Comparatively, the Zipﬁan-Composite distribution has a much weaker spatial locality,which leads to intensive rewrites in the GC process under thehigh space utilization ratio. With this workload, the intensiveGC causes 10.3% more writes on the SSD and 19.5% lowerthroughput in FlexDB compared to that of low GC cost.

Recovery

This experiment evaluates the recovery speed ofFlexDB (described in §5.4) with a clean page cache. Fora store containing the UDB dataset (42 GB), using a smallrebuilding interval size of 4 KB leads to a recovery time of27.2 seconds. Increasing the recovery interval size to 256 KBreduces the recovery time to 2.5 seconds. For comparison,rebuilding the sparse index by sequentially scanning the tableﬁle costs 93.7 seconds. There is a penalty when accessing alarge interval for the ﬁrst time. In practice, the penalty can bereduced by promptly loading and splitting large intervals inthe background using spare bandwidth.

YCSB Benchmark

YCSB [8] is a popular benchmark thatevaluates KV store performance using realistic workloads. Ineach experiment, we sequentially load the whole UDB dataset

Table 4:

YCSB workloads

Workload A B C D E FDistribution Zipﬁan Latest ZipﬁanOperations 50% U 5% U 100% R 5% I 5% I 50% R50% R 95% R 95% R 95% S 50% M

I: Insert; U: Update; R: Read; S: Scan; M: Read-Modify-Write. B C D E F0123 T h r o u g h p u t ( n o r m . ) K K K . M . K K RocksDB FlexDB (a)

42 GB store size, 64 GB RAM

A B C D E F012 T h r o u g h p u t ( n o r m . ) K K K . M . K K RocksDB FlexDB (b)

84 GB store size, 16 GB RAM

Figure 13:

YCSB benchmark with KV sizes in UDB(42 GB) into the store and run the YCSB workloads from Ato F. Each workload is run for 60 seconds. The details of theYCSB workloads are shown in Table 4. A scan in workload Eperforms a seek and retrieves 50 KV pairs.Figure 13a shows the benchmark results. In read-dominatedworkloads including B, C, and E, FlexDB shows high through-put (from 1 . × to 7 . × ) compared to RocksDB. This is espe-cially the case in workload E because of FlexDB’s advantageon range queries. Workload D is also read-dominated, butFlexDB shows a similar performance to RocksDB. The reasonis that the latest access pattern produces a strong temporaryaccess locality. Therefore, most of the requests are servedfrom the MemTable(s), and both the stores achieve a highthroughput of over 2.6 Mops/sec.In write-dominated workloads, including A and F, FlexDBalso outperforms RocksDB, but the performance advantage isnot as high as that in read-dominated workloads. In RocksDB,the compactions are asynchronous, and they do not block readrequests. In the current implementation of FlexDB, when acommitter thread is actively merging updates into the FlexFile,readers that reach the sparse index and the FlexFile can betemporarily blocked (see §5.2). Meanwhile, the MemTablescan still serve user requests. This is the main limitation of thecurrent implementation of FlexDB.We also run the YCSB benchmark in an out-of-corescenario. In this experiment, we limit the available memorysize to 16 GB (using cgroups ), and double the size of theUDB dataset to 84 GB (560 M keys). With this setup, onlyabout 10% of the ﬁle data can reside in the OS page cache.The benchmark results are shown in Figure 13b. Both systemsshow reduced throughput in all the workloads due to theincreased I/O cost. Compared to the ﬁrst YCSB experiment,the advantage of FlexDB is reduced in read-dominatedworkloads B, C, and E, because a miss in the interval cacherequires multiple reads to load the interval. That said, FlexDBstill achieves 1.3 × to 4.0 × speedups over RocksDB. Data-management Systems

Studies on improving I/Oefﬁciency in data-management systems are abundant [10,56]. B-tree-based KV stores [31, 32, 47] support efﬁcientsearching with minimum read I/O but have suboptimalperformance under random writes because of the in-placeupdates [27]. LSM-Tree [33] uses out-of-place writes and delayed sorting to improve write performance, and it has beenwidely adopted in modern write-optimized KV stores [16,18]. However, the improved write efﬁciency comes with thecost of high overheads on read operations since a searchmay query multiple tables at different locations [29, 44]. Tocompensate reads, LSM-tree-based KV stores need to rewritetable ﬁles periodically using a compaction process, whichin turn offsets the beneﬁt of out-of-place write [11, 12, 36,38, 51]. KVell [26] indexes all the KV pairs in a volatileB + -Tree and leaves KV data unsorted on disk for fast readand write. However, maintaining a volatile full index leadsto a high memory footprint and a lengthy recovery process.SplinterDB [7] employs B ε -tree for efﬁcient write by loggingunsorted KV pairs in every tree node. However, the unsortednode layout causes slow reads, especially for range queries [7].Recent works [23, 24, 54] also employ byte-addressable non-volatile memories for fast access and persistence. Thesesolutions require non-trivial implementation, including spaceallocation, garbage collection, and maintaining crash con-sistency, which overlaps the core duties of a ﬁle system.FlexDB delegates the challenging data organizing tasks tothe mechanisms behind the ﬁle interface, which effectivelyreduces the application’s complexity. Managing persistentlysorted KV data with efﬁcient in-place update achieves fastread and write at low cost. Address Space Management

Modern in-kernel ﬁle sys-tems, such as Ext4, XFS, Btrfs, and F2FS, use B+-Tree and itsvariants or multi-level mapping tables to index ﬁle extents [15,25, 41, 48]. These ﬁle systems provide comprehensive supportfor general ﬁle management tasks but exhibit suboptimal per-formance in metadata-intensive workloads, such as massiveﬁle creation, crowded small writes, as well as insert-range and collapse-range that require data shifting. Recent studiesemploy write-optimized data structures in ﬁle systems toimprove metadata management performance. Speciﬁcally,BetrFS [22, 55], TokuFS [14], WAFL [30], TableFS [37],and KVFS [46] use write-optimized indexes, including B ε -Tree [3] and LSM-Tree [33], to manage ﬁle system metadata.Their designs exploit the advantages of these indexes andsuccessfully optimized many existing ﬁle system metadataand ﬁle I/O operations. However, these systems still employa traditional ﬁle address space abstraction and do not allowdata to be freely moved or shifted in the ﬁle address space.Therefore, rearranging ﬁle data in these systems still relieson rewriting existing data. The design of FlexFile removesa fundamental limitation from the traditional ﬁle addressspace abstraction. Leveraging the efﬁcient shift operationsfor logically reorganizing data in ﬁles, applications built onFlexFile can easily avoid data rewriting in the ﬁrst place. This paper presents a novel ﬁle abstraction, FlexFile, thatenables lightweight in-place updates in a ﬂexible ﬁle address11pace. It allows applications to perform efﬁcient data manage-ment on a linear ﬁle layout with a simpliﬁed implementation.FlexDB, a KV store built on FlexFile with a small codebase,achieves speedups of up to 11.7 × for read and 1.9 × for write,compared with the highly optimized RocksDB. References [1] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, SongJiang, and Mike Paleczny. “Workload Analysis ofa Large-Scale Key-Value Store”. In:

SIGMETRICSPerform. Eval. Rev. . 2017, pp. 363–375.[3] Michael A Bender, Martin Farach-Colton, WilliamJannen, Rob Johnson, Bradley C Kuszmaul, Donald EPorter, Jun Yuan, and Yang Zhan. “And introduction toBe-trees and write-optimization”. In:

Login; Magazine . 2020, pp. 209–223.[5] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wil-son C. Hsieh, Deborah A. Wallach, Mike Burrows,Tushar Chandra, Andrew Fikes, and Robert E. Gruber.“Bigtable: A Distributed Storage System for StructuredData”. In:

Proceedings of the 7th USENIX Symposiumon Operating Systems Design and Implementation(OSDI’06) . 2006, p. 15.[6] DWARF Debugging Information Format Committee.

DWARF debugging information format version 5 .2017.[7] Alexander Conway, Abhishek Gupta, Vijay Chi-dambaram, Martin Farach-Colton, Richard Spillane,Amy Tai, and Rob Johnson. “SplinterDB: Closingthe Bandwidth Gap for NVMe Key-Value Stores”. In: . 2020, pp. 49–63.[8] Brian F. Cooper, Adam Silberstein, Erwin Tam, RaghuRamakrishnan, and Russell Sears. “BenchmarkingCloud Serving Systems with YCSB”. In:

Proceedingsof the 1st ACM Symposium on Cloud Computing(SoCC’10) . 2010, 143–154.[9] Thomas H. Cormen, Charles E. Leiserson, Ronald L.Rivest, and Clifford Stein.

Introduction to Algorithms,Third Edition . 3rd. The MIT Press, 2009. [10] Ali Davoudian, Liu Chen, and Mengchi Liu. “A Surveyon NoSQL Stores”. In:

ACM Comput. Surv.

Proceedings of the 2018 InternationalConference on Management of Data (SIGMOD’18) .2018, pp. 505–520.[12] Niv Dayan and Stratos Idreos. “The Log-StructuredMerge-Bush & the Wacky Continuum”. In:

Pro-ceedings of the 2019 International Conference onManagement of Data (SIGMOD’19) . 2019, 449–466.[13] Siying Dong, Mark Callaghan, Leonidas Galanis,Dhruba Borthakur, and Tony Savor. “Optimizing SpaceAmpliﬁcation in RocksDB.” In:

The Conference onInnovative Data Systems Research (CIDR’17) . Vol. 3.2017, p. 3.[14] John Esmet, Michael A. Bender, Martin Farach-Colton,and Bradley C. Kuszmaul. “The TokuFS StreamingFile System”. In:

Proceedings of the 4th USENIXConference on Hot Topics in Storage and File Systems(HotStorage’12) . 2012, p. 14.[15]

Ext4 Disk Layout . https : / / ext4 . wiki . kernel .org/index.php/Ext4_Disk_Layout .[16] Facebook. RocksDB . https://rocksdb.org .[17] fallocate(2) — Linux manual page . .[18] Sanjay Ghemawat and Jeff Dean. LevelDB . https ://github.com/google/leveldb .[19] Eran Gilad, Edward Bortnikov, Anastasia Braginsky,Yonatan Gottesman, Eshcar Hillel, Idit Keidar, NuritMoscovici, and Rana Shahout. “EvenDB: OptimizingKey-Value Storage for Spatial Locality”. In: Pro-ceedings of the Fifteenth European Conference onComputer Systems (EuroSys’20) . 2020.[20]

Inserting a hole into a ﬁle . https : / / lwn . net /Articles/629965/ .[21] Intel® Optane™ Technology . .[22] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshin-tala, John Esmet, Yizheng Jiao, Ankur Mittal, PrashantPandey, Phaneendra Reddy, Leif Walsh, Michael A.Bender, Martin Farach-Colton, Rob Johnson, BradleyC. Kuszmaul, and Donald E. Porter. “BetrFS: Write-Optimization in a Kernel File System”. In: ACM Trans.Storage

Proceedings of the 17th USENIX Conference on Fileand Storage Technologies (FAST’19) . 2019, 191–204.[24] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, An-drea Arpaci-Dusseau, and Remzi Arpaci-Dusseau.“Redesigning LSMs for Nonvolatile Memory withNoveLSM”. In:

Proceedings of the 2018 USENIXConference on Usenix Annual Technical Conference(USENIX ATC’18) . 2018, 993–1005.[25] Changman Lee, Dongho Sim, Joo-Young Hwang, andSangyeun Cho. “F2FS: A New File System for FlashStorage”. In:

Proceedings of the 13th USENIX Con-ference on File and Storage Technologies (FAST’15) .2015, 273–286.[26] Baptiste Lepers, Oana Balmau, Karan Gupta, and WillyZwaenepoel. “KVell: The Design and Implementationof a Fast Persistent Key-Value Store”. In:

Proceedingsof the 27th ACM Symposium on Operating SystemsPrinciples (SOSP’19) . 2019, 447–461.[27] Yinan Li, Bingsheng He, Robin Jun Yang, Qiong Luo,and Ke Yi. “Tree Indexing on Solid State Drives”. In:

Proc. VLDB Endow. . 2016,pp. 133–148.[29] Chen Luo and Michael J Carey. “LSM-based storagetechniques: a survey”. In:

The VLDB Journal

Proceedings of the 8th USENIX Con-ference on File and Storage Technologies (FAST’10) .2010, p. 2.[31]

MongoDB . .[32] Michael A Olson, Keith Bostic, and Margo I Seltzer.“Berkeley DB.” In: USENIX Annual Technical Confer-ence, FREENIX Track . 1999, pp. 183–191.[33] Patrick O’Neil, Edward Cheng, Dieter Gawlick, andElizabeth O’Neil. “The Log-Structured Merge-Tree(LSM-Tree)”. In:

Acta Inf.

Proceedings of the 2016USENIX Conference on Usenix Annual TechnicalConference (USENIX ATC’16) . 2016, 537–550. [35]

PerconaFT (TokuDB) . https : / / github . com /percona/PerconaFT .[36] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram,and Ittai Abraham. “PebblesDB: Building Key-Value Stores Using Fragmented Log-Structured MergeTrees”. In: Proceedings of the 26th Symposium on Op-erating Systems Principles (SOSP’17) . 2017, 497–514.[37] Kai Ren and Garth Gibson. “TABLEFS: EnhancingMetadata Efﬁciency in the Local File System”. In:

Proceedings of the 2013 USENIX Conference onAnnual Technical Conference (USENIX ATC’13) . 2013,145–156.[38] Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson.“SlimDB: A Space-Efﬁcient Key-Value Storage Enginefor Semi-Sorted Data”. In:

Proc. VLDB Endow.

RocksDB Tuning Guide . https : / / github . com /facebook/rocksdb/wiki/RocksDB-Tuning-Guide .[40] Ohad Rodeh. “B-Trees, Shadowing, and Clones”. In: ACM Trans. Storage

ACM Trans. Stor-age

ACM Trans. Comput. Syst.

Proceedings of the 12th USENIX Conferenceon File and Storage Technologies (FAST’14) . 2014,1–16.[44] Russell Sears and Raghu Ramakrishnan. “BLSM: AGeneral Purpose Log Structured Merge Tree”. In:

Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data (SIGMOD’12) .2012, 217–228.[45] Kai Shen, Stan Park, and Meng Zhu. “Journaling ofJournal is (Almost) Free”. In:

Proceedings of the 12thUSENIX Conference on File and Storage Technologies(FAST’14) . 2014, 287–293.[46] Pradeep Shetty, Richard Spillane, Ravikant Malpani,Binesh Andrews, Justin Seyster, and Erez Zadok.“Building Workload-Independent Storage with VT-Trees”. In:

Proceedings of the 11th USENIX Con-ference on File and Storage Technologies (FAST’13) .2013, 17–30.[47]

Symas Lightning Memory-mapped Database . https://symas.com/lmdb/ .1348] The SGI XFS Filesystem . .[49] Darrick Wong, Dave Chinner, Eric Sandeen, RyanLerch, and Sillion Graphics Inc. XFS Algorithms &Data Structures, 3rd Edition . 2018.[50] Kan Wu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. “Towards an Unwritten Contract of IntelOptane SSD”. In:

Proceedings of the 11th USENIXConference on Hot Topics in Storage and File Systems(HotStorage’19) . 2019, p. 3.[51] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang.“LSM-trie: An LSM-tree-based Ultra-Large Key-ValueStore for Small Data Items”. In: . 2015, pp. 71–82.[52] Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala,and Swaminathan Sundararaman. “Don’t Stack YourLog On My Log”. In: . 2014.[53] Juncheng Yang, Yao Yue, and K. V. Rashmi. “A largescale analysis of hundreds of in-memory cache clustersat Twitter”. In: . 2020,pp. 191–208.[54] Ting Yao, Yiwen Zhang, Jiguang Wan, Qiu Cui, LiuTang, Hong Jiang, Changsheng Xie, and Xubin He.“MatrixKV: Reducing Write Stalls and Write Ampli-ﬁcation in LSM-tree Based KV Stores with MatrixContainer in NVM”. In: . 2020, pp. 17–31.[55] Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr,Michael A. Bender, Martin Farach-Colton, WilliamJannen, Rob Johnson, Donald E. Porter, and Jun Yuan.“The Full Path to Full-Path Indexing”. In:

Proceedingsof the 16th USENIX Conference on File and StorageTechnologies (FAST’18) . 2018, 123–138.[56] H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang.“In-Memory Big Data Management and Processing: ASurvey”. In: