Efficient Data Management with Flexible File Address Space
EEfficient Data Management with Flexible File Address Space
Chen Chen, Wenshao Zhong, and Xingbo Wu
University of Illinois at Chicago
Abstract
Data management applications store their data using struc-tured files in which data are usually sorted to serve indexingand queries. In order to insert or remove a record in a sortedfile, the positions of existing data need to be shifted. To thisend, the existing data after the insertion or removal pointmust be rewritten to admit the change in place, which can beunaffordable for applications that make frequent updates. Asa result, applications often employ extra layers of indirectionsto admit changes out-of-place. However, it causes increasedaccess costs and excessive complexity.This paper presents a novel file abstraction, FlexFile, thatprovides a flexible file address space where in-place updatesof arbitrary-sized data, such as insertions and removals, canbe performed efficiently. With FlexFile, applications canmanage their data in a linear file address space with minimalcomplexity. Extensive evaluation results show that a simplekey-value store built on top of this abstraction can achievehigh performance for both reads and writes.
Data management applications store data in files for persistentstorage. The data in files are usually sorted in a specific orderso that they can be correctly and efficiently retrieved in thefuture. However, it is not trivial to make updates such asinsertions and deletions in these files. To commit in-placeupdates in a sorted file, existing data may need to be rewrittento maintain the order of data. For example, key-value (KV)stores such as LevelDB [18] and RocksDB [16] need to mergeand sort KV pairs in their data files periodically, causingrepeated rewriting of existing KV data [2, 28, 51].It has been common wisdom to rewrite data for betteraccess locality. By co-locating logically adjacent data ona storage device, the data can be quickly accessed in thefuture with a minimal number of I/O requests, which iscrucial for traditional storage technologies such as HDDs.However, when managing data with new storage technologiesthat provide balanced random and sequential performance(e.g., Intel’s Optane SSDs [21]), access locality is no longera critical factor of I/O performance [50]. In this scenario,data rewriting becomes less beneficial for future accesses butstill consumes enormous CPU and I/O resources [26, 34]. Therefore, it may not be cost-effective to rewrite data on thesedevices in exchange for better localities. Despite this, datamanagement applications still need to keep their data logicallysorted for efficient access. An intuitive solution is to relocatedata in the file address space logically without rewriting themphysically. However, this is barely feasible because of thelack of support for logically relocating data in today’s files.As a result, applications have to employ additional layers ofindirections to keep data logically sorted, which increases theimplementation complexity and requires extra work in theuser space, such as maintaining data consistency [45, 52].We argue that if files can provide a flexible file addressspace where insertions and removals of data can be appliedin-place, applications can keep their data sorted easily withoutrewriting data or employing complex indirections. With sucha flexible abstraction, data management applications candelegate the data organizing jobs to the file interface, whichwill improve their simplicity and efficiency.Efforts have been made to realize such an abstraction.For example, a few popular file systems—Ext4, XFS, andF2FS—have provided insert-range and collapse-range fea-tures for inserting or removing a range of data in a file [17,20]. However, these mechanisms have not been able tohelp applications because of several fundamental limitations.First of all, these mechanisms have rigid block-alignmentrequirements. For example, deleting a record of only a fewbytes from a data file using the collapse-range operationis not allowed. Second, shifting an address range is veryinefficient with conventional file extent indexes. Insertinga new data segment to a file needs to shift all the existingfile content after the insertion point to make room for thenew data. The shift operation has O ( N ) cost ( N is the numberof blocks or extents in the file), which can be very costlydue to intensive metadata updating, journaling, and rewriting.Third, commonly used data indexing mechanisms cannotkeep track of shifted contents in a file. Specifically, indexesusing file offsets to record data positions are no longer usablebecause the offsets can be easily changed by shift operations.Therefore, a co-design of applications and file abstractionsis necessary to realize the benefits of managing data on aflexible file address space.This paper introduces FlexFile, a novel file abstractionthat provides a flexible file address space for data manage-1 a r X i v : . [ c s . O S ] J a n xtentsFile Address Space O ff set: 0Length: 8Block: 386 O ff set: 8Length: 4Block: 0 O ff set: 12Length: 4Block: 100 O ff set: 16Length: 8Block: 2800 100 280 3860 32K 48K 64K 96K . . . . . . . . . O ff set: 0 Length: 4
Block: 386 O ff set: 13 Length: 4Block: 100 O ff set: 17 Length: 8Block: 280 O ff set: 9 Length: 4Block: 0 O ff set: 5Length: 4Block: 390O ff set: 4Length: 1Block: 640 new updatedupdated Insert a 4K extent at 16K o ff set Device Blocks(4K size)
Figure 1:
A file with 96K address spacement systems. The core of FlexFile is a B + -Tree-like datastructure, named FlexTree, designed for efficient and flexiblefile address space indexing. In a FlexTree, it takes O ( log N ) time to perform a shift operation in the file address space,which is asymptotically faster than that of existing index datastructures with O ( N ) cost. We implement FlexFile as a user-space I/O library that provides a file-like interface. It adoptslog-structured space management for write efficiency andperforms defragmentation based on data access locality forcost-effectiveness. It also employs logical logging [40, 55] tocommit metadata updates at a low cost.Based on the advanced features provided by FlexFile,we build FlexDB, a key-value store that demonstrates howto build simple yet efficient data management applicationsbased on a flexible file address space. FlexDB has a simplestructure and a small codebase (1.7 k lines of C code). Thatbeing said, FlexDB is a fully-functional key-value storethat not only supports regular KV operations like GET , SET , DELETE and
SCAN , but also integrates efficient mechanismsto support caching, concurrent access, and crash consistency.Evaluation results show that FlexDB has substantially reducedthe data rewriting overheads. It achieves up to 11 . × and1 . × the throughput of read and write operations, respectively,compared to an I/O-optimized key-value store, RocksDB. Modern file systems use extents to manage file addressmappings. An extent is a group of contiguous blocks. Itsmetadata consists of three essential elements—file offset,length, and block number. Figure 1 shows an example of a filewith 96 KB address space on a file system using 4 KB blocks.This file consists of four extents. Real-world file systemsemploy index structures to manage extents. For example, Ext4uses an HTree [15]. Btrfs and XFS use a B + -Tree [41, 49].F2FS uses a multi-level mapping table [25].Regular file operations such as overwrite do not modifyexisting mappings. An append-write to a file needs to expandthe last extent in-place or add new extents to the end of themapping index, which is of low cost. However, the insert-range and collapse-range operations with the aforementioneddata structures can be very expensive due to the shifting ofextents. To be specific, an insert-range or collapse-range Number of 1MB data written O p s / s e c PWRITEINSERT-RANGEREWRITE
Figure 2:
Performance of random write/insertion on Ext4 filesystem. The
REWRITE test was terminated when 128M datawas written because of its low throughput.operation needs to update the offset value of every extentafter the insertion or removal point. For example, insertinga 4 KB extent at the offset of 16 KB to the example file inFigure 1 needs to update all the existing extents’ metadata.Therefore, the shift operation has O ( N ) cost, where N is thetotal number of extents after the insertion or removal point.We benchmark the file editing performance of an Ext4 filesystem on a 100 GB Intel Optane DC P4801X SSD. In eachtest, we construct a 1 GB file and measure the throughput ofdata writing. There are three tests, namely, PWRITE , INSERT - RANGE , and
REWRITE . PWRITE starts with an empty fileand use the pwrite system call to write 4 KB blocks inrandom order (without overwrites). Both
INSERT - RANGE and
REWRITE start with an empty file and insert 4KB datablocks to a random 4K-aligned offset within the already-written file range. Accordingly, each insertion shifts the dataafter the insertion point forward.
INSERT - RANGE utilizes the insert-range of Ext4 (through the fallocate system callwith mode=FALLOC_FL_INSERT_RANGE ). The
REWRITE testrewrites the shifted data to realize data shifting, which onaverage rewrites half of the existing data for each insertion.The results are shown in Figure 2. The
REWRITE test wasterminated early due to its inferior performance caused byintensive data rewriting. With 128 MB of new data writtenbefore terminated,
REWRITE caused 5.5 GB of writes tothe SSD.
INSERT - RANGE shows better performance than
REWRITE by inserting data logically in the file address space.However, due to the inefficient shift operations in the extentindex, the throughput of
INSERT - RANGE dropped quicklyand was eventually nearly 1000 × lower than that of PWRITE .Although
INSERT - RANGE does not rewrite any user data,it updates the metadata intensively and caused 25% morewrites to the SSD compared to
PWRITE . This number can befurther increased if the application frequently calls fsync toenforce write ordering. XFS and F2FS also support the shiftoperations, but they exhibit much worse performance thanExt4, so their results are not included.Extents are simple and flexible for managing variable-length data chunks. However, the block and page alignmentrequirements and the inefficient extent index structures intoday’s systems hinder the adoption of the shift operations.To make flexible file address spaces generally usable and2ffordable for data management applications, an efficient datashifting mechanism without the rigid alignment requirementsis indispensable.
Inserting or removing data in a file needs to shift all theexisting data beyond the insertion or removal point, whichcauses intensive updates to the metadata of the affectedextents. With regard to the number of extents in a file, the costof shift operations can be significantly high due to the O ( N ) complexity in existing extent index structures.The following introduces FlexTree, an augmented B + -Tree that supports efficient shift operations. The design ofFlexTree is based on the observation that a shift operationalters a contiguous range of extents. FlexTree treats the shiftedextents as a whole and applies the updates to them collectively.To facilitate this, it employs a new metadata representationscheme that stores the address information of an extent onits search path. As an extent index, it costs O ( log N ) time toperform a shift operation in FlexTree, and a shift operationonly needs to update a few tree nodes. Before demonstrating the design of FlexTree, we first startwith an example of B + -Tree [9] that manages file addressspace in byte granularity, as shown in Figure 3a. Each extentcorresponds to a leaf-node entry consisting of three elements— offset , length , and (physical) address . Each internal nodecontains pivot entries that separate the pointers to the childnodes. When inserting a new extent at the head of the file,every existing extent’s offset and every pivot’s offset must beupdated because of the shift operation on the entire file.FlexTree employs a new address metadata representationscheme that allows for shifting extents with substantiallyreduced changes. Figure 3b shows an example of a FlexTreethat encodes the same address mappings in the B + -tree. InFlexTree, the offset fields in extent entries and pivot entries arereplaced by partial offset fields. Besides, the only structuraldifference is that in a FlexTree, every pointer to a child nodeis associated with a shift value. These shift values are usedfor encoding address information in cooperation with thepartial offsets. The effective offset of an extent or pivot entryis determined by the sum of the entry’s partial offset and theshift values of the pointers found on the search path fromthe root node to the entry. The search path from the rootnode (at level 0) to an entry at level N can be represented bya sequence (cid:0) ( X , S ) , ( X , S ) , . . . , ( X N − , S N − ) (cid:1) , where X i represents the index of the pointer at level i , and S i representsthe shift value associated with that pointer. Suppose the partialoffset of an entry is P . Its effective offset E can be calculatedas E = (cid:0) ∑ N − i = S i (cid:1) + P . FlexTree supports basic operations such as appending extentsat the end of a file and remapping existing extents, as well asadvanced operations, including inserting or removing extentsin the middle of a file ( insert-range and collapse-range ). Thefollowing explains how the address range operations executein a FlexTree. In this section, a leaf node entry in FlexTree isdenoted by a triple: (partial_offset, length, address) . insert-range Operation
Inserting a new extent of length L to a leaf node z in FlexTreetakes three steps. First, the operation searches for the leaf nodeand inserts a new entry with a partial offset P = E − (cid:0) ∑ N − i = S i (cid:1) ,assuming the leaf node is not full. When inserting to themiddle of an existing extent, the extent must be split beforethe insertion. The insertion requires a shift operation on allthe extents after the new extent. In the second step, for eachextent within node z that needs shifting, its partial offset isincremented by L . The remaining extents that need shiftingspan all the leaf nodes after node z . We observe that, if everyextent within a subtree needs to be shifted, the shift valuecan be recorded in the pointer that points to the root of thesubtree. Therefore, in the third step, the remaining extentsare shifted as a whole by updating a minimum number ofpointers to a few subtrees that cover the entire range. To thisend, for each ancestor node of z at level i , the shift valuesof the pointers and the partial offsets of the pivots after thepointer at X i are all added by L . In this process, the updatedpointers cover all the remaining extents, and the path of eachremaining extent contains exactly one updated pointer. Whenthe update is finished, every shifted extent has its effectiveoffset added by L . The number of updated nodes of a shiftoperation is bounded by the tree’s height, so the operation’scost is O ( log N ) .Figure 4 shows the process of inserting a new extent withlength 3 and physical address 89 to offset 0 in the FlexTreeshown in Figure 3b. The first step is to search for the target leafnode for insertion. Because all the shift values of the pointersare 0, the effective offset of every entry is equal to its partialoffset. Therefore, the target leaf node is the leftmost one, andthe new extent should be inserted at the beginning of that leafnode. Then, there are three changes to be made to the FlexTree.First, a new entry ( , , ) is inserted at the beginning of thetarget leaf node. Second, the other two extents in the sameleaf node are updated from ( , , ) and ( , , ) to ( , , ) and ( , , ) , respectively. Third, following the target leafnode’s path upward, the pointers to the three subtrees coveringthe remaining leaf nodes and the corresponding pivots areupdated, as shown in the shaded areas in Figure 4. Now, theeffective offset of every existing leaf entry is increased by 3.Similar to B + -tree, FlexTree splits every full node when asearch travels down the tree for insertion. The split thresholdin FlexTree is one entry smaller than the node’s capacitybecause an insertion may cause an extent to be split, which3
4 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 52125917 30 64 0 9 0 9 831 51 O ff setLengthAddress Pivot (o ff sets)Pointer (a) B + -Tree
64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259+017 +0+0 30 +0 64 +0+0 0 9 0 9 831 +0 51
Partial O ff setLengthAddressPivot (partial o ff sets)Pointer (w/ shift value) (b) FlexTree
Figure 3:
Examples of B + -Tree and FlexTree that manage the same file address space
64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259
0 389
9 0
831 3insertshift update21
54 +3 +0
20 +3 +0
33 +3
64 +0+0
Figure 4:
Inserting a new extent in FlexTreeleads to two entries being added to the node for the insertion.To split a node, half of the entries in the node are moved to anew node. Meanwhile, a pointer to the new node and a newpivot entry is created at the parent node. The new pointerinherits the shift value of the pointer to the old node so thatthe effective offsets of the moved entries remain unchanged.The new pivot entry inherits the effective offset of the mediankey in the old full node. The partial offset of the new pivot iscalculated as the sum of the old median key’s partial offsetand the new pointer’s inherited shift value. Figure 5 shows anexample of a split operation. The new pivot’s partial offset is38 (which is 5 +
10 +5...12 +7+0 33 +6 45 +3 10 +5...12 +7+0 45 +3
38 +5 +6Split... ...
Figure 5:
An example of node splitting in FlexTree
To retrieve the mappings of an address range in FlexTree, theoperation first searches for the starting point of the range,which is a byte address within an extent. Then, it scansforward on the leaf level from the starting point to retrieveall the mappings in the requested range. The correctness ofthe forward scanning is guaranteed by the assumption that allextents on the leaf level are contiguous in the logical addressspace. However, a hole (an unmapped address range) in thelogical address space can break the continuity and lead toincorrect range size calculation and wrong search results.To address this issue, FlexTree explicitly records holes asunmapped ranges using entries with a special address value.Figure 6 shows the process of querying the addressmappings from 36 to 55, a 19-byte range in the FlexTreeafter the insertion in Figure 4. First, a search of logical offset36 identifies the third leaf node. The partial offset values ofthe pivots in the internal nodes on the path are equal to theireffective offsets (54 and 33), and the target leaf node has the
64 743 711178 82 77117 3 9 201021 30 912 39 339 42 950 51 142 521259 0 389 3 9 0 12 831 42 1 searchwalk walk *3 search
20 33 54 E ff ective O ff sets
54 +3+020 +3+0 33 +3 64 +0+0
23 42 45 55 67 74 850 3 12
Figure 6:
Looking up mappings from 36 to 55 in FlexTreepath (cid:0) ( , + ) , ( , + ) (cid:1) . Although the first extent in the targetleaf node has a partial offset value of 30, its effective offset is33 (0 + + (( , ) , ( , ) , ( , ) , ( , )) , an array of four tuples, eachcontaining a physical address and a length. collapse-range Operation
To collapse (remove) an address range in FlexTree, theoperation first searches for the starting point of the removal.If the starting points is in the middle of an extent, the extentis split so that the removal will start from the beginning of anextent. Similarly, a split is also used when the ending point isin the middle of an extent. The address range being removedwill cover one or multiple extents. For each extent in the range,the extents after it are shifted backward using a process similarto the forward shifting in the insertion operation (§3.2.1). Theonly difference is that a negative shift value is used.Figure 7 shows the process of removing a 9-byte addressrange (33 to 42) from the FlexTree in Figure 6 withoutleaving a hole in the address space. First, a search identifiesthe starting point, which is the beginning of the first extent ( , , ) in the third leaf node. Then the extent is removed,and the remaining extents in the leaf node are shiftedbackward. Finally, in the root node, the pointer to the subtreethat covers the last two leaf nodes is updated with a negativeshift value of −
9, as shown in the shaded area in Figure 7.Similar to B + -Tree, FlexTree merges a node to a sibling iftheir total size is under a threshold after a removal. Since twonodes being merged can have different shift values in theirparents’ pointers, we need to adjust the partial offsets in themerged node to maintain correct effective offsets for all theentries. When merging two internal nodes, the shift valuesare also adjusted accordingly. Figure 8 shows an example of4erging two internal nodes. FlexTree manages extent address mappings in byte granu-larity. To be specific, the size of each extent in a FlexTreecan be arbitrary bytes. In the implementation of FlexTree,the internal nodes have 64-bit shift values and pivots. Forleaf nodes, we use 32-bit lengths, 32-bit partial offsets, and64-bit physical addresses for extents. The largest physicaladdress value (2 −
1) is reserved for unmapped addressranges. Therefore, an extent in FlexTree records 16 bytesof data. While a 32-bit partial offset can only cover a 4 GBaddress range, the effective offset can address a much largerspace using the sum of the shift values on the search path asthe base. When a leaf node’s maximum partial offset becomestoo large, FlexTree subtracts a value M , which is the minimumpartial offset in the node, from every partial offset of the node,and adds M to the node’s corresponding shift value at theparent node. A practical limitation is that the extents in a leafnode can cover no more than a 4 GB range. We implement FlexFile as a user-space I/O library thatprovides a file-like interface with FlexTree as the extent indexstructure. The FlexFile library manages the file address spacein byte granularity, and its data and metadata are stored inregular files in a traditional file system. Each FlexFile consistsof three files—a data file, a FlexTree file, and a logical logfile. The user-space library implementation gives FlexFilethe flexibility to perform byte-granularity space managementwithout block alignment limitations. In the meantime, theFlexFile library can delegate the job of cache managementand data storage to the underlying file system.
A FlexFile stores its data in a data file. Similar to thestructures in log-structured storage systems [25, 42, 43], thedata file’s space is divided into fixed-size segments (4 MBin the implementation), and each new extent is allocatedwithin a segment. Specifically, a large write operation maycreate multiple logically contiguous extents residing indifferent segments. To avoid small writes, an in-memorysegment buffer is maintained, where consecutive extents areautomatically merged if they are logically contiguous.
45 -6 +0 64 743 711178 82 77151 142 521259 64 +0+017 3 9 201021
950 0 389 3 9 0 12 831 removeshift21update320 +3+0 33 +3 30 912
Figure 7:
Removing address mapping from offset 33 to 42
10 +5...12 +7+0
36 +9 48 +6
10 +5...12 +7+0 45 +341 +8+6 Merge ......
Figure 8:
An example of node merging in FlexTreeThe FlexFile library performs garbage collection (GC) toreclaim space from underutilized segments. It maintains anin-memory array to record the valid data size of each segment.A GC process scans the array to identify a set of mostunderutilized segments and relocates all the valid extents fromthese segments to new segments. Then, the FlexTree extentindex is updated accordingly. Since the extents in a FlexFilecan have arbitrary sizes, the GC process may produce less freespace than expected because of the internal fragmentation ineach segment. To address this issue, we adopt an approachused by a log-structured memory allocator [43] to guaranteethat a GC process can always make forward progress. Bylimiting the maximum extent size to K of the segment size,relocating extents in one segment whose utilization ratiois not higher than K − K can reclaim free space for at leastone new extent. Therefore, if the space utilization ratio ofthe data file is capped at K − K , the GC can always reclaimspace from the most underutilized segment for writing newextents. To reduce the GC pressure in the implementation,we set the maximum extent size to be (128 KB) of thesegment size and conservatively limit the space utilizationratio of the data file to (93 . flexfile_defrag interfacefor manually relocating a contiguous range of data in the fileinto new segments. We will evaluate the efficiency of the GCpolicy in §6. A FlexFile maintains an in-memory FlexTree that periodicallysynchronizes its updates to the FlexTree file. It must ensureatomicity and crash consistency in this process. An insertionor removal operation often updates multiple tree nodes alongthe search path in the FlexTree. If we use a block-basedjournaling mechanism to commit updates, every dirtied nodein the FlexTree needs to be written twice. To address thepotential performance issue, we use a combination of Copy-on-Write (CoW) [40] and logical logging [40, 55] to minimizethe I/O cost.CoW is used to synchronize the persistent FlexTree withthe in-memory FlexTree. The FlexTree file has a header atthe beginning of the file that contains a version number anda root node position. A commit to the FlexTree file creates anew version of the FlexTree in the file. In the commit process,dirtied nodes are written to free space in the FlexTree filewithout rewriting existing nodes. Once all the updated nodeshave been written, the file’s header is updated atomically tomake the new version persist. Once the new version has beencommitted, the file space used by the updated nodes in the5ld version can be safely reused in future commits.Updates to the FlexTree extent index can be intensive withsmall insertions and removals. If every metadata update onthe FlexTree directly commits to the FlexTree file, the I/Ocost can be high because every commit can write multipletree nodes in the FlexTree file. The FlexFile library adoptsthe logical logging mechanism [40, 55] to further reducethe metadata I/O cost. Instead of performing CoW to thepersistent FlexTree on every metadata update, the FlexFilelibrary records every FlexTree operation in a log file and onlysynchronizes the FlexTree file with the in-memory FlexTreewhen the logical log has accumulated a sufficient amount ofupdates. A log entry for an insertion or removal operationcontains the logical offset, length, and physical address of theoperation. A log entry for a GC relocation contains the oldand new physical addresses and the length of the relocatedextent. We allocate 48 bits for addresses, 30 bits for lengths,and 2 bits to specify the operation type. Each log entry takes16 bytes of space, which is much smaller than the node sizeof FlexTree. The logical log can be seen as a sequence ofoperations that transforms the persistent FlexTree to the latestin-memory FlexTree. The version number of the persistentFlexTree is recorded at the head of the log. Upon a crash-restart, uncommitted updates to the persistent FlexTree canbe recovered by replaying the log on the persistent FlexTree.When writing data to a FlexFile, the data are first writtento free segments in the data file. Then, the metadata updatesare applied to the in-memory FlexTree and recorded in anin-memory buffer of the logical log. The buffered log entriesare committed to the log file periodically or on-demandfor persistence. In particular, the buffered log entries arecommitted after every execution of the GC process to makesure that the new positions of the relocated extents arepersistently recorded. Then, the reclaimed space can be safelyreused. Upon a commit to the log file, the data file must befirst synchronized so that the logged operations will refer tothe correct file data. When the logical log file size reaches apre-defined threshold, or the FlexFile is being closed, the in-memory FlexTree is synchronized to the FlexTree file usingthe CoW mechanism. Afterward, the log file can be truncatedand reinitialized using the FlexTree’s new version number.Figure 9 shows an example of the write ordering of aFlexFile. D i and L i represent the data write and the logicallog write for the i -th file operation, respectively. At the timeof “v1”, the persistent FlexTree (version 1) is identical to thein-memory FlexTree. Meanwhile, the log file is almost empty,contains only a header that records the FlexTree version(version 1). Then, for each write operation, the data is written D D D L +L +L v1 ... CoW v2 logical log fi le is full CoW D D ...... TimeWrite Order
I/O barrier using fsync() : update tree fi le header Figure 9:
An example of write ordering in FlexFile to the data file (or buffered if the data is small), and itscorresponding metadata updates are logged in the logicallog buffer. When the logical log buffer is full, all the file data( D , D , and D ) are synchronized to the data file. Then thebuffered log entries ( L + L + L ) are written to the logicallog file. When the log file is full, the current in-memoryFlexTree is committed to the FlexTree file to create a newversion (version 2) in the FlexTree file using CoW. Oncethe nodes have been written to the FlexTree file, the newversion number and the root node position of the FlexTree arewritten to the file atomically. The logical log is then clearedfor recording future operations based on the new version. I/Obarriers ( fsync ) are used before and after each logical log filecommit and each FlexTree file header update to enforce thewrite ordering for crash consistency, as shown in Figure 9. We build FlexDB, a key-value store powered by the advancedfeatures of FlexFile. Similar to the popular LSM-tree-basedKV stores such as LevelDB [18] and RocksDB [16], FlexDBbuffers updates in a MemTable and writes to a write-aheadlog (WAL) for immediate data persistence. When committingupdates to the persistent storage, however, FlexDB adopts agreatly simplified data model. FlexDB stores all the persistentKV data in sorted order in a table file (a FlexFile) whoseformat is similar to the SSTable format of LevelDB [5, 16, 18].Instead of performing repeated compactions across a multi-level store hierarchy that causes high write amplification,FlexDB directly commits updates from the MemTable to thedata file in-place at low cost, which is made possible by theflexible file address space of the FlexFile. FlexDB employs aspace-efficient volatile sparse index to track positions of KVdata in the table file and implements a user-space cache forfast reads.
FlexDB stores persistent KV pairs in the table file and keepsthem always sorted (in lexical order by default) with in-placeupdates. Each KV pair in the file starts with the key and valuelengths encoded with Base-128 Varint [6], followed by thekey and value’s raw data. A sparse KV index, whose structureis similar to a B + -Tree, is maintained in the memory to enablefast search in the table file.KV pairs in the table file are grouped into intervals , eachcovering a number of consecutive KV pairs. The sparse indexstores an entry for each interval using the smallest key in it asthe index key. Similar to FlexTree, the sparse index encodesthe file offset of an interval using the partial offset and the shiftvalues on its search path. Specifically, each leaf node entrycontains a partial offset , and each child pointer in internalnodes records a shift value. The effective file offset of aninterval is the sum of its partial offset and the shift values onits search path. A search of a key first identifies the target6 ar bit far kit pinfooact0 42 64 89 SET(key="cat", value="abcd")
73 98cat
Pivot (anchor key) Pointer (w/ shift value)Anchor KeyPartial O ff set bar bit far kit pinfooact ...... "foo"+0 +9 "bit"0 42 "pin"64 89"foo"+0 +0 "pin"64 89"bit" 420FlexFileIn-memorySparse IndexFlexFileIn-memorySparse Index Figure 10:
An example of the sparse KV index in FlexDBinterval with a binary search on the sparse index. Then thetarget file range is calculated using the shift values reached onthe search path and the metadata in the target entry. Finally, thesearch scans the interval to find the KV pair. Figure 10 showsan example of the sparse KV index with four intervals. Thefirst interval does not need an index key. The index keys of theother three intervals are “bit”, “foo” and “pin”, respectively.A search of “kit” reaches the third interval (“foo” < “kit” < “pin”) at offset 64 (0+64).When inserting (or removing) a KV pair in an interval,the offsets of all the intervals after it needs to be shifted sothat the index can stay in sync with the table file. The shiftoperation is similar to that in a FlexTree. First, the operationupdates the partial offsets of the intervals in the same leafnode. Then, the shift values on the path to the target leaf nodeare updated. Different from that of FlexTree, the partial offsetsin the sparse index are not the search keys but the values inleaf node entries. Therefore, none of the index keys and pivotsare modified in the shifting process. An update operation thatresizes a KV pair is performed by removing the old KV pairwith collapse-range and inserting the new one at the sameoffset with insert-range .To insert a new key “cat” with the value “abcd” in theFlexDB shown in Figure 10, a search first identifies theinterval at offset 42 whose index key is “bit”. Assumingthe new KV item’s size is 9 bytes, we insert it to the tablefile between keys “bit” and “far” and shift the intervalsafter it forward by 9. As shown at the bottom of Figure 10,the effective offsets of interval “foo” and “pin” are allincremented by 9.Similar to B + -tree, the sparse index needs to split a largenode or merge two small nodes when their sizes reach specificthresholds. FlexDB also defines the minimum and maximumsize limits of an interval. The limits can be specified by thetotal data size in bytes or the number of KV pairs. An intervalwhose size exceeds the maximum size limit will be split witha new index key inserted in the sparse index. Updates in FlexDB are first buffered in a MemTable. TheMemTable is a thread-safe skip list that supports concur-rent access of one writer and multiple readers. When theMemTable is full, it is converted to an immutable MemTable,and a new MemTable is created to receive new updates. Meanwhile, a background thread is activated to commit theupdates to the table file and the sparse index. A lookupin FlexDB first checks the MemTable and the immutableMemTable (if it exists). If the key is not found in theMemTables, the lookup queries the sparse KV index to findthe key in the table file. When the committer thread isactive, it requires exclusive access to the sparse index andthe table file to prevent inconsistent data or metadata frombeing reached by readers. To this end, a reader-writer lockis used for the committer thread to block the readers whennecessary. For balanced performance and responsiveness, thecommitter thread releases and reacquires the lock every 4 MBof committed data so that readers can be served quicklywithout waiting for the completion of a committing process.
Real-world workloads often exhibit skewed access patterns [1,4, 53]. Many popular KV stores employ user-space caching toexploit the access locality for improved search efficiency [7,16, 19]. FlexDB adopts a similar approach by cachingfrequently used intervals of the table file in the memory.The cache (namely the interval cache ) uses the CLOCKreplacement algorithm and a write-through policy. Everyinterval’s entry in the sparse index contains a cache pointerthat is initialized as
NULL to represent an uncached interval.Upon a cache miss, the data of the interval is loaded fromthe table file, and an array of KV pairs is created based onthe data. Then, the cache pointer is updated to point to a newcache entry containing the array, the number KV pairs, andthe interval size. A lookup on a cached interval can perform abinary search on the array.
Upon a restart, FlexDB first recovers the uncommitted KVdata from the write-ahead log. Then, it constructs the volatilesparse KV index of the table file. Intuitively the sparse indexcan be built by sequentially scanning the KV pairs in thetable file, but the cost can be significant for a large store. Infact, the index rebuilding only requires an index key for eachinterval. Therefore, a sparse index can be quickly constructedby skipping a certain amount of extents every time an indexkey is determined.In FlexDB, extents in the underlying table file are createdby inserting or removing KV pairs, which guarantees that anextent in the file always begins with a complete KV pair. Toidentify a KV in the middle of a table file without knowing itsexact offset, we add a flexfile_read_extent(off, buf,maxlen) function to the FlexFile library. A call to thisfunction searches for the extent at the given file offset andreads up to maxlen bytes of data at the beginning of the extent.The extent’s logical offset ( ≤ off ) and the number of bytesread are returned. To build a sparse index, read_extent isused to retrieve a key at each approximate interval offset(4 KB, 8 KB, . . . ) and use these keys as the index keys of7he new intervals. FlexDB can immediately start processingrequests once the sparse index is built. Upon the first accessof a new interval, it can be split if it contains too many keys.Accordingly, the rebuilding cost can be effectively reducedby increasing the interval size. In this section, we experimentally evaluate FlexTree, theFlexFile library, and FlexDB. In the evaluation of FlexTreeand the FlexFile library, we focus on the performance of theenhanced insert-range and collapse-range operations as wellas the performance metrics on regular file operations. ForFlexDB, we use a set of microbenchmarks and YCSB [8].Experiments in this section are run on a workstation withDual Intel Xeon Silver 4210 CPU and 64G DDR4 ECC RAM.The persistent storage of all tests is an Intel Optane DCP4801X SSD with 100 GB capacity. The workstation runs a64-bit Linux OS with kernel version 5.9.14.
In this section, we evaluate the performance of the FlexTreeindex structure and compare it with a regular B + -Tree and asorted array. The B + -Tree has the structure shown in Figure 3awhere a shift operation can cause intensive node updates.The sorted array is allocated with malloc and contains asorted sequence of extents. A shift operation in it also requiresintensive data movements in memory, but a lookup of anextent in the array can be done efficiently with a binary search.We benchmark three operations— append , lookup , and insert . An append experiment starts with an empty index.Each operation appends a new extent after all the existingextents. Once the appends are done, we run a lookupexperiment by sequentially querying every extent. Note thatevery lookup performs a complete search on the index insteadof simply walking on the leaf nodes or the array. An insertexperiment starts with an empty index. Each operation insertsa new extent before all the existing extents.Table 1 shows the execution time of each experiment. Forappends, the sorted array outperforms FlexTree and B + -Treebecause appending new extents at the end of an array istrivial and does not require node splits or memory allocations.FlexTree is about 10% slower than B + -Tree. The overheadmainly comes from calculating the effective offset of eachaccessed entry on the fly.The results of the lookups are similar to those of theappends. However, for each index, the lookups take a longer Table 1:
Execution time of the extent metadata operations
Operation Append Lookup Insert Execution Time (seconds)FlexTree 7.83 91.1 9.47 99.2 B + -Tree 7.01 82.8 7.91 88.9 Sorted Array 4.49 55.1 8.56 86.8 time than the corresponding appends. The reason is that theindex is small at the beginning of an append test. Appendingto a small index is faster than that on a large index, whichreduces the total execution time. In a lookup experiment,every search operation is performed on a large index with aconsistently high cost.FlexTree’s address metadata representation scheme allowsfor much faster extent insertions with a small loss on lookupspeed. The B + -Tree and the sorted array shows extremelyhigh overhead due to the intensive memory writes andmovements. To be specific, every time an extent is inserted atthe beginning, the entire mapping index is rewritten. FlexTreemaintains a consistent O ( log N ) cost for insertions, which isasymptotically and practically faster. In this section, we evaluate the efficiency of file I/O operationsin the FlexFile library (referred to as FlexFile in this section)and compare it with the file abstractions provided by two filesystems, Ext4 and XFS. The FlexFile library is configuredto run on top of an XFS file system. The XFS and Ext4are formatted using the mkfs command with the defaultconfigurations. File system benchmarks that require hierarchydirectory structures do not apply to FlexFile because FlexFilemanages file address spaces for single files. As a result, theevaluation focuses on file I/O operations.Each experiment consists of a write phase and a read phase.There are three write patterns for the write phase—randominsert (using insert-range ), random write, and sequentialwrite. The first two patterns are the same as the
INSERT and
PWRITE described in Section 2, respectively. The sequentialwrite pattern writes 4 KB blocks sequentially. A write phasestarts with a newly formatted file system and creates a1 GB file by writing or inserting 4 KB data blocks using therespective pattern. After the write phase, we measure theread performance with two read patterns—sequential andrandom. Each read operation reads a 4 KB block from thefile. The random pattern use randomly shuffled offsets sothat it reads each block exactly once with 1 GB of reads.For each read pattern, the kernel page cache is first cleared.Then the program reads the entire file twice and measuresthe throughput before and after the cache is warmed up.The results are shown in Table 2. We also calculate thewrite amplification (WA) ratio of each write pattern usingthe S.M.A.R.T. data reported by the SSD. The followingdiscusses the key observations on the experimental results.
Fast Inserts and Writes
Insertions in FlexFile has a lowcost ( O ( log N ) ). As shown in Table 2, FlexFile’s random-insert throughput is more than 500 × higher than Ext4(491 vs. 0 .
87) and four orders of magnitude higher thanXFS. Figure 11a shows the throughput variations during therandom-insert experiments. FlexFile maintains a constanthigh throughput while Ext4 and XFS both suffer extremethroughput degradations because of the growing index sizes8
Data Written (GB) O p s / s e c XFSEXT4FlexFile (a)
The write phase
Data Read (GB) M o p s / s e c Cold Warm
XFSEXT4FlexFile (b)
The sequential read phase
Figure 11:
The random insert experimentthat lead to increasingly intensive metadata updates. The writethroughput of FlexFile is higher than that of Ext4 and XFSfor both random and sequential writes. The reason is thatthe FlexFile library commits writes to the data file in theunit of segments, which enables batching and buffering inthe user space. Meanwhile, the FlexFile library adopts thelog-structured write in the data file, which transforms randomwrites on the FlexFile into sequential writes on the SSD. Thisallows FlexFile to gain a higher throughput at the beginningof the random insert experiment.
Low Write Amplification
FlexFile maintains a low writeamplification ratio (1 .
03 to 1 .
04) throughout all the writeexperiments. In the random and sequential write experiments,the WA ratio of FlexFile is slightly higher than the ones ofExt4 and XFS due to the extra data it writes in the logicallog and the persistent FlexTree. Ext4 and XFS show higherWA ratios in the random insert experiments because of theoverheads on metadata updates for file space allocation. InExt4 and XFS, each insert operation updates on average 50%of the existing extents’ metadata, which causes intensivemetadata I/O. In FlexFile, the data file is written sequentially,and each insert operation updates O ( log N ) FlexTree nodes.Additionally, the logical logging in the FlexFile librarysubstantially reduces metadata write under random inserts.
Read Performance
All the systems exhibit good readthroughput with a warm cache and low random read through-put with a cold cache. However, we observe that FlexFile alsoshows low throughput in sequential reads with cold cacheon a file constructed by random writes or random inserts.This is because random writes and random inserts generatehighly fragmented layouts in the FlexFile where consecutiveextents reside in different segments in the underlying data file.Therefore, it cannot benefit from prefetching in the OS kernel,and it costs extra time to perform extent index querying.
Table 2:
I/O performance of FlexFile, Ext4 and XFS
Random Insert Random Write Seq. WriteFlex Ext4 XFS Flex Ext4 XFS Flex Ext4 XFSW. (Kops/sec)
491 0.87 0.01
489 242 359 493 304 460W. Amp. Ratio 1.04 1.28 5.53 1.04 1.02 1.02 1.03 1.02 1.02Read Throughput (Kops/sec)Seq. Cold 95 221 240 95 445 441 449 482 443Seq. Warm 737 1053 1028 731 1079 1044 921 1048 1050Rand. Cold 92 87 92 92 87 96 95 100 95Rand. Warm 571 803 808 569 836 823 648 804 824
Table 3:
Synthetic KV datasets using real-world KV sizes
Dataset ZippyDB [4] UDB [4] SYS [1]Avg. Key Size (Bytes) 48 27 28Avg. Value Size (Bytes) 43 127 396 ∼
40 GB) 470 M 280 M 100 M
Figure 11b shows the throughput variations of the sequentialread phase before and after the cache is warmed up. Ext4and XFS maintain a constant throughput in the first 1 GBwith optimal prefetching. Meanwhile, FlexFile shows a muchlower throughput at the beginning due to the random readsin the underlying data file. The FlexFile’s throughput keepsimproving as the hit ratio increases. After 1 GB of reads, allthe systems show high and stable throughput.
We evaluate the performance of FlexDB through variousexperiments and compare it with Facebook’s RocksDB [16],an I/O-optimized KV store. We also evaluated LMDB [47], aB + -Tree based KV store, TokuDB [35], a B ε -Tree based KVstore, and KVell [26], a highly-optimized KV store for fastSSDs. However, LMDB and TokuDB exhibit consistently lowperformance compared with RocksDB. Similar observationsare also reported in recent studies [13, 19, 34]. KV insertionsin KVell abort when a new key share a common prefix oflonger than 8 bytes with an existing key since KVell onlyrecords up to eight bytes of a key for space-saving. Severalstate-of-the-art KV stores [7, 23, 24, 54] are either built forbyte-addressable non-volatile memory or not publicly avail-able for evaluation. Therefore, we focus on the comparisonbetween FlexDB and RocksDB in this section.For a fair comparison, both the FlexDB and RocksDB areconfigured with a 256 MB MemTable and an 8 GB user-spacecache. The FlexDB is configured with the maximum intervalsize of 8 KV pairs. The RocksDB is tuned as suggested by itsofficial tuning guide (following the configurations for “Totalordered database, flash storage.”) [39].We generate synthetic KV datasets using the representativeKV sizes of Facebook’s production workloads [1, 4]. Table 3shows the details of the datasets. The size of each dataset isabout 40 GB, which is approximately 5 × the size of the user-level cache in both systems. The workloads are generatedusing three key distributions—sequential, Zipfian ( α = . Write
We first evaluate the write performance of the twostores. In each experiment, each thread inserts 25% (approx.10 GB) of the dataset to an empty store following the keydistribution. For sequential writes, the dataset is partitionedinto four contiguous ranges, and each thread inserts one rangeof KVs. We run experiments with KV datasets described9
Z C0250500750 T h r o u g h p u t ( K o p s / s e c ) ZippyDB
S Z C
Key DistributionUDB
S Z C
SYSRocksDB FlexDB (a)
Insertion Throughput
S Z C050100150 W r i t e I / O ( G B ) ZippyDB
S Z C
Key DistributionUDB
S Z C
SYSRocksDB FlexDB (b)
SSD Writes
S Z C
Key Distrib. T h r o u g h p u t ( M o p s / s e c ) GET
10 20 50 100
Scan Length
SCAN (zipf)RocksDB FlexDB (c)
Query Throughput
S Z C
Key Distrib. M o p s / s e c Throughput
S Z C
Key Distrib. I / O S i z e ( G B ) SSD WriteLow GC Intensive GC (d)
GC Overhead
Figure 12:
Benchmark results of FlexDB. Key distributions: S – Sequential; Z – Zipfian; C – Zipfian-Composite.in Table 3 and different key distributions. For the Zipfianand Zipfian-Composite distributions, insertions can overwritepreviously inserted keys, which leads to reduced write I/O.Figure 12a shows the measured throughput of FlexDBand RocksDB in the experiments. The sizes of data writtento the SSD are shown in Figure 12b. FlexDB outperformsRocksDB in every experiment. The advantage mainly comesfrom FlexDB’s capability of directly inserting new data intothe table file without requiring further rewrites. In contrast,RocksDB needs to merge and sort the tables when movingthe KV data across the multi-level structure, which generatesexcessive rewrites that consume extra CPU and I/O resources.As shown in Figure 12b, RocksDB’s write I/O is up to 2 × compared to that of FlexDB.In every experiment with sequential write, FlexDB gener-ates about 90 GB write I/O to the SSD, where about 40 GBare in the WAL, and the rest are in the FlexFile. However,RocksDB writes up to 175 GB in these experiments. Theroot cause of the excessive writes is that every newly writtentable in the RocksDB spans a long range in the key spacewith keys clustered at four distant locations, making the newtables overlap with each other. Consequently, RocksDB has toperform intensive compactions to sort and merge these tables. Read
We measure the read performance of FlexDB andRocksDB. To emulate a randomized data layout in real-worldKV stores, we first load a store with the UDB dataset andperform 100 million random updates following the Zipfiandistribution. We then measure the point query (
GET ) and rangequery (
SCAN ) performance of the store. Each read experimentruns for 60 seconds. Figure 12c shows the performanceresults. For
GET operations, FlexDB consistently outperformsRocksDB because it can quickly find the target key using abinary search on the sparse index. In contrast, a
GET operationin RocksDB needs to check the tables on multiple levels. Inthis process, many key comparisons are required for selectingcandidate tables on the search path. The advantage of FlexDBis particularly high with sequential
GET because a sequenceof
GET operations can access the keys in an interval with atmost one miss in the interval cache.The advantage of FlexDB is also significant in rangequeries. We test the
SCAN performance of both systems usingthe Zipfian distribution. As shown on the right of Figure 12c,FlexDB outperforms RocksDB by 8 . × to 11 . × because of its capability of quickly searching on the sparse index andretrieving the KV data with optimal cache locality. GC overhead
We evaluate the impact of the FlexFile GCactivities on FlexDB using an update-intensive experiment.In a run of the experiment, each thread performs 100 millionKV updates to a store containing the UDB dataset (42 GB).We first run experiments with a 90 GB FlexFile data file,which represents the scenario of modest space utilizationand low GC overhead. For comparison, we run the sameexperiments with a 45 GB data file size that exhibits a 93%space utilization ratio and causes intensive GC activities in theFlexFile. The results are shown in Figure 12d. The intensiveGC shows a negligible impact on both throughput and I/Owith sequential and Zipfian workloads. In this scenario, theGC process can easily find near-empty segments becausethe frequently updated keys are often co-located in the datafile with a good spatial locality. Comparatively, the Zipfian-Composite distribution has a much weaker spatial locality,which leads to intensive rewrites in the GC process under thehigh space utilization ratio. With this workload, the intensiveGC causes 10.3% more writes on the SSD and 19.5% lowerthroughput in FlexDB compared to that of low GC cost.
Recovery
This experiment evaluates the recovery speed ofFlexDB (described in §5.4) with a clean page cache. Fora store containing the UDB dataset (42 GB), using a smallrebuilding interval size of 4 KB leads to a recovery time of27.2 seconds. Increasing the recovery interval size to 256 KBreduces the recovery time to 2.5 seconds. For comparison,rebuilding the sparse index by sequentially scanning the tablefile costs 93.7 seconds. There is a penalty when accessing alarge interval for the first time. In practice, the penalty can bereduced by promptly loading and splitting large intervals inthe background using spare bandwidth.
YCSB Benchmark
YCSB [8] is a popular benchmark thatevaluates KV store performance using realistic workloads. Ineach experiment, we sequentially load the whole UDB dataset
Table 4:
YCSB workloads
Workload A B C D E FDistribution Zipfian Latest ZipfianOperations 50% U 5% U 100% R 5% I 5% I 50% R50% R 95% R 95% R 95% S 50% M
I: Insert; U: Update; R: Read; S: Scan; M: Read-Modify-Write. B C D E F0123 T h r o u g h p u t ( n o r m . ) K K K . M . K K RocksDB FlexDB (a)
42 GB store size, 64 GB RAM
A B C D E F012 T h r o u g h p u t ( n o r m . ) K K K . M . K K RocksDB FlexDB (b)
84 GB store size, 16 GB RAM
Figure 13:
YCSB benchmark with KV sizes in UDB(42 GB) into the store and run the YCSB workloads from Ato F. Each workload is run for 60 seconds. The details of theYCSB workloads are shown in Table 4. A scan in workload Eperforms a seek and retrieves 50 KV pairs.Figure 13a shows the benchmark results. In read-dominatedworkloads including B, C, and E, FlexDB shows high through-put (from 1 . × to 7 . × ) compared to RocksDB. This is espe-cially the case in workload E because of FlexDB’s advantageon range queries. Workload D is also read-dominated, butFlexDB shows a similar performance to RocksDB. The reasonis that the latest access pattern produces a strong temporaryaccess locality. Therefore, most of the requests are servedfrom the MemTable(s), and both the stores achieve a highthroughput of over 2.6 Mops/sec.In write-dominated workloads, including A and F, FlexDBalso outperforms RocksDB, but the performance advantage isnot as high as that in read-dominated workloads. In RocksDB,the compactions are asynchronous, and they do not block readrequests. In the current implementation of FlexDB, when acommitter thread is actively merging updates into the FlexFile,readers that reach the sparse index and the FlexFile can betemporarily blocked (see §5.2). Meanwhile, the MemTablescan still serve user requests. This is the main limitation of thecurrent implementation of FlexDB.We also run the YCSB benchmark in an out-of-corescenario. In this experiment, we limit the available memorysize to 16 GB (using cgroups ), and double the size of theUDB dataset to 84 GB (560 M keys). With this setup, onlyabout 10% of the file data can reside in the OS page cache.The benchmark results are shown in Figure 13b. Both systemsshow reduced throughput in all the workloads due to theincreased I/O cost. Compared to the first YCSB experiment,the advantage of FlexDB is reduced in read-dominatedworkloads B, C, and E, because a miss in the interval cacherequires multiple reads to load the interval. That said, FlexDBstill achieves 1.3 × to 4.0 × speedups over RocksDB. Data-management Systems
Studies on improving I/Oefficiency in data-management systems are abundant [10,56]. B-tree-based KV stores [31, 32, 47] support efficientsearching with minimum read I/O but have suboptimalperformance under random writes because of the in-placeupdates [27]. LSM-Tree [33] uses out-of-place writes and delayed sorting to improve write performance, and it has beenwidely adopted in modern write-optimized KV stores [16,18]. However, the improved write efficiency comes with thecost of high overheads on read operations since a searchmay query multiple tables at different locations [29, 44]. Tocompensate reads, LSM-tree-based KV stores need to rewritetable files periodically using a compaction process, whichin turn offsets the benefit of out-of-place write [11, 12, 36,38, 51]. KVell [26] indexes all the KV pairs in a volatileB + -Tree and leaves KV data unsorted on disk for fast readand write. However, maintaining a volatile full index leadsto a high memory footprint and a lengthy recovery process.SplinterDB [7] employs B ε -tree for efficient write by loggingunsorted KV pairs in every tree node. However, the unsortednode layout causes slow reads, especially for range queries [7].Recent works [23, 24, 54] also employ byte-addressable non-volatile memories for fast access and persistence. Thesesolutions require non-trivial implementation, including spaceallocation, garbage collection, and maintaining crash con-sistency, which overlaps the core duties of a file system.FlexDB delegates the challenging data organizing tasks tothe mechanisms behind the file interface, which effectivelyreduces the application’s complexity. Managing persistentlysorted KV data with efficient in-place update achieves fastread and write at low cost. Address Space Management
Modern in-kernel file sys-tems, such as Ext4, XFS, Btrfs, and F2FS, use B+-Tree and itsvariants or multi-level mapping tables to index file extents [15,25, 41, 48]. These file systems provide comprehensive supportfor general file management tasks but exhibit suboptimal per-formance in metadata-intensive workloads, such as massivefile creation, crowded small writes, as well as insert-range and collapse-range that require data shifting. Recent studiesemploy write-optimized data structures in file systems toimprove metadata management performance. Specifically,BetrFS [22, 55], TokuFS [14], WAFL [30], TableFS [37],and KVFS [46] use write-optimized indexes, including B ε -Tree [3] and LSM-Tree [33], to manage file system metadata.Their designs exploit the advantages of these indexes andsuccessfully optimized many existing file system metadataand file I/O operations. However, these systems still employa traditional file address space abstraction and do not allowdata to be freely moved or shifted in the file address space.Therefore, rearranging file data in these systems still relieson rewriting existing data. The design of FlexFile removesa fundamental limitation from the traditional file addressspace abstraction. Leveraging the efficient shift operationsfor logically reorganizing data in files, applications built onFlexFile can easily avoid data rewriting in the first place. This paper presents a novel file abstraction, FlexFile, thatenables lightweight in-place updates in a flexible file address11pace. It allows applications to perform efficient data manage-ment on a linear file layout with a simplified implementation.FlexDB, a KV store built on FlexFile with a small codebase,achieves speedups of up to 11.7 × for read and 1.9 × for write,compared with the highly optimized RocksDB. References [1] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, SongJiang, and Mike Paleczny. “Workload Analysis ofa Large-Scale Key-Value Store”. In:
SIGMETRICSPerform. Eval. Rev. . 2017, pp. 363–375.[3] Michael A Bender, Martin Farach-Colton, WilliamJannen, Rob Johnson, Bradley C Kuszmaul, Donald EPorter, Jun Yuan, and Yang Zhan. “And introduction toBe-trees and write-optimization”. In:
Login; Magazine . 2020, pp. 209–223.[5] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wil-son C. Hsieh, Deborah A. Wallach, Mike Burrows,Tushar Chandra, Andrew Fikes, and Robert E. Gruber.“Bigtable: A Distributed Storage System for StructuredData”. In:
Proceedings of the 7th USENIX Symposiumon Operating Systems Design and Implementation(OSDI’06) . 2006, p. 15.[6] DWARF Debugging Information Format Committee.
DWARF debugging information format version 5 .2017.[7] Alexander Conway, Abhishek Gupta, Vijay Chi-dambaram, Martin Farach-Colton, Richard Spillane,Amy Tai, and Rob Johnson. “SplinterDB: Closingthe Bandwidth Gap for NVMe Key-Value Stores”. In: . 2020, pp. 49–63.[8] Brian F. Cooper, Adam Silberstein, Erwin Tam, RaghuRamakrishnan, and Russell Sears. “BenchmarkingCloud Serving Systems with YCSB”. In:
Proceedingsof the 1st ACM Symposium on Cloud Computing(SoCC’10) . 2010, 143–154.[9] Thomas H. Cormen, Charles E. Leiserson, Ronald L.Rivest, and Clifford Stein.
Introduction to Algorithms,Third Edition . 3rd. The MIT Press, 2009. [10] Ali Davoudian, Liu Chen, and Mengchi Liu. “A Surveyon NoSQL Stores”. In:
ACM Comput. Surv.
Proceedings of the 2018 InternationalConference on Management of Data (SIGMOD’18) .2018, pp. 505–520.[12] Niv Dayan and Stratos Idreos. “The Log-StructuredMerge-Bush & the Wacky Continuum”. In:
Pro-ceedings of the 2019 International Conference onManagement of Data (SIGMOD’19) . 2019, 449–466.[13] Siying Dong, Mark Callaghan, Leonidas Galanis,Dhruba Borthakur, and Tony Savor. “Optimizing SpaceAmplification in RocksDB.” In:
The Conference onInnovative Data Systems Research (CIDR’17) . Vol. 3.2017, p. 3.[14] John Esmet, Michael A. Bender, Martin Farach-Colton,and Bradley C. Kuszmaul. “The TokuFS StreamingFile System”. In:
Proceedings of the 4th USENIXConference on Hot Topics in Storage and File Systems(HotStorage’12) . 2012, p. 14.[15]
Ext4 Disk Layout . https : / / ext4 . wiki . kernel .org/index.php/Ext4_Disk_Layout .[16] Facebook. RocksDB . https://rocksdb.org .[17] fallocate(2) — Linux manual page . .[18] Sanjay Ghemawat and Jeff Dean. LevelDB . https ://github.com/google/leveldb .[19] Eran Gilad, Edward Bortnikov, Anastasia Braginsky,Yonatan Gottesman, Eshcar Hillel, Idit Keidar, NuritMoscovici, and Rana Shahout. “EvenDB: OptimizingKey-Value Storage for Spatial Locality”. In: Pro-ceedings of the Fifteenth European Conference onComputer Systems (EuroSys’20) . 2020.[20]
Inserting a hole into a file . https : / / lwn . net /Articles/629965/ .[21] Intel® Optane™ Technology . .[22] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshin-tala, John Esmet, Yizheng Jiao, Ankur Mittal, PrashantPandey, Phaneendra Reddy, Leif Walsh, Michael A.Bender, Martin Farach-Colton, Rob Johnson, BradleyC. Kuszmaul, and Donald E. Porter. “BetrFS: Write-Optimization in a Kernel File System”. In: ACM Trans.Storage
Proceedings of the 17th USENIX Conference on Fileand Storage Technologies (FAST’19) . 2019, 191–204.[24] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, An-drea Arpaci-Dusseau, and Remzi Arpaci-Dusseau.“Redesigning LSMs for Nonvolatile Memory withNoveLSM”. In:
Proceedings of the 2018 USENIXConference on Usenix Annual Technical Conference(USENIX ATC’18) . 2018, 993–1005.[25] Changman Lee, Dongho Sim, Joo-Young Hwang, andSangyeun Cho. “F2FS: A New File System for FlashStorage”. In:
Proceedings of the 13th USENIX Con-ference on File and Storage Technologies (FAST’15) .2015, 273–286.[26] Baptiste Lepers, Oana Balmau, Karan Gupta, and WillyZwaenepoel. “KVell: The Design and Implementationof a Fast Persistent Key-Value Store”. In:
Proceedingsof the 27th ACM Symposium on Operating SystemsPrinciples (SOSP’19) . 2019, 447–461.[27] Yinan Li, Bingsheng He, Robin Jun Yang, Qiong Luo,and Ke Yi. “Tree Indexing on Solid State Drives”. In:
Proc. VLDB Endow. . 2016,pp. 133–148.[29] Chen Luo and Michael J Carey. “LSM-based storagetechniques: a survey”. In:
The VLDB Journal
Proceedings of the 8th USENIX Con-ference on File and Storage Technologies (FAST’10) .2010, p. 2.[31]
MongoDB . .[32] Michael A Olson, Keith Bostic, and Margo I Seltzer.“Berkeley DB.” In: USENIX Annual Technical Confer-ence, FREENIX Track . 1999, pp. 183–191.[33] Patrick O’Neil, Edward Cheng, Dieter Gawlick, andElizabeth O’Neil. “The Log-Structured Merge-Tree(LSM-Tree)”. In:
Acta Inf.
Proceedings of the 2016USENIX Conference on Usenix Annual TechnicalConference (USENIX ATC’16) . 2016, 537–550. [35]
PerconaFT (TokuDB) . https : / / github . com /percona/PerconaFT .[36] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram,and Ittai Abraham. “PebblesDB: Building Key-Value Stores Using Fragmented Log-Structured MergeTrees”. In: Proceedings of the 26th Symposium on Op-erating Systems Principles (SOSP’17) . 2017, 497–514.[37] Kai Ren and Garth Gibson. “TABLEFS: EnhancingMetadata Efficiency in the Local File System”. In:
Proceedings of the 2013 USENIX Conference onAnnual Technical Conference (USENIX ATC’13) . 2013,145–156.[38] Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson.“SlimDB: A Space-Efficient Key-Value Storage Enginefor Semi-Sorted Data”. In:
Proc. VLDB Endow.
RocksDB Tuning Guide . https : / / github . com /facebook/rocksdb/wiki/RocksDB-Tuning-Guide .[40] Ohad Rodeh. “B-Trees, Shadowing, and Clones”. In: ACM Trans. Storage
ACM Trans. Stor-age
ACM Trans. Comput. Syst.
Proceedings of the 12th USENIX Conferenceon File and Storage Technologies (FAST’14) . 2014,1–16.[44] Russell Sears and Raghu Ramakrishnan. “BLSM: AGeneral Purpose Log Structured Merge Tree”. In:
Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data (SIGMOD’12) .2012, 217–228.[45] Kai Shen, Stan Park, and Meng Zhu. “Journaling ofJournal is (Almost) Free”. In:
Proceedings of the 12thUSENIX Conference on File and Storage Technologies(FAST’14) . 2014, 287–293.[46] Pradeep Shetty, Richard Spillane, Ravikant Malpani,Binesh Andrews, Justin Seyster, and Erez Zadok.“Building Workload-Independent Storage with VT-Trees”. In:
Proceedings of the 11th USENIX Con-ference on File and Storage Technologies (FAST’13) .2013, 17–30.[47]
Symas Lightning Memory-mapped Database . https://symas.com/lmdb/ .1348] The SGI XFS Filesystem . .[49] Darrick Wong, Dave Chinner, Eric Sandeen, RyanLerch, and Sillion Graphics Inc. XFS Algorithms &Data Structures, 3rd Edition . 2018.[50] Kan Wu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. “Towards an Unwritten Contract of IntelOptane SSD”. In:
Proceedings of the 11th USENIXConference on Hot Topics in Storage and File Systems(HotStorage’19) . 2019, p. 3.[51] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang.“LSM-trie: An LSM-tree-based Ultra-Large Key-ValueStore for Small Data Items”. In: . 2015, pp. 71–82.[52] Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala,and Swaminathan Sundararaman. “Don’t Stack YourLog On My Log”. In: . 2014.[53] Juncheng Yang, Yao Yue, and K. V. Rashmi. “A largescale analysis of hundreds of in-memory cache clustersat Twitter”. In: . 2020,pp. 191–208.[54] Ting Yao, Yiwen Zhang, Jiguang Wan, Qiu Cui, LiuTang, Hong Jiang, Changsheng Xie, and Xubin He.“MatrixKV: Reducing Write Stalls and Write Ampli-fication in LSM-tree Based KV Stores with MatrixContainer in NVM”. In: . 2020, pp. 17–31.[55] Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr,Michael A. Bender, Martin Farach-Colton, WilliamJannen, Rob Johnson, Donald E. Porter, and Jun Yuan.“The Full Path to Full-Path Indexing”. In:
Proceedingsof the 16th USENIX Conference on File and StorageTechnologies (FAST’18) . 2018, 123–138.[56] H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang.“In-Memory Big Data Management and Processing: ASurvey”. In: