SplitFS: Reducing Software Overhead in File Systems for Persistent Memory
Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram
SSplitFS: Reducing Software Overheadin File Systems for Persistent Memory
Rohan Kadekodi
University of Texas at Austin
Se Kwon Lee
University of Texas at Austin
Sanidhya Kashyap
Georgia Institute of Technology
Taesoo Kim
Georgia Institute of Technology
Aasheesh Kolli
Pennsylvania State University andVMware Research
Vijay Chidambaram
University of Texas at Austin andVMware Research
Abstract
We present SplitFS, a file system for persistent memory (PM)that reduces software overhead significantly compared tostate-of-the-art PM file systems. SplitFS presents a novelsplit of responsibilities between a user-space library file sys-tem and an existing kernel PM file system. The user-spacelibrary file system handles data operations by interceptingPOSIX calls, memory-mapping the underlying file, and serv-ing the read and overwrites using processor loads and stores.Metadata operations are handled by the kernel PM file sys-tem (ext4 DAX). SplitFS introduces a new primitive termedrelink to efficiently support file appends and atomic data op-erations. SplitFS provides three consistency modes, whichdifferent applications can choose from, without interferingwith each other. SplitFS reduces software overhead by up-to4 × compared to the NOVA PM file system, and 17 × com-pared to ext4 DAX. On a number of micro-benchmarks andapplications such as the LevelDB key-value store runningthe YCSB benchmark, SplitFS increases application perfor-mance by up to 2 × compared to ext4 DAX and NOVA whileproviding similar consistency guarantees. CCS Concepts • Information systems → Storage class memory ; •
Hard-ware → Non-volatile memory ; •
Software and its engi-neering → File systems management ; Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6873-5/19/10...$15.00https://doi.org/10.1145/3341301.3359631
Keywords
Persistent Memory, File Systems, Crash Consistency, DirectAccess
ACM Reference Format:
Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim,Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: ReducingSoftware Overhead in File Systems for Persistent Memory . In
SOSP ’19: Symposium on Operating Systems Principles, October 27–30,2019, Huntsville, ON, Canada.
ACM, New York, NY, USA, 16 pages.https://doi.org/10.1145/3341301.3359631
Persistent Memory (PM) is a new memory technology thatwas recently introduced by Intel [17]. PM will be placed onthe memory bus like DRAM and will be accessed via proces-sor loads and stores. PM has a unique performance profile:compared to DRAM, loads have 2–3.7 × higher latency and1 / rd bandwidth, while stores have the same latency but1 / th bandwidth [18]. A single machine can be equippedwith up to 6 TB of PM. Given its large capacity and low la-tency, an important use case for PM will be acting as storage.Traditional file systems add large overheads to each file-system operation, especially on the write path. The overheadcomes from performing expensive operations on the criti-cal path, including allocation, logging, and updating mul-tiple complex structures. The systems community has pro-posed different architectures to reduce overhead. BPFS [7],PMFS [24], and NOVA [33] redesign the in-kernel file systemfrom scratch to reduce overhead for file-system operations.Aerie [28] advocates a user-space library file system coupledwith a slim kernel component that does coarse-grained allo-cations. Strata [19] proposes keeping the file system entirelyin user-space, dividing the system between a user-space li-brary file system and a user-space metadata server. Aerie Rohan Kadekodi and Se Kwon Lee are supported by SOSP 2019 studenttravel scholarships from the National Science Foundation. a r X i v : . [ c s . O S ] S e p ile system AppendTime ( ns ) Overhead( ns ) Overhead(%)ext4 DAX 9002 8331 1241%PMFS 4150 3479 518%NOVA-Strict 3021 2350 350%SplitFS-Strict 1251 580 86%SplitFS-POSIX 1160 488 73% Table 1: Software Overhead. The table shows the soft-ware overhead of various PM file systems for append-ing a 4K block. It takes ns to write 4KB to PM. Strictand POSIX indicate the guarantees offered by the filesystems (§3.2). and Strata both seek to reduce overhead by not involvingthe kernel for most file-system operations.Despite these efforts, file-system data operations, espe-cially writes, have significant overhead. For example, con-sider the common operation of appending 4K blocks to a file(total 128 MB). It takes 671 ns to write a 4 KB to PM; thus,if performing the append operation took a total of 675 ns ,the software overhead would be 4 ns . Table 1 shows the soft-ware overhead on the append operation on various PM filesystems. We observe that there is still significant overhead(3 . − . × ) for file appends.This paper presents SplitFS, a PM file system that seeksto reduce software overhead via a novel split architecture: auser-space library file system handles data operations whilea kernel PM file system (ext4 DAX) handles metadata opera-tions. We refer to all file system operations that modify filemetadata as metadata operations . Such operations include open() , close() , and even file appends (since the file sizeis changed). The novelty of SplitFS lies in how responsibili-ties are divided between the user-space and kernel compo-nents, and the semantics provided to applications. Unlikeprior work like Aerie, which used the kernel only for coarse-grained operations, or Strata, where all operations are inuser-space, SplitFS routes all metadata operations to thekernel. While FLEX [32] invokes the kernel at a fine gran-ularity like SplitFS, it does not provide strong semanticssuch as synchronous, atomic operations to applications. Ata high level, the SplitFS architecture is based on the beliefthat if we can accelerate common-case data operations, itis worth paying a cost on the comparatively rarer metadataoperations. This is in contrast with in-kernel file systems likeNOVA which extensively modify the file system to optimizethe metadata operations. SplitFS transparently reduces software overhead for readsand overwrites by intercepting POSIX calls, memory map-ping the underlying file, and serving reads and overwrites viaprocessor loads and stores. SplitFS optimizes file appendsby introducing a new primitive named relink that minimizesboth data copying and trapping into the kernel. The appli-cation does not have to be rewritten in any way to benefitfrom SplitFS. SplitFS reduces software overhead by up-to4 × compared to NOVA and 17 × compared to ext4 DAX.Apart from lowering software overhead, the split architec-ture leads to several benefits. First, instead of re-implementingfile-system functionality, SplitFS can take advantage of themature, well-tested code in ext4 DAX for metadata opera-tions. Second, the user-space library file system in SplitFSallows each application to run with one of three consistencymodes ( POSIX, sync, strict ). We observe that not all appli-cations require the same guarantees; for example, SQLitedoes not require the strong guarantees provided by NOVA-strict, and gets 2.5 × higher throughput on ext4 DAX andSplitFS-POSIX than on NOVA-strict owing to their weakerguarantees. Applications running with different consistencymodes do not interfere with each other on SplitFS.SplitFS introduces the relink primitive to optimize fileappends and atomic data operations. Relink logically andatomically moves a contiguous extent from one file to an-other, without any physical data movement. Relink is builton top of the swap_extents ioctl in ext4 DAX, and usesext4 journaling to ensure the source and destination files aremodified atomically. Both file appends and data overwritesin strict mode are redirected to a temporary PM file we termthe staging file. On fsync() , the data from the staging fileis relinked into the original file. Relink provides atomic dataoperations without paging faults or data copying.SplitFS also introduces an optimized logging protocol. Instrict mode, all data and metadata operations in SplitFS areatomic and synchronous. SplitFS achieves this by loggingeach operation. In the common case, SplitFS will write a sin-gle cache line worth of data (64B), followed by one memoryfence ( e.g., sfence in x86 systems), for each operation; incontrast, NOVA writes at least two cache lines and issues twofences. As a result of these optimizations, SplitFS logging is4 × faster than NOVA in the critical path. Thanks to relinkand optimized logging, atomic data operations in SplitFSare 2–6 × faster than in NOVA-strict, providing strong guar-antees at low software overhead.We evaluate SplitFS using a number of micro-benchmarks,three utilities (git, tar, rsync), two key-value stores (Redis,LevelDB), and an embedded database (SQLite). Our evalu-ation on Intel DC Persistent Memory shows that SplitFS,though it is built on ext4 DAX, outperforms ext4 DAX byup-to 2 × on many workloads. SplitFS outperforms NOVA by10%–2 × (when providing the same consistency guarantees)n LevelDB, Redis, and SQLite when running benchmarkslike YCSB and TPCC. SplitFS also reduces total amount ofwrite IO by 2 × compared to Strata on certain workloads. Onmetadata-heavy workloads such as git and tar, SplitFS suf-fers a modest drop in performance (less than 15%) comparedto NOVA and ext4 DAX.SplitFS is built on top of ext4 DAX; this is both a strengthand a weakness. Since SplitFS routes all metadata operationsthrough ext4 DAX, it suffers from the high software overheadand high write IO for metadata operations. Despite theselimitations, we believe SplitFS presents a useful new point inthe spectrum of PM file-system designs. ext4 DAX is a robustfile system under active development; its performance willimprove with every Linux kernel version. SplitFS providesthe best features of ext4 DAX while making up for its lackof performance and strong consistency guarantees.This paper makes the following contributions: • A new architecture for PM file systems with a novelsplit of responsibilities between a user-space libraryfile system and a kernel file system. • The novel relink primitive that can be used to provideefficient appends and atomic data operations. • The design and implementation of SplitFS, based onthe split architecture. We have made SplitFS publiclyavailable at https://github.com/utsaslab/splitfs. • Experimental evidence demonstrating that SplitFSoutperforms state-of-the-art in-kernel and in-user-spacePM file systems, on a range of workloads.
This section provides background on persistent memory(PM), PM file systems, Direct Access, and memory mapping.
Persistent memory is a new memory technology that offersdurability and performance close to that of DRAM. PM canbe attached on the memory bus similar to DRAM, and wouldbe accessed via processor loads and stores. PM offers 8-byteatomic stores and they become persistent as soon as theyreach the PM controller [16]. There are two ways to ensurethat stores become persistent: (i) using non-temporal storeinstructions (e.g., movnt in x86) to bypass the cache hierar-chy and reach the PM controller or (ii) using a combinationof regular temporal store instructions and cache line flushinstructions (e.g., clflush or clwb in x86).Intel DC Persistent Memory is the first PM product thatwas made commercially available in April 2019. Table 2 liststhe performance characteristics of PM revealed in a reportby Izraelevitz et al. [18]. Compared to DRAM, PM has 3.7 × higher latency for random reads, 2 × higher latency for se-quential reads, 1 / rd read bandwidth, and close to 1 / th Property DRAM Intel PMSequential read latency (ns) 81 169 (2.08 × )Random read latency (ns) 81 305 (3.76 × )Store + flush + fence (ns) 86 91 (1.05 × )Read bandwidth (GB/s) 120 39.4 (0.33 × )Write bandwidth (GB/s) 80 13.9 (0.17 × ) Table 2: PM Performance. The table shows perfor-mance characteristics of DRAM, PM and the ratio ofPM/DRAM, as reported by Izraelevitz et al. [18]. write bandwidth. Finally, PMs are expected to exhibit limitedwrite endurance (about 10 write cycles [25]). The Linux ext4 file system introduced a new mode calledDirect Access (DAX) to help users access PM [21]. DAX filesystems eschew the use of page caches and rely on memorymapping to provide low-latency access to PM.A memory map operation (performed via the mmap() sys-tem call) in ext4 DAX maps one or more pages in the processvirtual address space to extents on PM. For example, considervirtual addresses to are mapped to bytes to on file foo on ext4 DAX. Bytes to in foo then corre-spond to bytes to on PM. A store instructionto virtual address would then translate to a store tobyte on PM. Thus, PM can be accessed via processorloads and stores without the interference of software; thevirtual memory subsystem is in charge of translating virtualaddresses into corresponding physical addresses on PM.While DAX and mmap() provide low-latency access toPM, they do not provide other features such as naming oratomicity for operations. The application is forced to imposeits own structure and semantics on the raw bytes offeredby mmap() . As a result, PM file systems still provide usefulfeatures to applications and end users. Apart from ext4 DAX, researchers have developed a num-ber of other PM file systems such as SCMFS [31], BPFS [7],Aerie [28], PMFS [24], NOVA [33], and Strata [19]. Only ext4DAX, PMFS (now deprecated), NOVA, and Strata are publiclyavailable and supported by modern Linux 4.x kernels.These file systems make trade-offs between software over-head, amount of write IO, and operation guarantees. NOVAprovides strong guarantees such as atomicity for file-systemoperations. PMFS provides slightly weaker guarantees (dataoperations are not atomic), but as a result obtains better per-formance on some workloads. Strata is a cross-media filesystem which uses PM as one of its layers. Strata writes allata to per-process private log, then coalesces the data andcopies it to a shared area for public access. For workloadsdominated by operations such as appends, Strata cannot coa-lesce the data effectively, and has to write data twice: once tothe private log, and once to the shared area. This increasesthe PM wear-out by up to 2 × . All these file systems still sufferfrom significant overhead for write operations (Table 1). We present the goals of SplitFS, its three modes and theirguarantees. We present an overview of the design, describehow different operations are handled, and discuss how SplitFSprovides atomic operations at low overhead. We describe theimplementation of SplitFS, and discuss its various tuningparameters. Finally, we discuss how the design of SplitFSaffects security.
Low software overhead . SplitFS aims to reduce softwareoverhead for data operations, especially writes and appends.
Transparency . SplitFS does not require the application tobe modified in any way to obtain lower software overheadand increased performance.
Minimal data copying and write IO . SplitFS aims to re-duce the number of writes made to PM. SplitFS aims toavoid copying data within the file system whenever possible.This both helps performance and reduces wear-out on PM.Minimizing writes is especially important when providingstrong guarantees like atomic operations.
Low implementation complexity . SplitFS aims to re-useexisting software like ext4 DAX as much as possible, andreduce the amount of new code that must be written andmaintained for SplitFS.
Flexible guarantees . SplitFS aims to provide applicationswith a choice of crash-consistency guarantees to choose from.This is in contrast with PM file systems today, which provideall running applications with the same set of guarantees.
SplitFS provides three different modes: POSIX, sync, andstrict. Each mode provides a different set of guarantees. Con-current applications can use different modes at the same timeas they run on SplitFS. Across all modes, SplitFS ensuresthe file system retains its integrity across crashes.Table 3 presents the three modes provided by SplitFS.Across all modes, appends are atomic in SplitFS; if a seriesof appends is followed by fsync() , the file will be atomicallyappended on fsync() . POSIX mode . In POSIX mode, SplitFS provides metadataconsistency [6], similar to ext4 DAX. The file system will
Mode Sync.DataOps AtomicDataOps Sync.MetadataOps AtomicMetadataOps Equivalent to
POSIX ✗ ✗ ✗ ✓ ext4-DAXsync ✓ ✗ ✓ ✓
Nova-Relaxed,PMFSstrict ✓ ✓ ✓ ✓
NOVA-Strict,Strata
Table 3: SplitFS modes. The table shows the threemodes of SplitFS, the guarantees provided by eachmode, and list current file systems which provide thesame guarantees. recover to a consistent state after a crash with respect to itsmetadata. In this mode, overwrites are performed in-placeand are synchronous. Note that appends are not synchronous,and require an fsync() to be persisted. However, SplitFSin the POSIX mode guarantees atomic appends, a propertynot provided by ext4 DAX. This mode slightly differs fromthe standard POSIX semantics: when a file is accessed ormodified, the file metadata will not immediately reflect that.
Sync mode . SplitFS ensures that on top of POSIX modeguarantees, operations are also guaranteed to be synchro-nous. An operation may be considered complete and persis-tent once the corresponding call returns and applications donot need a subsequent fsync() . Operations are not atomicin this mode; a crash may leave a data operation partiallycompleted. No additional crash recovery needs to be per-formed by SplitFS in this mode. This mode provides similarguarantees to PMFS as well as NOVA without data and meta-data checksuming and with in-place updates; we term thisNOVA configuration
NOVA-Relaxed . Strict mode . SplitFS ensures that on top of sync mode guar-antees, each operation is also atomic. This is a useful guaran-tee for applications; editors can allow atomic changes to thefile when the user saves the file, and databases can removelogging and directly update the database. This mode does notprovide atomicity across system calls though; so it cannotbe used to update two files atomically together. This modeprovides similar guarantees to a NOVA configuration weterm
NOVA-Strict : NOVA with copy-on-write updates, butwithout checksums enabled.
Visibility . Apart from appends, all SplitFS operations be-come immediately visible to all other processes on the sys-tem. On fsync() , appends are persisted and become visibleto the rest of the system. SplitFS is unique in its visibilityguarantees, and takes the middle ground between ext4 DAXand NOVA where all operations are immediately visible, and echnique Benefit
Split architecture Low-overhead data operations,correct metadata operationsCollection of memory-mmaps Low-overhead data operations inthe presence of updates and ap-pendsRelink + Staging Optimized appends, atomic dataoperations, low write amplifica-tionOptimized operation logging Atomic operations, low write am-plification
Table 4: Techniques. The table lists each main tech-nique used in SplitFS along with the benefit it pro-vides. The techniques work together to enable SplitFSto provide strong guarantees at low software over-head.
U-Split mmaps
POSIX Application close()fsync()read() write() [append]
PM Device
File on PM Staging File
K-Split
User spaceKernel space open()write()
Op log
U-Split mmaps
POSIX Application close()fsync()read() write() [append] open()write()
File on PM Staging File
Op log
Figure 1: SplitFS Overview. The figure provides anoverview of how SplitFS works. Read and write op-erations are transformed into loads and stores on thememory-mapped file. Append operations are stagedin a staging file and relinked on fsync() . Other meta-data POSIX calls like open() , close() , etc. are passedthrough to the in-kernel PM file system. Note thatloads and stores do not incur the overhead of trappinginto the kernel. Strata where new files and data updates are only visible toother processes after the digest operation. Immediate visibil-ity of changes to data and metadata combined with atomic,synchronous guarantees removes the need for leases to coor-dinate sharing; applications can share access to files as theywould on any other POSIX file system.
We now provide an overview of the design of SplitFS, andhow it uses various techniques to provide the outlined guar-antees. Table 4 lists the different techniques and the benefiteach technique provides.
Split architecture . As shown in Figure 1, SplitFS comprisesof two major components, a user-space library linked to theapplication called U-Split and a kernel file system called K-Split. SplitFS services all data operations ( e.g., read() and write() calls) directly in user-space and routes metadataoperations ( e.g., fsync() , open() , etc. ) to the kernel file sys-tem underneath. File system crash-consistency is guaranteedat all times. This approach is similar to Exokernel [12] whereonly the control operations are handled by the kernel anddata operations are handled in user-space. Collection of mmaps . Reads and overwrites are handled by mmap() -ing the surrounding 2 MB part of the file, and servingreads via memcpy and writes via non-temporal stores ( movnt instructions). A single logical file may have data present inmultiple physical files; for example, appends are first sentto a staging file, and thus the file data is spread over theoriginal file and the staging file. SplitFS uses a collection ofmemory-maps to handle this situation. Each file is associatedwith a number of open mmap() calls over multiple physicalfiles, and reads and over-writes are routed appropriately.
Staging . SplitFS uses temporary files called staging files forboth appends and atomic data operations. Appends are firstrouted to a staging file, and are later relinked on fsync() .Similarly, file overwrites in strict mode are also first sent tostaging files and later relinked to their appropriate files.
Relink . On an fsync() , all the staged appends of a filemust be moved to the target file; in strict mode, overwriteshave to be moved as well. One way to move the stagedappends to the target file is to allocate new blocks and thencopy appended data to them. However, this approach leadsto write amplification and high overhead. To avoid theseunnecessary data copies, we developed a new primitive called relink . Relink logically moves PM blocks from the stagingfile to the target file without incurring any copies.Relink has the following signature: relink(file1, offset1,file2, offset2, size) . Relink atomically moves data from offset1 of file1 to offset2 of file2 . If file2 alreadyhas data at offset2 , existing data blocks are de-allocated.Atomicity is ensured by wrapping the changes in a ext4 jour-nal transaction. Relink is a metadata operation, and doesnot involve copying data when the involved offsets and size are block aligned. When offset1 or offset2 happensto be in the middle of a block, SplitFS copies the partialdata for that block to file2 , and performs a metadata-onlyrelink for the rest of the data. Given that SplitFS is targeted taging fileTarget file Physical blocksLogical blocks
Init state: Staging filehas a mmap() region of 2 pre-allocatedphysical block available for appendsAn append to thetarget file is routedto the staging file block. Later reads to the appended region are also routed to the block
Physical blocksLogical blocksPhysical blocksLogical blocks On fsync, the newly written block in the staging file is logically linked to the target file, while retaining its mmap() region and physical block Staging 1 Staging 2Target 2 Staging 2 Staging 2
Figure 2: relink steps. This figure provides an overviewof the steps involved while performing a relink oper-ation. First, appends to a target file are routed to pre-allocated blocks in the staging file and subsequentlyon an fsync() , they are relinked into the target filewhile retaining existing memory-mapped regions. at POSIX applications, block writes and appends are oftenblock-aligned by the applications. Figure 2 illustrates thedifferent steps involved in the relink operation.
Optimized logging . In strict mode, SplitFS guarantees atom-icity for all operations. To provide atomicity, we employ an
Operation Log and use logical redo logging to record theintent of each operation. Each U-Split instance has its ownoperation log that is pre-allocated, mmap -ed by U-Split, andwritten using non-temporal store instructions. We use thenecessary memory fence instructions to ensure that log en-tries persist in the correct order. To reduce the overheadsfrom logging, we ensure that in the common case, per opera-tion, we write one cache line (64B) worth of data to PM anduse a single memory fence ( sfence in x86) instruction in theprocess. Operation log entries do not contain the file dataassociated with the operation ( e.g., data being appended to afile), instead they contain a logical pointer to the staging filewhere the data is being held.We employ a number of techniques to optimize logging.First, to distinguish between valid and invalid or torn logentries, we incorporate a 4B transactional checksum [23]within the 64B log entry. The use of checksum reduces thenumber of fence instructions necessary to persist and val-idate a log entry from two to one. Second, we maintain atail for the log in DRAM and concurrent threads use the tailas a synchronization variable. They use compare-and-swapto atomically advance the tail and write to their respectivelog entries concurrently. Third, during the initialization ofthe operation log file, we zero it out. So, during crash re-covery, we identify all non-zero 64B aligned log entries as being potentially valid and then use the checksum to identifyany torn entries. The rest are valid entries and are replayed.Replaying log entries is idempotent, so replaying them mul-tiple times on crashes is safe. We employ a 128MB operationlog file and if it becomes full, we checkpoint the state ofthe application by calling relink() on all the open files thathave data in staging files. We then zero out the log and reuseit. Finally, we designed our logging mechanism such thatall common case operations ( write() , open() , etc.) can belogged using a single 64B log entry while some uncommonoperations, like rename() , require multiple log entries.Our logging protocol works well with the SplitFS archi-tecture. The tail of each U-Split log is maintained only inDRAM as it is not required for crash recovery. Valid log en-tries are instead identified using checksums. In contrast, filesystems such as NOVA have a log per inode that resides onPM, whose tail is updated after each operation via expensive clflush and sfence operations. Providing Atomic Operations . In strict mode, SplitFS pro-vides synchronous, atomic operations. Atomicity is providedin an efficient manner by the combination of staging files,relink, and optimized logging. Atomicity for data operationslike overwrites is achieved by redirecting them also to a stag-ing file, similar to how appends are performed. SplitFS logsthese writes and appends to record where the latest data re-sides in the event of a crash. On fsync() , SplitFS relinks thedata from the staging file to the target file atomically. Onceagain, the data is written exactly once, though SplitFS pro-vides the strong guarantee of atomic data operations. Relinkallows SplitFS to implement a form of localized copy-on-write. Due to the staging files being pre-allocated, locality ispreserved to an extent. SplitFS logs metadata operations toensure they are atomic and synchronous. Optimized loggingensures that for most operations exactly one cache line iswritten and one sfence is issued for logging.
Reads . Reads consult the collection of mmaps to determinewhere the most recent data for this offset is, since the datacould have been overwritten or appended (and thus in astaging file). If a valid memory mapped region for the offsetsbeing read exists in U-Split, the read is serviced from thecorresponding region. If such a region does not exist, thenthe 2 MB region surrounding the read offset is first memorymapped, added to the the collection of mmaps, and then theread operation is serviced using processor loads.
Overwrites . Similar to reads, if the target offset is alreadymemory mapped, then U-Split services the overwrite usingnon-temporal store instructions. If the target offset is notemory mapped, then the 2MB region surrounding the offsetis first memory mmaped, added to the collection of mmaps,and then the overwrite is serviced. However, in strict mode,to guarantee atomicity, overwrites are first redirected to astaging file (even if the offset is memory mapped), then theoperation is logged, and finally relinked on a subsequent fsync() or close() . Appends . SplitFS redirects all appends to a staging file, andperforms a relink on a subsequent fsync() or close() . Aswith overwrites, appends are performed with non-temporalwrites and in strict mode, SplitFS also logs details of theappend operation to ensure atomicity. We implement SplitFS as a combination of a user-spacelibrary file system (9K lines of C code) and a small patchto ext4 DAX to add the relink system call (500 lines of Ccode). SplitFS supports 35 common POSIX calls, such as pwrite() , pread64() , fread() , readv() , ftruncate64() , openat() , etc; we found that supporting this set of callsis sufficient to support a variety of applications and micro-benchmarks. Since PM file systems PMFS and NOVA aresupported by Linux kernel version 4.13, we modified 4.13to support SplitFS. We now present other details of ourimplementation. Intercepting POSIX calls . SplitFS uses
LD_PRELOAD to in-tercept POSIX calls and either serve from user-space or routethem to the kernel after performing some book-keeping tasks.Since SplitFS intercepts calls at the POSIX level in glibc rather than at the system call level, SplitFS has to interceptseveral variants of common system calls like write() . Relink . We implement relink by leveraging an ioctl pro-vided by ext4 DAX. The
EXT4_IOC_MOVE_EXT ioctl swapsextents between a source file and a destination file, and usesjournaling to perform this atomically. The ioctl also de-allocates blocks in the target file if they are replaced by blocksfrom the source file. By default, the ioctl also flushes theswapped data in the target file; we modify the ioctl to onlytouch metadata, without copying, moving, or persisting ofdata. We also ensure that after the swap has happened, exist-ing memory mappings of both source and destination filesare valid; this is vital to SplitFS performance as it avoidspage faults. The ioctl requires blocks to be allocated atboth source and destination files. To satisfy this requirement,when handling appends via relink, we allocate blocks atthe destination file, swap extents from the staging file, andthen de-allocate the blocks. This allows us to perform relinkwithout using up extra space, and reduces implementationcomplexity at the cost of temporary allocation of data.
Handling file open and close . On file open, SplitFS per-forms stat() on the file and caches its attributes in user-space to help handle later calls. When a file is closed, we donot clear its cached information. When the file is unlinked, allcached metadata is cleared, and if the file has been memory-mapped, it is un-mapped. The cached attributes are used tocheck file permissions on every subsequent file operation( e.g., read() ) intercepted by U-Split.
Handling fork . Since SplitFS uses a user-space library filesystem, special care needs to be taken to handle fork() and execve() correctly. When fork() is called, SplitFS iscopied into the address space of the new process (as part ofcopying the address space of the parent process), so that thenew process can continue to access SplitFS.
Handling execve . execve() overwrites the address space,but open file descriptors are expected to work after the callcompletes. To handle this, SplitFS does the following: be-fore executing execve() , SplitFS copies its in-memory dataabout open files to a shared memory file on /dev/shm ; the filename is the process ID. After executing execve() , SplitFSchecks the shared memory device and copies informationfrom the file if it exists. Handling dup . When a file descriptor is duplicated, thefile offset is changed whenever operations are performedon either file descriptor. SplitFS handles by maintaining asingle offset per open file, and using pointers to this file in thefile descriptor maintained by SplitFS. Thus, if two threads dup a file descriptor and change the offset from either thread,SplitFS ensures both threads see the changes.
Staging files . SplitFS pre-allocates staging files at startup,creating 10 files each 160 MB in size. Whenever a staging fileis completely utilized, a background thread wakes up andcreates and pre-allocates a new staging file. This avoids theoverhead of creating staging files in the critical path.
Cache of memory-mappings . SplitFS caches all memory-mappings its creates in its collection of memory mappings.A memory-mapping is only discarded on unlink() . This re-duces the cost of setting up memory mappings in the criticalpath on read or write.
Multi-thread access . SplitFS uses a lock-free queue formanaging the staging files. It uses fine-grained reader-writerlocks to protect its in-memory metadata about open files,inodes, and memory-mappings.
SplitFS provides a number of tunable parameters that can beset by application developers and users for each U-Split in-stance. These parameters affect the performance of SplitFS. map() size . SplitFS supports a configurable size of mmap() for handling overwrites and reads. Currently, SplitFS sup-ports mmap() sizes ranging from 2MB to 512MB. The defaultsize is 2 MB, allowing SplitFS to employ huge pages whilepre-populating the mappings.
Number of staging files at startup . There are ten stagingfiles at startup by default; when a staging file is used up,SplitFS creates another staging file in the background. Weexperimentally found that having ten staging files providesa good balance between application performance and theinitialization cost and space usage of staging files.
Size of the operation log . The default size of the operationlog is 128MB for each U-Split instance. Since all log entriesconsist of a single cacheline in the common case, SplitFS cansupport up to 2M operations without clearing the log andre-initializing it. This helps applications with small bursts toachieve good performance while getting strong semantics.
SplitFS does not expose any new security vulnerabilities ascompared to an in-kernel file system. All metadata operationsare passed through to the kernel which performs securitychecks. SplitFS does not allow a user to open, read, or writea file to which they previously did not have permissions. TheU-Split instances are isolated from each other in separateprocesses; therefore applications cannot access the data ofother applications while running on SplitFS. Each U-Splitinstance only stores book-keeping information in DRAMfor the files that the application already has access to. Anapplication that uses SplitFS may corrupt its own files, justas in an in-kernel file system.
We reflect on our experiences building SplitFS, describeproblems we encountered, how we solved them, and surpris-ing insights that we discovered.
Page faults lead to significant cost . SplitFS memory mapsfiles before accessing them, and uses
MAP_POPULATE to pre-fault all pages so that later reads and writes do not incurpage-fault latency. As a result, we find that a significant por-tion of the time for open() is consumed by page faults. Whilethe latency of device IO usually dominates page fault cost instorage systems based on solid state drives or magnetic harddrives, the low latency of persistent memory highlights thecost of page faults.
Huge pages are fragile . A natural way of minimizing pagefaults is to use 2 MB huge pages. However, we found hugepages fragile and hard to use. Setting up a huge-page map-ping in the Linux kernel requires a number of conditions.First, the virtual address must be 2 MB aligned. Second, the physical address on PM must be 2 MB aligned. As aresult, fragmentation in either the virtual address space orthe physical PM prevents huge pages from being created. Formost workloads, after a few thousand files were created anddeleted, fragmenting PM, we found it impossible to createany new huge pages. Our collection-of-mappings techniquesidesteps this problem by creating huge pages at the begin-ning of the workload, and reusing them to serve reads andwrites. Without huge pages, we observed read performancedropping by 50% in many workloads. We believe this is afundamental problem that must be tackled since huge pagesare crucial for accessing large quantities of PM.
Avoiding work in the critical path is important . Finally,we found that a general design technique that proved crucialfor SplitFS is simplifying the critical path. We pre-allocatewherever possible, and use a background thread to performpre-allocation in the background. Similarly, we pre-faultmemory mappings, and use a cache to re-use memory map-pings as much as possible. SplitFS rarely performs heavy-weight work in the critical path of a data operation. Similarly,even in strict mode, SplitFS optimizes logging, trading offshorter recovery time for a simple, low overhead loggingprotocol. We believe this design principle will be useful forother systems designed for PM.
Staging writes in DRAM . An alternate design that we triedwas staging writes in DRAM instead of on PM. While DRAMstaging files incur less allocation costs than PM staging files,we found that the cost of copying data from DRAM to PM on fsync() overshadowed the benefit of staging data in DRAM.In general, DRAM buffering is less useful in PM systemsbecause PM and DRAM performances are similar.
Legacy applications need to be rewritten to take max-imum advantage of PM . We observe that the applicationswe evaluate such as LevelDB spent a significant portion oftheir time (60 − libpmem that exclusively operate on data structures in mmap() to take further advantage of PM. In this section, we use a number of microbenchmarks andapplications to evaluate SplitFS in relation to state-of-the-art PM filesystems like ext4 DAX, NOVA, and PMFS. Whilecomparing these different file systems, we seek to answerthe following questions: pplication DescriptionTPC-C [9] on SQLite [26] Online transaction processingYCSB [8] on LevelDB [15] Data retreival & maintenanceSet in Redis [1] In-memory data structure storeGit Popular version control softwareTar Linux utility for data compressionRsync Linux utility for data copy
Table 5: Applications used in evaluation. The table pro-vides a brief description of the real-world applicationswe use to evaluate PM file systems. • How does SplitFS affect the performance of differentsystem calls as compared to ext4 DAX? (§5.4) • How do the different techniques employed in SplitFScontribute to overall performance? (§5.5) • How does SplitFS compare to other file systems fordifferent PM access patterns? (§5.6) • Does SplitFS reduce file-system software overhead ascompared to other PM file systems? (§5.7) • How does SplitFS compare to other file systems forreal-world applications? (§5.8 & §5.9) • What are the compute and storage overheads incurredwhen using SplitFS? (§5.10)We first briefly describe our experimental methodology(§5.1 & §5.2) before addressing each of the above questions.
We evaluate the performance of SplitFS against other PMfile systems on Intel Optane DC Persistent Memory Module(PMM). The experiments are performed on a 2-socket, 96-core machine with 768 GB PMM, 375 GB DRAM, and 32 MBLast Level Cache (LLC). We run all evaluated file systems onthe 4.13 version of the Linux kernel (Ubuntu 16.04). We runeach experiment multiple times and report the mean. In allcases, the standard deviation was less than five percent ofthe mean, and the experiments could be reliably repeated.
We used two key-value stores (Redis, LevelDB), an embeddeddatabase (SQLite), and three utilities (tar, git, rsync) to evalu-ate the performance of SplitFS. Table 5 lists the applicationsand their characteristics.
TPC-C on SQLite . TPC-C is an online transaction process-ing benchmark. It has five different types of transactionseach with different ratios of reads and writes. We run SQLitev3.23.1 with SplitFS, and measured the performance of TPC-C on SQLite in the Write-Ahead-Logging (WAL) mode.
YCSB on LevelDB . The Yahoo Cloud Serving Benchmark [8]has six different key-value store benchmarks, each with dif-ferent read/write ratios. We run the YCSB workloads on theLevelDB key-value stores. We set the sstable size to 64 MBas recommended in Facebook’s tuning guide [13].
Redis . We set 1M key-value pairs in Redis [1], an in-memorykey-value store. We ran Redis in the Append-Only-File mode,where it logs updates to the database in a file and performs fsync() on the file every second.
Utilities . We also evaluated the performance of SplitFS fortar, git, and rsync. With git, we measured the time takenfor git add and git commit of all files in the Linux kernelten times. With rsync, we copy a 7 GB dataset of 1200 fileswith characteristics similar to backup datasets [30] from onePM location to another. With tar, we compressed the Linuxkernel 4.18 along with the files from the backup dataset.
Correctness . First, to validate the functional correctness ofSplitFS we run various micro-benchmarks and real-worldapplications and compare the resulting file-system state tothe ones obtained with ext4 DAX. We observe that the file-system states obtained with ext4 DAX and SplitFS are equiv-alent, validating how SplitFS handles POSIX calls in its user-space library file system.
Recovery times . Crash recovery in POSIX and sync modesof SplitFS do not require anything beyond allowing theunderlying ext4 DAX file system to recover. In strict modehowever, all valid log entries in the operation log need to bereplayed on top of ext4 DAX recovery. This additional logreplay time depends on the number and type of valid logentries in the log. To estimate the additional time neededfor recovery, we crash our real-world workloads at randompoints in their execution and measure the log replay time. Inour crash experiments, the maximum number of log entriesto be replayed was 18,000 and that took about 3 secondson emulated PM (emulation details in §5.8). In a worst-casemicro-benchmark where we perform cache-line sized writesand crash with 2M (128MB of data) valid log entries, weobserved a log replay time of 6 seconds on emulated PM.
The central premise of SplitFS is that it is a good trade-offto accelerate data operations at the expense of metadataoperations. Since data operations are more prevelant, thisoptimization improves overall application performance. Tovalidate this premise, we construct a micro-benchmark simi-lar to FileBench Varmail [27] that issues a variety of data andmetadata operations. The micro-benchmark first creates andappends 16KB to a file (as four appends, each followed by an fsync() ), closes it, opens it again, read the whole file as oneystem call Strict Sync POSIX ext4 DAXopen 2.09 2.08 1.82 1.54close 0.78 0.69 0.69 0.34append 3.14 3.09 2.84 11.05fsync 6.85 6.80 6.80 28.98read 4.57 4.53 4.53 5.04unlink 14.60 13.56 14.33 8.60
Table 6: SplitFS system call overheads. The table com-pares the latency (in us) of different system calls forvarious modes of SplitFS and ext4 DAX. read call, closes it, then opens and closes the file once more,and finally deletes the file. The multiple open and close callswere introduced to account for the fact that their latencyvaries over time. Opening a file for the first time takes longerthan opening a file that we recently closed, due to file meta-data caching inside U-Split. Table 6 shows the latencies weobserved for different system calls and they are reported forall the three modes provided by SplitFS and for ext4 DAXon which SplitFS was built.We make three observations based on these results. First,data operations on SplitFS are significantly faster than onext4 DAX. Writes especially are 3–4 × faster. Second, meta-data operations ( e.g., open() , close() , etc.) are slower onSplitFS than on ext4 DAX, as SplitFS has to setup its owndata structures in addition to performing the operation onext4 DAX. In SplitFS, unlink() is an expensive operationbecause the file mappings that are created for serving readsand overwrites need to be unmapped in the unlink() wrap-per. Third, as the consistency guarantees provided by SplitFSget stronger, the syscall latency generally increases. This in-crease can be attributed to more work SplitFS has to do( e.g., logging in strict mode) for each system call to providestronger guarantees. Overall, SplitFS achieves its objectiveof accelerating data operations albeit at the expense of meta-data operations. We examine how the various techniques employed by SplitFScontribute to overall performance. We use two write-intensivemicrobenchmarks: sequential 4KB overwrites and 4KB ap-pends. An fsync() is issued every ten operations. Figure 3shows how individual techniques introduced one after theother improve performance.
Sequential overwrites . SplitFS increases sequential over-write performance by more than 2 × compared to ext4 DAXsince overwrites are served from user-space via processor Figure 3: SplitFS techniques contributions. This fig-ure shows the contributions of different techniquesto overall performance. We compare the relative mer-its of these techniques using two write intensive mi-crobenchmarks; sequential overwrites and appends. stores. However, further optimizations like handling appendsusing staging files and relink have negligible impact on thisworkload as it does not issue any file append operations.
Appends . The split architecture does not accelerate appendssince without staging files or relink all appends go to ext4DAX as they are metadata operations. Just introducing stag-ing files to buffer appends improves performance by about2 × . In this setting, even though appends are serviced in user-space, overall performance is bogged down by expensivedata copy operations on fsync() . Introducing the relinkprimitive to this setting eliminates data copies and increasesapplication throughput by 5 × . To understand the relative merits of different PM file sys-tems, we compare their performance on microbenchmarksperforming different file IO patterns: sequential reads, ran-dom reads, sequential writes, random writes, and appends.Each benchmark reads/writes an entire 128MB file in 4KBoperations. We compare file systems providing the sameguarantees: SplitFS-POSIX with ext4 DAX, SplitFS-syncwith PMFS, and SplitFS-strict with Nova-strict and Strata.Figure 4 captures the performance of these file systems forthe different micro-benchmarks.
POSIX mode . SplitFS is able to reduce the execution timesof ext4 DAX by at least 27% and as much as 7.85 × (sequen-tial reads and appends respectively). Read-heavy workloadspresent fewer improvement opportunities for SplitFS asfile read paths in the kernel are optimized in modern PMfile systems. However, write paths are much more complexand longer, especially for appends. So, servicing a write in igure 4: Performance on different IO patterns. Thisfigure compares SplitFS with the state-of-the-art PMfile systems in their respective modes using micr-benchmarks that perform five different kinds of fileaccess patters. The y-axis is throughput normalizedto ext4 DAX in POSIX mode, PMFS in sync mode, andNOVA-Strict in Strict mode (higher is better). The ab-solute throughput numbers in Mops/s are given overthe baseline in each group. user-space has a higher payoff than servicing a read, anobservation we already made in Table 6. Sync mode . Compared to PMFS, SplitFS improves the per-formance for write workloads (by as much as 2.89 × ) and in-creases performance for read workloads (by as much as 56%).Similar to ext4 DAX, SplitFS’s ability to not incur expensivewrite system calls translates to its superior performance forthe write workloads. Strict mode . NOVA, Strata, and SplitFS in this mode pro-vide atomicity guarantees to all operations and perform thenecessary logging. As can be expected, the overheads of log-ging result in reduced performance compared to file systemsin other modes. Overall, SplitFS improves the performanceover NOVA by up to 5.8 × on the random writes workload.This improvement stems from SplitFS’s superior loggingwhich incurs half the number of log writes and fence opera-tions than NOVA. The central premise of SplitFS is that it is possible to accel-erate applications by reducing file system software overhead.We define file-system software overhead as the time takento service a file-system call minus the time spent actuallyaccessing data on the PM device. For example, if a systemcall takes 100 µ s to be serviced, of which only 25 µ s werespent read or writing to PM, then we say that the software Figure 5: Software overhead in applications. This fig-ure shows the relative file system software overheadincurred by different applications with various filesystems as compared SplitFS providing the same levelof consistency guarantees (lower is better). The num-bers shown indicate the absolute time taken to run theworkload for the baseline file system. overhead is 75 µ s. To provide another example, for append-ing 4 KB (which takes 10 µ s to write to PM), if file system Awrites 10 metadata items (incurring 100 µ s) while file systemB writes two metadata items (incurring 20 µ s), file-systemB will have lower overhead. In addition to avoiding kerneltraps of system calls, the different techniques discussed in§3 help SplitFS reduce its software overhead. Minimizingsoftware overhead allows applications to fully leverage PMs.Figure 5 highlights the relative software overheads in-curred by different file systems compared to SplitFS provid-ing the same level of guarantees. We present results for threewrite-heavy workloads, LevelDB running YCSB Load A andRun A, and SQLite running TPCC. ext4 DAX and NOVA (inrelaxed mode) suffer the highest relative software overheads,up to 3.6 × and 7.4 × respectively. NOVA-Relaxed incurs thehighest software overhead for TPCC because it has to up-date the per-inode logical log entries on overwrites beforeupdating the data in-place. On the other hand, SplitFS-synccan directly perform in-place data updates, and thus has sig-nificantly lower software overhead. PMFS suffers the lowestrelative software overhead, capping off at 1.9 × for YCSB LoadA and Run A. Overall, SplitFS incurs the lowest softwareoverhead. Figure 6 summarizes the performance of various applicationson different file systems. The performance metric we use for igure 6: Real application performance. This figure shows the performance of both data intensive applications(YCSB, Redis, and TPCC) and metadata instensive utilities (git, tar, and rsync) with different file systems, providingthree different consistency guarantees, POSIX, sync, and strict. Overall, SplitFS beats all other file systems onall data intensive applications (in their respective modes) while incurring minor performance degradation onmetadata heavy workloads. For throughput workloads, higher is better. For latency workloads, lower is better.The numbers indicate the absolute throughput in Kops/s or latency in seconds for the base file system. these data intensive workloads (LevelDB with YCSB, Rediswith 100% writes, and SQLite with TPCC) is throughput mea-sured in KOps/s. For each mode of consistency guarantee(POSIX, sync, and strict), we compare SplitFS to state-of-the-art PM file systems. We report the absolute performance forthe baseline file system in each category and relative through-put for SplitFS. Despite our best efforts, we were not ableto run Strata on these large applications; other researchershave also reported problems in evaluating Strata [32]. Weevaluated Strata with a smaller-scale YCSB workload usinga 20GB private log. Overall, SplitFS outperforms other PM file systems (whenproviding similar consistency guarantees) on all data-intensiveworkloads by as much as 2.70 × . We next present a breakdownof these numbers for different guarantees. POSIX mode . SplitFS outperforms ext4 DAX in all work-loads. Write-heavy workloads like RunA (2 × ), LoadA (89%),LoadE (91%), Redis (27%), etc. benefit the most with SplitFS.SplitFS speeds up writes and appends the most, so write-heavy workloads benefit the most from SplitFS. SplitFS orkload Strata SplitFSLoad A 29.1 kops/s 1.73 × Run A 55.2 kops/s 1.76 × Run B 76.8 kops/s 2.16 × Run C 94.3 kops/s 2.14 × Run D 113.1 kops/s 2.25 × Load E 29.1 kops/s 1.72 × Run E 8.1 kops/s 2.03 × Run F 73.3 kops/s 2.25 × Table 7: SplitFS vs. Strata. This table compares the per-formance of Strata and SplitFS strict running YCSBon LevelDB. We present the raw throughput numbersfor Strata and normalized SplitFS strict throughputw.r.t Strata. This is the biggest workload that we couldrun reliably on Strata. outperforms ext4 DAX on read-dominated workloads, butthe margin of improvement is lower.
Sync and strict mode . SplitFS outperforms sync-modefile systems PMFS and NOVA (relaxed) and strict-mode filesystem NOVA (strict) for all the data intensive workloads.Once again, its the write-heavy workloads that show thebiggest boost in performance. For example, SplitFS in syncmode outperforms NOVA (relaxed) and PMFS by 2 × and 30%on RunA and in strict mode outperforms NOVA (strict) by2 × . Ready-heavy workloads on the other hand do not showmuch improvement in performance. Comparison with Strata . We were able to reliably evalu-ate Strata (employing a 20 GB private log) using LevelDBrunning smaller-scale YCSB workloads (1M records, and 1Mops for workloads A–D and F, 500K ops for workload E). Wewere unable to run Strata on Intel DC Persistent Memory.Hence, we use DRAM to emulate PM. We employ the samePM emulation framework used by Strata. We inject a delayof 220ns on every read() system call, to emulate the accesslatencies of the PM hardware. We do not add this fixed 220nsdelay for writes, because writes do not go straight to PMin the critical path, but only to the memory controller. Weadd bandwidth-modeling delays for reads as well as writesto emulate a memory device with 1 / rd the bandwidth ofDRAM, an expected characteristic of PMs [18]. While thisemulation approach is far from perfect, we observe that theresulting memory access characteristics are inline with theexpected behavior of PMs [19]. SplitFS outperforms Strataon all workloads, by 1.72 × –2.25 × as shown in Table 7. Fig 6 compares the performance of SplitFS with other PMfile systems (we only show the best performing PM file sys-tem) on metadata-heavy workloads like git, tar, and rsync.These metadata-heavy workloads do not present many op-portunities for SplitFS to service system calls in userspaceand in turn slow metadata operations down due to the addi-tional bookkeeping performed by SplitFS. These workloadsrepresent the worst case scenarios for SplitFS. The maxi-mum overhead experienced by SplitFS is 13%.
SplitFS consumes memory for its file-related metadata ( e.g., ,to keep track of open file descriptors, staging files used). Italso additionally consumes CPU time to execute backgroundthreads that help with metadata management and to movesome expensive tasks off the application’s critical path.
Memory usage . SplitFS using a maximum of 100MB tomaintain its own metadata to help track different files, themappings between file offsets and mmap() -ed regions, etc.In strict mode, SplitFS additionally uses 40MB to maintaindata structures to provide atomicity guarantees.
CPU utilization . SplitFS uses a background thread to han-dle various deferred tasks ( e.g., , stage file allocation, file clo-sures). This thread utilizes one physical thread of the ma-chine, occasionally increasing CPU consumption by 100%.
SplitFS builds on a large body of work on PM file systemsand building low-latency storage systems. We briefly de-scribe the work that is closest to SplitFS.
Aerie . Aerie [28] was one of the first systems to advocatefor accessing PM from user-space. Aerie proposed a splitarchitecture similar to SplitFS, with a user-space library filesystem and a kernel component. Aerie used a user-spacemetadata server to hand out leases, and only used the kernelcomponent for coarse-grained activities like allocation. Incontrast, SplitFS does not use leases (instead making mostoperations immediately visible) and uses ext4 DAX as its ker-nel component, passing all metadata operations to the kernel.Aerie proposed eliminating the POSIX interface, and aimedto provide applications flexibility in interfaces. In contrast,SplitFS aims to efficiently support the POSIX interface.
Strata . The Strata [19] cross-device file system is similar toAerie and SplitFS in many respects. There are two maindifferences from SplitFS. First, Strata writes all data to aprocess-private log, coalesces the data, and then writes it toa shared space. In contrast, only appends are private (andonly until fsync ) in SplitFS; all metadata operations andverwrites are immediately visible to all processes in SplitFS.SplitFS does not need to copy data between a private spaceand a shared space; it instead relinks data into the target file.Finally, since Strata is implemented entirely in user-space,the authors had to re-implement a lot of VFS functionalityin their user-space library. SplitFS instead depends on themature codebase of ext4 DAX for all metadata operations.
Quill and FLEX . Quill [11] and File Emulation with DAX(FLEX) [32] both share with SplitFS the core technique oftransparently transforming read and overwrite POSIX callsinto processor loads and stores. However, while Quill andFLEX do not provide strong semantics, SplitFS can provideapplications with synchronous, atomic operations if required.SplitFS also differs in its handling of appends. Quill calls intothe kernel for every operation, and FLEX optimizes appendsby pre-allocating data beyond what the application asks for.In contrast, SplitFS elegantly handles this problem usingstaging files and the relink primitive. While Quill appendsare slower than ext4 DAX, SplitFS appends are faster thanext4 DAX appends. At the time of writing this paper, FLEXhas not been made open-source, so we could not evaluate it.
PM file systems . Several file systems such as SCMFS [31],BPFS [7], and NOVA [33] have been developed specificallyfor PM. While each file system tries to reduce software over-head, they are unable to avoid the cost of trapping into thekernel. The relink primitive from SplitFS is similar to theshort-circuit paging presented in BPFS. However, while short-circuit paging relies on an atomic 8-byte write, SplitFS relieson ext4’s journaling mechanism to make relink atomic.
Kernel By-Pass . Several projects have advocated directuser-space access to networking [29], storage [5, 10, 14, 20],and other hardware features [3, 4, 22]. These projects typi-cally follow the philosophy of separating the control pathand data path, as in Exokernel [12] and Nemesis [2]. SplitFSfollows this philosophy, but differs in the abstraction pro-vided by the kernel component; SplitFS uses a PM file systemas its kernel component to handle all metadata operations,instead of limiting it to lower-level decisions like allocation.
We present SplitFS, a PM file system built using the splitarchitecture. SplitFS handles data operations entirely in user-space, and routes metadata operations through the ext4 DAXPM file system. SplitFS provides three modes with varyingguarantees, and allows applications running at the same timeto use different modes. SplitFS only requires adding a singlesystem call to the ext4 DAX file system. Evaluating SplitFSwith micro-benchmarks and real applications, we show thatit outperforms state-of-the-art PM file systems like NOVAon many workloads. The design of SplitFS allows users to benefit from the maturity and constant development of theext4 DAX file system, while getting the performance andstrong guarantees of state-of-the-art PM file systems. SplitFSis publicly available at https://github.com/utsaslab/splitfs.
Acknowledgments
We would like to thank our shepherd, Keith Smith, the anony-mous reviewers, and members of the LASR group and theSystems and Storage Lab for their feedback and guidance.We would like to thank Intel and ETRI IITP/KEIT[2014-3-00035] for providing access to Optane DC Persistent Memoryfor conducting experiments for the paper. This work wassupported by NSF CAREER grant 1751277 and generous do-nations from VMware, Google and Facebook. Any opinions,findings, and conclusions, or recommendations expressedherein are those of the authors and do not necessarily reflectthe views of other institutions. eferences [1] 2019. Redis: In-memory data structure store. https://redis.io. (2019).[2] Paul R. Barham. 1997. A fresh approach to file system quality ofservice. In
Proceedings of 7th International Workshop on Network andOperating System Support for Digital Audio and Video (NOSSDAV’97) .IEEE, 113–122.[3] Adam Belay, Andrea Bittau, Ali José Mashtizadeh, David Terei, DavidMazières, and Christos Kozyrakis. 2012. Dune: Safe User-level Accessto Privileged CPU Features. In
Proceedings of the 17th InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2012, London, UK, March 3-7, 2012 . 387–400.https://doi.org/10.1145/2150976.2151017[6] Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. 2012. Consistency Without Ordering. In
Pro-ceedings of the 10th USENIX Symposium on File and Storage Technologies(FAST ’12) . San Jose, California, 101–116.[7] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek,Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/OThrough Byte-addressable, Persistent Memory. In
Proceedings of theACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09) . 133–146.[8] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,and Russell Sears. 2010. Benchmarking cloud serving systems withYCSB. In
Proceedings of the 1st ACM symposium on Cloud computing .ACM, 143–154.[9] Transaction Processing Performance Council. 2001. TPC benchmarkC, Standard Specification Version 5. (2001).[10] Matt DeBergalis, Peter F. Corbett, Steven Kleiman, Arthur Lent, DaveNoveck, Thomas Talpey, and Mark Wittle. 2003. The Direct Access FileSystem. In
Proceedings of the FAST ’03 Conference on File and StorageTechnologies, March 31 - April 2, 2003, Cathedral Hill Hotel, San Francisco,California, USA
Quill:Exploiting fast non-volatile memory by transparently bypassing the filesystem . Department of Computer Science and Engineering, Universityof California, San Diego.[12] Dawson R. Engler, M. Frans Kaashoek, and James O’Toole. 1995. Ex-okernel: An Operating System Architecture for Application-Level Re-source Management. In
Proceedings of the Fifteenth ACM Symposiumon Operating System Principles, SOSP 1995, Copper Mountain Resort,Colorado, USA, December 3-6, 1995 . 251–266. https://doi.org/10.1145/224056.224076[13] Facebook. 2017. RocksDB Tuning Guide. https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide. (2017).[14] Garth A. Gibson, David Nagle, Khalil Amiri, Fay W. Chang, Eugene M.Feinberg, Howard Gobioff, Chen Lee, Berend Ozceri, Erik Riedel, DavidRochberg, and Jim Zelenka. 1997. File Server Scaling with Network-Attached Secure Disks. In
Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computersystems, Seattle, Washington, USA, June 15-18, 1997
CoRR abs/1903.05714 (2019). arXiv:1903.05714http://arxiv.org/abs/1903.05714[19] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, EmmettWitchel, and Thomas E. Anderson. 2017. Strata: A Cross Media FileSystem. In
Proceedings of the 26th Symposium on Operating SystemsPrinciples, Shanghai, China, October 28-31, 2017 . ACM, 460–477. https://doi.org/10.1145/3132747.3132770[20] Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Dis-tributed Virtual Disks. In
ASPLOS-VII Proceedings - Seventh Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems, Cambridge, Massachusetts, USA, October 1-5,1996.
Proceedings of the 20th ACMSymposium on Operating Systems Principles 2005, SOSP 2005, Brighton,UK, October 23-26, 2005 . 206–220. https://doi.org/10.1145/1095810.1095830[24] Dulloor Subramanya Rao, Sanjay Kumar, Anil Keshavamurthy, PhilipLantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. Systemsoftware for persistent memory. In
Ninth Eurosys Conference 2014,EuroSys 2014, Amsterdam, The Netherlands, April 13-16, 2014 login: The USENIXMagazine
41, 1 (2016), 6–12.[28] Haris Volos, Sanketh Nalli, Sankarlingam Panneerselvam,Venkatanathan Varadarajan, Prashant Saxena, and Michael M.Swift. 2014. Aerie: Flexible File-system Interfaces to Storage-classMemory. In
Proceedings of the Ninth European Conference on ComputerSystems (EuroSys ’14) .29] Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels.1995. U-Net: A User-Level Network Interface for Parallel and Dis-tributed Computing. In
Proceedings of the Fifteenth ACM Symposiumon Operating System Principles, SOSP 1995, Copper Mountain Resort,Colorado, USA, December 3-6, 1995 . 40–53. https://doi.org/10.1145/224056.224061[30] Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane,Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012.Characteristics of backup workloads in production systems.In
Proceedings of the 10th USENIX conference on File and Stor-age Technologies, FAST 2012, San Jose, CA, USA, February 14-17, 2012
TOS
9, 3(2013), 7:1–7:23. https://doi.org/10.1145/2501620.2501621[32] Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019.Finding and Fixing Performance Pathologies in Persistent MemorySoftware Stacks. In
Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019 .427–439. https://doi.org/10.1145/3297858.3304077[33] Jian Xu and Steven Swanson. 2016. NOVA: A Log-structured File Sys-tem for Hybrid Volatile/Non-volatile Main Memories. In14th USENIXConference on File and Storage Technologies, FAST 2016, Santa Clara,CA, USA, February 22-25, 2016.