Kernel/User-level Collaborative Persistent Memory File System with Efficiency and Protection
KKernel/User-level Collaborative Persistent Memory File System with Efficiencyand Protection
Youmin Chen Youyou Lu Bohong Zhu Jiwu Shu Tsinghua University
Abstract
Emerging high performance non-volatile memories recall theimportance of efficient file system design. To avoid the virtualfile system (VFS) and syscall overhead as in these kernel-based file systems, recent works deploy file systems directlyin user level. Unfortunately, a user level file system can easilybe corrupted by a buggy program with misused pointers, andis hard to scale on multi-core platforms which incorporates acentralized coordination service.In this paper, we propose KucoFS, a Kernel and user-level collaborative file system. It consists of two parts: auser-level library with direct-access interfaces, and a kernelthread, which performs metadata updates and enforces writeprotection by toggling the permission bits in the page table.Hence, KucoFS achieves both direct-access of user-leveldesigns and fine-grained write protection of kernel-level ones.We further explore its scalability to multicores: For metadatascalability, KucoFS rebalances the pathname resolution over-head between the kernel and userspace, by adopting the indexoffloading technique. For data access efficiency, it coordinatesthe data allocation between kernel and userspace, and usesrange-lock write and lock-free read to improve concurrency.Experiments on Optane DC persistent memory show thatKucoFS significantly outperforms existing file systems andshows better scalability.
Emerging byte-addressable non-volatile memories (NVMs),such as PCM [19, 26, 38], ReRAM [5], and the recentlyreleased Intel Optane DC persistent memory [4], provideperformance comparable to DRAM and data persistencesimilar to disks. Such high-performance hardware recallsthe importance of redesigning efficient file systems. Theefficiency refers to not only the lightweight software overheadof the file system itself, but also its scalability to multicoresthat is able to exploit the hardware performance of non-volatile memories. File systems have long been part of an operating system,and are placed in the kernel level to provide data protectionfrom arbitrary user writes. System calls ( syscall s) are usedfor the communication between the kernel and userspace. Inthe kernel, the virtual file system (VFS) is an abstractionlayer that hides concrete file system designs to provideuniform accesses. However, both syscall and
VFS incur non-negligible overhead in file systems for NVMs. Our evaluationon NOVA [33] shows that, even the highly scalable andefficient NVM-aware file system still suffers great overhead inthe VFS layer and fails to scale on some file operations (e.g., creat/unlink ). For syscall , the context switch overheadoccupies up to 34% of the file system accessing time, evenwithout counting the effects of TLB and CPU cache misseson the following execution.Recent works like Strata [18] and Aerie [30] propose todesign NVM file systems in the user level. By bypassing theoperating system, they exploit the benefits of direct access .However, since the NVM space is exported to applications’address space, a programmer can easily corrupt the file systemimage by misusing pointers, which accidently point to theNVM space. Moreover, these file systems adopt a trusted,but centralized component to coordinate the critical updates,which inevitably restricts their scalability to multi-cores.It is difficult to achieve both performance efficiency andwrite protection simultaneously, as long as the VFS andkernel/user-space architecture remain unchanged. In thispaper, we revisit the file system architecture and proposea Kernel and user-level collaborative File System named Ku-coFS. Unlike existing user-level file systems, KucoFS enablesuser-level direct-access while ensuring write protection that akernel file system provides. KucoFS decouples the file systeminto a kernel thread (a.k.a., master ) and a user space library(a.k.a.,
Ulib ). Programs are capable of directly reading/writingfile data in user level by linking with
Ulib , while the master isdedicated to updating metadata on behalf of the applications,as well as guaranteeing the integrity of file data. KucoFSprevents a buggy program from corrupting the file system byexporting the NVM space to user level in read-only mode.1 a r X i v : . [ c s . O S ] A ug n this way, the read operations still can be conducted in userspace. To serve write operations without compromising thedirect-access feature, the master carefully manipulates thepage table to make the related data pages writable beforehand,and read-only again once the operation completes, retainingthe protection feature.We further explore the multicore scalability from thefollowing aspects: (cid:202) Metadata Scalability.
Like existing user-level file systems, KucoFS introduces a centralized master ,despite the different insight behind such architecture. As aresult, the master in KucoFS is still the bottleneck when thenumber of served programs increases. We introduce indexoffloading to migrate the pathname resolution overhead touserspace, and use batching-based logging to amortize themetadata persistence overhead. (cid:203)
Write Protocol.
To writefile data,
Ulib needs to interact with the master both beforeand after the operation to enforce write protection. This cannot only further increase the pressure on the master , butalso lead to increased latency. We propose an efficient writeprotocol to reduce the number of interactions between
Ulib and master when writing a file. It achieves this by lazilyreserving free data pages from the master , and coordinatingthe concurrent write operations directly in user space witha range lock. (cid:204)
Read-Write Conflicts. Ulib is likely to readinconsistent data when it directly accesses the file systemwithout any coordination, while master -involved readingreduces the benefits of direct-access. We propose lock-freefast read to deal with read-write conflicts. By carefullychecking the status of metadata,
Ulib is able to consistentlyread file data without any interacting to the master , despitethe concurrent writers.
Implementing an NVM file system in Linux kernel faces twotypes of unavoidable costs, which are the syscall overheadand the heavy-weight software stack in VFS. We investigatethe overhead of them by analyzing NOVA [33], a well-knownhighly scalable and efficient NVM-based file system. Ourexperimental platform is described in Section 6.1.
Syscall Overhead.
We analyze the syscall overhead bycollecting the context-switch latency of common file systemoperations (Each operation is repeated over 1 million files ordirectories with a single thread). The results are shown in Fig-ure 1 (a). We observe that the context-switch latency takes upto 21% of the total execution time, and this ratio is especiallylarge for read-oriented operations (e.g., stat/open ). Notethat the context-switch latency we captured only includesthe direct parts. The indirect costs (e.g., cache pollution) canfurther affect the efficiency of a program [28].
Inefficiency of VFS.
In existing Linux kernel, VFS improvesthe performance of storage devices (e.g., HDD/SSD) by main- L a t e n c y B r e a k d o w n ( % ) T h r o u g h p u t ( M o p s / s ) fs VFS Syscall CreateRename Unlink (a) (b)Number of Threads . . . . . . . . ( μ s ) o p e n s t a t m k d i r m k n o d r e n a m e u n li n k r e a d w r i t e Figure 1: Analysis of OS-part Overhead with NOVA.taining page cache in DRAM, but such caching mechanismis not always effective for NVMs since they have very closeaccess latency. Therefore, a number of NVM-aware file sys-tems choose to bypass them directly [10, 12, 13, 22, 31, 33, 37].However, we find that the remaining software stack in VFSis still too heavyweight: Our experiments show that NOVAhas to spend an average of 34% of the execution time inVFS layer (in Figure 1 (a)). In addition, VFS synchronizesthe concurrent syscalls by using the coarse-grained lock,which limits the scalability. As shown in Figure 1 (b), tocreate/rename/delete files in the same folder, VFS directlylocks the parent directory, so their throughput is unchangeddespite the increasing number of client threads.To sum up, the unified abstraction in VFS and syscallinterfaces do provide a safe and convenient way for program-mers, but at the same time, such classical design concept alsorestricts us from reconstructing the file system stack.
A group of file systems reduces the OS-part overhead byenabling user-level programs to directly access NVM deviceswithout trapping into the kernel [18, 30]. However, they failto provide the following important properties:
Write Protection.
Both Aerie [30] and Strata [18] rely onhardware virtualization capabilities in modern server systems(i.e., MMU) to enforce coarse-grained protection: they specifyaccess rights for each application to contiguous subsets ofNVM space, so as to prevent other malicious processes fromcorrupting the file system image. However, mapping (a subsetof) the file system image to applications’ address spaceand granting them with write access right is still dangerous,despite that they adopt a third-party service to manage themetadata: Applications can access Aerie by directly updatingthe file data in place. Strata allows user-level programs todirectly update the per-process operation log and the DRAMcache (including both metadata and data). As a result, a buggyprogram can easily corrupts the file system image by misusingpointers which accidentally point to the NVM space [11, 34],and the real-world evidence shows that such accidents arereally common.
Multicore Scalability.
Aerie relies on a trusted file systemservice (TFS, a separate user-level process) to ensure theintegrity of metadata updates and coordinate the concurrentaccesses with a distributed lock service. Such centralized2ervice easily becomes the bottleneck when the number ofconcurrent applications increases. Strata, in contrast, enablesapplications to update file data by appending their modifica-tions directly to the per-process log without the involvementof a third-party service. However, Strata requires backgroundthreads (KernFS) to asynchronously digest the log entries(including both data and metadata) to the storage devices. Ifan application completely uses up its log, it must wait for anin-progress digest to complete before it can reclaim log space.Consequently, the number of digestion threads determinesStrata’s overall performance. Similar to Aerie, Strata alsorelies on the KernFS for concurrency control, this indicatesthat the application needs to interact with the KernFS eachtime it accesses a new file. Besides, both of them accessthe third-party service via socket-based RPCs, which againintroduce context-switch overhead, thus reducing the benefitsof direct access.
In this section, we discuss the designing goals and non-goalsand clarify the trade-offs we made when building KucoFS.
Direct-access and data protection.
The key design aspectof KucoFS lies in decoupling the functionality of a file systeminto two parts, so as to achieve the respective advantages of direct-access of user-level file systems, and write protection of kernel-based ones. Note that KucoFS mainly target atenforcing write protection over buggy programs, and theimmunity to malicious attacks is out of the scope of thispaper. Nevertheless, KucoFS is still robust to them in mostcases by using checksum and lease (Section 4.3).
Scalability.
KucoFS should work well on a multi-core plat-form, so as to take full advantages of the internal parallelismof persistent memory. This drives us to design a scalable master service when it is accessed by concurrent applications.Besides, more efficient concurrency control is also requiredto deal with concurrent accesses and read-write conflicts.
Atomicity and Consistency, as is required by most exist-ing applications [24]. two aspects need to be taken intoconsideration: 1) KucoFS should always remain consistenteven after the system crashes abnormally, which requiresus to carefully design failure atomic update protocols. 2)The readers always see consistent data/metadata when otherprograms are concurrently updating files or directories.
Compatible APIs.
KucoFS should be backward compatiblewith kernel file system APIs, so that existing applications canuse KucoFS without modifying the source code.KucoFS makes a few tradeoffs that deviate from standardPOSIX semantics, but without restricting its applicabilityin real-world applications. 1) KucoFS implements per-userdirectory trees (i.e., the programs within the same user share a“private” root node), instead of a global tree, so as to enforceread protection (Section 4.4). 2) We don’t provide a explicitway for sharing data between different users as it is not … open() read() write() unlink() UserKernel
Read-only
Msg Buf
Per-user File System ImageApplica:on master
PollUpdate
Map page table
CoW User Space Library (
Ulib ) Figure 2: Architecture of KucoFS.the common case (several feasible approaches have beenproposed in Section 4.4). 3) Some minor properties are notimplemented in KucoFS (e.g., atime , etc.).
We designed KucoFS with the main goal of providing direct-access while enforcing data protection, failure atomicity,consistency, as well as the scalability to multi-cores.
Figure 2 shows the architecture of KucoFS. KucoFS consistsof an user-level library and a global kernel thread, whichare respectively called
Ulib and master . Ulib communicateswith the master via a exclusively owned message buffer. InKucoFS, each user owns a partition of file system image,which is mapped into user-level address space with read-onlyaccess rights. By linking with
Ulib , an application can postmemory
Load instructions to directly locate the data for thoseread-only operations (e.g., read/stat ). Ulib writes file databy always indirecting updates to new data pages with a copy-on-write mechanism. To enable user-level direct write, the master modifies the permission bits in the page table to switchthe newly allocated data pages between “writable” and “read-only” when
Ulib is updating them.
Ulib is not allowed toupdate metadata directly. Instead, it posts a request to the master through the message buffer, and the master updatesthe metadata on behalf of it.KucoFS adopts both DRAM and NVM to manage the filesystem image (see Figure 3). For efficiency, KucoFS onlyoperates on the DRAM data for normal requests. In DRAM,an array of pointers ( inode table ) is placed at a predefinedlocation to point to the actual inode s. The first element inthe inode table always points to the root inode of each user,therefore,
Ulib can lookup iteratively from the root inode toany file directly in user space. KucoFS uses an Ext2-like [9] block mapping to map a file to its data pages. We choose blockmapping, instead of the widely used extent tree, to supportlock-free fast read (in Section 4.4). We then introduce skip-list [25] to organize the dentry list of each directory, so as toachieve atomicity and consistency (in Section 4.2).3 nodeTable
DRAMNVM binodeMappingBlock Bob dentry List … Data Pages Opera9on Logappenda … User: Bobcreat(/Bob/a)
Tail Ptr
Metadata Pages checkpoint Op : Create inode : 4 p_inode : 7 name : a ACL : …
LogEntry:
Figure 3: Data layout in KucoFS and the steps to create a file.To ensure the durability and crash consistency of metadata,KucoFS further places an append-only persistent operationlog in NVM. When the master updates the metadata, it firstatomically appends a log entry, and then actually updatesthe in-memory metadata. To avoid the operation log fromgrowing arbitrarily, the master will periodically checkpointthe modifications to the NVM metadata pages in the back-ground and finally truncate the log (in Section 4.6). Sincethe operation log only contains light-weight metadata, suchcheckpoint overhead is not high. In face of system failures, thein-memory metadata can always be recovered by replayingthe log entries in the operation log.In addition to the operation log and metadata pages, theextra NVM space is cut into contiguous 4 KB data pages to store the file data. The free data pages are managed withboth a bitmap in NVM and a free list in the DRAM (for fastallocation). Similar to the metadata pages, the bitmap is alsolazily persisted by the master during the checkpoint.
KucoFS delegates all metadata updates to the master . Torelieve the pressure of the master , we propose to 1) minimizeits metadata indexing overhead with index offloading and 2)reduce the metadata persistence overhead with batching.
Index Offloading.
To update metadata, the master needs toperform iterative pathname resolution from root inode downto the directory containing this file. When a large numberof processes access concurrently, such indexing overhead isa heavy burden for the master. Things become even worsewhen a directory contains a large number of sub-files or thefile path is long. To address this issue, we propose to offloadthe pathname resolution from the master to Ulib .By mapping the file system image to user space,
Ulib isenabled to locate the related metadata directly in user-level be-fore posting a metadata update request. Take creat operationfor example,
Ulib finds the address of the predecessor in itsparent directory’s dentry list . It then posts the request to the master by piggybacking the addresses of the related metadata.In this way, the master can directly insert a new dentry into the dentry list with the giving address. The addresses of both the dentry in the parent directory and inode itself are provided forthe unlink operation. However, we still need extra techniques to ensure the correctness:First, we need to ensure that
Ulib can always read consistentdirectory tree when the master is updating them concurrently.To address this issue, we organize the dentry list of eachdirectory with a skip-list [25] and the key is the hash valueof each file name. Skip-list is a linked list-like data structurewith multiple layers, and each higher layer acts as an “expresslane” for the lists below, thus providing O ( logN ) search/insertcomplexity (see Figure 3). More importantly, by performingsimple pointer manipulations on a singly linked list withCPU’s atomic operations, we can atomically update the list.We enforce the master to updates the dentry list at differenttime point for different operations: For creat , it inserts a new dentry on the final step, to atomically make the created filevisible; For unlink , it deletes the dentry firstly. Hence, Ulib is guaranteed to always have a consistent view of the directorytree even without acquiring the lock. Renaming involvesupdating two dentries simultaneously, so it is possible for aprogram to see two same files at some time point. To addressthis issue, we add an dirty flag in each dentry to prevent the
Ulib from reading such inconsistent state.Second, it’s possible that the pre-located metadata by
Ulib becomes obsolete before it is actually accessed by the master (e.g., the inode or dentry has already been deletedby the master for other concurrent processes). To solve thisproblem, we reuse the dirty bit in each inode / dentry . Once anitem is deleted, this bit is set to an invalid state. Therefore,other applications and the master itself can determine theliveness of each metadata. The deleted items are temporarilykept in place and reclaimed via an epoch-based reclamationmechanism (EBR) [14]. We follow a classic way by usingthree reclamation queues, each of which is associated withan epoch number. The master pushes the deleted items onlyto the current active epoch queue. When all Ulib instancesare active in the current epoch, the master then increasesthe global epoch and begins to reclaim the space from theoldest queue. A reader executing in user level can sufferarbitrary delays due to thread scheduling, impacting thereclaim efficiency. However, we believe it is not a serious issuesince KucoFS only reclaims these obsolete items periodically.Third, a pre-located dentry may no longer be the prede-cessor when a new dentry is inserted between them. Hence,the master also needs to check the legality of the pre-locatedmetadata by comparing the related fields. Note that the master can update in-memory metadata without any synchronizationoverhead (i.e., locking) since all the metadata updates aredelegated to the master [27].
Examples.
To create a file,
Ulib sends a creat request tothe master to create a file. The address of the predecessorin its parent directory’s dentry list is put in the message too.Upon receiving the request, the master does the followingsteps (as shown in Figure 3): (cid:202) reserves an empty inodenumber from the inode table and appends a log entry toguarantee crash consistency. This log entry records the inode4umber, file name, parent directory inode number, and otherattributes; (cid:203) allocates an inode with each field filled, andupdates the inode table to point to this inode , and (cid:204) insertsa dentry into the dentry list with the given address, to makethe created file visible. To delete a file, the master appendsa log entry firstly, deletes the dentry in the parent directorywith the given addresses, and finally frees the related spaces(e.g., inode , NVM file pages and block mapping ). With suchstrict execution order, the failure atomicity and consistency(described in Section 3) is guaranteed.
Batching-based Metadata Logging.
The master ensuresthe crash consistency of metadata by appending log entriesand flushing them out of the CPU cache. However, cacheflushing leads to significant overhead since NVM has poorwrite bandwidth. Fortunately, the master serves many userspace applications and it can flush log entries with batching.Following this idea, we let the master fetch multiple requestsat a time from concurrent applications and process them ina batch manner. multiple log entries from different requestsnow can be merged into a large log entry. After it is persisted,the master then updates the in-memory metadata one-by-one with the order described above, and finally sends theacknowledgments back. Such processing mode has the follow-ing advantage: CPU flushes data with cacheline granularity(typically 64 B), which is larger than most of the log entries,by merging and persisting them together, the number offlushing operations is dramatically reduced. Note that theaforementioned batching is different from Aerie and Strata:Aerie batches requests before sending them to the TFS, soas to reduce the cost of posting RPCs. The KernFS in Stratadigests batches of operations from the log, which coalescesadjacent writes and forms sequential writes. Instead, KucoFSbatches log entries to amortize the data persistence overhead,leveraging the mismatch between the flush granularity andthe log entry size.
Another key design principle lies in how to provide efficient,consistent and safe write protocol . To achieve these goals,we propose pre-allocation and direct-access range-lock tosimplify the way of interaction between
Ulib and master .Similar to NOVA [33] and PMFS [30], we use a copy-on-write (CoW) mechanism to update data pages. It updates filepages by moving the unmodified part of data from the oldplace as well as the application data to new data pages. CoWcauses extra copying overhead for small-sized updates. Inmost cases, however, it dismisses the double write overheadas in redo/undo logging and the log cleaning overhead as inlog-structured data management.KucoFS first uses CoW to update the data pages, andthen atomically appends a log entry to record the metdatamodifications, during which the old data and metadata is nevertouched. Once a system failure occurs before a write operation stateoffsetsizelease versionchecksum inode Init a lock item. Insert. Atomic add.
Ring Buffer Conflictchecking.
Figure 4: Layout of Direct Access Range-Lock.is finished, KucoFS simply rollbacks to its original state. Assuch, the failure atomicity and consistency is guaranteed. Werely on the master to enforce write protection over eachfile page leveraging the permission bits in the page table:When the user-level programs directly write data, the master carefully manipulates the permission bits of the related datapages. An intuitive write protocol is:1)
Ulib sends the first request to the master to lock the file,reserve free data pages and make them “writable”;2)
Ulib relies on CoW to copy both the unmodified data fromthe old place and new data from the user buffer to thereserved data pages, and flush them out of the CPU cache;3)
Ulib sends a second request to the master to reset thenewly written data pages to “read-only”, append a newlog entry (inode number, offset, size and related NVMaddresses) to the operation log, update the metadata (i.e., inode , block mapping ) and finally release the lock.We can observe that a single write operation involvesposting two requests to the master . This can not only leadto high write latency, but also limit the efficiency of the master since it is frequently involved. Thus, we propose pre-allocation and direct access range-lock to avoid sending thefirst request to the master . Pre-allocation.
Rather than posting the request to the master to reserve free pages for each write operation, we allow
Ulib to lazily allocate data pages from the master (4 MB at atime in our implementation). These data pages are managedprivately by
Ulib with a free list. When an application exits,the unused data pages are given back to the master . For anabnormal exit, these data pages are temporarily non-reusableby other applications, but still can be reclaimed after rebootedby replaying the operation log.
Direct Access Range-Lock.
To completely avoid sendingthe first request as described in the naive write protocol, wefurther propose direct access range-lock . It coordinates theconcurrent writes directly in user-level, since we cannot relyon a master to acquire the lock anymore.As shown in Figure 4, we assign each opened file a rangelock (i.e., a DRAM ring buffer), which is pointed by the inode . Ulib writes a file by acquiring the range-lock first, and the filewriting is delayed once a lock conflict occurs. Each slot in thering buffer has five fields, which are state, offset, size, lease and a checksum . The checksum is the hash value of the firstfour fields. We also place a version at the head of each ringbuffer to describe the ordering of each write operation. Toacquire the lock of a file,
Ulib firstly increments its version fetch_and_add . It then inserts a lock item into aspecific slot in the ring buffer, and the location is determinedby the fetched version (modulo the ring buffer size). Afterthis,
Ulib traverses the ring buffer backward to find the firstconflicting lock item (i.e., their written data overlaps). If itexists,
Ulib verifies its checksum , and then polls on its state until it is released.
Ulib also checks its lease field repeatedlyto avoid deadlock if an application is aborted before it releasesthe lock. Once the lock has been required,
Ulib process thesecond and third steps described in the naive protocol. In step3, the version is encapsulated in the requests, so the master can persist it in the log entry.Worth noticing, our proposed range-lock supports concur-rent writing in the same file covering different data pages.Such fine-grained concurrency control is important in high-performance computing [8, 35] and the emerging rack-scalecomputers with hundreds to thousands of cores [17].
Write Protection.
KucoFS strictly controls the access rightsto the file system image: Both in-memory metadata and thepersistent operation log are critical to the file system, so the master is the only one that is allowed to update them.
Ulib only has write access to its privately managed free data pages.However, these pages are immediately changed to “read-only”once they are allocated to serve the write operations. Sinceboth the metadata and valid data pages are non-writable,KucoFS is immune to arbitrary memory writes. However,there are still two anomalies: 1) the private data pages stillcan be corrupted within a write operation by other concurrentthreads. However, a kernel-based file system cannot copewith such case either [13], and we believe this is unlikely tohappen. 2) a buggy application can still corrupt the rangelock or the message buffer, since they are directly writablein user space. We add checksum and lease fields at eachslot, enabling the user-level programs to identify whether theinserted element has been corrupted. Both the lock item andrequest message only contains a few tens of bytes of data, sothe hash calculating overhead is not high. Besides, The secretkey for generating the checksum is owned by the master andgranted only to those trusted applications. Therefore, KucoFSis even immune to some malicious attacks (e.g., replay orDoS), though this is not the main target of this paper.When the master updates the page table for each writeoperation, it needs to explicitly flush the related TLB entriesto make the modifications visible. This indicates that each write operation in KucoFS involves twice of TLB flushing.Luckily, we can allocate multiple data pages at a time in pre-allocation phase, So the TLB entries can be flushed in batch,which reduces the flushing overhead dramatically.
KucoFS updates data pages with CoW mechanism, hence,any data page is in either old or new version. This providesus the opportunity to design an efficient read protocol to V V V V version pointerstart end V V V V V V V V V V V V V V V V V (a)(b)(c) start = 1 end = 1 Block Mapping Item
55 560
Figure 5: Lock-Free Fast Read with Version Checking.directly read in user-level. Considering that the master maybe updating the metadata for other concurrent writers, themain challenge is how to read a consistent snapshot of blockmapping s efficiently despite other concurrent writers.Hence, we propose lock-free fast read , which guaranteesthat readers never read data from unfinished writes. It achievesthis by embedding a version field in each pointer of the blockmapping : As shown in Figure 5, each 96-bit block mappingitem contains four fields, which are start , version , end , and pointer . Take a write operation with three updated data pagesfor example, when the master updates the block mapping ,the header of three mapping items are constructed with thefollowing layout: 1 | V | | V | | V |
1. Note that all thethree items share the same version (i.e., V ), which is providedby Ulib when it acquires the range lock (in Section 4.3). The start bit of the first item and the end bit of the last item areset to 1. We only reserve 40-bit for pointer field since italways points to a 4 KB-aligned page (the lower 12 bits canbe discarded). It’s easy to understand that when there are noconcurrent writers, the block mapping items should satisfyany of the conditions in Figure 5:(a)
Writes without overlapping.
The items with the same version are enclosed with a start bit and an end bit,indicating that multiple threads have updated the samefile but different data pages.(b)
Overlapping in the tail.
The reader sees a start bitwhen the version increases, indicating that a thread hasoverwritten the end part of the pages that are updated bya former thread.(c)
Overlapping in the head.
The reader sees an end bitbefore the version decreases, indicating that a thread hasoverwritten the front part of the pages that are updated bya former thread.If
Ulib meets any cases that violate the above conditions,we assert that the master is updating the block mappings forother concurrent write threads. In this case,
Ulib needs toreload the metadata again and checks its validity. To reducethe overhead of retrying, the read thread copies the file datato user’s buffer only after it has successfully collected aconsistent version of the block mapping. This is achievablebecause the obsolete NVM pages are lazily reclaimed. Whenthe modified mapping items span to multiple cachelines, the master also adds extra mfence to serialize the updates. By thisway, the read threads can see the updates in order.
Read Protection.
Leveraging the permission bits to enforceread protection is more challenging, since metadata have6emantically richer permissions [30]. Hence, instead of main-taining a fully-compatible hierarchical/group access controlas in kernel-based file systems, we partition the directory treeinto per-user sub-trees and each user has a private root node.When a program access KucoFS, only the sub-tree (i.e., inodetable , inodes , dentry lists , etc.) and the related data pagesof the current user are mapped to its address space, whileother space is invisible to it. To alleviate the bookkeepingoverhead for page mapping, the master assign each user 4 MBof contiguous DRAM/NVM blocks, which forms the per-userfile system image (i.e., DRAM metadata, operation log, datapages, etc.). Similar to Arrakis [23], KucoFS doen’t providea explicit way for data sharing between different users, yetthere are several practical approaches: 1) create a standalonepartition that every users have read/write access to it; 2) issueuser-level RPCs to a specific user to acquire the data. Webelieve such tradeoff is not likely to be an obstacle to itsapplication in real-world scenarios, since KucoFS naturallysupports efficient sharing between applications within thesame user, which are the more common case. We introduce a checkpoint mechanism to avoid the operationlog from growing arbitrarily: When a master is not busyor the size of operation log grows out of a maximum size,it periodically applies the metadata modifications to NVMmetadata pages by replaying log entries in the operationlog. The bitmap that used to manage the NVM free spaceis updated and persisted as well. After that, the operationlog is truncated. Each time KucoFS is restarted, the master first replays the un-checkpointed log entries in the operationlog, so as to make the NVM metadata pages up-to-date. Itthen copies the NVM metadata pages to DRAM. The freelist of NVM data pages is also reconstructed according to thebitmap stored in NVM. Keeping redundant copies of metadatabetween DRAM and NVM can introduce higher consumptionof NVM/DRAM space. But we believe it is worth the effortsbecause by selectively placing the (un)structured metadata inDRAM and NVM, we can perform fast indexing directly inDRAM, append log entries with reduced persistency overhead(batching), and lazily checkpoint in the background withoutaffecting performance. As our future work, we plan to reducethe DRAM footprint by only keeping the metadata of activefiles in DRAM.
We finally summarize the design of KucoFS by walkingthrough an example of writing 4 KB of data to a new fileand then reading it out. First of all, this program links with
Ulib to map the related NVM/DRAM space of the currentuser into its address space.
Open.
Before sending the open system call,
Ulib pre-locates the related metadata first. Since this is a new file,
Ulib cannotfind its inode. Instead, it finds the predecessor in its parentdirectory’s dentry list for latter creation. The address, as wellas other information (e.g., file name,
O_CREAT flags, etc.) areencapsulated in the open request. When the master receivesthe request, it creates this file based on the given address. Italso allocates a range-lock ring buffer for this file since it’sthe first time to open it. Then, the master sends a responsemessage. After this,
Ulib creates a file descriptor for thisopened file and returns back to the application.
Write.
The application then uses write call via
Ulib to write4 KB of data to this created file. First,
Ulib finds the inode ofthis file and locks it with the direct access range-lock.
Ulib blocks the program when there are write conflicts and waituntil the corresponding lock has been released. After this,
Ulib can acquire the lock successfully. It then allocates a4 KB-page from its privately managed lists, copies the datainto it, and flushes them out of CPU cache.
Ulib needs topost extra request to the master to allocate more free datapages once its own space is used up. Finally,
Ulib sends the write request to the master to perform the loose ends, whichincludes: change the permission bits of the written data pagesto “read-only”, atomically appending a log entry to describethis write operation, update the in-memory metadata, andfinally unlock the file.
Read.
KucoFS enables reading file data without interactingwith the master . To read the first 4 KB from this file,
Ulib directly locates the inode in user space and reads the firstblock mapping item (i.e., the pointer). The version checkingis performed to ensure its state satisfies one of the threeconditions described in Section 4.4. After this,
Ulib can safelyread the file data page pointed by the pointer . Close.
Ulib also needs to send a close system call to the master upon closing this file. The master then reclaims thespace of the range lock ring buffer if no other processes isaccessing this file.
KucoFS is implemented into two parts: a loadable kernelmodule (i.e., the master ) and a shared library (i.e., the
Ulib ).Each
Ulib instance communicates with the master with anexclusively owned message buffer.
KucoFS’s APIs.
KucoFS provides a POSIX-like interface,so existing applications are enabled to access it without anymodifications to the source code. It achieves this by settingthe
LD_PRELOAD environment variable.
Ulib intercepts allAPIs in standard C library that are related to file systemoperations.
Ulib processes the syscall directly if the prefixof the accessed file matches with a predefined string (e.g., “/kuco” ). Otherwise, the syscall is processed in legacy mode.Note that write operations only pass the file descriptors tolocate the file data, therefore,
Ulib distinguish the write . of Threads(a) Read, Low (b) Read, Medium (c) Overwrite, Low024 (d) Overwrite, Medium (e) Append, Low T h r o u g h p u t ( M o p s / s ) T h r o u g h p u t ( M o p s / s ) XFS-DAX EXT4-DAX PMFS NOVA KucoFS0200400600 0 20 0 20 0 20 0 20 0 20
Figure 6: Read and write throughput with FxMark. ( “Low”: different threads read(write) data from(to) different files; “Medium”:in the same file but different data blocks; We use default I/O size of 4 KB. )operations from legacy file systems by only using big filedescriptor numbers (greater than 2 in our implementation). Memory-mapped I/O.
Supporting DAX feature in a copy-on-write file system needs extra efforts, since the files are out-of-place updated in normal write operations [33]. Besides,DAX leaves great challenges for programmers to correctlyuse NVM space with atomicity and crash consistency. Takingthese factors into consideration, we borrow the idea fromNOVA to provide atomic-mmap , which has higher consis-tency guarantee. When an application maps a file into userspace,
Ulib copies the file data to its privately managed datapages, and then sends a request to the master to map thesepages into contiguous address space. When the applicationissues a msync system call,
Ulib then handles it as a writeoperation, so as to atomically makes the updates in these datapages visible to other applications.
In this section, we evaluate the overall performance ofKucoFS with micro(macro)-benchmarks and real-world ap-plications. We also learn the effects brought by its internalmechanisms.
Testbed.
Our experimental testbed is equipped with 2 × Intel Xeon Gold 6240M CPUs (36 physical cores and 72logical threads), 192 GB DDR4 DRAM, and six Optane DCpersistent memory DIMMs (256GB per module, 1.5TB intotal). Our evaluation on Optane DC shows that its readbandwidth peaks at 37 GB/s and the write bandwidth is13.2 GB/s. The server is installed with Ubuntu 19.04 andLinux kernel 5.1, the kernel version supported by NOVA.
Compared Systems.
We evaluate KucoFS against NVM-ware file systems including PMFS [13], NOVA [33], andStrata [18] , as well as traditional file system with DAXsupport including Ext4-DAX [2] and XFS-DAX [29]. Strataonly support a few applications and has trouble running multi- https://github.com/NVSL/PMFS-new , https://github.com/NVSL/linux-nova , https://github.com/ut-osa/strata threaded workloads [36], so we only give its single-threadedperformance results in Section 6.3 and Section 6.3.Aerie is based on Linux 3.2.2, which doesn’t have therelated drivers to support Optane DC. Hence, we comparewith Aerie [30] by emulating persistent memory with DRAM(Due to limited space, we only describe these experimentaldata in words, without including them in the figures). We use FxMark [21] to evaluate the basic file system opera-tions (in terms of both throughput and multi-core scalability).FxMark provides 19 micro-benchmarks, which is categorizedbased on four criteria: data types (i.e., data or metadata),modes (i.e., read or write), operations (i.e., read, overwrite,append, create, etc.) and sharing levels (i.e., low, medium orhigh). We only include some of them in the paper due to thelimited space.
File Read.
Figure 6 (a)-(b) show the file read performanceof each file system with a varying number of client threadsand different sharing levels (i.e., Low/Medium). We canobserve that KucoFS exhibits significant higher throughputthan the other file systems, and its throughput scales linearlyas the number of clients increases. Specifically, with 36 clientthreads and
Low sharing level, KucoFS outperforms NOVAand PMFS by 6 × on average, and has two orders magnitudeshigher performance than XFS-DAX and EXT4-DAX. Suchperformance advantage stems primarily from the design of lock-free fast read , which enables user space direct access without the involvement of the master . Those kernel filesystems (e.g., XFS, Ext4, NOVA and PMFS) have to performcontext switch and walk through the VFS layer, which impactsthe read performance. Besides, All of compared systemsneed to lock the file before actually reading the file data.Such locking overhead impacts their performance severely,despite the contention is low [20]. We further observe thatthe throughput of the compared systems keeps steady and lowunder Medium sharing level, since all the threads are acquiringthe same lock of the same file. Instead, the performance ofKucoFS is unchanged with varying sharing level, because itdoesn’t rely on a per-file lock to coordinate the concurrentreaders. Note that the measured read performance via FxMark8
FS-DAXEXT4-DAX PMFSNOVA KucoFS T h r o u g h p u t ( M o p s / s ) (a) Figure 7: readdir performance with FxMark. ( “Medium”:In the same folder )is larger than the raw bandwidth of Optane DC (which is37 GB/s), because FxMark let each thread read one file pagerepeatedly, and the accessed data is cached in the CPU cache.With our emulated persistent memory, Aerie shows almost thesame performance as that of KucoFS with
Low sharing level,but its throughput becomes far behind others with
Medium sharing level. This is because Aerie needs to contact withthe TFS frequently to acquire the lock, causing extra contextswitch overhead.
File Write.
The throughputs of both append and overwrite operations are given in Figure 6 (c)-(e). For overwriteoperations with “Low” sharing level, all systems exhibit aperformance curve that increases first and then decreases. Inthe increasing part, KucoFS shows the highest throughputamong the compared systems because it is enabled to directlywrite data in user space. XFS and NOVA also shows good scal-ability: among them, NOVA partitions the free spaces to avoidglobal locking overhead when allocating new data pages,while XFS directly write data in-place without allocating newpage. Both PMFS and Ext4 fail to scale since they adoptstransaction to write data, introducing extra locking overhead.In the decreasing part, their throughput are restricted bythe Optane bandwidth because of its poor scalability [15].For overwrite operations with “Medium” sharing level, thethroughput of KucoFS is one order of magnitude higher thanthe other three file systems when the number of threads issmall. Such performance benefits mainly come from therange-lock design in KucoFS, which enables parallel updatingto different data blocks in the same file. The performance ofKucoFS drops again when the number of clients is more than8, which is mainly restricted by the ring buffer size in therange-lock (we reserve 8 lock items in each ring buffer). Forappend operations, XFS-DAX, Ext4-DAX and PMFS exhibitun-scalable performance as the number of client threadsincreases. This is because all of them uses a global lock tomanage its metadata journal and free data pages, so the lockcontention contributes to the major overhead. Both NOVAand KucoFS show better scalability, and KucoFS outperformsNOVA from 1.1 × to 3 × as the number of threads varies.On our emulated persistent memory, Aerie shows the worstperformance because the trusted service is the bottleneck: theclients need to frequently interact with it to acquire the lockand allocate new data pages. XFS-DAXEXT4-DAX PMFSNOVA StrataKucoFS (a) (b) N o r m a li z e d T h r o u g h p u t
20 Threads1 Threads
91K 365K 575K 2.73M
Figure 8: Filebench Throughput with Different File Systems.We conclude that by fully exploiting the benefits of directaccess , KucoFS always shows the highest performance amongthe evaluated file systems.
Metadata Read.
Figure 7(a) shows the performance of readdir operations with
Medium sharing level (i.e., all thethreads read the same directory). (Aerie doesn’t support thisoperation). We observe that only KucoFS exhibits scalableperformance and PMFS even cannot complete the workloadsas the number of clients increases. These kernel file systemslock the parent directory’s inode in VFS before readingthe dentry list and file inode s, as a result, the execution ofdifferent client threads is serialized when they access the samedirectory. However, the skip list used in KucoFS supportslock-free reads and atomic updates, enabling multiple readersto concurrently read the same directory.
File Creation.
To evaluate the performance of creat with
Medium sharing level, FxMark lets each client thread create10 K files in a shared directory. As shown in Figure 7(b),KucoFS achieves one order of magnitude higher throughputthan the compared file systems and it exhibits scalableperformance as the number of threads increases. XFS-DAX,Ext4-DAX and PMFS use a global lock to perform metadatajournaling and manage the free spaces, which leads to theirun-scalable performance. Besides, the VFS layer needs tolock the inode of the parent directory before creating the files.Hence, NOVA also fails to scale despite it avoids using globallock. We explain the high performance of KucoFS from thefollowing aspects: (1) In KucoFS, all the metadata updatesare delegated to the master, so it can update them without anylocking overhead. (2) By offloading all the indexing overheadto user space, the master only needs to do very lightweightoperations. (3) KucoFS can persist metadata with batching,while the other three kernel file systems do not have suchopportunity. Aerie synchronizes the updated metadata of thecreated files to the trusted service with batching so it achievescomparable performance as that of KucoFS, but it fails towork properly with more threads.
We then use Filebench [1] as a macro-benchmark to evaluatethe performance of KucoFS. We select two workloads — File-server and Varmail — with the same settings as that in NOVApaper: Files are created with the average size of 128 KB and9
FS-DAXEXT4-DAX PMFSNOVA StrataKucoFS T h r o u g h p u t ( K o p s / s ) Object Size (SET)0100200300 128 1KB 4KB 8KB
Figure 9: Filebench Throughput with Different File Systems.32 KB for Fileserver and Varmail respectively. The I/O sizesof both read and write operations are set to 16 KB in Fileserver.Varmail has read I/O size of 1 MB and write I/O size of 16 KB.Fileserver and Varmail have write to read ratios of 2:1 and1:1 respectively. The total number of files in each workload isset to 100K. Fileserver emulates I/O activity of a simple fileserver [3] by randomly performing creates, deletes, appends,reads and writes. Varmail emulates an email server and usesa write-ahead log for crash consistency. It contains a largenumber of small files involving both read and write operations.We only give single-threaded evaluation of Strata. Figure 8shows the results and we make the following observations:(1) KucoFS shows the highest performance among allthe evaluated workloads. In single-threaded evaluation, itsthroughput is 2.5 × , 2 × , 1.34 × , 1.29 × and 1.26 × higherthan XFS, Ext4, PMFS, NOVA, and Strata respectively forFileserver workload, and is 2.7 × , 6 × , 2 × , 1.67 × and 1.3 × higher for Varmail workload. Such performance advantagemainly comes from the direct access feature of KucoFS.It executes file I/O operations directly in user-level, thusdismissing the OS-part overhead (i.e., context saving andreloading, executing in VFS layer). Strata also benefit fromdirect access, however, it needs to acquire the lease fromthe third-party service each time they access a new file,which limits its efficiency. We also observe that the design ofKucoFS is a good fit for Varmail workloads. This is expected:Varmail frequently creates/deletes files, so it generates moremetadata operations and issues system calls more frequently.As described before, KucoFS eliminates the OS-part overheadand is better at handling metadata operations. Besides, Stratashows much higher throughput than NOVA since the file I/Osin Varmail is small-sized. Strata only needs to append thesesmall-sized updates to the operation log, reducing the writeamplification dramatically.(2) KucoFS is better at handling concurrent workloads.With 20 concurrent client threads and Fileserver workload,KucoFS outperforms XFS-DAX and Ext4-DAX by 3.5 × on average, and PMFS by 2.3 × , and NOVA by 1.4 × . Suchperformance advantage is more obvious for Varmail workload:it achieves 15% higher performance than XFS-DAX andExt4-DAX on overage. Two reasons contribute to its goodperformance: 1) KucoFS incorporates techniques like indexoffloading to enable the master to provide scalable metadataaccessing performance; 2) KucoFS avoids using global T h r o u g h p u t ( M o p s / s ) E x e c u t i o n t i m e ( s ) (a) ffl oadingw/o BatchingKucoFS NOVArw Lockw/o LockKucoFS00.511.52 0 10 20 30 0 5 10 15 Figure 10: Benefits of Each Optimizations in KucoFS.lock by letting each client manage private free data pages.NOVA also exhibits good scalability since it uses per-filelog-structure and partitioned free space management.
Many modern cloud applications use key-value stores likeRedis for storing data. Redis exports an API allowingapplications to process and query structured data, but usesthe file system for persistent data storage. Redis has twoapproaches to persistently record its data: one is to logoperations to an append-only-file (AOF), and the other is touse an asynchronous snapshot mechanism. We only evaluateRedis with AOF mode in this paper. Similar to the way inStrata [18], we configure Redis to use AOF mode and topersist data synchronously.Figure 9 shows the throughput of SET operations using 12-byte keys and with various value sizes. For small values, thethroughput of Redis is 53%% higher on average on KucoFS,compared to PMFS, NOVA and Strata, and 76% highercompared to XFS-DAX and Ext4-DAX. This is consistentwith the evaluation results of
Append operations, whereKucoFS outperforms other systems at least by 2 × with asingle thread. With larger object sizes, KucoFS achievesslightly higher throughput than other file systems since theOptane bandwidth becomes the major limiting factor. In this section, we analyze the performance improvementsbrought by each optimization in KucoFS.First, we measure the individual benefit of index offloading and batching-based logging . To achieve this, we disablebatching by letting the master persist log entries one by one.We then move the metadata indexing operations back to the master to see the effects of index offloading. Figure 10(a)shows the results by measuring the throughput of creat with varying number of clients. We make the followingobservations:(1) In single thread evaluation, index offloading does notcontribute to improving performance: Since moving themetadata indexing from
Ulib back to the master doesn’treduce the total execution latency of each operation, the single-thread throughput is unchanged. We also find that batching10oesn’t degrade the single-thread performance, which is incontrast to the broad belief that batching causes higher latency.In our implementation, the master simply scans the messagebuffer to fetch the existing requests, and the overhead ofscanning is insignificant.(2) When the number of client threads increases, we findthat indexing offloading improves throughput by 55% at mostfor creat operation. Since KucoFS only allows the master to update metadata on behalf of multiple
Ulib instances, thetheoretical throughput limit is T max = req / L req (where L req is the latency for a master to process one request). Therefore,the proposed offloading mechanism improves performanceby shortening the execution time for each request (i.e., L req ).Similarly, batching is introduced to speed up the processingefficiency of the master by reducing the data persistencyoverhead. From the figure, we can find that it improvesthroughput by 33% at most for the creat operation.Second, we demonstrate the efficiency of lock-free fast read by concurrently reading and writing data to the same file. Inour evaluation, one read thread is selected to sequentially reada file with I/O size of 16 KB, and an increasing number ofthreads are launched to overwrite the same file concurrently(4 KB writes to a random offset). We let the read threadissues read operations for 1 million times and measure itsexecution time by varying the number of write threads. Forcomparison, we also implement KucoFS r/w lock that readsfile data by acquiring the read-write lock in the range-lockring buffer, and
KucoFS w/o lock that reads file data directlywithout regarding the correctness. We make the followingobservations from Figure 10(b): (1) The proposed lock-freefast read achieves almost the same performance as that of
KucoFS w/o lock . This proves that the overhead of versionchecking is extremely low. We also observe that
KucoFS r/wlock needs to pay much more time to finish reading (7% to3.2 × more time than lock-free for different I/O sizes). Thisis because one needs to use atomic operations to acquire therange lock, and this can severely impact read performancewhen there are more conflicts. (3) The execution time ofNOVA is orders of magnitudes higher than that of KucoFS.We notice that NOVA directly uses mutex to synchronize theconcurrent readers and writes. As a result, the reader will bedelayed by the writers dramatically. Kernel/Userspace Collaboration.
The emergence of highthroughput and low latency hardware (e.g., Infiniband net-work, NVMe SSDs and NVMs) prompts the idea of mov-ing I/O operations from the kernel to user level: Belay etal. [6] abstract the Dune process leveraging the virtualizationhardware in modern processors. It enables direct access tothe privileged CPU instructions in user space and executessyscalls with reduced overhead. Based on Dune, IX [7] stepsfurther to improve the performance of data-center applications by separating management and scheduling functions of thekernel (control-plane) from network processing (data plane).Arrakis [23] is a new network server operating system. It splitsthe traditional role of the kernel in two, where applicationshave direct access to virtualized I/O devices, while the kernelonly enforces coarse-grained protection and doesn’t need tobe involved in every operation.
Persistent Memory File System.
Existing research workson NVM-based file systems can be classified into threecategories: (cid:202)
Kernel-Level.
BPFS [12] adopts short-circuitshadow paging to guarantee the metadata and data consistency.It also introduces epoch hardware modifications to efficientlyenforce orderings. SCMFS [32] simplifies the file manage-ment by mapping files to contiguous virtual address regionswith the virtual memory management (VMM) in existing OS,but it fails to support consistency for both data and metadata.Both PMFS [13] and NOVA [33] use separated mechanismsto guarantee the consistency of metadata and data: PMFS usesjournaling for metadata updates and perform writes with copy-on-write mechanism. NOVA is a log-structured file systemdeployed on hybrid DRAM-NVM architecture. It managesthe metadata with per-inode log to improve scalability andmoves file data out of the log (file data is managed with CoW)to achieve efficient garbage collection. NOVA-Fortis [34]steps further to be fault-tolerant by providing a snapshotmechanism. While these kernel file systems provide POSIXI/O and propose different approaches to enforce (meta)dataconsistency, their performance is still restricted by existingOS abstraction (e.g., syscall and VFS). (cid:203)
User-Level.
BothAerie [30] and Strata [18] propose to avoid the OS-partoverhead by implementing the file system in user space. Withthis design, user-level applications have direct access to thefile system image. Both of them adopt a third-party trustedservice to coordinate the concurrent operations and processother essential works (e.g., metadata management in Aerieand data digestion in Strata). However, by exporting the filesystem image to user-level applications, they are vulnerable toarbitrary writes from the buggy applications. (cid:204)
Device-Level.
DevFS [16] proposes to push the file system implementationinto the storage device that has compute capability and device-level RAM, which requires the support of dedicated hardware.
In this paper, we revisit the file system architecture fornon-volatile memories by proposing a kernel and user-levelcollaborative file system named KucoFS. It fully exploitsthe respective advantages of direct access in user-level and data protection in kernel space. We further improve itsscalability to multicores by rebalancing the loads betweenkernel and user space and carefully coordinating the read andwrite conflicts. Experiments show that KucoFS provides bothefficient and scalable non-volatile memory management.11 eferences [1] Filebench file system benchmark. , 2004.[2] Support ext4 on NV-DIMMs. " https://lwn.net/Articles/588218 ", 2014.[3] Specsfs-2014. ,2017.[4] Intel optane dc persistent memory. , 2019.[5] IG Baek, MS Lee, S Seo, MJ Lee, DH Seo, D-S Suh,JC Park, SO Park, HS Kim, IK Yoo, et al. Highlyscalable nonvolatile resistive memory using simplebinary oxide driven by asymmetric unipolar voltagepulses. In
Electron Devices Meeting, 2004. IEDMTechnical Digest. IEEE International , pages 587–590.IEEE, 2004.[6] Adam Belay, Andrea Bittau, Ali Mashtizadeh, DavidTerei, David Mazières, and Christos Kozyrakis. Dune:Safe user-level access to privileged cpu features. In
Proceedings of the 10th USENIX Conference on Oper-ating Systems Design and Implementation , OSDI’12,pages 335–348, Berkeley, CA, USA, 2012. USENIXAssociation.[7] Adam Belay, George Prekas, Ana Klimovic, SamuelGrossman, Christos Kozyrakis, and Edouard Bugnion.Ix: A protected dataplane operating system for highthroughput and low latency. In
Proceedings of the 11thUSENIX Conference on Operating Systems Design andImplementation , OSDI’14, pages 49–65, Berkeley, CA,USA, 2014. USENIX Association.[8] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong,and Arie Shoshani. Parallel data analysis directly onscientific file formats. In
Proceedings of the 2014 ACMSIGMOD international conference on Management ofdata , pages 385–396. ACM, 2014.[9] Remy Card, Theodore Ts’o, and Stephen Tweedie.Design and implementation of the second extendedfilesystem. In
Proceedings of the 1st Dutch InternationalSymposium on Linux , pages 1–6, 1994.[10] Youmin Chen, Jiwu Shu, Jiaxin Ou, and Youyou Lu.Hinfs: A persistent memory file system with both buffer-ing and direct-access.
ACM Trans. Storage , 14(1):4:1–4:30, April 2018. [11] Joel Coburn, Adrian M. Caulfield, Ameen Akel,Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, andSteven Swanson. Nv-heaps: Making persistent objectsfast and safe with next-generation, non-volatile mem-ories. In
Proceedings of the Sixteenth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS XVI,pages 105–118, New York, NY, USA, 2011. ACM.[12] Jeremy Condit, Edmund B. Nightingale, ChristopherFrost, Engin Ipek, Benjamin Lee, Doug Burger, andDerrick Coetzee. Better i/o through byte-addressable,persistent memory. In
Proceedings of the ACM SIGOPS22Nd Symposium on Operating Systems Principles ,SOSP ’09, pages 133–146, New York, NY, USA, 2009.ACM.[13] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava-murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran,and Jeff Jackson. System software for persistent mem-ory. In
Proceedings of the Ninth European Conferenceon Computer Systems , EuroSys ’14, pages 15:1–15:15,New York, NY, USA, 2014. ACM.[14] Keir Fraser. Practical lock-freedom. Technical report,University of Cambridge, Computer Laboratory, 2004.[15] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, XiaoLiu, Amirsaman Memaripour, Yun Joon Soh, ZixuanWang, Yi Xu, Subramanya R Dulloor, et al. Basic per-formance measurements of the intel optane dc persistentmemory module. arXiv preprint arXiv:1903.05714 ,2019.[16] Sudarsun Kannan, Andrea C Arpaci-Dusseau, Remzi HArpaci-Dusseau, Yuangang Wang, Jun Xu, and GopinathPalani. Designing a true direct-access file system withdevfs. In , page 241, 2018.[17] Kimberly Keeton. The machine: An architecture formemory-centric computing. In
Workshop on Runtimeand Operating Systems for Supercomputers (ROSS) ,2015.[18] Youngjin Kwon, Henrique Fingler, Tyler Hunt, SimonPeter, Emmett Witchel, and Thomas Anderson. Strata:A cross media file system. In
Proceedings of the 26thSymposium on Operating Systems Principles , SOSP ’17,pages 460–477, New York, NY, USA, 2017. ACM.[19] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and DougBurger. Architecting phase change memory as a scalabledram alternative. In
Proceedings of the 36th annualInternational Symposium on Computer Architecture(ISCA) , pages 2–13, New York, NY, USA, 2009. ACM.1220] Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and LintaoZhang. Socksdirect: Datacenter sockets can be fast andcompatible. In
Proceedings of the ACM Special InterestGroup on Data Communication , SIGCOMM ’19, pages90–103, New York, NY, USA, 2019. ACM.[21] Changwoo Min, Sanidhya Kashyap, Steffen Maass,Woonhak Kang, and Taesoo Kim. Understandingmanycore scalability of file systems. In
Proceedingsof the 2016 USENIX Conference on Usenix AnnualTechnical Conference , USENIX ATC ’16, pages 71–85,Berkeley, CA, USA, 2016. USENIX Association.[22] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high perfor-mance file system for non-volatile main memory. In
Proceedings of the Eleventh European Conference onComputer Systems , EuroSys ’16, pages 12:1–12:16, NewYork, NY, USA, 2016. ACM.[23] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports,Doug Woos, Arvind Krishnamurthy, Thomas Anderson,and Timothy Roscoe. Arrakis: The operating systemis the control plane. In
Proceedings of the 11thUSENIX Conference on Operating Systems Design andImplementation , OSDI’14, pages 1–16, Berkeley, CA,USA, 2014. USENIX Association.[24] Thanumalayan Sankaranarayana Pillai, Vijay Chi-dambaram, Ramnatthan Alagappan, Samer Al-Kiswany,Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. All file systems are not created equal: Onthe complexity of crafting crash-consistent applications.In
Proceedings of the 11th USENIX Conference on Op-erating Systems Design and Implementation , OSDI’14,pages 433–448, Berkeley, CA, USA, 2014. USENIXAssociation.[25] William Pugh. Skip lists: A probabilistic alternative tobalanced trees.
Commun. ACM , 33(6):668–676, June1990.[26] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, andJude A. Rivers. Scalable high performance mainmemory system using phase-change memory technol-ogy. In
Proceedings of the 36th annual InternationalSymposium on Computer Architecture (ISCA) , pages24–33, New York, NY, USA, 2009. ACM.[27] Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu.Ffwd: Delegation is (much) faster than you think.In
Proceedings of the 26th Symposium on OperatingSystems Principles , SOSP ’17, pages 342–358, NewYork, NY, USA, 2017. ACM.[28] Livio Soares and Michael Stumm. Flexsc: Flexiblesystem call scheduling with exception-less system calls. In
Proceedings of the 9th USENIX Conference on Op-erating Systems Design and Implementation , OSDI’10,pages 33–46, Berkeley, CA, USA, 2010. USENIX As-sociation.[29] Adam Sweeney, Doug Doucette, Wei Hu, Curtis An-derson, Mike Nishimoto, and Geoff Peck. Scalabilityin the xfs file system. In
USENIX Annual TechnicalConference , volume 15, 1996.[30] Haris Volos, Sanketh Nalli, Sankarlingam Panneersel-vam, Venkatanathan Varadarajan, Prashant Saxena, andMichael M. Swift. Aerie: Flexible file-system interfacesto storage-class memory. In
Proceedings of the Ninth Eu-ropean Conference on Computer Systems , EuroSys ’14,pages 14:1–14:14, New York, NY, USA, 2014. ACM.[31] Ying Wang, Dejun Jiang, and Jin Xiong. Caching ornot: Rethinking virtual file system for non-volatile mainmemory. In . USENIXAssociation, 2018.[32] Xiaojian Wu and A. L. Narasimha Reddy. Scmfs: Afile system for storage class memory. In
Proceedings of2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC ’11,pages 39:1–39:11, New York, NY, USA, 2011. ACM.[33] Jian Xu and Steven Swanson. Nova: A log-structuredfile system for hybrid volatile/non-volatile main mem-ories. In
Proceedings of the 14th Usenix Conferenceon File and Storage Technologies , FAST’16, pages 323–338, Berkeley, CA, USA, 2016. USENIX Association.[34] Jian Xu, Lu Zhang, Amirsaman Memaripour, AkshathaGangadharaiah, Amit Borase, Tamires Brito Da Silva,Steven Swanson, and Andy Rudoff. Nova-fortis: Afault-tolerant non-volatile main memory file system.In
Proceedings of the 26th Symposium on OperatingSystems Principles , SOSP ’17, pages 478–496, NewYork, NY, USA, 2017. ACM.[35] Esma Yildirim, Engin Arslan, Jangyoung Kim, andTevfik Kosar. Application-level optimization of bigdata transfers through pipelining, parallelism and con-currency.
IEEE Transactions on Cloud Computing ,4(1):63–75, 2016.[36] Shengan Zheng, Morteza Hoseinzadeh, and StevenSwanson. Ziggurat: A tiered file system for non-volatilemain memories and disks. In , pages 207–219, 2019.[37] Deng Zhou, Wen Pan, Tao Xie, and Wei Wang. A filesystem bypassing volatile main memory: Towards a13ingle-level persistent store. In
Proceedings of the 15thACM International Conference on Computing Frontiers ,CF ’18, pages 97–104, New York, NY, USA, 2018.ACM.[38] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durable and energy efficient main memory using phasechange memory technology. In