[PDF] Kernel/User-level Collaborative Persistent Memory File System with Efficiency and Protection

Abstract

Emerging high performance non-volatile memories recall the importance of efficient file system design. To avoid the virtual file system (VFS) and syscall overhead as in these kernel-based file systems, recent works deploy file systems directly in user level. Unfortunately, a userlevel file system can easily be corrupted by a buggy program with misused pointers, and is hard to scale on multi-core platforms which incorporates a centralized coordination service. In this paper, we propose KucoFS, a Kernel and user-level collaborative file system. It consists of two parts: a user-level library with direct-access interfaces, and a kernel thread, which performs metadata updates and enforces write protection by toggling the permission bits in the page table. Hence, KucoFS achieves both direct-access of user-level designs and fine-grained write protection of kernel-level ones. We further explore its scalability to multicores: For metadata scalability, KucoFS rebalances the pathname resolution overhead between the kernel and userspace, by adopting the index offloading technique. For data access efficiency, it coordinates the data allocation between kernel and userspace, and uses range-lock write and lock-free read to improve concurrency. Experiments on Optane DC persistent memory show that KucoFS significantly outperforms existing file systems and shows better scalability.

Full PDF

KKernel/User-level Collaborative Persistent Memory File System with Efﬁciencyand Protection

Youmin Chen Youyou Lu Bohong Zhu Jiwu Shu Tsinghua University

Abstract

Emerging high performance non-volatile memories recall theimportance of efﬁcient ﬁle system design. To avoid the virtualﬁle system (VFS) and syscall overhead as in these kernel-based ﬁle systems, recent works deploy ﬁle systems directlyin user level. Unfortunately, a user level ﬁle system can easilybe corrupted by a buggy program with misused pointers, andis hard to scale on multi-core platforms which incorporates acentralized coordination service.In this paper, we propose KucoFS, a Kernel and user-level collaborative ﬁle system. It consists of two parts: auser-level library with direct-access interfaces, and a kernelthread, which performs metadata updates and enforces writeprotection by toggling the permission bits in the page table.Hence, KucoFS achieves both direct-access of user-leveldesigns and ﬁne-grained write protection of kernel-level ones.We further explore its scalability to multicores: For metadatascalability, KucoFS rebalances the pathname resolution over-head between the kernel and userspace, by adopting the indexofﬂoading technique. For data access efﬁciency, it coordinatesthe data allocation between kernel and userspace, and usesrange-lock write and lock-free read to improve concurrency.Experiments on Optane DC persistent memory show thatKucoFS signiﬁcantly outperforms existing ﬁle systems andshows better scalability.

Emerging byte-addressable non-volatile memories (NVMs),such as PCM [19, 26, 38], ReRAM [5], and the recentlyreleased Intel Optane DC persistent memory [4], provideperformance comparable to DRAM and data persistencesimilar to disks. Such high-performance hardware recallsthe importance of redesigning efﬁcient ﬁle systems. Theefﬁciency refers to not only the lightweight software overheadof the ﬁle system itself, but also its scalability to multicoresthat is able to exploit the hardware performance of non-volatile memories. File systems have long been part of an operating system,and are placed in the kernel level to provide data protectionfrom arbitrary user writes. System calls ( syscall s) are usedfor the communication between the kernel and userspace. Inthe kernel, the virtual ﬁle system (VFS) is an abstractionlayer that hides concrete ﬁle system designs to provideuniform accesses. However, both syscall and

VFS incur non-negligible overhead in ﬁle systems for NVMs. Our evaluationon NOVA [33] shows that, even the highly scalable andefﬁcient NVM-aware ﬁle system still suffers great overhead inthe VFS layer and fails to scale on some ﬁle operations (e.g., creat/unlink ). For syscall , the context switch overheadoccupies up to 34% of the ﬁle system accessing time, evenwithout counting the effects of TLB and CPU cache misseson the following execution.Recent works like Strata [18] and Aerie [30] propose todesign NVM ﬁle systems in the user level. By bypassing theoperating system, they exploit the beneﬁts of direct access .However, since the NVM space is exported to applications’address space, a programmer can easily corrupt the ﬁle systemimage by misusing pointers, which accidently point to theNVM space. Moreover, these ﬁle systems adopt a trusted,but centralized component to coordinate the critical updates,which inevitably restricts their scalability to multi-cores.It is difﬁcult to achieve both performance efﬁciency andwrite protection simultaneously, as long as the VFS andkernel/user-space architecture remain unchanged. In thispaper, we revisit the ﬁle system architecture and proposea Kernel and user-level collaborative File System named Ku-coFS. Unlike existing user-level ﬁle systems, KucoFS enablesuser-level direct-access while ensuring write protection that akernel ﬁle system provides. KucoFS decouples the ﬁle systeminto a kernel thread (a.k.a., master ) and a user space library(a.k.a.,

Ulib ). Programs are capable of directly reading/writingﬁle data in user level by linking with

Ulib , while the master isdedicated to updating metadata on behalf of the applications,as well as guaranteeing the integrity of ﬁle data. KucoFSprevents a buggy program from corrupting the ﬁle system byexporting the NVM space to user level in read-only mode.1 a r X i v : . [ c s . O S ] A ug n this way, the read operations still can be conducted in userspace. To serve write operations without compromising thedirect-access feature, the master carefully manipulates thepage table to make the related data pages writable beforehand,and read-only again once the operation completes, retainingthe protection feature.We further explore the multicore scalability from thefollowing aspects: (cid:202) Metadata Scalability.

Like existing user-level ﬁle systems, KucoFS introduces a centralized master ,despite the different insight behind such architecture. As aresult, the master in KucoFS is still the bottleneck when thenumber of served programs increases. We introduce indexofﬂoading to migrate the pathname resolution overhead touserspace, and use batching-based logging to amortize themetadata persistence overhead. (cid:203)

Write Protocol.

To writeﬁle data,

Ulib needs to interact with the master both beforeand after the operation to enforce write protection. This cannot only further increase the pressure on the master , butalso lead to increased latency. We propose an efﬁcient writeprotocol to reduce the number of interactions between

Ulib and master when writing a ﬁle. It achieves this by lazilyreserving free data pages from the master , and coordinatingthe concurrent write operations directly in user space witha range lock. (cid:204)

Read-Write Conﬂicts. Ulib is likely to readinconsistent data when it directly accesses the ﬁle systemwithout any coordination, while master -involved readingreduces the beneﬁts of direct-access. We propose lock-freefast read to deal with read-write conﬂicts. By carefullychecking the status of metadata,

Ulib is able to consistentlyread ﬁle data without any interacting to the master , despitethe concurrent writers.

Implementing an NVM ﬁle system in Linux kernel faces twotypes of unavoidable costs, which are the syscall overheadand the heavy-weight software stack in VFS. We investigatethe overhead of them by analyzing NOVA [33], a well-knownhighly scalable and efﬁcient NVM-based ﬁle system. Ourexperimental platform is described in Section 6.1.

Syscall Overhead.

We analyze the syscall overhead bycollecting the context-switch latency of common ﬁle systemoperations (Each operation is repeated over 1 million ﬁles ordirectories with a single thread). The results are shown in Fig-ure 1 (a). We observe that the context-switch latency takes upto 21% of the total execution time, and this ratio is especiallylarge for read-oriented operations (e.g., stat/open ). Notethat the context-switch latency we captured only includesthe direct parts. The indirect costs (e.g., cache pollution) canfurther affect the efﬁciency of a program [28].

Inefﬁciency of VFS.

In existing Linux kernel, VFS improvesthe performance of storage devices (e.g., HDD/SSD) by main- L a t e n c y B r e a k d o w n ( % ) T h r o u g h p u t ( M o p s / s ) fs VFS Syscall CreateRename Unlink (a) (b)Number of Threads . . . . . . . . ( μ s ) o p e n s t a t m k d i r m k n o d r e n a m e u n li n k r e a d w r i t e Figure 1: Analysis of OS-part Overhead with NOVA.taining page cache in DRAM, but such caching mechanismis not always effective for NVMs since they have very closeaccess latency. Therefore, a number of NVM-aware ﬁle sys-tems choose to bypass them directly [10, 12, 13, 22, 31, 33, 37].However, we ﬁnd that the remaining software stack in VFSis still too heavyweight: Our experiments show that NOVAhas to spend an average of 34% of the execution time inVFS layer (in Figure 1 (a)). In addition, VFS synchronizesthe concurrent syscalls by using the coarse-grained lock,which limits the scalability. As shown in Figure 1 (b), tocreate/rename/delete ﬁles in the same folder, VFS directlylocks the parent directory, so their throughput is unchangeddespite the increasing number of client threads.To sum up, the uniﬁed abstraction in VFS and syscallinterfaces do provide a safe and convenient way for program-mers, but at the same time, such classical design concept alsorestricts us from reconstructing the ﬁle system stack.

A group of ﬁle systems reduces the OS-part overhead byenabling user-level programs to directly access NVM deviceswithout trapping into the kernel [18, 30]. However, they failto provide the following important properties:

Write Protection.

Both Aerie [30] and Strata [18] rely onhardware virtualization capabilities in modern server systems(i.e., MMU) to enforce coarse-grained protection: they specifyaccess rights for each application to contiguous subsets ofNVM space, so as to prevent other malicious processes fromcorrupting the ﬁle system image. However, mapping (a subsetof) the ﬁle system image to applications’ address spaceand granting them with write access right is still dangerous,despite that they adopt a third-party service to manage themetadata: Applications can access Aerie by directly updatingthe ﬁle data in place. Strata allows user-level programs todirectly update the per-process operation log and the DRAMcache (including both metadata and data). As a result, a buggyprogram can easily corrupts the ﬁle system image by misusingpointers which accidentally point to the NVM space [11, 34],and the real-world evidence shows that such accidents arereally common.

Multicore Scalability.

Aerie relies on a trusted ﬁle systemservice (TFS, a separate user-level process) to ensure theintegrity of metadata updates and coordinate the concurrentaccesses with a distributed lock service. Such centralized2ervice easily becomes the bottleneck when the number ofconcurrent applications increases. Strata, in contrast, enablesapplications to update ﬁle data by appending their modiﬁca-tions directly to the per-process log without the involvementof a third-party service. However, Strata requires backgroundthreads (KernFS) to asynchronously digest the log entries(including both data and metadata) to the storage devices. Ifan application completely uses up its log, it must wait for anin-progress digest to complete before it can reclaim log space.Consequently, the number of digestion threads determinesStrata’s overall performance. Similar to Aerie, Strata alsorelies on the KernFS for concurrency control, this indicatesthat the application needs to interact with the KernFS eachtime it accesses a new ﬁle. Besides, both of them accessthe third-party service via socket-based RPCs, which againintroduce context-switch overhead, thus reducing the beneﬁtsof direct access.

In this section, we discuss the designing goals and non-goalsand clarify the trade-offs we made when building KucoFS.

Direct-access and data protection.

The key design aspectof KucoFS lies in decoupling the functionality of a ﬁle systeminto two parts, so as to achieve the respective advantages of direct-access of user-level ﬁle systems, and write protection of kernel-based ones. Note that KucoFS mainly target atenforcing write protection over buggy programs, and theimmunity to malicious attacks is out of the scope of thispaper. Nevertheless, KucoFS is still robust to them in mostcases by using checksum and lease (Section 4.3).

Scalability.

KucoFS should work well on a multi-core plat-form, so as to take full advantages of the internal parallelismof persistent memory. This drives us to design a scalable master service when it is accessed by concurrent applications.Besides, more efﬁcient concurrency control is also requiredto deal with concurrent accesses and read-write conﬂicts.

Atomicity and Consistency, as is required by most exist-ing applications [24]. two aspects need to be taken intoconsideration: 1) KucoFS should always remain consistenteven after the system crashes abnormally, which requiresus to carefully design failure atomic update protocols. 2)The readers always see consistent data/metadata when otherprograms are concurrently updating ﬁles or directories.

Compatible APIs.

KucoFS should be backward compatiblewith kernel ﬁle system APIs, so that existing applications canuse KucoFS without modifying the source code.KucoFS makes a few tradeoffs that deviate from standardPOSIX semantics, but without restricting its applicabilityin real-world applications. 1) KucoFS implements per-userdirectory trees (i.e., the programs within the same user share a“private” root node), instead of a global tree, so as to enforceread protection (Section 4.4). 2) We don’t provide a explicitway for sharing data between different users as it is not … open() read() write() unlink() UserKernel

Read-only

Msg Buf

Per-user File System ImageApplica:on master

PollUpdate

Map page table

CoW User Space Library (

Ulib ) Figure 2: Architecture of KucoFS.the common case (several feasible approaches have beenproposed in Section 4.4). 3) Some minor properties are notimplemented in KucoFS (e.g., atime , etc.).

We designed KucoFS with the main goal of providing direct-access while enforcing data protection, failure atomicity,consistency, as well as the scalability to multi-cores.

Figure 2 shows the architecture of KucoFS. KucoFS consistsof an user-level library and a global kernel thread, whichare respectively called

Ulib and master . Ulib communicateswith the master via a exclusively owned message buffer. InKucoFS, each user owns a partition of ﬁle system image,which is mapped into user-level address space with read-onlyaccess rights. By linking with

Ulib , an application can postmemory

Load instructions to directly locate the data for thoseread-only operations (e.g., read/stat ). Ulib writes ﬁle databy always indirecting updates to new data pages with a copy-on-write mechanism. To enable user-level direct write, the master modiﬁes the permission bits in the page table to switchthe newly allocated data pages between “writable” and “read-only” when

Ulib is updating them.

Ulib is not allowed toupdate metadata directly. Instead, it posts a request to the master through the message buffer, and the master updatesthe metadata on behalf of it.KucoFS adopts both DRAM and NVM to manage the ﬁlesystem image (see Figure 3). For efﬁciency, KucoFS onlyoperates on the DRAM data for normal requests. In DRAM,an array of pointers ( inode table ) is placed at a predeﬁnedlocation to point to the actual inode s. The ﬁrst element inthe inode table always points to the root inode of each user,therefore,

Ulib can lookup iteratively from the root inode toany ﬁle directly in user space. KucoFS uses an Ext2-like [9] block mapping to map a ﬁle to its data pages. We choose blockmapping, instead of the widely used extent tree, to supportlock-free fast read (in Section 4.4). We then introduce skip-list [25] to organize the dentry list of each directory, so as toachieve atomicity and consistency (in Section 4.2).3 nodeTable

DRAMNVM binodeMappingBlock Bob dentry List … Data Pages Opera9on Logappenda … User: Bobcreat(/Bob/a)

Tail Ptr

Metadata Pages checkpoint Op : Create inode : 4 p_inode : 7 name : a ACL : …

LogEntry:

Figure 3: Data layout in KucoFS and the steps to create a ﬁle.To ensure the durability and crash consistency of metadata,KucoFS further places an append-only persistent operationlog in NVM. When the master updates the metadata, it ﬁrstatomically appends a log entry, and then actually updatesthe in-memory metadata. To avoid the operation log fromgrowing arbitrarily, the master will periodically checkpointthe modiﬁcations to the NVM metadata pages in the back-ground and ﬁnally truncate the log (in Section 4.6). Sincethe operation log only contains light-weight metadata, suchcheckpoint overhead is not high. In face of system failures, thein-memory metadata can always be recovered by replayingthe log entries in the operation log.In addition to the operation log and metadata pages, theextra NVM space is cut into contiguous 4 KB data pages to store the ﬁle data. The free data pages are managed withboth a bitmap in NVM and a free list in the DRAM (for fastallocation). Similar to the metadata pages, the bitmap is alsolazily persisted by the master during the checkpoint.

KucoFS delegates all metadata updates to the master . Torelieve the pressure of the master , we propose to 1) minimizeits metadata indexing overhead with index ofﬂoading and 2)reduce the metadata persistence overhead with batching.

Index Ofﬂoading.

To update metadata, the master needs toperform iterative pathname resolution from root inode downto the directory containing this ﬁle. When a large numberof processes access concurrently, such indexing overhead isa heavy burden for the master. Things become even worsewhen a directory contains a large number of sub-ﬁles or theﬁle path is long. To address this issue, we propose to ofﬂoadthe pathname resolution from the master to Ulib .By mapping the ﬁle system image to user space,

Ulib isenabled to locate the related metadata directly in user-level be-fore posting a metadata update request. Take creat operationfor example,

Ulib ﬁnds the address of the predecessor in itsparent directory’s dentry list . It then posts the request to the master by piggybacking the addresses of the related metadata.In this way, the master can directly insert a new dentry into the dentry list with the giving address. The addresses of both the dentry in the parent directory and inode itself are provided forthe unlink operation. However, we still need extra techniques to ensure the correctness:First, we need to ensure that

Ulib can always read consistentdirectory tree when the master is updating them concurrently.To address this issue, we organize the dentry list of eachdirectory with a skip-list [25] and the key is the hash valueof each ﬁle name. Skip-list is a linked list-like data structurewith multiple layers, and each higher layer acts as an “expresslane” for the lists below, thus providing O ( logN ) search/insertcomplexity (see Figure 3). More importantly, by performingsimple pointer manipulations on a singly linked list withCPU’s atomic operations, we can atomically update the list.We enforce the master to updates the dentry list at differenttime point for different operations: For creat , it inserts a new dentry on the ﬁnal step, to atomically make the created ﬁlevisible; For unlink , it deletes the dentry ﬁrstly. Hence, Ulib is guaranteed to always have a consistent view of the directorytree even without acquiring the lock. Renaming involvesupdating two dentries simultaneously, so it is possible for aprogram to see two same ﬁles at some time point. To addressthis issue, we add an dirty ﬂag in each dentry to prevent the

Ulib from reading such inconsistent state.Second, it’s possible that the pre-located metadata by

Ulib becomes obsolete before it is actually accessed by the master (e.g., the inode or dentry has already been deletedby the master for other concurrent processes). To solve thisproblem, we reuse the dirty bit in each inode / dentry . Once anitem is deleted, this bit is set to an invalid state. Therefore,other applications and the master itself can determine theliveness of each metadata. The deleted items are temporarilykept in place and reclaimed via an epoch-based reclamationmechanism (EBR) [14]. We follow a classic way by usingthree reclamation queues, each of which is associated withan epoch number. The master pushes the deleted items onlyto the current active epoch queue. When all Ulib instancesare active in the current epoch, the master then increasesthe global epoch and begins to reclaim the space from theoldest queue. A reader executing in user level can sufferarbitrary delays due to thread scheduling, impacting thereclaim efﬁciency. However, we believe it is not a serious issuesince KucoFS only reclaims these obsolete items periodically.Third, a pre-located dentry may no longer be the prede-cessor when a new dentry is inserted between them. Hence,the master also needs to check the legality of the pre-locatedmetadata by comparing the related ﬁelds. Note that the master can update in-memory metadata without any synchronizationoverhead (i.e., locking) since all the metadata updates aredelegated to the master [27].

Examples.

To create a ﬁle,

Ulib sends a creat request tothe master to create a ﬁle. The address of the predecessorin its parent directory’s dentry list is put in the message too.Upon receiving the request, the master does the followingsteps (as shown in Figure 3): (cid:202) reserves an empty inodenumber from the inode table and appends a log entry toguarantee crash consistency. This log entry records the inode4umber, ﬁle name, parent directory inode number, and otherattributes; (cid:203) allocates an inode with each ﬁeld ﬁlled, andupdates the inode table to point to this inode , and (cid:204) insertsa dentry into the dentry list with the given address, to makethe created ﬁle visible. To delete a ﬁle, the master appendsa log entry ﬁrstly, deletes the dentry in the parent directorywith the given addresses, and ﬁnally frees the related spaces(e.g., inode , NVM ﬁle pages and block mapping ). With suchstrict execution order, the failure atomicity and consistency(described in Section 3) is guaranteed.

Batching-based Metadata Logging.

The master ensuresthe crash consistency of metadata by appending log entriesand ﬂushing them out of the CPU cache. However, cacheﬂushing leads to signiﬁcant overhead since NVM has poorwrite bandwidth. Fortunately, the master serves many userspace applications and it can ﬂush log entries with batching.Following this idea, we let the master fetch multiple requestsat a time from concurrent applications and process them ina batch manner. multiple log entries from different requestsnow can be merged into a large log entry. After it is persisted,the master then updates the in-memory metadata one-by-one with the order described above, and ﬁnally sends theacknowledgments back. Such processing mode has the follow-ing advantage: CPU ﬂushes data with cacheline granularity(typically 64 B), which is larger than most of the log entries,by merging and persisting them together, the number ofﬂushing operations is dramatically reduced. Note that theaforementioned batching is different from Aerie and Strata:Aerie batches requests before sending them to the TFS, soas to reduce the cost of posting RPCs. The KernFS in Stratadigests batches of operations from the log, which coalescesadjacent writes and forms sequential writes. Instead, KucoFSbatches log entries to amortize the data persistence overhead,leveraging the mismatch between the ﬂush granularity andthe log entry size.

Another key design principle lies in how to provide efﬁcient,consistent and safe write protocol . To achieve these goals,we propose pre-allocation and direct-access range-lock tosimplify the way of interaction between

Ulib and master .Similar to NOVA [33] and PMFS [30], we use a copy-on-write (CoW) mechanism to update data pages. It updates ﬁlepages by moving the unmodiﬁed part of data from the oldplace as well as the application data to new data pages. CoWcauses extra copying overhead for small-sized updates. Inmost cases, however, it dismisses the double write overheadas in redo/undo logging and the log cleaning overhead as inlog-structured data management.KucoFS ﬁrst uses CoW to update the data pages, andthen atomically appends a log entry to record the metdatamodiﬁcations, during which the old data and metadata is nevertouched. Once a system failure occurs before a write operation stateoﬀsetsizelease versionchecksum inode Init a lock item. Insert. Atomic add.

Ring Buﬀer Conﬂictchecking.

Figure 4: Layout of Direct Access Range-Lock.is ﬁnished, KucoFS simply rollbacks to its original state. Assuch, the failure atomicity and consistency is guaranteed. Werely on the master to enforce write protection over eachﬁle page leveraging the permission bits in the page table:When the user-level programs directly write data, the master carefully manipulates the permission bits of the related datapages. An intuitive write protocol is:1)

Ulib sends the ﬁrst request to the master to lock the ﬁle,reserve free data pages and make them “writable”;2)

Ulib relies on CoW to copy both the unmodiﬁed data fromthe old place and new data from the user buffer to thereserved data pages, and ﬂush them out of the CPU cache;3)

Ulib sends a second request to the master to reset thenewly written data pages to “read-only”, append a newlog entry (inode number, offset, size and related NVMaddresses) to the operation log, update the metadata (i.e., inode , block mapping ) and ﬁnally release the lock.We can observe that a single write operation involvesposting two requests to the master . This can not only leadto high write latency, but also limit the efﬁciency of the master since it is frequently involved. Thus, we propose pre-allocation and direct access range-lock to avoid sending theﬁrst request to the master . Pre-allocation.

Rather than posting the request to the master to reserve free pages for each write operation, we allow

Ulib to lazily allocate data pages from the master (4 MB at atime in our implementation). These data pages are managedprivately by

Ulib with a free list. When an application exits,the unused data pages are given back to the master . For anabnormal exit, these data pages are temporarily non-reusableby other applications, but still can be reclaimed after rebootedby replaying the operation log.

Direct Access Range-Lock.

To completely avoid sendingthe ﬁrst request as described in the naive write protocol, wefurther propose direct access range-lock . It coordinates theconcurrent writes directly in user-level, since we cannot relyon a master to acquire the lock anymore.As shown in Figure 4, we assign each opened ﬁle a rangelock (i.e., a DRAM ring buffer), which is pointed by the inode . Ulib writes a ﬁle by acquiring the range-lock ﬁrst, and the ﬁlewriting is delayed once a lock conﬂict occurs. Each slot in thering buffer has ﬁve ﬁelds, which are state, offset, size, lease and a checksum . The checksum is the hash value of the ﬁrstfour ﬁelds. We also place a version at the head of each ringbuffer to describe the ordering of each write operation. Toacquire the lock of a ﬁle,

Ulib ﬁrstly increments its version fetch_and_add . It then inserts a lock item into aspeciﬁc slot in the ring buffer, and the location is determinedby the fetched version (modulo the ring buffer size). Afterthis,

Ulib traverses the ring buffer backward to ﬁnd the ﬁrstconﬂicting lock item (i.e., their written data overlaps). If itexists,

Ulib veriﬁes its checksum , and then polls on its state until it is released.

Ulib also checks its lease ﬁeld repeatedlyto avoid deadlock if an application is aborted before it releasesthe lock. Once the lock has been required,

Ulib process thesecond and third steps described in the naive protocol. In step3, the version is encapsulated in the requests, so the master can persist it in the log entry.Worth noticing, our proposed range-lock supports concur-rent writing in the same ﬁle covering different data pages.Such ﬁne-grained concurrency control is important in high-performance computing [8, 35] and the emerging rack-scalecomputers with hundreds to thousands of cores [17].

Write Protection.

KucoFS strictly controls the access rightsto the ﬁle system image: Both in-memory metadata and thepersistent operation log are critical to the ﬁle system, so the master is the only one that is allowed to update them.

Ulib only has write access to its privately managed free data pages.However, these pages are immediately changed to “read-only”once they are allocated to serve the write operations. Sinceboth the metadata and valid data pages are non-writable,KucoFS is immune to arbitrary memory writes. However,there are still two anomalies: 1) the private data pages stillcan be corrupted within a write operation by other concurrentthreads. However, a kernel-based ﬁle system cannot copewith such case either [13], and we believe this is unlikely tohappen. 2) a buggy application can still corrupt the rangelock or the message buffer, since they are directly writablein user space. We add checksum and lease ﬁelds at eachslot, enabling the user-level programs to identify whether theinserted element has been corrupted. Both the lock item andrequest message only contains a few tens of bytes of data, sothe hash calculating overhead is not high. Besides, The secretkey for generating the checksum is owned by the master andgranted only to those trusted applications. Therefore, KucoFSis even immune to some malicious attacks (e.g., replay orDoS), though this is not the main target of this paper.When the master updates the page table for each writeoperation, it needs to explicitly ﬂush the related TLB entriesto make the modiﬁcations visible. This indicates that each write operation in KucoFS involves twice of TLB ﬂushing.Luckily, we can allocate multiple data pages at a time in pre-allocation phase, So the TLB entries can be ﬂushed in batch,which reduces the ﬂushing overhead dramatically.

KucoFS updates data pages with CoW mechanism, hence,any data page is in either old or new version. This providesus the opportunity to design an efﬁcient read protocol to V V V V version pointerstart end V V V V V V V V V V V V V V V V V (a)(b)(c) start = 1 end = 1 Block Mapping Item

55 560

Figure 5: Lock-Free Fast Read with Version Checking.directly read in user-level. Considering that the master maybe updating the metadata for other concurrent writers, themain challenge is how to read a consistent snapshot of blockmapping s efﬁciently despite other concurrent writers.Hence, we propose lock-free fast read , which guaranteesthat readers never read data from unﬁnished writes. It achievesthis by embedding a version ﬁeld in each pointer of the blockmapping : As shown in Figure 5, each 96-bit block mappingitem contains four ﬁelds, which are start , version , end , and pointer . Take a write operation with three updated data pagesfor example, when the master updates the block mapping ,the header of three mapping items are constructed with thefollowing layout: 1 | V | | V | | V |

1. Note that all thethree items share the same version (i.e., V ), which is providedby Ulib when it acquires the range lock (in Section 4.3). The start bit of the ﬁrst item and the end bit of the last item areset to 1. We only reserve 40-bit for pointer ﬁeld since italways points to a 4 KB-aligned page (the lower 12 bits canbe discarded). It’s easy to understand that when there are noconcurrent writers, the block mapping items should satisfyany of the conditions in Figure 5:(a)

Writes without overlapping.

The items with the same version are enclosed with a start bit and an end bit,indicating that multiple threads have updated the sameﬁle but different data pages.(b)

Overlapping in the tail.

The reader sees a start bitwhen the version increases, indicating that a thread hasoverwritten the end part of the pages that are updated bya former thread.(c)

Overlapping in the head.

The reader sees an end bitbefore the version decreases, indicating that a thread hasoverwritten the front part of the pages that are updated bya former thread.If

Ulib meets any cases that violate the above conditions,we assert that the master is updating the block mappings forother concurrent write threads. In this case,

Ulib needs toreload the metadata again and checks its validity. To reducethe overhead of retrying, the read thread copies the ﬁle datato user’s buffer only after it has successfully collected aconsistent version of the block mapping. This is achievablebecause the obsolete NVM pages are lazily reclaimed. Whenthe modiﬁed mapping items span to multiple cachelines, the master also adds extra mfence to serialize the updates. By thisway, the read threads can see the updates in order.

Read Protection.

Leveraging the permission bits to enforceread protection is more challenging, since metadata have6emantically richer permissions [30]. Hence, instead of main-taining a fully-compatible hierarchical/group access controlas in kernel-based ﬁle systems, we partition the directory treeinto per-user sub-trees and each user has a private root node.When a program access KucoFS, only the sub-tree (i.e., inodetable , inodes , dentry lists , etc.) and the related data pagesof the current user are mapped to its address space, whileother space is invisible to it. To alleviate the bookkeepingoverhead for page mapping, the master assign each user 4 MBof contiguous DRAM/NVM blocks, which forms the per-userﬁle system image (i.e., DRAM metadata, operation log, datapages, etc.). Similar to Arrakis [23], KucoFS doen’t providea explicit way for data sharing between different users, yetthere are several practical approaches: 1) create a standalonepartition that every users have read/write access to it; 2) issueuser-level RPCs to a speciﬁc user to acquire the data. Webelieve such tradeoff is not likely to be an obstacle to itsapplication in real-world scenarios, since KucoFS naturallysupports efﬁcient sharing between applications within thesame user, which are the more common case. We introduce a checkpoint mechanism to avoid the operationlog from growing arbitrarily: When a master is not busyor the size of operation log grows out of a maximum size,it periodically applies the metadata modiﬁcations to NVMmetadata pages by replaying log entries in the operationlog. The bitmap that used to manage the NVM free spaceis updated and persisted as well. After that, the operationlog is truncated. Each time KucoFS is restarted, the master ﬁrst replays the un-checkpointed log entries in the operationlog, so as to make the NVM metadata pages up-to-date. Itthen copies the NVM metadata pages to DRAM. The freelist of NVM data pages is also reconstructed according to thebitmap stored in NVM. Keeping redundant copies of metadatabetween DRAM and NVM can introduce higher consumptionof NVM/DRAM space. But we believe it is worth the effortsbecause by selectively placing the (un)structured metadata inDRAM and NVM, we can perform fast indexing directly inDRAM, append log entries with reduced persistency overhead(batching), and lazily checkpoint in the background withoutaffecting performance. As our future work, we plan to reducethe DRAM footprint by only keeping the metadata of activeﬁles in DRAM.

We ﬁnally summarize the design of KucoFS by walkingthrough an example of writing 4 KB of data to a new ﬁleand then reading it out. First of all, this program links with

Ulib to map the related NVM/DRAM space of the currentuser into its address space.

Open.

Before sending the open system call,

Ulib pre-locates the related metadata ﬁrst. Since this is a new ﬁle,

Ulib cannotﬁnd its inode. Instead, it ﬁnds the predecessor in its parentdirectory’s dentry list for latter creation. The address, as wellas other information (e.g., ﬁle name,

O_CREAT ﬂags, etc.) areencapsulated in the open request. When the master receivesthe request, it creates this ﬁle based on the given address. Italso allocates a range-lock ring buffer for this ﬁle since it’sthe ﬁrst time to open it. Then, the master sends a responsemessage. After this,

Ulib creates a ﬁle descriptor for thisopened ﬁle and returns back to the application.

Write.

The application then uses write call via

Ulib to write4 KB of data to this created ﬁle. First,

Ulib ﬁnds the inode ofthis ﬁle and locks it with the direct access range-lock.

Ulib blocks the program when there are write conﬂicts and waituntil the corresponding lock has been released. After this,

Ulib can acquire the lock successfully. It then allocates a4 KB-page from its privately managed lists, copies the datainto it, and ﬂushes them out of CPU cache.

Ulib needs topost extra request to the master to allocate more free datapages once its own space is used up. Finally,

Ulib sends the write request to the master to perform the loose ends, whichincludes: change the permission bits of the written data pagesto “read-only”, atomically appending a log entry to describethis write operation, update the in-memory metadata, andﬁnally unlock the ﬁle.

Read.

KucoFS enables reading ﬁle data without interactingwith the master . To read the ﬁrst 4 KB from this ﬁle,

Ulib directly locates the inode in user space and reads the ﬁrstblock mapping item (i.e., the pointer). The version checkingis performed to ensure its state satisﬁes one of the threeconditions described in Section 4.4. After this,

Ulib can safelyread the ﬁle data page pointed by the pointer . Close.

Ulib also needs to send a close system call to the master upon closing this ﬁle. The master then reclaims thespace of the range lock ring buffer if no other processes isaccessing this ﬁle.

KucoFS is implemented into two parts: a loadable kernelmodule (i.e., the master ) and a shared library (i.e., the

Ulib ).Each

Ulib instance communicates with the master with anexclusively owned message buffer.

KucoFS’s APIs.

KucoFS provides a POSIX-like interface,so existing applications are enabled to access it without anymodiﬁcations to the source code. It achieves this by settingthe

LD_PRELOAD environment variable.

Ulib intercepts allAPIs in standard C library that are related to ﬁle systemoperations.

Ulib processes the syscall directly if the preﬁxof the accessed ﬁle matches with a predeﬁned string (e.g., “/kuco” ). Otherwise, the syscall is processed in legacy mode.Note that write operations only pass the ﬁle descriptors tolocate the ﬁle data, therefore,

Ulib distinguish the write . of Threads(a) Read, Low (b) Read, Medium (c) Overwrite, Low024 (d) Overwrite, Medium (e) Append, Low T h r o u g h p u t ( M o p s / s ) T h r o u g h p u t ( M o p s / s ) XFS-DAX EXT4-DAX PMFS NOVA KucoFS0200400600 0 20 0 20 0 20 0 20 0 20

Figure 6: Read and write throughput with FxMark. ( “Low”: different threads read(write) data from(to) different ﬁles; “Medium”:in the same ﬁle but different data blocks; We use default I/O size of 4 KB. )operations from legacy ﬁle systems by only using big ﬁledescriptor numbers (greater than 2 in our implementation). Memory-mapped I/O.

Supporting DAX feature in a copy-on-write ﬁle system needs extra efforts, since the ﬁles are out-of-place updated in normal write operations [33]. Besides,DAX leaves great challenges for programmers to correctlyuse NVM space with atomicity and crash consistency. Takingthese factors into consideration, we borrow the idea fromNOVA to provide atomic-mmap , which has higher consis-tency guarantee. When an application maps a ﬁle into userspace,

Ulib copies the ﬁle data to its privately managed datapages, and then sends a request to the master to map thesepages into contiguous address space. When the applicationissues a msync system call,

Ulib then handles it as a writeoperation, so as to atomically makes the updates in these datapages visible to other applications.

In this section, we evaluate the overall performance ofKucoFS with micro(macro)-benchmarks and real-world ap-plications. We also learn the effects brought by its internalmechanisms.

Testbed.

Our experimental testbed is equipped with 2 × Intel Xeon Gold 6240M CPUs (36 physical cores and 72logical threads), 192 GB DDR4 DRAM, and six Optane DCpersistent memory DIMMs (256GB per module, 1.5TB intotal). Our evaluation on Optane DC shows that its readbandwidth peaks at 37 GB/s and the write bandwidth is13.2 GB/s. The server is installed with Ubuntu 19.04 andLinux kernel 5.1, the kernel version supported by NOVA.

Compared Systems.

We evaluate KucoFS against NVM-ware ﬁle systems including PMFS [13], NOVA [33], andStrata [18] , as well as traditional ﬁle system with DAXsupport including Ext4-DAX [2] and XFS-DAX [29]. Strataonly support a few applications and has trouble running multi- https://github.com/NVSL/PMFS-new , https://github.com/NVSL/linux-nova , https://github.com/ut-osa/strata threaded workloads [36], so we only give its single-threadedperformance results in Section 6.3 and Section 6.3.Aerie is based on Linux 3.2.2, which doesn’t have therelated drivers to support Optane DC. Hence, we comparewith Aerie [30] by emulating persistent memory with DRAM(Due to limited space, we only describe these experimentaldata in words, without including them in the ﬁgures). We use FxMark [21] to evaluate the basic ﬁle system opera-tions (in terms of both throughput and multi-core scalability).FxMark provides 19 micro-benchmarks, which is categorizedbased on four criteria: data types (i.e., data or metadata),modes (i.e., read or write), operations (i.e., read, overwrite,append, create, etc.) and sharing levels (i.e., low, medium orhigh). We only include some of them in the paper due to thelimited space.

File Read.

Figure 6 (a)-(b) show the ﬁle read performanceof each ﬁle system with a varying number of client threadsand different sharing levels (i.e., Low/Medium). We canobserve that KucoFS exhibits signiﬁcant higher throughputthan the other ﬁle systems, and its throughput scales linearlyas the number of clients increases. Speciﬁcally, with 36 clientthreads and

Low sharing level, KucoFS outperforms NOVAand PMFS by 6 × on average, and has two orders magnitudeshigher performance than XFS-DAX and EXT4-DAX. Suchperformance advantage stems primarily from the design of lock-free fast read , which enables user space direct access without the involvement of the master . Those kernel ﬁlesystems (e.g., XFS, Ext4, NOVA and PMFS) have to performcontext switch and walk through the VFS layer, which impactsthe read performance. Besides, All of compared systemsneed to lock the ﬁle before actually reading the ﬁle data.Such locking overhead impacts their performance severely,despite the contention is low [20]. We further observe thatthe throughput of the compared systems keeps steady and lowunder Medium sharing level, since all the threads are acquiringthe same lock of the same ﬁle. Instead, the performance ofKucoFS is unchanged with varying sharing level, because itdoesn’t rely on a per-ﬁle lock to coordinate the concurrentreaders. Note that the measured read performance via FxMark8

FS-DAXEXT4-DAX PMFSNOVA KucoFS T h r o u g h p u t ( M o p s / s ) (a) Figure 7: readdir performance with FxMark. ( “Medium”:In the same folder )is larger than the raw bandwidth of Optane DC (which is37 GB/s), because FxMark let each thread read one ﬁle pagerepeatedly, and the accessed data is cached in the CPU cache.With our emulated persistent memory, Aerie shows almost thesame performance as that of KucoFS with

Low sharing level,but its throughput becomes far behind others with

Medium sharing level. This is because Aerie needs to contact withthe TFS frequently to acquire the lock, causing extra contextswitch overhead.

File Write.

The throughputs of both append and overwrite operations are given in Figure 6 (c)-(e). For overwriteoperations with “Low” sharing level, all systems exhibit aperformance curve that increases ﬁrst and then decreases. Inthe increasing part, KucoFS shows the highest throughputamong the compared systems because it is enabled to directlywrite data in user space. XFS and NOVA also shows good scal-ability: among them, NOVA partitions the free spaces to avoidglobal locking overhead when allocating new data pages,while XFS directly write data in-place without allocating newpage. Both PMFS and Ext4 fail to scale since they adoptstransaction to write data, introducing extra locking overhead.In the decreasing part, their throughput are restricted bythe Optane bandwidth because of its poor scalability [15].For overwrite operations with “Medium” sharing level, thethroughput of KucoFS is one order of magnitude higher thanthe other three ﬁle systems when the number of threads issmall. Such performance beneﬁts mainly come from therange-lock design in KucoFS, which enables parallel updatingto different data blocks in the same ﬁle. The performance ofKucoFS drops again when the number of clients is more than8, which is mainly restricted by the ring buffer size in therange-lock (we reserve 8 lock items in each ring buffer). Forappend operations, XFS-DAX, Ext4-DAX and PMFS exhibitun-scalable performance as the number of client threadsincreases. This is because all of them uses a global lock tomanage its metadata journal and free data pages, so the lockcontention contributes to the major overhead. Both NOVAand KucoFS show better scalability, and KucoFS outperformsNOVA from 1.1 × to 3 × as the number of threads varies.On our emulated persistent memory, Aerie shows the worstperformance because the trusted service is the bottleneck: theclients need to frequently interact with it to acquire the lockand allocate new data pages. XFS-DAXEXT4-DAX PMFSNOVA StrataKucoFS (a) (b) N o r m a li z e d T h r o u g h p u t

20 Threads1 Threads

91K 365K 575K 2.73M

Figure 8: Filebench Throughput with Different File Systems.We conclude that by fully exploiting the beneﬁts of directaccess , KucoFS always shows the highest performance amongthe evaluated ﬁle systems.

Metadata Read.

Figure 7(a) shows the performance of readdir operations with

Medium sharing level (i.e., all thethreads read the same directory). (Aerie doesn’t support thisoperation). We observe that only KucoFS exhibits scalableperformance and PMFS even cannot complete the workloadsas the number of clients increases. These kernel ﬁle systemslock the parent directory’s inode in VFS before readingthe dentry list and ﬁle inode s, as a result, the execution ofdifferent client threads is serialized when they access the samedirectory. However, the skip list used in KucoFS supportslock-free reads and atomic updates, enabling multiple readersto concurrently read the same directory.

File Creation.

To evaluate the performance of creat with

Medium sharing level, FxMark lets each client thread create10 K ﬁles in a shared directory. As shown in Figure 7(b),KucoFS achieves one order of magnitude higher throughputthan the compared ﬁle systems and it exhibits scalableperformance as the number of threads increases. XFS-DAX,Ext4-DAX and PMFS use a global lock to perform metadatajournaling and manage the free spaces, which leads to theirun-scalable performance. Besides, the VFS layer needs tolock the inode of the parent directory before creating the ﬁles.Hence, NOVA also fails to scale despite it avoids using globallock. We explain the high performance of KucoFS from thefollowing aspects: (1) In KucoFS, all the metadata updatesare delegated to the master, so it can update them without anylocking overhead. (2) By ofﬂoading all the indexing overheadto user space, the master only needs to do very lightweightoperations. (3) KucoFS can persist metadata with batching,while the other three kernel ﬁle systems do not have suchopportunity. Aerie synchronizes the updated metadata of thecreated ﬁles to the trusted service with batching so it achievescomparable performance as that of KucoFS, but it fails towork properly with more threads.

We then use Filebench [1] as a macro-benchmark to evaluatethe performance of KucoFS. We select two workloads — File-server and Varmail — with the same settings as that in NOVApaper: Files are created with the average size of 128 KB and9

FS-DAXEXT4-DAX PMFSNOVA StrataKucoFS T h r o u g h p u t ( K o p s / s ) Object Size (SET)0100200300 128 1KB 4KB 8KB

Figure 9: Filebench Throughput with Different File Systems.32 KB for Fileserver and Varmail respectively. The I/O sizesof both read and write operations are set to 16 KB in Fileserver.Varmail has read I/O size of 1 MB and write I/O size of 16 KB.Fileserver and Varmail have write to read ratios of 2:1 and1:1 respectively. The total number of ﬁles in each workload isset to 100K. Fileserver emulates I/O activity of a simple ﬁleserver [3] by randomly performing creates, deletes, appends,reads and writes. Varmail emulates an email server and usesa write-ahead log for crash consistency. It contains a largenumber of small ﬁles involving both read and write operations.We only give single-threaded evaluation of Strata. Figure 8shows the results and we make the following observations:(1) KucoFS shows the highest performance among allthe evaluated workloads. In single-threaded evaluation, itsthroughput is 2.5 × , 2 × , 1.34 × , 1.29 × and 1.26 × higherthan XFS, Ext4, PMFS, NOVA, and Strata respectively forFileserver workload, and is 2.7 × , 6 × , 2 × , 1.67 × and 1.3 × higher for Varmail workload. Such performance advantagemainly comes from the direct access feature of KucoFS.It executes ﬁle I/O operations directly in user-level, thusdismissing the OS-part overhead (i.e., context saving andreloading, executing in VFS layer). Strata also beneﬁt fromdirect access, however, it needs to acquire the lease fromthe third-party service each time they access a new ﬁle,which limits its efﬁciency. We also observe that the design ofKucoFS is a good ﬁt for Varmail workloads. This is expected:Varmail frequently creates/deletes ﬁles, so it generates moremetadata operations and issues system calls more frequently.As described before, KucoFS eliminates the OS-part overheadand is better at handling metadata operations. Besides, Stratashows much higher throughput than NOVA since the ﬁle I/Osin Varmail is small-sized. Strata only needs to append thesesmall-sized updates to the operation log, reducing the writeampliﬁcation dramatically.(2) KucoFS is better at handling concurrent workloads.With 20 concurrent client threads and Fileserver workload,KucoFS outperforms XFS-DAX and Ext4-DAX by 3.5 × on average, and PMFS by 2.3 × , and NOVA by 1.4 × . Suchperformance advantage is more obvious for Varmail workload:it achieves 15% higher performance than XFS-DAX andExt4-DAX on overage. Two reasons contribute to its goodperformance: 1) KucoFS incorporates techniques like indexofﬂoading to enable the master to provide scalable metadataaccessing performance; 2) KucoFS avoids using global T h r o u g h p u t ( M o p s / s ) E x e c u t i o n t i m e ( s ) (a) ffl oadingw/o BatchingKucoFS NOVArw Lockw/o LockKucoFS00.511.52 0 10 20 30 0 5 10 15 Figure 10: Beneﬁts of Each Optimizations in KucoFS.lock by letting each client manage private free data pages.NOVA also exhibits good scalability since it uses per-ﬁlelog-structure and partitioned free space management.

Many modern cloud applications use key-value stores likeRedis for storing data. Redis exports an API allowingapplications to process and query structured data, but usesthe ﬁle system for persistent data storage. Redis has twoapproaches to persistently record its data: one is to logoperations to an append-only-ﬁle (AOF), and the other is touse an asynchronous snapshot mechanism. We only evaluateRedis with AOF mode in this paper. Similar to the way inStrata [18], we conﬁgure Redis to use AOF mode and topersist data synchronously.Figure 9 shows the throughput of SET operations using 12-byte keys and with various value sizes. For small values, thethroughput of Redis is 53%% higher on average on KucoFS,compared to PMFS, NOVA and Strata, and 76% highercompared to XFS-DAX and Ext4-DAX. This is consistentwith the evaluation results of

Append operations, whereKucoFS outperforms other systems at least by 2 × with asingle thread. With larger object sizes, KucoFS achievesslightly higher throughput than other ﬁle systems since theOptane bandwidth becomes the major limiting factor. In this section, we analyze the performance improvementsbrought by each optimization in KucoFS.First, we measure the individual beneﬁt of index ofﬂoading and batching-based logging . To achieve this, we disablebatching by letting the master persist log entries one by one.We then move the metadata indexing operations back to the master to see the effects of index ofﬂoading. Figure 10(a)shows the results by measuring the throughput of creat with varying number of clients. We make the followingobservations:(1) In single thread evaluation, index ofﬂoading does notcontribute to improving performance: Since moving themetadata indexing from

Ulib back to the master doesn’treduce the total execution latency of each operation, the single-thread throughput is unchanged. We also ﬁnd that batching10oesn’t degrade the single-thread performance, which is incontrast to the broad belief that batching causes higher latency.In our implementation, the master simply scans the messagebuffer to fetch the existing requests, and the overhead ofscanning is insigniﬁcant.(2) When the number of client threads increases, we ﬁndthat indexing ofﬂoading improves throughput by 55% at mostfor creat operation. Since KucoFS only allows the master to update metadata on behalf of multiple

Ulib instances, thetheoretical throughput limit is T max = req / L req (where L req is the latency for a master to process one request). Therefore,the proposed ofﬂoading mechanism improves performanceby shortening the execution time for each request (i.e., L req ).Similarly, batching is introduced to speed up the processingefﬁciency of the master by reducing the data persistencyoverhead. From the ﬁgure, we can ﬁnd that it improvesthroughput by 33% at most for the creat operation.Second, we demonstrate the efﬁciency of lock-free fast read by concurrently reading and writing data to the same ﬁle. Inour evaluation, one read thread is selected to sequentially reada ﬁle with I/O size of 16 KB, and an increasing number ofthreads are launched to overwrite the same ﬁle concurrently(4 KB writes to a random offset). We let the read threadissues read operations for 1 million times and measure itsexecution time by varying the number of write threads. Forcomparison, we also implement KucoFS r/w lock that readsﬁle data by acquiring the read-write lock in the range-lockring buffer, and

KucoFS w/o lock that reads ﬁle data directlywithout regarding the correctness. We make the followingobservations from Figure 10(b): (1) The proposed lock-freefast read achieves almost the same performance as that of

KucoFS w/o lock . This proves that the overhead of versionchecking is extremely low. We also observe that

KucoFS r/wlock needs to pay much more time to ﬁnish reading (7% to3.2 × more time than lock-free for different I/O sizes). Thisis because one needs to use atomic operations to acquire therange lock, and this can severely impact read performancewhen there are more conﬂicts. (3) The execution time ofNOVA is orders of magnitudes higher than that of KucoFS.We notice that NOVA directly uses mutex to synchronize theconcurrent readers and writes. As a result, the reader will bedelayed by the writers dramatically. Kernel/Userspace Collaboration.

The emergence of highthroughput and low latency hardware (e.g., Inﬁniband net-work, NVMe SSDs and NVMs) prompts the idea of mov-ing I/O operations from the kernel to user level: Belay etal. [6] abstract the Dune process leveraging the virtualizationhardware in modern processors. It enables direct access tothe privileged CPU instructions in user space and executessyscalls with reduced overhead. Based on Dune, IX [7] stepsfurther to improve the performance of data-center applications by separating management and scheduling functions of thekernel (control-plane) from network processing (data plane).Arrakis [23] is a new network server operating system. It splitsthe traditional role of the kernel in two, where applicationshave direct access to virtualized I/O devices, while the kernelonly enforces coarse-grained protection and doesn’t need tobe involved in every operation.

Persistent Memory File System.

Existing research workson NVM-based ﬁle systems can be classiﬁed into threecategories: (cid:202)

Kernel-Level.

BPFS [12] adopts short-circuitshadow paging to guarantee the metadata and data consistency.It also introduces epoch hardware modiﬁcations to efﬁcientlyenforce orderings. SCMFS [32] simpliﬁes the ﬁle manage-ment by mapping ﬁles to contiguous virtual address regionswith the virtual memory management (VMM) in existing OS,but it fails to support consistency for both data and metadata.Both PMFS [13] and NOVA [33] use separated mechanismsto guarantee the consistency of metadata and data: PMFS usesjournaling for metadata updates and perform writes with copy-on-write mechanism. NOVA is a log-structured ﬁle systemdeployed on hybrid DRAM-NVM architecture. It managesthe metadata with per-inode log to improve scalability andmoves ﬁle data out of the log (ﬁle data is managed with CoW)to achieve efﬁcient garbage collection. NOVA-Fortis [34]steps further to be fault-tolerant by providing a snapshotmechanism. While these kernel ﬁle systems provide POSIXI/O and propose different approaches to enforce (meta)dataconsistency, their performance is still restricted by existingOS abstraction (e.g., syscall and VFS). (cid:203)

User-Level.

BothAerie [30] and Strata [18] propose to avoid the OS-partoverhead by implementing the ﬁle system in user space. Withthis design, user-level applications have direct access to theﬁle system image. Both of them adopt a third-party trustedservice to coordinate the concurrent operations and processother essential works (e.g., metadata management in Aerieand data digestion in Strata). However, by exporting the ﬁlesystem image to user-level applications, they are vulnerable toarbitrary writes from the buggy applications. (cid:204)

Device-Level.

DevFS [16] proposes to push the ﬁle system implementationinto the storage device that has compute capability and device-level RAM, which requires the support of dedicated hardware.

In this paper, we revisit the ﬁle system architecture fornon-volatile memories by proposing a kernel and user-levelcollaborative ﬁle system named KucoFS. It fully exploitsthe respective advantages of direct access in user-level and data protection in kernel space. We further improve itsscalability to multicores by rebalancing the loads betweenkernel and user space and carefully coordinating the read andwrite conﬂicts. Experiments show that KucoFS provides bothefﬁcient and scalable non-volatile memory management.11 eferences [1] Filebench ﬁle system benchmark. , 2004.[2] Support ext4 on NV-DIMMs. " https://lwn.net/Articles/588218 ", 2014.[3] Specsfs-2014. ,2017.[4] Intel optane dc persistent memory. , 2019.[5] IG Baek, MS Lee, S Seo, MJ Lee, DH Seo, D-S Suh,JC Park, SO Park, HS Kim, IK Yoo, et al. Highlyscalable nonvolatile resistive memory using simplebinary oxide driven by asymmetric unipolar voltagepulses. In

Electron Devices Meeting, 2004. IEDMTechnical Digest. IEEE International , pages 587–590.IEEE, 2004.[6] Adam Belay, Andrea Bittau, Ali Mashtizadeh, DavidTerei, David Mazières, and Christos Kozyrakis. Dune:Safe user-level access to privileged cpu features. In

Proceedings of the 10th USENIX Conference on Oper-ating Systems Design and Implementation , OSDI’12,pages 335–348, Berkeley, CA, USA, 2012. USENIXAssociation.[7] Adam Belay, George Prekas, Ana Klimovic, SamuelGrossman, Christos Kozyrakis, and Edouard Bugnion.Ix: A protected dataplane operating system for highthroughput and low latency. In

Proceedings of the 11thUSENIX Conference on Operating Systems Design andImplementation , OSDI’14, pages 49–65, Berkeley, CA,USA, 2014. USENIX Association.[8] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong,and Arie Shoshani. Parallel data analysis directly onscientiﬁc ﬁle formats. In

Proceedings of the 2014 ACMSIGMOD international conference on Management ofdata , pages 385–396. ACM, 2014.[9] Remy Card, Theodore Ts’o, and Stephen Tweedie.Design and implementation of the second extendedﬁlesystem. In

Proceedings of the 1st Dutch InternationalSymposium on Linux , pages 1–6, 1994.[10] Youmin Chen, Jiwu Shu, Jiaxin Ou, and Youyou Lu.Hinfs: A persistent memory ﬁle system with both buffer-ing and direct-access.

ACM Trans. Storage , 14(1):4:1–4:30, April 2018. [11] Joel Coburn, Adrian M. Caulﬁeld, Ameen Akel,Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, andSteven Swanson. Nv-heaps: Making persistent objectsfast and safe with next-generation, non-volatile mem-ories. In

Proceedings of the Sixteenth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS XVI,pages 105–118, New York, NY, USA, 2011. ACM.[12] Jeremy Condit, Edmund B. Nightingale, ChristopherFrost, Engin Ipek, Benjamin Lee, Doug Burger, andDerrick Coetzee. Better i/o through byte-addressable,persistent memory. In

Proceedings of the ACM SIGOPS22Nd Symposium on Operating Systems Principles ,SOSP ’09, pages 133–146, New York, NY, USA, 2009.ACM.[13] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava-murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran,and Jeff Jackson. System software for persistent mem-ory. In

Proceedings of the Ninth European Conferenceon Computer Systems , EuroSys ’14, pages 15:1–15:15,New York, NY, USA, 2014. ACM.[14] Keir Fraser. Practical lock-freedom. Technical report,University of Cambridge, Computer Laboratory, 2004.[15] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, XiaoLiu, Amirsaman Memaripour, Yun Joon Soh, ZixuanWang, Yi Xu, Subramanya R Dulloor, et al. Basic per-formance measurements of the intel optane dc persistentmemory module. arXiv preprint arXiv:1903.05714 ,2019.[16] Sudarsun Kannan, Andrea C Arpaci-Dusseau, Remzi HArpaci-Dusseau, Yuangang Wang, Jun Xu, and GopinathPalani. Designing a true direct-access ﬁle system withdevfs. In , page 241, 2018.[17] Kimberly Keeton. The machine: An architecture formemory-centric computing. In

Workshop on Runtimeand Operating Systems for Supercomputers (ROSS) ,2015.[18] Youngjin Kwon, Henrique Fingler, Tyler Hunt, SimonPeter, Emmett Witchel, and Thomas Anderson. Strata:A cross media ﬁle system. In

Proceedings of the 26thSymposium on Operating Systems Principles , SOSP ’17,pages 460–477, New York, NY, USA, 2017. ACM.[19] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and DougBurger. Architecting phase change memory as a scalabledram alternative. In

Proceedings of the 36th annualInternational Symposium on Computer Architecture(ISCA) , pages 2–13, New York, NY, USA, 2009. ACM.1220] Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and LintaoZhang. Socksdirect: Datacenter sockets can be fast andcompatible. In

Proceedings of the ACM Special InterestGroup on Data Communication , SIGCOMM ’19, pages90–103, New York, NY, USA, 2019. ACM.[21] Changwoo Min, Sanidhya Kashyap, Steffen Maass,Woonhak Kang, and Taesoo Kim. Understandingmanycore scalability of ﬁle systems. In

Proceedingsof the 2016 USENIX Conference on Usenix AnnualTechnical Conference , USENIX ATC ’16, pages 71–85,Berkeley, CA, USA, 2016. USENIX Association.[22] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high perfor-mance ﬁle system for non-volatile main memory. In

Proceedings of the Eleventh European Conference onComputer Systems , EuroSys ’16, pages 12:1–12:16, NewYork, NY, USA, 2016. ACM.[23] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports,Doug Woos, Arvind Krishnamurthy, Thomas Anderson,and Timothy Roscoe. Arrakis: The operating systemis the control plane. In

Proceedings of the 11thUSENIX Conference on Operating Systems Design andImplementation , OSDI’14, pages 1–16, Berkeley, CA,USA, 2014. USENIX Association.[24] Thanumalayan Sankaranarayana Pillai, Vijay Chi-dambaram, Ramnatthan Alagappan, Samer Al-Kiswany,Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. All ﬁle systems are not created equal: Onthe complexity of crafting crash-consistent applications.In

Proceedings of the 11th USENIX Conference on Op-erating Systems Design and Implementation , OSDI’14,pages 433–448, Berkeley, CA, USA, 2014. USENIXAssociation.[25] William Pugh. Skip lists: A probabilistic alternative tobalanced trees.

Commun. ACM , 33(6):668–676, June1990.[26] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, andJude A. Rivers. Scalable high performance mainmemory system using phase-change memory technol-ogy. In

Proceedings of the 36th annual InternationalSymposium on Computer Architecture (ISCA) , pages24–33, New York, NY, USA, 2009. ACM.[27] Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu.Ffwd: Delegation is (much) faster than you think.In

Proceedings of the 26th Symposium on OperatingSystems Principles , SOSP ’17, pages 342–358, NewYork, NY, USA, 2017. ACM.[28] Livio Soares and Michael Stumm. Flexsc: Flexiblesystem call scheduling with exception-less system calls. In

Proceedings of the 9th USENIX Conference on Op-erating Systems Design and Implementation , OSDI’10,pages 33–46, Berkeley, CA, USA, 2010. USENIX As-sociation.[29] Adam Sweeney, Doug Doucette, Wei Hu, Curtis An-derson, Mike Nishimoto, and Geoff Peck. Scalabilityin the xfs ﬁle system. In

USENIX Annual TechnicalConference , volume 15, 1996.[30] Haris Volos, Sanketh Nalli, Sankarlingam Panneersel-vam, Venkatanathan Varadarajan, Prashant Saxena, andMichael M. Swift. Aerie: Flexible ﬁle-system interfacesto storage-class memory. In

Proceedings of the Ninth Eu-ropean Conference on Computer Systems , EuroSys ’14,pages 14:1–14:14, New York, NY, USA, 2014. ACM.[31] Ying Wang, Dejun Jiang, and Jin Xiong. Caching ornot: Rethinking virtual ﬁle system for non-volatile mainmemory. In . USENIXAssociation, 2018.[32] Xiaojian Wu and A. L. Narasimha Reddy. Scmfs: Aﬁle system for storage class memory. In

Proceedings of2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC ’11,pages 39:1–39:11, New York, NY, USA, 2011. ACM.[33] Jian Xu and Steven Swanson. Nova: A log-structuredﬁle system for hybrid volatile/non-volatile main mem-ories. In

Proceedings of the 14th Usenix Conferenceon File and Storage Technologies , FAST’16, pages 323–338, Berkeley, CA, USA, 2016. USENIX Association.[34] Jian Xu, Lu Zhang, Amirsaman Memaripour, AkshathaGangadharaiah, Amit Borase, Tamires Brito Da Silva,Steven Swanson, and Andy Rudoff. Nova-fortis: Afault-tolerant non-volatile main memory ﬁle system.In

Proceedings of the 26th Symposium on OperatingSystems Principles , SOSP ’17, pages 478–496, NewYork, NY, USA, 2017. ACM.[35] Esma Yildirim, Engin Arslan, Jangyoung Kim, andTevﬁk Kosar. Application-level optimization of bigdata transfers through pipelining, parallelism and con-currency.

IEEE Transactions on Cloud Computing ,4(1):63–75, 2016.[36] Shengan Zheng, Morteza Hoseinzadeh, and StevenSwanson. Ziggurat: A tiered ﬁle system for non-volatilemain memories and disks. In , pages 207–219, 2019.[37] Deng Zhou, Wen Pan, Tao Xie, and Wei Wang. A ﬁlesystem bypassing volatile main memory: Towards a13ingle-level persistent store. In

Proceedings of the 15thACM International Conference on Computing Frontiers ,CF ’18, pages 97–104, New York, NY, USA, 2018.ACM.[38] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durable and energy efﬁcient main memory using phasechange memory technology. In