[PDF] Analyzing IO Amplification in Linux File Systems

Abstract

We present the first systematic analysis of read, write, and space amplification in Linux file systems. While many researchers are tackling write amplification in key-value stores, IO amplification in file systems has been largely unexplored. We analyze data and metadata operations on five widely-used Linux file systems: ext2, ext4, XFS, btrfs, and F2FS. We find that data operations result in significant write amplification (2-32X) and that metadata operations have a large IO cost. For example, a single rename requires 648 KB write IO in btrfs. We also find that small random reads result in read amplification of 2-13X. Based on these observations, we present the CReWS conjecture about the relationship between IO amplification, consistency, and storage space utilization. We hope this paper spurs people to design future file systems with less IO amplification, especially for non-volatile memory technologies.

Full PDF

AAnalyzing IO Ampliﬁcation in Linux File Systems

Jayashree Mohan Rohan Kadekodi Vijay Chidambaram Department of Computer Science, University of Texas at Austin Department of Computer Science, University of Wisconsin Madison

Abstract

We present the ﬁrst systematic analysis of read, write,and space ampliﬁcation in Linux ﬁle systems. Whilemany researchers are tackling write ampliﬁcation in key-value stores, IO ampliﬁcation in ﬁle systems has beenlargely unexplored. We analyze data and metadata oper-ations on ﬁve widely-used Linux ﬁle systems: ext2, ext4,XFS, btrfs, and F2FS. We ﬁnd that data operations re-sult in signiﬁcant write ampliﬁcation (2–32 × ) and thatmetadata operations have a large IO cost. For exam-ple, a single rename requires 648 KB write IO in btrfs.We also ﬁnd that small random reads result in read am-pliﬁcation of 2–13 × . Based on these observations, wepresent the CReWS conjecture about the relationship be-tween IO ampliﬁcation, consistency, and storage spaceutilization. We hope this paper spurs people to designfuture ﬁle systems with less IO ampliﬁcation, especiallyfor non-volatile memory technologies. File systems were developed to enable users to easily andefﬁciently store and retrieve data. Early ﬁle systems suchas the Unix Fast File System [1] and ext2 [2] were sim-ple ﬁle systems. To enable fast recovery from crashes,crash-consistency techniques such as journaling [3] andcopy-on-write [4] were incorporated into ﬁle systems, re-sulting in ﬁle systems such as ext4 [5] and xfs [6]. Mod-ern ﬁle systems such as btrfs [7] include features such assnapshots and checksums for data, making the ﬁle sys-tem even more complex.While the new features and strong crash-consistencyguarantees have enabled wider adoption of Linux ﬁlesystems, it has resulted in the loss of a crucial aspect:efﬁciency. File systems now maintain a large number ofdata structures on storage, and both data and metadatapaths are complex and involve updating several blockson storage. In this paper, we ask the question: what isthe IO cost of various Linux ﬁle-system data and meta-data operations? What is the IO ampliﬁcation of vari-ous operations on Linux ﬁle systems? While this ques-tion is receiving wide attention in the world of key-value stores [8–13] and databases [14], this has been largely ig-nored in ﬁle systems. File systems have traditionally op-timized for latency and overall throughput [15–18], andnot on IO or space ampliﬁcation.We present the ﬁrst systematic analysis of read, write,and space ampliﬁcation in Linux ﬁle systems. Read am-pliﬁcation indicates the ratio of total read IO to user datarespectively. For example, if the user wanted to read 4KB, and the ﬁle system read 24 KB off storage to satisfythat request, the read ampliﬁcation is 6 × . Write ampli-ﬁcation is deﬁned similarly. Space ampliﬁcation mea-sures how efﬁciently the ﬁle system stores data: if theuser writes 4 KB, and the ﬁle system consumes 40 KBon storage (including data and metadata), the space am-pliﬁcation is 10 × .We analyze ﬁve widely-used Linux ﬁle systems thatoccupy different points in the design space: ext2 (nocrash consistency guarantees), ext4 (metadata journal-ing), XFS (metadata journaling), F2FS (log-structuredﬁle system), and btrfs (copy-on-write ﬁle system). Weanalyze the write IO and read IO resulting from vari-ous metadata operations, and the IO ampliﬁcation arisingfrom data operations. We also analyze these measures fortwo macro-benchmarks: compiling the Linux kernel, andFilebench varmail [19]. We break down write IO cost byIO that was performed synchronously (during fsync() )and IO that was performed during delayed backgroundcheckpointing.We ﬁnd several interesting results. For data operationssuch as overwriting a ﬁle or appending to a ﬁle, there wassigniﬁcant write ampliﬁcation (2–32 × ). Small randomreads resulted in a read ampliﬁcation of 2–8 × , even witha warm cache. Metadata operations such as directory cre-ation or ﬁle rename result in signiﬁcant storage IO: forexample, a single ﬁle rename required 12–648 KB to bewritten to storage. Even though ext4 and xfs both im-plement metadata journaling, we ﬁnd XFS signiﬁcantlymore efﬁcient for ﬁle updates. Similarly, though F2FSand btrfs are implemented based on the log-structuredapproach (copy-on-write is a dual of the log-structuredapproach), we ﬁnd F2FS to be signiﬁcantly more efﬁ-cient across all workloads. In fact, in all our experi-ments, btrfs was an outlier, producing the highest read,1 a r X i v : . [ c s . O S ] J u l rite, and space ampliﬁcation. While this may partlyarise from the new features of btrfs (that other ﬁle sys-tems do not provide), the copy-on-write nature of btrfs isalso part of the reason.We ﬁnd that IO ampliﬁcation arises due to three mainfactors: the block-based interface, the crash-consistencymechanisms of ﬁle systems, and the different data struc-tures maintained by storage to support features such assnapshots. Based on these observations, we introduce the CReWS conjecture. The CReWS conjecture states thatfor a general-purpose ﬁle system on a shared storage de-vice, it is impossible to provide strong crash-consistencyguarantees while also minimizing read, write, and spaceampliﬁcation. We discuss different designs of ﬁle sys-tems, and show that for a general-purpose ﬁle system(used by many applications), minimizing write ampliﬁ-cation leads to space ampliﬁcation. We hope the CReWSconjecture helps guide future ﬁle-system designers.With the advent of non-volatile memory technologiessuch as Phase Change Memory [20] that have limitedwrite cycles, ﬁle-system designers can no longer ignoreIO ampliﬁcation. Such technologies offer the byte-basedinterface, which can greatly help to reduce IO ampliﬁ-cation. Data structures can be updated byte-by-byte ifrequired, and the critical metadata operations can be re-designed to have low IO footprint. We hope this paperindicates the current state of IO ampliﬁcation in Linuxﬁle systems, and provides a useful guide for the design-ers of future ﬁle systems.

We now analyze ﬁve Linux ﬁle systems which representa variety of ﬁle-system designs. First, we present ourmethodology ( § § § We use blktrace [21], dstat [22], and iostat [23]to monitor the block IO trace of different ﬁle-system op-erations such as rename() on ﬁve different Linux ﬁlesystems. These tools allow us to accurately identify thefollowing three metrics.

Write Ampliﬁcation . The ratio of total storage write IOto the user data. For example, if the user wrote 4 KB, andthat resulted in the ﬁle system writing 8 KB to storage,the write ampliﬁcation is 2. For operations such as ﬁlerenames, where there is no user data, we simply reportthe total write IO. Write IO and write ampliﬁcation bothshould be minimized.

Read Ampliﬁcation . Similar to write ampliﬁcation, thisis the ratio of total storage read IO to user-requested data.For example, if the user wants to read 4 KB, and the ﬁlesystem reads 12 KB off the storage to serve the read re-quest, the read ampliﬁcation is 3. We report the total readIO for metadata operations such as ﬁle creation. Readampliﬁcation should also be minimized.

Space Ampliﬁcation . The ratio of bytes consumed onstorage to bytes stored by the user. For example, the userwants to store 4 KB. If the ﬁle system has to consume20 KB on storage to store 4 KB of user data, the spaceampliﬁcation is 5. Space ampliﬁcation is a measure ofhow efﬁciently the ﬁle system is using storage, and thusshould be minimized. We calculate space ampliﬁcationbased on the unique disk locations written to the storage,during the workloads.Note that if the user stores one byte of data, the writeand space ampliﬁcation is trivially 4096 since the ﬁlesystem performs IO in 4096 block-sized units. We as-sume that a careful application will perform read andwrite in multiples of the block size. We also use noatime in mounting the ﬁle systems we study. Thus, our re-sults represent ampliﬁcation that will be observed evenfor careful real-world applications.

We analyze ﬁve different Linux ﬁle systems. Each ofthese ﬁle systems is (or was in the recent past) usedwidely, and represents a different point in the ﬁle-systemdesign spectrum. ext2 . The ext2 ﬁle system [24] is a simple ﬁle systembased on the Unix Fast File System [1]. ext2 does not in-clude machinery for providing crash consistency, insteadopting to ﬁx the ﬁle system with fsck after reboot. ext2writes data in place, and stores ﬁle metadata in inodes.ext2 uses direct and indirect blocks to ﬁnd data blocks. ext4 . ext4 [2] builds on the ext2 codebase, but uses jour-naling [3] to provide strong crash-consistency guaran-tees. All metadata is ﬁrst written to the journal beforebeing checkpointed (written in-place) to the ﬁle system.ext4 uses extents to keep track of allocated blocks.

XFS . The XFS [6] ﬁle system also uses journaling to pro-vide crash consistency. However, XFS implements jour-naling differently from ext4. XFS was designed to havehigh scalability and parallelism. XFS manages the allo-cated inodes through the inode B+ tree, while the freespace information is managed by B+ trees. The inodeskeep track of their own allocated extents.

F2FS . F2FS [25] is a log-structured ﬁle system designedspeciﬁcally for solid state drives. Similar to the origi-nal LFS [26], F2FS writes all updates to storage sequen-2ially. The logs in F2FS are composed of multiple sege-ments, with the segment utilization monitored using Seg-ment Information Table (SIT). Additionally, to avoid thewandering tree problem [27], F2FS assigns a node IDto the metadata structures like inodes, direct and indi-rect blocks. The mapping between node ID and the ac-tual blockaddress is maintained in a Node Address Table(NAT), which has to be referred to read data off stor-age, resulting in some overhead. Though data is writtensequentially to the logs, NAT and SIT updates are ﬁrstjournaled and then written out in place. btrfs . btrfs [7] is a copy-on-write ﬁle system based onB+ trees. The entire ﬁle system is composed of differ-ent B+ trees ( e.g., ﬁle-system tree, extent tree, checksumtree, etc. ), all emerging from a single tree called as thetree of tree roots. All the metadata of Btrfs is located inthese trees. The ﬁle-system tree stores the informationabout all the inodes, while the extent tree holds the meta-data related to each allocated extent. Btrfs uses copy-on-write logging, in which any modiﬁcation to a B+ treeleaf/node is preceded by copying of the entire leaf/nodeto the log tree.

We measure the read IO, write IO, and space consumedby different ﬁle-system operations.

First, we focus on data operations: ﬁle read, ﬁle over-write, and ﬁle append. For such operations, it is easyto calculate write ampliﬁcation, since the workload in-volves a ﬁxed amount of user data. The results are pre-sented in Table 1.

File Overwrite . The workload randomly seeks to a4KB-aligned location in a 100 MB ﬁle, does a 4 KB write(overwriting ﬁle data), then calls fsync() to make thedata durable. The workload does 10 MB of such writes.From Table 1, we observe that ext2 has the lowest writeand space ampliﬁcation, primarily due to the fact that ithas no extra machinery for crash consistency; hence theoverwrites are simply performed in-place. The 2 × writeampliﬁcation arises from writing both the data block andthe inode (to reﬂect modiﬁed time). XFS has a similarlow write ampliﬁcation, but higher space ampliﬁcationsince the metadata is ﬁrst written to the journal. Whencompared to XFS, ext4 has higher write and space am-pliﬁcation: this is because ext4 writes the superblockand other information into its journal with every trans-action; in other words, XFS journaling is more efﬁcientthan ext4 journaling. Interestingly, F2FS has an efﬁcientimplementation of the copy-on-write technique, leadingto low write and space ampliﬁcation. The roll-forward Measure ext2 ext4 xfs f2fs btrfs

File Overwrite

Write Ampliﬁcation 2.00 4.00 2.00 2.66 32.65Space Ampliﬁcation 1.00 4.00 2.00 2.66 31.17

File Append

Write Ampliﬁcation 3.00 6.00 2.01 2.66 30.85Space Ampliﬁcation 1.00 6.00 2.00 2.66 29.77

File Read (cold cache)

Read Ampliﬁcation 6.00 6.00 8.00 9.00 13.00

File Read (warm cache)

Read Ampliﬁcation 2.00 2.00 5.00 3.00 8.00

Table 1:

Ampliﬁcation for Data Operations . The tableshows the read, write, and space ampliﬁcation incurredby different ﬁle systems when reading and writing ﬁles. recovery mechanism of F2FS allows F2FS to write onlythe direct node block and data on every fsync() , withother data checkpointed infrequently [25]. In contrast,btrfs has a complex implementation of the copy-on-writetechnique (mostly due to a push to provide more featuressuch as snapshots and stronger data integrity) that leadsto extremely high space and write ampliﬁcation. Whenbtrfs is mounted with the default mount options that en-able copy-on-write and checksumming of both data andmetadata, we see 32 × write ampliﬁcation as shown in Ta-ble 1. However, if the checksumming of the user data isdisabled, the write ampliﬁcation drops to 28 × , and whenthe copy-on-write feature is also disabled for user data(metadata is still copied on write), the write ampliﬁca-tion for overwrites comes down to about 18.6 × . An in-teresting take-away from this analysis is that even if youpre-allocate all your ﬁles on these ﬁle systems, writeswill still lead to 2–30 × write ampliﬁcation. File Append . Our next workload appends a 4 KB blockto the end of a ﬁle and calls fsync() . The workloaddoes 10 MB of such writes. The appended ﬁle is emptyinitially. Our analysis for the ﬁle overwrite workloadmostly holds for this workload as well; the main differ-ence is that more metadata (for block allocation) has tobe persisted, thus leading to more write and space am-pliﬁcation for ext2 and ext4 ﬁle systems. In F2FS andxfs, the block allocation information is not persisted atthe time of fsync() , leading to behavior similar to ﬁleoverwrites. Thus, on xfs and f2fs, pre-allocating ﬁlesdoes not provide a beneﬁt in terms of write ampliﬁca-tion.We should note that write ampliﬁcation is high in ourworkloads because we do small writes followed by a3igure 1:

Write Ampliﬁcation for Various Write Sizes . The ﬁgure shows the write ampliﬁcation observed forwrites of various sizes followed by a fsync() call. fsync() . The fsync() call forces ﬁle-system activity,such as committing metadata transactions, which has aﬁxed cost regardless of the size of the write. As Fig-ure 1 shows, as the size of the write increases, the writeampliﬁcation drops close to one. Applications which is-sue small writes should take note of this effect: even ifthe underlying hardware does not get beneﬁt from bigsequential writes (such as SSDs), the ﬁle system itselfbeneﬁts from larger writes.

File Reads . The workload seeks to a random 4 KBaligned block in a 10 MB and reads one block. In Table 1,we make a distinction between a cold-cache read, and awarm-cache read. On a cold cache, the ﬁle read usuallyinvolves reading a lot of ﬁle-system metadata: for ex-ample, the directory, the ﬁle inode, the super block etc. .On subsequent reads (warm cache), reads to these blockswill be served out of memory. The cold-cache read am-pliﬁcation is quite high for all the ﬁle systems. Even inthe case of simple ﬁle systems such as ext2, reading aﬁle requires reading the inode. The inode read triggersa read-ahead of the inode table, increasing the read am-pliﬁcation. Since the read path does not include crash-consistency machinery, ext2 and ext4 have the same readampliﬁcation. The high read ampliﬁcation of xfs resultsfrom reading the metadata B+ tree and readahead for ﬁledata. F2FS read ampliﬁcation arises from reading ex-tra metadata structures such as the NAT table and theSIT table [25]. In btrfs, a cold-cache ﬁle read involvesreading the Tree of Tree roots, the ﬁle-system and thechecksum tree, leading to high read ampliﬁcation. Ona warm cache, the read ampliﬁcation of all ﬁle systemsgreatly reduces, since global data structures are likely tobe cached in memory. Even in this scenario, there is 2–8 × read ampliﬁcation for Linux ﬁle systems. Measure ext2 ext4 xfs f2fs btrfs

File Create

Write Cost (KB) 24 52 52 16 116 fsync checkpoint

20 24 48 12 48Read Cost (KB) 24 24 32 36 40Space Cost (KB) 24 52 20 16 116

Directory Create

Write Cost (KB) 28 64 80 20 132 fsync checkpoint

24 28 76 12 64Read Cost (KB) 20 20 60 36 60Space Cost (KB) 28 64 54 20 132

File Rename

Write Cost (KB) 12 32 16 20 648 fsync checkpoint

Table 2:

IO Cost for Metadata Operations . The tableshows the read, write, and space IO costs incurred by dif-ferent ﬁle systems for different metadata operations. Thewrite cost is broken down into IO at the time of fsync() ,and checkpointing IO performed later.

We now analyze the read and write IO (and space con-sumed) by different ﬁle-system operations. We presentﬁle create, directory create, and ﬁle rename. We haveexperimentally veriﬁed that the behavior of other meta-data operations, such as ﬁle link, ﬁle deletion, and direc-tory deletion, are similar to our presented results. Table 2presents the results. Overall, we ﬁnd that metadata oper-ations are very expensive: even a simple ﬁle rename re-sults in the 12–648 KB being written to storage. On stor-age with limited write cycles, a metadata-intensive work-load may wear out the storage quickly if any of these ﬁlesystems are used.In many ﬁle systems, there is a distinction between IOperformed at the time of the fsync() call, and IO per-formed later in the background. The fsync()

IO is per-formed in the critical path, and thus contributes to user-perceived latency. However, both kinds of IO ultimatelycontribute to write ampliﬁcation. We show this break-down for the write cost in Table 2.

File Create . The workload creates a new ﬁle in a pre-4xisting directory of depth three ( e.g., a/b/c ) and calls fsync() on the parent directory to ensure the creation ispersisted. File creation requires allocating a new inodeand updating a directory, and thus requires 16–116 KBof write IO and 24–40 KB of read IO in the various ﬁlesystems. F2FS is the most efﬁcient in terms of write IO(but requires a lot of read IO). Overall, ext2 is the mostefﬁcient in performing ﬁle creations. ext2, XFS, andF2FS all strive to perform the minimum amount of IO inthe fsync() critical path. Due to metadata journaling,ext4 writes 28 KB in the critical path. btrfs performs theworst, requiring 116 KB of write IO (68 KB in the criticalpath) and 40 KB in checkpointing IO. The poor perfor-mance of btrfs results from having to update a number ofdata structures, including the ﬁle-system tree, the direc-tory index, and backreferences to create a ﬁle [7].

Directory Create . The workload creates a new directoryin an existing directory of depth four, and calls fsync() on the parent directory. Directory creation follows a sim-ilar trend to ﬁle creation. The main difference is the addi-tional IO in creating the directory itself. As before, btrfsexperience the most write IO cost and read IO cost forthis workload. ext2 and F2FS are the most efﬁcient.

File Rename . The workload renames a ﬁle within thesame directory, and calls fsync() on the parent direc-tory to ensure the rename is persisted. Renaming aﬁle requires updating two directories. Performing re-name atomically requires machinery such as journalingor copy-on-write. ext2 is the most efﬁcient, requiringonly 32 KB of IO overall. Renaming a ﬁle is a surpris-ingly complex process in btrfs. Apart from linking andunlinking ﬁles, renames also change the backreferencesof the ﬁles involved. btrfs also logs the inode of everyﬁle and directory (from the root to the parent directory)involved in the operation. The root directory is persistedtwice, once for unlink, and once for the link. As a re-sult, btrfs is the least efﬁcient, requiring 696 KB of IOto rename a single ﬁle. Even if many of these inodes arecached, btrfs renames are signiﬁcantly less efﬁcient thanin other ﬁle systems.

Macro-benchmark: Kernel Compilation . To providea more complete picture of the IO ampliﬁcation of ﬁlesystems, we also measure IO ampliﬁcation for a macro-benchmark: uncompressing a Linux kernel tarball, andcompiling the kernel. The results are presented in Ta-ble 3. The ﬁle systems perform 6.09–6.41 GB of writeIO and 0.23–0.27 GB of read IO. ext2 is the most efﬁ-cient ﬁle system, achieving the lowest write and spacecost. Among ﬁle systems providing crash-consistencyguarantees, ext4 and XFS perform well, achieving lowerwrite and space cost than the copy-on-write ﬁle systemsof F2FS and btrfs. btrfs performs the most write IO, anduses the most space on storage. The kernel compilation

Measure ext2 ext4 xfs f2fs btrfs

Kernel Compilation

Write Cost (GB) 6.09 6.19 6.21 6.38 6.41Read Cost (GB) 0.25 0.24 0.24 0.27 0.23Space Cost (GB) 5.94 6.03 5.96 6.2 6.25

Filebench Varmail

Write Cost (GB) 1.52 1.63 1.71 1.82 2.10Read Cost (KB) 116 96 116 1028 0Space Cost (GB) 1.45 1.57 1.50 1.77 2.02

Table 3:

IO Cost for Macro-benchmarks . The ta-ble shows the read, write, and space IO costs incurredby different ﬁle systems when compiling the Linux ker-nel 3.0 and when running the Varmail benchmark in theFilebench suite. workload does not result in lot of write ampliﬁcation (orvariation between ﬁle systems), because the fsync() isnot called often; thus each ﬁle system is free to grouptogether operations to reduce IO and space cost. Even inthis scenario, the higher write and space ampliﬁcation ofbtrfs is observed.

Macro-benchmark: Filebench Varmail . We ranthe Varmail benchmark from the Filebench benchmarksuite [19] with the following parameters: 16 threads,total ﬁles 100K, mean ﬁle size 16 KB. Varmail simu-lates a mail server, and performs small writes followedby fsync() on different ﬁles using multiple threads. Inthis fsync() -heavy workload, we see that the effects ofwrite, read, and space ampliﬁcation are clear. ext2 stillperforms the least IO and uses the least storage space.btrfs performs 38% more write IO than ext2, and uses39% more space on storage. F2FS performs better thanbtrfs, but has a high read cost (10 × other ﬁle systems). Discussion . IO and space ampliﬁcation arises in Linuxﬁle systems due to using the block interface, from crash-consistency techniques, and the need to maintain and up-date a large number of data structures on storage. Com-parison of XFS and ext4 shows that even when the samecrash-consistency technique (journaling) is used, the im-plementation leads to a signiﬁcant difference in IO am-pliﬁcation. With byte-addressable non-volatile memorytechnologies arriving on the horizon, using such block-oriented ﬁle systems will be disastrous. We need to de-velop lean, efﬁcient ﬁle systems where operations suchas ﬁle renames will result in a few bytes written to stor-age, not tens to hundreds of kilobytes.5

The CReWS Conjecture

Inspired by the RUM conjecture [28] from the world ofkey-value stores, we propose a similar conjecture for ﬁlesystems: the CReWS conjecture . The CReWS conjecture states that it is impossible fora general-purpose ﬁle system to provide strong crash(C)onsistency guarantees while simultaneously achiev-ing low (R)ead ampliﬁcation, (W)rite ampliﬁcation, and(S)pace ampliﬁcation.

By a general-purpose ﬁle system we mean a ﬁle sys-tem used by multiple applications on a shared storagedevice. If the ﬁle system can be customized for a singleapplication on a dedicated storage device, we believe itis possible to achieve the other four properties simulta-neously.For example, consider a ﬁle system designed speciﬁ-cally for an append-only log such as Corfu [29] (withoutthe capability to delete blocks). The storage device isdedicated for the append-only log. In this scenario, theﬁle system can drop all metadata and treat the device asa big sequential log; storage block 0 is block 0 of theappend-only log, and so on. Since there is no metadata,the ﬁle system is consistent at all times implicitly, andthere is low write, read, and space ampliﬁcation. How-ever, this only works if the storage device is completelydedicated to one application.Note that we can extend our simple ﬁle-system to acase where there are N applications. In this case, wewould divide up the storage into N units, and assign oneunit to each application. For example, lets say we divideup a 100 GB disk for 10 applications. Even if an appli-cation only used one byte, the rest of its 10 GB is notavailable to other applications; thus, this design leads tohigh space ampliﬁcation.In general, if multiple applications want to share a sin-gle storage device without space ampliﬁcation, dynamicallocation is required. Dynamic allocation necessitatesmetadata keeping track of resources which are available;if ﬁle data can be dynamically located, metadata suchas the inode is required to keep track of the data loca-tions. The end result is a simple ﬁle system such asext2 [24] or NoFS [15]. While such systems offer lowread, write, and space ampliﬁcation, they compromiseon consistency: ext2 does not offer any guarantees on acrash, and a crash during a ﬁle rename on NoFS couldresult in the ﬁle disappearing.File systems that offer strong consistency guaranteessuch as ext4 and btrfs incur signiﬁcant write ampliﬁca-tion and space ampliﬁcation, as we have shown in pre-vious sections. Thus, to the best of our knowledge, theCReWS conjecture is true. We spent some time trying to come up with something cool likeRUM, but alas, this is the best we could do

Implications . The CReWs conjecture has useful impli-cations for the design of storage systems. If we seekto reduce write ampliﬁcation for a speciﬁc applicationsuch as a key-value store, it is essential to sacriﬁce oneof the above aspects. For example, by specializing theﬁle system to a single application, it is possible to min-imize the three ampliﬁcation measures. For applica-tions seeking to minimize space ampliﬁcation, the ﬁlesystem design might sacriﬁce low read ampliﬁcation orstrong consistency guarantees. For non-volatile mem-ory ﬁle systems [30, 31], given the limited write cyclesof non-volatile memory [32], ﬁle systems should be de-signed to trade space ampliﬁcation for write ampliﬁca-tion; given the high density of non-volatile memory tech-nologies [20, 33–36], this should be acceptable. Thus,given a goal, the CReWS conjecture focuses our atten-tion on possible avenues to achieve it.

We analyze the read, write, and space ampliﬁcation ofﬁve Linux ﬁle systems. We ﬁnd that all examined ﬁlesystems have high write ampliﬁcation (2–32 × ) and readampliﬁcation (2–13 × ). File systems that use crash-consistency techniques such as journaling and copy-on-write also suffer from high space ampliﬁcation (2–30 × ).Metadata operations such as ﬁle renames have large IOcost, requiring 32–696 KB of IO for a single rename.Based on our results, we present the CReWS conjec-ture: that a general-purpose ﬁle system cannot simulta-neously achieve low read, write, and space ampliﬁcationwhile providing strong consistency guarantees. With theadvent of byte-addressable non-volatile memory tech-nologies, we need to develop leaner ﬁle systems withoutsigniﬁcant IO ampliﬁcation: the CReWS conjecture willhopefully guide the design of such ﬁle systems. References [1] Marshall K McKusick, William N Joy, Samuel J Lefﬂer, andRobert S Fabry. A fast ﬁle system for unix.

ACM Transactionson Computer Systems (TOCS) , 2(3):181–197, 1984.[2] Avantika Mathur, Mingming Cao, Suparna Bhattacharya, An-dreas Dilger, Alex Tomas, and Laurent Vivier. The new ext4ﬁlesystem: current status and future plans. In

Proceedings ofthe Linux symposium , volume 2, pages 21–33. Citeseer, 2007.[3] Robert Hagmann. Reimplementing the Cedar ﬁle system usinglogging and group commit. In

SOSP , 1987.[4] Dave Hitz, James Lau, and Michael Malcolm. File System De-sign for an NFS File Server Appliance. In

Proceedings of theUSENIX Winter Technical Conference (USENIX Winter ’94) , SanFrancisco, California, January 1994.[5] Avantika Mathur, Mingming Cao, Suparna Bhattacharya,Alex Tomas Andreas Dilge and, and Laurent Vivier. The NewExt4 ﬁlesystem: Current Status and Future Plans. In

OttawaLinux Symposium (OLS ’07) , Ottawa, Canada, July 2007.

6] Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, MikeNishimoto, and Geoff Peck. Scalability in the xfs ﬁle system. In

USENIX Annual Technical Conference , volume 15, 1996.[7] Ohad Rodeh, Josef Bacik, and Chris Mason. Btrfs: The linuxb-tree ﬁlesystem.

ACM Transactions on Storage (TOS) , 9(3):9,2013.[8] Michael A Bender, Martin Farach-Colton, Jeremy T Fineman,Yonatan R Fogel, Bradley C Kuszmaul, and Jelani Nelson.Cache-oblivious streaming b-trees. In

Proceedings of the nine-teenth annual ACM symposium on Parallel algorithms and archi-tectures , pages 81–92. ACM, 2007.[9] Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala,and Raju Rangaswami. Nvmkv: a scalable, lightweight, ftl-awarekey-value store. In , pages 207–219, 2015.[10] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. Lsm-trie: Anlsm-tree-based ultra-large key-value store for small data items. In ,pages 71–82, 2015.[11] Russell Sears and Raghu Ramakrishnan. blsm: a general purposelog structured merge tree. In

Proceedings of the 2012 ACM SIG-MOD International Conference on Management of Data , pages217–228. ACM, 2012.[12] Pradeep J Shetty, Richard P Spillane, Ravikant R Malpani,Binesh Andrews, Justin Seyster, and Erez Zadok. Buildingworkload-independent storage with vt-trees. In

Presented as partof the 11th USENIX Conference on File and Storage Technologies(FAST 13) , pages 17–30, 2013.[13] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea CArpaci-Dusseau, and Remzi H Arpaci-Dusseau. Wisckey: sepa-rating keys from values in ssd-conscious storage. In , pages133–148, 2016.[14] Percona TokuDB. .[15] Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau,and Remzi H. Arpaci-Dusseau. Consistency Without Ordering. In

Proceedings of the 10th USENIX Symposium on File and StorageTechnologies (FAST ’12) , pages 101–116, San Jose, California,February 2012.[16] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, An-drea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Opti-mistic Crash Consistency. In

Proceedings of the 24th ACM Sym-posium on Operating Systems Principles (SOSP ’13) , Farming-ton, PA, November 2013.[17] Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan,Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. Application crash consistency andperformance with ccfs. In , pages 181–196, Santa Clara,CA, 2017. USENIX Association.[18] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, JohnEsmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneen-dra Reddy, Leif Walsh, et al. Betrfs: A right-optimized write-optimized ﬁle system. In

FAST , pages 301–315, 2015.[19] Andrew Wilson. The new and improved ﬁlebench. In , 2008.[20] Simone Raoux, Geoffrey W. Burr, Matthew J. Breitwisch,Charles T. Rettner, Y. C. Chen, Robert M. Shelby, Martin Salinga,Daniel Krebs, S. H. Chen, H.L. Lung, and C. H. Lam. Phase-change random access memory: A scalable technology.

IBMJournal of Research and Development , 52(4.5):465–479, 2008. [21] Block I/O Layer Tracing. https://linux.die.net/man/8/blktrace , December 2016.[22] Generating System Resource Statisting. https://linux.die.net/man/1/dstat , December 2016.[23] Reporting I/O Statistics. https://linux.die.net/man/1/iostat , December 2016.[24] Remy Card, Theodore Ts’o, and Stephen Tweedie. Design andImplementation of the Second Extended Filesystem. In

FirstDutch International Symposium on Linux , Amsterdam, Nether-lands, December 1994.[25] Changman Lee, Dongho Sim, Joo Young Hwang, and SangyeunCho. F2fs: A new ﬁle system for ﬂash storage. In

FAST , pages273–286, 2015.[26] Mendel Rosenblum and John K Ousterhout. The design and im-plementation of a log-structured ﬁle system.

ACM Transactionson Computer Systems (TOCS) , 10(1):26–52, 1992.[27] Artem B Bityutskiy. Jffs3 design issues, 2005.[28] Manos Athanassoulis, Michael S Kester, Lukas M Maas, RaduStoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan.Designing access methods: The rum conjecture. In

InternationalConference on Extending Database Technology , pages 461–466,2016.[29] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, TedWobber, Michael Wei, and John D Davis. Corfu: A shared logdesign for ﬂash clusters. In , pages1–14, 2012.[30] Jian Xu and Steven Swanson. NOVA: a log-structured ﬁle systemfor hybrid volatile/non-volatile main memories. In

FAST , 2016.[31] Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy,Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jack-son. System software for persistent memory. In

Proceedings ofthe Ninth European Conference on Computer Systems , page 15.ACM, 2014.[32] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durableand energy efﬁcient main memory using phase change memorytechnology. In

ACM SIGARCH computer architecture news , vol-ume 37, pages 14–23. ACM, 2009.[33] Chun Jason Xue, Youtao Zhang, Yiran Chen, Guangyu Sun,J Jianhua Yang, and Hai Li. Emerging non-volatile mem-ories: opportunities and challenges. In

Proceedings of theseventh IEEE/ACM/IFIP International Conference on Hard-ware/Software Codesign and System Synthesis , pages 325–334,2011.[34] Yenpo Ho, Garng M Huang, and Peng Li. Nonvolatile Memris-tor Memory: Device Characteristics and Design Implications. In

Proceedings of the 2009 International Conference on Computer-Aided Design , pages 485–490. ACM, 2009.[35] Dmitri B. Strukov, Gregory S. Snider, Duncan R. Stewart, andR. Stanley Williams. The missing memristor found.

Nature ,2008.[36] Leon Chua. Resistance switching memories are memristors.

Ap-plied Physics A , 102(4):765–783, 2011., 102(4):765–783, 2011.