[PDF] Accelerating Filesystem Checking and Repair with pFSCK

Abstract

File system checking and recovery (C/R) tools play a pivotal role in increasing the reliability of storage software, identifying and correcting file system inconsistencies. However, with increasing disk capacity and data content, file system C/R tools notoriously suffer from long runtimes. We posit that current file system checkers fail to exploit CPU parallelism and high throughput offered by modern storage devices. To overcome these challenges, we propose pFSCK, a tool that redesigns C/R to enable fine-grained parallelism at the granularity of inodes without impacting the correctness of C/R's functionality. To accelerate C/R, pFSCK first employs data parallelism by identifying functional operations in each stage of the checker and isolating dependent operation and their shared data structures. However, fully isolating shared structures is infeasible, consequently requiring serialization that limits scalability. To reduce the impact of synchronization bottlenecks and exploit CPU parallelism, pFSCK designs pipeline parallelism allowing multiple stages of C/R to run simultaneously without impacting correctness. To realize efficient pipeline parallelism for different file system data configurations, pFSCK provides techniques for ordering updates to global data structures, efficient per-thread I/O cache management, and dynamic thread placement across different passes of a C/R. Finally, pFSCK designs a resource-aware scheduler aimed towards reducing the impact of C/R on other applications sharing CPUs and the file system. Evaluation of pFSCK shows more than 2.6x gains of e2fsck and more than 1.8x over XFS's checker that provides coarse-grained parallelism.

Full PDF

AAccelerating Filesystem Checking and Repair with pFSCK

David Domingo, Kyle Stratton, Sudarsun Kannan

Rutgers University

Abstract

File system checking and recovery (C/R) tools play a pivotalrole in increasing the reliability of storage software, identify-ing and correcting ﬁle system inconsistencies. However, withincreasing disk capacity and data content, ﬁle system C/Rtools notoriously suffer from long runtimes. We posit thatcurrent ﬁle system checkers fail to exploit CPU parallelismand high throughput offered by modern storage devices.To overcome these challenges, we propose pFSCK, a toolthat redesigns C/R to enable ﬁne-grained parallelism at thegranularity of inodes without impacting the correctness ofC/R’s functionality. To accelerate C/R, pFSCK ﬁrst employsdata parallelism by identifying functional operations in eachstage of the checker and isolating dependent operation andtheir shared data structures. However, fully isolating sharedstructures is infeasible, consequently requiring serializationthat limits scalability. To reduce the impact of synchroniza-tion bottlenecks and exploit CPU parallelism, pFSCK designspipeline parallelism allowing multiple stages of C/R to runsimultaneously without impacting correctness. To realize efﬁ-cient pipeline parallelism for different ﬁle system data conﬁg-urations, pFSCK provides techniques for ordering updates toglobal data structures, efﬁcient per-thread I/O cache manage-ment, and dynamic thread placement across different passesof a C/R. Finally, pFSCK designs a resource-aware scheduleraimed towards reducing the impact of C/R on other applica-tions sharing CPUs and the ﬁle system. Evaluation of pFSCKshows more than 2.6x gains of e2fsck and more than 1.8xover XFS’s checker that provides coarse-grained parallelism.

Modern ultra-fast storage devices such as SSDs, NVMe,and byte-addressable NVM storage technologies offer higherbandwidth capabilities and lower latency compared to hard-disks providing better opportunities for exploiting CPU paral-lelism. While I/O access performance has increased, storagehardware errors have continued to grow coupled with newerand exploratory high-performance designs impacting ﬁle sys-tem reliability [9, 19, 39]. For decades, ﬁle system checkingand repair tools (referred to as C/R henceforth) has played apivotal role in increasing reliability of software storage stacks,identifying and correcting ﬁle system inconsistencies [35]. Infact, in the event of a system crash or storage failure in datacenters, ﬁle system checkers are typically used as the ﬁrstremedial solution to system recovery [19].File system C/R tools work by identifying and ﬁxing thestructural inconsistencies of ﬁle system metadata, such as in-consistencies in inodes, data and inode bitmaps, links, and directory entries. Well-known and widely used tools suchas e2fsck (ﬁle system checker for Ext4) divides C/R acrossmultiple stages (commonly referred to as passes) with eachpass responsible for checking a ﬁle system structure (e.g.,directories, ﬁles, links). However, C/Rs are known to be no-toriously slow, showing a linear increase in C/R time withan increase in ﬁle and directory count and the disk utiliza-tion [21, 32–35]. Although modern ﬂash and NVM technolo-gies provide lower latency and bandwidth, current C/R toolsfail to utilize such hardware capabilities or multicore CPUparallelism fully. While modern C/Rs have attempted to in-crease parallelism, their coarse-grained approaches, such asparallelizing C/R across logical volumes or logical groups,are insufﬁcient to accelerate C/R on ﬁle systems with dataimbalance across logical groups [18, 21, 33, 36].To overcome such limitations, we propose pFSCK , a par-allel C/R that exploits CPU parallelism and modern storage’shigh bandwidth to accelerate ﬁle system checking and re-pair time without compromising correctness. Acceleratingﬁle system C/R could signiﬁcantly reduce system downtimeand improve storage availability [9, 18, 19, 33]. In this pursuit,pFSCK introduces ﬁne-grained parallelism, i.e., parallelismat the granularity of inodes and directory blocks, resulting ina signiﬁcantly faster execution compared to traditional C/Rs.pFSCK ﬁrst employs data parallelism by breaking up thework done at each pass, redesigning data structures for scala-bility, and allowing multiple threads to process. Although dataparallelism accelerates checking, updates to global data struc-tures (e.g., bitmap) within each pass are designed to match theﬁle system’s layout (e.g., block bitmap in an Ext4 ﬁle system)and must be synchronized and serialized to ensure checkingcorrectness. As a result, with increasing threads, the cost ofsynchronization and serialization can quickly outweigh theperformance gains. Hence, pFSCK introduces pipeline par-allelism to parallelize C/R along with the logical ﬂow (i.e.,across multiple passes).Supporting data and pipeline parallelism within pFSCKrequires addressing several challenges. First, updates to datastructures shared must be ordered for C/R correctness. Forexample, a directory cannot be certiﬁed to be error-free by thedirectory checking pass unless all its ﬁles are veriﬁed as con-sistent by the inode checking pass. To address these orderingconstraints, taking inspiration from out-of-order executionsin hardware processors, we isolate the global data structuresand perform all necessary operations in parallel but certifycorrectness only when the results are merged. Second, staticpartitioning of CPU threads across different passes is subop-timal because each pass checks different metadata (e.g., ﬁle,1 a r X i v : . [ c s . O S ] A p r irectory, links) and the amount of each kind of metadata canvary across different ﬁle system conﬁgurations. In addition,the time to process different types of metadata vary signiﬁ-cantly (e.g., checking a directory can take signiﬁcantly longerthan a ﬁle). Hence, we propose a dynamic thread schedulerthat monitors progress across different passes of pFSCK anduses the pending work ratio for thread assignment.Third, I/O optimizations such as I/O caching and read-ahead mechanisms in current C/Rs are not designed for multi-threaded parallelism, which we address by designing a thread-aware I/O caching, thereby substantially reducing I/O wait-times. Finally, to exploit multi-core parallelism in ways thatdo not affect the performance of other co-running applicationsthat share CPUs or access the same disks checked by C/R(online checking), we propose a resource-aware mechanismthat allows for scaling the number of threads used to performchecking by monitoring the overall CPU utilization of thesystem.The combination of pFSCK’s above techniques signiﬁ-cantly reduces C/R runtime. For example, pFSCK’s data par-allelism and pipeline parallelism on an 800GB ﬁle system onNVMe reduce runtime by up to 2.6x for a ﬁle-intensive diskconﬁguration. For a directory-intensive disk conﬁguration,pFSCK is able to reduce runtime by up to 1.6x compared toe2fsck and by up to 1.8x over the XFS ﬁle system checker.In the pursuit of increasing multicore parallelism, pFSCK in-creases memory usage by only 1.17x over e2fsck (from 3 GBin e2fsck to 3.5 GB in pFSCK for a 800GB ﬁle system). Fur-ther, pFSCK’s scheduler increases gains by 1.1x over pFSCKwithout a scheduler. When sharing the CPUs between pFSCKand RocksDB, the system-wide resource-aware mechanismminimizes pFSCK performance degradation to 1.07x as wellas limits the overhead on RocksDB by 1.05x. Finally, pFSCKprovides a signiﬁcant performance boost during live checkingcompared to e2fsck improving performance by 1.7x. Storage hardware advancements have opened up the po-tential for accelerating I/O bound applications. One criticalset of applications that could potentially beneﬁt are the ﬁlesystem checking and repair (C/R) tools, which run in almostall computing systems. We ﬁrst give some background oncurrent hardware trends, C/R tools, and then discuss priorapproaches that accelerate C/R and their limitations.

With increasing core count and the advent of faster stor-age devices, system performance has seen vast speedups dueto hardware advancements. More speciﬁcally, ﬂash memorytechnologies like PCI-attached SSD and NVMe devices pro-vide increased throughput (8-16 GB/s) and lower access la-tency (20-50 µ s) compared to hard disks. At the other end,fast storage class memories such as Intel’s DC PersistentMemory [4] and other classes of nonvolatile memory (NVM) directly attach to the memory controller and provide byte-addressable persistence. These technologies scale 4x largerthan DRAM capacity, with variable read (100-200ns) andwrite (400-800ns) latencies, and bandwidth capabilities rang-ing from 8 GB/s to 20 GB/s.On the software side, a huge body of prior research isin progress to redesign and optimize ﬁle systems for mod-ern storage hardware, which includes ﬁle systems for SSD,NVMe, and NVMs [26, 29, 30, 37, 41] and storage stackin general [13, 27]. Besides, the open-source community isinvesting a substantial effort to optimize traditional ﬁle sys-tems such as Ext4 and XFS for modern storage hardware,given the wide-usage and reliability of these ﬁle systems. Forexample, ﬁle systems such as Ext4-DAX continue to retaintraditional ﬁle system structure and optimize performance byremoving components such as page cache, schedulers, andlogging. Reducing data corruption for both these approachescan be challenging and requires a few years of production use[8, 24]. Since the dawn of ﬁle systems, ﬁle system consistency hasalways been an issue. Though modern ﬁle systems deploymechanisms such as journaling, copy-on-write, log-structuredwrites, and soft updates to handle inconsistencies, they can-not ﬁx errors that may have been present due to corruptionsmanifested in the past by events such as a failing disk, bitﬂips, overheating, or correlated crashes [10–12, 25, 42]. Awidely-used approach to handle disk and ﬁle system corrup-tions and errors is to check and ﬁx inconsistencies using ﬁlesystem C/R tools, such as e2fsck and xfs_repair, that scanﬁle system metadata for corruptions and ﬁx them. Most C/Rtools are designed speciﬁcally for the layout of a ﬁle system.C/Rs such as e2fsck and xfs_repair have multiple passes thatcheck inode consistency, directory consistency, connectivityof all directories, directory entries, and lastly, the referencecounts of inodes and blocks. The repair process involves ﬁx-ing errors such as updating the block bitmap that does notindicate a block referenced by a ﬁle as being used. For morecomplex errors (e.g., a block referenced by several inodes),administrators are given an option to accept or decline repairs.

With increasing disk capacities and ﬁle system size, C/Rsnotoriously tend to run longer; speciﬁcally, the increase insystem downtime may dominate the repair cost by orders ofmagnitude [1,6,7,22,31]. We next discuss the state-of-the-artC/R optimizations for ofﬂine (unmounted ﬁle systems) andonline C/Rs and their limitations.

Ofﬂine C/Rs.

To reduce C/R time, open-source C/Rs, such asthe Ext4’s e2fsck and XFS ﬁle system’s xfs_repair, parallelizechecking across disks (e2fsck) or logical groups (xfs_repair).C/R techniques like Ffsck [35] and Chunkfs [23], speed up bymodifying the ﬁle system to provide a better balance across2ogical groups. For example, Chunkfs is designed to utilizedisk bandwidth by partitioning the ﬁle system into smaller,isolated groups that can be repaired individually and in par-allel, whereas Ffsck [35] rearranges metadata blocks withinthe ﬁle system to reduce seek cost and optimize ﬁle systemtraversal. SQCK [17] enhances C/R by utilizing declarativequeries for consistency checking across ﬁle system structures.Overall, while prior C/R designs have attempted to improveﬁle C/R performance, they suffer from several weaknesses.First, Chunkfs (and XFS) require a coarse grain separationof ﬁle system blocks to accelerate ﬁle system checking. Takethe case of an imbalanced ﬁle system with several large ﬁlesspread across different logical groups of a disk. Interestingly,we show in Section § 6 the xfs_repair’s parallelism is limitedto simple inode scans, omitting any parallelism for checkingdirectory metadata. Other techniques such as SQSCK andFfsck require intrusive changes to the way we manipulate ﬁlesystem metadata or need completely rebuilding C/R, whichcould reduce or prevent widespread adoption.

Online C/Rs.

To reduce system C/R downtime, proprietaryonline C/Rs such as WAFL ﬁle system’s Iron [28] (a NetApp-based C/R tool for WAFL ﬁle system) and ReFS [16] ﬁxcorruptions as they are encountered allowing ﬁle system op-erations to continue. WAFL-Iron performs incremental liveC/R. Because storage blocks are made available as C/R is inprogress, WAFL-Iron imposes invariants such as (1) check-ing all blocks before any software use, (2) checking ancestorblocks (directory) before any data or metadata block (inodeblock) is checked. These invariants avoid repeated checking ofan inode for every data block, and also reduce memory usage.To scale C/R to petabytes, WAFL-Iron expects the presenceof block-level checksums, RAID, and most importantly, goodstorage practices by customers. Open-source C/Rs such ase2fsck allows for online checking by utilizing LVM-basedsnapshotting and running C/R on the snapshot while the ﬁlesystem is still in use [2]. We evaluate e2fsck’s LVM-basedonline C/R in Section § 6. Recon protects ﬁle system meta-data from buggy operations by verifying metadata consistencyat runtime [14]. Doing so allows Recon to detect metadatacorruption before committing it to disk, preventing error prop-agation. Recon does not perform a global scan and hencecannot identify or ﬁx errors originating from hardware fail-ures.

C/R Correctness.

To ensure the correctness and crash-consistency of C/Rs itself and recover more reliably in lightof system faults, Rfsck-lib [15] provides C/Rs with robustundo logging. pFSCK’s ﬁne-grained parallelism goals areorthogonal to Rfsck-lib, however, incorporating Rfsck-lib canimprove the reliability of pFSCK in case of failures.

Summary.

To summarize, unlike prior systems, pFSCK isaimed towards ﬁne-grained parallelism, the ability to utilizestorage bandwidth efﬁciently across multiple passes of C/R,adapting to system resources, and the capability to reduce theimpact on other applications.

10 20 30 40 50File Count (millions)020406080100120 T i m e ( s e c o n d s ) Pass 1Pass 2Pass 3Pass 4Pass 5 (a) File Count Sensitivity:

Runtime of e2fsck astotal ﬁle count increases T i m e ( m i nu t e s ) (b) Directory CountSensitivity: Runtime of e2fsckas total directory count increases

Figure 1: Runtime of C/R for an 800GB ﬁle system withvarying counts of ﬁles or directories

In the pursuit of accelerating C/Rs, we ﬁrst decipher theperformance bottlenecks of the widely-used Ext4 ﬁle sys-tem’s e2fsck C/R tool. We ﬁrst provide an overview of e2fsckand then examine e2fsck’s runtime for different ﬁle systemconﬁgurations. For brevity, we study xfs_repair in Section § 6.

E2fsck uses ﬁve sequential passes for C/R: the ﬁrst pass (re-ferred to as Pass-1) checks the consistency of inode metadata;Pass-2 checks directory consistency; Pass-3 checks directoryconnectivity; Pass-4 checks reference counts; ﬁnally, Pass-5checks data and metadata bitmap consistency.

To analyze and decipher the breakdown of e2fsck’s run-time, we run e2fsck on ﬁle systems with varying conﬁgu-rations. We conduct our analysis on a 64-core Dual Intel®Xeon Gold 5218, 2.30GHz, 64GB of DDR memory, and 1TBNVMe Flash Storage running Ubuntu 18.04.1. We ﬁll theﬁle system using fs_mark , an open-source, ﬁle system bench-mark tool [40]. For our analysis, we mainly focus on ﬁlesystems without corruptions. To get a ﬁner understanding ofhow e2fsck scales with ﬁle system conﬁgurations, we studythe sensitivity of C/R’s runtime for multiple ﬁle system vari-ables such as ﬁle count and directory count.

File-intensive ﬁle systems.

First, to understand how ﬁlecount affects runtime, we generate multiple ﬁle intensiveﬁle system conﬁgurations with a 95:1 ﬁles to directories ra-tio. Operating on ﬁle-intensive ﬁle systems, Pass-1, whichchecks the consistency of inodes structures, dominates e2fsckruntime, followed Pass-2, which checks directory block con-sistency. Figure 2 shows the function-wise breakdown inPass-1 that checks the consistency of ﬁle inodes as well astrack directory blocks encountered to be examined in thenext pass. We notice a function dcigettext (a seemingly3 igure 2: e2fsck Pass-1 Time Breakdown . Time spent withininode checking pass (Pass-1) as the total ﬁle count increases. innocuous) language translator used for error handling gets(incorrectly) used for every inode check and poses a substan-tial slowdown on the C/R performance. Other Pass-1 stepssuch as check blocks that checks the blocks referenced byan inode, next inode that reads the next inode blocks fromdisk, mark bitmap that updates global bitmaps to track themetadata encountered, and icount store that stores inodereferences also increase in runtime. Although the numberof directories is small, the Pass-2 (directory checking pass)runtime increases because the number of directory blocks thatstore directory entries increase. Pass-3 checks connectivityand ensures the reachability of directories from the root. Fora small directory count, the runtime is a small compared tothe runtime of Pass-1 and Pass-2. We also ﬁnd that increasingﬁle size while keeping the number of ﬁles constant does notincrease e2fsck runtime signiﬁcantly (not shown for brevity).

Directory-intensive ﬁle system.

We next analyze the impactof the directory count on the runtime of e2fsck in Figure 1b.We generate multiple directory intensive ﬁle systems, eachwith increasing number of directories. We deﬁne a directory-intensive ﬁle system as a ﬁle system with over a 1:1 ﬁle todirectory ratio. To ensure that each directory requires thesame amount of work, we create a single ﬁle in each directory.As expected, with an increase in directory count, Pass-2’sruntime signiﬁcantly increases due to the increased numberof directory blocks as well as the fact that directory blockshold checksums for consistency, forcing Pass-2 to recomputeand verify these directory blocks checksums. Interestingly,Pass-1’s runtime also increases; this is because Pass-1 isresponsible for identifying directory blocks and adding to adirectory block list ( db_list ) to be used within Pass-2.

To understand the computational vs. I/O bottlenecks, inFigure 3, we show the compute vs. I/O wait time ratio fore2fsck. As shown, the compute time dominates the I/O waittime. In all our experimental runs, we observe that e2fsck’speak and average I/O bandwidth usage is 260 MB/s and100 MB/s, respectively, on an NVMe device with 2 GB/s and We reported this to e2fsck developers, and the ﬁx has been upstreamed.

10 20 30 40 50File Count (millions)020406080100 % o f E x e c u t i o n CPU I/O 5 10 15 20 25Directory Count (millions)020406080100 % o f E x e c u t i o n Figure 3: Percent of I/O wait time (Left: Files-intensiveﬁle system, Right: Directory-intensive ﬁle system)

512 MB/s sequential and random read bandwidth.

In general,ﬁle system C/R tools (e.g., e2fsck and XFS_repair) not onlysuffer from I/O access time but also computational cost.

Summary.

To summarize, our analysis shows high run-time overheads of e2fsck across ﬁle system conﬁgurations;this is mainly due to the serial, single-threaded nature ofe2fsck, designed in the era of spinning hard drives. The linearcomplexity of its runtime is unsuitable as ﬁle system capac-ities trends upward, potentially taking hours, or even days,to check datacenter-scale ﬁle systems. Besides, C/R’s repairwhen there are ﬁle system inconsistencies could further in-crease C/R runtime. pFSCK aims to overcome the limitations of traditional ﬁlesystem C/Rs by exploiting ﬁne-grained multi-core parallelism,higher disk bandwidth, efﬁcient use of CPUs using a pFSCKscheduler, and ways to reduce the impact on other co-runningapplications. We next outline the goals and provide pFSCK’sdesign overview.

Decrease ﬁle system C/R runtime.

The main goal is tomake ﬁle system C/R faster. We want to increase the speed atwhich ﬁle system metadata can be scanned and inconsisten-cies identiﬁed, without compromising repairing capabilities.

Adapt to different ﬁle system conﬁgurations.

The C/Rperformance should improve regardless of ﬁle system size, uti-lization, or conﬁgurations, such as a ﬁle-intensive or directory-intensive ﬁle system.

Support ofﬂine and online C/R.

C/Rs can be used whena disk is not mounted, and the ﬁle system is ofﬂine or when asystem is online, and the ﬁle system is actively used. Hence,pFSCK aims to support both ofﬂine and online C/R.

Adapt to system utilization.

C/R should have the abilityto adapt to varying system resource utilization over time toreduce the potential performance impact on any currently-running applications. pFSCK aims to adapt to varying system-wide CPU use.4 .2 pFSCK Design Insights

We next describe the key design insights to realize theabove goals.

Insight 1: Maximize potential bandwidth through mul-tiple cores and data parallelism.

To overcome the bottle-necks of current serial C/Rs and C/Rs that parallelize at acoarse granularity such as across logical volumes or logi-cal groups, pFSCK exploits ﬁne-grained inode and directoryblock parallelism. Towards this goal, pFSCK ﬁrst introducesdata parallelism to C/R for better utilization of CPU paral-lelism enabled by modern storage bandwidth capabilities. Ata high-level, in each pass, basic ﬁle system structures suchas inodes, directory blocks, dirents, and links are divided andchecked across a pool of worker threads. While seeminglysimple, achieving data parallelism requires data structure iso-lation across threads to reduce synchronization bottlenecks.

Insight 2: Enable pipeline parallelism by reducinginter-pass dependencies.

Though data parallelism improvesperformance, updates to several inter-pass global data struc-tures used for building a consistent view of the ﬁle system andidentifying inconsistencies (ex. bitmaps), must be serialized.Consequently, this limits data parallelism’s performance capa-bility with higher CPU counts and also degrades performanceat higher thread counts due to contention on the shared struc-tures. Pipeline parallelism breaks the rigid wall across passesallowing multiple passes to be executed simultaneously alongwith the logical ﬂow of an application, thereby increasingCPU parallelism and reducing the performance impact of se-rialization. To realize pipeline parallelism requires managingper-pass thread pools, isolating inter-pass shared structuresusing divide and merge approaches, delineating checking andcertiﬁcation of inodes, and reducing I/O wait times.

Insight 3: Adapt to ﬁle system conﬁgurations with dy-namic thread scheduling.

Enabling data and pipeline paral-lelism requires assigning threads across different passes ofpFSCK. Static partitioning of CPU threads across differentpasses are suboptimal due to lack of information about meta-data types (ﬁles, directories, links) and work across passes;for example, checking directory blocks in Pass-2 (directorychecking pass) require more processing time than checking in-odes in Pass-1 (ﬁle checking pass) as discussed in Section § 3.To overcome the challenge of accelerating C/R for differ-ent ﬁle system conﬁgurations, we design a dynamic threadscheduler that assigns threads to process different types of ﬁlesystem objects as they are discovered and migrates threadsacross different passes of the pipeline.

Insight 4: Reduce system impact through resource uti-lization awareness.

File system C/Rs could potentially runwith other applications sharing CPUs while performing check-ing on separate disks. Given pFSCK’s goal to exploit availableCPUs, this could potentially impact other co-running applica-tions. Similarly, C/R could run on disks that are also activelyused by other applications to store data. To reduce the overall system impact on co-running applications as well as pFSCK,we equip pFSCK’s scheduler with resource awareness to dy-namically identify the number of cores to use at any singlepoint in time to minimize potential impact on other co-runningapplications without signiﬁcantly impacting pFSCK’s perfor-mance.

To realize the goals of pFSCK, we discuss the design andimplementation of pFSCK’s data parallelism, pipeline paral-lelism, dynamic thread scheduler, and resource-aware schedul-ing. pFSCK extends e2fsck to realize these design changes. pFSCK’s data parallelism divides work in each pass amonga group of worker threads on the granularity of inodes and en-ables concurrent C/R. While seemingly simple, efﬁcient dataparallelism during C/R demands an efﬁcient threading modelfor ﬁne-grained inode parallelism, functional separation ofC/R within each pass, and per-thread contexts for isolatingdata structures and reducing synchronization cost.

Fine-grained Inode-level Parallelism.

For ﬁne-grainedinode-level parallelism, pFSCK uses the superblock informa-tion to identify the total number of inodes in the ﬁle systemand evenly divides the inodes across a given set of C/R work-ers. To reduce the cost of worker threads management, pFSCKuses a thread-pool framework [38] that provides the ability toassign tasks to multiple worker threads. The worker threadsare then reused across different passes of a C/R. pFSCK alsoco-locates threads of a pass to the same CPU and memorysocket to avoid the lock variable bouncing across processorcaches on different sockets. We will also discuss the needfor dynamically identifying work done across threads andscheduling in Section § 5.3.

Functional Parallelism for Reducing SynchronizationOverheads.

Only dividing inodes for checking across workerthreads is insufﬁcient. To beneﬁt from ﬁne-grained paral-lelism, it is critical to reduce synchronization across workerthreads in each pass without compromising correctness.We ﬁrst break each C/R pass into four main functional stepsand reduce synchronization across these steps. The steps in-clude: (1) ﬁle system metadata C/R, (2) global ﬁle systemmetadata update, (3) C/R-level accounting, and ﬁnally, (4)intermediate result sharing; these four steps comprise 95% ofthe work. The metadata check performs logical checks thatverify their integrity across each pass (for example, blocks ofan inode). Next, updating global ﬁle system metadata includesupdating ﬁle system-level bitmaps that keep track of blocksand inodes currently used and referenced. The bitmaps arealso used to detect any inconsistencies between inodes such asduplicate block references where more than one inode claimthe same block. Third, C/R-level accounting involves updat-ing counters that track statistics such as ﬁle types. Finally,intermediate result sharing across passes involves creating5 nodes Pass Directories PassThread Pool Delayed ChecksThread Pool ( a ) Data Parallelism( b ) Pipeline Parallelism ( c ) Synchronous DependentChecks File blocks Directory blocks Directory blocks delayed certification per-pass work queues per-pass thread pool

Figure 4: Parallelism in pFSCK. (a) Thread pools within eachpass allows for data to be operate in parallel ( data parallelism ). (b)Use of multiple thread pools allows each pass of pFSCK to operatesimultanously ( pipeline parallelism ). (c) Any dependent checksneeded to be carried out synchronously is delayed within its ownlogical pass. and updating data structures such as a red-black tree withinode information and a hash-tree based directory list.While synchronization between ﬁle system metadata check(step 1) and global metadata update steps (step 2) are essen-tial, synchronization between ﬁrst two steps and step 3 (C/Rcounter/statistics update) can be avoided by allowing threadsto maintain per-threads stats. The results of step 1 and 2 can beaggregated before the next pass, reducing the synchronizationcost signiﬁcantly.

Thread Contexts for Isolation.

In current C/Rs such ase2fsck, upon which pFSCK is built, we identify signiﬁcantdata structure sharing across functions (steps 1 to 4) insideeach pass and across passes. To reduce sharing and provideisolation, we introduce per-thread contexts (in contrast to aglobal context in e2fsck). The thread contexts are similar toOS thread contexts and contain information such as buffersused for processing ﬁle system objects, intermediate datastructures, structures to track progress, locks held for shareddata structures, and CPUs used. At the end of each pass, theinformation within each thread context, such as per-threadbuffers and generated intermediate data structures, are aggre-gated before the subsequent pass.

While data parallelism achieves concurrency for process-ing ﬁle system objects within a pass, fully isolating per-passshared data structures and global data structures is not feasiblewithout substantial changes to either the ﬁle system layoutor the C/R. As a result, data parallelism does not fully ben-eﬁt from increasing CPU count and in fact, as our resultsshow, can degrade substantially in performance at higher corecounts due to increasing synchronization overheads.To reduce time on synchronization and increase the CPUeffectiveness, pipeline parallelism breaks the limitation thatC/R passes must be sequentially executed, thereby allowing asubsequent C/R pass (

Pass i + ) to start even before the com-pletion of an earlier pass ( Pass i ) in a pipelined fashion (i.e.checking directories in directory checking pass (Pass-2) evenbefore the inode checking pass (Pass-1) has completed). First, to facilitate each pass operating in parallel, we use per-pass thread pools . As shown in Figure 4, the inode and di-rectory checking passes each maintain a separate thread poolthat is used to hold threads that carry out logic within the pass.In addition to the per-pass thread pools, each pass maintainsa dedicated work queue ﬁlled with ﬁle system objects need-ing to be checked. As each pass operates, any intermediatework generated is placed in the next pass’s work queue. Forexample, within the inode checking pass (Pass-1) as directoryinodes are identiﬁed, their directory blocks are queued to thedirectory checking pass’s work queue so they can be checked.

Allowing multiple passes to run in parallel using pipelineparallelism requires reordering logical checks for correctness.Take an example of the inode checking pass (Pass-1) and thedirectory checking pass (Pass-2): in pFSCK (and e2fsck), theinode checking pass reads inodes from disk and checks allinodes including directories and adds directory block infor-mation in a shared directory block list ( dblist ) so the directorychecking pass can check the directory entries. While the in-ode and directory checking passes can proceed in parallel,the directories can be marked as consistent only after theinode checking pass veriﬁes the consistency of the inodesrepresenting the ﬁles and subdirectories inside a directory.

Providing Ordering Guarantee.

To address the challengeof ordering guarantee, pFSCK delays certain checks untilthe prior pipeline pass is complete. For example, the inodechecking pass within the pipeline is responsible for creatingin-memory directory structures that are used in the directorychecking pass to check directories. The directory checkingpass stores a list of subdirectories and checks whether the sub-directory’s parent entry point (represented by double dot .. )map back to the directory. However, because the inode check-ing and directory checking passes run in parallel, not all theinodes of the directory entries would have been checked whenthe parent directories are checked in the directory checkingpass. To handle such scenarios, pFSCK delays certiﬁcation byadding the uncompleted checks, just like the one described,in a separate work queue and completing them only after allinodes have been checked (e.g. after the inode checking passcompletes) as shown in Figure 4. Effective use of multiple CPUs for an I/O-intensive C/Rrequires efﬁcient I/O prefetching and caching even for fastmodern storage devices such as NVMe. Though current C/Rssuch as e2fsck cache and prefetch ﬁle system blocks, they areinﬂexible and lack thread awareness. For example, e2fsck bydefault prefetches a few blocks at a time when reading in in-odes and fetches only 1 block when fetching directory blocks.It is possible to change the amount of readahead done howeverwe observe that statically or naively increasing the prefetchdepth negatively impacts performance because threads access6he ﬁle system blocks at different offsets, frequently invali-dating previously read cache entries, consequently increasingthe overheads of I/O.To overcome such limitations and accelerate I/O, we imple-ment a per-thread caching and readahead-based prefetchingmechanism that prevents the eviction of cache entries whenmultiple threads operate in parallel. As our results show inSection § 6, combining pipeline parallelism with data par-allelism and employing dynamic workload-based threading(discussed next) improves pFSCK’s performance across dif-ferent ﬁle system conﬁgurations.

The runtime of C/Rs can vary signiﬁcantly depending onthe conﬁguration of the ﬁle system. For example, C/R on aﬁle system with a larger ratio of smaller ﬁles could resultin a substantially longer runtime compared to a ﬁle systemwith few, but large ﬁles due to more metadata needing tobe checked. Similarly, heterogeneity in terms of inode types(ﬁles, directories, links) can impact runtime, and the exactconﬁguration remains unknown until the inodes are iteratedover in the inode checking pass (Pass-1). Additionally, eachpass within C/R have differing degrees of accesses to sharedstructures. Therefore, a static assignment of threads acrosseach pass could be ineffective. Hence, to adapt to ﬁle systemconﬁgurations, pFSCK implements a C/R-aware scheduler, pFSCK-sched , supported by extending the thread pools toallow for migration of threads between the passes. In addi-tion, pFSCK-sched maintains an idle thread pool to hold anythreads not scheduled to run for any of the passes.

Thread Assignment and Migration of Worker Threads.

In pFSCK, we enable dynamic assignment of threads acrosseach pass by implementing a scheduler that actively monitorsprogress and migrates threads across the passes. The sched-uler periodically scans through the work queues of each passto identify the work distribution ratio across the pipelinedpasses and uses this ratio to assign threads across them.Figure 5 shows an example of pFSCK-sched across theﬁrst two passes. Initially, all the CPU threads are assigned tothe ﬁrst pass (inode checker) given that pFSCK only knowstotal inodes from the ﬁle system superblock and not the typesof inodes. When the inode checker’s worker thread identiﬁesa group of directory inodes, it places the directory inodes andtheir corresponding directory blocks to the work queue ofdirectory checking pass. If no threads are present in the threadpool used for the directory checking pass, threads from theinode checking pass (ﬁrst pass) are migrated to the directorychecking pass. To calculate the number of threads to be reas-signed, a dedicated scheduler thread ﬁnds the total work to bedone across all passes using the following model.Let W total be the amount of work needing to be done. Let q i be the length of the work queue for pass i . Let n i be the numberof discrete elements needing to be processed for each entryin the work queue. Let w i be some weight that normalizes the Inodes Pass Directories Pass

Inode List

Thread Pool Thread Pool(1) Sample Work Queues (2) Redistribute Threadsn n Idle Thread Pool (3) Take threads from idle thread pool if more cores available

Scheduler Thread per-pass work queues

Figure 5: Dynamic Thread Scheduling.

A dedicated sched-uler threads periodically samples work queues among all thepasses and redistributes threads based of proportion of out-standing work.work to be done for each element in pass i . Let C be the corebudget and t i be the number of threads to assign for pass i . W total = N ∑ i = q i n i w i (1) t i = C · q i n i w i · W total (2)As shown in Equation (1), the total work needing to bedone is a summation of outstanding work across all the passes.The outstanding work in each pass is a product of the workqueue length ( q i ), the number of objects encapsulated withineach queue entry ( n i ), and a normalizing weight ( w i ). Asshown in Equation (2), with the total amount of work needingto be done, the scheduler can determine the ideal numberof threads to assign to a pass ( t i ) based on the total corebudget ( C ) and the relative amount of work calculated foreach pass. Note that the normalizing weights are essential foraccounting the differences in the time to process different ﬁletypes (directories vs. regular ﬁles). In our experimentation,we ﬁnd it is beneﬁcial to use higher weights for prioritizingwork in the directory checking queue as directories can takesigniﬁcantly longer to check compared to regular ﬁles due todirectory checksum calculations. File system C/Rs could potentially coexist or even shareCPUs with other applications using the same or another ﬁlesystem (or disk). In the pursuit of exploiting parallelism,pFSCK’s approaches must avoid or minimize the performanceimpact on other applications. To address this goal, we intro-duce pFSCK-rsched , which enables resource-awareness forpFSCK’s scheduler.

First, we discuss the case when pFSCK-rsched runs along-side other applications but using different ﬁle systems, wherepFSCK-rsched performs C/R on a separate, unmounted disk.Initially, pFSCK-rsched schedules the main pFSCK processusing the SCHED_IDLE priority to minimize any contentionon CPUs with regular processes. The SCHED_IDLE prior-ity mostly schedules a process on any idle CPUs [5]. As the7cheduler periodically runs, pFSCK-rsched ﬁrst determines acore budget that represents the maximum number of threadspFSCK-rsched should be running at any point in time. It doesthis by identifying the number of CPU threads currently run-ning, the number of idle cores available, and the number ofcores pFSCK-rsched is currently running on. For idle coresnot being utilized by any application (including pFSCK),pFSCK-rsched increases the core budget by the number ofidle cores available. On the contrary, if pFSCK-rsched identi-ﬁes that the total number of pFSCK threads are more than theidle cores, pFSCK-rsched reduces the core budget to avoidmultiplexing pFSCK’s threads on the cores it runs on. Thecore budget remains unchanged if the available idle coresand pFSCK threads remain the same. After determining thecore budget, the scheduler identiﬁes the work ratio acrosspasses to determine the ideal number of threads that shouldbe assigned to each pass. The scheduler then redistributesthe threads across the thread pools. In the case of the threadsneeding to be added due to an increase in the core budget,threads are taken from the idle thread pool and assigned to thethread pool to fulﬁll the ideal thread count. If threads need tobe removed due to a decrease in the core budget, threads aresignaled and reassigned to the idle thread pool. Our resultsin Section § 6 show the performance beneﬁts and implicationof pFSCK-rsched when co-running and sharing CPUs withanother application (RocksDB).

Given the renewed focus on supporting online-checking,open-source C/Rs such as e2fsck support C/R on a online ﬁlesystem (and disk) that is mounted and actively used by appli-cations. E2fsck supports online-checking by utilizing Linux’sLogical Volume Manager’s (LVM) snapshot feature whichpreserves a ﬁle system’s state by capturing the changes to theﬁle system [20]. E2fsck can then perform C/R on snapshot.The C/R time is dominated by an application’s activity inchanging the ﬁle system objects. Consequently, this results ina longer C/R time for an actively modiﬁed ﬁle system.Despite this, pFSCK still shows general improvement overe2fsck. First, pFSCK’s generic ﬁne-grained parallelism sup-ports and accelerates online C/R even when applications aresharing the same ﬁle system (and disks). Second, pFSCK’sresource awareness reduces the impact on co-running appli-cation. As our results show, even when running online C/Rswith I/O-intensive applications (RocksDB [3]), pFSCK pro-vides considerable performance gains compared to runningonline C/R with vanilla e2fsck.

Correctness.

To ensure correctness with ﬁne-grained C/Rparallelism, pFSCK employs a series of steps. First, althoughthe checks are done in parallel, an inode is not marked com-plete unless prior passes in the pipeline complete. For exam-ple, recall that a directory within the directory checking pass(Pass-2) cannot be marked as complete until all the inodes for

Name Descriptione2fsck original FSCK for EXT ﬁle systemse2fsck-opt optimized e2fsckxfs_repair XFS ﬁle system checkerpFSCK proposed ﬁle system checker

Table 1: C/R systems evaluated.

Name Descriptiondatapara Only data parallelism enableddatapara+pipeline-split-equal Pipeline parallelism combined withdata parallelism, using an equal dis-tribution of threads across the passesdatapara+pipeline-split-optimal Pipeline parallelism combined withdata parallelism, using optimal man-ual thread assignmentsched All prior parallelization optimiza-tions but using dynamic thread sched-uler for thread assignmentrsched Sched conﬁguration with resource-awareness enabled

Table 2: pFSCK incremental system design all of its directory entries are checked. Second, threads aresynchronized when performing complex ﬁxes upon detectingerrors. When a worker thread detects any inconsistencies, allthreads across the different passes are notiﬁed and stalledusing a barrier. The thread that detected the inconsistencyattempts to ﬁx the errors with (e.g., incorrect inode, blocksclaimed by multiple inodes) or without user input (e.g., in-consistent bitmap), after which parallel execution is resumed.In addition, we plan to explore C/R crash-consistency forpFSCK in the future using prior approaches [15].

Optimizations.

As additional optimizations to both e2fsckand pFSCK, we restrict the overheads of language localizationdiscussed earlier to inode and directory block checking, useIntel hardware-accelerated CRC instead of the default CRC, aswell as improve the cache-readahead mechanism. While thelanguage localization optimization has been reported and up-streamed to e2fsck codes mainline, the other optimizations areunder review. We evaluate the beneﬁts of these optimizationsin Section § 6 (referred to as e2fsck-opt in graphs).

Our evaluation of pFSCK aims to answer the followingimportant questions:• Does data parallelism improve runtime performance byincreasing CPU parallelism?• Can does pipeline parallelism address the limitationsof data parallelism by running multiple passes of C/Rsimultaneously?• How effective is pFSCK’s dynamic thread placementmechanism for different ﬁle system conﬁgurations?• Can pFSCK’s resource-aware scheduler effectively mini-mize the performance impact on other applications whensharing CPUs?• How does pFSCK perform for online ﬁle system check-ing?8 f s c k x f s _ r e p a i r e f s c k - o p t x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - T i m e ( s e c o n d s )

569 550 548 564 551

Pass 1Pass 2 Pass 3Pass 4 Pass 5XFS

Figure 6: Data Parallelism impact on a ﬁle-intensive con-ﬁguration.

X-axis shows different C/Rs and different thread conﬁgura-tions.

We use a 64-core Dual Intel® Xeon Gold 5218, 2.30GHz,64GB of DDR memory, and 1TB NVMe Flash Storage run-ning Ubuntu 18.04.1. We run pFSCK on various ﬁle systemconﬁgurations with varying thread counts. As seen in Table 1,we compare against vanilla e2fsck, e2fsck-opt (an optimizedversion of e2fsck that removes localization overheads andutilizes Intel CPU-accelerated CRC calculations), and ﬁnally,xfs_repair. Table 2 shows the incremental pFSCK’s designoptimizations.

In order to evaluate the potential performance improve-ment with data parallelism, we run pFSCK with just data par-allelism (bars shown as pFSCK[datapara]) that parallelizeseach pass of a C/R by partitioning work. We compare twoﬁle system conﬁgurations: a ﬁle-intensive (99% ﬁles) anddirectory-intensive (50% directories) conﬁguration. The x-axis shows the reduction in runtime with pFSCK and thestacked bars shows the runtime breakdown for each pass.First, with a ﬁle-intensive conﬁguration, as shown in Fig-ure 6, the inode checking pass (Pass-1) shows a higher runtimecompared to other passes. Secondly, our optimized e2fsck(e2fsck-opt) outperforms the vanilla e2fsck by optimizing theCRC mechanism, avoiding language localization overheadsfor every inode, and improving the readahead mechanism.Interestingly, both e2fsck and pFSCK outperform xfs_repairfor all cases. Although xfs_repair is able to check inodes inparallel on the granularity of allocation groups, it is unableto check directory entries and link counts in parallel whichunfortunately dominates the checking time for large ﬁle sys-tems.Finally, enabling data parallelism within pFSCK reducesthe runtime of the ﬁrst pass by 2.1x with 4 threads. Data par-allelism also reduces the runtime of the directory checkingpass (Pass-2) by 1.8x resulting in an overall C/R speedup of1.9x. Beyond 4-threads, data parallelism does not scale dueto higher serialization and lock contention overheads. Speciﬁ-cally, we ﬁnd the functions that update shared structures such e f s c k x f s _ r e p a i r e f s c k - o p t x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - T i m e ( s e c o n d s ) Figure 7: Data Parallelism impact on a directory-intensive conﬁguration.

The x-axis shows different C/Rs and differentthread conﬁgurations. as the used/free block bitmap as the most prominent sourceof bottlenecks that hinders scaling.Next, for the directory-intensive ﬁle system, as shown inFigure 7, pFSCK parallelizes inode checking (in Pass-1) anddirectory checking (in Pass-2). The runtime of Pass-1 andPass-2 reduces by 1.8x and 1.3x respectively, resulting in anoverall C/R speedup of 1.4x compared to the vanilla e2fsck.Similar to the ﬁle-intensive conﬁguration, we see a perfor-mance gain of 1.8x compared to xfs_repair, for which the di-rectory metadata check is not parallelized. Finally, pFSCK’sdata parallelism does not scale beyond 4-cores.

We next evaluate the beneﬁts of combining data andpipeline parallelism and the need for a dynamic thread place-ment. In order to evaluate the performance improvementof pipeline parallelization, we compare pFSCK’s data par-allelism (pFSCK[datapara]), with data and pipeline paral-lelism (pFSCK[datapara+pipeline]) against ﬁle-intensive anddirectory-intensive ﬁle system conﬁgurations. When usingpipeline parallelism, the threads of each pass adds work tobe processed in the next pass. Because of this, logical bound-aries between the passes within e2fsck diminishes, hence,we only report the full runtime of C/R. Because e2fsck andpFSCK outperforms xfs_repair in all cases, we do not showxfs_repair’s results.Figure 8a and Figure 8b show the results for a ﬁle-intensive and directory-intensive ﬁle system conﬁguration.The x-axis shows the increase in the number of threadsused for the C/R. In the ﬁgures, we compare four cases: (1)pFSCK[datapara], which only uses data parallelism runningone pass at a time, (2) pFSCK[datapara+pipeline-split-equal],which combines pipeline and data parallelism and staticallyuses an equal number of threads for each of the simultane-ously executing passes (ex. 2 threads are assigned to Pass-1 and 2 threads to Pass-2 in a 4-thread conﬁguration), (3)pFSCK[datapara+pipeline-split-optimal], which represents abest manually selected thread conﬁguration, and (4) pFSCK-sched, which dynamically assigns threads to each pass based9 T i m e ( s e c o n d s ) (a) File-intensive FS on NVMe T i m e ( s e c o n d s ) e2fscke2fsck-optpFSCK[datapara]pFSCK[datapara+pipeline-split-equal]pFSCK[datapara+pipeline-split-optimal]pFSCK-sched (b) Directory-intensive FS on NVMe T i m e ( s e c o n d s ) (c) File-intensive FS on SSD Figure 8:

Comparison of pipeline parallelism and scheduleron the amount of work to be done and current progress.First, for the ﬁle-intensive conﬁguration,pFSCK[datapara+pipeline-split-equal] is not beneﬁcialfor low thread counts (2 and 4 threads) compared to the dataparallelism-only approach, because threads are staticallyand equally assigned to Pass-1 and Pass-2 irrespectiveof the ﬁle system conﬁguration. Increasing the threadcount improves performance, because, for a ﬁle-intensiveconﬁguration, most work is done in the inode checkingpass (Pass-1). As a result, threads statically assigned forsubsequent passes are under-utilized. Increasing the threadcount (along the x-axis) improves pFSCK’s performancemarginally compared to using only the data parallelism.In contrast, pFSCK[datapara+pipeline-split-optimal], themanually selected thread conﬁguration case improvesperformance by up to 1.3x compared to data parallelism byemploying three-fourth of the threads to the inode checker.More importantly, the manually selected conﬁguration avoidshigh synchronization and contention overheads of the dataparallelism beyond four threads. Finally, pFSCK’s scheduler(pFSCK-sched) automatically migrates threads based onrelative amount the outstanding work to be completed acrosseach pass. As a result, pFSCK-sched can automaticallyprovide optimal performance, improving performance by1.1x compared to pFSCK[datapara+pipeline-split-optimal],resulting in an overall speedup of 2.6x compared to vanillae2fsck.Next, for the directory-intensive ﬁle system, unlike the dataparallelism only approach, pipeline parallelism, which en-ables simultaneous inode and directory checking (withoutcompromising correctness), provides some performance im-provement as well. pFSCK[datapara+pipeline-split-optimal]reduces runtime by 1.1x with up to 8 threads com-pared to just using data parallelism. When employingpFSCK’s scheduler (pFSCK-sched), it automatically miti-gates thread contention within the directory block check-ing by limiting the number of threads assigned to eachpass. Consequently, the performance improves by 1.05x overpFSCK[datapara+pipeline-split-equal], resulting in an over-all speedup of up to 1.6x over the vanilla e2fsck. Comparedto the speedup with a ﬁle-intensive ﬁle system, since pFSCKemploys delayed certiﬁcation of directories, beyond 8 threads, processing dependent directories (directory with subdirectory)must join, synchronize, and merge their work for correctness,limiting scalability.Interestingly, although not shown, we see that e2fsck usesthe most memory for directory-intensive ﬁle systems. In orderto check an 800GB directory-intensive ﬁle system, e2fsckuses as much as 3GB of memory as a signiﬁcant amount ofin-memory data structures are needed to track all the directoryinformation needed to verify the relationships between thedirectories and the ﬁles and subdirectories that exist withinthem. Despite this, we ﬁnd pFSCK’s memory usage is compa-rable, only using as much as 3.5GB resulting in only a 1.17xincrease in memory usage.Lastly, in 8c shows pFSCK performance on a ﬁle-intensiveﬁle system backed by SSD. We see that pFSCK is able toshow the similar speedups of up to 2.1x over vanilla e2fsckdespite SSDs having lower bandwidth capabilities comparedto NVMe.

In summary, pFSCK’s pipeline parallelism reducesthe serialization bottlenecks of data parallelism, and the dy-namic thread placement reduces work imbalance, all leadingto signiﬁcant performance gains.

We next evaluate the effectiveness of pFSCK’s resource-aware scheduler (pFSCK-rsched) in reducing the impact onother applications compared to e2fsck. To illustrate the ef-fectiveness, we pick a popular persistent key-value store,RocksDB [3], and use it to run a multithreaded system work-load along with each system. We evaluate them in both anofﬂine setting, where the checker and RocksDB operate onseparate ﬁle systems, and an online setting, where the checkerand RocksDB operate on the same ﬁle system. For eachsetting, we consider the following cases: (1) e2fsck-no-cpu-sharing , where e2fsck is runs with RocksDB without sharingthe same CPU cores, (2) e2fsck-cpu-sharing , where e2fsckruns with RocksDB while sharing the same CPU cores, (3) pFSCK-rsched-no-cpu-sharing , where pFSCK-rsched runswith RocksDB without sharing the same CPU cores, and(4) pFSCK-rsched-cpu-sharing , where pFSCK-rsched runswith RocksDb while sharing the same CPU cores. We runRocksDB with 12 threads. In the case of CPU sharing weforce e2fsck to share cores with RockDB by restricting the10fﬁnity of all the threads to 12 cores resulting in the overlap-ping of one core. In the case of CPU sharing with pFSCK-rsched, we run pFSCK-rsched with 12 threads and restrict theafﬁnity of all threads to 16 cores, resulting in the overlappingof 8 cores. For brevity, we show only the results for checkinga ﬁle-intensive ﬁle system.

Figure 9a shows the performance of e2fsck and pFSCK-rsched when sharing CPU cores with RocksDB. In this exper-iment, the C/R and RocksDB do not share the ﬁle system, andthe C/R runs on an ofﬂine unmounted disk. In the y-axis, theresults are normalized relative to no-CPU sharing betweenthe C/R and RocksDB as the baseline.First, sharing CPUs between e2fsck and RocksDB im-pacts the performance of both e2fsck and RocksDB (shownwith e2fsck-cpu-share in the x-axis) compared to the base-line that does not share CPUs (e2fsck-no-cpu-share). Whensharing CPUs, e2fsck’s performance degrades by 1.2x andRocksDB’s performance degrades by 1.5x compared to thebaseline. E2fsck is context switched to run periodically, tak-ing away effective CPU time from RockDB and introducingoverhead from context switching.Next, for pFSCK, the baseline pFSCK-rsched-no-cpu-share conﬁguration shows the performance of pFSCK andRocksDB without CPU sharing, whereas pFSCK-rsched-cpu-share shows the performance with CPU sharing. For CPUsharing, we see that although we overlap 8 out of 16 cores,the performance of pFSCK-rsched and RocksDB does notdegrade as signiﬁcantly compared to the no-sharing case. Wesee that the performance of pFSCK-rsched and RocksDBdegrades only by 1.07x and 1.05x, respectively; this is be-cause of pFSCK-rsched’s ability to downscale the number ofthreads being utilized to carry out C/R mitigates the event ofCPU time being taken away from RocksDB and minimizingany potential overhead due to context switching. BecauseRocksDB mainly utilizes 12 out the 16 cores for the majorityof execution, the effective performance of pFSCK-rsched isequivalent to the performance of only utilizing 4-6 threads

Figure 9b shows the results when C/R and RocksDB sharethe CPU as well as the ﬁle system. As discussed earlier inSection 5.4.2, pFSCK utilizes the LVM-based snapshots tocapture the changes to the ﬁle system and perform incrementalC/R. Similar to our evaluation with separate ﬁle systems, wenormalize all the results to the base of e2fsck running withRocksDB without overlapping cores.First, when overlapping e2fsck and RocksDB, performancesigniﬁcantly degrades by 1.4x and 1.6x respectively. Interest-ingly this is not only due to context switching between e2fsckand RocksDB but also due to overheads in utilizing LVMsnapshots. As discussed before, LVM preserves ﬁle systemstate for a snapshot by capturing all updates to a ﬁle systemand making a copy of the original data when modiﬁed forthe entire duration the LVM snapshot is active. In the case e f s c k - n o - c p u - s h a r ee f s c k - c p u - s h a r e p F S C K - r s c h e d - n o - c p u - s h a r e p F S C K - r s c h e d - c p u - s h a r e N o r m a li z e d P e r f o r m a n c e e2fsckRocksDBpFSCK-rsched (a) Ofﬂine C/R. e f s c k - n o - c p u - s h a r ee f s c k - c p u - s h a r e p F S C K - r s c h e d - n o - c p u - s h a r e p F S C K - r s c h e d - c p u - s h a r e N o r m a li z e d P e r f o r m a n c e (b) Online C/R. Figure 9: Impact of resource-aware pFSCK for ofﬂineand online C/R.

Results shown for ﬁle-intensive conﬁguration. of e2fsck sharing CPUs with RocksDB, since CPU sharingand context switching naturally decreases the performance ofe2fsck, this means the snapshot is active for a longer period oftime, compounding the degradation of performance for bothe2fsck and RocksDB.Despite the compounding performance degradation dueto LVM snapshot overheads, the performance degradationwhen co-running pFSCK-rsched with RocksDB is minimaldue to the following reasons: (1) pFSCK-rsched’s paralleliza-tion reduces C/R runtime compared to e2fsck, minimizingthe amount of time the snapshot is active and reducing per-formance degradation for C/R and RocksDB. (2) In the caseof overlapping cores, similar to an ofﬂine setting, pFSCK-rsched mitigates signiﬁcant performance impact by scalingthe number of threads it uses, reducing performance degra-dation of both pFSCK-rsched and RocksDB by only 1.2x.Compared to an ofﬂine setting, this increase in degradationfrom 1.07x for pFSCK-sched and 1.05x for RocksDB to 1.2xfor both pFSCK-sched and RocksDB is due to compoundingoverheads of using LVM snapshots.

Summary.

To summarize, pFSCK-rsched’s resource aware-ness is able to effectively adapt its number of threads to maxi-mize the utilization of available cores (and performance) inboth an ofﬂine and online setting while effectively minimizingthe amount of impact on RocksDB.

With a goal of accelerating ﬁle system checking and repairtools, in this paper, we propose pFSCK, a parallel C/R toolthat exploits CPU parallelism and the high bandwidth capa-bilities of modern storage to accelerate ﬁle system checkingand repair time without compromising correctness. pFSCKexplores ﬁne-grained parallelism by assigning threads with in-odes, blocks, or directories and efﬁciently performing C/R us-ing data parallelism within each pass and pipeline parallelismacross multiple passes. In addition, pFSCK also enables ef-ﬁcient thread management techniques to adapt to varyingﬁle system conﬁgurations as well as minimize performanceimpact on other applications. Evaluation of pFSCK showsmore than 2.6x gains over e2fsck and 1.8x over xfs_repairthat provides coarse-grained parallelism.11 eferences [1] Disk check takes too long to check. linuxquestions.org.[2] e2scrub: online fsck for ext4.[3] Facebook RocksDB. http://rocksdb.org/ .[4] Intel-Micron Memory 3D XPoint. http://intel.ly/1eICR0a .[5] Linux sched() man page.[6] Extremely long time for an ext4 fsck. stackexchange, Mar 2013.[7] File system check (fsck) is slow and running for a very long time, Sep2016.[8] Abutalib Aghayev, Sage Weil, Michael Kuchnik, Mark Nelson, Gre-gory R Ganger, and George Amvrosiadis. File systems unﬁt as dis-tributed storage backends: lessons from 10 years of ceph evolution.In

Proceedings of the 27th ACM Symposium on Operating SystemsPrinciples , pages 353–369, 2019.[9] Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanu-malayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. Correlated crash vulnerabilities. In

Pro-ceedings of the 12th USENIX Conference on Operating Systems Designand Implementation , OSDI’16, pages 151–167, Berkeley, CA, USA,2016. USENIX Association.[10] Lakshmi N Bairavasundaram, Andrea C Arpaci-Dusseau, Remzi HArpaci-Dusseau, Garth R Goodson, and Bianca Schroeder. An analysisof data corruption in the storage stack.

ACM Transactions on Storage(TOS) , 4(3):8, 2008.[11] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy,and Jiri Schindler. An analysis of latent sector errors in disk drives.

Proceedings of the 2007 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems - SIGMETRICS07 , 2007.[12] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea CArpaci-Dusseau, and Remzi H Arpaci-Dusseau. Optimistic crash con-sistency. In

Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles , pages 228–243. ACM, 2013.[13] Pradeep Fernando, Sudarsun Kannan, Ada Gavrilovska, and KarstenSchwan. Phoenix: Memory speed HPC I/O with NVM. In , pages 121–131. IEEE Com-puter Society, 2016.[14] Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Ben-jamin, Ashvin Goel, and Angela Demke Brown. Recon: Verifying ﬁlesystem consistency at runtime.

ACM Transactions on Storage (TOS) ,8(4):1–29, 2012.[15] Om Rameshwar Gatla, Muhammad Hameed, Mai Zheng, ViacheslavDubeyko, Adam Manzanares, Filip Blagojevi´c, Cyril Guyot, and RobertMateescu. Towards robust ﬁle system checkers. In , pages 105–122, Oakland, CA, February 2018. USENIX Association.[16] Gawatu. Resilient ﬁle system (refs) overview.[17] Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-Dusseau,and Remzi H Arpaci-Dusseau. Sqck: A declarative ﬁle system checker.[18] Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau,and Remzi H. Arpaci-Dusseau. Sqck: A declarative ﬁle system checker.In

Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation , OSDI’08, pages 131–146, Berkeley, CA,USA, 2008. USENIX Association.[19] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher,Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng,Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields,Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, KirkWebb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Fail-slow at scale: Evidence of hardware performance faults in largeproduction systems. In , pages 1–14, Oakland, CA, 2018. USENIXAssociation.[20] Michael Hasenstein. The logical volume manager (lvm).

White paper ,2001.[21] Val Henson, Zach Brown, and Arjan van de Ven. Reducing fsck timefor ext2 ﬁle systems. 04 2019.[22] Val Henson, Amit Gud, Arjan van de Ven, and Zach Brown. Chunkfs:Using divide-and-conquer to improve ﬁle system reliability and re-pair. In

Proceedings of the Second Conference on Hot Topics in Sys-tem Dependability , HotDep’06, pages 7–7, Berkeley, CA, USA, 2006.USENIX Association.[23] Val Henson, Arjan van de Ven, Amit Gud, and Zach Brown. Chunkfs:Using divide-and-conquer to improve ﬁle system reliability and repair.In

HotDep , 2006.[24] Shehbaz Jaffer, Stathis Maneas, Andy Hwang, and Bianca Schroeder.Evaluating ﬁle system reliability on solid state drives. In { USENIX } Annual Technical Conference ( { USENIX }{ ATC } , pages783–798, 2019.[25] Shehbaz Jaffer, Stathis Maneas, Andy Hwang, and Bianca Schroeder.Evaluating ﬁle system reliability on solid state drives. In , pages 783–798, Ren-ton, WA, July 2019. USENIX Association.[26] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim,Aasheesh Kolli, and Vijay Chidambaram. Splitfs: reducing softwareoverhead in ﬁle systems for persistent memory. In Proceedings ofthe 27th ACM Symposium on Operating Systems Principles , pages494–508, 2019.[27] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Redesigning lsms for non-volatile memory with novelsm. In Haryadi S. Gunawi and Ben-jamin Reed, editors, , pages 993–1005. USENIX Association, 2018.[28] Ram Kesavan, Harendra Kumar, and Sushrut Bhowmik. WAFL iron:Repairing live enterprise ﬁle systems. In , pages 33–48, Oakland, CA,February 2018. USENIX Association.[29] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, EmmettWitchel, and Thomas Anderson. Strata: A Cross Media File System. In

Proceedings of the 26th Symposium on Operating Systems Principles ,SOSP ’17, 2017.[30] Changman Lee, Dongho Sim, Joo-Young Hwang, and Sangyeun Cho.F2FS: A New File System for Flash Storage. In

Proceedings of the13th USENIX Conference on File and Storage Technologies , FAST’15,Santa Clara, CA, 2015.[31] W. Li, Y. Yang, J. Chen, and D. Yuan. A cost-effective mechanism forcloud data reliability management based on proactive replica checking.In , pages 564–571, 2012.[32] M. Lu, T. Chiueh, and S. Lin. An incremental ﬁle system consistencychecker for block-level cdp systems. In , pages 157–162, Oct 2008.[33] Ao Ma, Chris Dragga, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Marshall Kirk Mckusick. Ffsck: The fast ﬁle-systemchecker.

Trans. Storage , 10(1):2:1–2:28, January 2014.[34] Marshall K. McKusick. Improving the performance of fsck in freebsd. ;login: , 38(2), 2013.[35] Marshall Kirk McKusick, Willian N Joy, Samuel J Lefﬂer, and Robert SFabry. Fsck- the unix† ﬁle system check program.

Unix SystemManager’s Manual-4.3 BSD Virtual VAX-11 Version , 1986.

36] Mtanski. mtanski/xfsprogs github.com/mtanski/xfsprogs/preadv2/repair,Feb 2015.[37] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high performance ﬁle sys-tem for non-volatile main memory. In

Proceedings of the EleventhEuropean Conference on Computer Systems , pages 1–16, 2016.[38] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster-hout. Arachne: Core-aware thread management. In

Proceedings of the12th USENIX Conference on Operating Systems Design and Implemen-tation , OSDI’18, pages 145–160, Berkeley, CA, USA, 2018. USENIXAssociation.[39] Omar Sandoval. A survey of bugs in the Btrfs ﬁlesys- tem. https://courses.cs.washington.edu/courses/cse551/15sp/projects/osandov.pdf .[40] Ric Wheeler. fs_mark.[41] Jian Xu and Steven Swanson. NOVA: A Log-structured File System forHybrid Volatile/Non-volatile Main Memories. In

Proceedings of the14th Usenix Conference on File and Storage Technologies , FAST’16,2016.[42] Mai Zheng, Joseph Tucek, Feng Qin, Mark Lillibridge, Bill W. Zhao,and Elizabeth S. Yang. Reliability analysis of ssds under power fault.

ACM Trans. Comput. Syst. , 34(4):10:1–10:28, November 2016., 34(4):10:1–10:28, November 2016.