Accelerating Filesystem Checking and Repair with pFSCK
AAccelerating Filesystem Checking and Repair with pFSCK
David Domingo, Kyle Stratton, Sudarsun Kannan
Rutgers University
Abstract
File system checking and recovery (C/R) tools play a pivotalrole in increasing the reliability of storage software, identify-ing and correcting file system inconsistencies. However, withincreasing disk capacity and data content, file system C/Rtools notoriously suffer from long runtimes. We posit thatcurrent file system checkers fail to exploit CPU parallelismand high throughput offered by modern storage devices.To overcome these challenges, we propose pFSCK, a toolthat redesigns C/R to enable fine-grained parallelism at thegranularity of inodes without impacting the correctness ofC/R’s functionality. To accelerate C/R, pFSCK first employsdata parallelism by identifying functional operations in eachstage of the checker and isolating dependent operation andtheir shared data structures. However, fully isolating sharedstructures is infeasible, consequently requiring serializationthat limits scalability. To reduce the impact of synchroniza-tion bottlenecks and exploit CPU parallelism, pFSCK designspipeline parallelism allowing multiple stages of C/R to runsimultaneously without impacting correctness. To realize effi-cient pipeline parallelism for different file system data config-urations, pFSCK provides techniques for ordering updates toglobal data structures, efficient per-thread I/O cache manage-ment, and dynamic thread placement across different passesof a C/R. Finally, pFSCK designs a resource-aware scheduleraimed towards reducing the impact of C/R on other applica-tions sharing CPUs and the file system. Evaluation of pFSCKshows more than 2.6x gains of e2fsck and more than 1.8xover XFS’s checker that provides coarse-grained parallelism.
Modern ultra-fast storage devices such as SSDs, NVMe,and byte-addressable NVM storage technologies offer higherbandwidth capabilities and lower latency compared to hard-disks providing better opportunities for exploiting CPU paral-lelism. While I/O access performance has increased, storagehardware errors have continued to grow coupled with newerand exploratory high-performance designs impacting file sys-tem reliability [9, 19, 39]. For decades, file system checkingand repair tools (referred to as C/R henceforth) has played apivotal role in increasing reliability of software storage stacks,identifying and correcting file system inconsistencies [35]. Infact, in the event of a system crash or storage failure in datacenters, file system checkers are typically used as the firstremedial solution to system recovery [19].File system C/R tools work by identifying and fixing thestructural inconsistencies of file system metadata, such as in-consistencies in inodes, data and inode bitmaps, links, and directory entries. Well-known and widely used tools suchas e2fsck (file system checker for Ext4) divides C/R acrossmultiple stages (commonly referred to as passes) with eachpass responsible for checking a file system structure (e.g.,directories, files, links). However, C/Rs are known to be no-toriously slow, showing a linear increase in C/R time withan increase in file and directory count and the disk utiliza-tion [21, 32–35]. Although modern flash and NVM technolo-gies provide lower latency and bandwidth, current C/R toolsfail to utilize such hardware capabilities or multicore CPUparallelism fully. While modern C/Rs have attempted to in-crease parallelism, their coarse-grained approaches, such asparallelizing C/R across logical volumes or logical groups,are insufficient to accelerate C/R on file systems with dataimbalance across logical groups [18, 21, 33, 36].To overcome such limitations, we propose pFSCK , a par-allel C/R that exploits CPU parallelism and modern storage’shigh bandwidth to accelerate file system checking and re-pair time without compromising correctness. Acceleratingfile system C/R could significantly reduce system downtimeand improve storage availability [9, 18, 19, 33]. In this pursuit,pFSCK introduces fine-grained parallelism, i.e., parallelismat the granularity of inodes and directory blocks, resulting ina significantly faster execution compared to traditional C/Rs.pFSCK first employs data parallelism by breaking up thework done at each pass, redesigning data structures for scala-bility, and allowing multiple threads to process. Although dataparallelism accelerates checking, updates to global data struc-tures (e.g., bitmap) within each pass are designed to match thefile system’s layout (e.g., block bitmap in an Ext4 file system)and must be synchronized and serialized to ensure checkingcorrectness. As a result, with increasing threads, the cost ofsynchronization and serialization can quickly outweigh theperformance gains. Hence, pFSCK introduces pipeline par-allelism to parallelize C/R along with the logical flow (i.e.,across multiple passes).Supporting data and pipeline parallelism within pFSCKrequires addressing several challenges. First, updates to datastructures shared must be ordered for C/R correctness. Forexample, a directory cannot be certified to be error-free by thedirectory checking pass unless all its files are verified as con-sistent by the inode checking pass. To address these orderingconstraints, taking inspiration from out-of-order executionsin hardware processors, we isolate the global data structuresand perform all necessary operations in parallel but certifycorrectness only when the results are merged. Second, staticpartitioning of CPU threads across different passes is subop-timal because each pass checks different metadata (e.g., file,1 a r X i v : . [ c s . O S ] A p r irectory, links) and the amount of each kind of metadata canvary across different file system configurations. In addition,the time to process different types of metadata vary signifi-cantly (e.g., checking a directory can take significantly longerthan a file). Hence, we propose a dynamic thread schedulerthat monitors progress across different passes of pFSCK anduses the pending work ratio for thread assignment.Third, I/O optimizations such as I/O caching and read-ahead mechanisms in current C/Rs are not designed for multi-threaded parallelism, which we address by designing a thread-aware I/O caching, thereby substantially reducing I/O wait-times. Finally, to exploit multi-core parallelism in ways thatdo not affect the performance of other co-running applicationsthat share CPUs or access the same disks checked by C/R(online checking), we propose a resource-aware mechanismthat allows for scaling the number of threads used to performchecking by monitoring the overall CPU utilization of thesystem.The combination of pFSCK’s above techniques signifi-cantly reduces C/R runtime. For example, pFSCK’s data par-allelism and pipeline parallelism on an 800GB file system onNVMe reduce runtime by up to 2.6x for a file-intensive diskconfiguration. For a directory-intensive disk configuration,pFSCK is able to reduce runtime by up to 1.6x compared toe2fsck and by up to 1.8x over the XFS file system checker.In the pursuit of increasing multicore parallelism, pFSCK in-creases memory usage by only 1.17x over e2fsck (from 3 GBin e2fsck to 3.5 GB in pFSCK for a 800GB file system). Fur-ther, pFSCK’s scheduler increases gains by 1.1x over pFSCKwithout a scheduler. When sharing the CPUs between pFSCKand RocksDB, the system-wide resource-aware mechanismminimizes pFSCK performance degradation to 1.07x as wellas limits the overhead on RocksDB by 1.05x. Finally, pFSCKprovides a significant performance boost during live checkingcompared to e2fsck improving performance by 1.7x. Storage hardware advancements have opened up the po-tential for accelerating I/O bound applications. One criticalset of applications that could potentially benefit are the filesystem checking and repair (C/R) tools, which run in almostall computing systems. We first give some background oncurrent hardware trends, C/R tools, and then discuss priorapproaches that accelerate C/R and their limitations.
With increasing core count and the advent of faster stor-age devices, system performance has seen vast speedups dueto hardware advancements. More specifically, flash memorytechnologies like PCI-attached SSD and NVMe devices pro-vide increased throughput (8-16 GB/s) and lower access la-tency (20-50 µ s) compared to hard disks. At the other end,fast storage class memories such as Intel’s DC PersistentMemory [4] and other classes of nonvolatile memory (NVM) directly attach to the memory controller and provide byte-addressable persistence. These technologies scale 4x largerthan DRAM capacity, with variable read (100-200ns) andwrite (400-800ns) latencies, and bandwidth capabilities rang-ing from 8 GB/s to 20 GB/s.On the software side, a huge body of prior research isin progress to redesign and optimize file systems for mod-ern storage hardware, which includes file systems for SSD,NVMe, and NVMs [26, 29, 30, 37, 41] and storage stackin general [13, 27]. Besides, the open-source community isinvesting a substantial effort to optimize traditional file sys-tems such as Ext4 and XFS for modern storage hardware,given the wide-usage and reliability of these file systems. Forexample, file systems such as Ext4-DAX continue to retaintraditional file system structure and optimize performance byremoving components such as page cache, schedulers, andlogging. Reducing data corruption for both these approachescan be challenging and requires a few years of production use[8, 24]. Since the dawn of file systems, file system consistency hasalways been an issue. Though modern file systems deploymechanisms such as journaling, copy-on-write, log-structuredwrites, and soft updates to handle inconsistencies, they can-not fix errors that may have been present due to corruptionsmanifested in the past by events such as a failing disk, bitflips, overheating, or correlated crashes [10–12, 25, 42]. Awidely-used approach to handle disk and file system corrup-tions and errors is to check and fix inconsistencies using filesystem C/R tools, such as e2fsck and xfs_repair, that scanfile system metadata for corruptions and fix them. Most C/Rtools are designed specifically for the layout of a file system.C/Rs such as e2fsck and xfs_repair have multiple passes thatcheck inode consistency, directory consistency, connectivityof all directories, directory entries, and lastly, the referencecounts of inodes and blocks. The repair process involves fix-ing errors such as updating the block bitmap that does notindicate a block referenced by a file as being used. For morecomplex errors (e.g., a block referenced by several inodes),administrators are given an option to accept or decline repairs.
With increasing disk capacities and file system size, C/Rsnotoriously tend to run longer; specifically, the increase insystem downtime may dominate the repair cost by orders ofmagnitude [1,6,7,22,31]. We next discuss the state-of-the-artC/R optimizations for offline (unmounted file systems) andonline C/Rs and their limitations.
Offline C/Rs.
To reduce C/R time, open-source C/Rs, such asthe Ext4’s e2fsck and XFS file system’s xfs_repair, parallelizechecking across disks (e2fsck) or logical groups (xfs_repair).C/R techniques like Ffsck [35] and Chunkfs [23], speed up bymodifying the file system to provide a better balance across2ogical groups. For example, Chunkfs is designed to utilizedisk bandwidth by partitioning the file system into smaller,isolated groups that can be repaired individually and in par-allel, whereas Ffsck [35] rearranges metadata blocks withinthe file system to reduce seek cost and optimize file systemtraversal. SQCK [17] enhances C/R by utilizing declarativequeries for consistency checking across file system structures.Overall, while prior C/R designs have attempted to improvefile C/R performance, they suffer from several weaknesses.First, Chunkfs (and XFS) require a coarse grain separationof file system blocks to accelerate file system checking. Takethe case of an imbalanced file system with several large filesspread across different logical groups of a disk. Interestingly,we show in Section § 6 the xfs_repair’s parallelism is limitedto simple inode scans, omitting any parallelism for checkingdirectory metadata. Other techniques such as SQSCK andFfsck require intrusive changes to the way we manipulate filesystem metadata or need completely rebuilding C/R, whichcould reduce or prevent widespread adoption.
Online C/Rs.
To reduce system C/R downtime, proprietaryonline C/Rs such as WAFL file system’s Iron [28] (a NetApp-based C/R tool for WAFL file system) and ReFS [16] fixcorruptions as they are encountered allowing file system op-erations to continue. WAFL-Iron performs incremental liveC/R. Because storage blocks are made available as C/R is inprogress, WAFL-Iron imposes invariants such as (1) check-ing all blocks before any software use, (2) checking ancestorblocks (directory) before any data or metadata block (inodeblock) is checked. These invariants avoid repeated checking ofan inode for every data block, and also reduce memory usage.To scale C/R to petabytes, WAFL-Iron expects the presenceof block-level checksums, RAID, and most importantly, goodstorage practices by customers. Open-source C/Rs such ase2fsck allows for online checking by utilizing LVM-basedsnapshotting and running C/R on the snapshot while the filesystem is still in use [2]. We evaluate e2fsck’s LVM-basedonline C/R in Section § 6. Recon protects file system meta-data from buggy operations by verifying metadata consistencyat runtime [14]. Doing so allows Recon to detect metadatacorruption before committing it to disk, preventing error prop-agation. Recon does not perform a global scan and hencecannot identify or fix errors originating from hardware fail-ures.
C/R Correctness.
To ensure the correctness and crash-consistency of C/Rs itself and recover more reliably in lightof system faults, Rfsck-lib [15] provides C/Rs with robustundo logging. pFSCK’s fine-grained parallelism goals areorthogonal to Rfsck-lib, however, incorporating Rfsck-lib canimprove the reliability of pFSCK in case of failures.
Summary.
To summarize, unlike prior systems, pFSCK isaimed towards fine-grained parallelism, the ability to utilizestorage bandwidth efficiently across multiple passes of C/R,adapting to system resources, and the capability to reduce theimpact on other applications.
10 20 30 40 50File Count (millions)020406080100120 T i m e ( s e c o n d s ) Pass 1Pass 2Pass 3Pass 4Pass 5 (a) File Count Sensitivity:
Runtime of e2fsck astotal file count increases T i m e ( m i nu t e s ) (b) Directory CountSensitivity: Runtime of e2fsckas total directory count increases
Figure 1: Runtime of C/R for an 800GB file system withvarying counts of files or directories
In the pursuit of accelerating C/Rs, we first decipher theperformance bottlenecks of the widely-used Ext4 file sys-tem’s e2fsck C/R tool. We first provide an overview of e2fsckand then examine e2fsck’s runtime for different file systemconfigurations. For brevity, we study xfs_repair in Section § 6.
E2fsck uses five sequential passes for C/R: the first pass (re-ferred to as Pass-1) checks the consistency of inode metadata;Pass-2 checks directory consistency; Pass-3 checks directoryconnectivity; Pass-4 checks reference counts; finally, Pass-5checks data and metadata bitmap consistency.
To analyze and decipher the breakdown of e2fsck’s run-time, we run e2fsck on file systems with varying configu-rations. We conduct our analysis on a 64-core Dual Intel®Xeon Gold 5218, 2.30GHz, 64GB of DDR memory, and 1TBNVMe Flash Storage running Ubuntu 18.04.1. We fill thefile system using fs_mark , an open-source, file system bench-mark tool [40]. For our analysis, we mainly focus on filesystems without corruptions. To get a finer understanding ofhow e2fsck scales with file system configurations, we studythe sensitivity of C/R’s runtime for multiple file system vari-ables such as file count and directory count.
File-intensive file systems.
First, to understand how filecount affects runtime, we generate multiple file intensivefile system configurations with a 95:1 files to directories ra-tio. Operating on file-intensive file systems, Pass-1, whichchecks the consistency of inodes structures, dominates e2fsckruntime, followed Pass-2, which checks directory block con-sistency. Figure 2 shows the function-wise breakdown inPass-1 that checks the consistency of file inodes as well astrack directory blocks encountered to be examined in thenext pass. We notice a function dcigettext (a seemingly3 igure 2: e2fsck Pass-1 Time Breakdown . Time spent withininode checking pass (Pass-1) as the total file count increases. innocuous) language translator used for error handling gets(incorrectly) used for every inode check and poses a substan-tial slowdown on the C/R performance. Other Pass-1 stepssuch as check blocks that checks the blocks referenced byan inode, next inode that reads the next inode blocks fromdisk, mark bitmap that updates global bitmaps to track themetadata encountered, and icount store that stores inodereferences also increase in runtime. Although the numberof directories is small, the Pass-2 (directory checking pass)runtime increases because the number of directory blocks thatstore directory entries increase. Pass-3 checks connectivityand ensures the reachability of directories from the root. Fora small directory count, the runtime is a small compared tothe runtime of Pass-1 and Pass-2. We also find that increasingfile size while keeping the number of files constant does notincrease e2fsck runtime significantly (not shown for brevity).
Directory-intensive file system.
We next analyze the impactof the directory count on the runtime of e2fsck in Figure 1b.We generate multiple directory intensive file systems, eachwith increasing number of directories. We define a directory-intensive file system as a file system with over a 1:1 file todirectory ratio. To ensure that each directory requires thesame amount of work, we create a single file in each directory.As expected, with an increase in directory count, Pass-2’sruntime significantly increases due to the increased numberof directory blocks as well as the fact that directory blockshold checksums for consistency, forcing Pass-2 to recomputeand verify these directory blocks checksums. Interestingly,Pass-1’s runtime also increases; this is because Pass-1 isresponsible for identifying directory blocks and adding to adirectory block list ( db_list ) to be used within Pass-2.
To understand the computational vs. I/O bottlenecks, inFigure 3, we show the compute vs. I/O wait time ratio fore2fsck. As shown, the compute time dominates the I/O waittime. In all our experimental runs, we observe that e2fsck’speak and average I/O bandwidth usage is 260 MB/s and100 MB/s, respectively, on an NVMe device with 2 GB/s and We reported this to e2fsck developers, and the fix has been upstreamed.
10 20 30 40 50File Count (millions)020406080100 % o f E x e c u t i o n CPU I/O 5 10 15 20 25Directory Count (millions)020406080100 % o f E x e c u t i o n Figure 3: Percent of I/O wait time (Left: Files-intensivefile system, Right: Directory-intensive file system)
512 MB/s sequential and random read bandwidth.
In general,file system C/R tools (e.g., e2fsck and XFS_repair) not onlysuffer from I/O access time but also computational cost.
Summary.
To summarize, our analysis shows high run-time overheads of e2fsck across file system configurations;this is mainly due to the serial, single-threaded nature ofe2fsck, designed in the era of spinning hard drives. The linearcomplexity of its runtime is unsuitable as file system capac-ities trends upward, potentially taking hours, or even days,to check datacenter-scale file systems. Besides, C/R’s repairwhen there are file system inconsistencies could further in-crease C/R runtime. pFSCK aims to overcome the limitations of traditional filesystem C/Rs by exploiting fine-grained multi-core parallelism,higher disk bandwidth, efficient use of CPUs using a pFSCKscheduler, and ways to reduce the impact on other co-runningapplications. We next outline the goals and provide pFSCK’sdesign overview.
Decrease file system C/R runtime.
The main goal is tomake file system C/R faster. We want to increase the speed atwhich file system metadata can be scanned and inconsisten-cies identified, without compromising repairing capabilities.
Adapt to different file system configurations.
The C/Rperformance should improve regardless of file system size, uti-lization, or configurations, such as a file-intensive or directory-intensive file system.
Support offline and online C/R.
C/Rs can be used whena disk is not mounted, and the file system is offline or when asystem is online, and the file system is actively used. Hence,pFSCK aims to support both offline and online C/R.
Adapt to system utilization.
C/R should have the abilityto adapt to varying system resource utilization over time toreduce the potential performance impact on any currently-running applications. pFSCK aims to adapt to varying system-wide CPU use.4 .2 pFSCK Design Insights
We next describe the key design insights to realize theabove goals.
Insight 1: Maximize potential bandwidth through mul-tiple cores and data parallelism.
To overcome the bottle-necks of current serial C/Rs and C/Rs that parallelize at acoarse granularity such as across logical volumes or logi-cal groups, pFSCK exploits fine-grained inode and directoryblock parallelism. Towards this goal, pFSCK first introducesdata parallelism to C/R for better utilization of CPU paral-lelism enabled by modern storage bandwidth capabilities. Ata high-level, in each pass, basic file system structures suchas inodes, directory blocks, dirents, and links are divided andchecked across a pool of worker threads. While seeminglysimple, achieving data parallelism requires data structure iso-lation across threads to reduce synchronization bottlenecks.
Insight 2: Enable pipeline parallelism by reducinginter-pass dependencies.
Though data parallelism improvesperformance, updates to several inter-pass global data struc-tures used for building a consistent view of the file system andidentifying inconsistencies (ex. bitmaps), must be serialized.Consequently, this limits data parallelism’s performance capa-bility with higher CPU counts and also degrades performanceat higher thread counts due to contention on the shared struc-tures. Pipeline parallelism breaks the rigid wall across passesallowing multiple passes to be executed simultaneously alongwith the logical flow of an application, thereby increasingCPU parallelism and reducing the performance impact of se-rialization. To realize pipeline parallelism requires managingper-pass thread pools, isolating inter-pass shared structuresusing divide and merge approaches, delineating checking andcertification of inodes, and reducing I/O wait times.
Insight 3: Adapt to file system configurations with dy-namic thread scheduling.
Enabling data and pipeline paral-lelism requires assigning threads across different passes ofpFSCK. Static partitioning of CPU threads across differentpasses are suboptimal due to lack of information about meta-data types (files, directories, links) and work across passes;for example, checking directory blocks in Pass-2 (directorychecking pass) require more processing time than checking in-odes in Pass-1 (file checking pass) as discussed in Section § 3.To overcome the challenge of accelerating C/R for differ-ent file system configurations, we design a dynamic threadscheduler that assigns threads to process different types of filesystem objects as they are discovered and migrates threadsacross different passes of the pipeline.
Insight 4: Reduce system impact through resource uti-lization awareness.
File system C/Rs could potentially runwith other applications sharing CPUs while performing check-ing on separate disks. Given pFSCK’s goal to exploit availableCPUs, this could potentially impact other co-running applica-tions. Similarly, C/R could run on disks that are also activelyused by other applications to store data. To reduce the overall system impact on co-running applications as well as pFSCK,we equip pFSCK’s scheduler with resource awareness to dy-namically identify the number of cores to use at any singlepoint in time to minimize potential impact on other co-runningapplications without significantly impacting pFSCK’s perfor-mance.
To realize the goals of pFSCK, we discuss the design andimplementation of pFSCK’s data parallelism, pipeline paral-lelism, dynamic thread scheduler, and resource-aware schedul-ing. pFSCK extends e2fsck to realize these design changes. pFSCK’s data parallelism divides work in each pass amonga group of worker threads on the granularity of inodes and en-ables concurrent C/R. While seemingly simple, efficient dataparallelism during C/R demands an efficient threading modelfor fine-grained inode parallelism, functional separation ofC/R within each pass, and per-thread contexts for isolatingdata structures and reducing synchronization cost.
Fine-grained Inode-level Parallelism.
For fine-grainedinode-level parallelism, pFSCK uses the superblock informa-tion to identify the total number of inodes in the file systemand evenly divides the inodes across a given set of C/R work-ers. To reduce the cost of worker threads management, pFSCKuses a thread-pool framework [38] that provides the ability toassign tasks to multiple worker threads. The worker threadsare then reused across different passes of a C/R. pFSCK alsoco-locates threads of a pass to the same CPU and memorysocket to avoid the lock variable bouncing across processorcaches on different sockets. We will also discuss the needfor dynamically identifying work done across threads andscheduling in Section § 5.3.
Functional Parallelism for Reducing SynchronizationOverheads.
Only dividing inodes for checking across workerthreads is insufficient. To benefit from fine-grained paral-lelism, it is critical to reduce synchronization across workerthreads in each pass without compromising correctness.We first break each C/R pass into four main functional stepsand reduce synchronization across these steps. The steps in-clude: (1) file system metadata C/R, (2) global file systemmetadata update, (3) C/R-level accounting, and finally, (4)intermediate result sharing; these four steps comprise 95% ofthe work. The metadata check performs logical checks thatverify their integrity across each pass (for example, blocks ofan inode). Next, updating global file system metadata includesupdating file system-level bitmaps that keep track of blocksand inodes currently used and referenced. The bitmaps arealso used to detect any inconsistencies between inodes such asduplicate block references where more than one inode claimthe same block. Third, C/R-level accounting involves updat-ing counters that track statistics such as file types. Finally,intermediate result sharing across passes involves creating5 nodes Pass Directories PassThread Pool Delayed ChecksThread Pool ( a ) Data Parallelism( b ) Pipeline Parallelism ( c ) Synchronous DependentChecks File blocks Directory blocks Directory blocks delayed certification per-pass work queues per-pass thread pool
Figure 4: Parallelism in pFSCK. (a) Thread pools within eachpass allows for data to be operate in parallel ( data parallelism ). (b)Use of multiple thread pools allows each pass of pFSCK to operatesimultanously ( pipeline parallelism ). (c) Any dependent checksneeded to be carried out synchronously is delayed within its ownlogical pass. and updating data structures such as a red-black tree withinode information and a hash-tree based directory list.While synchronization between file system metadata check(step 1) and global metadata update steps (step 2) are essen-tial, synchronization between first two steps and step 3 (C/Rcounter/statistics update) can be avoided by allowing threadsto maintain per-threads stats. The results of step 1 and 2 can beaggregated before the next pass, reducing the synchronizationcost significantly.
Thread Contexts for Isolation.
In current C/Rs such ase2fsck, upon which pFSCK is built, we identify significantdata structure sharing across functions (steps 1 to 4) insideeach pass and across passes. To reduce sharing and provideisolation, we introduce per-thread contexts (in contrast to aglobal context in e2fsck). The thread contexts are similar toOS thread contexts and contain information such as buffersused for processing file system objects, intermediate datastructures, structures to track progress, locks held for shareddata structures, and CPUs used. At the end of each pass, theinformation within each thread context, such as per-threadbuffers and generated intermediate data structures, are aggre-gated before the subsequent pass.
While data parallelism achieves concurrency for process-ing file system objects within a pass, fully isolating per-passshared data structures and global data structures is not feasiblewithout substantial changes to either the file system layoutor the C/R. As a result, data parallelism does not fully ben-efit from increasing CPU count and in fact, as our resultsshow, can degrade substantially in performance at higher corecounts due to increasing synchronization overheads.To reduce time on synchronization and increase the CPUeffectiveness, pipeline parallelism breaks the limitation thatC/R passes must be sequentially executed, thereby allowing asubsequent C/R pass (
Pass i + ) to start even before the com-pletion of an earlier pass ( Pass i ) in a pipelined fashion (i.e.checking directories in directory checking pass (Pass-2) evenbefore the inode checking pass (Pass-1) has completed). First, to facilitate each pass operating in parallel, we use per-pass thread pools . As shown in Figure 4, the inode and di-rectory checking passes each maintain a separate thread poolthat is used to hold threads that carry out logic within the pass.In addition to the per-pass thread pools, each pass maintainsa dedicated work queue filled with file system objects need-ing to be checked. As each pass operates, any intermediatework generated is placed in the next pass’s work queue. Forexample, within the inode checking pass (Pass-1) as directoryinodes are identified, their directory blocks are queued to thedirectory checking pass’s work queue so they can be checked.
Allowing multiple passes to run in parallel using pipelineparallelism requires reordering logical checks for correctness.Take an example of the inode checking pass (Pass-1) and thedirectory checking pass (Pass-2): in pFSCK (and e2fsck), theinode checking pass reads inodes from disk and checks allinodes including directories and adds directory block infor-mation in a shared directory block list ( dblist ) so the directorychecking pass can check the directory entries. While the in-ode and directory checking passes can proceed in parallel,the directories can be marked as consistent only after theinode checking pass verifies the consistency of the inodesrepresenting the files and subdirectories inside a directory.
Providing Ordering Guarantee.
To address the challengeof ordering guarantee, pFSCK delays certain checks untilthe prior pipeline pass is complete. For example, the inodechecking pass within the pipeline is responsible for creatingin-memory directory structures that are used in the directorychecking pass to check directories. The directory checkingpass stores a list of subdirectories and checks whether the sub-directory’s parent entry point (represented by double dot .. )map back to the directory. However, because the inode check-ing and directory checking passes run in parallel, not all theinodes of the directory entries would have been checked whenthe parent directories are checked in the directory checkingpass. To handle such scenarios, pFSCK delays certification byadding the uncompleted checks, just like the one described,in a separate work queue and completing them only after allinodes have been checked (e.g. after the inode checking passcompletes) as shown in Figure 4. Effective use of multiple CPUs for an I/O-intensive C/Rrequires efficient I/O prefetching and caching even for fastmodern storage devices such as NVMe. Though current C/Rssuch as e2fsck cache and prefetch file system blocks, they areinflexible and lack thread awareness. For example, e2fsck bydefault prefetches a few blocks at a time when reading in in-odes and fetches only 1 block when fetching directory blocks.It is possible to change the amount of readahead done howeverwe observe that statically or naively increasing the prefetchdepth negatively impacts performance because threads access6he file system blocks at different offsets, frequently invali-dating previously read cache entries, consequently increasingthe overheads of I/O.To overcome such limitations and accelerate I/O, we imple-ment a per-thread caching and readahead-based prefetchingmechanism that prevents the eviction of cache entries whenmultiple threads operate in parallel. As our results show inSection § 6, combining pipeline parallelism with data par-allelism and employing dynamic workload-based threading(discussed next) improves pFSCK’s performance across dif-ferent file system configurations.
The runtime of C/Rs can vary significantly depending onthe configuration of the file system. For example, C/R on afile system with a larger ratio of smaller files could resultin a substantially longer runtime compared to a file systemwith few, but large files due to more metadata needing tobe checked. Similarly, heterogeneity in terms of inode types(files, directories, links) can impact runtime, and the exactconfiguration remains unknown until the inodes are iteratedover in the inode checking pass (Pass-1). Additionally, eachpass within C/R have differing degrees of accesses to sharedstructures. Therefore, a static assignment of threads acrosseach pass could be ineffective. Hence, to adapt to file systemconfigurations, pFSCK implements a C/R-aware scheduler, pFSCK-sched , supported by extending the thread pools toallow for migration of threads between the passes. In addi-tion, pFSCK-sched maintains an idle thread pool to hold anythreads not scheduled to run for any of the passes.
Thread Assignment and Migration of Worker Threads.
In pFSCK, we enable dynamic assignment of threads acrosseach pass by implementing a scheduler that actively monitorsprogress and migrates threads across the passes. The sched-uler periodically scans through the work queues of each passto identify the work distribution ratio across the pipelinedpasses and uses this ratio to assign threads across them.Figure 5 shows an example of pFSCK-sched across thefirst two passes. Initially, all the CPU threads are assigned tothe first pass (inode checker) given that pFSCK only knowstotal inodes from the file system superblock and not the typesof inodes. When the inode checker’s worker thread identifiesa group of directory inodes, it places the directory inodes andtheir corresponding directory blocks to the work queue ofdirectory checking pass. If no threads are present in the threadpool used for the directory checking pass, threads from theinode checking pass (first pass) are migrated to the directorychecking pass. To calculate the number of threads to be reas-signed, a dedicated scheduler thread finds the total work to bedone across all passes using the following model.Let W total be the amount of work needing to be done. Let q i be the length of the work queue for pass i . Let n i be the numberof discrete elements needing to be processed for each entryin the work queue. Let w i be some weight that normalizes the Inodes Pass Directories Pass
Inode List
Thread Pool Thread Pool(1) Sample Work Queues (2) Redistribute Threadsn n Idle Thread Pool (3) Take threads from idle thread pool if more cores available
Scheduler Thread per-pass work queues
Figure 5: Dynamic Thread Scheduling.
A dedicated sched-uler threads periodically samples work queues among all thepasses and redistributes threads based of proportion of out-standing work.work to be done for each element in pass i . Let C be the corebudget and t i be the number of threads to assign for pass i . W total = N ∑ i = q i n i w i (1) t i = C · q i n i w i · W total (2)As shown in Equation (1), the total work needing to bedone is a summation of outstanding work across all the passes.The outstanding work in each pass is a product of the workqueue length ( q i ), the number of objects encapsulated withineach queue entry ( n i ), and a normalizing weight ( w i ). Asshown in Equation (2), with the total amount of work needingto be done, the scheduler can determine the ideal numberof threads to assign to a pass ( t i ) based on the total corebudget ( C ) and the relative amount of work calculated foreach pass. Note that the normalizing weights are essential foraccounting the differences in the time to process different filetypes (directories vs. regular files). In our experimentation,we find it is beneficial to use higher weights for prioritizingwork in the directory checking queue as directories can takesignificantly longer to check compared to regular files due todirectory checksum calculations. File system C/Rs could potentially coexist or even shareCPUs with other applications using the same or another filesystem (or disk). In the pursuit of exploiting parallelism,pFSCK’s approaches must avoid or minimize the performanceimpact on other applications. To address this goal, we intro-duce pFSCK-rsched , which enables resource-awareness forpFSCK’s scheduler.
First, we discuss the case when pFSCK-rsched runs along-side other applications but using different file systems, wherepFSCK-rsched performs C/R on a separate, unmounted disk.Initially, pFSCK-rsched schedules the main pFSCK processusing the SCHED_IDLE priority to minimize any contentionon CPUs with regular processes. The SCHED_IDLE prior-ity mostly schedules a process on any idle CPUs [5]. As the7cheduler periodically runs, pFSCK-rsched first determines acore budget that represents the maximum number of threadspFSCK-rsched should be running at any point in time. It doesthis by identifying the number of CPU threads currently run-ning, the number of idle cores available, and the number ofcores pFSCK-rsched is currently running on. For idle coresnot being utilized by any application (including pFSCK),pFSCK-rsched increases the core budget by the number ofidle cores available. On the contrary, if pFSCK-rsched identi-fies that the total number of pFSCK threads are more than theidle cores, pFSCK-rsched reduces the core budget to avoidmultiplexing pFSCK’s threads on the cores it runs on. Thecore budget remains unchanged if the available idle coresand pFSCK threads remain the same. After determining thecore budget, the scheduler identifies the work ratio acrosspasses to determine the ideal number of threads that shouldbe assigned to each pass. The scheduler then redistributesthe threads across the thread pools. In the case of the threadsneeding to be added due to an increase in the core budget,threads are taken from the idle thread pool and assigned to thethread pool to fulfill the ideal thread count. If threads need tobe removed due to a decrease in the core budget, threads aresignaled and reassigned to the idle thread pool. Our resultsin Section § 6 show the performance benefits and implicationof pFSCK-rsched when co-running and sharing CPUs withanother application (RocksDB).
Given the renewed focus on supporting online-checking,open-source C/Rs such as e2fsck support C/R on a online filesystem (and disk) that is mounted and actively used by appli-cations. E2fsck supports online-checking by utilizing Linux’sLogical Volume Manager’s (LVM) snapshot feature whichpreserves a file system’s state by capturing the changes to thefile system [20]. E2fsck can then perform C/R on snapshot.The C/R time is dominated by an application’s activity inchanging the file system objects. Consequently, this results ina longer C/R time for an actively modified file system.Despite this, pFSCK still shows general improvement overe2fsck. First, pFSCK’s generic fine-grained parallelism sup-ports and accelerates online C/R even when applications aresharing the same file system (and disks). Second, pFSCK’sresource awareness reduces the impact on co-running appli-cation. As our results show, even when running online C/Rswith I/O-intensive applications (RocksDB [3]), pFSCK pro-vides considerable performance gains compared to runningonline C/R with vanilla e2fsck.
Correctness.
To ensure correctness with fine-grained C/Rparallelism, pFSCK employs a series of steps. First, althoughthe checks are done in parallel, an inode is not marked com-plete unless prior passes in the pipeline complete. For exam-ple, recall that a directory within the directory checking pass(Pass-2) cannot be marked as complete until all the inodes for
Name Descriptione2fsck original FSCK for EXT file systemse2fsck-opt optimized e2fsckxfs_repair XFS file system checkerpFSCK proposed file system checker
Table 1: C/R systems evaluated.
Name Descriptiondatapara Only data parallelism enableddatapara+pipeline-split-equal Pipeline parallelism combined withdata parallelism, using an equal dis-tribution of threads across the passesdatapara+pipeline-split-optimal Pipeline parallelism combined withdata parallelism, using optimal man-ual thread assignmentsched All prior parallelization optimiza-tions but using dynamic thread sched-uler for thread assignmentrsched Sched configuration with resource-awareness enabled
Table 2: pFSCK incremental system design all of its directory entries are checked. Second, threads aresynchronized when performing complex fixes upon detectingerrors. When a worker thread detects any inconsistencies, allthreads across the different passes are notified and stalledusing a barrier. The thread that detected the inconsistencyattempts to fix the errors with (e.g., incorrect inode, blocksclaimed by multiple inodes) or without user input (e.g., in-consistent bitmap), after which parallel execution is resumed.In addition, we plan to explore C/R crash-consistency forpFSCK in the future using prior approaches [15].
Optimizations.
As additional optimizations to both e2fsckand pFSCK, we restrict the overheads of language localizationdiscussed earlier to inode and directory block checking, useIntel hardware-accelerated CRC instead of the default CRC, aswell as improve the cache-readahead mechanism. While thelanguage localization optimization has been reported and up-streamed to e2fsck codes mainline, the other optimizations areunder review. We evaluate the benefits of these optimizationsin Section § 6 (referred to as e2fsck-opt in graphs).
Our evaluation of pFSCK aims to answer the followingimportant questions:• Does data parallelism improve runtime performance byincreasing CPU parallelism?• Can does pipeline parallelism address the limitationsof data parallelism by running multiple passes of C/Rsimultaneously?• How effective is pFSCK’s dynamic thread placementmechanism for different file system configurations?• Can pFSCK’s resource-aware scheduler effectively mini-mize the performance impact on other applications whensharing CPUs?• How does pFSCK perform for online file system check-ing?8 f s c k x f s _ r e p a i r e f s c k - o p t x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - T i m e ( s e c o n d s )
569 550 548 564 551
Pass 1Pass 2 Pass 3Pass 4 Pass 5XFS
Figure 6: Data Parallelism impact on a file-intensive con-figuration.
X-axis shows different C/Rs and different thread configura-tions.
We use a 64-core Dual Intel® Xeon Gold 5218, 2.30GHz,64GB of DDR memory, and 1TB NVMe Flash Storage run-ning Ubuntu 18.04.1. We run pFSCK on various file systemconfigurations with varying thread counts. As seen in Table 1,we compare against vanilla e2fsck, e2fsck-opt (an optimizedversion of e2fsck that removes localization overheads andutilizes Intel CPU-accelerated CRC calculations), and finally,xfs_repair. Table 2 shows the incremental pFSCK’s designoptimizations.
In order to evaluate the potential performance improve-ment with data parallelism, we run pFSCK with just data par-allelism (bars shown as pFSCK[datapara]) that parallelizeseach pass of a C/R by partitioning work. We compare twofile system configurations: a file-intensive (99% files) anddirectory-intensive (50% directories) configuration. The x-axis shows the reduction in runtime with pFSCK and thestacked bars shows the runtime breakdown for each pass.First, with a file-intensive configuration, as shown in Fig-ure 6, the inode checking pass (Pass-1) shows a higher runtimecompared to other passes. Secondly, our optimized e2fsck(e2fsck-opt) outperforms the vanilla e2fsck by optimizing theCRC mechanism, avoiding language localization overheadsfor every inode, and improving the readahead mechanism.Interestingly, both e2fsck and pFSCK outperform xfs_repairfor all cases. Although xfs_repair is able to check inodes inparallel on the granularity of allocation groups, it is unableto check directory entries and link counts in parallel whichunfortunately dominates the checking time for large file sys-tems.Finally, enabling data parallelism within pFSCK reducesthe runtime of the first pass by 2.1x with 4 threads. Data par-allelism also reduces the runtime of the directory checkingpass (Pass-2) by 1.8x resulting in an overall C/R speedup of1.9x. Beyond 4-threads, data parallelism does not scale dueto higher serialization and lock contention overheads. Specifi-cally, we find the functions that update shared structures such e f s c k x f s _ r e p a i r e f s c k - o p t x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - x f s _ r e p a i r - p F S C K [ d a t a p a r a ] - T i m e ( s e c o n d s ) Figure 7: Data Parallelism impact on a directory-intensive configuration.
The x-axis shows different C/Rs and differentthread configurations. as the used/free block bitmap as the most prominent sourceof bottlenecks that hinders scaling.Next, for the directory-intensive file system, as shown inFigure 7, pFSCK parallelizes inode checking (in Pass-1) anddirectory checking (in Pass-2). The runtime of Pass-1 andPass-2 reduces by 1.8x and 1.3x respectively, resulting in anoverall C/R speedup of 1.4x compared to the vanilla e2fsck.Similar to the file-intensive configuration, we see a perfor-mance gain of 1.8x compared to xfs_repair, for which the di-rectory metadata check is not parallelized. Finally, pFSCK’sdata parallelism does not scale beyond 4-cores.
We next evaluate the benefits of combining data andpipeline parallelism and the need for a dynamic thread place-ment. In order to evaluate the performance improvementof pipeline parallelization, we compare pFSCK’s data par-allelism (pFSCK[datapara]), with data and pipeline paral-lelism (pFSCK[datapara+pipeline]) against file-intensive anddirectory-intensive file system configurations. When usingpipeline parallelism, the threads of each pass adds work tobe processed in the next pass. Because of this, logical bound-aries between the passes within e2fsck diminishes, hence,we only report the full runtime of C/R. Because e2fsck andpFSCK outperforms xfs_repair in all cases, we do not showxfs_repair’s results.Figure 8a and Figure 8b show the results for a file-intensive and directory-intensive file system configuration.The x-axis shows the increase in the number of threadsused for the C/R. In the figures, we compare four cases: (1)pFSCK[datapara], which only uses data parallelism runningone pass at a time, (2) pFSCK[datapara+pipeline-split-equal],which combines pipeline and data parallelism and staticallyuses an equal number of threads for each of the simultane-ously executing passes (ex. 2 threads are assigned to Pass-1 and 2 threads to Pass-2 in a 4-thread configuration), (3)pFSCK[datapara+pipeline-split-optimal], which represents abest manually selected thread configuration, and (4) pFSCK-sched, which dynamically assigns threads to each pass based9 T i m e ( s e c o n d s ) (a) File-intensive FS on NVMe T i m e ( s e c o n d s ) e2fscke2fsck-optpFSCK[datapara]pFSCK[datapara+pipeline-split-equal]pFSCK[datapara+pipeline-split-optimal]pFSCK-sched (b) Directory-intensive FS on NVMe T i m e ( s e c o n d s ) (c) File-intensive FS on SSD Figure 8:
Comparison of pipeline parallelism and scheduleron the amount of work to be done and current progress.First, for the file-intensive configuration,pFSCK[datapara+pipeline-split-equal] is not beneficialfor low thread counts (2 and 4 threads) compared to the dataparallelism-only approach, because threads are staticallyand equally assigned to Pass-1 and Pass-2 irrespectiveof the file system configuration. Increasing the threadcount improves performance, because, for a file-intensiveconfiguration, most work is done in the inode checkingpass (Pass-1). As a result, threads statically assigned forsubsequent passes are under-utilized. Increasing the threadcount (along the x-axis) improves pFSCK’s performancemarginally compared to using only the data parallelism.In contrast, pFSCK[datapara+pipeline-split-optimal], themanually selected thread configuration case improvesperformance by up to 1.3x compared to data parallelism byemploying three-fourth of the threads to the inode checker.More importantly, the manually selected configuration avoidshigh synchronization and contention overheads of the dataparallelism beyond four threads. Finally, pFSCK’s scheduler(pFSCK-sched) automatically migrates threads based onrelative amount the outstanding work to be completed acrosseach pass. As a result, pFSCK-sched can automaticallyprovide optimal performance, improving performance by1.1x compared to pFSCK[datapara+pipeline-split-optimal],resulting in an overall speedup of 2.6x compared to vanillae2fsck.Next, for the directory-intensive file system, unlike the dataparallelism only approach, pipeline parallelism, which en-ables simultaneous inode and directory checking (withoutcompromising correctness), provides some performance im-provement as well. pFSCK[datapara+pipeline-split-optimal]reduces runtime by 1.1x with up to 8 threads com-pared to just using data parallelism. When employingpFSCK’s scheduler (pFSCK-sched), it automatically miti-gates thread contention within the directory block check-ing by limiting the number of threads assigned to eachpass. Consequently, the performance improves by 1.05x overpFSCK[datapara+pipeline-split-equal], resulting in an over-all speedup of up to 1.6x over the vanilla e2fsck. Comparedto the speedup with a file-intensive file system, since pFSCKemploys delayed certification of directories, beyond 8 threads, processing dependent directories (directory with subdirectory)must join, synchronize, and merge their work for correctness,limiting scalability.Interestingly, although not shown, we see that e2fsck usesthe most memory for directory-intensive file systems. In orderto check an 800GB directory-intensive file system, e2fsckuses as much as 3GB of memory as a significant amount ofin-memory data structures are needed to track all the directoryinformation needed to verify the relationships between thedirectories and the files and subdirectories that exist withinthem. Despite this, we find pFSCK’s memory usage is compa-rable, only using as much as 3.5GB resulting in only a 1.17xincrease in memory usage.Lastly, in 8c shows pFSCK performance on a file-intensivefile system backed by SSD. We see that pFSCK is able toshow the similar speedups of up to 2.1x over vanilla e2fsckdespite SSDs having lower bandwidth capabilities comparedto NVMe.
In summary, pFSCK’s pipeline parallelism reducesthe serialization bottlenecks of data parallelism, and the dy-namic thread placement reduces work imbalance, all leadingto significant performance gains.
We next evaluate the effectiveness of pFSCK’s resource-aware scheduler (pFSCK-rsched) in reducing the impact onother applications compared to e2fsck. To illustrate the ef-fectiveness, we pick a popular persistent key-value store,RocksDB [3], and use it to run a multithreaded system work-load along with each system. We evaluate them in both anoffline setting, where the checker and RocksDB operate onseparate file systems, and an online setting, where the checkerand RocksDB operate on the same file system. For eachsetting, we consider the following cases: (1) e2fsck-no-cpu-sharing , where e2fsck is runs with RocksDB without sharingthe same CPU cores, (2) e2fsck-cpu-sharing , where e2fsckruns with RocksDB while sharing the same CPU cores, (3) pFSCK-rsched-no-cpu-sharing , where pFSCK-rsched runswith RocksDB without sharing the same CPU cores, and(4) pFSCK-rsched-cpu-sharing , where pFSCK-rsched runswith RocksDb while sharing the same CPU cores. We runRocksDB with 12 threads. In the case of CPU sharing weforce e2fsck to share cores with RockDB by restricting the10ffinity of all the threads to 12 cores resulting in the overlap-ping of one core. In the case of CPU sharing with pFSCK-rsched, we run pFSCK-rsched with 12 threads and restrict theaffinity of all threads to 16 cores, resulting in the overlappingof 8 cores. For brevity, we show only the results for checkinga file-intensive file system.
Figure 9a shows the performance of e2fsck and pFSCK-rsched when sharing CPU cores with RocksDB. In this exper-iment, the C/R and RocksDB do not share the file system, andthe C/R runs on an offline unmounted disk. In the y-axis, theresults are normalized relative to no-CPU sharing betweenthe C/R and RocksDB as the baseline.First, sharing CPUs between e2fsck and RocksDB im-pacts the performance of both e2fsck and RocksDB (shownwith e2fsck-cpu-share in the x-axis) compared to the base-line that does not share CPUs (e2fsck-no-cpu-share). Whensharing CPUs, e2fsck’s performance degrades by 1.2x andRocksDB’s performance degrades by 1.5x compared to thebaseline. E2fsck is context switched to run periodically, tak-ing away effective CPU time from RockDB and introducingoverhead from context switching.Next, for pFSCK, the baseline pFSCK-rsched-no-cpu-share configuration shows the performance of pFSCK andRocksDB without CPU sharing, whereas pFSCK-rsched-cpu-share shows the performance with CPU sharing. For CPUsharing, we see that although we overlap 8 out of 16 cores,the performance of pFSCK-rsched and RocksDB does notdegrade as significantly compared to the no-sharing case. Wesee that the performance of pFSCK-rsched and RocksDBdegrades only by 1.07x and 1.05x, respectively; this is be-cause of pFSCK-rsched’s ability to downscale the number ofthreads being utilized to carry out C/R mitigates the event ofCPU time being taken away from RocksDB and minimizingany potential overhead due to context switching. BecauseRocksDB mainly utilizes 12 out the 16 cores for the majorityof execution, the effective performance of pFSCK-rsched isequivalent to the performance of only utilizing 4-6 threads
Figure 9b shows the results when C/R and RocksDB sharethe CPU as well as the file system. As discussed earlier inSection 5.4.2, pFSCK utilizes the LVM-based snapshots tocapture the changes to the file system and perform incrementalC/R. Similar to our evaluation with separate file systems, wenormalize all the results to the base of e2fsck running withRocksDB without overlapping cores.First, when overlapping e2fsck and RocksDB, performancesignificantly degrades by 1.4x and 1.6x respectively. Interest-ingly this is not only due to context switching between e2fsckand RocksDB but also due to overheads in utilizing LVMsnapshots. As discussed before, LVM preserves file systemstate for a snapshot by capturing all updates to a file systemand making a copy of the original data when modified forthe entire duration the LVM snapshot is active. In the case e f s c k - n o - c p u - s h a r ee f s c k - c p u - s h a r e p F S C K - r s c h e d - n o - c p u - s h a r e p F S C K - r s c h e d - c p u - s h a r e N o r m a li z e d P e r f o r m a n c e e2fsckRocksDBpFSCK-rsched (a) Offline C/R. e f s c k - n o - c p u - s h a r ee f s c k - c p u - s h a r e p F S C K - r s c h e d - n o - c p u - s h a r e p F S C K - r s c h e d - c p u - s h a r e N o r m a li z e d P e r f o r m a n c e (b) Online C/R. Figure 9: Impact of resource-aware pFSCK for offlineand online C/R.
Results shown for file-intensive configuration. of e2fsck sharing CPUs with RocksDB, since CPU sharingand context switching naturally decreases the performance ofe2fsck, this means the snapshot is active for a longer period oftime, compounding the degradation of performance for bothe2fsck and RocksDB.Despite the compounding performance degradation dueto LVM snapshot overheads, the performance degradationwhen co-running pFSCK-rsched with RocksDB is minimaldue to the following reasons: (1) pFSCK-rsched’s paralleliza-tion reduces C/R runtime compared to e2fsck, minimizingthe amount of time the snapshot is active and reducing per-formance degradation for C/R and RocksDB. (2) In the caseof overlapping cores, similar to an offline setting, pFSCK-rsched mitigates significant performance impact by scalingthe number of threads it uses, reducing performance degra-dation of both pFSCK-rsched and RocksDB by only 1.2x.Compared to an offline setting, this increase in degradationfrom 1.07x for pFSCK-sched and 1.05x for RocksDB to 1.2xfor both pFSCK-sched and RocksDB is due to compoundingoverheads of using LVM snapshots.
Summary.
To summarize, pFSCK-rsched’s resource aware-ness is able to effectively adapt its number of threads to maxi-mize the utilization of available cores (and performance) inboth an offline and online setting while effectively minimizingthe amount of impact on RocksDB.
With a goal of accelerating file system checking and repairtools, in this paper, we propose pFSCK, a parallel C/R toolthat exploits CPU parallelism and the high bandwidth capa-bilities of modern storage to accelerate file system checkingand repair time without compromising correctness. pFSCKexplores fine-grained parallelism by assigning threads with in-odes, blocks, or directories and efficiently performing C/R us-ing data parallelism within each pass and pipeline parallelismacross multiple passes. In addition, pFSCK also enables ef-ficient thread management techniques to adapt to varyingfile system configurations as well as minimize performanceimpact on other applications. Evaluation of pFSCK showsmore than 2.6x gains over e2fsck and 1.8x over xfs_repairthat provides coarse-grained parallelism.11 eferences [1] Disk check takes too long to check. linuxquestions.org.[2] e2scrub: online fsck for ext4.[3] Facebook RocksDB. http://rocksdb.org/ .[4] Intel-Micron Memory 3D XPoint. http://intel.ly/1eICR0a .[5] Linux sched() man page.[6] Extremely long time for an ext4 fsck. stackexchange, Mar 2013.[7] File system check (fsck) is slow and running for a very long time, Sep2016.[8] Abutalib Aghayev, Sage Weil, Michael Kuchnik, Mark Nelson, Gre-gory R Ganger, and George Amvrosiadis. File systems unfit as dis-tributed storage backends: lessons from 10 years of ceph evolution.In
Proceedings of the 27th ACM Symposium on Operating SystemsPrinciples , pages 353–369, 2019.[9] Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanu-malayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, andRemzi H. Arpaci-Dusseau. Correlated crash vulnerabilities. In
Pro-ceedings of the 12th USENIX Conference on Operating Systems Designand Implementation , OSDI’16, pages 151–167, Berkeley, CA, USA,2016. USENIX Association.[10] Lakshmi N Bairavasundaram, Andrea C Arpaci-Dusseau, Remzi HArpaci-Dusseau, Garth R Goodson, and Bianca Schroeder. An analysisof data corruption in the storage stack.
ACM Transactions on Storage(TOS) , 4(3):8, 2008.[11] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy,and Jiri Schindler. An analysis of latent sector errors in disk drives.
Proceedings of the 2007 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems - SIGMETRICS07 , 2007.[12] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea CArpaci-Dusseau, and Remzi H Arpaci-Dusseau. Optimistic crash con-sistency. In
Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles , pages 228–243. ACM, 2013.[13] Pradeep Fernando, Sudarsun Kannan, Ada Gavrilovska, and KarstenSchwan. Phoenix: Memory speed HPC I/O with NVM. In , pages 121–131. IEEE Com-puter Society, 2016.[14] Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Ben-jamin, Ashvin Goel, and Angela Demke Brown. Recon: Verifying filesystem consistency at runtime.
ACM Transactions on Storage (TOS) ,8(4):1–29, 2012.[15] Om Rameshwar Gatla, Muhammad Hameed, Mai Zheng, ViacheslavDubeyko, Adam Manzanares, Filip Blagojevi´c, Cyril Guyot, and RobertMateescu. Towards robust file system checkers. In , pages 105–122, Oakland, CA, February 2018. USENIX Association.[16] Gawatu. Resilient file system (refs) overview.[17] Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-Dusseau,and Remzi H Arpaci-Dusseau. Sqck: A declarative file system checker.[18] Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau,and Remzi H. Arpaci-Dusseau. Sqck: A declarative file system checker.In
Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation , OSDI’08, pages 131–146, Berkeley, CA,USA, 2008. USENIX Association.[19] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher,Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng,Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields,Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, KirkWebb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. Fail-slow at scale: Evidence of hardware performance faults in largeproduction systems. In , pages 1–14, Oakland, CA, 2018. USENIXAssociation.[20] Michael Hasenstein. The logical volume manager (lvm).
White paper ,2001.[21] Val Henson, Zach Brown, and Arjan van de Ven. Reducing fsck timefor ext2 file systems. 04 2019.[22] Val Henson, Amit Gud, Arjan van de Ven, and Zach Brown. Chunkfs:Using divide-and-conquer to improve file system reliability and re-pair. In
Proceedings of the Second Conference on Hot Topics in Sys-tem Dependability , HotDep’06, pages 7–7, Berkeley, CA, USA, 2006.USENIX Association.[23] Val Henson, Arjan van de Ven, Amit Gud, and Zach Brown. Chunkfs:Using divide-and-conquer to improve file system reliability and repair.In
HotDep , 2006.[24] Shehbaz Jaffer, Stathis Maneas, Andy Hwang, and Bianca Schroeder.Evaluating file system reliability on solid state drives. In { USENIX } Annual Technical Conference ( { USENIX }{ ATC } , pages783–798, 2019.[25] Shehbaz Jaffer, Stathis Maneas, Andy Hwang, and Bianca Schroeder.Evaluating file system reliability on solid state drives. In , pages 783–798, Ren-ton, WA, July 2019. USENIX Association.[26] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim,Aasheesh Kolli, and Vijay Chidambaram. Splitfs: reducing softwareoverhead in file systems for persistent memory. In Proceedings ofthe 27th ACM Symposium on Operating Systems Principles , pages494–508, 2019.[27] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Redesigning lsms for non-volatile memory with novelsm. In Haryadi S. Gunawi and Ben-jamin Reed, editors, , pages 993–1005. USENIX Association, 2018.[28] Ram Kesavan, Harendra Kumar, and Sushrut Bhowmik. WAFL iron:Repairing live enterprise file systems. In , pages 33–48, Oakland, CA,February 2018. USENIX Association.[29] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, EmmettWitchel, and Thomas Anderson. Strata: A Cross Media File System. In
Proceedings of the 26th Symposium on Operating Systems Principles ,SOSP ’17, 2017.[30] Changman Lee, Dongho Sim, Joo-Young Hwang, and Sangyeun Cho.F2FS: A New File System for Flash Storage. In
Proceedings of the13th USENIX Conference on File and Storage Technologies , FAST’15,Santa Clara, CA, 2015.[31] W. Li, Y. Yang, J. Chen, and D. Yuan. A cost-effective mechanism forcloud data reliability management based on proactive replica checking.In , pages 564–571, 2012.[32] M. Lu, T. Chiueh, and S. Lin. An incremental file system consistencychecker for block-level cdp systems. In , pages 157–162, Oct 2008.[33] Ao Ma, Chris Dragga, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Marshall Kirk Mckusick. Ffsck: The fast file-systemchecker.
Trans. Storage , 10(1):2:1–2:28, January 2014.[34] Marshall K. McKusick. Improving the performance of fsck in freebsd. ;login: , 38(2), 2013.[35] Marshall Kirk McKusick, Willian N Joy, Samuel J Leffler, and Robert SFabry. Fsck- the unix† file system check program.
Unix SystemManager’s Manual-4.3 BSD Virtual VAX-11 Version , 1986.
36] Mtanski. mtanski/xfsprogs github.com/mtanski/xfsprogs/preadv2/repair,Feb 2015.[37] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high performance file sys-tem for non-volatile main memory. In
Proceedings of the EleventhEuropean Conference on Computer Systems , pages 1–16, 2016.[38] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster-hout. Arachne: Core-aware thread management. In
Proceedings of the12th USENIX Conference on Operating Systems Design and Implemen-tation , OSDI’18, pages 145–160, Berkeley, CA, USA, 2018. USENIXAssociation.[39] Omar Sandoval. A survey of bugs in the Btrfs filesys- tem. https://courses.cs.washington.edu/courses/cse551/15sp/projects/osandov.pdf .[40] Ric Wheeler. fs_mark.[41] Jian Xu and Steven Swanson. NOVA: A Log-structured File System forHybrid Volatile/Non-volatile Main Memories. In
Proceedings of the14th Usenix Conference on File and Storage Technologies , FAST’16,2016.[42] Mai Zheng, Joseph Tucek, Feng Qin, Mark Lillibridge, Bill W. Zhao,and Elizabeth S. Yang. Reliability analysis of ssds under power fault.
ACM Trans. Comput. Syst. , 34(4):10:1–10:28, November 2016., 34(4):10:1–10:28, November 2016.