Understanding and taming SSD read performance variability: HDFS case study
UUnderstanding and taming SSD read performance variability: HDFS case study
Mar´ıa F. Borge
University of Sydney
Florin Dinu
University of Sydney
Willy Zwaenepoel
EPFL and University of Sydney
Abstract
In this paper we analyze the influence that lower layers (filesystem, OS, SSD) have on HDFS’ ability to extract max-imum performance from SSDs on the read path. We un-cover and analyze three surprising performance slowdownsinduced by lower layers that result in HDFS read through-put loss. First, intrinsic slowdown affects reads from everynew file system extent for a variable amount of time. Sec-ond, temporal slowdown appears temporarily and periodi-cally and is workload-agnostic. Third, in permanent slow-down, some files can individually and permanently becomeslower after a period of time.We analyze the impact of these slowdowns on HDFS andshow significant throughput loss. Individually, each of theslowdowns can cause a read throughput loss of 10-15%.However, their effect is cumulative. When all slowdownshappen concurrently, read throughput drops by as much as30%. We further analyze mitigation techniques and showthat two of the three slowdowns could be addressed via in-creased IO request parallelism in the lower layers. Unfortu-nately, HDFS cannot automatically adapt to use such addi-tional parallelism. Our results point to a need for adaptabil-ity in storage stacks. The reason is that an access pattern thatmaximizes performance in the common case is not necessar-ily the same one that can mask performance fluctuations.
Layering is a popular way to design big data storagestacks [24]. Distributed file systems (HDFS [25], GFS [9])are usually layered on top of a local file system (ext4 [19],Btrfs [23], F2FS [17], XFS [28], ZFS [2]) running in an OS.This approach is desirable because it encourages rapid devel-opment by reusing existing functionality. An important goalfor such storage stacks is to extract maximum performancefrom the underlying storage. With respect to this goal, lay-ering is a double-edge sword [22]. On one hand, the OS andthe file system can compensate for and mask inefficiencies found in hardware. On the other hand, they can introducetheir own performance bottlenecks and sources of variabil-ity [6, 21, 26].In this paper we perform a deep dive into the performanceof a critical layer in today’s big data storage stacks, namelythe Hadoop Distributed File System (HDFS [25]). We par-ticularly focus on the influence that lower layers have onHDFS’ ability to extract the maximum throughput that theunderlying storage is capable of providing. We focus on theHDFS read path and on SSD as a storage media. We useext4 as the file system due to its popularity. The HDFS readpath is important because it can easily become a performancebottleneck for an application’s input stage which accessesslower storage media compared to later stages which are of-ten optimized to work fully in memory [31].Central to our exploration is the storage access pattern ofa single HDFS read request: single threaded, sequential ac-cess to large files (hundreds of megabytes) using bufferedIO. This access pattern is simple but tried and tested and hasremained unchanged since the beginnings of HDFS. It is in-creasingly important to understand whether a single HDFSread using this access pattern can consistently extract by it-self the maximum throughput that the underlying storage canprovide. As many big data processing systems are heavilyIO provisioned and the ratio of cores to disks reaches 1 [1],relying on task-level parallelism to generate enough parallelrequests to saturate storage is no longer sufficient. Moreover,relying on such parallelism is detrimental to application per-formance since each of the concurrent HDFS reads is servedslower than it would be in isolation.With proper parameter configuration, the HDFS read ac-cess pattern is sufficient to extract maximum performanceout of our SSDs. However, we uncover and analyze threesurprising performance slowdowns that can affect the HDFSread path at different timescales (short, medium and long)and result in throughput loss. All three slowdowns arecaused by lower layers (file system, SSDs). Furthermore,our analysis shows that the three slowdowns can affect notonly HDFS but any application that uses buffered reads. As a r X i v : . [ c s . O S ] M a r pposed to related work that shows worn out SSDs couldcause various performance problems [3, 10], our slowdownsoccur on SSD with very low usage throughout their lifetime.The first slowdown, which we call intrinsic slowdown, af-fects HDFS reads at short time scales (seconds). HDFS readthroughput drops at the start of every new ext4 extent read.A variable number of IO requests from the start of each ex-tent are served with increased latency regardless of the extentsize. The time necessary for throughput to recover is alsovariable as it depends on the number of request affected.The second slowdown, which we call temporal slowdown,affects HDFS reads at medium time scales (tens of minutesto hours). Tail latencies inside the drive increase periodi-cally and temporarily and cause HDFS read throughput loss.While this slowdown may be confused with write-triggeredSSD garbage collection [8], we find, surprisingly, that it ap-pears in a workload-agnostic manner.The third slowdown, which we call permanent slowdown,affects HDFS reads at long time scales (days to weeks). Af-ter a period of time, HDFS read throughput from a file per-manently drops and never recovers for that specific file. Im-portantly this is not caused by a single drive-wide malfunc-tion event but rather it is an issue that affects files individu-ally and at different points in time. The throughput loss iscaused by latency increases inside the drive, but, comparedto temporal slowdown all requests are affected, not just thetail.Interestingly, we find that two of the three slowdowns canbe completely masked by increased parallelism in the lowerlayers, yet HDFS cannot trigger this increased parallelismand performance suffers. Our results point to a need foradaptability in storage stacks. An access pattern that maxi-mizes performance in the common case is not necessarily theone that can mask unavoidable hardware performance fluc-tuations.With experiments on 3 SSD based systems, we show thateach of the slowdowns we identified can individually intro-duce at least 10-15% HDFS read throughput degradation.The effect is cumulative. In the worst case, all slowdownscan overlap leading to a 30% throughput loss.The contributions of the paper are as follows: • We identify and analyze the intrinsic slowdown affect-ing reads at the start of every new ext4 extent for a vari-able amount of time. • We identify and analyze the temporal slowdown thataffects reads temporarily and periodically while beingworkload-agnostic. • We identify and analyze the permanent slowdown af-fecting reads from an individual file over time. • We analyze the impact of these slowdowns on HDFSperformance and show significant throughput loss. • We analyze mitigation techniques and show that two ofthe three slowdowns can be addressed via increased par-allelism.The rest of the paper continues as follows. § § § § § § § § HDFS is a user-level distributed file system that runs layeredon top of a local file system (e.g. ext4, zfs, btrfs). An HDFSfile is composed of several blocks, and each block is storedas one separate file in the local file system. Blocks are largefiles; a common size is 256 MB.The HDFS design follows a server-client architecture. Theserver, called the NameNode, is a logically centralized meta-data manager that also decides file placement and performsload balancing. The clients, called DataNodes, work at thelevel of HDFS blocks and provide block data to requestscoming from compute tasks.To perform an HDFS read request, a compute task (e.g.Hadoop mapper) first contacts the NameNode to find all theDataNodes holding a copy of the desired data block. Thetask then chooses a DataNode and asks for the entire blockof data. The DataNode reads data from the drive and sendsit to the task. The size DataNode’s reads is controlled bythe parameter io.file.buffer.size (default 64KB). The DataN-ode sends as much data as allowed by the OS socket buffersizes. A DataNode normally uses the sendfile system callto read data from a drive and send it to the task. This ap-proach is general and handles both local tasks (i.e. on thesame node as the DataNode) as well as remote tasks. Asan optimization, a task can bypass the DataNode for localreads (short-circuit reads) and read data directly using stan-dard read system calls.
We now summarize the main characteristics of the HDFSread access pattern since this pattern is central to our work. • Single-threaded.
One request from a compute task ishandled by a single worker thread in the DataNode. • Large files.
HDFS blocks are large files (e.g. 128MB,256MB). Each HDFS block is a separate file in the localfile system. • Sequential access.
The HDFS reads access data se-quentially for performance reasons.2
Buffered IO.
HDFS uses buffered IO in the DataNode(via sendfile or read system calls).
We now detail the configuration changes we made to allevi-ate network and compute bottlenecks affecting HDFS and toeliminate sources of interference. As a result, we observedthat HDFS can extract the maximum performance that ourSSDs can generate.
File system configuration.
We disabled access time update,directory access time update, and data-ordered journaling(we use write-back journaling) in ext4. This removes sourcesof interference so that we can profile HDFS in isolation.
OS configuration.
Small socket buffer sizes limit the num-ber of packets that the DataNode can send to a task and thusreduce performance by interrupting disk reads and inducingdisk idleness. We increase the socket buffer size for bothreads and write to match the size of the HDFS blocks.
HDFS configuration.
We configured io.file.buffer.size tobe equal to the HDFS block size. The default value of thisparameter (64KB) results in too many sendFile system call,which in turn create a lot of context-switching between userand kernel space, which results in idleness for the IO device.We modified the HDFS code to allow the parameter to be setto 256MB as by default the maximum size is 32MB.
An important goal for HDFS is to maximize the throughputobtained from the storage devices. One way to achieve thisis via multi-threading in the DataNode. This is already partof the design as different DataNode threads can serve dif-ferent task requests concurrently. While this can maximizedevice throughput it does so at the expense of single-threadperformance which reduces task performance.State-of-the-art data processing systems are heavily IOprovisioned, with a ratio of CPU to disk close 1 [1]. Inthis context, relying on parallelism to make the most of thestorage is unlikely to help because the number of tasks isroughly the same as the number of disks (tasks are usuallyscheduled on a separate core). As a result, it is important tounderstand and ensure that the HDFS access pattern (single-thread, large files, sequential access, buffered IO) can by it-self extract maximum performance from SSDs.
In this section, we describe the hardware and software set-tings, tools, workloads and metrics used in our analysis.
Hardware.
We use three types of machines:
Machine A has2 Intel Xeon 2.4GHz E5-2630v3 processors, with 32 cores intotal, 128GB of RAM, and a 450GB Intel DC S3500 Series(MLC) SATA 3.0 SSD.
Machine B has 4 Intel Xeon 2.7GHzE5-4650 processors, with 32 cores in total, 1.5TB of RAM,and a 800GB HP 6G Enterprise SATA 3.0 SSD.
MachineC has 2 Intel Xeon 2.4GHz E5-2630v3 processors, with 32cores in total, 128GB of RAM, and a 512GB Samsung 860Pro (V-NAND) SATA 3.0 SSD.Our SSDs have been very lightly used throughout theirlifetimes. After concluding our analysis we computed thetotal lifetime reads and writes performed on the drivesusing the ”sectors read” and ”sectors written” fields in/proc/diskstats in Linux. The value was less than 1TB forboth reads and writes for each drive. This is orders of mag-nitude less than the manufacturer provided guarantees forSSDs. Thus, past heavy use of the drives is not a factor inour findings. Moreover, the disk utilization of our SSDs inthe experiments is very low, under 20%.
Software.
We use Ubuntu 16.04 with Linux kernel ver-sion 4.4.0. As a local file system we use ext4, one the mostpopular Linux file systems. We use HDFS version 2.7.1.
Monitoring Tools.
To monitor IO at the storage device,we rely on block layer measurements using blktrace and blk-parse. Blktrace collects IO information at the block layer,while blkparse makes the traces human readable. Where nec-essary we use perf and strace to analyze program behavior.
Workloads.
We use HDFS via Hadoop where we run asimple WordCount job. The exact type of Hadoop job isinconsequential for our findings because we have alreadydecoupled HDFS performance from software bottlenecks inSection 2.3. We modified the Hadoop WordCount job to notwrite any output so that we can reliably measure read perfor-mance. We use the FIO tool for analysis beyond HDFS. Thedata read by HDFS (or FIO) is composed of randomly gen-erated strings and is divided in 8 ext4 files of 256MB each.Our experiments consist in reading (with Hadoop or FIO)repeatedly over 24 hours the 8 ext4 files We ensure that allexperiments run in isolation and are not affected by interfer-ence from any other processes.
Presenting slowdowns.
For every slowdown we are ableto separate its effect and present results for periods with andwithout that slowdown. The results without a slowdown in-clude the effects of all other slowdowns that occur at shortertimescales. For example, when comparing results with orwithout temporal slowdown, the results include the effectof intrinsic slowdown but not that of permanent slowdown.This is ok because the effect of a slowdown is roughly con-stant over longer periods of time.
What we measure.
We measure performance at theHDFS DataNode level. We measure the throughput ofHDFS reads and the latency of individual block layer IO re-3uests. We do not measure end-to-end performance relatedto Hadoop tasks. Before every experiment we drop all cachesto ensure reads actually come from the drive.The Hadoop tasks are collocated with the HDFS DataN-odes on the same machines. The DataNodes send data tothe tasks via the loopback interface using the sendfile sys-tem call. We also analyzed short-circuit reads which enableHadoop tasks to read local input directly (using standard readcalls) by completely bypassing HDFS but the findings re-mained the same.
Generating single requests.
In our experiments usingbuffered IO, two requests overlap in the device driver. Insuch a case, increases in request latency could be causedeither by drive internals or by a sub-optimal request over-lap. To distinguish such cases we tweak buffered IO tosend one request at a time. To send a single request of sizeX we first set the Linux read ahead size to X KB by tun-ing /sys/block/ < device > /queue/read ahead kb . Wethen use the dd command to read one chunk of size X (ddbs=X count=1). The influence of read ahead size on blocklayer size is known and discussed in related work [13]. In the rest of the paper, the word ”file” refers to one 256MBext4 file. In HDFS parlance this represents one HDFS block.
File throughput.
The number of bytes, read from the tar-get file, divided by the period of time. The period starts withthe submission of the first block layer IO request in the file(as timestamped by blktrace) and finishes with the comple-tion of the last block layer IO request in the file (as times-tamped by blktrace). During this period we only count timewhen the disk is active, i.e. there is at least one IO request be-ing serviced by or queued for the drive. This metric removesthe impact of disk idle time caused by context-switches be-tween user and kernel space in the application. Our HDFSresults show no disk idle time after applying the changes inSection 2.3. Nevertheless, disk idle time appears in FIO andwe chose to discard it for a fair comparison to HDFS. Over-all, the disk idle time does not influence our main findings.
Request Latency.
The time between the timestamp whena block layer request is sent to the drive (D symbol in blk-trace) and the timestamp of its completion (C symbol in blk-trace). Both timestamps are taken from blktrace.
Fragmentation.
The number of extents an ext4 file has.Note that all of our files are 256MB. The maximum extentsize in ext4 is 128MB. Therefore, the minimum possiblenumber of extents in a file is 2.
Recovery Time.
The period of time during which IO re-quests have higher than usual latency due to intrinsic slow-down. This is measured starting from the first IO read re-quest of an ext4 extent until either the latency of the requestsdecreases to normal or the extent is fully read, whichevercomes first.
In this section, we introduce the intrinsic slowdown, a per-formance degradation that predictably affects files at shorttime scales (seconds to minutes). This slowdown is related tothe logical file fragmentation. Every time a new file systemextent is read, a number of IO requests from the start of theextent are served with increased latency and that is correlatedwith throughput drops. Interestingly, even un-fragmentedfiles are affected since a file has to have at least one extentand every extent is affected by the slowdown. The more frag-mented a file is, the more extents it has, and the bigger is theimpact of the slowdown.Intrinsic slowdown appears on all the machines we testedand causes a drop in throughput of 10-15% depending onthe machine. The slowdown lasts a variable amount of timebut there is no correlation with extent size. This slowdownaffects not only HDFS but all applications using buffered IO.The remainder of this section presents an overview of athroughput loss, an analysis of the results, a discussion oncauses and an analysis of mitigation strategies.
Figure 1 illustrates the influence that an increased numberof extents has on throughput for each of the 3 machines. Inthis figure, each point represents the average file throughputof a set of files with the same number of extents in one ma-chine. Files were created using ext4’s default file allocationpolicies so we had no control over the number of extent eachfile was allocated. We observed that ext4 allocations resultin highly variable fragmentation levels even on our driveswhich were less then 20% full. We often saw cases whereone file was allocated 30 extents and a file created secondsafter was allocated 2 extents. A thorough analysis of ext4 al-location patterns is, however, beyond the scope of this work.The figure shows that an increase in fragmentation is cor-related with a loss in throughput. This finding holds on all 3machines but the magnitude of the throughput loss is differ-ent because the SSDs are different. With 29 extents, through-put drops by roughly 13% for machines A and B, but by lessthan 5% for machine C. There is a limit to the throughputloss and that is best exemplified by the fact that throughputdrops very slowly for machines A and B after 20 extents.The reason is that the extents are smaller but the recoveryperiod is not correlated with the extent size so a very largepercentage of the IO requests is affected by the slowdown.
Correlations.
We next analyze IO request latency. Figure 2presents the request latencies on machines A, B and C, dur-ing an HDFS read. The dashed lines correspond to requestlatencies after the intrinsic slowdown disappeared while the4 A v e r age F il e T h r oughpu t ( M B / s ) Number of ExtentsMachine
AMachine
BMachine C Figure 1: HDFS read. Average file throughput vs number ofextents. P e r c en t il e s Request
Latency (ms)Dur. int. sldwn - Machine
AAft. int. sldwn - Machine
ADur. int. sldwn - Machine
BAft. int. sldwn - Machine
BDur. int. sldwn - Machine
CAft. int. sldwn - Machine C Figure 2: HDFS read. Request Latency CDFs during andafter intrinsic slowdown.continuous one show latencies during the slowdown. Laten-cies increase during slowdown both at the median but espe-cially in the tail. Machine C shows both the smallest laten-cies and the smallest degradation and this is due to the factthat its SSD is based on a different technology (V-NAND).The above latencies are with the standard buffered IO con-figuration in which 2 requests overlap in the device driver.We also measured (not illustrated) the request latency whensending one single request a time. We find that latency re-mains unaffected even during the parts of the extent that arenormally affected by intrinsic slowdown. This suggests thatthe latency increase and throughput loss are not solely dueto drive internals. This is expected as the SSD FTL has nonotion of extents which are a file system construct.We find that inherent slowdown is correlated with a sub-optimal request overlap in the device. Consider a request R and let S and E be its en-queueing and completion time.With buffered IO, the execution of R overlaps with the finalpart of R and the first part of R , R being the previous re-quest and R the next. We have that S < E < S < E . Wefind that periods of intrinsic slowdown are correlated withan imbalanced overlap, that is R overlaps much more witheither R or R . In other words, imbalance overlap occurswhen T >> T = abs (( E − S ) − ( E − S )) . Toexemplify, Figure 3 shows the correlation between T and re-quest latency for a sample extent. For the first 20 requests,the overlap is sub-optimal and latency suffers. The overlap D u r a t i on ( m s ) Request ID in extentLatency of requestImbalance of request overlap Figure 3: Correlation between increased latency and requestoverlap imbalance.imbalance is corrected around request 22 and soon after la-tency drops under 1ms which is the latency we normally seeoutside of intrinsic slowdown.
Characterization of recovery periods
We next analyzethe duration and variability of the recovery periods. Thereare two main insights. First, even for a single extent sizeand one machine, there can be significant variation in theduration of the recovery period. Second, the duration of therecovery period is not correlated to the extent size.Figure 4 shows CDFs of the duration of the recovery pe-riod on the 3 machines. For each machine we show largeextents (128MB) with continuous lines and smaller extents(32-40 MB) with dashed lines. We aggregated results fromextents from multiple files if they have the target size and re-side on the same machine. We measure the recovery durationin number of requests. The request size is 256KB.The CDFs for any one of the machines show a similar pat-tern in the recovery period despite the different extent size.Therefore, extent size is not a factor with respect to the du-ration of the recovery period.There is significant variability in the recovery period forevery extent size on machines A and B. The worst-case re-covery duration is more than 5x that of the best-case. Incontrast, machine C shows much less variability.If we compute the recovery period relative to extent size(not illustrated) we find that for the smallest extents (e.g.8MB) it is common for at least 50% of the requests in theextent to be affected by intrinsic slowdown. In the worstcase, we have seen 90% of an extent being affected.
Discussion on internal SSD root cause.
Since we donot have access to the proprietary SSD FTL design we can-not directly search for the root cause internal to the drive. Webelieve that sub-optimal request overlap leads to throughputloss because it forces the drive to be inefficient by servingboth overlapping requests in parallel when the most efficientstrategy would sometimes be to focus on the oldest one first.The request stream enters in this state due to the initial re-quests at the start of the extent. The stream self-corrects byeventually reaching the optimal (balanced) request overlapand remaining there. The software does not help in the cor-5 P e r c en t il e s Number of Requests in Recovery
PeriodLarge
Ext. - Machine
ASmall
Ext. - Machine
ALarge
Ext. - Machine
BSmall
Ext. - Machine
BLarg
Ext. - Machine
CSmall
Ext. - Machine C Figure 4: HDFS read. CDFs of number of requests executedin the recovery period for large and small extents. A v e r age F il e T h r oughpu t ( M B / s ) Number of Extents Machine
AMachine
BMachine C Figure 5: FIO Direct IO read. Average file throughput vsnumber of extents.rection as it functions in a reactive manner. It sends a new re-quest as soon as one completes. The self-correction happenssolely due to timing, based on the request latencies. This alsoexplain the variability in the recovery periods.
We consider mitigation strategies that are more aggressivein generating request level parallelism in the hope that theycould compensate for the loss in throughput due to the slow-down. We find that both direct IO as well as increasingthe number of requests sent in parallel with buffered IO canmask intrinsic slowdown.Figure 5 compares average file throughput vs number ofextents, when using direct IO across different machines. Thefiles are the same as in Figure 1. The figure shows that av-erage throughput is maintained across different numbers ofextents with direct IO. The tendency holds across all ma-chines tested. In other words, direct IO can mask intrinsicslowdown. The reason is that by sending more and largerrequests, direct IO better leverages the device parallelism.We observe the same effect when increasing parallelism inbuffered IO, by increasing the read ahead size. This settingresults in both larger requests as well as more requests beingsent in parallel to the drive.
450 00:50 01:40 02:30 03:20 04:10 05:00 05:50 06:40 F il e T h r oughpu t ( M B / s ) Time
HH:MM Machine A Figure 6: HDFS read. File throughput timeline on MachineA.
In this section, we introduce the temporal slowdown, a pe-riodic and temporary performance degradation that affectsfiles at medium timescales (minutes to hours). At the highlevel, the pattern in which temporal slowdown manifestsmight we confused with write-induced SSD garbage collec-tion (GC). However, temporal slowdown is not always GC.Surprisingly, on machine A, it always manifests even in read-only workloads. On machine B, it is indeed triggered bywrites but interestingly it takes a very small amount of writesrelative to the drive capacity to trigger temporal slowdown.Moreover, our SSDs have a very low utilization (under 20%).We link the slowdown to tail latency increases inside thedrive. Temporal slowdown causes a throughput drop of upto 14%. Temporal slowdown affects not only HDFS but allapplications using either direct or buffered IO.The remainder of this section presents an overview of athroughput loss, an analysis of the results, a discussion oncauses and an analysis of mitigation strategies.
Figure 6 presents the throughput timeline of a file affected bytemporal slowdown on machine A. It shows three instancesof the slowdown around the 1:00, 3:30 and 5:40 marks. Therest of the throughput variation is caused by inherent slow-down. The average throughput of the periods not affected bythe slowdown is 430 MB/s. The first instance of slowdowncauses a drop in throughput to 370 MB/s, a 14% drop fromthe 430 MB/s average. On machine A, temporal slowdownappears on average every 130 min and last on average 5 min.Figure 7 shows the same experiment on machine B. Thereare 5 instances of temporal slowdown clearly visible due tothe pronounced drops in throughput. The average throughputof the periods not affected by the slowdown is 455 MB/s.The biggest impact is caused by the third slowdown instancewhich causes a drop to 390 MB/s, almost 15% down fromthe average. On machine B, temporal slowdown appears onaverage every 18 min and last for 1.5 min.6 F il e T h r oughpu t ( M B / s ) Time
HH:MM Machine B Figure 7: HDFS read. File throughput timeline on MachineB. P e r c en t il e s Request
Latency (ms)Before
Temp.
Slowdown
HDFSDuring
Temp.
Slowdown
HDFS
Figure 8: HDFS read. CDFs of read request latencies duringand outside of temporal slowdown on Machine A.
Correlations.
We next analyze IO request latency. Fig-ure 8 shows a CDF of the request latencies for one file.One line shows latencies during temporal slowdown whileanother shows latencies during periods not affected by theslowdown. The difference lies in the tail behavior. Duringtemporal slowdown a small percentage of the requests showmuch larger latency. This is consistent with the impact ofbackground activities internal to the drive.The experiments in Figures 6 and 7 do introduce writesand they responsible for triggering temporal slowdown onmachine B. Even though from the application perspective(i.e. Hadoop) the workload is read-only, a small number ofwrites appear due to HDFS metadata management. These arethe only writes in the system as we explicitly turned off jour-naling and metadata updates in ext4. Interestingly, a smallamount of writes relative to the drive size is sufficient to trig-ger temporal slowdown. On machine B, temporal slowdownoccurs approximately every 120MB. That amounts to only0.015% of the disk size.
Temporal slowdown without writes.
Our main findingrelated to temporal slowdown is that it can occur in the com-plete absence of writes. This occurs only on machine A sowe focus on it for these experiments. To avoid any writes, werepeat the experiment using FIO instead of HDFS. We con-figure FIO to use the read system call and evaluated both
440 01:30 03:00 04:30 06:00 07:30 09:00 10:30 12:00 13:30 15:00 16:30 18:00 A c t i v e C u m u l a t i v e F il e T h r oughpu t ( M B / s ) Time
HH:MMFIO
DirectIO
Machine A Figure 9: FIO direct IO Read. File throughput timeline onMachine A. P e r c en t il e s Request
Latency (ms)Before temp.
Slowdown
FIO
DirectIODuring temp.
Slowdown
FIO
DirectIO
Figure 10: FIO Direct I/O Read. CDFs of read request laten-cies before and during temporal slowdown on MachineA.direct IO and buffered IO. The results were similar so weonly show direct IO. We confirm that there are no writes per-formed during the experiments by checking the number ofwritten sectors on the drive (from /proc/diskstats ), be-fore and after the experiments. In addition, we ensure thatno writes have been performed in the system for at least onehour before the start of the experiments.In Figure 9, we show the throughput timeline when usingFIO with direct IO. FIO shows more variability in the com-mon case compared to Hadoop because of context-switchesbetween kernel and user space. The temporal slowdown isagain visible despite the absence of writes. The slowdownappears every 130 min on average and last 5 min on average.The periodicity is almost identical to the HDFS case, sug-gesting that the HDFS metadata writes did not play a role intriggering temporal slowdown on machine A.Figure 10 presents the IO request latency for FIO with di-rect IO. Again, tail latency increases during slowdown. Thefour different latency steps appear because direct IO sends,by default, four large requests (1 MB) to the drive.
Trigger of slowdown without writes.
Next, we analyzewhether temporal slowdown in the absence of writes is corre-lated with the number of reads performed or it is time-based.We introduce periods of inactivity using sleep periods be-tween the reads. We make sure that these periods are muchsmaller than the duration of temporal slowdown so that we7o not miss slowdown events. We find that regardless of theinactivity period induced, the periodicity remains the samesuggesting time-based triggers.
Discussion on internal SSD root cause.
Since we donot have access to the proprietary SSD FTL design we can-not directly search for the root cause internal to the drive.In theory, there are three known culprits for temporal slow-downs in SSDs yet our findings does not match any of them.The first one is write-induced GC [27, 30]. However, weshow that temporal slowdown can appear in the absence ofwrites as well. The last two culprits are read disturbance andretention errors [3]. In the related work, in Section 8, weargue at length that these culprits appear on drives that arefar more worn out (orders of magnitude more P/E cycles)than ours and after order of magnitude more reads have beenperformed. We hypothesize that temporal slowdown on ourdrives is triggered by periodic internal bookkeeping tasks un-related to past drive usage or current workload.
We have found no simple way of masking temporal slow-down. It occurrs for both buffered IO and direct IO. Onecould attempt to detect early signs of slowdown or estimateits start via profiling and then avoid performing reads duringthe period. This would yield more predictable performanceat the expense of delays.
In this section, we introduce the permanent slowdown, an ir-reversible performance degradation that affects files at longtimescales (days to weeks). Permanent slowdown occurs ata file level . It is not triggered by a single drive-wide event.Thus, at any point in time, a drive can contain both files af-fected by permanent slowdown and files unaffected by it.The exact amount of time it takes for a file to be affectedby permanent slowdown varies from file to file and is notinfluenced by how many times a file was read. We onlysee permanent slowdown on machines of type A. Permanentslowdown causes a throughput drop of up to 15%.We find that permanent slowdown is not specific to HDFSbut affects all read system call that use buffered IO. Welink the slowdown to unexpected and permanent latency in-creases inside the drive for all IO requests.For terminology, in the context of permanent slowdown,”before” means before the first signs of slowdown and ”af-ter” means after slowdown completely set in. The CDFs rep-resent a single HDFS file composed of 8 blocks (i.e. 8 ext4files). Figure 11 shows a different file where we caught theonset of the slowdown. Nevertheless, we have seen that allfiles affected by the slowdown show a similar degradationpattern and magnitude.
440 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 F il e T h r oughpu t ( M B / s ) Time
HH:MM
Figure 11: HDFS read. File throughput over a 10h periodcentered around the onset of permanent slowdown. P e r c en t il e s File
Throughput (MB/s)Bef. perm. slowdownAft. perm. slowdon
Figure 12: HDFS read. CDFs of file throughput before andafter permanent slowdown.The remainder of this section presents an overview of athroughput loss, an analysis of the results, a discussion oncauses and an analysis of mitigation strategies.
Figure 11 shows the onset and impact of permanent slow-down. The plot shows a 10 hour interval centered around theonset of permanent slowdown. The file was created severaldays before this experiment was ran. For the first 4 hours,read throughput lies between 340 MB/s and 430 MB/s. Thisvariation is explained by the intrinsic and the temporal slow-downs described in Sections 4 and 5. Around the fourth hour,the permanent slowdown appears and after less than one hourit completely sets in. From that point on, read throughput re-mains between 320 MB/s and 380 MB/s in this experimentand all future experiments involving this file.Figure 12 compares the CDF of the read throughput ofthe same file before and after slowdown. At the median,throughput drops by 14.7% from 418 MB/s to 365 MB/s.
Generality.
We start by analyzing the generality of thepermanent slowdown. HDFS uses the sendfile system call totransfer data. Using the perf tool we find that sendfile sharesmost of its IO path in the Linux kernel with the read system8 P e r c en t il e s File
Throughput (MB/s)Before perm. slowdown
HDFSBefore perm. slowdown
Buff
FIOAfter perm. slowdown
HDFSAfter perm. slowdn
Buff
FIO
Figure 13: HDFS read vs Buffered FIO. CDFs of filethroughput before and after permanent slowdown. P e r c en t il e s Request
Latency (ms)Before perm.
Slowdown
HDFSAfter perm.
Slowdown
HDFS
Figure 14: HDFS read. CDFs of request latency before andafter permanent slowdown.calls that use buffered IO. Therefore, we ask whether perma-nent slowdown affects only sendfile system calls or also readsystem calls that use buffered IO.We use FIO to generate reads using buffered IO. We con-figure FIO to use the read system call (i.e. sync io engineas a FIO parameter). Figure 13 presents a throughput com-parison between HDFS (sendfile system call) and FIO (readsystem call). The two rightmost CDFs show the through-put for HDFS and FIO before permanent slowdown. HDFSand FIO behave similarly. The same applies after permanentslowdown sets in (leftmost CDFs). Similar results were ob-tained using libaio as an IO engine for FIO. This result showthat permanent slowdown does not affect a particular systemcall (sendfile) but the group of read system calls that performbuffered IO.
Correlations.
We next analyze IO request latency. Fig-ure 14 compares the CDFs of request latency in HDFS onone file before and after permanent slowdown. Permanentslowdown induced an increase in latency at almost every per-centile. Thus, most requests are treated slower. At the me-dian, the latency increases by 25%. The latencies in the tailof the CDF are explained by the inherent and the temporaryslowdowns. We re-run the experiment using FIO and sawsimilar results.We also measure latency when sending one request at atime. We vary request size between 128KB and 1MB. Wefind that single request latency also increases after permanent P e r c en t il e s File
Throughput (MB/s)After perm. slowdown
HDFSBefore perm. slowdown
DirectIOAfter perm. slowdown
DirectIO
Figure 15: HDFS read vs FIO DirectIO. CDFs of filethroughput before and after permanent slowdown.slowdown and for all sizes. The increase in latency is con-stant in absolute terms and is thus not correlated to requestsize. The latency for the default request size of 256 KB in-creased by 33%. These findings show that latency increasesare due to the drive and not due to the software layers.
Discussion on internal SSD root cause.
The read dis-turbance and retention errors discussed as potential culpritsfor temporary slowdown could conceivably lead to perma-nent slowdown [3] if left uncorrected by the drive. How-ever, the same argument we made for temporal slowdown ap-plies. Read disturbance and retention occur on drives muchmore worn out (orders of magnitude more P/E cycles) thanours and after performing orders of magnitude more reads.We hypothesize that the reason for permanent slowdown lieswith error correction algorithms being triggered inside thedrive after enough time has passed since file creation.
We consider mitigation strategies that are more aggressivein generating request level parallelism in the hope that theycould compensate for the throughput loss. We find that bothdirect IO as well as increasing the number of requests sent inparallel with buffered IO can mask permanent slowdown.First, we look at the behavior of permanent slowdownwhen reading with direct IO. It is known that direct IO is-sues more and larger requests to the block layer, when com-pared to buffered IO [13]. In our experiments it issues four1MB in parallel. Figure 15 presents a throughput compari-son between HDFS and FIO with direct IO. The two right-most CDFs correspond to the throughput of FIO with directIO before and after permanent slowdown. The differencebetween the two is minimal. We repeated the experimentusing a smaller, 256KB direct IO request size (by tuning /sys/block/ < device > /queue/max sectors kb) . Theresults remained the same suggesting that having a largernumber of parallel requests is key for best performance.We also analyzed making buffered IO more aggressive.We increase the request size from 256KB to 2MB by modi-fying the read ahead value. This change automatically brings9 P e r c en t il e s File
Throughput (MB/s)After perm.
Slowdown
HDFSAfter perm.
Slowdown
Buff
FIOAfter perm.
Slowdown
HDFS
Large
ReqAfter perm.
Slowdown
Buff
FIO
Large
Req
Figure 16: HDFS read vs Buffered FIO. CDFs of filethrougput. Before permanent slowdown with IO requestsof 256KB, and after permanent slowdown with large IO re-quests of 2MB.about a change in the number of request sent in parallel to thedrive. When request size is 256KB, two requests execute inparallel. For a 2MB request, four parallel execute in parallel.Figure 16 presents four CDFs representing the throughput af-ter permanent slowdown with HDFS reads and FIO bufferedreads. The leftmost CDFs correspond to the default requestsize of 256KB and show the impact of permanent slowdown.The rightmost CDFs are for a request size of 2MB. The mod-ified buffered IO is able to mask the permanent slowdownwith increased parallelism.
We showed that intrinsic and permanent slowdowns occurbecause software cannot adapt its strategy for extracting themaximum performance and parallelism from the device inthe face of changes in SSD behavior. In the common case,software can extract the maximum performance and paral-lelism from the SSD using a set strategy of generating IO re-quests. When SSD performance drops due to internal causes,the same set strategy cannot continue to extract the sameperformance level. A more aggressive strategy is needed.Yet, HDFS cannot adapt. This points to the need to considermore adaptable software designs that readjust according toperceived performance drops and instabilities in hardware.The more aggressive approaches that we evaluated wereswitching to direct IO and increasing the size and number ofparallel IO requests in buffered IO. Unfortunately, for exist-ing applications, especially those with large code bases likeHDFS, these more aggressive approach may not always beeasy to leverage. Switching to direct IO may require exten-sive changes to application code. Increasing the aggressive-ness of buffered IO may lead to wasted disk reads if turnedon at machine level. If turned on per application, aggressivebuffered IO may influence fairness in the co-existence withcollocated workloads. In addition, from an operational per-spective, increasing the aggressiveness of buffered IO is notstraight-forward. First, it is not intuitive because under the common case the default strategy for buffered IO is enoughto extract maximum performance from SSDs. Moreover, inLinux, the settings required to increase aggressiveness arecontrolled by a seemingly unrelated configuration that con-trols read-ahead size.Our findings have an impact on they way systems arebenchmarked on SSDs. If two systems are tested on copiesof the exact same file, 10-20% of the performance differencemay come from intrinsic slowdown (a copy is more frag-mented) and/or permanent slowdown (a copy is older). Evenif the same input file is used but at different points in time,10% of the performance difference may come from perma-nent slowdown. Finally, if systems are tested for short peri-ods of time, 10% of the difference can come from temporaryslowdown if one of the systems is unlucky to run during oneslowdown episode. In the extreme case, one system may beaffected by all three slowdowns at the same time while an-other may only be slightly affected by intrinsic slowdown.In this case, almost 30% of the performance difference maycome from the slowdowns and not the systems under test.
Related to sources of performance variation internal toSSDs.
Garbage collection (GC) in SSDs is known totrigger temporary slowdowns but it is write induced [27, 30].Flash on Rails [27] reports no GC-like effects in read-onlyworkloads. Since our paper focus solely on read-only work-loads we do not discuss further GC.There are two types of errors that can appear in read-onlyworkloads, retention errors and read errors. Retention er-rors occur when data stored in a cell changes as time passesand are caused by the charge in a cell dissipating over timethrough the leakage current [3, 5]. Read (disturbance) er-rors occur when the data in a cell is modified over time asa neighboring cell is read repeatedly and are caused by therepeated reads shifting the threshold voltages of unread cellsand switching them to a different logical state [3, 4]. In prac-tice, retention errors happen much more frequently than readdisturbance errors [20].The temporary slowdowns we encountered show a dif-ferent pattern compared to the two read errors describedabove. Related work shows that read errors are highly cor-related with the number of P/E cycles that the drive wentthrough [3]. Our drives have a very low P/E cycle. At theend of our experiments, the amount of data written to thedrives over their entire lifetime was just 1TB, double theircapacity. In contrast, related work uses drives with thou-sands of P/E cycles to show a noticeable increase in errorrates [3]. Similarly, to obtain read errors, related work [4]perform hundreds of thousands of reads on a single page inorder to see noticeable effects. Our experiments perform atmost a few thousand reads. In addition, the read-errors re-sults from related work [4] are on drives that already under-10ent thousands of P/E cycles.Gunawi et al. [10] study 101 reports of fail-slow hardware(some of which SSD-related) incidents, collected from large-scale cluster deployments. One the SSD front, they findfirmware bugs that cause latency spikes or stalls and slowreads due to read retries or parity-based read reconstruction.The study finds that slow reads occurs mostly on worn outSSDs or SSDs that approach end of life. We show that simi-lar problems can occur on very lightly used SSDs. Moreover,we analyze the impact that these hardware issues have at theapplication level.Jung et al. [15] find at least 5x increased latency on readswhen enabling reliability management on reads (RMR).RMR refers collectively to handling read disturbance man-agement, runtime bad block management, and ECC. Thelatency differences causing the slowdown we uncover aremuch less pronounced. Moreover, our slowdowns illustratedynamics in read latency over time whereas this work fo-cuses on read-related insights from parameter sweeps.Hao et al. [11] perform a large-scale study analysis of taillatency in production HDDs and SSDs. Their study presentsa series of slowdowns and shows that drive internal charac-teristics are most likely responsible form them. They charac-terize long slowdown periods that (may) last hours and affectthe whole drive, without any particular correlation to IO rate.Like them, we find that the drive is most likely responsiblefor most of the slowdowns. We did not experience large pe-riod of slowdowns across the whole drive, probably due tothe fact that our drives are more lightly used.
Related to fragmentation in SSDs.
Conway et al. [7]show that certain workloads cause file systems to age (be-come fragmented) and this causes performance loss even onSSDs. Their workloads involve many small files ( < et al. [16] also exposes the impact of aging in SSDacross a variation of workloads and file sizes. They focus onreplicating fragmentation to improve benchmarking quality.Similarly, Chopper [14] studies tail latencies introduced byblock allocation in ext4 in files of maximum 256 KB. In con-trast, we study intrinsic slowdown in much larger files (256MB) and we quantify the impact on HDFS. Related to extracting best performance out of SSDs. He et al. [13] focus on five unwritten rules that applicationsshould abide by to get the most performance out of the SSDsand analyze how a number of popular applications abide bythose rules. These rules boil down to specific ways of cre-ating and writing files: write aligned, group writes by deathtime, create data with similar lifetimes, etc. These findingsare all complementary to our work. The authors also point tosmall IO request sizes and argue that they are unlikely to usethe SSD parallelism well. In contrast we see that in the com-mon case, the default IO request sizes can extract the maxi-mum SSD performance but fall short when hardware behav-ior changes as exemplified by the permanent slowdown. Related to storage-influenced HDFS performance
Shafer et al. [24] analyze the performance of HDFS v1 onHDDs using Hadoop jobs. They show three main findings.First, architectural bottlenecks exist in Hadoop that resultin inefficient HDFS usage. Second, portability limitationsprevent Java from exploiting features of the native platform.Third, HDFS makes assumptions about how native platformsmanage storage resources even though these vary widelyin design and behavior. Our findings complement this pastwork by looking at SSDs instead of HDDs. Moreover, welook at the influence that internal drive characteristics haveon HDFS performance while this past work focuses onsoftware-level interactions. Harter et al. [12] study HDFSbehavior under HBase workload constraints: store smallfiles ( < Related to performance variability of storage stacks
Cao et al. [6] study the performance variation of modernstorage stacks, on both SSDs and HDDs. For the work-loads they analized they find ext4-SSD performance to bestable even across different configurations, with less than5% relative range. In contrast we show variations of up to30% over time, for one single configuration for HDFS. Mar-icq et al. [18] conduct a large-scale variability study. Stor-age wise, they focus on understanding performance variabil-ity between HDDs and SSDs. Similar to us, they find thatsending large number of requests to the SSDs reduces perfor-mance variability. However, they focus on workloads withdirect IO and small request sizes (4KB). In contrast, we studySSD variability both under direct IO and buffered IO. Wedive deeper into the importance on the number and size ofrequests. Vangoor et al. [29] analyze the performance over-heads of FUSE versus native ext4. Their analysis shows thatin some cases FUSE overhead is negligible, while in someothers it can heavily degrade performance. HDFS is also auser space file system, however it has a different architectureand functionality, and use cases than FUSE. In this work, weanalyze the interaction between HDFS and lower layers ofthe storage stack, under HDFS main use case, sequential IOin large files.
In this paper we introduced and analyzed three surprisingperformance problems (inherent, temporal and permanentslowdowns) that stop HDFS from extracting maximum per-formance from some SSDs. These problems are introducedby the layers sitting beneath HDFS (file system, SSDs). Thelower layers also hold the key to masking two of the threeproblems by increasing IO request parallelism during theproblems. Unfortunately, HDFS does not have the ability toadapt. Its access pattern successfully extracts maximum per-formance from SSDs in the common case but it is not aggres-sive enough to mask the performance problems we found.Our results point to a need for adaptability in storage stacks.11 eferences [1] Nutanix Hardware Platforms. .[2] B
ONWICK , J., A
HRENS , M., H
ENSON , V., M
AYBEE ,M.,
AND S HELLENBAUM , M. The zettabyte file sys-tem. In
Proc. of the 2nd Usenix Conference on File andStorage Technologies (2003), vol. 215.[3] C AI , Y., H ARATSCH , E. F., M
UTLU , O.,
AND M AI ,K. Error patterns in mlc nand flash memory: Measure-ment, characterization, and analysis. In Proceedingsof the Conference on Design, Automation and Test inEurope DATE 12 .[4] C AI , Y., L UO , Y., G HOSE , S.,
AND M UTLU , O. Readdisturb errors in mlc nand flash memory: Characteri-zation, mitigation, and recovery. In
Proceedings of the2015 45th Annual IEEE/IFIP International Conferenceon Dependable Systems and Networks DSN 15 .[5] C AI , Y., L UO , Y., H ARATSCH , E. F., M AI , K., AND M UTLU , O. Data retention in MLC NANDflash memory: Characterization, optimization, andrecovery. In .[6] C AO , Z., T ARASOV , V., R
AMAN , H. P., H
ILDE - BRAND , D.,
AND Z ADOK , E. On the performancevariation in modern storage stacks. In
FAST (2017),pp. 329–344.[7] C
ONWAY , A., B
AKSHI , A., J
IAO , Y., J
ANNEN , W.,Z
HAN , Y., Y
UAN , J., B
ENDER , M. A., J
OHNSON ,R., K
USZMAUL , B. C., P
ORTER , D. E.,
ET AL . Filesystems fated for senescence? nonsense, says science!In
FAST (2017), pp. 45–58.[8] D
EAN , J.,
AND B ARROSO , L. A. The tail at scale.
Commun. ACM 56 , 2 (Feb. 2013), 74–80.[9] G
HEMAWAT , S., G
OBIOFF , H.,
AND L EUNG , S.-T.The google file system. In
Proceedings of the Nine-teenth ACM Symposium on Operating Systems Princi-ples , SOSP ’03.[10] G
UNAWI , H. S., S
UMINTO , R. O., S
EARS , R., G OL - LIHER , C., S
UNDARARAMAN , S., L IN , X., E MAMI ,T., S
HENG , W., B
IDOKHTI , N., M C C AFFREY , C.,
ET AL . Fail-slow at scale: Evidence of hardware per-formance faults in large production systems.
ACMTransactions on Storage (TOS) 14 , 3 (2018), 23.[11] H AO , M., S OUNDARARAJAN , G., K
ENCHAMMANA -H OSEKOTE , D. R., C
HIEN , A. A.,
AND G UNAWI ,H. S. The tail at store: A revelation from millions of hours of disk and ssd deployments. In
FAST (2016),pp. 263–276.[12] H
ARTER , T., B
ORTHAKUR , D., D
ONG , S., A
IYER ,A. S., T
ANG , L., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. Analysis of hdfs underhbase: a facebook messages case study. In
FAST (2014), vol. 14, p. 12th.[13] H E , J., K ANNAN , S., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. The unwritten contract ofsolid state drives. In
Proceedings of the Twelfth Euro-pean Conference on Computer Systems (2017), ACM,pp. 127–144.[14] H E , J., N GUYEN , D., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. Reducing file system tail la-tencies with chopper. In
FAST (2015), vol. 15, pp. 119–133.[15] J
UNG , M.,
AND K ANDEMIR , M. Revisiting widelyheld ssd expectations and rethinking system-level im-plications. In
Proceedings of the ACM SIGMET-RICS/International Conference on Measurement andModeling of Computer Systems SIGMETRICS 13 .[16] K
ADEKODI , S., N
AGARAJAN , V., G
ANGER , G. R.,
AND G IBSON , G. A. Geriatrix: Aging what you seeand what you dont see. a file system aging approachfor modern storage systems. In
Proceedings of the 2018USENIX Conference on Usenix Annual Technical Con-ference (2018), USENIX Association, pp. 691–703.[17] L EE , C., S IM , D., H WANG , J. Y.,
AND C HO , S. F2fs:A new file system for flash storage. In FAST (2015),pp. 273–286.[18] M
ARICQ , A., D
UPLYAKIN , D., J
IMENEZ , I.,M
ALTZAHN , C., S
TUTSMAN , R.,
AND R ICCI , R.Taming performance variability. In { USENIX } Symposium on Operating Systems Design and Imple-mentation ( { OSDI } (2018), pp. 409–425.[19] M ATHUR , A., C AO , M., B HATTACHARYA , S., D IL - GER , A., T
OMAS , A.,
AND V IVIER , L. The new ext4filesystem: current status and future plans. In
Proceed-ings of the Linux symposium (2007), vol. 2, pp. 21–33.[20] M
EZA , J., W U , Q., K UMAR , S.,
AND M UTLU , O. Alarge-scale study of flash memory failures in the field.In
Proceedings of the 2015 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling ofComputer Systems .[21] M
YTKOWICZ , T., D
IWAN , A., H
AUSWIRTH , M.,
AND S WEENEY , P. F. Producing wrong data withoutdoing anything obviously wrong!
ACM Sigplan No-tices 44 , 3 (2009), 265–276.1222] O
USTERHOUT , J. Always measure one level deeper.
Commun. ACM 61 , 7 (June 2018), 74–83.[23] R
ODEH , O., B
ACIK , J.,
AND M ASON , C. Btrfs: Thelinux b-tree filesystem.
ACM Transactions on Storage(TOS) 9 , 3 (2013), 9.[24] S
HAFER , J., R
IXNER , S.,
AND C OX , A. L. Thehadoop distributed filesystem: Balancing portabilityand performance. In Performance Analysis of Systems& Software (ISPASS), 2010 IEEE International Sympo-sium on (2010), IEEE, pp. 122–133.[25] S
HVACHKO , K., K
UANG , H., R
ADIA , S.,
AND C HANSLER , R. The Hadoop Distributed File System.In
MSST 2010 .[26] S
ITES , R. L. Benchmarking” hello, world!
Bench-marking 16 , 5 (2018).[27] S
KOURTIS , D., A
CHLIOPTAS , D., W
ATKINS , N.,M
ALTZAHN , C.,
AND B RANDT , S. A. Flash onrails: Consistent flash performance through redun-dancy. In
USENIX Annual Technical Conference (2014), pp. 463–474.[28] S
WEENEY , A., D
OUCETTE , D., H U , W., A NDERSON ,C., N
ISHIMOTO , M.,
AND P ECK , G. Scalability in thexfs file system. In
USENIX Annual Technical Confer-ence (1996), vol. 15.[29] V
ANGOOR , B. K. R., T
ARASOV , V.,
AND Z ADOK , E.To fuse or not to fuse: Performance of user-space filesystems. In
FAST (2017), pp. 59–72.[30] Y AN , S., L I , H., H AO , M., T ONG , M. H., S UN - DARARAMAN , S., C
HIEN , A. A.,
AND G UNAWI ,H. S. Tiny-tail flash: Near-perfect elimination ofgarbage collection tail latencies in nand ssds.
ACMTransactions on Storage (TOS) 13 , 3 (2017), 22.[31] Z
AHARIA , M.,
ET AL . Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster com-puting. In