[PDF] Characterizing Synchronous Writes in Stable Memory Devices

Abstract

Distributed algorithms that operate in the fail-recovery model rely on the state stored in stable memory to guarantee the irreversibility of operations even in the presence of failures. The performance of these algorithms lean heavily on the performance of stable memory. Current storage technologies have a defined performance profile: data is accessed in blocks of hundreds or thousands of bytes, random access to these blocks is expensive and sequential access is somewhat better. File system implementations hide some of the performance limitations of the underlying storage devices using buffers and caches. However, fail-recovery distributed algorithms bypass some of these techniques and perform synchronous writes to be able to tolerate a failure during the write itself. Assuming the distributed system designer is able to buffer the algorithm's writes, we ask how buffer size and latency complement each other. In this paper we start to answer this question by characterizing the performance (throughput and latency) of typical stable memory devices using a representative set of current file systems.

Full PDF

aa r X i v : . [ c s . O S ] F e b Characterizing Synchronous Writes in Stable Memory Devices

William B. Mingardi , Gustavo M. D. Vieira DComp – CCGT – UFSCarSorocaba, São Paulo, Brasil [email protected], [email protected]

Abstract.

Distributed algorithms that operate in the fail-recovery model relyon the state stored in stable memory to guarantee the irreversibility of opera-tions even in the presence of failures. The performance of these algorithms leanheavily on the performance of stable memory. Current storage technologieshave a deﬁned performance proﬁle: data is accessed in blocks of hundreds orthousands of bytes, random access to these blocks is expensive and sequentialaccess is somewhat better. File system implementations hide some of the perfor-mance limitations of the underlying storage devices using buffers and caches.However, fail-recovery distributed algorithms bypass some of these techniquesand perform synchronous writes to be able to tolerate a failure during the writeitself. Assuming the distributed system designer is able to buffer the algorithm’swrites, we ask how buffer size and latency complement each other. In this paperwe start to answer this question by characterizing the performance (through-put and latency) of typical stable memory devices using a representative set ofcurrent ﬁle systems.

1. Introduction

Paxos [Lamport 1998, 2006] is a distributed consensus algorithm for asynchronous dis-tributed systems. It can be used to create replicas of distributed services to ensure greateravailability of the service as a whole [Schneider 1990]. This algorithm and other similarconsensus algorithms are used in mission critical systems around the world [Chandra et al.2007, Hunt et al. 2010].The Paxos algorithm was designed to tolerate faults assuming a fail-recovery faultmodel, in which the processes that compose the distributed system can fail by crashingand later recover and return to operation [Cachin et al. 2011]. When one of these failuresoccurs the process stops operating and loses all state stored in main memory. It can laterreturn to normal operation, but can only rely on state stored in stable memory, such asthe hard disk. This failure model is very interesting because it is a suitable representationof the way computers crash, are restarted and reload the operating system and runningapplications.Paxos and similar consensus algorithms that operate in the fail-recovery modelrely on the state stored in stable memory to guarantee the irreversibility of operationseven in the presence of failures. This irreversibility is crucial for the consistency of thesedistributed algorithms [Lamport 2006]. However, accesses to stable memory are usuallymuch slower than accesses to main memory, and algorithms in the fail-recovery modelminimize the use of this costly resource. For example, Paxos has two writes to stableemory in its critical path [Lamport 1998]. This way, performance of distributed algo-rithms in the fail-recovery model will lean heavily on the performance of stable memory.Currently there are two main implementations of stable memory in use: spinningdisks and ﬂash memory. Both technologies are radically different, but have a similarperformance proﬁle: data is accessed in blocks of hundreds or thousands of bytes, ran-dom access to these blocks is expensive and sequential access is somewhat more efﬁ-cient [Ruemmler and Wilkes 1994, Min et al. 2012, Chen et al. 2009]. Access to thesedevices is mediated by the operating system using the ﬁle system abstraction. File systemimplementations take the performance proﬁle of storage devices into account, and are ableto hide some of the performance limitations of the underlying device using buffers andcaches. However, fail-recovery distributed algorithms require that the data being writtenis committed to disk before proceeding to be able to tolerate a failure during the writeitself. As a consequence, every write must be a synchronous write , usually achieved by acall to fsync or similar system call. Issuing a synchronous write, however, conﬂicts withmost strategies employed by ﬁle systems to hide the performance cost of stable mem-ory [Jannen et al. 2015, Yeon et al. 2018].Considering the limitations of ﬁle systems, we ask how can the designer of adistributed system maximize the performance of the available storage devices. To maxi-mize something it is necessary to ﬁrst establish appropriate metrics. If we only consider throughput of the storage device, its physical implementation and the way ﬁle systemsare implemented suggest that we should write as much data as possible in any singlesynchronous write. This way, the ﬁle system will be able to ﬁll its buffers, to allocaterelatively large extents of contiguous blocks and the device controller will have enoughdata to optimize the execution of the write. However, from the point of view of the dis-tributed algorithm, a big write will mean a larger latency for the operation to complete.Making the question much more interesting is the fact that due to the way ﬁle systemsare implemented, this latency increase is not directly proportional to the amount of databeing written. Speciﬁcally, synchronous writes of only a few bytes each will reduce thethroughput of the disk to almost zero, while large writes will hardly affect the latency ofthe operation.Thus, if the distributed system designer is able to buffer the algorithm’s writes tostable memory [Vieira and Buzato 2010] and wants to use storage devices at optimumcapacity, we need to understand exactly how buffer size and latency complement eachother. To this end, we should evaluate the full programming environment, including op-erating system, ﬁle system and storage device. In this paper we start this investigationby characterizing the performance (throughput and latency) of typical stable memory de-vices, a spinning disk and a solid state drive, using a representative set of current ﬁle sys-tems. Many works evaluate ﬁle system and storage device performance [Chen et al. 2009,Min et al. 2012, Jannen et al. 2015, Sweeney et al. 1996, Mathur et al. 2007, Rodeh et al.2013, Lee et al. 2015], however these works usually assume the ﬁle system will be ableto handle any type of operation the application requires. This work is different in thesense that we investigate how the distributed system designer can help the ﬁle system andstorage device better handle the use pattern created by distributed algorithms.This paper is organized as follows. In Section 2 we describe the basic assump-tions of our research and related work. Section 3 describes our experimental setup andection 4 shows the experimental data and analysis. We present some concluding remarksin Section 5.

2. Background

Storage devices are orders of magnitude slower than main memory. Random access ofdata at the byte level is prohibitively expensive and the best performance of these de-vices can be achieved by accessing data in sequential blocks of hundreds or thousands ofbytes [Ruemmler and Wilkes 1994, Min et al. 2012, Chen et al. 2009]. This reality hasshaped the design of ﬁle systems for decades. Among the techniques used are cachingof recently accessed blocks in memory for reading and writing, and the pre-fetching ofblocks that are likely to be accessed in the future [Rosenblum and Ousterhout 1992].Write caches are particularly useful because they allow successive small writes to hap-pen to the same block while requiring only one single write to the storage device.The negative consequence of the use of write caches is that the application can’tbe sure if the data it has written is effectively recorded in the storage device. If a fail-ure occurs after the write is issued but before the data is written to stable memory thiswrite may be lost. For many applications this behavior is acceptable, as a crash meansthe application has already unexpectedly stopped and it is inconsequential if this failurehappened before or after the write. After a potential recovery, that application just redoesthe interrupted write. However, some applications that interact with external systems havethe requirement that a write they had made must not be forgotten. One simple exampleof these applications is an automated teller machine that can’t undo the fact it has alreadydispensed money. Distributed algorithms have similar requirements, but instead of moneythey can’t take back messages that were already sent [Lamport 2006].Operating systems support the implementation of applications that require thata write be committed to stable storage through a fsync or similar system call. Thiscall will force the ﬁle system to write to disk all buffers it keeps in memory. Very fre-quently metadata and caches are also ﬂushed and immediately written to the underlyingdevice [Jannen et al. 2015, Yeon et al. 2018]. This satisﬁes the requirement of the appli-cation, but have a considerable impact on the performance of the ﬁle system. Moreover,the application itself can be victim of its own access patterns. To explain how this hap-pens, let’s consider the situation of a distributed algorithm that needs to write x bytes foreach message it sends, where x is small number. If this application writes synchronouslythese few bytes, it will have to wait for the write of a complete ﬁle system block plus anymetadata changes required by the ﬁle system. If we assume the latency of writing a blockis l b and the latency of metadata update is l m , the throughput of this write is: xl b + l m However, if the application is able and decides to batch a group of messages send-ing them at the same time, it can make a single write corresponding to the record of allmessages sent at once. If this group has y messages and the size of bytes written xy issmaller than the ﬁle system block size, we can assume the latency l b + l m will remainonstant. Under this assumption, the throughput of this batched write will be: xyl b + l m that is y times larger than the throughput of recording the send of a single message. De-pending on the size of the ﬁle system block and the size of this application buffer of xy bytes, the increase of throughput obtained can be orders of magnitude larger than the casewhere y = 1 . Also, the latency l b + l m can be considered the minimum latency requiredto perform a synchronous write. We can improve the throughput, but the best we canachieve under ideal conditions is to maintain this minimum latency.The assumption that l b + l m will remain constant is a bit optimistic, though. Atthe least, as xy increases eventually the size of the application buffer will exceed the ﬁlesystem block. Moreover, the amount of metadata updates required will increase as thenumber of ﬁle system blocks touched by the write also increases. Then, as we increase y the latency will eventually creep up, matching some of the throughput gains we ob-tained. The question an application programmer has to face now is how many messagesshould one batch in an application buffer to increase throughput to an ideal level whilenot increasing latency disproportionately.This trade-off of throughput and latency is not new, actually it is at the heart ofmany computing devices we rely on [Patterson 2004]. What is interesting in this caseis that, at least for implementations of distributed algorithms and the current designs forﬁle systems, this trade-off falls squarely in the hands of the application programmer. Twotechniques that have been employed at the application level is to use a log-like structure tokeep writes sequential and accumulating data in a large enough buffer [Vieira and Buzato2010]. These are common sense approaches [Patterson 2004], but that lack a solid frame-work to evaluate their effectiveness. Management of synchronous writes is a problem usually faced by ﬁle system design-ers with respect to metadata consistency [Rosenblum and Ousterhout 1992]. Coalescingwrite operations in the buffer cache is usually done internally by the ﬁle system, insteadof the application. For this reason, studies considering the performance of synchronouswrites under different application buffer sizes are practically nonexistent.From the point of view of ﬁle system design and implementation, many studiesfocus on making metadata updates more efﬁcient. One approach is a log-structured ﬁlesystem [Rosenblum and Ousterhout 1992] such as F2FS [Lee et al. 2015]. This ﬁle sys-tem tries to transform random write patterns in sequential ones by avoiding to changemetadata in place. Instead, changes are written in a sequential log of changes, properlyindexed for later access. F2FS designers show that synchronous writes make or break aﬁle system performance and that coalescing writes can be very efﬁcient [Lee et al. 2015].F2FS writes data in chunks of 512 kB while a recent ﬁle system such as ext4 [Mathur et al.2007] usually writes 4 kB. F2FS is a very efﬁcient ﬁle system, however there is a limitto what the ﬁle system can do alone. Many small synchronous writes will kill the perfor-mance of any ﬁle system, including F2FS, as we show in Section 4.Research proposing new ﬁle systems usually is accompanied by performanceevaluations [Sweeney et al. 1996, Mathur et al. 2007, Rodeh et al. 2013, Lee et al. 2015],ut the focus are workloads that do not include synchronous writes. One exceptionis [Lee et al. 2015] that shows performance data of a workload with fsync on top ofF2FS. However, this evaluation assumes many concurrent threads providing enough datato batch the writes in larger buffers. We are interested in the performance of single-threaded synchronous writes that reﬂect more accurately the latency expected by a dis-tributed algorithm. Other examples of works that perform extensive performance evalu-ations of storage systems, but do not consider synchronous writes, are [Chen et al. 2009]and [Jannen et al. 2015].Write-optimized ﬁle systems such as BetrFS [Jannen et al. 2015] could be used tooffer efﬁcient writes to the application, regardless of buffer size. However, these ﬁle sys-tems still suffer from many limitations [Jannen et al. 2015] and aren’t yet in widespreaduse. The fsync optimizations present in [Yeon et al. 2018] would help reduce the inherentlatency of performing a fsync. This work complements our results in the sense that im-provements in fsync latency will be a constant factor reduction of l b + l m , but that alonewon’t change the problem of low throughput with small buffers.

3. Experimental Characterization

We want to characterize the performance of secondary memory, speciﬁcally we are in-terested in the performance of sequential, synchronous writes that ignore the ﬁle systembuffers. The target application is a distributed algorithm, such as Paxos, that has to per-form synchronous writes (fsync) to stable memory before it can proceed with its execu-tion. We also assume that this algorithm processes many requests in parallel and is thusable to assist the ﬁle system by coalescing its own writes in larger batches at the cost ofincreased latency for its operations.The focus of the characterization is the trade off between stable memory through-put and latency as the application sets the size of its write buffer: the larger the buffer,the larger both throughput and latency. The objective is to give the application designer atool that can be used to discover, for a combination of stable device implementation andﬁle system, which is the optimal buffer size considering the latency requirements of theapplication.We have performed tests with application buffers ranging from 4 kB to 16 MB, asthese sizes cover the point where latencies start to grow proportionally to the buffer size.The tests write sequentially a new ﬁle of 16 MB and we have observed that bigger ﬁlesdo not change the throughput and latency data observed. The experiments were run ondifferent combinations of stable memory technology (spinning disk and ﬂash memory)and ﬁles systems (ext4, XFS, BTRFS, F2FS).

The tests were run using the Iozone multi-platform ﬁle system benchmark tool. This toolwas selected because it is portable to many operating systems, it can run many differenttests including reading, writing, rereading and rewriting, besides being highly customiz-able. We conﬁgured the tool to perform a sequential synchronous write test with thefollowing parameters: est type: The test is a sequential write of a ﬁle (-i 0).

Buffer size:

The buffer sizes tested range from 4 kB to 16 MB (-a).

File size:

The ﬁle written has 16 MB (-n 16m -g 16m).

Synchronous write:

The writes bypass the ﬁle system buffers and go directly to disk andthe wait time of the synchronous write is added to the operation latency (-o -I -e).

Latency tests:

The default for Iozone is to measure throughput, but we also tested thelatency of operations (-N).Unfortunately, the Iozone documentation is not clear about how many iterationsit performs when calculating the throughput and latency for each block size. More im-portantly, we did not have access to the raw data or even indirect measures beyond theaverage, such as standard deviation. To achieve more rigorous results we decided to runIozone repeatedly (30 times) and treat the output of each run as a separate data point.We then calculated the average and standard deviation from this data. Our approach wasvalidated when we observed that for some experimental parameters the variation of thereadings for many runs of Iozone was considerable (Section 4).To automate the process of repeatedly running Iozone we created a Bash script thatruns the benchmark, collects the data and aggregates the result, calculating the averageand standard deviation. Other aspect automated by the script is that Iozone has to be runtwice for each data point: the ﬁrst for throughput and the second for latency. For eachrun the raw output of Iozone is parsed by Awk and a simple text ﬁle describing this run iscreated, containing for each block size the relevant metric (throughput or latency). Afterall runs are performed, these ﬁles are coalesced in a single ﬁle containing the average andstandard deviation for each block size of the measurement. Finally, all intermediary dataand the ﬁnal results are archived in a single compacted ﬁle to allow future reprocessingof the data, if necessary.This way, the script created integrates all steps to reproduce our data, and moreimportantly, is the seed for an automated tool that could be used to inform an applicationprogrammer about the behavior of the ﬁle system and storage device being used.

The tests were run in a desktop computer running 64 bits Linux (Fedora 25), using version4.12.9 of the Linux kernel. All tested ﬁle systems were at the standard versions found inthis kernel release. The computer had an Intel Core 2 6300 processor, with a 1.86 GHzclock, and 4 GiB of RAM. For storage devices to be tested we installed in the computertwo units that throughout the text are going to be referenced as HDD and SSD:

HDD:

HITACHI HDS72101 7200 rpm hard disk (1 TB),

SSD:

Samsung 850 solid state drive (256 GB).Both drives were connected to a SATA II interface and the tests were run in a dedicated19 GiB partition in each drive. The drives were both dedicated to the test, there was nosystem ﬁles or user data in them and no competing accesses.To validate that our experimental setup and data were robust and to assess theportability of the test script, we also run tests with a different type of storage device inanother machine. These tests were run in a Vostro 5740 laptop running 64 bits LinuxUbuntu 16.04), using version 4.4.0 of the Linux kernel. This laptop had an Intel Core i5-4210U processor, with a 1.70 GHz clock, and 4 GiB of RAM. In this machine we ran testson a SD card attached to an integrated card reader, referenced in the text as SDCARD:

SDCARD:

SandDisk Ultra SD card (32 GB).In this device the tests were run in a dedicated 1.44 GiB partition. Although this deviceis not representative of the type of devices usually employed to support distributed algo-rithms, we found that the data obtained helps to support some observations we made asmore general and applicable to a wide range of devices.Even though our test load uses synchronous writes to ensure data was stable in thedrive before the application can continue, we do not bypass the ﬁle system and go direct tothe device. Thus, the performance of the ﬁle system is also a signiﬁcant factor in overallperformance of the target applications. To assess how big this factor is, we selected arepresentative set of current ﬁle systems: XFS [Sweeney et al. 1996], ext4 [Mathur et al.2007], BTRFS [Rodeh et al. 2013] and F2FS [Lee et al. 2015]. XFS and ext4 representmodern implementations of a “classic” ﬁle system, while BTRFS is a more recent design.F2FS is a log-structured ﬁle system tailored for ﬂash-based devices.

4. Results

In this section we ﬁrst present the throughput and latency data for each of the devices andﬁle systems tested. Using this data we show how to choose a buffer size appropriate fora speciﬁc distributed algorithm. All charts in this section have logarithmic-scaled x axis,because data on buffer size was measured by doubling the buffer size. This way we cancover a larger range of buffer sizes in less time, but this distorts the data. To compensatefor this, the y axis is also in logarithmic scale. In our discussion of the results, whencomparing two ﬁle systems we use an independent t-test to test for statistic signiﬁcance,with p < . as threshold. The results for the HDD device are shown in Figure 1 (throughput) and Figure 2 (la-tency), comparing the XFS, ext4 and BTRFS ﬁle systems. The ﬁrst observation we canmake is that, as expected, throughput increases proportionally to buffer size while latencyremains mostly constant for small buffer sizes. In this range, performance is dictatedby synchronous write performance of the device and efﬁciency of the ﬁle system. Asbuffers increase, however, latencies start to raise as memory and disk throughput starthaving a larger impact on the performance. For larger buffers, throughput remains con-stant while latency increases proportionally to buffer size. This behavior is also expected,as a throughput-saturated disk will take more time to write a larger buffer.The behavior considering very small and very large buffers is predictable, whilethe transition between the two is much less so. In Section 4.4 we will discuss this tran-sition in more depth, but ﬁrst we are going to make some observations about the per-formance of the ﬁle systems in the HD device. For small buffers, BTRFS has the worstlatency, XFS has the best, at about half of the latency of BTRFS, and ext4 stays in themiddle. This latency dominates the throughput and the relative performance of these ﬁlesystems stays roughly the same for small buffers: XFS is the best, followed by ext4 and

64 256 1024 4096 16384 65536 262144 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 k B / s e c Record size in kBXFSext4BTRFS

Figure 1. HDD Throughput

16 32 64 128 256 512 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 La t en cy i n m s Record size in kBXFSext4BTRFS

Figure 2. HDD Latency

BTRFS. We don’t have data to support this, but we speculate this difference in perfor-mance is due to the management of metadata for each ﬁle system, with BTRFS showingthe largest overhead.For larger buffers we are approaching the maximum throughput of the underlyingdevice. With a saturated disk ﬁle system differences start to disappear and optimizations towriting of bulk data and metadata start to make a difference. This appears to be speciallytrue for BTRFS, which takes advantage of larger ﬁle system blocks [Rodeh et al. 2013].There are still some noticeable differences in the averages, but the standard deviationincreased as well. For instance, for the buffer size of 4096 kB that marks a point wherethe average throughput of ext4 surpasses the throughput of XFS, this difference is notstatistically signiﬁcant ( t (30) = 2 . , p = 0 . ). The results for the SSD device are shown in Figure 3 (throughput) and Figure 4 (latency),comparing the XFS, ext4, BTRFS and F2FS ﬁle systems. The data conﬁrms that, de-spite their considerable difference in implementation, both HDD and SSD share a verysimilar performance proﬁle. We can observe increasing throughput with constant latencyor smaller buffers, and constant throughput with increasing latency for larger buffers.What takes HDD and SSD apart is the magnitude of the performance, with throughputand latency of SSD about 4x better. Moreover, the SSD device reaches its maximumthroughput with smaller buffers than the HDD. This means that this class of device has areal advantage for use with distributed algorithms because the maximum throughput canbe achieved with lower latency, as we discuss in Section 4.4.

512 1024 2048 4096 8192 16384 32768 65536 131072 262144 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 k B / s e c Record size in kBXFSext4BTRFSF2FS

Figure 3. SSD Throughput La t en cy i n m s Record size in kBXFSext4BTRFSF2FS

Figure 4. SSD Latency

Regarding the relative performance of the tested ﬁle systems, we observed in theSSD device a similar pattern found in the HDD device. For small buffer sizes, XFSand ext4 have a clear advantage, with XFS leading by a small margin. For example,with buffer size of 128 kB XFS achieves 50127 kB/s while ext4 manages 45675 kB/s,a statistically signiﬁcant difference ( t (30) = 13 . , p < . ). Both BTRFS and F2FShave weak performance with small buffers, probably due to metadata overhead. F2FSin particular buffers concurrent synchronous writes and is not particularly suited for thesingle-threaded workload we tested [Lee et al. 2015]. Moreover, BTRFS and F2FS showa very high standard deviation, probably indicating that metadata overhead is not constantand probably has some infrequent high-cost operations.s observed for the HDD, as we increase the size of the buffers for the SSD, ﬁlesystems differences tend to get smaller. But, as the SSD device shows a smaller standarddeviation, these differences are more consistent. For example, for buffer size of 4096 kBthe throughput of XFS is 230612 kB/s while the throughput of ext4 is 197912 kB/s, astatistically signiﬁcant difference ( t (30) = 20 . , p < . ). One unexpected observationwas that ext4 had a noticeable and consistent drop in throughput for the largest buffers.We don’t have data to support it, but we speculate as the throughput increases ext4 usesmore CPU and may start to be CPU bound. The results for the SDCARD device are shown in Figure 5 (throughput) and Figure 6(latency), comparing the XFS, ext4, BTRFS and F2FS ﬁle systems. The SDCARD deviceis a low capacity, removable device, with a basic FTL ( ﬂash translation layer [Chen et al.2009, Min et al. 2012]) and consequently lower throughput and higher latency than theSSD device. The data conﬁrms a low maximum throughput, but the device is able to reachthis throughput with smaller buffers. As a consequence, at the maximum throughput thelatency is surprisingly low.

256 512 1024 2048 4096 8192 16384 32768 65536 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 k B / s e c Record size in kBXFSext4BTRFSF2FS

Figure 5. SDCARD Throughput

F2FS has specially low latencies up to 32 kB buffers and consequently a higherthroughput, beating XFS by a small margin. This difference is only statistically signiﬁcantup to 8 kB buffers ( t (30) = 3 . , p = 0 . ). Nonetheless, this shows the F2FS ﬁle systemis particularly optimized for this class of storage device.For larger buffer sizes, F2FS disappointingly plateaus at about two thirds of theﬁnal throughput achieved by both XFS and ext4. For these larger buffer sizes the standarddeviation increases considerably and performance differences between XFS and ext4 arenot statistically signiﬁcant. With a buffer size of 1 MB XFS reaches 25469 kB/s whileext4 throughput is 24368 kB/s, a not statistically signiﬁcant difference ( t (30) = 0 . , p =0 . ). XFS is however better than F2FS for larger buffers. For the same 1 MB bufferF2FS throughput is 16321 kB/s, and XFS has a statistic signiﬁcant advantage ( t (30) =6 . , p < . ). La t en cy i n m s Record size in kBXFSext4BTRFSF2FS

Figure 6. SDCARD Latency

With the throughput and latency curves obtained with our benchmark a system designercan pick the required buffer size for implementation of a distributed algorithm. If theapplication needs to move data with a minimum throughput, the designer should choosethe smallest buffer that achieves the desired throughput, if possible. If the applicationrequires responses to be sent with a maximum latency, the designer should choose thelargest buffer that doesn’t violate this limit, if possible. However, matters aren’t as clear ifthe application hasn’t any hard limit on throughput and latency, but only a general desireto optimize the balance between the two.Each metric we have measured has a transition: throughput increases until itreaches maximum device throughput and then stabilizes; latency is stable until transfercosts start to dominate, then it starts to increase. The two transition points do not coin-cide, they happen in different buffer sizes. For example, considering XFS on the HDD thethroughput starts to level off at buffer sizes of 4096 kB to 8192 kB, while latency startspicking up with buffer sizes of 512 kB to 1024 kB.To try and capture the tradeoff between throughput and latency in face of the factthat the tipping point of each metric is different we introduce a combined metric, the ratio of throughput by latency. This new metric gives a rough idea of the efﬁciency of abuffer size, indicating how many units of throughput one can gain for each unit of latencyintroduced. Figures 7, 8 and 9 show this new metric for the devices and ﬁle systemstested. In a sense, the curve observed in these ﬁgures can be seen as a combination of thethroughput and latency curves. The ascending slope represents the phase which increasingbuffer size will increase throughput more than it increases latency. The plateau is thephase where latency and throughput increase in the same rate. The descending sloperepresents the phase where the increase in throughput in smaller than the increase inlatency. The maximum of the curve represents the best throughput for each unit of latency.If one considers larger buffers, the latency is increased but the throughput won’t improve.This is the point where throughput is maximized requiring the least of latency.Using this metric as guide, we can observe that the SSD device is more capable R a t i o T h r oughpu t / La t en cy Record size in kBXFSext4BTRFS

Figure 7. HDD Ratio

256 512 1024 2048 4096 8192 16384 32768 65536 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 R a t i o T h r oughpu t / La t en cy Record size in kBXFSext4BTRFSF2FS

Figure 8. SSD Ratio of handling the loads of distributed algorithms because it can saturate its write through-put with smaller buffers. The SSD has a superior absolute performance, but the ratio ofthroughput/latency is more relevant in this case. As an example of this, take the SD-CARD device. As a low end device, without the sophisticated FTL found in the SSD, itsperformance parameters are arguably inferior to the HDD. Maximum throughput of theSDCARD is about half of the HDD, but latency is about half. As a consequence, the opti-mum throughput/latency ratio is about the same for both devices, but with the SDCARDshowing latencies that are 10x smaller. Throughput is lower, however, but only about 4xlower. Thus, for an application in which this lower throughput is acceptable, surprisinglythe SDCARD device would be an interesting choice.

5. Conclusion

Distributed algorithms in the fail-recovery failure model require efﬁcient access to stablememory. This efﬁciency is measured by the trade-off between the latency of each writeto stable memory and the total throughput of writes. A distributed system programmerwants to balance the two by correctly sizing the buffers sent to be written by the ﬁlesystem, ideally achieving a target throughput with the minimum latency possible. To R a t i o T h r oughpu t / La t en cy Record size in kBXFSext4BTRFSF2FS

Figure 9. SDCARD Ratio aid in this task, we have characterized the performance proﬁle of typical stable memorydevices, a spinning disk and a solid state drive, using a representative set of current ﬁlesystems.Our data show that the performance of the studied storage devices show threedistinct phases as application buffer size grows. In the ﬁrst phase throughput increaseswhile latency stays approximately constant. In the second phase throughput and latencyincrease proportionally. In the third phase throughput reaches a maximum and latencystarts increasing. In general terms, a designer should choose a buffer size in the ﬁrst statewith the minimum latency that respects the required throughput of the application.With respect to the devices tested, the ones that saturate the throughput withsmaller buffer sizes tend to offer the smaller latency. The SSD device was the best inthis respect, with very good throughput and latency ﬁgures. Surprisingly, the SDCARDdevice showed a very interesting balance between throughput and latency, despite be-ing a low performance device. With respect to ﬁle systems, the ones that handle largebuffers efﬁciently were the best performers. In particular, XFS showed a very consistentperformance.

References

Cachin, C., Guerraoui, R., and Rodrigues, L. (2011).

Introduction to reliable and securedistributed programming . Springer.Chandra, T. D., Griesemer, R., and Redstone, J. (2007). Paxos made live: an engineeringperspective. In

PODC ’07: Proceedings of the twenty-sixth annual ACM symposiumon Principles of distributed computing , pages 398–407, New York, NY, USA. ACMPress.Chen, F., Koufaty, D. A., and Zhang, X. (2009). Understanding intrinsic characteris-tics and system implications of ﬂash memory based solid state drives.

SIGMETRICSPerform. Eval. Rev. , 37(1):181–192.Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). ZooKeeper: Wait-free coordi-nation for internet-scale systems. In

Proceedings of the 2010 USENIX Conference onSENIX Annual Technical Conference , USENIXATC’10, pages 11–11, Berkeley, CA,USA. USENIX Association.Jannen, W., Yuan, J., Zhan, Y., Akshintala, A., Esmet, J., Jiao, Y., Mittal, A., Pandey,P., Reddy, P., Walsh, L., Bender, M., Farach-Colton, M., Johnson, R., Kuszmaul,B. C., and Porter, D. E. (2015). BetrFS: A right-optimized write-optimized ﬁle sys-tem. In

Proceedings of the 13th USENIX Conference on File and Storage Technologies ,FAST’15, pages 301–315, Berkeley, CA, USA. USENIX Association.Lamport, L. (1998). The part-time parliament.

ACM Trans. Comput. Syst. , 16(2):133–169.Lamport, L. (2006). Fast Paxos.

Distrib. Comput. , 19(2):79–103.Lee, C., Sim, D., Hwang, J.-Y., and Cho, S. (2015). F2FS: A new ﬁle system for ﬂash stor-age. In

Proceedings of the 13th USENIX Conference on File and Storage Technologies ,FAST’15, pages 273–286, Berkeley, CA, USA. USENIX Association.Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. (2007).The new ext4 ﬁlesystem: current status and future plans. In

Proceedings of the Linuxsymposium , volume 2, pages 21–33.Min, C., Kim, K., Cho, H., Lee, S.-W., and Eom, Y. I. (2012). SFS: Random write consid-ered harmful in solid state drives. In

Proceedings of the 10th USENIX Conference onFile and Storage Technologies , FAST’12, pages 12–12, Berkeley, CA, USA. USENIXAssociation.Patterson, D. A. (2004). Latency lags bandwith.

Commun. ACM , 47(10):71–75.Rodeh, O., Bacik, J., and Mason, C. (2013). BTRFS: The linux B-tree ﬁlesystem.

Trans.Storage , 9(3):9:1–9:32.Rosenblum, M. and Ousterhout, J. K. (1992). The design and implementation of a log-structured ﬁle system.

ACM Trans. Comput. Syst. , 10(1):26–52.Ruemmler, C. and Wilkes, J. (1994). An introduction to disk drive modeling.

Computer ,27(3):17–28.Schneider, F. B. (1990). Implementing fault-tolerant services using the state machineapproach: a tutorial.

ACM Comput. Surv. , 22(4):299–319.Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. (1996).Scalability in the XFS ﬁle system. In

Proceedings of the 1996 Annual Conference onUSENIX Annual Technical Conference , ATEC ’96, pages 1–1, Berkeley, CA, USA.USENIX Association.Vieira, G. M. D. and Buzato, L. E. (2010). Implementation of an object-oriented speciﬁ-cation for active replication using consensus. Technical Report IC-10-26, Institute ofComputing, University of Campinas.Yeon, J., Jeong, M., Lee, S., and Lee, E. (2018). RFLUSH: Rethink the ﬂush. In