Barrier Enabled IO Stack for Flash Storage
Youjip Won, Jaemin Jung, Gyeongyeol Choi, Joontaek Oh, Seongbae Son, Jooyoung Hwang, Sangyeun Cho
aa r X i v : . [ c s . O S ] N ov Barrier Enabled IO Stack for Flash Storage
Youjip Won
Hanyang University
Jaemin Jung
Texas A&M University
Gyeongyeol Choi
Hanyang University
Joontaek Oh
Hanyang University
Seongbae Son
Hanyang University
Jooyoung Hwang
Samsung Electronics
Sangyeun Cho
Samsung Electronics
Abstract
This work is dedicated to eliminating the overhead ofguaranteeing the storage order in modern IO stack. Theexisting block device adopts prohibitively expensive re-sort in ensuring the storage order among write requests:interleaving successive write requests with transfer andflush. Exploiting the cache barrier command for theFlash storage, we overhaul the IO scheduler, the dispatchmodule and the filesystem so that these layers are orches-trated to preserve the ordering condition imposed by theapplication till they reach the storage surface. Key ingre-dients of Barrier Enabled IO stack are
Epoch based IOscheduling , Order Preserving Dispatch , and
Dual ModeJournaling . Barrier enabled IO stack successfully elim-inates the root cause of excessive overhead in enforc-ing the storage order. Dual Mode Journaling in Barri-erFS dedicates the separate threads to effectively decou-ple the control plane and data plane of the journal com-mit. We implement Barrier Enabled IO Stack in serveras well as in mobile platform. SQLite performance in-creases by 270% and 75%, in server and in smartphone,respectively. Relaxing the durability of a transaction,SQLite performance and MySQL performance increasesas much as by 73 × and by 43 × , respectively, in serverstorage. Modern IO stack is a collection of arbitration layers; IOscheduler, command queue manager, and storage write-back cache manager. Despite the compound uncertaintiesfrom the multiple layers of arbitration, it is essential forthe software writers to ensure the order in which the datablocks are reflected to the storage surface, storage order ,e.g. in guaranteeing the durability and the atomicity of adatabase transaction [46, 26, 35], in filesystem journal-ing [65, 40, 64, 4], in soft-update [41, 61], or in copy-on-write or log-structure filesystems [59, 35, 58, 31].Preserving the ordering requirement across the layers ofthe arbitration is being achieved by an extremely expen- O r de r ed I O / B u ff e r ed I O ( % ) Buffered IO (IOPSX10 ) AB CD EF GHDD supercapHDD 1351 21312297 2296584403 y = (3.4 X
10 ) x
Figure 1: Ordered write() IO vs. Orderless write(),A: mobile/eMMC5.0, B: mobile/UFS2.0, C:server/SATA3.0, D: server/NVMe, E: server/SATA3.0(supercap), F: server/PCIe, G: Flash arraysive resort; dispatching the following request only afterthe data block associated with the preceding request iscompletely transferred and is made durable. We call this transfer-and-flush mechanism. For decades, interleavingthe writes with transfer-and-flush has been the funda-mental principle to guarantee a storage order in a set ofrequests [23, 15].The concurrency and the parallelism in the Flash stor-age, e.g. multi-channel/way controller [70, 6], large sizestorage cache [47], and deep command queue [18, 27,69] have brought phenomenal performance improve-ment. State of the art NVMe SSD reportedly exhibitsup to 750 KIOPS random read performance [69], whichis nearly 4,000 × of HDD’s performance. On the otherhand, the time to program a Flash cell has barely im-proved if it has not deteriorated [21]. This is due to theadoption of the finer process (sub 10 nm) [24, 36], themulti-bits per cell (MLC, TLC, and QLC) [5, 10] in theendless quest for higher storage density [42]. Despitethe splendid performance improvement of the Flash stor-age claimed by the storage vendors, the service providershave difficulty in fully utilizing the underlying high per-formance storage.Fig. 1 alarms us an important trend. We examine theerformance of write with ordering guarantee ( write() followed by fdatasync() ) against the one without or-dering guarantee ( write() ). We test seven Flash stor-ages with different degrees of parallelism. In a singlechannel mobile storage for smartphone (SSD A), the per-formance of ordered write is 20% of that of the bufferedwrite. In a thirty-two channel Flash array (SSD G), thisratio decreases to 1%. In SSD with supercap (SSD E),the ordered write performance is 25% of that of thebuffered write. There are two important observations.First, the overhead of transfer-and-flush becomes severeas the the degree of parallelism increases. Second, use ofPower-Loss Protection (PLP) hardware fail to eliminatethe transfer-and-flush overhead. The overhead is going toget worse as the Flash storage employs higher degree ofparallelism and denser Flash device.Fair amount of works have been dedicated to ad-dress the overhead of storage order guarantee. The tech-niques deployed in the production platforms includenon-volatile writeback cache at the Flash storage [22], no-barrier mount option at the EXT4 filesystem [14],or transactional checksum [55, 32, 62]. Efforts as trans-actional write at the filesystem [49, 17, 53, 35, 66] andtransactional block device [30, 71, 43, 67, 51] save theapplication from the overhead of enforcing the storageorder associated with filesystem journaling. A school ofworks address more fundamental aspects in controllingthe storage order such as separating the ordering guaran-tee from durability guarantee [8], providing a program-ming model to define the ordering dependency amongthe set of writes [19], persisting a data block only whenthe result needs to be externally visible [48]. These worksshare the same essential principle in controlling the stor-age order; transfer-and-flush. For example, OptFS[8]checkpoints the data blocks only after the associatedjournal transaction becomes durable. Featherstitch[19]realizes the ordering dependency between the patch-groups via interleaving them with transfer-and-flush.In this work, we revisit the issue of eliminating thetransfer-and-flush overhead in modern IO stack. We aimat developing an IO stack where the host can dispatchthe following command before the data blocks associ-ated with the preceding command becomes durable andbefore the preceding command is serviced and yet thehost can enforce the storage order between them.We develop a Barrier Enabled IO stack which effec-tively addresses our design objective. Barrier enabled IOstack consists of the cache barrier-aware storage device,the order preserving block device layer and the barrierenabled filesystem. Barrier enabled IO stack is built uponthe foundation that the host can control a certain par-tial order in which the cache contents are flushed, per-sist order . Different from rotating media, the host canenforce a persist order without the risk of getting anoma- lous delay in the Flash storage. With reasonable com-plexity, the storage controller can be made to flush thecache contents satisfying a certain ordering conditionfrom the host [30, 56, 39]. The mobile Flash storagestandards already defines “cache barrier” command [28]which precisely serves this purpose. For order preserv-ing block device layer, the command dispatch mecha-nism and the IO scheduler of the block device layerare overhauled so that they can preserve partial orderin the incoming sequence of the requests in schedulingthem. For barrier enabled filesystem, we define new in-terfaces, fbarrier() and fdatabarrier() to exploitthe nature of order preserving block device layer. The fbarrier() and the fdatabarrier() system calls arethe ordering guarantee only counter part of fsync() and fdatasync() , respectively. fbarrier() shares thesame semantics as osync() of OptFS [8]; it writes thedirty pages, triggers filesystem journal commit and re-turns without persisting them. fdatabarrier() ensuresthe storage order between its preceding writes and thefollowing writes without flushing the writeback cache inbetween and without waiting for DMA completion of thepreceding writes. It is a storage version of the memorybarrier, e.g. mfence [52]. OptFS does not provide the oneequivalent to fdatabarrier() . The order-preservingblock device layer is filesystem-agnostic. We can imple-ment fbarrier() and fdatabarrier() in any filesys-tems. We modify EXT4 to support fbarrier() and fdatabarrier() . We only present our result of EXT4filesystem due to the space limit. We modify the journal-ing module of EXT4 and develop Dual Mode journalingfor order preserving block device. We call the modifiedversion of EXT4, the BarrierFS.Barrier Enabled IO stack not only removes the flushoverhead but also the transfer overhead in enforcing thestorage order. While large body of the preceding workssuccessfully eliminate the flush overhead, few worksdealt with the overhead of DMA transfer in storage or-der guarantee. The benefits of Barrier Enabled IO stackinclude the following; • The application can control the storage order virtu-ally without any overheads; without being blockedor without stalling the queue. • The latency of a journal commit decreases signifi-cantly. The journaling module can enforce the stor-age order between the journal logs and the journalcommit mark without interleaving them with flushand without interleaving them with DMA transfer. • Throughput of the filesystem journaling improvessignificantly. Dual Mode journaling commits multi- The source codes are currently unavailable to public to abide by thedouble blind rule of the submission. We plan to open-source it shortly. × and by 43 × , respectively, in server storage.The rest of the paper is organized as follows. Section 2introduces the background. Section 3, section 4, and sec-tion 5 explain the block device layer, the filesystem layer,and the application of Barrier Enabled IO stack, respec-tively. Section 6 and section 7 discusses the result of theexperiment and surveys the related works, respectively.Section 8 concludes the paper. A write request travels a complicated route until the asso-ciated data blocks reach the storage surface. The filesys-tem puts the request to the IO scheduler queue. The blockdevice driver removes one or more requests from thequeue and constructs a command. It probes the deviceand dispatches the command if the device is available .The device is available if the command queue at thestorage device is not full. Arriving at the storage device,the command is inserted into the command queue. Thestorage controller removes the command from the com-mand queue and services it, i.e. transfers the data blockbetween the host and the storage. When the transfer fin-ishes, the device sends the completion signal to the host.The contents of the writeback cache are committed tostorage surface either periodically or by an explicit re-quest from the host.We define four types of orders in the IO stack;
IssueOrder , I , Dispatch Order , D , Transfer Order , C , and Persist Order , P . The issue order I = { i , i , . . . , i n } isa set of write requests issued by the application or by thefile system. The subscript denotes the order in which therequests enter the IO scheduler. The dispatch order D = { d , d , . . . , d n } denotes a set of the write requests whichare dispatched to the storage device. The subscript de-notes the order in which the requests leaves the IO sched-uler. Transfer order, C = { c , c , . . . , c n } , is the set oftransfer completions. Persist Order P = { p , p , . . . , p n } is a set of operations which make the associated datablocks durable. Fig. 2 schematically illustrates the lay-ers and the associated orders in the IO stack. We say acertain partial order is preserved if the relative positionof the requests against a certain designated request, bar-rier , are preserved. We use the notation ‘=’ to denote thata certain partial order is preserved. We briefly summarize WritebackCache FlashIO Scheduler Command Queue
Host Storage
Dispatch Queue
I D C P
Figure 2: Set of queues in the IO stack: the sources ofarbitrationthe source of arbitration at each layer. • I = D . IO scheduler reorders and coalesces theIO requests subject to their optimization criteria,e.g. CFQ, DEADLINE, etc. When there is noscheduling mechanism, e.g. NO-OP scheduler [3]or NVMe [12] interface, the dispatch order may beequal to the issue order. • D = C . Storage controller freely schedules thecommands in its command queue. Also, the datablocks can be transferred out of order due to the er-rors, time-out and retry. • C = P . The cache replacement algorithm, map-ping table update algorithm, and storage controller’spolicy to schedule Flash operations governs the per-sist order independent of the order in which the datablocks are transferred.Due to the all these sources of arbitrations, the modernIO stack is said to be orderless [7]. Enforcing a storage order corresponds to preserving apartial order between issue order I and persist order P ,i.e. satisfying the condition I = P . It is equivalent tocollectively enforcing the individual ordering constraintsbetween the layers; ( I = P ) ≡ ( I = D ) ∧ ( D = C ) ∧ ( C = P ) (1)Modern IO stack has evolved under the assumptionthat the host cannot control the persist order, i.e. C = P . Persist order specifically denotes the order in which thecontents in the writeback cache are persisted whereas storage order denotes an order in which the write re-quests from the filesystem are persisted. For rotating me-dia such as hard disk drive, the disk scheduling is entirelyleft to the storage device due to its complicated sectorgeometry hidden from outside [20]. Blindly enforcing acertain persist order may bring unexpected delay in IOservice. Inability to control the persist order, C = P , isa fundamental limitation of the modern IO stack, whichmakes the condition I = P in Eq. 1 unsatisfiable.3o circumvent this limitation in satisfying a storageorder, the host takes the indirect and expensive resortto satisfy each component in Eq. 1. First, after dis-patching the write command to the storage device, thecaller is blocked until the associated DMA transfer com-pletes, Wait-on-Transfer . This is to prohibit the storagecontroller from servicing the commands in out-of-ordermanner and to satisfy the transfer order, D = C . Thismay stall the command queue. When the DMA trans-fer completes, the caller issues the flush command andblocks again waiting for its completion. When the flushreturns, the caller wakes up and issues the followingcommand; Wait-on-Flush . These two are used in tandemleaving the caller under a number of context switches.Transfer-and-flush is unfortunate sole resort in enforcingthe storage order in a modern orderless IO stack. fsync() in EXT4
We examine how the EXT4 filesystem controls the stor-age order among the data blocks, journal descriptor, jour-nal logs and journal commit block in fsync() in Or-dered mode journaling. In Ordered mode, EXT4 ensuresthat data blocks are persisted before the associated jour-nal transaction does.Fig. 3 illustrates the behavior of an fsync() . The ap-plication dispatches the write requests for the dirty pages, D . After dispatching the write requests, the applicationblocks and waits for the completion of the associatedDMA transfer. When the DMA transfer completes, theapplication thread resumes and triggers the JBD threadto commit the journal transaction. After triggering theJBD thread, the application thread sleeps again. Whenthe JBD thread makes journal transaction durable, the fsync() returns, waking up the caller. The JBD threadshould be triggered only after D are completely. Oth-erwise, the storage controller may service the write re-quests for D , JD and JC in out-of-order manner and stor-age controller may persist the journal transaction prema-turely before D reaches the writeback cache. In this hap-pens, the filesystem can be recovered incorrectly in caseof the unexpected system failure.A journal transaction consists of the journal descrip-tor block, one or more log blocks and the journal commitblock. A transaction is usually written to the storage withtwo requests: one for writing the coalesced chunk of thejournal descriptor block and the log blocks and the otherfor writing the commit block. In the rest of the paper,we will use JD and JC to denote the coalesced chunk ofthe journal descriptor and the log blocks, and the commitblock, respectively. JBD needs to enforce the storage or-der in two situations. JD needs to be made durable before JC . The journal transactions need to be made durable inthe order in which they have been committed. When any Block DeviceFile Systemfsync()startJournalBlock Layer fsync()return dispatch context-switch
D FlushJD JC Flush complete
Figure 3: DMA, flush and context switches in fsync() of the two conditions are violated, the file system mayrecover incorrectly in case of unexpected system failure[65, 8]. JBD interleaves the write request for JD and thewrite request for JC with transfer-and-flush. To controlthe storage order between the transactions, JBD threadwaits for JC to become durable before it starts commit-ting the next journal transaction.An fsync() can be represented as a tandem of Wait-on-transfer and Wait-on-flush as in Eq. 2. D , JD and JC denote the write request for D , JD and JC , respectively.‘xfer’ and ‘flush’ denote wait-for-transfer and wait-for-flush, respectively. D → xfer → JD → xfer → flush → JC → xfer → flush | {z } FLUSH|FUA (2)In early days, the block device layer was responsiblefor issuing the flush and for waiting for its comple-tion [63, 25]. This approach blocks not only the callerbut all the other requests which share the same dispatchqueue [14]. Since Linux 2.6.37 kernel, this role has beenmigrated from the block device layer to the filesystemlayer [15]. The filesystem uses flush option (
REQ FLUSH )and force-unit-atomic option (
REQ FUA ) in writing JC and the filesystem blocks until it completes. With FLUSH option, the storage device flushes the writeback cache be-fore servicing the command. With
FUA option, the stor-age controller writes a given block directly to the storagesurface. The last four steps in Eq. 2 can be compressedinto a write request with
FLUSH|FUA option. When thefilesystem is responsible for waiting for the completionof Flash, the other commands in the dispatch queue canprogress after JC FLUSH | FUA is dispatched. In both ap-proaches, the caller is subject to transfer-and-flush over-head to interleave JD and JC . We overhaul the IO scheduler, the dispatch module andthe write command to satisfy each of three conditions, I = D , D = C , and C = P , respectively.4n the legacy IO stack, the host has been entirelyresponsible for controlling the storage order; the hostpostpones sending the following command until it en-sures that the result of the preceding command is madedurable. In Barrier enabled IO stack, the host and thestorage device share the responsibility. The host sideblock device layer is responsible for dispatching thecommands in order. The host and the storage device col-laborate with each other to transfer the data blocks (or toservice the commands, equivalently) in order. The wayin which the host and the storage device collaborate witheach other will be detailed shortly. The storage device isresponsible for making them durable in order. This ef-fective orchestration between the host and the storagedevice saves the IO stack from the overhead of transfer-and-flush based storage order guarantee. Fig. 4 illustratesthe organization of Barrier Enabled IO stack.The order preserving block device layer is respon-sible for dispatching the commands in order and forhaving them serviced in order. The IO scheduler andthe command dispatch module is redesigned to pre-serve the order. Order preserving block device layerdefines two types of write requests: orderless and order-preserving . There exists special type of order-preserving request called barrier . We introduce two newattributes REQ ORDERED and
REQ BARRIER for the order-preserving request and the barrier request, respectively.We call a set of order-preserving write requests whichcan be reordered with each other as an epoch [13]. Abarrier request is used to delimit an epoch. barrier write , the command
The “cache barrier”, or “barrier” for short, command isdefined in the standard command set for mobile Flashstorage [28]. When the storage controller receives thebarrier command, the controller guarantees that the datablocks transferred following the barrier command reachthe storage surface after the data blocks transferred be-fore the barrier command do without flushing the cachein between. A few eMMC products in the market supportcache barrier command [1, 2]. Via barrier command, the fbarrier() BarrierFS(Dual Mode Journaling)Order Preserving Dispatch Epoch BasedIO SchedulerBarrier Compliant Storage Devicefdatabarrier()WRITE with BARRIER flag BARRIERFile SystemBlock Layer File SystemBlock Layer
Figure 4: Organization of the Barrier Enabled IO stack IO stack can satisfy the persist order without cache flush.The essential condition C = P in ensuring the storageorder can now be satisfied with the barrier command.We start our effort with devising a more efficient bar-rier write command. Implementing a barrier as a separatecommand occupies one entry in the command queue andcosts the host the latency of dispatching a command. Toavoid this overhead, we define a barrier as a commandflag, REQ BARRIER , to the write command as in the caseof
REQ FUA or REQ FLUSH . In our implementation, wedesignate one unused bit in the SCSI command as a bar-rier flag.We discuss the implementation aspect of a barriercommand. It is a matter of how the storage controllercan enforce the persist order imposed by the barrier com-mand. When the Flash storage device has Power LossProtection (PLP) feature, e.g. supercapacitor, supportinga barrier command is trivial. Thanks to PLP, the write-back cache contents are always guaranteed to be durable.The storage controller can flush the writeback cache inany order fully utilizing its parallelism and yet can guar-antee the persist order. There is no performance overheadin enforcing the persist order.For the devices without PLP, the barrier command canbe supported in three ways; in-order write-back, trans-actional write-back or in-order recovery from crash. Inin-order write-back, the storage controller flushes datablocks in epoch basis and inserts some delay in betweenif necessary. It may fail to fully exploit the underly-ing parallelism in the storage controller. In transactionalwrite, the storage controller flushes the writeback cachecontents as a single atomic unit [56, 39]. Since all epochsin the writeback cache are are flushed together, the con-straint imposed by the barrier command is well satisfied.The performance overhead of transactional flush is 12%in worst case with a traditional commit approach but canbe eliminated by maintaining next page pointer at thespare area of the Flash page [56].The in-order recovery method guarantees the persistorder imposed by the barrier command through crash re-covery routine. When multiple controller cores concur-rently write the data blocks to multiple channels, onemay have to use sophisticated crash recovery protocolsuch as ARIES protocol [45] to recover the storage toconsistent state. If the entire Flash storage is treated asa single log device, we can use simple crash recoveryalgorithm used in LFS [59]. Since the persist order is en-forced by the crash recovery logic, the controller is ableto flush the writeback cache as if there is no orderingdependency. The controller is saved from performancepenalty at the cost of complexity in the recovery routine.We implement the cache barrier command in UFS de-vice, which is a commercial product used in the smart-phone. We use simple LFS style recovery routine. The5FS controller treats the entire storage as a single logstructured device and maintains an active segment inmemory. FTL appends incoming data blocks to the ac-tive segment in the order in which they are transferred.It naturally satisfies the ordering constraints betweenthe epochs. When an active segment becomes full, it isstriped across the multiple Flash chips in log-structuredmanner. In crash recovery, the UFS controller locates thebeginning of the most recently flushed segment. It scansthe pages in the segment from the beginning till it first en-counters the page which has not been programmed prop-erly. The storage controller discards the rest of the pagesincluding the incomplete one.Developing a sophisticated barrier-aware SSD con-troller is subject to a number of design choices andshould be dealt with in detail in separate context.Through this work, we demonstrate that the performancebenefit in using the cache barrier command deserve thecomplexity of implementing it if the host side IO stackcan properly exploit it.
There are three scheduling principles in Epoch based IOscheduling. First, it preserves the partial order betweenthe epochs. Second, the requests within an epoch can befreely scheduled with each other. Third, the orderless re-quests can be scheduled freely across the epochs. It sat-isfies I = D condition.The Epoch Based IO scheduler uses existing IO sched-uler, e.g. CFQ, NO-OP and etc., to schedule the IO re-quests within an epoch. The key ingredient of the Or-der Preserving IO scheduler is Epoch based barrier re-assignment . When the IO request enters the schedulerqueue, the order preserving IO scheduler examines if itis a barrier request. If the request is not a barrier request,it is inserted as normal requests. If the request is a barrierwrite request, IO scheduler removes the barrier flag fromthe request and inserts it to the queue. After the schedulerinserts a barrier write, the scheduler stops accepting morerequests. The IO scheduler re-orders and merges the IOrequests in the queue based upon its own schedulingdiscipline e.g. FIFO, SCAN, CFQ. The requests in thequeue either are orderless or belong to the same epoch.Therefore, they can be freely scheduled with each otherwithout violating the ordering condition. The merged re-quest will be order-preserving if one of the constituents isorder-preserving. The IO scheduler designates the order-preserving request that leaves the queue last as a newbarrier. This mechanism is called
Epoch Based BarrierReassignment . When there is no more order-preservingrequests in the queue, the IO scheduler starts acceptingthe IO requests. When the IO scheduler unblocks thequeue, there can be one or more orderless requests in the
W3 W3 W4 Block Device
I/O Scheduler
W1W2W4 W3 W2W1W4 W2W1Ordered: Barrier:W1W2W4 W5W6W3 W5W6 W5 fsync()
W5{W1 , W2 , W4}
Epoch {W1 , W2 , W4}
Epoch
Wi: Write Request i
Figure 5: Epoch Based Barrier Reassignmentqueue. These orderless requests can be scheduled withthe other requests in the following epoch. Differentiat-ing the order-preserving requests from orderless ones, weavoid imposing unnecessary ordering constraint on therequests. Currently, the Epoch based IO scheduler is im-plemented on top of existing CFQ scheduler. Each pro-cess defines its own scheduler queue.Fig. 5 illustrates how the barrier reassignment works.The circular and the rectangular write request denote theorder-preserving attribute and barrier attribute, respec-tively. In Fig. 5, the application calls fsync() and in themean time, pdflush daemon flushes the dirty pages. InFig. 5, fsync() creates three write requests: w , w and w . The filesystem marks the three requests as orderingpreserving ones. The filesystem designates the last re-quest, w , as a barrier write. pdflush creates three writerequests w , w and w . They are all orderless. The re-quests from the two threads are fed to the IO schedulerwith as w , w , w , w , w barrier , w in order. When the bar-rier write, w , enters the queue, the scheduler stops ac-cepting the new request. There are only five requests inthe queue, w , w , w , w and w . w cannot be insertedat the queue since the queue is blocked. The IO schedulerreorders the them and dispatches them in w w w w w order. After they are scheduled, w leaves the queue last.The IO scheduler puts the barrier flag to w . In this sce-nario, the request w is going to be scheduled with therequests in the following epoch. The order preserving dispatch is a fundamental innova-tion of this work. In order preserving dispatch, the hostdispatches the following write request when the storagedevice acknowledges that the preceding request has suc-cessfully been received (6(a)) and yet the transfer orderbetween the two requests are preserved, i.e. D = C . Theorder preserving dispatch guarantees the transfer orderwithout blocking the caller. Legacy IO stack controls thetransfer order with Wait-On-Transfer . Wait-On-Transfernot only exposes the caller to the context switch overheadbut also makes the IO latency less predictable. It maystall the storage device since the caller postpones dis-patching the following command till the preceding com-mand is serviced. Order preserving dispatch eliminates6 inish I/Ore-runIRQreceive CMD decode CMD
BlockLayerFileSystemDevice submit I/Oreordering & merge dispatch DMA transfer (a) When Device is Available
DMA transferreordering & merge wakeup & reschedulecontext-switch finish I/Odispatchagain re-run IRQfaildevice busydispatch receive CMD decode CMDdelay ready
BlockLayerFileSystemDevice submit I/O (b) When Device is Busy
Figure 6: Order Preserving Dispatchall these overheads.For order preserving dispatch, the only thing the hostblock device driver does is to set the priority of a barrierwrite command to ordered when dispatching it. Then, theSCSI compliant storage device automatically guaranteesthe transfer order constraint in serving the requests. SCSIstandard defines three command priority levels: head ofthe queue , ordered , and simple [57], with which the in-coming command is put at the head of the commandqueue, tail of the command queue or at arbitrary positiondetermined by the storage controller. In addition, the sim-ple command cannot be inserted in front of the existing”ordered” or ”head of the queue” commands. The head ofthe queue priority is used when a command requires animmediate service, e.g. flush command. Via setting thepriority of barrier write command to ordered , the hostensures the the data blocks associated with the write re-quests in the preceding epoch are transferred ahead of thedata blocks associated with the barrier write. Likewise,the data blocks associated with the following epoch aretransferred after the data blocks associated with the bar-rier write is transferred. The transfer order condition issatisfied.The caller may be blocked after dispatching the writerequest. This can happen when the device is unavailableor the caller is switched out involuntarily, e.g. time quan-tum expires. For both cases, the block device driver ofthe order preserving dispatch module uses the same er-ror handling routine adopted by the existing block de-vice driver; the kernel daemon inherits the task and re-tries dispatching the request after a certain time interval,e.g., 3 msec for SCSI device [57] (Fig. 6(b)). The threadresumes once the request is dispatched successfully. BarrierFS : Barrier Enabled Filesystem4.1 Programming Model
We propose two new filesystem interfaces, fbarrier() and fdatabarrier() which are the ordering guaran-tee only counter part to fsync() and fdatasync() , re-spectively. fbarrier() shares the same semantics with osync() in OptFS [8]. The salient feature of Barri-erFS is fdatabarrier() . fdatabarrier() returns af-ter dispatching the write requests for dirty pages. With fdatabarrier() , the application can enforce a stor-age order virtually without any overhead; without flush,without waiting for DMA completion and even withoutcontext switch. The following codelet illustrates the us-age of the fdatabarrier() . write(fileA, "Hello") ;fdatabarrier(fileA) ;write(fileA, "World")} It ensures that “Hello” is written to the storage surfaceahead of “World”. Modern applications have been us-ing expensive fdatasync() to guarantee both durabil-ity and ordering. For example, SQLite which is the de-fault DBMS in mobile device, such as Android, iOSor Tizen uses fdatasync() to ensure that the updateddatabase node reach the disk surface ahead of the up-dated database header. In SQLite, fdatabarrier() canreplace the fdatasync() when it is used for ensuringthe storage order, not the durability.The Barrier Enabled IO stack is filesystem agnos-tic. fbarrier() and fdatabarrier() can be imple-mented in any filesystem using proposed order preserv-ing block device layer. As a seminal work, we modifythe EXT4 filesystem for order preserving block devicelayer. We optimize fsync() and fdatasync() for or-der preserving block device layer and newly implement fbarrier() and fdatabarrier() . We name the mod-ified EXT4 as BarrierFS. fbarrier() in BarrierFS sup-ports all journal modes in EXT4; WRITEBACK, OR-DERED and DATA.
Committing a journal transaction essentially consists oftwo separate tasks: dispatching write commands for JD and JC to the storage (host side) and making themdurable (storage side). In the order preserving block de-vice design, the host (the block device layer) is respon-sible for controlling the dispatch order and transfer or-der while the storage controller takes care of handlingthe persist order. The design of order preserving blockdevice layer naturally supports separation of the controlplane (dispatching the write requests) and the data plane7 pplicationStorage fsync() JBD
D FlushJD JC Flush
DMA Transfer Context Swtich Execution (a) fsync() in EXT4 with
FLUSH/FUA () StorageApplication fsync()
CommitThread
D FlushJD JC
FlushThread (b) fsync() and fbarrier() in BarrierFS
Figure 7: fsync() and fbarrier() , D: DMA for dirtypages, JD: DMA for journal descriptor, JC: DMA forjournal commit block(persisting the associated data blocks and journal trans-action) in filesystem journaling. For effective separation,these two planes should work independently with mini-mum dependency. For filesystem journaling, we allocateseparate threads for dispatching the write requests andfor making them durable: commit thread and flush thread,respectively. This mechanism is called
Dual Mode Jour-naling .The commit thread is responsible for dispatching thewrite requests for JD and JC . In BarrierFS, the com-mit thread tags both requests with REQ ORDERED and
REQ BARRIER so that JD and JC are transferred andare guaranteed to be persisted in order. After the dis-patching write request for JC , the commit thread in-serts the journal transaction to the committing transac-tion list. In ordering guarantee ( fbarrier() ), the com-mit thread wakes up the caller. In the legacy IO stack,JBD thread interleaves the write request for JC and JD with transfer-and-flush. In BarrierFS, the commit threaddispatches them in order-preserving dispatch disciplinewithout Wait-For-Transfer overhead and with Wait-For-Flush overhead.The flush thread is responsible for (i) issuing the flushcommand, (ii) handling error and retry and (iii) removingthe transaction from the committing transaction list. Theflush thread is triggered when the JC is transferred. If thejournaling is triggered by fbarrier() , the flush threadremoves the transaction from the committing transactionlist and returns. It does not call flush. There is no callerto wake up. If the journaling is initiated by fsync() ,the flush thread flushes the cache, removes the associ- ated transaction from the committing transaction list andwakes up the caller. Via separating the control plane(commit thread) and data plane (flush thread), the com-mit thread can commit the following transaction after itis done with dispatching the write requests for precedingjournal commit. In Dual Mode journaling, there can bemore than one committing transactions in flight.In fsync() or fbarrier() , the BarrierFS dispatchesthe write request for D as an order-preserving request.Then, the commit thread dispatches the write request for JD and JC both with order-preserving and barrier write.As a result, D and JD form a single epoch while JC byitself forms another. A journal commit consists of the twoepoches: { D , JD } and { JC } . An fsync() in barrierFScan be represented as in Eq. 3. Eq. 3 also denotes the fbarrier() . D → JD BAR → JC BAR | {z } fbarrier() → xfer → flush (3)The benefit of Dual Mode Journaling is substantial. InEXT4 (Fig. 7(a)), an fsync() consists of a tandem ofthree DMA’s and two flushes interleaved with contextswitches. In BarrierFS, an fsync() consists of singleflush, three DMA’s(Fig. 7(b)) and fewer number of con-text switches. The transfer-and-flush between JD and JC are completely eliminated. fbarrier() returns almostinstantly after the commit thread dispatches the write re-quest for JC .BarrierFS forces journal commit if fdatasync() or fdatabarrier() do not find any dirty pages. Throughthis scheme, fdatasync() (or fdatabarrier() ) candelimit an epoch despite the absence of the dirty pages. A buffer page can belong to only one journal transac-tion at a time [65]. Blindly inserting a buffer page tothe running transaction may yield removing it from thecommitting transaction before it becomes durable. Wecall this situation as page conflict . In both EXT4 andBarrierFS, when the application thread inserts a bufferpage to the running transaction, it checks if the bufferpage is being held by the committing transaction. If so,the application blocks without inserting it to the run-ning transaction. When the JBD thread of EXT4 (or flushthread in BarrierFS) has made the committing transac-tion durable, it identifies the conflict pages in the com-mitted transaction and inserts them to the running trans-action. In EXT4, there is only one committing transac-tion at a time. The running transaction is guaranteed tobe conflict free when the JBD thread resolves the pageconflicts from the committed transaction. In BarrierFS,the running transaction can conflict with more than one8ommitting transactions, multi-transaction page conflict .When the flush thread resolves the page conflicts from acommitted transaction, the running transaction may stillconflict with the other committing transactions. If therunning transaction is committed prematurely with con-flicted pages missing, the storage order can be compro-mised. Whenever the flush thread resolves the page con-flicts and notifies the commit thread about its comple-tion of persisting a transaction, the commit thread has toscan all the pages in the other committing transactionsfor page conflict. To reduce the overhead of scanningthe pages, we introduce conflict-page list . The applica-tion thread inserts the buffer page to the conflict-page listif the buffer page is being held by one of the committingtransactions. When the flush thread has made the com-mitting transaction durable, the flush thread inserts theconflict pages to the buffer page list of the running trans-action and removes them from the conflict-page list. Thecommit thread can start committing a running transactiononly when conflict-page list is empty.
We examine how the journaling throught may vary sub-ject to different methods of journal commit: BarrierFS,EXT4 with no-barrier option, EXT4 with supercapSSD and and plain EXT4. Fig. 8 schematically illus-trates the behaviors. With no-barrier mount option,filesystem does not issue flush command in fsync() or fdatasync() . t D , t C and t F denote the dispatch latency,transfer latency, and flush latency associated with com-mitting a journal transaction, respectively. In particular, t ε denotes the total flush latency in supercap SSD.With supercap SSD, EXT4 (quick flush), the journalcommits are interleaved by t D + t C + t ε . The host observesthe round-trip delay of the flush command and the asso-ciated context switch overhead, t ε . t ε is not negligible inFlash storage. EXT4 with no-barrier option, EXT4 (noflush), can commit a new transaction once all the associ-ated blocks are transferred to the storage. The journalingis interleaved by command dispatch and DMA transfer, t D + t C . In BarrierFS, the commit thread keeps dispatch-ing the journal commit operations without waiting for thecompletion of the transfer. The interval between the suc-cessive journal commit can be as small as t D . fsync() accounts for dominant fraction of IO in mod-ern applications, e.g. mail server [60] or OLTP. 90%of IO’s in the TPC-C workload is created by fsync() for synchronizing the logs to the storage [50]. The or-der preserving IO stack can significantly improve the t D t T t F t D +t T +t F t D BarrierFSEXT4(quick flush)EXT4(full flush) tt D +t T +tEXT4(no flush) t D +t T Figure 8: fsync() under different storage order guar-antee: BarrierFS, EXT4 (no flush), EXT4 (quick flush),EXT4 (full flush), t D : dispatch latency, t C : transfer la-tency, t ε : flush latency in supercap SSD, t F : flush latencyperformance in these workloads. SQLite can be the ap-plication which the Barrier Enabled IO stack benefitsthe most. SQLite uses fdatasync() not only to guar-antee the durability of a transaction but also to con-trol the storage order in various occasions, e.g. be-tween writing the undo-log and storing the journal headerand between writing updated database node and writ-ing the commit block [37]. In a single insert transac-tion, SQLite calls fdatasync() four times, three ofwhich are to control the storage order. We can replacethem with fdatabarrier() ’s without compromisingthe durability of a transaction. Some applications pre-fer to trade the durability and freshness of the result withthe performance and scalability of the operation [11, 16].The benefit of BarrierFS can be more than signifi-cant in these applications. One can replace all fsync() and fdatasync() with ordering guarantee counterparts, fbarrier() and fdatabarrier() , respectively. We implement Barrier Enabled IO stack on three differ-ent platforms: smartphone (Galaxy S6, Android 5.0.2,Linux 3.10), PC server (4 cores, Linux 3.10.61) and en-terprise server (16 cores, Linux 3.10.61). We test threestorage devices: mobile storage (UFS 2.0, QD =16, sin-gle channel), 850 PRO for server (SATA 3.0, QD=32, 8channels), 843TN for server (SATA 3.0, QD=32, 8 chan-nels, supercap). We call each of these as UFS, plain-SSDand supercap-SSD, respectively. We implement barrierwrite command in UFS device. In plain-SSD, we in-troduce 5% performance penalty to simulate the barrieroverhead. For supercap-SSD, we assume that there is nobarrier overhead. QD: queue depth I O PS ( X ) Q ueue D ep t h XnFX BP
Figure 9: 4KB Randwom Write; XnF: write() fol-lowed by fdatasync() , X: write followed by fdatasync() ( no-barrier option), B: write() fol-lowed by fdatabarrier() , P: Plain Buffered write() We examine the performance of 4 KByte random writewith different ways of enforcing the storage order. Fig. 9illustrates the result. In scenario ‘X’ where ‘X’ denotesWait-On-Transfer, the host sends the following requestafter the data block associated with the preceding re-quest is completely transferred. Despite the absence ofthe flush overhead, the storage devices exhibit less than50% of its plain buffered write performance, the scenario‘P’. All three devices are severely underutilized. Aver-age queue depths in all three devices are less than one.Wait-On-Transfer overhead in modern IO stack prohibitsthe host from properly exploiting the underlying Flashstorage. In scenario ‘B’ where ‘B’ denotes Barrier, theIO performance increases at least by 2 × against scenario‘X’. The average queue depths reach near the maximumin all three Flash storages. An fdatabarrier() is notentirely free. We observe 1 % to 25% performance de-ficiency when it is compared against the plain bufferedwrite. Plain buffered write exhibits shorter queue depththan barrier write does (Fig. 9). This is because in plainedbuffered write, the IO scheduler merges the multiple re-quests and the number of commands dispatched to thestorage device decreases.Fig. 10 is another manifestation of fdatabarrier() .The storage performance is closely related to the com-mand queue utilization [33]. When the requests are in-terleaved with DMA transfer, the queue depth nevergoes beyond one (Fig. 10(a) and Fig. 10(c)). When thewrite request is followed by fdatabarrier(), the queuedepth grows near to its maximum in all three storage.(Fig. 10(b) and Fig. 10(d)). Order preserving block layerenables the host to fully exploit the concurrency and theparallelism of the underlying Flash storage. Latency : In plain-SSD and supercap-SSD, the average fsync() latency decreases by 40% when we use Barri-erFS against when we use EXT4 (Table 1). UFS expe-riences more significant reduction in fsync() latency Q D time (sec) (a) Wait-For-Transfer , plain SSD Q D time (sec) (b) Barrier , plain SSD Q D time (sec) (c) Wait-For-Transfer , UFS Q D time (sec) (d) Barrier , UFS
Figure 10: Queue Depth, 4KB Random Write,
Wait-For-Transfer : write() followed by fdatasync() with no barrier , Barrier : write() followed by fdatabarrier()
UFS plain-SSD supercap-SSD C on t e x t S w i t c h / KB w r i t e () EXT4-DRBFS-DR EXT4-ODBFS-OD
Figure 11: Average Number of Context Switches per fsync() / fbarrier() , 4 KByte write() followed by fsync() or fbarrier() , EXT4-DR: fsync() , BFS-DR: fsync() , EXT-OD: fsync() with no-barrier ,BFS-OD: fbarrier() than the SSD’s do. The smartphone uses transactionalchecksum in filesystem journaling. With BarrierFS, wecan eliminate not only the transfer overhead but alsothe checksum overhead. The fsync() latency decreasesby 60% in BarrierFS. In supercap-SSD and UFS, the fsync() latencies at 99.99 th percentile are 30 × of theaverage fsync() latency(Table 1). Using BarrierFS, thetail latencies at 99.99 th percentile decrease by 50%, 20%and 70% in UFS, plain-SSD and supercap-SSD, respec-tively, against EXT4. UFS plain-SSD supercap-SSD(%) EXT4 BFS EXT4 BFS EXT4 BFS µ th th th Table 1: fsync() latency statistics (msec)
Context Switches:
We examine the number of ap-plication level context switches in various modes of10 Q D time (msec) Next IO (a) Durability Guarantee Q D time (msec) Next IO dipatchcomplete (b) Ordering Guarantee
Figure 12: Queue Depth Chanages in BarrierFS: write() followed by fsync() vs. write() followedby fbarrier() journaling. Fig. 11 illustrates the result. In EXT4-DR, fsync() wakes up the caller twice; after DMA trans-fer of D completes and after the journal transaction ismade durable. This applies to all three Flash storages.In BarrierFS, fsync() wakes up the caller only once;after the transaction is made durable. In UFS and su-percap SSD, fsync() of BFS-DR wakes up the callertwice in entirely different reasons. In UFS and supercap-SSD, the interval between the successive write requestsare much smaller than the timer interrupt interval dueto small flush latency. As a result, write() requestsrarely update the time fields of the inode and fsync() becomes an fdatasync() . fdatasync() wakes up thecaller twice in BarrierFS; after transferring D and afterflush completes. The plain-SSD uses TLC flash. The in-terval between the successive write()’s can be longer thanthe timer interrupt interval. In plain-SSD, fsync() oc-casionally commits journal transaction and the averagenumber of context switches becomes less than two inBFS-DR for plain-SSD.BFS-OD manifests the benefits of BarrierFS. The fbarrier() rarely finds updated metadata since it re-turns quickly. Most fbarrier() calls are serviced as fdatabarrier() . fdatabarrier() does not block thecaller and it does not release CPU voluntarily. The num-ber of context switches in fbarrier() is much smallerthan EXT4-OD. BarrierFS significant improves the con-text switch overhead against EXT4. Command Queue Utilization:
In BarrierFS, fsync() drives the queue upto two (Fig. 12(a)). Theo-retically, it can drive the queue depth upto three becausethe host can dispatches the write requests for D , JD and JC , in tandem. According to our instrumentation, thereexists 160 µ sec context switch interval between theapplication thread and the commit thread. It takes ap-proximately 70 µ sec to transfer a 4 KByte block from thehost to device cache. The command from the applicationthread is serviced before the commit thread dispatchesthe command for writing JD . In fbarrier() , BarrierFSsuccessfully saturates the command queue (Fig. 12(b)).The queue depth increases to fifteen. Throughput:
We examine the throughput of filesys- tem journaling under varying number of CPU cores. Weuse modified DWSL workload in fxmark [44]. In DWSLworkload, each thread performs 4 Kbyte allocating writefollowed by fsync() . Each thread operates on its ownfile. Each thread writes total 1 GByte. BarrierFS exhibitsmuch more scalable behavior than EXT4 (Fig. 13). Inplain-SSD, BarrierFS exhibits 2 × performance againstEXT4 in all numbers of cores (Fig. 13(a)). In supercap-SSD, the performance saturates with six cores in bothEXT4 and BarrierFS. BarrierFS exhibits 1.3 × journalingthroughput against EXT4 at the full throttle (Fig. 13(b)). In mobile storage, BarrierFS achieves 75% performanceimprovement against EXT4 in default PERSIST journalmode under durability guarantee (Fig. 14). We replacefirst three fdatasync() ’s with fdatabarrier() ’samong all four fdatasync() ’s in a transaction. Wekeep the last fdatasync() for the durability of atransaction. In Ordering guarantee, we replace all four fdatasync() ’s with fdatabarrier() ’s. When we re-move the durability requirement, the performance in-creases by 2.8 × in PERSIST mode against the baselineEXT4. In WAL mode, SQLite issues fdatasync() oncein every commit and there is not much room for improve-ment for BarrierFS.The benefit of eliminating the Transfer-and-flush ismore significant as the storage has higher degree of par-allelism and slow Flash device. In plain-SSD, SQLite ex-hibits 73 × performance gain in BFS-OD against baselineEXT4-DR. We run two workloads: varmail workload inFILEBENCH [68] and OLTP-insert workloads fromsysbench [34]. Sysbench is database workload and usesMySQL [46]. varmail is metadata intensive workload.We also test OptFS [8]. We use osync() in OptFS.We perform two sets of experiments. First, we leavethe application intact and replace the EXT4 with Bar-rierFS (EXT4-DR and BFS-DR). We compare the op s / s e c ( X ) EXT4-DR BFS-DR (a) plain-SSD op s / s e c ( X ) EXT4-DR BFS-DR (b) supercap-SSD
Figure 13: fxmark: scalability of filesystem journaling11
PERSIST WAL T x / s ( X ) EXT4-DR BFS-DR (a) UFS
PERSIST WAL T x / s ( X ) EXT4-ODOptFS BFS-OD (b) plain-SSD
Figure 14: SQLite Performance: inserts/sec (100,000 in-serts) fsync() performance between BarrierFS and EXT4.The second set of experiment is for ordering guarantee.In EXT4, we use nobarrier mount option. In Barri-erFS, we replace fsync() with fbarrier() . Fig. 15illustrates the result.In plain-SSD, BFS-DR brings 60% performance gainagainst EXT4-DR in varmail workload. This is due tothe more efficient implementation of fsync() in Bar-rierFS. The benefit of BarrierFS manifests itself whenwe relax the durability guarantee. The varmail work-load is known for its heavy fsync() traffic. In EXT4-OD, the journal commit operations are interleaved byDMA transfer latency. In BFS-OD, the journal commitoperations are interleaved by the dispatch latency. TheDual mode journal can significantly improve the journal-ing throughput via increasing the concurrency in jour-nal commit. With ordering guarantee, BarrierFS achieves80% performance gain against EXT4 with no-barrier op-tion.In MySQL, BFS-OD prevails EXT4-OD, by 12%.The performance increases 43 × when we replace the fsync() of EXT4 with fbarrier() . Notes on OptFS:
In SQLite (Fig. 14(b)), varmail andMySQL (Fig. 15), we observe that OptFS does not showas good performance in Flash storage as it does in therotating media [8]. OptFS is elaborately designed to re-duce the seek overhead inherent in Ordered mode jour-naling of EXT4. OptFS achieves this objective via twoinnovations: via flushing larger number of transactionstogether and via selectively journaling the data blocks.
Varmail OLTP-insert Varmail OLTP-insert
EXT4-DR BFS-DR OptFS EXT4-OD BFS-OD plain-SSD supercap-SSD (X10 ) Figure 15: Performance for Server Workloads,Filebench: Varmail(ops/s), Sysbench: OLTP-insert(Tx/s) Benefit of eliminating a seek overhead is marginal forFlash storage. Due to this reason, in varmail work-load which rarely entails selective data mode journal-ing, OptFS and EXT4-OD exhibit similar performance inFlash storage(Fig. 15). The selective data mode journal-ing increases the amount of pages to scan for osync() ,only a few of which can be dispatched to the storage.The selective data mode journaling can negatively inter-fere with the osync() especially when the underlyingstorage has short latency. In [8], MySQL performancedecreases to one thirds in OptFS against EXT4-OD andthe selective data mode journaling has been designatedas its prime cause. Our MySQL workload creates evenlarger amount of selective data journaling and the per-formance of OptFS corresponds to one eights of that ofEXT-OD under MySQL workload (Fig. 15).
OptFS [8] is the closest work of our sort; they pro-posed a new journaling primitive osync() which re-turns without persisting the journaling transaction andyet which guarantees that the write requests associ-ated with journal commits are stored in order. OptFSdoes not provide the filesystem primitive that corre-sponds to fdatabarrier() in our Barrier Enabled IOstack. osync() still relies on Wait-On-Transfer in en-forcing the storage order. Featherstitch[19] propose aprogramming model to specify the set of requests thatcan be scheduled together, patchgroup and the or-dering dependency between them pg depend() . Whilexsyncfs [48] successfully mitigates the overhead of fsync() , xsyncfs maintains complex causal dependen-cies among buffered updates. An order preserving blockdevice layer can make the implementation of xsyncfsmuch simpler. NoFS (no order file system) [9] introduces“backpointer” to entirely eliminate the transfer-and-flushordering requirement in the file system. However, it doesnot support atomic transactions.A few works proposed to use multiple running trans-action or multiple committing transaction to circum-vent the transfer-and-flush overhead in filesystem jour-naling [38, 29, 54], to improve journaling performanceor to isolate errors. IceFS [38] allocates separate run-ning transactions for each container. SpanFS [29] splitsa journal region into multiple partitions and allocatescommitting transactions for each partition. CCFS [54]allocates separate running transactions for individualthreads. These systems, where each journaling sessionstill relies on the transfer-and-flush mechanism in en-forcing the intra- and inter-transaction storage orders, arecomplementary to our work.A number of file systems provide a multi-block atomicwrite feature [17, 35, 53, 66] to relieve applications from12he overhead of logging and journaling. These file sys-tems internally use the transfer-and-flush mechanism toenforce the storage order between write requests for datablocks and associated metadata. An order preservingblock device can effectively mitigate overheads incurredwhen enforcing the storage order in these file systems.
In this work, we develop an Barrier Enabled IO stackto address the transfer-and-flush overhead inherent inthe legacy IO stack. Barrier Enabled IO stack effec-tively eliminates the transfer-and-flush overhead associ-ated with controlling the storage order and is successfulin fully exploiting the underlying Flash storage. We liketo conclude this paper with two important observations.First, “cache barrier” is a necessity than a luxury. “cachebarrier” is an essential tool for the host to control the per-sist order which has not been possible before. Currently,cache barrier command is only available in the standardcommand set for mobile storage. Given its implication onIO stack, it should be available in all range of the stor-age device ranging from the mobile storage to the highperformance Flash storage with supercap. Second, elim-inating a “Wait-On-Transfer” overhead is not an option.It blocks the caller and stalls the command queue leav-ing the storage device being severely underutilized. Asthe storage latency becomes shorter, the relative cost of“Wait-On-Transfer” can become more significant.Despite all the preceding sophisticated techniques tooptimize the legacy IO stack for Flash storage, we care-fully argue that the IO stack is still fundamentally drivenby the old legacy that the host cannot control the per-sist order. This work shows how the IO stack can evolvewhen the persist order can be controlled and its substan-tial benefit. We hope that this work serves as a possiblebasis for the future IO stack in the era of Flash storage.
References [1] emmc5.1 solution in sk hynix. .[2] Toshiba expands line-up of e-mmc version 5.1compliant embedded nand flash memory modules. http://toshiba.semicon-storage.com/us/company/taec/news/2015/03/memory-20150323-1.html .[3] A
XBOE , J. Linux block IO present and future. In
Proc. of OttawaLinux Symposium (Ottawa, Ontario, Canada, Jul 2004).[4] B
EST , S. JFS Overview. http://jfs.sourceforge.net/project/pub/jfs.pdf ,2000.[5] C
HANG , Y.-M., C
HANG , Y.-H., K UO , T.-W., L I , Y.-C., AND L I , H.-P. Achieving SLC Performance with MLC Flash Mem-ory. In Proc. of DAC 2015 (San Francisco, CA, USA, 2015).[6] C
HEN , F., L EE , R., AND Z HANG , X. Essential roles of exploit-ing internal parallelism of flash memory based solid state drivesin high-speed data processing. In
Proc. of IEEE HPCA 2011 (SanAntonio, TX, USA, Feb 2011). [7] C
HIDAMBARAM , V.
Orderless and Eventually Durable File Sys-tems . PhD thesis, UNIVERSITY OF WISCONSIN–MADISON,2015.[8] C
HIDAMBARAM , V., P
ILLAI , T. S., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. Optimistic Crash Consistency.In
Proc. of ACM SOSP 2013 (Farmington, PA, USA, Nov 2013).[9] C
HIDAMBARAM , V., S
HARMA , T., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. Consistency Without Ordering.In
Proc. of USENIX FAST 2012 (San Jose, CA, USA, Feb 2012).[10] C HO , Y. S., P ARK , I. H., Y
OON , S. Y., L EE , N. H., J OO ,S. H., S ONG , K.-W., C
HOI , K., H AN , J.-M., K YUNG , K. H.,
AND J UN , Y.-H. Adaptive multi-pulse program scheme based ontunneling speed classification for next generation multi-bit/cellNAND flash. IEEE Journal of Solid-State Circuits(JSSC) 48 , 4(2013), 948–959.[11] C
IPAR , J., G
ANGER , G., K
EETON , K., M
ORREY
III, C. B.,S
OULES , C. A.,
AND V EITCH , A. LazyBase: trading freshnessfor performance in a scalable database. In
Proc. of ACM EuroSys2012 (Bern, Switzerland, Apr 2012).[12] C
OBB , D.,
AND H UFFMAN , A. NVM express and the PCI ex-press SSD Revolution. In
Proc. of Intel Developer Forum (SanFrancisco, CA, USA, 2012).[13] C
ONDIT , J., N
IGHTINGALE , E. B., F
ROST , C., I
PEK , E., L EE ,B., B URGER , D.,
AND C OETZEE , D. Better I/O through byte-addressable, persistent memory. In
Proc. of ACM SOSP 2009 (Big Sky, MT, USA, Oct 2009).[14] C
ORBET , J. Barriers and journaling filesystems. http://lwn.net/Articles/283161/ .[15] C
ORBET , J. The end of block barriers. https://lwn.net/Articles/400541/ , August 2010.[16] C UI , H., C IPAR , J., H O , Q., K IM , J. K., L EE , S., K UMAR , A.,W EI , J., D AI , W., G ANGER , G. R., G
IBBONS , P. B.,
ET AL .Exploiting bounded staleness to speed up big data analytics. In
Proc. of USENIX ATC 2014 (Philadelihia, PA, USA, Jun 2014).[17] D
ABEK , F., K
AASHOEK , M. F., K
ARGER , D., M
ORRIS , R.,
AND S TOICA , I. Wide-area Cooperative Storage with CFS. In
Proc. of ACM SOSP 2001 (Chateau Lake Louise, Banff, Canada,Oct 2001).[18] D
EES , B. Native command queuing-advanced performance indesktop storage.
IEEE Potentials Magazine 24 , 4 (2005), 4–7.[19] F
ROST , C., M
AMMARELLA , M., K
OHLER , E.,
DE LOS R EYES ,A., H
OVSEPIAN , S., M
ATSUOKA , A.,
AND Z HANG , L. Gener-alized File System Dependencies. In
Proc. of ACM SOSP 2007 (Stevenson, WA, USA, Oct 2007).[20] G IM , J., AND W ON , Y. Extract and infer quickly: Obtainingsector geometry of modern hard disk drives. ACM Transactionson Storage (TOS) 6 , 2 (2010), 6.[21] G
RUPP , L. M., D
AVIS , J. D.,
AND S WANSON , S. The bleakfuture of nand flash memory. In
Proc.of USENIX FAST 2012 (Berkeley, CA, USA, 2012), USENIX Association, pp. 2–2.[22] G UO , J., Y ANG , J., Z
HANG , Y.,
AND C HEN , Y. Low costpower failure protection for mlc nand flash storage systems withpram/dram hybrid buffer. In
Design, Automation & Test in EuropeConference & Exhibition (DATE), 2013 (2013), IEEE, pp. 859–864.[23] H
ELLWIG , C. ”block: update documentation for req flush /req fua”. linux-2.6/Documentation/block/barrier.txt .[24] H
ELM , M., P
ARK , J.-K., G
HALAM , A., G UO , J., WAN H A ,C., H U , C., K IM , H., K AVALIPURAPU , K., L EE , E., M OHAM - MADZADEH , A.,
ET AL . 19.1 A 128Gb MLC NAND-Flash de-vice using 16nm planar cell. In
Proc. of IEEE ISSCC 2014 Dig.Tech Papers (San Francisco, CA, USA, Feb 2014).
25] H EO , T. I/O Barriers. Linux/Documentation/block/barrier.txt , July 2005.[26] J
EONG , S., L EE , K., L EE , S., S ON , S., AND W ON , Y. I/O StackOptimization for Smartphones. In Proc. of USENIX ATC 2013 (San Jose, CA, USA, Jun 2013).[27] JESD220C, J. S. Universal Flash Storage(UFS) Version 2.1.[28] JESD84-B51, J. S. Embedded Multi-Media Card(eMMC) Elec-trical Standard (5.1).[29] K
ANG , J., Z
HANG , B., W O , T., Y U , W., D U , L., M A , S., AND H UAI , J. SpanFS: A Scalable File System on Fast Storage De-vices. In
Proc. of USENIX ATC 2015 (Santa Clara, CA, USA, Jul2015).[30] K
ANG , W.-H., L EE , S.-W., M OON , B., O H , G.-H., AND M IN ,C. X-FTL: Transactional FTL for SQLite Databases. In Proc. ofACM SIGMOD 2013 (New York, NY, USA, Jun 2013).[31] K
ESAVAN , R., S
INGH , R., G
RUSECKI , T.,
AND P ATEL , Y. Al-gorithms and data structures for efficient free space reclamationin wafl. In
Proc. of USENIX FAST 2017 (Santa Clara, CA, 2017),USENIX Association, pp. 1–14.[32] K IM , H.-J., AND K IM , J.-S. Tuning the ext4 filesystem perfor-mance for android-based smartphones. In Frontiers in ComputerEducation . Springer, 2012, pp. 745–752.[33] K IM , Y. An empirical study of redundant array of independentsolid-state drives (RAIS). Springer Cluster Computing 18 , 2(2015), 963–977.[34] K
OPYTOV , A. SysBench manual. http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf ,2004.[35] L EE , C., S IM , D., H WANG , J.,
AND C HO , S. F2FS: A New FileSystem for Flash Storage. In Proc. of USENIX FAST 2015 (SantaClara, CA, USA, Feb 2015).[36] L EE , S., L EE , J.- Y ., P ARK , I.- H ., P ARK , J., Y UN , S.- W ., K IM ,M.- S ., L EE , J.- H ., K IM , M., L EE , K., K IM , T., ET AL . 7.5A 128Gb 2b/cell NAND flash memory in 14nm technology withtPROG=640us and 800MB/s I/O rate. In
Proc. of IEEE ISSCC2016 (San Francisco, CA, USA, Feb 2016).[37] L EE , W., L EE , K., S ON , H., K IM , W.-H., N AM , B., AND W ON ,Y. WALDIO: eliminating the filesystem journaling in resolvingthe journaling of journal anomaly. In Proc. of USENIX ATC 2015 (Santa Clara, CA, USA, Jul 2015).[38] L U , L., Z HANG , Y., D O , T., A L -K ISWANY , S., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. PhysicalDisentanglement in a Container-Based File System. In
Proc. ofUSENIX OSDI 2014 (Broomfield, CO, USA, Oct 2014).[39] L U , Y., S HU , J., G UO , J., L I , S., AND M UTLU , O. Lighttx:A lightweight transactional design in flash-based ssds to supportflexible transactions. In
In proc. of IEEE ICCD 2013 .[40] M
ATHUR , A., C AO , M., B HATTACHARYA , S., D
ILGER , A.,T
OMAS , A.,
AND V IVIER , L. The new ext4 filesystem: cur-rent status and future plans. In
Proc. of Linux symposium 2007 (Ottawa, Ontario, Canada, Jun 2007).[41] M C K USICK , M. K., G
ANGER , G. R.,
ET AL . Soft Updates: ATechnique for Eliminating Most Synchronous Writes in the FastFilesystem. In
Proc. of USENIX ATC 1999 (Monterey, CA, USA,Jun 1999).[42] M
EARIAN , L. Flash memory’s density surpoasses hard drivesfor first time. , Feb2016. [43] M IN , C., K ANG , W.-H., K IM , T., L EE , S.-W., AND E OM , Y. I.Lightweight application-level crash consistency on transactionalflash storage. In Proc. of USENIX ATC 2015 (Santa Clara, CA,USA, Jul 2015).[44] M IN , C., K ASHYAP , S., M
AASS , S.,
AND K IM , T. Understand-ing manycore scalability of file systems. In Proc.of USENIX ATC2016 (Denver, CO, 2016), USENIX Association, pp. 71–85.[45] M
OHAN , C., H
ADERLE , D., L
INDSAY , B., P
IRAHESH , H.,
AND S CHWARZ , P. ARIES: a transaction recovery method supportingfine-granularity locking and partial rollbacks using write-aheadlogging.
ACM Transactions on Database Systems(TODS) 17 , 1(1992), 94–162.[46] M Y SQL, A. Mysql 5.1 reference manual.
Sun Microsystems (2007).[47] N
ARAYANAN , D., D
ONNELLY , A.,
AND R OWSTRON , A. WriteOff-loading: Practical Power Management for Enterprise Stor-age.
ACM Transactions on Storage(TOS) 4 , 3 (2008), 10:1–10:23.[48] N
IGHTINGALE , E. B., V
EERARAGHAVAN , K., C
HEN , P. M.,
AND F LINN , J. Rethink the Sync. In
Proc. of USENIX OSDI2006 (Seattle, WA, USA, Nov 2006).[49] O
KUN , M.,
AND B ARAK , A. Atomic writes for data integrityand consistency in shared storage devices for clusters. In
Proc. ofICA3PP 2002 (Beijing, China, Oct 2002).[50] O U , J., S HU , J., AND L U , Y. A high performance file systemfor non-volatile main memory. In Proc. of ACM EuroSys 2016 (London, UK, Apr 2016).[51] O
UYANG , X., N
ELLANS , D., W
IPFEL , R., F
LYNN , D.,
AND P ANDA , D. K. Beyond block I/O: Rethinking traditional storageprimitives. In
Proc. of IEEE HPCA 2011 (San Antonio, TX, USA,Feb 2011).[52] P
ALANCA , S., F
ISCHER , S. A., M
AIYURAN , S.,
AND Q AWAMI ,S. Mfence and lfence micro-architectural implementationmethod and system, July 5 2016. US Patent 9,383,998.[53] P
ARK , S., K
ELLY , T.,
AND S HEN , K. Failure-atomic Msync():A Simple and Efficient Mechanism for Preserving the Integrityof Durable Data. In
Proc. of ACM EuroSys 2013 (Prague, CzechRepublic, Apr 2013).[54] P
ILLAI , T. S., A
LAGAPPAN , R., L U , L., C HIDAMBARAM , V.,A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. Ap-plication crash consistency and performance with ccfs. In
Proc.ofUSENIX FAST 2017 (Santa Clara, CA, 2017), USENIX Associa-tion, pp. 181–196.[55] P
RABHAKARAN , V., B
AIRAVASUNDARAM , L. N., A
GRAWAL ,N., G
UNAWI , H. S., A
RPACI -D USSEAU , A. C.,
AND A RPACI -D USSEAU , R. H. IRON File Systems. In
Proc. of ACM SOSP2005 (Brighton, UK, Oct 2005).[56] P
RABHAKARAN , V., R
ODEHEFFER , T. L.,
AND Z HOU , L.Transactional flash. In
Proc. of USENIX OSDI 2008 , vol. 8.[57] R EV , H. SCSI Commands Reference Manual. ,Jul 2014. Seagate.[58] R ODEH , O., B
ACIK , J.,
AND M ASON , C. Btrfs: The linux b-treefilesystem.
ACM Transactions on Storage (TOS) 9 , 3 (2013), 9.[59] R
OSENBLUM , M.,
AND O USTERHOUT , J. K. The design andimplementation of a log-structured file system.
ACM Transac-tions on Computer Systems (TOCS) 10 , 1 (Feb. 1992), 26–52.[60] S
EHGAL , P., T
ARASOV , V.,
AND Z ADOK , E. Evaluating Perfor-mance and Energy in File System Server Workloads. In
Proc. ofUSENIX FAST 2010 (San Jose, CA, USA, Feb 2010).
61] S
ELTZER , M. I., G
ANGER , G. R., M C K USICK , M. K., S
MITH ,K. A., S
OULES , C. A.,
AND S TEIN , C. A. Journaling VersusSoft Updates: Asynchronous Meta-data Protection in File Sys-tems. In
Proc. of USENIX ATC 2000 (San Diego, CA, USA, Jun2000).[62] S
HILAMKAR , G. Journal Checksums. http://wiki.old.lustre.org/images/4/44/Journal-\checksums.pdf ,May 2007.[63] S
TEIGERWALD , M. Imposing Order: Working with write barriersand journaling filesystems.
Linux Magazine 78 (2007), 60–64.[64] S
WEENEY , A., D
OUCETTE , D., H U , W., A NDERSON , C.,N
ISHIMOTO , M.,
AND P ECK , G. Scalability in the xfs file sys-tem. In
Proc. of USENIX ATC (1996), vol. 15.[65] T
WEEDIE , S. C. Journaling the linux ext2fs filesystem. In
Proc.of The Fourth Annual Linux Expo (Durham, NC, USA, May1998).[66] V
ERMA , R., M
ENDEZ , A. A., P
ARK , S., M
ANNARSWAMY , S.,K
ELLY , T.,
AND M ORREY , C. Failure-Atomic Updates of Ap-plication Data in a Linux File System. In
Proc. of USENIX FAST2015 (Santa Clara, CA, USA, Feb 2015).[67] W
EISS , Z., S
UBRAMANIAN , S., S
UNDARARAMAN , S., T
ALA - GALA , N., A
RPACI -D USSEAU , A.,
AND A RPACI -D USSEAU , R.ANViL: Advanced Virtualization for Modern Non-Volatile Mem-ory Devices. In
Proc. of USENIX FAST 2015 (Santa Clara, CA,USA, Feb 2015).[68] W
ILSON , A. The new and improved FileBench. In
Proc. ofUSENIX FAST 2008 (San Jose, CA, USA, Feb 2008).[69] X U , Q., S IYAMWALA , H., G
HOSH , M., S
URI , T., A
WASTHI ,M., G UZ , Z., S HAYESTEH , A.,
AND B ALAKRISHNAN , V. Per-formance Analysis of NVMe SSDs and Their Implication on RealWorld Databases. In
Proc. of ACM SYSTOR 2015 (Haifa, Israel,May 2015).[70] Y . P ARK , S., S EO , E., S HIN , J. Y., M
AENG , S.,
AND L EE , J.Exploiting Internal Parallelism of Flash-based SSDs. IEEE Com-puter Architecture Letters(CAL) 9 , 1 (2010), 9–12.[71] Z
HANG , C., W
ANG , Y., W
ANG , T., C
HEN , R., L IU , D., AND S HAO , Z. Deterministic crash recovery for NAND flash basedstorage systems. In
Proc. of ACM/EDAC/IEEE DAC 2014 (SanFrancisco, CA, USA, Jun 2014).(SanFrancisco, CA, USA, Jun 2014).