[PDF] FastDrain: Removing Page Victimization Overheads in NVMe Storage Stack

Abstract

Host-side page victimizations can easily overflow the SSD internal buffer, which interferes I/O services of diverse user applications thereby degrading user-level experiences. To address this, we propose FastDrain, a co-design of OS kernel and flash firmware to avoid the buffer overflow, caused by page victimizations. Specifically, FastDrain can detect a triggering point where a near-future page victimization introduces an overflow of the SSD internal buffer. Our new flash firmware then speculatively scrubs the buffer space to accommodate the requests caused by the page victimization. In parallel, our new OS kernel design controls the traffic of page victimizations by considering the target device buffer status, which can further reduce the risk of buffer overflow. To secure more buffer spaces, we also design a latency-aware FTL, which dumps the dirty data only to the fast flash pages. Our evaluation results reveal that FastDrain reduces the 99th response time of user applications by 84%, compared to a conventional system.

Full PDF

aa r X i v : . [ c s . O S ] J un IEEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 1

FastDrain: Removing Page Victimization Overheads inNVMe Storage Stack

Jie Zhang , Miryeong Kwon , Sanghyun Han , Nam Sung Kim , Mahmut Kandemir and Myoungsoo Jung KAIST, Samsung, PennState

Abstract —Host-side page victimizations can easily overﬂow the SSD internal buffer, which interferes I/O services of diverse userapplications thereby degrading user-level experiences. To address this, we propose FastDrain, a co-design of OS kernel and ﬂashﬁrmware to avoid the buffer overﬂow, caused by page victimizations. Speciﬁcally, FastDrain can detect a triggering point where anear-future page victimization introduces an overﬂow of the SSD internal buffer. Our new ﬂash ﬁrmware then speculatively scrubs thebuffer space to accommodate the requests caused by the page victimization. In parallel, our new OS kernel design controls the trafﬁcof page victimizations by considering the target device buffer status, which can further reduce the risk of buffer overﬂow. To securemore buffer spaces, we also design a latency-aware FTL, which dumps the dirty data only to the fast ﬂash pages. Our evaluationresults reveal that FastDrain reduces the th response time of user applications by 84%, compared to a conventional system. Index Terms —SSD, ﬂash translation layer, operating system, page cache, page victimization. ✦ NTRODUCTION

In the past decade, SSDs have successfully replaced spinningdisks and become dominant storage media in diverse comput-ing domains, thanks to their performance superiority. However,SSDs’ access latency is still two orders of magnitude longerthan that of the main memory [16]. Since such long latency cansigniﬁcantly degrade the performance of user applications, thehost storage stack practically employs a large kernel memorybuffer upon the target SSD, called Linux page cache .Even though the page cache to buffer ﬁle data can effectivelyhide the performance penalties in accessing the underlyingSSDs, OS kernel often requires to clean the cache by ﬂushingdirty pages to the SSD owing to ﬁle synchronization andmemory resource depletion. We observe that this cleaning task,called page victimization , can severely interfere many legacyI/O requests, issued by other user applications. In practice,modern SSDs in parallel employ a built-in DRAM to buffermultiple incoming write requests, referred to as internal buffer [8]. However, a large number of dirty page writes on a pagevictimization can introduce an overﬂow of the internal bufferthereby signiﬁcantly degrading user-level experiences. Specif-ically, scrubbing the SSD internal buffer by writing dirty datato the backend ﬂash can block the requests of the user appli-cations from immediate I/O services. This in turn compels theapplications to violate a given service level agreement.To quantitatively analyze how the page victimization canaffect the user-level experiences, we perform a long-tail latencyanalysis by running diverse latency-critical applications [2],[11], [12] together with a throughput-oriented application [5](cf. Section 4 for experiment details). Figure 1a shows theresults that compare the average latency with th responsetime of the corresponding latency-critical applications. The th response time of all latency-critical applications is 9 × longerthan their average response time, in overall. To understand theroot cause of the long th response time, we also study theexecution behavior of a representative application, Apache-U .The results are shown in Figure 1b. In addition to a time seriesanalysis of

Apache-U ’s response time, this ﬁgure includes thenumber of dirty pages, ﬂushed by the page cache.

Apache-U ’sresponse time drastically increases when the OS kernel startsthe process of heavy page victimizations. This is because theOS kernel ﬂushes the dirty pages in the page cache withoutunderstanding the status of the underlying SSD, which resultsin an overﬂow of the internal buffer. Speciﬁcally, this bufferoverﬂow enforces ﬂash ﬁrmware to write a bulk of dirty datafrom the built-in DRAM to the ﬂash, which makes the SSDbackend too busy to serve other legacy I/O requests (coming A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . r e s pon s e t i m e Average 99th (a) Normalized response time. R e s pon s e t i m e ( m s ) Time (s) 010k20k F l u s hed d i r t y page s (b) Apache-U response time.

Fig. 1: (a) Performance comparison between average and th response time, and (b) Apache-U response time analysis.from

Apache-U ). The I/O requests are suspended until theSSD backend is available. Consequently, the following dataprocessing is postponed due to this I/O service suspension,which in turn degrades the user experience of

Apache-U .In this work, we propose

FastDrain that coordinates bothOS kernel and ﬂash ﬁrmware to address the long-tail latencyissue, caused by the page victimizations. Speciﬁcally, the OSkernel signals the ﬂash ﬁrmware about an incoming page vic-timization event via a PCIe message. Our new ﬂash ﬁrmwaredesign leverages the message to speculatively allocate freespaces in the built-in DRAM, which can immediately absorbthe incoming heavy loads (imposed by the page victimization).In parallel, the ﬂash ﬁrmware notiﬁes the OS kernel whenthe page victimization causes the buffer overﬂow in the near-future. To this end, we modify the OS kernel to adaptivelycontrol the trafﬁc of page victimizations by using an upcallmessage. Note that the internal buffer has a relatively smallDRAM size, which cannot accommodate all the requests, in-troduced by the host-side page victimization. Thus, we alsodesign a latency-aware FTL ( ﬂash translation layer ) at the devicelevel. This FTL enables the internal buffer to secure as manyDRAM spaces as possible by quickly dumping out the dirtydata to only fast LSB ﬂash pages, which can reap the beneﬁtsof reducing the interference impact for applications’ requests.Our evaluation results show that FastDrain reduces the th response time by 84% in comparison to a conventional system. ACKGROUND

Figure 2a illustrates the storage stack existing from a userprocess to low-level ﬂash media. When the user process issuesI/O requests, the requests are forwarded to Linux page cacheand ﬁle system modules. If the target data does not exist inthe page cache, an I/O scheduler (e.g., kyber [3]) in the Linuxmulti-queue block layer reorders the read and write requests

EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 2 (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:0)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" (a) I/O storage stack. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . t h r e s p . t i m e Vanilla Oracle (b) Performance comparison.

Fig. 2: (a) I/O storage stack, and (b) th response time of Vanilla and

Oracle systems (normalized to

Oracle ).by prioritizing the read requests. The requests are then passedthrough the NVMe driver and ﬁnally served by the SSD.

Page cache.

The host employs the page cache to cache frequent-accessed data chunks of ﬁles, which can speed up the ﬁleaccesses. However, the OS kernel needs to frequently ﬂushdirty pages from the page cache to the SSD to satisfy thedata durability guarantee and/or secure enough DRAM spacefor the incoming I/O requests. The ﬂush activities of pagevictimization can be in practice categorized as background and foreground tasks. Speciﬁcally, the ﬂush task is executed alongwith the user applications as background activities when thenumber of dirty pages is greater than a low threshold (denotedby the dirty_background_ratio ) or when the dirty pageshave a period of life longer than a timer that the user conﬁgures.Otherwise, the OS can suspend user processes and prioritizethe page victimization as foreground task when the system callof fsync() is invoked or when the number of dirty pages isgreater than a high threshold (denoted by the dirty_ratio ). SSD internal.

Modern SSDs typically consist of a SSD controller(ﬂash ﬁrmware), built-in DRAM modules and a large numberof ﬂash packages. To increase the storage throughput, the SSDsemploy multiple channels , each connecting to a number of ﬂashpackages over a ﬂash system bus (i.e., ONFI [1]). Each ﬂashpackage also contains multiple ﬂash dies . Flash dies acrossall ﬂash packages can simultaneously serve different memoryrequests, which exposes a high degree of SSD internal paral-lelism. However, writing data to ﬂash takes access times muchlonger than those of read operations. Typically, the write latencyvaries ranging from 560 us to 5 ms based on which type ofﬂash pages being accessed. Table 1 shows the write variationobserved by different types (LSB/CSB/MSB) of ﬂash pages,which are extracted from a real 25 nm ﬂash sample [14]. Dueto the long write latencies, writing ﬂushed dirty pages of pagevictimization to ﬂash can, unfortunately, make the target SSDbackend too busy to serve other legacy requests being issuedby user applications. To address this, most SSDs employ thebuilt-in DRAM modules as an internal buffer and accommodatethe incoming write requests before directly writing them to thebackend ﬂash media. Since the SSD built-in DRAM buffers theﬂushed dirty pages of page victimization, the SSD backendis available to serve the applications’ requests. However, theinternal buffer usually has a limited space, which is unable tobuffer a bulk of dirty pages from a heavy page victimization. LIMINATE P AGE V ICTIMIZATION O VERHEAD

Co-design of page cache and SSD internal buffer.

To preventthe page victimization from interfering the I/O services ofthe user applications, one solution is to coordinate both OSkernel and ﬂash ﬁrmware to avoid the overﬂow of the smallSSD internal buffer. However, it is challenging to enable suchcoordination due to the lack of communication between them:the existing ﬂash ﬁrmware is unaware of the host-side pagevictimization, while the OS kernel is completely disconnectedfrom the device-level buffer status. To address this, we enable amessage passing across the storage stack. Speciﬁcally, we reviseLinux I/O service routine that delivers I/O requests from a userto a physical device by piggybacking useful information ontothe requests. We also modify the NVMe driver and controller, A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . t h r e s p . t i m e LSB/CSB/MSB LSB-only (a) Impact of ﬂash write latencies. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U F l a s h page b r ea k do w n Wasted Used (b) Flash page wastage.

Fig. 3: (a) The impact of ﬂash write latencies on th percentileresponse time (normalized to LSB-only ), and (b) ﬂash pageusage analysis of the prior work.which can enable all software/hardware modules in the OSand SSD to communicate in a full-duplex manner. When anincoming page victimization event is detected from the pagecache, a notiﬁcation is passed through to the ﬂash ﬁrmware.This information allows the ﬂash ﬁrmware to optimize itsinternal buffer to accommodate the ﬂushed dirty pages of pagevictimization, which can mitigate the interference to the I/Oaccesses of the user applications. In addition, the ﬂash ﬁrmwarereports the status of its internal buffer to the OS kernel whenthe buffer is nearly full. Using this upcall report, the OS kernelcan precisely control the trafﬁc of page victimization based onthe device-level buffer status.

Page victimization aware FTL.

While the co-design of OS ker-nel and ﬂash ﬁrmware can improve the efﬁciency of the internalbuffer, some write requests should be served by the ﬂash mediaas soon as possible to alleviate I/O interference and congestion.For example, data eviction of the SSD internal buffer needs to bequickly drained if the buffer is almost full and an activity of thepage victimization (coming from the page cache) is detected. Toappropriately handle such cases, we design a new FTL, whichserves the urgent requests of the foreground data eviction infast LSB pages residing across the ﬂash blocks of all existingﬂash dies. This design not only shortens the write latency butalso improves its internal parallelism. We also accommodatebackground (non-urgent) data eviction in slow CSB/MSB ﬂashpages to balance the writes in different types of ﬂash pages,thereby utilizing as many ﬂash pages as possible in the SSD.

SSD internal buffer management for page victimization.

Wemeasure the impact of the SSD internal buffer on user-levelexperiences by comparing the performance of two SSDs: atraditional SSD (

Vanilla ) and an oracle SSD with an inﬁniteSSD internal buffer (

Oracle ). The evaluation results on variousapplications are shown in Figure 2b. By completely bufferingdirty data in its buffer,

Oracle can prevent the page victimiza-tion from blocking applications’ I/O services, thereby reducingthe th response time of the applications by 91%, on average.Motivated by this, we design foreground and backgroundbuffer eviction strategies in ﬂash ﬁrmware to improve theefﬁciency of the internal buffer: the foreground eviction strategycan clean up the buffer space to accommodate the page vic-timization as much as possible, while the background evictionstrategy smoothly evicts buffered dirty pages to the underlyingstorage with a minimal penalty of blocking applications’ I/Oservices. We introduce high threshold and low threshold as thewatermark of the buffer usage and adjust different evictionstrategies based on the runtime buffer usage. Speciﬁcally, ournew ﬂash ﬁrmware can estimate the future buffer usage byaccounting the number of dirty pages in the incoming pagevictimization. If the internal buffer has a sufﬁcient space (thefuture buffer usage is lower than a high threshold), the back-ground eviction strategy is triggered. The ﬂash ﬁrmware thenestimates the average number of ﬂash channels/dies that areintensively accessed by the applications’ I/O requests. For eachinternal buffer eviction, we evict a limited set of pages to theﬂash dies in idle. On the other hand, if the internal bufferis insufﬁcient to serve incoming I/O requests related withthe page victimization (the future buffer usage is higher than EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 3 vwxyz{|}~(cid:127)(cid:128) (cid:129)(cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149)(cid:150)(cid:151)(cid:152) (cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡ ¢£⁄¥ƒ§¤'“«‹ ›ﬁ ﬂ(cid:176)–†‡·(cid:181)¶•‚„”»…‰(cid:190)¿(cid:192)` ´ˆ˜¯˘˙¨(cid:201)˚¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212) (cid:213)(cid:214)(cid:215)(cid:216)(cid:217)(cid:218)(cid:219)(cid:220)(cid:221)(cid:222)(cid:223) (cid:224)Æ(cid:226)ª (cid:228)(cid:229)(cid:230)(cid:231)ŁØŒº(cid:236)(cid:237)(cid:238)(cid:239)(cid:240)æ(cid:242)(cid:243) (cid:244)ı(cid:246)(cid:247) łø œß(cid:252)(cid:253)(cid:254)(cid:255)(cid:11)(cid:30)(cid:31)(cid:7)(cid:4)(cid:2)(cid:24) (cid:22)(cid:9)(cid:6)(cid:13)(cid:0)(cid:14)(cid:1)!(cid:3)(cid:5)(cid:8)(cid:16)(cid:26)(cid:10)(cid:12)"(cid:15)

Fig. 4: Page cache and SSD communication.a high threshold), our foreground eviction strategy activelyevicts dirty pages to all ﬂash dies until the SSD internal bufferusage is less than low threshold.

Latency-aware FTL design.

The foreground eviction canseverely block applications’ I/O requests, as it actively dumpsa large number of dirty pages to ﬂash. To mitigate the distur-bance from the foreground eviction, one potential solution isto selectively write/program the foreground eviction in fastLSB pages and write the background eviction sequentiallyin LSB/CSB/MSB pages. Figure 3a shows the performancebeneﬁts, brought by our solution, in executing latency-criticalapplications. As shown in the ﬁgure, programming foregroundeviction in fast ﬂash pages only (

LSB-only ), on average, canreduce the th response time by 65% under all workloads,compared to programming foreground eviction sequentiallyin all types of ﬂash pages ( LSB/CSB/MSB ). To make FTL beaware of the fast pages, prior work [7] maintains a “writepoint” to ﬁnd out the target LSB ﬂash page. For example, ifthe write point marks the next available page as a MSB pagein the current ﬂash block, the FTL can skip the MSB page andplace dirty data in the next LSB page. However, as foregroundeviction simultaneously writes a large number of dirty pagesto ﬂash, prior work has to skip lots of CSB/MSB pages, whichunfortunately wastes ﬂash spaces. Our evaluation results showthat 35% of ﬂash pages are underutilized in the evaluatedworkloads, due to the foreground eviction (cf. Figure 3b).To reduce the write latency and improve ﬂash utilization,we develop a new latency-aware

FTL design. Our design groupsthe LSB pages across all available ﬂash blocks to accommo-date the dirty data of foreground eviction without skippingCSB/MSB pages. In addition, we interleave the foregroundand background evictions to access LSB and CSB/MSB pages,respectively, to balance the ﬂash page usage. To manage theavailable resources in the ﬂash blocks, we allocate a blockpointer for each ﬂash block, which points to the free page forprogram operations. All block pointers are categorized intothree groups based on the types of free pages: LSB, CSB andMSB page groups. Within each page group, the block pointersbelonging to the same ﬂash die are organized as a list and alllists are indexed by the ﬂash die ID. To accommodate dirtypages from a foreground eviction, our proposed FTL selects theﬂash blocks from different lists in the LSB group for free LSBpages. Therefore, the dirty data are interleaved to the LSB pagesacross different ﬂash dies for better parallelism. Similarly, ourFTL prefers choosing the ﬂash blocks in CSB and MSB pagegroups to store the dirty data from background eviction. If CSBand MSB pages are unavailable, ﬂash blocks in LSB page groupswill be selected for background eviction. Note that the unusedﬂash space (i.e., CSB/MSB pages) due to foreground evictioncan be reclaimed by garbage collection, which would not hurtthe total SSD capacity. We also constrain the maximum LSB-only ﬂash space (cf. Table 1) to keep high effective SSD capacity.

Adaptive page victimization scheduling.

It is undesirable toface a situation that the SSD internal buffer is close to full, whilethe OS kernel is still ﬂushing dirty pages in the page cache.Although our FTL design can accelerate the draining procedureof the SSD internal buffer, intensive foreground eviction canoveruse LSB ﬂash pages, which wastes the storage space. Toaddress this, we propose to adjust the trafﬁc of page victim-

Host Values Storage (SSD) ValuesProcessor

ARM v8

Read(LSB/CSB/MSB)

Write(LSB/CSB/MSB)

Erase/Channel/Package

L1I/L1D

Die/Plane/Page size

L2 cache

Capacity/Internal buffer

Mem ctrler High/Low thresholds

Memory

DDR3, 4GB

Max LSB-only region

TABLE 1: System conﬁguration parameters.ization in OS kernel based on the status of the SSD internalbuffer. Speciﬁcally, we conﬁgure two sets of dirty_ratio and dirty_background_ratio in the page cache, which havehigh (10% and 5% in this paper) and low values (5% and 3%).By default, the set of low dirty ratio values is applied in thepage cache to reduce the occupancy of memory spaces. If theﬂash ﬁrmware reports that its internal buffer is occupied higherthan the high threshold, the OS kernel stops ﬂushing the dirtypages by temporarily using the set of high dirty ratio values.Once the SSD reports that its internal buffer utilization hasfallen back to the low threshold, the OS kernel adopts the set oflow dirty ratio and ﬂushes the dirty pages to the SSD.

Page victimization detection.

It is non-trivial for the ﬂashﬁrmware to speculate the page victimization events sololybased on the patterns of incoming I/O requests. Since OSkernel manages the page cache, we propose to extract theinformation of page victimization by monitoring every ﬂushevent at runtime and checking the number of dirty pagesto be ﬂushed. Speciﬁcally, to manage the page victimiza-tion, OS kernel maintains a pool of working threads, called bdi_writeback_threads , each of which can be invokedto process the ﬂush tasks. By monitoring the actions of bdi_writeback_threads , we can detect each event of pagevictimization. For every page victimization, we check the num-ber of pages to be ﬂushed by leveraging the data structure bdi_writeback . bdi_writeback is created to maintain a listof dirty inodes for ﬂushing. By walking through the list of dirtyinodes, we can obtain the number of dirty pages. Bidirectional datapath for communication.

To leverage theextracted information of page victimization and device bufferstatus, one potential issue is the lack of data path between thepage cache and the SSDs for communication. We set up thedatapath by adopting the tagging methods in storage system[10], [15]. Speciﬁcally, we tag the page victimization informa-tion and SSD internal buffer status in an NVMe command andan NVMe completion message, respectively. Figure 4 shows thedetails of our proposed datapath. Speciﬁcally, page victimiza-tion needs to pass through the ﬁle system, block I/O layer,NVMe driver, and NVMe controller. Once the storage-sideNVMe controller receives the NVMe commands, it extracts thepage victimization information and informs the ﬂash ﬁrmware.Similarly, the SSD internal buffer status can be embedded inthe NVMe completion message by overriding a ﬁeld called sq_head . Figure 4 shows the datapath for transferring theSSD internal buffer information (i.e., full) from the SSD tothe host. When the NVMe driver extracts SSD internal bufferinformation in the function, nvme_process_cq , the OS kernelmodiﬁes global variables of ﬂush thresholds dirty_ratio and dirty_background_ratio . VALUATION

Experiment environment conﬁguration.

We leverage Amber[6], a full-system simulator, to perform precise simulation onthe SSD hardware and the existing software from OS to SSDﬁrmware. In our experiment, the storage is an NVMe SSDinstance with TLC ﬂash [14]. We also employ 512MB DDR3DRAM as SSD internal buffer, which is similar to Intel 750NVMe SSD [4]. Our simulation details are shown in Table 1.

EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 4 A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r - F I m g S e r v e r - U t h r e s p . t i m e Vanilla FD-Buf FD-FTL FD

Fig. 5: Response time analysis of multiple co-located workloadexecution. The results are normalized to

Vanilla . R e s p . t i m e ( m s ) Times (s)

Vanilla FD-Buf FD-FTL FD victimizationPage (a) Response time analysis. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U F l a s h page u s age Wasted Used (b) Flash page wastage in FD . Fig. 6: Response time analysis when co-running workloads

Apache-U and

Ungzip , and ﬂash page wastage in FD . System conﬁgurations.

We implement three different computersystems by adding the proposed optimization techniques andcompare them against a vanilla system.1)

Vanilla : A Vanilla system running with NVMe SSD;2)

FD-Buf : Integrating our SSD internal buffer managementdesign into

Vanilla , which can adaptively perform SSDbuffer eviction based on page victimization events;3)

FD-FTL : Introducing our latency-aware FTL into

FD-Buf ,which selectively programs foreground eviction to LSB pages;4) FD : Integrating our adaptive page victimization schedulingscheme into FD-FTL , such that Linux page cache can suspendevicting dirty pages when the SSD internal buffer is full.

Workloads.

We evaluate six representative latency-criticalworkloads:

Apache-F , Apache-U , DB-S , DB-U , ImgServer-F and

ImgServer-U , which are obtained from [2], [11], [12]. For theworkloads of

DB-S and

DB-U , we measure the latency to select( S ) and update ( U ) key-value pairs from a target database,respectively. For the other workloads, we measure the responsetime to fetch ( F ) or upload ( U ) data from/to a server. We exam-ine both web ( Apache ) and image services (

ImgServer ). We alsoselect a write-intensive throughput-oriented workload,

Ungzip ,which is used by a GNU application [5]. In our evaluations,we co-run latency-critical and write-intensive workloads for 30seconds to simulate a real-world server environment.

Figure 5 plots the th percentile response time of the latency-critical applications, when co-running the latency-critical appli-cations and write-intensive workloads under four different sys-tem conﬁgurations. Overall, FD-Buf , FD-FTL , and FD achieves34%, 76%, and 84% shorter th percentile response latency,respectively, in comparison to Vanilla . FD-Buf identiﬁes page victimization via holistic NVMestorage stack optimization and cleans up SSD internal bufferin advance to accommodate I/Os of page victimization. Asa result, compared to

Vanilla , FD-Buf can isolate the pagevictimization from read I/Os. This in turn reduces the de-lay that user applications experience when waiting for I/Oresponse. Nonetheless, the limited SSD internal buffer canoverﬂow when the Linux page cache intensively evicts dirtypages. Compared to

FD-Buf , FD-FTL can allocate low-latencyLSB pages of NAND ﬂash as write buffer to accommodate alarge number of dirty pages. Therefore,

FD-FTL can reducethe time during which page victimization blocks other I/Orequests. FD can further reduce the th percentile responsetime of the evaluated latency-critical workloads by 37%, onaverage, compared to FD-FTL . This is because FD adaptivelysuspends Linux page cache from evicting dirty pages when theSSD write buffer is full. Response time analysis.

Figure 6a shows the response timeof

Apache-U when co-running workloads

Apache-U and

Ungzip .Speciﬁcally,

Vanilla achieves the average response time un-der 5 ms when there is no page victimization. Once writeI/Os of page victimization arrive in the storage, the responsetime increases up to 35 ms. As

FD-Buf can identify pagevictimization, it cleans up the SSD internal buffer in advancefor the incoming write I/Os. While the SSD internal bufferis sufﬁcient to accommodate all write I/Os without accessingthe underlying ﬂash media at the start of page victimization,such promising performance of

FD-Buf unfortunately is notsustainable after serving a large number of write I/Os. This isbecause

FD-Buf needs to evict a large number of dirty pages tothe ﬂash media, when the SSD internal buffer is full. As

FD-FTL can mitigate the ﬂash write latency by placing evicted datain ﬂash LSB pages, it reduces the worst response time by 3 × compared to FD-Buf . FD further reduces the long tail responsetime by throttling the write I/Os during page victimizationevents. Note that FD is also beneﬁcial to reduce the responsetime when there is no page victimization event. This is becauseour FTL design monitors the SSD internal trafﬁc and schedulesthe background data eviction of the SSD internal buffer onlywhen the ﬂash resources are available. Overhead analysis.

Figure 6b compares the number of skippedCSB/MSB pages (

Wasted ) with the number of programmedﬂash pages (

Used ). As shown in the ﬁgure, our latency-awareFTL design, on average, can utilize 90% of ﬂash pages in theevaluated workloads, which is much higher than the priorwork (cf. Figure 3b). This is because our proposed FTL designskips programming data in CSB/MSB pages only during theforeground eviction of the SSD internal buffer. In addition, ourFTL design spreads the foreground eviction to LSB pages acrossﬂash blocks to minimize the wastage of CSB/MSB pages.

ELATED W ORK

Several studies [9], [13] have tried to holistically optimize theOS and storage stack to fully reap the beneﬁts of SSD-enabledsystems. These storage-stack optimizations mainly detect I/Odependencies among I/O requests, which are performed inthe background and address their interferences. Even thoughthe overall system performance can be improved with thisoptimization, they are unsuitable in scenarios that do no exhibitany I/O dependency. Compared to these prior works, in thiswork, we ﬁrst reveal that OS page cache has a major inﬂuenceon performance and blocks other I/O requests that have noI/O dependency with incoming requests from different userapplications. FastDrain holistically optimizes the software andhardware modules from the OS to the underlying SSDs, whichcan eliminate the critical performance bottleneck of OS pagecache in the existing storage stack.

ONCLUSION

We observe that page victimization can degrade the userexperiences. To address this, we propose a co-design of OSkernel and ﬂash ﬁrmware, which can mitigate the overheadof page victimization. Our evaluation results demonstrate thatthe proposed approach outperforms the conventional systemby 84%, in terms of the th percentile response time. CKNOWLEDGEMENTS

We thank anonymous reviewers for their constructivefeedback. This research is mainly supported by NRF2016R1C1B2015312, DOE DE-AC02-05CH 11231, KAIST Start-Up Grant (G01190015), and MemRay grant (G01190170). N.S.Kim is supported in part by grants from NSF CNS-1705047. M.Kandemir is supported in part by NSF grants 1822923, 1439021,1629915, 1626251, 1629129, 1763681, 1526750 and 1439057. My-oungsoo Jung is the corresponding author.

EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 5 R EFERENCES [1] Open NAND Flash Interface Speciﬁcation Revision 3.0.

ONFIWorkgroup, Published Mar , 2011.[2] Apache HTTP server benchmark tool. httpd.apache.org , 2017.[3] Two new block i/o schedulers for 4.12.

URL:https://lwn.net/Articles/720675/ , 2017.[4] anandtech. Intel ssd 750 review.

URL:http://tiny.cc/197bnz , 2015.[5] J.-l. Gailly. Gnu gzip, 2010.[6] D. Gouk et al. Amber*: Enabling precise full-system simulationwith detailed modeling of all ssd resources. In

MICRO , 2018.[7] L. M. Grupp et al. Characterizing ﬂash memory: anomalies,observations, and applications. In

MICRO . IEEE, 2009.[8] M. Jung et al. Revisiting widely held ssd expectations and rethink-ing system-level implications. In

SIGMETRICS . ACM, 2013.[9] S. Kim et al. Enlightening the i/o path: A holistic approach forapplication performance. In

FAST , 2017.[10] M. Mesnier et al. Differentiated storage services. In

SOSP , 2011.[11] D. Mosberger et al. httperfa tool for measuring web serverperformance.

SIGMETRICS , 1998.[12] A. MySQL. Mysql reference manual, 2001.[13] S. Yang et al. Split-level i/o scheduling. In

SOSP . ACM, 2015.[14] J. Zhang et al. Opennvm: An open-sourced fpga-based nvmcontroller for low level memory characterization. In

ICCD , 2015.[15] J. Zhang et al. Flashshare: punching through server storage stackfrom kernel to ﬁrmware for ultra-low latency ssds. In

OSDI , 2018.[16] J. Zhang et al. Flashgpu: Placing new ﬂash next to gpu cores. In