FastDrain: Removing Page Victimization Overheads in NVMe Storage Stack
Jie Zhang, Miryeong Kwon, Sanghyun Han, Nam Sung Kim, Mahmut Kandemir, Myoungsoo Jung
aa r X i v : . [ c s . O S ] J un IEEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 1
FastDrain: Removing Page Victimization Overheads inNVMe Storage Stack
Jie Zhang , Miryeong Kwon , Sanghyun Han , Nam Sung Kim , Mahmut Kandemir and Myoungsoo Jung KAIST, Samsung, PennState
Abstract —Host-side page victimizations can easily overflow the SSD internal buffer, which interferes I/O services of diverse userapplications thereby degrading user-level experiences. To address this, we propose FastDrain, a co-design of OS kernel and flashfirmware to avoid the buffer overflow, caused by page victimizations. Specifically, FastDrain can detect a triggering point where anear-future page victimization introduces an overflow of the SSD internal buffer. Our new flash firmware then speculatively scrubs thebuffer space to accommodate the requests caused by the page victimization. In parallel, our new OS kernel design controls the trafficof page victimizations by considering the target device buffer status, which can further reduce the risk of buffer overflow. To securemore buffer spaces, we also design a latency-aware FTL, which dumps the dirty data only to the fast flash pages. Our evaluationresults reveal that FastDrain reduces the th response time of user applications by 84%, compared to a conventional system. Index Terms —SSD, flash translation layer, operating system, page cache, page victimization. ✦ NTRODUCTION
In the past decade, SSDs have successfully replaced spinningdisks and become dominant storage media in diverse comput-ing domains, thanks to their performance superiority. However,SSDs’ access latency is still two orders of magnitude longerthan that of the main memory [16]. Since such long latency cansignificantly degrade the performance of user applications, thehost storage stack practically employs a large kernel memorybuffer upon the target SSD, called Linux page cache .Even though the page cache to buffer file data can effectivelyhide the performance penalties in accessing the underlyingSSDs, OS kernel often requires to clean the cache by flushingdirty pages to the SSD owing to file synchronization andmemory resource depletion. We observe that this cleaning task,called page victimization , can severely interfere many legacyI/O requests, issued by other user applications. In practice,modern SSDs in parallel employ a built-in DRAM to buffermultiple incoming write requests, referred to as internal buffer [8]. However, a large number of dirty page writes on a pagevictimization can introduce an overflow of the internal bufferthereby significantly degrading user-level experiences. Specif-ically, scrubbing the SSD internal buffer by writing dirty datato the backend flash can block the requests of the user appli-cations from immediate I/O services. This in turn compels theapplications to violate a given service level agreement.To quantitatively analyze how the page victimization canaffect the user-level experiences, we perform a long-tail latencyanalysis by running diverse latency-critical applications [2],[11], [12] together with a throughput-oriented application [5](cf. Section 4 for experiment details). Figure 1a shows theresults that compare the average latency with th responsetime of the corresponding latency-critical applications. The th response time of all latency-critical applications is 9 × longerthan their average response time, in overall. To understand theroot cause of the long th response time, we also study theexecution behavior of a representative application, Apache-U .The results are shown in Figure 1b. In addition to a time seriesanalysis of
Apache-U ’s response time, this figure includes thenumber of dirty pages, flushed by the page cache.
Apache-U ’sresponse time drastically increases when the OS kernel startsthe process of heavy page victimizations. This is because theOS kernel flushes the dirty pages in the page cache withoutunderstanding the status of the underlying SSD, which resultsin an overflow of the internal buffer. Specifically, this bufferoverflow enforces flash firmware to write a bulk of dirty datafrom the built-in DRAM to the flash, which makes the SSDbackend too busy to serve other legacy I/O requests (coming A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . r e s pon s e t i m e Average 99th (a) Normalized response time. R e s pon s e t i m e ( m s ) Time (s) 010k20k F l u s hed d i r t y page s (b) Apache-U response time.
Fig. 1: (a) Performance comparison between average and th response time, and (b) Apache-U response time analysis.from
Apache-U ). The I/O requests are suspended until theSSD backend is available. Consequently, the following dataprocessing is postponed due to this I/O service suspension,which in turn degrades the user experience of
Apache-U .In this work, we propose
FastDrain that coordinates bothOS kernel and flash firmware to address the long-tail latencyissue, caused by the page victimizations. Specifically, the OSkernel signals the flash firmware about an incoming page vic-timization event via a PCIe message. Our new flash firmwaredesign leverages the message to speculatively allocate freespaces in the built-in DRAM, which can immediately absorbthe incoming heavy loads (imposed by the page victimization).In parallel, the flash firmware notifies the OS kernel whenthe page victimization causes the buffer overflow in the near-future. To this end, we modify the OS kernel to adaptivelycontrol the traffic of page victimizations by using an upcallmessage. Note that the internal buffer has a relatively smallDRAM size, which cannot accommodate all the requests, in-troduced by the host-side page victimization. Thus, we alsodesign a latency-aware FTL ( flash translation layer ) at the devicelevel. This FTL enables the internal buffer to secure as manyDRAM spaces as possible by quickly dumping out the dirtydata to only fast LSB flash pages, which can reap the benefitsof reducing the interference impact for applications’ requests.Our evaluation results show that FastDrain reduces the th response time by 84% in comparison to a conventional system. ACKGROUND
Figure 2a illustrates the storage stack existing from a userprocess to low-level flash media. When the user process issuesI/O requests, the requests are forwarded to Linux page cacheand file system modules. If the target data does not exist inthe page cache, an I/O scheduler (e.g., kyber [3]) in the Linuxmulti-queue block layer reorders the read and write requests
EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 2 (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:0)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" (a) I/O storage stack. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . t h r e s p . t i m e Vanilla Oracle (b) Performance comparison.
Fig. 2: (a) I/O storage stack, and (b) th response time of Vanilla and
Oracle systems (normalized to
Oracle ).by prioritizing the read requests. The requests are then passedthrough the NVMe driver and finally served by the SSD.
Page cache.
The host employs the page cache to cache frequent-accessed data chunks of files, which can speed up the fileaccesses. However, the OS kernel needs to frequently flushdirty pages from the page cache to the SSD to satisfy thedata durability guarantee and/or secure enough DRAM spacefor the incoming I/O requests. The flush activities of pagevictimization can be in practice categorized as background and foreground tasks. Specifically, the flush task is executed alongwith the user applications as background activities when thenumber of dirty pages is greater than a low threshold (denotedby the dirty_background_ratio ) or when the dirty pageshave a period of life longer than a timer that the user configures.Otherwise, the OS can suspend user processes and prioritizethe page victimization as foreground task when the system callof fsync() is invoked or when the number of dirty pages isgreater than a high threshold (denoted by the dirty_ratio ). SSD internal.
Modern SSDs typically consist of a SSD controller(flash firmware), built-in DRAM modules and a large numberof flash packages. To increase the storage throughput, the SSDsemploy multiple channels , each connecting to a number of flashpackages over a flash system bus (i.e., ONFI [1]). Each flashpackage also contains multiple flash dies . Flash dies acrossall flash packages can simultaneously serve different memoryrequests, which exposes a high degree of SSD internal paral-lelism. However, writing data to flash takes access times muchlonger than those of read operations. Typically, the write latencyvaries ranging from 560 us to 5 ms based on which type offlash pages being accessed. Table 1 shows the write variationobserved by different types (LSB/CSB/MSB) of flash pages,which are extracted from a real 25 nm flash sample [14]. Dueto the long write latencies, writing flushed dirty pages of pagevictimization to flash can, unfortunately, make the target SSDbackend too busy to serve other legacy requests being issuedby user applications. To address this, most SSDs employ thebuilt-in DRAM modules as an internal buffer and accommodatethe incoming write requests before directly writing them to thebackend flash media. Since the SSD built-in DRAM buffers theflushed dirty pages of page victimization, the SSD backendis available to serve the applications’ requests. However, theinternal buffer usually has a limited space, which is unable tobuffer a bulk of dirty pages from a heavy page victimization. LIMINATE P AGE V ICTIMIZATION O VERHEAD
Co-design of page cache and SSD internal buffer.
To preventthe page victimization from interfering the I/O services ofthe user applications, one solution is to coordinate both OSkernel and flash firmware to avoid the overflow of the smallSSD internal buffer. However, it is challenging to enable suchcoordination due to the lack of communication between them:the existing flash firmware is unaware of the host-side pagevictimization, while the OS kernel is completely disconnectedfrom the device-level buffer status. To address this, we enable amessage passing across the storage stack. Specifically, we reviseLinux I/O service routine that delivers I/O requests from a userto a physical device by piggybacking useful information ontothe requests. We also modify the NVMe driver and controller, A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U N o r m . t h r e s p . t i m e LSB/CSB/MSB LSB-only (a) Impact of flash write latencies. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U F l a s h page b r ea k do w n Wasted Used (b) Flash page wastage.
Fig. 3: (a) The impact of flash write latencies on th percentileresponse time (normalized to LSB-only ), and (b) flash pageusage analysis of the prior work.which can enable all software/hardware modules in the OSand SSD to communicate in a full-duplex manner. When anincoming page victimization event is detected from the pagecache, a notification is passed through to the flash firmware.This information allows the flash firmware to optimize itsinternal buffer to accommodate the flushed dirty pages of pagevictimization, which can mitigate the interference to the I/Oaccesses of the user applications. In addition, the flash firmwarereports the status of its internal buffer to the OS kernel whenthe buffer is nearly full. Using this upcall report, the OS kernelcan precisely control the traffic of page victimization based onthe device-level buffer status.
Page victimization aware FTL.
While the co-design of OS ker-nel and flash firmware can improve the efficiency of the internalbuffer, some write requests should be served by the flash mediaas soon as possible to alleviate I/O interference and congestion.For example, data eviction of the SSD internal buffer needs to bequickly drained if the buffer is almost full and an activity of thepage victimization (coming from the page cache) is detected. Toappropriately handle such cases, we design a new FTL, whichserves the urgent requests of the foreground data eviction infast LSB pages residing across the flash blocks of all existingflash dies. This design not only shortens the write latency butalso improves its internal parallelism. We also accommodatebackground (non-urgent) data eviction in slow CSB/MSB flashpages to balance the writes in different types of flash pages,thereby utilizing as many flash pages as possible in the SSD.
SSD internal buffer management for page victimization.
Wemeasure the impact of the SSD internal buffer on user-levelexperiences by comparing the performance of two SSDs: atraditional SSD (
Vanilla ) and an oracle SSD with an infiniteSSD internal buffer (
Oracle ). The evaluation results on variousapplications are shown in Figure 2b. By completely bufferingdirty data in its buffer,
Oracle can prevent the page victimiza-tion from blocking applications’ I/O services, thereby reducingthe th response time of the applications by 91%, on average.Motivated by this, we design foreground and backgroundbuffer eviction strategies in flash firmware to improve theefficiency of the internal buffer: the foreground eviction strategycan clean up the buffer space to accommodate the page vic-timization as much as possible, while the background evictionstrategy smoothly evicts buffered dirty pages to the underlyingstorage with a minimal penalty of blocking applications’ I/Oservices. We introduce high threshold and low threshold as thewatermark of the buffer usage and adjust different evictionstrategies based on the runtime buffer usage. Specifically, ournew flash firmware can estimate the future buffer usage byaccounting the number of dirty pages in the incoming pagevictimization. If the internal buffer has a sufficient space (thefuture buffer usage is lower than a high threshold), the back-ground eviction strategy is triggered. The flash firmware thenestimates the average number of flash channels/dies that areintensively accessed by the applications’ I/O requests. For eachinternal buffer eviction, we evict a limited set of pages to theflash dies in idle. On the other hand, if the internal bufferis insufficient to serve incoming I/O requests related withthe page victimization (the future buffer usage is higher than EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 3 vwxyz{|}~(cid:127)(cid:128) (cid:129)(cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149)(cid:150)(cid:151)(cid:152) (cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡ ¢£⁄¥ƒ§¤'“«‹ ›fi fl(cid:176)–†‡·(cid:181)¶•‚„”»…‰(cid:190)¿(cid:192)` ´ˆ˜¯˘˙¨(cid:201)˚¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212) (cid:213)(cid:214)(cid:215)(cid:216)(cid:217)(cid:218)(cid:219)(cid:220)(cid:221)(cid:222)(cid:223) (cid:224)Æ(cid:226)ª (cid:228)(cid:229)(cid:230)(cid:231)ŁØŒº(cid:236)(cid:237)(cid:238)(cid:239)(cid:240)æ(cid:242)(cid:243) (cid:244)ı(cid:246)(cid:247) łø œß(cid:252)(cid:253)(cid:254)(cid:255)(cid:11)(cid:30)(cid:31)(cid:7)(cid:4)(cid:2)(cid:24) (cid:22)(cid:9)(cid:6)(cid:13)(cid:0)(cid:14)(cid:1)!(cid:3)(cid:5)(cid:8)(cid:16)(cid:26)(cid:10)(cid:12)"(cid:15)
Fig. 4: Page cache and SSD communication.a high threshold), our foreground eviction strategy activelyevicts dirty pages to all flash dies until the SSD internal bufferusage is less than low threshold.
Latency-aware FTL design.
The foreground eviction canseverely block applications’ I/O requests, as it actively dumpsa large number of dirty pages to flash. To mitigate the distur-bance from the foreground eviction, one potential solution isto selectively write/program the foreground eviction in fastLSB pages and write the background eviction sequentiallyin LSB/CSB/MSB pages. Figure 3a shows the performancebenefits, brought by our solution, in executing latency-criticalapplications. As shown in the figure, programming foregroundeviction in fast flash pages only (
LSB-only ), on average, canreduce the th response time by 65% under all workloads,compared to programming foreground eviction sequentiallyin all types of flash pages ( LSB/CSB/MSB ). To make FTL beaware of the fast pages, prior work [7] maintains a “writepoint” to find out the target LSB flash page. For example, ifthe write point marks the next available page as a MSB pagein the current flash block, the FTL can skip the MSB page andplace dirty data in the next LSB page. However, as foregroundeviction simultaneously writes a large number of dirty pagesto flash, prior work has to skip lots of CSB/MSB pages, whichunfortunately wastes flash spaces. Our evaluation results showthat 35% of flash pages are underutilized in the evaluatedworkloads, due to the foreground eviction (cf. Figure 3b).To reduce the write latency and improve flash utilization,we develop a new latency-aware
FTL design. Our design groupsthe LSB pages across all available flash blocks to accommo-date the dirty data of foreground eviction without skippingCSB/MSB pages. In addition, we interleave the foregroundand background evictions to access LSB and CSB/MSB pages,respectively, to balance the flash page usage. To manage theavailable resources in the flash blocks, we allocate a blockpointer for each flash block, which points to the free page forprogram operations. All block pointers are categorized intothree groups based on the types of free pages: LSB, CSB andMSB page groups. Within each page group, the block pointersbelonging to the same flash die are organized as a list and alllists are indexed by the flash die ID. To accommodate dirtypages from a foreground eviction, our proposed FTL selects theflash blocks from different lists in the LSB group for free LSBpages. Therefore, the dirty data are interleaved to the LSB pagesacross different flash dies for better parallelism. Similarly, ourFTL prefers choosing the flash blocks in CSB and MSB pagegroups to store the dirty data from background eviction. If CSBand MSB pages are unavailable, flash blocks in LSB page groupswill be selected for background eviction. Note that the unusedflash space (i.e., CSB/MSB pages) due to foreground evictioncan be reclaimed by garbage collection, which would not hurtthe total SSD capacity. We also constrain the maximum LSB-only flash space (cf. Table 1) to keep high effective SSD capacity.
Adaptive page victimization scheduling.
It is undesirable toface a situation that the SSD internal buffer is close to full, whilethe OS kernel is still flushing dirty pages in the page cache.Although our FTL design can accelerate the draining procedureof the SSD internal buffer, intensive foreground eviction canoveruse LSB flash pages, which wastes the storage space. Toaddress this, we propose to adjust the traffic of page victim-
Host Values Storage (SSD) ValuesProcessor
ARM v8
Read(LSB/CSB/MSB)
Write(LSB/CSB/MSB)
Erase/Channel/Package
L1I/L1D
Die/Plane/Page size
L2 cache
Capacity/Internal buffer
Mem ctrler High/Low thresholds
Memory
DDR3, 4GB
Max LSB-only region
TABLE 1: System configuration parameters.ization in OS kernel based on the status of the SSD internalbuffer. Specifically, we configure two sets of dirty_ratio and dirty_background_ratio in the page cache, which havehigh (10% and 5% in this paper) and low values (5% and 3%).By default, the set of low dirty ratio values is applied in thepage cache to reduce the occupancy of memory spaces. If theflash firmware reports that its internal buffer is occupied higherthan the high threshold, the OS kernel stops flushing the dirtypages by temporarily using the set of high dirty ratio values.Once the SSD reports that its internal buffer utilization hasfallen back to the low threshold, the OS kernel adopts the set oflow dirty ratio and flushes the dirty pages to the SSD.
Page victimization detection.
It is non-trivial for the flashfirmware to speculate the page victimization events sololybased on the patterns of incoming I/O requests. Since OSkernel manages the page cache, we propose to extract theinformation of page victimization by monitoring every flushevent at runtime and checking the number of dirty pagesto be flushed. Specifically, to manage the page victimiza-tion, OS kernel maintains a pool of working threads, called bdi_writeback_threads , each of which can be invokedto process the flush tasks. By monitoring the actions of bdi_writeback_threads , we can detect each event of pagevictimization. For every page victimization, we check the num-ber of pages to be flushed by leveraging the data structure bdi_writeback . bdi_writeback is created to maintain a listof dirty inodes for flushing. By walking through the list of dirtyinodes, we can obtain the number of dirty pages. Bidirectional datapath for communication.
To leverage theextracted information of page victimization and device bufferstatus, one potential issue is the lack of data path between thepage cache and the SSDs for communication. We set up thedatapath by adopting the tagging methods in storage system[10], [15]. Specifically, we tag the page victimization informa-tion and SSD internal buffer status in an NVMe command andan NVMe completion message, respectively. Figure 4 shows thedetails of our proposed datapath. Specifically, page victimiza-tion needs to pass through the file system, block I/O layer,NVMe driver, and NVMe controller. Once the storage-sideNVMe controller receives the NVMe commands, it extracts thepage victimization information and informs the flash firmware.Similarly, the SSD internal buffer status can be embedded inthe NVMe completion message by overriding a field called sq_head . Figure 4 shows the datapath for transferring theSSD internal buffer information (i.e., full) from the SSD tothe host. When the NVMe driver extracts SSD internal bufferinformation in the function, nvme_process_cq , the OS kernelmodifies global variables of flush thresholds dirty_ratio and dirty_background_ratio . VALUATION
Experiment environment configuration.
We leverage Amber[6], a full-system simulator, to perform precise simulation onthe SSD hardware and the existing software from OS to SSDfirmware. In our experiment, the storage is an NVMe SSDinstance with TLC flash [14]. We also employ 512MB DDR3DRAM as SSD internal buffer, which is similar to Intel 750NVMe SSD [4]. Our simulation details are shown in Table 1.
EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 4 A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r - F I m g S e r v e r - U t h r e s p . t i m e Vanilla FD-Buf FD-FTL FD
Fig. 5: Response time analysis of multiple co-located workloadexecution. The results are normalized to
Vanilla . R e s p . t i m e ( m s ) Times (s)
Vanilla FD-Buf FD-FTL FD victimizationPage (a) Response time analysis. A p a c h e - F A p a c h e - U D B - S D B - U I m g S e r v e r- F I m g S e r v e r- U F l a s h page u s age Wasted Used (b) Flash page wastage in FD . Fig. 6: Response time analysis when co-running workloads
Apache-U and
Ungzip , and flash page wastage in FD . System configurations.
We implement three different computersystems by adding the proposed optimization techniques andcompare them against a vanilla system.1)
Vanilla : A Vanilla system running with NVMe SSD;2)
FD-Buf : Integrating our SSD internal buffer managementdesign into
Vanilla , which can adaptively perform SSDbuffer eviction based on page victimization events;3)
FD-FTL : Introducing our latency-aware FTL into
FD-Buf ,which selectively programs foreground eviction to LSB pages;4) FD : Integrating our adaptive page victimization schedulingscheme into FD-FTL , such that Linux page cache can suspendevicting dirty pages when the SSD internal buffer is full.
Workloads.
We evaluate six representative latency-criticalworkloads:
Apache-F , Apache-U , DB-S , DB-U , ImgServer-F and
ImgServer-U , which are obtained from [2], [11], [12]. For theworkloads of
DB-S and
DB-U , we measure the latency to select( S ) and update ( U ) key-value pairs from a target database,respectively. For the other workloads, we measure the responsetime to fetch ( F ) or upload ( U ) data from/to a server. We exam-ine both web ( Apache ) and image services (
ImgServer ). We alsoselect a write-intensive throughput-oriented workload,
Ungzip ,which is used by a GNU application [5]. In our evaluations,we co-run latency-critical and write-intensive workloads for 30seconds to simulate a real-world server environment.
Figure 5 plots the th percentile response time of the latency-critical applications, when co-running the latency-critical appli-cations and write-intensive workloads under four different sys-tem configurations. Overall, FD-Buf , FD-FTL , and FD achieves34%, 76%, and 84% shorter th percentile response latency,respectively, in comparison to Vanilla . FD-Buf identifies page victimization via holistic NVMestorage stack optimization and cleans up SSD internal bufferin advance to accommodate I/Os of page victimization. Asa result, compared to
Vanilla , FD-Buf can isolate the pagevictimization from read I/Os. This in turn reduces the de-lay that user applications experience when waiting for I/Oresponse. Nonetheless, the limited SSD internal buffer canoverflow when the Linux page cache intensively evicts dirtypages. Compared to
FD-Buf , FD-FTL can allocate low-latencyLSB pages of NAND flash as write buffer to accommodate alarge number of dirty pages. Therefore,
FD-FTL can reducethe time during which page victimization blocks other I/Orequests. FD can further reduce the th percentile responsetime of the evaluated latency-critical workloads by 37%, onaverage, compared to FD-FTL . This is because FD adaptivelysuspends Linux page cache from evicting dirty pages when theSSD write buffer is full. Response time analysis.
Figure 6a shows the response timeof
Apache-U when co-running workloads
Apache-U and
Ungzip .Specifically,
Vanilla achieves the average response time un-der 5 ms when there is no page victimization. Once writeI/Os of page victimization arrive in the storage, the responsetime increases up to 35 ms. As
FD-Buf can identify pagevictimization, it cleans up the SSD internal buffer in advancefor the incoming write I/Os. While the SSD internal bufferis sufficient to accommodate all write I/Os without accessingthe underlying flash media at the start of page victimization,such promising performance of
FD-Buf unfortunately is notsustainable after serving a large number of write I/Os. This isbecause
FD-Buf needs to evict a large number of dirty pages tothe flash media, when the SSD internal buffer is full. As
FD-FTL can mitigate the flash write latency by placing evicted datain flash LSB pages, it reduces the worst response time by 3 × compared to FD-Buf . FD further reduces the long tail responsetime by throttling the write I/Os during page victimizationevents. Note that FD is also beneficial to reduce the responsetime when there is no page victimization event. This is becauseour FTL design monitors the SSD internal traffic and schedulesthe background data eviction of the SSD internal buffer onlywhen the flash resources are available. Overhead analysis.
Figure 6b compares the number of skippedCSB/MSB pages (
Wasted ) with the number of programmedflash pages (
Used ). As shown in the figure, our latency-awareFTL design, on average, can utilize 90% of flash pages in theevaluated workloads, which is much higher than the priorwork (cf. Figure 3b). This is because our proposed FTL designskips programming data in CSB/MSB pages only during theforeground eviction of the SSD internal buffer. In addition, ourFTL design spreads the foreground eviction to LSB pages acrossflash blocks to minimize the wastage of CSB/MSB pages.
ELATED W ORK
Several studies [9], [13] have tried to holistically optimize theOS and storage stack to fully reap the benefits of SSD-enabledsystems. These storage-stack optimizations mainly detect I/Odependencies among I/O requests, which are performed inthe background and address their interferences. Even thoughthe overall system performance can be improved with thisoptimization, they are unsuitable in scenarios that do no exhibitany I/O dependency. Compared to these prior works, in thiswork, we first reveal that OS page cache has a major influenceon performance and blocks other I/O requests that have noI/O dependency with incoming requests from different userapplications. FastDrain holistically optimizes the software andhardware modules from the OS to the underlying SSDs, whichcan eliminate the critical performance bottleneck of OS pagecache in the existing storage stack.
ONCLUSION
We observe that page victimization can degrade the userexperiences. To address this, we propose a co-design of OSkernel and flash firmware, which can mitigate the overheadof page victimization. Our evaluation results demonstrate thatthe proposed approach outperforms the conventional systemby 84%, in terms of the th percentile response time. CKNOWLEDGEMENTS
We thank anonymous reviewers for their constructivefeedback. This research is mainly supported by NRF2016R1C1B2015312, DOE DE-AC02-05CH 11231, KAIST Start-Up Grant (G01190015), and MemRay grant (G01190170). N.S.Kim is supported in part by grants from NSF CNS-1705047. M.Kandemir is supported in part by NSF grants 1822923, 1439021,1629915, 1626251, 1629129, 1763681, 1526750 and 1439057. My-oungsoo Jung is the corresponding author.
EEE COMPUTER ARCHITECTURE LETTER, VOL. XX, NO. X, APRIL 2020 5 R EFERENCES [1] Open NAND Flash Interface Specification Revision 3.0.
ONFIWorkgroup, Published Mar , 2011.[2] Apache HTTP server benchmark tool. httpd.apache.org , 2017.[3] Two new block i/o schedulers for 4.12.
URL:https://lwn.net/Articles/720675/ , 2017.[4] anandtech. Intel ssd 750 review.
URL:http://tiny.cc/197bnz , 2015.[5] J.-l. Gailly. Gnu gzip, 2010.[6] D. Gouk et al. Amber*: Enabling precise full-system simulationwith detailed modeling of all ssd resources. In
MICRO , 2018.[7] L. M. Grupp et al. Characterizing flash memory: anomalies,observations, and applications. In
MICRO . IEEE, 2009.[8] M. Jung et al. Revisiting widely held ssd expectations and rethink-ing system-level implications. In
SIGMETRICS . ACM, 2013.[9] S. Kim et al. Enlightening the i/o path: A holistic approach forapplication performance. In
FAST , 2017.[10] M. Mesnier et al. Differentiated storage services. In
SOSP , 2011.[11] D. Mosberger et al. httperfa tool for measuring web serverperformance.
SIGMETRICS , 1998.[12] A. MySQL. Mysql reference manual, 2001.[13] S. Yang et al. Split-level i/o scheduling. In
SOSP . ACM, 2015.[14] J. Zhang et al. Opennvm: An open-sourced fpga-based nvmcontroller for low level memory characterization. In
ICCD , 2015.[15] J. Zhang et al. Flashshare: punching through server storage stackfrom kernel to firmware for ultra-low latency ssds. In
OSDI , 2018.[16] J. Zhang et al. Flashgpu: Placing new flash next to gpu cores. In