[PDF] Reading from External Memory

Abstract

Modern external memory is represented by several device classes. At present, HDD, SATA SSD and NVMe SSD are widely used. Recently ultra-low latency SSD such as Intel Optane became available on the market. Each of these types exhibits it's own pattern for throughput, latency and parallelism. To achieve the highest performance one has to pick an appropriate I/O interface provided by the operating system. In this work we present a detailed overview and evaluation of modern storage reading performance with regard to available Linux synchronous and asynchronous interfaces. While throughout this work we aim for the highest throughput we also measure latency and CPU usage. We provide this report in hope the detailed results could be interesting to both researchers and practitioners.

Full PDF

RReading from External Memory

Ruslan Savchenko [email protected]

Yandex School of Data Science

Abstract

Modern external memory is represented by several device classes. At present, HDD, SATA SSDand NVMe SSD are widely used. Recently ultra-low latency SSD such as Intel Optane became availableon the market. Each of these types exhibits it’s own pattern for throughput, latency and parallelism.To achieve the highest performance one has to pick an appropriate I/O interface provided by theoperating system. In this work we present a detailed overview and evaluation of modern storagereading performance with regard to available Linux synchronous and asynchronous interfaces. Whilethroughout this work we aim for the highest throughput we also measure latency and CPU usage. Weprovide this report in hope the detailed results could be interesting to both researchers and practitioners. a r X i v : . [ c s . D C ] F e b Introduction

During the last 10 years external memory market has experienced an unprecedented revolution. Wellknown hard disk drives has been present since 1970s and their performance has constantly beingimproved. To grasp the pace of this progress, bandwidth increased by a factor of four since thebeginning of the century. Today typical speed is about from 100 to 200 megabytes per second. Solidstate drives appeared in 2009 and raised external memory speed up to 300 megabytes per second. LaterNVMe SSD increased the speed further up to 3 gigabytes per second. For manufactures it makes moresense to sell NVMe SSD so vendors are switching to this interface. Industry data-centers are switchingto NVMe SSD as well.Typical NVMe SSD throughput for reading is about 3 gigabytes per second. This is quite a largevalue for modern system. For example it is close to the bandwidth of 25 gigabit network interface. Thusit is enough to saturate modern network interface with a single NVMe SSD.It is not only bandwidth which solid state drives improve but also latency. Hard disk drive requiresabout 12 milliseconds to read a random block. Solid state drive needs only a few hundreds of mi-croseconds to get the job done. Recently ultra-low latency SSD pushed this even further to astonishingsingle-digit number of microseconds. As a result, external memory latency dropped by a factor ofthousand in just a decade.The random read workload throughput depends on latency. If one reads small random blocks fromHDD then all the time would be wasted on the arm positioning and instead of 200 MB/s the resultingdata transfer speed would be less than a megabyte per second. It is a common technique to increaseblock size and use cache when fetching data from HDD. In contrast, one can saturate the whole NVMeSSD bandwidth with random reads even if the block size is only 4 kilobytes.This revolution in hardware capabilities requires the software interfaces to change as well. Thewell-known read interface form 1960s is not useful for anything serious anymore. It was replaced by pread which allows one to specify the oﬀset within the same system call. Next was preadv whichmakes possible to fetch data into multiple buﬀers (yet all the data should be continuous on the drive).It followed by preadv2 which introduced some ﬂags to ask kernel for enhancements (we are interestedin

RWF_HIPRI to make reading done via polling and save CPU time form unnecessary interrupts).Although all above system calls are synchronous input/output operations are asynchronous bytheir nature. There were several attempts to express this in programming interface. POSIX AIO madeit possible to make several requests and execute them in separate threads in parallel. It was going to bereplaced by Linux aio which used a kernel queue for requests. Linux aio requests are gathered in kernelto be executed asynchronously and even out of order. Finally Linux 5.4 introduced io_uring interfacewhich addresses some drawbacks of Linux aio and aims to be the major asynchronous userspace datatransfer interface. Both Linux aio and io_uring introduced new system calls. On the opposite POSIXAIO is implemented with just a few threads deﬁned in GNU libc. In fact, many applications implementtheir own I/O thread pool which works like POSIX AIO.For data managing software all this means that old fashioned way to read large blocks and storethem in a cache has to be reconsidered. The amount of data is constantly increasing and sometimes itis impossible to keep even the most relevant things the main memory. The popularity of distributedkey-value databases nudges service developers to request random keys during online processing. As aresult we have a huge amount of random external memory fetches.In this work we present our experiments with diﬀerent types of external memory devices. Our maingoal is to choose the best way to read if reading is always performed from a storage. So we disablethe Linux ﬁlesystem cache by specifying

O_DIRECT ﬂag. The cache is wonderful when hit, however itintroduces additional latency when data is fetched from a device. To stress this argument we also showa few measurements taken without

O_DIRECT ﬂag.The paper is organized as follows. Sections 2, 3 give a brief overview of hardware and specify modelsused in our experiments. Section 4 studies single read latency. Section 5 studies synchronous readingfrom a single thread. Section 6 explores extension to multiple threads. This concludes synchronousinterfaces. Further we study study asynchronous I/O. Section 7 evaluates Linux aio interface. Sec-tion 8 compares Linux aio performance in 4.19 and 5.4 kernels. Section 9 gives benchmarks for the io_uring interface. Section 10 investigates io_uring extended features. Section 11 summarizes all ourexperiments. Section 12 covers related work. 2

Hardware Overview

In this section we brieﬂy describe modern external memory hardware. We have HDD, SATA SSD,NVMe SSD, and Intel Optane installed in our testing machine. While it is impossible to cover all thedetails in a single section we try to give a grasp of what they are and how they diﬀer from each other.For more information refer to operating systems textbook [ADAD18] or materials covered in section 12.

Probably the most well-known external memory device is the hard disk drive. Internal structure ofHDD resembles that of gramophone record: the information is stored on tracks on aluminum platterwhich rotate at high speed (however HDD tracks are concentric circles while gramophone record has asingle spiral track). To read or write data a magnetic head is used. The magnetic head is attached to amechanical arm which is used to position the head over the right track. To access data the drive ﬁrst hasto move the arm over the right track and then wait for the platter rotation to bring the data right underthe head.It is easy to calculate how long it takes to wait for the right angle. The platter rotation speed isusually included in a drive title as RPM (rotations per minute) keyword. For a typical drive whichrotates at speed 7200 RPM a single rotation takes 8 milliseconds. This means that we wait 4 millisecondson average for our data to appear under the head. As we will see in the next sections it takes 12milliseconds on average to access random block on HDD. Given these two numbers we get that armpositioning (which is usually called track seeking) takes 8 milliseconds.For small blocks the data transfer time is much less than seeking time. Only if block is large enoughit becomes noticeable. For our HDD this happens when the block size is about 256 kilobytes. It meansthat for a random access there is no diﬀerence if the block size is 256 kilobytes or less. Note that thisis quite large compared to both ﬁlesystem block size (4 kilobytes) and traditional database block size(about 8 kilobytes).The latency of 12 milliseconds imposes a huge restriction on the number of operations per second. Itis easy to see that if every access is expected to be random then the drive can perform only 80 operationsper second. In practice this can be a bit higher because the OS driver or the drive internal controlleritself can reorder requests to reduce the total distance of arm movements. Nevertheless the number ofrequests per seconds is far too lower than what is expected from modern information systems.

Flash memory has been developed since 1980s. It avoids some problems of the HDD but introducessome others. There are no mechanical parts in the SSD therefore a random read doesn’t have to waitfor slow arm motion or platter rotation. It just need to access proper memory cell electrically and datatransfer can be performed immediately. Thus SSD suits random reads better.Nevertheless SSD is exposed to another problem. It happens that one cannot change ﬂash memorycell value easily. To be able to write data the SSD ﬁrst needs to clear a large area of memory calledblock in SSD terminology. Usually block size is in range form 128 to 512 kilobytes. After this SSDblock is cleared small portions of it called pages can be used to write new data. Typical page size is 4kilobytes which exactly matches the ﬁlesystem block. After data is written into a page this page becomesread-only. The only way to write new data to the page is to ﬁrst clear the whole block that contains it.It is not surprising that SSD internally remaps pages. When a ﬁlesystem logically overwrites a pagean SSD allocates a new page and writes data into it instead of rewriting the whole block. After a whilethe SSD has to do defragmentation: used pages are gathered together and old blocks are cleared. Sounused pages become available for new writes.The background defragmentation process aﬀects SSD write performance. Sometimes write latencyis so high that it looks more like SSD is frozen for a while. Moreover the latency pattern is unpredictedand can depend on factors such as free space. We won’t consider write in this report but we believereader should be aware of this problem. 3 .3 Non-Volatile Memory Express

Serial ATA interface has limited capacity: 3 gigabits per second for SATA II and 6 gigabits per secondfor SATA III. This allows SATA to transfer 384 or 768 megabytes per second. To bypass this restrictionvendors started attaching SSD to PCI-E bus directly. Early devices implemented their own protocoland required proprietary OS drivers. Later this approach evolved into industry standard which is nowcalled non-volatile memory express (NVMe).It is not only physical interface that has been changed but the protocol itself. The NCQ extensionfor SATA allowed a host to issue up to 32 commands to a device to be executed in parallel (or to bereordered according to device geometry in case of HDD). The NVMe interface allows the queue lengthto be up to 65536 commands. Moreover a single device can support many queues, up to 65536 queuesare allowed by the standard. Real-world devices have much fewer queues. The ones we consider in thisreport have 32 and 128 queues.Each queue consists of two ring buﬀers: one for commands submitted to the device and another oneto notify the host about command completion. These buﬀers act asynchronously: each participant addsentries to the head and another reads requests form the tail. This looks similar to network interfacedevices with their Tx and Rx queues.Modern NVMe SSDs are more advanced than old SATA SSDs and are able to sustain stable writespeed for a long period of time. If one wishes to evaluate an SSD write performance we strongly suggestto consider NVMe SSD. Again, we don’t do writes in our experiments.

Recently Intel announced a new persistent memory type. This memory has two advantages overprevalent SATA and NVMe SSDs. First, pages can be overwritten directly and it is no longer requiredto clear an entire block ﬁrst. Second, latency is lower by a factor of magnitude. We were able to read 4kilobyte block from it in just 12 microseconds.Latency as low as 4 microseconds has been reported when SPDK is used. SPDK is a library whichallows one to bypass the kernel and access a drive directly from the userspace. However when rawdevice is exposed to the userspace the ﬁlesystem should also be implemented in the userspace. There isone for SPDK called BlobFS however it doesn’t look as mature as kernel ﬁlesystems. For example thereis no journal in BlobFS which means it cannot be used when durability is of concern.There are several products all marked as Optane memory. To avoid confusion we list here all theitems available on Intel website at the time this report was written. Intel Optane SSD DC D4800X Series.

This device is made completely from Optane memory and hasNVMe interface. The size diﬀers from 375 gigabytes up to 1.5 terabytes. Intel suggests it for datacenters.

Intel Optane Memory (M10) Series.

An Optane memory device of size from 16 to 64 gigabytes attachedvia M.2 interface. The main purpose is to be used as a ﬁlesystem cache in desktops and laptops.

Intel Optane Memory H10 with Solid State Storage.

A combined Optane and QLC NAND ﬂash memorydevice. The Optane part is 16 or 32 gigabytes. The NAND ﬂash part can be from 256 gigabytes to 1terabyte. It attaches via M.2 interface and an be used in desktops and laptops as a complete externalmemory device with a fast internal Optane cache.

Intel Optane DC Persistent Memory.

An Optane-backed DIMM module of size 128, 256 or 512 gigabytes.Recall that DIMM is interface for main (volatile) memory. As the ﬁrst one, this device is intended fordatacenters.The ﬁrst one looks like a more advanced NVMe SSD. In fact there is a new class of such devicescalled ultra-low latency SSD in general. This is exactly the device we examine in our report. The secondand the third devices look like a solution for laptops. Probably one could reduce OS boot time if it isput onto Optane memory cache. The last one looks really interesting however it is the least available. Ithas been made it into the market only in 2019. Unfortunately at the time we performed our experimentswe didn’t have access to this kind of device. Nevertheless it has been extensively studied recently andwell-written reports are available in the literature [IYZ + All product and company names are trademarks or registered trademarks of their respective holders. Use of them does notimply any aﬃliation with or endorsement by them. Testing Conﬁguration

We run all our tests on a single machine which has all four storage device types installed. The machinehas 512 gigabytes of RAM, two Intel Xeon CPU E5-2660 v4 processors running at clock speed 2.00GHzwith HyperThreading enabled. Most tests run on Linux kernel 4.19. To test io_uring we use Linuxkerenl 5.4. Most of the experiments were executed in november and december of 2019 therefore thekernel version choice. Even at the time of writing many production servers still use earlier Linux kernelversions. So we hope our evaluations are still relevant.The testing machine has the following storage hardware installed: • HDD:

Western Digital/HGST Ultrastar He12 7200 rpm 12 Tb. • SATA SSD:

Micron 5200 PRO 1.92 Tb. • NVMe SSD:

Micron SSD 3.2TB U.2 MLC NVME 12V 3.2 Tb. • NVMe Optane SSD:

Intel Optane SSD DC P4800X Series 750 Gb.The main goal of this work is to evaluate the hardware and not the ﬁlesystem. On each device wecreate a ﬁle of size approximately 90% that of a device size. We ﬁll it with speciﬁc data to be sure thatthe space is allocated and that the reading takes place. To assert the latter we verify that received bytesmatch the ones we expect. 5 igure 1: HDD single read latency. Figure 2: SATA SSD single read latency.

We start with reading a single block. Our goal is to pick the best block size for a random read. Anapplication (or ﬁlesystem) can pick any block size and access data with respect to this block size. Wevary block size from 4 kilobytes up to 32 megabytes. For each block size we make some random reads.Among these reads we calculate average, minimum and maximum latency as well as 99,0 and 99,9percentiles. We use system call pread in this experiment. We believe that lseek followed by read should have the same performance since the observed storage access time is far longer than a systemcall. Later we will read from multiple threads and pread will be the only choice. We use the sameinterface to make our experiments comparable to each other.Hereinafter we use “SSD” and “NVMe” notions for SATA SSD and NVME SSD. It looks like thisjargon appears in some programmers communities and doesn’t lead to confusion at present. Howeverone should keep in mind that NVMe is just an interface and more storage devices could use it in future.Thus said, Intel Optane SSD uses NVMe interface. We use “Optane” notion when we talk about IntelOptane SSD. These notions are used in all ﬁgures throughout our report.

Figure 1 shows results for HDD. The latency is almost the same for all block sizes smaller than 256kilobytes. This happens because seek time is much larger than the data transfer time. The seek timeincludes arm positioning to ﬁnd the right track and awaiting for platter rotation to bring data under thehead. A simple consequence is that for a HDD random read one should use blocks of size at least 256kilobytes. Even if an application use smaller blocks the drive access time would be the same. Howeverone could still decide to use smaller blocks for better cache utilization: if the amount of data per requestis small and is expected to ﬁt in cache then storing a large block along with the requested data wouldactually make cache capacity smaller in terms of useful data.The 256 kilobyte block read takes 12 milliseconds on the average. We experienced variations from 4milliseconds up to 25 milliseconds. This is really a huge amount of time for a computer. For examplethe typical process scheduling quantum is just a few milliseconds. An operating system can (and in factdoes) execute other processes while our process waits for the data to arrive from the hard drive.

Figure 2 shows SATA SSD read latencies. Note that the time at the lower part of the ﬁgure is inmicroseconds (we use standard shortenings ms for milliseconds and us for microseconds). Readingblock of size 4 kilobytes takes 140 microseconds on the average and the time growth is linear when theblock size increase. Compared to HDD reading a 4 kilobyte block from SSD is 80 times faster. For a256 kilobyte block SSD is ten times faster than HDD. When block size is large enough (starting from 4megabytes) SSD is only two times faster than HDD.6 igure 3: NVMe SSD single read latency. Figure 4: Intel Optane SSD single read latency.Figure 5: Intel Optane SSD single read latency in pollmode. .3 NVMe Figure 3 shows results for NVMe SSD. The latency is better than those for SATA SSD. For a 4 kilobytesblock size the average time improved only a little, but the 99 percentile is two times lower. It takes lessthan millisecond to read a megabyte block from NVMe SSD. For SATA SSD it took 3 milliseconds. As wesee, upgrade from SATA SSD to NVMe SSD is not as dramatic as upgrade from HDD to SATA SSD. Thisis not surprising since both SATA and NVMe SSD are based on the same thechnology. Only interfacesdiﬀer.

Figure 4 shows results for Intel Optane SSD. Minimal latency is 12 microseconds whih is 10 times lowerthan those of NVMe SSD. Average latency is 1000 lower than those of HDD. There is quite large variationfor small block read latency: even though the average time is quite low and close to minimal latencythe maximum latency and even 99 percentile are signiﬁcantly worse. If somebody looks at these resultsand wishes to create an Optane-bases service with 12 microsecond latency for reads they would have toinstall larger number of Optane drives or consider providing more realistic timings.When latency is so small overheads of context switching and interrupt handling become noticeable.One can use polling mode to gain some improvement. In this mode the Linux kernel monitors thecompletion queue instead of switching to some other job and relying on hardware interrupt withinterrupt handler to notify about completion. Clearly, it is considerable to use the polling mode onlywhen hardware response is expected to arrive fast enough.The polling mode is used when an application calls preadv2 system call with

RWF_HIGHPRI ﬂag. Theresult of using this call for Intel Optane is shown on ﬁgure 5. Compared to usual pread the pollingmode lowers the maximum latency by a factor of two for block sizes up to 256 kilobytes.

Figure 6 shows single read latencies for all four storage types on a single chart. Starting from 4 megabytesthe latency is easily predicted by linear extrapolation so we don’t show larger blocks here. To showeverything on a single ﬁgure we are forced to use quite an overloaded legend. We use vertical level toshow the latency and we iterate the block size horizontally. For each block size we show four bars, fromleft to right: for Intel Optane, NVMe SSD, SATA SSD, and HDD. Storage type is represented by hatchand the latency by color.We see that solid state device latencies are far better than HDD. For a single read the leader is IntelOptane, however as we shall see later it has it’s own drawback compared to NVMe SSD. NVMe SSDand SATA SSD look quite close to each other when the block size is small. Our observations show thatthe best block size for random read is 256 kilobytes for HDD, 4 kilobytes for NVMe and SATA SSD and8 kilobytes for Intel Optane. 8 igure 6: Single read latency for HDD, SATA SSD, NVMe SSD and Optane. igure 7: Whole HDD read. Figure 8: Whole SATA SSD read.Figure 9: Whole NVMe SSD read. Figure 10: Whole Intel Optane SSD read. In previous section we studied reading just a single block from a storage device. In real life disk I/Ooperations are happening all the time. When a large amount of data is read we usually talk about astream of data. Under the hood any stream is just a set of blocks if we look at it from point of view of astorage device.A stream is usually measured by throughput which is an amount of data transferred during aparticular amount of time. Typically we talk about megabytes per second. Since any stream is just a setof blocks we can look at blocks separately and measure latency. These latencies can be aggregated justlike in the experiments from the previous section.In this section we study reading a single stream of data (from a single CPU thread). We considerboth sequential and random streams. We measure throughput and aggregated latency for each stream.

Hard disk drive bandwidth depends on physical data location. Imagine an HDD platter. It seems naturalthat the outer track contains more data than the inner track. However rotation speed is constant. Thusthe further the head is from the center the more is the bandwidth. Filesystems exploit this phenomenaand put important metadata close to the outer side of the disk to improve performance.To demonstrate the diﬀerence in bandwidth we tried to read the whole drive in one pass. To alleviatethe eﬀect of disk rotation we used block size of 1 megabyte in our calls. Also we disabled

O_DIRECT and hoped that OS disk scheduler would execute requests more carefully. It took 15 hours to read 10terabytes of data from HDD. Figure 7 shows the result. It is easy to note that the bandwidth goes downthroughout this experiment. At the beginning the data transfer speed was about 250 megabytes persecond. While the head moved to the end of disk the speed felled down to 150 megabytes per second.For solid state devices the bandwidth should not depend on data position. Nevertheless when we10 igure 11: Comparison of sequential and randomreading throughput for HDD. Figure 12: Random HDD reading throughput. did our experiments we noticed some ﬂuctuations. Unfortunately the origin is unknown. It could befrom internal buﬀering inside the device but it is only a speculation.It turned out that with

O_DIRECT the bandwidth is increased for NVMe SSD by a factor of two. Asfor SATA SSD and Optane it is slightly worse to use

O_DIRECT but we enable it anyway just to make ourexperiments consistent with each other.Figure 8 shows results for SATA SSD. The bandwidth is about 450 megabytes per second which is2-3 times higher than those of HDD. The pattern is the same as the reading goes from the beginning ofthe device to the end. This is expected since SSD doesn’t have anything which makes speed to dependon data location. We believe that the observed ﬂuctuations are due to caching and predictions made byinternal device controller.NVMe SSD exhibits bandwidth three times larger than those of SATA SSD in this experiment.Figure 9 shows results for NVMe SSD. The observed bandwidth is about 1.3 gigabytes per second andalso oscillates. The pattern looks diﬀerent from SATA SSD though.Figure 10 shows the result of this experiment for Intel Optane. This device demonstrated the highestbandwidth of 2.2 gigabytes per second. Since the speed is so high and the device size is the smallest(recall our model capacity is only 750 GB) it took only 5 minutes to read the whole storage. Comparedto 15 hours of reading from HDD this is indeed a dramatic diﬀerence.

Previously we observed how reading a whole drive at once performs. This experiment is actually quitefar from real world workload. We used large blocks and read them sequentially. Tyical applicationbreaks one of these conditions (or both).Next we study reading from a single thread in more details. We pick diﬀerent block sizes andexecute requests for a minute in a loop. We choose next block both in sequential and random manner.This is more likely to how a real application behaves: sometimes large amount of data should be readseqentially and sometimes blocks of interest are scattered across the device. We are not aiming atbandwidth saturation in this section, rather we wish to highlight some interesting properties of singlestream reading.There is a tremendous diﬀerence between sequential and random reading from HDD. If reading issequential the drive doesn’t have to move the arm to fetch next block since it will be right under thehead when the current block ends. If blocks are picked at random, then an application would have towait for the head to move to the next block before the drive could read it. As observed earlier this takes12 milliseconds on average.An easy consequence is that random reading leads to performance degradation if the block size issmall. If reading is sequential then the maximum bandwidth is achieved for a 16 kilobyte blocks. Onthe other hand random reading comes close to the maximum bandwidth only when the block size istens of megabytes.Figure 11 shows bandwidth for sequential and random readings with block sizes 256 kilobytes and1 megabyte. Lines for sequential readings are solid and look almost identical which means that the11 igure 13: Comparison of buﬀered and direct readingthroughput for HDD. Figure 14: Comparison of sequential and randomreading throughput for SATA SSD. bandwidth is saturated. Lines for random readings are dashed and it is easy to see that they are waybelow the maximum.Nevertheless one can saturate the bandwidth even when blocks are picked at random if suﬃcientlylarge block size is used. Figure 12 shows how bandwidth depends on the block size for random readings.It is almost unnoticeable for blocks smaller than 64 kilobytes and reaches slightly above 175 megabytesper second when the block size goes up to 64 megabytes. Recall that peak bandwidth depends on theposition on the platter and varies from 250 to 140 megabytes per second. Reaching 175 megabytes persecond when blocks are scattered across the whole drive seems like a fair result.In the introduction we discussed the

O_DIRECT ﬂag. Recall that this ﬂag controls whether kernelpage cache is used when ﬁle is accessed. Cache brings signiﬁcant beneﬁt for sequential reading.When page cache is used (e.g. ﬁle has been opened without

O_DIRECT ﬂag) kernel fetches subsequentdata in advance. One can speciﬁcally notify the kernel that reading pattern is sequential by calling posix_fadvise() with

POSIX_FADV_SEQUENTIAL ﬂag. Reading data that will be requested in the futureallows kernel to saturate drive bandwidth even if user calls read with small block size.Figure 13 shows both buﬀered and direct sequential readings form the beginning of the ﬁle. Blocksizes of 4 and 8 kilobytes are considered. When block size is 4 kilobytes reading in direct mode reachesbandwidth of 150 megabytes per second while the maximum bandwidth for this part of the disk is250 megabytes per second. On the other hand when block size is 8 kilobytes bandwidth is close to themaximum even for unbuﬀered reading. If we had picked block size equal 16 kilobytes then the linewould coincides with those of buﬀered reading so it is omitted to keep the chart readable.

When reading from SSD the behavior is quite steady. Figure 14 shows some sequential and randomreads for SATA SSD. The chart clearly could be replaced with just bars for each block size. So we won’tstudy SATA, NVME SSD and Optane in details here and move to combined results for all four storagetypes. We present charts for both latency and throughput.Figures 15 and 16 shows latencies observed in our experiments for sequential and random blocksequences. Each chart is similar to the latency chart from section 4. The diﬀerence is that now weevaluate statistics from replies observed during one minute of continuous execution. It is noticeablethat for sequential reading there is a huge diﬀerence between maximum value and 99.9 percentile for allstorage devices. Most likely this happens due to sequential reading recognition and internal buﬀeringin the device.For sequential reading the minimum latency for all four devices is rather close to each other. EvenHDD behaves nearly as good as SATA SSD in terms of minimum and average latency when reading issequential. This matches our hypothesis about internal buﬀering: once data is read form the platter itis stored in an internal device memory. Subsequent requests ask device for the next blocks. Since theyalready reside in the internal device memory only SATA interface latency contributes to the total latencyof the transfer. Hence HDD and SATA SSD behave similarly.12VMe SSD and Optane perform close to each other as well though Optane is faster for small blockson average. It is interesting that for both sequential and random reading the maximum time of allthree solid memory devices is much higher and reaches 1-2 milliseconds which is way higher than the99.9 percentile. A possible explanation could be the context switch. Since our test runs as usual Linuxprocess it should be suspended from time to time to give some other processes and the kernel to executeon the same CPU. The timing matches what a gap should be on a rather idle CPU and of cause there isa possibility for such event to occur. Anyway we test not only a device but pread interface as a whole.This is an observed behavior and one better be prepared for it.When block sequence is random HDD performs as bad as it should be and the latency is similar towhat we observed in section 4. It is interesting to note that the be behavior of NVMe SSD and IntelOptane looks quite similar with a little diﬀerence when the block size is small. Intel Optane is fasterbut the diﬀerence in the performance under load is not as dramatic as for a single request. On averageOptane is two times faster than NVMe SSD while the minimum latency is almost the same. On the otherhand SATA SSD is worse than NVMe SSD by a noticeable factor of 2-4 times. Compared to solid statedevices HDD is worse by two orders of magnitude. These days one should not use HDD if laency is atstake.Figure 16 also shows an annoying property of SATA SSD which is discussed quite often. For blocksize 32 kilobytes maximum latency reaches almost 50 milliseconds. This just is as bad as HDD. As wediscussed in 2 SSD sometimes need to rearrange internal data layout and this requires both resourcesand time. To cope with this problem one could issue trim requests to the device when it is unused butit sounds rather involved to schedule such maintainance at a cluster scale. Another way would be toaccept that SSD can respond unbearably slow from time to time. If system is disk-based and tries toachieve low latency the request should be sent to multiple replicas. The good news is that we don’tobserve such spikes for NVMe SSD and Optane.Next we discuss throughput for both sequential and random reading. Note that for NVMe SSD andIntel Optane there is almost no diﬀerence from sequential and random reading. As for HDD and SATASSD random reading performs signiﬁcantly worse. When the block size is small the throughput is sosmall that the bars are invisible.It is the ﬁrst time we see that NVMe SSD beats Intel Optane: throughput is higher for the formerwhen block size is 16 megabytes or higher. There is a limit of 2.3 gigabytes per second for Intel Optane.For NVMe SSD it looks like the limit is yet unreached. The bar for block size of 64 megabytes is almost2.9 gigabytes per second For small blocks Intel Optane throughput is better than those of NVMe SSDby a factor of two. Recall that the average latency is also lower by a factor of two so these results matcheach other.Sequential reading througput form both SATA SSD and HDD doesn’t depend on block size andreaches 250 megabytes per second. This experiment is not quite representative for HDD: the readingstarts from the beginning of the disk rather than from a random point where throughput is lower. Fora small block size random reading throughput is so low for both SATA SSD and HDD that the barsare invisible. However for suﬃciently large blocks SSD achieve almost 300 megabytes per second whileHDD reaches 175 megabytes per second.

In this section we observed how reading from a single thread behaves for both sequential and randomblock sequenceses on all four storage types. To saturate the bandwidth one should pick up an appropriateblock size. If data is on HDD block size should be 16 megabytes or higher for random reading. For SATASSD it suﬃces to use 1 megabyte blocks. For Intel Optane block size should be 4 megabytes to saturatethe 2.3 gigabyte per second bandwidth. If one uses NVMe SSD it is hard to reach the bandwidth with asingle thread. Even block size of 64 megabytes is not suﬃcient. In the next section we will study readingform multiple threads. As it will turn out three threads are enough to saturate NVMe SSD in terms ofthroughput. 13 igure 15: Latency for all storage types sequential blocks reading.Figure 16: Latency for all storage types random blocks reading. igure 17: Throughput for all storage types sequential blocks reading.Figure 18: Throughput for all storage types random blocks reading. igure 19: A failed attempt to show multiple results on a single chart. Finally we start talking about how real production systems work with storage. As shown in section 5reading from a single thread cannot saturate a modern device to its full potential. Even for HDD itmay be better to execute several requests simultaneously to allow them to be reordered in kernel thusreducing head movements between tracks.In this section we consider random reading from multiple threads. This corresponds to what manyapplications do. Typically a dedicated thread pool is created and I/O requests are served by threadsfrom this pool. In fact, this is how POSIX AIO works in Linux by using thread pool created inside GNULibc. It is worth to note that we don’t study here the cost of transferring the task into the thread pool. Wemeasure only the work inside the thread pool. The results can be compared with those of asynchronousI/O in the next sections.There will be lots of ﬁgures in this section. Showing more information on a single chart looks likeﬁgure 19.

It’s been long since hard disks and OS drivers learned a simple optimization: requests can be groupedtogether and reordered to save time spent on disk’s head movements between tracks. If it is necessaryto visit several places on a platter it makes sense to reorder them by a track number and visit in a singlesweep. More intelligent schedulers consider not only time required to move head to the next point’strack but the time which will be necessary to wait until required sector appears below the head due todisk rotation. Thus the total seek plus rotation time between two points is taken into the account at thetime of head route planning. Obviously the ﬁrmware must receive several requests in advance to makethe calculations. NCQ extension for SATA protocol allow up to 32 commands to be sent to a device.Before being sent to ﬁrmware requests are kept in a queue inside the kernel. The kernel schedulercan reorder requests and merge them if blocks are consequent. Therefore even for such sequentialdevice as HDD making requests from several threads can result in better overall bandwidth comparedto single-threaded execution. 16 igure 20: HDD reading throughput depending onnumber of threads. Figure 21: HDD reading CPU usage depending onnumber of threads.Figure 22: HDD multithread reading latency for 256kilobytes block. Figure 23: HDD multithread reading latency for 1megabyte block.

Figure 20 shows how throughput changes when the number of threads increases. Diﬀerent linescorrespond to diﬀerent block sizes. First of all we should notice that for large blocks throughputdecreases when the number of threads increases. It looks a bit strange: although latency decrease isabsolutely expected in this case it should be possible to sustain throughput. Second, there is a reallysharp step for 64 and 256 kilobytes blocks when the number of threads switches form 7 to 8. It seemslike heuristics in the scheduler depend on constants and are triggered in one case but not in the other.In the next subsection we will look a bit into this.Next we look at CPU usage. This is also a resource that needs to be accounted even when we talkabout reading from a storage. It turns out CPU usage can be annoyingly noticeable. In the next sectionsCPU usage will be one of the parameter to compare between synchronous and asyncrhonous interfaces.Figure 21 shows CPU usage for readings with diﬀerent block sizes from a diﬀerent number of threads.As on the previous ﬁgure lines correspond to diﬀerent block sizes and the number of threads increasesfrom left to right. CPU usage is shown vertically and the value is the percentage of a single core usage.This should be familiar to what usually performance utilities such as top show. Reading form HDDtakes no more than 5% of a single core and the reader could ask why do we even care about such a smallvalue. However when we look at NVMe SSD a little further we will see saturation of several cores. Soan architector of resource-intensive application should take this value into account. For completenessof the presentation we show CPU usage for HDD too.Finally, we look at third interesting measurement — latency. Latency also changes when the numberof threads increase. What makes latency diﬀerent from throughput and CPU usage is that we neednot only one value but several of them: minimum, maximum, average and percentiles. We show themon the same chart dedicated for a certain block size. Figure 22 shows latency for 256 kilobytes blockreadings, ﬁgure 23 for 1 megabyte block and ﬁgure 24 for 16 megabyte block. As we discussed earliersmaller blocks are not very meaningful for HDD random reading.17 igure 24: HDD multithread reading latency for 16megabytes block. Figure 25: HDD multithread reading throughputwhen mq-deadline scheduler is enabled.

Surprisingly, when the number of threads changes from 7 to 8 in case of 256 kilobytes block sizenot only throughput increases but also latency decreases. The value drops by a factor of two for the 99percentile. However the beneﬁt is not so eﬀective for the 99.9 percentile.For a 1 megabyte block size latency behaves as expected when the number of threads increases from1 up to 32. It just raises smoothly. Recall that throughput is constant for such block size. This meetsour expectations: there is a limited capacity of a hard drive to read random blocks. If we increase thenumber of readers then they all just wait longer. However if we pick a smaller block size or inadequatenumber of threads there will be artifacts on the chart. Nevertheless when the number of threads is 55and higher the lines become more smooth.For a block size of 16 megabytes everything is pretty straightforward. When the number of threadsincreases we see that latency raises and throughput decreases. An obvious conclusion is that when ablock size is large it is better to just use a single thread.It is easy to see that if we want to get the maximum possible throughput we need to read large blocksfrom a single thread. For 16 megabyte block size throughput as high as 170 megabytes per second isachievable even for a random blocks. Each read requests takes 10 milliseconds in 99.9 percentile andabout 30 milliseconds on the average. Taking into account the hard disk geometry this result is prettyclose to optimum.If the product logic dictates as small block size as 256 kilobytes and it is necessary to maximizethroughput 8-9 threads is the best choice. The throughput is 30 megabytes per second which is nearly6 times worse than the maximum average throughput. The latency is even worse: 50 milliseconds onthe average and more than half a second in 99.9 percentile. It may be better to lower the expectationsand agree to 20 megabytes per second when reading from a single thread but on the other hand achievemore reasonable latency of 25 milliseconds in 99.9 percentile.

Previously we observed a sharp step in throughput when the number of threads increased from 7 to8. This happens because of kernel I/O scheduler. Before application request is sent to a device it iskept inside the kernel queue. Requests in that queue could be rearranged to achieve better performancebecause of locality. The behavior is deﬁned by particular I/O scheduler. There are diﬀerent schedulersfor diﬀerent device types. In our case the default scheduler is Budget Fair Queuing [VC10]. To ﬁnd outwhich scheduler is currently used one can execute the following command: $ cat /sys/block/sdb/queue/schedulermq-deadline kyber [bfq] none

BFQ scheduler is rather complex and depends on heuristic based on the number of blocks. Anotherscheduler could be used, for example the one which implements deadline policy. Such a scheduler triesto make sure that requests don’t stay in the queue for too long. To change current scheduler we executethe following command: 18 igure 26: HDD multithread reading CPU usage whenmq-deadline scheduler is enabled. Figure 27: HDD multithread 256k blocks reading la-tency when mq-deadline scheduler is enabled.

When mq-deadline scheduler is enabled results are more smooth. Figures 24, 26, and 27 showthroughput, CPU usage and latency for the interval in question. Number of threads is increased from1 up to 16. For the sake of presentation size latency is shown only for 256 kilobyte block size. Latencyfor 64 kilobyte block size looks almost the same. Also, as we saw earlier, making block size smaller than256 kilobytes is not particularly helpful when HDD latency is at stake.It is easy to notice that throughput now raises smoothly while the number of threads increases.For a 256 kilobyte block size the line reaches 33 megabytes per second though it seems unlikely thatit will cross the 35 megabytes per second even if we add more threads. For 12 threads throughput of30 megabytes per second is reached while latency stays below half a second in 99.9 percentile. This isstrictly better than the results for BFQ scheduler. Nevertheless the resulting latency is quite high.

In this section we look at similar charts for SATA SSD. Reading from SSD can be done in much smallerblocks even when access pattern is random. Actually it is in our interest to reduce the block sizesince. Consider for example an application that requires access to random small chunks of information.Naturally its authors would like to fetch data in 4 kilobyte blocks. Therefore we study latency for blocksize of 4, 16 and 256 kilobytes.At ﬁrst we look at throughput. Figure 28 shows throughput for these three block sizes. All linesbehave similarly: at ﬁrst they raise really fast and quite soon reach a plateau after which additionalthreads give absolutely no eﬀect. The overall maximum for all block sizes is 500 megabytes per second.However reaching this throughput requires reading in rather large blocks: 256 kilobytes and higher.If the block size is small the throughput is signiﬁcantly smaller and could not be improved even withmore threads.Next we study CPU usage shown on ﬁgure 29. It turns out that when we read in blocks of size 16kilobytes a signiﬁcant amount of CPU time is spent on SSD read requests processing. We talk aboutseveral tenths of a fraction of a single core. If the block size is 4 kilobytes we need to allocate almosttwo cores only for SSD reading requests processing. Needless to say this is a noticeable amount if ourapplication needs to carry a CPU-intensive workload and CPU is carefully accounted.Figure 28 shows that if we want to achieve the maximum possible throughput for a block size of 4kilobytes we need 21 threads. For a block size of 64 kilobytes we need 13 threads for that and for a blocksof size 256 kilobytes we need only 3 threads. According to ﬁgure 29 the CPU usage for these block sizesis respectfully 120%, 10% and 5% of a single core.Now lets look at the latency and ﬁnd out which latencies we get for the above number of threads(picked to optimize throughput). On ﬁgures 30, 31, and 32 we see that it will take on average 200microseconds for 4 kilobytes blocks, and between 1 and 2 seconds for blocks of size 64 and 256 kilobytes.So, for 4 kilobytes blocks we need 21 threads. Figure 30 shows nearly 500 microseconds latency in99.9 percentile. This is almost twice as bad as latency for a single thread. There is a peak at 1 thread19 igure 28: SSD multithread reading throughput. Figure 29: SSD multithread reading CPU usage.Figure 30: SSD multithread reading latency for 4 kilo-bytes blocks. Figure 31: SSD multithread reading latency for 64 kilo-bytes blocks.Figure 32: SSD multithread reading latency for 256kilobytes blocks. igure 33: NVMe SSD multithread reading through-put. Figure 34: NVMe SSD multithread reading CPU us-age. for a 99.9 percentile latency, however it may not be the real behavior but an artifact of cold start. Foreach test there is a warming period when requests are being made but no measurements are taken andit may be the case that in this test there were not enough requests at a warming period.To get maximum throughput for 64 kilobytes blocks we need 13 threads. As we see on ﬁgure 31the latency is horrifying 5 milliseconds in 99.9 percentile. This is ten times worse than the latency for asingle thread. However reading from a single thread would give us four times less throughput.Generally speaking if we agree with 5 milliseconds latency we can try reading larger blocks from asingle thread. Truly the single thread reading latency is only 2 milliseconds even for a megabyte-sizedblocks. However throughput will be smaller by nearly 100 megabytes per second. Moreover we need tokeep in mind that smaller granularity has its own advantages and reading a few randomly distributedsmaller blocks could not be compared to reading one large block in terms of data placement and access.If we select block size of 256 kilobytes then the maximum throughput is achievable by only threethreads. The value is almost 500 megabytes per second. CPU usage is about 5% of a single core whichis really a small amount. The latency happens to be 5 milliseconds in 99.9 percentile.Recall that when we tried to read from HDD in 256 kilobyte blocks we achieved only 20 or 30megabytes per second. Thus our results for SSD show about twenty times better throughput. Latencyfor HDD would be 25 milliseconds even for a single thread which is ﬁve times worse. And that wouldbe 20 megabytes per second for HDD. As for 30 megabytes per second which we see in multi-threadedsetting the latency is tremendously bad. By reaching almost half a second it is hundred times worsethan one for SSD. Now we move to NVMe SSD and look at throughput on ﬁgure 33. Throughput lines appear to behavesimilarly to SSD: many lines reach plateau really fast and don’t change after that despite of how manynew threads were added. For NVMe SSD the plateau is several times higher and is about 2.7 gigabytesper second. However all of this is not true for small blocks. For example the line for 4 kilobytes blocks isonly half of it’s way to the maximum. We will take a closer look at small blocks in the next subsection.As for now let’s look at large blocks.For block size 1 megabyte and higher it is suﬃcient to have ﬁve threads to saturate the bandwidth.For seven threads and 16 megabyte blocks we observe the absolute maximum of nearly 2.9 gigabytesper second. Surprisingly when the number of threads is increased the throughput gradually reduces tothe same 2.7 gigabytes per second.CPU usage is shown on ﬁgure 34. For large blocks it is about 20% of a single core. For small blocksthere is a linear growth which continues even after the right border of the chart. It is expected sincethe throughput also grows. But we should note that even this unsaturated CPU usage is quite high.It’s already higher than 100% of a single core for 16 kilobyte blocks and more than two full cores for 4kilobytes blocks.Figure 35 shows latency for 4 kilobytes block size. As we see, higher throughput comes with the costby latency. The 99.9 percentile happens to be be between 1 and 2 milliseconds. It is quite interesting that21 igure 35: NVMe SSD multithread reading latency for4 kilobytes blocks. Figure 36: NVMe SSD multithread reading latency for64 kilobytes blocks.Figure 37: NVMe SSD multithread reading latency for256 kilobytes blocks. Figure 38: NVMe SSD multithread reading through-put for 4 kilobytes blocks and up to 256 threads. it raises to a certain value and doesn’t grow more. Surprisingly the point when this value is reachedis when the throughput matches the SATA SSD maximum throughput of 500 megabytes per second.SSD latency for the same block size is half a millisecond while the maximum throughput of roughly 300megabytes per second. Remember that it is impossible to saturate SATA SSD bandwidth with 4 kilobyteblocks therefore this is not quite fair competition.Another unexpected fact is that the minimum latency signiﬁcantly reduces when the number ofthreads is increased. With the value of nearly 30 microseconds it starts to resemble Intel Optane.Unfortunately we don’t have the explanation for this phenomena. This probably has to do somethingwith caching. However it is hard to imagine a cache with such a hit rate for the whole device. Onepossibility is that this is an observation of a Flash Translation Layer cache. Flash Translation Layeris a part of SSD ﬁrmware that is responsible to translate LBA sector number (which a device receivesin a request from operating system) into internal pages on the ﬂash memory itself. Due to physicallimitations of NAND memory this is quite a complicated and carefully optimized software.When reading with 64 kilobytes blocks the latency also grows along with the number of threadsincrease. The growth itself is not so high: it raises smoothly around 2 milliseconds when the numberof threads is increased from 9 to 64. The corresponding throughput growth is from 1.5 gigabytes persecond to 2.5 gigabytes per second. It is worth to note that the CPU usage is nearly a core and a halfwhich is already noticeable and should be taken into account.Reading with blocks of 256 kilobytes behaves as expected. The latency grows as a logarithm ona logarithmic scale chart and in total is a few times larger. Generally speaking we can say that thereading is saturated when the number of threads reach 12. The throughput growth continues even afterbut is already quite slow. We can say that when throughput is 2.5 gigabytes per second the latency isalmost 2 milliseconds in 99.9 percentile. With SATA SSD the latency for the same block size is about 1.5milliseconds but the transmitted amount of data was ﬁve times smaller.22 igure 39: NVMe SSD multithread reading CPU us-age. Figure 40: NVMe SSD multithread reading latency for4 kilobytes blocks.

As we have seen in previuous subsection it is worth to increase the number of threads when we read4 kilobyte blocks. Just in case here we also look at smaller blocks. It appears that even if we specifysmaller block size the device itself behaves as if whole 4 kilobyte page has been read.Figure 38 shows the resulting throughput when the number of threads raises up to 256. Now wesee that it is possible to transfer 2.5 gigabytes per second even in 4 kilobytes blocks. This is obviouslythe result we couldn’t achieve in previous experiments. However we have to use 256 threads which isﬁve times higher than the number of (hyperthreaded) cores on our testing machine. Unfortunately thisresult is good only as a joke since no real world application would use so many threads to read from asingle NVMe SSD.Despite the enormous number of threads the CPU usage is quite reasonable. It turns out that inorder to read with maximum throughput with 4 kilobytes blocks we need to pay CPU time in equivalentof six full cores.And the ﬁnal bad thing here is latency. Figure 40 shows that even 99.0 percentile is larger than 1millisecond. This is ten times worse than a single read from NVMe SSD. Probably this is not what weexpected when we ﬁrst heard about single request latency and peak NVMe SSD bandwidth.

Optane, in contrast to NVMe SSD, turned out out be quite good to read from it from multiple threads.Starting from block size of 64 kilobytes the maximum throughput is achieved by only two threads. Theresulting throughput is about 2.4 gigabytes per second as we can see on ﬁgure 41. For blocks of 16kilobytes and less the maximum throughput is about 2.2 gigabytes per second. To reach it one needsseven threads for 16 kilobytes blocks and eleven threads for 4 kilobyte blocks. Sounds quite reasonableand realistic to use in an application. At least compared to 256 threads for NVMe SSD.CPU usage is shown on ﬁgure 42 and seems comparable to NVMe SSD. To read with 4 kilobytesblock size we need a little bit more than three fully utilized cores. For 16 kilobytes blocks it is suﬃcientto allocate CPU time equivalent to a single core and for larger blocks only 10 to 20 percents of a core areused.For 4 kilobytes block size we observe a very good latency. Figure 43 shows a bit higher than50 microseconds in 99.9 percentile. Compared to 2 milliseconds of NVMe SSD this is a tremendousspeedup. If we leave out the maximum latency and consider only 99.9 percentile we get that currentlatency is similar to those of a single read. Recall that bandwidth is saturated when reading with onlyeleven threads so we don’t need to take into account a sharp growth when the number of threads reach37. This is a really good result. As we will see further it is really hard to reproduce such a low latencywith asynchronous interfaces. In fact we will often fail to achieve it.For larger blocks latency is also pretty good. For 64 kilobytes block size we see about 200 microsecondson ﬁgure 44. This is worse than a single read and even a single-threaded read. On the other hand itnot really much worse. Anyway it is ten times faster than reading with the same block size form NVMe23 igure 41: Optane multithread reading throughput. Figure 42: Optane multithread reading CPU usage.Figure 43: Optane multithread reading latency for 4kilobytes blocks. Figure 44: Optane multithread reading latency for 64kilobytes blocks.Figure 45: Optane multithread reading latency for 256kilobytes blocks.

We evaluated synchronous reading from multiple threads for all our four types of storage devices.In spite of single-threaded appearance of HDD physics we observed that in some cases reading frommultiple threads allowed us to increase throughput. However this was paid by tremendous worseningof latency. Also we observed that to extract the maximum throughput from either HDD or SATA SSDwe need to read with block size large enough.As for NVMe SSD and Intel Optane we succeeded in reaching nearly maximum bandwidth with 4kilobytes blocks. To read from Optane with the highest throughput we need only eleven threads. ForNVMe SSD we required an enormous number of 256 threads for that. Obviously nobody sane woulduse this amount of threads. At least while the number of CPU cores on a machine is several timessmaller. If the number of active threads exceeds the number of cores then unexpected latency spikeshere and there are unavoidable due to a thread being on a standby while all cores execute other threads.Moreover an application has absolutely no control for these spikes since their origin is inside the kernelscheduler.Nevertheless it is possible to read from NVMe SSD in 4 kilobytes blocks and fully utilize the band-width while keeping the number of threads reasonable. However we require asynchronous input-outputinterface for that and this is exactly the topic of the next sections.25 igure 46: Asynchronous reading throughput.

Linux kernel has asynchronous I/O interface for a long time already [BPPM03]. It works as follows:a process creates a queue inside the kernel and then inserts I/O requests into that queue. Requestsspecify type of operation, ﬁle descriptor, oﬀset and data buﬀer. Conceptually queue insertion is non-blocking. However the Linux aio implementation could block sometimes (this is one of critiques of theinterface). After request is inserted the process continues its execution and the request will be performedasynchronously by the kernel at some point in the future. Further process can ask kernel for a list ofcompleted requests. During this call the process can additionally specify the minimum number ofrequests that should be completed. If the number of already completed requests is smaller the processwill block until enough requests are ﬁnished.Namely there are four system calls: io_setup , io_destroy , io_submit , and io_getevents . Firsttwo manage the existence of the in-kernel asynchronous requests queue itself. The third adds requeststo the queue and the last one is used to get the results. Refer to the man pages for the details how to usethese system calls. Rather extensive description based on user experience can be found in [Maj19]. Figure 46 shows that even with asynchronous interface single thread is not enough to saturate NVMeSSD bandwidth. However we can achieve the goal with only three threads even when 4 kilobytes blocksize is used. Recall that for synchronous interface we required an unpleasant number of 256 threads.Here we create a queue for each thread. We should mention however that Linux aio interface allowsdiﬀerent threads access a single queue. Maybe there is a good way to exploit it but we won’t try toinvestigate it here. Our goal is to push throughput to the maximum and alongside it latency to theminimum. All multi-thread programming experience says that accessing a single object from severalthreads would be an obstacle for such a goal.Next we show latency, throughput, and CPU usage with respect to block size. Asynchronousinterface exhibits much higher throughput than synchronous one. However for a small block size evenasynchronous interface is not enough to saturate NVMe SSD bandwidth form a single thread. We needto use more threads. Here we demonstrate results for one, two, and three of them.26 igure 47: Asynchronous reading throughput for diﬀerent block sizes.Figure 48: Asynchronous reading CPU usage for diﬀerent block sizes.

With asynchronous interface we can perform an operation which is impossible with a synchronous one:read several random blocks at once in a single request. Even though system calls preadv and preadv2 accept a vector of buﬀers they access blocks on a device sequentially. With asynchronous interface wecan insert into the queue a number of absolutely arbitrary requests and wait for all of them to complete.Evidently the request duration depend on the number of blocks. However this dependency is sub-linear.Thus a single asynchronous call would be faster than several synchronous in a row.We show latencies for reading 4 kilobytes blocks. As with the single reading experiment in section 4we execute some requests and show minimum, maximum, average and percentiles. Since in terms ofdata the request complexity grows linear with the number of blocks increased we use linear scale forlatency. This is diﬀerent from all other sections where latency is shown at the logarithmic scale. It iseasy to see that although the growth is linear the derivative is less than one.Figure 52 shows the results of HDD. It is easy to calculate that reading multiple blocks at once withasynchronous interface is faster than multiple synchronous readings. Recall that a single random readfrom HDD takes 12 milliseconds on average. Here we see that ﬁve blocks are read in 50 milliseconds28 igure 49: Single-threaded asynchronous reading latency for diﬀerent block sizes.Figure 50: Two-threaded asynchronous reading latency for diﬀerent block sizes. igure 51: Three-threaded asynchronous reading latency for diﬀerent block sizes. on average which is slightly faster. If we take 20 blocks we get 150 milliseconds. With synchronousinterface the result would be 60 and 240 milliseconds respectively. It is easy to notice that the responsetime grows faster for the synchronous interface.Now let’s look at SSD. Figure 53 shows the results. Here we see that latency growth is really low. Itraises high at the beginning when we step from two blocks to three and then it only doubles while thenumber of blocks is increased by order of magnitude. There are three spikes on this ﬁgure which goup to almost 50 milliseconds. Recall that we already observed them in previous experiments. For thesake of completeness the results are also shown on a ﬁgure 54 with logarithmic scale. One could alsoascertain why we don’t use logarithmic scale in this section. It is really hard to grasp the correlationbetween the number of blocks and latency on such a chart.Results for NVMe SSD are shown on ﬁgure 55. They appear to behave similarly to SSD. The latencydoubles at the beginning when the number of thread is increased from one to two. After that the latencysmoothly increases from 25 microseconds to 400 microseconds while the number of blocks raises fromthree up to sixty. In the end the line is almost horizontal.Optane performs better than NVMe SSD here. As shown on ﬁgure 56 there is no annoying doublingof latency in the beginning. We can read almost 50 blocks and keep ourselves in the limit of 250microseconds. However the overall performance is not enormously better than NVMe SSD.Recall that we disable the kernel cache in our experiments. With asynchronous interface the beneﬁtis the most noticeable. For example ﬁgure 57 shows results for Optane when the ﬁle we read from isopened without O_DIRECT ﬂag. It is easy to see that latencies are several times worse.

We looked at Linux aio interface. In contrast to synchronous interface we were able to read from NVMeSSD up to 2.5 gigabytes per second with 4 kilobytes blocks using only three threads. The CPU usage isalso three full cores. Surprisingly further increase in the number of threads doesn’t help.We ﬁnished our overview of Linux 4.19 interfaces. Starting from the next section we move to thekernel version 5.4. At ﬁrst we and compare Linux aio performance in 5.4 and 4.19 kernels. So old kerneluser can see the beneﬁt of updating the kernel. 30 igure 52: HDD multiple blocks read. Figure 53: SSD multiple blocks read.Figure 54: SSD multiple blocks read (logarithm scale). Figure 55: NVMe SSD multiple blocks read.Figure 56: Optane multiple blocks read. Figure 57: Optane multiple blocks read without

O_DIRECT . igure 58: Asynchronous reading throughput on Linux 5.4. Rather surprisingly Linux aio has not been adopted widely. In Linux version 5.4 kernel developers makeanother attempt to create asynchronous interface. This new interface is called uring

We will study uring in the next section. To deliver an extensive comparison between Linux aio and uring interfaces we needto achieve the best performance for both of them (or show various trade-oﬀs). But ﬁrst we run our testswith the same parameters as in the previous section to compare diﬀerent kernel versions.

In this subsection we run tests with precisely the same parameters as in section 7. We do this tocompare diﬀerent kernel versions with each other and show that it is worth updating the Linux kernelon production systems.Figure 58 shows throughput for all four storage devices. Note that most of the bars are way above 3gigabytes per second. In all our previous results 3 gigabytes per second appeared as an unreachable goaland we succeeded to reach it only with a low margin using asynchronous interface. With kernel version5.4 we see 3.2 gigabytes per second for rather small block size of 32 kilobytes. As for 4 kilobytes blockswe almost reach 2.9 gigabytes per second. When executed with kernel 4.19 the result was noticeablyworse. So it happens that sole kernel upgrade increases NVMe SSD throughput by 7%. Anyone whouses Linux aio interface and saturates NVMe SSD bandwidth should consider updating their kernel.As ﬁgure 59 shows, CPU usage also drops. Reading with 4 kilobytes blocks requires half a core ofCPU time less. Again, if a production system saturates NVMe SSD bandwidth with 4 kilobytes blocksthe kernel upgrade results in 17% CPU usage reduction.Finally, let’s look at latency. Figures 60, 61, and 62 show latencies for one, two, and three threadsrespectively. Compared to previous results we see that 99.9 percentile is better for middle sized blocks,but for blocks of size 4 and 8 kilobytes it looks relatively the same for all three solid devices. The resultsfor HDD are the same, however it should be obvious that HDD is not of our concern here. Furthermorethe block sizes represented on the ﬁgure are inadequately small for modern HDDs. One concernedabout HDD throughput should pick block size 16 megabytes and larger.32 igure 59: Asynchronous reading CPU usage on Linux 5.4.Figure 60: Single-threaded asynchronous reading latency on Linux 5.4. igure 61: Two-threaded asynchronous reading latency on Linux 5.4.Figure 62: Three-threaded asynchronous reading latency on Linux 5.4. igure 63: Asynchronous reading throughput on Linux 5.4. When creating asynchronous input-output queue one has to specify its length. Usually programmerspick some number which seems to be adequate or good enough. Our test were no exception here. Strictlyspeaking the queue length aﬀects throughput and latency. It seems obvious in the corner cases: a queueof length one should behave like synchronous interface while a very large queue should demonstratelarge latency.Let’s ﬁnd the optimum queue size. We read with block size of 4 kilobytes and vary the queue sizefrom 1 up to 256 elements. Figure 63 shows throughput observed in this experiment. The ﬁgure looksfamiliar but the bars correspond to diﬀerent queue sizes instead of block sizes.Looking at the ﬁgure we could say that any queue size is distinguishable. In particular queues with16 and 32 elements stand out the most since they achieve the best throughput for Optane and NMVerespectively. Also queue of 128 elements looks rather interesting since it allows to use only two threadsto read from Optane and achieve almost maximum throughput.Figure 64 shows observed CPU usage. It doesn’t seem like it would aﬀect our choice of the best queuesize. The CPU usage grows alongside throughput growth and this should be expected. We won’t divein 10% oscillations of the CPU usage. However the reader may ﬁnd this important for some particularcase.Latency is more important to us. Figures 65, 66, and 67 show the results for one, two, and threethreads respectively. It is straightforward to see that latency increase when the queue size is increased. Inparticular for the 128 elements queue Optane latency is quite bad. It is 1 millisecond in 99.9 percentile.If we pick a smaller queue of 16 elements the latency would be 100 microseconds for the same 99.9percentile. For queue size 32 the latency for both NVMe SSD and Optane is 250 microsecond. Recallthat we should look at the third ﬁgure since for this queue size we need three threads to saturate thebandwidth.It seems that we need a queue of 16 elements and three threads when we read from either Optaneor NMVe. For SSD we need the same queue size and a single thread. However it is too early to ﬁnishour conﬁguration at this point.When io_getevents is called an application speciﬁes how many request should be awaited forcompletion. Clearly one could wait for a single request to complete or for all requests in the queue. In35 igure 64: Asynchronous reading CPU usage on Linux 5.4.Figure 65: Single-threaded asynchronous reading latency on Linux 5.4. igure 66: Two-threaded asynchronous reading latency on Linux 5.4.Figure 67: Three-threaded asynchronous reading latency on Linux 5.4. igure 68: Asynchronous reading throughput on Linux 5.4. some sense this parameter determines how often we add new requests into the queue. If we use thisin a real application that won’t be our choice since this usually depends on user behavior. Howeverin synthetic experiments we both can and should tweak this parameter since this could show us whatperformance we can hope for in real life.It is interesting to look how latency and throughput depend on the number of requests we insertinto the queue at a single step (a batch). We select two best queue sizes of 16 and 32 elements and try tovary size of the batch. Earlier when we tried to ﬁnd the best queue size the batch size was equal to one.Next we look at the throughput, CPU usage and latency observed in these experiments and ﬁnally pickup the best parameters.It can be hard to read new charts since legend is extended. We add one new dimension for the queuesize which is either 16 or 32. Each bar is now split in two and each half-bar represents the diﬀerentqueue size. Thus results for a speciﬁc storage device and a number of threads but with diﬀerent queuesizes are shown by adjacent bars.Figures 68 and 69 show throughput and CPU usage respectively. Diﬀerent queue sizes can bedistinguished by hatches. In the legend queue sizes of 16 and 32 are labeled as QS16 and QS32respectively. Diﬀerent groups of bars diﬀer by batch size thus the horizontal axis represent the batchsize.Figures 70, 71, and 72 show latencies for one, two, and three threads respectively. Unfortunatelyhatches have already been in use on latency charts and represent the storage type. Diﬀerent queue sizesare represented by color saturation. More colorful bars correspond to the queue of 16 elements whileless colorful correspond to the queue of 32 elements.Let’s deduce the best batch size. First we look at latency. It is straightforward to notice that inalmost all cases the queue of 32 elements leads to larger latencies than the queue of 16 elements. Theonly advantage of the 32 elements queue is larger throughput when reading from NVMe SSD. Howeverdiﬀerence of 50 or even 100 megabytes per second doesn’t justify 10% and higher increase in latency.Therefore we will look at 16 elements queue.Let’s look at throughput on ﬁgure 68. Notice that for Optane if the batch size is 4 or lower it suﬃcesto use only two threads to saturate the bandwidth. Almost all these variants also have the same latency.We can pick batch size 4 here to save some CPU time.38 igure 69: Asynchronous reading CPU usage on Linux 5.4.Figure 70: Single-threaded asynchronous reading latency on Linux 5.4. igure 71: Two-threaded asynchronous reading latency on Linux 5.4.Figure 72: Three-threaded asynchronous reading latency on Linux 5.4.

40s for NVMe SSD the throughput chart shows that we need to choose between batch sizes 1, 2, and4 with preference for the smaller size. It happens again that all three bars look similar therefore forNVMe SSD we pick batch size 1. This leads to the highest throughput. However the throughput is 2.8gigabytes per second. If we want to reach 2.9 gigabytes per second we need to choose 32 elements queueand batch size 4. But this will lead to nearly 50% latency increase in 99.9 percentile.The charts clearly show that it is better to read from SSD using a single thread since this leads to 300microsecond latency in 99.9 percentile. If we use two threads the latency would be 500 microseconds forthe same percentile. For three threads the latency is around 1 millisecond, however it drops to around600 microseconds when the bath size is 16. The lowest latency for a single thread is with batch size 2 or4. If we take throughput into account the batch size 2 is strictly better. The throughput is rather close towhat we can achieve. If one would like to sacriﬁce latency and get the maximum throughput then theyshould use three threads, queue size 16, and batch size 8.HDD is almost unnoticeable on throughput and CPU usage ﬁgures. This is not surprising since weread using 4 kilobyte blocks and for HDD all the time is spent on seeking. Therefore we just pick thelowest latency: single thread, 16 elements queue and batch size also 16. The resulting latency is slightlyabove 100 milliseconds.This concludes selecting the parameters. We summarize our choices in the table below.Storage Threads Queue Size Batch sizePptane 2 16 4NVMe 3 16 1SSD 1 16 2HDD 1 16 16

In the last subsection we execute our usual experiments with diﬀerent block sizes but with the queueand batch sizes picked in previous subsection.Figure 73 shows the resulting throughput. It seems like NVMe SSD performs slightly worse forblock sizes 4 and 8. For block sizes 16 and larger the results look similar to observed previously. As forthe other storage types the results look the same.The CPU usage is shown on ﬁgure 74. It appears to be slightly better for 4 kilobytes block size andslightly worse for 8 kilobytes block size but the overall results are also nearly the same.On the other hand latency diﬀerence is enormous. The results are shown on ﬁgure 75. Latency isseveral times better for all storage types and for all block sizes. For example reading from Optane withblock size 4 kilobytes demonstrates latency of 100 microseconds in 99.9 percentile. In the beginning ofthis section it was 250 microseconds. For NVMe SSD the improvement is even more signiﬁcant. Westarted with 600 microseconds in 99.9 percentile and now we see about 150 microseconds.

We looked at Linux aio interface with kernel version 5.4, compared it to version 4.19 and selected optimalparameters. We were able to signiﬁcantly improve latency. For small blocks it begins to look like singleblock reading. For NVMe SSD and Optane this diﬀerence is within the factor of two. For SSD it isslightly worse, more like a factor of 2.5.Recall that our ultimate goal is to compare Linux aio with uring . In the next section we ﬁnally startreading via uring . 41 igure 73: Best asynchronous reading throughput on Linux 5.4.Figure 74: Best asynchronous reading CPU usage on Linux 5.4. igure 75: Best asynchronous reading latency on Linux 5.4. igure 76: Uring reading throughput. uring interface As we stated several times already Linux kernel version 5.4 introduced new asynchronous input-outputinterface called uring . The interface itself slightly resembles NVMe SSD (refer to 2.3). There are twoqueues: submission queue (SQ) and completion queue (CQ). Both queues reside in a process addressspace and could be accessed without issuing a system call. Nevertheless it is required to notify thekernel that there are new elements in SQ by using uring_enter system call. One could also wait forcompletion with the same call. On the other hand an application could look at CQ and wait for newelements to appear without making any system call. It is quite tricky to use these system calls anddevelopers purpose liburing library which slightly simpliﬁes uring usage. We use liburing in ourexperiments. For a detailed overview of uring refer to [Axb19] and liburing source code.

Now we start testing uring interface. Similar to what we did in previous section for Linux aio we beginwith choosing the optimum queue size. Figures 76, 76, 78, 79, and 80 show observed throughput, CPUusage, and latencies respectively. Each group of bars correspond to the same queue size and the numberof elements in the queue is increased from left to right.As we can see for NVMe SSD and SSD the choice is between queue sizes 16 and 32 while forOptane best variants are 4 and 8. As for HDD the throughput is too small to make a diﬀerence in theseexperiments. Let’s pick a single element queue for HDD and read from a single thread. This shouldlead to lower latency.Next we need to choose the batch size (that is the number of requests we insert into the queue eachtime). This time we are interested in four queue sizes therefore there are too many bars on the ﬁgures.To save some space we leave out HDD.Figures 81, 82, 83, 84, and 85 show the results.For Optane we should look at queue sizes 4 and 8. Queue of 4 elements demonstrates better latency:50 microseconds even when reading from three threads. Throughput is just a few tens of megabytes persecond lower than the optimum. It is rather hard to pick the batch size. We choose batch size 2 to saveCPU usage by 25% of a single core. 44 igure 77: Uring reading CPU usage.Figure 78: Single-threaded uring reading latency. igure 79: Two-threaded uring reading latency.Figure 80: Three-threaded uring reading latency. igure 81: Uring reading throughput.Figure 82: Uring reading CPU usage. igure 83: Single-threaded uring reading latency.Figure 84: Two-threaded uring reading latency. igure 85: Three-threaded uring reading latency. When reading from NVMe SSD we choose between queues of 16 and 32 elements. We need to usethree threads to achieve the desired throughput therefore let’s look at ﬁgure 85. For both queue sizesthe 99.9 percentile is between 100 and 200 microseconds but for 16 elements queue the latency is lower.Although we have to sacriﬁce some throughput. As for the batch size we pick 2 also by looking at theCPU usage even though the diﬀerence is slightly noticeable. If one would like to maximize throughputthey should pick queue size 32 and batch size 4. This results in 2.9 gigabytes per second but the price isnearly doubled latency.As for the SSD it seems that we need to cope with the fact that all variants which lead to maximumthroughput exhibit latency of 500 microseconds and higher. If we look at the CPU usage the winner isthe queue size 32 and batch size 8.As stated earlier, for HDD we pick queue size 1 just because this makes sense. Our selected parametersare presented in the table below.Storage Threads Queue Size Batch sizeOptane 3 4 2NVMe 3 16 2SSD 1 32 8HDD 1 1 1

Here we present results of our experiments with parameters selected in the previous subsection. Fig-ures 86, 87, and 88 show throughput, CPU usage and latency respectively.It seems that in terms of throughput there is almost no diﬀerence. A notable exception is Optanefor which the result is worse for 4 kilobytes blocks. However this drop in throughput is compensatedby latency reduction. It could be that HDD performance is also worse but recall that we selected ratherunusual setting on purpose. With a single element queue uring works as synchronous interface.For small block sizes CPU usage is lower by 10% compared to Linux aio. This is observed for allstorage types. 49 igure 86: Best uring reading throughput.

Also for small blocks we observe latency improvement for all solid state devices. Optane latencyﬁnally reaches 50 microseconds in 99.9 percentile for 4 kilobytes blocks. Recall that the last time weobserved such latency was when we used synchronous interface from multiple threads. However thattime it required CPU power of more than three full cores. Now we need even lower than one and a halfof a core. So the improvement is by a factor of two.Reading from NVMe SSD latency is now closer to 100 microseconds than with Linux aio interfacebut in general the diﬀerence is not really noticeable on our logarithmic scale charts. As for SSD it seemsthat there is no change at all. Both uring and Linux aio the latency is slightly above 500 microsecondsin 99.9 percentile when reading with small block sizes. We won’t talk about HDD latency here howeverit’s worth to note that since the latency is not so embarrassingly high latencies for other drives arerepresented more clearly on the chart.

We looked at uring asynchronous interface and compared it to Linux aio. At last we were able to repeatsuccess of synchronous interface: 50 microsecond latency in 99.9 percentile when reading from Optanewith 4 kilobytes blocks. NVMe SSD latency is also better now. NVMe SSD throughput is the same whileOptane throughput is slightly worse for uring . We made this sacriﬁce to gain some improvement inlatency.There are many parameters in uring interface. One could ﬁx ﬁles and buﬀers and even utilize aspecial kernel thread to process requests. We didn’t try these features here and leave this for the nextsection. 50 igure 87: Best uring reading CPU usage.Figure 88: Best uring reading latency. igure 89: Uring reading with ﬁxed ﬁles throughput.

10 Tuning uring

In previous section we found optimal queue parameters for default uring mode. There are some newfeatures in uring itself namely ﬁxed buﬀers and kernel ﬁle objects and even special kernel thread forprocessing requests. These tweaks have potential to improve performance. In particular one couldexpect that using a kernel thread would liberate the process from issuing system calls and thus improveperformance. However in our experiments we weren’t able to see any justiﬁcation for this. In this sectionwe try to tune uring and show the results to purify false expectations.

If you read from the same ﬁles over and over you should attach them to the queue. As stated in uring documentation when a ﬁle descriptor is passed into the kernel the corresponding ﬁle object has to belocked before performing the operation. After the request is ﬁnished the ﬁle object is unlocked. Toavoid unnecessary locking and unlocking the ﬁle object can be in some sense attached to the queue. Thedocumentation calls this feature ﬁxed ﬁles .To measure the eﬀect we execute previously obtained optimal conﬁguration with and without ﬁxing.Figures 89, 90, and 91 show throughput, CPU usage and latency comparison respectively. As we can seethis results in slightly less CPU usage when reading from NVMe SSD and Optane with small blocks.Also there is a huge reduction in maximum latency for Optane and NVMe SSD and 4 kilobyte blocks.In case of Optane the drop is from nearly a millisecond to 300 microseconds. We didn’t take maximumlatency into account previously but it is pleasant to see improvement anyway. Especially when it couldbe achieved with such a small eﬀort. All other values look pretty much the same.

Another thing which originates from internal kernel organization is buﬀer registration. As uring documentation says when

O_DIRECT is used the kernel has to map the data buﬀer inside the kerneladdress space. After data is transferred into the memory the buﬀer is unmapped from the kerneladdress space. This could take even more time than locking a ﬁle object. Thus uring has a special option52 igure 90: Uring reading with ﬁxed ﬁles CPU usage.Figure 91: Uring reading with ﬁxed ﬁles latency. igure 92: Uring reading with ﬁxed buﬀers throughput. to ﬁx a buﬀer once and reuse it afterwards without the necessity of extra memory management for eachrequest.Figures 92, 93, and 94 show the throughput, CPU usage and latency comparison when buﬀers areﬁxed and when they are not. Throughput is not changed at all for 4 kilobytes blocks and is slightlyimproved for 8 and 16 kilobytes blocks. CPU usage is better for all block sizes. Finally for small blocksizes we again see maximum latency decrease. Finally, probably the most interesting feature is the ability to create a special kernel thread which wouldread requests from the submission queue, execute them and write notiﬁcation into the completionqueue. Thus a userspace program doesn’t need to make any system call at all and making contextswitch between user and kernel modes to submit a request could be avoided. All the application hasto do is to look at completion queue in its own memory. One could hope that this will lead to betterlatency.Unfortunately the results are disappointing. In general they are similar to simple uring and some-times even worse. Maybe there is something wrong with our experiments. It would be nice to look at areport for more successful attempt.Anyway we try our best and present the results. The CPU usage measurement should be takencarefully. Obviously kernel poll thread consumes CPU. However since it is a special thread insidethe kernel which doesn’t relate to a process in common sense its resource consumption is not shownin standard usage statistics. The kernel poll thread is neither part of a process nor its child. In ourexperiments we carefully ﬁnd this thread in proc using its name and gathered CPU consumption fromthere. We hope that in future kernel versions this will be more straightforward.If we execute the experiments with the same queue parameters as selected in previous section theresults are remarkably bad. In particular the latency is too large while the throughput stays the same.It should come with no surprise: after all we changed the execution signiﬁcantly and the workloadis distributed diﬀerently across contexts and probably even CPU cores. Therefore we once again ﬁndoptimal queue parameters using the same method as in previous sections. We hope that this approachmakes the comparison fair enough. 54 igure 93: Uring reading with ﬁxed buﬀers CPU usage.Figure 94: Uring reading with ﬁxed buﬀers latency. igure 95: Uring reading with kernel poll thread throughput.Figure 96: Uring reading with kernel poll thread CPU usage. igure 97: Single-threaded uring reading with kernel poll thread latency.Figure 98: Two-threaded uring reading with kernel poll thread latency. igure 99: Uring reading with kernel poll thread throughput. At ﬁrst we try to vary the queue size. Figures 95 and 96 show throughput and CPU usage respectivelywhile ﬁgures 97 and 98 shows latencies for single and two threads. We pick queue sizes 8, 16, and 32as the most interesting. Let’s look how batch size aﬀects the performance. Figures 99, 100, 101, and 102show the results.There are a few options for Optane with throughput more than 2 gigabytes per second. However allof them lead to latency more than 50 microseconds in 99.9 percentile. We choose queue size 16, batchsize 2 and two threads. Queue size 8 would give slightly better latency but throughput will be worse.The reader should consider this option if the latency is more important. The CPU usage is slightly above150% of a single core.For NVMe SSD throughput chart shows that we need to choose 32 elements queue and two threads.All other variants fall behind. Now we need to select batch size from 1 to 8. Latencies for all these fourvariants look alike and are around 200 microseconds. So we pick batch size 8 by the lowest CPU usagewhich is slightly above 250% of a single core.For SSD we pick 16 elements queue, single thread, and batch size 4. This leads to maximumthroughput, latency is about 300 microseconds and two full cores of CPU time.For HDD we again take single element queue and single thread. The selected parameters aresummarized in the table below.Storage Threads Queue Size Batch sizeOptane 2 16 2NVMe 2 32 8SSD 1 16 4HDD 1 1 1

In this subsection we present the results of running our experiments with the selected parameters.Figures 103, 104, and 105 show throughput, CPU usage and latency respectively.At ﬁrst let’s look at throughput. There is an annoying drop for 64 kilobytes blocks. Our parametershave been chosen for 4 kilobytes blocks and may be not so good for other block sizes. For other block58 igure 100: Uring reading with kernel poll thread CPU usage.Figure 101: Single-threaded uring reading with kernel poll thread latency. igure 102: Two-thread uring readingwith kernel poll thread latency. sizes the results are similar to those without the kernel poll thread.As for the CPU usage everything is bad now. Kernel poll threads consume 100% of a core and wherethe usage was close to zero previously we now see 100% or even 200%. Thus using kernel poll threadleads to more CPU consumption.Finally, let’s look at latency. For Optane latency is worse for 4 kilobytes blocks. We were able toreach 50 microseconds in 99.9 percentile without using kernel poll thread. Now this latency is about 100microseconds. Results for NVMe SSD look similar. There is an improvement for SSD: 30 microsecondsagainst 500. However the throughput is noticeably worse now. Just in case recall that for SSD it isbetter to read with synchronous interface from several threads. It is possible to reach 500 megabytes persecond for large block sizes. However we were unable to repeat this result with asynchronous interfaces. We investigated additional uring capabilities and observed that they don’t help much. Nevertheless itmay be worth to use ﬁxed ﬁles and buﬀers to reduce maximum latency. On the other hand enablingkernel poll thread could lead to a ﬁasco.Now we’ve ﬁnished our overview of Linux input-output interfaces. In the next section we give abrief summary of all our experiments. 60 igure 103: Best uring reading with kernell poll thread throughput.Figure 104: Best uring reading with kernel poll thread CPU usage. igure 105: Best uring reading with kernel poll thread latency. igure 106: All experiments for HDD.

11 All experiments combined

We looked at both modern external memory hardware and Linux application interfaces to read fromthem. Intel Optane shows the lowest latency of 12 microsecond. NVMe SSD shows the highest through-put of 3.2 gigabytes per second. It appears that under load Optane and NVMe SSD demonstrate latencieswithin the factor of two from each other (for a single read the diﬀerence is by the order of magnitude).So we could say that they perform as devices from the same class.In previous sections we studied extensively program interfaces to access the storage. It turns outthat it is not easy to pick an appropriate interface and to select adequate parameters. Our experimentsresult in lots of data only fraction of which was selected for the presentation. Now we try to show allour data on a few plots.We are interested in three parameters: throughput, latency and CPU usage. We usually concern with99.9 percentile and the CPU usage is a second class parameter compared to throughput and latency.Therefore we show our data on a plot where axes are throughput and latency 99.9 percentile. Instead ofpoints we use ﬁgures which shape represent block size and color represent CPU usage. For each blocksize we picked a few best choices and made their ﬁgures larger.To ﬁnd out the speciﬁc setting we use labels that encode the interface, number of threads and queueparameters. First letter represents the interface: “P” for pread , “A” for Linux aio, and “U” for uring .For asynchronous interfaces next is queue size followed by “B” and batch size. For uring there couldbe additional “F” and “M” which stand for ﬁxed ﬁles and buﬀers respectively. Finally any label can endwith “T” followed by the number of threads if executed from more than one thread.Figure 106 shows results for HDD. Now it is really straightforward to realize that for HDD it isbetter to read with synchronous interface from a single thread. It should not be surprising that thesame setting could result in slightly diﬀerent points on the plot. Especially for HDD where everythingdepends on luck and data placement.Another interesting thing to notice is that larger queue size leads to more throughput for asyn-chronous interfaces. This should not be surprising since larger queue allows more possibility forreordering. On the other hand the latency could be as large as a few seconds. The best example here is1 megabyte block and settings U1B1F, A16B1, and A64B1. One could also notice the same behavior forsmaller blocks. 63 igure 107: All experiments for SSD.

It is not an error that almost all ﬁgures are white. In our experiments reading from HDD we neverutilizes more than 5% of a single core. The notable exception is the case when uring kernel poll threadis enabled which results in 100% utilization of a single core. Careful eye should be able to ﬁnd a fewyellow ﬁgures.Figure 107 shows results for SSD. There are three noticeable clusters on the plot: one with the largestthroughput about 500 megabytes per second, another one which holds indistinguishably best combina-tion of throughput and latency for 4 kilobytes blocks, and a strange vertical cluster with throughput 300megabytes per second and large latency. It is straightforward to notice that one can reach 500 megabytesper seconds only when reading with block size larger than 4 kilobytes. It’s better to use synchronousinterface from multiple threads in this case. For 4 kilobytes blocks the best choice is uring with queuesize 32 and batch size 1 (label U32B1). An interesting observation is that 32 is exactly the numberof requests in NCQ, the hardware request queue. The latency is about 500 microseconds. Neighborlabel A16B1 shows better latency which is 250 microseconds but this comes with the sacriﬁce of somethroughput.Figure 108 shows all our experiments for NVMe SSD. One could also note that the peak throughputof 3.2 gigabytes per second is possible only for large enough blocks. This looks similar to SSD howeverthis time we need to use asynchronous interface from multiple threads.For 4 kilobytes blocks there are two interesting points labeled by A16B1T3 and U64B1MFT2. Onecan choose between them according to their desired trade-oﬀ between throughput and latency. If CPUusage is also of concern it should be noted that one of the circles is darker than the other which representsmore CPU consumption.Figure 109 shows results for Optane. Once again we see that the maximum throughput overall is bysome step further than the maximum throughput for 4 kilobytes blocks. Here the absolute winner ismulti-threaded synchronous interface. For blocks larger than 4 kilobytes a few threads are enough. Weneed three threads for 64 kilobytes blocks, single thread for 16 megabytes blocks and two threads forall other block sizes in the middle. For 4 kilobytes blocks we need 16 threads. The CPU consumptionis equal to three full CPU cores but with respect to trade-oﬀ between throughput and latency this pointlooks as the unrivaled best option.This concludes our presentation. In the next section we cover related work.64 igure 108: All experiments for NVMe SSD.Figure 109: All experiments for Optane. In this section we give a brief overview of recent storage performance reports and related materials.Benchmarking is quite a broad topic [TZJW08]. Many technical websites publish measurements of newdevices when they appear. Here we focus mostly on the academic papers.HDD internal operation is described in [RW94] and [ADR03] with some (outdated) performancecharacteristics. Preference for sequential workload has been stated in [SG04] in the form of unwrittencontract for ﬁlesystems and applications. Internal geometry reconstruction through microbenchmarkinghas received attention ever since explicit cylinder-head-sector interface has been replaced by logical blockaddressing: [SG99], [TADP99], [GW10], and recently [Won19]. HDD performance heavily depends onscheduling [WGP94]. Mentioned Linux BFQ scheduler is presented in [VC10]. Shingled disks, whichappeared in the last decade as a response to increased capacity demands, are studied in [AD15].Mass market SSDs are made of NAND ﬂash memory cells [Hut12]. NAND ﬂash doesn’t exactlymatch the block-device interface and a complex software called ﬂash translation layer is running as apart of a device ﬁrmware. Diﬀerent algorithms for ﬂash translation layer are surveyed in [MFL14].Although writing is not covered in our report at all, the major concern when working with SSD is writeperformance [SA13] and endurance [BD10].In [HKADAD17] authors describe SSD architecture and present some patterns that developers shouldfollow. They further investigate them using ﬁlesystems and databases as examples. Since SSD behavediﬀerently from HDD a new benchmarking methodology has been purposed as a set of representativeI/O patterns [BJB09]. Raw NAND memory cell has been benchmarked in [Des10].First performance comparison which simultaneously studies HDD, SSD, and NVMe SSD appearsin [XSG + Open-channel

SSDs with exposedinternals became available. Since there are more options to coordinate with an application such a devicecan demonstrate better performance in particular for a mixed workload [BGB17].Intel Optane SSD is a notable example of ultra-low latency SSD . Its performance is investigatedin [WOHL17], however is only compared to HDD. In [WADAD19] authors study peculiarities of OptaneSSD and try to deduce some rules for developers. There is an interesting use-case where Optane SSD isused as a replacement for a DRAM cache [EGA + +

19] and [PG19]. The former provides reader with detailedlatencies and throughputs with comparison to both volatile memory and SSD. It further proceeds withbenchmarking of various ﬁlesystems and databases. The latter investigates the appliance of such astorage for memory-consuming graph algorithms.Linux block I/O has been studied for a long time [Axb04]. Asynchronous I/O was presentedin [BPPM03] and analyzed in [BTSM04]. Databases such as ScyllaDB and MySQL successfully adoptedasynchronous I/O. ScyllaDB developers described investigation for the best I/O method in theirblog [Kiv17].Uring motivation and description is presented in [Axb19]. Although it’s new it has been alreadystarted gaining polularity [Cos20]. To the best of our knowledge our report is the ﬁrst detailed com-parison between uring and other methods. Since uring is still in active development we expect morebenchmarks to appear in the future.Justiﬁcation for polled I/O can be found on Linux kernel mail list [Axb15]. Prototype study appearedearlier in the literature [YMH12]. Although polling reduce average latency it can severely damage highpercentile [KLKJ18]. Delayed polling which saves CPU cycles has also been considered [EGA + + +

17] or NVMeDi-rect [KLK16]. Howerver this would require using an SSD as a raw device or pick a user-level ﬁlesystemsuch as BlobFS [Blo]. It is yet an open challenge to create a high-performance system-wide userspaceﬁlesystem [LADADK19]. 66 eferences [AD15] Abutalib Aghayev and Peter Desnoyers. Skylight—a window on shingled disk operation.In , pages 135–149, SantaClara, CA, February 2015. USENIX Association.[ADAD18] Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau.

Operating Systems: Three EasyPieces . Arpaci-Dusseau Books, 1.00 edition, August 2018.[ADR03] Dave Anderson, Jim Dykes, and Erik Riedel. More than an interface—scsi vs. ata. In

Proceedings of the 2nd USENIX Conference on File and Storage Technologies , FAST ’03, pages245–257, USA, 2003. USENIX Association.[Axb04] Jens Axboe. Linux block io — present and future.

Proceedings of the Ottawa LinuxSymposium 2004 , July 2004.[Axb15] Jens Axboe. Initial support for polled io. https://lkml.org/lkml/2015/11/6/454 ,November 2015.[Axb19] Jens Axboe. Eﬃcient io with io_uring. https://kernel.dk/io_uring.pdf , October2019.[BANB13] Matias Bjørling, Jens Axboe, David W. Nellans, and Philippe Bonnet. Linux block io:introducing multi-queue ssd access on multi-core systems. In

SYSTOR ’13 , 2013.[BD10] Simona Boboila and Peter Desnoyers. Write endurance in ﬂash drives: Measurementsand analysis. In

Proceedings of the 8th USENIX Conference on File and Storage Technologies ,FAST’10, page 9, USA, 2010. USENIX Association.[BGB17] Matias Bjørling, Javier Gonzalez, and Philippe Bonnet. Lightnvm: The linux open-channel SSD subsystem. In , pages 359–374, Santa Clara, CA, February 2017. USENIX Association.[BJB09] Luc Bouganim, Björn Pór Jónsson, and P. Bonnet. uﬂip: Understanding ﬂash io patterns.In

CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research . CIDR, 2009.[Blo] Spdk: Blobstore ﬁlesystem. https://spdk.io/doc/blobfs.html .[BPPM03] Suparna Bhattacharya, Steven L. Pratt, Badari Pulavarty, and Janet A. Morgan. Asyn-chronous i/o support in linux 2.5. In

Proceedings of the Linux Symposium , July 2003.[BTSM04] Suparna Bhattacharya, John Tran, Mike Sullivan, and Chris Mason. Linux aio perfor-mance and robustness for enterprise workloads. In

Proceedings of the Linux Symposium ,July 2004.[Cor17] Jonathan Corbet. Two new block i/o schedulers for 4.12. https://lwn.net/Articles/720675 , April 2017.[Cos20] Glauber Costa. How io_uring and ebpf will revolutionize pro-gramming in linux. , May 2020.[Des10] Peter Desnoyers. Empirical evaluation of nand ﬂash memory performance.

SIGOPSOper. Syst. Rev. , 44(1):50–54, March 2010.[EGA +

18] Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, KimHazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. Reducing dram footprint withnvm in facebook. In

Proceedings of the Thirteenth EuroSys Conference , EuroSys ’18, NewYork, NY, USA, 2018. Association for Computing Machinery.[FH16] A. Foong and F. Hady. Storage as fast as rest of the system. In , pages 1–4, 2016.67GW10] Jongmin Gim and Youjip Won. Extract and infer quickly: Obtaining sector geometry ofmodern hard disk drives.

ACM Trans. Storage , 6(2), July 2010.[HFVW17] F. T. Hady, A. Foong, B. Veal, and D. Williams. Platform storage performance with 3dxpoint technology.

Proceedings of the IEEE , 105(9):1822–1833, 2017.[HKADAD17] Jun He, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.The unwritten contract of solid state drives.

Proceedings of the Twelfth European Conferenceon Computer Systems , 2017.[Hut12] Lee Hutchinson. Solid-state revolution: in-depth on how ssds really work. https://arstechnica.com/information-technology/2012/06/inside-the-ssd-revolution-how-solid-state-disks-really-work/ , April 2012.[IYZ +

19] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour,Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor, Jishen Zhao, and StevenSwanson. Basic performance measurements of the intel optane dc persistent memorymodule. April 2019.[Kiv17] Avi Kivity. Diﬀerent i/o access methods for linux, what we chose for scylla, and why. , October 2017.[KKLJ17] Sangwook Kim, Hwanju Kim, Joonwon Lee, and Jinkyu Jeong. Enlightening the i/opath: A holistic approach for application performance. In

Proceedings of the 15th UsenixConference on File and Storage Technologies , FAST’17, pages 345–358, USA, 2017. USENIXAssociation.[KLK16] H. Kim, Young-Sik Lee, and J. Kim. Nvmedirect: A user-space i/o framework forapplication-speciﬁc optimization on nvme ssds. In

HotStorage , 2016.[KLKJ18] Sungjoon Koh, Changrim Lee, Miryeong Kwon, and Myoungsoo Jung. Exploring systemchallenges of ultra-low latency solid state drives. In , Boston, MA, July 2018. USENIX Association.[LADADK19] Jing Liu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Sudarsun Kannan.File systems as processes. In , Renton, WA, July 2019. USENIX Association.[LSS +

19] Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham, Jae W. Lee, and Jinkyu Jeong.Asynchronous i/o stack: A low-latency kernel i/o stack for ultra-low latency ssds. In , pages 603–616, Renton, WA,July 2019. USENIX Association.[Maj19] Marek Majkowski. io_submit: The epoll alternative you’ve never heard about. https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/ ,January 2019.[MFL14] Dongzhe Ma, Jianhua Feng, and Guoliang Li. A survey of address translation technolo-gies for ﬂash memories.

ACM Comput. Surv. , 46(3), January 2014.[PG19] I Peng and M Gokhale. System evaluation of the intel optane byte-addressable nvm.June 2019.[RW94] Chris Ruemmler and John Wilkes. An introduction to disk drive modeling.

Computer ,27(3):17–28, March 1994.[SA13] Radu Stoica and Anastasia Ailamaki. Improving ﬂash write performance by using updatefrequency.

Proc. VLDB Endow. , 6(9):733–744, July 2013.[SG99] J. Schindler and G. Ganger. Automated disk drive characterization. 1999.68SG04] Steven W. Schlosser and Gregory R. Ganger. Mems-based storage devices and standarddisk interfaces: A square peg in a round hole?, March 2004.[TADP99] Nisha Talagala, R. Arpaci-Dusseau, and D. Patterson. Microbenchmark-based extractionof local and global disk characteristics. 1999.[TZJW08] Avishay Traeger, Erez Zadok, Nikolai Joukov, and Charles P. Wright. A nine year studyof ﬁle system and storage benchmarking.

ACM Trans. Storage , 4(2), May 2008.[VC10] P. Valente and F. Checconi. High throughput disk scheduling with fair bandwidthdistribution.

IEEE Transactions on Computers , 59(9):1172–1186, 2010.[WADAD19] Kan Wu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Towards an un-written contract of intel optane ssd. In

HotStorage , 2019.[WGP94] B. Worthington, G. Ganger, and Y. Patt. Scheduling algorithms for modern disk drives.In

SIGMETRICS , 1994.[WOHL17] Kai Wu, Frank Ober, Shari Hamlin, and Dong Li. Early evaluation of intel optane non-volatile memory with hpc i/o workloads.

ArXiv , abs/1708.02199, 2017.[Won19] Henry Wong. Discovering hard disk physical geometry through microbenchmarking. http://blog.stuffedcow.net/2019/09/hard-disk-geometry-microbenchmarking ,September 2019.[XSG +

15] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, ZvikaGuz, Anahita Shayesteh, and Vĳay Balakrishnan. Performance analysis of nvme ssdsand their implication on real world databases. In

SYSTOR ’15 , 2015.[YHW +

17] Z. Yang, J. R. Harris, B. Walker, D. Verkamp, C. Liu, C. Chang, G. Cao, J. Stern, V. Verma,and L. E. Paul. Spdk: A development kit to build high performance storage applica-tions. In , pages 154–161, 2017.[YMH12] Jisoo Yang, Dave B. Minturn, and Frank Hady. When poll is better than interrupt. In