A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module
TTo appear in
IEICE TRANS. ??, VOL.Exx–??, NO.xx XXXX 2020 LETTER
A Prompt Report on the Performance of Intel Optane DC PersistentMemory Module
Takahiro HIROFUCHI † , Nonmember and
Ryousei TAKANO † , Member
SUMMARY
In this prompt report, we present the basic performanceevaluation of Intel Optane Data Center Persistent Memory Module (OptaneDCPMM), which is the first commercially-available, byte-addressable non-volatile memory modules released in April 2019. Since at the moment ofwriting only a few reports on its performance were published, this letter isintended to complement other performance studies. Through experimentsusing our own measurement tools, we obtained that the latency of randomread-only access was approximately 374 ns. That of random writeback-involving access was 391 ns. The bandwidths of read-only and writeback-involving access for interleaved memory modules were approximately 38GB/s and 3 GB/s, respectively. key words:
Non-volatile Memory, NVM, Optane DC PM, DCPMM
1. Introduction
In April 2019, Intel officially released the first commercially-available, byte-addressable NVM technology, Intel Op-tane Data Center Persistent Memory Module (DCPMM).DCPMM is a long-awaited product drastically increasingmain memory capacities. Since DRAM technology is un-likely able to meet this growing memory demand, non-volatile memory (NVM) technologies, being accessible inthe same manner as DRAM, are considered indispensablefor expanding main memory capacities. However, there is asubstantial performance gap between DRAM and DCPMM.Since DCPMM was released, only a few reports onits performance were published ([1], [2]). This prompt re-port is intended to complement other performance reports onDCPMM and pave the way for further system software stud-ies addressing the performance gap. We developed our ownmicro-benchmark programs to measure memory latency andbandwidth and investigated bare performance of DCPMM tosee its fundamental characteristics. † †††
National Institute of Advanced Industrial Science and Tech-nology (AIST)DOI: 10.1587/trans.E0.??.1 † Note that Intel Optane DCPMM (released in 2019) and IntelOptane Memory (released in 2017) are different products. Thelatter is a storage class memory device connected to the PCIe NVMeinterface. DCPMM is connected to the DIMM interface and seenas main memory from CPU if configured as the App Direct mode. †† To promptly report results and obtain feedback from the com-munity, we uploaded the early summary of our experiments to apublic preprint server [3]. It summarizes the basic performanceof DCPMM as well as its feasibility to our hypervisor-based vir-tualization mechanism for hybrid memory systems. Consideringthe broader reader’s interest and the page limit of the IEICE letterformat, we focus this paper only to the results of basic performanceevaluation. In this letter, we added discussion on how this workcomplements other performance reports.
To clarify the contribution of this letter, we summarizeour obtained performance numbers and compare them withthe ones reported by related work:• Although [1] reported that the read latency of DCPMMis 305 ns, we obtained 374 ns, which is close to 391 nsreported by [2]. As discussed later, there is a possibilitythat the measurement tool used in [1] (i.e., Intel MLCv3.6) outputted a relatively small value.• In [1] and [2], the write latency of DCPMM was mea-sured with non-temporal instructions or cache-controlinstructions (e.g., clflush). Although depending on con-ditions, their values were generally in the range of 100-200 ns. On the other hand, we conducted experimentsfrom another viewpoint, in order to see write latenciespossibly experienced by ordinary applications (that donot intentionally use non-temporal and cache-controlinstructions for NVM). The estimate value of its writelatency through our experiments was 391 ns. Consid-ering the write mechanism of the 3D Xpoint technol-ogy, it is very unlikely that its actual write latency ismuch shorter than its read latency. Possibly, the writelatencies obtained by non-temporal and cache-controlinstructions present a period of time to deliver data tothe non-volatile internal buffer of a memory controlleror memory module (that ensures no data loss upon apower failure), which is not a period of time to actuallydeliver data to non-volatile memory cells.• Regarding the read bandwidth of DCPMM, [1] reported39.4 GB/s by measuring the performance of sequentialread with Intel MLC v3.6. [2] reported 37 GB/s bymeasuring random read at the granularity of 4 adja-cent cache lines. We obtained 37.6 GB/s by doingexperiments in which multiple worker processes per-formed sequential read on each non-overlapped scratchbuffer. Our result corroborates the already reportedperformance numbers.• Regarding its write bandwidth, [1] reported 13.9 GB/sby Intel MLC v3.6. [2] reported 4 GB/s. In our exper-iments, the peak performance was 3 GB/s. Althoughthe details of the measurement algorithm of Intel MLCwere not available, we consider that 13.9 GB/s wasan unlikely high value, which will not represent timeto actually reach memory cells. Our result was moreconservative than [2].• While interleaving was not disabled in [1] and [2], wealso measured the read/write bandwidths and latenciesCopyright © 2020 The Institute of Electronics, Information and Communication Engineers a r X i v : . [ c s . PF ] F e b IEICE TRANS. ??, VOL.Exx–??, NO.xx XXXX 2020
Table 1: The overview of the test machine used in experi-ments
CPU Intel Xeon Platinum 8260L 2.40 GHz (Cascade Lake) x2L1d cache 32 KB, L1i cache 32 KBL2 cache 1024KL3 cache 36 MBDRAM DDR4 DRAM 16 GB, 2666 MT/s, 12 slotsDCPMM DDR-T 128 GB, 2666 MT/s, 12 slotsOS Linux Kernel 4.19.16 (extended for RAMinate)
Fig. 1: The memory configuration of the tested machine(NUMA 0)with non-interleaved configurations. For example, theread/write latencies were degraded by 5.4% and 17.2%,respectively. Since interleaving contributed to decreas-ing latencies, there will be multiple request queues toaccess memory modules. As the number of concur-rent reading processes increased, the read bandwidthdrastically decreased. We observed this behavior onlyin the case of read access with interleaving disabled.Although it is difficult to explain the exact reason ofthis behavior because the technical detail of DCPMMis not disclosed, a possible reason is that its internalbuffering mechanism does not work efficiently whenthe interleaving mechanism is disabled.
2. Evaluation
Table 1 summarizes the specification of the tested machine.Figure 1 shows its memory configuration. The machine isequipped with 2 CPU sockets. A CPU processor has 24physical CPU cores and 2 memory controllers. A memorycontroller has 3 memory channels. Each memory channelhas a DDR4 DRAM module (16 GB) and a DCPMM (128GB). The total DRAM size of the machine is 192 GB. Thetotal DCPMM size is 1536 GB.The Intel CPU processors supporting DCPMM allowusers to configure how DCPMM is incorporated into themain memory of a computer. In experiments, we assignedall the DCPMMs to App Direct Mode. In App Direct Mode,the memory controller maps both DRAM and DCPMM tothe physical memory address space of the machine, whichenables the software layer to directly accesses DCPMM.The host operating system leaves DCPMMs intact. Thebenchmark programs directly accessed the physical memory ranges of DCPMMs via the device file of Linux ( /dev/mem ).Although the operating system recognized two NUMA do-mains (i.e., those of CPU socket 0 and 1, respectively), weused the CPU cores and memory modules only in the firstNUMA domain.The interleaving mechanism of DRAM and that ofDCPMM were enabled, respectively. For DCPMM, the in-terleaving configuration of App Direct Mode was used unlessotherwise noted. The 6 DCPMMs connected to each NUMAdomain were logically combined. The memory controllerspread memory accesses evenly to the memory modules.For DRAM, the controller interleaving (i.e., iMC interleav-ing) was enabled in the BIOS setting. Similarly, the 6 DRAMmodules connected to each NUMA domain were logicallycombined. In order to simplify system behavior, we disabledthe hyper-threading mechanism of CPUs. Transparent hugepage and address randomization were also disabled in thesetting of Linux Kernel.We developed micro-benchmark programs that mea-sure the read/write access latencies and bandwidth of phys-ical memory † . To measure read performance, the micro-benchmark programs induce Last Level Cache (LLC) missesthat result in data fetches from memory modules. For writeperformance, the programs cause the evictions of modifiedcachelines as well.2.1 Read/Write LatenciesFigure 2 illustrates the overview of the micro-benchmark pro-gram to measure memory read/write latencies. Most CPUarchitectures perform the memory prefetching and the out-of-order execution to hide memory latencies from programsrunning on CPU cores. To measure latencies precisely, thebenchmark program was carefully designed to suppress theseeffects. To measure the read latency of main memory, itworks as follows:• First, it allocates a certain amount of memory bufferfrom a target memory device. To induce LLC misses,the size of the allocated buffer should be sufficientlylarger than the size of LLC. It splits the memory bufferinto 64-bytes cacheline objects.• Second, it set up the link list of the cacheline objectsin a random order, i.e., traversing the linked list causesjumps to remote cacheline objects.• Third, it measures the elapsed time for traversing allcacheline objects and calculates the average latency tofetch a cacheline. In most cases, a CPU core stalls dueto an LLC miss upon the traversal of the next cachelineobject in the linked list. The elapsed time of this CPUstall is a memory latency.When measuring the write-back latency, in addition tothe second step, it updates the second 8 bytes of a cachelineobject before jumping to the next cacheline object. The status † The micro-benchmark programs were also used in our priorstudies. Refer to [4] for more information.
ETTER Fig. 2: The overview of the micro-benchmark program to measure memory read/write latenciesof the cacheline in LLC changes to modified . The cachelineis written back to main memory later. Although a write-backoperation is asynchronously performed, we can estimate theaverage latency of a memory access involving the write-backof a cacheline, from the elapsed time to traverse all the cachelink objects.Figure 3 summarizes the measured results of theread/write latencies of DRAM and DCPMM, respectively.As the size of the allocated memory buffer increased, theread/write latencies of DRAM reached approximately 95 ns,respectively. Although write latencies were slightly higherwith any tested buffer sizes, the differences in read/writelatencies were only 1-2 ns. On the other hand, the read la-tency of DCPMM was up to 374.1 ns. The write latencywas 391.2 ns. For read access, the latency of DCPMM was400.1% higher than that of DRAM. For write access, it was407.1% higher. Similarly to other NVM technologies, thewrite latency of a bare DCPMM module was larger thanthe read latency, as clearly shown in the result of the non-interleaved configuration. The latency of memory accessinvolving write-back was 458.4 ns, which was 16.1% higherthan that of read-only access (394.5 ns). The read/writelatency was degraded by 5.4% and 17.2%, respectively, incomparison to the interleaved cases.It should be noted that these measured latencies includethe penalty caused by TLB (Translation Lookaside Buffer)misses. The page size in the experiments was 4 KB. Ourmeasured latencies of DRAM were slightly higher than thevalue that Intel Memory Latency Checker (MLC) reported.Intel MLC v3.6 reported that the DRAM latency was 82 ns.The method of random access in Intel MLC slightly differsfrom that of our micro-benchmark program. According tothe documentation of Intel MLC v3.6, it performs randomaccess in a 256-KB range of memory in order to mitigate TLBmisses. After completing that range, it performs randomaccess in the next 256-KB range of memory. We considerthat memory intensive applications randomly accessing awide range of memory will experience memory latenciesclose to our obtained results. Although it is out of the scopeof this report, one could use a large page size such as 2 MBand 1 GB to mitigate TLB misses.2.2 Read/Write BandwidthsOur micro-benchmark program measuring the read/write bandwidths of main memory launches a multiple numberof concurrent worker processes to perform memory access.Each worker process allocates 1 GB of memory buffer froma target memory device. The memory buffer of a worker pro-cess does not overlap the memory buffer of another workerprocess. Each worker process sequentially scans its allo-cated buffer. We increased the number of worker processesup to the number of CPU cores of an NUMA domain.Figure 4 shows the read/write bandwidths of DRAMand DCPMM, respectively. As the number of the concurrentworker processes increased for read-only memory access,the bandwidth of DRAM reached 101.3 GB/s at peak; onthe other hand, the bandwidth of DCPMM was 37.6 GB/s.For memory access involving write-back, the bandwidth ofDRAM was 37.4 GB/s at peak, and that of DCPMM was2.9 GB/s. For read access, the throughput of DCPMM was37.1% of DRAM. For write access, it was 7.8%. The dif-ference in read and write bandwidths is larger in DCPMM;it was approximately 13 times in DCPMM, while it was 2.7times in DRAM.With the interleaving of DCPMM disabled, the ob-served peak bandwidths were degraded to approximately 1/6(i.e., 6.4 GB/s for read-only access, and 0.46 GB/s for write-back-involving access). The number of the memory mod-ules, being simultaneously accessed, was only one (i.e., 1/6of the interleaved configuration). Interestingly, as the num-ber of concurrent worker processes increased, the throughputof read access decreased by approximately 50%. A possiblereason is that the internal buffering mechanism of DCPMMdoes not work efficiently when the interleaving mechanismis disabled. Its design is supposed to be optimized for inter-leaved memory accesses.2.3 Summary and DiscussionTable 2 and Table 3 summarize the key results of our ex-periments. The advantage of DCPMM is the large capacityof a memory module (e.g., 128 GB, 256 GB and 512 GB),which is an order of magnitude greater than that of DRAM(i.e., typically up to 32 GB). Its disadvantage is its modestread/write performance:Latency:• The read latency was approximately 374.1 ns, whichwas 400.1% larger than that of DRAM.
IEICE TRANS. ??, VOL.Exx–??, NO.xx XXXX 2020 (a) DRAM (Interleaved) (b) DCPMM (Interleaved) (c) DCPMM (Non Interleaved)
Fig. 3: The read and write latencies of DRAM and DCPMM. In the graphs, the results of the read latency are marked as RO(read-only), and those of the write latency are marked as WB (write-back). (a) DRAM (Interleaved) (b) DCPMM (Interleaved) (c) DCPMM (Non Interleaved)
Fig. 4: The read/write memory bandwidths of DRAM and DCPMM. In the graphs, the results of the read latency are markedas RO (read-only), and those of the write latency are marked as WB (write-back).Table 2: The obtained performance numbers of interleavedDRAM and DCPMM
DRAM DCPMM RatioLatency Read-only 93.5 ns 374.1 ns 400.1%Write-back 96.1 ns 391.2 ns 407.1%Bandwidth Read-only 101.3 GB/s 37.6 GB/s 37.1%Write-back 37.4 GB/s 2.9 GB/s 7.8% • The memory access latency involving write back oper-ations was approximately 391.2 ns, which was 407.1%times larger than that of DRAM. Without interleaving,it was degraded to 458.4 ns.Bandwidth:• The read bandwidth of DCPMM was approximately37.6 GB/s, which was 37.1% of that of DRAM.• The memory access bandwidth involving write back op-erations was approximately 2.9 GB/s, which was 7.8%of that of DRAM.The obtained performance numbers complement priorwork. To make the contribution of the paper clear withinthe page limit of the letter format, we discussed comparisonwith prior work in the latter half of Section 1. Table 3: The obtained performance numbers of interleavedand non-interleaved DCPMM
Interleaved Non-Interleaved RatioLatency Read-only 374.1 ns 394.5 ns 105.5%Write-back 391.2 ns 458.4 ns 117.2%Bandwidth Read-only 37.6 GB/s 6.4 GB/s 17.0%Write-back 2.9 GB/s 0.46 GB/s 15.9%
3. Conclusion
In order to complement prior performance reports on IntelOptane DCPMM, we conducted experiments using our ownmeasurement tools. We observed that the latency of randomread-only access was approximately 374 ns. That of randomwriteback-involving access was 391 ns. The bandwidthsof read-only and writeback-involving access for interleavedmemory modules were approximately 38 GB/s and 3 GB/s,respectively.Many applications (e.g., especially large-scale HPC andAI workloads) will get benefit from a large capacity of mainmemory expanded by DCPMM. However, a substantial per-formance gap between DCPMM and DRAM poses new chal-lenges for system software studies. We are currently con-ducting experiments using application programs and will
ETTER report details in our future publication. Acknowledgment
We would like to acknowledge the support of Intel Corpora-tion. We also thank Dr. Jason Haga and other colleagues fortheir invaluable feedback.