The Preliminary Evaluation of a Hypervisor-based Virtualization Mechanism for Intel Optane DC Persistent Memory Module
TThe Preliminary Evaluation of a Hypervisor-basedVirtualization Mechanism for Intel Optane DC PersistentMemory Module
Takahiro Hirofuchi and Ryousei TakanoInformation Technology Research Institute,National Institute of Advanced Industrial Science and Technology (AIST)E-mail: [email protected]
Abstract
Non-volatile memory (NVM) technologies, being accessible in the same manner as DRAM, are considered indispensablefor expanding main memory capacities. Intel Optane DCPMM is a long-awaited product that drastically increasesmain memory capacities. However, a substantial performance gap exists between DRAM and DCPMM. In ourexperiments, the read/write latencies of DCPMM were 400% and 407% higher than those of DRAM, respectively.The read/write bandwidths were 37% and 8% of those of DRAM. This performance gap in main memory presentsa new challenge to researchers; we need a new system software technology supporting emerging hybrid memoryarchitecture. In this paper, we present RAMinate, a hypervisor-based virtualization mechanism for hybrid memorysystems, and a key technology to address the performance gap in main memory systems. It provides great flexibility inmemory management and maximizes the performance of virtual machines (VMs) by dynamically optimizing memorymappings. Through experiments, we confirmed that even though a VM has only 1% of DRAM in its RAM, theperformance degradation of the VM was drastically alleviated by memory mapping optimization. The elapsed timeto finish the build of Linux Kernel in the VM was 557 seconds, which was only 13% increase from the 100% DRAMcase (i.e., 495 seconds). When the optimization mechanism was disabled, the elapsed time increased to 624 seconds(i.e. 26% increase from the 100% DRAM case).
Keywords:
Non-volatile Memory, NVM, Optane DC PM, DCPMM, RAMinate, Hypervisor, Qemu/KVM, HybridMemory, Virtualization
Data centers using virtualization technologies obtain great benefit from increased main memory capacities. Largermemory capacities of physical machines (PMs) will allow IaaS service providers to consolidate more virtual machines(VMs) onto a single PM, increasing the competitiveness of their commercial services. Since DRAM technology isunlikely able to meet this growing memory demand, non-volatile memory (NVM) technologies, being accessible in thesame manner as DRAM, are considered indispensable for expanding main memory capacities.RAMinate [1, 2] is our hypervisor-based virtualization mechanism for the use of such byte-addressable NVM devicein the main memory of a computer. It creates a unified RAM for a VM from DRAM and NVM, and dynamicallyoptimizes its memory mappings so as to place hot memory pages in the faster memory device (i.e., DRAM in thecurrent system). It provides great flexibility in the management of hybrid memory systems. To the best of ourknowledge, RAMinate is the first virtualization mechanism implemented in the hypervisor layer.In April 2019, Intel officially released the first commercially-available, byte-addressable NVM technology, IntelOptane Data Center Persistent Memory Module (DCPMM). DCPMM is a long-awaited product drastically increasingmain memory capacities. However, there is a substantial performance gap between DRAM and DCPMM, whichnecessitates novel system software technologies maximizing performance of hybrid memory systems.In this paper, we present the preliminary evaluation of the hypervisor-based virtualization for the first byte-addressable NVM. We consider that RAMinate is the key technology to address a performance gap between DRAMand NVM in the main memory of a computer. We first clarify the performance characteristics of DCPMM by using a r X i v : . [ c s . O S ] J u l Background Tab. 1:
The overview of the test machine used in experimentsCPU Intel Xeon Platinum 8260L 2.40 GHz (Cascade Lake), 2 processorsL1d cache 32 KB, L1i cache 32 KB, L2 cache 1024K, L3 cache 36 MBDRAM DDR4 DRAM 16 GB, 2666 MT/s, 12 slotsDCPMM DDR-T 128 GB, 2666 MT/s, 12 slotsOS Linux Kernel 4.19.16 (extended for RAMinate)
Fig. 1:
The memory configuration of the tested machine micro-benchmark programs. Next, we applied RAMinate to the hybrid memory system composed of DRAM andDCPMM, and evaluated its feasibility through a preliminary workload.Section 2 briefly explains the overview of DCPMM and RAMinate. It also discusses the comparison betweenDCPMM’s Memory Mode and RAMinate. Section 3 presents the performance characteristics of DCPMM and thepreliminary evaluation of RAMinate for DCPMM. Section 4 concludes the paper.
Intel Optane DCPMM is an NVM module connected to the DIMM interface of a computer. The capacity of a DCPMMis larger than that of ordinary DRAM modules; each DCPMM module has 128 GB in our tested machine. Its memorycell is based on the 3D XPoint technology, which is a stacked array of resistive memory elements. Although the detailsare not disclosed, wear-leveling and buffering are supposed to be performed at the inside of the memory module. Inthis report, we focus on the increased capacity of main memory by DCPMM rather than its data persistency.Table 1 summarizes the specification of the tested machine. Figure 1 shows its memory configuration. The machineis equipped with 2 CPU sockets. A CPU processor has 24 physical CPU cores and 2 memory controllers. A memorycontroller has 3 memory channels. Each memory channel has a DDR4 DRAM module (16 GB) and a DCPMM (128GB). The total DRAM size of the machine is 192 GB. The total DCPMM size is 1536 GB.The Intel CPU processors supporting DCPMM allow users to configure how DCPMM is incorporated into themain memory of a computer. In
Memory Mode , DRAM works as a cache for DCPMMs. From the viewpoint of theoperating system running on the machine, the total size of the main memory is the sum of DCPMMs. The memorycontroller caches read/written data in DRAM, thus reducing performance overhead caused by slow memory accessto DCPMM. This mechanism is implemented in the hardware layer. No modification is necessary to the softwarelayer. In
App Direct Mode , the memory controller maps both DRAM and DCPMM to the physical memoryaddress space of the machine. It is the responsibility of the software layer to manage the memory space of DCPMM.Typically, programming libraries and persistent memory file systems are used to take advantage of DCPMM. Ourhypervisor-based virtualization mechanism for hybrid memory systems also works for App Direct Mode.
RAMinate [1] is a hypervisor-based virtualization mechanism for hybrid main memory composed of DRAM and byte-addressable NVM. To our knowledge, RAMinate is the first work implementing hypervisor-based virtualization. Incontrast to past studies, our mechanism works at the hypervisor, not at the hardware or operating system level. Itdoes not require any special program at the operating system level nor any design changes of the current memorycontroller at the hardware level.
Background Fig. 2:
The overview of RAMinate
Tab. 2:
Comparison between DCPMM’s Memory Mode and RAMinateDCPMM’s Memory Mode RAMinateWhere the mechanism is im-plemented Hardware (i.e., memory controller) Software (i.e., hypervisor)How virtual memory spaceis created System wide. The main memory of a phys-ical machine is extended to the size of theDCPMM space assigned to Memory Mode. Per virtual machine. The main memoryof each virtual machine is created by usingDRAM and NVM.Flexibility in the mixed ra-tio of DRAM and NVM The mixed ratio is system-wide configura-tion. Once changed, reboot is necessary. Any mixed ratio is possible for each VM. Itis possible to dynamically change the mixedratio without rebooting the VM.
RAMinate was originally presented in the seventh ACM Symposium on Cloud Computing 2016 (ACM SoCC2016). Our paper [2] obtained the best paper award in the symposium. In the paper, we assumed that STT-MRAM(Spin Transfer Torque Magnetoresistive RAM) will be used as a part of main memory. Since at the moment of ACMSoCC 2016 there was no byte-addressable NVM modules in the market, we evaluated its feasibility by dedicating apart of DRAM to a pseudo STT-MRAM region. The design of RAMinate is basically independent of the type of anNVM device. It works for any byte-addressable NVMs (including DCPMM) without any modification to it.Figure 2 illustrates the overview of RAMinate. It creates a VM and allocates the main memory of the VMfrom DRAM and byte-addressable NVM. From the viewpoint of the guest operating system, the guest main memoryappears like normal memory. Underneath the hood, each memory page of the guest main memory is mapped toDRAM or NVM. RAMinate periodically monitors memory access of the VM and determines which memory pagesare being intensively accessed. It dynamically optimizes memory mappings in order to place hot memory pages tofaster memory (i.e., DRAM in the current system). RAMinate maximizes the performance of a VM even with a smallamount of DRAM mapped to its main memory. For design details, please refer to our paper [2].Table 2 summarizes the comparison between Memory Mode of DCPMM and RAMinate. Both the mechanismsprovide virtualization for hybrid memory systems. Since DCPMM’s Memory Mode is implemented in a memorycontroller, the main memory of a physical machine is extended to the size of the DCPMM space assigned to MemoryMode. Although it is possible to change the percentage of the DCPMM space to be assigned to Memory Mode,this configuration is system-wide. Rebooting the physical machine is necessary to apply a new configuration. Onthe other hand, RAMinate, implemented in the hypervisor layer, provides greater flexibility. For each VM, it ispossible to specify any mixed ratio of DRAM and NVM. For example, we can assign a large ratio of DRAM to a VMrunning a memory-intensive database server; we can reduce the DRAM ratio of a VM running batch jobs that do nothave tight requirement on service latencies. An IaaS provider (e.g., Amazon EC2) can increase the density of serverconsolidation without sacrificing performance of workloads. As shown in the below, even though a small amount ofDRAM is assigned to a VM, RAMinate is capable of drastically alleviating performance degradation by dynamicallyoptimizing memory mappings.
Evaluation Fig. 3:
The overview of the micro-benchmark program to measure memory read/write latencies
We first conducted preliminary experiments to measure the basic performance of DCPMM. The performance ofDCPMM is supposed to be different from that of DRAM. We made it clear through our micro-benchmark programs.Next, we set up RAMinate for DCPMM to see its feasibility for the first byte-addressable NVM.We assigned all the DCPMMs to App Direct Mode. The host operating system leaves DCPMMs intact. Thebenchmark programs directly accessed the physical memory ranges of DCPMMs via the device file of Linux ( /dev/mem ).Although the operating system recognized two NUMA domains (i.e., those of CPU socket 0 and 1, respectively), weused the CPU cores and memory modules only in the first NUMA domain.The interleaving mechanism of DRAM and that of DCPMM were enabled, respectively. For DCPMM, the inter-leaving configuration of App Direct Mode was used unless otherwise noted. The 6 DCPMMs connected to each NUMAdomain were logically combined. The memory controller spread memory accesses evenly to the memory modules. ForDRAM, the controller interleaving (i.e., iMC interleaving) was enabled in the BIOS setting. Similarly, the 6 DRAMmodules connected to each NUMA domain were logically combined. In order to simplify system behavior, we disabledthe hyper-threading mechanism of CPUs. Transparent huge page and address randomization were also disabled inthe setting of Linux Kernel.
We developed micro-benchmark programs that measure the read/write access latencies and bandwidth of physicalmemory. To measure read performance, the micro-benchmark programs induce Last Level Cache (LLC) misses thatresult in data fetches from memory modules. For write performance, the programs cause the evictions of modifiedcachelines as well.
Figure 3 illustrates the overview of the micro-benchmark program to measure memory read/write latencies. MostCPU architectures perform the memory prefetching and the out-of-order execution to hide memory latencies fromprograms running on CPU cores. To measure latencies precisely, the benchmark program was carefully designed tosuppress these effects. To measure the read latency of main memory, it works as follows: • First, it allocates a certain amount of memory buffer from a target memory device. To induce LLC misses, thesize of the allocated buffer should be sufficiently larger than the size of LLC. It splits the memory buffer into64-bytes cacheline objects. • Second, it set up the link list of the cacheline objects in a random order, i.e., traversing the linked list causesjumps to remote cacheline objects. • Third, it measures the elapsed time for traversing all cacheline objects and calculates the average latency tofetch a cacheline. In most cases, a CPU core stalls due to an LLC miss upon the traversal of the next cachelineobject in the linked list. The elapsed time of this CPU stall is a memory latency.When measuring the write-back latency, in addition to the second step, it updates the second 8 bytes of a cachelineobject before jumping to the next cacheline object. The status of the cacheline in LLC changes to modified . Thecacheline is written back to main memory later. Although a write-back operation is asynchronously performed, wecan estimate the average latency of a memory access involving the write-back of a cacheline, from the elapsed timeto traverse all the cache link objects.Figure 4 summarizes the measured results of the read/write latencies of DRAM and DCPMM, respectively. Asthe size of the allocated memory buffer increased, the read/write latencies of DRAM reached approximately 95 ns,
Evaluation (a) DRAM (Interleaved) (b)
DCPMM (Interleaved) (c)
DCPMM (Non Interleaved)
Fig. 4:
The read and write latencies of DRAM and DCPMM. In the graphs, the results of the read latency are marked as RO(read-only), and those of the write latency are marked as WB (write-back). respectively. Although write latencies were slightly higher with any tested buffer sizes, the differences in read/writelatencies were only 1-2 ns. On the other hand, the read latency of DCPMM was up to 374.1 ns. The write latency was391.2 ns. For read access, the latency of DCPMM was 400.1% higher than that of DRAM. For write access, it was407.1% higher. Similarly to other NVM technologies, the write latency of a bare DCPMM module was larger thanthe read latency, as clearly shown in the result of the non-interleaved configuration. The latency of memory accessinvolving write-back was 458.4 ns, which was 16.1% higher than that of read-only access (394.5 ns).It should be noted that these measured latencies include the penalty caused by TLB (Translation LookasideBuffer) misses. The page size in the experiments was 4 KB. Our measured latencies of DRAM were slightly higherthan the value that Intel Memory Latency Checker (MLC) reported. Intel MLC v3.6 reported that the DRAM latencywas 82 ns. The method of random access in Intel MLC slightly differs from that of our micro-benchmark program.According to the documentation of Intel MLC v3.6, it performs random access in a 256-KB range of memory in orderto mitigate TLB misses. After completing that range, it performs random access in the next 256-KB range of memory.We consider that memory intensive applications randomly accessing a wide range of memory will experience memorylatencies close to our obtained results. Although it is out of the scope of this report, one could use a large page sizesuch as 2 MB and 1 GB to mitigate TLB misses.
Our micro-benchmark program measuring the read/write bandwidths of main memory launches a multiple number ofconcurrent worker processes to perform memory access. Each worker process allocates 1 GB of memory buffer from a
Evaluation (a) DRAM (Interleaved) (b)
DCPMM (Interleaved) (c)
DCPMM (Non Interleaved)
Fig. 5:
The read/write memory bandwidths of DRAM and DCPMM. In the graphs, the results of the read latency are markedas RO (read-only), and those of the write latency are marked as WB (write-back). target memory device. The memory buffer of a worker process does not overlap the memory buffer of another workerprocess. Each worker process sequentially scans its allocated buffer. We increased the number of worker processes upto the number of CPU cores of an NUMA domain.Figure 5 shows the read/write bandwidths of DRAM and DCPMM, respectively. As the number of the concurrentworker processes increased for read-only memory access, the bandwidth of DRAM reached 101.3 GB/s at peak; on theother hand, the bandwidth of DCPMM was 37.6 GB/s. For memory access involving write-back, the bandwidth ofDRAM was 37.4 GB/s at peak, and that of a DCPMM was 2.9 GB/s. With the interleaving of DCPMM disabled, theobserved peak bandwidths were degraded to approximately 1/6 (i.e., 6.4 GB/s for read-only access, and 0.46 GB/sfor write-back-involving access). The number of the memory modules, being simultaneously accessed, was only one(i.e., 1/6 of the interleaved configuration).For read access, the throughput of a DCPMM was 37.1% of DRAM. For write access, it was 7.8%. The differencein read and write bandwidths is larger in DCPMM; it was approximately 13 times in DCPMM, while it was 2.7 timesin DRAM.
Table 3 summarizes the results of our experiments investigating the performance characteristics of DCPMM. AlthoughDCPMM provides a large capacity of main memory, memory access to a DCPMM region is slower than that of DRAM.Latency: • The read latency was approximately 374.1 ns, which was 400.1% larger than that of DRAM.
Evaluation Tab. 3:
The summary of the performance characteristics of interleaved DRAM and DCPMM in our experimentsDRAM DCPMM RatioLatency Read-only 93.5 ns 374.1 ns 400.1%Write-back 96.1 ns 391.2 ns 407.1%Bandwidth Read-only 101.3 GB/s 37.6 GB/s 37.1%Write-back 37.4 GB/s 2.9 GB/s 7.8%
Fig. 6:
An example command line of RAMinate to create a VM with 4 GB RAM. The mixed ratio of DRAM is 1% (40 MB). • The memory access latency involving write back operations was approximately 391.2 ns, which was 407.1% largerthan that of DRAM. As observed in Section 3.1.1, without interleaving, it was degraded to approximately 458.4ns.Bandwidth: • The read bandwidth of DCPMM was approximately 37.6 GB/s, which was 37.1% of that of DRAM. • The memory access bandwidth involving write back operations was approximately 2.9 GB/s, which was 7.8%of that of DRAM.
RAMinate is the key technology to address a performance gap between DRAM and NVM in the main memory of acomputer. As confirmed in the above, there is a substantial performance gap between DRAM and DCPMM.To confirm its feasibility to DCPMM, we applied RAMinate to the hybrid main memory of the tested machine.We created a VM with 4 GB RAM composed of 40 MB DRAM and 4056 MB DCPMM (i.e., the mixed ratio of DRAMis 1%). To carefully examine how RAMinate spreads memory traffic to both the types of memory devices, we need toeliminate the impact of memory access of the host operating system. We therefore reserved the memory modules ofMemory Controller 1 of CPU Socket 0 for RAMinate. We disabled the interleaving of App Direct Mode for DCPMMand assigned a DCPMM region in the DCPMMs of Channel 3-5 to the VM. We also disabled the interleaving ofDRAM memory controllers and assigned a DRAM region in the DRAM modules of Channel 3-5 to the VM. Figure 6shows the command line of RAMinate to create the VM. The physical address of is the start addressof a DCPMM region assigned to the VM, which is mapped to offset of the guest physical address of the VM. ThisDCPMM region is included in the first DCPMM of Memory Controller 1 of CPU Socket 0. The physical addressof is the start address of the assigned DRAM region, which is included in the DRAM modules of thesame memory controller. This DRAM region is mapped to offset (i.e., at 4056 MB) of the guest physicaladdress.The appropriate percentage of DRAM in the RAM of the VM depends on applications. In this preliminary report,we set it to a bare minimum value, 1%, to demonstrate the advantage of memory mapping optimization by RAMinate.As an example of an application, we built Linux Kernel in the guest operating system of the VM.During experiments, we monitored the read/write traffic of DRAM and DCPMM, respectively. The DRAM trafficof the VM was measured by the performance counters of the memory controller. The DCPMM traffic of the VM wasobtained through a utility command of DCPMM (i.e., ipmctl ).Figure 7 shows the read/write traffic of DRAM and DCPMM assigned to the VM. Just after kernel build started,most memory traffic was generated in the DCPMM region. However, once RAMinate optimized the locations of hotmemory pages, the memory traffic of DCPMM was reduced to 50%. Please note that the mixed ratio of DRAM isonly 1%.RAMinate detected hot guest physical pages and moved them to the DRAM region. It also moved cold guestphysical pages to the DCPMM region. In this experiment, RAMinate was configured to swap up to a maximum of Conclusion T h p t ( M B / s ) DRAM ReadDCPMM Read (a)
Read T h p t ( M B / s ) DRAM WriteDCPMM Write (b)
Write
Fig. 7:
The read/write traffic of DRAM and DCPMM created by the VM. Kernel build started in the guest operating systemat 40 seconds. o f R e l o c a t i o n s Fig. 8:
The numbers of page relocations performed by RAMinate during kernel build in the guest operating system.
Intel Optane DCPMM is the first commercially-available, byte-addressable NVM module connected to the DIMMinterface of a computer. While DCPMM drastically increases the capacity of main memory, the performance charac-teristics of DCPMM are substantially different from those of DRAM. In experiments, we observed that the read/writelatencies of DCPMM were 400% and 407% higher than those of DRAM, respectively. The read/write bandwidths were37% and 8% of those of DRAM. Therefore, we believe that our hypervisor-based virtualization mechanism for hybridmain memory systems, RAMinate, is the key technology to address the performance gap. In this express report, weconfirmed that RAMinate successfully worked for the first byte-addressable NVM. It improved the performance of aworkload by dynamically optimizing memory mappings, placing hot memory pages to the region of faster memory(i.e., DRAM in the current system). Even though the VM had only 1% of DRAM in its RAM, performance degra-dation of the VM was drastically alleviated. The elapsed time to finish the build of Linux Kernel was 557 seconds.which was only 13% increase from the 100% DRAM case (i.e., 495 seconds). When the optimization mechanism wasdisabled, the elapsed time increased to 624 seconds (i.e. 26% increase from the 100% DRAM case). We are conducting
Conclusion Tab. 4:
The comparison of the elapsed times to finish the build of Linux Kernel. The mixed ratio of DRAM and DCPMM inthe 4GB RAM of the VM was changed. The optimization mechanism was enabled/disabled.Mixed Ratio in 4 GB RAM Memory Mapping Optimization Time (s)DRAM 100% - 495DRAM 1% and DCPMM 99% Enabled 557DRAM 1% and DCPMM 99% Disabled 624DCPMM 100% - 633 T h p t ( M B / s ) DRAM ReadDCPMM Read (a)
Read T h p t ( M B / s ) DRAM WriteDCPMM Write (b)
Write
Fig. 9:
The read/write traffic of DRAM and DCPMM created by the VM. Kernel build started in the guest operating systemat 40 seconds. Since the optimization of memory mappings was disabled during this experiment, the DRAM traffic wasmerely observed. additional experiments to thoroughly examine the feasibility of RAMinate under various conditions with DCPMM.Further details will be reported in future publications.We would like to acknowledge the support of Intel Corporation. We also thank Dr. Jason Haga and othercolleagues for their invaluable feedback.
References [1] Takahiro Hirofuchi. Hypervisor-based virtualization for hybrid main memory systems. https://github.com/takahiro-hirofuchi/raminate .[2] Takahiro Hirofuchi and Ryousei Takano. RAMinate: Hypervisor-based virtualization for hybrid main memorysystems. In