Analysis of Interference between RDMA and Local Access on Hybrid Memory System
aa r X i v : . [ c s . PF ] A ug Analysis of Interference between RDMA and Local Accesson Hybrid Memory System
Kazuichi Oe
National Institute of Informatics (NII) [email protected]
ABSTRACT
We can use a hybrid memory system consisting of DRAMand Intel ® Optane ™ DC Persistent Memory (We call it“DCPM” in this paper) as DCPM is now commercially avail-able since April 2019. Even if the latency for DCPM isseveral times higher than that for DRAM, the capacity forDCPM is several times higher than that for DRAM andthe cost of DCPM is also several times lower than thatfor DRAM. In addition, DCPM is non-volatile. A Serverwith this hybrid memory system could improve the perfor-mance for in-memory database systems and virtual machine(VM) systems because these systems often consume a largeamount of memory. Moreover, a high-speed shared storagesystem can be implemented by accessing DCPM via remotedirect memory access (RDMA). I assume that some of theDCPM is often assigned as a shared area among other re-mote servers because applications executed on a server witha hybrid memory system often cannot use the entire capac-ity of DCPM. This paper evaluates the interference betweenlocal memory access and RDMA from a remote server. As aresult, I indicate that the interference on this hybrid memorysystem is significantly different from that on a conventionalDRAM-only memory system. I also believe that some kindof throttling implementation is needed when this interfer-ence occures.
1. INTRODUCTION
In recent years, many kinds of persistent memory (PM)[2] have been under research and development, and someachievements have been merged into products for solid statedrives (SSDs) and dual inline memory modules (DIMMs).Intel was commercially available for Intel ® Optane ™ DC Per-sistent Memory (We call it “DCPM” in this paper) [7] inApril of last year. DCPM is connected to a computer systemvia a DIMM slot and is available not only for memory butalso for storage [8, 14]. DCPM also has a byte-addressablefeature, its latency is two to five times higher than that ofDRAM’s latency [11, 16], and its capacity is up to 3 TBper CPU socket. For example, a server system that has atwo-socket CPU can implement a capacity of up to 6 TB forDCPM.DCPM must be mounted with DRAM. Its capacity islarger than that of DRAM and it is non-volatile. A serverwith DCPM is often used in the operation of an in-memory . database system or virtual machine (VM) because these ap-plications need to use a large amount of memory capacity.In particular, there has been much research [1, 12, 15] onin-memory database systems using this hybrid memory sys-tem. In the VM field, a server can operate a higher num-ber of VMs by using the hybrid memory system. However,I think that there are many use cases in which a serverwith this hybrid storage system does not consume the entireDCPM capacity when executing applications on the server.In these cases, we can operate the unused DCPM capac-ity as shared memory or storage among non-DCPM servers.A non-DCPM server can execute high-throughput and low-latency communication by using remote direct memory ac-cess (RDMA), which is supported by InfiniBand ™ etc. In aword, this hybrid memory system might be accessed fromlocal applications and remote applications simultaneously.By the way, Imamura et al.[6] reported that the interfer-ence on this hybrid memory system is significantly differentfrom that on a conventional DRAM-only memory systemwhen several applications were executed on same server si-multaneously. Therefore, I assume that similar interferencewill occur when the hybrid memory system is accessed fromlocal applications and remote applications simultaneously.In this paper, I evaluated interference in case that DRAMaccess from a local server and DCPM access from a remotenon-DCPM server were executed simultaneously. I usedthe Intel ® Memory Latency Checker ( We call it “MLC”in this paper) [10] as the DRAM access application and theib write bw, which is one of the InfiniBand Verbs Perfor-mance Tests[13], extended tool as the DCPM access applica-tion. I called this ib write bw extended tool “ib write bw+”.Then, I indicate that the interference on this hybrid memorysystem is significantly different from that on a conventionalDRAM-only memory system. I moreover propose that somekind of throttling technique for DCPM access from remotenon-DCPM server is needed when this interference occurs.
2. BACKGROUND2.1 What is persistent memory (PM)
Much research has been done in the persistent memory(PM) field. Compared with DRAM, its strong points arethat it is low in cost and has a large capacity, and its weakpoint is that its write latency is high. Its write latencyis about two to five times higher than that of DRAM [2]. Consumers can currently use PM because Intel releasedIntel ® Optane ™ DC Persistent Memory (DCPM) in April2019. Compared with DRAM, DCPM’s features are byte- igure 1: Intel CPU configuration with DCPM addressable, non-volatile, 4 times larger capacity, 2 to 5times higher write latency, and several times lower gigabyte costs[6]. Consumers can use larger capacity DCPM inthe near future because Intel ® plans to update the currentDCPM. Figure 1 shows the configuration of a CPU with DCPM.The CPU cores, memory controller (MC), and PCI Express(PCIe) are connected by an interconnect. The MC has mul-tiple channels (CHs), and each CH is connected to bothDRAM and DCPM. DCPM must be connected to a CH withDRAM. When an application accesses the DCPM area byInfiniBand RDMA, the access path for RDMA is from PCIeto MC via the interconnect. The path does not include theCPU’s last level cache (LLC).
A hybrid memory system consisting of DRAM and DCPMneeds to set either memory mode or app direct mode[8].Memory mode treats DCPM as volatile memory, and DRAMis the cache area of DCPM. The cache control mechanism isinstalled on MC.App direct mode treats DCPM as non-volatile memory.Linux supports three access methods for DCPM : block de-vice, filesystem dax, and device dax[3]. When using theblock device access method, the traditional filesystems canbe executed on DCPM. But, these filesystems cannot usethe maximum performance of DCPM because of block unitaccess. Filesystem dax maps the DCPM area to an applica-tion’s address space directly by using dax supported filesys-tems. Device dax maps the DCPM area to an application’saddress space directly by using a device dax driver. Thedevice dax is the best method for getting the most out of aDCPM’s performance.Linux also has a patch that treats DCPM as normal RAM[4]. The patch can be used to mount the entire DCPM areaon a NUMA node by using the device dax method, and anapplication can allocate the memory area from the DCPMarea by using the Linux numactl command.
3. EVALUATION3.1 Overview
Figure 2: Evaluation system
I wanted to clarify the interference when DRAM or DCPMaccess from a local server and DCPM access from a remotenon-DCPM server are executed simultaneously. The appli-cation for the local server was the Intel ® Memory LatencyChecker (Intel MLC), and the application for remote non-DCPM server was the extended ib write bw. As mentionedthe extended ib write bw is called “ib write bw+”. Theevaluation considerd MLC-only performance, ib write bw+-only performance, and the performance when executing MLCand ib write bw+ simultaneously. The evaluation also usedthe Platform Profiler feature of Intel ® VTune Amplifier 2019[9]to clarify the internal throughput and latency for both DCPMand DRAM.
Figure 2 shows the evaluation system. A server withDCPM and a server without DCPM were connected by twoInfiniBand paths. The server with DCPM consisted of a 16-core Xeon Gold 5218 (Cascade Lake) x2, 192 GB of DRAM,and 812 GB of DCPM (256 GB x 6). I also set six DCPMDIMMs as one DCPM area by using interleaved app directmode. The server was also installed with the NUMA nodepatch described in Section 2.3. Then, NUMA node 0 and 1were mounted to DRAM, and NUMA 2 and 3 were mountedto DCPM. SUSE Linux Enterprise 15 (5.0.0-rc1-25.25) wasalso installed on the server.The server without DCPM consisted of a 16-core Xeon E5-2650L (Sandy Bridge) x2 and 32 GB of DRAM. Fedora30(5.1.12-300.fc30) was also installed.Two InfiniBand Host Channel Adapters (HCA) were alsoinstalled on both servers, and these servers were connecteddirectly by using the InfiniBand. The HCA’s bandwidth is100 Gbps per direction, and teh total bandwidth is 200 Gbpsper direction.
In this paper, MLC version 3.7 was used, and the hy-brid memory system was evaluated by using loaded latencymode. Its read/write option was R (Read only), W2 (2:1read-write ratio), and W5 (1:1 read-write ratio). Its memorysize was 4 GB, so the effect of CPU cache can be ignored.Its offset setting was random (rand) and sequential (seq),and its NUMA node was 0 (DRAM) or 2 (DCPM).
I downloaded and investigated the source code for Infini-Band Verbs Performance Tests 3.0 (March 2015). In par-icular, I carefully investigated the ib write bw source codewhich included the RDMA test. Then, the ib write bw re-peatedly executed RDMA with the same source and desti-nation address, and no tool existed for the read/write mixedRDMA test. Most real applications using RDMA often ac-cess various source/destination addresses, and their oper-ations were mostly read-write mixed. Then, I added thefollowing features to the ib write bw. • Setting any size for RDMA buffer (in this evaluation,10 MB was set). • Choosing three operations [Read only, Write only, Read/Writemixed (1:1)]. • Choosing two RDMA offset updates (random, sequen-tial). The offset is updated in the RDMA buffer.In the evaluation, ten-multiplex ib write bw+ was used(See Figure 2). I decided on this in a preliminary experimentto create sufficient memory access for DCPM.
To understand the performance without interference, bothib write bw+-only and MLC-only were evaluated. Then,the interference performance when co-executing ib write bw+and MLC was evaluated. I can understand the performancedegradation by comparing its interference performance andits unit performance. Moreover, to find which conditions in-creased the interference, both ib write bw+ and MLC wereexecuted simultaneously while changing their parameters.
The ib write bw+ server was executed on the server withDCPM and the ib write bw+ client was executed on theserver without DCPM. The RDMA buffer for the server wasset on both DRAM and DCPM. In order to generate enoughIO traffic, ten-multiplex execution was done. Figure 3 showsthe results for a random RDMA offset, and Figure 4 showsthose for a sequential RDMA offset. The RDMA size waschanged from 2 KB to 64 KB, and the RDMA operation wasRead, Write, and Read/Write mixed.First, the results for the random RDMA offset are dis-cussed. When executing RDMA Read, there was near through-put even when the server’s RDMA buffer was changed fromDRAM to DCPM. In particular, there was almost the samethroughput when the RDMA size was more than 8 KB.When executing RDMA Write and RDMA Read/Write mixed,the throughput setting the server’s RDMA buffer to DRAMwas three times higher than that setting the server’s RDMAbuffer to DCPM. This is because of the higher write latencyfor DCPM.Second, the results for the sequential RDMA offset arediscuessed. The results for RDMA Read were similar to theresults for the random RDMA offset. However, the resultsfor RDMA Write and Read/Write mixed were different fromthose for the random offset. Executing RDMA Write, thedifference between the DRAM and DCPM throughput de-creased when a bigger RDMA size was used. Both through-puts matched at an RDMA size if 64 KB. These results mayhave an effect on write buffer of MC. Executing RDMARead/Write mixed, both throughputs reached 30 GB/sec T h r o u g hpu t ( M B / s e c ) RDMA size (bytes)DRAM-RW DRAM-R DRAM-W DCPM-RW DCPM-R DCPM-W
Figure 3: RDMA random throughput (MB/sec) T h r o u g hpu t ( M B / s e c ) RDMA size (bytes)DRAM-RW DRAM-R DRAM-W DCPM-RW DCPM-R DCPM-W
Figure 4: RDMA sequential throughput (MB/sec) when the RDMA size 64 KB. This is because the total bidi-rectional bandwidth reached 400 Gbps.
Table 1 shows the results described in Section 3.2.2. Therswas one thread execution for MLC. The results indicate thatthe MLC throughputs with DRAM were three to four timeshigher than those with DCPM.
Figure 5 shows the MLC throughput when ib write bw+and MLC were co-executed. The X-axis indicates the con-ditions for these evaluations. R, W2, and W5 are the MLCoptions described in Section 3.2.2. The server’s memory areafor ib write bw+ was both DRAM and DCPM, and its otheroptions were 4-KB RDMA size, random offset, Read/Writemixed operation, and ten-multiplex execution because theseoptions are the condition that generates the most IO ac-
Table 1: Throughput for MLC only (MB/sec)
R W2 W5rand+DCPM 2632 2592 2326seq+DCPM 4042 5247 5715rand+DRAM 6733 8576 7758seq+DRAM 12414 17213 19943 T h r o u g hpu t ( M B / s e c ) rand+DCPM seq+DCPM rand+DRAM seq+DRAMMLC op!ons Figure 5: MLC results when executing ib write bw+4KB, random offset, RW mixed (MB/sec) cesses on DCPM. The Y-axis is the MLC throughput, andthe usage guide shows the remaining MLC options describedin Section 3.2.2.The results shows that the interference performances wereless than half of the non-interference performances even whenthe MLC was executed with DRAM (See the portion of“MLC + RDMA(DCPM)”). In particular, when the MLCoptions were W5+seq+DCPM, the MLC throughput be-came 24% of the MLC only throughput. I guess that theMLC throughput of DRAM became drastically slow downbecause both DRAM and DCPM were connected to thesame CH. If many IO accesses are concentrated to DCPM,IO accesses for DRAM may be waited till the completion ofDCPM accesses (See figure 1).However, when the server’s memory area for ib write bw+was DRAM, small throughput falls occurred (lowered lessthan 20%) (See the portion of “MLC + RDMA(DRAM)”). ib_write_bw+ performance.
Figure 6 shows the ib write bw+ throughput whenib write bw+ and MLC were co-executed. Both the X-axisand the usage guide are almost the same as Figure 5. TheY-axis is the throughput for ten-multiplex ib write bw+ ex-ecutions.The results shows less than 20% degradation when theserver’s memory area for ib write bw+ was DCPM. More-over, tiny throughput falls occurred when the server’s mem-ory area for ib write bw+ was DRAM.The previous paragraph showed that the MLC perfor-mance was drastically slow down when the ib write bw+was executed to DCPM. However, the ib write bw+ per-formance was tiny slow down even if the MLC target wasDCPM. I guess that the implementation for Intel’s MC bringsthis phenomenon.
MLC performance when traffic for ib_write_bw+ waschanged.
From the results so far, this paper has indidated that MLCthroughput is drastically changed when ib write bw+ usingDCPM and MLC are co-executed. I also investigated theMLC throughput when both the amount of RDMA accessfor DCPM and the RDMA access patterns were changed.Both can be adjusted by changing the parameters forib write bw+. In particular, the RDMA operations werenot only RDMA Read/Write mixed but also Read-only and T h r o u g hpu t ( M B / s e c ) rand+DCPM seq+DCPM rand+DRAM seq+DRAMRDMA only (DCPM) = 7414 MB/sec RDMA only (DRAM) = 20644 MB/secib_write_bw+serverMLC op!ons Figure 6: ib write bw+ results when executing MLC(MB/sec) T h r o u g hpu t ( M B / s e c ) Number of ib_write_bw+processes
MLC only = 12414MB/sec MLC only = 17212MB/sec MLC only = 19942MB/sec
Figure 7: MLC results with various RDMA settings(MB/sec)
Write-only. The multiplex values for ib write bw+ rangedfrom 2 to 12 so as to change the amount of RDMA access.The MLC operations were R, W2, and W5 when the offsetwas sequential to DRAM.Figure 7 shows the results. The usage guide indicates themultiplex value for ib write bw+. All of the results (R, W2,and W5) indicate that RDMA Write only caused the inter-ference to be big even when the multiplex value was small.However, both RDMA Read-only and RDMA Read/Writemixed increased the interference as the multiplex value in-creased. Figure 8 indicates the rate for each result whenthe result of MLC-only was 1.00. It can be seen that in-terference was big for RDMA Write-only. In particular, theMLC throughput co-executing with ib write bw+ was 18%of MLC-only when the ib write bw+ options were four mul-tiplexes and RDMA Write-only and the MLC option wasW5.To determine the reason for Figure 7 and 8, the read andwrite latency were also investigated by using the PlatformProfiler feature of Intel ® VTune Amplifier 2019. Figure 9shows the results for read latency, and Figure 10 shows thosefor write latency.First, RDMA Write-only is discussed. Both read andwrite latencies were higher when the multiplex value wassmaller as can be seen from the results of both figures.The MLC throughput slowed down when these latencies in-creased.Next is RDMA Read-only. The read latencies were stable .000.200.400.600.801.00 Writeonly Readonly RWmixed Writeonly Readonly RWmixed Writeonly Readonly RWmixedR W2 W5 R a t e Figure 8: MLC results with various RDMA settings(rate) D C P M r e a d l a t e n c y ( n s ) Figure 9: DCPM read latency when using IntelVTune (ns) when the MLC was executed with the R option, and theread latencies became a little lower when the multiplex valuewas bigger with the W2 and W5 options. The amount ofRDMA read accesses was bigger when the multiplex valuewas bigger. Then, the increases for RDMA read accessesafforded the large interference.Last is RDMA Read/Write mixed. Both read and writelatencies were higher when the multiplex value was bigger,and the amount of RDMA read/write accesses was biggerwhen the multiplex value was bigger. This is why the largeinterference for MLC throughput.
4. DISCUSSION
From the results of Section 3.4, the MLC throughput dras-tically changed when both ib write bw+ for the DCPM areaand MLC for the DRAM/DCPM area were executed simul-taneously. However, the ib write bw+ throughput changeda little (up to 20%). Then, I proposed that some kind ofthrottling technique for the DCPM access from remote non-DCPM server (ib write bw+ in this paper) was needed whenthis interference occurred. For example, when a programexecutes many RDMA requests for DCPM, it checks theamount of MC accesses by using the CPU’s statistical func-tion. If the amount of MC access is bigger, the programshould reduce the number of RDMA requests until the in-terference is not occurred. Figure 8 indicated that the inter-ferences for Read-only and Read/Write mixed RDMA werenot occurred when the number of ib write bw+ process was D C P M w r i t e l a t e n c y ( n s ) Figure 10: DCPM write latency when using IntelVTune (ns) smaller. When Write-only RDMA, I confirmed that the in-terference was not occurred when the ib write bw+ was exe-cuted by using one HCA pair only. I will study the throttlingtechnique in the future work.Imamura et al.[6] reported that the interference on this hy-brid memory system is significantly different from that ona conventional DRAM-only memory system when severalapplications were executed on same server simultaneously.He also indicated that DRAM access may be changed be-cause the write queue in MC keeps a large number of writerequests. Moreover, DRAM access starts to be changedwhen the write latency for DCPM reaches 1.5 micro seconds.The results for this paper also indicated the interferencebetween DCPM accessing application when useing RDMAand DRAM accessing application when using MLC. How-ever, from this paper’s results, the DRAM access started tobe changed when the write latency for DCPM was from 0.7to 0.9 micro seconds. Both the MLC and RDMA executionshared the resources from MC to DRAM or DCPM. There-fore, the MLC throughput changed because some resourcecompetition for MC occurred. The resource competitionpoints may include a new point other than Imamura’s reportbecause of the difference in the write latency for DCPM.
5. RELATED WORK
Many research papers already have evaluated DCPM. Izraele-vitz et al.[11] evaluated each DCPM function exhaustively.Renon et al.[16] executed a basic evaluation for applyingdatabase logging. Hirofuchi et al.[5] executed an evaluationfor applying virtual machines (VMs). Weiland et al. [17]executed an evaluation for high-performance scientific appli-cations. These pieces of research indicated that the latencyfor DCPM changed between 100 ns and 800 ns according totheir access pattern for DCPM.
6. CONCLUSION
This paper evaluated interference for when DRAM ac-cess from a local server and DCPM access from a remotenon-DCPM server was executed simultaneously. An In-tel ® Memory Latency Checker (Intel MLC) was used as theDRAM access application, and the the ib write bw, an In-finiBand Verbs Performance Test, extended tool was used asthe DCPM access application, called “ib write bw+”.This paper showed the interference on this hybrid memorysystem is significantly different from that on a conventionalRAM-only memory system. I moreover propose that somekind of throttling technique for DCPM access from remotenon-DCPM server is needed when this interference occurs.I would like to study the throttling technique in the future.
7. REFERENCES [1] M. Andrei, C. Lemke, G. Radestock, R. Schulze,C. Thiel, R. Blanco, A. Meghlan, M. Sharique,S. Seifert, S. Vishnoi, D. Booss, T. Peh, I. Schreter,W. Thesing, and M. Wagle. SAP HANA Adoption ofNon-Volatile Memory. In
Proc. of the 43rdInternational Conference on Very Large Data Bases(VLDB 2017), Munich, Germany , August 2017.[2] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram,N. Satish, R. Sankaran, J. Jackson, and K. Schwan.Data Tiering in Heterogeneous Memory Systems. In
Proc. the 11th ACM European conference onComputer systems (EuroSys 2016)
IXPUG Workshop on HPC Asia 2020,Fukuoka, Japan
Proc. of the 2015 ACMSIGMOD/PODS Conference (SIGMOD’15),Melbourne, VIC, Australia , May 2015.[13] Linux-rdma. Infiniband Verbs Performance Tests.https://github.com/linux-rdma/perftest.[14] S. Romik. INTRODUCTION TO PROGRAMMINGFOR PERSISTENT MEMORY. May 2019.[15] A. van Renon, V. Leis, A. Kemper, T. Neumann,T. Hashida, K. Oe, Y. Doi, L. Harada, and M. Sato.Manageing Non-Volatile Memory in DatabaseSystems. In
Proc. of the 2018 ACM SIGMOD/PODSConference (SIGMOD’18), Houston, TX, USA , June2018.[16] A. van Renon, L. Vogel, V. Leis, T. Neumann, andA. Kemper. Persistent Memory I/O Primitives. In
Proc. of the 15th International Workshop on DataManagement on New Hardware (DaMoN’19),Amsterdam, Netherlands , July 2019.[17] M. Weiland, H. Brunst, T. Quintino, N. Johnson,O. Iffrig, S. Smart, C. Herold, A. Bonanni,. Jackson, and M. Parsons. An Early Evaluation ofIntel’s Optane DC Persistent Memory Module and itsImpact on High-Performance Scientific Applications.In