System measurement of Intel AEP Optane DIMM
SSystem measurement of Intel AEP OptaneDIMM
Tianyue Lu , Haiyang Pan , and Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China { lutianyue,panhaiyang,cmy } @ict.ac.cn Abstract.
In recent years, memory wall has been a great performancebottleneck of computer system. To overcome it, Non-Volatile Main Mem-ory (NVMM) technology has been discussed widely to provide a muchlarger main memory capacity. Last year, Intel released AEP OptaneDIMM, which provides hundreds of GB capacity as a promising replace-ment of traditional DRAM memory. But as most key parameters of AEPis not open to users, there is a need to get to know them because theywill guide a direction of further NVMM research. In this paper, we fo-cus on measuring performance and architecture features of AEP DIMM.Together, we explore the design of DRAM cache which is an importantpart of DRAM-AEP hybrid memory system. As a result, we estimate thewrite latency of AEP DIMM which has not been measured accurately.And, we discover the current design parameters of DRAM cache, such astag organization, cache associativity and set index mapping. All of thesefeatures are first published on academic paper which are greatly helpfulto future NVMM optimizations.
In modern era, memory wall [1] has been a great performance bottleneck tomany big-data programs. Plenty of data is handled in parallel which requiresgreat memory bandwidth and memory capacity. Multi-channel memory systemhas been a mainstream solution to provide both enough bandwidth and capacity.But to growing big-data applications, memory channels are not able to provideenough scalability. In particular, memory capacity is not well scalable in tradi-tional DRAM memory system. This is due to that, DRAM capacity per-DIMMhas reached 64GB and is hard to grow further [2]. New memory technology isdemanded to break the limit of DRAM memory.In this case, Non-Volatile Memory (NVM) [3] has been a research hotspotto solve memory capacity wall. NVM, includes PCM [4], RRAM [5], MRAM[6], provides much higher storage density than DRAM. Besides, some NVMmedium provides similar read latency and bandwidth with DRAM, which meansNVM could be a potential replacement of DRAM to compose memory system.NVDIMM [8] is an JEDEC [9] standard of using NVM as storage module. Amongseveral branches, NVDIMM-P is a DRAM-NVM hybrid memory standard. InNVDIMM-P [10], DRAM and NVM are both deployed as memory channel when a r X i v : . [ c s . A R ] S e p Tianyue Lu, Haiyang Pan, and Mingyu Chen
NVM is used as main memory while DRAM is used as cache. NVM has largercapacity and DRAM has better performance, so DRAM cache improves accessperformance of partial hot data and NVM meets the demand of memory capacityof applications.Intel AEP Optane [11] is a commercial product which achieves NVDIMM-P architecture. AEP DIMM is inserted on DIMM slot as main memory andDRAM DIMM on other slots are used as transparent off-chip cache. Each AEPDIMM has 128GB to 512GB capacity (depends on different product models )which is much larger than DRAM. At the same time, DRAM cache still has1/8 to 1/2 capacity of NVM main memory which means it can easily achievea high cache hit rate. So with this memory system, programs which do notrequire large memory capacity still acquire same performance with the caseof traditional pure-DRAM memory. And on the other side, programs whichrequires hundreds of GB memory could gain great performance improvementto the case that main memory is lacked and disk is used as swap memory.Unfortunately, as a commercial product, many performance parameters of AEPare not published. But in order to use AEP wisely, we need to know parametersfully, including r/w latency, bandwidth, and structure design parameters likeDRAM cache associativity, cache tag placement and AEP access granularity.With detailed parameter, academia and industries can optimize the memoryarchitecture and scheduling more carefully.Some prior works had made great work on AEP parameter measurement.[17] gives a detailed AEP access bandwidth and read latency. [16] shows thatAEP has a write buffer on AEP DIMM which improves AEP write performance.But in most works, write latency is not well surveyed because write instructionsare committed as soon as write data are written on CPU cache or committedon write command queue. Also, most work did not take into account DRAMcache architecture parameters. In many researches, DRAM cache optimizationhas been discussed [12,13,14,15], including cache associativity design, cache tagplacement, cache replacement algorithm and so on. Different cache designs de-termine cache performance, and in real system, DRAM cache parameters aredetermined by memory controller on DRAM channel which is not open to mostusers. DRAM cache architecture is important and some of its parameters areable to be measured.In this paper, we measure AEP and DRAM cache parameters. Different withprior works, we focus on the architecture parameter of DRAM-AEP hybrid mem-ory system. But first, we estimate the write latency of AEP DIMM under dif-ferent high write bandwidth. High write bandwidth makes all write commandqueues in every levels of memory hierarchy are full, so that new write instructionwill not be able to issue in CPU instruction queue (actually is Re-Order Buffer).At this time, the measured latency on both ends of one write instruction is thetime of write queues get one empty entry, which is nearly the completion timeof one write command finishing data writing on AEP module. Furthermore, wemeasure the architecture parameter of AEP DIMM and DRAM cache. To AEPDIMM, we test if a small buffer is on DIMM module. And to DRAM cache, we ystem measurement of Intel AEP Optane DIMM 3 measure the cache associativity and address mapping of cache set index. As aresult, we find that a 16KB buffer is on AEP DIMM and it is fully-associative.And DRAM cache is direct-mapped cache whose tag and data is stored in one64B cacheline and its set index mapping is also found. Detailed measurementresults are introduced later in this paper.In following paper, Section 2 introduces Intel AEP and its two deploymentmodes. Section 3 introduces our experiment platform and our test benchmarks.And measurement results are introduced in Section 4. Based on these results,Section 5 provide some discussions about DRAM-AEP hybrid memory systems,and Section 6 makes conclusions.
In this section, we will introduce Intel AEP Optane DIMM in detail.
Figure 1 shows the architecture of Intel DRAM-AEP hybrid memory system.Both the AEP and DRAM DIMMs are accessed via traditional DDR-T buses andmanaged by memory controller hardware named iMC. All DRAM DIMMs arestandard DDR4 DIMMs. As illustrated in [fast20], a on-DIMM controller namedXPController manages the requests of Optane DIMM. XPController translatesthe request address into the real address on AEP DIMM and maintains a buffernamed XPbuffer to accelerate access, similar to the function of row buffer onDRAM DIMMs. iMC manages the DIMMs in two modes including memory modeand App direct, and two modes can be set via ipmctl command in command lineof Linux dynamically.
CPUIMC0 IMC1DDRDIMM DDRDIMM
AEP
DIMM
AEP
DIMM
XPController
XPBuffer
XPController
XPBuffer
Fig. 1.
Hardware architecture of Intel AEP. Tianyue Lu, Haiyang Pan, and Mingyu Chen
Figure 2 shows the storage topology in different modes of AEP DIMMs. We canfind that DRAM DIMMs also change their roles according to AEP modes. Inmemory mode, AEP DIMMs work as main memory and the DRAM DIMMs arecache of the AEP DIMMs on the same channel. Main memory capacity seenin OS is all capacities of AEP DIMMs while DRAM cache is transparent tosoftware.In App direct mode, AEP DIMMs work as persistent device which can bedirect accessed by applications, similar as a common Optane SSD device. Inthis case, AEP DIMMs are accessed as traditional block devices. Besides, In-tel supports a library that can implements direct load/store instructions forAEP DIMMs using the fsdax file system. Persistency is ensured by Intel ADR(Asynchronous DRAM Refresh) mechanism, which ensures writing data in ADRdomain will survive a power failure. In this architecture, ADR domain includesIMC and AEP DIMMs. To programers, adding clwb and sfence instruction be-hind write instruction will ensure writing data persistent, when clwb writes datafrom CPU cache to IMC and sfence insures clwb to be committed. In AEP AppDirect mode, DRAM DIMMs act as the traditional volatile main memory andcan be seen as main memory capacity in software.Public information and prior work have some introductions about both mem-ory mode and App Direct mode, but there is no clear description or measurementon specific details. For example, how is the DRAM cache tag managed? Whatis the cache associativity and address mapping of set-index? What is the writelatency from IMC to AEP module? We find these architecture parameters im-portant, and we manage to measure them in this paper. Our methodology isintroduced in next section.
Main memory
Memory Mode App Direct Mode
CPU
DRAM
AEP CPU
DRAM AEP
Fs / mmap transparentvisible capacity
Main memory Device
Fig. 2.
Two modes of AEP.
As mentioned above, DRAM is used as DRAM cache in AEP memory mode.Many prior researches have focused on DRAM cache design and optimization. To ystem measurement of Intel AEP Optane DIMM 5
DRAM cache, there are two major problems to solve: hit rate and hit latency. Onhit rate, it is not hard for DRAM cache to reach a hit rate above 50%, as it has acapacity of 1/8 to 1/2 main memory. But on the other hand, it is hard to increasea hit rate that is already so high. And this hit rate is more critical to DRAM-AEP hybrid memory system performance than traditional one as the latency gapbetween DRAM cache and AEP main memory could be larger than it betweenCPU cache and DRAM main memory. On hit latency, DRAM cache must handlecache tags carefully because extra tag or metadata access issues extra DRAMaccess which is much more expensive than on-chip cache access. Considering hitrate and hit latency, replacement algorithm design on DRAM cache faces newchallenges that, complex algorithm like LRU or RRIP maintains a large amountof metadata which brings large storage overhead and access overhead. Simplealgorithm like random or FIFO algorithm could not achieve a high hit rate. Soin our paper, we are concerned about the actual DRAM cache design with InteliMC which we think will provide a direction of further DRAM cache research.
In this section, we will introduce our measurement platform and benchmark.
In our experiment, we deploy Intel server with 1-channel DRAM and 1-channelAEP. Detailed server configuration is shown in Table 1. DRAM has 16GB ca-pacity and AEP has 128GB. Both DIMMs are inserted on slots managed bysame IMC. And our benchmarks are bound on cores of local CPU. So in ourexperiment, we avoid the impact of NUMA architecture to get the local AEPperformance parameters.
Table 1.
Experiment Platform ConfigurationsProcessor Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHzOperating System Linux Kernel 4.20 in CentOS 7LLC Cache Size 24.75MBDRAM 16GB per DIMMNVM 128GB per DIMM
In measurement programs, we use RDTSC instruction between two ends of everyread/write instruction to measure latency. Instead of average value, we get avariation curve of latency value on the timeline. As a result, some latencies showa stable value, but on the other side, some values are varied regularly.
Tianyue Lu, Haiyang Pan, and Mingyu Chen
For read latency measurement, we use single-thread program to access mallocmemory space of main memory in AEP memory mode of mmap space in AppDirect mode. All accessed memory addresses are chained. For example, We firstread address A, and data in A is B, which is next access address. In this case,access B must wait for access A to be finished. So there will be dependency amongall reads to avoid read latency to be hidden in pipeline. To make read instructioncompleted, LFENCE is added before RDTSC instruction. The following codeshows an example of our testing. u n s i g n e d l o n g t e s t ( ) { u n s i g n e d l o n g t i m e h i , t i m e l o ; u n s i g n e d l o n g s t a r t ; asm v o l a t i l e ( ” l f e n c e \ n” ) ; asm v o l a t i l e ( ” r d t s c ” : ”=a” ( t i m e l o ) , ”=d” ( t i m e h i ) : ) ; s t a r t = t i m e h i << | t i m e l o ; asm v o l a t i l e ( ” l f e n c e \ n” ) ; c o d e f o r t e s t i n g ; u n s i g n e d l o n g end ; asm v o l a t i l e ( ” l f e n c e \ n” ) ; asm v o l a t i l e ( ” r d t s c ” : ”=a” ( t i m e l o ) , ”=d” ( t i m e h i ) : ) ; s t a r t = t i m e h i << | t i m e l o ; asm v o l a t i l e ( ” l f e n c e \ n” ) ; r e t u r n end − s t a r t } To measure write latency, we need to ensure that write bandwidth is highenough to fill all memory write queues. In this case, new write instruction isnot allowed to issue as there is no space for new writes. Only at this time, thecompletion time of a write instruction reflect the write latency on AEP module.All instructions are issued and committed in Re-Order Buffer (ROB). And, writeinstruction is committed at the time of that it has been sent on memory queue.Only when queue is full, the write instruction will wait in ROB. The waitingtime depends on how much time it takes for a queue entry is deleted. And onlywhen writing data is written on AEP module, the memory queue on IMC isallowed to delete corresponding write command. Therefore, we use three threadsto fill memory bandwidth and memory queues. And fourth thread measures thecompletion time of a write instruction. Three threads are enough to fill memorybandwidth, which is confirmed in our AEP bandwidth experiment in Section4.2. ystem measurement of Intel AEP Optane DIMM 7
To deal with DRAM cache measurement, as it is transparent device to OS, we useLecroy Kibra DDR Protocol Analyzer Suite to support our experiment. LecroyKibra DDR Protocol Analyzer Suite is connected with server board via standardDIMM slot, and DRAM DIMM is inserted on Lecroy Analyzer Suite as shownin Figure X. At work, DRAM DIMM still work as normal memory module, andDDR command sent to it can be caught by Lecroy Analyzer Suite. Commandtype (Read/write/activate/precharge) and access address can be get, excludingR/W data. In AEP memory mode, DRAM acts as cache. And we use LecroyAnalyzer Suite to catch the DDR commands from IMC, to observe the patternof it. Analyzing the pattern will help us know the workflow of DRAM cache, likehow much DDR commands are needed during a cache hit/miss.Besides, to test DRAM cache set index mapping, we use a pair memoryaddresses, which have only one-bit difference. Sometimes when some addressbits flip, two addresses conflict in cache set, and sometimes they dont. Based onthis experiment, we can find out how memory addresses are mapped to DRAMcache set index. (cid:3)(cid:3)(cid:3)(cid:3)
Kibra (cid:3)
DDR (cid:3)
Protocol (cid:3)
Analyzer (cid:3)
User (cid:3)
Manual 17DDR (cid:3)
Interposer (cid:3)
Teledyne (cid:3)
LeCroy
Power on Procedure
1. Start (cid:3) the (cid:3)
DDR (cid:3)
Protocol (cid:3)
Suite (cid:3) application.2. Make (cid:3) sure (cid:3) the (cid:3)
DDR (cid:3)
Interposer(s) (cid:3) is/are (cid:3) attached (cid:3) securely (cid:3) to (cid:3) the (cid:3) Kibra (cid:3) unit.3. Verify (cid:3) that (cid:3) either:Interposers (cid:3) are (cid:3) not (cid:3) inserted (cid:3) into (cid:3) the (cid:3) system (cid:3) under (cid:3) test. (cid:3)
ORInterposers (cid:3) are (cid:3) inserted (cid:3) into (cid:3) the (cid:3) system (cid:3) under (cid:3) test (cid:3)
AND (cid:3) power (cid:3) is (cid:3) OFF (cid:3) on (cid:3) the (cid:3) system (cid:3) under (cid:3) test.4. Power (cid:3) On (cid:3) the (cid:3) Kibra (cid:3) unit (cid:3) and (cid:3) wait (cid:3) until (cid:3) the (cid:3)
Phy's (cid:3) have (cid:3) completed (cid:3) initialization. (cid:3)
5. Power (cid:3) On (cid:3) the (cid:3) system (cid:3) under (cid:3) test. The (cid:3)
Kibra (cid:3)
Interposers (cid:3) sit (cid:3) between (cid:3) the (cid:3)
Host (cid:3)
Under (cid:3)
Test (cid:3) (HUT) (cid:3)
DIMM (cid:3) slots (cid:3) and (cid:3) the (cid:3)
DIMMS (cid:3) under (cid:3) test. (cid:3)
The (cid:3) interposers (cid:3) add (cid:3) less (cid:3) than (cid:3) (cid:3) inch (cid:3) of (cid:3) trace (cid:3) length (cid:3) and (cid:3) a (cid:3) resistive (cid:3) tap (cid:3) to (cid:3) minimize (cid:3) effects (cid:3) on (cid:3) signal (cid:3) quality (cid:3) while (cid:3) ensuring (cid:3) an (cid:3) adequate (cid:3) signal (cid:3) for (cid:3) the (cid:3) Analyzer.
Figure 1.4: DDR Interposer
DIMMCable to AnalyzerDIMM Interposer
Fig. 3.
Two modes of AEP.
Our results will be shown in this section. First, we show measured AEP read-/write latency in two modes. And based on latency results, we analyze the resultsand obtain some architecture features of AEP DIMM. Also, on memory mode,we measure how DRAM cache is organized including cache data and tag. Atlast, limit bandwidths of AEP are educed.
Tianyue Lu, Haiyang Pan, and Mingyu Chen
We measure AEP latency in different access modes. Both read and write aremeasured, and sequential accesses or random accesses also give different results.First, we measure read latency in AEP App Direct mode. Figure 4 showsthe results of sequential read. In Figure 4(a), we find that read latency has twoalternate values, the lower one is about 150ns, and the higher one 350ns. Wetake a consecutive 32 points in Figure 4(a) and generate Figure 4(b). Actually,we find that read latency values have a 3-low with 1-high pattern. As shown inFigure 4(c), lower value takes 75% and other 25% is higher value. We think thatthe lower value means the read accesses hit on XPbuffer of AEP DIMM. L a t e n c y ( n s ) (a) (b)
200 3000.0%20.0%40.0%60.0%80.0%100.0% (c)
Fig. 4.
Sequential Read Latency in APP direct mode. L a t e n c y ( n s ) (a)
400 500 6000.0%20.0%40.0%60.0%80.0%100.0% (b)
Fig. 5.
Random Read Latency in APP direct mode.
The reason is, each time the latency value shows a high value at about 350ns,the access address of sequential reads comes to 256-aligned one. And it hints usthat, the access granularity of AEP is 256B. This means that, in AEP module,each DDR read commands get 256B data from module. Although only 64B isreturned through DDR bus, other 192B is buffered on XPbuffer AEP DIMM.Random reads show a stable latency value, as shown in Figure 5. Differentwith sequential reads, latency values are much more stable, and equal to the ystem measurement of Intel AEP Optane DIMM 9 G B G B G B G B G B G B G B G B G B L a t e n c y ( n s ) Fig. 6.
Random Read Latency in Memory Mode over different footprints. higher value in the results of sequential reads. In summary, we think real readlatency of AEP DIMM is about 350ns.In AEP memory mode, we get almost the same latency results with App Di-rect mode. Besides, we run our test benchmark with different memory footprints.And with small footprints, all accesses will hit on DRAM cache in memory mode.Figure 6 shows the measured latencies with different footprints. Leads to a valuejumping happens at 16GB, the memory footprint is larger than DRAM cachecapacity which causes all accesses are miss on DRAM cache. Lower than 16GB,the latency value equals to DRAM read latency, which is about 90ns.
Much different with read latency, We will shows the results of write latencyexperiments in this section. As introduced in Section 3.2, our benchmarks makeall memory write queues full in every level of memory hierarchy, which makesthe measured time of write instructions committing equals to the time of onequeue entry being issued which nearly equals to memory write latency. Figure 7shows the latency values of sequential writes on the timeline. We can see that,about 10% writes show low latency at about 170ns. And most write latencyvalues distribute from 200 to 1100ns. We can find a similar pattern in randomwrite experiment, whose result is shown in Figure 8.According to our benchmark, we consider the 10% low values as the latencyof write instruction successfully issued once it enters into ROB. To other values,write instructions need to wait for an empty entry in memory queues. Some waitfor a long time and some wait shorter. The longest situation is that next deletedwrite command in memory queue is just issued on AEP module. And in thissituation, new write instruction must write for a full AEP writes, whose timeis AEP write latency. As a summary, we estimate that, write latency of AEP isnearly 1200ns. L a t e n c y ( n s ) (a) (b) Fig. 7.
Sequential Write Latency in APP direct mode. L a t e n c y ( n s ) (a) (b) Fig. 8.
Sequential Write Latency in APP direct mode.
In memory mode, we also measure the architecture feature of DRAM cache. Wefocus on three features: cache tag placement, cache associativity and set indexmapping.We use Lecroy Kibra DDR Protocol Analyzer Suite to see what happens onDDR bus during a DRAM cache hit or miss. In benchmarks, we use a pair ofaddresses to test whether they are both cache hit access or both cache misses.In hit case, we catch a fragment of DDR commands as shown in Figure 9. Wecan see that all commands are RD (read) and two addresses appear alternately.This means that, a cache hit leads to only one DRAM read. This points to thatcache tag and cache data are stored in one cacheline, maybe some bits of ECCare replaced as cache tag.The cache miss case is shown in Figure 10, where one address corresponds toone read and one write command. In our opinion, read command corresponds tocache tag read. When IMC realizes a cache miss according to cache tag content,it sends a fetch command to AEP and updates cache tag and data in DRAMcache via a DRAM write command.The conclusion of tag and data are placed in one cacheline leads to anotherpossibility of that DRAM cache is direct-mapped. This is because, all data of onecache set are separate in different cachelines in set-associative cache. So there ystem measurement of Intel AEP Optane DIMM 11 must be some cases that tag and data are not in one cacheline, but we haventobserved a phenomenon like this.Another evidence of direct-mapped cache is set index mapping of DRAMcache. As introduced in Section 3.3, we use pair addresses with only one bitdifference to test at which time cache is hit or miss. In cache miss case, themeasured latency is AEP read latency, or it is DRAM read latency. Figure 11shows the measured latency of which bit in memory addresses is different.We can see that when bit of from 0 to 33 changes, the latency stay low,which means all accesses are cache hit. So the set index is simply continuous lowaddress bits. More precisely, as cache granularity is 64B, bit 33-6 is set indexas bit 5-0 are cacheline offset. So total number of cache set is 256M, divides to16GB DRAM capacity, each cache set is 64B capacity. This conclusion also leadsto that DRAM cache is direct-mapped.
Fig. 9.
DDR Commands when DRAM cache hit.
As a summary of our experiments on AEP DIMMs, we find three important fea-tures of AEP in this paper which are helpful to guide further AEP optimization:First, we have found that, rather than a small value, write latency of AEPis about 3 times of read latency in fact, which means write performance of AEPcould be bottleneck to some write-intensive programs. Together with bandwidth,AEP has a quite unbalance performance between read and write. Maybe futuredesign optimization should think highly of write optimization. Some prior workhad proposed that, part of DRAM cache can be used as write-exclusive buffer,to achieve a higher cache hit rate on write instructions.
Fig. 10.
DDR Commands when DRAM cache miss. L a t e n c y ( n s ) Fig. 11.
Measured latency of which bit in memory addresses is different.ystem measurement of Intel AEP Optane DIMM 13
Second, DRAM cache is a direct-mapped cache in AEP memory mode. Andcompared to set-associative cache, which is commonly used on CPU cache,direct-mapped cache performs lower hit rate. On contract, direct-mapped cachehas lower cache hit latency when cache tag and data are fetched in one DRAMread, which cannot be achieved easily in set-associative cache. In future works,we should give consideration to both hit rate and hit latency on DRAM cachedesigns.At last, a small buffer named XPbuffer benefits both read and write com-mands on AEP DIMM. But the capacity of XPbuffer is limited by area onDIMM. If we can use XPbuffer more wisely or we can modify DDR protocol toadapt 256B access granularity of AEP, the access efficiency of AEP could growup further. Some work had proposed that, DRAM cache can be filled and evictedin 256B granularity. This is more suitable to AEP DIMM and it will reduce thestorage space of cache tags.
In this paper, we propose a new measurement methodology of memory latencyand cache architecture. As an application, we measure latency and bandwidth ofIntel AEP Optane DIMM and architecture design parameters of DRAM cachewith AEP. The release of AEP DIMM has great significance of Non-Volatile MainMemory research but its parameters are not open to most users. According toour evaluation,