[PDF] A Software-based NVM Emulator Supporting Read/Write Asymmetric Latencies

Abstract

Non-volatile memory (NVM) is a promising technology for low-energy and high-capacity main memory of computers. The characteristics of NVM devices, however, tend to be fundamentally different from those of DRAM (i.e., the memory device currently used for main memory), because of differences in principles of memory cells. Typically, the write latency of an NVM device such as PCM and ReRAM is much higher than its read latency. The asymmetry in read/write latencies likely affects the performance of applications significantly. For analyzing behavior of applications running on NVM-based main memory, most researchers use software-based emulation tools due to the limited number of commercial NVM products. However, these existing emulation tools are too slow to emulate a large-scale, realistic workload or too simplistic to investigate the details of application behavior on NVM with asymmetric read/write latencies. This paper therefore proposes a new NVM emulation mechanism that is not only light-weight but also aware of a read/write latency gap in NVM-based main memory. We implemented the prototype of the proposed mechanism for the Intel CPU processors of the Haswell architecture. We also evaluated its accuracy and performed case studies for practical benchmarks. The results showed that our prototype accurately emulated write-latencies of NVM-based main memory: it emulated the NVM write latencies in a range from 200 ns to 1000 ns with negligible errors from 0.2% to 1.1%. We confirmed that the use of our emulator enabled us to successfully estimate performance of practical workloads for NVM-based main memory, while an existing light-weight emulation model misestimated.

Full PDF

TTo appear in

IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019 PAPER

A Software-based NVM Emulator Supporting Read/WriteAsymmetric Latencies

Atsushi KOSHIBA † , †† a) , Takahiro HIROFUCHI †† , Nonmembers , Ryousei TAKANO †† , and Mitaro NAMIKI † , Members

SUMMARY

Non-volatile memory (NVM) is a promising technologyfor low-energy and high-capacity main memory of computers. The charac-teristics of NVM devices, however, tend to be fundamentally diﬀerent fromthose of DRAM (i.e., the memory device currently used for main mem-ory), because of diﬀerences in principles of memory cells. Typically, thewrite latency of an NVM device such as PCM and ReRAM is much higherthan its read latency. The asymmetry in read/write latencies likely aﬀectsthe performance of applications signiﬁcantly. For analyzing behavior ofapplications running on NVM-based main memory, most researchers usesoftware-based emulation tools due to the limited number of commercialNVM products. However, these existing emulation tools are too slow toemulate a large-scale, realistic workload or too simplistic to investigate thedetails of application behavior on NVM with asymmetric read/write la-tencies. This paper therefore proposes a new NVM emulation mechanismthat is not only light-weight but also aware of a read/write latency gap inNVM-based main memory. We implemented the prototype of the pro-posed mechanism for the Intel CPU processors of the Haswell architecture.We also evaluated its accuracy and performed case studies for practicalbenchmarks. The results showed that our prototype accurately emulatedwrite-latencies of NVM-based main memory: it emulated the NVM writelatencies in a range from 200 ns to 1000 ns with negligible errors from 0.2%to 1.1%. We conﬁrmed that the use of our emulator enabled us to success-fully estimate performance of practical workloads for NVM-based mainmemory, while an existing light-weight emulation model misestimated. key words: middleware, non-volatile memory, performance emulation,asymmetric read/write latencies, write-back awareness

1. Introduction

Recent trends of high-speed and many-core processors leadto an increasing demand for larger memory capacity. Mod-ern computer systems use DRAM for main memory whilescaling up DRAM capacity is becoming diﬃcult due to itsrefresh energy. Because a DRAM cell holds its data as elec-tric charge in a capacitor, periodically refreshing the cell isnecessary to prevent data loss. This energy overhead rapidlyincreases as DRAM scales up its capacity. It is predicted thatthe refreshing energy occupies 50% of the overall power con-sumption of a 64 GB DRAM module [1]. It is also reportedthat a server computer with 128 GB DRAM consumes morethan 40% of its energy consumption for its main memory [2].This energy-greedy characteristic of DRAM is an obstaclefor future large capacity memory systems.Non-Volatile Memory (NVM) is the key to overcomethis energy constraint. Some NVM devices with fast access † The author is with Tokyo University of Agriculture and Tech-nology, Tokyo, 184-8588 Japan. †† The author is with National Institute of Advanced IndustrialScience and Technology (AIST), Tsukuba, 305-8560 Japan.a) E-mail: [email protected] latencies will have the potential to be used for the main mem-ory of computers [3]. In addition, NVM does not requirerefreshing to keep its data, unlike DRAM. This non-volatilityprevents memory subsystems from wasting a large amountof energy. Recent NVM technologies have attracted muchattention not only in academia but also in the industry; newNVM products such as 3D-Xpoint are being developed [4].For these reasons, NVM products are expected to achievehigh-capacity and energy-eﬃcient main memory systems.Although NVM is eﬀective for energy reduction, cur-rent applications and system software, designed for DRAM-based main memory, will not eﬃciently work for futureNVM-based main memory, due to its performance char-acteristics. In particular, the gap between read latency andwrite latency is generally signiﬁcant. For example, phasechange memory (PCM) [6] represents the state of a 1-bitcell (e.g., high or low) by changing its cell phase either oftwo phases: an amorphous phase (low) and a crystallinephase (high). A read operation to a PCM cell just sensesits resistance while a write operation applies an electricalpulse to the cell to heat it and change its phase. Particu-larly, PCM recrystallization (changing from the amorphousphase to the crystalline phase) requires a long duration ofpulsing. Therefore, writing PCM typically requires muchlonger latency than reading. The ITRS roadmap [3] reportsthat the write latency of a typical PCM device is approx-imately 10x higher than its read latency. It also forecaststhat writing PCM will be still 5x slower than reading it in2026. This gap possibly leads to performance degradation ofwrite-intensive application programs. For example, the re-sults of our preliminary experiments (shown in Section 4.3in this paper) showed that write-intensive workloads suchas milc and libquantum experienced nearly 2x slower per-formance with NVM-based main memory in comparison toDRAM-based main memory.

Note: this paper extends our preliminary work published atNVMSA 2017 [5]. Speciﬁcally, we reimplemented a prototype ofour emulator for the Intel Haswell processors, which previouslytargeted for an old processor architecture (i.e., Sandy Bridge) toverify the portability of our emulator for newer processor families.Along with the reimplementation, we drastically improved the ac-curacy of the emulator by ﬁxing bugs of cache miss measurement;the worst emulation error of the NVM write latency was mitigatedfrom 28.6% to 1.1%. Moreover, we conducted thorough experi-ments using various workloads. All the parts of the paper are alsothoroughly updated to improve the quality of the paper.

To make use of future main memory with NVM, severalresearchers have tackled to ﬁnd out new system softwaresupport and memory subsystems which are appropriate forNVM characteristics [7], [8], [9]. However, no NVM-basedmain memories are commercially available.Memory emulation tools are therefore essential for re-searchers to analyze/evaluate the performance of their pro-posals without actual NVM devices. Although several simu-lation/emulation tools for NVM devices have been presented,these tools have problems for practical software research.Cycle-accurate simulators [10], [11] are widely used amongresearchers. While they can set read and write latenciesindependently in nanoseconds, these simulators are not ap-propriate for large-scale workloads because they are verytime-consuming for system software emulation. In contrastto heavy-weight simulators, Volos et al. proposed Quartz,which is a software emulator for NVM devices [12]. Quartzemulates NVM-based main memory using a computer withDRAM-based main memory. It estimates the delays of theexecution of a target process caused by accesses to an em-ulated NVM device and slows down the target process. Ituses the performance counters of a CPU processor to get in-formation of memory accesses. This emulation mechanism,slowing down a target process running on an operating sys-tem, is basically light-weight. However, Quartz is unawareof the read/write latency gap of NVM devices. Most CPUprocessors implement a write-back caching mechanism. TheCPU cores of a processor are not responsible for write-backto the main memory. Instead, a cache controller handles it.Quartz, using the performance counters of CPU cores, doesnot incorporate write-back information into the emulationmodel.To overcome these shortcomings of existing emulationtools, this paper presents a light-weight NVM emulator thattakes the read/write latency gap into account. Unlike Quartzapproach, our emulator classiﬁes cache misses of a targetprocess into two types: read-only and write-back . The for-mer performs only reading data from NVM, and the latterperforms both reading and writing. On NVM systems, write-back cache misses are expected to cause longer CPU stallcycles than the other. To estimate the number of write-backcache misses, our emulator monitors not only CPU cachemisses, but also the behavior of other components (prefetch-ers and cache controllers). The emulator then calculates theadditional delays caused by the two types of cache misses(read-only and write-back) respectively for the read/write la-tencies of an emulated NVM device. This write-back awareemulation model enables an accurate emulation of NVMdevices such as PCM.To clarify the eﬀectiveness of the proposed emulator, wedeveloped a prototype of the proposed emulator on an IntelXeon processor and conducted three experiments. First, weevaluated the accuracy of the prototype. We found that ourprototype emulates the write latencies of NVM-based mainmemory in the range of 200 ns to 1000 ns with negligibleerrors of 0.1% to 1.1%. Second, we applied our emulatorto various workloads selected from SPECCPU 2006. The

NVM(Main Memory)

DIMM

Multi-core Processor

Main memory modules (e.g., DRAM)Last Level Cache (LLC)

Memory controllerLast Level Cache (LLC) controller … A simplified memory system structure of typical computer systems.

CPU core

L1$Prefetchers

CPU core

Prefetchers

CPU core

Prefetchers

LLC accessesMemory read/write requests caused by LLC misses

L2$ L1$

L2$

L1$ L2$

Read/write data

Fig. 1

Memory system structure of modern multi-core computer systems.Every CPU core of state-of-the-art processors may have more local cachelevels (e.g., L3). We assume that NVM-based main memory is providedthrough memory modules and managed by the memory controller in thesame manner as DRAM. results demonstrated that the use of our emulator successfullyestimated the execution time of these workloads. Third, wecompared the proposed mechanism with Quartz using an in-memory database program, Memcached, as a case study ofa realistic application. In the experiment, we executed theoriginal Quartz on our evaluation environment and appliedit to Memcached. We compared the evaluation results withour emulator. We found that the use of Quartz misestimatedthe performance of Memcached running with NVM-basedmain memory. These results show that our write-back awaremechanism has clear advantages in emulating NVM deviceswith asymmetric read/write latencies.

2. Motivation

In this section, we introduce the overview of memory accessmechanisms and then explain how the gap of read/write la-tencies potentially impact on the performance of computers.2.1 Memory Access MechanismWe brieﬂy explain a typical memory access mechanism incomputers. Fig. 1 shows the hardware structure of recentmulti-core computer systems. CPU cores are implemented ina multi-core processor and every CPU core has local caches(e.g., L1, L2). Each CPU core has memory prefetchers,and it also executes instructuions in an out-of-order manner.With these functions, two or more load/store instructionsare sometimes performed concurrently and CPU cores avoidlong stalls to access to the memory modules. All CPU coresshare the Last Level Cache (LLC), which is larger than localcaches. The LLC coherency among CPU cores is maintainedby the LLC controller of the processor. The LLC controlleris also responsible for issuing read/write requests to the mainmemory when LLC misses occur. The memory controller

OSHIBA et al.: A SOFTWARE-BASED NVM EMULATOR SUPPORTING READ/WRITE ASYMMETRIC LATENCIES of the processor, receiving read/write requests from the LLCcontroller, operates main memory modules. Note that CPUcores of recent processors have hardware performance coun-ters, which measure the number of performance events (e.g.,cache misses, stall cycles). We assume that NVM-basedmain memory is byte-addressable in the same manner asDRAM-based main memory. We also assume that bothNVM and DRAM-based main memory modules are write-back cacheable; the caches in the processor hold modiﬁeddata in cache lines and do not write the data to main memorymodules until the cache lines are evicted.In this memory architecture, memory references reach-ing to the main memory mostly occur when load/store in-structions cause LLC misses. A CPU core, executing a pro-gram, accesses memory data with load/store instructions.When a CPU core executes a load or store instruction, itrefers to a source or destination address of the main mem-ory, which is speciﬁed by the instruction. Because the datacorresponding to the address may exist in caches, the CPUcore ﬁrst refers to its L1 cache. If the data does not exist in theL1 cache, the CPU core refers to the next cache level (e.g., L2and then LLC). If the data does not exist even in the LLC, theCPU core triggers an LLC miss event. It fetches a cache lineof data (i.e., typically 64-bytes data) from the memory mod-ule. When an LLC miss occurs, a cache controller selectsan LLC line where new data should be maintained accord-ing to a certain cache management scheme (e.g., n-way setassociative). At the same time, the old data on the selectedLLC line is evicted to make room for new data.The procedure of an LLC miss diﬀers depending on thestate of the evicted LLC line. If the state of the line is clean or invalid , the cache controller reads the new data from thememory module and overwrites it to the cache line. On theother hand, if the state of the line is modiﬁed , the controllernot only reads new data from the module but also writes themodiﬁed line to the module in order to reﬂect the change tothe main memory. Therefore, we can ﬁnd two types of LLCmisses; one that just reads data from the memory module,and the other that induces a write-back. We deﬁne the formerand the latter as a read-only LLC miss and a write-back

LLCmiss, respectively.2.2 Impacts of Higher Write LatencyThe two types of LLC misses lead to the same latency withDRAM-based main memory because reading a new line andwriting an old line are executed in parallel [13]. Upon awrite-back LLC miss, the LLC controller simultaneouslystarts reading a new line and writing an old line. In DRAM-based main memory, the duration of a write-back LLC miss isthe same as that of a read-only LLC miss, because read/writelatencies of DRAM are the same. However, if NVM devicessuch as PCM are used for main memory, the additional du-ration will be necessary upon a write-back LLC miss dueto its higher write latency. Fig. 2 shows the diﬀerence ofpenalty time per one LLC miss between DRAM and NVM.The upper part of Fig. 2 shows the DRAM case where the

Read DRAM

Write DRAM

Read DRAMRead NVM

Read-only LLC miss Write-back LLC miss

Read NVMWrite NVM ≈ Read-only LLC miss Time

Time ≈ Write-back LLC miss

DRAM

DRAM (Write Latency ≈ Read Latency)NVM (Write Latency ≫ Read Latency)

Fig. 2

The performance penalty upon an LLC miss in DRAM-based andNVM-based main memory systems, respectively. An NVM-based systemlikely experiences signiﬁcant penalty upon a write-back LLC miss. write latency is almost the same as the read latency. On theother hand, the lower part of Fig. 2 shows the NVM casewhere the write latency is much longer than the read latency.We assume that a write-back LLC miss in NVM-based mainmemory requires a longer period because the controller waitsfor the eviction of an old line. Although the controller cantemporarily hold write requests in a request queue to pre-vent write requests from interfering read requests, the queuewill not work well for write-intensive applications becauseof its limited size. Thus, if write-back LLC misses occurfrequently, the CPU core that causes a write-back is forcedto keep stalling until the old data eviction ﬁnishes. Thisproblem possibly inﬂuences the performance of applicationprograms depending on their memory-access behavior. Forinstance, our experimental results in Sec. 4.3 show that theexecution time of libquantum, a write-intensive benchmark,becomes nearly 2x slower on NVM than on DRAM.2.3 Problem of Existing WorkAs described above, the read/write latency gap of NVM-based main memory possibly has a great impact on appli-cation performance. Analyzing its impact on performanceis therefore indispensable for developing future NVM sys-tems. Because there are few numbers of commercial NVMproducts, researchers are forced to use emulation/simulationtools for their experiments.However, there are some issues in existing tools to em-ulate the read/write latency gap. The most common tool iscycle-accurate simulators. These simulators are used withother CPU simulators and simulate full system behavior withNVM per CPU cycle [14], [10]. This approach can setread/write latencies of main memory respectively, while itis too slow to emulate large-scale workloads. For instance,we experienced that a simulation system using NVMain [10]

IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019 s 𝑴𝑨 𝒊 : the number of DRAM accesses caused by LLC misses

𝑬𝒑𝒐𝒄𝒉 𝒊 Time

Suspend execution Δ 𝒊 D R A M acce ss e s o f t h e t a r g e t a pp Fig. 3

The mechanism to delay the execution of a target process in Quartz. with gem5 [15] took more than eight hours to ﬁnish a sim-ulation of a tiny program, whose execution took only onesecond in reality.On the other hand, Quartz [12] is a light-weight emu-lation mechanism using hardware performance monitoringcounters implemented in CPU cores of Intel processors. Toemulate a given NVM latency, Quartz inserts delays to theexecution of a target process. The inserted delays are basedon the number of DRAM references obtained through per-formance counters of CPU cores. Fig. 3 shows the Quartzemulation model. Quartz measures the number of DRAMaccesses caused by the target process using performancecounters implemented in CPU cores at a speciﬁc intervalnamed

E poch . It then calculates the additional delay, ∆ ,that is expected to be involved if the target process is exe-cuted with NVM-based main memory. After the calculation,Quartz suspends the process execution until ∆ elapses. Theoverhead of this emulation mechanism is negligible for mostuse-cases.The Quartz emulation model deﬁnes ∆ i , the additionaldelay in E poch i , as Eq. (1): ∆ i = M A i × ( NV M lat − DRAM lat ) (1)where M A i is the number of LLC misses during E poch i ,which have caused CPU stalls of the CPU core executingthe target process. NV M lat and

DRAM lat represents NVMaccess latency and DRAM access latency, respectively. Itshould be noted that thanks to memory prefetching and out-of-order execution, an LLC miss does not necessarily involvea CPU stall. Thus, we need to count the number of the LLCmisses involving CPU stalls, not the number of LLC misses.To obtain

M A i , Quartz divides the number of the CPU stallcycles induced by LLC misses by DRAM access latency (incycles): M A i = LLC _ ST ALL i DRAM lat (2)where

LLC _ ST ALL i represents the total cycles of CPU corestalls caused by LLC misses. The documentation of IntelCPUs [13] provides the equation to calculate LLC _ ST ALL i as follows: LLC _ ST ALL i = L stalls × W × LLC miss

LLC hit + W × LLC miss (3)where L stalls is the total number of core stall cycles causedby L2 cache misses, and LLC hit and

LLC miss are the num-bers of LLC hits and LLC misses of the core, and W is the s 𝑴𝑨 𝒊𝑹𝑶 : read-only 𝑬𝒑𝒐𝒄𝒉 𝒊 Time ∆ 𝒊 ′ D R A M acce ss e s o f t h e t a r g e t a pp Suspend execution 𝑴𝑨 𝒊 𝑾𝑩 : inducing write-backs Lead to longer CPU stalls than 𝑴𝑨 𝒊𝑹𝑶 Fig. 4

The mechanism to delay the execution of a target process in theproposed emulation model. It distinguishes LLC misses into two types:read-only and write-back. The latter, LLC misses inducing write-backs, areexpected to cause longer CPU stalls than the former. ratio of the LLC miss latency (DRAM access latency) to theLLC hit latency.Although the Quartz approach has an advantage on theprocessing overhead over cycle-accurate simulators, it doesnot take the read/write latency gap into account. The diﬃ-culty to support the latency gap stems from the lack of thecapability in monitoring write-back activities through CPUcores; CPU performance counters implemented in recentprocessors (e.g., Intel processors) do not support a perfor-mance event to measure write-back LLC misses of each CPUcore. The reason is considered that a modern processor as-suming DRAM-based main memory does not need to payattention to the write-back latency since DRAM write-backoperations are completely hidden behind its read operationsas shown in Fig. 2. This fact makes it diﬃcult for the emu-lation approach using CPU performance counters to analyzethe impact of higher write latencies on the performance ofa certain process. To overcome this issue, we propose anemulation mechanism estimating per-core write-back LLCmisses, which is not directly countable.

3. Write-back Aware NVM Emulator

This section proposes a light-weight emulation model thatdistinguishes write-back LLC misses and read-only LLCmisses.3.1 Basic IdeaWe assume that write-back LLC misses lead to longer CPUstalls than read-only LLC misses. To take the diﬀerencebetween read and write latencies into account, our emulationmodel monitors two types of LLC misses respectively unlikethe Quartz emulation model. Our model allows users toevaluate applications performance with NVM devices whoseread/write access latencies are asymmetric.Our emulator injects delays into a target process de-pending on the number of LLC misses in the same manneras Quartz. However, unlike Quartz, our model divides LLCmisses into two types; one just reads data from memorymodules (read-only) and the other induces both reading andwriting (write-back) as shown in Fig. 4.

M A

W Bi in Fig. 4is the number of write-back LLC misses, and

M A

ROi is thenumber of read-only LLC misses within

E poch i . Note that M A

W Bi and

M A

ROi represents the number of LLC misses

OSHIBA et al.: A SOFTWARE-BASED NVM EMULATOR SUPPORTING READ/WRITE ASYMMETRIC LATENCIES that actually cause CPU stalls. These two types of LLCmisses satisfy the following condition: M A i = M A

W Bi + M A

ROi (4)We assume that the write-back LLC misses make CPU coresstalled for a longer period than the read-only LLC misses.Let

NV M

Writelat be the average NVM write latencyand let

NV M

Readlat be the average NVM read latency(

NV M

Writelat (cid:29)

NV M

Readlat ), our model represents the ad-ditional delay ∆ (cid:48) i as follows: ∆ (cid:48) i = M A

W Bi × (

NV M

Writelat − DRAM lat ) + M A

ROi × (

NV M

Readlat − DRAM lat ) (5)To calculate the value of ∆ (cid:48) i , the emulator needs to period-ically estimate M A

W Bi and

M A

ROi of the target process atrun-time. However, performance counters of CPU cores can-not measure the number of write-back LLC misses becauseof the cache architecture. Therefore, we present a way toestimate the number of write-back LLC misses and achievea write-back aware NVM emulator.3.2 Run-time Estimation of Read-only/Write-back Mem-ory AccessesThis section describes how to calculate the two types of LLCmisses (

M A

ROi and

M A

W Bi ) respectively. Our emulationmodel enables the calculation by making use of performancecounters of the LLC controller in addition to informationobtained from performance counters of CPU cores. Ourmodel deﬁnes

M A

W Bi and

M A

ROi as shown in Eq. (6):

M A

W Bi = LLC _ ST ALL

W Bi

DRAM lat , M A

ROi = LLC _ ST ALL

ROi

DRAM lat (6)where

LLC _ ST ALL

W Bi and

LLC _ ST ALL

ROi are the to-tal cycles of CPU core stalls caused by write-back LLCmisses and read-only LLC misses, respectively. To calculate

LLC _ ST ALL

W Bi and

LLC _ ST ALL

ROi , our model extendsEq. (3).

LLC miss in Eq. (3) can be classiﬁed into two types(write-back and read-only) as we have already described inSec. 2.1. Our model then deﬁnes

LLC _ ST ALL

W Bi and

LLC _ ST ALL

ROi as Eq. (7) and Eq. (8):

LLC _ ST ALL

W Bi = L stalls × W × LLC

W Bmiss

LLC hit + W × LLC miss (7)

LLC _ ST ALL

ROi = L stalls × W × ( LLC miss − LLC

W Bmiss ) LLC hit + W × LLC miss (8)where

LLC

W Bmiss is the total number of write-back LLC misses.Due to the lack of performance monitoring events ofCPU cores,

LLC

W Bmiss cannot be counted directly. Therefore,our model estimates

LLC

W Bmiss using other available moni-toring functions. To estimate

LLC

W Bmiss , there are two keyfactors: (1) the number of write-backs within a certain pe-riod, and (2) the degree of contribution of the target processto these write-backs. To measure the factor (1), our modeluses an uncore performance counter implemented on thecache controller. Intel processors such as Intel Xeon haveLLC controllers called LLC coherency engines (CBo) [16].Because CBo counters monitor the number of cache lineswritten back to the memory modules, they enable our modelto measure the factor (1) directly. Next, to estimate the fac-tor (2), our model measures the number of all LLC missescaused by CPU cores and their prefetchers in the system. Weexpect that the degree of contribution of a certain CPU coreto write-backs can be estimated based on the proportion ofits LLC misses to the whole. Assuming that a certain corecauses 40,000 LLC misses in an epoch and the total numberof LLC misses in the same epoch is 200,000, the numberof LLC misses caused by the core occupies 20% of all theLLC misses. Since write-back requests are induced by LLCmisses, the number of write-backs caused by the core in thisepoch is expected to be 20% of all the write-backs. Thus, ifthe total number of write-backs in this epoch is 50,000, thenumber of write-back LLC misses of the core is expectedto be 10,000. Based on these considerations, our modelestimates

LLC

W Bmiss with Eq. (9):

LLC

W Bmiss = W B × LLC miss (cid:205) n − i = LLC miss , cpu i + (cid:205) n − i = LLC miss , PF i (9)where W B is the total number of write-back operations bythe cache controller, n is the number of CPU cores of aprocessor, (cid:205) n − i = LLC miss , cpu i is the sum of the numbers ofLLC misses caused by every CPU core, (cid:205) n − i = LLC miss , PF i is the sum of the numbers of LLC misses caused by everyprefetcher. Eq. (9) calculates the ratio of LLC misses of thetarget process to LLC misses of the whole system and thenmultiplies the ratio and the number of write-backs. Thus, theequation gives us the estimated number of write-back LLCmisses caused by a speciﬁc process.3.3 Applying to an Intel ProcessorWe implemented a prototype of our emulator for the IntelHaswell architecture. Table 1 shows the performance counterevents corresponding with the variables of the above equa-tions [17], [16]. DRAM lat and W are static values relyingon the performance of a given machine and can be mea-sured using a tool such as Intel Memory Latency Checker(MLC) [18].Fig. 5 shows the execution ﬂow of the controller dae-mon of the emulator. Both the controller daemon and the IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019

Table 1

The performance monitoring events of the Haswell architecturefamily used in the proposed emulator.Performance events of CPU counters [17] L st alls CYCLE_ACTIVITY:STALLS_L2_PENDING

LLC hit

MEM_LOAD_UOPS_L3_HIT_RETIRED:XSNP_NONE

LLC miss , MEM_LOAD_UOPS_L3_MISS_RETIRED:

LLC miss , cpu i LOCAL_MEM

LLC miss , PF i + OFFCORE_RESPONSE_0

LLC miss , cpu i (oﬀcore rsp: 0x3FB84003F7)Performance events of CBo (LLC controller) counters [16] W B

LLC_VICTIMS.M_STATE

The controller daemon of the emulator process

Resume processCalculate delay & wait Wait

𝐸𝑝𝑜𝑐ℎ write-back requestsSend SIGSTOPSend SIGCONT

LLC controllerCPU core CPU core CPU coreCounter Counter

L2 stall cycles, LLC misses, …

CounterCounter … Suspend processrunningprocess

Fig. 5

The overview of the emulation mechanism of our emulator. emulated process are running on the same multi-core pro-cessor during the emulation. The controller daemon peri-odically calculates and injects an additional delay at everyﬁxed interval (

E poch ). When

E poch elapses, the controllerdaemon suspends the execution of the target process. It thenreads performance counters of CPU cores and the LLC con-troller to obtain values of performance events shown in Table1. It calculates the additional delay from the obtained valuesusing our emulation model. The target process is suspendeduntil its idle time reaches the calculated delay. Finally, thecontroller daemon resumes the target process and waits forthe next

E poch . The controller daemon uses POSIX sig-nals to suspend the process execution and to resume it. Ourprototype is portable because other Intel processor familiesare equipped with performance counters that support theequivalent events.

4. Evaluation

To verify the eﬀectiveness of our emulation model, we evalu-ated the prototype of the proposed emulator using a computerwith an Intel Xeon E2637 v3 processor. The processor is theIntel Haswell architecture. Table 2 shows the detail of ourevaluation environment. We used Intel MLC to measure thevalues of

DRAM lat and W . In the experiments, we conﬁg-ured the E poch parameter so that our emulator can makea good balance between accuracy and calculation overhead.Because setting a longer Epoch value will increase the possi-bility that the emulator fails to track short temporal changes

Table 2

Our Evaluation EnvironmentProcessor Intel Xeon E5-2637 v3OS CentOS 6.10 (Linux 2.6.32)Memory 32GB DDR4 RAM @2400MHz

E poch

20 ms

DRAM lat W wbbench (){ memory_region = malloc(line_count * 64);generate_random_address_list(memory_region, line_count);struct cacheline *clp = get_nextline_from_list(); start_time = get_time(); while(clp != NULL){clp->value = 0xFFFF; … (a) // modify an LLC lineclp = get_nextline_from_list(); // load a new line (cause write-back)} end_time = get_time(); return wb_latency = (end_time – start_time) / line_count; } Fig. 6

Pseudo code of wbbench. of memory access behavior of a workload, a shorter Epochvalue is preferable in this sense. Although, a shorter Epochvalue results in the increase of CPU load due to the calcula-tion overhead. We observed that in the experiments 20 msof Epoch was appropriate to accurately emulate NVM withnegligible calculation overhead. In other situations, it maybe necessary to tune up an Epoch value to obtain suﬃcientaccuracy, especially for workloads whose memory accessbehaviors frequently change. We will conduct further inves-tigation on the relationship between the emulation accuracyand Epoch in future work.We used the machine exclusively for the emulation andalso conﬁgured the operating system to stop unnecessaryservices. We consider that the cache hit ratio in the ex-periments will be close to that of a machine with a realNVM device. We focus this paper on the evaluation ofour emulator for single-threaded (or single-process) work-loads. However, if the emulator is applied to multi-threaded(or multi-processes) workloads, the temporal suspension ofeach thread may change the behavior of cache contention.This may result in the diﬀerence in the cache hit ratio anddegrade the accuracy of emulation. We will report the feasi-bility of the emulator for such applications in our upcomingwork.4.1 A Tool to Measure Write-back LatencyTo evaluate the precision of our model emulating the NVMwrite latency, we developed a tool named wbbench that mea-sures the average latency of write-back LLC misses. Fig. 6shows the pseudo code of wbbench. In order to accu-rately measure cache miss latencies, wbbench is carefullydesigned to suppress the eﬀect of prefetching and out-of-order execution. First, wbbench calls malloc() to reserve

OSHIBA et al.: A SOFTWARE-BASED NVM EMULATOR SUPPORTING READ/WRITE ASYMMETRIC LATENCIES Table 3

DRAM access latencies measured with diﬀerent tools. *IntelMLC does not distinguish the read/write latencies.Intel MLC Our toolsMeasured read latency 121.7 ns 122.6 ns (with robench)Measured write latency * 123.2 ns (with wbbench) a certain amount of memory region. It then calls gener-ate_random_address_list() to split the memory region intoa linked list of cache-line aligned objects (i.e., struct cache-line). Each cache line object is aligned to the size of an LLCline (64 bytes). The cacheline objects of the linked list arearranged in a random order; a cacheline object points to thenext one likely located at a distant address. While executingthe while() loop, wbbench writes a value to the cache lineobject that is currently referred by a pointer (clp). Next, itcalls get_nextline_from_list() to refer to the address of thenext cache line object in the list and store it in the pointer,which causes an LLC miss with a line eviction. Since thecache line objects in the list are arranged randomly, wbbenchsuppresses the eﬀect of memory prefetching and out-of-orderexecution. The memory access prefetching is not eﬀectivefor random access to cache lines. The out-of-order executionof CPU does not work eﬀectively for the pointer traversal ofa link list. The size of the memory region is set to be suﬃ-ciently larger than the size of LLC; at each iteration of thewhile() block, a write-back LLC miss occurs at a high prob-ability. Wbbench measures the total elapsed time during thewhile() loop and calculates the average write-back latency.We also implemented robench, a tool to measure theaverage latency of read-only LLC misses. Robench code isalmost the same as the wbbench code. The only diﬀerenceis that robench does not modify a cache line (i.e., skips theline marked as (a) in the pseudo code). Since LLC missescaused by robench do not induce write-backs, robench canmeasure the read-only LLC miss latency.To conﬁrm the accuracy of latency measurement ofwbbench and robench, we measured the LLC miss latenciesof a computer with DRAM-based main memory. The read-only and write-back latencies should be the same in a DRAMreference. For comparison, Intel MLC was also used to mea-sure a DRAM latency , which does not distinguish read/writelatencies. Table 3 shows DRAM access latencies measuredon our experimental environment. Measurement errors ofwbbench and robench in comparison with Intel MLC is 1.5ns (1.2%) and 0.9 ns (0.7%), respectively. We observed thattheir output results are very close to those of the Intel’s pro-prietary measurement program. The results indicate that ourlatency measurement programs are suﬃciently accurate.4.2 Validating Accuracy of EmulationWe evaluated the accuracy of the proposed emulation mech-anism using wbbench and robench. We set target read/writelatencies of the emulator and then measured actually-emulated latencies by wbbehcn and robench, i.e., wbbenchor robench were executed in our latency emulator. If our pro-

Table 4

NVM latencies conﬁgured by our prototype and measured withwbbench/robench.Conﬁgured wbbench robenchread/write lat. Measured lat. error Measured lat. error122 ns/200 ns 202.1 ns 1.1 % 125.3 ns 2.7 %122 ns/300 ns 300.4 ns 0.1 % 125.6 ns 3.0 %122 ns/400 ns 399.2 ns -0.2 % 125.8 ns 3.1 %122 ns/500 ns 497.8 ns -0.4 % 126.4 ns 3.6 %122 ns/1000 ns 988.7 ns -1.1 % 128.6 ns 5.4 % totype can accurately emulate the write latency of NVM, atarget write latency and its actually-emulated latency becomevery close. To ensure that every get_nextline_from_list() callinduces an LLC miss, we set the size of the memory regionreserved by wbbench/robench to 30 MB, which is twice aslarge as the LLC size of our environment (15 MB).Table 4 shows the evaluation results. The emulatedwrite latency was changed from 200 ns to 1000 ns while theread latency is the same as the actual DRAM latency. Whenapplying our emulator to wbbench, the NVM write latencieswere emulated with errors of 0.1% to 1.1%. In addition,when applying our emulator to robench, the NVM read la-tencies were emulated with errors of 2.7% to 5.4%. Theseresults show that our mechanism can emulate asymmetricread/write latencies with negligible errors.4.3 Applying to Various WorkloadsTo show the eﬀectiveness of our emulation model for esti-mating the performance of future NVM devices, we evalu-ated the performance of various workloads when emulatingNVM-based main memory. We executed benchmark pro-grams of SPECCPU 2006 and applied our prototype to themto emulate their behavior in NVM-based main memory. Wemeasured the execution time of each benchmark programin the emulation. We used 28 benchmark programs mixingcompute-intensive and memory-intensive workloads in theexperiment. We also measured memory write throughputof each benchmark program to see the intensity of writememory accesses. We used an internal performance counterof the memory controller to measure write throughput. Thecounter measures the total bytes written in the memory mod-ules. The average write throughput was calculated by divid-ing the total written data size by the total execution time ofthe benchmark.Fig. 7 shows the execution time of each benchmarkprogram in the latency emulator. Fig. 8 shows their aver-age write throughput. In the experiments, we set the targetNVM read latency to the same value as the DRAM readlatency, while we set the target NVM write latency to 300ns, 500 ns, and 1000 ns. According to the results, compute-intensive workloads such as 416.gamess, 435.gromacs, and444.namd keep their performance the same as DRAM-basedmain memory because they cause a small number of write-backs. On the other hand, write-intensive workloads such as433.milc, 459.GemsFDTD and 462.libquantum lead to theincrease of the execution time and the degradation of the

IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019

Normalized Execution Time

Fig. 7

Execution time of SPECCPU 2006 benchmark programs when emulating the read/write asym-metric latencies of NVM. The results are normalized to the no emulation case. The target NVM writelatencies were set to higher values than the DRAM latency (300 ns, 500 ns, and 1000 ns) while the targetNVM read latency was always set to the same as the DRAM latency (122 ns).

Write throughput [MB/s]

Fig. 8

Write throughput of each benchmark program in the emulation. The experimental condition isthe same as Fig. 7.

Fig. 9

Proportions of LLC misses induced by CPU cores to all the LLCmisses during the experiments. write throughput due to the high NVM write latency. Theseresults indicate that our model can emulate the behavior ofpractical workloads running with NVM-based main memoryaccording to their memory access characteristics.Some workloads in Fig. 7 and 8 are not sensitive to theirwrite intensiveness. For instance, 437.leslie3d and 470.lbmare the third and fourth most write-intensive of all the bench-marks. However, the slow down of their execution time in theemulation was less than other write-intensive workloads. On the other hand, 458.sjeng and 471.omnetpp are less write-intensive while their execution time more sharply increasedas the higher write latency was emulated. Thus, the inten-sity of memory write is not only the factor that determineshow a write latency impacts on workload performance. Theeﬀectiveness of prefetchers explains the results of these ap-plications. Fig. 9 shows the proportions of LLC missesinduced by CPU cores to all the LLC misses during theexperiment. The graph shows the values of the selectedﬁve write-intensive benchmarks in addition to 437.leslie3d,470.lbm, 458.sjeng, and 471.omnetpp. As we described inSec. 2.3, LLC misses is induced by not only CPU cores butprefetchers. As shown in Fig. 9, the proportion of CPU-induced LLC misses at the execution of 437.leslie3d and470.lbm was quite small because of the prefetchers. Con-trary, LLC misses occurred when executing 458.sjeng and471.omnetpp were more likely induced by CPU cores them-selves. These results show that our emulator is eﬀectiveto estimate the performance impact of memory-level paral-lelism on NVM systems.Since Quartz does not distinguish read/write latencies,users possibly obtain erroneous results when using it for em-

OSHIBA et al.: A SOFTWARE-BASED NVM EMULATOR SUPPORTING READ/WRITE ASYMMETRIC LATENCIES R122ns_W300ns21.01 32.61R122ns_W500ns21.01 30.21R122ns_W1000ns20.92 34.08403.gcc RW122ns (no emulation)246.78 18.75R122ns_W300ns235.13 18.91R122ns_W500ns226.33 18.94R122ns_W1000ns210.52 18.98

Normalized Execution Time

Fig. 10

Execution time of SPECCPU 2006 benchmarks when setting the emulated NVM read/writelatencies to 500 ns. The results are normalized to no emulation . Table 5

Elapsed time to perform the NVM emulation/simulation. In theemulation/simulation, the NVM write latency was conﬁgured to 300 ns.bare execution our emulator NVMain&gem5444.namd 14.4 sec 14.7 sec 81210.8 sec462.libquantum 8.8 sec 9.1 sec 61917.5 sec ulating an NVM device with asymmetric read/write laten-cies. Thus, we examined how each SPECCPU benchmarkprogram behaved diﬀerently when we do not distinguishread/write latencies. Fig. 10 shows the results when weconﬁgured both NVM read/write latencies to 500 ns in theemulation. When setting both read/write latency to 500 ns,several benchmark programs such as 429.mcf, 433.milc, and471.omnetpp experienced more serious performance degra-dation than the case of setting only the write latency to 500ns. Since read-only LLC misses induced by these workloadsare more dominant than write-back LLC misses, setting bothemulated NVM read/write latencies to 500 ns resulted in theworse performance than setting only the write latency to500 ns. This fact indicates that the capability in emulat-ing read/write latencies independently is indispensable foraccurate emulation of NVM devices.To clarify the slowness of cycle-accurate simulators,we measured the elapsed time of a cycle-accurate simula-tion for SPECCPU 2006 benchmark programs. We set up acycle-accurate simulation system comprising a CPU simu-lator (gem5 [15]) and a memory simulator (NVMain [10]).We executed the 444.namd and 462.libquantum benchmarkprograms of SPECCPU2006 on it. Because the simulationsystem is too time-consuming, we used smaller datasets toexecute the benchmark programs than those used in other ex-periments. Table 5 shows the evaluation results. As shownin the table, the simulation is several thousand times slowerthan our emulator. The results indicate that our emulator ismore light-weight than cycle-accurate simulators.4.4 A Case Study using a Realistic WorkloadAs a case study with a realistic workload, we applied ouremulator prototype to Memcached, an in-memory key-valuestore database. We also chose memaslap as a client applica- tion of Memcached. Memaslap randomly generates get/setrequests following a given set/get proportion and sends themto a Memcached server during a given time period. We ex-ecuted memaslap for one minute and measured the averagethroughput (operations per second). In the experiment, aMemcached server program and our emulator were executedon the machine shown in Table 2. The number of Mem-cached worker threads was set to one . Besides, memaslapwas executed on another machine with Intel Xeon CPU E5-2650 v4 @2.2GHz. The number of memaslap worker threadswas set to eight. The key and value sizes of each request wereset to 128 bytes and 2048 bytes, respectively .Fig. 11 shows the throughput of memaslap when set-ting emulated NVM read/write latencies to the same value.We compared our model with Quartz. In this experiment,we also executed the original Quartz on our evaluation en-vironment and applied it to Memcached. The evaluationresults of Quartz is also shown in the ﬁgure. It should benoted that memaslap achieves the best performance whenthe ratio of set:get is 5:5. Thus, most results in the ﬁgurehave peek throughput at set5:get5. The ﬁgure shows that thethroughput of memaslap decreased as the latency of NVMset higher. When emulating the same latency, we observednearly the same throughput in our emulator and Quartz. Thisfact indicates that our emulator and Quartz are accurate toemulate main memory with symmetric read/write latencies.Fig. 12, 13 and 14 show the throughput of memaslapwhen the emulators were intended to emulate an NVM devicewith asymmetric read/write latencies. We tried to emulate anNVM device with the read latency of 122 ns (i.e., the same asDRAM) and the write latency of 300 ns. Since Quartz doesnot distinguish read/write latencies, we have no choice butto set its latency to 300 ns. As shown in the results, there is asigniﬁcant performance diﬀerence between our emulator andQuartz; in all the three ﬁgures, the throughput emulated byQuartz are lower than our emulator. Our emulator only de- We focus this paper to the validation of the emulation accuracyfor single-threaded workloads. The emulation accuracy of multi-threaded workloads are discussed in future work. We chose these parameter values so that the memcached work-load would cause a suﬃcient number of LLC misses. IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019 read Bytes [MB]read Bytes [MB] set1:get9 set2:get8 set3:get7 set4:get6 set5:get5 set6:get4 set7:get3 set8:get2 set9:get1 set10:get0

Throughput [ops/s]

RW122ns (no emulation) RW300ns (wb_aware) RW300ns (quartz)RW500ns (wb_aware) RW500ns (quartz) RW1000ns (wb_aware)RW1000ns (quartz)

Fig. 11

Throughput of memaslap when setting the emulated NVMread/write latencies to the same value using two emulators: ours (wb_aware)and an existing emulator (quartz). The experiments were conducted at dif-ferent ratios of set/get operation. For example, set1:get9 means the ratio ofset/get is 1:9. read Bytes [MB]read Bytes [MB] set1:get9 set2:get8 set3:get7 set4:get6 set5:get5 set6:get4 set7:get3 set8:get2 set9:get1 set10:get0

Throughput [ops/s] RW122ns (no emulation) R122ns_W300ns (wb_aware)RW300ns (quartz)

Fig. 12

Throughput of memaslap when setting 300 ns to the NVM writelatency with our emulator (wb_aware). The result is compared with Quartzsetting 300 ns to NVM read/write latencies. set1:get9 set2:get8 set3:get7 set4:get6 set5:get5 set6:get4 set7:get3 set8:get2 set9:get1 set10:get0

Throughput [ops/s] RW122ns (no emulation) R122ns_W500ns (wb_aware)RW500ns (quartz)

Fig. 13

Throughput of memaslap when setting 500 ns to the NVM writelatency with our emulator (wb_aware). The result is compared with Quartzsetting 500 ns to NVM read/write latencies. lays LLC misses that are expected to induce write-backs. Onthe other hand, Quartz delays both read-only and write-backLLC misses since it is not aware of the diﬀerence betweenthe two types of LLC misses. This indicates that our emula-tor has great advantages in emulating read/write asymmetricmemory devices. The use of Quartz likely under-estimatesapplication performance for such NVM devices.

5. Related Work

Cycle-accurate simulators such as NVMain, DRAMSim2,and NVSim are widely used to evaluate software perfor-mance on NVM systems [10], [14], [19]. In general, these set1:get9 set2:get8 set3:get7 set4:get6 set5:get5 set6:get4 set7:get3 set8:get2 set9:get1 set10:get0

Throughput [ops/s] RW122ns (no emulation) R122ns_W1000ns (wb_aware)RW1000ns (quartz)

Fig. 14

Throughput of memaslap when setting 1000 ns to the NVM writelatency with our emulator (wb_aware). The result is compared with Quartzsetting 1000 ns to NVM read/write latencies. memory simulators are combined with processor simulatorssuch as Gem5 and MARSS [15], [20]. They calculate thefull-system behavior of target architecture per CPU cycle.This approach can set NVM read/write latencies indepen-dently while the time required for a simulation is enormous.Our experiment found that the full-system simulation withNVMain and gem5 is approximately three orders of magni-tude slower than the light-weight emulation of our proposedemulator.Some researchers customized the hardware of commod-ity computer systems to accurately imitate the behavior ofNVM-based main memory. Persistent Memory EmulationPlatform (PMEP) enables NVM latency emulation with spe-cial CPU microcode of an Intel Xeon processor and a cus-tomized BIOS system [8]. The microcode monitors a batchof LLC misses and injects additional delays to emulate higherNVM latency. Lee et al. integrate an FPGA-based NVM em-ulator on an ARM System-on-Chip board [21]. A hardwaremodule implemented in the FPGA part monitors read/writerequests issued from CPU cores to a DRAM controller andinserts additional delays to each request. These hardware-based mechanisms can emulate slow NVM accesses withsmall performance overhead while such hardware customiza-tion is not easy for software researchers.LEEF [22] is an NVM emulation platform that providesboth full-system simulation and light-weight emulation. Theemulation mode of LEEF supports several emulation modelsbased on existing work [8], [23], [24]. However, the accu-racy of these emulation models heavily depends on types ofworkloads; it is reported that LEEF causes emulation errorsof approximately 30% to 40% in the worst case. To com-plement the accuracy of these models, LEEF also proposesa regression method to select an optimal emulation modelaccording to the type of a workload. However, the detail ofthe regression method is not clear in this paper.Quartz [12] is similar work to our emulator as we de-scribed before, while Quartz is lack of support for asym-metric read/write latencies. HME is another software-basedemulator using CPU performance counters [25]. HME alsotries to emulate slow NVM writes by counting the num-ber of LLC lines written back to main memory modules.Their emulation model calculates a delay to be inserted to

OSHIBA et al.: A SOFTWARE-BASED NVM EMULATOR SUPPORTING READ/WRITE ASYMMETRIC LATENCIES the execution of a target process, from the total number ofwrite-back requests. It, however, evenly distributes the delayto each CPU core. This approach is not accurate becauseit does not consider important factors such as the per-corediﬀerence of LLC miss frequency and memory-level paral-lelism. In contrast, our emulation model can cover thesefactors.

6. Conclusion

In this paper, we presented a software-based emulation mech-anism supporting asymmetric read/write latencies of NVM-based main memory. It can emulate the behavior of NVM-based main memory, using normal DRAM-based computers.The emulation model of our emulator inserts a delay to theexecution of a target process. It calculates the delay fromthe number of LLC misses and write-back operations usingperformance counters of the CPU cores and the LLC con-troller in a processor. We implemented a prototype of theemulation model for an Intel processor family (i.e., Haswell)and evaluated its accuracy through experiments. The resultsof the experiments showed that our proposed mechanismsuccessfully emulated target read/write latencies with neg-ligible errors of 0.1% to 1.1%. We conﬁrmed that the useof the existing emulator without the support of asymmetriclatencies (i.e., Quartz) seriously under-estimated the perfor-mance of several workloads. The use of our emulator, thanksto the modeling of the write-back mechanism of a proces-sor, successfully generated realistic performance for theseworkloads. Because emerging NVM devices such as PCM,ReRAM, and MRAM basically have asymmetric read/writelatencies, our emulator has great advantages on the emula-tion of main memory comprising NVM.In future work, we furthermore evaluate the accuracy ofthe proposed mechanism using actual NVM devices that aresupposed to be available in the upcoming years. Since theenergy consumed by reading and writing NVM is diﬀerent,we assume that our write-back aware emulation model isalso eﬀective for evaluating the energy performance of NVMdevices. We will clarify the eﬀectiveness of our model forthe energy asymmetry of NVM.

Acknowledgment

This work is supported by JSPS Grant KAKENHI 16K00115and 19H01108.

References

IEICE TRANS. INF. & SYST., VOL.E102-D, NO.12 DECEMBER 2019 [21] T. Lee, D. Kim, H. Park, S. Yoo, and S. Lee, “Fpga-based prototypingsystems for emerging memory technologies,” 2014 25nd IEEE In-ternational Symposium on Rapid System Prototyping, pp.115–120,Oct 2014.[22] G. Zhu, K. Lu, X. Wang, and Y. Dong, “Building emulation frame-work for non-volatile memory,” 2017 IEEE 37th International Con-ference on Distributed Computing Systems Workshops (ICDCSW),pp.330–333, June 2017.[23] D. Sengupta, Q. Wang, H. Volos, L. Cherkasova, J. Li, G. Magalhaes,and K. Schwan, “A framework for emulating non-volatile memorysystemswith diﬀerent performance characteristics,” Proceedings ofthe 6th ACM/SPEC International Conference on Performance Engi-neering, ICPE ’15, New York, NY, USA, pp.317–320, ACM, 2015.[24] V. Spiliopoulos, S. Kaxiras, and G. Keramidas, “Green governors: Aframework for continuously adaptive dvfs,” 2011 International GreenComputing Conference and Workshops, pp.1–8, July 2011.[25] Z. Duan, H. Liu, X. Liao, and H. Jin, “Hme: A lightweight emu-lator for hybrid memory,” 2018 Design, Automation Test in EuropeConference Exhibition (DATE), pp.1375–1380, March 2018.

Atsushi Koshiba is a Ph.D. student in the Department of Electric andInformation Sciences at Tokyo University of Agriculture and Technology.He received a master degree from the Department of Computer and In-formation Sciences at the same university in 2016. His research interestsinclude operating systems, heterogeneous computing, and energy-savingtechnologies for computer systems. He is a student member of ACM, IEEE,and IPSJ.

Takahiro Hirofuchi is a senior researcher of National Institute ofAdvanced Industrial Schience and Technology (AIST) in Japan. He isworking on system software technologies for non-volatile memory devices.He obtained a Ph.D. of engineering in March 2007 at the Graduate School ofInformation Science of Nara Institute of Science and Technology (NAIST).He obtained the BS of Geophysics at Faculty of Science in Kyoto Universityin Marchi 2002. He is an expert of operating system, virtual machine, andnetwork technologies.

Ryousei Takano is a research group leader of the Institute of AdvancedIndustrial Science and Technology (AIST), Japan. He received his Ph.D.from the Tokyo University of Agriculture and Technology in 2008. Hejoined AXE, Inc. in 2003 and then, in 2008, moved to AIST. His researchinterests include operating systems and distributed parallel computing. Heis currently exploring an operating system for heterogeneous acceleratorclouds.