[PDF] An Operating System Level Data Migration Scheme in Hybrid DRAM-NVM Memory Architecture

Abstract

With the emergence of Non-Volatile Memories (NVMs) and their shortcomings such as limited endurance and high power consumption in write requests, several studies have suggested hybrid memory architecture employing both Dynamic Random Access Memory (DRAM) and NVM in a memory system. By conducting a comprehensive experiments, we have observed that such studies lack to consider very important aspects of hybrid memories including the effect of: a) data migrations on performance, b) data migrations on power, and c) the granularity of data migration. This paper presents an efficient data migration scheme at the Operating System level in a hybrid DRAMNVM memory architecture. In the proposed scheme, two Least Recently Used (LRU) queues, one for DRAM section and one for NVM section, are used for the sake of data migration. With careful characterization of the workloads obtained from PARSEC benchmark suite, the proposed scheme prevents unnecessary migrations and only allows migrations which benefits the system in terms of power and performance. The experimental results show that the proposed scheme can reduce the power consumption up to 79% compared to DRAM-only memory and up to 48% compared to the state-of-the art techniques.

Full PDF

AAn Operating System Level Data Migration Schemein Hybrid DRAM-NVM Memory Architecture

Reza Salkhordeh and Hossein Asadi

Data Storage Systems & Networks (DSN) Lab, Department of Computer EngineeringSharif University of Technology, Tehran, IranEmail: [email protected] and [email protected]

Abstract —With the emergence of

Non-Volatile Memories (NVMs) and their shortcomings such as limited endurance andhigh power consumption in write requests, several studies havesuggested hybrid memory architecture employing both

DynamicRandom Access Memory (DRAM) and NVM in a memory system.By conducting a comprehensive experiments, we have observedthat such studies lack to consider very important aspects ofhybrid memories including the effect of: a) data migrations onperformance, b) data migrations on power, and c) the granularityof data migration. This paper presents an efﬁcient data migrationscheme at the Operating System level in a hybrid DRAM-NVM memory architecture. In the proposed scheme, two

LeastRecently Used (LRU) queues, one for DRAM section and onefor NVM section, are used for the sake of data migration. Withcareful characterization of the workloads obtained from PARSECbenchmark suite, the proposed scheme prevents unnecessary mi-grations and only allows migrations which beneﬁts the system interms of power and performance. The experimental results showthat the proposed scheme can reduce the power consumptionup to 79% compared to DRAM-only memory and up to 48%compared to the state-of-the art techniques.

I. I

NTRODUCTION

In the past decades, computer designers have steadily used

Dy-namic Random Access Memory (DRAM) as the main memory dueto its prominent features such as high performance and low cost perGB. Despite of the performance and cost efﬁciency of DRAM, itstill suffers from frequent recharge requirement and low scalability.Recharging DRAM cells every few milliseconds imposes signiﬁcantpower, no matter how many accesses are dispatched to the mainmemory. The power usage of DRAM is more pronounced whensystem is mostly idle. In addition, the low scalability of DRAM limitsthe maximum main memory size that can be used in a computersystem [1].To alleviate the limitations of DRAM,

Non-Volatile Memories (NVMs) have been emerged in the recent studies offering zeroleakage current to preserve data and less scalability issue as comparedto DRAM. Among various NVMs offered in the past years,

Phase-Change Memory (PCM),

Spin-Transfer Torque (STT-RAM), and resistive RAM (PRAM) are recognized as the most promising NVMsto be employed in the main memory [2]. Despite prominent featuresof NVMs, they have serious shortcomings such as high dynamicwrite power and long write latency (similar to solid-state drives [3])which prohibit them to entirely replace the DRAM technology. NVMshave asymmetric characteristics for read and write requests. In mostemerging NVMs, write requests require more time for completionand therefore, their performance will be lower in write-dominantworkloads. From power perspective, write requests are more powerconsumptive than read requests. In addition, NVMs have very limitedwrite cycles compared to DRAM despite of several efforts to increasetheir lifetime [4], [5].Due to shortcomings of DRAM, several studies have attemptedto employ NVMs in the main memory of computer systems. A fewof these studies explore possibility of entirely replacing DRAM withNVMs [6], [7]. A recent study shows that NVMs cannot reach theperformance and power consumption of DRAM in the near future [6]. Other studies investigate using a hybrid memory composed of bothDRAM and NVMs and possible effects on the

Operating System (OS)[8], [9], [10]. Hybrid memories try to use characteristics of DRAMand NVM in order to improve performance or power consumption ascompared to a DRAM-based main memory. Clock-DWF [8] is oneof the most recent studies in this ﬁeld that uses two clock algorithms,one for managing DRAM and another for managing NVM. Thistechnique tries to move data pages between these two memories inorder to reduce the power consumption while maintaining almostthe same performance level. Clock-DWF outperforms previous worksuch as CLOCK-PRO and CAR which makes it the most optimaltechnique in the literature. The simulated results of Clock-DWFover hybrid DRAM-NVM memory lacks considering the effect ofthe migrations between DRAM and NVM memories. In addition,the effect of moving data pages between the main memory and thesecondary storage has been neglected. There are also several studiesthat employ hybrid memory architecture in on-chip memory [11],whose discussion is beyond the scope of this work.This paper presents a data migration scheme in a hybrid memoryarchitecture employing both DRAM and NVM in the main memory.The main aim of the proposed scheme is reducing the number ofnon-beneﬁcial data migrations between DRAM and NVM memoriesto improve both performance and power efﬁciency. To this end, weuse two

Least Recently Used (LRU) queues (one for DRAM and onefor NVM) and optimize the LRU queue for NVM to prevent non-beneﬁcial migrations to DRAM. The optimizations in the LRU queueare minimal and therefore the proposed scheme will have almost thesame hit ratio as an unmodiﬁed LRU. Contrary to Clock-DWF thateach write hit will result in moving the page to the DRAM mainmemory, in the proposed scheme every hit in the NVM LRU will betreated similar to the LRU algorithm with one difference. If a pagestays in the top pages of LRU for more than a threshold accesses, itwill be considered hot and will be moved to DRAM. Since the costof moving a data page between two memories is high, using thisthreshold will prevent non-beneﬁcial migrations that are very likelyto occur in previous studies such as Clock-DWF.Both the proposed scheme and previous studies have been sim-ulated using a framework developed similar to Linux memorymanagement layer. The performance and power characteristics areextracted from the same source as previous studies. We also usedPARSEC to run the experiments [12]. Since the multi-level cachesin CPU affect the distribution of accesses dispatched to the mainmemory, in this paper we used COTSon full-system simulator [13]which is able to simulate a multi-core system with many cachelevels. The experimental results show that the proposed scheme canreduce the power consumption up to 48% (14% on average), improveperformance up to 70% (48% on average), and improve endurance upto 93% (64% on average) compared to previous studies. As comparedto a DRAM-based main memory, the power consumption is reducedup to 79% (43% on average).The rest of the paper is organized as follows. Section II presentsour model for evaluating the performance and power consumption inhybrid memories. The motivation of this work is discussed in SectionIII. The proposed data migration scheme is presented in Section IV.Experimental results are reported in Section V. Finally, Section VIconcludes the paper. a r X i v : . [ c s . O S ] M a y ABLE I: Parameters Description

Parameter Description P HitDRAM

DRAM Memory Hit Probability P HitNV M

NVM Memory Hit Probability P RDRAM

DRAM Read Access Probability P RNV M

NVM Read Access Probability P WDRAM

DRAM Write Access Probability P WNV M

NVM Write Access Probability P Miss

Main Memory Miss Probability P MigD

Probability of NVM to DRAM Migration P MigN

Probability of DRAM to NVM Migration P DiskToD

Probability of Moving Page to DRAM due to Page Faults P DiskToN

Probability of Moving Page to NVM due to Page Faults T RDRAM

DRAM Memory Read Latency ( s ) T RNV M

NVM Memory Read Latency ( s ) T WDRAM

DRAM Memory Write Latency ( s ) T WNV M

NVM Memory Write Latency ( s ) T Disk

Disk Access Latency ( s ) P o

RDRAM

DRAM Read Dynamic Power ( ηj ) P o

WDRAM

DRAM Write Dynamic Power ( ηj ) P o

RNV M

NVM Read Dynamic Power ( ηj ) P o

WDRAM

NVM Write Dynamic Power ( ηj ) P ageF actor

AvgStaticP ower

Prorated Static Power Over All Requests

StperP age

Static Power Consumption of a Page ( ηj/s ) AccessperP age

Average Number of Accesses to Each Page (1 /s ) II. P

ERFORMANCE AND P OWER M ODELS IN A H YBRID M EMORY

This section presents a model for performance and power con-sumption of hybrid memories. The proposed model tries to considerall aspects of computer systems which inﬂuence the performanceand/or the power consumption. In addition to the traditional movingpages in case of a miss or evicting a data page, hybrid memorieshave migrations between two memories. The migration between twomemories depends on the architecture of the hybrid memory. Forthe sake of generality, we consider separate memory modules forDRAM and NVM that communicate through

Direct Memory Access (DMA). If both memory types can be assembled in one module,the migrations can be done more effectively. The integrated memory,however, requires hardware modiﬁcation which is out of scope of thispaper. In the following sections, the performance and power modelswill be presented.

A. Performance Model

The performance model depends on the delay of DRAM andNVM, granularity of eviction, and the delay of migration betweenmemories. For measuring performance, we use

Average MemoryAccess Time (AMAT). The overhead of migrations will be proratedbetween all accesses to the memory. Equation 1 shows the formula forAMAT. The description of the parameters is available in Table I. Inthis equation, the ﬁrst two terms calculate AMAT for all hit accessesin either DRAM or NVM. The third term considers the page faults.Since transferring a data page from a disk to the memory will bedone with DMA, the delay of writing data blocks to memory will beoverlaid with reading the next data block from the disk. Therefore,OS only sees the disk delay and in this term we only consider thedisk delay.

AMAT = P HitDRAM ∗ ( P RDRAM ∗ T RDRAM + P WDRAM ∗ T WDRAM )+ P HitNV M ∗ ( P RNV M ∗ T RNV M + P WNV M ∗ T WNV M )+ P Miss ∗ T Disk + P MigD ∗ P ageF actor ∗ ( T RNV M + T WDRAM )+ P MigN ∗ P ageF actor ∗ ( T RDRAM + T WNV M ) (1) The last two terms calculate the migration cost between twomemories. Upon occurring a migration, a data page will be readfrom a memory and will be written to the other memory. Sincethe granularity of data pages is quite larger than the actual accessesto memory (typically 4 up to 16B), we use

P ageF actor that is acoefﬁcient which converts moving of a data page into the required number of accesses to memory. The granularity of the moves betweendisk and memory modules and between two memories is a data pagewhich is typically 4KB or 8KB. In this paper, we assume 4KBdata pages. Moving a data page from disk to either of memoriesmight result in a migration between two memories. It depends onthe employed algorithm for managing hybrid memory. The proposedperformance model takes into account this type of migrations.

B. Power Model

The proposed power model tries to consider every aspect of thehybrid memories in order to provide more accurate and more realisticestimations for the power consumption of computer systems. Whilestatic power consumption is consumed regardless of the number ofarrived requests to the memory, dynamic power is consumed perrequest sent to the memory. Our power model considers the migrationbetween two memories and moving pages from disk to either of thememory modules as well as static and dynamic power for servicingrequests.The dynamic power consumption is calculated per access to thememory. This will result in independency of the power model fromapplication runtime and the memory size. Therefore, we introduce

Average Power Per Request (APPR) as a metric for measuring thepower as shown in Equation 2. Similar to the performance model, ﬁrsttwo terms calculate the power for all hit accesses to the memories.The third and fourth terms consider the write power for moving a datapage from disk to a memory module. The last two terms take intoaccount the power effect of the migrations between two memories.

AP P R = P HitDRAM ∗ ( P RDRAM ∗ P o

RDRAM + P WDRAM ∗ P o

WDRAM )+ P HitNV M ∗ ( P RNV M ∗ P o

RNV M + P WNV M ∗ P o

WNV M )+ P Miss ∗ P DiskToD ∗ P ageF actor ∗ P WDRAM + P Miss ∗ P DiskToN ∗ P ageF actor ∗ P WNV M + P MigD ∗ P ageF actor ∗ ( P o

RNV M + P o

WDRAM )+ P MigN ∗ P ageF actor ∗ ( P o

RDRAM + P o

WNV M ) (2) Since static power consumption is independent from requests, weintroduce a new parameter called

AvgStaticP ower which proratesthe static power consumption between all requests arrived to thememory in a given time interval. The reason behind prorating thestatic power over all requests is that from the OS perspective, themain memory consumes power (including both static and dynamic)for servicing the requests and both of the sources of the powerconsumption should be considered as the cost of servicing therequests. For a speciﬁc workload,

AvgStaticP ower is calculatedaccording to Equation 3.

AvgStaticP ower

Page = StperP ageAccessperP age (3)

Here,

AvgStaticP ower can be combined with the dynamic powerto form an APPR that models all power aspects of hybrid memories.It is worthy to mention that the dynamic power consumption is stillindependent from memory size and workload. As expected, staticpower per request is still dependent on memory size and requestservice rate.

III. M

OTIVATION

Designing hybrid memories and employing both DRAM and NVMmemories is discussed in many of previous work. A group of previousstudies tried to use DRAM as a caching layer for NVM memory [10],[14], [15]. Similar to the other caching techniques, if the locality ofthe requests drops below a threshold, the performance of the cachewill be decreased. In addition, the algorithms employed in the DRAMcache can be moved into the

Last Level Cache (LLC) of CPU in orderto evict mostly read-dominant data pages [16].Another group of previous studies, similar to our proposed scheme,use DRAM and NVM at the same level in the memory hierarchy[8], [9], [17], [18]. Many of the these studies require hardwaremodiﬁcations in memory module controllers [17], [9]. There are b l acksc ho l es bod y t r ackca nn ea l d e dup f aces i m f e rr e tf l u i d a n i m a t e f r e q m i n e r ay t r aces t r ea m c l u s t e r v i p s x264 N o r m a li z e d P o w e r C on s u m p t i on Static Dynamic Page Fault

Fig. 1: DRAM Power Breakdown also very few software-driven techniques that try to use the existinginterfaces between OS and memory modules [18], [8].CLOCK-DWF [8], which is one of the most effective techniquesin the previous work, is a very similar study to this paper and itoutperforms many of the previous studies such as CLOCK-PRO [19].Hence, we will have an in-depth analysis about its performance andpower. CLOCK-DWF uses two clock algorithms one for each of thememory modules. Upon occurrence of a page fault, if the requestcausing the page fault is write, the page will be moved to the DRAMand otherwise it will be moved to the NVM. The modiﬁcations inthe clock algorithm enables CLOCK-DWF to ﬁnd popular and write-dominant data pages and move them to the DRAM memory. If awrite request arrives for a data page residing in the NVM memory,the data page will be moved to DRAM. Migrating pages between twomemories require many accesses to both memories. But such effectis not considered in CLOCK-DWF which will result in inaccuracy oftheir model. In the reminder of this section, we will analyze CLOCK-DWF with respect to the proposed performance and power models.Before examining CLOCK-DWF, we will calculate the maximumpower saving that can be achieved by reducing the static powerconsumption. The proposed power model can be used for modelinghomogeneous memories. Hence, the single DRAM main memory ischaracterized by the proposed model. Considering a DRAM-onlymain memory with LRU algorithm as the eviction policy, Fig. 1shows the composition of the power consumption sources for variousworkloads. Since static power consumption contributes for 60-80% ofthe total power consumption of DRAM main memory, reducing thestatic power consumption will have signiﬁcant effect on the overallsystems power consumption. As shown in Fig. 1, the streamcluster benchmark does not behave similar to the other workloads. Accordingto Table III, this workload has a large burst of accesses and asmall memory footprint which will result in higher dynamic powerconsumption. Workloads with a high hit ratio in LLC of CPU willhave higher static power consumption per request. This is due to lessrequests will reach the main memory and power consumption willbe prorated over fewer number of requests.CLOCK-DWF maintains two clock algorithms for DRAM andNVM. The clock algorithm in the NVM is the traditional clockalgorithm with one difference. If a write access arrives for a data pagein NVM, the corresponding data page will be moved to the DRAM.Therefore, no write access will be responded by NVM. The main aimof this method is to reduce the number of writes in NVM. Althoughthis prevents any writes from reaching NVM, each write access for adata page in NVM will result in a data page migration between twomemories. Clock algorithm for DRAM, however, is different and triesto keep write-dominant data pages in the DRAM memory and evictsthe mostly read-dominant data pages. This is motivated by the factthat the read-only pages will have better performance-power trade-offcompared to write requests in NVM. Upon occurrence of a page fault,if the request is read, the corresponding data page will be moved toNVM and if it is a write, the data page will be moved to DRAM.

A. Power Analysis

Fig. 2a depicts the normalized power consumption of CLOCK-DWF compared to power consumption of a DRAM-only memory. In all workloads, the static power consumption is reduced by 80%which shows the effectiveness of hybrid memories to reduce thestatic power consumption. Although CLOCK-DWF can decreasethe power consumption in many workloads, there are workloadsin which CLOCK-DWF fails to improve power consumption andhas worse power efﬁciency compared to DRAM-only memory. The streamcluster benchmark is read-dominant and CLOCK-DWF movesthe read-only data pages to NVM. Therefore, DRAM area will bealmost idle and NVM will respond most of the requests. This willcause the dynamic power consumption to be higher than DRAM-onlymain memory. The two other benchmarks that have higher powerconsumption compared to DRAM are canneal and ﬂuidanimate .Although these two workloads are read-intensive, the behaviour of theapplication causes CLOCK-DWF to migrate a data page to NVM andafter a short time, it brings the migrated data pages back to DRAM.It is worthy to note that the blackscholes benchmark is a read-onlybenchmark and the reason its dynamic power consumption is similarto DRAM-only memory is that when DRAM is empty, the data pagewill be moved to DRAM regardless of the type of the request. Inmany of the workloads examined in this paper, the contribution ofthe migrations in power consumption is more than 40%. This is dueto this fact that when DRAM memory is full, each write access fordata pages in NVM will trigger a migration from NVM to DRAMand also, a migration from DRAM to NVM.

B. Performance Analysis

In terms of performance, the source of latencies that can beobserved by applications are the delay of responding to the request,the delay of migrations, and the delay of page faults. Similar to thepower analysis, the performance analysis can show how much wehave to pay in terms of latency in order to use a hybrid memory.Fig. 2b shows the contribution of each source of delay on theAMAT. AMAT is normalized based on AMAT of a DRAM-onlymain memory. The calculated AMAT for requests is very closeto the results reported by in the CLOCK-DWF study. Migrations,however, have not been considered in the CLOCK-DWF study. Basedon the proposed model, the observed delay caused by migrations isconsiderable and contributes to more than 60% of the total AMAT.Therefore, the performance, similar to power, is greatly degradedbecause of the non-beneﬁcial migrations. If the hybrid memoryalgorithm identiﬁes and prevents these migrations, it will reducethe migration cost in terms of performance, power, and endurance.The beneﬁcial migrations, however, should be allowed to exploit thebeneﬁts of hybrid memories.

C. Endurance Analysis

As mentioned earlier, CLOCK-DWF does not issue any writerequests to the NVM and all writes will be responded in DRAM.Therefore, the only sources of writes in NVM are migration fromDRAM to NVM and moving data pages from disk to NVM in caseof a page fault caused by a read request. Although the data pages inNVM are read-dominant, each write request for data pages in NVMwill result in a high number of physical writes, since the granularity ofmoving a data page is typically three orders of magnitude larger thanthe CPU requests. Fig. 2c shows the contribution of various sourcesof writes in NVM. The number of writes is normalized comparedto an NVM-only main memory to see how much CLOCK-DWF canreduce the total number of writes. In most of the workloads, writesissued for migrations contribute more than 50% of the total writes inNVM. This excessive use of migrations makes the overall number ofthe writes to be even more than an NVM-only main memory. Hence,the lifetime of NVM will be heavily penalized by using CLOCK-DWF.

IV. P

ROPOSED D ATA M IGRATION S CHEME

Non-beneﬁcial migrations are the biggest ﬂaw in CLOCK-DWFand other previous work. Therefore, in the proposed scheme, wetry to identify and prevent this type of migrations. In addition, theproposed scheme aims to maintain almost the same level of hit ratio b l acksc ho l es bod y t r ackca nn ea l d e dup f aces i m f e rr e tf l u i d a n i m a t e f r e q m i n e r ay t r aces t r ea m c l u s t e r v i p sx264 G - M ea n A - M ea n N o r m a li z e d P o w e r C on s u m p t i on Static Dynamic Migration (a) b l acksc ho l es bod y t r ackca nn ea l d e dup f aces i m f e rr e tf l u i d a n i m a t e f r e q m i n e r ay t r aces t r ea m c l u s t e r v i p sx264 G - M ea n A - M ea n N o r m a li z e d A M A T Read/Write Requests Migrations (b) b l acksc ho l es bod y t r ackca nn ea l d e dup f aces i m f e rr e tf l u i d a n i m a t e f r e q m i n e r ay t r aces t r ea m c l u s t e r v i p sx264 G - M ea n A - M ea n N o r m a li z e d N u m b e r o f W r i t es Page Fault Migration (c)

Fig. 2: a) CLOCK-DWF Power Breakdown Normalized to DRAM Power Consumption b) Normalized AMAT of CLOCK-DWFCompared to DRAM-Only Memory c) Number of Writes in CLOCK-DWF Normalized to NVM-Only Memory as conventional algorithms in order to have comparable performancecompared to DRAM-only main memory with LRU algorithm.The proposed scheme consists of two LRU queues, one queuefor DRAM and another queue for NVM. In order to have a highhit ratio, the algorithms employed in both queues is LRU withoutany modiﬁcation. The proposed scheme manages the migrations be-tween two memories and moves pages from/to disk. Therefore, bothmemories work with LRU and the proposed scheme decides whena data page should be migrated to another memory. Furthermore,upon moving a data page to a memory, it will be treated based onthe algorithm of the memory, e.g, moving to the head of the LRUqueue and evicting the last page in the queue. This is one of themain differences between this work and the previous studies. In theprevious studies, the algorithms for managing pages in memoriesneed to be changed which will result in lower hit ratio.In order to ﬁnd the data pages that will improve power consump-tion and performance upon migration (with respect to the migrationcost), the proposed scheme stores some additional information aboutdata pages such as read and write counters in the NVM LRU queue.Note that this additional information does not interfere with LRU andit does not need to know about this housekeeping information. Foreach data page in the NVM queue, two counters will be stored thatcount the number of read and write accesses to the correspondingdata page from the time that data page enters the queue.Fig. 3 shows the architecture of the proposed data migrationscheme consisting of two LRU queues. Dashed lines depict actionsperformed by the proposed technique and solid line are for traditionalLRU management algorithm. Dark data pages are more frequentlyaccessed and are considered as hot data pages. Contrary to CLOCK-DWF that places page faults issued by read requests on NVM, theproposed scheme moves all pages from disk to DRAM area. This ismotivated by the fact that moving to either NVM or DRAM will resultin a page write in NVM since the DRAM is always full and movinga data page to DRAM will issue an eviction to NVM. Therefore, thecost of moving to NVM or DRAM is the same in terms of writesin NVM. The newly accessed data pages have higher probabilityof access compared to the older data pages and moving this newpage to DRAM will result in increase in DRAM hit ratio insteadof NVM hit ratio. This will help improving both performance andpower efﬁciency since DRAM is superior in terms of dynamic powerand delay. The overhead of storing the housekeeping information isnot considerable and is about 0.04% for 4KB data pages. However,keeping the counters for all pages in NVM has a few drawbacks. First,it requires an ordering scheme in order to identify data pages thatare cold but will be accessed once in a long time. These data pageswill reside long enough in NVM to have a high counter values andtherefore will be moved to DRAM where they cannot compete withhot data pages and will return to NVM which makes their migration toDRAM without any beneﬁts. Second, there is no difference betweenpages that are frequently accessed and typically reside near the headof the NVM LRU queue for the entire time and data pages which go

DRAM NVM

LRU_nextLRU_prevReadCounterWriteCounterLRU_nextLRU_prev

Additional InformationWrite PercentageRead Percentage

Fig. 3: Proposed Data Migration Scheme in a Hybrid MemoryArchitecture back and forth in the queue.In the proposed scheme, another method has been added to handleboth of the above-mentioned issues. The housekeeping informationwill be only stored for a few percentage of top positions in theNVM LRU queue. Once a data page moves to the end of thisselected percentage of LRU, the corresponding counter will be resetto zero. This will handle both ordering scheme and identifyingburst data accesses. Since NVMs have different costs for readsand writes in terms of power and performance, we will treat themdifferently in the proposed scheme. Write-dominant data pages shouldhave higher priority over read-dominant data pages for migratingto DRAM since they cost more in NVM. Therefore, writeperc and writethreshold parameters will be set to higher values than readperc and readthreshold .Algorithm 1 shows the ﬂow of the proposed scheme in case ofarriving a request. Since DRAM contains the most hot data pages,the proposed scheme searches DRAM ﬁrst and if it is not found, itgoes to NVM. Finding the data page in DRAM will result in a normalLRU housekeeping. Otherwise, the extra housekeeping information inNVM will be updated based on the request type. The read and writecounters will be stored for readperc and writeperc top data pagesin the NVM, respectively. Therefore, in case of a hit, read and writecounters for data pages that are dropped off from the top data pageswill be cleared. Lines 10 through 22 initialize the counters for thecorresponding data page. If the value of the counter for a data page inNVM exceeds the read threshold or write threshold (dependingon the request type), it will be migrated to DRAM. Inserting a newdata page into memory and eviction policies are unchanged from LRUand therefore, such details are omitted from the algorithm for thesake of brevity. The values of read threshold and write threshold determine how aggressive we plan to prevent the migrations withlow probability of being useful. It is closely related to the cost ofthe migration between DRAM and NVM which is related to theperformance and power characteristics of the employed NVM. V. E

XPERIMENTAL R ESULTS

In this section, the experimental setup to extract the traces fromworkloads and the experimental results for both the proposed method lgorithm 1:

Data Migration in a Hybrid Memory Search for request address in DRAM LRU ; if request address is found in DRAM then Update

DRAM LRU ; else Search for request address in NV M LRU ; if request address is found in NVM then Update

NV M LRU ; Reset read counter for page in position readperc ; Reset write counter for page in position writeperc ; if request is read then if request is within readperc then page read counter = page read counter + 1 ; else page read counter = 1 ; end else if request is within writeperc then page write counter = page write counter + 1 ; else page write counter = 1 ; end end if ( request is read and page read counter > read threshold )or ( request is write and page write counter > write threshold then Migrate page to DRAM; end else Issue page fault from Disk to DRAM; Migrate from DRAM to NVM if necessary; end end TABLE II: COTSon Conﬁguration

CPU Quad-core with MOESI ProtocolL1 Data Cache 32KB WB 4-way set associative with 64B line sizeL1 Instruction Cache 32KB WB 4-way set associative with 64B line sizeLast-Level Cache 2MB WB 16-way set associative with 64B line sizeMain Memory 2x 2GB DDR2Secondary Storage HDD with 5 milliseconds response time and previous studies will be presented.

A. Experimental Setup

The proposed scheme and previous studies are evaluated based onthe proposed performance and power models. For further accuracyof the evaluation, we used COTSon [13] which is a full systemsimulator to obtain memory traces. The memory traces are extractedfrom running the actual benchmark programs in a Linux virtualmachine inside COTSon and only memory accesses from ROI ofthe benchmark is considered. PARSEC-3.0 [12] has been selected asthe benchmarking suite. The input of all benchmarks was set to thelargest dataset available in order to minimize the effect of startingfrom cold memory .COTSon simulator used a quad-core CPU with two levels of cacheand 4GB main memory running an Ubuntu operating system. Usinga quad-core CPU will ensure that there is always enough requestsissued to the memory to simulate a production server. The detailedconﬁguration of the simulated hardware is reported in Table II. Inorder to fully understand the effect of different parameters of theworkloads on the output of the hybrid memories, the main featuresof the workloads are presented in Table III and will be discussed inthe next subsection. In the experiments, the total memory size is setto 75% of the total pages and the DRAM size is set to 10% of thetotal memory size, similar to previous studies [8]. The performanceand power characteristics of DRAM and NVM, reported in Table IV,are obtained from the same source as CLOCK-DWF in order to havea fair comparison. swaptions workloads are not included in the results due to compilationissues in our platform. TABLE III: Workload Characterization

Workload Working Set Size (KB)

TABLE IV: Memory Characteristics [8]

Memory Latency r/w ( ηs ) Power r/w ( ηj ) Static Power ( jGB.second ) DRAM 50/50 3.2/3.2 1NVM (PCM) 100/350 6.4/32 0.1

B. Experimental Results

Fig. 4a depicts the normalized power consumption of CLOCK-DWF and the proposed scheme compared to a DRAM-only mainmemory. For each workload, the left and right bars represent CLOCK-DWF and the proposed scheme, respectively. In most of the work-loads, the proposed scheme has better power efﬁciency comaparedto CLOCK-DWF with a few exceptions which will be addressedlater in this section. As shown in Fig. 4a, the power consumptionof the proposed scheme is up to 48% (14% on average ) less thanCLOCK-DWF. In addition, the proposed scheme can reduce the totalpower consumption of the main memory up to 79% (43% on average)compared to using a DRAM-only main memory. The static powerconsumption is the same for both methods since they are evaluatedusing the same DRAM and NVM size. The main beneﬁt of theproposed scheme is that the power consumption for migrations isdecreased signiﬁcantly compared to CLOCK-DWF. The migrationcost is decreased up to 80% by using the proposed scheme.Among the benchmark programs, canneal , ﬂuidanimate , and streamcluster have unusual characteristics such as small footprint orlack of read-dominant data pages which will increase the dynamicand migration power and makes them not suitable for using hybridmemories. Contrary to the other workloads, in raytrace workload, themigration cost of the proposed scheme is higher than CLOCK-DWF.Our analysis shows that the optimal values for readthreshold and writethreshold of this workload differs from the other workloadswhich caused many non-beneﬁcial migrations between two memories.It is worthy to note that using adaptive threshold prediction canfurther improve the efﬁciency of the proposed scheme. This is partof our ongoing research.One of the main differences between CLOCK-DWF and theproposed scheme is how they treat write requests attempting to accessdata pages in NVM. CLOCK-DWF moves data pages to DRAMwhile the proposed scheme tries to respond the request from NVM.Fig. 4b shows the normalized number of writes arrived to NVMcompared to a NVM-only main memory. Without considering themigrations, CLOCK-DWF will reduce the number of writes dis-patched to NVM. Considering the migrations, CLOCK-DWF issuesmore writes to NVM compared to a NVM-only main memory up to3.7x, which signiﬁcantly affects the lifetime of NVM. The proposedscheme, on the other hand, limits the number of migrations betweenmemories and therefore issues less writes to NVM. The mentionedtradeoff between dispatching requests to NVM and migrating datapages to DRAM affects the contribution of different sources ofwrites in NVM. The proposed scheme favours issuing writes to NVMinstead of migrating the whole data page to DRAM while CLOCK-DWF does the opposite. This change in policy results in signiﬁcantdecrease (up to 93%) in the number of writes in NVM compared Average numbers reported throughout the paper are geometric means. N o r m a li z e d P o w e r C on s u m p t i on StaticDynamicMigration A - M ea n G - M ea n x264v i p ss t r ea m c l u s t e rr ay t r ace f r e q m i n e f l u i d a n i m a t e f e rr e tf aces i m d e dup ca nn ea l bod y t r ack b l acksc ho l es (a) N o r m a li z e d N u m b e r o f W r i t es MigrationPage FaultRead/WriteRequests A - M ea n G - M ea n x264v i p ss t r ea m c l u s t e rr ay t r ace f r e q m i n e f l u i d a n i m a t e f e rr e tf aces i m d e dup ca nn ea l bod y t r ack b l acksc ho l es (b) b l acksc ho l es bod y t r ackca nn ea l d e dup f aces i m f e rr e tf l u i d a n i m a t e f r e q m i n e r ay t r aces t r ea m c l u s t e r v i p sx264 G - M ea n A - M ea n N o r m a li z e d A M A T Read/Write Requests Migrations (c)

Fig. 4: a) Power Breakdown of CLOCK-DWF (Left Bar) and the Proposed Scheme (Right Bar) Normalized to DRAM PowerConsumption, b) Number of Writes in CLOCK-DWF (Left Bar) and the Proposed Scheme (Right Bar) Normalized to NVM-Only Memory, and c) Normalized AMAT of the Proposed Scheme Compared to CLOCK-DWF to CLOCK-DWF. In addition, the proposed scheme can reduce thenumber of writes in NVM up to 75% (49% on average) comparedto a NVM-only main memory which will prolong its lifetime up to4x. In streamcluster and vips benchmark programs, CLOCK-DWFperforms slightly better since burst accesses to data pages are nearthe threshold of being beneﬁcial migration and the proposed schememay take a wrong decision on such cases.From performance perspective, as we concluded in Section III,the migrations lead to high delay on the average request responsetime in CLOCK-DWF. Fig. 4c depicts the normalized AMAT of theproposed scheme compared to CLOCK-DWF. The proposed schemesuccessfully limited the number of migrations and the contribution ofthe migration is less than 50% in most of the workloads. Limiting themigrations improves the AMAT of the proposed scheme signiﬁcantlycompared to CLOCK-DWF up to 70% (48% on average). Preventingnon-beneﬁcial migrations is not the only reason that the proposedscheme has superior performance compared to CLOCK-DWF. Thepolicy for selecting the targets for migrations is another reason thatthe proposed scheme has higher performance than CLOCK-DWFsince placing the hot data pages in DRAM will improve AMAT.In raytrace and vips benchmarks, CLOCK-DWF has better AMATsince the proposed scheme issues high number of migrations.

VI. C

ONCLUSION

NVMs are emerging memory technologies that unlike DRAM, donot have high leakage power and do not depend on the power supplyto store data. NVMs, however, have their own limitations whichprevent them from entirely replacing DRAM. Hybrid memories try toreduce the power consumption of the main memory while maintaininghigh performance. Previous studies lack considering all aspects ofthe hybrid memories and the inaccuracy in their models results ininefﬁcient hybrid memories. In this paper, we ﬁrst presented bothperformance and power models for the hybrid memories. Using theproposed models, we identiﬁed the shortcomings of previous studiesand proposed a novel data migration scheme for hybrid memory.The proposed scheme consists of two LRU queues with efﬁcientalgorithms to manage data migration. The experimental results showthat the proposed scheme can reduce the power consumption up to79% compared to DRAM-only memory and up to 48% compared toprevious studies. R More than Moore Technologies for Next Generation Computer Design ,2015, pp. 127–153.[3] R. Salkhordeh, H. Asadi, and S. Ebrahimi, “Operating system leveldata tiering using online workload characterization,”

The Journal ofSupercomputing , vol. 71, no. 4, pp. 1534–1562, 2015. [4] S. Yazdanshenas, M. Pirbasti, M. Fazeli, and A. Patooghy, “Coding lastlevel STT-RAM cache for high endurance and low power,”

ComputerArchitecture Letters , vol. 13, no. 2, pp. 73–76, July 2014.[5] M. Tarihi, H. Asadi, A. Haghdoost, M. Arjomand, and H. Sarbazi-Azad, “A hybrid non-volatile cache design for solid-state drives usingcomprehensive I/O characterization,”

IEEE Transactions on Computers ,vol. In Press, pp. 1–1, 2015.[6] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval-uating STT-RAM as an energy-efﬁcient main memory alternative,” in

Performance Analysis of Systems and Software (ISPASS) , 2014, pp. 256–267.[7] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase changememory as a scalable dram alternative,” in

International Symposium onComputer Architecture (ISCA) , 2009, pp. 2–13.[8] S. Lee, H. Bahn, and S. Noh, “CLOCK-DWF: A write-history-awarepage replacement algorithm for hybrid PCM and DRAM memoryarchitectures,”

IEEE Transactions on Computers (TC) , vol. 63, no. 9,pp. 2187–2200, 2013.[9] G. Dhiman, R. Ayoub, and T. Rosing, “PDRAM: a hybrid PRAMand DRAM main memory system,” in , 2009, pp. 664–469.[10] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-mance main memory system using phase-change memory technology,”in

International Symposium on Computer Architecture (ISCA) , 2009, pp.24–33.[11] F. Sampaio, M. Shaﬁque, B. Zatt, S. Bampi, and J. Henkel, “Energy-efﬁcient architecture for advanced video memory,” in

InternationalConference on Computer-Aided Design (ICCAD) , 2014, pp. 132–139.[12] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Princeton University, January 2011.[13] E. Argollo, A. Falc´on, P. Faraboschi, M. Monchiero, and D. Ortega,“COTSon: infrastructure for full system simulation,”

SIGOPS OperatingSystems Review , vol. 43, no. 1, pp. 52–61, 2009.[14] M. Gamell, I. Rodero, M. Parashar, and S. Poole, “Exploring energy andperformance behaviors of data-intensive scientiﬁc workﬂows on systemswith deep memory hierarchies,” in , Dec 2013, pp. 226–235.[15] H. Khouzani, C. Yang, and J. Hu, “Improving performance and life-time of DRAM-PCM hybrid main memory through a proactive pageallocation strategy,” in , Jan 2015, pp. 508–513.[16] R. Rodr´ıguez-Rodr´ıguez, F. Castro, D. Chaver, L. Pinuel, and F. Tirado,“Reducing writes in phase-change memory environments by usingefﬁcient cache replacement policies,” in

Design, Automation and Testin Europe (DATE) , 2013, pp. 93–96.[17] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hy-brid memory systems,” in

International Conference on Supercomputing(ICS) , 2011, pp. 85–95.[18] Z. Fan, D. Du, and D. Voigt, “H-ARC: A non-volatile memory basedcache policy for solid state drives,” in

Mass Storage Systems andTechnologies (MSST) , June 2014, pp. 1–11.[19] S. Jiang, F. Chen, and X. Zhang, “CLOCK-Pro: An effective improve-ment of the CLOCK replacement,” in