[PDF] SoftWear: Software-Only In-Memory Wear-Leveling for Non-Volatile Main Memory

Abstract

Several emerging technologies for byte-addressable non-volatile memory (NVM) have been considered to replace DRAM as the main memory in computer systems during the last years. The disadvantage of a lower write endurance, compared to DRAM, of NVM technologies like Phase-Change Memory (PCM) or Ferroelectric RAM (FeRAM) has been addressed in the literature. As a solution, in-memory wear-leveling techniques have been proposed, which aim to balance the wear-level over all memory cells to achieve an increased memory lifetime. Generally, to apply such advanced aging-aware wear-leveling techniques proposed in the literature, additional special hardware is introduced into the memory system to provide the necessary information about the cell age and thus enable aging-aware wear-leveling decisions. This paper proposes software-only aging-aware wear-leveling based on common CPU features and does not rely on any additional hardware support from the memory subsystem. Specifically, we exploit the memory management unit (MMU), performance counters, and interrupts to approximate the memory write counts as an aging indicator. Although the software-only approach may lead to slightly worse wear-leveling, it is applicable on commonly available hardware. We achieve page-level coarse-grained wear-leveling by approximating the current cell age through statistical sampling and performing physical memory remapping through the MMU. This method results in non-uniform memory usage patterns within a memory page. Hence, we further propose a fine-grained wear-leveling in the stack region of C / C++ compiled software. By applying both wear-leveling techniques, we achieve up to 78.43% of the ideal memory lifetime, which is a lifetime improvement of more than a factor of 900 compared to the lifetime without any wear-leveling.

Full PDF

aa r X i v : . [ c s . O S ] A p r SoftWear: Software-Only In-Memory Wear-Levelingfor Non-Volatile Main Memory

Christian Hakert ∗ , Kuan-Hsun Chen ∗ , Paul R. Genssler † , Georg von der Br¨uggen ∗ , Lars Bauer † , Hussam Amrouch † ,Jian-Jia Chen ∗ , J ¨org Henkel †∗ Design Automation for Embedded Systems Group, TU Dortmund Unibersity, Germany † Chair for Embedded Systems, KIT, Germany

Abstract —Several emerging technologies for byte-addressablenon-volatile memory (NVM) have been considered to replaceDRAM as the main memory in computer systems during the lastyears. The disadvantage of a lower write endurance, comparedto DRAM, of NVM technologies like Phase-Change Memory(PCM) or Ferroelectric RAM (FeRAM) has been addressed inthe literature. As a solution, in-memory wear-leveling techniqueshave been proposed, which aim to balance the wear-level overall memory cells to achieve an increased memory lifetime.Generally, to apply such advanced aging-aware wear-levelingtechniques proposed in the literature, additional special hardwareis introduced into the memory system to provide the necessaryinformation about the cell age and thus enable aging-aware wear-leveling decisions.This paper proposes software-only aging-aware wear-levelingbased on common CPU features and does not rely on anyadditional hardware support from the memory subsystem. Specif-ically, we exploit the memory management unit (MMU), per-formance counters, and interrupts to approximate the memorywrite counts as an aging indicator. Although the software-onlyapproach may lead to slightly worse wear-leveling, it is applicableon commonly available hardware. We achieve page-level coarse-grained wear-leveling by approximating the current cell agethrough statistical sampling and performing physical memoryremapping through the MMU. This method results in non-uniform memory usage patterns within a memory page. Hence,we further propose a ﬁne-grained wear-leveling in the stackregion of C / C++ compiled software.By applying both wear-leveling techniques, we achieve upto . of the ideal memory lifetime, which is a lifetimeimprovement of more than a factor of compared to thelifetime without any wear-leveling. I. I

NTRODUCTION

Emerging technologies for non-volatile memory (NVM),like Phase-Change-Memory (PCM) or Ferroelectric RAM(FeRAM), have been considered as a replacement for DRAMas the main memory over the last years. Most NVM tech-nologies feature advantages like low energy consumption andhigh integration density, which makes them a desired mainmemory replacement. One of the major disadvantages of someNVM technologies is the lower write-endurance. While classicDRAM endures for more than write cycles, PCM onlyendures − write cycles per cell [7]. Thus, to wear outa DRAM cell within 10 years, an application would have towrite the same memory cell every th CPU cycle in averageon a 3GHz CPU. Applying the same application to PCM, thememory would wear-out within 5 minutes. Although typical applications do not cause such an extreme write pattern, theystill cause a highly non-uniform write pattern to the memory[13], [21], [24]. Accordingly, the problem has been tackled inthe literature and several in-memory wear-leveling techniqueshave been proposed. A majority of these techniques is aging-aware [3], [8], [11], [13], [14], [16], [18], [19], [22], [26],which means that the current cell age or the current write countis taken into account for the wear-leveling decisions. The wear-leveling itself is mostly realized through an abstraction layer,which remaps the physical location of logical memory regions.However, as current memory hardware does not provide awrite-count, which is necessary to determine the cell age,additional hardware is introduced. This hardware requiresadditional chip-space, and might be hard to realize in a waythat meets the desired granularity and clock-frequency.To allow aging-aware wear-leveling in the absence of suchspecial hardware, this paper proposes software-only wear-leveling techniques. The term software-only here means thatwe do not require any additional hardware from the memorysubsystem and only use hardware features which are widelyavailable. We provide the necessary write-count through astatistical online approximation of the write distribution, whichonly requires a memory management unit (MMU), perfor-mance counters, and an interrupt mechanism. The performancecounter allows to generate an interrupt every n th memory writeaccess, which achieves an equidistant sampling of the writedistribution. A special conﬁguration of the memory access per-mission allows to record the target address of a single memorywrite afterwards. The approximated write distribution enablesan arbitrary aging-aware wear-leveling algorithm subsequently.In this paper, we implement a simple wear-leveling algorithmon the granularity of virtual memory pages, which achievesthe necessary physical memory remapping through the MMU.Since the resulting memory write distribution still results inhigh non-uniformity due to the granularity of memory pages,we introduce an additional software-only, ﬁne-grained wear-leveling technique, which balances the write-accesses to thestack region by relocating the stack in a circular manner. Thisis achieved by copying the current stack content regularly toa new location and adjust the stack-pointer accordingly. Aspecial virtual memory conﬁguration allows a hardware-aidedwraparound to achieve a circular movement.1 ur contributions: • We deliver a software-only coarse-grained in-memorywear-leveling system, consisting of an online approxima-tion mechanism for the write-distribution and an MMU-based wear-leveling algorithm. • We further provide an extending software-only ﬁne-grained wear-leveling technique, which targets the stackregion of C/C++ compiled applications and relocates thestack in a circular manner in a bounded memory region.We aim to balance the write-count to each memory bytein the ﬂat memory space equally to achieve a high memorylifetime. We note that other factors impact the memory en-durance as well, e.g., process variation in PCM [24], but thewrite-count is a major factor. Our approaches can be extendedaccording to physical models (e.g. process variation domains)to also respect advanced physical memory properties.After giving an overview about the related wear-levelingapproaches in literature in Section II, we present the memorywrite distribution of our benchmark applications in Section IIIand our method to analyze the write pattern of applications,which is also used for our evaluations in Section IV. After this,our novel wear-leveling techniques are described in detail inSection V and Section VI. Each section contains an evaluation,which uses the write pattern analysis mechanism. The paperconcludes with a short summary in Section VIII.II. R

ELATED W ORK

During the last years, several approaches for in-memorywear-leveling for NVM have been proposed. These approachescan be categorized along different criteria. First, there areaging-aware approaches [3], [8], [11], [13], [14], [16], [18],[19], [22], [26], which take the current cell age into accountto apply wear-leveling. In contrast there are random-basedapproaches [13], [21], [26], which apply wear-leveling in acircular or random-based manner. Both approaches are oftencombined to achieve a random-based wear-leveling on ﬁnegranularities inside memory blocks, while an aging-awareapproach is used to target these coarse-grained memory blocks.The granularity also varies from single bits [9], [25] overcache-lines [21], [26] for ﬁne-grained approaches to memorypages [3], [8], [13], [14], [22] or even bigger memory segments[24], [26] for coarse-grained approaches.Some approaches are not based on remapping the physicalmemory content through an abstraction layer, but hook intothe memory allocation process of the operating system toapply wear-leveling to the memory allocator [3], [18], [22].Li et al. [18] also propose to use an allocated memory portionwhenever a function is called for the function’s stack memoryto wear-level the stack region.Gogte et al. propose a software-only coarse-grained wear-leveling approach by using a sampled approximation of thewrite distribution [14]. They make use of advanced debug-ging capabilities, e.g. Intel Processor Event Based Sampling(PEBS), which allows them to sample the write requestsfrom the CPU. These debugging capabilities, however, can rarely be found in embedded systems and resource constrainedhardware.All other mentioned aging-aware approaches rely on thethe current write-count information of the memory. Mostapproaches introduce specialized hardware into the memorycontroller to collect the write-count information, which is notavailable in commonly available systems and might be hard torealize. Dong et al. [11] use an ofﬂine recorded memory traceto estimate the write distribution, which limits the approachto a subset of well-known applications only.III. P

ROBLEM D ESCRIPTION

When considering non-volatile memory as the main memoryfor program executions, the system may suffer from the lowwrite-endurance of the underlying memory technology. Even ifthe system is also equipped with DRAM, certain applicationsmay be desired to only run on the non-volatile memory toreach energy saving states as fast as possible. To understandthe impact of program executions on main memory with lowwrite-endurance, the precise write distribution from a programshould be recorded and analyzed. Separating the program’smemory into the text , data , bss , and stack regionsallows to analyze the write pattern of each region separatelyand determine the impact on the memory lifetime. This sectionpresents the write distribution for our benchmark applicationsand points out the inﬂuence on the write-endurance.To determine the inﬂuence of code executions on the mem-ory write-patterns of applications, especially on the differentmemory regions, we run four benchmark applications. Weaggregate the resulting memory trace ﬁle on the granularity of64 byte (a cache-line is assumed to be written always entirely)to a write-count distribution and present them graphically. Asthe benchmark applications we chose following programs:1) bitcount : A simple implementation, which iterates overan array of data and counts the 1 bits. The resultingcount is stored in global counter and returned at theend.2) pfor : A simulation of a data decompression scenario. Abig set of data is available in a lightweight compressedformat, namely Patched Frame of Reference (PFOR)[27]. The data is decompressed and aggregated in ﬁxedsize windows, which simulates the processing of astream of compressed data.3) sha : This application is part of the MiBench securitysuite [15] and calculates the sha sum of a given dataset.4) dijkstra

This application is also part of the MiBenchnetwork suite [15] and calculates a ﬁxed number ofshortest paths in a network, using the dijkstra algorithm.We chose these benchmarks, because they are simple enoughto understand the connection between the code and the mem-ory usage of the different segments. The limitation to fourbenchmarks is due to the high time consumption of therequired full system simulations.Figure 1 shows the resulting illustration of the write-countdistributions of the benchmark applications. Note that the fourapplications face different execution times and thus the total2 . E . E . E . E bitcount main memory w r it ec oun t

148 kB E E E pfor main memory address text data stack text data stack

52 kB E . E . E . E sha main memory address w r it ec oun t

148 kB E E . E dijkstra main memory address textdata stack text data stack Fig. 1. Memory write-count distribution - baseline amount of writes is different. Thus, the scaling of the y axes isdifferent. Considering the different memory regions, differentobservations can be made: • text : As the text segment only contains the compiledbinary code, it is never written during the normal applica-tion execution. This behavior is also shown in the result.In the context of wear-leveling, read only memory regionshave to be targeted as well as heavy written memoryregions to distribute the wear-levels equally. • data/bss : The data and the bss segments storeglobal program variables, such as global attributes orarrays. Naturally, these variables are written from timeto time, depending on the application logic. The dijkstra benchmark has a heavy, non-uniform usage of the bss segment, since the benchmark manages the steps of thealgorithm in a queue. • stack : The stack segment causes the most non-uniform write access to the main memory. This resultsfrom the way the stack is typically used: Local variablesare stored on top of the stack and are removed whenthey are no longer used. Depending on the applicationlogic, this makes the beginning of the stack a heavilyused area with a lot of memory writes, while the rest ofthe stack region is used less. A wear-leveling algorithmhas to distribute the memory writes to this region to allother, less written memory regions.These results point out the need for aging-aware wear level-ing. The memory writes to hot memory regions have to beredirected mainly to unused memory regions, but also to less used memory regions. This requires a monitoring of the currentwrite-count and an incremental redistribution according to thecurrent write-count distribution.IV. M EMORY W RITE -P ATTERN A NALYSIS

Section III presents the memory write-count distribution offour benchmark applications. In a usual computation platform,the memory accesses of a program cannot be captured andanalyzed without special techniques. Debugging mechanismscan overcome the problem but introduce a large overhead.Using a hardware analyzer, which basically plugs an FPGAbetween the CPU and the memory DIMM, is consideredby Bao et al. [4]. Such an analyzer is reasonably fast butrequires a complex hardware setup. In this paper, we use afull system cycle-accurate simulator (including CPU, memory,buses, peripherals, etc.) on top of a Linux host instead. Thissection introduces our simulation environment, which is alsoused for the results in Section III.We chose gem5 [6] as the full system simulator, since itcan be combined with a memory simulator for non-volatilememories, namely

NVMain2.0 [20], due to its modular struc-ture. This setup allows to obtain all memory accesses of arunning program in a logﬁle, analyze them afterwards, andperform detailed evaluations of our methods by comparingthe captured logﬁles. To simulate the properties of NVMs,several simulators can be considered (e.g. [23] and [12]),which precisely simulate, for instance, the timing and energybehavior. However, the methods in this paper analyze andchange the write behavior of applications only, which isindependent from the physical properties of the underlyingmemory. Thus, we do not involve them in our analysis.

A. Simulation Setup DetailsNVMain2.0 provides an option to generate a memory traceﬁle, which contains detailed information for every main mem-ory access. Using this information, we can extract the memoryaddress for each write access and aggregate them for each 64byte sized cache-line , which results in a write-count distribu-tion. This method is also independent from the CPU internalcache conﬁguration, since writes to the main memory arerecorded. Even if a write is caused by a logical read operation(cache preemption), this write is captured in our simulation.We simulate an ARMv8 CPU architecture, the DerivO3CPUimplementation, and the VExpress GEM5 V2 machine. Thissystem includes an advanced CPU with pipelining and out-of-order execution as well as a set of controllers, which aretypically found in ARM based systems (e.g., the GIC interruptcontroller, PL011 UART controller, etc.).Two simulation modes are supported by gem5 : Thesystemcall-emulation and the full system simulation. As wewant to reduce the inﬂuence of the runtime infrastructure(libraries, operating system services, etc.) on the application The gray lines indicate boundaries of 4 kB virtual memory pages. The data and bss segment is marked as a big data segment in the picture. The simulation model of gem5 assumes cache-lines to be written to thememory entirely, hence we also use this assumption in the analysis. system services memory accesseshardware initialization memory trace

Fig. 2. Overview of the simulation setup as much as possible, we run bare-metal full system modesimulations. This requires an operating system to be startedin gem5 , handling the hardware initialization and providingrequired services for the running application. We developeda small bare-metal runtime system, which takes the place ofthe operating system in the simulation setup. Thus, we caninitialize the hardware in a ﬂexible way with low overhead(compared to Linux kernel modiﬁcations), and only providethe required operating system services. Even if the analyzedapplication is directly compiled into the binary ﬁle of the run-time system, which is started in gem5 afterwards, the runtimesystem can be seen as part of the simulation environmentand not as part of the application. The simulation setup isillustrated in Figure 2.

B. Application Separation

The full system simulation mode of gem5 combined witha small, customized runtime system in place of the operatingsystem allows us to highly control the hardware behavior andthe memory placement. In this section, we aim to analyze thewrite access behavior of an application, without interferenceof an operating system, and separately analyze the memoryregions of the application. To achieve this, we apply twoseparation techniques:

Spatial Separation : During the linking process of the runtimesystem, the application’s memory regions (i.e. text , data , bss , and stack ) are placed in a static separate memorylocation, which resides apart from the memory locations of theruntime system. Thus, the memory accesses of the applicationtarget a separated memory region, which can be analyzed sep-arately in the recorded write-count distribution. Furthermore,the concrete memory addresses of the memory regions canbe determined after the linking process, which allows to ana-lyze the recorded write-count distribution separately for eachmemory region. Hence, the runtime system has to establish anidentity mapping (or at least a constant, well-known mapping)from virtual memory addresses to physical memory addressesto be able to determine the different memory regions in therecorded write-count distribution. Interrupt Separation : The handling of interrupts is separatedfrom the application’s stack. Usually, the operating system,respectively our runtime system, saves the current register seton the stack when handling an interrupt. An interrupt duringthe running application would cause the application’s stack to be used for the register backup, which would inﬂuence theapplication’s write pattern to the stack region. To overcomethis, we handle interrupts on another stack instead of theapplication’s stack by the hardware. For ARMv8 architectures,this can be achieved by using two different exception levels[1]. When taking an interrupt to a higher exception level, anARMv8 CPU can be conﬁgured to switch the stack pointerto a dedicated stack pointer for the higher exception level.We run the runtime system on exception level 1 (EL1),using a stack, allocated for the runtime system only. Theapplication is executed on exception level 0 (EL0) with theapplication’s stack. Thus, whenever an interrupt occurs duringthe application execution, the interrupt is handled on EL1 onthe stack of the runtime system. Accordingly, the application’sstack is not inﬂuenced by interrupts at all.Both techniques allow to analyze the memory write-patternof isolated applications. Based on this, required wear-levelingactions are deduced and proposed subsequently. In this paper,we only focus on wear-leveling for the test applications. In areal world setup, also the runtime system / operating systemrequires wear-leveling to be applied on its memory regions,because the implementation uses the main memory similarlylike the test applications. However, the solutions presentedhere can also be applied for the runtime system, but requiresome additional implementation effort, since they are providedas a service from the runtime system itself.V. A

GING - AWARE C OARSE - GRAINED W EAR -L EVELING

Section III points out the need for aging-aware in-memorywear-leveling, when the write-endurance is low. If the currentwrite behavior cannot be tracked by the hardware and nomemory trace is known for the running application, aging-aware techniques cannot be applied. To overcome this issue,in this section we propose a software-only write distributionapproximation technique, which estimates the memory writedistribution (i.e., the write count to ﬁxed sized memoryregions) using only commonly available hardware support(i.e., MMU, performance counters, and interrupts). The writedistribution approximation can be used subsequently to enablean arbitrary aging-aware wear-leveling algorithm. However,to keep our implementation software-only, we developed asimple aging-aware wear-leveling algorithm, which adjuststhe virtual memory mapping of the MMU to exchange thephysical location of hot (heavy written) and cold (less oftenwritten) virtual memory pages. Thus, the entire wear-levelingis coarse-grained with a 4 kB granularity. To omit the need ofstoring the aging state of the memory as a persistent object, wedesign our wear-leveling solution incremental. Hence, at everypoint in time the algorithm aims to achieve an allover write-count balance in the memory. After a reboot, for instance,the memory can be assumed to be wear-leveled and theincremental wear-leveling can be continued. This furthermoreovercomes the requirement to know the exact age of thememory at any time. Therefore, the approximation does notneed to estimate absolute number, a relative representationof the write distribution is sufﬁcient. At the end of this4ection, we evaluate the resulting wear-leveling quality on thepreviously mentioned benchmark applications.

A. Write Distribution Approximation

Several steps are required to record an approximation of thereal write distribution of an application at runtime. To achievean equidistant sampling of write accesses, i.e. every n th writeaccess is sampled, the target of every n th memory write ofthe application is captured and stored in an appropriate datastructure. The number n determines the temporal granularityof the approximation technique, allowing a trade-off betweenaccuracy and introduced overhead. After capturing the write,the spatial granularity of the data structure has to be consideredas well. Storing the estimated write count for every byteintroduces a big storage overhead and leads to impreciseresults, when the temporal granularity is coarse. Instead, bytescan be related to larger memory blocks and the write countsare aggregated for every write access into these blocks. Forour implementation, we aggregate the write counts for 4 kBmemory blocks, because the wear-leveling algorithm considersthis granularity, i.e., the decision is based on memory pages.Using an 8 byte counter for every block, · memory-sizebytes are required to store the approximated write distribution(e.g., 2 MB when 1GB of main memory is tracked).The detailed ﬂow of capturing the target of every n th mem-ory write access requires two techniques to be implemented.First, an interrupt has to be generated after every n th writeaccess, thus the runtime system can take action. Secondly, thetarget of the next memory write access has to be determinedand stored in the data structure. Both implementations arestated in detail subsequently. Although the approach by Gogteet al. allows to directly capture CPU write requests at sampledintervals [14], their approach relies on a specialized debuggingcapability. Our method provides an alternative, which makesuse of more widely available hardware features.

1) Temporal Write Distribution Sampling:

To generate aninterrupt after every n th write access of the application, weuse the CPU internal performance counting mechanism. InARMv8, each performance counter can be conﬁgured to onlyrecord events triggered on EL0, thus there is no interferenceof executed interrupt handlers. The BUS_ACCESS_ST eventcounts the total number of store requests on the memorybus, thus the number of write accesses of the applicationare recorded. For Intel CPUs, the same behavior could beachieved by using a performance counter for writebacks of thelast-level-cache. If no such performance counter is availablein some system, any approximation (e.g. the cycle counter),still can be considered. The performance counting mechanismallows to generate an interrupt when the performance counteroverﬂows (i.e., exceeds the value of − ). To establishinterrupts on every n th write access, the performance counteris set to − n during the handling of the overﬂow interrupt.

2) Write Access Trapping:

As the last written memoryaddress cannot be determined during the interrupt handlingof the performance counter overﬂow, a second technique isimplemented to track the target address of the the next memory write. During the handling of the overﬂow interrupt, thememory access permission for the tracked memory regionis set to

READ_ONLY . Note that the ARMv8 architectureallows hierarchical memory access permissions, allowing toconﬁgure memory regions of 1 GB size to

READ_ONLY byonly modifying one page-table entry. Due to the

READ_ONLY permission, the next write access causes a permission violationtrap, which is handled as an interrupt. The violation causingaddress is available for the interrupt handler in a dedicatedregister, which then is used to increment the correspondingcounter in the write distribution approximation . During thehandling of the trap, the access permissions are set back to READ_WRITE . Note that this mechanism does not strictlyrequire a MMU, it could also be implemented with a verylightweight MPU on a microcontroller. B. Wear-leveling Algorithm

As mentioned before, the write distribution approximationenables arbitrary aging-aware wear-leveling algorithms. Whenthis technique is used, the integration of the approximationsystem and the wear-leveling algorithm has to be consideredas well. To provide a common interface, the approximationimplementation could provide the estimated write-counts in atable inside the runtime system’s memory and a notiﬁcationmechanism to trigger the wear-leveling algorithm when aspecial event occurs (e.g., one estimated counter exceeds aconﬁgured threshold). However, to reduce the overhead fur-ther, we interleave our wear-leveling algorithm further with theapproximation implementation to reduce redundantly storeddata. Our wear-leveling algorithm uses a red-black tree [5] tomaintain all managed virtual memory pages along with theirestimated age. As the estimated age is already present insideof the tree nodes, there is no need to store these values in theapproximation implementation as well.

1) Management of Memory Pages:

Our wear-leveling al-gorithm is based on a red-black tree as the management datastructure, which contains all managed physical memory pagestogether with their estimated cell age. Whenever a virtualmemory page should be relocated to another physical memorypage, the current minimum is extracted from the tree as thetarget physical page and the estimated ages are adjusted ac-cordingly. Regarding the overhead, the wear-leveling algorithmis only called in this setup, when a memory page has to berelocated. Regarding the selection policy of the wear-levelingdecisions, the estimated age of all physical pages is balancedequally over time, because every page will be the currentminimum page at a certain time when the estimated age isupdated properly. The semantics of the performance counter and of the write access trappingmechanism differ slightly. While the performance counter counts everywrite to the memory, including cache writebacks and other indirect memoryaccesses, the write access trapping only applies to CPU write operations,which require a fetch of a TLB line. However this only implies that not thetarget of every n th write is recorded, but that sometimes the distance betweentwo recorded writes is n + x , where x is a small integer. For our runtime system implementation, memory permissions are not usedfor any protection purposes. If this is the case, the modiﬁed permissions mighthave to be backed up and restored later on. E E . E E . E bitcount main memory a pp r ox i m a t e d w r it e - c oun t

148 kB E E . E pfor main memory address text data stack text data stack

52 kB E E E E sha main memory address a pp r ox i m a t e d w r it e - c oun t

148 kB E E E dijkstra main memory address textdata stack text data stack Fig. 3. Memory write-count approximation n = 5000 Eventually, this integration of the wear-leveling algorithmand the approximation system leads to an additional conﬁgu-ration parameter, besides the temporal and spatial granularityof the write-count approximation. The threshold, after whichnumber of estimated writes a relocation should be performedis maintained by the approximation system, because the wear-leveling algorithm is called from the approximation system inthat case. This conﬁguration parameter provides a trade-offbetween the overhead of page relocation and the frequency,respectively the resulting quality, of wear-leveling actionswithout taking inﬂuence on the quality of the write-countapproximation.

2) Memory Page Relocation:

Once the wear-leveling algo-rithm determined a pair of two virtual memory pages to swap,two steps are required to perform the relocation. First, thevirtual memory mapping in the page-table has to be adjustedaccordingly, such that the physical pages of both virtualmemory pages are exchanged. A Translation Lookaside Buffer(TLB) maintenance operation is required afterwards to makesure the exchanged mapping is applied. Note that the ARMv8virtual memory system allows single entries to be invalidatedin the TLB, thus a total TLB ﬂush is not necessary. After thenew page mapping is established, the physical content has tobe exchanged to maintain the application’s view on the virtualmemory. This is achieved by copying one page to a sparebuffer, copy the second page to the ﬁrst page, and copy thebuffer content to the second page. The size of the buffer is

76 kB E E E E bitcont main memory a pp r ox i m a t e d w r it e - c oun t

148 kB E E E E pfor main memory address text data stack text data stack

52 kB sha main memory address a pp r ox i m a t e d w r it e - c oun t

148 kB E E E E dijkstra main memory address textdata stack text data stack Fig. 4. Memory write-count approximation n = 20000 chosen to 4 kB for two reasons: First, copying a sequentialmemory content can be done more efﬁciently in most systemsthan copying single bytes or words from different regions.Second, the write access pattern to the buffer memory pageis completely uniform and thus has no negative inﬂuence onthe memory lifetime if it is also handled by the wear-levelingsystem. C. Evaluation

To point out how the previously presented techniques canbe used to improve the balance of wear-levels, the write-countapproximation system is evaluated ﬁrst. The four benchmarkapplications shown in Figure 1 are executed again with enabledwrite-count approximation. Instead of triggering the wear-leveling algorithm, the write-counts are simply aggregated,resulting in an analyzable distribution. The spatial granularityis ﬁxed to 4 kB sized memory regions (virtual memorypage size), while the temporal granularity is evaluated fortwo different values. For the ﬁrst experiment, a sample isrecorded every n = 5000 th memory write access, for thesecond experiment a sample is recorded every n = 20000 th memory write access. The resulting approximated write-countdistributions are illustrated in Figure 3 and Figure 4.

1) Write-Count Approximation Evaluation:

The character-istic of the real write-count distribution (compared to Fig-ure 1) is reﬂected properly in both experiments. The mainpeaks inside the distribution are shown regarding their height6 sha - baseline main memory w r it ec oun t

52 kB sha - wear-leveling main memory Fig. 5. Coarse-grained full Wear-Leveling Result For sha n = 5000 compared to the rest of the distribution. The variation ofthe temporal granularity can be observed due to the differentscaling of the y axes. Since our approach performs incrementalwear-leveling, the total memory lifetime is not considered.Hence, the absolute scaling of the write approximation doesnot matter. However, the reduction of the temporal granularitydoes not inﬂuence the preciseness of the approximation inthis setup, because still enough samples are recorded, evenfor n = 20000 . If the application executes relative short orthe temporal granularity is conﬁgured too coarse, not enoughsamples might be available to reﬂect the characteristic ofthe distribution properly. This trade-off should be taken intoaccount when considering the temporal granularity. bitcount pfor sha dijkstra n = 5000 n = 20000 OVERHEAD FOR THE WRITE - COUNT APPROXIMATION

When choosing a temporal granularity, the introduced over-head should be also considered. To evaluate the overhead, thenecessary additional CPU cycles are calculated as a percentageof the baseline execution, without write-count approximation.Table I lists the calculated CPU overhead of both experiments.The relative overhead is similar for all benchmarks, becausethe approximation system reacts relative to the total writecount, respectively the execution time.

2) Full Wear-Leveling Evaluation:

To determine if theestimation is precise enough to enable aging-aware wear-leveling, the approximation and wear-leveling algorithm isplugged together and evaluated again. The red-black tree basedwear-leveling algorithm is activated and triggered from theapproximation system. The spatial granularity remains at 4kB while the temporal granularity of the approximation againis chosen as n = 5000 and n = 20000 . A remapping of a pageis requested, whenever the write-count estimation exceeds thevalue of (for n = 5000 ) or the value of ( n = 20000 ). Thisleads to mostly the same total number of page relocations inboth experiments. Thus they can be compared regarding thequality of the write count approximation.

52 kB sha - baseline main memory w r it ec oun t

52 kB sha - wear-leveling main memory Fig. 6. Coarse-grained full wear-leveling result for sha n = 20000 Figure 5 and Figure 6 show the resulting write distributionof our simulation under coarse-grained wear-leveling for the sha benchmark. The results from the other benchmarks areonly presented by their calculated improvement later due tospace limitation. Note that due to the logarithmic scale of the yaxes memory bytes with a write-count of 0 are not displayed.The estimated write-count distribution is precise enough toperform aging-aware relocations and balance the wear-levelsacross the target memory region.

3) Memory Lifetime Improvement:

Considering the gainedimprovement of the memory lifetime requires some assump-tions. First, the system is considered dead once the ﬁrstmemory cell is worn out. Thus, the maximum write countto the memory determines the memory lifetime. Assumingthat the target of each write access could be shufﬂed throughthe memory arbitrarily, the theoretical best memory lifetimecould be achieved when every memory cell is written equallyoften, thus the mean write count would be applied to eachcell. Combining both considerations, Equation (1) calculatesthe achieved endurance ( AE ), which is the fraction of the idealmemory lifetime, which is achieved by the analyzed execution.A value of means that the experiment already achieves themaximum memory lifetime, while a value of, for instance, . means that the memory lifetime could be doubled in the idealcase. AE = mean write countmax write count (1)Comparing the achieved endurance of an execution with en-abled wear-leveling to the baseline without any wear-levelingleads to an endurance improvement ( EI ), which can be de-termined according to Equation (2). The maximum enduranceimprovement thus depends on the achieved endurance of thebaseline. EI = AE analyzed AE baseline (2)The endurance improvement describes how many additionalwrite accesses can be performed before the memory wears outwhile using the analyzed wear-leveling technique, comparedto the baseline, but does not give any insight if the applicationproﬁts from the additional writes. For instance, an EI of2 means that the application can perform twice as many7rites compared to the situation without wear-leveling. If thewear-leveling causes overhead, all the additional writeswould be consumed by the wear-leveling and no real beneﬁtwould be achieved. Therefore, the introduced write overhead W O (as a percentage of the total number of writes of thebaseline execution) has to be considered as well to determinethe lifetime improvement ( LI ) according to Equation (3). A LI value of, for instance, implies that the application canperform twice as much writes, respectively can run twice aslong, regardless of introduced overhead and writes for thewear-leveling. LI = EIW O + 1 (3)Similarly, the achieved endurance can be related to the writeoverhead, which leads to the normalized endurance (

N E ). N E = AEW O + 1 (4)The write overhead is determined in this evaluation on thesimulation results by comparing the total number of memorywrites for each benchmark execution with the correspondingbaseline. For the four benchmark applications, the achievedendurance, the write overhead and the lifetime improvementis calculated for both wear-leveling experiments and the resultsare collected in Table II.

AE W O NE LIn = 5000 bitcount pfor sha dijkstra n = 20000 bitcount pfor sha dijkstra IFETIME IMPROVEMENT ( LI ) FOR COARSE - GRAINED WEAR - LEVELING

We observe the following properties. First, the memorywrite overhead is mostly independent from the conﬁgurationof the approximation system, because the approximation ingeneral does not cause many additional memory writes. Sec-ond, the lifetime improvement depends on the total amountof memory which is used for the wear-leveling, since thewrite pattern of the application is anyway mostly targetinga single memory page. If this page can be remapped tocolder pages, the improvement is higher. Third, although thelifetime is improved by a considerable factor, the achievedendurance remains at mostly ≈ of the ideal lifetime inall benchmarks. This stems from the high non-uniformitywithin memory pages, which is caused by the applications.As memory pages are only relocated to other 4 kB alignedmemory pages, the non-uniformity within pages is not resolvedby the wear-leveling system.To summarize this section, aging-aware wear-leveling onthe coarse-granularity of 4 kB sized memory pages performsreasonably in a software-only manner due to the statisti-cal write-count approximation. Nevertheless, a coarse-grained wear-leveling technique alone is not sufﬁcient to achieve anequal balance of the wear-levels allover the memory due tothe high non-uniformity within memory pages.VI. F INE - GRAINED S TACK W EAR -L EVELING

To overcome the problem of intra page non-uniformity,solutions in literature are extended with a ﬁner grained wear-leveling technique, resolving the non-uniformity in the scopeof coarse-grained memory regions, which are targeted by thecoarse-grained technique subsequently [21], [26]. To the bestof our knowledge, all the ﬁne-grained extensions are eitherrealized in hardware by remapping single bytes or group ofbytes with an additional abstraction or by functional dataremapping [17], which requires at least compiler support. Inthis section, we propose a software-only ﬁne-grained extensionto the coarse-grained wear-leveling system (Section V), whichresolves non-uniform write accesses in the memory pagesof the stack region. These pages are targeted by the coarse-grained wear-leveling system subsequently and are remappedto other physical pages.Since all ﬁne-grained wear-leveling extensions are hardwarebased, we most likely cannot propose a generic ﬁne-grainedwear-leveling approach based on commonly available hard-ware. Instead, we propose a specialized technique, which onlytargets the stack region of C / C++ compiled applications.The concept to target the stack with a specialized wear-leveling system in a software-based manner is also consideredby Li et al. [18]. The basic idea is to allocate every stackframe for a new function call on the heap through an aging-aware memory allocator. This approach features two majordisadvantages: First, the wear-leveling quality relies on theapplication to perform enough and ﬁne-grained function callsto apply sufﬁcient wear-leveling actions. Second, the amountof required stack memory might not be known in advance ,which leads to a certain fragmentation and to worse wear-leveling results. Due to these disadvantages, we in contrastrelocate the entire stack memory without the application’scooperation.As the stack is used by the compiled code relative to thestack pointer ( sp ) , the application can be instructed to useanother memory location as the stack by adjusting the sp .As the stack anyway is the main cause for non-uniform writeaccesses (see Section V-C), we focus our ﬁne-grained wear-leveling extension on relocating the stack to other memorylocations and thus resolve the non-uniform write access patterninside the stack. A. Circular Stack Relocation

To evenly distribute the write accesses to the stack, we movethe stack region in a circular manner through the memory. Inessence, the physical memory content is relocated with a ﬁxed C99 allows dynamic sized local arrays [2]. However, this could also beachieved in assembly. Depending on the application logic, concrete pointer values may be alsocalculated and stored in variables. These pointer are also considered when thememory location of the stack is changed. eserved stackshadow stack valid stack content sp Fig. 7. Shadow stack offset into one direction always with an overﬂow semantics atthe end of the memory. For the

Start-gap approach, this canbe achieved by a corresponding remapping function, becausean additional abstraction layer maintains the logical view onthe memory. The runtime system allocates a memory region ofthe size of multiple memory pages for the application’s stack.The stack is relocated from time to time by setting the sp further by an offset and copying the old stack content to theaccording new location. The logical view of the applicationalways expects free memory bytes left (negative offset) of the sp and the already created stack content directly right (positiveoffset) of the sp . As long as the stack only is relocated into onedirection, this view can be maintained easily. A wraparoundat the end of the reserved memory region cannot be achievedtrivially when the stack should be relocated by the same offsetin each step, since the stack content cannot be split. Thus,we install a mechanism, called shadow stack, which aids toimplement the wraparound at the end of the reserved memoryregion.

1) Shadow Stack:

The basic concept of the shadow stackis to allow one part of the stack to maintain at the end of thereserved memory region, while the rest of the stack alreadyis wrapped around to the beginning. At any point in time, theentire stack content must be accessible by addressing memorycontents right of the sp (with a positive offset). Furthermore,at any point in time the same amount of free memory shouldbe available left of the sp (with a negative offset). Only bymaintaining these two properties, the application can continuethe execution at any time.The setup of the shadow stack is illustrated in Figure 7.Technically, the real stack is present as a consecutive virtualmemory region, which is shown in the right half of Figure 7.For the shadow stack, the same amount of virtual memoryspace left of the real stack is allocated and is mapped toexactly the same physical memory pages like the real stack.Thus, given an arbitrary virtual address A of the real stack,the same physical content is accessed at the virtual address S ( A ) = A − stacksize. This also implies that setting the sp from some virtual address S ( A ) inside the shadow stackto the corresponding real stack address A does not changethe application’s perspective on the stack at all. Using thismechanism, the stack relocation is implemented in two steps.First, the stack is moved down the memory periodically. Atany time, the application can access the same amount ofmemory left of the sp , because the writes can target theshadow stack. Once the currently used stack (including allvalid stack content) is entirely moved to the shadow stack,the sp is set back to the corresponding real stack address.As mentioned before, the virtual memory at the new locationof the sp contains exactly the same content as at the old location. Hence, the application’s perspective is maintainedand the entire stack is wrapped around back to the real stack(right half). Repeating these two steps regularly, the stack isrelocated in a circular manner with the same offset in eachrelocation step.

2) Combination with Coarse-grained Wear-Leveling:

Asstated before, the ﬁne-grained wear-leveling is designed asan extension to the previously presented coarse-grained wear-leveling system (Section V). Both systems can work togethernearly out of the box. Since the stack relocation only operatesin the virtual memory space, a stack relocation can only beinterrupted by the remapping of the page to another physicalmemory page. Nevertheless, when remapping hot and coldpages, the coarse-grained wear-leveling system has to be awareof the special shadow stack conﬁguration and has to maintainit during remapping. Furthermore, the statistical write-countapproximation has to aggregate the captured write accessesfrom the shadow stack and from the real stack to the samephysical page. Eventually, we set up a frequent stack relocationby using the same performance counter overﬂow interruptmechanism like the coarse-grained wear-leveling system. Thisensures that stack relocations are triggered after a certainnumber of writes to the memory. Additionally, the overheadcan be reduced by combining the interrupt mechanism andonly using one interrupt service routine (ISR).

B. Address Consistency

The concept of moving the stack in a circular manner(Section VI-A) is based on the sp relative access of thestack region by C / C++ compiled applications. However,the sp relative access is not the only way to access memorycontents within the stack memory. Sometimes, the applicationrequires to create pointers to variables inside the stack to passit to subsequent function calls or to store the pointer in acentral variable. Furthermore, pointers to variables on the stackmay also be moved out of the stack to some global or heapdata structures. During a relocation of the stack, the memoryaddress of the variables on the stack changes, while the contentof the pointers stays unchanged. This leads to invalid pointersand to a wrong behavior of the application. To overcomethis problem, we equip the ﬁne-grained relocation systemwith two pointer adjustment mechanisms, which maintain thecorrectness of pointer contents over stack relocations.

1) In-memory Pointer Adjustment:

First, an in-memorypointer adjustment technique targets pointers to stack contents,which are stored inside the stack itself. This is the usual casewhen pointers to local variables are passed to subsequentfunction calls or positions inside local arrays need to beremembered. For the relocation of the stack, the entire validstack content has to be copied to the new memory locationanyway, resulting in every memory word from the currentvalid stack is loaded to the CPU and stored back to thememory. During this process, the memory word is checked,and a pointer to stack variable is adjusted by the relocationoffset. To identify a memory word as a pointer into the stack,a strong constraint needs to be put to the memory usage of9 sha - coarse-grained main memory w r it ec oun t

52 kB sha - ﬁne-grained main memory Fig. 8. Fine-grained wear-leveling result for sha (page relocation every t =64 th stack relocation) the application. As the memory word is just seen as a 8 bytenumber by the relocation routine, the application has to makesure to not use any logic variable content, which has the samenumber like a pointer value into the stack would have. Weensure this by allocating the virtual memory pages of thestack at a memory location bigger than 4 GB and allow theapplication to use 64 bit aligned data types with the 32 lowerbits set only.

2) Smart-Pointer Adjustment:

As the previous techniqueonly targets pointers, which are stored inside the stack, point-ers which are stored in global or heap data structures still arecorrupted after a stack relocation. To solve this problem, theﬁne-grained wear-leveling system ships with a smart-pointerimplementation, which checks the current relocation of thestack during dereferencing. The internally stored raw pointeris adjusted properly and dereferenced. The smart-pointer im-plementation only allows to hand out copied variables, butnot the internal raw pointer. Whenever the application aims tomove a pointer out of the stack, it has to use the smart-pointerimplementation instead of a raw pointer.To summarize, maintaining the consistency of pointers dur-ing stack relocations puts strong constraints on the applicationand blows up in-memory data structures. Nevertheless, theconstraints can be achieved by reimplementing applicationsaccordingly and this enables software-only ﬁne-grained in-memory wear-leveling.

C. Evaluation

The technical details of the combined implementation ofthe ﬁne-grained stack relocation technique and the coarse-grained aging-aware wear-leveling system are explained inSection VI-A2. The movement of the stack by an offsetof 64 bytes is triggered periodically from the performancecounter overﬂow mechanism. In this evaluation the perfor-mance counter overﬂow is conﬁgured to trigger after every n = 1000 th memory write access, thus the stack is relocatedevery th memory write. Accordingly, the write-countapproximation works on the same temporal granularity. The In our simulation setup 64 byte cache-lines are assumed to be writtenentirely. A ﬁner movement than 64 byte has no further effect on the wear-leveling result in this case.

52 kB sha - coarse-grained main memory w r it ec oun t

52 kB sha - ﬁne-grained main memory Fig. 9. ﬁne-grained wear-leveling result for sha (page relocation every t =32 nd stack relocation) coarse-grained wear-leveling system is triggered whenever apage exceeds an approximated write-count of t = 64 andthus in mean on every th stack relocation. Considering therelocation offset of bytes, a coarse-grained page relocationis triggered whenever the stack is relocated by bytes,which is the size of one memory page. A second experiment isexecuted with the trigger for the coarse-grained wear-levelingsystem set to t = 32 . This increases the total number ofpage relocations at the cost of higher memory overhead.Furthermore, in this scenario page relocations are performedwhen the stack only passed half of a memory page size, thusthe internal non-uniformity is higher.Figure 8 and Figure 9 show the resulting memory write-count distribution for the sha benchmark, compared to thecoarse-grained wear-leveling system only (Figure 5) for bothbenchmark conﬁgurations. The results show that the non-uniformity within virtual memory pages can be resolved bythe ﬁne-grained stack wear-leveling technique and thus theallover write pattern to the main memory is more uniform.Even though the total number of page relocations is higherin the second experiment (Figure 9), the results from theﬁrst experiment are slightly better due to the fact that a pagerelocation is only performed, when the stack is moved by anoffset of an entire memory page.

1) Memory Lifetime Improvement:

To ﬁnalize the evalu-ation, the improvement of the memory lifetime can be cal-culated in the same way like in Section V-C3. The accordingresults are collected in Table III. First of all, it can be observed

AE W O NE LIt = 64 bitcount pfor sha dijkstra t = 32 bitcount pfor sha dijkstra IFETIME IMPROVEMENT (LI)

FOR FINE - GRAINED WEAR - LEVELING that the write overhead

W O has a high variation for thedifferent benchmarks. This is caused by the different wayof stack usage by each benchmark. The sha application forinstance uses a big part of the stack memory and thus has a10ery high write overhead. The total write distribution of theapplication in the end determines the lifetime improvement LI . The dijkstra application for instance also faces a highnon-uniform memory usage within the bss segment, whichis not resolved by our ﬁne-grained wear-leveling technique.Thus, the results for dijkstra are relative bad.In conclusion, the memory lifetime can be improved sig-niﬁcantly, if the intra page non-uniformity can be resolvedby the ﬁne-grained stack wear-leveling, e.g., ≈ timesfor the bitcount application. Note that the memory lifetimeimprovement strongly depends on the available memory size.In this evaluation, only the minimal required amount ofmemory for each benchmark is considered. If a system offersadditional spare memory, the memory lifetime can be furtherimproved. The improvement is determined mostly by theresulting uniformity of the memory access distribution ( AE )and the write overhead.

2) Comparison to the Literature:

Several techniques for in-memory wear-leveling for NVM have been proposed over thelast years. In this section we compare our evaluation resultswith following related techniques:

Start-gap was proposed byQureshi et al. [21] and relocates the entire memory space ina circular manner on the granularity of 256 byte cache-linesthrough special hardware. To resolve non-uniformity withincache-lines, a ﬁner-grained address space randomization isintroduced. Khouzani et al. [3] proposed a wear-levelingscheme, which hooks into the page allocation process of theoperating system. Due to knowledge about the current write-count and the write characteristic to each memory region,wear-leveling actions are decided and performed. Chen et al.[8] proposed a similar scheme with advanced management datastructures to make the wear-leveling algorithm more efﬁcient.This approach only operates on the coarse granularity ofvirtual memory pages.As a metric, we adopted the term normalized endurance(

N E ) from the

Start-gap approach, which is our achievedendurance value related to the memory write overhead. Asa concrete lifetime or a relative improvement always highlydepends on the considered benchmark and the memory size,we use the normalized endurance as a fraction of the possibleideal memory usage, respectively the memory lifetime. Unfor-tunately only a few works consider the possible ideal lifetimein their evaluation. The previously mentioned works [3], [8],[21] all report to achieve almost the ideal memory lifetime inthe best case (i.e., in the range of ≈ to ≈ ). Our bestresult achieves . of the ideal memory lifetime.As our system requires no additional hardware and can betuned regarding the write-overhead, it enables a trade-off forthe design-process of a hardware platform. The necessary costsfor the required hardware support for in-memory wear-levelingcan be replaced by the slightly worse wear-leveling quality anda possibly bigger runtime overheadVII. O UTLOOK ON F URTHER F INE -G RAINED E XTENSIONS

The ﬁnal evaluation results in Table III show that the all-over wear-leveling quality can be good, if the non-uniformity of write accesses within memory pages can be resolved.However, not only the stack has to be targeted by a ﬁne-grained speciﬁc extension, but also the data/bss and, if it exists,the heap segment. For instance, the dijkstra application hasa highly non-uniform memory usage inside the bss segmentleading to a bad performance. The text segment requires nospecial wear-leveling, because all accesses are read-only bydeﬁnition. While speciﬁc wear-leveling for the heap has beentargeted in form of aging-aware memory allocations in theliterature [10], [18], the data/bss segment requires anotherspecial technique. For future work, we propose to relocateelements of the data/bss segment by using the feature ofdynamic linked code. If the application is not statically linked,the addresses or an access offset for the data/bss segment isdetermined and set while the application is loaded. During amaintenance phase, i.e., an interrupt, the text segment couldbe re-loaded with relocated addresses of the data/bss segmentand thus these segments can be relocated. This could achievea circular movement, similar to the movement for the stack,for the data/bss segment.VIII. C

ONCLUSION

Recently, several in-memory wear-leveling techniques havebeen proposed to tackle a major disadvantage, namely thelower write endurance, of NVM technologies, which mightreplace classic DRAM in the near future. Advanced, aging-aware wear-leveling techniques rely on hardware-provided ageinformation, such as a write-count per cell / byte / domain, toachieve good wear-leveling results. As the necessary hardwaresupport is not available in common or commercial off-the-shelf(COTS) hardware, it introduces additional costs. The hardwareat least requires additional chip-space, but also might be verycomplex to build to meet a certain clock-speed and granularity.To overcome the need for this hardware and offer thepossibility to use the chip-space for other features, this paperintroduced a software-only, aging-aware wear-leveling system,which only makes use of widely available hardware features.The ﬁnal evaluations show that we are able to achieve upto . of the theoretically ideal possible memory lifetimewith our wear-leveling system without any additional hardwarecosts. During the design process of a system, it might betotally reasonable to only achieve roughly of the possiblememory lifetime (e.g. 8 instead of 10 years), but to equip thesystem with advanced hardware controllers to improve energyconsumption, for instance.As we believe it is important to offer the possibil-ity for such software-only in-memory wear-leveling, werelease all our sources, including benchmark applicationsand wear-leveling implementations: https://github . com/tu-dortmund-ls12-rt/NVMSimulator.A CKNOWLEDGEMENT

This paper is supported in parts by the German Re-search Foundation (DFG) Project OneMemory (Project num-ber 405422836).11

EFERENCES[1] “Arm architecture reference manual armv8, for armv8-a architectureproﬁle,” https://developer . arm . com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-proﬁle.[2] “Using the gnu compiler collection (gcc) - 6.20 arrays of variablelength,” https://gcc . gnu . org/onlinedocs/gcc/Variable-Length . html.[3] H. Aghaei Khouzani, Y. Xue, C. Yang, and A. Pandurangi, “Prolongingpcm lifetime through energy-efﬁcient, segment-aware, and wear-resistant page allocation,” in Proceedings of the 2014 InternationalSymposium on Low Power Electronics and Design , ser. ISLPED ’14.New York, NY, USA: ACM, 2014, pp. 327–330. [Online]. Available:http://doi . acm . org/10 . . Proceedings of the 2008 ACM SIGMETRICS InternationalConference on Measurement and Modeling of Computer Systems , ser.SIGMETRICS ’08. New York, NY, USA: ACM, 2008, pp. 229–240.[Online]. Available: http://doi . acm . org/10 . . Acta Informatica , vol. 1, no. 4, pp. 290–306, Dec 1972.[Online]. Available: https://doi . org/10 . SIGARCH Comput. Archit. News , vol. 39, no. 2, pp. 1–7, Aug. 2011.[Online]. Available: http://doi . acm . org/10 . . ACMTrans. Des. Autom. Electron. Syst. , vol. 23, no. 2, pp. 14:1–14:32, Nov.2017. [Online]. Available: http://doi . acm . org/10 . Proceedings of the 49th Annual Design Automation Conference , ser.DAC ’12. New York, NY, USA: ACM, 2012, pp. 453–458. [Online].Available: http://doi . acm . org/10 . . Proceedings of the 42Nd Annual IEEE/ACM International Symposiumon Microarchitecture , ser. MICRO 42. New York, NY, USA: ACM,2009, pp. 347–357. [Online]. Available: http://doi . acm . org/10 . . Proceedings of the 16thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS 2011, Newport Beach, CA,USA, March 5-11, 2011 , 2011, pp. 105–118.[11] J. Dong, L. Zhang, Y. Han, Y. Wang, and X. Li, “Wear rate leveling:Lifetime enhancement of pram with endurance variation,” in

Proceed-ings of the 48th Design Automation Conference . ACM, 2011, pp.972–977.[12] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-levelperformance, energy, and area model for emerging nonvolatile memory,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , vol. 31, no. 7, pp. 994–1007, 2012.[13] A. P. Ferreira, M. Zhou, S. Bock, B. Childers, R. Melhem, andD. Moss´e, “Increasing pcm main memory lifetime,” in

Proceedingsof the Conference on Design, Automation and Test in Europe , ser.DATE ’10. 3001 Leuven, Belgium, Belgium: European Designand Automation Association, 2010, pp. 914–919. [Online]. Available:http://dl . acm . org/citation . cfm?id=1870926 . . usenix . org/conference/fast19/presentation/gogte[15] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, andR. B. Brown, “Mibench: A free, commercially representative embeddedbenchmark suite,” in Proceedings of the Workload Characterization,2001. WWC-4. 2001 IEEE International Workshop , ser. WWC ’01.Washington, DC, USA: IEEE Computer Society, 2001, pp. 3–14.[Online]. Available: https://doi . org/10 . . .

15 [16] Y. Han, J. Dong, K. Weng, Y. Wang, and X. Li, “Enhanced wear-rateleveling for pram lifetime improvement considering process variation,”

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,vol. 24, no. 1, pp. 92–102, Jan 2016.[17] A. Jacobvitz, “Coset coding to extend the lifetime of non-volatilememory,” Ph.D. dissertation, Duke University, 2014.[18] W. Li, Z. Shuai, C. J. Xue, M. Yuan, and Q. Li, “A wear leveling awarememory allocator for both stack and heap management in pcm-basedmain memory systems,” in

Proceedings of the 2019 Design, Automation& Test in Europe (DATE) , 2019.[19] D. Liu, T. Wang, Y. Wang, Z. Shao, Q. Zhuge, and E. Sha, “Curling-pcm: Application-speciﬁc wear leveling for phase change memorybased embedded systems,” in , Jan 2013, pp. 279–284.[20] M. Poremba, T. Zhang, and Y. Xie, “Nvmain 2.0: A user-friendlymemory simulator to model (non-)volatile memory systems,”

IEEEComputer Architecture Letters , vol. 14, no. 2, pp. 140–143, July 2015.[21] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,and B. Abali, “Enhancing lifetime and security of pcm-based mainmemory with start-gap wear leveling,” in , Dec 2009, pp.14–23.[22] Songping Yu, Nong Xiao, Mingzhu Deng, Yuxuan Xing, Fang Liu,Zhiping Cai, and Wei Chen, “Walloc: An efﬁcient wear-aware allocatorfor non-volatile main memory,” in , Dec2015, pp. 1–8.[23] H. Volos, G. Magalhaes, L. Cherkasova, and J. Li, “Quartz: Alightweight performance emulator for persistent memory software,” in

Proceedings of the 16th Annual Middleware Conference . ACM, 2015,pp. 37–49.[24] W. Zhang and T. Li, “Characterizing and mitigating the impact ofprocess variations on phase change based memory systems,” in , Dec 2009, pp. 2–13.[25] M. Zhao, L. Shi, C. Yang, and C. J. Xue, “Leveling to the last mile:Near-zero-cost bit level wear leveling for pcm-based main memory,” in ,Oct 2014, pp. 16–21.[26] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A durable and energy efﬁcientmain memory using phase change memory technology,” in

Proceedingsof the 36th Annual International Symposium on Computer Architecture ,ser. ISCA ’09. New York, NY, USA: ACM, 2009, pp. 14–23. [Online].Available: http://doi . acm . org/10 . . Icde , vol. 6, 2006, p. 59., vol. 6, 2006, p. 59.