[PDF] Freezer: A Specialized NVM Backup Controller for Intermittently-Powered Systems

Abstract

The explosion of IoT and wearable devices determined a rising attention towards energy harvesting as source for powering these systems. In this context, many applications cannot afford the presence of a battery because of size, weight and cost issues. Therefore, due to the intermittent nature of ambient energy sources, these systems must be able to save and restore their state, in order to guarantee progress across power interruptions. In this work, we propose a specialized backup/restore controller that dynamically tracks the memory accesses during the execution of the program. The controller then commits the changes to a snapshot in a Non-Volatile Memory (NVM) when a power failure is detected. Our approach does not require complex hybrid memories and can be implemented with standard components. % and integrated in any MCU with Results on a set of benchmarks show an average 8\times reduction in backup size. Thanks to our dedicated controller, the backup time is further reduced by more than 100\times, with an area and power overhead of only 0.4\% and 0.8\%, respectively, w.r.t. a low-end IoT node.

Full PDF

11 Freezer: A Specialized NVM Backup Controller forIntermittently-Powered Systems

Davide Pala, Ivan Miro-Panades,

Member, IEEE , and Olivier Sentieys,

Member, IEEE

Abstract —The explosion of IoT and wearable devices deter-mined a rising attention towards energy harvesting as sourcefor powering these systems. In this context, many applicationscannot afford the presence of a battery because of size, weightand cost issues. Therefore, due to the intermittent nature ofambient energy sources, these systems must be able to saveand restore their state, in order to guarantee progress acrosspower interruptions. In this work, we propose a specializedbackup/restore controller that dynamically tracks the memoryaccesses during the execution of the program. The controller thencommits the changes to a snapshot in a Non-Volatile Memory(NVM) when a power failure is detected. Our approach does notrequire complex hybrid memories and can be implemented withstandard components. Results on a set of benchmarks show anaverage × reduction in backup size. Thanks to our dedicatedcontroller, the backup time is further reduced by more than × , with an area and power overhead of only 0.4% and 0.8%,respectively, w.r.t. a low-end IoT node. Index Terms —Embedded systems, energy harvesting, intermit-tent computing, IoT, non-volatile processor.

I. I

NTRODUCTION

In the context of IoT, many applications cannot afford thepresence of a battery because of size, weight and cost issues.The recent advancement in the Non-Volatile Memory (NVM)technologies is paving the way for Non-Volatile ComputingSystems. These systems are able to sustain computations underunstable power, by quickly saving the state of the full system ina non-volatile fashion. Thus, Non-Volatile Processors (NVPs)may allow battery-less designs without suffering from frequentpower losses inherent in energy harvesting scenarios.In related work, both software- and hardware-level solutionswere proposed to cope with the backup and restore problem.Software-based approaches are implemented on platforms thatinclude both some SRAM and an addressable NVM used tostore the backup, as the one presented in [1]. Checkpointsare placed at compile time [2]. Then, at run-time the supplyvoltage is checked and, if an imminent power failure isidentiﬁed ( V dd < V th ), a backup of the stack and the registersis executed. In some works, backups are only executed whena power failure interrupt is triggered and the full volatile state(SRAM and registers) is copied to the NVM [3], [4]. Otherapproaches do not take advantage of the volatile SRAM andexploit the NVM as the only system memory, backing-up onlythe registers in the event of a power outage [5], [6]. Software-level solutions can be implemented on available hardware, but D. Pala and O. Sentieys are with Univ. Rennes, Inria, Rennes, France.I. Miro-Panades is with Univ. Grenoble Alpes, CEA List, Grenoble, France.Manuscript received April 10, 2020; revised xxx. they normally come with a big overhead in terms of bothbackup time and energy.Hardware solutions on the other hand usually implementfully Non-Volatile Processors (NVP). NVPs mostly make useof emerging NVM technologies to implement complex hybridmemory elements (nvFF and nvSRAM, non-volatile registers,and SRAM memory, respectively) that allow for very fastparallel backup and restore operations [7]–[12]. However,introducing these hybrid memory elements is intrusive. More-over, it usually comes with a signiﬁcant area overhead andoften results in increased delay and active power. Additionallimitations on the amount of data that can be saved and re-stored in parallel is imposed by the peak current consumptionrequired to drive all the NVM bit cells at the same time. Tomitigate these problems, distributed small non-volatile arrays,where groups of ﬂip-ﬂops are backed-up in sequence, areproposed in [13]. An adaptive restore controller for conﬁguringthe parallelism of the nvSRAM restore operation, trading offpeak current with restore speed is instead presented in [9].The use of NVM enables persistence across power failuresbut it also introduces the problem of consistency for the datastored in the NVM [14]. To address the consistency issue andimprove reliability of the system, a software framework thatperforms a copy-on-write of modiﬁed pages of the NVM ina shadow memory area is developed in [6]. The consistencyproblem can be also addressed via static analysis or withhardware techniques [15]. In particular, hybrid nvFFs can beused in a hardware scheme where an enhanced store buffer isused to treat the execution of stores to the NVM as speculative,until a checkpoint is reached [15]. Two counters are also usedto periodically trigger checkpoints based on the number ofexecuted stores or on the number of executed instructions.Previous work has also focused the attention to the problemof optimal checkpoint placement, as in [16] where onlinedecisions on checkpoints are taken based on a table ﬁlledofﬂine using Q-learning.In this paper, we propose Freezer, a hardware backupand restore controller that is able to reduce the amount ofdata that needs to be backed-up. Our approach avoids thehigh cost of hardware fully NVP architectures since it canbe implemented with plain CMOS technology. Furthermore,contrary to other hardware based approaches such as non-volatile processors [9], [11], [17], our proposed controller isa component that can be integrated in existing SoCs, withoutrequiring modiﬁcation of the processor architecture. Moreover,Freezer achieves better performance than pure software ap-proaches. Our contributions can be summarized as follows: • We propose an analysis of different backup strategies a r X i v : . [ c s . A R ] J a n based on the use of memory access traces. • We introduce an oracle based backup strategy that pro-vides the optimal lower bound for the backup size. • We present a hardware backup controller, Freezer, thatdynamically keeps track of the changes in the programstate and commits these changes in the NVM before thepower failure. The controller spies the address signal ofthe SRAM and uses dirty bits to track modiﬁed addresseswith a block granularity. • We conduct an analysis of the trade-offs and a designspace exploration for our proposed strategy. Results ona set of benchmarks show an average × reduction inbackup size. Thanks to Freezer, the backup time is furtherreduced by more than × , with a very low area andpower overhead. • We compare the memory access energy of three differ-ent system architectures: SRAM+NVM, NVM-only andcache+NVM, showing that NVM-only systems take onaverage . × to . × more energy than SRAM+NVMwith full-memory backup and . × to . × more whencompared to Freezer. Our strategy shows a clear advan-tage also when compared to cache+NVM architecture,requiring in average . × and . × less energy, withrespectively RRAM and STTRAM as main memory.The rest of the paper is organized as follows. In Section II,we present some background information and related works.In Section III we describe the main system models andarchitectures for a transiently powered device, and we presenta model for evaluating the memory access energy of differentsystem architectures. In Section IV, we introduce and discussthe model for the backup strategies. Section V explains howthe memory access traces of the benchmarks are processedand analysed. Section VI presents Freezer backup controller,its algorithm, and some area and power synthesis results.We report several comparison results of our study in SectionVII. Finally, we brieﬂy discuss our approach and draw theconclusions in Sections VIII and IX.II. B ACKGROUND AND R ELATED W ORK

In this section, we brieﬂy present the context around non-volatile processors and the problem of state retention in energyharvesting applications. We then present the motivation fromwhich this paper is derived.In related work, both software- and hardware-level solutionswere proposed to guarantee forward progress across unpre-dictable power failures. There are two main approaches tocope with the backup and restore problem: periodic check-pointing [2], [6], and on-demand backup [3]–[5]. Periodiccheck-pointing systems try to guarantee forward progress byrepeatedly executing some check-pointing tasks, interleavedwith the computation. These check-points are usually placedby the compiler, according to some heuristic. At run-time,when a check-point is reached, the system decides if a backupshould be executed. In [2], for example, the supply voltagelevel is checked to determine whether there is enough energyor if a snapshot should be taken. After a power outage, the statewill be rolled back to the last saved state and the execution will resume from the last check-point that was reached. Thisapproach has the advantage that backup size can be optimised,as the location of each check-point is known in advance. In[6], checkpoints are instead taken based on the expiration of atimer, but only the registers are saved as the system uses onlyNVM as its main memory. To avoid consistency issues withNVM updates happening between a checkpoint and a powerfailure, the modiﬁed NVM pages are saved with a copy-on-write mechanism on a shadow memory area. These periodiccheck-pointing techniques also introduce overhead due to theexecution of unnecessary checkpoints and backups, moreoverthey may lead to the re-execution of part of the code after therollback.On-demand backup tries to avoid the run-time overheadsintroduced with periodic check-pointing by waiting until apower failure is detected before executing the backup. Thetypical behavior of an on-demand backup system is depictedin Fig. 1, which shows how the system responds to a powerfailure, signaled by a decrease in the supply voltage (Vdd), byinterrupting the computation and by entering in the

Backup phase. When the backup is completed, the system goes in the

OFF state, where it will wait until the power resumes. Whenthe power is newly available, the platform can leave the

OFF state and start the recovery. The new interval begins when thesystem enters the

Restore phase, to recover the state saved inthe previous backup. When the restore is completed the systemcan resume the computation.Some hardware-based solutions can also be considered asimplementation of on-demand backups. As an example, in[18], the non-volatile processor is paired with a dedicatedvoltage detector used to trigger the backup mechanism. Themain disadvantage with these techniques is that they oftenrequire a full backup of the system memory, as it is difﬁcultto know in advance when a power failure will happen and thussaving only the required memory is complicated.To mitigate this problem some ofﬂine static analysis tech-nique have been proposed [19], [20]. In particular, in [20], anofﬂine analysis of the code is used to ﬁnd the backup positionsthat reduce the stack size. These positions are marked in thecode with the insertion of special label instructions. At run-time a dedicated hardware module will wait for the powerfailure signal. After this signal, the execution continues untilthe program reaches the label instruction. Then, this dedicatedhardware module executes the backup. These techniques re-quire a compile-time analysis, with a detailed energy modelof the platform. Moreover they tend to introduce overhead asthey need to modify the program code [19] and the internalarchitecture of the processor [20].Non-volatile processors can also be considered implemen-tation of on-demand backup, as they focus on having veryfast backup (and restore) in response to power failure. In [17],architectures and techniques for implementing non-pipelined,pipelined, and out-of-order (OoO) non-volatile processors areproposed. The proposed techniques try to optimise the backupsize of the internal state of the processor, using techniques suchas dirty bits for a selective backup of the register ﬁle. Contraryto our approach these architectures rely on NVM or hybridmemories for the persistence of the main memory. Moreover

Computing Backup OFF RestoreV dd PowerFailure PowerRestore

System StateRestore Interval i Backup OFF RestoreComp. Interval i+1

Fig. 1: Division of execution time in intervals and system state during an interval.these techniques are in general very intrusive, as they requirean in depth modiﬁcation of the internal architecture of theprocessor.To address the problem of full memory backup in an on-demand scheme, we propose a hardware backup controller,Freezer, that is able to optimise the size of the backup based onthe information collected at run-time. Our proposed controlleris an independent component that can be integrated in existingSoCs, without requiring changes to the internal architecture ofthe processor core.In this work, we focus on how to optimise the backup ofthe main memory and we do not consider the problem ofsaving the internal state of the processor. However the stateof the CPU could be managed via software by the processorby copying its internal register into the main memory beforestarting the back up. Other techniques are proposed in the lit-erature to save the internal registers. Common hardware-basedsolutions use nvFFs based on different technologies, suchas STTRAM [11], MRAM-based nvFFs [12], FeRAM [18],ReRAM [9], and the use of FeRAM distributed mini ar-rays [13] or the use of nvFFs and NVM blocks for the backupof internal registers [17].III. S

YSTEM M ODELLING

A. Considered System Model

Energy harvesting is seen as a promising source to powerfuture battery-less IoT systems. However, due to the un-predictable nature of the energy source, these systems willbe subject to sudden power outages. This could cause theexecution of program to be unexpectedly interrupted. Thus,in these intermittently (or transiently) powered systems, theexecution is divided in multiple power cycles, i.e., intervals,as shown in Fig. 1. The timing break-down of one of theseintervals is depicted in Fig. 2. t cyc is the duration of this on-off cycle and is deﬁned as t cyc = t r + t a + t s + t off , where t a is the time in active state where the system is executing somesoftware tasks, t off the time in the power-off state, and t r , t s the time to restore, save (backup) the data from, to the NVM,respectively. The energy consumed by the system during t cyc can be modelled as (adapted from [21]) E c = E s N s + E r N r + P on t a + P off t off , (1)where E s and E r are the energy required respectively forsaving and restoring one word, N s and N r the total numberof words to save and restore. P on and P off are the power Fig. 2: Detail of an execution cycle between two consecutivepower outages.consumed during the active state and off state, respectively. Inthis type of intermittently-powered systems, usually P off iszero as the state is retained in a non-volatile manner, thus theall system including the processor core can be fully shut-down.Moreover, N s and N r are usually equal and often coincidewith the full size of the volatile system state [3].Considering an on-demand backup system that only per-forms a backup before a power failure, the total executiontime t exec of a program can be modeled as (adapted from [3]) t exec = t prog + n i × ( t s + t r + t off ) , (2)where t prog is the time needed for running the whole programwithout interruptions, n i the number of interruptions, t s and t r the save and restore time, respectively, and t off the averageoff time.Our approach, Freezer, aims at reducing the size of thebackup ( N s ), thus also reducing t s and the total execution timeand backup energy. Moreover, the hardware implementation ofour approach guarantees an additional decrease to the backupand restore time and energy, by eliminating the overhead dueto software operations.In this paper, we assume that the system has a reliable wayto detect a power failure and we also assume that the systemhas enough power to complete the backup. Therefore we donot investigate the problem of how to deal with incompletebackup. To have a stronger guarantee on the consistency ofthe system state after recovery a double buffering scheme canbe applied, such that a new backup does not overwrite theprevious one on the NVM. Moreover, we do not deal with theissue of how to detect a power failure. For this problem thereare also solutions proposed in the literature, such as dedicatedvoltage detector [18]. Fig. 3: Architectural models for non-volatile state retention.

B. System Architecture

In the ﬁeld of non-volatile processors for energy harvestingapplications, there are several possible architectural choicesfor achieving state retention. The most common approachesare the following: • A CPU with an SRAM and an addressable NVM. TheNVM might serve as a backup of the full memory spaceof the SRAM but might also be addressable by theprocessor. • A CPU with an SRAM and a backup-only NVM or aCPU with a hybrid nvSRAM as in [9]. • A CPU with an NVM as main system memory as in [5],[6], [10]–[12]. • A CPU with an SRAM-based cache and an NVM as theonly system memory [16], [17].These approaches can be grouped into the three basic archi-tectures depicted in Fig. 3. The ﬁrst two approaches have incommon the SRAM+NVM architecture, which, as shown inFig. 3, exploits SRAM for execution and NVM for enablingbackup and restore operations. The NVM-only approach reliessolely on NVM as its main memory. Cache+NVM uses NVMas the main memory with the addition of a volatile cache.A common choice for implementing intermittently-poweredsystems is to use commercially available SoCs with an em-bedded addressable NVM. As this NVM is addressable, thistype of systems is the common choice for implementingsoftware-based retention schemes [2], [3]. Another optionexplored in related work is that of using hybrid nvSRAM[7], [9].This choice allows to exploit the main advantagesof SRAM (fast read/write and low access power), while alsoobtaining fast parallel backup through the paired non-volatilememory elements. This means that the non-volatile elementsare not directly accessible by the programmer, instead the non-volatility is made transparent by the hardware. A conceptuallysimple solution to guarantee state retention is to exploit onlyan NVM as the main memory. This solution is proposed in[11], where the system is fully based on STTRAM. Anotherexample is given by the software approach of QuickRecall[5], where the available SRAM is not used and the systemruns only on the FeRAM. As with hybrid nvSRAM, the non-volatility is transparent to the programmer. Also, in this case,there is no need to copy the data in the event of a power failure.In [17] methods for the backup and recovery of the internalstate are proposed and compared considering non pipelined,pipelined, and out of order (OoO) processor architectures.These solutions can also be considered NVM-only type of systems, as they use NVM as their main memory, with theaddition of hybrid or NVM caches in the case of the OoOprocessor.Unfortunately, some of these new NVM technologies arestill immature and often they do not provide the same level ofperformance in terms of access time and access energy as theSRAM [22]. Moreover, NVM-only designs must also face theissue of wear and the reduced endurance that characterisesmany of the emerging NVM technologies. To mitigate thisproblem, a possible solution could be to use register-basedor SRAM-based store buffers. As an example, enhanced storebuffers are proposed in [15] to postpone the execution of NVMwrites, treating store operations as speculative. Though thelimited size of the store buffer still results in very frequentcheckpoints and a large number of NVM writes. Anotherpossible answer to mitigate NVM writes speed and enduranceproblem could be to use an SRAM-based cache to buffer theaccesses to the main NVM. Although this type of architecturecould be of some interest for higher performance systems, itis not very common in small IoT edge nodes. This is becauseadding a cache would signiﬁcantly increase both the dynamicand static power consumption during the active period.In this work, we consider an architecture that comprises amicro-controller with an SRAM as main memory and an NVMthat is used by our proposed backup controller, Freezer, to save(and restore) the state of the system before (and after) a powerfailure. The general overview of such architecture is depictedin Fig. 4. The micro-controller we consider implements theRISC-V Instruction Set Architecture (ISA).

RISC-VCore SRAM

Peripherals

FreezerEnergyHarvestingSystem

NVM

Fig. 4: General overview of a system implementing Freezer.

C. Modelling Memory Access Energy

The energy required for a backup operation is dominated bythe data transfers between the SRAM and the NVM, and willbe proportional to the backup size. This energy will mostlybe determined by the write energy of the NVM, that can beeven × that of the SRAM [22]. Our approach providesa reduction of the backup energy by decreasing the numberof data transfers and by improving the speed of the processcompared with a software based backup strategy.In this section, we provide a simpliﬁed model to evaluateand compare the energy cost of some of the different system architectures introduced in Section III-B. Results provided inSection VII are based on this model. In particular, we derivethe energy cost in terms of memory accesses for the followingtypes of memory models: • SRAM + NVM for backup, • NVM only, • cache + NVM as main memory.For the SRAM+NVM architecture, we consider both a systemwhich performs a full memory backup and a system withFreezer. For this system, the energy cost associated withmemory accesses can be expressed as E SRAM + NV M = E prog + E backup + E restore (3)where E prog is the energy of the memory accesses needed forrunning the program. E prog = E sram/r N load + E sram/w N store (4)where E sram/r and E sram/w are the read and write energy ofthe SRAM, and N load and N store are the total number of loadand store operations, respectively. The additional cost requiredby a platform with both SRAM and NVM are expressed inEq. 3 by the energy for the backup E backup and by the energyfor the restore E restore , deﬁned respectively as E backup = N s ( E sram/r + E nvm/w ) , (5) E restore = N r ( E nvm/r + E sram/w ) . (6)The energy for the backup depends on the total size of thebackup N s and on the energy required for reading fromSRAM E sram/r and writing to NVM E nvm/w . N s is thetotal number of saved words throughout the full execution.Similarly E restore can be expressed as the energy for a singletransfer (read from NVM and write to SRAM) multiplied bythe total number of restored words N r .For the NVM-only architecture there is no need to preformbackup and restore operations, as everything is already savedin the NVM. In this case, the memory access energy is givenonly by the load and store operations performed for runningthe program. The energy cost for a purely non-volatile systemthat uses a NVM as its main memory is estimated by E NV M = E progNV M = E nvm/r N load + E nvm/w N store . (7)The cache+NVM architecture comprises both an NVM asits main memory, and an SRAM-based cache to reduce thenumber of accesses to the NVM. This system uses a write-back cache controller that performs a ﬂush of the dirty lineson NVM in case of a power failure. On a cache system, forevery operation, the TAG memory is ﬁrst read to verify ifthe required address is on the cache or not, then in case of amiss a read from NVM is executed. Moreover, simultaneousTAG and DATA memory reads are performed inside the cacheto sustain high throughput. Finally, multiple data words maybe accessed in parallel on N-way set-associative cache whereonly one word is useful. Therefore, the energy per read/writeoperation of this system is much higher that the one withtightly coupled memory (SRAM+NVM). The energy cost fora cache+NVM system is therefore estimated by E cache = E hits + E misses + E flushes (8) where E hits is the energy due to cache hits, E misses the energypenalty due to misses, and E flushes the energy consumed withﬂushes. The ﬁrst part of the energy cost E hits is E hits = N hit/r E hit + N hit/w ( E hit + E cache/w ) (9)where N hit/r and N hit/w are respectively the number of readand write hits, E hit the energy for a single cache access and E cache/w the energy for a write operation inside the cache. E hits therefore includes the energy due to read hits N hit/r E hit and the energy due to write hits N hit/w ( E hit + E cache/w ) . E misses , the energy due to the misses, is expressed as E misses = N miss ( E miss + ( E nvm/r + E cache/w ) × N evict E nvm/w (10)where N miss is the total number of misses, E miss the energyfor a missing access, N evict the total number of evicted words, E nvm/r and E nvm/w are the energy for reading and writinga word in the NVM. Eq. (10) shows that each miss causesthe reading of a full block (8 words in our case) from theNVM. Moreover a missing access may also cause the evictionof a block from the cache resulting in writes to the NVM. E flushes is caused by the backup of the dirty lines beforea power failure happens. This operation requires to scan allthe cache lines and write back the dirty ones and is repeatedbefore every power failure. E flushes = N i N lines E hit + N flush E nvm/w × (11)where N lines E hit represents the energy for reading all theblocks of the cache and N i the number of interruptions. Theenergy due to the writes to NVM is expressed by N flush , thetotal number of ﬂushed blocks throughout all power failures,multiplied by the energy for writing words to NVM.IV. M ODELING OF THE B ACKUP S TRATEGIES

By analyzing the memory access sequences, we can identifydifferent backup strategies. The

Full Memory Backup strategycorresponds to the state of the art. In this paper, we proposefour backup strategies deﬁned as

Used Address (UA) , ModiﬁedAddress (MA) , and

Modiﬁed Block (MB) , a block-based evolu-tion of the two previous strategies. The last strategy presentedis an

Oracle and cannot be implemented in a real system as itrequires knowledge of the future. This oracle is however veryuseful for comparison, as it gives the optimal lower bound forthe backup size. In the rest of the paper, a word is deﬁned asa 32-bit data.

A. Full Memory Backup

The ﬁrst and simplest solution is to backup the full contentof the memory at the end of each interval as it is proposedin [3]. For our study and fair comparison, we considered aslightly improved version of this strategy that saves only thedata section of the program in pages of bytes (128 words),thus not saving the full memory every time. As an example, ifa program needs a -byte data space, bytes ( pages)will be saved in the NVM. With this approach, the backup sizeis a constant for all the intervals, equivalent to the number ofpages to be saved. B. Used Address Backup

The ﬁrst strategy that we propose is the

Used Address (UA) strategy. UA consists of keeping track of all the differentaddresses that are accessed (reading and writing) during aninterval. When a power failure is detected every address thatwas accessed during that interval is saved in the NVM. In theUA case, only the memory locations that were used during theinterval are going to be backed-up.

C. Modiﬁed Address Backup

If the initial snapshot of the program is stored in the NVM,the UA scheme can be improved by implementing the

ModiﬁedAddress (MA) backup strategy. MA only keeps track of thememory locations that are modiﬁed (written) during a powercycle. Then, before a power outage, only the words that weremodiﬁed (write operation) are saved to the NVM. In practicethis means saving only the addresses accessed by a storeoperation at least once during the interval. This number ofaddresses gives the size of the backup at the end of the interval.It may happen that the data written during execution do notmodify the content of the memory. However, to keep thetechnique simple, we do not track the content of the memorybut only the addresses where a write operation happens.

D. Oracle

The

Oracle is deﬁned as the strategy that saves only thewords that are alive . An address is considered alive when it isgoing to be read at least once in any future interval. In otherwords, a written data is considered alive if it is read in anyfuture intervals, before being modiﬁed by any other write atthe same address. A word that will be overwritten before beingread is not considered alive and thus is not backed-up by theoracle. Fig. 5 shows an example of two addresses changingFig. 5: Example of the aliveness of two addresses with Load(L) and Store (S) instructions. The continuous green lineindicates that the address is alive. The black dotted line isused when the address is not alive. The store on address at cycle 12 (S*) does not make the address alive because itis followed by another store at cycle 15, that overwrites thevalue written by S*.between the alive and the not alive state as the executionprogresses. In the example, address stops being aliveafter it is used by the load in cycle 5 and stays not alive for the period between the sixth and the ninth clock cycles.This happens because the Oracle knows that the value will beoverwritten by the store executed at clock cycle 10. Therefore,between clock cycles 6 and 9, it does not consider as analive address. For the same reason, address stops beingalive after the load in cycle 7 and is not alive in the time between cycle 8 and cycle 14. The store operation happeningat cycle 12 does not change the state of the address because itis going to be followed by another store instruction that willdiscard this temporary update.The

Oracle , before the power failure, only saves the wordsthat are going to be read during any further interval. Extendingthis oracle, we moreover deﬁne the

Oracle Modiﬁed (OM) strategy that only saves the alive words that were modiﬁed inthe current interval. As for the MA scheme, we can considerthat a complete snapshot of the system memory is stored inthe NVM at the beginning and during any previous interval.With the OM strategy, the data that will be read in the futureare only saved if they were modiﬁed. If a data has been savedin the previous intervals and remained unchanged, it is notadded in the snapshot of the memory to be saved before thenext power failure. From now on, we will use Oracle to refer tothe Oracle Modiﬁed when comparing with the other strategies.

E. Block-Based Strategies

Both the

Used Address and

Modiﬁed Address strategies canbe implemented with different degrees of granularity. Trackingeach individual word may require a very large memory tostore the modiﬁed addresses, block-based strategy tries totrade-off between hardware cost and backup saving. Insteadof considering single word addresses, the addresses can begrouped in blocks of N words and the scheme can be adaptedto keep track of these blocks. Therefore the Modiﬁed Block(MB) strategy keeps track of the blocks that are modiﬁedduring the interval. The backup size is given, for each interval,by the number of blocks that are accessed with one or morestore operations. In Freezer, the modiﬁed blocks are trackedusing corresponding dirty bits , which allows for the size ofthe associated tracking memory to be reduced by a factorequivalent to the block size. MB with blocks of N = 1 wordcorresponds to the MA strategy.V. T RACE A NALYSIS AND I MPROVEMENT IN B ACKUP S IZE

In order to validate our approach, we analyzed the memoryaccess traces of several benchmarks from a subset of MiBench(see Table III for a list of the benchmarks). The benchmarkswere run on a cycle accurate, bit accurate RISC-V model [23],thus only two types of memory access are possible: load andstore operations. The traces report the information about eachmemory access during the program execution. In particular,each trace records a timestamp (cycle count), the type ofoperation (ST or LD for store or load) and the address forevery memory access. Table I shows an example of a memoryaccess trace. The occurrences of power failures are simulatedby dividing an access trace in n time intervals. Each interval i is composed of a given number of clock cycles N prog i , equalto the active time t a of the interval i divided by the processorclock period. The cycle count reported in the trace is used todivide the execution of a benchmark in these n intervals. In therest of the paper, for simplicity without loosing generality, wedivide t prog , the time needed for running the whole programwithout interruptions, in n equal intervals of N prog cycles. TABLE I: Example of memory access trace. interval cycle op addr i ... ... ... i

90 ST 0x38aaad4 i

97 LD 0x2ba50 i

99 LD 0x2b06c i + 1

104 LD 0x2b954 i + 1

109 LD 0x38aaad4 i + 1 ... ... ...... ... ... ... n ... ... ... In the example reported on Table I, the interruption is placedafter cycle . This means that the load happening in cycle is considered as being executed in the next interval ( i +1 ). Thisis a simple way to simulate a frequency of power failures every N prog cycles ( N prog = 100 in this example). In practice, forour simulations we considered longer intervals, ranging from to cycles. As an example, considering a device runningat 10 MHz, intervals of clock cycles would correspond toa frequency of interruptions due to power failures of 10 Hz.In Section VII-B, we present an analysis of the impact of theinterval length and of the variability of intervals duration onthe reduction of backup size.From these traces, the number of load and store operationsper interval, as well as other memory access features, canbe extracted. As an example, Fig. 6 shows the number ofLD and ST in each interval during the execution of the FFTbenchmark, with intervals of N prog = 10 clock cycles.Considering the duration of the full execution of the FFTFig. 6: Number of LD and ST opserations per interval duringthe FFT benchmark execution with N prog = 10 clock cycles.benchmark on the target processor, n = 7 intervals can besimulated, ranging from interval to in the ﬁgure. Thesetraces provide relevant information about the memory accessbehavior of a given program. They will be used to comparethe different backup strategies in Sections VI-C and VII.Fig. 7: Percentage w.r.t. full memory space of “alive” and“alive & modiﬁed” addresses per interval during the executionof the FFT benchmark with N prog = 10 clock cycles.Fig. 7 shows the fraction of alive and alive & modiﬁed addresses with respect to the total number of words addressed,for every interval of clock cycles for the FFT benchmark.In the last interval no address is considered alive as theoracle knows that the program is going to terminate before the next power failure. The ﬁgure also shows that, even witha relatively small benchmark, the number of words that reallyneeds to be saved is less than a quarter of the total. Thismotivates our work on the deﬁnition of new backup strategiesto reduce the volume of data to be backed-up before a powerfailure. However, as already mentioned, the OM cannot beimplemented in a real system as it requires knowledge of thefuture. It is however very useful for comparison as it gives theoptimal lowest bound to the backup size.Fig. 8: Average number of words saved per interval bythe different backup strategies – full-memory, used-address(UA), modiﬁed (MA), and oracle modiﬁed (OM) – during theexecution of different benchmarks, with N prog = 10 cycles.Fig. 8 compares the average number of word saved perinterval by the full-memory, UA, MA, and OM strategies fordifferent benchmarks and with N prog = 10 clock cycles. Theﬁgure shows the great potential of the proposed strategiesw.r.t. state-of-the-art approaches. Fig. 8 also demonstratesthat the MA strategy always outperforms the UA strategy interms of number of saved words and it is the only techniquethat comes close to the performance of the oracle modiﬁed.Therefore, only the MA strategy will be considered in the restof the paper, as well as its extension to a block-based strategypresented in the following section.VI. F REEZER

In this section, we present Freezer, a backup controller thatimplements the

Modiﬁed Block backup strategy, and study theimpacts of the block size in the MB strategy.

A. Freezer Architecture

Fig. 4 shows the system-level view of the

Freezer archi-tecture. The system is composed of four major components:the CPU, the SRAM used as a main memory, the NVM usedfor the backup, and the backup controller (Freezer). Freezer isitself composed of two main blocks: a controller implementedas a ﬁnite-state machine (FSM) for sequencing the operationsand a small memory containing the dirty bits used to keeptrack of the blocks that need to be saved, as shown in Fig. 9.The Freezer controller is a stand-alone component, that doesnot need to be tightly coupled with the memories or withthe core. It uses two handshake interfaces for the SRAM andNVM requests, allowing to tolerate variable access latency.Freezer can be directly connected to the control, address and addrwdatareqwebegntrvalidrdata

NVM_rgv

NVMFSMREGS

APB cpu_req a dd r w e v a li d FREEZER addrwdatareqwebegntrvalidrdata

SRAM_rgv

SRAM

CPU_rgv

CPU addrwdatareqwebegntrvalidrdata pwr_fail

Fig. 9: Freezer internal architecturedata signals of both SRAM and NVM, using these handshakeinterfaces. Alternatively the SRAM and NVM interfaces canbe arbitrated and share a single master port on the systembus. Moreover, Freezer is also connected to the request signalsof the CPU to the SRAM, this allows Freezer to (i) spythe address of the SRAM accesses by the processor and (ii)manage the backup-to and restore-from-NVM phases in placeof the processor. SRAM and NVM do not need to have twoports, CPU and Freezer accesses can be easily arbitrated asthey never access the memory at the same time.At run-time, Freezer checks the address of the store oper-ations in the SRAM to dynamically keep track of the blocksthat are modiﬁed. When a power failure arises, the CPU ishalted and the controller starts transferring the modiﬁed blocksinto the non-volatile memory. The words within a block arethen stored sequentially in the NVM. The controller usesthe information collected during the active time to determinewhich blocks to save. When performing this task, Freezer hasaccess to both the SRAM and the NVM memory.Algorithm 1 describes the behavior of the backup controllerduring the execution, backup, and restore phases. Duringexecution, Freezer implements the

Modiﬁed Block backupstrategy. During execution, when there is no power failure (not pwr fail ) and there is a valid store operation, the controllerrecords the blocks that are modiﬁed in a table ( to backup )implemented in a small memory, or in a register bank. Whenthe pwr fail condition is true, it enters in the backup phaseand in a loop where, for each block, the to backup memoryis checked. If the block has to be saved, then a loop forevery address of the block is executed, where a word is readfrom the SRAM and written in the NVM. This last loop caneasily be pipe-lined such that an NVM write in an addresscan be executed in the same cycle with an SRAM read in thesuccessive address. The same holds true also for the restorephase, that simply moves back the data from the NVM to theSRAM. In this way, the backup controller is able to back upand restore one word every clock cycle. This should also leadto an additional speed-up, when compared with software-basedbackup loops executed on low-end micro-controllers, as in thecase of [3].

Algorithm 1:

Freezer backup controller algorithm

Input: cpu addr address generated by the CPU

Input: is store = 1 if the operation is a store

Input: op valid = 1 if the operation is valid

Input: pwr fail = 1 if power failure is detected

Input: restore = 1 if resume after a power failure

Data: to backup ﬂag memory of 1-bit per block if restore :for i ← to SRAM SIZE − : sram[i] ← nvm [ i ] ; else:if not pwr fail :if is store and op valid : block ← cpu addr (cid:29) log BLOCK SIZE ) ;to backup[block] ← ; else:for b ← to BLOCK N U M − :if to backup[b] :for a ← to BLOCK SIZE-1 : addr ← ( b (cid:28) log BLOCK SIZE )) (cid:107) a ;nvm[addr] ← sram[addr];In the hardware implementation, the process of checkingthe dirty bits can also be optimised. As an example, the scanof the to backup memory to ﬁnd the next dirty block canhappen in parallel to the backup of the current block, which isa relatively long operation. Moreover, the to backup memorycan be organised as a matrix of dirty bits and the controllercan check an entire row of dirty bits in parallel. This meansthat the to backup memory can be scanned row by row. Thesparsity of the dirty bits can also be exploited: skipping rowsthat have only clear bits (all zeros).With these and other optimisations, the throughput of thebackup operation can be sustained with little to no dead cycles.However these low level optimisations are outside the scopeof this work and will not be investigated further. B. Area and Power Results

As our algorithm is relatively simple, the controller itselfintroduces small area and power overheads. The major con-tribution in the area and power overheads is given by the to backup dirty-bit memory, used to keep track of the blocksthat have to be saved. Table II shows the number of bits and anestimation of the area of the to backup memory for differentblock sizes, considering a 32KB SRAM. For these results,the to backup memory is synthesized with standard cells ina 28nm FDSOI technology using Synopsys Design Compiler(DC). Even when considering a ﬁne granularity for the blocksize, the dirty-bit memory is small compared to the total sizeof the SRAM memory. As an example, for a block size of words, the required 1024-bit memory is × smaller thanthe main SRAM memory. Moreover, by tuning the block sizewith larger blocks, the to backup memory can be stored in aregister ﬁle with a small increase in the backup size. TABLE II: Number of bits and area estimation of the to backup memory, implemented with standard cells in 28 nm FDSOI. block size (32bit words) 2 4 8 16 32 64 [ µm ] A non optimized version of the controller was synthesizedfrom a C++ speciﬁcation using Mentor Graphics CatapultHLSand Synopsys (DC) with the same 28nm FDSOI technologyat . V . In this conﬁguration, Freezer’s controller achievesa dynamic power of P active = 6 . µW and a leakage powerof around P leak = 40 nW at 25°. With the same technology,we estimate a leakage of roughly nW for a register-based to backup memory of bits. These synthesis results willbe exploited in Section VII-F to estimate the energy of asystem implementing Freezer. C. Impact of Block Size

In this section, we study the impact of the block size on thesize of the backup provided by the MB strategy. The size of the to backup memory depends on two parameters: the number of32-bit words in each block, which determines the granularityof the backup strategy, and the total size of the SRAM.Therefore, it is possible to trade off an increase in backupsize with a smaller area overhead of the to backup memory.In Table III, the backup size across a set of benchmarks isreported for different conﬁgurations of block granularity. Thebackup size is averaged on all intervals and normalized withrespect to a block of one word (MA strategy). The intervalis set to N prog = 10 clock cycles. Increasing the blockTABLE III: Backup size relative to blocks of one 32-bit word(MA approach) for different benchmarks. N prog = 10 cycles.The table also reports the average on all benchmarks. block size N size has obviously an impact on the performance of the MBstrategy, i.e., the average size of the backup required at theend of each interval. However, MB with a relative small sizeof block (up to N = 8 ) only increases the backup size by24% in average, while this block-based strategy can decreasethe size of the to backup memory by a factor of × . Morecomparisons and impact of the interval size are reported in thenext section. VII. R ESULTS

In this section, the details of the experimental setup are ex-plained and the results regarding the backup size (Sec. VII-A) and backup time (Sec. VII-C) are reported. The impact ofthe interval size is discussed in Section VII-B. For all theother results provided in this section, the interval is set to N prog = 10 clock cycles. Moreover, a discussion aboutpower, energy, and area of our approach is presented inSections VII-D and VII-F, while considerations on the impactof leakage are presented in Section VII-E. A. Backup Size

For every interval, the backup size is computed consideringthe different approaches described in Section IV. Fig. 10 showsthe backup size reported for every benchmark and for blocksof 1, 8, and 64 32-bit words. Blocks of size equal to oneword corresponds to the MA strategy. The OM strategy isalso reported to provide the optimal (non reachable) value.The backup size is averaged on all intervals and normalizedagainst the improved Hibernus [3] approach, which saves thefull memory used by the program in pages of 512 bytes.As it can be seen from Fig. 10, our approach greatly reducesthe average backup size per interval, reaching an . (morethan × ) reduction in average, with only a . distance fromthe oracle modiﬁed , when conﬁgured with a granularity of words per block. This reduction in backup size can be directlyconverted into an energy saving in number of write to theNVM during the backup phase.Fig. 10: Backup size normalized w.r.t. the program memorysize (improved Hibernus strategy) of Freezer implementingMB strategy with blocks of 1, 8, and 64 words. Lower boundin backup size of the oracle-modiﬁed is also reported. B. Impact of Interval Size

On a system powered with intermittent ambient energy,the time length of the intervals is mostly determined by theenergy source and by the energy budget of the platform. Ifthe energy source is relatively stable, the length of the powercycle increases, and so does the amount of computation thatthe processor manages to complete during one interval. Thismeans that more memory accesses will be performed, thus wecan expect the average size of a backup to increase. However,this also depends on the spatial locality of the application, and considering wider blocks could be beneﬁcial for lessintermittent sources. When the length of the power cyclesdecreases, the processor is interrupted more frequently, and thenumber of memory accesses is reduced. Therefore, the averagebackup size is further decreased. Fig. 11 shows the averagereduction in backup size, across all benchmarks, for differentlengths of the power cycles (interval size N prog expressed innumber of clock cycles), considering blocks of 8 words. As itcan be seen, the backup size reduction is greater than forall the interval lengths. Moreover, with shorter intervals, thereduction becomes greater than . It must be noted alsothat, when the length of the interval is increased above 20million clock cycles, the majority of the programs are able torun to completion before the ﬁrst power failure occurs.Fig. 11: Average backup size reduction with different intervalsize N prog expressed in number of clock cycles.Due to the unpredictability of the energy source, anintermittently-powered system might also experience a widevariation between the time length of successive intervals. Tobetter capture this behaviour, we model the occurrence ofa power failure as a random variable distributed accordingto a binomial law. Power failure events in this model areconsidered independent of one another. At each clock cycle,there is a certain probability to incur in a power failure.For this experiment, we considered two values of one powerfailure every cycles and one power failure every clockcycles. Figures 12a and 12b show, for each benchmark, theaverage savings with relative standard deviation computed for100 executions, considering blocks of 8 words. Our proposedmethod is robust to variability in the size of the intervals and itis able to achieve more than and savings on averagewhen the failure rates are respectively − and − . C. Backup Time

The reduction in the backup size comes with a relativereduction in the save time. On top of that, thanks to thehardware accelerated backup process, our solution provides anadditional improvement in terms of backup time. In particular,the backup process is managed directly by Freezer and canbe further pipelined, so that each word can be saved in oneclock cycle. Of course, the speed of this process is limited bythe cycle-time of the slowest NVM memory. As our approachdoes not rely on any speciﬁc NVM technology, we consideredthe numbers reported in [3] for our comparison. In particular,we considered a clock frequency of 24 MHz for the normaloperation using SRAM, and a clock cycle period of 125 ns (8MHz) for the FeRAM.Table IV reports the improvement in backup time comparedwith a modiﬁed implementation of Hibernus that only savesthe memory used by the program (in pages of 512 KB). (a) Failure rate − .(b) Failure rate − . Fig. 12: Average savings and std. deviation for 100 executionswith power failures distributed following the binomial law withfailure rates of − (a) and − (b).TABLE IV: Percentage reduction of backup time w.r.t. im-proved Hibernus (higher is better). Columns b N providesresults with our strategy using blocks of N words. Oraclemodiﬁed and NVP [9] are also provided for comparison. b 1 b 8 b 64 oracle NVP [9]susan smooth small 99.80 99.78 99.68 99.87 39.25susan edge small 98.93 98.69 98.20 99.20 42.58matmul16 ﬂoat 99.32 99.00 98.27 99.79 39.08qsort 99.38 99.34 98.95 99.58 45.96fft 99.58 99.39 98.64 99.85 39.64matmul32 int 99.35 99.23 98.86 99.69 40.44str search 99.49 99.43 99.21 99.81 43.03cjpeg 98.10 97.98 97.48 99.32 38.97dijkstra 99.34 99.13 98.98 99.63 43.32matmul16 int 99.23 98.94 97.88 99.80 29.94susan edge large 99.93 99.91 99.89 99.95 45.78average 99.31 99.17 98.73 99.68 40.73 Columns b N provides results with our strategy using blocksof N N prog = 10 clock cycles at 24 MHz. We assumedan average off time equal to the active time. As reportedin Table V, our strategy achieves a 32% average decreaseof the total execution time when compared with improvedHibernus. We also compared Freezer against approaches likeQuickRecall [5] that runs the programs only using the NVM. In this case, the save and restore time are roughly zero(only the registers need to be saved), but the frequency ofthe core is limited to the frequency of the FeRAM. As aconsequence, in most cases, the QuickRecall approach leadsto longer execution time than Hibernus, whereas our solutionperforms always better and is very close to the Oracle.TABLE V: Percentage reduction of execution time w.r.t.improved Hibernus (higher is better). Columns b N providesresults with our strategy using blocks of N words. Oracle andNVM-only solution of [5] are also provided for comparison. b 1 b 8 oracle NVM only [5]susan smooth small 20.75 20.75 20.76 -17.59susan edge small 33.35 33.31 33.41 2.35matmul16 ﬂoat 9.90 9.88 9.93 -34.50qsort 88.79 88.77 88.90 89.00fft 10.68 10.67 10.70 -33.30matmul32 int 15.32 15.31 15.35 -26.01str search 24.90 24.89 24.95 -11.05cjpeg 27.93 27.91 28.13 -5.94dijkstra 35.21 35.17 35.27 5.14matmul16 int 8.72 8.71 8.75 -36.34susan edge large 83.49 83.48 83.50 80.27average 32.64 32.62 32.69 1.09 D. Energy Comparison with other Memory Models

We use Eqs. (3), (7), and (8) to compare the dynamicmemory access energy of the different system conﬁgurations.Figures 13a and 13b show these dynamic energies normalisedw.r.t. the system using Freezer, with RRAM and STT, respec-tively. For the cache+NVM architecture, four different cachesizes of 2KB, 4KB, 8KB and 16KB are reported. The threecaches are all 4-way set associative with lines of 8 words(256 bits), which is representative of this type of device. Weconsidered blocks of 8 words also for the system using Freezer.The read and write dynamic energies per 32-bit word for thememories used in these comparison are reported in Table VI,and were obtained using NVSim [24]. Table VII reports theHit and Write dynamic energy for the different cache sizes,obtained with NVSim. Miss energies were in all cases equalto Hit energies. As it can be seen from Fig. 13, our proposedapproach provides a signiﬁcant reduction in the energy due tomemory accesses when compared with all the other methods.The memory access energy for a full-memory backup strategyis in average . × of that required by Freezer when usingSTT and . × when considering RRAM.Being based on the same SRAM+NVM architecture,Freezer and full-memory backup strategies require the sameabsolute amount of energy for the execution of the program,i.e., the energy required for executing load and store operationsis the same for the same benchmark. Moreover as the twostrategies rely on a full memory restore after a power failure,they spend the same amount of energy for the restore memoryaccesses, for the same benchmark. Tables VIII and IX showthe energy decomposition, across all benchmarks, for Freezerand full-memory strategies when using STT NVM. The twotables show the clear advantage that Freezer brings in terms ofbackup energy, reducing its weight from an average . to an average . of the total memory access energy. (a) RRAM(b) STT Fig. 13: Relative dynamic energy of memory accesses, normal-ized w.r.t. Freezer, using RRAM (a) and STT (b) as NVMsfor backup.Figures 13a and 13b also show that, due to the higher read andwrite dynamic energies, using the NVM as the main memoryis often detrimental even when compared with full-memorybackup systems. Moreover when compared to Freezer, NVM-only systems require in average . × and . × more energyfor RRAM and STT respectively.As described in Section III-C, the cache+NVM system usesthe write-back policy and ﬂushes the dirty lines in the NVMwhen a power failure arises. Thus the cache+NVM systemshows a behaviour that is similar to the one of Freezer duringpower failures, but with higher energy per operation. Thereare however some major differences between a system thatimplements Freezer and a system with a write-back cacheand a NVM main memory. First of all, Freezer is meantto be simple to reduce the energy overhead of tracking themodiﬁed blocks. Moreover, Freezer is able to track the fullmain memory and only needs to write on the NVM before apower failure happens. A write-back cache on the other handmight perform additional writes on the NVM at run-time. Infact if the access causes a conﬂict, the cache will evict theconﬂicting line thus causing additional NVM writes. Theseadditional writes may reduce the lifetime of the NVM due tothe limited endurance of these type of memories.When it comes to cache+NVM based systems, the size thatin average provides the smallest energy is 4KB, with 2KB and8KB caches performing better in some benchmarks. Access tosmaller caches requires less energy, as shown in Table VII, butthey might incur in the high cost of additional NVMs readand writes due to a larger number of misses and evictions.The increased number of writes to the NVM could also causeproblems of endurance because of wear-out, that might preventthis solution to be applied for long-lasting operations. A largercache can reduce the number of accesses to the NVM, up to TABLE VI: Energy and leakage power parameters used for memory access cost simulation. Read/write energy is reported inpJ per 32-bit word access.

SRAM STT RRAMSize [KB] 4 16 32 64 4 16 32 64 4 16 32 64Read [pJ] 0.219 0.703 1.664 2.50 7.754 7.889 8.426 8.692 5.101 5.477 6.004 6.667Write [pJ] 0.111 0.215 1.175 1.388 20.244 20.614 20.873 21.416 21.349 27.449 24.176 28.575Leakage [ µ W] 0.78 2.16 3.58 7.16

TABLE VII: Cache Miss and Hit dynamic energy in pJ per32-bit word access

Size [KB] Hit Energy [pJ] Write Energy [pJ] E cache/r E cache/w TABLE VIII: Memory access energy percentage decomposi-tion for Freezer using STT

Trace backup restore prog. loads prog. storessss 0.74 22.60 74.86 1.80ses 5.97 29.30 59.23 5.49mm16f 5.53 21.89 49.88 22.70fft 4.72 28.28 46.12 20.88cjpeg 7.53 16.30 57.64 18.53str 2.62 23.86 43.65 29.87mm16i 15.13 33.12 40.55 11.21dijk 3.45 23.56 61.34 11.64mm32i 4.46 26.86 60.42 8.26avg 3.44 23.52 61.18 11.86 the point where the cache is so large that it is able to buffer thefull application. In these case, it is possible to obtain a numberof writes to the NVM which is close to what Freezer achieves.However, this comes at the cost of having a large cache that iscomplex and energy hungry. Moreover, it is unusual to see acache used in small low-power edge devices, where the systemmemory is embedded on chip and seldom exceeds 64KB. Tosummarize, for our set of benchmarks, the energy required bya 4KB cache + STT system is . × w.r.t. Freezer, whereasthe larger 16KB cache requires in average . × more energythan Freezer. E. Impact of Leakage Power

For a fair comparison, it is also important to study theimpact of leakage power of the SRAM+NVM memory model,especially when compared to NVM-only architectures. Eq. 3TABLE IX: Memory access energy percentage decompositionfor full-memory backup using STT

Trace backup restore prog. loads prog. storessss 21.39 17.90 59.29 1.43ses 28.29 22.35 45.18 4.19mm16f 21.66 18.15 41.36 18.82fft 26.15 21.92 35.75 16.18cjpeg 17.40 14.56 51.49 16.55str 22.65 18.95 34.67 23.72mm16i 31.33 26.80 32.81 9.07dijk 23.60 18.65 48.54 9.21mm32i 25.77 20.87 46.94 6.42avg 23.25 18.69 48.63 9.42

TABLE X: Backup energy using Freezer, leakage and memorysize for different benchmarks, energy in [ µJ ] , memory size inwords of 32 bits. Trace mem size E freezer E freezer E leakage[32-bit word] RRAM STT SRAMsss 8192 11.0 11.0 6.1ses 16384 3.3 3.2 1.9mm16f 2048 2.5 2.4 2.0fft 2048 3.4 3.4 3.7cjpeg 8192 3.9 3.6 1.3sl 8192 3.6 3.7 1.9mm16i 1024 0.051 0.045 0.028dijk 16384 51.0 50.0 27.0mm32i 4096 0.58 0.56 0.41 is therefore enhanced by considering the leakage power oflow-power SRAMs of the appropriate size, as reported inTable VI. The leakage power of STT and RRAM is consideredto be zero, which is obviously not the case for real designs.Table X reports for each benchmark the absolute dynamicenergy of memory accesses for Freezer with both RRAMand STT as NVMs, equivalent to the Freezer blue bar inFigures 13a and 13b, respectively. The table also reports anestimation of the leakage energy due to the main SRAMmemory obtained considering a M Hz clock, and the totalmemory size of the benchmark. Table X shows that theleakage energy represents around half of the dynamic energyof memory accesses when using Freezer. Even accounting forthe leakage of SRAM, the approaches based on SRAM+NVMsare still better than running an NVM-only system. Comparedto full-memory backup which would consume roughly thesame leakage energy, Freezer still beneﬁts from the backupsize reduction.Moreover, even accounting for the leakage of the NVMmemories would not change the outcome of the analysis. Infact, when considering NVMs of the same size running forsimilar periods of time, the leakage due to the NVMs wouldbe roughly the same for both SRAM+NVM and NVM-onlyarchitectures. Furthermore, an SRAM+NMV system wouldeven be able to activate the NVM only during the backupand restore phases, reducing even more the impact of NVMleakage. In both cases, the SRAM+NVM architecture wouldstill show an advantage.

F. Energy and Area Overhead Considerations

In this section, we provide insights about the overhead inenergy due to our backup controller. The use of the Freezerhardware backup strategy in an energy harvesting platform willintroduce a small overhead at run-time, but will also decreasethe energy required for the backup and restore operations. We can account for the overhead and the reduction in the backupsize by modifying Eq. 1 which becomes E c = E s N (cid:48) s + E r N r + ( P on + P ovh ) × t on + P off t off , (12)where N (cid:48) s is the reduced backup size and P ovh representsthe overhead introduced at run-time. The energy required formoving the data ( E s for save and E r for restore) is heavilydependent on the memory technology. However, software-based approaches introduces additional overhead. In our case,as a backup operation may require hundreds or even thousandsof transfers, we can approximate the energy required for savingone word as E s = E sram/r + E nvm/w (13)where E sram/r is the energy for reading a from the SRAMand E nvm/w the energy required for a write in the NVM.The power overhead introduced by our strategy can beestimated as P ovh = α × P active + P leak , where P leak is the leakage power, which will be mostly determined bythe to backup memory, and P active the active power. P leak and P active were provided in Section VI-B. P active will beconsumed whenever the processor performs a store operationand α = N store /N prog is the fraction of clock cycles spentperforming store operations w.r.t. the execution of the programin the whole interval.This overhead can be compared with the advantage gainedin terms of save and restore energy. If we compare againsta system that saves everything but does not introduce anyoverhead, we can estimate the maximum active time t on afterwhich the power consumed by the controller during active timebecomes greater than the energy reduction obtained at backuptime. t on is constrained by the following inequation: t on ≤ δE s N tot P ovh (14)where N tot is the number of words to be backed-up withoutFreezer (full memory), E s the energy required to back-up oneword ( E sram/r + E nvm/w ), and δE s N tot the energy savedduring the backup operation. With Freezer, considering δ =87 . , E sram/r = 0 . pJ/bit , E nvm/w = 100 × E sram , weobtained for the two extreme conﬁgurations depending on theconsidered benchmark: t on < . s and P ovh = 1 . µW for susan smooth , and t on < . s and a similar P ovh for the FFT benchmark.Both these t on values allow for the programs to be executedcompletely and are well above the typical active time ofintermittently-powered systems. Moreover, Eq. 14 is obtainedby comparing our solution to a system that introduces nooverhead at run-time and no overhead during the backupprocess, which would not be the case in real systems.To give an idea of how Freezer would ﬁt in a low-endIoT node, we can compare it with a ultra-low-power, size-optimised SoC, implemented with the same 28nm FDSOItechnology node such as the one presented in [25]. In termsof area the SoC is 0.7 mm , while its power consumption is3 µ W/MHz giving at 48MHz a power consumption of 144 µ W.From these numbers we can see that Freezer, even with our non-optimised implementation, would lead to a smalloverhead. In particular, assuming blocks of 8 words, thearea overhead of , µm represents ≈ . . The poweroverhead during active time, considering the α of the FFTbenchmark, could be as low as 0.82%.VIII. D ISCUSSION A BOUT T HE A PPROACH

Several studies have approached the problem of computingunder intermittent power supply, providing a wide variety ofdifferent solutions. While software-based approaches try tosolve the problem at the application level, hardware-basedsolutions try to provide platforms that implement the non-volatility in a way that is transparent to the programmer. Themajority of the hardware solutions usually rely heavily onthe underlying memory technology to accomplish the stateretention. Even in [21], where no NVM is used, their techniquerelies on an ultra low-power retention SRAM.Our approach moves away from this type of scheme andtries to solve the problem from a different standpoint, byproviding hardware acceleration for the backup and restoreprocedures, and by exploiting run-time information to optimizethe backup sequence. Moreover, this approach is agnosticwith respect to the NVM technology, and opens a series ofpossibilities. Technologies such as hybrid nvSRAM, as the oneused in [9], with circuit-level conﬁgurable memory, parallelblock-wise backup and adaptive restore, may be exploited andenhanced by Freezer, thus achieving a faster and more energyefﬁcient backup sequence thanks to the backup size reduction.Furthermore, our approach could be extended to implementa programmable backup hardware accelerator, or to implementa dedicated ISA extension. This would provide programs withsome levels of control on the save and restore procedures andallow for the hardware to exploit some of the informationavailable to the program. As an example, a program maysignal that a certain buffer or memory region is no longerused, allowing the controller to exclude it from the backupprocess. This would also make possible to integrate staticanalysis techniques such as the one presented in [19] and [20]on top of Freezer. IX. C

ONCLUSION

Applications that run under ambient harvested energy sufferfrom frequent and unpredictable power losses. To guaranteeprogress of computation in this circumstances, these applica-tions have to rely on some mechanisms to retain their state. Inthis paper, we propose Freezer, a backup and restore controllerthat is able to reduce the backup size by monitoring thememory accesses, and that provides hardware accelerationfor the backup and restore procedures. The controller onlyrequires a small memory to keep track of the store operations.Moreover, it can be implemented with plain CMOS technologyand does not rely on complex and expensive hybrid non-volatile memory elements. Furthermore, Freezer is a drop-incomponent that can be integrated in existing SoCs withoutrequiring modiﬁcations to the internal architecture of theprocessor. Our proposed solution achieve a 87.7% averagereduction in backup size on a set of benchmarks, and a two orders of magnitude reduction in the backup time whencompared with software based state-of-the-art approaches.The code and traces used in this paper are available forreproducibility at https://gitlab.inria.fr/dpala/freezer-resources.A CKNOWLEDGMENT

This work was supported by Inria Project Lab “Zero-Power Systems” (ZEP). The authors would like to thank theanonymous reviewers for their comments and feedback.R

EFERENCES[1] M. Zwerg et al. , “An 82 µ A/MHz microcontroller with embeddedFeRAM for energy-harvesting applications,” in

IEEE Int. Solid-StateCircuits Conference , Feb. 2011, pp. 334–336.[2] B. Ransford, J. Sorber, and K. Fu, “Mementos: System Support forLong-running Computation on RFID-scale Devices,” in

ACM Int. Conf.on Architectural Support for Programming Languages and OperatingSystems (ASPLOS) , 2011, pp. 159–170.[3] D. Balsamo et al. , “Hibernus: Sustaining Computation During Intermit-tent Supply for Energy-Harvesting Systems,”

IEEE Embedded SystemsLetters , vol. 7, no. 1, pp. 15–18, 2015.[4] ——, “Hibernus: A Self-Calibrating and Adaptive System forTransiently-Powered Embedded Devices,”

IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems , vol. 35, no. 12, pp.1968–1980, 2016.[5] H. Jayakumar, A. Raha, W. S. Lee, and V. Raghunathan, “QuickRecall:A HW/SW Approach for Computing Across Power Cycles in TransientlyPowered Computers,”

J. Emerg. Technol. Comput. Syst. , vol. 12, no. 1,pp. 8:1–8:19, 2015.[6] J. Choi, H. Joe, Y. Kim, and C. Jung, “Achieving Stagnation-FreeIntermittent Computation with Boundary-Free Adaptive Execution,” in , Apr. 2019, pp. 331–344, iSSN: 2642-7346.[7] W. k. Yu, S. Rajwade, S. E. Wang, B. Lian, G. E. Suh, and E. Kan,“A non-volatile microcontroller with integrated ﬂoating-gate transistors,”in

IEEE/IFIP 41st Int. Conf. on Dependable Systems and NetworksWorkshops (DSN-W) , Jun. 2011, pp. 75–80.[8] Y. Wang et al. , “A 3us wake-up time nonvolatile processor based onferroelectric ﬂip-ﬂops,” in

Proc. of ESSCIRC , 2012, pp. 149–152.[9] Y. Liu et al. , “A 65nm ReRAM-enabled nonvolatile processor with 6 × reduction in restore time and 4 × higher clock frequency using adaptivedata retention and self-write-termination nonvolatile logic,” in IEEE Int.Solid-State Circuits Conf. (ISSCC) , 2016, pp. 84–86.[10] Z. Wang et al. , “A 130nm feram-based parallel recovery nonvolatilesoc for normally-off operations with 3.9× faster running speed and 11×higher energy efﬁciency using fast power-on detection and nonvolatileradio controller,” in

Symposium on VLSI Circuits , 2017, pp. C336–C337.[17] K. Ma et al. , “Architecture exploration for ambient energy harvestingnonvolatile processors,” in , Feb. 2015, pp. 526–537, iSSN: 2378-203X. [11] N. Sakimura et al. , “A 90nm 20mhz fully nonvolatile microcontrollerfor standby-power-critical applications,” in

IEEE Int. Solid-State CircuitsConf. (ISSCC) , 2014, pp. 184–185.[12] S. Senni, L. Torres, G. Sassatelli, and A. Gamatie, “Non-volatileprocessor based on mram for ultra-low-power iot devices,”

J. Emerg.Technol. Comput. Syst. , vol. 13, no. 2, Dec. 2016.[13] S. C. Bartling, S. Khanna, M. P. Clinton, S. R. Summerfelt, J. A.Rodriguez, and H. P. McAdams, “An 8mhz 75 µ A/MHz zero-leakage nvlogic-based Cortex-M0 MCU SoC exhibiting 100% digital state retentionat VDD=0v with lt;400ns wakeup and sleep transitions,” in

IEEE Int.Solid-State Circuits Conf. , 2013, pp. 432–433.[14] B. Ransford and B. Lucia, “Nonvolatile memory is a broken time ma-chine,” in

Proceedings of the Workshop on Memory Systems Performanceand Correctness (MSPC) , Edinburgh, UK, Jun. 2014, pp. 1–3.[15] Q. Liu and C. Jung, “Lightweight hardware support for transparentconsistency-aware checkpointing in intermittent energy-harvesting sys-tems,” in , Aug. 2016, pp. 1–6.[16] Z. Ghodsi, S. Garg, and R. Karri, “Optimal Checkpointing for SecureIntermittently-powered IoT Devices,” in , 2017, pp. 376–383.[18] F. Su, Y. Liu, Y. Wang, and H. Yang, “A ferroelectric nonvolatileprocessor with 46 µ s system-level wake-up time and 14 µ s sleep timefor energy harvesting applications,” IEEE Transactions on Circuits andSystems I: Regular Papers , vol. 64, no. 3, pp. 596–607, March 2017.[19] M. Zhao et al. , “Software assisted non-volatile register reduction forenergy harvesting based cyber-physical system,” in

IEEE/ACM Design,Automation Test in Europe Conference Exhibition (DATE) , Mar. 2015,pp. 567–572.[20] M. Zhao, C. Fu, Z. Li, Q. Li, M. Xie, Y. Liu, J. Hu, Z. Jia, and C. J.Xue, “Stack-Size Sensitive On-Chip Memory Backup for Self-PoweredNonvolatile Processors,”

IEEE Trans. on Computer-Aided Design of Int.Circ. and Syst. , vol. 36, no. 11, pp. 1804–1816, 2017.[21] P. A. Hager, H. Fatemi, J. P. de Gyvez, and L. Benini, “A scan-chain based state retention methodology for iot processors operating onintermittent energy,” in

IEEE/ACM Design, Automation Test in EuropeConference (DATE) , 2017, pp. 1171–1176.[22] S. Yu and P. Chen, “Emerging memory technologies: Recent trends andprospects,”

IEEE Solid-State Circ. Mag. , vol. 8, no. 2, pp. 43–56, 2016.[23] S. Rokicki, D. Pala, J. Paturel, and O. Sentieys, “What You Simulate IsWhat You Synthesize: Designing a Processor Core from C++ Speciﬁca-tions,” in , Nov. 2019, pp. 1–8.[24] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-levelperformance, energy, and area model for emerging nonvolatile memory,”