[PDF] A Machine Learning Accelerator In-Memory for Energy Harvesting

Abstract

There is increasing demand to bring machine learning capabilities to low power devices. By integrating the computational power of machine learning with the deployment capabilities of low power devices, a number of new applications become possible. In some applications, such devices will not even have a battery, and must rely solely on energy harvesting techniques. This puts extreme constraints on the hardware, which must be energy efficient and capable of tolerating interruptions due to power outages. Here, as a representative example, we propose an in-memory support vector machine learning accelerator utilizing non-volatile spintronic memory. The combination of processing-in-memory and non-volatility provides a key advantage in that progress is effectively saved after every operation. This enables instant shut down and restart capabilities with minimal overhead. Additionally, the operations are highly energy efficient leading to low power consumption.

Full PDF

AA Machine Learning Accelerator In-Memory for Energy Harvesting

Salonik Resch [email protected]

S. Karen Khatamifard [email protected]

Zamshed Iqbal Chowdhury [email protected]

Masoud Zabihi [email protected]

Zhengyang Zhao [email protected]

Jian-Ping Wang [email protected]

Sachin S. Sapatnekar [email protected]

Ulya R. Karpuzcu [email protected]

Abstract

There is increasing demand to bring machine learning capabilities to low power devices. By integrating the computa-tional power of machine learning with the deployment capabilities of low power devices, a number of new applicationsbecome possible. In some applications, such devices will not even have a battery, and must rely solely on energyharvesting techniques. This puts extreme constraints on the hardware, which must be energy eﬃcient and capa-ble of tolerating interruptions due to power outages. Here, as a representative example, we propose an in-memorysupport vector machine learning accelerator utilizing non-volatile spintronic memory. The combination of processing-in-memory and non-volatility provides a key advantage in that progress is eﬀectively saved after every operation. Thisenables instant shut down and restart capabilities with minimal overhead. Additionally, the operations are highlyenergy eﬃcient leading to low power consumption.

Machine learning is desirable for low-power, edge devices as it provides the capability to solve a wide variety of problems.As a result, much research has been devoted to optimizing hardware for machine learning inference on such devices[15, 37]. Going even further, energy harvesting techniques [33] remove the need for a battery, enabling the placementof such devices into almost any conceivable environment. There are many exciting possible applications, such as lowpower sensor networks [46], wearable tech, or even implants [25]. Previous work has already experimentally demonstratedmachine learning capability on energy harvesting devices using commercially available hardware [24].Energy harvesting applications present numerous and unique challenges. Power limitations are extreme. The energyharvested from the environment is likely far less than what can be supplied by a battery. Thus, energy eﬃciency is evenmore critical than in mobile applications. Signiﬁcantly, the process of energy harvesting also introduces the requirementfor intermittent processing . Energy sources (such as sunlight, heat, movement) may be unreliable, and a device will haveto shut down when the power source goes away. Additionally, even when available, the power source may be insuﬃcientto run the device continually. In order to operate within the power budget, the device must acquire energy over timeand consume it in bursts [8]. Intermittent processing introduces new considerations and metrics for performance [39].Signiﬁcantly, correctness has to be guaranteed over shut down and restart operations. If the state is not properly stored,a process known as checkpointing, restarting a device can lead to memory inconsistencies and incorrect operation [12].Additionally, the eﬃciency of these shut down and restart operations becomes critical, as they take away precious energyfrom operations that enable forward progress. Also critical, it has to be ensured that forward progress can be made a r X i v : . [ c s . ET ] A ug uring phases of power-on time. If the energy required between two checkpoints is too large, the device will be unableto complete the computation. This results in a program getting stuck repeating the same computation, which is referredto as non-termination. Thus, eﬀective energy harvesting devices must have eﬃcient techniques which enable correctnessand forward progress, all while remaining within a modest hardware budget.A recently proposed spintronic processing-in-memory (PIM) substrate, CRAM [10], is uniquely well suited for energyharvesting applications. Operations on CRAM are highly energy eﬃcient, enabling a low power budget. Further, as it isa PIM solution, it removes the need for energy hungry data transfers between processor logic and (volatile) memories.The main advantage, however, is that progress is automatically saved after every operation. CRAM consists entirely ofnon-volatile devices and the results of all computation are immediately stored in permanent memory. As there are veryfew variables required to maintain the architectural state, these can also be saved after each operation with minimalenergy cost. Eﬀectively, checkpointing occurs after every operation.Checkpointing after each operation is not a new idea [43], and for most systems this would generally be consideredineﬃcient [12]. However, as CRAM is a non-volatile PIM substrate, most of the checkpointing operations come forfree . Hence, CRAM can restart a program from the very last operation with fast and eﬃcient shut down and restart.Additionally, CRAM is always in a state that can be recovered from. The power can be cut instantly and unexpectedly ,and it will still restart correctly. The maximum penalty is repeating the last instruction. We refer to this capability as instantly restart-able. This provides a signiﬁcant advantage, as shut down and restart procedures for more conventionalenergy harvesting devices introduce additional latency and energy, and signiﬁcant complexity.While PIM has been used previously in energy harvesting devices [57], in such cases the PIM array acts as a sub-component of the system, leaving much of the computation to an external processor. Hence, these systems do not exploitthe full potential of PIM as CRAM does. Other non-volatile PIM substrates such as [36], which could potentially beadapted similarly, use external logic at the periphery of the memory array (including sense ampliﬁers) for computation,which is not only less energy eﬃcient, but also makes adaptation for intermittent processing more complex.In this paper, we introduce MASTER ( M achine Learning A ccelerator in ST T-MRAM for E nergy Ha R vesting Applica-tions) which is built using CRAM [10]. While based on CRAM, MASTER has a diﬀerent cell design which reduces energyconsumption during computation. As a case study, we implement support vector machines (SVM), which are widelyused machine learning algorithms. We demonstrate how MASTER can provide high performance and energy eﬃciencyon such applications while also having eﬃcient shut down and restart procedures. Additionally, we show how anothermodiﬁcation to the CRAM cell, the addition of a spin-hall eﬀect (SHE) channel, can further increase energy eﬃciency.In Section 2 we provide the working principles of CRAM. In Section 3 we describe the the support vector machine weuse as an application. We introduce design speciﬁcs of MASTER in Section 4 and show how we guarantee correctnessin Section 5. We set up the evaluation in Section 6, show our results in Section 7, discuss related work in Section 8, andconclude in Section 9. Spintronic memory in the form of STT-MRAM is an emerging technology, with a few products already commerciallyavailable [1]. Due to its non-volatility, high density, speed, and endurance, STT-MRAM is being considered as a universalmemory replacement [18]. STT-MRAM arrays use one magnetic tunnel junction (MTJ) and one access transistor percell. MASTER maintains the same basic cell structure. By making light modiﬁcations to the array, we are able to connectMTJs in such a way to enable logic operations to be implemented within the array. Therefore, MASTER is capable ofbeing used as both a standard STT-MRAM array and as a computational substrate. MASTER is unique in that thecomputation does not require any external logic circuits or the use of sense ampliﬁers, making the computation contained entirely within the array. In the following, we explain MTJ basics and show how they can be used in logic operations.Then we demonstrate how these operations can be performed within the array structure.2 .1 Magnetic Tunnel Junction (MTJ)

STT-MRAM arrays are built with magnetic tunnel junctions (MTJ). The MTJ is a resistive memory device which consistsof two magnetic layers (ﬁxed layer and free layer) which are separated by an insulator. The polarity of the free layer canchange but the ﬁxed cannot. When the ﬁxed and free layers are aligned, the MTJ is in the parallel (P) state, which hasa low resistance and corresponds to logic value 0. When the layers are opposing, the MTJ is in the anti-parallel (AP)state, which has a high resistance and corresponds to logic value 1.The state can be determined by applying a voltage across the device and sensing the amount of current that travelsthrough it. If a suﬃcient amount of current is driven through the device, it will change state. Importantly, the stateit changes to depends on the direction of the current. This is key to our ability to ensure correctness in spite of poweroutages. When current ﬂows from the free layer (ﬁxed layer) to the ﬁxed layer (free layer), it switches the MTJ to theAP (P) state.

Before showing how logic can be implemented in the MASTER array, we demonstrate how logic gates are performed onMTJs. The conﬁguration for a two-input logic gate is shown in Figure 1. The two MTJs in parallel are the inputs to thelogic gate, and the MTJ in series with them is the output. The output must be preset to a known value. For example,the output is preset to 0 (low resistance) for a NAND gate. To implement a NAND gate, a voltage is applied across thetwo terminals, V and V , such that current ﬂows from the input MTJs to the output MTJ. If either of the input MTJsis 0 (low resistance) there will be suﬃcient current to switch the output MTJ to 1. If both input MTJs are 1, there willbe insuﬃcient current to change the state of the output MTJ, and it will remain at 0. Therefore, the state of the outputMTJ follows the truth table for a NAND gate, it is 0 only if both inputs are 1. Due to the underlying physics, MTJswitching depends on the direction of the current. Current ﬂowing from the input MTJs to the output MTJ can onlycause the MTJ to switch to 1. It cannot cause it to switch to 0.All other logic gates are performed similarly. In order to implement other gates, we can change the number of inputs,the preset value of the output, or the direction of the current. For example, using the same circuit (same number ofinputs), we can perform an AND gate on the two inputs. In this case, the output MTJ is preset to 1 and current isapplied in the opposite direction. This is because we want the output MTJ to switch from 1 to 0 (rather than 0 to 1)if either of the input MTJs is 0 (low resistance). Hence, current ﬂows from the output MTJ to the input MTJs, and ifeither of the inputs MTJs is 0, there is suﬃcient current to switch the output to 0. This follows the logic of an AND,where the output will be 0 if either of the inputs is 0.Many other common gates can be implemented in this way, such as NOT, COPY, and (N)OR. (Inverse) Majoritygates with more inputs are also possible. However, the diﬀerence in the combined resistances of diﬀerent inputs getsharder to distinguish for more inputs as more resistances are added in parallel. Generally, two input gates are robustas MTJ resistance is much larger than the parasitic resistance of the access transistors. Hence, resistance diﬀerencesbetween diﬀerent input combinations are also large. We restrict ourselves to a maximum of two inputs per gate in ourevaluation.More complex operations are broken down into these basic logic operations. For example, a full-add can be performedwith 9 NAND gates and 7 temporary bits. To perform a full-add in MASTER, we perform the 9 NAND gates sequentiallyand use spare MTJs to hold the temporary bits. Using full-adds, full-subtracts, and other primitive operations we canperform integer or ﬁxed-point arithmetic, thus enabling us to implement our benchmarks. Naturally, the latency foreach complex operation is quite high, as it must be broken down into its constituent gates which are then performedsequentially. However, as we will show in later sections, this can be compensated for by performing many data independentoperations in parallel. 3igure 1: MTJs connected to implement a 2-input logic gate. The preset value of the output MTJ and the polarity andmagnitude of the voltage applied between V and V determines the type of logic gate. The ﬁxed layer is colored in greyand the free layer in light blue. MASTER is an STT-MRAM array with some additional hardware. Four cells located in adjacent rows and columns areshown in Figure 2. Each memory cell consists of one MTJ and one access transistor. In each column there are two bitlines, bit line even (BLE) and bit line odd (BLO), and a logic line (LL). In each row there is a wordline (WL) that controlsthe access transistor. Each MTJ is connected to the LL through the access transistor and to one of the two bit lines.Cells in even rows are connected to BLE and cells in odd rows are connected to BLO. We now describe how memory andlogic operations are performed in the array.

Memory Operation:

To read or write from row n , activate WL n and apply a voltage diﬀerential across LL and thebitlines. Current will only travel through the bitline with the same parity as n . Current can be sensed on the bit lines toperform a read, or a large current can be driven through the MTJ to perform a write. Logic Operation:

To perform a logic operation with inputs on rows n and n with output in row m , preset row m by performing a write operation. Activate WL n , WL n and WL m . Apply a voltage diﬀerential across BLE and BLO.Current travels from one bit line, through the MTJs in rows n and n , through the LL, through the MTJ in row m , andback to the other bit line. Depending on the states of the MTJs in rows n and n , the state of the MTJ in row m willeither change state or not. n and n must have the same parity and m the opposite.Voltage which drives the operation is applied to every column in which the speciﬁed operation should take place.The peripheral circuitry determines which columns these are, which can be speciﬁed by dedicated instructions as will bedescribed in Section 4.2. Hence, while only one operation can be performed in a column at a time, an operation can beperformed in many columns simultaneously. This gives MASTER column level parallelism . This bears some resemblanceto bit-serial architectures. Augmenting each MASTER cell with a spin hall eﬀect (SHE) channel can further improve energy eﬃciency. Fouraugmented cells in two rows and two columns are shown in Figure 3. Each device (MTJ and combined SHE channel)now has three terminals, instead of two. This necessitates the addition of a second transistor per cell. One end of theSHE channel is connected directly to the bit line. There are two word lines per row, word line read (WLR) and word linewrite (WLW). Each controls one of the access transistors. The access transistor t read is controlled by WLR and connectsthe other end of the MTJ to the logic line. When t read is activated, current passes through the SHE channel and theMTJ device. This allows the MTJ state to aﬀect the current that travels through it. This is used when reading the MTJstate and when the MTJ is used as an input to a logic operation. t write is controlled by WLW and connects the otherend of the SHE channel to the logic line. When t write is activated, current only passes through the SHE channel. Thiscurrent, while not eﬀected by the state of the MTJ, can still change the state of the MTJ. This conﬁguration is usedwhen writing to the MTJ or when the MTJ is the target output of a logic operation.4igure 2: Four cells in two columns and two rows of 1TM conﬁguration. Abbreviations are Wordline (WL), Logic Line(LL), Bitline Even (BLE), Bitline Odd (BLO). LL enables the MTJs to be connected, with voltages applied over BLEand BLO.Figure 3: Four cells in two columns and two rows of 2T-1M SHE MASTER conﬁguration. There are two wordlines, wordline read (WLR) and word line write (WLW).The SHE channel has a few beneﬁts. One is that the required current density to induce switching in the SHE channelis lower, allowing for a reduction in the energy of write and logic operations. It also removes the need to preset the valueof the output MTJ for logic operations, as the state of the MTJ does not aﬀect the SHE channel resistance, and hencedoes not need to be accounted for. This saves latency and energy by making many write operations unnecessary duringthe run of a program. To show the capability of MASTER, we implement Support Vector Machines (SVM) as a case study. SVMs are widelyused machine learning algorithms. Currently, they are second in popularity to neural networks. The applications forneural networks and SVMs overlap considerably, however they oﬀer diﬀerent advantages. Neural networks generallyprovide higher accuracy at a cost of more complexity. They are particularly good at image recognition, which has resultedin increased attention in the last few years. 5ASTER can also serve as an accelerator for neural networks, as it is capable of performing a universal set of logicoperations, hence, any program. Compressed neural networks [24] and binary neural networks [16] would also be wellsuited for the energy harvesting domain. SVMs are eﬀective and simple classiﬁers for typically smaller data sets, whichwe chose as a case study in this paper, without loss of generality. Particularly, we found SVMs to perform well on imagerecognition and human activity recognition. However, there is a trade-oﬀ, as SVMs can struggle with some problems.For example, we were unable to achieve reasonable accuracy on the speech recognition data set, which neural networkshave performed well on [24]. Generally speaking, whether SVMs or neural networks are a superior choice depends on thetarget problem, but both are applicable for energy harvesting applications.SVMs work by mapping inputs to a higher dimensional space, where the diﬀerent classes become linearly separablefrom each other. Training an SVM involves ﬁnding a set of training inputs (support vectors) and weights (coeﬃcients)which are good indicators of a particular class output. New inputs are then compared to the chosen training inputs, andwhichever class it is most similar to is the assigned class.For all benchmarks we use a polynomial kernel with a degree of 2. For inference, the main computation is eﬀectivelycomputing the dot product between an input vector and each of the support vectors. The results of these dot productsare then squared, multiplied by the coeﬃcients, and ﬁnally added together. By design, SVMs have two class outputs,where the sign of the output value is the classiﬁcation.In this work, we opt for the simplest extension to multi-class problems: we train a separate SVM for each possibleoutput class. Each SVM has the task of identifying its assigned class. For example, MNIST has 10 diﬀerent classesfor digits 0-9. We train 10 SVMs each identifying each digit. The output is 10 scores for “how similar” the input isto each digit. We take the highest output of the 10 classiﬁers to be the ﬁnal classiﬁcation. Our SVMs are customdesigned, however we compare our results with libSVM [9] and achieve comparable accuracy. We perform training oﬄinein software and only consider inference acceleration on MASTER.

Energy harvesting systems are powered by their environment. If the environment does not provide enough power, thesystem will have to accumulate energy over time and consume it in bursts [24]. Therefore, such devices must consumeas little energy as possible and be capable of tolerating power outages while maintaining program correctness. MASTERis a natural ﬁt for such a paradigm as logic operations are highly energy eﬃcient and the memory is entirely non-volatile.Additionally, all computation occurs within the memory so progress is eﬀectively saved after each operation. This greatlysimpliﬁes strategies to maintain correctness. In this section, we detail a basic MASTER design which is tightly tailoredto energy harvesting applications.

MASTER has a tiled architecture. Certain MASTER tiles are dedicated for instructions, while all others are dedicated fordata and computation, as shown in Figure 5. MASTER has a larger storage capacity than is typical for energy harvestingdevices. This is due to two reasons. First, STT-MRAM is dense and has extremely low standby power, giving the memorya low area and energy impact. For example, NVSIM [19] reports the size of 64MB STT-MRAM array, which is nearlytwice the size of our largest conﬁguration, as 15.12 mm . A 256-MB STT-MRAM memory device manufactured byEverspin [1] comes in a package that is 130 mm . For reference, the MSP430FR5994 micro-controller commonly usedas a sub-component of energy harvesting systems [24, 13, 26, 27, 28, 53] is over 100 mm . Second, as there is noneed for external processor logic or area costly volatile memory (such as SRAM), and due to minimal peripheral circuitry,nearly the entire area budget is available for memory arrays. There are only ﬁve components of MASTER that are notmemory arrays: 6igure 4: MASTER instruction formats. There are three types of instructions, logic, memory, and an additional activatecolumns instruction for conﬁguration. Opcodes are 4 bits; tile addresses, 9 bits; and row and column addresses, 10 bitseach. Dashed items are optional.1. A memory controller that reads instructions from the instruction arrays and issues all instructions;2. An 128B memory buﬀer that facilitates communication between MASTER tiles;3. A non-volatile register for program counter;4. A non-volatile register for storing a single instruction;5. Voltage sensing circuitry for monitoring the power source.The memory controller only needs to diﬀerentiate between three instruction types as will be described in Section4.2. All computation and memory operations are performed in the tiles, hence the controller needs only broadcast theappropriate command to the tiles. The memory buﬀer is the same size as one line of the MASTER tiles and is used forintermediate storage when transferring data between tiles. The non-volatile registers are used for maintaining correctnessduring power outages, as will be described in Section 4.4. The voltage sensing circuitry is standard for energy harvestingsystems, and is as described in [39]. Instructions for MASTER are 64-bit and the formats are shown in Figure 4. There are three types of instructions, logicoperations, memory operations, and column activation. Memory operations are the same as standard read and writeoperations for STT-MRAM. Instructions for logic operations specify the type of operation (which determines the appliedvoltage level) and the rows on which input and output cells reside. When a logic instruction is issued, it will be appliedto every column that is currently active. Columns are activated by the

Activate Columns instruction, which provides alist of column addresses to a column decoder. Once columns are activated they are held active by a latching mechanismas proposed by [36]. This allows columns to remain active over multiple instructions. As columns need to be changedinfrequently, typically staying active for many instructions, the peripheral cost for activation is amortized. This cost isfurther reduced by modifying the encoding to allow for bulk addressing, similar to the procedure in [56].Compiling instructions for MASTER is non-trivial as it requires some knowledge of the hardware to make eﬃcient useof potential parallelism. This situation is analogous to compiling for GPU architectures from Open-CL or CUDA code.Unfortunately there is no equivalent for PIM. In our work the instructions are custom generated, however, the architectureand data layout for MASTER is similar to a number of other processing-in-memory (PIM) substrates [36, 56].Some tiles are dedicated to store the instructions. The instructions are written into these tiles before deployment.Once active, the memory controller fetches each instruction from the instruction tiles, decodes it, and then broadcasts itto the tiles storing data. Instructions vary in the amount of time they take to complete. This is because specifying row7igure 5: Overview of MASTER. MASTER tiles hold data and instructions. The memory controller fetches instructionsand broadcasts them to the tiles. The memory controller is also responsible for maintaining the program counter andvalid bits to preserve architectural state.and column addresses has an associated latency, and diﬀerent instructions have diﬀerent numbers of addresses. Logicoperations can use 2 or 3 rows and column activation can specify up to 5 columns. To ensure that every instructionﬁnishes, the memory controller waits longer than the longest instruction before issuing the next. This does not impactperformance as, due to power restrictions of energy harvesting sources described in Section 4.3, MASTER does not issueinstructions as fast as possible. Hence, this wait period can use already existing spare time.In this work, as we are only performing inference, the instructions performed are not input dependent. Instructionsare performed in sequential order until the program repeats. We provide more detail on issuing instructions in Section 5.

While MASTER operations are very energy eﬃcient, MASTER can still consume a lot of power due to large amountsof parallelism. If unconstrained, MASTER can consume up to approximately 15 mW. Unfortunately, typical energyharvesters can only provide up to a few hundred micro watts of power [39]. Fortunately, MASTER can be easilyconﬁgured to consume much less power at the cost of performance. The trick is to reduce parallelism and perform moreoperations sequentially. However, we choose to reduce the rate at which we issue instructions so operations are performedat a lower frequency. This introduces idle time between instructions. This idle time can be used to perform other usefultasks, such as updating the architectural state.

As energy harvesting systems frequently experience power outages, they must be designed to perform intermittentprocessing. This involves addressing the challenge of maintaining correct state while repeatedly shutting down andrestarting. The mechanism for maintaining state also need be eﬃcient, as to avoid consuming the precious energyavailable for program execution. A number of techniques have been designed to ensure correctness [12, 50, 45, 23].These studies have devised sophisticated techniques to ensure correctness while introducing minimal backup and restartoverhead. In contrast, MASTER maintains correctness with just a program counter (PC) and an additional non-volatile8tatus bit. While extremely simple, and would be crude for other architectures, it is a natural ﬁt for MASTER. Moresophisticated techniques are unsuitable and unnecessary as MASTER has no volatile data to backup. As MASTERperforms all computation within the non-volatile memory, progress is saved after each operation. This makes restartingafter the last instruction possible and ideal.When MASTER restarts, only two pieces of information are required: the last instruction that was performed andthe columns that were active. In order to restart from the last instruction, MASTER writes the PC into a non-volatileregister after each instruction. When MASTER gains suﬃcient power to restart, it simply reads the next instruction fromthe address in the PC. In the worst case, the power is cut after the last instruction is issued and performed, but beforethe update to the PC register. This does not break correctness as the same result is obtained if a single instructionis repeated multiple times, meaning it is idempotent , as will be shown in Section 5.1. The only requirement is thatthe PC update happens strictly after each instruction is performed. Restarting after the very last instruction not onlyminimizes the amount of work potentially lost on shutdown, but it simpliﬁes the restart process. The simple correctnessguarantee, an operation being idempotent , does not hold if we were to repeat multiple instructions. This is because overthe course of multiple instructions, multiple temporary values can be created. These temporary values may be used laterin the computation or periodically overwritten. Repeating multiple instructions on startup would require some methodfor ensuring correctness of these temporary values, such as performing additional presetting operations. This is certainlypossible to do, but it introduces additional complexity.The second requirement is to restore the previously active columns, for which we use a similar procedure. Wheneveran activate columns instruction is issued, it is stored in an additional instruction register. Reissuing this last activatecolumns instruction is the ﬁrst action on restart. This scheme gives MASTER minimal backup and restart overhead.The cost is 1) continuous update of the program counter and activate columns registers and 2) an additional issue of an activate columns instruction on every restart. Both of these actions incur far less energy than a typical logic instruction.It is noteworthy that MASTER is always in a state which is safe to shut down in. Hence, MASTER maintains correctnesseven if power is cut unexpectedly.We make sure that operations happen in the correct order by performing them sequentially; updates to (architectural)state maintaining registers occur only after the current instruction is performed. If run at full speed, MASTER consumesmore power than a typical energy harvesting source can provide. This requires us to reduce the rate at which we issueinstructions. Hence, there is already a time slack between instructions, during which these updates to the architecturalstate can be performed.

MASTER holds all static data required and performs all the computation. To be integrated into an energy harvestingsystem, MASTER needs to receive energy from an energy harvester, receive input from a sensor, and send output to atransmitter. In this work, we assume input data is stored in a non-volatile buﬀer in the sensor prior to inference. Thesensor’s buﬀer is assigned a tile address and is treated as one of the tiles. Additionally, the buﬀer contains a non-volatilevalid bit indicating that new input is ready. When MASTER is ready for new input, the memory controller can checkthe valid bit and trigger a memory transfer. The memory transfer then consists of reads from the buﬀer and writes tothe MASTER data tiles. These reads and writes are controlled by instructions at the beginning of the program. WhenMASTER ﬁnishes inference, the memory controller reads out the data from the tiles. This data is then available to betransferred to transmitter. In this work, we focus only on the accelerator and do not consider any overhead for the sensoror transmitter. 9 id not switch prior Did switch priorShould not switch No values werechanged. Repeatingthe operation isexactly the sameas ﬁrst time, noswitching will occur. Not possible. Therewas insuﬃcientcurrent to induceswitching at allpoints during theoperation, regard-less of interruptioninduced switching.Should switch No values werechanged. Repeatingthe operation is asperforming for ﬁrsttime, and will nowresult in correctoperation. The output has al-ready switched to0. Reapplying volt-age will result in alarger current, how-ever the direction ofthe current can onlyresult in switchingthe output to 0.Hence, this is analo-gous to applying thevoltage for a gate fora longer duration.

Table 1:

Four possible cases for re-performing an interrupted AND gate. The output MTJ either should or should notswitch for correct operation, and it either did or did not prior to the power being cut.

We show that correctness is guaranteed in spite of power outages, even when unexpected. There are two components,the correctness of individual operations when interrupted or re-performed and correctness of state variables in transitionsbetween states.

In this section we show that correctness is maintained if a single operation is repeated, meaning it is idempotent . Giventhat the power may be cut at any moment, we must consider what happens when an operation is interrupted in all itspossible stages. Since all operations in MASTER are threshold operations, the two stages are pre- and post-switching.Additionally, switching of the output MTJ either should or should not occur depending on the inputs. To be explicit, weuse AND as an example, however, our observations here apply to all gates.The preset value for the output of an AND gate is 1, meaning the MTJ has a high resistance. During operation,current is applied in a direction that could change the output state to 0. If either of the two inputs is 0, there will be asuﬃcient current to change the state, otherwise it will remain at 1. We show the four possible cases in Table 1. If, dueto the inputs, the output is not supposed to switch, the output MTJ will not switch before the power is cut or after thepower is restored. On the other hand, if the output is supposed to switch, it does not matter if it switches before thepower outage or after. If the output MTJ does not switch before the power outage, it will switch once power is restoredand the operation is re-performed. If the output MTJ does switch to 0 before the power outage, re-applying the powerafterwards will leave the output at 0. This is because the direction of the current can only change the output to 0, itcannot revert it back to 1.Putting it all together, the basic idea is that repeating a logic gate is eﬀectively the same as performing the gate10igure 6: State transitions to maintain correctness. The program counter (PC) is duplicated and labelled A and B.Interrupts are highlighted in red, corrective measures in blue, and forward progress in green. Individual instructions aresafe to re-perform, as detailed in Section 5.1for a longer duration. Doing so results in an identical outcome, regardless of whether the output MTJ switched beforeinterruption or not. The case for writes is simpler. The result of a write operation does not depend on the presetvalue, hence repeating a write is eﬀectively writing the value twice. Power interruptions can lead to wasted energy, byre-performing unnecessary work, but do not result in corruption of logical values.

It must also be ensured that the memory controller can tolerate unexpected interruptions. The memory controller readsinstructions from the address held in the non-volatile program counter (PC), decodes them, and broadcasts them to thedata tiles. It then updates the PC. If power is cut during a write operation to the PC, the value may be corrupt. Wesolve this by duplicating the PC register and maintaining a parity bit. If the parity bit is 0 then PC-A is valid and if theparity bit is 1 then PC-B is valid. The valid PC register points to the instruction currently being executed. After aninstruction is completed, the value stored in the valid PC register is read, updated, and written to the invalid PC register.The new PC value now points to the next instruction that is to be executed. After the PC register update, the parity bitis ﬂipped. This process is depicted in Figure 6. With this scheme, a valid copy of the PC is maintained at all times. Ifpower is cut after the update to the invalid PC but before the parity bit is ﬂipped, the memory controller will considerthe old PC to be valid on restart. This results in the previous instruction being re-performed. This does not introduceerrors as individual instructions are idempotent, as shown in Section 5.1. Hence, power can be cut at any point duringthe execution of an instruction and the memory controller can restart correctly.

Benchmarks:

Energy harvesting systems are ideal for applications in which the system is diﬃcult or inconvenient topower directly or with batteries. Examples include remote sensors and wearable tech. We choose benchmarks which arerepresentative of diﬀerent possible use cases, along with an additional standard benchmark.MNIST [35], as an example image recognition for sensor networks, is a digit recognition data set, where there are 10classes for digits 0-9. The input is a grey scale × pixel image with 8-bit precision. The pixels are placed row wise11 arameter Modern FutureP State Resistance 3.15 k Ω Ω AP State Resistance 7.34 k Ω Ω Switching Time 3 ns [51] 1 nsSwitching Current 40 µ A [51] 3 µ A Table 2:

Parameters for MTJ devicesinto a 784 element vector. We also use a binarized version, where pixels that are greater than a threshold value are setto 1 and others to 0. This allows us to replace multiplications with AND gates for some parts of the computation.Human Activity Recognition (HAR) [2], as an example for wearable tech, is a data set containing measurementsfrom an accelerometer and gyroscope embedded in a smartphone, which is carried by participants performing a varietyof activities. The task is to classify each set of readings to which activity is being performed. We represent the inputwith ﬁxed point integer format with 8-bit precision. Each input is a vector of 561 elements.ADULT [34] is a commonly used benchmark for SVMs that contains census information and the task is to classifywhether an individual makes greater than $50,000 per year or not. We use a reformatted version of the data set fromlibSVM [9]. Each input is a 15 element vector where each element is an 8-bit integer.Our SVMs are trained and tested in R [48]. They are custom designed, however we do compare our results withlibSVM [9] with the same inputs and obtain similar accuracy. In our custom implementation we do not use any operationsthat would be ineﬃcient in MASTER; all programs consist of bit-wise and integer arithmetic.

Performance and Energy Model:

We simulate the benchmarks on MASTER with an in-house simulator, also imple-mented in R. MASTER has a tiled architecture. We set each tile to have a capacity of 128KB, which is an 1024x1024array. We chose this size as it is a commonly recommended subarray size for non-volatile memories from NVSIM [19].We experiment with both modern MTJs [52] and estimates of future MTJs [65]. Expected improvements in MTJ deviceswill drastically increase energy eﬃciency. For future MTJs, we test both STT and SHE based architectures. The MTJparameters we use are shown in Table 2. For future MTJs, two techniques enable a reduction in the switching current,1) decreasing the damping constant of ferromagnetic materials [55, 47, 20] and 2) using a dual-reference layer structure[31, 17]. To be conservative, we assume 3 µ A, however, switching currents as low as 1 µ A are possible. To estimatelatency and energy cost due to peripheral circuitry, we take data from NVSIM [19] which reports results for modern STT-MRAM memories. We set our peripheral circuitry costs so that they consume the same percentage share of the totallatency and energy as reported by NVSIM. In addition to the latency and energy required for performing the instructions,we also account for the overhead involved in reading the instructions from the arrays, updating the program counter andvalid bits, storing the most recent activate columns instruction, and the re-issuing of the last activate columns instructionwhenever the system restarts.We ﬁrst evaluate the performance of MASTER under continuous power. We do not limit the power it consumes toallow it to achieve its maximal throughput. Then, we evaluate MASTER under energy harvesting conditions. Followingthe approach in [39], we model the power source as a 16 kHz square wave with a duty cycle. The duty cycle is thepercentage of the time the power source is on, e.g., a duty cycle of 0.5 means power is on half the time. We reportresults for each benchmark over a variety of duty cycles. Additionally, during power on time we need to keep the powerwithin a realistic budget for energy harvesting systems, approximately a couple hundred micro watts [39]. To reducepower consumption we idle between instructions. This increases latency but ensures the power budget is within energyharvesting limitations. Following metrics provided in [54], we report energy dedicated to diﬀerent components. In additionto total energy, we report Backup energy, Dead energy, and Restore energy. Backup is operations performed prior toshut down to save state. For us, this is the continual writing of the PC, parity bit, and storing each activate columnsinstruction in an additional instruction register. Dead energy is energy spent re-performing work that was lost duringshut down, which in this case is repeating the last instruction on restart. Restore energy includes any operation needed12 enchmark (Array Capacity) Modern Future STT Future SHEMNIST (64MB) 15.12 17.10 34.20MNIST Binarized (8MB) 2.11 2.34 4.68HAR (16MB) 4.02 4.47 8.93ADULT (1MB) 0.41 0.46 0.91

Table 3:

Area required for MASTER for diﬀerent benchmarks and conﬁgurations. Units are in mm . Modern resultscome from NVSIM [19] and Future results come from our conservative cell area projections. Benchmark Latency ( µ s ) Energy ( µ J ) ) Accuracy MASTER

MNIST 2,137 22.49 11,813 4.5 / 30.0 17.10 97.55MNIST (Binarized) 66.63 1.10 12,214 1.25 / 6.0 2.34 97.37HAR (integer) [2, 60] 1,068 7.62 3,293 2.25 / 10.0 4.47 94.57ADULT 116.50 0.12 1,909 0.25 / 0.5 0.46 76.12 libSVM [9]MNIST 7,830 234,900 8,652 - - 98.05MNIST (Binarized) 19,037 571,116 23,672 - - 92.49HAR (integer) 1,701 51,042 2,632 - - 93.69ADULT 379 11,370 15,792 - - 78.62

SONIC [24]MNIST 2,740,000 27,000 NA 0.256 >

100 99HAR 1,100,000 12,500 NA 0.256 >

100 88

Table 4:

Unconstrained MASTER (using STT design and future MTJ devices) and related work under continuouspower. Unconstrained means power consumption may be higher than what an energy harvesting power source canprovide. libSVM is implemented on Intel Haswell E5-2680v3 processor, SONIC [24] is implemented on MSP430FR5994microcontroller.to prepare MASTER for computation on restart. For us, this is issuing the most recent activate columns instruction.

Area Overhead:

MASTER tiles have a similar area overhead as STT-MRAM arrays. MASTER has an extra bit line percolumn for the STT conﬁguration. For the SHE conﬁguration, it has an extra transistor and SHE channel for each cell.The impact of the additional bit line is minor but the additional transistor has signiﬁcant overhead. To get estimates forarea overhead for modern STT-MRAM, we take results directly from NVSIM [19] using 22 nm node size. This does nottake into account the additional circuitry in MASTER. NVSIM only allows for memory sizes that are powers of two, thuswe choose the minimum size for which each benchmark ﬁts. To estimate area overhead for MASTER with future MTJs,we create estimates for a cell size assuming access transistors with 1k Ω resistance. The access transistors dominate thearea overhead. This is for two reasons: 1) the MTJs and SHE channel can be placed on a separate layer from the accesstransistors and 2) the access transistors are much larger. Technology scaling will help reduce the size of the MASTERtiles, but this is counteracted by the additional hardware required. To estimate peripheral circuitry, we take NVSIMresults for area eﬃciency and adjust our estimates by the same ratio. We ﬁnd that the MASTER arrays are slightly largerthan modern STT-MRAM arrays, as shown in Table 3. As the SHE design has twice as many access transistors, the cellarea is approximately twice as large. Results for MASTER unconstrained by power limitations are summarized in Table 4. This is assuming the power sourceis always on and that the MASTER accelerator can draw as much power as needed, typically a few mW. Also reportedare results for the same benchmarks performed using libSVM on a CPU and an energy harvesting system SONIC [24]13 uty Cycle Total T Restore T Total E Backup E Dead E Restore E

Modern STT

Future STT

Future SHE

Table 5:

Time (T) in µ s and Energy (E) in µ J for MNIST for diﬀerent conﬁgurations and duty cycles.

Duty Cycle Total T Restore T Total E Backup E Dead E Restore E

Modern STT

Future STT

Future SHE

Table 6:

Time (T) in µ s and Energy (E) in µ J for Binarized MNIST for diﬀerent conﬁgurations and duty cycles.under continuous power. libSVM is run on a supercomputing cluster using Intel Haswell 5-2680v3 processors. To beconservative, for libSVM we account only for the processor power consumption and assume it operates at its idle power.SONIC uses a TI-MSP430FR5994 microcontroller and is powered by a Powercast P2210B energy harvester. MASTERshows signiﬁcant energy eﬃciency advantages, and improved latency over other implementations. MASTER does requiremore memory than SONIC, however, we believe this to be reasonable given that MASTER is implemented in high densitySTT-MRAM and does not need external processing logic or area costly volatile memory.MASTER beneﬁts greatly from binarizing the MNIST input. One bit inputs enable us to replace multiplications withAND gates, which signiﬁcantly reduces the amount of computation required. This comes at a small cost in accuracy. ThelibSVM implementation struggles on the binarized MNIST inputs, and attempts to increase accuracy by adding manymore support vectors. This increases the latency and energy of inference.We wish to speciﬁcally address the signiﬁcant diﬀerence in performance between MASTER and SONIC [24]. SONICis implemented on a conventional, low performance microprocessor. That design is highly economical, makes use of veryscarce memory capacity, uses currently commercially available hardware, and has been proven experimentally. Additionally,the authors note that there is room for signiﬁcant improvement in the eﬃciency. While we are reporting a signiﬁcantlatency and energy advantage, MASTER is not yet ready for fabrication. MTJ based logic has been experimentally14 uty Cycle Total T Restore T Total E Backup E Dead E Restore E

Modern STT

Future STT

Future SHE

Table 7:

Time (T) in µ s and Energy (E) in µ J for Human Activity Recognition (HAR) for diﬀerent conﬁgurations andduty cycles.

Duty Cycle Total T Restore T Total E Backup E Dead E Restore E

Modern STT

Future STT

Future SHE

Table 8:

Time (T) in µ s and Energy (E) in µ J for ADULT benchmark for diﬀerent conﬁgurations and duty cycles.15emonstrated [62], however a full-scale CRAM array has not yet. Integration into an energy harvesting system is still afew years away.Now we consider MASTER in a more realistic energy harvesting scenario. The power consumption of the unconstrainedconﬁguration (a few mW) is unrealistic for energy harvesting, where power budgets typically are a few hundred microwatts. Thus, we slow down the rate at which instructions are issued to remain within the power budget. Followingthe approach in [39], we model the energy harvesting power source as a 16 kHz square wave with various duty cycles.MASTER can only operate when the power source is on, and remains idle when the power is oﬀ. When the power isrestored, there is the start up task of re-activating the active columns with the activate columns instruction. Resultsare shown in Tables 6 – 8. The reported total time includes power oﬀ time. A duty cycle of 1 means the system iscontinuously powered. The time required for each benchmark increases with decreasing duty cycle. This is due to tworeasons. Naturally, more time is spent powered oﬀ and MASTER cannot make forward progress. Also, with a lower dutycycle there are more interruptions during the run of the program and hence more restart operations. This is reﬂectedin the increasing Restore time. However, as the restore process is fast, the Restore time remains a small fraction of thetotal time. For STT at a duty cycle of 0.01, the Restore time is only 185.42 µ s compared to the 7,511,610 µ s requiredfor completion of the program, considering time while powered oﬀ. Because the SHE design does not require presettingof the output MTJ, it requires fewer operations than the STT design, and hence, ﬁnishes faster. As a result, the SHEdesign experiences fewer power interruptions during the run of the program. Thus, SHE has a lower Restore time thanSTT.With decreasing duty cycle energy does not signiﬁcantly increase. This is because no energy is spent while poweredoﬀ and MASTER has a highly eﬃcient restart process. Backup energy is the continual writing of architectural statevariables, Restore energy is the peripheral cost of column re-activation on restart, and Dead energy is due to the possiblere-execution of the previous instruction on restart. Typically, energy will increase with decreasing duty cycle as there aremore interruptions during the run of the program, leading to more restart operations. The energy cost of restart dependson where in the program progress was interrupted. The more columns that were active at the time of interrupt, thehigher the Restore energy cost will be. Due to the previously mentioned reduction in time to ﬁnish, and consequentlythe number of interruptions, SHE has a lower Restore energy than STT. For the STT conﬁguration the Dead energy istypically much larger the Restore energy. For example, the Dead energy is 4.49 µ J whereas the Restore energy is only1.36 µ J on the MNIST benchmark at a duty cycle of 0.01. For SHE, Dead and Restore energy are similar, the Restoreenergy is a comparable 1.12 µ J but the Dead energy reduces to 0.918 µ J for the same conﬁguration. This is becausemost of the Dead energy goes towards logic operations, for which SHE has a higher eﬃciency. Backup energy is smallrelative to both Dead and Restore energy, as this corresponds to writing only a few bits on every cycle. For future STT,the Backup energy is only 0.025 µ J for a duty cycle of 0.01. Backup energy can increase with decreasing duty cycle dueto the repeating of backup operations on restart, which happens if the previous backup operation did not ﬁnish prior toshut down.Restore time, Dead energy, and Restore energy are all zero for the case of a continuously powered system. Thisis because there are no power outages and, hence, never a need to restart the system or re-perform any potentiallyunﬁnished instructions.As modern MTJs are less eﬃcient than predicted future MTJs, MASTER must idle for longer between instructionswhen using them. As a result, MASTER has a higher latency than SONIC [24] on the MNIST benchmark and acomparable latency on the HAR benchmark. However, even with modern MTJs, MASTER still provides an energyeﬃciency advantage, with 1,385.91 µ J for MNIST (relative to 27,000 µ J) and 469.4 µ J for HAR (relative to 12,500 µ J).We note that the ASIC accelerator in [37] is the most relevant comparison to MASTER at full performance. However,there are no absolute values for latency, energy, or throughput reported in [37]. All results are reported relative to a GPUbaseline, for which absolute values are not reported. The only comparison we can make is that they consume 596 mWof power, not counting external memory. MASTER consumes approximately 15 mW when unconstrained. We believe1637] has a lower latency than MASTER, as ASIC designs typically have a latency advantage over PIM.

Non-volatile processors [43, 42, 39] are uniquely designed for intermittent computing by integrating non-volatile memorynear the compute units. Unlike MASTER, these devices have a structure similar to traditional CPUs. The authors of [39]propose a system using a THU1010N non-volatile processor for energy harvesting applications. They describe trade-oﬀsin designing such a system and demonstrate its capability on a number of smaller benchmarks. A non-volatile processoris presented in [57] which features PIM components. There is a controlling CPU that performs logic and control. A fewRRAM arrays are used to accelerate computing in neural networks. In this case, the PIM is a sub-component of the system,which also contains more traditional logic circuitry. SONIC [24] uses compressed neural networks to perform inference ona TI-MSP430FR5994 microcontroller. SONIC is powered by a Powercast P2210B energy harvester which collects energyfrom a 3W Powercaster transmitter. This design can perform MNIST image recognition, Human Activity Recognition,and speech identiﬁcation with high accuracy. This work is signiﬁcant as it developed methods to ensure correctnessfor machine learning applications on conventional hardware for intermittent systems, and was proven experimentally.Capybara [14] is a dynamic power delivery system. In energy harvesting applications, tasks can be capacity-constrained(i.e., need to perform a large computation without being interrupted) or temporally-constrained (i.e., need to be runat a speciﬁc time). These constraints have conﬂicting needs. Capacity-constrained prefers a large energy buﬀer soit can complete a longer task. Temporally-constrained prefers a small buﬀer that recharges quickly. Capybara uses are-conﬁgurable hardware energy storage mechanism and a software interface that allows the speciﬁcation of energy needsfor diﬀerent tasks. This gives the system the ability to satisfy the requirements of both kinds of tasks. While we do notfocus on the power delivery system in this work, systems such as Capybara could be used to optimally supply MASTERwith power. Hibernus [6], on the other hand, is a system that reactively hibernates and wakes up.A number of techniques have been developed to enable intermittent computation on more traditional hardware.CleanCut [12] works with LLVM to compile programs with checkpoints. There is a diﬃcult balance when creatingcheckpoints. Too many, and it wastes valuable energy and time. Too few is worse, where the required energy betweentwo checkpoints is larger than the energy that can be stored in the energy buﬀer. Thus, the program will get stuck,which is called non-termination. Finding such non-terminating conditions is diﬃcult to do by hand. CleanCut uses astatistical energy model to ﬁnd potential non-terminating paths. Chinchilla [45] attempts to get the best of both worldswith adaptive checkpointing. When compiling, Chinchilla inserts many possible checkpoints. When running, it keeps atimer and only performs a checkpoint if the timer has expired. If the device fails to checkpoint before power outage,the timer is set to half the value. This occurs until it has found an appropriate amount of time to go before performingcheckpointing. It also opportunistically increases the timer at speciﬁed intervals in an attempt to increase performanceby reducing the number of checkpoints. Coati [50] developed methods to ensure correctness for concurrent executionand interrupts for intermittent systems. The What’s Next intermittent architecture [22] uses approximation to improveperformance. Rather than all-or-nothing approach, What’s Next computes approximate results and continually improvesthe output. If an acceptable output is achieved it will skip to processing the next input. This enables the device to processmore inputs as it does not waste time and energy achieving unnecessary accuracy. Alternatively, if there is suﬃcientenergy available, it can continue reﬁning the output. These works have developed sophisticated techniques to enablemore traditional computation substrates to achieve accuracy and performance on intermittent systems. With MASTER,we are able to signiﬁcantly simplify our strategy as the substrate has a natural immunity to power interruptions.The EH model [54] facilitates early design space exploration for energy harvesting architectures. It helps ﬁnding agood balance to achieve minimal overhead for allowing maximal forward progress. As noted by the authors of [54],energy harvesting systems can generally be divided into two types, multi-backup, which perform many backups betweenpower outages, and single back-up, which only save state once before a power outage. Multi-backup systems include17ementos, [49], DINO [40], Chain [11], Alpaca [44], Mayﬂy [29], Ratchet [61], and Clank [30]. Single-backup systemsinclude Hibernus [5], QuickRecall [32], and many others [3, 4, 7, 38, 41]. According to this categorization, MASTERﬁts under a multi-backup system as we are constantly saving the architectural state.PIM has been studied for non-volatile memories with Pinatubo [36], for DRAM with Ambit [56], and for SRAM withNeural Cache [21]. These technologies are meant to be integrated into the memory hierarchy of traditional CPUs and havenot been considered for energy harvesting applications. Ambit and Neural Cache are not suitable for energy harvesting asthey are volatile technologies. Pinatubo could be adapted and used similarly as CRAM in MASTER. However, Pinatubouses logic external to the memory array for some operations. This adds complexity as these circuits would need to beprotected against errors in intermittent computing. Additionally, Pinatubo uses sense ampliﬁers to perform computation,which is less energy eﬃcient than the logic operations in CRAM.A number of RRAM PIM technologies also exist [64, 63, 59, 58]. However, the RRAM array is used as an acceleratoras a sub-component of the system. Hence, there is much additional circuitry and logic that occurs outside the memory.This signiﬁcantly increases the diﬃculty to adapt to intermittent processing. Additionally, many RRAM accelerators relyheavily on ADC units, which have a signiﬁcant area and energy overhead.

In this paper we presented MASTER, a machine learning accelerator in (non-volatile) memory for energy harvestingapplications. The requirements for energy harvesting applications are extreme energy eﬃciency, eﬃcient shut down andrestart procedures, and correctness during intermittent execution. MASTER provides all of these by having highly energyeﬃcient logic operations with simple and eﬀective shut down and restart procedures. The non-volatility combined withprocessing in memory provides a natural progress saving mechanism which demands very little overhead. By simulation,we demonstrated that such a device would provide signiﬁcant latency and energy eﬃciency advantages over state of theart approaches, and is is a promising candidate to bring machine learning to new domains.

References

Esann , 2013.[3] Faycal Ait Aouda, Kevin Marquet, and Guillaume Salagnac. Incremental checkpointing of program state to nvram fortransiently-powered systems. In , pages 1–4. IEEE, 2014.[4] Domenico Balsamo, Anup Das, Alex S Weddell, Davide Brunelli, Bashir M Al-Hashimi, Geoﬀ V Merrett, and LucaBenini. Graceful performance modulation for power-neutral transient computing systems.

IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems , 35(5):738–749, 2016.[5] Domenico Balsamo, Alex S Weddell, Anup Das, Alberto Rodriguez Arreola, Davide Brunelli, Bashir M Al-Hashimi,Geoﬀ V Merrett, and Luca Benini. Hibernus++: a self-calibrating and adaptive system for transiently-poweredembedded devices.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 35(12):1968–1980, 2016. 186] Domenico Balsamo, Alex S Weddell, Geoﬀ V Merrett, Bashir M Al-Hashimi, Davide Brunelli, and Luca Benini. Hi-bernus: Sustaining computation during intermittent supply for energy-harvesting systems.

IEEE Embedded SystemsLetters , 7(1):15–18, 2014.[7] Gautier Berthou, Tristan Delizy, Kevin Marquet, Tanguy Risset, and Guillaume Salagnac. Peripheral state persistencefor transiently-powered systems. In , pages 1–6. IEEE, 2017.[8] Anantha P Chandrakasan, Denis C Daly, Joyce Kwong, and Yogesh K Ramadass. Next generation micro-powersystems. In , pages 2–5. IEEE, 2008.[9] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines.

ACM transactions on intelligentsystems and technology (TIST) , 2(3):27, 2011.[10] Zamshed Chowdhury, Jonathan D Harms, S Karen Khatamifard, Masoud Zabihi, Yang Lv, Andrew P Lyle, Sachin SSapatnekar, Ulya R Karpuzcu, and Jian-Ping Wang. Eﬃcient in-memory processing using spintronics.

IEEE ComputerArchitecture Letters , 17(1):42–46, 2017.[11] Alexei Colin and Brandon Lucia. Chain: tasks and channels for reliable intermittent programs. In

ACM SIGPLANNotices , volume 51, pages 514–530. ACM, 2016.[12] Alexei Colin and Brandon Lucia. Termination checking and task decomposition for task-based intermittent programs.In

Proceedings of the 27th International Conference on Compiler Construction , pages 116–127. ACM, 2018.[13] Alexei Colin, Emily Ruppel, and Brandon Lucia. A reconﬁgurable energy storage architecture for energy-harvestingdevices. In

ACM SIGPLAN Notices , volume 53, pages 767–781. ACM, 2018.[14] Alexei Colin, Emily Ruppel, and Brandon Lucia. A reconﬁgurable energy storage architecture for energy-harvestingdevices. In

ACM SIGPLAN Notices , volume 53, pages 767–781. ACM, 2018.[15] Francesco Conti, Pasquale Davide Schiavone, and Luca Benini. Xnor neural engine: A hardware accelerator ip for21.6-fj/op binary neural network inference.

IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems , 37(11):2940–2951, 2018.[16] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks:Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 ,2016.[17] Zhitao Diao, Alex Panchula, Yunfei Ding, Mahendra Pakala, Shengyuan Wang, Zhanjie Li, Dmytro Apalkov,Hideyasu Nagai, Alexander Driskill-Smith, Lien-Chang Wang, et al. Spin transfer switching in dual mgo magnetictunnel junctions.

Applied Physics Letters , 90(13):132508, 2007.[18] Xiangyu Dong, Xiaoxia Wu, Guangyu Sun, Yuan Xie, Helen Li, and Yiran Chen. Circuit and microarchitectureevaluation of 3d stacking magnetic ram (mram) as a universal memory replacement. In , pages 554–559. IEEE, 2008.[19] Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P Jouppi. Nvsim: A circuit-level performance, energy, and areamodel for emerging nonvolatile memory.

IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems , 31(7):994–1007, 2012.[20] Philipp D¨urrenfeld, Felicitas Gerhard, Jonathan Chico, Randy K Dumas, Mojtaba Ranjbar, Anders Bergman, LarsBergqvist, Anna Delin, Charles Gould, Laurens W Molenkamp, et al. Tunable damping, saturation magnetization,and exchange stiﬀness of half-heusler nimnsb thin ﬁlms.

Physical Review B , 92(21):214424, 2015.1921] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw,and Reetuparna Das. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In

Proceedings of the45th Annual International Symposium on Computer Architecture , pages 383–396. IEEE Press, 2018.[22] Karthik Ganesan, Joshua San Miguel, and Natalie Enright Jerger. The what’s next intermittent computing architec-ture. In , pages 211–223.IEEE, 2019.[23] Graham Gobieski, Nathan Beckmann, and Brandon Lucia. Intermittent deep neural network inference, 2018.[24] Graham Gobieski, Brandon Lucia, and Nathan Beckmann. Intelligence beyond the edge: Inference on intermittentembedded systems. In

Proceedings of the Twenty-Fourth International Conference on Architectural Support forProgramming Languages and Operating Systems , pages 199–213. ACM, 2019.[25] Hayit Greenspan, Bram Van Ginneken, and Ronald M Summers. Guest editorial deep learning in medical imaging:Overview and future promise of an exciting new technique.

IEEE Transactions on Medical Imaging , 35(5):1153–1159,2016.[26] Josiah Hester, Travis Peters, Tianlong Yun, Ronald Peterson, Joseph Skinner, Bhargav Golla, Kevin Storer, StevenHearndon, Kevin Freeman, Sarah Lord, et al. Amulet: An energy-eﬃcient, multi-application wearable platform. In

Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM , pages 216–229. ACM,2016.[27] Josiah Hester, Lanny Sitanayah, and Jacob Sorber. Tragedy of the coulombs: Federating energy storage for tiny,intermittently-powered sensors. In

Proceedings of the 13th ACM Conference on Embedded Networked SensorSystems , pages 5–16. ACM, 2015.[28] Josiah Hester and Jacob Sorber. Flicker: Rapid prototyping for the batteryless internet-of-things. In

Proceedings ofthe 15th ACM Conference on Embedded Network Sensor Systems , page 19. ACM, 2017.[29] Josiah Hester, Kevin Storer, and Jacob Sorber. Timely execution on intermittently powered batteryless sensors. In

Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems , page 17. ACM, 2017.[30] Matthew Hicks. Clank: Architectural support for intermittent computation. In , pages 228–240. IEEE, 2017.[31] G Hu, JH Lee, JJ Nowak, JZ Sun, J Harms, A Annunziata, S Brown, W Chen, YH Kim, G Lauer, et al. Stt-mramwith double magnetic tunnel junctions. In , pages 26–3.IEEE, 2015.[32] Hrishikesh Jayakumar, Arnab Raha, and Vijay Raghunathan. Quickrecall: A low overhead hw/sw approach forenabling computations across power cycles in transiently powered computers. In , pages 330–335. IEEE, 2014.[33] Sangkil Kim, Rushi Vyas, Jo Bito, Kyriaki Niotaki, Ana Collado, Apostolos Georgiadis, and Manos M Tentzeris.Ambient rf energy-harvesting technologies for self-sustainable standalone wireless sensor platforms.

Proceedings ofthe IEEE , 102(11):1649–1666, 2014.[34] Ron Kohavi. Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In

Kdd , volume 96, pages202–207. Citeseer, 1996. 2035] Yann LeCun, L´eon Bottou, Yoshua Bengio, Patrick Haﬀner, et al. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[36] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan Xie. Pinatubo: A processing-in-memoryarchitecture for bulk bitwise operations in emerging non-volatile memories. In

Proceedings of the 53rd Annual DesignAutomation Conference , page 173. ACM, 2016.[37] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou,and Yunji Chen. Pudiannao: A polyvalent machine learning accelerator. In

ACM SIGARCH Computer ArchitectureNews , volume 43, pages 369–381. ACM, 2015.[38] Qingrui Liu and Changhee Jung. Lightweight hardware support for transparent consistency-aware checkpointing inintermittent energy-harvesting systems. In , pages 1–6. IEEE, 2016.[39] Yongpan Liu, Zewei Li, Hehe Li, Yiqun Wang, Xueqing Li, Kaisheng Ma, Shuangchen Li, Meng-Fan Chang, SampsonJohn, Yuan Xie, et al. Ambient energy harvesting nonvolatile processors: from circuit to system. In

Proceedings ofthe 52nd Annual Design Automation Conference , page 150. ACM, 2015.[40] Brandon Lucia and Benjamin Ransford. A simpler, safer programming and execution model for intermittent systems.In

ACM SIGPLAN Notices , volume 50, pages 575–585. ACM, 2015.[41] Giedrius Lukosevicius, Alberto Rodriguez Arreola, and Alex S Weddell. Using sleep states to maximize the activetime of transient computing systems. In

Proceedings of the Fifth ACM International Workshop on Energy Harvestingand Energy-Neutral Sensing Systems , pages 31–36. ACM, 2017.[42] Kaisheng Ma, Xueqing Li, Jinyang Li, Yongpan Liu, Yuan Xie, Jack Sampson, Mahmut Taylan Kandemir, andVijaykrishnan Narayanan. Incidental computing on iot nonvolatile processors. In , pages 204–218. IEEE, 2017.[43] Kaisheng Ma, Yang Zheng, Shuangchen Li, Karthik Swaminathan, Xueqing Li, Yongpan Liu, Jack Sampson, YuanXie, and Vijaykrishnan Narayanan. Architecture exploration for ambient energy harvesting nonvolatile processors.In , pages 526–537.IEEE, 2015.[44] Kiwan Maeng, Alexei Colin, and Brandon Lucia. Alpaca: intermittent execution without checkpoints.

Proceedingsof the ACM on Programming Languages , 1(OOPSLA):96, 2017.[45] Kiwan Maeng and Brandon Lucia. Adaptive dynamic checkpointing for safe eﬃcient intermittent computing. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } , pages 129–144, 2018.[46] Milos Manic, Kasun Amarasinghe, Juan J Rodriguez-Andina, and Craig Rieger. Intelligent buildings of the future:Cyberaware, deep learning powered, and human interacting. IEEE Industrial Electronics Magazine , 10(4):32–49,2016.[47] S Mizukami, D Watanabe, M Oogane, Y Ando, Y Miura, M Shirai, and T Miyazaki. Low damping constant for co 2feal heusler alloy ﬁlms and its correlation with density of states.

Journal of Applied Physics , 105(7):07D306, 2009.[48] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing,Vienna, Austria, 2016. 2149] Benjamin Ransford, Jacob Sorber, and Kevin Fu. Mementos: system support for long-running computation onrﬁd-scale devices. In

ACM SIGARCH Computer Architecture News , volume 39, pages 159–170. ACM, 2011.[50] Emily Ruppel and Brandon Lucia. Transactional concurrency control for intermittent, energy-harvesting computingsystems. In

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implemen-tation , pages 1085–1100. ACM, 2019.[51] Daisuke Saida, Saori Kashiwada, Megumi Yakabe, Tadaomi Daibou, Naoki Hase, Miyoshi Fukumoto, Shinji Miwa,Yoshishige Suzuki, Hiroki Noguchi, Shinobu Fujita, et al. Sub-3 ns pulse with sub-100 µ a switching of 1x–2x nmperpendicular mtj for high-performance embedded stt-mram towards sub-20 nm cmos. In , pages 1–2. IEEE, 2016.[52] Daisuke Saida, Saori Kashiwada, Megumi Yakabe, Tadaomi Daibou, Naoki Hase, Miyoshi Fukumoto, Shinji Miwa,Yoshishige Suzuki, Hiroki Noguchi, Shinobu Fujita, et al. Sub-3 ns pulse with sub-100 µ a switching of 1x–2x nmperpendicular mtj for high-performance embedded stt-mram towards sub-20 nm cmos. In , pages 1–2. IEEE, 2016.[53] Alanson P Sample, Daniel J Yeager, Pauline S Powledge, Alexander V Mamishev, and Joshua R Smith. Design ofan rﬁd-based battery-free programmable sensing platform. IEEE transactions on instrumentation and measurement ,57(11):2608–2615, 2008.[54] Joshua San Miguel, Karthik Ganesan, Mario Badr, Chunqiu Xia, Rose Li, Hsuan Hsiao, and Natalie Enright Jerger.The eh model: Early design space exploration of intermittent processor architectures. In , pages 600–612. IEEE, 2018.[55] H Sato, ECI Enobio, M Yamanouchi, S Ikeda, S Fukami, S Kanai, F Matsukura, and H Ohno. Properties of magnetictunnel junctions with a mgo/cofeb/ta/cofeb/mgo recording structure down to junction diameter of 11 nm.

AppliedPhysics Letters , 105(6):062403, 2014.[56] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael AKozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Ambit: In-memory accelerator for bulk bitwise oper-ations using commodity dram technology. In

Proceedings of the 50th Annual IEEE/ACM International Symposiumon Microarchitecture , pages 273–287. ACM, 2017.[57] Fang Su, Wei-Hao Chen, Lixue Xia, Chieh-Pu Lo, Tianqi Tang, Zhibo Wang, Kuo-Hsiang Hsu, Ming Cheng, Jun-Yi Li, Yuan Xie, et al. A 462gops/j rram-based nonvolatile intelligent processor for energy harvesting ioe systemfeaturing nonvolatile logics and processing-in-memory. In , pages T260–T261.IEEE, 2017.[58] Xiaoyu Sun, Xiaochen Peng, Pai-Yu Chen, Rui Liu, Jae-sun Seo, and Shimeng Yu. Fully parallel rram synaptic arrayfor implementing binary neural network with (+ 1,- 1) weights and (+ 1, 0) neurons. In

Proceedings of the 23rdAsia and South Paciﬁc Design Automation Conference , pages 574–579. IEEE Press, 2018.[59] Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, and Huazhong Yang. Binary convolutional neural network on rram. In

Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Paciﬁc , pages 782–787. IEEE, 2017.[60] https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones, 2019. Accessed: 2019-06-02. 2261] Joel Van Der Woude and Matthew Hicks. Intermittent computation without hardware support or programmerintervention. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } , pages17–32, 2016.[62] Jian-Ping Wang, Mahdi Jamaliz, Angeline Klemm Smith, and Zhengyang Zhao. Magnetic tunnel junction basedintegrated logics and computational circuits. Nanomagnetic and Spintronic Devices for Energy-Eﬃcient Memoryand Computing , page 133, 2016.[63] Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, and Huazhong Yang.Switched by input: power eﬃcient structure for rram-based convolutional neural network. In

Proceedings of the53rd Annual Design Automation Conference , page 125. ACM, 2016.[64] Shimeng Yu, Zhiwei Li, Pai-Yu Chen, Huaqiang Wu, Bin Gao, Deli Wang, Wei Wu, and He Qian. Binary neuralnetwork with 16 mb rram macro chip for classiﬁcation and online training. In

Electron Devices Meeting (IEDM),2016 IEEE International , pages 16–2. IEEE, 2016.[65] Masoud Zabihi, Zhengyang Zhao, DC Mahendra, Zamshed I Chowdhury, Salonik Resch, Thomas Peterson, Ulya RKarpuzcu, Jian-Ping Wang, and Sachin S Sapatnekar. Using spin-hall mtjs to build an energy-eﬃcient in-memorycomputation platform. In20th International Symposium on Quality Electronic Design (ISQED)