DNN-Life: An Energy-Efficient Aging Mitigation Framework for Improving the Lifetime of On-Chip Weight Memories in Deep Neural Network Hardware Architectures
TTo appear at the th Design, Automation, and Test in Europe (DATE 2021)
DNN-Life: An Energy-Efficient Aging Mitigation Frameworkfor Improving the Lifetime of On-Chip Weight Memories inDeep Neural Network Hardware Architectures
Muhammad Abdullah Hanif , Muhammad Shafique Faculty of Informatics, Technische Universit¨at Wien (TU Wien), Vienna, Austria Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, United Arab Emirates [email protected], muhammad.shafi[email protected]
Abstract —Negative Biased Temperature Instability (NBTI)-inducedaging is one of the critical reliability threats in nano-scale devices. Thispaper makes the first attempt to study the NBTI aging in the on-chipweight memories of deep neural network (DNN) hardware accelerators,subjected to complex DNN workloads. We propose DNN-Life, a specializedaging analysis and mitigation framework for DNNs, which jointly exploitshardware- and software-level knowledge to improve the lifetime of aDNN weight memory with reduced energy overhead. At the software-level, we analyze the effects of different DNN quantization methods onthe distribution of the bits of weight values. Based on the insights gainedfrom this analysis, we propose a micro-architecture that employs low-costmemory-write (and read) transducers to achieve an optimal duty-cycle atrun time in the weight memory cells, thereby balancing their aging. Asa result, our DNN-Life framework enables efficient aging mitigation ofweight memory of the given DNN hardware at minimal energy overheadduring the inference process.
I. I
NTRODUCTION
DNN accelerators have already become an essential part of variousmachine learning systems [1] [2]. DNNs usually require a largenumber of parameters to offer high accuracy, which comes at thecost of high memory requirements; see Fig. 1a. Dedicated memoryhierarchies are designed to tradeoff between the low-cost storageoffered by the off-chip DRAMs and the energy-/performance-efficientaccess offered by the on-chip SRAMs [1]; see Fig. 1b for accessenergy. This has led to an increasing trend towards the use of largeron-chip memory in the state-of-the-art DNN accelerators [3] [4],with the recent wafer-scale chips having up to 18 GBs of on-chipmemory [5]. However, due to continuous technology scaling, the on-chip SRAM-based memories are becoming increasingly vulnerableto different reliability threats, for example, soft errors and aging [6][7] [8]. Studies have shown that even a single fault in weights ofcritical neurons can result in significant degradation of application-level accuracy [9]. State-of-the-art works have focused on analyzingand mitigating the effects of faults in DNN accelerators w.r.t. DNNaccuracy [10].
However, to the best of our knowledge, no prior workshave analyzed and optimized the aging of the on-chip weight memoriesof DNN accelerators, especially when considering diverse dataflowsof different DNNs and the impact of different types of quantizationson the weight distributions.
Aging due to NBTI:
In PMOS transistors when a negative gate-to-source voltage is applied, it can break-down the Si-H bond at theoxide-interface, thereby causing a gradual increase in the thresholdvoltage ( V th ) over the device lifetime, which results in poor drivecurrent and a reduction in the noise margin [11] . To overcome this V th shift, the operating frequency of the device has to be reducedby more than 20% over its entire lifetime [12]. However, due tostrict performance and energy constraints (specifically for embeddedapplications), the V th shift cannot be addressed just by design-timedelay margins or adaptive operating frequency adjustments [13], asthis leads to a significant loss in the system’s performance and energyefficiency. Therefore, in traditional computing systems, alternateopportunities have to be exploited to overcome this challenge [12].One such opportunity lies in the fact that the NBTI aging phenomenonis partially reversed by removing the stress. A similar phenomenon called PBTI happens in NMOS transistors, thoughNBTI has been considered relatively more serious compared to PBTI [6].
AlexNet GoogleNet VGG-16 ResNet-152 S i z e [ M B ] A cc u r a c y [ % ] Size Top-5 Accuracy Top-1 Accuracy
32 bit 32KBSRAM 32 bitDRAM A cc e ss E n e r gy [ p J ] ( l o g s c a l e ) (a) (b) Fig. 1: (a) Accuracy and size comparison of few of the state-of-the-art DNNs(b) Access energy comparison of SRAM with DRAM (data source: [1]). V DD BL BL WL P P N N N N The PMOS transistors are mainly affected by NBTI aging S N M deg r ada t i on a ft e r y ea r s i n m Percentage of time that the cell stores zero [%]
The NBTI effect is minimum here because the
NBTI stress will equally be distributed between the two PMOS transistors existing in the SRAM (b) (a)
Fig. 2: (a) A 6T-SRAM Cell; and (b) its SNM degradation after 7 years [15]
NBTI Aging of On-chip Memories:
On-chip memories aretypically built using 6T-SRAM cells to achieve high area and powerefficiency. A 6T-cell is composed of two inverters coupled with twoaccess transistors (see Fig. 2a). The inverters store complementaryvalues to store a single bit. Each inverter has a PMOS transistor and anNMOS transistor. Depending on whether the cell is storing ‘0’ or ‘1’,one of the PMOS transistors is always under stress, when the transistoris on. As aging of a cell is defined by its most-aged transistor, thelowest aging is achieved when both the PMOS transistors receive on-average the same amount of stress over the entire lifetime of the device,i.e., the percentage of the entire lifetime for which the cell stores a‘1’ ( duty-cycle ) is 50%, as shown in Fig. 2b. Note that NBTI agingstrongly depends on average long-term stress and weakly on short-term statistics [14].
Therefore, the key challenge in aging mitigationof on-chip memories is to balance their duty-cycle over the entirelifetime without affecting system-level performance.
State-of-the-art techniques and their limitations:
Varioustechniques have been proposed at circuit-level and at architecture-level.At circuit-level, the structure of SRAM-cells is modified to reducethe aging rate [16] [13]. For example, Ricketts et al. [16] proposedan asymmetric SRAM structure for workloads having biased bitdistribution, but due to their high data dependence, they are applicableonly in specific scenarios. Recovery boosting through dedicatedrecovery accelerating circuit is another method for enhancing thelifetime of the SRAM cells [17], but it increases power/energyconsumption due to additional transistors per cell, and thereforecannot be used in energy-constrained large-sized memories [18]. Atarchitecture-level, periodic inversion of data is used to reduce the agingrate of on-chip caches [19]. However, it cannot guarantee optimalduty-cycle, specifically in cases where the same data is periodicallyreused, e.g., in DNN-based systems where the same set of parametersare reused for processing each input sample. Calimera et al. in [20]improved recovery of unutilized portions of memory, but at high area& energy cost of expensive online monitoring. The technique alsosuffers from serious performance degradation in dynamic workloadscenarios. Another set of techniques uses bit rotations to cater NBTIaging in registers [15], but they work only in cases where the overall1 a r X i v : . [ c s . A R ] J a n o appear at the th Design, Automation, and Test in Europe (DATE 2021)distribution of bits is relatively balanced. Moreover, they use barrelshifters that incur high area and power overheads. The work in [21]proposed a configurable micro-architecture for reducing aging rate ofvideo memories, but only works for streaming video applications.
In summary, the state-of-the-art techniques either incur highoverheads in terms of area and power/energy or rely on certainspecific workloads, but cannot be employed in DNN accelerators dueto the unique properties of DNN hardware and workloads, as we willillustrate later in this paper.
Additional Challenges from the Deep Learning Perspective:
Thedataflow (i.e., computation scheduling) for a given DNN on a specifichardware is defined as per the DNN architecture and the hardwareimplementation to achieve maximum energy-/performance-efficiency.Altering the dataflow to balance the duty-cycle in on-chip SRAMcells can result in significant degradation of system-level efficiency.
Therefore, an aging mitigation technique that does not require anyalteration to the dataflow or the mapping of the data in on-chip SRAMis desired.
Our Novel Contributions:
Towards this, we propose DNN-Life,an aging analysis and mitigation framework for on-chip memories ofDNN hardware (see Fig. 3). Our framework employs two key features:1)
Aging Analysis [Section III]:
We analyze the impact of usingdifferent data representation formats and quantization methods forweights of a DNN on the probability distribution of weight-bits,as this can provide useful insights for designing an effective andlow-overhead aging mitigation technique.2)
Aging Mitigation [Section IV]:
We propose a scheme andsupporting micro-architecture for mitigating the NBTI-aging of6T-SRAM-based on-chip weight memory of DNN acceleratorswith minimal energy overhead. Noteworthy, our scheme does notrequire any alteration to the dataflow of DNN inference or on-chipdata mapping, and thereby maintains the energy and performancebenefits of the system. The micro-architectural extensions for agingmitigation are integrated in the DNN accelerator before and afterthe on-chip weight memory in the form of aging-aware write andread transducers, as shown in Fig. 4. I n s i g h t s DNN Hardware Accelerator
Design-time Run-time 𝜇 -architecture Encoded
Weights Control Signals
Decoded
Weights
Conventional
Aging Mitigation
MethodsPre-trained DNNs and
Datasets
Aging
Analysis ( Section IV )DNN
Quantization and Data
Representation
Aging Mitigation
Scheme and
Micro-architecture(
Section III & VI )Integration of Aging Mitigation Modules in DNN Hardware
Architecture Aging Mitigation Modules Processing
Array
Weight
Memory
Control and
Address
Generation Unit
Fig. 3: Overview of the design-time and run-time steps involved in our DNN-Life framework. Our novel contributions are highlighted in colored boxes.
II. O
VERVIEW OF O UR DNN-L
IFE F RAMEWORK
In this work, we propose DNN-Life, a novel aging analysisand mitigation framework for weight memories of DNN hardwareaccelerators. It employs a low-cost data encoding scheme that accountsfor diverse DNN workloads to adapt over time to balance the duty-cycle in each on-chip weight memory cell to alleviate the NBTI-agingeffects. Towards this, the two key features of our framework are:1)
Analysis:
We analyze the probability distribution of weight-bitsof different pre-trained DNNs to find key insights that help indeveloping a low-cost aging-mitigation scheme. To consider thevariations in the distribution across number representation formatsand the methods used to transform the weights to those formats,we consider different number representation formats and differentcommonly used conversion methods. The detailed analysis andinsights are presented in Section III.2)
Architecture:
Based on the gathered insights, we design adata encoding module and an aging controller. The encoder is responsible for encoding the weights before writing the valuesto the weight memory, and the aging controller is responsiblefor generating encoding information required to encode the datasuch that the duty-cycle is balanced. The encoding informationis then stored to be used by the corresponding decoder module.The data encoder is deployed inside the DNN hardware acceleratorright before the weight memory, and the corresponding decoder isinstalled after the memory, to decode the weights before passingthem for computations. The integration of the encoder and thedecoder modules in a DNN accelerator is illustrated in Fig. 4a.The details of the micro-architecture are presented in Section IV.
A. DNN Hardware Architecture
Our DNN hardware architecture is based on well-established DNNaccelerator models, such as [22] for dense DNNs. Our acceleratoris composed of an
Activation Buffer , a
Weight Buffer , a
ProcessingArray , and an
Accumulation Unit ; see Fig. 4a. Our proposed weight-memory aging mitigation modules integrated in the architecture arealso shown in the figure (see details in Section IV). The activationand weight buffers provide intermediate storage for the activations andweights, respectively, to reduce the costly off-chip memory accesses.The buffers provide data to the processing array for performing thecomputations. For this work, we assume a memory hierarchy similar toBit-Tactical [22], DaDianNao [3] and TPU [4], according to which:1) the activation buffer is large enough to store the activations ofa single layer of a DNN; 2) the activation memory can provide N number of activation values to the processing array at a time; and 3)the weight memory can provide f × N weights to the processing arraysimultaneously. The processing array (see Fig. 4b) is composed of f number of Processing Elements (PEs) that share the activations, andtherefore can perform N number of multiplications for f differentfilters at the same time. Each PE has an adder tree to computethe sum of the multiplications. The computed sum is passed to theaccumulation unit where it is added with the corresponding partialsums to generate the output activation value. Note, as the filters canbe significantly large, the computation of each output activation cantake several cycles, depending on the filter size. B. Dataflow in the DNN Accelerator
To perform the computations of a DNN layer using the aboveaccelerator, the weights have to be partitioned into blocks that canbe accommodated in the on-chip memory. The goal of partitioning isto maximize the use of available PEs. The input/output feature mapsand the filters/neurons all are divided into so-called tiles , dependingon the available on-chip storage for the corresponding data type.Works like SmartShuttle [23] provide methods to find an optimal tilingconfiguration and computation scheduling policy for a layer of a DNNfor a given memory hierarchy.Fig. 5 illustrates the policy that we employ for partitioning thefilters of a CONV layer. Note, we support the well-established tilingtechnique so that we can demonstrate that our technique can benefita wide-range of existing DNN hardware accelerators. The figure alsoillustrates the sequence in which the blocks are moved to the on-chipweight memory and the corresponding computations are scheduled.The filters are first divided into sets, where each set contains f numberof filters. Note, f is mainly defined based on the number of filters thatthe hardware accelerator can process in parallel. Afterwards, a chunkof data (grey boxes in Fig. 5) from a set is selected to be moved tothe on-chip memory. The selected chunk contains a block of data ofsize r × c × ch from the same location of each filter in the set. Thesequence in which the grey boxes are traversed in the filters definesrest of the dataflow. The used sequence is shown as steps in Fig. 5.III. A NALYSIS OF THE D ISTRIBUTION OF W EIGHT -B ITS FOR D IFFERENT
DNN S & THEIR I MPACT ON D UTY -C YCLE
Before presenting the design of the proposed aging mitigationmodules in Section IV, here we first present an analysis whichhighlights the rationale behind the proposed design.2o appear at the th Design, Automation, and Test in Europe (DATE 2021) 𝑓𝑥𝑁 𝑓𝑥𝑁 𝑁 ... ... A A A N Off-Chip
Memory(DRAM) Weight BufferActivation
Buffer Accumulation Unit
Write Data
Encoder (WDE) Read Data
Decoder (RDD)
Control Unit
Processing Array PE PE PE f ... DNN Accelerator
Input Activations
Output ActivationsWeights
Encoded
Weights Encoded Weights
Decoded Weights
Partial Sums
Output
Activations
Aging Controller 𝑓 W W W N <1> ... ... W f > W f > W N < f > ... PE PE f ... ... ... ......... Processing
Array
Adder
Tree
Accumulation Unit
Reg. + ... ... Reg. + Output Activations (a) (b) I npu t A c t i v a t i o n s X X X X X X
Adder
Tree 𝑁 number of Weights of Filter 𝑁 number of Weights of Filter f Fig. 4: (a) Architecture of the baseline DNN accelerator. The highlighted boxes, i.e., Write Data Encoder (WDE), Read Data Decoder (RDD) and AgingController, are the proposed modules for mitigating NBTI aging of weight memory. (b) A detailed view of the processing array and the accumulation unit. ... ... f CH ch r c C R Step 1
Step 2
Step 4 ... f CH
C R
Filter Set 1 Filter Set N FS
For this analysis, we consider the AlexNet and the VGG-16 networks, trained on the ImageNet dataset. As different datarepresentations for weights, we consider 32-bit floating pointrepresentation (IEEE 754 standard) and 8-bit integer formatachieved using range-linear symmetric and asymmetric quantizationtechniques [24]. Fig. 6 illustrates the ratio of observing a ‘1’ to the totalnumber of observations (which corresponds to probability of observinga ‘1’) at each bit-location of a word for all three data representationformats for both the networks.
By analyzing the distributions, thefollowing key observations are made: The probability of getting a ‘1’ value at a particular bit-locationof a randomly selected weight depends on the network, the datarepresentation format, and the method used to transform the datato the particular data representation format.
For example, theprobability of getting a ‘1’ at a particular bit-location in symmetric8-bit representation is almost the same across bit-locations withina network for both the considered DNNs, however, it varies acrossnetworks. Similarly, the probability of getting a ‘1’ at the lowerbit-locations in 32-bit floating-point representation is around 0.5,however, the distribution of bits at higher bit-locations varies acrossbit-locations as well as across DNNs.2)
Representation of weights using a specific format cannot guaranteea distribution that offers 0.5 probability at each bit-location, i.e.,a distribution that can potentially lead to a balanced duty-cycle.
For example, out of all the studied cases, only the distribution ofthe AlexNet when represented using 8-bit integer format achievedusing symmetric range-linear quantization offers close to 0.5probability for all the bit-locations.3)
The average probability of getting a ‘1’ across bit-locations ina specific format is also not guaranteed to be equal to 0.5.
Forexample, see the distributions of 8-bit asymmetrically quantizedDNNs. Therefore, barrel shifter-based balancing techniques wouldnot produce desirable results in such cases.
31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1
Format
Format (Symmetric) (a) Probability of getting a ‘1’ at a specific bit position of a weight in AlexNet
ExponentSign
Mantissa
MSB
LSB P r o b a b ili t y P r o b a b ili t y P r o b a b ili t y Bit-location Bit-location Bit-location0
31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 Format (Asymmetric)
Format
Format (Symmetric)MSB LSB P r o b a b ili t y P r o b a b ili t y P r o b a b ili t y Bit-location
Bit-location
Bit-location Format (Asymmetric) (b) Probability of getting a ‘1’ at a specific bit position of a weight in VGG-16
Fig. 6: Distribution of bits of weights of different different DNNs whenrepresented in different data representation formats. Symmetric and asymmetricrepresent which post-training quantization method is used to transform the datafor the corresponding distribution.
B. A Probabilistic Model-based Analysis for Aging of 6T-SRAMOn-chip Weight Memory of a DNN Accelerator
In the following, we develop a probabilistic model to analyze theeffectiveness of different aging mitigation techniques.
1) Probabilistic Model:
Assume the on-chip memory of a givenDNN accelerator is composed of I × J cells. For mapping the weightsof a DNN onto the memory, we assume: (a) the same dataflow aspresented in Fig. 5; (b) each block of weights is kept in the on-chipmemory for equal amount of time, and it is fetched only once duringa single inference (similar to the dataflow for the DNN acceleratorproposed in [22]); (c) each block of data mapped onto the on-chipmemory fits perfectly to it. Based on the aforementioned conditionsand the given DNN size, we can divide the DNN into K blocks thattranslates to K number of data mappings onto the on-chip weightmemory. Now, if the same DNN is used repeatedly for inferencingwith the same dataflow, a single on-chip memory cell is mapped withonly K different bits. If the probability of getting a ‘1’ for all the bitsis given by ρ , the probability of getting a duty-cycle less than andequal to b/K , or greater than and equal to − b/K , can be computedusing the following equation, except when b/K = 0 . , where theprobability is 1. P b/K = b (cid:88) i =0 (cid:16) Ki (cid:17) ρ i × (1 − ρ ) K − i + K (cid:88) i = K − b (cid:16) Ki (cid:17) ρ i × (1 − ρ ) K − i (1) Here, b is an arbitrary variable with the range from to (cid:98) K/ (cid:99) .Note that we combine (i) the cases in which duty-cycle is less thanand equal to b/K and (ii) the cases in which duty-cycle is greaterthan and equal to − b/K , because in a symmetric 6T-SRAM cellboth the cases cause the same level of stress in one of the two PMOStransistors. Assuming the above computed probability to be the same3o appear at the th Design, Automation, and Test in Europe (DATE 2021) (a) P b/K > 0.1 for b/K = 0.3 With the increase in the value of K the probability starts decreasing (b) Fig. 7: Probability of occurrence of b/K ≥ duty-cycle ≥ − b/K when (a) K = 20 , and (b) K = 160 for all the cells of the on-chip memory, the probability of at least n number of cells (out of I × J ) experiencing duty-cycle less than andequal to b/K , or greater than and equal to − b/K can be computedusing the following equation. P n = I × J (cid:88) i = n (cid:16) I × Ji (cid:17) P ib × (1 − P b ) I × J − i (2)
2) An Example Case-Study:
Let us consider a scenario where K =20 and ρ = 0 . (i.e., the best-case with balanced bit distribution), and I × J = 8192 . Fig. 7a shows the probability for each possible valueof b computed using Eq. 1. Note, even for b/K = 0 . , the probabilityis over 0.1, i.e., more than 10% of the cells are expected to experiencea duty-cycle of less than 0.3, or greater than 0.7.Now, if we employ a given aging mitigation technique that offersupto 7 shifts to increase the number of different bits that are mappedto a single cell, we can theoretically increase the value of K to 160,assuming the bits to be independent from each other and the idealshifting policy. Putting K = 160 in the above mentioned example,Fig. 7b shows the probabilities for different b/K values. As canbe seen from Fig. 7b, the probabilities at lower b/K values havedropped significantly. The above analysis implies that by significantlyincreasing K and having ρ = 0 . , we can achieve close to idealduty-cycle for all the cells. Now, instead of a barrel shifter, if we employ an inversion-basedduty-cycle balancing technique where every other write to the samelocation is inverted, for the given scenario, the value of K remains thesame, as it is even. Moreover, as ρ is defined to be 0.5, the inversion-based policy has no impact on ρ either. Therefore, we get the sameprobabilities as presented in Fig. 7a. However, note that the inversion-based policy is mainly useful for achieving ρ = 0 . in cases wherethe distribution of bits is biased either towards ‘0’ or ‘1’ . C. Challenges in Designing an Efficient Aging Mitigation System
Based on the above analysis, we outline the following keychallenges in designing a generic aging mitigating system.1) The probability of occurrence of non-ideal duty-cycle isconsiderable even with the state-of-the-art fixed aging mitigationtechniques. Therefore, a more robust method has to be designedby exploiting the fact that NBTI-aging is more dependent on theaverage duty-cycle over the lifetime of the device [14].2) The distribution of bits and the duty-cycle is significantly affectedby the datatype used for representing the weights.
Therefore, themitigation technique should be generic and independent of thedatatype used so that it is beneficial for various DNN accelerators.
Moreover, in practical scenarios, each layer of a DNN can have adifferent size. Therefore, each layer can take different amount oftime for processing that can vary significantly across layers. Also,different DNNs can have different number of layers. Therefore, amethod that keeps track of all these factors at a fine granularity canhelp in significantly reducing the aging rates. However, such methodsare super costly.
This makes it very challenging to develop a genericmethod that offers effective aging mitigation at reasonable overheads.
IV. A M
ICRO - ARCHITECTURE FOR M ITIGATING A GING OF THE O N -C HIP W EIGHT M EMORY OF
DNN A
CCELERATORS
To address the above challenges, we propose a
Write Data Encoder (WDE) for encoding the weights before writing them to the on-chipweight memory, and a
Read Data Decoder (RDD) which performs the b b b b Inverter
Switch b b b n-1 ... b b ... b n-1 ... Write Data Encoder (WDE)
M-bit Register E M th bit + M-bits
New Data Block Signal
Aging Mitigation Controller ...... n True Random Bit Generator (TRBG) E Fig. 8: Proposed micro-architecture for effective aging mitigation of 6T-SRAMweight memory of DNN accelerators. inverse function of the WDE while reading the data from the on-chipmemory and before passing it to the processing array. The integrationof the proposed modules in the DNN accelerator is shown in Fig. 4a.Moreover, we propose an aging mitigation controller which generatesthe control signals (metadata) for the write (and read) transducer. Theproposed micro-architectures of the WDE and the aging mitigationcontroller is shown in Fig. 8.
Write Data Encoder (WDE):
It leverages the inversion logic thatbesides its low-overhead , also enables to perfectly balance out thedistribution of bits in the cells of the memory when the distribution isoriginally biased towards either ‘0’ or ‘1’, as highlighted in Section III.The inversion logic in the proposed micro-architecture is implementedusing XOR gates as they allow the aging mitigation controller toenable or disable it using just a ( E ) signal. Another keyadvantage of this design is that the micro-architecture of the RDD isthe same as WDE , where the same E signal (metadata) that is used toencode the weights is used (at a later point in time) for decoding thembefore passing them to the processing array. Moreover, the proposedWDE and RDD modules are highly scalable , as increasing the width ofthe modules require only a linear increase in the number of XOR gates.Therefore, the widths of these modules can be defined directly basedon the DNN accelerator configuration without affecting the energy-efficiency of the system. Aging Mitigation Controller:
The controller is the core part ofthe proposed micro-architecture, as it is responsible for generating theenable signal ( E ) that enables/disables the inversion logic in WDE.The design is based on the observations made in Section III that thehigher the number of different bits to be written on an SRAM cellduring its lifetime (i.e., K in Eq. 1) that are generated from a uniformdistribution the lower the chances of observing a deviation in its duty-cycle from 0.5 (see Figs. 7), i.e., the ideal point shown in Fig. 2b.Therefore, to increase the number of different bits to be written onan SRAM cell, we employ a True Random Bit Generator (TRBG)to generate the enable signal and decide whether the upcoming datashould be written with or without inversion in the memory cell. TRBGadds the sense of randomness in the bits to be written in the memoryand thereby leads to larger K value and lower aging.Note in practical scenarios, the output of TRBGs can be biasedtowards either ‘0’ or ‘1’, which can eventually affect the duty-cycle.Therefore, to mitigate this, we periodically invert the output of theTRBG after a defined number of iterations with the help of an M -bitregister before using it as the enable signal, which balances the bias.V. R ESULTS AND D ISCUSSION
A. Experimental Setup
Fig. 10 illustrates the overall experimental setup used for evaluation.The setup consists of hardware synthesis for estimating the power, areaand delay characteristics of the proposed modules, and simulations foraging estimation of the 6T-SRAM on-chip weight memory of differentDNN hardware accelerators. For hardware synthesis, we implemented low overhead compared to other techniques such as shifting, which requirescostly barrel shifters (as shown later in Section V)
4o appear at the th Design, Automation, and Test in Europe (DATE 2021)
Without Aging
Mitigation Inverter-based
Barrel Shifter- based
DNN-Life when
Bias = 0.5
DNN-Life with
Bias Balancing when Bias = 0.732-bit Floating Point Format
Format (Symmetric
Quantization)
Format (Asymmetric
Quantization)
DNN-Life without
Bias Balancing when Bias = 0.7 Baseline DNN
Accelerator with the AlexNet a Fig. 9: SNM degradation of 6T-SRAM on-chip weight memory cells of the baseline DNN accelerator when used for performing inferences only using theAlexNet network. Each bar graph shows the percentage of the number of cells (Y-axis) experiencing different level of SNM degradation (X-axis).
Memory Simulator
Duty-Cycle Generation Test dataset Verilog
Files
Verilog
Files
Verilog Files
Cadence Genus
Logic SynthesisPower Estimation
Area, Delay,
Estimation
Verilog Files
Verilog FilesTechnology LibraryDevice-Level NBTI AgingAging ModelAging AnalysisSource Files
Scripts (.m)Pre-trained Deep
Neural Networks InputsDNN Hardware Configurations and Dataflow
Fig. 10: Overall experimental setup used for evaluation. different aging balancing circuits and our DNN-Life architecture inVerilog. The circuits are synthesized for the TSMC 65nm technologyusing Cadence Genus.For aging estimation, we use Static Noise Margin (SNM) to quantifythe NBTI-aging of 6T-SRAM cells, similar to [21] [25]. The SNMdefines the tolerance to noise that directly affects the read stability ofa cell [26], i.e., if the SNM of a cell is low, the cell is highly susceptibleto read failures. As per [15] [21] [25], SNM mainly depends on theduty-cycle over the entire lifetime of the cell, and the least SNMdegradation is achieved at 50% duty-cycle. To obtain SNM results,we employ a similar device aging model as used in state-of-the-artstudies like [21] [25]. However, due to its duty-cycle optimizationfocus, our proposed technique is orthogonal to the given device agingmodels, and other device-level models can easily be integrated in ourframework. Based on the models, the SNM degradation of a 6T-SRAMcell can be computed using the duty-cycle. From the analysis, the bestSNM degradation for 6T-SRAM cell after 7 years is 10.82% (at 50%duty-cycle), and the worst is 26.12% (at 0% and 100% duty-cycle).For large-scale simulations, we integrated the output of these modelsinto a memory simulator of the baseline DNN hardware (described inSection II-A). The simulator takes the DNN hardware configuration,dataflow, pre-trained DNN architecture and test samples as inputs. Wealso built a memory simulator for a TPU-like hardware architecture [4]to validate the proposed aging-mitigation technique across DNNhardware accelerators. The hardware configurations used for theevaluation are presented in Table I. The DNNs used are the AlexNetand the VGG-16 with the ImageNet dataset and a custom networkwith MNIST dataset. The custom network is composed of two CONVlayers and two FC layers, i.e., CONV(16,1,5,5), CONV(50,16,5,5),FC(256,800) and FC(10,256). For each setting the duty-cycles areestimated based on the values observed in 100 inferences. The biasbalancing register is defined to be a 4-bit register (i.e., M=4), for allthe corresponding cases.
B. Aging Estimation Results and Comparisons
In this subsection, we analyze the impact of using different agingmitigation policies on the SNM degradation of the 6T-SRAM on-chipweight memory cells after 7 years. We mainly considered four differentpolicies: (1) No aging mitigation, (2) Inversion-based, (3) Barrelshifter-based, and (4) DNN-Life. For the proposed DNN-Life, weconsider three different cases: (i) TRBG is not biased and it generates0s and 1s with equal probability (referred in the results as
Bias=0.5 );(ii) TRBG is biased and it generates 1s with 0.7 probability, and theaging controller does not have a bias balancing register (referred in theresults as without bias balancing with Bias=0.7 ); and (iii) TRBG isbiased and it generates 1s with 0.7 probability and the aging controllerhas a 4-bit bias balancing register (referred in the results as with biasbalancing with Bias=0.7 ).Moreover, we performed experiments considering three differentdata representation formats for weights: (1) 32-bit floating pointformat; (2) 8-bit integer format when weights are quantized usingsymmetric quantization method; and (3) 8-bit integer format whenweights are quantized using asymmetric quantization method.Fig. 9 shows the distributions of SNM degradation in the memorycells obtained using different aging mitigation policies and a pre-trained AlexNet model. The Y-axis of each bar graph shows thepercentage of the number of cells and the X-axis of each showsSNM degradation levels. Note that, for these experiments, we assumedthe baseline DNN accelerator configuration presented in Table I andthe dataflow shown in Fig. 5 with f = 8 . Also, we assumed thatonly a single DNN (i.e., the AlexNet) is used for data inferencethroughout the lifetime of the device. As can be seen in the figure,the inversion-based and barrel shifter-based aging balancing reducethe SNM degradation of the SRAM cells, however, they do not offerminimum SNM degradation (see 2 and 3 in comparison with 1in Fig. 9). This behavior is observed to be consistent across all thedata representation formats (see 2 till 7 in comparison with theirrespective without aging mitigation graphs in Fig. 9). Specifically, theinversion-based aging balancing offers sub-optimal aging mitigationin case of the 32-bit floating point format (see 2 in Fig. 9), wheremost of the cells experience around 10.8% SNM degradation (see a TABLE I: Hardware configurations and settings used in evaluation
Baseline Accelerator (Section II-A) TPU-like NPU [4]Weightmemory size 512KB 256KBActivationmemory size 4MB 24MBPE array size 8 PEs (1 PE = 8 Multipliers) 256 x 256 PEs (1 PE = 1 MAC)Networks AlexNet AlexNet, VGG-16 and Custom
5o appear at the th Design, Automation, and Test in Europe (DATE 2021)
Without Aging Mitigation Inverter-based Barrel Shifter-based
DNN-Life with
Bias Balancing when Bias = 0.7VGG-16 for ImageNetDataset
Custom
Network for MNIST Dataset
NPU with
Different DNNs
Individually
AlexNet for
ImageNet
Dataset Fig. 11: SNM degradation of 6T-SRAM on-chip weight memory cells of aTPU-like NPU when used for performing inferences using the AlexNet, theVGG-16 and the custom DNN, individually. The networks are quantized to8-bit format using symmetric range-linear quantization method. in Fig. 9). However, this is not the ideal scenario as there are 4% cellsthat experience highest level of SNM degradation (see b in Fig. 9)and a few that experience moderate level of SNM degradation (see cin Fig. 9). Now, if we analyze the results of the proposed
DNN-Lifewith bias balancing , it offers maximum aging-mitigation (i.e., all thecells experience around 10.8% SNM degradation) in all the cases (see8 , 9 and 10 in Fig. 9).
Impact of biased TRBG on aging balancing of 6T-SRAM on-chip weight memory:
Fig. 9 also illustrates the impact of usingproposed design without bias correction when the duty-cycle of TRBGis 0.7. As can be seen in the figure, for all the data representationformats, having biased TRBG and no bias correction leads to lessreduction in SNM degradation of the 6T-SRAM cells (e.g., see 11in comparison with 8 in Fig. 9). This behavior is consistent acrossall the data representation formats.
Impact across different hardware accelerators:
Fig. 11 showsthe impact of using the proposed aging-mitigation technique for aTPU-like [4] Neural Processing Unit (NPU) architecture that has anon-chip weight FIFO which is four tiles deep, where one tile isequivalent to weights for × PEs. Each PE has a single MACunit that can perform 8-bit multiplication. For our implementation,we assumed the weight FIFO to be a circular buffer-based design.We performed analysis using the three different networks mentionedearlier. All the DNNs are quantized to 8-bits using post-trainingsymmetric quantization. Considering the dataflow of the NPU, theparameter f was set to 256. As can be seen in Fig. 11, the inversion-based aging mitigation policy offers optimal results for the AlexNetand the VGG-16 networks (see 1 and 2 in Fig. 11). However, whenused for the custom DNN, almost all the memory cells experiencesignificant SNM degradation (see 3 in Fig. 11). The barrel shifter-based approach also offer sub-optimal results (see 4 till 6 inFig. 11). However, the proposed DNN-Life with bias balancing offersmaximum aging mitigation (see 7 till 9 in Fig. 11). This showsthat DNN-Life can be used for a wide range of DNN accelerators. C. Area and Power Results
The area, power and delay characteristics of three different WDEscomposed of different aging balancing units are shown in Table II. Allthree WDEs are designed for 64 bit-width. The barrel shifter-basedWDE consumes the most amount of area and power. The proposeddesign consumes slightly more power and area as compared to theinversion-based WDE. However, as shown in the previous subsection,it offers best aging-mitigation in all the possible scenarios regardlessof the size of the given DNN, the data representation format and theon-chip weight memory size. Note that, at hardware level, we realizedTRBG using a 5-stage ring oscillator.
TABLE II: Hardware results of different Write Data Encoders (WDEs)
Delay [ps] Power [nW] Area [cell area]Barrel Shifter based WDE 977.7 345190 9035Inversion based WDE 811.6 10716 195Proposed WDE with AgingMitigation Controller 581.8 13747 295
VI. C
ONCLUSION
In this paper, we proposed DNN-Life, an aging-mitigationframework that employs read and write transducers to reduce NBTI-induced aging of 6T-SRAM on-chip weight memory in DNN hardwareaccelerators. We analyzed different DNN data representation formatsat the software-level and their potential for balancing the duty-cycle in SRAM cells. Based on the analysis, we proposed a micro-architecture that makes use of a True Random Bit Generator (TRBG)to ensure optimal duty-cycle at runtime, thereby balancing the agingof complimentary parts in 6T-SRAM cells of the weight memory. Asa result, our DNN-Life enables efficient aging mitigation of weightmemory of a given DNN hardware with minimal energy overhead.A
CKNOWLEDGMENT
This work is partially supported by Intel Corporation through Giftfunding for the project ”Cost-Effective Dependability for Deep NeuralNetworks and Spiking Neural Networks.”R
EFERENCES[1] V. Sze et al., “Efficient processing of deep neural networks: A tutorial and survey,”
Proceedings of IEEE , vol. 105, no. 12, pp. 2295–2329, 2017.[2] M. Capra et al., “Hardware and software optimizations for accelerating deep neuralnetworks: Survey of current trends, challenges, and the road ahead,”
IEEE Access ,2020.[3] Y. Chen et al., “Dadiannao: A machine-learning supercomputer,” in
IEEE/ACM
MICRO
Symposium , 2014, pp. 609–622.[4] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,”in
ACM/IEEE
ISCA , 2017, pp. 1–12.[5] P. McLellan. (2019) Hot chips: The biggest chip in the world. Accessed:2019-09-10. [Online]. Available: https://community.cadence.com/cadence blogs8/b/breakfast-bytes/posts/the-biggest-chip-in-the-world[6] J. Henkel et al., “Reliable on-chip systems in the nano-era: Lessons learnt and futuretrends,” in
ACM/ESDA/IEEE
DAC , 2013, p. 99.[7] M. Shafique et al., “Robust machine learning systems: Challenges, current trends,perspectives, and the road ahead,”
IEEE Design & Test , vol. 37, no. 2, pp. 30–57,2020.[8] J. Henkel et al., “Thermal management for dependable on-chip systems,” in
IEEE
ASP-DAC , 2013, pp. 113–118.[9] M. A. Hanif et al., “Robust machine learning systems: Reliability and security fordeep neural networks,” in
IEEE IOLTS , 2018, pp. 257–260.[10] S. Kim et al., “Matic: Learning around errors for efficient low-voltage neuralnetwork accelerators,” in
IEEE
DATE , 2018, pp. 1–6.[11] K. Kang et al., “Nbti induced performance degradation in logic and memorycircuits: How effectively can we approach a reliability solution?” in
IEEE
ASP-DAC , 2008, pp. 726–731.[12] D. Gnad et al., “Hayat: Harnessing dark silicon and variability for agingdeceleration and balancing,” in , 2015.[13] J. Shin et al., “A proactive wearout recovery approach for exploitingmicroarchitectural redundancy to extend cache sram lifetime,” in
ACM/IEEESIGARCH Computer Arch. News , vol. 36, no. 3, 2008, pp. 353–362.[14] J. Abella et al., “Penelope: The nbti-aware processor,” in
IEEE/ACM
MICRO
Symposium . IEEE Computer Society, 2007, pp. 85–96.[15] S. Kothawade et al., “Analysis and mitigation of nbti aging in register file: An end-to-end approach,” in
IEEE
ISQED , 2011, pp. 1–7.[16] A. Ricketts et al., “Investigating the impact of nbti on different power saving cachestrategies,” in
IEEE
DATE , 2010, pp. 592–597.[17] T. Siddiqua et al., “Enhancing nbti recovery in sram arrays through recoveryboosting,”
IEEE
TVLSI , vol. 20, no. 4, pp. 616–629, 2011.[18] B. Zatt et al., “A low-power memory architecture with application-aware powermanagement for motion disparity estimation in multiview video coding,” in
IEEE/ACM
ICCAD , 2011, pp. 40–47.[19] T. Jin et al., “Aging-aware instruction cache design by duty cycle balancing,” in
IEEE
IVLSI , 2012, pp. 195–200.[20] A. Calimera et al., “Partitioned cache architectures for reduced nbti-induced aging,”in
IEEE
DATE , 2011, pp. 1–6.[21] M. Shafique et al., “Enaam: Energy-efficient anti-aging for on-chip videomemories,” in
ACM/IEEE
DAC , 2015, pp. 101:1–101:6.[22] A. Delmas et al., “Bit-tactical: Exploiting ineffectual computations in convolutionalneural networks: Which, why, and how,” preprint arXiv:1803.03688 , 2018.[23] J. Li et al., “Smartshuttle: Optimizing off-chip memory accesses for deep learningaccelerators,” in
IEEE
DATE , 2018, pp. 343–348.[24] D. Lin et al., “Fixed point quantization of deep convolutional networks,” in
ICML ,2016, pp. 2849–2858.[25] M. Shafique et al., “Content-aware low-power configurable aging mitigation forsram memories,”
IEEE Transactions on Computers , vol. 65, no. 12, pp. 3617–3630,2016.[26] K. Agarwal et al., “Statistical analysis of sram cell stability,” in
ACM/IEEE
DAC ,2006, pp. 57–62.,2006, pp. 57–62.