[PDF] MF-Net: Compute-In-Memory SRAM for Multibit Precision Inference using Memory-immersed Data Conversion and Multiplication-free Operators

Abstract

We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We use multiplication-free function approximators based on ell_1 norm along with a co-adapted processing array and compute flow. Using the approach, we overcame many deficiencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM row/column, the need for high precision analog-to-digital converters (ADCs), limited support for multi-bit precision weights, and limited vector-scale parallelism. Our co-adapted implementation seamlessly extends to multi-bit precision weights, it doesn't require DACs, and it easily extends to higher vector-scale parallelism. We also propose an SRAM-immersed successive approximation ADC (SA-ADC), where we exploit the parasitic capacitance of bit lines of SRAM array as a capacitive DAC. Since the dominant area overhead in SA-ADC comes due to its capacitive DAC, by exploiting the intrinsic parasitic of SRAM array, our approach allows low area implementation of within-SRAM SA-ADC. Our 8\times62 SRAM macro, which requires a 5-bit ADC, achieves \sim105 tera operations per second per Watt (TOPS/W) with 8-bit input/weight processing at 45 nm CMOS.

Full PDF

11 MF-Net: Compute-In-Memory SRAM for MultibitPrecision Inference using Memory-immersed DataConversion and Multiplication-free Operators

Shamma Nasrin,

Student Member, IEEE,

Diaa Badawi, Ahmet Enis Cetin,

Fellow, IEEE,

Wilfred Gomes, andAmit Ranjan Trivedi,

Member, IEEE,

Abstract —We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We usemultiplication-free function approximators based on (cid:96) normalong with a co-adapted processing array and compute ﬂow.Using the approach, we overcame many deﬁciencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM row/column,the need for high precision analog-to-digital converters (ADCs),limited support for multi-bit precision weights, and limitedvector-scale parallelism. Our co-adapted implementation seam-lessly extends to multi-bit precision weights, it doesn’t requireDACs, and it easily extends to higher vector-scale parallelism.We also propose an SRAM-immersed successive approximationADC (SA-ADC), where we exploit the parasitic capacitance ofbit lines of SRAM array as a capacitive DAC. Since the dominantarea overhead in SA-ADC comes due to its capacitive DAC, byexploiting the intrinsic parasitic of SRAM array, our approachallows low area implementation of within-SRAM SA-ADC. Our8 ×

62 SRAM macro, which requires a 5-bit ADC, achieves ∼ ×

30 SRAM macro,which requires a 4-bit ADC, achieves ∼

84 TOPS/W. SRAMmacros that require lower ADC precision are more tolerant ofprocess variability, however, have lower TOPS/W as well. Weevaluated the accuracy and performance of our proposed networkfor MNIST, CIFAR10, and CIFAR100 datasets. We chose anetwork conﬁguration which adaptively mixes multiplication-freeand regular operators. The network conﬁgurations utilize themultiplication-free operator for more than 85% operations fromthe total. The selected conﬁgurations are 98.6% accurate forMNIST, 90.2% for CIFAR10, and 66.9% for CIFAR100. Sincemost of the operations in the considered conﬁgurations are basedon proposed SRAM macros, our compute-in-memory’s efﬁciencybeneﬁts broadly translate to the system-level.

Index Terms —Deep neural networks; In-SRAM processing;MNIST; CIFAR10; CIFAR100. I. I

NTRODUCTION

In many practical applications, deep neural networks(DNNs) have shown a remarkable prediction accuracy [1]–[4]. DNNs in these applications typically utilize thousandsto millions of parameters (i.e., weights) and are trained overa huge number of example patterns [5]–[7]. Operating oversuch a large parametric space, which is carefully orchestratedover multiple abstraction levels (i.e., hidden layers), facilitates Shamma Nasrin ([email protected]), Diaa Badawi([email protected]), Ah-met Enis Cetin([email protected]) and Amit Trivedi([email protected]) are withthe department of Electrical and Computer Engineering, University of Illinoisat Chicago. Wilfred Gomes([email protected]) is afﬁliated with Intel.

DAC DAC AD C SRAM Array SRAM Array F il t er Convolution Inputs

Local analog averaging AD C Parallel DACs: increased area/power overhead

High ADC precision demand: [n+log(l)]-bit; l=DAC precision, n= O u t High bitcell area/power: F il t er Figure 1: High-level overview of in-SRAM processing in the current artand key limitations. The ﬁgure considers CONV-SRAM [8] as a motivatingexample, however, the challenges are common to the most other designs.

DNNs with a superior generalization and learning capacity,but also presents critical inference constraints, especially whenconsidering real-time and/or low power applications. For in-stance, when DNNs are mapped on a traditional computingengine, the inference performance is strangled by extensivememory accesses, and the high performance of the processingengine helps little.A radical approach, gaining attention to address this per-formance challenge of DNN, is to design memory units thatcan not only store DNN weights but also locally processDNN layers. Therefore, using such ‘compute-in-memory’ highvolume data trafﬁc between processor and memory units isobviated, and the critical bottleneck can be alleviated. More-over, a mixed-signal in-memory processing of DNN operandsreduces necessary operations for DNN inference. For example,using charge/current-based representation of the operands, theaccumulation of products simply reduces to current/chargesummation over a wire. Therefore, dedicated modules andoperation cycles for product summations are not necessary.In recent years, several in-SRAM DNN implementationshave been shown. However, many critical limitations remain,which inhibit the scalability of the processing. In Figure 1,using CONV-SRAM [8] as a motivating example, we highlightthese limitations – notably, the challenges are common to mostother in-SRAM applications too. To compute the inner productof l -element weight and input vectors w and x , l -DACs andone ADC are required. Since DACs are concurrently active,they lead to both high area and power. With the increasingprecision of operands, the design of DACs also becomes more a r X i v : . [ c s . A R ] J a n complex. In [9], time-domain DACs are used to handle thiscomplexity; however, with increasing input precision, eitheroperating time increases exponentially, or complex analog-domain voltage scaling is necessitated. In [10], DACs areobviated, but the operation is only limited to binary inputsand weights, which has low accuracy.An ADC is needed to digitize the inner product of w and x vectors in Figure 1. If x is n -bit and ADC combinesthe output of l cells, the minimum necessary precision ofthe ADC is ∼ n + log ( l ) to avoid any quantization loss.Therefore, ADC precision requirement becomes more strin-gent with increasing input precision and the number of cellsbeing summed. Moreover, scaled technology nodes of SRAMprecludes analog-heavy ADCs embedded within SRAM. In[8], a charge sharing-based ADC was integrated with SRAM.However, the worst-case comparison steps grow exponentiallywith ADC’s precision, limiting vector scale parallelism (i.e.,the number of cells/products l that can be concurrently pro-cessed). In [11], ADC is avoided by using a comparator circuit,but this limits the implementation only to step function-basedactivation and doesn’t support the mapping of DNNs withlarger weight matrices that cannot ﬁt within an SRAM array.Near-memory processing avoids the complexity ofADC/DAC by operating in the digital domain only. Theschemes use the time-domain and frequency-domainsumming of weight-input products. Unlike charge/current-based sum, however, time/frequency-domain summation is notinstantaneous. A counter [12] or memory delay line (MDL)[13] can be used to accumulate weight-input products. Withincreasing vector-scale parallelism (length of input/weightvector l ), the integration time of counter/MDL increasesexponentially, which again limits parallelism and throughput.Addressing the challenges of state-of-the-art, we proposethe following core concepts:1) We use a multiplication-free neural network operator thateliminates high-precision multiplications in input-weightcorrelation [14]–[16]. In the new operator, the correlationof weight w and input x is represented as w ⊕ x = (cid:88) i sign ( x i ) · abs ( w i ) + sign ( w i ) · abs ( x i ) (1)Here, · is an element-wise multiplication operator, + iselement-wise addition operator, and (cid:80) is vector sum op-erator. sign () operator is ± and abs () operator producesabsolute unsigned value of the operand. Therefore, in Eq.1, the correlation operator is inherently designed to onlymultiply a one-bit element of sign ( x ) against full precision w , and one-bit sign ( w ) against x . By avoiding directmultiplications between full precision variables, DACs canbe avoided in in-memory computing.2) Additionally, we reformulate Eq. 1 as below to minimizethe dynamic energy of computation: (cid:88) i sign ( w i ) · abs ( x i ) =2 × (cid:88) i step ( w i ) · abs ( x i ) (cid:124) (cid:123)(cid:122) (cid:125) low dynamic energy − (cid:88) i abs ( x i ) (cid:124) (cid:123)(cid:122) (cid:125) shared computation (2a) (cid:88) i sign ( x i ) · abs ( w i ) =2 × (cid:88) i step ( x i ) · abs ( w i ) (cid:124) (cid:123)(cid:122) (cid:125) low dynamic energy − (cid:88) i abs ( w i ) (cid:124) (cid:123)(cid:122) (cid:125) weight statistics (2b)In the above reformulation, step () ∈ [0 , . The abovereformulation allows processing with single product portof SRAM cells; thus, it can reduce dynamic energy. Com-pare this to current implementations [8], [9], [11] whereoperations with weights w ∈ [ − , require product ac-cumulation over both bit lines. SRAM design in [8] is10T to support differential ended processing, while ourSRAM is 8T due to single-ended processing. However,the above reformulation also has residue terms (cid:80) abs ( x i ) and (cid:80) abs ( w i ) . The ﬁrst term can be computed using adummy row of weights, all storing ones. For a given input,this computation is referenced for all weight vectors; thus,computing overheads amortize. The second term is a weightstatistics that can be pre-computed and can be looked-upduring evaluation.3) We also discuss that parasitic capacitance of bit lines ofSRAM array can be exploited as a capacitive digital-to-analog converter (DAC) for successive approximation-basedADC (SA-ADC). In our architecture, when bit lines inone half of the array compute the weight-input correlation,bit lines in the other half implement binary search ofSA-ADC to digitize the correlation output. Remarkably,the proposed DNN operator also helps reducing precisionconstraints on SA-ADC. With the proposed operator, eachSRAM cell only performs 1-bit logic operation; thus, todigitize the output of l columns, ADC with log ( l ) precisionis needed. Compare this to CONV-SRAM in Figure 1,where necessary ADC’s precision is n + log ( l ) since eachSRAM cell processes n -bit DAC’s output. By simplifyingdata converters, our scheme can also achieve higher vector-scale parallelism, i.e., allows processing a higher number ofparallel columns ( l ) with the same ADC complexity thanin [8].Sec. II introduces the co-adapted multiplication-free op-erators for the in-SRAM deep neural network. Sec. IIIcharacterizes the algorithmic accuracy of our framework.Sec. IV gives an overview of compute-in-memory macrobased on multiplication-free operator. Sec. V discussespower/performance characterization of the proposed compute-in-SRAM macro. Sec. VI explores the concept of synergisticintegration of digital and compute-in-memory processing forDNN. Sec. VII concludes.II. C O - DESIGNED M ULTIPLICATION - FREE

NN O

PERATOR

Muliplication-free DNN operator were presented in [17]. Inthis work, we expand on the potential of multiplication-freeoperators to considerably reduce the complexity of SRAM-based compute-in-memory design. Compared to [17], weadjusted the operator with abs() operation on operands w and x in Eq. (1) to further simplify compute-in-memory processingsteps. Our later discussion will show that the adjusted operatoralso achieves high prediction accuracy on various benchmark (a) L o ss Our approachBinary neural network (b)

Full 8 7 6 5 4

Quantization bits A cc u r ac y ( % ) MNISTCIFAR10CIFAR100 (c)Figure 2: Training curves for the multiplication-free NN operator on (a) MNIST and (b) CIFAR10 data sets. (c) The effect of quantization on multiplication-freeoperator. datasets. Note that a multiplication-free operator in Eq. (1) isbased on the (cid:96) norm, since x ⊕ x = 2 || x || . In traditionalneural networks, neurons perform inner products to computethe correlation between the input vector with the weightsof the neuron. We deﬁne a new neuron by replacing theafﬁne transform of a traditional neuron using co-designed NNoperator as φ ( α ( x ⊕ w ) + b ) where w ∈ R d , α, b ∈ R are weights, the scaling coefﬁcient and the bias, respectively.Moreover, since the proposed NN operator is nonlinear itself,an additional nonlinear activation layer (e.g., ReLU) is notneeded, i.e., φ () can be an identity function. Most neuralnetwork structures including multi-layer perceptrons (MLP),recurrent neural networks (RNN), and convolutional neuralnetworks (CNN) can be easily converted into such an compute-in-memory compatible network structures by just replacingordinary neurons with the activation functions deﬁned using ⊕ operations without modiﬁcation of the topology and thegeneral structure.The above co-designed neural network can be trained us-ing standard back-propagation and related optimization algo-rithms. The back-propagation algorithm computes derivativeswith respect to the current values of parameters. However, thekey training complexity for the operator is that the derivativeof α ( x ⊕ w ) + b with respect to x and w is undeﬁned when x i and w i are zero. The partial derivative of x ⊕ w with respectto x and w can be expressed: ∂ ( x ⊕ w ) ∂x i = sign ( w i ) sign ( x i ) + 2 × abs ( w i ) δ ( x i ) , (3a) ∂ ( x ⊕ w ) ∂w i = sign ( x i ) sign ( w i ) + 2 × abs ( x i ) δ ( w i ) . (3b)Here, δ () is a Dirac-delta function. For gradient-descent steps,the discontinuity of sign function can be approximated bya steep hyperbolic tangent and the discontinuity of Dirac-delta function can be approximated by a steep zero-centeredGaussian function.III. A CCURACY ON B ENCHMARK D ATASETS

In this section, we characterize the prediction accuracy ofthe proposed NN operator for MNIST [18], CIFAR10 [19] andCIFAR100 [20] data-sets. For MNIST, we simulate the LeNet-5 network [21]. The network consists of two convolution

Table I: Multiplication-free operator vs. conventional DNN and binary neuralnetwork (BNN))Dataset Accuracy on test setConventional Multiplication-free BNNMNIST 99.01% 98.6% 97%CIFAR10 90.95% 90.2% 85%CIFAR100 67.88% 66.9% - layers, two fully connected layers, and uses max-poolinglayers. For CIFAR10, we use a convolutional neural networkconsisting of ﬁve convolution layers and two fully connectedlayers. DNN for CIFAR10 uses max-pooling layers and batchnormalization layer. For CIFAR100 we choose MobileNetV2[22]. Both MNIST and CIFAR-10 data sets consist of 60,000images with 10 classes, whereas the CIFAR100 data setconsists of 60,000 images with 100 classes. A test-set of10,000 images is used to characterize the prediction accuracy.Table I summarizes test-set accuracy for the data sets withconventional, multiplication-free, and binarized network oper-ators. In Table 1, for CIFAR10 results, the convolution layersare binarized in BNN and made multiplication-free in theproposed network, respectively, while the fully-connected lay-ers are implemented using regular multiplications. Accuracyresults for MNIST are obtained by keeping only the last layertypical, and the other layers are binarized or multiplication-free. For CIFAR100 results, we make only the bottleneck(BN) layers of MobileNetV2 multiplication-free and keep therest of the network with the typical operator. In Table I, wealso compare the accuracy of the multiplication-free operatoragainst the binarized weight operator using the above con-ﬁguration on various datasets. While a binary neural network(BNN) simpliﬁes the implementation at the cost of constrainedlearning space, the learning space of the multiplication-freeoperator is as extensive as the typical operator. Therefore,multiplication-free operator-based networks have much betteraccuracy than BNN. The accuracy of the multiplication-free operator is also quite competitive to the conventionaloperator on various datasets [14], [15]. Meanwhile, our laterdiscussion will show that a multiplication-free operator sig-niﬁcantly reduces implementation complexity compared to thetypical operator. Figures 2(a-b) show the training curve for themultiplication-free operator compared to BNN under identicaltraining conditions for MNIST and CIFAR10. In Figure 2(c), the prediction accuracy of multiplication-free operator is alsoamenable to input/weight quantization. Operation with 8-bitﬁxed precision inputs/weights has equivalent accuracy to theﬂoating-point precision cases on various datasets. Therefore, inthe following discussion, we consider an 8-bit ﬁxed-precisionoperation with the multiplication-free operator.IV. O

VERVIEW OF C OMPUTE - IN -SRAM M ACRO BASEDON M ULTIPLICATION - FREE O PERATOR

A. Compute-in-SRAM macro based on µ Arrays/Channels

Figure 3(a) shows the proposed design of compute-in-SRAM macro for multiplication-free operator-based DNNinference. In the proposed design, an SRAM macro consists of µ Arrays and µ Channels, as shown in the ﬁgure. Each µ Arrayis dedicated to storing one weight channel. DNN weights arearranged across columns in a µ Array where each bit planeof weights is arranged in a row. Therefore, an N -dimensionalweight channel with m -bit precision weights will require m rows and N columns of SRAM cells in a µ Array. Figure3(b) shows the proposed 8T SRAM cell used for the in-SRAM processing of the operator. Extra transistors in the cellcompared to a 6T cell decouple typical read/write operationsto within cell product. The added transistors are selected by therow and column select lines (RL and CL) and operate on theproduct bit line (PL). The decoupling of read/write and prod-uct operations mitigates interference between the operations,reduces the impact of process variability, and allows operationin storage hold mode. Each µ Array is augmented with a µ Channel. µ Channels convey digital inputs/outputs to/from µ Arrays. µ Channels are essentially low overhead serial-inserial-out digital paths based on scan-registers. If a weightﬁlter has many channels, µ Channels also allow stitching of µ Arrays so that inputs can be shared among the µ Arrays.Figure 3(d) shows µ Channel-based column merging betweentwo µ Arrays. If two columns are merged, inputs are passedto the top array directly from the bottom array, and theloading of input bits is bypassed on the top column; therefore,overheads to load input feature-map are minimized. Figure3(c) illustrates input/weight mapping to SRAM macro andoperation sequence. For step ( x ) · abs ( w ) step in w ⊕ x , step ( x ) vector is loaded on the µ Channel and operated against abs ( w ) rows of µ Array. For step ( w ) · abs ( x ) step, bitplanes of abs ( x ) vector are sequentially loaded on the µ Channel and operatedagainst step ( w ) row of the µ Array.

B. Operation cycles

In a µ Array, to compute x ⊕ w , the operation proceeds by bitplanes. If the left half computes the weight-input product, theright half digitizes. Both halves subsequently exchange theiroperating mode to process weights stored in the right half.When evaluating the inner product terms step ( x ) · abs ( w ) ,computations for i th weight vector bitplane are performed inone instruction cycle. At the start, the inverted logic values ofstep ( x ) bit vector is applied to CL through µ Channels. PL isprecharged. When clock switches, tri-state MUXes ﬂoat PL.Compute-in-memory controller activates SRAM rows storingi th bit vector of w . In a column j , only if both w j,i andstep ( x j ) are one, the corresponding PL segment discharges. To minimize the leakage power, we maintain SRAM cells in theirhold mode and dedicate additional clock time to dischargePLs. The potential of all column lines is averaged on thesum-lines to determine the net multiply-average (MAV), i.e., (cid:80) N (cid:0) w j,i × step ( x j ) (cid:1) for input vector and weight bit plane w j . Figure 3(e) shows the instruction sequence for the left halfto compute MAV consisting of precharge, product, and averagestages. Figure 4(b) shows the simulated transients in the lefthalf to illustrate the execution sequence. step ( w ) · abs ( x ) is processed similarly by loading bitplanes of abs ( x ) to the µ Channels.

C. SRAM-Immersed SA-ADC

Since MAV output at the sum line ( SL ) is charge-based, ananalog-to-digital converter (ADC) is necessary to convert theoutput into digital bits. In Figure 3(a), the right half of the ar-ray implements an SRAM-immersed successive approximation(SA) data converter to digitize the output at the left sum line( SLL ). Reference voltages for SA-based data conversion aregenerated by exploiting PL parasitic in the right half. Figure 5describes the utilization of parasitic capacitance of the productlines for the DAC implementation of SA-ADC. The productlines of the right half are charged and discharged according tothe SAR logic to produce the reference voltage at the right sumline (SLR). In the i th SA iteration, 2 i capacitors are used togenerate the reference voltage. Each half also uses a dummyPL of matching capacitance to complete SA. Although thecapacitance of SL affects the MAV range, its effect nulliﬁesduring the digitization since the capacitor is a common modeto both ends of the comparator. Nonetheless, limited voltageswing range due to SL’s capacitance limits the number ofparallel columns in a µ Array that can be reliably operated.Figure 3(e) also shows the instruction cycles for the dataconversion consisting of precharge, average, compare, andSAR steps. One cycle of data conversion lasts two clockperiods. For n -bit digitization, n clock cycles are needed.In a conversion cycle, at the start, PLs in the right half arecharged based on initialization or SA output from the previouscycle. At the next clock transition, PLs are merged to averagetheir voltage. Next, a comparator compares the potential atthe left and right sum lines (SLL and SLR in Figure 3(a)).Subsequently, SA logic operates on the comparator’s outputto update the digitization registers and produces the nextprecharge logic bits. Figure 4(b) shows the simulated transientsin the right half and interplay with the left half.The comparator in our design must accommodate rail-to-rail input voltages at SLL and SLR. Therefore, in Figure4(a), we use a cross-coupled comparator consisting of n-typeand p-type modules. The n-type module receives inputs atNMOS transistors while p-type receives at PMOS. Couplingtransistors to integrate both modules are highlighted in Figure4(a). If the input voltages are closer to zero, the p-type in-stance dominates. Otherwise, if the input voltages are close toVDD, the n-type instance dominates. Connections to couplingtransistors in the ﬁgure ensure that n-type or p-type instancescan be overridden at the appropriate voltage range. Figure 4(c)shows the comparator output transients. (b)(d)8-T cell Comp. S AR PCH

Precharge busSLL C L PL µArray architecture and processing elements S e qu e n t i a l r o w s c a nn er SICLKEN CLKSISO EN SLR µChannels

REG(a) M ag n it u d e w w w w s S i g n CL

8T IMC Cell

WWLW

BLL W BL R PLRL (e)

MAV

PrechargeProductAverage PrechargeSumCompareSAR

ADCCLK (c)

WeightsInput feature map

SHIFT

Skip loading input bits Feed column inputs from the lower µArray

Column-wise reconfiguration bit sign( w )abs( w ) µ A rr ay µ C h a nn e l sign( x )abs( x ) Figure 3: (a) µ Array and µ Channel architecture for w ⊕ x processing in a compute-in-SRAM macro. (b) 8-T SRAM cell for in-memory processing. SRAMcell has additional transistors, colored red, for input-weight correlation. (c) Mapping of sign bits and the absolute values of weights and inputs on µ Arraysand µ Channels. (d) Stitching of µ Array columns by reconﬁguration bits in µ Channels. (e) Instruction cycles for in-SRAM processing. V o l t ag e S w i n g s Digitization: Right half P rec h a r g e Su m C o m p a re S AR P rec h a r g e B i t ce ll - w i s e p r o du c t Su m MAV: Left half (b)(a) ComparatorSum lines (SL)Product lines (PL)CLK

Time (ns) V o lt a g e ( V ) EnableV(O N1 ) V(O N2 ) N-type dominatedP-type dominated(c)

Figure 4: (a) Cross-coupled comparator schematic. Operational transients for (b) in-SRAM processing of multiplication-free operator and (c) comparator.

Product line capacitancesWeight-input product bits V

MAV

Reference bits0 0 10 1

Cycle-1 +- 0 10 0

Cycle-2 +- Left ArrayRight Array

ComparatorSAR Logic + - W W Figure 5: SRAM-Immersed successive approximation (SA) ADC exploitsbitline parasitic as capacitive digital-to-analog converter (DAC).

D. Key advantages over the current compute-in-SRAM

Our multiplication-free inference framework usingcompute-in-SRAM µ Arrays and µ Channels has manykey advantages over the competitive designs [23]–[25].

First , a multiplication-free learning operator, obviatesdigital-to-analog converters (DAC) in SRAM macros.Meanwhile, DACs incur considerable area/power in thecurrent competitive designs [26], [27]. Although overheadsof DAC can be amortized by operating in parallel over manychannels, the emerging trends on neural architectures, such as depth-wise convolutions in MobileNets [28], show that theseopportunities may diminish. Comparatively, our DAC-freeframework is much more efﬁcient in handling even thin convolution layers by eliminating DACs; thereby, allowingﬁne-grained embedding of µ Channels without considerableoverheads. If the ﬁlter has many parallel channels, ourarchitecture can also exploit input reuse opportunities bymerging µ Channels as we discussed earlier.

Secondly , amultiplication-free operator, is also synergistic with thediscussed bitplane-wise processing. Bitplane-wise processingfollowed in this work reduces the ADC’s precision demandin each cycle by limiting the dynamic range of MAV. Notethat with bitplane-wise processing, for n column lines,MAV varies over 2 n levels. However, if such bitplane-wiseprocessing is performed for the typical operator, an excessive O ( n ) operating cycles will be needed for n -bit precision.Meanwhile, a multiplication-free operator only requires O (2 n ) cycle. Lastly , we have discussed unique opportunitiesto exploit SRAM array parasitic for SRAM-immersed ADC.The proposed scheme obviates a major area overhead forSA-ADC. Therefore, the compute-in-SRAM macro canmaintain a high memory density.

Hold Voltage(V) D i s c h a r g e ti m e ( p s ) L ea k a g e P o w e r ( n W ) Discharge timeLeakage (a)

SAR logicComparatorMAVDACLeakage (b)Figure 6: Power performance of SRAM array. (a) The product line dischargetime and SRAM leakage power at varying hold voltage. (b) Distribution ofdynamic and leakage energy for a µ Array for performing MAV and within-SRAM digitization.

V. P

OWER –P ERFORMANCE C HARACTERISTICS OF C OMPUTE - IN -SRAM M ACRO

A. Simulation methodology

In this section, we discuss power–performance character-istics of compute-in-SRAM macro presented earlier. We useLeNET-5 for MNIST characterization as a running use-case todiscuss various design and power optimization opportunities.The scalability of the proposed framework on more complexdatasets and deeper networks will be discussed in the nextsection. The network weights for LeNET-5 are trained usingTensorFlow. Convolutional layers in LeNET-5 and the nextfully-connected layer are implemented using a multiplication-free operator. The last layer of LeNET-5 is implemented in atypical operator. Data transfer among SRAM arrays and post-processing of array’s output, such as applying max-pooling,is simulated functionally. Compute-in-SRAM operations aresimulated using HSPICE in 45 nm CMOS technology usingpredictive technology models [29]. Each µ Array in SRAM-macro has 8 rows and 62 columns. A µ Array is split intohalves, where each half stores a weighted channel by ﬂatteningit to one-dimensional. µ Arrays process 8-bit weights against8-bit inputs. In each SA cycle in µ Arrays, MAV is digitizedto 5-bit output. If the ﬂattened ﬁlter width is more than 31, itis partitioned and mapped on more µ Arrays.

B. Power–performance characterizations

We summarize key power–performance characteristics ofour compute-in-SRAM macros based on 45 nm CMOS sim-ulations using PDK [30]. In Figure 6(a), reducing the holdvoltage of SRAM cells reduces leakage current through thecells; however, also increases the PL discharge time duringproduct computation. We chose the hold voltage to be 0.4 Vsince this represents an optimal balance of discharge time andleakage power. At this voltage, the worst case discharge time(considering slow-slow corner and 120 C temperature) is 50ps. The average leakage power of a µ Array is 0.97 n W atthe typical corner. Figure 6(b) shows the distribution of poweramong various operations in a µ Array. A µ Array consumes ∼ µ W of active power to perform the MAV operation whileoperating at 1 GHz. MAV operation consumes 44% of the totalenergy, whereas digitization consumes 55% of the total energy,accounting for capacitive DAC, comparator, and SAR logicoperations. The leakage power of SRAM accounts for < A cc u r ac y ( % ) W e i g h t P r e c i s i o n AD C p r ec i s i on (a) (b) W e i g h t P r e c i s i o n W e i g h t P r e c i s i o n AD C p r ec i s i on AD C p r ec i s i on P r e d i c ti on L a t e n c y ( K c l o c k c y c l e s ) E n e r gy ( p J ) (c) (d) Case-B Case-A Case-B Case-ACase-ACase-B

Parameter Value C PL PCH C SAR

Estimation parameters for µArray’s power-performance

Figure 7: Surface plot for (a ) accuracy for MNIST characterization, and (b)latency and (c) energy of compute-in-SRAM macros at varying precision ofweights and ADC. (d) Parameters used in Equation 4. of the total. Since ADC’s power overhead are considerable, inthe next section, we discuss techniques for dynamic precisionscaling to mitigate the overhead.

C. Dynamic precision scaling

In Figure 7, we explore the design space of varying weightprecision ( W P ) and ADC precision ( A P ) for predictions onMNIST dataset. Figure 7(a) shows a surface plot showing theprediction accuracy at varying W P and A P . The predictionaccuracy improves with higher W P and A P . Figures 7(b-c)show surface plot for latency and energy for a unit w ⊕ x operation on the µ Array, respectively. The following determinethe clock cycles ( T ) and average energy ( E ) for the unitoperation T = W P × (1 + 2 A P ) (4a) E = W P × (cid:16) M C

P L V P CH + A P − (cid:88) i =0 E C + E SAR + 2 i C P L V P CH (cid:17) (4b)Here, C P L is the product line (PL) capacitance. V P CH isthe precharge voltage for processing. E C and E SAR are theaverage energy of unit operation for the comparator and SAregister logic. M is the number of columns in each halfof µ Array. Since MAV operation requires all PL to charge,the estimated dynamic energy is

M C

P L V P CH per weightbit plane. Within a bit plane processing, SA-ADC operatescomparator and SAR logic A P times. From our simulationsin 45 nm CMOS, Figure 7(d) shows various parameters in theequation. For C P L , we add a 20% overhead from the transistorcapacitance to account for the interconnect parasitic.From the surface plots, W P and A P have different sensitiv-ity to latency and energy demand on computing in our scheme.Consider two iso-accuracy cases: Case-A ( W P =8-bit, A P =2-bit) and Case-B ( W P =4-bit, A P =5-bit) in the surface plot. Forboth cases, the prediction accuracy is ∼ P and Case-B has the maximum A P . Forthe iso-accuracy cases, Case-A has ∼

10% lower latency than

Charging Cycles V M AV ( V ) Low C PL High C PL M N+1 (a) M N Cap variability (V P ×C u )/(NC u +C sum ) Comparator Uncertainty M N-1

MAV output levels

REF REF-+0 1 0 01111 SL (b)

Ignore column operations if CPL variation is too high: write padding entries, correct from the final MAV M AV c r o ss ov e r p r ob . ( % ) (d) PL ) % µArray: 8×62 µArray: 8×30 Ignore columns with C PL variability more than threshold (c) C PL distributionC TH C TH Comparator mismatch (mV) S a m p l e c oun t UncalibratedCalibrated M AV c r o ss ov e r p r ob . ( % ) Discarded columns (%) µArray: 8×62 Sweeping C TH (e) +-01Set calibration transistors to minimize bias+-Set M e t a - s t a b l e Output representing bias

Figure 8: (a) MAV output levels vary due to process variability in PLcapacitors. (b) µ Array columns with extremely varying PL capacitor arediscarded by padding them with memory and column entries that doesn’tcontribute to the MAV numerator. (c) An on-chip scheme to estimate PLcolumns with extremely varying capacitance. (d) MAV crossover probability atvarying PL capacitor mismatch and µ Array sizes. Mitigating MAV crossoverprobability by discarding columns with high PL capacitor variability. (e)Estimating comparator’s variability by forcing it to metastable point andcalibrating tail currents to mitigate process variability.

Case-B while requires ∼

30% more energy. Case-B has lowerenergy due to fewer MAV steps, thereby, fewer MAV andADC cycles. Case-A has lower latency since each MAV cycleis much shorter than in Case-B. Interestingly, an optimal W P and A P can be dynamically reconﬁgured in our design. W P can be lowered by skipping low signiﬁcance weight bit-planes. A P can be lowered by limiting the successive approximationcycles of SA-ADC to a desired precision. D. Impact of process variability and on-chip calibration

In Figure 8(a), due to process variability among PL capac-itors, MAV output levels will follow a Gaussian distribution.The distribution of MAV output levels arises both due to vari-ability in PL capacitors as well as many combinations to obtaina MAV level. If MAV output levels crossover, the weight-inputproduct from µ Arrays can be erroneous. In our scheme, theaccuracy of MAVs is mainly affected by the PL capacitor’smismatch. The effect of global variability among PL capacitorscancels out by bi-partitioning a µ Array – generating MAVs in one half and reference voltages in the other half – so that theglobal variability of PL capacitors becomes common mode.Considering a Gaussian distribution of MAV output levels,Figure 8(d) shows the probability of MAV crossover (P F ) ina µ Array at varying capacitor mismatch and µ array size. P F increases with higher PL capacitor variability as well as withthe increasing number of columns in a µ Array. Therefore, themaximum number of columns in a µ Array (i.e., its parallelism)is constrained.In Figure 8(c), we discuss an on-chip scheme to self-determine the usable column width of a µ Array based onits process variability. In the ﬁgure, the strength of a PLcapacitor is measured on-chip by repeatedly charging thesum-line through it and counting the number of cycles tocross a set threshold. A smaller PL capacitor will requiremore charging cycles to cross the threshold. Most extremePL capacitors are identiﬁed. If their process variability ismore than an acceptable margin, these columns are not used[Figure 8(b)]. In our scheme, we avoid adding a switch todisconnect such columns, since it will considerably increasethe area overhead of the on-chip calibration scheme. Note thatthe column disconnect switch and a memory cell to store theswitch enable needs to be implemented for each column of µ Array. Instead, we lessen the effect of columns with extremeC

P L variation by writing one to all SRAM cells in the columnand by applying the CL input signal to be one. Therefore, thecolumn with extremely varying C

P L always discharges andonly contributes to the charge averaging step. The sensitivityof extremely varying C

P L to MAV is thereby low since itonly contributes to the denominator of MAV, where its effectaverages out against other columns in µ Array. Based on thisscheme, the right of Figure 8(d) shows the MAV cross-overprobability for 8 × µ Arrays considering ±

12% mismatchamong PL capacitors and at varying C

T H levels [Figure 8(b)].By discarding only about 3% of columns, MAV cross-overprobability can be sufﬁciently suppressed.Similarly, process variability in the comparator constraintsthe minimum precharge voltage and the maximum numberof columns in a µ Array. In Figure 8(e), we use an on-chip calibration scheme to mitigate the comparator’s processvariability. The scheme selects N- and P-type counterpartsof the comparator in turn. The comparator is ﬁrst set to aknown initial condition and then forced to a metastable pointby shorting both inputs. By repeatedly resetting and settingthe comparator, its bias can be estimated from the outputbit sequence. An unbiased comparator should have an equalprobability of 0/1 under thermal noise. The tail currents inthe left and right half of the comparator can be adjusted tominimize the comparator’s bias. Calibrating transistors for thecomparator are shown in Figure 4(a). A counter monitors thecomparator’s output and adds calibration transistors to theleft or right half to minimize bias in the comparator. In theright of Figure 8(e), using a 2-bit calibration, the comparator’smismatch can be reduced to ±

12 mV from the initial ±

45 mV.

E. Comparisons to current art

Table II compares our results against the state-of-art ondifferent data sets. For various datasets, our network conﬁgu-

Table II: Comparison against state-of-the-art * CIFAR100 accuracy extracted using MobileNetV2

Metric Our [20] [21] [22] [29] [28] Tech (nm)

45 28 65 7 28 65

Weight bits

Input bits

Output bits

Efficiency (TOPs/W)

105 (8×62 µArray) 84 (8×30 µArray) 7 2.96 321 68.4 671.5

MNIST

CIFAR10

CIFAR100

Table II: Comparison against state-of-the-art * CIFAR100 accuracy extracted using MobileNetV2

Metric Our [23] (ISSCC’20) [24] (ISSCC’20) [25] (ISSCC’20) [31] (JSSCC’20) Tech (nm)

45 28 65 7 65

Weight bits

Input bits

Output bits

Efficiency (TOPs/W)

105 (8×62 µArray) 84 (8×30 µArray) 7 2.96 321 671.5

MNIST

CIFAR10

CIFAR100 ration was discussed in Sec. II. We achieve better predictionaccuracy than the competitive approaches by processing withmultibit precision weights and inputs, whereas many priorapproaches were limited to binarized weights. Our 8 × ∼ ×

30 SRAMmacro, which requires a 4-bit ADC, achieves ∼

84 TOPS/W.SRAM macros that require lower ADC precision are moretolerant of process variability, however, have lower TOPS/Was well. Our implementation achieves higher energy efﬁciencyand performance by co-adapting DNN operators with SRAM’simplementation and operational constraints. Notably, our ap-proach is DAC-free. Our bitplane-wise processing also lowersthe necessary ADC’s precision. Area overheads in our ADCminimize by exploiting bit line capacitances. Due to thesecross-cutting design transformations from computing primi-tives to micro-architecture, our 8 × µ Array-based platformachieves 15 × higher macro-level energy efﬁciency than [23]even though the latter is in 28 nm CMOS. Compared to 65nmdesign in [24], it achieves 35 × better TOPs/W. Although ourefﬁciency is lower than [31], note that the latter is only for1-bit inputs and weights. The accuracy of 1-bit processing in[31] suffers on more complex processing tasks, such as forCIFAR-10 in the table.VI. S YNERGISTIC I NTEGRATION OF D IGITAL AND C OMPUTE - IN -M EMORY P ROCESSING FOR

DNNCompute-in-memory offers immense energy efﬁciency ben-eﬁts over digital by eliminating weight movements. Mixed-signal processing of compute-in-memory also obviates pro-cessing overheads for adders by exploiting physics (Kirchoff’slaw) to sum the operands over a wire. Note that additions area signiﬁcant portion of the total workload in a digital DNNinference. However, compute-in-memory is also inherentlylimited to only weight stationary processing. The advantagesof stationary weight processing reduce if the ﬁlter has fewerchannels or if the input has smaller dimensions. Compute-in-memory is also more area expensive compared to digitalprocessing, which can leverage denser memory modules suchas DRAM. On the other hand, the memory cells in compute-in-memory are larger to support both storage and computationswithin the same physical structure. Additionally, multibit pre-cision DNN inference is complex using compute-in-memory.Therefore, many prior works [31], [32] utilize binary-weightedneural networks, which, however, constraints the learningspace and reduces the prediction accuracy. Deep in-memoryarchitecture (DIMA) in [33] considers multibit precision in-memory inference; however, the implementation suffers from an exponential reduction in the throughput with increasingprecision. Meanwhile, in this work, we overcame the criticalchallenge using a novel co-design approach by adapting theDNN operator to in-memory processing constraints. In ourmultiplication-free compute-in-memory framework, the para-metric learning space expands, yet the implementation com-plexities are equivalent to a binarized neural network. Evenso, the accuracy of multiplication-free operators is somewhatlower than the typical deep learning operator due to the non-differentiability of gradients.Considering the above trade-offs, we ﬁnd that the key tobalance scalability with energy efﬁciency in DNN inference isthrough a synergistic integration of compute-in-memory withdigital processing. In Figure 9, we discuss insights towardsthis. The ﬁgure shows a layer-wise distribution of weights andoperations for the networks used for MNIST, CIFAR10, andCIFAR100. The network conﬁgurations were discussed earlierin Sec. II. Note that as the processing propagates throughthe networks, weights per layer increase, but the number ofoperations per weight reduces. This is, in fact, typical to anyDNN due to shrinking input feature map dimensions, whichreduces the weight reuse opportunities. Since the startinglayers have much fewer parameters but much higher weightreuse, they are quite suited for compute-in-memory. The latterlayers require many more parameters but have low weightreuse. Therefore, digital processing can minimize the excessivestorage overheads of these layers with denser storage.Using this strategy, the ﬁgure also shows a mixed mappingconﬁguration that layer-wise combines compute-in-memoryand digital processing. For example, in the mixed implemen-tation of MobileNetV2, feature extraction layers with highweight reuse are mapped in compute-in-memory using an 8-bit multiplication-free operator. Regression layers and otherswith low weight reuse are mapped in digital using the typi-cal operator. Remarkably, based on the synergistic mappingstrategy, compute-in-memory only stores about a third ofthe total weights; yet, performs more than 85% of the totaloperations. Therefore, the synergistic mapping can optimallytranslate compute-in-memory’s energy-efﬁciency advantagesto the overall system-level efﬁciency, and yet, limits its areaoverheads. The synergistic mapping also improves the pre-diction accuracy, since only critical layers are implementedwith the energy-expensive typical operator while the remainingmost of the network is operated with multiplication-free opera-tors. The ﬁgure also shows similar mapping conﬁgurations forMNIST and CIFAR10 prediction networks. In the ﬁgure, wealso project average macro-level energy efﬁciency in TOPs/W.For digital processing, we use 2.8 TOPs/W from [34]. Formultiplication-free compute-in-memory processing, we use105 TOPs/W from Table II. We, however, note that the macro-level energy efﬁciency doesn’t necessarily translate to system-level energy efﬁciency where overheads due to routing andcontrol ﬂow must also be accounted for. We plan to considerthis characterization in future work.VII. C

ONCLUSION

We presented a compute-in-SRAM macro based on amultiplication-free learning operator. The macro comprises

Layer % of Params % of Operations Ops Per param Digital In-mem Mixed Conv1 <1 67.56 900 R MF MF Conv2 2.09 16.92 784 R MF MF Conv3 4.18 8.43 144 R MF MF Conv4 8.35 2.74 100 R MF MF Conv5 16.7 1.38 49 R MF MF FC1 67 2.43 1 R MF R FC2 <1 <1 1 R MF R Test Accuracy (%) 91 79 90.2 Avg. TOPs/W 2.8 105 100.91

Layer % of Params % of Operations Ops Per param Digital In-mem Mixed Conv1 <1 84.28 625 R MF MF Conv2 3.08 6.7 100 R MF MF FC1 96 8.63 1 R MF MF FC2 <1 <1 1 R MF R Test Accuracy (%) 99 96 98.6 Avg. TOPs/W 2.8 105 103.97

Network configuration for MNISTNetwork configuration for CIFAR10 (a)

Network configuration for MNISTNetwork configuration for CIFAR10 (b)

Layer % of Params % of Operations Ops Per param Digital In-mem Mixed Conv3×3 <1 3.9 1024 R MF R BN1 <1 8.2 2824 R MF MF BN2 <1 21 1536 R MF MF BN3 1 16.7 576 R MF MF BN4 3.2 10 192 R MF MF BN5 8 13.7 144 R MF MF BN6 19 16.8 144 R MF MF BN7 19 8.3 48 R MF MF Conv3×3 17 <1 48 R MF R FC1 28 <1 1 R MF R FC2 <1 <1 1 R MF R Test Accuracy (%) 68 56 66.9 Avg. TOPs/W 2.8 105 98

Network configuration for CIFAR100 (c)Figure 9: Synergistic mixing of compute-in-memory and digital processingfor (a) MNIST, (b) CIFAR10, and (c) CIFAR100. low area/power overhead µ Arrays and µ Channels. Operationsin the macro are DAC-free. µ Arrays exploit bit line par-asitic for low overhead memory-immersed data conversion.We characterized the accuracy of our scheme on MNIST,CIFAR10, and CIFAR100 data sets. Other researchers suc-cessfully used the multiplication-free operator based struc-tures in various other datasets [14], [15]. On an equivalentnetwork conﬁguration, our framework has 1.8 × lower erroron MNIST and 1.5 × lower error on CIFAR10 compared tothe binarized neural network. At 8-bit precision, our 8 × µ Array achieves ∼

105 TOPS/W, which issigniﬁcantly better than the current compute-in-SRAM designsat matching precision. Our platform also offers several runtimecontrol knobs to dynamically trade-off accuracy, energy, andlatency. For example, weight precision can be dynamically modulated to reduce prediction latency, and ADC’s precisioncan be controlled to reduce energy. Additionally, for deeperneural networks, we have discussed mapping conﬁgurationswhere high weight reuse layers can be implemented in ourcompute-in-SRAM framework, and parameter-intensive layers(such as fully-connected) can be implemented through digitalaccelerators. Our synergistic mapping strategy combining bothmultiplication-free and typical operator is promising to achieveboth high energy efﬁciency and area efﬁciency in operatingdeeper neural networks.R

EFERENCES[1] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deepsupervised learning for hyperspectral data classiﬁcation through convo-lutional neural networks,” in . IEEE, 2015, pp. 4959–4962.[2] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang, “Targeting ulti-mate accuracy: Face recognition via deep embedding,” arXiv preprintarXiv:1506.07310 , 2015.[3] Y. Wei, Q. Yuan, H. Shen, and L. Zhang, “Boosting the accuracy ofmultispectral image pansharpening by learning a deep residual network,”

IEEE Geoscience and Remote Sensing Letters , vol. 14, no. 10, pp. 1795–1799, 2017.[4] J. Cho, K. Lee, E. Shin, G. Choy, and S. Do, “How much data is neededto train a medical image deep learning system to achieve necessary highaccuracy?” arXiv preprint arXiv:1511.06348 , 2015.[5] H. Qassim, D. Feinzimer, and A. Verma, “Residual squeeze vgg16,” arXiv preprint arXiv:1705.03004 , 2017.[6] T. Carvalho, E. R. De Rezende, M. T. Alves, F. K. Balieiro, and R. B.Sovat, “Exposing computer generated images by eye’s region classiﬁca-tion via transfer learning of vgg19 cnn,” in . IEEE,2017, pp. 866–870.[7] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” in

Thirty-First AAAI Conference on Artiﬁcial Intelligence , 2017.[8] A. Biswas and A. P. Chandrakasan, “Conv-sram: An energy-efﬁcientsram with in-memory dot-product computation for low-power convolu-tional neural networks,”

IEEE Journal of Solid-State Circuits , vol. 54,no. 1, pp. 217–230, 2019.[9] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memoryvlsi architecture for convolutional neural networks,”

IEEE Journal onEmerging and Selected Topics in Circuits and Systems , vol. 8, no. 3, pp.494–505, 2018.[10] Z. Jiang, S. Yin, J.-s. Seo, and M. Seok, “Xnor-sram: In-bitcellcomputing sram macro based on resistive computing mechanism,” in

Proceedings of the 2019 on Great Lakes Symposium on VLSI . ACM,2019, pp. 417–422.[11] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of amachine-learning classiﬁer in a standard 6t sram array,”

IEEE Journalof Solid-State Circuits , vol. 52, no. 4, pp. 915–924, 2017.[12] A. Amaravati, S. B. Nasir, J. Ting, I. Yoon, and A. Raychowdhury, “A55-nm, 1.0–0.4 v, 1.25-pj/mac time-domain mixed-signal neuromorphicaccelerator with stochastic synapses for reinforcement learning in au-tonomous mobile robots,”

IEEE Journal of Solid-State Circuits , vol. 54,no. 1, pp. 75–87, 2018.[13] A. Sayal, S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-tops/w all-digital time-domain cnn engine using bi-directional memorydelay lines for energy efﬁcient edge computing,”

IEEE Journal of Solid-State Circuits , 2019.[14] H. Pan, D. Badawi, X. Zhang, and A. E. Cetin, “Additive neural networkfor forest ﬁre detection,”

Signal, Image and Video Processing , pp. 1–8,2019.[15] T. Ergen, A. H. Mirza, and S. S. Kozat, “Energy-efﬁcient lstm networksfor online learning,”

IEEE Transactions on Neural Networks and Learn-ing Systems , 2019.[16] A. Afrasiyabi, D. Badawi, B. Nasir, O. Yildi, F. T. Y. Vural, andA. E. C¸ etin, “Non-euclidean vector product for neural networks,” in . IEEE, 2018, pp. 6862–6866.[17] C. E. Akbas¸, A. Bozkurt, A. E. C¸ etin, R. C¸ etin-Atalay, and A. ¨Uner,“Multiplication-free neural networks,” in and Communications Applications Conference (SIU) . IEEE, 2015, pp.2416–2418.[18] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist ∼ kriz/cifar.html[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2018.[23] J. Su, X. Si, Y. Chou, T. Chang, W. Huang, Y. Tu, R. Liu, P. Lu, T. Liu,J. Wang, Z. Zhang, H. Jiang, S. Huang, C. Lo, R. Liu, C. Hsieh, K. Tang,S. Sheu, S. Li, H. Lee, S. Chang, S. Yu, and M. Chang, “15.2 a 28nm64kb inference-training two-way transpose multibit 6t sram compute-in-memory macro for ai edge chips,” in , 2020, pp. 240–242.[24] J. Yue, Z. Yuan, X. Feng, Y. He, Z. Zhang, X. Si, R. Liu, M. Chang,X. Li, H. Yang, and Y. Liu, “14.3 a 65nm computing-in-memory-basedcnn processor with 2.9-to-35.8tops/w system energy efﬁciency usingdynamic-sparsity performance-scaling architecture and energy-efﬁcientinter/intra-macro data reuse,” in , 2020, pp. 234–236.[25] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao,Y. Wang, and J. Chang, “15.3 a 351tops/w and 372.4gops compute-in-memory sram macro in 7nm ﬁnfet cmos for machine-learning appli-cations,” in , 2020, pp. 242–244.[26] A. Biswas and A. P. Chandrakasan, “Conv-sram: An energy-efﬁcientsram with in-memory dot-product computation for low-power convolu- tional neural networks,”

IEEE Journal of Solid-State Circuits , vol. 54,no. 1, pp. 217–230, 2019.[27] S. Nasrin, S. Ramakrishna, T. Tulabandhula, and A. R. Trivedi,“Supported-binarynet: Bitcell array-based weight supports for dynamicaccuracy-latency trade-offs in sram-based binarized neural network,” arXiv preprint arXiv:1911.08518 , 2019.[28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neuralnetworks for mobile vision applications,”

CoRR , vol. abs/1704.04861,2017. [Online]. Available: http://arxiv.org/abs/1704.04861[29] W. Zhao and Y. Cao, “New generation of predictive technologymodel for sub-45nm design exploration,” in

Proceedings of the 7thInternational Symposium on Quality Electronic Design , ser. ISQED’06. Washington, DC, USA: IEEE Computer Society, 2006, pp.585–590. [Online]. Available: http://dx.doi.org/10.1109/ISQED.2006.91[30] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis,P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal,“Freepdk: An open-source variation-aware design kit,” in

Proceedingsof the 2007 IEEE International Conference on Microelectronic SystemsEducation , ser. MSE ’07. USA: IEEE Computer Society, 2007, p.173–174. [Online]. Available: https://doi.org/10.1109/MSE.2007.44[31] Z. Jiang, S. Yin, J. Seo, and M. Seok, “C3sram: An in-memory-computing sram macro based on robust capacitive coupling computingmechanism,”

IEEE Journal of Solid-State Circuits , vol. 55, no. 7, pp.1888–1897, 2020.[32] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-gio, “Binarized neural networks: Training deep neural networks withweights and activations constrained to+ 1 or-1,” arXiv preprintarXiv:1602.02830 , 2016.[33] M. Kang, S. Gonugondla, and N. R. Shanbhag,

The DeepIn-Memory Architecture (DIMA) . Cham: Springer InternationalPublishing, 2020, pp. 7–47. [Online]. Available: https://doi.org/10.1007/978-3-030-35971-3 2[34] Y. Toyama, K. Yoshioka, K. Ban, S. Maya, A. Sai, and K. Onizuka, “An8 bit 12.4 tops/w phase-domain mac circuit for energy-constrained deeplearning accelerators,”