[PDF] A Low Power In-Memory Multiplication andAccumulation Array with Modified Radix-4 Inputand Canonical Signed Digit Weights

Abstract

A mass of data transfer between the processing and storage units has been the leading bottleneck in modern Von-Neuman computing systems, especially when used for Artificial Intelligence (AI) tasks. Computing-in-Memory (CIM) has shown great potential to reduce both latency and power consumption. However, the conventional analog CIM schemes are suffering from reliability issues, which may significantly degenerate the accuracy of the computation. Recently, CIM schemes with digitized input data and weights have been proposed for high reliable computing. However, the properties of the digital memory and input data are not fully utilized. This paper presents a novel low power CIM scheme to further reduce the power consumption by using a Modified Radix-4 (M-RD4) booth algorithm at the input and a Modified Canonical Signed Digit (M-CSD) for the network weights. The simulation results show that M-Rd4 and M-CSD reduce the ratio of 1\times1 by 78.5\% on LeNet and 80.2\% on AlexNet, and improve the computing efficiency by 41.6\% in average. The computing-power rate at the fixed-point 8-bit is 60.68 TOPS/s/W.

Full PDF

IIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 1

A Low Power In-Memory Multiplication andAccumulation Array with Modiﬁed Radix-4 Inputand Canonical Signed Digit Weights

Rui Xiao,

Student Member, IEEE , Kejie Huang,

Senior Member, IEEE , Yewei Zhang,

Student Member, IEEE and Haibin Shen

Abstract —A mass of data transfer between the processing andstorage units has been the leading bottleneck in modern Von-Neuman computing systems, especially when used for ArtiﬁcialIntelligence (AI) tasks. Computing-in-Memory (CIM) has showngreat potential to reduce both latency and power consumption.However, the conventional analog CIM schemes are sufferingfrom reliability issues, which may signiﬁcantly degenerate theaccuracy of the computation. Recently, CIM schemes with digi-tized input data and weights have been proposed for high reliablecomputing. However, the properties of the digital memory andinput data are not fully utilized. This paper presents a novel lowpower CIM scheme to further reduce the power consumptionby using a Modiﬁed Radix-4 (M-RD4) booth algorithm at theinput and a Modiﬁed Canonical Signed Digit (M-CSD) for thenetwork weights. The simulation results show that M-Rd4 andM-CSD reduce the ratio of × by 78.5% on LeNet and 80.2%on AlexNet, and improve the computing efﬁciency by 41.6% inaverage. The computing-power rate at the ﬁxed-point 8-bit is60.68 TOPS/s/W. Index Terms —Non-volatile Memory, In-memory Computing,Charge Redistribution Integrator, Radix-4 Booth Recoding,Canonical-Signed-Digit.

I . I

N T R O D U C T I O N A LONG with computer technology unceasing development,the Artiﬁcial Intelligence (AI) has been widely applied invarious ﬁelds to perform speciﬁc tasks, such as transportation,education, healthcare, security, ﬁnance, etc [1]. With the supportof massive data and high-performance hardware, the DeepNeural Network (DNN), a particular kind of machine learning,achieves excellent power and ﬂexibility by learning to representthe world as a nested hierarchy of concepts [2]. However, dueto the limited on-chip memory and memory bandwidth, a massof intermediate data generated by DNN has to be transferredfrequently between the separated computing units and storageunits in conventional von-Neumann machines, resulting in atremendous amount of power and propagation delay, which istreated as the “Von Neumann bottleneck” [3].Inspired by the cranial nerve structure and informationprocessing mechanism from the brain science research, theartiﬁcial intelligence and system are breaking through the

The authors are with the College of Information Science & ElectronicEngineering, Zhejiang University, 38 Zheda Road, Hangzhou, China, 310027,email: [email protected]; [email protected]; [email protected];shen [email protected]. Huang and H. Shen are also with Zhejiang Lab, Building 10, ChinaArtiﬁcial Intelligence Town, 1818 Wenyi West Road, Hangzhou City, ZhejiangProvince, China. conventional computing architecture and promoting to the nextgeneration of computing paradigm. The computing-in memory[4] scheme is formed by a large number of interconnected low-power computing units (neurons) and re-conﬁgurable storageunits (synapses), which can perform the Multiplication-and-Accumulation (MAC) operations in the memory to signiﬁcantlyreduce the data movement. The emerging Resistive RandomAccess Memory (RRAM) is one of the best candidates in CIMdesign [5]–[7]. The resistance value of the memory can beused for weight storage and MAC operation. The memory cellsare organized into crossbar arrays for high density storage, lowpower consumption, and fully parallel computing [4].Analog computing with multi-level resistive memory iswidely used to achieve a massive parallel low power computing[8]–[10]. However, data storage and transmission betweencomputing cores require digital signals since analog signals aresensitive to noise. Most architectures require Digital-to-AnalogConverters (DACs) and Analog-to-Digital Converters (ADCs)at the interface, which consume large area and high powerconsumption. Moreover, most of them have overlooked thedefects of the resistive Non-volatile Memory (NVM), such asnonlinearity, stochasticity, asymmetry, etc [11]. To address theissues mentioned earlier, [12] proposes to use multiple binaryRRAMs to emulate one synapse. Moreover, DACs are alsomoved to neurons to reduce the high driving power and thenon-linearity caused by the analog input voltage. However, high-performance ampliﬁers are used to achieve high computingspeed and 8-bit resolution, resulting high power consumption.The high power dissipation ampliﬁers are removed in [13]by regulating the voltage before the passive integral neurons.However, 2’s complementary code is used in synapses, resultingin balanced ‘1’s and ‘0’s. Moreover, the uncertainty of thememory resistance in the MAC array with 2’s complementarycode may cause a big jump between the most negative andthe most positive values. Differential weights with ModiﬁedCanonical Signed Digital (M-CSD) are proposed to leveragethe unbalanced ‘1’s and ‘0’s in weights to address above issues.Modiﬁed radix-4 (M-RD4) booth algorithm is also used tofurther reduce the percentage of ‘1’s in the computation. Thesimulation results show that the total power consumption isreduced by more than 41.55%. The performance-power ratiois 57.53 TOPS/s/W with 8b precision. The main contributionsof this paper include:1) The inputs are encoded with M-RD4 codes, the amount a r X i v : . [ c s . A R ] J a n EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 2 of ‘1’s is halved since the encoding length of radix-4booth codes is only half of binary encodes.2) The weights are stored differentially with the M-CSDcode to signiﬁcantly reduce the number of ‘1’s, whichcan complete the MAC operation with 41.55% less powercomputation.3) Differential charge redistribution passive integrator andSuccessive Approximation Register (SAR) ADC areproposed to enable in-memory computing with M-RD4and M-CSD algorithms.The rest of the paper is organized as follows: SectionII introduces the related works of the resistive non-volatilememory based in-memory computing circuits and architectures.Section III discusses the detailed design of the proposedCIM core, including M-RD4, M-CSD, the integration schemeto perform MAC operations, and the corresponding circuits.Section IV provides the circuit level and system level simulationresults. Finally, the conclusion is drawn in Section V.I I . R

E L AT E D W O R K S

Computing near memory and computing in memory arethe two typical schemes to shorten the distance between theprocessing and storage units. Computing near memory suchas IBM TrueNorth [14] and Intel Loihi [15] can only accessthe memory by one row each time, thus the processing speedis minimal. Besides, excessive charge and discharge of thebit lines will cost high power consumption. The CIM schemecould simultaneously access the whole array to perform theMAC operations, thus signiﬁcantly reducing the latency andpower for computing and memory access. Resistive NVMssuch as memristor [16], Phase Change Memory (PCM) [5],[17], and RRAM [10] are the potential candidates to achieve thehigh-density CIM schemes. Since all resistive NVMs have highwrite power, the network weights are usually trained ofﬂine onthe server and then sent to the CIM cores for inference. TheCIM schemes can be divided into two groups: CIM with analogmemory and input signals, and CIM with digitized memoryand input signals.

A. Analog Computing-in-Memory

In the analog CIM schemes, the multiplication is usuallyachieved by multiplying the conductance of the multi-levelmemory and input analog voltage based on Ohm’s law [18],[19], which will output the current. The accumulation isusually done by converging output currents from differentmultiplications based on Kirchhoff’s Current Law (KCL). Theanalog signals are difﬁcult to be preserved and also sensitiveto noises. Therefore, the converged current has to be convertedto voltage signals for analog-to-digital conversion. The digitalinputs are also converted to analog signals for analog computing.The DACs and ADCs will consume enormous power and area,signiﬁcantly limiting the efﬁciency of the scheme.A. Shaﬁee et al. [20] proposed the RRAM-based ISAACscheme to perform 16-bit ﬁxed-point MAC operation for CNNinference, where eight 4-level RRAM cells are used to storeone 16-bit weight. As shown in Fig. 1, it takes 16 cycles toperform the 16-bit digital-to-analog conversion by 1-bit DACs

S&H S&H S&H S&H S&H S&H S&H S&HShift & Add16-bit Output - b i t D A C - b i t D A C - b i t D A C Input . . . Sample & Hold

Multiple 8-bit ADCs

Fig. 1. The CIM core of the ISAAC architecture. The S/H voltage of eachcolumn is quantized by an individual 8-bit ADC at each cycle, and then thequantized results are shifted and added to achieve 16-bit output for MACoperations.

OPA V r OPA V r OPA V r OPA V r OPA V r OPA V r OPA V r OPA V r Analog Fixed-Point Data Converter

ADC

N-bit Output (to Router)

Multiple Binary RRAM Array I npu t B u ff e r & T i m i n g C o n t r o l Integral MultiplierN-Bit

Input (from Router)

Fig. 2. The CIM core of the MBRAI architecture. The n-bit input data aresequentially computed in the integral multiplier and weighted at the outputneurons instead of 16-bit high-cost DACs in 1 cycle. In each cycle, theanalog outputs are converted to digital signals by eight 8-bitADCs, which are then shifted and added to generate 16-bitoutput. X. Qiao et al. [21] proposed AtomLayer to support16-bit ﬁxed-point CNN training and inference. The AtomLayeraccesses the ability of training by processing one networklayer each time. B esides, the data are redused to improvethe efﬁciency. However, there are still some shortcomings inISAAC and AtomLayer.1) The S/H structure without an ampliﬁer will seriouslyaffect the analog computation accuracy due to the varyinghold voltage.2) The accuracy after ADC is far less than 8-bit due to thenonlinearity of multi-level RRAM cells and the loss ofprecision.3) The shift-and-add operation will further reduce theaccuracy because ADC’s quantization error is magniﬁedafter the shift operation.4) The eight ADCs and 16 cycles’ conversion for 16-bitMAC operation leads to high power consumption.

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 3

B. Digitized RRAM based CIM Cores

Several single level RRAM based CIM cores have beenproposed to avoid the nonlinearity issue of multi-level RRAM.M. Courbariaux et al. [22] and M. Rastefari et al. [23] usedbinary weight and 1-bit input for the recognition tasks onMNIST and CIFAR-10 datasets. However, 1-bit weight and1-bit input will lose a lot of information when applied tolarge networks. C. Xue et al. [24] proposed a BL-IN-OUT(BLIOMC) scheme with Scrambled 2’s Complement WeightMaping (S2CWM), which exploits 4-bit inputs by 4-levelread voltage and 4-bit weight represented by four single-levelRRAM cells. The Dual-bit-Small-Offset Current-mode SenseAmpliﬁer (DbSO-CSA) with two I REF works as 2-bit ADC.It achieves an efﬁciency with 28.9 TOPS/s/W at 4-bit inputand 4-bit weight. However, the structure of this design limitsits application to some extent:1) The 4-level read voltage V RD , / V RD , / V RD , atthe input will vary memory resistance during the readoperation, which will affect the accuracy of MACoperation.2) The 2b sensing ampliﬁer will greatly limit the totalprecision of the MAC output. Adding multiple outputswill average the quantization error and noise, but theincreased precision is halved.3) It needs multiple cycles to ﬁnish the Vector-Matrix op-eration, which will signiﬁcantly reduce the computationspeed.To address the issues mentioned above and further improvethe energy efﬁciency, S. Zhang et al. [12] proposed a Mul-tiple Binary RRAM with Active Integrator (MBRAI) corearchitecture. As shown in Fig. 2, multiple binary-RRAM cellsare used to represent an 8-bit weight instead of a multi-levelRRAM cell. The core uses binary code at the input instead ofa time signal or analog signal. The n-bit data are sequentiallycomputed in the integral multiplier and weighted at the outputneurons. However, the ampliﬁers in the neurons are power-hungry components to achieve a wide dynamic range, whichconsume more than 95% power in the scheme. The computingefﬁciency of the CIM core is limited to 0.61 TOPS/s/W. Toaddress this issue, Y Zhang et al. [13] proposed an 8-bit InResistive Memory Computing Core with Regulated PassiveNeuron and Bit Line Weight Mapping (RPN & BLM) scheme.RPN & BLM uses passive integral circuits without ampliﬁersto decrease power consumption. The regulators in the bit linesare used to improve the linearity of the integration process. C. Differential Weight based CIM Cores

The uncertainty of the memory resistance in the MAC arraywith 2’s complementary code may cause a big jump betweenthe most negative value (i.e., 8’b10000000) and the mostpositive value (i.e., 8’b01111111). Differential weights [6],[25], [26] could be used to avoid this issue. Recently, P. Yao et al. [27] proposed a memristor-based hardware system withreliable multi-level conductance states for a ﬁve-layer mCNNfor MNIST digit image recognition (MBHS-mCNN). As shownin Fig. 3, the neural processing unit consists of multiplememristor tiles and each tile contains four memristor cores.

Multiple Memristor Tiles ... ... ... ... ... B L D r i v e r & S L D r i v e r S & H S & HS & H S & H S & H S & HS & H S & H S & H S & HS & H S & H S & H S & HS & H S & H S & H S & HS & H S & H ... ... MUX C o n t r o ll e r ... ADC ADC ADC ...

Shift & Adder . . .. . .

Fig. 3. The architecture of the memristor-based neural processing unit andrelevant circuit modules.

The MUX controller is used to select the positive and negativecomputing results. However, there are still some weaknessesin this scheme:1) It requires 32 times of analog-to-digital conversions andShift & add operations to ﬁnish one MAC operation,which consume about 92.14% energy in the system.2) The quantization error is ampliﬁed by the shift andadd operation, and thus it cannot achieve the desiredprecision.I I I . P

R O P O S E D I N - M E M O RY C O M P U T I N G C O R E

In this paper, we propose a booth encoded differential corefor low-power parallel MAC operations. M-RD4 and M-CSDalgorithms are proposed at the input and weights respectivelyto reduce the power consumption of MAC operations. Theoverall structure of the proposed scheme is shown in Fig. 4,which consists of six components, including M-RD4 generator,differential RRAM array, regulator, integrator, controller, anddifferential ADC. To be simpliﬁed, only 8 × × N × Nmemory cells for the real application. Each memory cellis comprised of a 1R1T pair. The binary inputs are ﬁrstlyconverted to the stimulus of the CIM core by using an M-RD4booth algorithm. The stimulus will turn on the transistor in1R1T to generate the current to pass through RRAM cells andaccumulated at the integrators to enable the massive parallelMAC computation. Regulator [13] is used before the integratorto minimize the voltage variation caused by the channel lengthmodulation during the integration. Finally, the analog voltageat the neuron is converted to the digital signals using the charge

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 4

Differential RRAM Array . . . . . . . . . . . . . . . . . . . . . M - R D B oo t h E n c o d e r Binary

Input R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U R E G U N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S N E G N E G P O S P O S N E G P O S Integral Multiplier

ADC

Binary Output C o n t r o ll e r x x x x p-2 x p-1 Fig. 4. The overall architecture of OUR proposed CIM core. redistribution differential SAR ADC. Only one 8-bit ADC isrequired by eight integrators for high density and low power.The details of each block will be introduced in the rest of thissection.

A. Modiﬁed Radix-4 Booth Code

Unsigned ﬁxed point data can be used as the CIM core inputbecause there is no negative data after the ReLU activationfunction. The input data can be expressed as an n-bit unsignedﬁxed-pointed data X k X k = 2 n − x k,n − + ... + 2 i x k,i + ... + 2 x k, (1)Radix-4 booth code [28] is a modiﬁed booth code used forhigh-speed and low-power computing, widely used to designthe multipliers to halve the number of partial products. Thealgorithm of recoding an n-bit binary number (X) to a radix-4booth number (Z) is as follows. Firstly append a ‘0’ to theright of the Least Signiﬁcant Bit (LSB) of the X, and thenextend the sign bit one position if necessary to ensure that nis even. After that, every three binary bits (with 1 bit overlap)are encoded as one radix-4 bit from the LSB to the MostSigniﬁcant Bit (MSB).The eight cases of the radix-4 codeare tabulated in Table I. By using the radix-4 algorithm, thelength of the input code is halved (i.e. 01111111 is encodedto 200 ¯1 ). The number of ‘1’s can also be reduced comparedwith binary codes, which means the power consumption canbe reduced since more multiplications can be bypassed in theMAC calculations.However, the radix-4 code sometimes leads to more ‘1’sthan that in binary codes. Fig.5(a) shows an example to encodea binary code ‘01010010’ to the radix-4 code ‘111 ¯2 ’, wherethe number of ‘1’s is increased in radix-4 code. To reducethe number of ‘1’s in radix-4 code, we propose an M-RD4code to get the least ‘1’s at the input. The M-RD4 algorithmis illustrated in Algorithm 1 . The proposed M-RD4 algorithm TABLE IT

H E T RU T H T A B L E O F R A D I X - 4 C

O D E A N D P RO P O S E D

M - R D 4C

O D E

Binary Bits Radix-4 Bit M-RD4 Bit t i +3 t i +2 t i +1 t i z j (a) Radix-4 (b) M-RD4Fig. 5. The example of (a) radix-4 code, (b) M-RD4 code. The proposedM-RD4 code can effectively reduce the number of ‘1’s in the input data to aminimum will observe one more bit at the left. If the sequence is ‘0100’,it will be turned into ‘0011’. If the sequence is ‘1011’, it willbe turne into ‘1100’. After that, the right three bits will beencoded by using Eq (2). z j = − t i +2 + t i +1 + t i (2)where i = 0 , , , ..., j = i , t i is the i th bit of T, T is deﬁnedin Algorithm 1. The cases are tabulated in Table I. The M-RD4code can further reduce the number of ‘1’s in input data. Fig.5(b) is used as an example to illustrate our M-RD4 algorithm.The M-RD4 code of ‘01010010’ is changed to ‘1102’ insteadof ‘111 ’.Fig. 6 shows the M-RD4 booth recoding circuit implemen-tation, which is composed of the MUX block, converter blockand encoder block. The MUX block consists of three 4-to-1multiplexers and one quaternary counter. The counter generatesthe control signals ( S A and S B ) to select the output of eachmultiplexer. In this way, the MUX block outputs the raw datafor M-RD4 from the LSB to MSB. The converter block convertsthe raw data for encoding according to the M-RD4 algorithm. a i +3 , a i +2 , a i +1 are the outputd of the MUX block, and a i isgenerated by the converter. As shown in Fig.5(b), in the ﬁrstclock, a i = 0 , and then a i is determined by the output of theconverter( t i +2 ) in the last clock. Therefore, a i is ‘0,0,0’ in thenext three clocks. According to Algorithm 1, we set F = a i +3 a i +2 a i +1 ¯ a i (3) G = a i +3 a i +2 a i +1 a i (4) EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 5

Algorithm 1

M-RD4 Booth Code

Input:

Binary n-bit X = x n − x n − ...x i ...x . Output:

M-RD4 Booth Encoded m-bit data Z = z m − z m − ...z j ..z , where m = (cid:6) n (cid:7) . // Extend a ‘0’ at the most left to ensure that n is even // Append a ‘0’ to the right of the Least Signiﬁcant Bit(LSB) if n is even then T[n:0] ⇐ x n − x n − ...x i ...x else T[n+1:0] ⇐ x n − x n − ...x i ...x end if i ⇐

0, j ⇐ while i ≤ n − do // Observe one more bit per time and transfer thesequence if necessary. if t i +3 t i +2 t i +1 t == ‘0100 (cid:48) then t i +3 t i +2 t i +1 t ⇐ ‘0011 (cid:48) else if t i +3 t i +2 t i +1 t == ‘1011 (cid:48) then t i +3 t i +2 t i +1 t ⇐ ‘1100 (cid:48) end if // Get the M-RD4 code Z from LSB to MSB. z j ⇐ − t i +2 + t i +1 + t i i ⇐ i + 2 j ⇐ j + 1 end while return Zthen we can get the output of the converter block t i +2 = G + F a i +2 (5) t i +1 = F + Ga i +1 (6) t i = F + Ga i (7)The output of the converter is sent to the encoder block forrecoding. The 3-bit binary codes are recoded to 1-bit M-RD4code by combination circuit according to Table I. The encoderoutput log can be shown as Z = t i +2 t i +1 t i (8) Z − = t i +2 t i +1 ¯ t i (9) Z = t i +2 ( t i +1 ⊕ t i ) (10) Z − = t i +2 ( t i +1 ⊕ t i ) (11)where Z , Z − , Z , and Z − represent four values of z j (2, -2,1, -1) in Table I. When z j is encoded to zero, the multiplicationresult is always zero. Therefore, there are only four outputterminals from the combination logic circuit, and only one ofthem will be activated at a time. If z j = 1 , then the voltage of Z is high and the others are low, and the other cases can bespeculated.To make the M-RD4 code and its corresponding circuitclearer, we use the binary code ‘01010010’ as an example.In the ﬁrst clock, S A = 0 , S B = 1 , then a i +3 a i +2 a i +1 = x x x (010) , and a i = 0 . According to Eq (4), we can get F = 1 , G = 0 . The outputs ( t i +2 t i +1 t i ) of the converterare 011. The M-RD4 result is 2, thus Z = 1 , Z − = 0 , B i n a r y I npu t x x x x x x x x x x x gnd Cnt-4 S A S B Cnt-4 S A S B S A S B S A S B S A S B S A S B a i+3 a i+2 a i+1 a i F G t i+2 t i+1 t i QQ D

Clk Z Z -2 Z Z -1 MUX Converter Encoder

F=a i+3 a i+2 a i+1 a i G=a i+3 a i+2 a i+1 a i F=a i+3 a i+2 a i+1 a i G=a i+3 a i+2 a i+1 a i t i+2 =G+Fa i+2 t i+1 =F+Ga i+1 t i =F+Ga i Z =t i+2 (t i+1 ⊕ t i ) Z -1 =t i+2 (t i+1 ⊕ t i ) Z =t i+2 (t i+1 ⊕ t i ) Z -1 =t i+2 (t i+1 ⊕ t i )Z =t i+2 t i+1 t i Z -2 =t i+2 t i+1 t i Z =t i+2 t i+1 t i Z -2 =t i+2 t i+1 t i Fig. 6. The circuit implementation of the proposed M-RD4 booth recodingcircuit. The MUX block selects the bits of the binary input, and the converterprocesses the input under the rules of the proposed M-RD4 algorithm. Theencoder recodes the input to M-RD4 codes and outputs it into the neuroncircuit. Z = 0 , and Z − = 0 . In the second clock, S A = 0 , S B = 1 ,then a i +3 a i +2 a i +1 a i = x x x Q (1000) , where Q equals t i +2 at the last clock. F = 0 , G = 0 , then t i +2 t i +1 t i = 000 ,therefore all of the outputs are 0 . In the third clock, S A = 1 , S B = 0 , then a i +3 a i +2 a i +1 a i = x x x Q (1010) . F = 0 , G = 0 , then t i +2 t i +1 t i = 010 , therefore Z = 1 . In thefourth clock, S A = 1 , S B = 1 , then a i +3 = gnd (0) , and a i +2 a i +1 a i = x x Q (010) . F = 0 , G = 0 , then t i +2 t i +1 t i =010 . According to the third clock, Z = 1 . Therefore, the M-RD4 output is ‘1102’. The four output bits, which are either atVDD or ground, are directly used in the in-memory computing.The weights of 1, -1, 2, and -2 will be employed in the neuroncircuit, which will be discussed in Section III. C. B. Modiﬁed CSD Weights × is about 10%,which is not optimized for low power computing. What’s more,the 2’s complementary may cause a big jump between themost negative value (10000000) and the most positive value(01111111) due to the uncertainty of the memory resistance.The leap will signiﬁcantly inﬂuence the accuracy of in-memorycomputing.Differential weights can be used to address the abovementioned issues, which can be represented as w = w p − w n = 2 n − ( b n − − c n − ) + ... + 2 ( b − c ) (12)where w p and w n are the unsigned number representation, and b i and c i are the bits in the positive part and negative part EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 6

Value

Frequency (a) (b) min max0

Differential weightFrequency (c)

Fig. 7. The simpliﬁed distribution curve of (a) weights, (b) weights in 2’scomplement, (c) differential weight. -255 -219 0 219 255 ≤ ＞＞ Consecutive ONEs

Fig. 8. The data representation of our proposed differential weight system. of a weight, respectively. For example, w p = 8’b00000000and w n =8’b01110111 represent weight -119. As shown inFig.7(c), the red line indicates a positive value, and the blueline indicates a negative value. The digits 1 and are placedin the positive and negative parts of the weight, respectively.In this way, the majority of bits in the weights are 0, whichcould bypass the in memory computing to save the powerconsumption by around 50%.However, it doesn’t fully utilize both parts of a differentialweight. If we could represent W with fewer non-zero digits,we could reduce the in-memory computing power consumption.CSD representation [29] is widely used to reduce the non-zerodigits by introducing a new digit into the number to forma ternary number system. The pair b i and c i in Eq (12) canbe used to represent the digit set { } for a CSD code. Asimple approach to encode a binary code to a CSD code is tosearch the binary code from LSB to MSB, ﬁnd a string of ‘1’sfollowed by ‘0’ (i.e. 0111), and replace them with the CSDrepresentation (1000 ¯1 ). The process may need to be repeatedseveral times to make sure there is no string of ‘1’s. CSDrepresentation still suffers from some shortcomings:1) In a CSD number, two consecutive non-zero bits are notallowed. Thus the maximum value of 8-bit CSD is limitedto 170 (10101010). For those 8-bit binary numbers greaterthan 170, an extra bit is needed to represent them inCSD representation.2) For string ‘011’, CSD representation (10 ¯1 ) doesn’t reducethe number of ‘1’s.An M-CSD representation is proposed to address the aboveissues. The strings ‘11’ and ‘ ¯1¯1 ’ are allowed in M-CSD.The main idea of M-CSD is shown in Algorithm 2. Stringscontaining three or more ‘1’s will be replaced by 10...0 andthree or more ‘ ’s will be replaced by ... . If the MSBof the binary code is contained in a string, then the stringwill not be replaced with the M-CSD representation. In thisway, the maximum value is extended to 219 (11011011). Asshown in Fig. 8, to achieve the same range as the binary code,more consecutive ‘1’s will be allowed if the weight is greater than 219 or smaller than -219. In this way, the M-CSD codeperfectly ﬁts the differential weight scheme. To comply withthe CSD design rule, w p and w n in the above example willbe changed to 8’b00001001 and 8’b10000000, respectively.Therefore, the number of ‘1’s is signiﬁcantly reduced. Algorithm 2

Modiﬁed CSD Representation

Input: n-bit differential W i = w n − w n − ...w . Output: n-bit modiﬁed CSD W i = w n − w n − ...w . Flag ⇐ // Mark the string containing the MSB. i ⇐ n − j ⇐ k ⇐ // String containing MSB will not be replaced. while i > do if w i ==0 then Flag ⇐ end if i ⇐ i − end while // From LSB to w i do the M-CSD. while j < i do if w j +4 ...w j == 11011 then w j +2 w j +1 w j ⇐ j ⇐ j + 2 else if w j +4 ...w j == ¯1¯10¯1¯1 then w j +2 w j +1 w j ⇐ j ⇐ j + 2 else if w j +2 w j +1 w j == 111 then k ⇐ j + 2 while w k == 1 do k ⇐ k + 1 end while w k w k − ...w j ⇐ ... j ⇐ k else if w j +2 w j +1 w j == ¯1¯1¯1 then k ⇐ j + 2 while w k == ¯1 do k ⇐ k + 1 end while w k w k − ...w j ⇐ ... j ⇐ k else j ⇐ j + 1 end if end while return W C. Neuron Circuit

The integral multiplier in the proposed CIM core is designedfor massive parallel MAC operations and data transmissionfrom digital to analog. [12] uses operational ampliﬁers to ﬁnishthe integral operation. However, the static power consumptionof the ampliﬁer is not optimized for low power computing.Therefore, regulated passive neuron taken from [13] is adoptedin our scheme to propose a differential passive integrator.

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 7

Radix-4Value ？ · ×2 -n ×2 -n Radix-4

Value ？ · ×2 -n ×2 -n ？ · ×2 -n ×2 -n Digital Analog Digital

Input A m-1 A m-2 ...A × Weight w n,0 ∫ Input A m-1 A m-2 ...A Weight w p,1

Weight w n,1

Input A m-1 A m-2 ...A Weight w p,n-1

Weight w n,n-1 × Weight w p,0 ∫ × ∫ × ∫ × ∫ × ∫ Σ Σ Radix-4Value ？ · ×2 -n ×2 -n Value ？ · ×2 -n ×2 -n Radix-4Value ？ · ×2 -n ×2 -n S A R A D C y N-bit+-

Multiplier & Intergrator

Accumulator

Fig. 9. The block diagram of the integration scheme in the integral multiplier. It implements digital input/weight and analog MAC operations, and completesthe analog-to-digital conversion output by a SAR ADC.

As shown in Fig. 9, digital inputs and digital weights aredifferentially multiplied and accumulated at the neurons. Theproposed integration scheme contains three phases: positiveintegration, negative integration, and charge redistribution. Theintegration phases are used to perform non-weighted MACoperations for inputs and weights. Therefore, each integratorhas the same integral voltage for different input bits and weightbits. The charge redistribution phase is used to perform theweighting process for M-RD4 digits ( , , ..., m − fromLSB to MSB, where m is the length of M-RD4 code, and m = (cid:6) n (cid:7) ).The integral neuron is designed as a symmetrical structurecomplete the positive and negative MAC operations separately.The differential integrator is illustrated in Fig. 10(a). The M-RD4 inputs are sequentially sent to the word lines from LSBto MSB. The RRAM model used in the 1R1T cells is around10 G Ω in High Resistance State (HRS) and 10 M Ω in LowResistance State (LRS) [30], [31]. The 1R1T cells are usedin pairs to store b i and c i mentioned in Eq (12). The positivecircuit is used for MAC operations whose results are positive( I p × W p + I n × W n ), while the negative circuit is used forMAC operations with negative results ( I p × W n + I n × W p ),where I p , I n , W p , and W n are the positive input, negativeinput, positive weight, and negative weight, respectively. Inthis way, the number of the discharge path is reduced. What’smore, the positive and negative circuits are compensated to eachother, effectively reducing the inﬂuence of parasitic parameters.Therefore, the proposed integrator can achieve higher accuracywith lower power. S controls the data input, S P controlsthe positive integral operation, and S N controls the negativeoperation. S , S , and S control the integration phase and thecharge redistribution phase. S controls the sample phase and the conversion phase of the ADC.During the positive integration phase, S is open to separateeach integrator. After that, S and S P are closed to clear thecharge in positive integral capacitors. Then S is closed toinput the M-RD4 data ( IN p = 1 , and IN n = − ), and S isopen to complete the 1-bit MAC of × − × . After thepositive integration phase, S P is open to keep the charge in Cp i , and S is open to ensure no power is consumed by the1R1T cells. During the negative integration phase, S is stillopen to make sure the integrator are separated. S and S N areclosed to clear the charge in negative capacitors. After that, S is closed with the input IN p = − , and IN n = 1 . Thephase complete the 1-bit MAC of − × × − . After twointegration phases, S and S are closed to complete the chargeredistribution phase, where the equivalent analog voltage ( V p for positive and V n for negative) is generated. According to thederivation process of [13], the positive or negative integrationvoltage after one step of the charge redistribution phase is V S = V − S − k (2 − p − (cid:88) i =0 A i G i,n − + 2 − p − (cid:88) i =0 A i G i,n − + ... + 2 − n +1 p − (cid:88) i =0 A i G i, ) (13)where V S represents V Sp or V Sn , and V − S represents the initialintegral voltage. k = V B TC f , p is the number of the input layers, A i is 1-bit M-RD4 input of the i th input line, and T is theﬁxed time period for each integration. G i is the conductanceof each binary-RRAM cell, which is /R H and /R L whenit is in the HRS and LRS, respectively. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 8

CsCsCsCs P O S N E G . . . b i,n-1 c i,n-1 . . . A p-1,n-1 A p-1,n-2 ...A p-1,0 A A ...A A A ...A INp,0INp,1INp,p-1INn,0INn,1INn,p-1 . . . RD4 Input . . . b i,0 c i,0 . . . . . . S S S V dd S Cn Cp Vn . . . b i,n-2 c i,n-2 . . . REGUBIAS P O S N E G REGU P O S N E G REGU . . .. . .

Integrators

Intg Intg n-2

Intg n-1 Vp S S BIASBias for Regulator T0 i Bias

Regulator V dd V dd REGUBIAS S P S N S S P S V dd V dd S ∪ S C f -Cp i S S N S Cp i C f -Cn i Cn i T0 i Integrator i Vp Vn S S S S S S T0 n-1 T0 n-2 T0 A D C +- S S V dd Vs n Vs p (a) S (V)S np (V)S (V)S p (V)S (V)S n (V)S (V)S (V) 30n 60n

90n 120n -1*-1 (a) S in S IN p Z Z Z Z Z -1 Z -1 Z -2 Z -2 S in S np IN n S in S IN p Z Z Z -1 Z -2 S in S np IN n -1*1 2*1 -2*-1 2*-1 -2*1 (b) (c) (d) (a) Positive integrator reset(b) Negative integrator reset(c) Charge redistribution 1*2 -1 ;-1*2 -1 (d) Charge redistribution (1*2 -1 )*2 -1 ;(-1*2 -1 )*2 -1 ;2*2 -1 ;-2*2 -1 (b) C n,n-1 C n,n-2 C n,n-4 C n,2 C n,1 C n,0 S VddVn S S S S S C f =2 C n,n-1 =2 C n,n-2 = … =2 n-1 C n,1 =2 n C n,0 C n,0 S C s S S C p,n-1 C p,n-2 C p,n-4 C p,2 C p,1 C p,0 S Vdd VpS S S S S C f =2 C p,n-1 =2 C p,n-2 = … =2 n-1 C p,1 =2 n C p,0 C p,0 S C s S S Vdd

Weighting process for the Weight bit Weighting process for the Weight bit

Differential SAR ADC C s =C f (c)Fig. 10. The proposed circuit schematics. (a) Integral multiplier with a symmetrical structure. The regulators are used to maintain the drain voltage of the1R1T cells. All regulators have the same bias current. (b) The control logic of 1-bit input data. (c) The weighting process for the weight bit and the input dataare completed simultaneously in the charge redistribution phase. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 9

In the proposed scheme, the input pulse has only two possiblevalues, which can effectively reduce the 1R1T cells’ readingvariation. Therefore, 1-bit M-RD4 data with different values arecomputed sequentially. The bits in M-RD4 have the relationship A i,m − = 4 A i,m − = ... m − A i, , which means each bitneeds two steps of charge redistribution operation to achievethe weighting process for input data. As shown in Fig. 10(b),four integration phases (two positive and two negative) andtwo charge redistribution phases are needed to complete thecomputing and weighting process for 1-bit M-RD4 data. Theﬁrst two integration phases mentioned above compute the layerswhose input is ‘1’ or ‘-1’. As shown in Fig. 10(c), the ﬁrstcharge redistribution phase uses the sampling capacitor C S tocomplete the weighting process for input data. Let C S = C f ,the charge on the capacitors C n − C n − ...C and C S is equallydivided after the charge redistribution operation. Taking thepositive integrator as an example, the voltage V p,a of C S canbe expressed as V p,a = 12 ( V Sp,a + V − p ) (14)where V − p represents the previous positive voltage in C S , V Sp,a represents the positive integration voltage for layers with input‘1’ and ‘-1’. In the second two integration phases, the layerswith input ‘2’ or ‘-2’ are input and computed. The positivevoltage of C S after the second charge redistribution phase is V p = 12 ( V Sp,b + V p,a ) = 12 V Sp,b + 14 V Sp,a + 14 V − p (15)where V Sp,b is the positive integration voltage for layers withinput ‘2’ and ‘-2’, V − p is the positive output voltage afterthe last input bit is computed. Eq (15) described the for loopprocess for each bit of the input data. Therefore the input data isweighted by m − , m − , ..., from LSB to MSB. Initially V p is reset to Vdd. After m-bit input data are computed, it canbe expressed as V p = 4 − m V dd + 4 − m V Sp, + ... + 4 − V Sp,m − , (16)where V Sp,i = V Sp,a,i + 2 V Sp,b,i , the change of the V p is ∆ V p = V dd − V p = 4 − m m − (cid:88) i =0 i ∆ V Sp,i (17)where ∆ V Sp,i = ∆ V Sp,a,i +2∆ V Sp,b,i , V Sp,i is the i th positiveintegration voltage, and ∆ V Sp,i is the change of V Sp in the i th integration. Therefore, the output voltage is V out = ∆ V p − ∆ V n = 4 − m m − (cid:88) i =0 i (∆ V Sp,i − ∆ V Sn,i ) (18) D. Mapping

There are several methods to implement the convolutionlayers and fully connected layers on cross-point arrays [32]–[34]. To estimate the network level energy efﬁciency of theproposed scheme, the mapping method in [12] is adopted. Bothconvolution kernel in convolution layers and weight matrix infully connected layers are mapped into the cores. A convolutionkernel whose size is C in ∗ k ∗ k ∗ C out is ﬁrstly transform it to a2D matrix with size ( C in ∗ k ∗ k ) × C out . The proposed scheme has a cross-point array size of × , and can implementa × matrix. Therefore, the number cores to implementthe kernel is (cid:6) C in ∗ k ∗ k (cid:7) × (cid:6) C out (cid:7) . The adders are integratedin the router unit to sum the results of different cores if thekernel size is larger than 256. For an M ∗ N fully connectedlayer, the weight matrix can be mapped into (cid:6) M (cid:7) × (cid:6) N (cid:7) cores, respectly.I V. S I M U L AT I O N R E S U LT S

In this section, both circuit-level and network-level evaluationresults are provided. The circuit-level simulation veriﬁes thecircuit’s functionalities and shows the energy and accuracybeneﬁts of the proposed core. The network-level evaluationpresents the performance comparison with other related works.The circuit-level simulations are done in Cadence AnalogMixed Signal (AMS) with a 45nm generic Process DesignKit (PDK). The RRAM model proposed by [35] is adoptedin the circuit simulations. The network-level simulations aredone on the PyTorch platform.

A. Functionality

The transient simulation is performed to verify the correctfunction of the circuit. A random input 125 (binary representa-tion: 8’b01111101, M-RD4 representation: 2, 0, -1, 1) is sentto the CIM core to complete the MAC operation with a randomweight 123 (binary representation: 8’b01111011, differentialrepresentation: 8’b10000000-8’b00000101). Fig.11 (a) showsthe whole MAC operations. The input bits is computed fromLSB to MSB. From 0 ns to 130 ns, the circuit completes theMAC operation for the M-RD4 bit ‘1’. As shown in Fig. 11 (b),

V c p, is the integration voltage of the positive capacitor C p, ,which is reset to 1 V when S is closed. From 16 ns to 31 ns, S P is closed and V c p, is decreased to 745.4 mV linearly tocomplete the multiplication of × . V c n, is the integrationvoltage of the negative capacitor C n, , which is reset when S is closed. The multiplication of ×− is completed from 47 nsto 62 ns where V c n, is decreased to 745.3 mV linearly. From64 ns to 70 ns, S is closed to complete the charge redistributionphase, and the output voltage V out is 61.19 mV. From 66 nsto 124 ns, V c p, and V c n, are kept at 1 V since no data isinput. After the second charge redistribution phase, the outputvoltage V out is halved to 30.49 mV. The computing of theM-RD4 input ‘1’ is completed. Using the difference as outputcan effectively reduce the impact of parasitic parameters onaccuracy. After 8 cycles of integration and charge redistributionphase, the output voltage V out is 59.73 mV. The digital resultis 8’b00111011. The theoretical results are 59.89 mV and8’b00111011, respectively. Therefore, the proposed schemeachieves its design requirement. B. Robustness Analysis

Fig. 12 shows the relationship between analog output V out ( ∆ V p − ∆ V n ) and (a) the digital input, (b) the number of inputlines, and (c) the digital weight. The results show that theproposed scheme achieves high linearity and accuracy. Fig. 13(a), (b), and (c) show the Differential Non-linearity (DNL) of EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 10 (b)(a)

Fig. 11. The transient simulation results of (a) the computing progress for 1-bitM-RD4 input, and (b) the whole MAC operation for M-RD4 input ‘2,0,-1,1’and differential weight 8’b10000000-8’b00000101. the proposed scheme with different (a) input value, (b) inputlines, and (c) weight value, respectively. Fig. 13 (d),(e), and (f)show the Integration Non-Linearity (INL), respectively. Thesimulated DNLs (INLs) in terms of the digital input, the digitalweight, and the number of input lines are +0.464/-0.073 LSB(-0.047/-0.809 LSB), +0.055/-0.291 LSB (+1.772/-1.061 LSB),and +0.111/-0.445 LSB (0.205/-0.673 LSB).

TABLE IIP V T

S I M U L AT I O N O N

E N O BProcess ff ss ttTemprature ( ◦ C) -40 80 -40 80 27Voltage (V) 1 7.420.9 7.41 7.28 7.29 7.211.1 7.11 7.16 7.25 7.11

Different process, voltage and temperature are chosen todo the PVT simulation to verify the robustness of the circuit.ENOBs, as shown in Table II, are all greater than 7.1 bits indifferent PVT combinations. Therefore, the proposed scheme is reliable with different variations of the process, voltage, andtemperature.

TABLE IIIC I M

C O R E P E R F O R M A N C E C O M PA R I S O N B E T W E E N

M B R A I ,R P N & B L M

A N D T H E P RO P O S E D

MBRAI [12] RPN&BLM [13] ProposedSupply 1.1 V 1 V M-RD4 Recoder 0.6 VNeuron Circuit 1 VComputing speed 1.85 M/s 1.85 M/s 1.85 M/sSFDR 67.42 dB 59.13 dB 63.41 dBSNDR 45.48 dB 46.13 dB 46.48 dBENOB 7.26 bit 7.37 bit 7.42 bit

TABLE IVE

N E R G Y C O S T C O M PA R I S O N B E T W E E N T H E P RO P O S E D C O R EA N D OT H E R S

MBRAI [12] MBHS-mCNN [27] RPN&BLM [13] ProposedTechnology 45 nm 65 nm 45 nm 45 nmSupply 1.1 V - 1 V 0.6/1 VSystem Frequency 16.7 MHz 20 MHz 16.7 MHz 16.7 MHzCore Size 256*256 128*256 256*256 256*512Power Ampliﬁer 0.22 mW - - -ADC 4.04 uW 25.47 uW 4.04 uW 3.99 uWRegulator - - 1.11 uW 0.55 uWCore 199.68 mW 7.44 mW 3.61 mW 2.00 mW

C. Performance

Table III shows the dynamic performance comparisonbetween the MBRAI [12], RPN&BLM [13] and the proposedscheme. The M-RD4 recoder has a supply voltage of 0.6V to further decreases the power consumption. The neuroncircuit’s supply voltage is 1 V to ensure the robustness ofour proposed scheme. The computing speed, SFDR, SNDR,Effective Number of Bits (ENOB) of our proposed schemeare 1.85 M/s, 63.41 dB, 46.48 dB, and 7.42 bit, which areslightly better than the others. Table IV gives the energycost comparison of MBRAI, MBHS-mCNN [27], RPN&BLM,and our proposed scheme. MBRAI consumes 0.22 mW onampliﬁers for stable read voltage, which means that ampliﬁersconsume more than 90% power, resulting in total powerconsumption is 199.68 mW. The ADCs consume more than85% energy in MBHS-mCNN, while the power consumptionfor × core is 7.44 mW. RPN&BLM uses regulators,with 1.11 uW power consumption, to keep the read voltagestable, and the total power consumption is 3.61 mW. In contrast,the power consumption of our proposed core is only 2.00 mW.Compared with MBRAI, MBHS-mCNN, and RPN&BLM, thepower consumption of our proposed scheme is reduced by98.9%, 73.1% and 44.6%, respectively.The core level comparison between our proposed schemeand the other CIM core schemes is shown in Table V. Thesimulation results show that our proposed design achievesenergy efﬁciency as high as 60.68 TOPS/s/W in 8-bit input8-bit weight pattern, 371.49 TOPS/s/W in 4-bit input 4-bitweight pattern, 941.55 TOPS/s/W in 3-bit input 2-bit weightpattern, 1418.44 TOPS/s/W in 2-bit input 2-bit weight pattern,and 1325.22 TOPS/s/W in 3-bit input 1-bit weight pattern. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 11 - 3 - 2 - 1 I n p u t

Vout (mV)

W e i g h t

V o u t a t V a r o u s I n p u t a n d W e i g h t V a l u e (a) Input Value - 3 - 2 - 1 I n p u t l i n e s

Vout (mV)

I n p u t

V o u t a t V a r i o u s I n p u t L i n e s a n d I n p u t V a l u e (b) Input Lines - 3 - 2 - 1 W e i g h t

Vout (mV)

I n p u t L i n e s

V o u t a t V a r i o u s W e i g h t V a l u e a n d I n p u t L i n e s (c) WeightFig. 12. The output voltage V out at various (a)input data and RRAM weights, (b) input lines and input data, (c) RRAM weights and input lines in theproposed scheme DNL (LSB)

I n p u t V a l u eD N L w i t h D i f f e r e n t I n p u t V a l u e (a)

DNL (LSB)

I n p u t L i n e sD N L w i t h D i f f e r e n t I n p u t L i n e s (b)

DNL (LSB)

W e i g h tD N L w i t h D i f f e r e n t W e i g h t V a l u e (c)

INL (LSB)

I n p u t V a l u eI N L w i t h D i f f e r e n t I n p u t V a l u e (d)

INL (LSB)

I n p u t L i n e sI N L w i t h D i f f e r e n t I n p u t L i n e s (e)

INL (LSB)

W e i g h tI N L w i t h D i f f e r e n t W e i g h t V a l u e (f)Fig. 13. The simulated DNL in terms of (a)input value, (b)input lines, (c) weight value and INL in terms of (d)input value, (e)input lines, (f)weight value inthe proposed scheme

Compared with the other schemes, our proposed schemeachieves much higher efﬁciency. In the 8-bit input 8-bit weightpattern, our proposed scheme achieves an efﬁciency whichis 99.47 × , 5.44 × , and 1.80 × more efﬁcient than MBRAI,MBHS-mCNN, and RPN&BLM schemes. Compared with otherCIM schemes, our proposed CIM core achieves better energyefﬁciency. D. Network-Level Estimation

To estimate the accuracy and energy estimate of our proposedscheme, the model of LeNet [40] on the dataset MNISTand the models of AlexNet [41], ResNet34 [42] and VGG16[43] on ILSVRC2012 are evaluated with the mapping methodmentioned in section III.D. The estimated accuracy is shown in Table VI. Our proposed scheme achieves an accuracy better thanMBHS-mCNN in LeNet, and roughly equivalent to MBRAIand RPN&BLM in LeNet and AlexNet. The energy estimationbetween the proposed scheme and other RRAM based schemesis shown in Table VII. The model of LeNet on the datasetMNIST is used to test the performance of the schemes in small-scale networks. The models of AlexNet, ResNet34 and VGG16on ILSVRC2012 are used to evaluate the performance in large-scale networks. Our proposed scheme reduces the ratio of × by 78.5% on LeNet, 80.2% on AlexNet, 70.4% on ResNet34and 82.9% on VGG16. Therefore, the power consumption isgreatly reduced. The inference energy per image is reduced by98.9% compared with MBRAI, more than 81.5% comparedwith MBHS-mCNN, and more than 43.6% compared with EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 12

TABLE VC

O R E - L E V E L C O M PA R I S O N B E T W E E N T H E P RO P O S E D C O R E A N D OT H E R S

Structure Technology Crossbar-size Weight/data bit Throughput Power Efﬁciency(GOPS) (mW) (TOPS/s/W)SINWP [36] 55 nm 256*512 ﬁxed-3/ﬁxed-1 - - 53.17ﬁxed-3/ﬁxed-2 - - 21.9MBRAI [12] 45 nm 256*256 ﬁxed-3/ﬁxed-1 1524 19.6 77.76ﬁxed-3/ﬁxed-2 1040 26.8 38.8ﬁxed-8/ﬁxed-8 121.4 199.68 0.61MBHS-mCNN [27] 65 nm 128*256 ﬁxed-8/ﬁxed-8 81.82 7.348 11.157nm SRAM Macro [37] 7 nm 4 K ﬁxed-4/ﬁxed-4 186.2 1.06 175.5RPN & BLM [13] 45 nm 256*256 ﬁxed-2/ﬁxed-2 1092.2 1.975 553.01ﬁxed-4/ﬁxed-4 546.1 2.66 205.30ﬁxed-8/ﬁxed-8 121.4 3.61 33.63Synapses Integrated Analog Processor [38] 180 nm 2 M analog 0.33 15.8 20.740 nm 4 M analog 0.66 9.9 66.5Fully Integrated Analog Chip [39] 130 nm 4 K ﬁxed-1/tenary - - 78.4Proposed 45 nm 256*512 ﬁxed-3/ﬁxed-1 1524 1.15 1325.22ﬁxed-2/ﬁxed-2 1092.2 0.77 1418.44ﬁxed-3/ﬁxed-2 1092.2 1.16 941.55ﬁxed-4/ﬁxed-4 546.1 1.47 371.49ﬁxed-8/ﬁxed-8 121.4 2.00 60.68

TABLE VIA

C C U R AC Y E S T I M AT E O F D I F F E R E N T

R R A M -

BA S E D S C H E M E S

Network Structure Top-1 Error RateLeNet on MNIST Software Based 0.90 %MBRAI 0.97 %MBHS-mCNN 2.44 %RPN&BLM 0.90 %Proposed

AlexNet on ILSVRC12 Software Based 42.70 %MBRAI 44.16 %RPN&BLM 43.60 %Proposed

ResNet34 on ILSVRC12 Software Based 26.70 %Proposed

VGG16 on ILSVRC12 Software Based 28.40 %Proposed

RPN&BLM on different nerworks. Therefore, the inferenceenergy is signiﬁcantly reduced in our proposed scheme byabandoning the ampliﬁers and adopting M-RD4 and M-CSDcodes.

R P N &B L M R D 4O n l y M _ R D 4O n l y M _ R D 4 *C S D M _ R D 4* M _ C S D

22 . 533 . 54 P o w e r ( m W )

P o w e r S a v i n g R a t i o o f 1 * 1

Fig. 14. The energy cost comparison between different code combination.

As shown in Fig. 14, the energy cost and the ratio of 1 × × × × decreasesto 13.3% by using radix-4 input. What’s more, the ratio of1 × × × O N C L U S I O N

In this paper, a low power in-memory multiplication andaccumulation array with modiﬁed radix-4 input and canonical-signed-digit weights has been proposed. Modiﬁed radix-4 boothcode is used to reduce the number of ‘1’s in the input data, anddifferential memory pairs with modiﬁed canonical-signed-digitare used to reduce the ‘1’s in weight. The proposed two codingschemes efﬁciently reduce the ratio of × by 85.0% on LeNet,79.7% on AlexNet, 70.4% on ResNet34 and 82.9% on VGG16.The simulation results has shown that our proposed CIM coreachieves 2.00 mW on power consumption with 256*512 in8-bit input and 8-bit weight pattern. The computing-power rateat the ﬁxed-point 8-bit is 60.68 TOPS/s/W, which is 99.47 × ,5.44 × , and 1.80 × than that of MBRAI, MBHS-mCNN andRPN&BLM schemes, respectively. The core is very robust withan ENOB of 7.42-bit whose SFDR and SNDR achieve 63.41dB and 46.48 dB. The network-level estimation has shownthat the proposed core achieves 0.91% top-1 error rate with7.59E-3 uJ/img on LeNet, 43.60% top-1 error rate with 13.36uJ/img on AlexNet, 27.80% top-1 error rate with 77.79 uJ/imgon ResNet34, and 29.30% top-1 error rate with 297.88 uJ/imgon VGG16, respectively. The core achieves very low inferenceenergy cost and high accuracy, which are much better thanother schemes. The linearity and PVT simulation has been doneto verify the robustness of the circuit. The energy efﬁciencycomparison has shown that the proposed scheme achieves muchlower power consumption than others. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 13

TABLE VIIE

N E R G Y E S T I M AT E O F D I F F E R E N T

R R A M -

BA S E D S C H E M E S

Network Number of Structure Ratio of System Frequency Data Bit Crossbar Size Energy SavingOperations 1 × MBHS-mCNN [27] 25 MHz 8 128*256 0.039

MNIST RPN & BLM [13] 16.7 MHz 8 256*256 0.013

Proposed 0.022 16.7 MHz 8 256*512 -AlexNet on 720 M MBRAI [12] 0.143 25 MHz 8 256*256 1.23E+03

MBHS-mCNN [27] 25 MHz 8 128*256 68.56

ILSVRC2012 RPN & BLM [13] 16.7 MHz 8 256*256 22.46

Proposed 0.029 16.7 MHz 8 256*512 -ResNet34 on 4 G MBRAI [12] 0.125 25 MHz 8 256*256 6.92E+03

MBHS-mCNN [27] 25 MHz 8 128*256 390.07

ILSVRC2012 RPN & BLM [13] 16.7 MHz 8 256*256 141.95

Proposed 0.037 16.7 MHz 8 256*512 -VGG16 on 16 G MBRAI [12] 0.129 25 MHz 8 256*256 2.77E+04

MBHS-mCNN [27] 25 MHz 8 128*256 1.56E+03

ILSVRC2012 RPN & BLM [13] 16.7 MHz 8 256*256 567.8

Proposed 0.022 16.7 MHz 8 256*512 - R E F E R E N C E S[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521,no. 7553, pp. 436–444, 2015.[2] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “Asurvey of deep neural network architectures and their applications,”

Neurocomputing , vol. 234, pp. 11–26, 2017.[3] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler,K. Virwani, M. Ishii, P. Narayanan, A. Fumarola et al. , “Neuromorphiccomputing using non-volatile memory,”

Advances in Physics: X , vol. 2,no. 1, pp. 89–124, 2017.[4] Q. Xia and J. J. Yang, “Memristive crossbar arrays for brain-inspiredcomputing,”

Nature materials , vol. 18, no. 4, pp. 309–323, 2019.[5] K. Huang, Y. Ha, R. Zhao, A. Kuma, and Y. Lian, “A low active leakageand high reliability phase change memory (pcm) based non-volatilefpga storage element,”

Circuits and Systems I: Regular Papers, IEEETransactions on , vol. 61, no. 9, pp. 2605–2613, 2014.[6] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,“Prime: A novel processing-in-memory architecture for neural networkcomputation in reram-based main memory,” in , 2016,pp. 27–39.[7] Y. Pan, P. Ouyang, Y. Zhao, W. Kang, S. Yin, Y. Zhang, W. Zhao,and S. Wei, “A mlc stt-mram based computing in-memory architec-ture for binary neural network.” in . IEEE, 2018, pp. 1–1.[8] A. Irmanova and A. P. James, “Multi-level memristive memory withresistive networks,” in , 2017,pp. 69–72.[9] M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila,H. Jiang, R. S. Williams, J. J. Yang et al. , “Memristor-based analogcomputation and neural network classiﬁcation with a dot product engine,”

Advanced Materials , vol. 30, no. 9, p. 1705914, 2018.[10] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P. Gaillardon,“A robust digital rram-based convolutional block for low-power imageprocessing and learning applications,”

IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 66, no. 2, pp. 643–654, 2019.[11] A. Chen and M.-R. Lin, “Variability of resistive switching memoriesand its impact on crossbar array performance,” in . IEEE, 2011, pp. MY–7.[12] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatilecomputing-in-memory core for low-power parallel mac operations,”

IEEETransactions on Circuits and Systems I: Regular Papers , 2020.[13] Y. Zhang, K. Huang, R. Xiao, and H. Shen, “An 8-bit in resistive memorycomputing core withregulated passive neuron and bit line weight mapping,” arXiv preprint arXiv-2008.11669 , 2020.[14] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al. , “Amillion spiking-neuron integrated circuit with a scalable communicationnetwork and interface,”

Science , vol. 345, no. 6197, pp. 668–673, 2014.[15] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,G. Dimou, P. Joshi, N. Imam, S. Jain et al. , “Loihi: A neuromorphic manycore processor with on-chip learning,”

IEEE Micro , vol. 38, no. 1,pp. 82–99, 2018.[16] X. Zhang, A. Huang, Q. Hu, Z. Xiao, and P. K. Chu, “Neuromorphiccomputing with memristor crossbar,” physica status solidi (a) , vol. 215,no. 13, p. 1700875, 2018.[17] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S.Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti et al. , “Experimentaldemonstration and tolerancing of a large-scale neural network (165 000synapses) using phase-change memory as the synaptic weight element,”

IEEE Transactions on Electron Devices , vol. 62, no. 11, pp. 3498–3507,2015.[18] P. M. Sheridan, F. Cai, C. Du, W. Ma, Z. Zhang, and W. D. Lu, “Sparsecoding with memristor networks,”

Nature nanotechnology , vol. 12, no. 8,p. 784, 2017.[19] Y. Jiang, P. Huang, D. Zhu, Z. Zhou, R. Han, L. Liu, X. Liu, and J. Kang,“Design and hardware implementation of neuromorphic systems withrram synapses and threshold-controlled neurons for pattern recognition,”

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 65,no. 9, pp. 2726–2738, 2018.[20] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”in , 2016, pp. 14–26.[21] X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: A universalreram-based cnn accelerator with atomic layer computation,” in , 2018, pp.1–6.[22] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarized neural networks: Training deep neural networks with weightsand activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830 ,2016.[23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classiﬁcation using binary convolutional neural networks,” in

European conference on computer vision . Springer, 2016, pp. 525–542.[24] C.-X. Xue, T.-Y. Huang, J.-S. Liu, T. Chang, H.-Y. Kao, J. Wang,T. Liu, S.-Y. Wei, S.-P. Huang, W.-C. Wei et al. , “15.4 a 22nm 2mbreram compute-in-memory macro with 121-28tops/w for multibit maccomputing for tiny ai edge devices,” , pp. 244–246, 2020.[25] M. V. Nair, L. K. Muller, and G. Indiveri, “A differential memristivesynapse circuit for on-line learning in neuromorphic computing systems,”

Nano Futures , vol. 1, no. 3, p. 035003, 2017.[26] V. Joshi, M. Le Gallo, S. Haefeli, I. Boybat, S. R. Nandakumar,C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou,“Accurate deep neural network inference using computational phase-change memory,”

Nature Communications , vol. 11, no. 1, p. 2473, May2020.[27] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, andH. Qian, “Fully hardware-implemented memristor convolutional neuralnetwork,”

Nature , vol. 577, no. 7792, pp. 641–646, 2020.[28] A. D. Booth, “A signed binary multiplication technique,”

The QuarterlyJournal of Mechanics and Applied Mathematics , vol. 4, no. 2, pp. 236–240, 1951.

EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I 14 [29] A. Avizienis, “Signed-digit numbe representations for fast parallelarithmetic,”

IRE Transactions on Electronic Computers , vol. EC-10,no. 3, pp. 389–400, 1961.[30] T. Ahmad, W. Devulder, K. Opsomer, M. Minjauw, U. Celano,T. Hantschel, W. Vandervorst, L. Goux, G. S. Kar, and C. Detavernier,“Inﬂuence of the chalcogen element on the ﬁlament stability incuin(te,se,s)2/al2o3 ﬁlamentary switching devices,”

ACS AppliedMaterials & Interfaces , vol. 10, no. 17, pp. 14 835–14 842, 2018, pMID:29652471. [Online]. Available: https://doi.org/10.1021/acsami.7b18228[31] Y.-J. Huang and S.-C. Lee, “Graphene/h-bn heterostructures for verticalarchitecture of rram design,”

Scientiﬁc Reports , vol. 7, no. 1, p. 9679, Aug2017. [Online]. Available: https://doi.org/10.1038/s41598-017-08939-2[32] L. Gao, P. Chen, and S. Yu, “Demonstration of convolution kerneloperation on resistive cross-point array,”

IEEE Electron Device Letters ,vol. 37, no. 7, pp. 870–873, 2016.[33] X. Wang, Q. Wang, F.-H. Meng, S. H. Lee, and W. D. Lu, “Deep neuralnetwork mapping and performance analysis on tiled rram architecture,”in . IEEE, 2020, pp. 141–144.[34] F. Cai, J. M. Correll, S. H. Lee, Y. Lim, V. Bothra, Z. Zhang, M. P. Flynn,and W. D. Lu, “A fully integrated reprogrammable memristor–cmossystem for efﬁcient multiply–accumulate operations,”

Nature Electronics ,vol. 2, no. 7, pp. 290–299, 2019.[35] Z. Jiang, Y. Wu, S. Yu, L. Yang, K. Song, Z. Karim, and H.-S. P. Wong,“A compact model for metal–oxide resistive random access memory withexperiment veriﬁcation,”

IEEE Transactions on Electron Devices , vol. 63,no. 5, pp. 1884–1892, 2016.[36] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,T. Chang, T. Chang et al. , “24.1 a 1mb multibit reram computing-in-memory macro with 14.6ns parallel mac computing time for cnn basedai edge processors,” in , 2019, pp. 388–390.[37] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao, Y. Wang,and J. Chang, “15.3 a 351tops/w and 372.4gops compute-in-memorysram macro in 7nm ﬁnfet cmos for machine-learning applications,” in ,2020, pp. 242–244.[38] R. Mochida, K. Kouno, Y. Hayata, M. Nakayama, T. Ono, H. Suwa,R. Yasuhara, K. Katayama, T. Mikawa, and Y. Gohou, “A 4m synapsesintegrated analog reram based 66.5 tops/w neural-network processor withcell current controlled writing and ﬂexible network architecture,” in , 2018, pp. 175–176.[39] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao,C. Xue, W. Chen, J. Tang, Y. Wang, M. Chang, H. Qian, and H. Wu, “33.2a fully integrated analog reram based 78.4tops/w compute-in-memorychip with fully parallel mac computing,” in , 2020, pp. 500–502.[40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,”

Communications of the ACM ,vol. 60, no. 6, pp. 84–90, 2017.[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[43] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.

Rui Xiao (Student Member, IEEE) received theBechalor degree from the College of InformationScience Electronic Engineering, Zhejiang Universityin 2019. Currently she is pursuing the Ph.D degreein the School of Information Science and ElectronicEngineering, Zhejiang University under the supervi-sion of Prof. Huang. Her research interests includein-memory computing circuits and systems designusing emerging resistive non-volatile memories, deeplearning accelerators, and embedded system design.

Kejie Huang (Senior Member, IEEE) received thePh.D. degree from the Department of ElectricalEngineering, National University of Singapore (NUS),Singapore, in 2014. He has been a Principal In-vestigator with the College of Information ScienceElectronic Engineering, Zhejiang University (ZJU),since 2016. Prior to joining ZJU, he has spent ﬁveyears at the IC design industry, including Samsungand Xilinx, two years in the Data Storage Insti-tute, Agency for Science Technology and Research(A*STAR), and another three years in the SingaporeUniversity of Technology and Design (SUTD), Singapore. He has authoredor coauthored 40 scientiﬁc articles in international peer-reviewed journalsand conference proceedings. He holds four granted international patents, andanother eight pending ones. His research interests include low power circuitsand systems design using emerging non-volatile memories, architecture andcircuit optimization for reconﬁgurable computing systems and neuromorphicsystems, machine learning, and deep learning chip design. He currently servesas the Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS ANDSYSTEMS-PART II: EXPRESS BRIEFS. 个人简历姓名章烨炜性别男出生年月 1996 年 1 月 8 日民族汉学历硕士专业电子科学与技术通讯地址浙江省杭州市浙江大学玉泉校区 6舍 328 邮编电话 17816855041 教育背景 2011 年-2014 年就读于浙江省绍兴市新昌中学。2014 年-2018 年就读于浙江大学信电学院电子科学与技术专业。 2019 年-2021 年在浙江大学信电学院电子科学与技术专业读取硕士学位。获奖经历 2016 年 11 月获校三等奖学金。项目经历参与用非易失性存储器实现存内计算的芯片设计，并在芯片上实现语音唤醒的神经网络部署的项目。在其中主要实现存内计算乘加电路的设计与仿真。实习经历无校内活动无

Yewei Zhang (Student Member, IEEE) recieved thebachelor’s degree from College of Information Sci-ence & Electronic Engineering, Zhe Jiang Universityin 2018. He is currently studying for a master’s degreeat College of Information Science & ElectronicEngineering, Zhe Jiang University. He is interestedin in-memory computing and non-volatile memories.