[PDF] An 8-bit In Resistive Memory Computing Core with Regulated Passive Neuron and Bit Line Weight Mapping

Abstract

The rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) increases the requirement for edge computing with low power and relatively high processing speed devices. The Computing-In-Memory(CIM) schemes based on emerging resistive Non-Volatile Memory(NVM) show great potential in reducing the power consumption for AI computing. However, the device inconsistency of the non-volatile memory may significantly degenerate the performance of the neural network. In this paper, we propose a low power Resistive RAM (RRAM) based CIM core to not only achieve high computing efficiency but also greatly enhance the robustness by bit line regulator and bit line weight mapping algorithm. The simulation results show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDR of the CIM core achieve 59.13 dB and 46.13 dB, respectively. The proposed bit line weight mapping scheme improves the top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16 on ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) in 8-bit mode, respectively.

Full PDF

11 An 8-bit In Resistive Memory Computing Core withRegulated Passive Neuron and Bit Line Weight Mapping

Yewei Zhang,

Student Member, IEEE , Kejie Huang,

Senior Member, IEEE , Rui Xiao,

Student Member, IEEE ,and Haibin Shen

Abstract —The rapid development of Artiﬁcial Intelligence (AI)and Internet of Things (IoT) increases the requirement foredge computing with low power and relatively high processingspeed devices. The Computing-In-Memory(CIM) schemes basedon emerging resistive Non-Volatile Memory(NVM) show greatpotential in reducing the power consumption for AI computing.However, the device inconsistency of the non-volatile memorymay signiﬁcantly degenerate the performance of the neuralnetwork. In this paper, we propose a low power Resistive RAM(RRAM) based CIM core to not only achieve high computingefﬁciency but also greatly enhance the robustness by bit lineregulator and bit line weight mapping algorithm. The simulationresults show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDRof the CIM core achieve 59.13 dB and 46.13 dB, respectively.The proposed bit line weight mapping scheme improves thetop-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16on ImageNet Large Scale Visual Recognition Competition 2012(ILSVRC 2012) in 8-bit mode, respectively.

Index Terms —In-memory computing, Non-volatile memory,Neuromorphic chip, Resistance inconsistency, Weight quantiza-tion and mapping

I. I

NTRODUCTION I N the past decade, with internet of things, cloud comput-ing, computer vision, and artiﬁcial intelligence becomingincreasingly connected to do perception, cognition, decision,and interaction, sensing devices in intelligent products aregoing to be the key interfaces to the real world. However,communication, storage, information retrieval, computation,and recognition will face great challenges due to the ex-tremely large amount of sensing data. Because of the sepa-ration of the data acquisition, processing, and analysis, theconventional intelligent systems are suffering from problemslike high construction cost, high power consumption, lowefﬁciency, and long latency [1]. To address these issues, themajority of AI computations will be moved to light-weightIoT devices [2]. Nevertheless, Moores Law has come to theend and the processor performance will be beneﬁted littlefrom Complementary Metal Oxide Semiconductor (CMOS)technology node scaling down. Therefore, we have to designnew hardware architectures and software algorithms to meet

Authors: K. Huang and H. Shen are with the College of InformationScience & Electronic Engineering, Zhejiang University, 38 Zheda Road,Hangzhou, China, 310027, and also with Zhejiang Lab, Building 10, ChinaArtiﬁcial Intelligence Town, 1818 Wenyi West Road, Hangzhou City, Zhe-jiang Province, China, email: [email protected]; shen [email protected] Y.Zhang and R. Xiao are with the College of Information Science & ElectronicEngineering, Zhejiang University, 38 Zheda Road, Hangzhou, China, 310027,email: [email protected]; [email protected] the requirement of the perception, computation, and storageat the end devices with limited computation capability andstorage resources.The high density and low power emerging resistive Non-Volatile Memory (NVM) [3]–[11] which enables massive par-allel Computing In-Memory (CIM) is a promising candidate tosolve the above-mentioned issues [12]–[14]. The majority ofworks are utilizing the multilevel resistance of the resistivememory for both storage and computation [15]–[18]. Forexample, Hewlett Packard Laboratories (HPL) proposed a DotProduct Engine (DPE) with the inverting ampliﬁer [19]. [20]designed In-Situ Analog Arithmetic in Crossbars (ISAAC)which utilizes eight 4 level RRAM cells to represent 16-bitweight. Though resistive NVM provides a potential solutionas the CIM unit, its non-ideal properties greatly degeneratethe reliability of the system. A few widely known propertiesof resistive NVM are the non-linear resistance value withdifferent biasing voltage, level to level resistance variation, andcell to cell resistance variation, which will cause signiﬁcanterrors in quantization, resulting in the accuracy loss in thenetwork. To reduce the mapping errors and improve thelinearity of the CIM system, a more reliable design is neededwhich may be at the cost of the increasing of the computingenergy.[21], [22] proposed Serial-Input Non-Weighted Product(SINWP) whose inputs are modulated by time instead ofthe analog voltage, which will address the non-linearity issuecaused by the biasing voltage. However, the digital-to-timeconverter will greatly increase the computing time at highdata width. [23] proposed a novel Multiple Binary RRAMwith Active Integrator (MBRAI) CIM core architecture, wheremultiple binary RRAM cells are used to store one weight.MBRAI could save a lot of power because binary code isused at the input instead of a time signal. Therefore, it requiresonly n CIM computations instead of n . The n-bit input dataare sequentially computed by the CIM core and weightedat the output neurons, which greatly improves the linearitybecause of the identical input voltage. However, the powerconsumption of this scheme is dominated by the operationalampliﬁer ( > a r X i v : . [ c s . A R ] A ug consumption of the proposed 256*256 CIM core in 8-bit modeis reduced by 98.2% compared with MBRAI.The rest of this paper is organized in the following manner.Section II introduces the background of CIM with resistiveNVMs. Section III shows the proposed circuit, problemsbrought by resistance inconsistency, and corresponding opti-mization. Finally, simulation results are presented in SectionIV with the conclusion in Section V.II. BACKGROUND AND RELATED WORKSThe majority of the computations in the neural network arematrix multiplication and accumulate operations, which can bewell implemented by crossbar architecture as shown in Fig. 1.The processing units shown as black dots multiply the inputfrom word lines by the stored weight. The neuron representedby the triangle accumulates the multiplication results at thesame bit line. Word lines

Bit lines

Fig. 1. Microarchitecture of a CIM core. The word lines get input data fromthe former neurons while the bit lines is to accumulate data at the neurons(triangles). The black dots are the computing unit.

In conventional schemes, the weight of the neural networkis stored in SRAM. For example, [24] proposed a 7-bit inputand 1-bit weight MAC using a 10T SRAM cell. However,the precision is limited by the 1-bit weight and the areais large due to the SRAM array. Emerging NVMs whichhave high density and simpler structure as memory unit willgreatly improve the precision of the weight and reduce the ... ... ... ... ...

DACDACDAC ...

OPA ...... ...

Axons

OPA OPA R f R f R f ADC

ADC

G G G

G G G V in Digital Output

Fig. 2. Basic architecture of HPL’s Dot-Product Engine C n-1 C f -C n-1 S S NOMS S C n-2 C f -C n-2 S S NMOS S C C f -C S S NMOS S C S S S V dd V dd V dd V dd N bit output

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 ...D m-1,0 D m-2,n-1 ...D m-2,0 D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

GND n* MUXn T T T T T T I ref V G0 V D2 V C V R V G2 T V c,n-1 V c,n-2 V c,0 V C - V S C V c RRR ... I Regulated Passive NeuronsC R S S S R ... C ... R R C ... R RT T T T T T T T T C S ... Sampling Capacitance Inside the ADCWL WL WL WL m-1 WL m-1 WL WL m-1 Fig. 3. Simpliﬁed integration circuit with OPA. core area for its application on CIM. A design of CIM corewith NVM in DPE is shown in Fig. 2. It employs memristorcrossbar for matrix multiplication where memristor stores theweight by its resistance. Once the input is converted to analogvoltage by DAC, the output voltage is determined by theconductance of the resistance as V out = (cid:80) V in GR f , where R f is the feedback resistance, and G is the conductance of thecross-point memristor device. After that, the output voltageis digitalized by the ADC for data transmission. DAC andADC, which are power-hungry components, are necessary toresist the noise and signal deformation in data transmission.MBRAI proposed in [23] moves the input DACs to the outputand shares the ADC for lower power consumption. RRAM ischosen as the storage unit for its reconﬁgurability, high density,and low power consumption. However, it is a great challengeto precisely control the resistance value of RRAM. Therefore,MBRAI utilizes n RRAM cells with binary resistance stateswhose high resistance state (HRS) is 0 and low resistancestate(LRS) is 1, to represent an n-bits weight to achieve a highEffective Number of Bits (ENOB) for weight mapping. Fig.3 is the simpliﬁed integration circuit of MBRAI. The inputis sent in bit by bit from Least Signiﬁcant Bit (LSB) to MostSigniﬁcant Bit (MSB), which is more computing reliable sinceevery bit is identical in computation. The importance of eachbit of the input data and network weights are weighted by thecharge redistribution at the neurons.Though MBRAI achieves better computing reliability, thereare still two critical issues that need to be addressed. Firstly,the power consumed by ampliﬁers in active integrators ac-counts for more than 95% of the energy cost of the whole CIMcore. Secondly, the resistance of the RRAM cells has a widedistribution, resulting in signiﬁcant quantization errors whenmapping weights of the neural network into the RRAM array.To address the ﬁrst issue, a passive integrator is proposed. Aregulator is designed to improve the linearity of the integrationresults. The details will be introduced in Section III.A. Toaddress the second issue, a pseudo-binary quantization and bitline weight mapping method is proposed to reduce the impact of the resistance inconsistency. The details will be introducedin Section III.B. C n-1 C f -C n-1 S S NOMS S C n-2 C f -C n-2 S S NMOS S C C f -C S S NMOS S C S S S V dd V dd V dd V dd N bit output

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 D m-1,n-2 ...D m-1,0 D m-2,n-1 D m-2,n-2 ...D m-2,0 D D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

GND n* MUXn T T T T T T I ref V G0 V D2 V C V R V G2 T V c,n-1 V c,n-2 V c,0 V C - V S C V c RRR ... I Regulated Passive Neurons C R S S S R ... GNDGND C ... GNDR GNDRC ...

GNDR

GND R T T T T T T T T T C S ... Sampling Capacitance Inside the ADC

Fig. 4. Passive integrator with ampliﬁer removed and its integration process.

III. P

ROPOSED

CIM C

ORE AND M APPING M ETHOD

Although a passive integrator can signiﬁcantly reduce powerconsumption by removing the ampliﬁer, it has a serious non-linear problem. Fig. 4 shows the passive integrator circuitand its integration process where the current decreases withthe decreasing of the integrating voltage V C . To improve thelinearity of the circuit, we design an optimized n-bit integralmultiplier shown in Fig. 5:1) We switch the position of RRAM and transistor in1T1R so that the reading voltage on the RRAM cellis mainly determined by the gate and threshold voltagesof the transistor. To differentiate from the conventionalstructure, the new structure is named as 1R1T.2) The saturation current of the transistor in 1R1T canbe inﬂuenced by the change of the integrating voltagebecause of the channel length modulation effect. Tominimize the impact of the integrating voltage, we addNMOS T at bit line to isolate the integrating voltageand drain voltage of 1R1T and thus reduce the variationof the integrating current.3) Because the load of the bit line is affected by the numberof input lines and the weights’ values, the linearity ofthe circuit is still inﬂuenced by the change of the sourcevoltage of T0 (the drain voltage of the 1R1T). Therefore,a regulator is added at T0 to make sure the stability ofthe drain voltage of 1R1T.4) Besides the nonlinearity in the bit line voltage, thecell to cell variation makes the devices’ integratingcurrent inconsistent which decreases the robustness ofthe system. To improve reliability, we propose a pseudo-binary quantization and bit line weight mapping methodwith corresponding circuit which utilizes the uncertaintyof resistive NVM to reduce quantization error. A. CIM Core with Regulated Passive Integrator1) Core Design:

Assuming the n-bit input sequenceis X , X , ..., X l and the weight is W , W , ..., W l , the multiplication and accumulation(MAC) can be expressed as Y = l (cid:88) i =1 X i W i = l (cid:88) i =1 n − (cid:88) j =0 j x i,j W i = l (cid:88) i =1 n − (cid:88) j =0 j n − (cid:88) k =0 k x i,j w i,k (1)where x i,n − x i,n − ...x i, and w i,n − w i,n − ...w i, is the bi-nary format of X i and W i ( x i,j , w i,k ∈ (0 , ), respectively.It can be observed from Eq. 1 that there are three consecutiveaccumulations. The proposed CIM core utilizes n integratorcells to get (cid:80) li =1 x i,j w i,k by charge integration, and theresults are stored in the passive regulated neuron composedby the capacitance array in Fig. 5 for charge redistributionto get the (cid:80) li =1 x i,j W i . The (cid:80) li =1 x i,j W i is also added upby charge redistribution to get (cid:80) li =1 X i W i . The integrationfor the resistances in the same bit line will be ﬁnishedsimultaneously by sharing the integrator so that the MAC canbe ﬁnished parallelly to achieve a smaller core area and fastercomputing speed. Multiple neurons are enabled at a time inthe integration phase when the inputs are divided into n cyclesand calculated from LSB to MSB. After integration and chargeredistribution, the data conversion phase is started for neuronsto convert the analog results into digital output.

2) Integral Multiplier:

The word line inputs shown in Fig.5 are sent in once a bit from LSB to MSB. The processof multiplication in the integral multiplier includes theintegration phase and the charge redistribution phase. Whenin the integration phase, S is closed, S , S , and S areopen. After the integration, the charge is redistributed with S and S open and S and S closed in the charge redistributionphase. Taking C n − as an example, the integrating voltageafter the integration phase is V c,n − = V − c,n − − V D (cid:80) l − i =0 D i TC f R i (2)where V − c,n − is the initial voltage of C n − , D i is one inputbit of the i th input line, l is the number of input lines, T is theintegration time, R i is the equivalent resistance of the 1R1Tunit of the i th input line and V D is the drain voltage of 1R1Tunit. The capacitances satisfy the following constraint C f = 2 C n − = 2 C n − = 2 C n − = . . . = 2 n C (3)Assuming there is only one input line and the initial integratingvoltage is set to V c − , the integrating voltage Vs after one stepof charge redistribution is V S = V c,n − C n − + V c,n − C n − + . . . + V c, C + V − C C C n − + C n − + . . . + 2 C = V − C − V D TC f (cid:32) − l − (cid:88) i =0 D i,n − R i,n − + 2 − l − (cid:88) i =0 D i,n − R i,n − + . . . + 2 − n l − (cid:88) i =0 D i, R i, (cid:33) (4) C n-1 C f -C n-1 S S NOMS S C n-2 C f -C n-2 S S NMOS S C C f -C S S NMOS S C S S S V dd V dd V dd V dd N bit output

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 ...D m-1,0 D m-2,n-1 ...D m-2,0 D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

ADC’s sampling capacitance C S which is connected tothe integrator’s capacitance array is in the meantime usedto add up the n partial products. Let C S = C f , the new V out is V out = 2 − (cid:0) V S + V − out (cid:1) (5)where V − out represents the former voltage of the C S . Assumingthe initial voltage of the C S is V init , then after n steps of thecharge redistribution, the voltage change is (cid:52) V out = V init − (cid:0) − n V init + 2 − n V S, + . . . + 2 − V S,n − (cid:1) =2 − n (cid:2) ( V init − V S, ) + . . . + 2 n − ( V init − V S,n − ) (cid:3) =2 − n n − (cid:88) j =0 j (cid:52) V S,j (6)where V S,n − is the ( n − th integrating voltage of V S and (cid:52) V S,j is the voltage change of V S in the j th integration. Aslong as the (cid:52) V S,j is designed to represent the result of the (cid:80) li =0 x i,j W i , Eq. 6 gives the result of (cid:80) li =0 (cid:80) j =0 j x i,j W i .

3) Regulated Passive Integrator:

By switching the positionof 1T1R in Fig. 3 to 1R1T in Fig. 5, we can get the followingequations I = 12 K ( V G − V R − V th ) (7) I = V R R (8)where K is the device parameter of T , V th is its thresholdvoltage, R is the resistance of RRAM device, V R is resistance’s read voltage, I is the integrating current passing through the1R1T unit. According to Eq. 7 8, we can get V R = V G − V th − (cid:112) K R ( V G − V th ) + 1 − K R (9)The drain voltage of T ( V D ), which is isolated from theintegrating voltage by T0, satisﬁes the following equation I b = 12 K ( V G − V D − V th ) (10)where I b is the integrating current of the bit line, K is thedevice parameter of T . The proposed regulator circuit shownin Fig. 5 stabilizes the V D of the 1R1T units by applying anegative feedback. T works at the saturation region, whichsatisﬁes the following equation I ref = 12 K ( V D − V th ) (11)where K is the device parameter of T , V th is the thresholdvoltage of T . According to Eq. 10 11, we get V G = V th + (cid:114) I ref K + V th + (cid:114) I b K (12) V D = V th + (cid:114) I ref K (13)Since I ref is a constant, the drain voltage V D of the 1R1Tunit is stabilized by the regulator. B. Pseudo-binary Quantization and Bit Line Weight MappingMethod

As the weight of the neural network is quantized to nbits rather than a continuous value, the quantization errorswhen mapping the weight of the neural network into the CIMsystem will inﬂuence the accuracy of inference. What’s more,the resistance distribution of resistive NVM may worsen thequantization. Therefore, it’s necessary to discuss the quanti-zation method and the corresponding errors in this section.To reduce the quantization error caused by the cell to cellvariation, a pseudo-binary quantization and bit line weightmapping method is proposed.

1) Quantization Error with NVM:

Quantization is animportant method for compressing the neural network andaccelerating the computation speed, among which uniformquantization is a basic one. The typical quantizer of uniformquantization can be expressed as Q ( x ) = ∆ · (cid:22) x ∆ + 12 (cid:23) (14)where ∆ is the quantization step size of some value, x is thevalue to be quantized. When the quantization step size ( ∆ ) issmall relative to the variation in the signal being quantized,it is simple to show that the mean squared error which isalso called the quantization noise power produced by such arounding operation will be ∆ . The calculation process is QE = (cid:90) ∆2 x dx = ∆ (15)The maximum ( w max ) and the minimize ( w min ) of the datarange and the quantization bits n determine the quantizationstep size since they usually have the relationship ∆ × n = ( w max − w min ) (16)Considering the resistance distribution, the practical non-linear quantizer is shown as follows n (cid:88) i =1 ∆ i = ( w max − w min ) (17)Assuming the resistance distribution is a general normaldistribution represented as f (cid:0) x | µ, σ (cid:1) = 1 √ πσ e − ( x − µ )22 σ (18)The Probability Density Function (PDF) of the quantizationerror is the noncentral chi-squared distribution with onedegree of freedom. Then, the mean value of the quantizationerror is given by µ = k + λ

12 = 1+ µ

12 = 1+ ∆

12 = 1+ ( w max − w min ) × n (19)and the variance of the quantization error is σ = 2( k + 2 λ ) = 2 + 4 µ = 2 + 4∆ = 2 + ( w max − w min ) n − (20) As we can see, the quantization error is greatly increasedwhen there is a distribution in resistance. Since the quantiza-tion errors are accumulated in the network, the accuracy willbe greatly reduced.

2) Resistance Measurement:

The proposed quantizationand mapping method needs the resistance value of the RRAMarray in LRS, so we ﬁrstly set all memory units to LRS andread the resistance of the RRAM array by ADC in resistancereading phase. The reading process consists of integrationphase and charge redistribution phase when the switches areset different from multiplication. For example, when readingthe resistance unit R n − ,m − in Fig. 5, the switches in thesame bit line with R n − ,m − are used while the others stayopen. In the integration phase, S , S , and S are open, S is closed and the input of ( m − th word line is 1 while theothers are 0. The integration result is a typical result of Eq.2 where l =1, D=1. When ADC read the integrating voltageduring the charge redistribution, S , S , and S are closedand S is open. The voltage read by ADC is V out = V init + V S (21)where V init is the initial voltage for both sampling capaci-tances and integration capacitances, and V S is the integrationresult. The integration process satisﬁes V init − V S = ITC f = V D TRC f (22)where I is the integrating current passing through 1R1T unit,T is the integration time, R is 1R1T’s resistance and V D isthe read voltage of the bit line. Therefore, we can get R = V R T C f ( V init − V out ) (23)

3) Quantization and Mapping Method:

Since thenormalized resistance in LRS is not exactly digital 1, apseudo-binary code whose importance of bits from MSBto LSB is still the same as the conventional binary code isproposed in our mapping schemes. The main difference isthat the value of the pseudo-binary code is related to theresistance of the memory unit, which is given by ˆ w = r n − × n − + r n − × n − + . . . + r × (24)where for the LRS, r i is the normalized resistance ofthe i th bit (mean value is 1) of the weight. For the highresistance, since the resistance can be much larger than thelow resistance, r i is set as 0, and the uncertainty of the highresistance is ignored in this paper. The weight quantizationprocedure is from MSB to LSB and we deﬁne the conditionas follows q =! (cid:2)(cid:0) r i × i − − w res > . (cid:1) | ( r i < = 0 . | (cid:0) r i × i − > × w res (cid:1)(cid:3) (25)where r i is the i th bit of normalized memory resistance, and w res is the remaining weight after partial quantization. Thecomponent ( r i < = 0 . is to abandon the device with toolarge resistance in LRS and r i × m i − w res > . is tocheck if the remaining weight is larger than the product of the importance of the bit and the resistance of the memory.The component r i × m > × w res is to minimize the quan-tization error of the LSB. Because of the memory resistancedistribution, | r × m − w res | could be larger than w res . Inother words, the memory should be in high resistance in case | r − w res | > w res to minimize the quantization errors.However, the initial memory sequence may not be thebest solution to minimize the quantization error. For example,assuming w=13.4, and four memory units with normalizedresistance 1.05, 1.1, 1.125, 0.93 are used to quantize theweight. The conventional binary code may give a quantizationerror of 0.4 (4’b1101). Based on the given sequence, theresistance states of the four cells are low, low, high, low,which reduce the quantization error to -0.33. Furthermore, ifwe switch the third cell with MSB, the resistance of the fourcells will be 1.125, 1.1, 1.05, and 0.93. In such a sequence,the resistance states of the four cells can be set to low, low,high, and high to minimize the quantization error to 0. Thisexample shows that the sequence of the memory units is veryimportant to minimize the quantization error.The traversal algorithm can be used to search all possiblesequences, and the sequence for the minimal quantization erroris picked and conﬁgured in the chip. However, it may requirea long searching time and its computation complexity is O( n ! ).Moreover, the cells in the same bit line should be in the samesequential position. Assuming the size of the weight matrixneed to be quantized is R*1 and the size of RRAM arrayis R*C where C means the weight is quantized to C bits,we propose a greedy mapping algorithm whose loss whenquantizing the i th bit is deﬁned asloss = max j ∈ R ( w i − ,j − ˆ w i,j ) j = R (cid:88) j =1 ( w i − ,j − ˆ w i,j ) (26)where w i − ,j is the remaining value of the j th weight afterpartial quantization and ˆ w i,j is the value quantized by thepseudo-binary quantization method in the i th bit of the j th row. Eq. 26 has taken both the worst case and the averagecase of the searching results into consideration. This bit lineselection is started from MSB which inﬂuences the mappingresult most to LSB. The algorithm traverses the remaining bitlines and chooses the bit line with minimum loss as the i th bit. To apply the algorithm in the circuit, the n* MUXn in Fig.5 is used for switching the connection between the bit linesof the RRAM array and the integrators. When mapping theweights to the core, all the RRAMs are set to LRS at ﬁrst.Then, according to the proposed mapping method, RRAMswith value 0 are set to HRS. By using this bit line weightmapping method, the computation complexity is reduced fromO( n ! ) to O( n ). IV. S IMULATION R ESULTS

In this section, we do the functional veriﬁcation of the mul-tiplication and resistance reading process. Then we evaluatethe circuit with dynamic performance, energy cost on circuitlevel and compare it with other CIM schemes on core leveland network level. Finally, we present the robustness of thecircuit. The circuit simulations are done in Cadence Analog (a)(b)Fig. 6. Transient simulation results of (a) integration phase and chargeredistribution phase for one bit of input (b) the core’s multiplication processof 8’b10111010 as input and 8’b11101100 as weight.Fig. 7. Process of resistance reading where the state of the switchescontrolling it is presented

Mixed Signal (AMS) with a 45nm generic Process Design Kit(PDK) and the network simulations are done on caffe platform.

A. Functional Veriﬁcation1) Multiplication Process Veriﬁcation:

We simulate thecomputing process shown in Fig. 6 to check the correctnessof the proposed circuit in 8-bit mode. Fig. 6(a) presents theintegration phase in an integrator, the integrating voltage V C shown in Fig. 5 is reset to 1V at 384 ns, and the integrationphase starts at 393 ns. After 20 ns, the integration phase iscompleted and V C is decreased to 745.2mV. Then the chargeredistribution starts at 415 ns. When charge redistributionis done, the 8 integrating voltages are converted to V out .After that, V C is reset to 1V for the next integration. Fig.6(b) shows the whole multiplication process of an 8-bit input(8’b10111010) and 8-bit weight (8’b11101100). The input TABLE ICIM

CORE PERFORMANCE COMPARISON BETWEEN

MBRAI

AND THEPROPOSED

MBRAI [23] ProposedSupply Voltage 1.1V 1VComputing Speed 1.85M/s 1.85M/sSFDR 67.42dB 59.13dBSNDR 45.48dB 46.13dBENOB 7.26bit 7.37bit

TABLE IIE

NERGY COST COMPARISON BETWEEN

MBRAI

AND PROPOSED

CIM

CORE

MBRAI [23] ProposedTechnology 45nm Technology 45nmSupply Voltage 1.1V Supply Voltage 1VSystem Clock 16.7MHz System Clock 16.7MHzIntegral Amliﬁer 0.22mW Regulator circuit 1.11uWCore(256*256) 199.68mW Core(256*256) 3.61mW sequence is sent in from LSB to MSB and after 8 cyclesof integration and charge redistribution, the output voltage V out is 831.6mV. Then ADC converts it to digital result as8’b10101011. The theoretical results of the output voltageand digital result are 831.5mV and 8’b10101011, respectively.Therefore, the design achieves its functional requirement.

2) Resistance Measurement Veriﬁcation:

Fig.7 presents theresistance measuring process of one 1R1T unit where the stateof the switches in the same bit line is simulated. The outputvoltage is set to 1V at ﬁrst and the integration phase is startedat 190 ns. Since only one 1R1T is working, the integratingcurrent is small and thus the integrating time is set to 110ns which is much longer than that of MAC operation. Afterintegration, the sampling phase (i.e. the charge redistributionphase) starts at 440ns and the output voltage is 0.994 V. Thenthe ADC converts it to digital output.

B. Performance Evaluation1) Circuit Level Performance:

Table I shows the dynamicperformance comparison between MBRAI and the proposedcore. The computing speed, SFDR, SNDR, ENOB of theproposed CIM core are 1.85M/s, 59.13dB, 46.13dB, and7.37bit, which are close to the performance indicators ofMBRAI. Table II gives the power cost comparison betweenthe proposed scheme and MBRAI. MBRAI consumes 0.22mW on ampliﬁers for stable read voltage while the proposedcircuit only consumes 1.11uW on the regulator circuit, and thetotal power consumption of the core(256*256) is reduced by98.2%.

2) Core Level Comparison:

The core level comparisonbetween the proposed scheme and the other CIM core schemesis shown in Table III. The simulation results show thatthe proposed design achieves energy efﬁciency as high as553.01 TMACs/s/W in 2-bit input 2-bit weight pattern, 205.30TMACs/s/W in 4-bit input 4-bit weight pattern, and 33.63TMACs/s/W in 8-bit input 8-bit weight pattern. Comparedwith MBRAI, whose energy efﬁciency is 77.76 TMACs/s/Win 1-bit input 3-bit weight pattern, 38.8 TMACs/s/W in 2- bit input 3-bit weight pattern, and 0.61 TMACs/s/W in 8-bit input 8-bit weight pattern, the proposed scheme achievesmuch higher energy efﬁciency (55.13 times in 8-bit input 8-bit weight pattern). Though [27] achieves a low average powerconsumption in ﬁxed-4 input and ﬁxed-4 weight pattern, thethroughout of the core is limited by the rate coding scheme.Meanwhile, the power consumption in [27] will increasewith the input value increase, which may achieve a muchhigher power consumption in practice. Comparing with otherCIM schemes, the proposed CIM core achieves better energyefﬁciency.

3) Network Level Comparison:

The accuracy and energyestimation comparison between the proposed scheme and otherRRAM based schemes is shown in Table IV. Though the bi-nary CIM scheme performs well on small-scale networks, theperformance of this scheme on large-scale networks is muchworse than the multibit based schemes because of its 1-bitquantization. When considering the energy cost, our schemereduces 99.81% of inference energy per image compared withthe binary CIM scheme and 98.17% compared with MBRAIfor LeNet on MNIST. The proposed scheme also reduces99.69% inference energy per image compared with the binaryCIM scheme and 98.64% compared with MBRAI for AlexNeton ILSVRC 2012. Therefore, by abandoning the ampliﬁers,the proposed scheme achieves much lower inference energycost.

C. Robustness Analysis1) Linearity Analysis:

The linearity comparison of integra-tion results under different initial integrating voltage (0.7 ∼ T , and integrator with T is shown inFig. 8(a), Fig. 8(b), and Fig. 8(c), respectively. The Differ-ential Nonlinearity (DNL) and Integration Nonlinearity (INL)are used to evaluate the performance. The INL/DNL is (-1.66 ∼ ∼ ∼ ∼ T , and (-0.40 ∼ ∼ T which conﬁrms that the linearityof the integration process is greatly improved by 1T1R unitposition switching and T . Fig. 9(a) and Fig. 9(b) present thelinearity evaluation of the proposed integral multiplier withdifferent input and weight by the code density measurement.The circuit achieves INL/DNL of (-0.51 ∼ ∼ ∼ ∼ ∼ -0.38)/(0.01 ∼ ∼ ∼

2) PVT Simulation:

To verify the robustness of the circuit,different combinations of process, voltage, and temperature

TABLE IIIC

ORE LEVEL COMPARISON BETWEEN THE PROPOSED SCHEME AND OTHER

CIM

SCHEMES

Structure Technology Crossbar Size Weight/Data Bit Throughout(GMACS) Power(mW) Efﬁciency(TMACs/s/W)SINWP [21] [22] 55nm 256*512 ﬁxed-3/ﬁxed-1 — — 53.17ﬁxed-3/ﬁxed-2 — — 21.9MBRAI [23] 45nm 256*256 ﬁxed-3/ﬁxed-1 1524 19.6 77.76ﬁxed-3/ﬁxed-2 1040 26.8 38.8ﬁxed-8/ﬁxed-8 121.4 199.68 0.61A 22nm 2Mb ReRAM CIM Macro [26] 22nm 512*512 ﬁxed-2/ﬁxed-1 — — 121.38ﬁxed-4/ﬁxed-2 — — 45.52ﬁxed-4/ﬁxed-4 — — 28.93Proposed 45nm 256*256 ﬁxed-2/ﬁxed-2 1092.2 1.975 553.01ﬁxed-4/ﬁxed-4 546.1 2.66 205.30ﬁxed-8/ﬁxed-8 121.4 3.61 33.63A CIM SRAM Macro in 7nm FinFET CMOS [27] 7nm 4kb ﬁxed-4/ﬁxed-4 186.2 1.06 175.5

TABLE IVA

CCURACY AND ENERGY ESTIMATION OF DIFFERENT

RRAM-

BASED SCHEME

Network The Number of Operations Structure System Frequency Data Bit Crossbar Size top-1 error Rate Energy(uJ/img) Saving(%)LeNet on MNIST 0.42M BNN+ADCs [25] 100MHz 1 128*128 1.40% 6.68 99.81%MBRAI [23] 25MHz 8 256*256 0.97% 0.71 98.17%Proposed 16.7MHz 8 256*256 0.90% 0.013 —AlexNet on ILSVRC 2012 720M BNN+ADCs [25] 100MHz 1 128*128 73.90% 5.42E+03 99.69%MBRAI [23] 25MHz 8 256*256 44.16% 1.23E+03 98.64%Proposed 16.7MHz 8 256*256 43.60% 16.65 —

TABLE VPVT S

IMULATION ON

ENOB

Process ff ss ttTemperature( ◦ C) -40 80 -40 80 27Voltage(V) 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 1ENOB(bit) 7.36 7.3 7.25 7.1 7.35 7.27 7.05 7.03 7.37 C n-1 C f -C n-1 S S NOMS S C n-2 C f -C n-2 S S NMOS S C C f -C S S NMOS S C S S S V dd V dd V dd V dd N bit output

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 D m-1,n-2 ...D m-1,0 D m-2,n-1 D m-2,n-2 ...D m-2,0 D D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 D m-1,n-2 ...D m-1,0 D m-2,n-1 D m-2,n-2 ...D m-2,0 D D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

ADC (charge redistribution in the mean time)

OPA OPA OPA

Input Buffer &

Timing Control ...... ... ... V ref V ref V ref S V dd In n-1 In n-2 In Out n-1

Out n-2

Out ... ADCNMOS R n-1,m-1 NMOSR n-1,m-2

NMOS R n-1,0 ... NMOS R n-2,m-1 NMOSR n-2,m-2

NMOS R n-2,0 ... NMOSR

NMOSR

NMOS R ... ...D m-1,n-1 D m-1,n-2 ...D m-1,0 D m-2,n-1 D m-2,n-2 ...D m-2,0 D D ...D NMOSPMOS PMOS PMOS

PMOS

NMOS NMOS

GND n* MUXn T T T T T T I ref V G0 V D2 V C V R V G2 T V c,n-1 V c,n-2 V c,0 V C - V S C V c RRR ... I Regulated Passive NeuronsC R S S S R ... C ... R R C ... R RT T T T T T T T T C S ... Sampling Capacitance Inside the ADCWL WL WL WL m-1 WL m-1 WL WL m-1 (c)Fig. 8. The INL/DNL comparison of integration results under differentintegrating voltage(0.7 ∼ T . are chosen to do the PVT simulation where ENOB is usedto evaluate the core’s performance. The ENOBs in these PVTcombinations are all greater than 7 bits as shown in Table V which indicates that the proposed circuit is reliable withvariations of process, voltage, and temperature.

3) Quantization and Mapping Methods Comparison:

Toadd the impact of resistance distribution into the weight of theneural network, the resistance reading phase is needed whenthe ADC is used to read the resistance of the RRAM array. Tomake things easy, the process of ADC reading 1R1T circuitwith ﬁxed resistance is ﬁrstly simulated by 1400 Monte Carlosimulations to evaluate the impact of the transistor variation,then the resistance inconsistency is evaluated by adding anormalized Gaussian distribution. The normalized distributionof RRAM array read by ADC is shown in Fig.10, wherethe standard deviation of the normalized Gaussian distributionis 0.2. Fig. 11 shows the comparison of 1400 Monte Carlosimulations on the computation error of combination of input180, weight 75, number of input lines 128 between normalmapping and bit line weight mapping method. The averagevalue and the standard deviation of the error in normalmapping method are 0.124 LSB and 1.744LSB, respectively,while those of the errors in bit line weight mapping method are0.013 LSB and 0.104 LSB, respectively. The mapping resultindicates that the bit line weight mapping method signiﬁcantlyimproves our CIM core’s robustness to variations of deviceinconsistency.To test the effect of the bit line weight mapping method onnetwork level, three quantization and mapping methods aresimulated. The ﬁrst one is normal binary quantization andmapping method, which quantiﬁes the weight to digital 8-bit value and set the resistance HRS/LRS according to thecorresponding digital bit 0/1. The second one is resistancebased quantization and mapping method that quantiﬁes theweight according to Eq. 25. The third one is resistance basedquantization and bit line weight mapping method proposedin this paper. Fig. 12(a) shows the quantization error ratio(sum of absolute values of quantization error/sum of absolute (a)(b)(c)(d)Fig. 9. The evaluation of linearity in terms of (a) different input(0-255) and(b) different weight(0-255) and the INL/DNL comparison between (c) integralmultiplier with regulator and (d) integral multiplier without regulator underdifferent input lines(1-256). values of weight) and the loss of top-1 accuracy compari-son between three methods under different quantization bits(the deviation of normalized resistance is 0.2) and differentstandard deviation of normalized resistance distribution (thequantization bits is 8) on AlexNet and ILSVRC 2012; Fig.12(b) presents the quantization error ratio and the loss of top-1accuracy comparison between quantization methods with dif-ferent quantization bit (the deviation of normalized resistanceis 0.2) and standard deviation of the resistance distribution(the quantization bits is 8) on VGG16 and ILSVRC 2012.As shown in Fig. 12, the optimized quantization and bit lineweight mapping method helps reduce the quantization errorsand improve the inference accuracy both on AlexNet andVGG16. For example, the accuracy loss in 8-bit mode with0.2 deviation on AlexNet are 2.97% and 0.51% for normalbinary mapping method and bit line weight mapping method,respectively, and those on VGG16 are 2.70% and 0.39% fornormal binary mapping method and bit line weight mappingmethod, respectively. What’s more, with the uncertainty of the N u m b e r o f R e s i s t a n c e Standard Deviation:0.2

Fig. 10. Normalized resistance distribution read by ADC where the standarddeviation of normalized Gaussian distribution is 0.2. C o un t Mean = 0.124 SD = 1.744Input:180 Weight:75 Input lines:128 (a) C o un t Mean = 0.013 SD = 0.104Input:180 Weight:75 Input lines:128 (b)Fig. 11. Comparison of 1400 Monte Carlo simulations on the computationerror of combination of input 180, weight 75, number of input lines 128between (a) normal mapping and (b) bit line weight mapping method. resistance increasing the effect of the optimization is moreevident. V. C

ONCLUSION

In this paper, an 8-bit RRAM based CIM core with regulatedpassive neuron and bit line weight mapping method has beenproposed. The non-linearity brought by the passive integratorand the errors caused by quantization and the cell to cellvariation have been discussed. To address the above issues,the detailed regulated integral multiplier and the bit lineweight mapping method have been presented. The circuit levelsimulation has shown that the proposed CIM core achieves3.61mW on power consumption with the size of 256*256 L o ss o f A cc u r a c y ABC S u m o f Q u a n t i z a t i o n E rr o r / S u m o f W e i g h t ( ) ABC . . . . . . . . Normalized Standard Deviation0.0000.0050.0100.0150.0200.0250.0300.0350.040 L o ss o f T o p A cc u r a c y ABC . . . . . . . . Normalized Standard Deviation0.20.30.40.50.60.70.8 S u m o f Q u a n t i z a t i o n E rr o r / S u m o f W e i g h t ( ) ABC (a) L o ss o f A cc u r a c y ABC S u m o f Q u a n t i z a t i o n E rr o r / S u m o f W e i g h t ( ) ABC . . . . . . . . Normalized Standard Deviation0.000.010.020.030.04 L o ss o f T o p A cc u r a c y ABC . . . . . . . . Normalized Standard Deviation0.00.20.40.60.81.0 S u m o f Q u a n t i z a t i o n E rr o r / S u m o f W e i g h t ( ) ABC (b)Fig. 12. The accuracy comparison between A: Resistance based quantization and bit line weight mapping method, B: Normal binary quantization and mappingmethod, and C: Resistance based quantization and normal mapping method. (a) The left two ﬁgures are quantization error ratio (sum of absolute values ofquantization error/sum of absolute values of weight) and loss of top-1 accuracy of AlexNet on ILSVRC 2012 with different quantization bits (the deviation ofnormalized resistance is 0.2) while the right two are quantization error ratio and loss of top-1 accuracy of AlexNet on ILSVRC 2012 with different standarddeviation of normalized resistance distribution (the quantization bits is 8). (b) The left two ﬁgures are quantization error ratio and loss of top-1 accuracy ofVGG16 on ILSVRC 2012 with different quantization bits (the deviation of normalized resistance is 0.2) while the right two are quantization error ratio andloss of top-1 accuracy of VGG16 on ILSVRC 2012 with different standard deviation of normalized resistance distribution (the quantization bits is 8). in 8-bit input and 8-bit weight mode, which is reduced by98.2% compared with MBRAI while the SFDR and SNDR ofthe CIM core achieve 59.13 dB and 46.13 dB, respectively.The network level simulation has shown that the CIM coreachieves 0.90% top-1 error rate with 0.013 uJ/img on LeNetand 43.60% top-1 error rate with 16.65 uJ/img on AlexNet,which are better than other schemes. The linearity and PVTsimulation has been done to verify the robustness of the circuit.The simulation on mapping methods has shown that comparedwith normal mapping method, the proposed bit line weightmapping scheme achieves better performance which improvesthe top-1 accuracy by 2.46% and 3.47% for AlexNet andVGG16 on ILSVRC 2012 in 8-bit mode.A

CKNOWLEDGMENT

This work was supported by the Major Scientic ResearchProject of Zhejiang Lab (No. 2019KC0AD02).R

EFERENCES[1] K. Chang and M. Chiang. Design of data reduction approach for aioton embedded edge node. In , pages 899–900, 2019.[2] H. Pham, M. Nguyen, and C. Sun. Aiot solution survey and comparisonin machine learning on low-cost microcontroller. In , pages 1–2, 2019.[3] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang.Experimental demonstration and tolerancing of a large-scale neuralnetwork (165,000 synapses), using phase-change memory as the synapticweight element. In ,pages 29.5.1–29.5.4, 2014. [4] K. Huang, Y. Ha, R. Zhao, A. Kumar, and Y. Lian. A low active leakageand high reliability phase change memory (pcm) based non-volatile fpgastorage element.

IEEE Transactions on Circuits and Systems I: RegularPapers , 61(9):2605–2613, 2014.[5] L. Zhang, W. Kang, H. Cai, P. Ouyang, L. Torres, Y. Zhang, A. Todri-Sanial, and W. Zhao. A robust dual reference computing-in-memoryimplementation and design space exploration within stt-mram. In , pages275–280, 2018.[6] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memorywith spin-transfer torque magnetic ram.

IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems , 26(3):470–483, 2018.[7] Y. Pan, P. Ouyang, Y. Zhao, W. Kang, S. Yin, Y. Zhang, W. Zhao,and S. Wei. A mlc stt-mram based computing in-memory architec-ture for binary neural network. In , pages 1–1, 2018.[8] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memorywith spin-transfer torque magnetic ram.

IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems , 26(3):470–483, 2018.[9] H. . P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen,and M. Tsai. Metaloxide rram.

Proceedings of the IEEE , 100(6):1951–1970, 2012.[10] W. Wan, R. Kubendran, S. B. Eryilmaz, W. Zhang, Y. Liao, D. Wu,S. Deiss, B. Gao, P. Raina, S. Joshi, H. Wu, G. Cauwenberghs, andH. . P. Wong. 33.1 a 74 tmacs/w cmos-rram neurosynaptic core withdynamically reconﬁgurable dataﬂow and in-situ transposable weights forprobabilistic graphical models. In , pages 498–500, 2020.[11] Z. Yang and L. Wei. Logic circuit and memory design for in-memorycomputing applications using bipolar rrams. In , pages 1–5, 2019.[12] Z. Liu, E. Ren, F. Qiao, Q. Wei, X. Liu, L. Luo, H. Zhao, andH. Yang. Ns-cim: A current-mode computation-in-memory architectureenabling near-sensor processing for intelligent iot vision nodes.

IEEETransactions on Circuits and Systems I: Regular Papers , pages 1–14,2020.[13] C. Xue and M. Chang. Challenges in circuit designs of nonvolatile-memory based computing-in-memory for ai edge devices. In , pages 164–165, 2019.[14] W. Chen, W. Khwa, J. Li, W. Lin, H. Lin, Y. Liu, Y. Wang, Huaqiang Wu, Huazhong Yang, and M. Chang. Circuit design for beyond vonneumann applications using emerging memory: From nonvolatile logicsto neuromorphic computing. In , pages 23–28, 2017.[15] G. W. Burr, P. Narayanan, R. M. Shelby, S. Sidler, I. Boybat, C. diNolfo, and Y. Leblebici. Large-scale neural networks implementedwith non-volatile memory as the synaptic weight element: Comparativeperformance analysis (accuracy, speed, and power). In , pages 4.4.1–4.4.4,2015.[16] J. Jang, S. Park, G. W. Burr, H. Hwang, and Y. Jeong. Optimizationof conductance change in pr1xcaxmno3-based synaptic devices forneuromorphic systems.

IEEE Electron Device Letters , 36(5):457–459,2015.[17] A. Fumarola, P. Narayanan, L. L. Sanches, S. Sidler, J. Jang, K. Moon,R. M. Shelby, H. Hwang, and G. W. Burr. Accelerating machine learningwith non-volatile memory: Exploring device and circuit tradeoffs. In ,pages 1–8, 2016.[18] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P. Gaillardon.A robust digital rram-based convolutional block for low-power imageprocessing and learning applications.

IEEE Transactions on Circuits andSystems I: Regular Papers , 66(2):643–654, 2019.[19] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves,S. Lam, N. Ge, J. J. Yang, and R. S. Williams. Dot-product enginefor neuromorphic computing: Programming 1t1m crossbar to acceleratematrix-vector multiplication. In , pages 1–6, 2016.[20] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar. Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars.In , pages 14–26, 2016.[21] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,T. Chang, T. Chang, T. Huang, H. Kao, S. Wei, Y. Chiu, C. Lee, C. Lo,Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang. 24.1 a1mb multibit reram computing-in-memory macro with 14.6ns parallelmac computing time for cnn based ai edge processors. In , pages 388–390, 2019.[22] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,T. Huang, T. Chang, T. Chang, H. Kao, Y. Chiu, C. Lee, Y. King, C. Lin,R. Liu, C. Hsieh, K. Tang, and M. Chang. Embedded 1-mb reram-based computing-in- memory macro with multibit input and weight forcnn-based ai edge processors.

IEEE Journal of Solid-State Circuits ,55(1):203–215, 2020.[23] S. Zhang, K. Huang, and H. Shen. A robust 8-bit non-volatilecomputing-in-memory core for low-power parallel mac operations.

IEEETransactions on Circuits and Systems I: Regular Papers , 67(6):1867–1880, 2020.[24] A. Biswas and A. P. Chandrakasan. Conv-ram: An energy-efﬁcientsram with embedded convolution computation for low-power cnn-basedmachine learning applications. In , pages 488–490, 2018.[25] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang. Binary convolutionalneural network on rram. In , pages 782–787, 2017.[26] C. Xue, T. Huang, J. Liu, T. Chang, H. Kao, J. Wang, T. Liu, S. Wei,S. Huang, W. Wei, Y. Chen, T. Hsu, Y. Chen, Y. Lo, T. Wen, C. Lo,R. Liu, C. Hsieh, K. Tang, and M. Chang. 15.4 a 22nm 2mbreram compute-in-memory macro with 121-28tops/w for multibit maccomputing for tiny ai edge devices. In , pages 244–246, 2020.[27] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao,Y. Wang, and J. Chang. 15.3 a 351tops/w and 372.4gops compute-in-memory sram macro in 7nm ﬁnfet cmos for machine-learning appli-cations. In , pages 242–244, 2020. 个人简历姓名章烨炜性别男出生年月 1996 年 1 月 8 日民族汉学历硕士专业电子科学与技术通讯地址浙江省杭州市浙江大学玉泉校区 6舍 328 邮编电话 17816855041 教育背景 2011 年-2014 年就读于浙江省绍兴市新昌中学。2014 年-2018 年就读于浙江大学信电学院电子科学与技术专业。 2019 年-2021 年在浙江大学信电学院电子科学与技术专业读取硕士学位。获奖经历 2016 年 11 月获校三等奖学金。项目经历参与用非易失性存储器实现存内计算的芯片设计，并在芯片上实现语音唤醒的神经网络部署的项目。在其中主要实现存内计算乘加电路的设计与仿真。实习经历无校内活动无

Yewei Zhang (Student Member, IEEE) recievedthe bachelors degree from College of InformationScience & Electronic Engineering, Zhe Jiang Uni-versity in 2018. He is currently studying for amaster’s degree at College of Information Science &Electronic Engineering, Zhe Jiang University. He isinterested in in-memory computing and non-volatilememories.

Kejie Huang (Senior Member, IEEE) received thePh.D. degree from the Department of Electrical En-gineering, National University of Singapore (NUS),Singapore, in 2014. He has been a Principal In-vestigator with the College of Information ScienceElectronic Engineering, Zhejiang University (ZJU),since 2016. Prior to joining ZJU, he has spent ﬁveyears at the IC design industry, including Samsungand Xilinx, two years in the Data Storage Insti-tute, Agency for Science Technology and Research(A*STAR), and another three years in the SingaporeUniversity of Technology and Design (SUTD), Singapore. He has authoredor coauthored more than 30 scientiﬁc articles in international peer-reviewedjournals and conference proceedings. He holds four granted internationalpatents, and another eight pending ones. His research interests include lowpower circuits and systems design using emerging non-volatile memories,architecture and circuit optimization for reconﬁgurable computing systemsand neuromorphic systems, machine learning, and deep learning chip design.He currently serves as the Associate Editor of the IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS.

Rui Xiao (Student Member, IEEE) received her bachelor’s degree from the School of

Information Science and Electronic Engineering, Zhejiang University in 2019. She is currently working for her Ph.D. degree in the School of Information Science and Electronic Engineering, Zhejiang University. Her research interests include in-memory computing, non-volatile memories, and neuromorphic systems.

RuiXiao (Student Member, IEEE) received herbachelors degree from the School of InformationScience and Electronic Engineering, Zhejiang Uni-versity in 2019. She is currently working for herPh.D. degree in the School of Information Scienceand Electronic Engineering, Zhejiang University.Her research interests include in-memory comput-ing, non-volatile memories, and neuromorphic sys-tems.