Enabling Lower-Power Charge-Domain Nonvolatile In-Memory Computing with Ferroelectric FETs
Guodong Yin, Yi Cai, Juejian Wu, Zhengyang Duan, Zhenhua Zhu, Yongpan Liu, Yu Wang, Huazhong Yang, Xueqing Li
Abstract β Compute-in-memory (CiM) is a promising approach to alleviating the memory wall problem for domain-specific applications. Compared to current-domain CiM solutions, charge-domain CiM shows the opportunity for higher energy efficiency and resistance to device variations. However, the area occupation and standby leakage power of existing SRAM-based charge-domain CiM (CD-CiM) are high. This paper proposes the first concept and analysis of CD-CiM using nonvolatile memory (NVM) devices. The design implementation and performance evaluation are based on a proposed 2-transistor-1-capacitor (2T1C) CiM macro using ferroelectric field-effect-transistors (FeFETs), which is free from leakage power and much denser than the SRAM solution. With the supply voltage between 0.45V and 0.90V, operating frequency between 100MHz to 1.0GHz, binary neural network application simulations show over 47%, 60%, and 64% energy consumption reduction from existing SRAM-based CD-CiM, SRAM-based current-domain CiM, and RRAM-based current-domain CiM, respectively. For classifications in MNIST and CIFAR-10 data sets, the proposed FeFET-based CD-CiM achieves an accuracy over 95% and 80%, respectively.
Index Terms β CiM, process in memory, ferroelectric, charge-domain computing-in-memory, ferroelectric transistors, FeFET. I. I NTRODUCTION
HE computing capability and energy efficiency of modern computers based on the von Neumann architecture are hindered by the data movement between the memory component and the processing units, known as the (cid:179)memory wall(cid:180) problem [1]. This problem has deteriorated with the advent of the big-data era. To tackle this challenge, recent attempts of computing in the memory (CiM) have become intriguing by reducing the data transfer activities [3][4]. As the conventional memories were not designed for the CiM purpose, a key CiM enabler is to facilitate the memory component with a computable circuit structure and/or a flexible interface, under the constraints of cost, power consumption, scalability, and reliability. From the application perspective, the data-intensive convolutional neural network (CNN) acceleration is of particular interest because of the data formality in computing parallelism and simplicity [2]. Recent exploration works ranging from devices and circuits to architectures and algorithms have indicated the benefits of such a co-design [3]-[5]. These existing CiM methodologies could be roughly classified in two dimensions in Fig. 1: (i) memory devices being volatile or nonvolatile, and (ii) computing and sensing methods being in a current mode or a voltage/charge mode [3]-[6], [28]. Compared with the conventional volatile CMOS solution, NVM has potentially higher density and the intrinsic zero standby power. Compared with the current-mode computing and interfacing, a voltage-mode charge-domain CiM (CD-CiM) consumes only dynamic power, which is appealing for low-power applications. More importantly, the capacitor-based CD-CiM may provide higher immunity to PVT variations, including the on-state current mismatch, which is challenging to handle for both MOSFET and RRAM [5]. Therefore, it is highly motivated and timely to start the adventure of NVM-based CD-CiM in the bottom-right quadrant of Fig. 1 to enable the combined advantages of NVM and charge-domain computing for denser, more reliable, and lower-power solutions. Previously, this was challenging due to the low on/off ratio of MRAM (typically ~2) and RRAM (typically 10 -10 [22]), as to be discussed subsequently. Today, the emerging of the nonvolatile ferroelectric field-effect- transistor (FeFET) with a highly-scalable CMOS-compatible process and an ultra-high on/off ratio (>10 ) has articulated a promising design space for a new CD-CiM paradigm. Besides, the other FeFET features, such as DC-power-free write, separate read and write ports, and the compact integration of NVM and transistor also contribute to more design flexibilities in the device-circuit co-design. This work proposes the concept, analysis, and design of NVM-based CD-CiM, and exploits FeFETs for the circuit implementation. Evaluation against SRAM and RRAM solutions suggests significantly improved trade-offs between density, performance, and power consumption. Itemized contributions include: (cid:120) A FeFET-based 2-transistor-1-capacitor (2T1C) compact CiM cell that supports charge-domain DC-power-free XNOR operations, along with the analysis on the cell-level comparison with other NVM technologies; (cid:120)
A CD-CiM macro array based on the proposed 2T1C cell for low-power, parallel, and reliable multiply-and-accumulate (MAC) operations in binary neural network (BNN) applications;
Enabling Lower-Power Charge-Domain Nonvolatile
In-Memory Computing with Ferroelectric FETs
Guodong Yin,
Student Member, IEEE,
Yi Cai, Juejian Wu,
Student Member, IEEE,
Zhengyang Duan, Zhenhua Zhu,
Student Member, IEEE,
Yongpan Liu,
Senior Member, IEEE,
Yu Wang,
Senior Member, IEEE,
Huazhong Yang,
Fellow, IEEE , and Xueqing Li,
Senior Member, IEEE T Manuscript received Apr. 1, 2020. Revised July 18, 2020, Nov. 19, 2020. Accepted Jan. 2, 2021. This work is supported in part by NSFC (
Fig.1. Extending CiM to the NVM-based charge-domain quadrant.
Volatile MemoryNonvolatile Memory Charge DomainCurrent Domain
Process reliability
DensityIdle leakage
Computing powerVariation limitation Process reliabilityDensity
Idle leakageComputing power
Variation limitation
Process reliabilityDensity
Idle leakageComputing powerVariation limitation
Process reliabilityDensity
Idle leakageComputing powerVariation limitation
ExploredExplored To be ExploredExplored (cid:120)
Evaluations of the proposed charge-domain CiM, including (i) 2T1C cell-level CiM circuit performance, (ii) array-level analysis on energy savings of the computing operation and the impact of major array-level variations, and (iii) variation-aware classification accuracy in MNIST and CIFAR-10 datasets. II. P ROPOSED C HARGE -D OMAIN C I M This section introduces BNN and FeFET background briefly and presents the proposed FeFET-based 2T1C XNOR cell for the MAC CiM computing methodology of BNN applications. A. Binary CNN
Convolutional neural networks (CNNs) may achieve high accuracy in computer vision applications, such as image classification and face detection. In CNNs, the multiply-and-accumulate (MAC) is a critical operation during inference and consumes major power [5]. The trained neural networks with binary weights and activations, i.e., +1 and -1, drastically simplify multiplications to be atomic XNOR operations [8]. While BNN has shown its success in smaller data sets like MNIST, recent works have extended the use of low-resolution quantization in larger data sets like ImageNet, showing improved accuracy and much lower costs with algorithm optimizations [8][24]. Binary MAC operations can be performed in three steps: (i) perform XNOR logic in atomic cells; (ii) accumulate the results; (iii) restore the binary value. These operations deliver the corresponding matrix multiplication functions in the convolutional/fully-connected layer and nonlinear functions in the neuron layer. Below, (1) shows the details of the batch normalization and the activation function (cid:545) [9], and (2) shows the particular case for binary batch normalization and the activation function by a sign comparison between the MAC result and a reference voltage:
πππ (cid:3036) (cid:3404) π(cid:4666)πΎ (cid:3036) (cid:3017)(cid:3002) (cid:3284) β(cid:3091) (cid:3284) (cid:3097) (cid:3284)(cid:3118) (cid:3397) π½ (cid:3036) (cid:4667) (1)
πππ (cid:3036) (cid:3404) π πππ(cid:4666)πΌπ (cid:3036) (cid:3398) πΌ (cid:3036) (cid:4667) (2) where PA i is the pre-activation tensor, OUT i is the output tensor, (cid:541) i is the mean of PA i , (cid:305) i is the standard deviation of PA i , and (cid:534) i , (cid:533) i , (cid:302) i are batch normalization parameters. Compared to 32-bit CNNs, binary CNNs need only 1/32 memory and also much fewer data accesses, leading to drastic power savings. XNOR circuits could be implemented in the current mode (current domain) or the voltage mode (charge domain) [3][5]. In the current-mode design, the XNOR results are calculated in custom designed XNOR memory cells based on the Kirchhoff's current law. By identifying the amount of the output current, the sense amplifiers (SA) could tell from different XNOR input scenarios. Due to the DC-power consumed by computing and sensing, the current-mode XNOR operation may not fit well in power-sensitive applications [3][5]. This is more challenging with device variations in the output currents. In contrast, in the charge domain, XNOR and MAC operations could be implemented more efficiently with the charge conservation law. Fig. 2 illustrates an example using the SRAM array. Each SRAM cell is accompanied by a local capacitor. It performs XNOR operations between the SRAM state and the input pair IA/IAb. Each single XNOR operation result is reflected as the charge stored at the local capacitor near each SRAM in Fig. 2. The charge of the three capacitors is then collected with the source line ScL for a summation of the XNOR results. As the SRAM states are directly linked to the supply voltage VDD or the ground voltage GND (thanks to the high on/off I DS ratio of MOSFETs), the major source that determines the MAC computation accuracy is the mismatch between the capacitors rather than the MOSFET V TH variations. With a proper VDD, this significantly improves the immunity to MOSFET on-state drain current variations (as compared with the current-mode sensing of summed currents). In addition, the operation does not occur with static currents, leading to significant power savings. The main drawbacks of the SRAM-based charge-domain XNOR and MAC operations are (i) low density due to large SRAM cells, and (ii) idle-state static leakage current. As to be shown subsequently, the proposed new design with FeFETs solves these problems elegantly. B. The FeFET Device Basics
An FeFET is essentially a MOSFET with a ferroelectric layer embedded at the gate [13]. The polarization of this extra layer brings a knob to tune and keep a nonvolatile V TH , which further leads to a tunable state of the drain-source current I DS . Fig. 3. shows the FeFET I DS -V GS curve with two states and adopted parameters of the model used in the simulation. Detailed device operating mechanisms have been reported in prior works [12][13][26]. Generally, to reduce (or increase) V TH of an n-type FeFET, a positive V write (or negative β V write ) voltage pulse could be applied to the FeFET gate. V write should be sufficiently high to trigger partial or full polarization switching. A negative gate voltage may be practiced with a negative V GS to avoid the use of a negative supply. While sensing the FeFET V TH state, the gate should be biased at a moderate voltage below V write to prevent disturbance. Recent use of hafnium-based materials makes FeFETs well scalable [14][15]. With the amplification of the embedded MOSFET, I DS could exhibit an ultra-high range beyond 10 , which is particularly preferred for computing in large memory arrays [17][29]. FeFETs also own a moderate endurance (up to 10 [18]), a moderate operating voltage range (low to 1.5V [13]), and speed (up to ns [19]). Fig. 2. Existing SRAM charge-domain XNOR and MAC (rotated view) [5].
Fig. 3. (a) FeFET I DS -V GS curve; (b) write methods. Fig. 4. Proposed FeFET-based CD-CiM: (a) cell; (b) array structure.
BLBLB
ScL
WL1IA1IAb1 WL2 WL3IAb2 IA2 IAb3 IA3 V GS (V) -0.5 -1.0 -0.5 -Vwrite -VDD VDDFerroelectric layer thickness=9nmKinetic coefficient=0.1 (a) GND
Write (cid:181)0(cid:182)
Vwrite (b)
Vwrite
Write (cid:181)1(cid:182)
GND
GND
Vwrite I D S ( (cid:541) A ) O n - o ff r a t i o = WLNWLB1
WL1
BLN BLBN
BL2
BLB2 BL1 BLB1V
RefN
OUT N ScL N SA Bitline Driver WL2WLB2WLBNScLC M M1 M2WLBL x WLB (a) (b) I npu t B u ff e r V ref2 OUT ScL2SA V ref1
OUT ScL1SABLB
Notably, recent reports show that highly-scaled FeFETs maintains the high on/off ratio, but may also suffer from large I DS variations. This limits the accuracy of current-sensing-based CiM [20]. Therefore, innovations that exploit the ultra-high on/off ratio rather than the absolute I DS is more preferred for computing purposes. This is the contribution of this work when compared with existing FeFET-based current-mode solutions. C. Proposed 2T1C CD-CiM macro Cell and array structures . Fig. 4 (a) shows the proposed 2T1C cell. It consists of two n-type FeFETs (M1 and M2), and one capacitor C M . WL and WLB are wordlines. BL, BLB, and ScL are bitlines. The cell can store bits as FeFET states: (cid:181)1(cid:182) for positive polarization state (negative V TH ) and (cid:181)0(cid:182) for negative polarization state (positive V TH ). In the cell, the two FeFETs store one (cid:181)0(cid:182) and one (cid:181)1(cid:182) , similar to SRAM. Fig. 4 (b) shows the proposed FeFET-based CD-CiM array macro implementation based on the 2T1C cells. Multiple rows and columns can be activated simultaneously to compute in parallel.
Cell and array write operations . TABLE I shows the write setup, with an example of writing (cid:181)1(cid:182) to M1 and (cid:181)0(cid:182) to M2 in one cell. It has two phases (Phase 1 and Phase 2), similar to the methods in [12][20]. In TABLE I, V BL /V BLB is set to V write /GND to write (cid:181)1(cid:182) to M1 and (cid:181)0(cid:182) to M2. In Phase 1, V WL and V WLB are connected to GND. In Phase 2, V WL and V WLB are driven concurrently to V write for a period of time. An effective write of (cid:181) (cid:182) and (cid:181)1(cid:182) occurs with V GS = β V write and V GS = V write , respectively. The write operations of one cell could be easily extended to an array. WL/WLB of the selected row are connected to GND in Phase 1 and V write in Phase 2. WL/WLB of unselected rows are set to V write /2, and BL/BLB is set to V write to write (cid:181)1(cid:182) for the selected FeFET, or GND to write (cid:181)0(cid:182) for the selected FeFET. For unselected rows, |V GS | = Vwrite/2 is maintained to avoid state disturb. Cell XNOR operation . Fig. 5 shows the XNOR logic in one cell. Initially, all bitlines and wordlines are set to GND. Then, ScL is left floating at GND. As mentioned above, either M1 or M2 has low resistance, so the internal node voltage V X is GND. Next, V WL /V WLB is set to VDD/GND and GND/VDD for an input pair of (cid:181)1/0(cid:182) and (cid:181)0/1(cid:182) , respectively. Note that VDD is set lower than V write to avoid FeFET state disturbance.
With the complementary (cid:181)0(cid:182) and (cid:181)1(cid:182) storage within each cell, when WL or WLB biased at VDD is connected to an on-state FeFET, V X will be pulled up to VDD. Otherwise, V X remains GND. As ScL is floating, the change of V X is linearly delivered to the top plate of C M , i.e., ScL. Fig. 5 has illustrated V ScL in two input scenarios (without considering the impact of parasitics capacitance): equal to V WL =VDD in Fig.5(a) for output (cid:181)1(cid:182) , and equal to V WLB =GND in Fig. 5(b) for output (cid:181)0(cid:182) . Fig. 6 shows a snapshot of transient simulation results, where both the voltage of bitlines and wordlines and FeFET polarization are included for XNOR output s of (cid:181)1(cid:182) and (cid:181)0(cid:182) . Operating non-idealities will be analyzed in Section III.
Array MAC operation.
Fig. 7 shows the array MAC operation with the shared ScL, BL/BLB, and WL/WLB between unary XNOR CiM cells. At the array-level MAC operation, the output V
ScL is driven by multiple XNOR cell outputs in each column. Note that cells in the same column have the same inputs with shared BL and BLB. Therefore, V
ScL will be lifted linearly as a function of the number of pulled-up cells. If M cells out of a total of N XNOR cells deliver (cid:181)1(cid:182), V
ScL is shifted from GND to VDD*M/N. Given a mapping scheme between -1/+1 and GND/VDD, the pre-trained weights are stored as FeFET states in the array, and the input signals are set through the WL/WLB lines. For the convolutional layer, the weights of the same filter are stored in the same column. For the fully connected layers, the weights are loaded similarly. Every column performs MAC operations, then the nonlinear activation and the binary batch normalization operations are performed with the outputs presented at ScL. Wordlines (WL and WLB) and bitlines (BL and BLB) could be set to GND to make corresponding rows and columns inactive so as to support a smaller network. Large networks may also be supported by matrix splitting with several smaller CiM macros, as discussed in [23].
III. E VALUATION A ND D ISCUSSION
This section evaluates the energy, area, and accuracy performance of the proposed 2T1C MAC CD-CiM macro against other existing techniques, in the presence of non-idealities. A. Benchmark Settings
The 2T1C cells are simulated using the calibrated FeFET SPICE model in [21], as shown in Fig. 3. This model has been adopted for prior circuit works. All MOSFETs are the 10nm PTM models [27]. In the benchmarking, C M is 1.2fF for all charge-domain solutions, adopted from [5]. We use R ON = 10K ⦠and R OFF = 1M⦠for RRAMs.
The array size is set to 128x128 as a typical case. RC parasitic parameters are from [23]. As shown in recent reports, the SA may consume a significant portion of energy. For fair comparisons, the adopted SAs are based on those in [7].
TABLE I: W RITE O PERATION C ONFIGURATION
Target example Phases V
ScL V BL V BLB V WL V WLB
M1: (cid:181)1(cid:182) ; M2: (cid:181)0(cid:182)
Phase1 GND V write
GND GND GND Phase2 V write V write Fig. 5. Proposed XNOR logic of one cell: (a) (cid:181)1(cid:182) output; (b) (cid:181)0(cid:182) output.
Fig. 6. Transient waveforms of the 2T1C cell (settings see III.A).
Fig. 7. Proposed 2-phase MAC operation (3 cells in a column example): (a) discharge capacitors and keep ScL floating; (b) XNOR and accumulate.
ScL: GND ->VDDC M M1 M2 WL:VDD
BL:GND
BLB:GND x WLB:GND β1β β0β (a)
ScL: GND -> GNDC M M1 M2WL:VDDBL:GND
BLB:GND x WLB:GND β0β β1β (b)
Time (ns) -2020
Write β1β Write β0βInput β1β Input β1β Input β0βInput β0β -20200.00.5 M2 polarizationScLM1 polarizationBLBBL
WLB
WL0.01.00.01.00.01.01.00.0 V o l t age ( V ) P ( ΞΌ C / c m ) ScLM1 M2GNDGND x M1 M2 x C M M1 M2 x GNDGND
GND
GNDGND GND GND (a)
M1 M2WL1GND x M1 M2 x M1 M2 x GNDWLB1 GNDGND (b)
WL2 WLB2WL3 WLB3 (cid:181)1(cid:182)(cid:181)0(cid:182) (cid:181)0(cid:182)
VDD/3GND
VDD C M C M C M C M C M B. Energy And Latency Evaluation Theoretical Analysis
In CD-CiM, most energy is consumed by capacitance charging. Fig. 8 shows the capacitor network model in the MAC operation of a column. During the charging process, the right-side plate of C M in Fig. 8 (a) is kept floating and the left-side plate, i.e., node X, is clamped to VDD or GND according to the XNOR result. The equivalent charging capacitor C EQ of a column could be calculated: πΆ (cid:3006)(cid:3018) (cid:3404) π (cid:3400) (cid:4666)128 (cid:3398) π(cid:4667) (cid:3400) πΆ (cid:3014) /128, (3) where M is the number of cells whose XNOR result is (cid:181)1(cid:182) . As observed, there is no charging load when all cells are delivering XNOR results of (cid:181)0(cid:182) or (cid:181)1(cid:182), as ScL is floating. The maximum charging load occurs when half cells are delivering (cid:181)1(cid:182) (and the other half delivering (cid:181)0(cid:182)). Differently, in the SRAM-based CD-CiM design [5], C M is charged when the XNOR result is (cid:181)1(cid:182) , and is not charged when it is (cid:181)0(cid:182) . Fig. 8(c) compares the equivalent capacitor load, in which the proposed design at p =0.5 is only half of the SRAM-based design. Here, p denotes the percentage of XNOR cells in a column that produce (cid:181)1(cid:182) . On average, C EQ of 2T1C design is only 33% of the SRAM-based design. Also, the 2T1C design consumes no idle power with the FeFET non-volatility, which also outperforms the SRAM-based design. For current-domain CiM solutions, the energy is consumed while settling down and sensing the bitline currents, along with maintaining the reference currents. Although a latch-style dynamic current-SA could be used, there still a trade-off between the current amplitude, supply and the operating frequency to minimize the energy. Experimental Simulation
To comply with the FeFET model, the supply voltage is set to 0.45V to avoid FeFET state disturbance. For other designs, an extra 0.90V supply is provided to investigate more options. In the evaluation, we set a clock cycle as the time window for each bitline and wordline controls, including the precharging and clamping, and a clock cycle for sensing and latching the outputs. Evaluations are done at 100MHz and 1.0GHz, each with custom optimizations, e.g. low or high V TH options. The comparison of energy consumption with related works is shown in Fig. 9. This work achieves the highest energy efficiency. Compared with the current-domain solutions, the minimum improvement is 2.5x at 1.0GHz, and up to 24x at 100MHz in which more time is spent on bitline settling-down. Practically, the operating frequency could be limited by the influence of the PVT variations in current-sensing CiM. Compared with the SRAM-based CD-CiM, the energy efficiency improvement is 1.9x at 0.45V and 7.8x at 0.90V, which confirms the theoretical analysis above. CD-CiM evaluation results are not sensitive to the frequency unless one fails to reach the operation speed. For example, the SRAM-based CD-CiM fails to reach 1GHz at 0.45V. C. Precision Analysis Theoretical Analysis and Array-Level Simulation
The energy evaluation above has assumed no variation impact. However, in the current-mode sensing, variations could be playing a key role as the variations of currents directly affect the summed result. For the proposed CD-CiM, it is also important to investigate how the variations of FeFET and C M affect the overall computing accuracy. Intuitively, as FeFETs have a very large on-off ratio, the drain-source leakage current I OFF is negligible when compared with the on-state current I ON . Therefore, the internal node X in each cell is well set at GND or VDD. Further, the non-ideality of the computation based on the charge re-distribution is determined by the C M capacitor mismatch, which affects the amount of charge re-distribution at the output ScL. Theoretically, the MAC result from (3) is reshaped as π (cid:3014)(cid:3002)πΆ (cid:3404) π (cid:3020)(cid:3030)(cid:3013) (cid:3404)
1β πΆ (cid:3284)(cid:3263)(cid:3284)(cid:3128)(cid:3117) β π (cid:3025)(cid:3036) (cid:3400) πΆ (cid:3036)(cid:3015)(cid:3036)=1 (cid:3404) (cid:3023)(cid:3005)(cid:3005)β πΆ (cid:3284)(cid:3263)(cid:3284)(cid:3128)(cid:3117) (cid:3400) (cid:4672)β πΆ (cid:3284) (cid:3400)(cid:3019) (cid:3264)(cid:3255)(cid:3255)(cid:3284) (cid:3019) (cid:3264)(cid:3263)(cid:3284) +(cid:3019) (cid:3264)(cid:3255)(cid:3255)(cid:3284) (cid:3014)(cid:3036)=1 (cid:3397) β πΆ (cid:3284) (cid:3400)(cid:3019) (cid:3264)(cid:3263)(cid:3284) (cid:3019) (cid:3264)(cid:3263)(cid:3284) +(cid:3019) (cid:3264)(cid:3255)(cid:3255)(cid:3284) (cid:3015)(cid:3036)=(cid:3014)+1 (cid:4673). (4) where πΆ (cid:3036) is the capacitor of the i th cell, π (cid:3025)(cid:3036) is V X of the i th cell, π (cid:3016)(cid:3015)(cid:3036) is the on-state FeFET drain-source resistance of the i th cell, π (cid:3016)(cid:3007)(cid:3007)(cid:3036) is the off-state FeFET drain-source resistance of the i th cell, N is the total number of cells in the MAC operation, and M is the number of cells whose XNOR result is (cid:181)1(cid:182). In the analysis, N is set to 128; C M is modeled as a Gaussian distribution with a mean value of 1.2fF. With the normalized standard C M deviation (cid:305) c between 1% - 5% and the on-off ratio set infinite, Fig. 10 shows the normalized standard deviation of V MAC . Considering all corners, the worst (cid:305)
MAC occurs at { p =0.5, (cid:305) c =5%} and is below 0.25%. This indicates a much smaller impact than the direct current summing errors caused by typical RRAM or MOSFET I ON variations. Fig. 11 shows normalized MAC errors with a different on-off ratio, in which R ON and R OFF are logarithmic Gaussian random variables with
Fig. 8. ScL charging capacitance: (a) Model of proposed 2T1C design; (b) Model of prior SRAM-based design; (c) Comparison as a function of p . Fig. 9. MAC operations comparisons between different works in an array.
Fig. 10. Normalized standard deviation (cid:305)
MAC vs p . Fig. 11. Normalized MAC error vs p with (cid:305) C =5% and different on-off ratios. V C M1 C M2 C M128 V XNOR result (cid:181) (cid:182) C M1 C M2 C M128 (a) Proposed
ScL p XNOR result (cid:181) (cid:182) XNOR result (cid:181) (cid:182) XNOR result (cid:181) (cid:182) Prior SRAM-based [5] Proposed 2T1C E qu i va l e n t l o a d ca p ac i t a n ce (f F ) ScL (b) SRAM-based [2] (c) Load capacitance comparison C M = 1.2fF, N = 128 x139 x24 x148 x24 x7.8 x1.9 x1x16 x2.8 x16 x2.5 x7.8 x1 E n e r g y ( p J ) SRAM current domain [25] RRAM (1T1R) SRAM charge domain [5]
This work@ 0.9V @ 0.45V @ 0.45V @ 0.9V @ 0.45V@ 0.9V @ 0.45V p p p p p (a) Mean on-off ratio=1e2 N o r m a li z e d M AC E rr o r N o r m a li z e d M AC E rr o r N o r m a li z e d M AC E rr o r N o r m a li z e d M AC E rr o r (b) Mean on-off ratio=1e3(c) Mean on-off ratio=1e4 (d) Mean on-off ratio=1e5 a normalized standard deviation of 15% and (cid:305) C =5%. When the on-off ratio is over 10 , the normalized MAC error has a chance of ~99.2% to be below that caused by flipping an XNOR cell. In contrast, with an on-off ratio around 10 , which is a typical value for RRAM, the accumulated error could be so significant that the average normalized MAC error could be as high as 5%. This finding actually answers the fundamental question: why is it challenging to explore CD-CiM using RRAM and MTJ in the fourth quadrat in Fig. 1? Application Simulation
The proposed FeFET-based CD-CiM macro is evaluated for classification applications while considering the impact of C M variations. The Pytorch framework is used to build the binary LeNet on the MNIST test set [11] and the binary NIN [10] on the CIFAR-10 test set. To evaluate the effect of the non-idealities of the core array, it is assumed that peripherals, such as the reference voltage generator, the SAs, and the quantization blocks, do not lower the overall accuracy. Fig. 12 scatters the classification accuracy as a function of (cid:305) C . Ideally, XNOR-Net in [8] achieves ~ 99.0% classification accuracy on the MNIST test set and 85.6% classification accuracy on the CIFAR-10 test set. As shown in Fig. 12, as long as the capacitor mismatch is within a reasonable range of 20% for CIFAR-10 and 30% for MNIST, the classification accuracy is almost uncontaminated. In practical designs, this matching requirement could be used to guide the C M design given a specific technology for the optimized trade-off between the target accuracy, power consumption, and the layout area. D. Area
The proposed FeFET-based cell consists of two transistors and one capacitor. Because the capacitor can be placed on top of the transistors, the area overhead of the capacitor is significantly reduced. In contrast, the SRAM-based CD-CiM cell needs one capacitor and a total of 9 transistors, including 8 for XNOR cell and one extra transistor to connect the cell to ScL for accumulation in a MAC [5]. In addition, the 2 transistors in the proposed design are both n-type and could be placed in a more compact layout than SRAM transistors. IV. C ONCLUSIONS
This paper has presented the concept and design of an NVM-based charge-domain computing-in-memory approach. A 2T1C XNOR CiM cell is proposed based on FeFET, a nonvolatile CMOS-compatible NVM device with an ultra-high on/off ratio. The array implementation for MAC CiM macro based on the proposed cell is presented and evaluated. Comparisons show higher density and lower power than prior current-domain and charge-domain CiM designs. Circuit and application evaluations have shown the potential of improving the performance and energy efficiency of BNN accelerators while achieving high accuracy. Acknowledgment The authors would like to thank Prof. Nan Sun for helpful discussions, and Prof. Sumeet Gupta and Kai Ni for model support. R
EFERENCES [1]
W. A. Wulf et al., "Hitting the memory wall: implications of the obvious,"
ACM SIGARCH computer architecture news , vol. 23, no. 1, pp. 20-24, 1995. [2]
M. Peemen et al., "Memory-centric accelerator design for Convolutional Neural Networks," , pp. 13-19. [3]
D. Bankman et al., "An Always-
On 3.8 (cid:541)J/86% CIFAR -10 Mixed-Signal Binary CNN Processor With All Memory on Chip in 28-nm CMOS," in
IEEE JSSC , vol. 54, no. 1, pp. 158-172, Jan. 2019. [4]
J. Yue et al., "7.5 A 65nm 0.39-to-140.3 TOPS/W 1-to-12b unified neural network processor using block-circulant-enabled transpose-domain acceleration with 8.1 Γ higher TOPS/mm 2 and 6T HBST-TRAM-Based 2D Data-Reuse Architecture," , pp. 138-140. [5] H. Valavi et al., "A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute,"
IEEE Journal of Solid-State Circuits , vol. 54, no. 6, pp. 1789-1799, 2019. [6]
X. Sun et al., "XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks," pp. 1423-1428 [7]
S. Yu, et al., "Compute-in-Memory with Emerging Nonvolatile-Memories: Challenges and Prospects,"
IEEE CICC 2020 , pp. 1-4. [8]
M. Rastegari et al., "Xnor-net: Imagenet classification using binary convolutional neural networks," in
ECCV, 2016 : Springer, pp. 525-542. [9]
S. Ioffe, et al., "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv:1502.03167, 2015. [10]
M. Lin, et al. "Network in network." arXiv preprint arXiv:1312.4400. [11]
Y. LeCun et al., "Gradient-based learning applied to document recognition," in
Proceedings of IEEE , vol. 86, no. 11, Nov. 1998. [12]
J. Wu et al., "A 3T/Cell Practical Embedded Nonvolatile Memory Supporting Symmetric Read and Write Access Based on Ferroelectric FETs," in [13]
K. Ni et al., "SoC Logic Compatible Multi-bit FeMFET Weight Cell for Neuromorphic Applications," in , pp. 13-2. [14]
S.-Y. Wu, "A new ferroelectric memory device, metal-ferroelectric-semiconductor transistor," in
TED , vol. 21, no. 8, pp. 499-504, Aug 1974. [15]
J. Jo and C. Shin, "Negative Capacitance Field Effect Transistor with Hysteresis-Free Sub-60-mV/Decade Switching," in
IEEE Electron Device Letters , vol. 37, no. 3, pp. 245-248, March 2016. [16]
J. M (cid:129) ller et al., "Ferroelectricity in HFO2 Enables Nonvolatile Data Storage in 28 nm HKMG," in , pp. 25 β
26. [17]
X. Li et al., "Enabling Energy-Efficient Nonvolatile Computing with Negative Capacitance FET,"
TED , vol. 64, no. 8, pp. 3452- 3458, 2017. [18]
C. H. Cheng et al., "Low-leakage-current DRAM-like memory using a one-transistor ferroelectric MOSFET with a hf-based gate dielectric,"
IEEE Electron Device Lett. , 35(1):138 β J. Muller et al., "Nanosecond Polarization Switching and Long Retention in a Novel MFIS-FET Based on Ferroelectric HfO2," in
IEEE Electron Device Letters , vol. 33, no. 2, pp. 185-187, Feb. 2012. [20]
D. Reis et al., "Computing in Memory with FeFETs," in
I(cid:54)LPED (cid:182)18 . New York, NY, USA: ACM, 2018, pp. 24:1 β A. Aziz et al., "Physics-Based Circuit-Compatible SPICE Model for Ferroelectric Transistors," in
EDL , vol. 37, no. 6, pp. 805-808, June 2016 [22]
S. Salahuddin, et al., "The Era of Hyper-Scaling in Electronics, "
Nature Electronics , vol. 1, no. 8, pp. 442 β X. Chen et al., "Design and optimization of FeFET-based crossbars for binary convolution neural networks," , pp. 1205-1210. [24]
J. Choi, et al., "Accurate and efficient 2-bit quantized neural networks," in
Proc. Conf. Syst. Mach. Learn. (SysML) , 2019. [25]
A. Agrawal et al., "Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays," in IEEE TCAS-I, vol. 66, no. 8, pp. 3064-3076, Aug. 2019. [26]
S. K. Thirumala et al., "Nonvolatile Memory utilizing Reconfigurable Ferroelectric Transistors to enable Differential Read and Energy-Efficient In-Memory Computation," 2019 IEEE/ACM ISLPED, pp. 1-6. [27] "Predictive technology model , " [Online] Available: ptm.asu.edu. [28] Z. Jiang, et al., "C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism," in JSSC, vol. 55, no. 7, pp. 1888-1897, July 2020. [29]
M. Lee et al, "FeFET-based low-power bitwise logic-in-memory with direct write-back and data-adaptive dynamic sensing interface," in ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED'20), August 10 β
12, 2020, Boston, MA, USA.
Fig. 12. Impact of C M mismatch on classification.mismatch on classification.