[PDF] 3D-aCortex: An Ultra-Compact Energy-Efficient Neurocomputing Platform Based on Commercial 3D-NAND Flash Memories

Abstract

The first contribution of this paper is the development of extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. Our detailed simulations have shown that, for example, the 5-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 um2/byte and energy efficiency of ~10 fJ/Op, including the input/output and other peripheral circuitry overheads. Our second major contribution is the development of 3D-aCortex, a multi-purpose neuromorphic inference processor that utilizes the proposed 3D-VMM blocks as its core processing units. We have performed rigorous performance simulations of such a processor on both circuit and system levels, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our modeling of the 3D-aCortex performing several state-of-the-art neuromorphic-network benchmarks has shown that it may provide the record-breaking storage efficiency of 4.34 MB/mm2, the peak energy efficiency of 70.43 TOps/J, and the computational throughput up to 10.66 TOps/s. The storage efficiency can be further improved seven-fold by aggressively sharing VMM peripheral circuits at the cost of slight decrease in energy efficiency and throughput.

Full PDF

MM. Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision  ABSTRACT

The first contribution of this paper is the development of extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. Our detailed simulations have shown that, for example, the 5-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 µm /byte and energy efficiency of ~10 fJ/Op, including the input/output and other peripheral circuitry overheads. Our second major contribution is the development of 3D-aCortex, a multi-purpose neuromorphic inference processor that utilizes the proposed 3D-VMM blocks as its core processing units. We have performed rigorous performance simulations of such a processor on both circuit and system levels, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our modeling of the 3D-aCortex performing several state-of-the-art neuromorphic-network benchmarks has shown that it may provide the record-breaking storage efficiency of 4.34 MB/mm , the peak energy efficiency of 70.43 TOps/J, and the computational throughput up to 10.66 TOps/s. The storage efficiency can be further improved seven-fold by aggressively sharing VMM peripheral circuits at the cost of slight decrease in energy efficiency and throughput. INTRODUCTION

The Vector-by-Matrix Multiplication (VMM) is the most common operation in deep neural networks and many other computationally-intensive data and signal processing systems [1-6]. This fact is the motivation for the current intensive development of efficient VMM circuits and optimal architectures for their deployment in neuromorphic processors. So far, most VMM implementations are digital, with numerous commercial and experimental processor architectures developed in the last several years [7-14]. Their performance on VMM-heavy benchmarks is much higher than that of the standard CPUs, in part due to using low- precision operations, sufficient in particular for most neuromorphic tasks [15-17], including the most frequent inference function. However, digital approaches to the VMM task lead to relatively sparse design, which necessitates storing most of the synaptic weights off-chip, and as a result paying large performance penalty for memory access [18]. Due to the limited required precision, the digital implementations of the VMM may be challenged by mixed-signal (MS) circuits based on advanced analog-grade non-volatile memory devices, such as ReRAM [19-22], phase-change [23,24], and embedded floating-gate memories [25-32]. Indeed, prior work on such circuits has demonstrated the possibility of rather dramatic, orders-of-magnitude advantages in energy, speed, throughput, and circuit density, over their digital counterparts [18,19,26,28]. However, the mixed-signal approaches to the VMM tasks have their own challenges. The developed technologies for fabrication of highly scalable emerging memristive devices are not yet mature, still requiring a substantial (orders-of-magnitude) improvements in device-to-device uniformity, and in device current reduction. The floating-gate memory cells, whose optimal design [33] mitigates these problems, have relatively large cells, even if implemented by re-design of highly optimized commercial flash memories [25]. The resulting relatively low circuit density may lead, just like in the case of the digital implementations, to significant inter- and intra-chip data transfer overheads [18]. Additional concern is large area/energy overhead of conversion between analog and digital domains in MS inference accelerator architectures. These challenges have provided the main motivation for our work - the development of VMM circuits and architectures based on 3D-NAND memories [34-38]. Indeed, even the already developed commercial 3D-NAND memory technology enables record-breaking effective bit density, ultra-low fabrication cost per bit, and multi-level cell programming capability [37], while still advancing rapidly.

Mohammad Bavandpour, Shubham Sahay, Mohammad Reza Mahmoodi, Dmitri B. Strukov

University of California, Santa Barbara, ECE Department . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision  A novel time-domain implementation of MS-VMM using commercial 3D-NAND flash memory blocks, without the need for its modification.  A detailed analysis of non-idealities impacting compute precision in 3D-NAND VMMs and optimization of peripheral circuits to achieve target precision.  A detailed discussion of baseline 2D-aCortex architecture focused on its unique features   A 3D weight-packing algorithm, and memory requirement analysis process for 3D-aCortex  A detailed performance results and their breakdowns for common deep-learning benchmarks. Fig. 1a shows a typical 3D-NAND memory architecture. In it, many layers of memory cells are stacked on top of each other, with the cells connected in the z -direction (normal to the chip surface) to form a “string”. On the top of each string, there is a bit-select-line (BSL) transistor that connects it to the bit line (BL). The memory block consists of a 2D ( x-y -plane) mesh of such strings, with all memory cells of the same level (i.e., at the same z -position) sharing the common word-line (WL) metal plate. In addition, the strings share BSLs in the x -direction, and BLs in the y -direction. While showing a possible dramatic increase of the stored weight density (scaling as the number of the cell layers), this figure also indicates a major problem for the VMM implementation. Namely, sharing of each word line by all cells of that layer does not allow to use the “current-mode” approach [28, 29] that was successfully employed [18, 25] for the adaptation of a commercial 2D flash memory for MS-VMM. In future, an appropriate redesign of the 3D wiring (perhaps, as in the 2D work [18, 25], not touching the highly optimized memory cells) may be the best option. However, such modification (assumed in the recent work [39]) would require a major technological effort. (Additionally, the approach [39] requires using high-resistance and high-capacitance WL on the critical path). In this work, we have shown that the time-domain approach to the VMM function [27, 40-42] may enable using commercial 3D-NAND memories without any modification. After describing this approach in the beginning of section 2, we use the balance of that section to describe the methods of our detailed, quantitative analysis of the possible performance of the resulting 3D-VMM blocks, taking into account various non-idealities impacting their performance. D ESIGN

Time-Domain VMM

The target analog VMM operation may be represented as 𝑦 (cid:3037) = (cid:2869)(cid:3014) ∑ 𝑤 (cid:3036)(cid:3037) 𝑥 (cid:3036) (cid:3014)(cid:3036)(cid:2880)(cid:2869) , (1) where x i , w ij , and y j are real numbers, which may take any values on the [0, 1] segment. In the time-domain approach [27], the components x i and y i of the input and output vectors are encoded with the durations  of fixed-amplitude pulses: Δ i in = x i T , Δ j out = y j T , where T is a certain fixed time window, while the matrix elements (“weights”) w ij are represented by adjustable current sources I ij within a fixed range [0, I max ]: w ij = I ij / I max . (In floating-gate memory cells, the weights are kept in the form of stored floating gate charges, which define the source-to-drain currents I ij at a fixed drain voltage.) The computation is performed in two phases (Fig. 1b). During the first T int -long (integration) phase, the input pulse Δ i in turns on fixed drain voltages, and hence the current sources I ij of the i th row, leading to the injection of electric charges equal to I ij Δ i in  w ij x i into the j th column through the corresponding memory cells. The charges from multiple rows of the j th column are summed up on its load capacitor C . As a result, by the end of phase I, the capacitor voltages V C (which are reset before the operation) become proportional to the component of the desired VMM output vector: Fig.1: The main idea of the 3D-VMM circuit. (a) Cartoon of 3D-NAND flash memory block and its use in the proposed circuit. For simplicity, a layer of transistors at the bottom of the block, which connects the cell strings to the common source (ground) is not shown. (b) Basic structure and example of operation in the utilized time-domain approach [27]. (c) Circuit diagram of the peripheralneuron, which consists of a load capacitor C , connected to the bit line (BL), and an SR latch, implementing a unit step function of its input. (d) Equivalent circuit of a single string for the operation mode. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision 𝑉 (cid:2887),(cid:3037) = (cid:2869)(cid:3004) ∑ 𝐼 (cid:3036)(cid:3037) ∆ (cid:3036)(cid:2919)(cid:2924) . (cid:3014)(cid:3036)(cid:2880)(cid:2869) (2) During the second T -long phase, these voltages are converted into the durations Δ j out of the output pulses (Fig. 1b). This is done by additional charging of each load capacitor with a constant “sweep” current equal to MI max , inducing a linear ramp-up of its voltage in time, starting from the value (2). At the moment when the total voltage reaches the fixed threshold V th , an output fixed-amplitude pulse is initiated, with its falling edge aligned with the end of this phase II. As a result, the duration of the output pulse generated in phase II is ∆ (cid:3037)(cid:2925)(cid:2931)(cid:2930) = (cid:2869)(cid:3014)(cid:3010) (cid:3171)(cid:3159)(cid:3182) ∑ 𝐼 (cid:3036)(cid:3037) ∆ (cid:3036)(cid:2919)(cid:2924)(cid:3014)(cid:3036)(cid:2880)(cid:2869) . (3) where, just for convenience, all load capacitances are assumed to be equal to C = MI max / V th . Also, note that T ≥ T int . The described approach can be easily extended to four-quadrant time-domain VMM, by using differential rows/columns, and a set of four cells for each weights, to represent positive and negative inputs/outputs [34]. In our 3D-VMM block, each elementary (“single-shot”) VMM operation uses the weights recorded in the floating-gate cells of one x-y layer of the 3D-NAND memory circuit (see Fig. 1a again). This layer is selected by setting its word line (WL) voltage to 2 V, while setting the cells of all other layers to the highly conductive “pass” state by applying 5 V to those word lines. The cell currents are collected and integrated at the bit lines (BL). However, irrespective of the selected layer of cells, the inputs are always applied to bit-select lines. The “sweep” currents, necessary for phase II of the operation, are injected through the top layer of cells of all strings, enabled by a positive voltage applied to all bit-select lines (BSL). Such elementary VMM operations, based on different layers, are used as steps of the time-division-multiplexing operation. Clearly, such VMM operation mode does not require changes in the usual NAND flash memory array, and only needs to complement it with custom-designed peripheral decoder and level-shifter circuits. Note that because of significant WL parasitics in 3D-NAND memory, the total delay for performing one VMM elementary operation is 2 T LS + T int + T , where T LS is the time required to select a certain layer. Non-Idealities

For our detailed analysis, we have specifically considered the 3D-NAND memory based on polysilicon gate-all-around macaroni-body charge-trap cells. Besides its widespread use, another reason for this choice is availability of a behavioral compact model for such memory, which may be used for quantitative simulation. In such model, individual cells are approximated as cylindrical gate-all-around nanowire FETs with a voltage-controlled-current-source [39]. The model takes into account various parasitic capacitance coupling effects, and accurately reproduces the experimental string current characteristics [43, 44]. We next discuss the most important factors affecting computing precision:

A. Drain-Induced Barrier Lowering

Let us first note that since the current is sunk through the cells to the source line, we consider the scheme in which BL voltage is charged to a voltage Δ V D + V th at the start of phase I, where Δ V D is the total voltage swing on BL during computation, and then discharged to V th in the phase II. DIBL error is defined as a relative difference of currents via string of cells at two extreme BL voltages, i.e. E DIBL ≈ 1 - I ( V th )/ I ( V th + Δ V D ). (4) Without considering additional headroom to deal with capacitive coupling, the typical values are V th = 0.6 V and Δ V D = 0.2 V, which correspond to the quasi optimal operation conditions for the CMOS-based neuron implementation [12]. According to Eq. 4, the DIBL error is proportional to the small signal transconductance gain δ I D /δ V D of a string over the target operating regime. Given the small signal model shown in Fig. 1d, the transconductance gain can be formulated as: (cid:3105)(cid:3010) (cid:3136) (cid:3105)(cid:3023) (cid:3136) = (cid:2869)(cid:3019) (cid:3136) (cid:2878)(cid:3019) (cid:3116) (cid:2878)((cid:2869)(cid:2878)(cid:3034) (cid:3171) (cid:3019) (cid:3116) )(cid:3019) (cid:3177) , (5) where g m and R are the small signal parameters of a single memory cell, and R D and R S are the lumped string resistances on the drain and source side, respectively, of the selected memory cell. According to Eq. 5, larger R D and R S help reducing the DIBL error, but at the cost of limiting the current range. Moreover, because of stronger effect of R S , DIBL error is less for top memory cells (which was the reason for using top layer for sweep currents). Also, DIBL error is less for larger string currents due to intrinsically larger R , when the selected cell operates closer to strong inversion mode. These observations are confirmed by modeling (Fig. 2). In line with Eq. 4, DIBL error increases almost linearly with the total swing in the target operation region (Fig. 2b). B. Capacitive Coupling

Due to the switched-capacitor nature of the proposed approach, capacitive coupling is a significant source of compute error. We break down the sources of coupling into two components. The first component, gate-drain (GD) coupling, is caused by their overlap in BSL transistor and coupling between BSL and BL wires. The second one (DD) is caused by the parasitic capacitors between the string and the rest of the memory block. These two lumped capacitors are denoted as C gd and C dd , respectively (Fig. 1d). Note that C dd is distributed over the total length of the string. When a 2.5 V rising edge is applied to BSL line, GD coupling results in an immediate positive disturbance charge on the BL voltage with the amount of C gd × (2.5 V). Moreover, when the string is selected via BSL, DD coupling causes a negative disturbance charge on BL to charge the string . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision C dd from their initial voltage (ground) to their final DC voltage at which the string sinks the target current. When a 2.5 V falling edge is applied to BSL, the capacitive coupling is dominated by the GD coupling which causes an immediate negative disturbance charge on BL by - C gd ×(2.5 V). GD coupling disturbance is almost independent of the selected cell location and programming state, while the DD coupling disturbance during rising edge is highly dependent on both (Fig. 3). The amplitude and time constant of the DD charge disturbance are both larger for the cells closer to the bottom of the string due to higher voltage variation on the parasitic capacitors ( C dd ), especially the ones closer to the bottom but higher than the selected cell where the path to both ground and BL are highly resistive. Taking into account the coupling, we can formulate the amount of voltage disturbance on the BL for each input as Δ V cp = Q D / C where C is the amount of load capacitance per input, and Q D is the total disturbance charge caused by one input in both phase I when the target weight layer is selected and a rising edge followed by a falling edge is applied to BSL, and also phase II when the sweeping layer, i.e. top layer, is selected and one rising edge is applied to BSL. A major portion of Q D , and consequently Δ V cp is dependent on the location of target weight layer (Fig. 3b). Hence the maximum disturbance charge ( Q D ) max , which causes the largest disturbance voltage swing on BL (Δ V cp ) max = ( Q D ) max / C , occurs when the target weight layer is at the bottom of the string. In order to support VMM operation on all the layers, reset voltage Δ V D + V th should be selected to reserve a portion of total voltage swing on BL for the worst case voltage variation due to coupling. Hence, we select Δ V D = Δ V cmp + (Δ V cp ) max , where Δ V cmp is the voltage swing without considering the capacitance coupling for the weight and sweep current sources. Though the utilized differential scheme is robust to coupling, the output time window in which the output pulse is generated should be scaled by a coupling coefficient α cp = 1 + (Δ V cp ) max / Δ V cmp . Note that a small portion of (Δ V cp ) max still affects the output precision because of difference in disturbance charge caused by positive and negative sub-weights due to process variation, and dependence of Fig. 2: (a) Small-signal DIBL error contours (shown in %) in I D - V D space for top, middle, and bottom layer memory cells, programmed in various states in a 64-layer 3D-NAND memory. Small-signal error is defined as 100×(1 - I ( V D ) / I ( V D +1 mV)), i.e. relative change in string current for a 1 mV change in the BL voltage. (b) Total DIBL error (%) for ±0.2 V swing on the drain voltage around V D = 0.7 V for various memory states. Fig. 3: Charge disturbance on BL due to capacitive coupling. (a) Time domain representation of drain (BL) current and its disturbances caused by coupling when a 2.5V rising edge (at t = 0.5 ns) followed by a same-amplitude falling edge (at t = 2 ns) is applied to the BSL for various programming states where the selected cell is located at top, middle, and bottom layer of the string. (b) Total string disturbance charge on a drain caused by capacitive coupling when a 2.5 V rising + falling edge is applied to BSL and target cell is located in top, middle, and bottom layer and programmed in various states (corresponding to phase I of computation), as well as when a single 2.5 V rising edge applied to BSL and target cell is located in top layer and programmed in various states (corresponding to phase II of computation). Error bar represents 3σ distribution of the disturbance charge due to process variations. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision V cp ) max also leads to a higher BL voltage swing and consequently a larger DIBL error. C. Noise

White (shot/thermal) noise will dominate at the considered high-bandwidth operation. (We assume that the cells with extremely high flicker noise will be set to high conductive states and avoided during mapping.) The noise power for a single string operating in subthreshold can be approximated as ~ 2 qI max / T , while SNR for a single device as SNR cell ≈ 2 q / I max , where q is an electron charge. Accordingly, for an M ×1 VMM unit (a dot product), noise and signal power are 𝑃 (cid:2924)(cid:2925)(cid:2919)(cid:2929)(cid:2915)(cid:3014)×(cid:2869) = (cid:2870)(cid:3044)(cid:3014)(cid:3010) (cid:3171)(cid:3159)(cid:3182) (cid:3021) and 𝑃 (cid:2929)(cid:2919)(cid:2917)(cid:2924)(cid:2911)(cid:2922)(cid:3014)×(cid:2869) =(𝑀𝐼 (cid:2923)(cid:2911)(cid:2934) ) (cid:2870) , respectively. Hence, SNR (cid:3014)×(cid:2869) = (cid:3017) (cid:3177)(cid:3167)(cid:3165)(cid:3172)(cid:3159)(cid:3170)(cid:3262)×(cid:3117) (cid:3017) (cid:3172)(cid:3173)(cid:3167)(cid:3177)(cid:3163)(cid:3262)×(cid:3117) ≈ (cid:3014)(cid:3010) (cid:3171)(cid:3159)(cid:3182) (cid:3021)(cid:2870)(cid:3044) = 𝑀 × SNR (cid:2913)(cid:2915)(cid:2922)(cid:2922) . (6) The equivalent 3σ error due to noise is derived as 𝐸 (cid:2871)(cid:2978)(cid:3014)×(cid:2869) ≈ (cid:2870)×(cid:2871)×(cid:3495) (cid:3118)(cid:3292)(cid:3262)(cid:3258)(cid:3171)(cid:3159)(cid:3182)(cid:3269) (cid:3014)(cid:3010) (cid:3171)(cid:3159)(cid:3182) = 6 × (cid:3495) (cid:2870)(cid:3044)(cid:3014)(cid:3010) (cid:3171)(cid:3159)(cid:3182) (cid:3021) = (cid:3006) (cid:3119)(cid:3226)(cid:3161)(cid:3163)(cid:3170)(cid:3170) √(cid:3014) . (7) Note that in the above equation, the distribution is multiplied by two due to the differential scheme. According to the derived equation, compute error is inversely proportional to the square root of maximum current, compute time window, and the VMM size. Computing Precision

The compute (output) precision p O can be defined separately from the weight precision [27] as 𝑝 (cid:3016) = − log (cid:2870) (𝐸 (cid:2913) ) − 1, 𝐸 (cid:2913) = (cid:2869)(cid:3021) max (cid:2940) (cid:3173)(cid:3179)(cid:3178) (cid:3627)Δ (cid:2919)(cid:2914)(cid:2915)(cid:2911)(cid:2922) − Δ (cid:2925)(cid:2931)(cid:2930) (cid:3627) , (8) where E C is a maximum absolute difference between the ideal (Δ ideal ) and actual (Δ out ) output pulse durations, normalized by its maximum value. The 3D-VMM circuit can be designed following various optimization targets such as precision, energy, speed and area. Here, we focus on the precision which generally limits the design space in application-specific hardware design. The main tunable circuit parameters impacting our 3D-VMM's precision are I max and T int . In Table 1, various combinations of ( T int , I max ) are targeted to investigate the impact of these parameters on 3D-VMM's compute precision. Assuming Δ V cmp = 0.2 V and ( Q D ) max = 6×10 -16 , we first calculate dependent parameters such as load capacitor, coupling voltage disturbance, and output time window for every combination of I max and T inp . Then, full circuit-level SPICE simulations are performed on 10 different VMM sizes from 10×10 to 1000×1000 with 1000-times randomized inputs/weights considering detailed parasitic models for the interconnect wires and devices, and also process variations considering the 55-nm technology node. The results for different simulated scenarios show that the compute error for the noise-free circuit remains relatively constant over the target VMM size range. Table 1 also reports the SNR and 3σ noise error parameters, calculated according to Eqs. 6 and 7, and total error targeting three representative VMM sizes. Fig. 4 shows that bit-precision, corresponding to the calculated error, increases with respect to I max , T int , and VMM size. Input time window T int I max Load capacitor per input C (fF) 4 8 12 8 16 24 16 32 48 Coupling vol. swing Δ V cpmax (mV) 150 75 50 75 32.5 25 32.5 16.25 12.5 Coupling coefficient, α cp T out (ns) 14 11 10 22 19 18 38 35 34 Single device SNR cell (dB) 33.97 36.98 38.75 36.98 40 41.76 40 43.01 44.77 Single device noise 3σ error (%) 12 8.48 6.92 8.48 6 4.89 6 4.24 3.46 Noise-free VMM comp. error (%) 6.24 3.55 1.79 4.25 2.31 1.16 3.62 1.92 0.96 Final compute error M =10 (%) 10.03 6.23 3.98 6.93 4.20 2.71 5.51 3.26 2.05 Final compute error M =100 (%) 7.44 4.40 2.48 5.10 2.91 1.65 4.22 2.34 1.30 Final compute error M =1000 (%) 6.62 3.81 2.01 4.52 2.50 1.31 3.81 2.05 1.07 Table 1: Design space exploration. circuit specs and compute error (due to noise and circuit nonidealities) for various choices of T int and I max . final VMM error is reported for three different VMM sizes ( M = 10, 100, and 1000), and the achievable output bit-precision is shown by a color coding scheme in which orange = 2 bits, blue = 3 bits, green = 4 bits, and yellow = 5 bits. Fig. 4: 3D-NAND based VMM bit precision with respect to VMM size for I max = 100 nA, 200 nA, and 300 nA for T int = (a) 8 ns, (b) 16 ns, and (c) 32 ns. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision Similar to 2D flash memory circuits [25, 26], the weight precision in 3D-VMM is also expected to be affected by the tuning accuracy and drift of the analog memory state. The additional challenge for cell current tuning will be relatively large resistance R D and R S (Fig. 1d). The voltage drops across these resistors (especially R S ) must be taken into account while optimizing the programming scheme for a target output current. Quantitative analysis of such factors is challenging, mostly due to the lack of published relevant data. It should be noted, however, that the utilization of barrier-engineered materials and the gate all-around architecture in the 3D-NAND memory results in a narrower threshold voltage distribution and a lower threshold voltage shift due to cell-cell coupling as compared to the planar counterparts. In fact, multi-level state capabilities (> 3-bits) have been routinely demonstrated in 3D-NAND memories, and is expected to further improve as its technology continues to advance [35-37]. VMM R ESULTS

As was described in the last section, the 3D-VMM parameters can be chosen to operate with any precision from 2 bits to 5 bits. In this section, we describe the results obtained for the 4-bit precision, which has been proved to be sufficient for most tasks of neuromorphic computations [15-17]. A 4-bit 3D-VMM block consists of the following main components (Fig. 1a):  DTC converts the digital input to the time-domain pulse of fixed amplitude and controllable duration. As was described earlier [27], this unit includes one shared 4-bit counter and one 4-bit comparator connected to a 1-bit latch per input.  is the 3D-NAND memory block for the M × N (per layer) VMM, which consists of M ×2 N cells with the dimensions reported in [38,43], as well as an extra marginal space for routing the word and bit-select lines. Note that the parasitics of the word-line plate extensions by routing and vias/wires are taken into due account in the simulations.  CAP stands for the load capacitor. Here we assume that it is implemented as MOSCAP in the 55-nm technology, and also account for an extra marginal space around each capacitor. (Using MOM and MIM capacitors should further improved the results.)  NB represents the neuron circuit, consisting of a pair of NAND latches and a couple of AND and NOT logic gates for implementing the differential scheme.  TDC converts the time-encoded digital output to the corresponding digital output number. This unit consists of a 4-bit adder and a 4-bit DFF per output. The adder and the DFFs are connected to form an accumulator, counting the duration of the output pulse, using clock pulses which are shared by all accumulators. Note that this unit along with DTC constitutes the “I/O”.  𝐖𝐋 represents the word-line level shifters, which apply the read/pass voltages (2 V / 5 V) to the word-line plates (Fig. 1a). Note that the width of each driver transistors is made proportional to the area ( M × N ) of the plate it serves, in order to keep the layer selection time ( T LS ) within a limited range comparable to the computation time.  𝐁𝐒𝐋 is an array of level-shifters driving the bit-select lines and converting the 1.2 V time-encoded, fixed-amplitude input pulses to 2.5 V digital pulses. As Table 1 shows, the optimal design point, which guarantees the 4-bit precision across VMMs of various size is { I max = 300 nA, T int =16 ns}. Fig. 5 shows the energy, area, and throughput calculation results for various sizes of our 3D-VMM, as well as the energy and area breakdowns for this design point. According these results, the energy consumption is dominated by the word line selection and by feeding the inputs into the bit-select lines, despite the fact that their capacitance (per cell) is lower than the load capacitance C (The reason is a higher voltage swing on these lines). Moreover, the contribution of I/O and neuron circuits into the energy consumption (per operation) decreases as the VMM size increases, due to their higher sharing factor. As a result, the energy per operation is only ~9 fF for M = N = 500. Fig. 5b shows the area breakdown and the area per synapse (i.e. per weight). The area is highly dominated by the CAP, though its contribution is minor in energy consumption. Moreover, the share of I/O and neuron in the area per operation also decreases when the size of VMM is increased, due to enhancement in their sharing factor. Finally, Fig. 5c shows the VMM's throughput for its various sizes

Fig. 5: 3D-NAND based VMM performance metrics. (a) Energy per operation breakdown. (b) Area efficiency breakdown. (c) Throughput as a function of VMM size. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision T LS within the range of [20 ns, 30 ns]. The results show that the proposed 3D-VMM achieves a ~100× better area efficiency than that of its 2D-NOR memory-based counterpart [27], while maintaining a comparable energy efficiency and throughput. Such high area efficiency of our 3D-VMM enables its efficient system-level deployment via minimizing the data transfer overhead - see the next section. A C ORTEX aCortex is MS NVM-based neuromorphic inference accelerator, which is specifically designed to minimize peripheral circuitry overhead maximize by performing more computation and communication in the analog domain. Fig. 6a shows the overall structure of 2D-aCortex architecture. Its main components, for processing a K -word data stream (with p bits per word), are: (1) a central eDRAM-based main memory (MM) with one input and one output port of K -word width, (2) a set of configurable local K -word buffers supporting both individual load and load & shift operations with flexible chain size, (3) a 2D mesh of processing elements (PEs) each including a core K × K analog VMM, (4) a set of integrate-digitalize unit (IDU) blocks including K neurons, ADCs, and activation functions, (5) an auxiliary block (AUX) including an array of K digital comparators/adders/multipliers to perform infrequent neuromorphic operations such as max-pooling and element-wise vector addition and vector-by-vector multiplication in digital domain, and finally (6) a controller including an instruction memory (IM) and digital circuitry to produce control signals for other blocks. Input data are loaded from the MM into buffers through a shared K -word digital load bus (L-Bus). This stream is vertically propagated from buffers into the PEs through a shared K -word input bus (I-Bus) in analog (or digital) domain where the input data conversion is done globally (or locally) in the buffer (or PE) blocks. Analog outputs of the PEs are integrated on the shared analog output bus (O-Bus) and converted back to digital domain in the IDU blocks, in which the activation function is also applied. The final output data is stored back into the MM via a shared K -word digital store bus (S-Bus). The VMM operator of variable size can be implemented by enabling target PEs on which the weights are pre-programmed, as well as their corresponding buffers and IDUs, while the rest of the blocks are disabled/inactive (Fig. 6a). Note that an output re-scaling scheme in the PE/IDU might be needed (especially when PE output is current-mode) in order to handle various input sizes. Accordingly, multiple VMM kernels of various dimensions may be packed into the 2D structure of PEs and utilized one at a time. The 2D-aCortex performs inference tasks in a layer-by-layer manner, by storing intermediate data in the MM. A full connection between two layers may be performed in a single VMM cycle. In contrast, the convolutional connection is performed one output pixel at a time in the row-first manner as shown in Fig. 6b. For that, one copy of the convolutional kernel is pre-programmed into a specific location in the 2D structure of PEs, and then activated in multiple steps to calculate different output pixels. Such scheme, along with the reconfigurable buffers, allows efficient data reuse. The recurrent-layer function is also performed in multiple steps, generally by loading the output of the previous step, along with its corresponding element of the input sequence, into the buffers, and activating the same kernel until the sequence has been finished. Note that according to this scheme, any network with various interconnections of the discussed basic layers, such as Residual layers [5], Inception layers [4], Bi-directional and Residual Recurrent layers [6], etc. may be performed on this architecture - as long as the MM is large enough to keep the intermediate data. The architecture of the proposed 3D-aCortex is derived from that of the 2D-aCortex, using the general transformation scheme shown in Fig. 7a. Indeed, the 2D-aCortex is equivalent to a very large VMM operator in which the digital inputs are read into the buffer blocks (shown black), which can be configured as shift registers to minimize the need in the MM access at convolution tasks. The inputs are converted into analog/time-domain signals and propagated to vertical input lines of the 2D NVM array, while analog output signals, aggregated on the shared output lines of the array are

Fig. 6: (a) 2D-aCortex architecture for a general-purpose neuromorphic processor employing a nonvolatile memory-based analog VMM as the core processing element (PE). (b) Convolution operation on the 2D-aCortex processor.

Main Memory (MM)

Controller

AUX 1 23 41 2 34 5 67 8 9 1 23 4 * = (a) General 2D Architecture (b) Convolution

11 2 4 52 3 4 1 S t e p

12 3 5 62 3 4 2 S t e p

14 5 7 82 3 4 3 S t e p

15 6 8 92 3 4 4 S t e p Processing Element (PE)Digital Buffer IDU Digital L-BusDigital/Analog I-Bus Analog O-BusDigital S-Bus

Input Kernel Output . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision PE:

In this architecture, PEs are placed as a M ×2 N

2D structure where they share time-domain inputs in vertical direction (I-Bus), and analog BL output in horizontal direction (O-Bus). As shown in Fig. 7c, each PE includes a core 64-layer 3D-NAND memory with the size of K ×2 K and also the peripheral circuitry. The peripheral circuitry for each PE includes: 1) K local load capacitors ( CAP ) connected to the shared BLs and also V reset through pass gates, 2) 64 WL level-shifters ( 𝐖𝐋 ) and drivers for selecting the target layer, 3) K BSL level-shifters (

𝐁𝐒𝐋 ) and drivers for changing the voltage level of the shared time-domain input, and also activating the inputs during the phase II of computation, and 4) control logic gates for enabling/disabling the unit components. The column select (CS) and row select (RS) lines are propagated respectively in vertical and horizontal directions to select and enable the target PEs. Moreover, the CAP pass-gates in the enabled PEs are set to VMM operation mode at the appropriate time through a control signal called VMM_OP.

IDU:

Each IDU block includes three sub-blocks as: 1) neuron latches receiving input from O-Bus, 2) TDCs which are digital accumulators with higher precision (here 6-bit where 2 extra bits enables accumulating results for VMM operation on 4-layers, i.e. 4×2 N × K inputs without overflow), 3) barrel shifters to select the target output bit locations, and 4) activation function circuitry which applies a target nonlinear function (here linear, ReLU, tanh, or sigmoid) to the TDC's output. Controller:

Due to the flexibility of 3D-aCortex, any VMM operation up to MK × NK can be performed in one VMM step. In order to perform a desirable size VMM on one layer of the3D-NAND memory: 1) target PEs and their corresponding DTC and IDU units are enabled, 2) input data is loaded into Fig. 7: Baseline 3D-aCortex architecture and results, with no CAPs sharing among 3D NAND flash blocks. (a) 2D to 3D architecture transformation. (b) 3D-aCortex architecture layout and main blocks. (c) Controller sub-blocks and signaling between them. (d) PE’s main circuit components and control circuitry. (e) Performance estimates for GNMT [6], Inception-v1 [4], and ResNet [5]. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision The goal of network mapping is to break down inference computation into a sequence of steps (instructions) and to determine optimal locations for storing kernel weights in VMM arrays and temporary results in the main memory. (The mapping process was also crucial for fine-tuning architectural parameters, e.g. understanding minimal requirements for main memory capacity.) To do that, the neural network is first converted into computational graph in which each node represents one (convolution, fully-connected, max-pooling, etc.) network layer, while each edge represents the amount of data which has to be transferred from one node (layer) to others. The layers are processed sequentially as a sequence of “processing steps” and we assume that all input and output data of the currently processed layer has to be stored in memory. With such scheme, the total amount of main memory which will be occupied after each processing step can be calculated by counting the edges in the computational graph which are cut by a line separating all already processed nodes from yet-to-be processed ones. Fig. 9a shows the memory requirement graph extracted from such assessment performed for the studied networks. The weight matrices are mapped into the 3D structure of memory blocks using a weight placement scheme including three steps - namely quantization, reshaping, and 3D packing. According to this scheme, first, the weight kernel dimensions, i.e. number of inputs and outputs, are quantized by K . In convolution operation, the quantization, reshaping, and packing are performed in such a way that the shift operation in hardware is equivalent to the shift in convolution. Then, in the second step, the quantized weight matrix dimensions are compared to the maximum dimensions of one-step VMM in the hardware, i.e. 2 N × M . If the kernel dimensions exceed the maximum allowable 2D VMM in hardware in any dimension, the weight matrix is broken in that dimension and reshaped to a 3D matrix in such a way that the third dimension, which is equivalent to the memory layer in a hardware, indicates different weight sub-matrices (either in a row-first or column-first manner). In a third step, weight kernels are mapped into specific locations in 3D memory array using heuristic algorithm whose goal is to minimize the number of utilized memory cell layers. Specifically, one iteration of the algorithm involves generation of a randomly ordered list of kernels, and then sequential mapping of kernels from the list by greedily searching for the locations within already occupied memory layers, and only allocating new layers if no such location is found. The best solution is then chosen among several iterations of the algorithm. The output results of such algorithm are shown in Fig. 9b for the three studied networks. S YSTEM -L EVEL P ERFORMANCE

In order to evaluate the system-level performance for any target DNN/RNN network running on the 3D-aCortex, we have developed a software framework that utilizes the post-layout energy/speed/area metrics of all its blocks (buffers, buses, DTCs, TDCs, neurons, and digital circuits) in the 55-nm technology node. (The energy/throughput/area numbers for the SRAM-based instruction memory and the eDRAM-based main memory are obtained using the Cacti memory estimator [45].) This framework extracts the list of processing tasks for a given network, maps the VMM kernels on the 3D array of memory devices, and provides estimates for the

Fig. 8: 3D-aCortex with CAP sharing: (a) main changes in PE design with respect to the baseline architecture, (b) area breakdown, and (c) performance estimates.

Area breakdown (%)

CAP

BSLWL BSLWL BSLWL . . . V reset Layer Select LinesI-Bus O-Bus GNMT INC-V1ResNet

Area (mm ) 41.7EE (TOps/J) 65 19.76 33.94Thr.put(TOps/s) 8.2 0.90 1.34 Column Select (CS) Row Select (RS) VMM_OP

PE Unit in Shared-CAP DesignShared between all PEs (a) (b)(c) . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision

10 energy/throughput of the inference operation along with the area of the processor for the given set of architecture specifications. Two DNN networks, Inception-v1 [4] and ResNet [5], with different computational graphs and network sizes, and also Google’s natural machine translation (GNMT), a very common RNN network [6], have been selected as the benchmarks for the evaluation of the proposed general-purpose architecture. The evaluation was performed for 3D-aCortex with 4-bit computing (activation) precision, which seems to be sufficient for the studied networks. For example, Refs. 51, 52 reported negligible drop in functional performance compared to the full precision one for exactly the same version of ResNet which was studied in our work, a larger version of Inception, and, similar to GNMT, LSTM-based recurrent networks. Furthermore, we have performed a preliminary exploration of architectural parameters to optimize the processor’s performance. As a result, the value K = 64 was chosen to improve the computational block utilization, while the parameters M and N were selected to balance the read and write time/energy. Note that the parasitics of the shared bit lines (O-BUS) bound the horizontal dimension of the processor, and it, in turn, affects the PE’s aspect ratio and the number (2 N ) of these elements sharing one line. A detailed study of the benchmark networks has shown that a 1MB MM is sufficient to store all intermediate data, while the flow control program requires at most 4KB IM. Finally, our rigorous analysis indicates that M = 32 and N = 8 satisfies the aforementioned conditions while being sufficient to perform even the largest, 128M-weight benchmark GNMT. The architecture specifications, performance measures and their breakdown are summarized in Fig. 7e for the baseline 3D-aCortex. Fig. 8b,c shows a brief report, focusing on the most important differences, for shared-CAP architecture. C OMPARISON WITH P RIOR W ORK

On the circuit level, to the best of our knowledge, a 3D-NAND-based VMM has been studied in only one work [39]. The main assumption of that work was that the word line of every cell in a particular layer is partitioned into separate independent lines along x -direction. However, such modification would require major changes in the fabrication flow of the existing 3D-NAND memory technology. This approach also faces a very challenging problem of managing a large number of word lines, which would likely result in a very heavy peripheral overhead. In addition, is this scheme, based on the current-mode VMM, analog input signals are applied to highly resistive and capacitive word lines, leading to higher energy consumption and larger delays. In contrast, our approach is fully compatible with commercial 3D-NAND flash memory. Encoding of inputs by digital pulses that are applied to bit-select lines results in better energy-efficiency and speed. Moreover, as it was shown in the prior work on the VMMs based on the 2D floating-gate memory [27], the time-domain approach to the VMM task enables more compact and energy-efficient peripheral circuits than the usual current mode implementations. As a result, as the detailed simulations described in Sec. 3 have shown, the energy efficiency of our 3D-VMM is very high - for example, ~9 fF/Op for M = N up to 500, at the 5-bit accuracy. Our results also show that the proposed 3D-VMM achieves a ~100× area efficiency increase in comparison with its 2D-NOR memory-based counterpart [27], while maintaining a comparable energy efficiency and throughput. Moreover, digital nature of the circuit peripheries (level-shifters, neuron, DTC, and TDC) in our proposed time-domain VMM significantly relaxes the limitation for technology node scaling as opposed to the analog nature of the peripheral circuits in the amplifier/current-mirror (voltage/current-mode) based approaches. In the proposed design, the precision is mainly constrained by inherent flash device characteristics such as DIBL and capacitive coupling (and not the peripheral circuitry characteristics such as gain, noise, and their sensitivity to process variation). Considering the extremely small footprint of the flash cells due to 3D integration, the Fig. 9: (a) Memory requirement for various processing steps of a single inference task, and (b) mapping of the network weight kernels into the 3D-NAND memory block (different colors represent different kernels, and dark blue represents empty space) for Inception-v1 [4], ResNet-152 [5], and GNMT-1024 [6]. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision

11 proposed approach can significantly benefit from technology node scaling even with scaling limitation of the flash cells. On the other hand, on the system level quite a few efforts were recently made to exploit the efficiency of MS operators to develop better DNN/RNN processor architectures [46-52]. For example, the ISAAC [46] and PUMA [47] architectures are 2D mesh structures of tiles where each tile contains several small (typically 128×128) ReRAM-based VMM units with their I/O peripheries. In these architectures, one shared memory is implemented in each tile for storing intermediate data and communication between the VMMs, while communications between the tiles are performed through a shared 2D bus structure. Such heavily-granular, multi-core design approach aims at increasing the VMM unit utilization, minimizing the data transfer overhead, and maximizing the system throughput via pipelining and parallel processing. However, the data conversion / communication overhead due to the partial VMM operation, static power consumption and large area overhead of the neurons / DACs / ADCs, and a large control and communication overhead between tiles / VMMs likely limits the performance of such architectures, especially when running relatively complex computational graphs such as those of the Inception [4] and ResNet [5] tasks. In contrast to this prior system-level work, our 3D-aCortex processor architecture is harmonically matched with the proposed 3D-NAND VMM as the core processing unit. It includes a flexible/programmable granular single-bank 3D analog operator and a reconfigurable folded chain of buffers, which allows contingent implementation of various size VMMs and convolution kernels fully in time/analog domain. Such design results in maximizing the data reuse while minimizing the area overhead of peripheral and control circuitry, as well as the energy overhead of the VMM operation (integration and I/O conversion) and control/data-movement associated with heavily multi-core designs performing partial VMM operations [46,47]. The main advantages of the proposed architecture are:  A flexible single-bank design, which results in very large sharing factor of costly peripheral circuitry such as buffers, DTCs, neurons, TDCs, and programming circuitry, while maintaining the capability of performing various size VMM operations. The large sharing factor of the peripheral circuitry and a high density of 3D-NAND memory result in a remarkable storage efficiency.  Such a design provides a flexible large VMM operator fully in time/analog domain, and consequently allows contingent implementation of VMMs of various size, fully exploiting the energy efficiency and speed of computation in time/analog domain, i.e. avoiding overheads of partial VMM operations.  The layer-by-layer processing scheme, combined with the single-bank deployment of analog operators, result in a relatively simple control circuitry, with low energy/area overhead, while still supporting even complex computational graphs.  The data reuse in convolution layers is fully preserved via a configurable folded buffer chain design.  Due to the time-domain approach, zero static power of the computational blocks improves the energy efficiency. The detailed simulation results for 3D-aCortex, benchmarked on representative RNN/DNN models, have shown a performance significantly higher than all published prior results, including the fully digital [7, 8, 14] and MS [18, 46, 47] systems - especially for mobile/IoT applications, for which the storage and energy efficiencies are the most important metrics (Table 2). In order to make a fair comparison between 3D-aCortex and other MS approaches, we have performed a highly optimistic rescaling of the published performance metrics to the 55-nm, 4-bit design point. Even with this highly optimistic projection, the baseline and shared-CAP 3D-aCortex provides a ~17× / ~119× improvement of the storage efficiency, and a ~14× / ~ 13× improvement of the energy efficiency over the ISAAC [46], while maintaining a comparable computational efficiency of 0.58 / 0.2 TOps/(s-mm ). In comparison with PUMA [47], these numbers are, respectively ~17× / ~119× and ~6× / ~5.5×. These results also show that in comparison with the 2D-aCortex based on 55-nm NOR flash memory Platform DaDianNao [8] TPU [7]

UNPU[14]

ISAAC [46] PUMA [47] 2D-aCortex [18] 3D-aCortex & Technology node 28 nm 28 nm 65 nm 32 nm 32 nm 55 nm 55 nm Approach digital digital digital ReRAM ReRAM 2D-NOR 3D-NAND Clock (MHz) 606 700 200 1200 1000 700 1000 Precision (bits) 16 fixed point 8 fixed point 1-16 (4 here) 16 fixed point 16 fixed point 4 fixed point 4 fixed point Area (mm ) 88 330 16 85.4 90.6 292.9 18.43 / 41.7 Power (W) 20.1 40 297 65.8 62.5 0.039 0.151 / 0.126 Throughput (TOps/s) ) 0.063 0.28 0.086 0.46 (0.62*) 0.58 (0.78*) 0.051 0.58 / 0.2 SE (MB/mm ) 0.2 off-chip off-chip 0.74 (0.25*) 0.76 (0.257*) 0.273 4.34 / 30.7 EE (TOps/J) 0.286 0.43 11.6 0.35 (5.14*) 0.84 (12.09*) 380.25 70.43 / 65 Table 2: Performance comparison of 3D-aCortex to the state-of-the-art digital and mixed-signal neuromorphic processor architectures. Except for TPU and UNPU, all performance results are based on simulations. *Estimated, highly optimistic performance for 4-bit computing precision and 55-nm technology node implementation.

The performance numbers do not include overhead of external memory access (weights/intermediate data). & Baseline / 16x CAP sharing architectures . . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision

12 [18], the chip footprint of the 3D a-Cortex is ~16 / ~7 times smaller, while its energy efficiency is lower only a factor of ~5.4 / ~ 5. Moreover, the proposed 3D-aCortex architecture is based on the 3D-NAND flash technology and digital time-domain peripheral circuitry, allowing for its further scaling beyond 20-nm technology node without performance/precision degradation. This fact promises even more compact and energy-efficient neuromorphic processors based on future, more advanced technology nodes.

REFERENCES [1]

M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, "Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” in

IEEE Communications Surveys & Tutorials , vol. 20, no. 4, pp. 2923-2960, 2018. [2]

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning,”

Nature , vol. 521, pp. 436-444, May 2015. [3]

A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet classification with deep convolutional neural networks,”

Adv. in Neural Info. Proc. Sys ., pp. 1097-1105, 2012. [4]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions,” 2015

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 1-9, 2015. [5]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition,”

IEEE Conf. on Comp. Vision and Pat. Rec ., pp. 770-778, 2016. [6]

Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, and J. Klingner, "Google's neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016. [7]

N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, and R. Boyle, “In-datacenter performance analysis of a tensor processing unit,” in

ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) , Toronto, ON, pp. 1-12, 2017. [8]

Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning Supercomputer,” in , Cambridge, pp. 609-622, 2014. [9]

Y.H. Chen, T. Krishna, J.S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in

IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127-138, 2017. [10]

B. Moons, U. Roel, D. Wim, and V. Marian, “14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi,” In 2017

IEEE International Solid-State Circuits Conference (ISSCC) , pp. 246-247, 2017. [12]

M. Davies, N. Srinivasa,T. H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, and Y. Liao, “Loihi: a neuromorphic manycore processor with on-chip learning,”

IEEE Micro , 38(1), pp.82-99, 2018. [13]

P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, and B. Brezzo, “A million spiking-neuron integrated circuit with a scalable communication network and interface,”

Science , 345(6197), pp.668-673, 2014. [14]

J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.J. Yoo, “UNPU: An Energy-Efficient Deep Neural Network Accelerator with Fully Variable Weight bit Precision,”

IEEE Journal of Solid-State Circuits , vol. 54, no.1, pp.173-185, 2018. [15]

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, Sep. 2016. [16]

J.L. McKinstry, S.K. Esser, R. Appuswamy, D. Bablani, J.V. Arthur, I.B. Yildiz, and D.S. Modha, “Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference,” arXiv preprint arXiv:1809.04191, 2018. [17]

C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alternating Multi-bit Quantization for Recurrent Neural Networks,” arXiv preprint arXiv:1802.00150, 2018. [18]

M. Bavandpour, M.R. Mahmoodi, H. Nili, F.M. Bayat, M. Prezioso, A. Vincent, D.B. Strukov, and K.K. Likharev, “Mixed-signal neuromorphic inference accelerators: Recent results and future prospects,” in:

Proc. IEDM'18 , San Francisco, CA, Dec. 2018. [19]

M. Hu, J.P. Strachan, Z. Li, E.M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J.J. Yang, and R.S. Williams, “Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication,” in:

Proc. DAC'16 , Austin, TX, pp.1-6, 2016. [20]

F. Merrikh Bayat, M. Prezioso, B. Chakrabarti, H. Nili, I. Kataeva, and D. Strukov, "Implementation of multilayer perceptron network with highly uniform passive memristive crossbar circuits",

Nature Communications

9, art. 2331, 2018. [21]

P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N. Deng, L. Shi, H. S. P. Wong, and H. Qian, H., “Face classification using electronic synapses,”

Nature communications , 8, p.15199, 2017. [22]

K. H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu, “A functional hybrid memristor crossbar-array/CMOS system for data storage and neuromorphic applications,”

Nano letters , 12(1), pp.389-395, 2011. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision [23] G.W. Burr, R.M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R.S. Shenoy, P. Narayanan, K. Virwani, E.U. Giacometti, and B.N. Kurdi, “Experimental Demonstration and Tolerancing of a Large-Scale Neural Network (165,000 Synapses), using Phase-Change Memory as the Synaptic Weight Element,” in:

IEEE International Electron Devices Meeting , San Francisco, CA, pp. 29.5.1-29.5.4, 2014. [24]

I. Boybat, M. L. Gallo, S. R. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou, “Neuromorphic computing with multi-memristive synapses”,

Nature Communications , vol. 9 (1), art. 2514, 2018. [25]

X. Guo, F.M. Bayat, M. Bavandpour, M. Klachko, M.R. Mahmoodi, M. Prezioso, K.K. Likharev, and D.B. Strukov, “Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology,” in:

Proc. IEDM'17 , San Francisco, CA, Dec. 2017, pp. 6.5.1-6.5.4. [26]

M.R. Mahmoodi, and D.B. Strukov, “An ultra low energy internally analog, externally digital vector-matrix multiplier circuit based on NOR flash memory technology,” in:

Proceedings of the 55th Annual Design Automation Conference , p. 22. ACM, 2018. [27]

M. Bavandpour, M.R. Mahmoodi, and D.B. Strukov, “Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond,” in:

IEEE Transactions on Circuits and Systems II: Express Briefs , 2019. doi: 10.1109/TCSII.2019.2891688 [28]

J. Hasler, and H. Marr, “Finding a roadmap to achieve large neuromorphic hardware systems,”,

Front. Neurosci ., vol. 7, art. 118, 2013. [29]

C. R. Schlottmann, and P. E. Hasler, “A highly dense, low power, programmable analog vector-matrix multiplier: The FPAA implementation”,

IEEE JETCAS , vol. 1, pp. 403-411, 2011. [30]

S. Chakrabartty, and G. Cauwenberghs, “Sub-microwatt analog VLSI trainable pattern classifier”,

IEEE JSSC , vol. 42, pp. 1169-1179, 2007 [31]

L. Fick, E.C. Manar, S. Skrzyniarz, and D. Fick, Mythic Inc. “System and Methods for Mixed-Signal Computing,”

U.S. Patent Application [32]

K F. Busch, P. Vorenkamp, and S.W. Bailey, Syntiant Corp. “Systems and Methods for Customizing Neural Networks,”

U.S. Patent Application [33]

C.M. Compagnoni, A. Goda, A.S. Spinelli, P. Feeley, A.L. Lacaita, and A. Visconti, “Reviewing the evolution of the NAND flash technology,”

Proc. IEEE , vol. 105, no. 9, pp. 1609-–1633, 2017. [35]

K.T. Park, S. Nam, D. Kim, P. Kwak, D. Lee, Y.H. Choi, M.H. Choi, D.H. Kwak, D.H. Kim, M.S. Kim, and H.W. Park, “Three-Dimensional 128 Gb MLC Vertical nand Flash Memory With 24-WL Stacked Layers and 50 MB/s High-Speed Programming,” in

IEEE Journal of Solid-State Circuits , vol. 50, no. 1, pp. 204-213, 2015. [36]

C. Kim, D.H. Kim, W. Jeong, H.J. Kim, I.H. Park, H.W. Park, J. Lee, J. Park, Y.L. Ahn, J.Y. Lee, and S.B. Kim, “A 512Gb 3b/cell 64-stacked WL 3D V-NAND flash memory,” in

IEEE International Solid-State Circuits Conference (ISSCC) , San Francisco, CA, pp. 202-203, 2017. [37]

N. Shibata, K. Kanda, T. Shimizu, J. Nakai, O. Nagao, N. Kobayashi, M. Miakashi, Y. Nagadomi, T. Nakano, T. Kawabe, and T. Shibuya, “A 1.33Tb 4-bit/Cell 3D-Flash Memory on a 96-Word-Line-Layer Technology,” in

IEEE International Solid-State Circuits Conference (ISSCC) , San Francisco, CA, pp. 210-212, 2019. [38]

S. Sahay, and D. B. Strukov, “A Behavioral Compact Model for Static Characteristics of 3D NAND Flash Memory," in:

IEEE Electron Device Letters , 2019. doi: 10.1109/LED.2019.2901211 [39]

P. Wang, F. Xu, B. Wang, B. Gao, H. Wu, H. Qian, and S. Yu, “Three-Dimensional nand Flash for Vector-Matrix Multiplication,” in

IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 2018. doi: 10.1109/TVLSI.2018.2882194 [40]

V. Ravinuthula, V. Garg, J.G. Harris, and J.A. Fortes, “Time-mode circuits for analog computation,”

Int. J. Circ. Theor. App., vol. 37, pp. 631-659, Jun. 2009. [41]

Q. Wang, H. Tamukoh, and T. Morie, “A Time-domain Analog Weighted-sum Calculation Model for Extremely Low Power VLSI Implementation of Multi-layer Neural Networks,” in: arXiv preprint arXiv:1810.06819, 2018. [42]

T. Tohara, H. Liang, H. Tanaka, M. Igarashi, S. Samukawa, K. Endo, Y. Takahashi, and T. Morie, “Silicon nanodisk array with a fin field-effect transistor for time-domain weighted sum calculation toward massively parallel spiking neural networks,”

APEX , vol. 9, art. 034201, 2016. [43]

D. Resnati, A. Mannara, G. Nicosia, G.M. Paolucci, P. Tessariol, A.S. Spinelli, A.L. Lacaita, and C.M. Compagnoni, “Characterization and Modeling of Temperature Effects in 3-D NAND Flash Arrays—Part I: Polysilicon-Induced Variability,”

IEEE Transactions on Electron Devices , vol. 65, no. 8, pp. 3199-3206, Aug. 2018. [44]

G. Malavena, A.L. Lacaita, A.S. Spinelli, and C.M. Compagnoni, “Investigation and Compact Modeling of the Time Dynamics of the GIDL-Assisted Increase of the String Potential in 3-D NAND Flash Arrays,”

IEEE Transactions on Electron Devices , vol. 65, no. 7, pp. 2804-2811, July 2018. [45]

N. Muralimanohar, R. Balasubramonian, and N.P. Jouppi, “CACTI 6.0: A Tool to Understand Large Caches,"

Technical Report . HP Labs, HPL-2009-85, 2009. [46]

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu, R.S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” in

ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , Seoul, pp. 14-26, 2016. . Bavandpour et al., “3D-aCortex”, August 2019, 2 nd revision [47] A. Ankit, I.E. Hajj, S.R. Chalamalasetti, G. Ndu, M. Foltin, R.S. Williams, P. Faraboschi, J.P. Strachan, K. Roy, and D.S. Milojicic, “PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference,” in arXiv preprint arXiv:1901.10351 (2019). [48]

X. Liu, M. Mao, B. Liu, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu, J. Yang, H. Li, and Y. Chen, “Harmonica: A Framework of Heterogeneous Computing Systems With Memristor-Based Neuromorphic Computing Accelerators,” in

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 63, no. 5, pp. 617-628, May 2016. [49]

L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” in

IEEE International Symposium on High Performance Computer Architecture (HPCA) , Austin, TX, pp. 541-552, 2017. [50]

P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” in

ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , Seoul, pp. 27-39, 2016. [51]

M. Imani, M. Samragh, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, “RAPIDNN: In-Memory Deep Neural Network Acceleration Framework,” in arXiv preprint arXiv:1806.05794 (2018). [52]

P. Srivastava, M. Kang, S.K. Gonugondla, S. Lim, J. Choi, V. Adve, N.S. Kim, and N. Shanbhag, “PROMISE: An End-to-End Design of a Programmable Mixed-Signal Accelerator for Machine-Learning Algorithms,” in