[PDF] Breaking Barriers: Maximizing Array Utilization for Compute In-Memory Fabrics

Abstract

Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or phase change random access memory (PCRAM), various forms of neural networks can be implemented to greatly reduce power and increase on chip memory capacity. However, compute in-memory faces its own limitations at both the circuit and the device levels. Although compute in-memory using the crossbar architecture can greatly reduce data transport, the rigid nature of these large fixed weight matrices forfeits the flexibility of traditional CMOS and SRAM based designs. In this work, we explore the different synchronization barriers that occur from the CIM constraints. Furthermore, we propose a new allocation algorithm and data flow based on input data distributions to maximize utilization and performance for compute-in memory based designs. We demonstrate a 7.47 × performance improvement over a naive allocation method for CIM accelerators on ResNet18.

Full PDF

BBreaking Barriers: Maximizing Array Utilization forCompute In-Memory Fabrics

Brian Crafton † , Samuel Spetalnick † , Gauthaman Murali ,Tushar Krishna , Sung-Kyu Lim , and Arijit Raychowdhury Georgia Institute of Technology, Atlanta, GA School of Electrical and Computer [email protected], [email protected]

Abstract —Compute in-memory (CIM) is a promising tech-nique that minimizes data transport, the primary performancebottleneck and energy cost of most data intensive applications.This has found wide-spread adoption in accelerating neuralnetworks for machine learning applications. Utilizing a crossbararchitecture with emerging non-volatile memories (eNVM) suchas dense resistive random access memory (RRAM) or phasechange random access memory (PCRAM), various forms ofneural networks can be implemented to greatly reduce power andincrease on chip memory capacity. However, compute in-memoryfaces its own limitations at both the circuit and the device levels.Although compute in-memory using the crossbar architecturecan greatly reduce data transport, the rigid nature of these largeﬁxed weight matrices forfeits the ﬂexibility of traditional CMOSand SRAM based designs. In this work, we explore the differentsynchronization barriers that occur from the CIM constraints.Furthermore, we propose a new allocation algorithm and dataﬂow based on input data distributions to maximize utilizationand performance for compute-in memory based designs. Wedemonstrate a 7.47 × performance improvement over a naiveallocation method for CIM accelerators on ResNet18. I. I

NTRODUCTION

Modern computing systems are heavily dependent on thecapacity and access time of expensive memory banks due tothe ever increasing performance gap between main memoryand logic. Furthermore, the cost of moving data has becomemore expensive than operating on it [1], and thus not only hasthe memory become the fundamental bottleneck of computing,but both reading and transporting the data has become moreexpensive than the operation we seek to perform. Populariza-tion of data intensive applications like machine learning andartiﬁcial intelligence have further exacerbated this problem.To address these issues, new architectures based on traditionalCMOS attempt to minimize the transport of data by optimizingfor data reuse [1] and adopting constraints inspired by the brain[2]. While these techniques yield strong results, they still facethe fundamental technological limitations of CMOS.Fortunately a new class of embedded non-volatile mem-ory (eNVM) is positioned to minimize data transport byperforming compute in-memory. In-memory computing seeksto perform matrix multiplication ( (cid:126)y = W (cid:126)x ) in a crossbarstructure using Ohm’s law and the non-volatile conductancestate provided by the non-volatile memory. Using this tech-nique, each weight of the matrix ( W ij ) is programmed as a † These authors contributed equally conductance to a cell and each value of the vector ( (cid:126)x i ) isconverted to voltage and applied to the rows of the memorycrossbar. By Ohm’s law, the current through each cell isproportional to the product of the programmed conductance( W ij ) and applied voltage ( (cid:126)x i ). By Kirchhoffs current law(KCL), the resulting currents summed along the columns ofthe crossbar are proportional to the product of the matrixand vector, ( (cid:126)y ). Under this procedure, the only data transportrequired for matrix multiplication is the feature vector ( (cid:126)x )from memory and result ( (cid:126)y ) to memory. Therefore, in-memorycomputing eliminates the majority of data transfer and thusenergy cost of data intensive operations.Although compute in-memory using the crossbar architec-ture can greatly reduce data transport, the rigid nature of theselarge ﬁxed weight matrices forfeits the ﬂexibility of traditionalCMOS and SRAM based designs. Given that eNVM hashigh density and unfortunately high write energy compared totraditional SRAM, CIM-based inference-only designs avoidwriting to the eNVM cells once programmed. While thisis advantageous for data transport and energy efﬁciency, itmeans each CIM processing element (PE) can only performoperations it has the weights for. This implies that if there is anunbalanced workload where some PEs operations take longerthan others, we cannot simply re-allocate these operations toother PEs. Therefore, we must use synchronization barriersfor all PEs so distributed matrix multiplication completesbefore another is started. In contrast, every CMOS and SRAMbased PE are computationally identical and can perform anyoperation in the DNN graph.Therefore a fundamental problem in CIM based designs isarray utilization, the percent of time an array is in use. Recentlarge scale CIM designs [3], use weight duplication and layerpipelining techniques to maximize performance. We describethese techniques in detail in Section II. While impressiveperformance is achieved, these techniques only perform wellwhen the workloads are deterministic. Circuit level techniqueslike zero-skipping greatly increase performance, but createnon-deterministic workloads that compromise array utilization.In this work we identify and proﬁle these new challengesusing a simple simulator framework. We then propose a novelalgorithm, which makes use of input statistics to ﬁnd optimalarray allocation policies to maximize utilization and break syn-chronization barriers . Furthermore, we introduce a new data a r X i v : . [ c s . A R ] A ug ig. 1. Typical compute in-memory PE (processing engine) and sub-array(SA) architecture. (A) NxN sub-array PE with L1 cache and psum buffer. Inthis work N is 8. (B) Typical sub-array design with dual word line drivers,ADCs, shift and add units, and an adder tree. ﬂow that generalizes CIM arrays to maximize their utilization.We run our experiments on ImageNet using ResNet18 andCIFAR10 using VGG11. Although we apply our techniques todeep learning, we claim that the techniques we propose can beextended to any compute in-memory application. We note thata combination of these strategies yield . × improvement inperformance over a baseline naive array allocation.II. B ACKGROUND AND M OTIVATION

Compute in-memory systems use binary or multi-level cellsas weights to perform matrix multiplication in memory. Inthis work we will focus our attention to binary cells given thecurrent state of the art in eNVM [4] already struggles withvariance thus making multi-level cells even more difﬁcult toutilize. However, the same techniques demonstrated in thiswork can easily be applied to multi-level cells as well. Givenbinary cells, we must use 8 adjacent cells to form a single 8-bitweight, like those shown in the columns of Figure 1. The 8-bitvector inputs to this array are shifted in 1 bit at a time, and theresulting binary product collected at the ADCs is shifted leftby the same amount the inputs are shifted right. In this way,each array is able to perform an 8-bit matrix multiplication.There are two common techniques for performing computein memory. The ﬁrst technique, we call baseline , is simplyreading as many rows as the ADC precision allows (e.g.for a 3-bit ADC, we read 8 rows simultaneously). The nexttechnique is commonly called zero skipping [5], where onlyrows with ‘1’s are read. This technique exploits sparsity inthe input features or activations (for neural networks). Zeroskipping performs faster than the baseline technique becausefor most cases it will process more total rows per cycle. InFigure 2, we provide an example case for zero-skipping where8 total rows are read using a 2-bit ADC. Baseline (2A) requires2 cycles since it targets four consecutive rows at a time. Zero-skipping (2B) is able to ﬁnish all 8 rows in a single cyclebecause we only consider the ‘1’s in the input vector. Thereare few reasons not to perform zero skipping, unless there islimited input data bandwidth or the eNVM has high varianceand accumulated too many errors.By encapsulating the array, ADCs, and shift and add logic,a matrix multiplication engine can be created. Using these

Fig. 2. Simpliﬁed breakdown of ADC reads in baseline and zero-skippingwith 2-bit ADC precision. (A) Baseline targets four consecutive rows at atime since the 2-bit ADCs are capable of distinguishing 4 states. (B) Zeroskipping targets the next 4 rows where the word line is enabled. This way wecan read more rows and not overﬂow our ADC. arrays as building blocks, prior work has implemented CNNs(Convolutional Neural Networks) where a group of arraysimplement a larger matrix multiplication. Despite performingmore complex operations, the core operations of CNNs areconverted into matrix multiplication. In Figure 1 we illustratethis idea, showing how a group of arrays is tiled together toform a PE. In Figure 3 we further depict how these arrays canbe pieced together to form a larger matrix. In this example,both input feature maps and ﬁlters are vectorized with theﬁlters forming the columns of a matrix. The vectorized featuremaps are input to the crossbar to perform matrix multiplica-tion, where the results are output feature maps for this layerin a CNN.Given the high density of these PEs, hundreds or thousandsof them can be tiled in the same area used by modern ICs.Although similar in concept, CIM-based DNN acceleratorshave numerous differences from traditional CMOS baseddesigns that introduce challenges in maximizing performance.First off, a CIM-based PE has ﬁxed weights that cannotbe reprogrammed due to the high energy cost of writingeNVM. Traditional CMOS based PEs are generalized computeunits that can operate on any input data, since they do notcontain ﬁxed weights. Thus a fundamental issue in CIM-basedaccelerators is array utilization. Several works have addressedthis issue introducing ideas such as weight duplication andlayer pipelining.

Weight duplication [3] is used to maximize throughput inlarge scale CIM accelerators where the amount of on-chipmemory exceeds the number of weights in the model. In [6],24,960 arrays are used for a total on-chip memory capacityof nearly 104 MB (2b cells), while only using an area of mm . Using this enormous on-chip memory capacity, theynot only ﬁt ResNet [7] but duplicate shallow layers up to 32 × .When weights are duplicated, the input data is divided equallyamongst each duplicate array so they can process in parallel.We illustrate this idea for a convolutional layer in Figure 3.The input patches from the input feature maps (IFMs) are ig. 3. Convolutional layer mapped to a CIM array. Both input featuresmaps (IFM) and ﬁlters are vectorized with the ﬁlters forming the columnsof a matrix. The vectorized feature maps applied to the crossbar to performmatrix multiplication, where the results are output feature maps (OFMs). divided into groups based on the number of duplicates, andthen mapped to each duplicate. Layer pipelining [3] is used to maximize throughput ineNVM CIM accelerator, where arrays are not re-programmeddue to large amounts of on-chip memory and high writeenergy. At the same time, most modern neural networkscontain 20 or more layers that must be processed sequentially.Given that most designs use × arrays, it becomesinfeasible to partition arrays such that they can be used foreach layer without being re-programmed. This implies thatthe majority of PEs would sit idle waiting for their layerto be processed. To solve this problem, images are pipelinedthrough the network to keep all arrays utilized. Although thiscompromises single example latency, it maintains maximumthroughput.III. B LOCK - WISE A RRAY A LLOCATION

In the previous section, we discussed several techniques thatare used in CIM accelerators to increase throughput, but eachintroduces it’s own synchronization barrier that limits arraylevel utilization. In this work, we identify two of these barriersand propose our solution to mitigate this problem. The twotechniques that create these barriers are weight duplication andlayer pipelining. In previous work these barriers were not aproblem because array performance was deterministic. Whenzero-skipping is introduced, it instigates these barriers becauseit introduces non-deterministic computation time for each ar-ray. Zero skipping will only improve the performance of a CIMaccelerator because it simply means each array will performequal to or faster than the baseline algorithm. However, sincethe number of ones in the input vector of the CIM operationfollows a random distribution, the amount of time to ﬁnisha dot product is non-deterministic. This means that severalarrays performing a part of a larger matrix multiplication needto be synchronized to the slowest preforming array. As thesize of the operation (and number of arrays) increases, themore stalls occur. In the following section, we explore theimplications of zero skipping at the architectural level.

A. Identifying Synchronization Barriers

The non-determinism introduced by zero-skipping inducesthe need for synchronization barriers. A synchronization bar-rier is required when a group arrays are processing a dis-tributed workload and ﬁnish at different times, but must besynchronized before starting another task. The ﬁrst barrieroccurs at the layer level and is a result of using layerpipelining. When the arrays are distributed to each layer, weattempt to divide them evenly so that all layers ﬁnish at thesame time. If any layer is consistently performing faster thanother layers, it will have to stall because layers downstreamwill not be able to buffer its outputs. Previous work [6] haveallocated arrays to layers based on the number of duplicatesrequired such that all layers in the pipeline complete theirworkload at the same time, and thus sustain full utilization.This allocation method works under the assumption that allarrays perform at the same rate and we can choose the numberof arrays on chip. However, as [5] points out neither of theseassumptions will hold in a realistic design. Prior works [3],[6] assume 128 cells can be read at once using 5 and 8 bitADCs. Although feasible in theory, we note that such a designwill yield very high error given that the state of the art deviceshave 5% device-to-device variance [4], and thus at most 8 rows(3-bit) can be read at once. Such a design also yields verypoor memory density since large (5-8 bit) ADCs occupy over × the area of eNVM. Instead columns must be processedin batches using zero-skipping, where current summation isused for 8 rows and then intermediate results are stored andaccumulated using existing digital logic in the array.When zero skipping is used, each array performs at a non-deterministic speed that follows the distribution of input data itreceives. In Figure 4, we plot the average time for an array toperform a × matrix multiplication versus the percentageof ‘1’s in all the 8-bit input features for the 20 convolutionallayers in ResNet18. To compute the percentage of ‘1’s for alayer, we average the 8 bits in all 8-bit input features together.For example, a 1000-entry 8-bit input vector contains 8000bits and to determine the percentage of ’1’s, we average over8000 bits to compute this percentage. From Figure 4, we infera linear relationship between the percentage of ‘1’s in theinput features to a layer, and the expected number of cyclesto perform the matrix multiplication.Naturally, we can use this information to better allocateduplicates to each layer in our design. We approach thisproblem by quantifying the total number of multiply-and-accumulate (MAC) operations in each layer, and the averagenumber of MAC operations per cycle an array can perform.In prior works, performance per array is constant since eacharray takes the same number of cycles to perform a matrixmultiplication. Therefore, arrays are allocated to each layerbased only on the total MACs per layer. When zero-skippingis introduced and performance per array is not constant, thisallocation method fails to allocate evenly. To achieve equalutilization, we can instead allocate arrays to each layer basedon the expected number of cycles it will take to ﬁnish without ig. 4. Cycles per array versus the percentage of ‘1’s in all 8-bit inputfeatures. Each point represents the average percentage for one of the 20 layersin ResNet18. any duplicate arrays. We can compute the expected numberof cycles it will take a layer to ﬁnish by dividing the totalMACs in a layer by the average performance of each array inthe layer. We call this allocation method performance-based allocation, whereas allocation that assumes all arrays performevenly is weight-based allocation.While this technique ensures that all our layers will beequally utilized, it does not ensure that the arrays insideeach layer will be equally utilized. Each layer in our DNN(convolution or fully connected) is implemented as a matrixconsisting of eNVM arrays. We visualize this idea in Figure 5,where a × × × ﬁlter is mapped to 72 arrays arrangedin a × grid. In each of the 9 rows, all 8 arrays share thesame input data and, consequently, the same word lines. Thisimplies that all 8 arrays will operate at the same speed andform our minimal deterministic compute unit that we call a block . Because the 9 different rows do not share the same inputvectors, they will operate at different speeds. If some arraysreceive fewer ‘1’s than other arrays, they will sit idle waitingfor arrays that receive more ‘1’s to ﬁnish. In Figure 6, we plotthe average cycle time of the arrays in each block of layers10 and 15 (ResNet18) versus the percent of ‘1’s they receive.Layer 10 is a × × × ﬁlter (Figure 5) that contains 9different blocks, and Layer 15 is a × × × ﬁlter thatcontains 18 different blocks. Just as before, we observe a linearrelationship between cycle time and the percentage of ‘1’s.Since layer 15 contains more blocks, it is more susceptible Fig. 5. The × × × ﬁlter used in layer 10 from ResNet18 convertedinto a matrix with annotated blocks. This ﬁlter requires 72 × arraysto store in a × grid. Fig. 6. Cycles per array versus the percentage of ‘1’s in all 8-bit inputfeatures. The blue crosses represent the average percentage for 1 of the 18blocks in layer 15 of ResNet18. The black × s represent 1 of the 9 blocks inlayer 10. to longer delays because the expected slowest block’s cycletime increases with the number of arrays. In this ﬁgure, weobserve a 12% and 27% difference in cycle time for layers10 and 15, which motivates a better allocation technique toprevent signiﬁcant idle time. B. Optimizing Array Allocation

Finding the optimal allocation policy for blocks is moredifﬁcult. We cannot add redundant blocks to the same layer,because each layer only uses each weight once per operation.Instead, we adopt a new grouping strategy for arrays: ratherthan duplicating layers of arrays, we duplicate blocks of arrays.To ﬁnd the optimal array allocation policy, we propose a lineartime ( O ( N ) complexity) solution. This is especially importantfor larger networks like ResNet18, where there are 247 blocksand ﬁnding an optimal solution could be quite difﬁcult.With this new grouping strategy, we can allocate using thesame technique as before. First we gather an approximationof the average MAC per cycle for each block of arrays. Wecan do this two ways. The ﬁrst option, is running a cycleaccurate simulator on some example data to get a very accurateapproximation. The second option is to proﬁle the distributionof ‘1’s in the activations gathered from a large set of examplesrun on a GPU. Once we have an approximation for the MACper cycle of each block, we can compute the expected numberof cycles each block will take to perform it’s partial dotproduct. Once we have cycle approximations for each block,we begin allocating arrays to each block. While we have free(not allocated) arrays, we loop through and allocate arraysto the block with the highest expected latency. Once we runout of arrays or the number of arrays left over is not enoughto allocate to the slowest block we have found the optimalallocation. We call this allocation method block-wise , whereasallocation based on the layer is layer-wise . C. Block-wise Data Flow

To make use of our new allocation policy, a new data ﬂowstrategy is required. Since arrays from the same layer are notgrouped together, we treat these blocks as generalized computenits rather than being bound to a speciﬁc duplicate. Therefore,we no longer stall for the slowest block in a layer, but ratherjust send work to the next available block. This means thatthe same blocks will no longer be working together on thesame input data, and thus will not be part of the same gatherand accumulate procedure. As a result, a new routing andscheduling policy is required because blocks will not alwayssend their partial sums to the same accumulator for everyinput feature map. To implement this idea, we include outputfeature destination addresses in the packet containing datawhen sending input features to each block. Upon completing apartial dot product, a block sends their computed partial sumsto the designated accumulator and requests additional workfrom the memory controller.IV. CIM-

BASED A RCHITECTURE

Although our allocation policy will work for any generalCIM based accelerator, we adopted a similar architecture toprevious work [3], [6]. Our basic processing element (PE)contains 64 128 ×

128 arrays. We choose 64 arrays becauseit provides each block with sufﬁcient network bandwidthand SRAM capacity, while maintaining good SRAM densityand low interconnect overhead. Our input data, weights, andactivations are all 8 bits. Each array has 1 3-bit ADC forevery 8 columns where a single column is pitch-matched witha comparator. We choose 3-bit because state of the art devices[4] have 5% variance and 3-bits is the maximum precisionthat can be read with no error. We shift one bit from each ofthe 128 inputs in one at a time which takes 8 cycles. In thebest case scenario, we perform all 128 rows at the same time.In the worst case scenario, it takes 16 cycles since we enableevery single row. Therefore, each array takes anywhere from64 to 1024 cycles and performs a 128 ×

16 dot product. In alldesigns we consider, we use use the same 64 array PE andsimply increase the count per design.The activation inputs to the RRAM sub-arrays are storedin on-chip SRAM, while the input images are read in fromexternal DRAM. Matrix multiplication is performed by thePEs, while custom vector units are used to perform vector-wise accumulation, bias addition, quantization, and relu. We

Fig. 7. Block-wise network architecture with 1 router (R) per PE. All inputfeatures are routed from the global buffer to PEs. All partial sums are routedfrom PE to vector unit (V), and vector unit to output feature buffer. Fig. 8. Inference performance for ResNet18 and VGG11 by algorithm anddesign size assuming 100MHz clock. For ResNet18, block-wise allocationsustains a 8.83 × , 7.47 × , and 1.29 × speedup over baseline (no zero-skipping),weight-based, and performance-based layer-wise allocation. For VGG11,block-wise allocation sustains a 7.04 × , 3.50 × , and 1.19 × speedup. use a N × N mesh network for communication betweenPEs, memory, and vector units shown in Figure 7. Sinceblocks vary in size and no block contains 64 sub-arrays,we have to partition the PE to contain several blocks. Thisconﬁguration implies that the different blocks share the samevirtualized input and output ports. As discussed in Section III,input and output vectors are packetized to include destinationinformation. Each block in the PE is given an id that is usedto route packets to and from. Upon completing a partial dotproduct, a block sends its partial sum to vector units wherethey are accumulated and activation functions and quantizationis applied. V. R ESULTS

To benchmark block-wise allocation, we compare with sev-eral other techniques: weight-based allocation, performance-based layer-wise allocation, and the baseline algorithm whichdoes not use zero-skipping. We empirically evaluate perfor-mance and array utilization for the three techniques on Ima-geNet using ResNet18 and CIFAR10 using VGG11. We runthese techniques in a custom simulation framework designedto evaluate performance and power of compute in-memoryusing standard CMOS and RRAM models from [8]. In thiswork we focus on performance evaluations, however higherarray utilization will result in less leakage power and improvedenergy efﬁciency.Our simulator performs cycle-accurate implementations ofconvolutional and fully connected layers. It is based in Python,but runs array level operations in C for faster evaluation. Wemodel components in the design in object oriented fashion,iterating through all components in all PEs each cycle. We em-bed performance counters in our ADC and sub-array objects ig. 9. Array utilization by layer for ResNet18 on ImageNet. Baseline not shown because zero skipping is not used. to track metrics like stalls so we can calculate utilization. Asinput, the simulator takes the network weights, input images,PE level conﬁguration, and chip-level conﬁguration. The PE-level conﬁguration includes details like the precision of eacADC and size of the sub-array. The chip-level conﬁgurationcontains the number of PEs and details about array allocationand mapping. As output, the simulator produces a table withall desired performance counters and all intermediate layer ac-tivations that are veriﬁed against a TensorFlow implementationfor correctness.To show how our algorithm scales by the size of thedesign, we have evaluated the different allocation algorithmson several different designs with increasing numbers of PEs.In Figure 8, we plot performance versus the number of PEsin the design for both ResNet18 and VGG11. For ResNet18,we begin at 86 PEs since this contains the minimum numberof arrays (5472) required to store ResNet18. At 86 PEs, allalgorithms yield the same result since no duplication can bedone and weights are simply allocated to store ResNet18.From there, we begin increasing the design size by powers of2. Block-wise allocation performs the best achieving 29% im-provement over layerwise-allocation and . × improvementover both weight-based and baseline (not zero-skipping) algo-rithms. We follow the same procedure for VGG11, howeverwe observe that block-wise allocation yields less performanceadvantage. This is because VGG11 has roughly half the layersthat ResNet18 has. It is more difﬁcult to allocate evenlyamongst a deeper network and therefore, block-wise allocationyields better results on deeper networks.To better understand why we get these large performanceimprovements, it is useful to analyze array utilization. InFigure 9, we visualize layer-wise utilization of the 20 convo-lutional layers from ResNet18 using the different techniques.It is clear that block-wise allocation sustains the highestarray utilization across nearly all layers in the network, easilyoutperforming the other techniques. Weight-based allocationperforms very poorly because of the very different speeds ofeach layer and block we showed in Figures 4 and 6. It shouldbe noted that we do not plot the baseline algorithm because ithas different array level performance given that zero skippingis not used. VI. C ONCLUSION

In this paper we demonstrate the efﬁcacy of a new techniqueand data ﬂow to improve array utilization in CIM accelerators.Given that the write energy of eNVM is high, CIM arrayscontain ﬁxed weights unlike CMOS PEs which can performany operation in a DNN. Thus array utilization becomes akey challenge since only some arrays can perform particularoperations. By proﬁling input statistics and relaxing our dataﬂow, we can allocate arrays to maximize utilization and asa result, performance. The proposed allocation algorithm anddata ﬂow performs 7.47 × better than naive allocation and alayer-wise dataﬂow.VII. A CKNOWLEDGEMENT

This work was funded by the U.S. Department of DefensesMultidisciplinary University Research Initiatives (MURI) Pro-gram under grant number FOA: N00014-16-R-FO05 and theSemiconductor Research Corporation under the Center forBrain Inspired Computing (C-BRIC) and Qualcomm.R

EFERENCES[1] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efﬁcient reconﬁgurable accelerator for deep convolutional neural net-works,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138,2017.[2] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,G. Dimou, P. Joshi, N. Imam, S. Jain, et al. , “Loihi: A neuromorphicmanycore processor with on-chip learning,”

IEEE Micro , vol. 38, no. 1,pp. 82–99, 2018.[3] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”

ACM SIGARCH Computer Architecture News , vol. 44, no. 3, pp. 14–26,2016.[4] J. Wu, Y. Chen, W. Khwa, S. Yu, T. Wang, J. Tseng, Y. Chih, and C. H.Diaz, “A 40nm low-power logic compatible phase change memory tech-nology,” in ,pp. 27–6, IEEE, 2018.[5] T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S.Chang, and H.-P. Li, “Sparse reram engine: joint exploration of activationand weight sparsity in compressed neural networks,” in

Proceedings of the46th International Symposium on Computer Architecture , pp. 236–249,2019.[6] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data ﬂow forconvolutional neural networks on processing-in-memory architectures,”

IEEE Transactions on Circuits and Systems I: Regular Papers , 2019.[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 770–778, 2016.[8] P.-Y. Chen, X. Peng, and S. Yu, “Neurosim: A circuit-level macromodel for benchmarking neuro-inspired architectures in online learning,”