Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays
Amogh Agrawal, Akhilesh Jaiswal, Deboleena Roy, Bing Han, Gopalakrishnan Srinivasan, Aayush Ankit, Kaushik Roy
XXcel-RAM: Accelerating Binary Neural Networksin High-Throughput SRAM Compute Arrays
Amogh Agrawal*, Akhilesh Jaiswal*, Deboleena Roy, Bing Han, Gopalakrishnan Srinivasan, Aayush Ankit,Kaushik Roy,
Fellow, IEEE
Abstract —Deep neural networks are biologically-inspired classof algorithms that have recently demonstrated state-of-the-artaccuracy in large scale classification and recognition tasks.Hardware acceleration of deep networks is of paramount impor-tance to ensure their ubiquitous presence in future computingplatforms. Indeed, a major landmark that enables efficienthardware accelerators for deep networks is the recent advancesfrom the machine learning community that have demonstratedthe viability of aggressively scaled deep binary networks. Inthis paper, we demonstrate how deep binary networks can beaccelerated in modified von-Neumann machines by enablingbinary convolutions within the SRAM array. In general, binaryconvolutions consist of bit-wise XNOR followed by a population-count ( popcount ). We present two proposals − one based oncharge sharing approach to perform vector XNORs and ap-proximate popcount and another based on bit-wsie XNORsfollowed by a digital bit-tree adder for accurate popcount . Wehighlight the various trade-offs in terms of circuit complexity,speed-up and classification accuracy for both the approaches.Few key techniques presented as a part of the manuscript isuse of low-precision, low overhead ADC, to achieve a fairlyaccurate popcount for the charge-sharing scheme and proposalfor sectioning of the SRAM array by adding switches ontothe read-bitlines, thereby achieving improved parallelism. Ourresults on a benchmark image classification dataset CIFAR-10 on a binarized neural network architecture show energyimprovements of 6.1 × and 2.3 × for the two proposals, comparedto conventional SRAM banks. In terms of latency, improvementsof 15.8 × and 8.1 × were achieved for the two respective proposals. Index Terms —In-memory computing, SRAM, binary convolu-tion, binary neural networks, deep-CNNs.
I. I
NTRODUCTION
Deep convolutional neural networks (CNNs) have beenestablished as the state-of-the-art for recognition and clas-sification tasks [1], [2], often surpassing human capabilities[3]–[5]. Most popular networks that won the ImageNet [6]challenge, such as AlexNet [7], GoogLeNet [8], ResNet [4], etc., are based on deep CNNs. However, hardware runningthese networks consume large amounts of energy, in fact,orders of magnitude more than the human brain [9]. Thisimmense energy-gap stems from the underlying architectureof the current state-of-the-art hardware implementations, thatare variants of the von-Neumann machines [10]. They con-tain physically separate computation and memory blocks,connected via a system bus. Although this architecture hasworked wonders for general-purpose computing tasks, whenit comes to deep CNNs and data intensive applications in
The authors are with the School of Electrical and Computer Engineering,Purdue University, West Lafayette, IN-47907, USA(* These authors contributed equally) general, frequent data transfers between the memory and thecomputation unit becomes a bottleneck, given the limitedbandwidth of the bus. Moreover, since each transaction isexpensive, a large power penalty is incurred per memoryaccess.Recent developments in the neural network community haveidentified these problems and have come up with simpler memory-friendly algorithms. Binary neural networks [11]–[13] and XNOR-nets [14] have been recently developed andshown large potential. The idea is to reduce the precision ofinput activations and the network weights to single-bit. Thisimmensely simplifies the computations to Boolean bit-wiseoperations, with only minimal degradation in the state-of-the-art accuracies. Since convolution is the most power-hungryoperation in neural networks, it is reduced to a bit-wise XNORfollowed by a population count ( popcount ) of the XNORedoutput. This opens pathways for adopting new simplified bi-nary in-memory computing paradigms for accelerating neuralnetworks. As shown in [15]–[17], bit-wise Boolean operationsincluding XORs or XNORs as well as non-Boolean vector-matrix dot-products can easily be incorporated within standardSRAM arrays. Such SRAM based in-memory computationsopen up new possibilities of augmenting the existing memoryarrays with compute capabilities. Thereby, one can imaginea modified von-Neumann machine, which can cater well togeneral purpose computing tasks as well as act as on-demandcompute accelerator.To that effect, we propose novel techniques to compute in-memory binary convolutions, as an added functionality to thestandard 10-transistor (T) SRAM bitcells. In the first approach(Proposal-A), we use charge-sharing between the parasiticcapacitances present inherently in the SRAM array to performthe XNOR and popcount operations involved in the binaryconvolution. Although this approach is digital, with binaryweights and binary inputs stored in the memory array, the popcount is generated as an analog voltage on the source-lines. In order to sense this analog voltage, we propose alow-overhead and low-precision ADC (owing to area andenergy constraints in the memory array). Another key highlightof this approach is that we employ a sectioned-SRAM bydividing memory sub-banks into smaller sections. With n-sections in a particular sub-bank we can accomplish n-binaryconvolutions in parallel. This is important because obtainingthe popcount output for large kernels is non-trivial. For largenetworks, the kernel sizes in deeper layers are typically toolarge to be stored in a single row of a given memory sub-array. As such, popcount for larger binary networks inevitably a r X i v : . [ c s . ET ] O c t equires a scheme to estimate the partial popcount from eachrow, which can then be summed up from different sub-arraysto get the final popcount . However, the low-overhead andlow-precision ADC induces approximations in the popcount output, which results in overall system accuracy degradation.Thus, we propose another approach (Proposal-B), where wealter the peripheral circuitry of the SRAM array and enabletwo word-lines simultaneously. This approach, although not asenergy/throughput efficient as Proposal-A, generates accurateXNOR and popcount operations, thereby, not affecting theoverall system accuracy. The proposed circuit techniques inProposal-A and Proposal-B allows us to process multiple ker-nels at once, thereby improving the overall system throughputand making the proposal suitable for a range of deep binarynetworks.There have been several earlier works to develop hard-ware platforms that can accelerate CNN algorithms. Hard-ware architectures that use highly sub-banked memory unitsfeeding an array of multiply-accumulate processing elementshave been presented in many works including [18]–[20].A key drawback of such distributed processing array basedcustomized design is the fact that it makes the underlyingcomputing hardware application specific and in many casesspecific to neural network accelerators. Further, emergingtechnologies like memristive crossbars have been employedin many proposals as convolution accelerators geared towardsneural networks in general [21], [22]. The very use of memris-tors as convolution engine renders such platforms unsuitablefor general purpose computing due to various challenges facedby memristive state-of-the-art technologies. These include,the limited endurance of memristive devices, the multi-cyclewrite-verify programming scheme [23] and the drift in pro-grammed resistance state with aging [24]. More recently, [25]demonstrated an analog approach to binary convolution usingcharge-sharing. However, the work presented in [25] was lim-ited to smaller networks. This is because with larger and morecomplex networks, the inaccuracies in interfacing conventionallow precision DACs/ADCs unacceptably degrades the networkaccuracy.The main highlights of the present work are as follows:1) We present two novel techniques to compute binary con-volutions. Proposal-A uses charge-sharing between theparasitic capacitances inherently present within the stan-dard 10T-SRAM array, to accomplish a fairly accurate popcount operation. Proposal-B alters the SRAM periph-eral circuitry to perform accurate in-memory XNOR and popcount operations.2) Further, we propose sectioned-SRAM to increase par-allelism within the SRAM arrays, thereby improvingthe computation throughput and energy-efficiency of thebinary convolution operation.II. I N - MEMORY B INARY C ONVOLUTION − P ROPOSAL -AAs discussed in the introduction, a convolution operation issimplified to a bitwise XNOR, followed by a popcount of theXNORed output in binary neural networks (BNNs). Althoughbitwise XNOR operation is simple to incorporate within the
Fig. 1. The 10 transistor SRAM cell featuring a differential decoupledread-port comprising of transistors M1-M2 and M1’-M2’. The write port isconstituted by write access transistors connected to WWL. memory, the popcount operation is not very straightforward.We exploit the inherent SRAM structure, utilizing the internalparasitic capacitances to perform the XNOR and popcount of two vectors stored within the memory array. Althoughour approach to binary convolution is digital, we sense ananalog voltage within the memory array to evaluate the pop-count output. Sensing analog voltages in general, is difficultwithout precise ADCs. Most common precise ADCs, such asFlash ADCs and SAR type ADCs require excessively largepower and area [26], making them unsuitable for memoryapplications. Thus, we propose a dual read-wordline (Dual-RWL) along with a dual-stage ADC to minimize the errorsin the popcount output. Further, we describe the sectioned-SRAM technique to improve the throughput and the energy-efficiency of the binary convolution. Since the same set ofinputs need to be convolved with multiple kernels, eachsection in sectioned-SRAM stores a different kernel while theinputs are shared among all sections, thereby performing theoperations concurrently.
A. Circuit Description
We use a standard 10T-SRAM cell as the basic memory unit.Fig. 1 shows a schematic of the 10T-SRAM cell, containingthe basic 6T-cell as the storage unit, along with transistorsM1-M2 and M1’-M2’ forming the differential read ports,respectively. Writing into the cell is functionally similar tothe 6T write operation through the write-ports (WWL, BL,BLB). For reading, RBL and RBLB are pre-charged to V DD ,SL is connected to ground, and RWL is enabled. If the bit-cellstores a ‘1’ (Q = V DD , QB = 0V), RBL discharges to 0V andRBLB holds its charge. Similarly, if the bit-cell stores a ‘0’ (Q= 0V, QB = V DD ), RBLB discharges to 0V and RBL holdsits charge. A differential sense amplifier senses the voltagedifference between RBL and RBLB to generate the output.We use the inherent parasitic capacitances on RBLs, RBLBsand SLs (C RBL , C
RBLB and C SL , respectively) in the 10T-SRAM structure to compute the binary convolution within thememory array itself. The operation can be described in threesteps as follows: a) Pseudo-read : A read operation is performed on arow storing the binary vector inputs, say A1 (refer Fig. 2(a)). ig. 2. Illustration of the binary convolution operation within the 10T-SRAM array. a) Step 1: Pseudo-read. RBLs/RBLBs are pre-charged and RWL for arow storing the input activation (A1) is enabled. Depending on the data A1, RBLs/RBLBs either discharge or stay pre-charged. The SAs are not enabled, incontrast to a usual memory read. Thus, the charge on RBLs/RBLBs represent the data A1. b) Step 2: XNOR on SL. Once the charges on RBLs/RBLBs havesettled, RWL for the row storing the kernel (K1) is enabled. Charge sharing occurs between the RBLs/RBLBs and the SL, depending on the data K1. TheRBLs either deposit charge on the SL, or take away charge from SL. c) The truth table for Step 2 is shown. The pull-up and pull-down of the SL follow theXNOR truth table. Moreover, since the SL is common along the row, the pull-ups and pull-downs are cumulative. Thus, the final voltage on SL representsthe XNOR + popcount of A1 and K1.Fig. 3. a) Dual RWL technique. The left block shows the memory array with Dual RWL. Each row in the memory array consists of two read-wordlinesRWL1 and RWL2. Half of the cells along the row are connected to RWL1, while the other half are connected to RWL2. At a time, only one of RWL1,RWL2 are enabled, to ensure that only half of the cells participate in charge sharing, at a time, thereby reducing the number of voltage states on the SLto be sensed. b) Dual-stage ADC scheme. The ADC consists of two dummy bitcells (only one shown), two SAs, counter and a control block. The ADCcontrol block generates the reference signals V REFN and V REFP , and SAE, which are fed to the two SAs. These are used in the first-stage of the ADCto determine the sub-class (first 2bits of ADC output). It also generates the signals
P CH , P CH , W L
ADC which operate on the dummy cells duringthe second-stage, to either pump-in or pump-out charge from SL, depending on the sub-class. The counter counts the number of cycles during the process togenerate the final 3bits of the ADC output.
First, all RBLs/RBLBs are precharged to V DD , as in the usualread operation. Next, when the RWL corresponding to therow storing A1 is enabled, the precharged RBLs and RBLBsdischarge conditionally, depending on the data values, therebystabilizing at V DD or 0V. For the example shown in the figure,the data stored is ‘1’ in both cells corresponding to the inputvector A1, thus, both RBLs discharge to 0V and RBLBs stayat V DD . Note that the differential sense-amplifiers are notenabled in this pseudo-read step. b) XNOR on SL : After the pseudo-read operation, theRBLs/RBLBs store the information of A1 as their respectivevoltages. Now, the RWL of the row storing a weight kernel,say K1, is enabled (refer Fig. 2(b)). Interestingly, this causescharge-sharing between C
RBL , C
RBLB and C SL as shown inthe figure by the charge current paths. In the example, thetwo cells corresponding to K1 store a ‘0’ and ‘1’ respectively.Thus, when the RWL is enabled, charge flows into the SLfrom M1’ in the left cell, while charge flows out of SL throughM1 in the right cell. This ‘pull-up’( ↑ ) and ‘pull-down’( ↓ ) of the SL follows the XNOR operation of the data stored in thecell (K1) and the RBL/RBLB charge (A1). With respect tothe example chosen above, one can observe that the first tworows of the XNOR truth table of Fig. 2(c) are taken care-of.If the bits corresponding to the activation (A1) was ‘0’ and‘0’, i.e., RBLB is at 0V while RBL is at V DD , then the chargeflows out of SL through M1’ in left cell, while it flows intoSL through M1 in right cell. This represents the bottom tworows of the XNOR truth table. Thus, we perform a bitwiseXNOR operation between vectors A1 and K1, represented bythe charge stored on the line SL. c) Popcount : Since the SL is shared by all the cells alongthe row, these ‘pull-ups’ and ‘pull-downs’ are cumulative. Ascan be seen from Fig. 2(c), an SL ‘pull-up’ corresponds to a‘1’ in the output XNORed vector, while an SL ‘pull-down’corresponds to a ‘0’ in the output XNORed vector. In order toevaluate the popcount of the output vector, we need to countthe number of 1’s. More 1’s in the output vector implies more‘pull-ups’ on SL, which in turn implies a higher voltage on ig. 4. The plot shows the final SL voltage with and without the Dual RWLapproach. A larger sense margin is obtained with our Dual RWL approach,thus relaxing the constraints on the low-overhead ADC. Note that with DualRWL technique we restrict the distinct voltage levels on SL to 32 at a time,instead of 64. However, the voltage swing on SL remains the same, therebyincreasing the sense margin between the states.
SL. Thus, the final SL voltage represents the popcount of theoutput vector: (A1 XNOR K1). We boost the RWL voltagesuch that the SL swing is from 0V to V DD . To sense thisanalog voltage we use a charge-sharing based sequentiallyintegrating ADC, adopted from [25]. Note however, that this isan approximate, low-precision ADC. Thus, in order to achievea fairly accurate estimate of the popcount of the entire row,we use two techniques described in the next sub-section. B. Dual Read-Wordline based Dual-stage ADC
In order to evaluate the popcount of the entire memory rowat once, we should be able to distinguish N- number of distinctstates in the analog SL voltage, where N is the number ofcolumns in a memory array (we choose N=64, for a reasonablysized array). In the output XNORed vector, there can zero 1’s,one 1’s, two 1’s, ... , up to N 1’s. Correspondingly, there are N-different voltage levels on the SL, which need to be sensed bythe ADCs. However, due to area and power constraints withinthe memory array, it is infeasible to use high-precision ADCs,such as the area-expensive SAR or power-hungry Flash ADCs.We adopt a simple charge-sharing based serially integratingtype ADC for our purposes. However, instead of having tosense N- distinct analog levels, the ADC only needs to senseN/8 levels. This is enabled by using a Dual RWL memorystructure, along with a dual-stage ADC.The Dual RWL technique is shown in Fig. 3(a). Note thatwe use two sets of read word-lines (RWL1a, RWL1b) for everymemory row. First half of the cells along the row are connectedto RWL1a, while the rest are connected to RWL1b. The step2 of the binary convolution (XNOR on SL) described aboveis split in two parts. First, only RWL1a is enabled. Thus, onlyN/2 cells are enabled to share charge with SL, either pulling-up or pulling-down the SL voltage. The rest half of the cellsare cut off from the SLs, and cannot participate in the chargesharing. Once the SL voltage has been sensed by the ADC,
Fig. 5. The figure shows the timing diagrams for the dual-stage ADC scheme.The figure plots the SL voltage for various popcount cases. In the first-stage,the sub-class SC1-4 is determined using multiple references (0.25V, 0.5Vand 0.75V). In the second-stage, charge is pumped-in/out of SL successively,depending on the SC. The number of cycles it takes for SL to reach V REF are counted. V REF for SC1-4 is 0.25V,0.5V, 0.5V and 0.75V, respectively.
RWL1a is disabled, and RWL1b is enabled. Now, the otherhalf of the cells share charge to generate a voltage on SL.Note that this does not change the swing on the SL, since theSL voltage depends on the capacitive ratio C RBL / C SL . Thus,the N/2 voltage levels are equally separated out from 0V to V DD . This can be confirmed from Fig. 4, which shows the SLvoltages for N=64, with and without Dual RWL technique, asa function of the popcount . Since the separation between thestates has increased, it becomes easier to sense the levels witha low-overhead ADC.The ADC used is shown schematically in Fig. 3(b). Itconsists of two dummy bitcells per row (only 1 shown infigure), two SAs, a counter and an ADC logic block. Weemploy a dual-stage ADC to sense the analog voltage on SL.In the first-stage for ADC sensing, we use multiple voltagereferences ( V DD /4, V DD /2 and 3 V DD /4), to classify the analogvoltage levels into four sub-classes SC1, SC2, SC3, SC4 − [0- V DD /4], [ V DD /4- V DD /2], [ V DD /2-3 V DD /4] and [3 V DD /4- V DD ], respectively. This is done using two voltage SAs, sincethe voltage swing on SL spans 0V to V DD . On SA N, aV REF of 3 V DD /4 is applied, while for SA P, a V REF of V DD /4 is applied. If both SA outputs are LOW, the SL voltageis classified in SC1. Similarly, when both SA outputs areHIGH, the SL voltage is classified in SC4. Otherwise, V REF is changed to V DD /2, and the SA outputs are observed again.If both outputs are HIGH, the SL voltage is classified in SC3,otherwise SC2. Thus, the first-stage of the ADC generates theMSB 2bits of the ADC output.Once the sub-classes of the analog voltage have beendefined, the second-stage of the ADC is initiated. The ADClogic block generates a bunch of control signals − P CH , P CH and WL ADC , which operate on the dummy bitcells.For SC1 and SC2, SA P is enabled with a V
REF of V DD /4and V DD /2, respectively. P CH is pulsed alternately with ig. 6. a) Typical SRAM memory array with row and column peripherals storing the activations A1-Am, and kernels K1-Kn. b) Proposed sectioned-SRAMarray. By introducing switches along the RBLs, the array is divided into sections. The kernels are mapped into the sectioned-SRAM with each section storingdifferent kernel. Once the activations are read onto the RBLs, the switches are opened, and the memory array is divided into sections. c) Since the RBLsfor each section have been decoupled, one RWL in each section can be simultaneously enabled such that each section performs the binary convolutionconcurrently. For example, if A1 was read onto the RBLs before sectioning, enabling the rows K1 and K2 in Section 1 and 2 respectively, we obtain A1*K1and A1*K2 in parallel. WL ADC , to pump-in a small amount of charge into SL everycycle through the dummy cells. In each cycle, when WL
ADC is LOW and
P CH is HIGH, the RBL of the dummy cellis precharged to V DD . When WL ADC is HIGH and
P CH is LOW, the precharged RBL pumps-in charge to the SL. Insuccessive cycles, the voltage on SL increases. As soon as theSL voltage exceeds V REF , SA P output flips from LOW toHIGH. The number of cycles in the process are counted usinga digital counter. On the other hand, for sub-classes SC3 andSC4, SA N is enabled with a V
REF of V DD /2 and 3 V DD /4,respectively. P CH is enabled instead of P CH , therebypumping-out charge from SL every cycle. Again, the numberof cycles are counted when SA N flips from HIGH to LOW.This is illustrated in Fig. 5, which shows the operation ofADC taking an example of popcount cases 0, 8, 24 and 32, forN=64. For the popcount case 32 and 24, the sub-classes SC4and SC3 are determined respectively, thus, charge is pumpedout of SL every cycle. Similarly for the popcount cases 8and 0, SC2 and SC1 are determined respectively, and chargeis pumped into SL every cycle. The two dummy bitcells areused to mimic the capacitances of the RBLs/RBLBs, such thatthe charge being pumped in/out from SL every cycle by thedummy bitcells mimics the charge sharing of RBLs/RBLBsand SL in Step 2 (XNOR on SL) operation. Note that theamount of charge being pumped-in/pumped-out exponentiallydecreases with time. This is a fundamental limit to charge-sharing type ADCs and thus, they work only if the numberof counts are small. In our case, for N=64, we count onlyN/8 = 8 states using this ADC, which gives us fairly accurateresults, as shown later. Thus, the output from the first-stage(sub-class SC1-4) along with the output from the second-stage(ADC counts) estimates the number of 1’s ( popcount ) for theXNORed output vector. Note that two sets of popcount s, onefrom RWL1a and other from RWL1b, are sequentially read,and then added together to get the final popcount of the vector. C. Sectioned Memory Array for Parallel Computing
We have seen that XNOR and popcount operations canbe computed within the SRAM array. The manner in whichthese computations are done, opens possibilities for improvingthe throughput and energy-efficiency in performing binaryconvolutions. A typical operation in a CNN layer involves convolution of input activations with multiple kernels. Thisgives us an opportunity for data re-use, since the same setof activations need to be convolved with different kernels.Our proposed scheme described above is well suited to exploitthis property of CNNs. Given a set of activations A1, A2,...,Am and kernels K1, K2,..., Kn, stored within the memoryarray (see Fig. 6(a)), we need to compute A1*K1, A1*K2,...,A1*Kn, A2*K1, A2*K2,..., A2*Kn and so on. In our com-putations described above, specifically in the psedo-read step,the data corresponding to A1 is read onto the RBL/RBLBvoltages. We propose sectioning the memory array into sub-sections by introducing switches along the RBLs, as shownin Fig. 6(b), such that kernels are grouped into differentsections. Each section consists of a separate ADC controlblock, as shown in Fig. 6(c). After A1 has been read onto theRBLs/RBLBs, the switches are opened. The RBLs/RBLBs inindividual sections store the information of data A1, but havebeen decoupled. This allows us to enable one memory row inall the sections corresponding to kernel K1 in section 1, K2 insection 2, and so on, thereby evaluating the XNOR- popcount operations concurrently, in all n sections. We thus obtain theoutput A1*K1, A1*K2,...., A1*Kn in a single cycle. Thisstep can be repeated for all activations A1, A2,..., Am. Thus,sectioning the memory array improves the throughput of ourcomputations n-fold. Moreover, with a single pseudo-read step,we are able to perform n convolutions, thereby saving multiplepseudo-read cycles which consume bitline precharge energy.Specifically, without sectioning, one RBL and RBLB pre-charge is required for every convolution operation in additionto ADC energy consumption. With n -sections per sub-banke obtain n-convolutions per pre-charging of the RBL andRBLB thereby not only increasing parallelism but also energy-efficiency.Let us now discuss how binary convolutions can be obtainedfor large kernels using the distributive property of popcount . Ifthe kernel size is larger than the memory word length, which isoften the case in deeper state-of-the art CNN layers, a singlekernel occupies multiple rows in the same or different sub-banks. In-memory binary convolution is performed for eachof these kernel rows separately, and the partial popcount sobtained from each operation are added to generate the final popcount . popcount ( N + N + ... ) = popcount ( N ) + popcount ( N ) + .... Once the final popcount is obtained, the output of the binaryconvolution operation is ‘1’ if the final popcount (number of1’s) is greater than half the kernel size, and ‘0’ otherwise.
D. Results
The sectioned-SRAM array assuming a section size of 32rows and 64 columns was simulated in HSPICE using the45-nm predictive transistor models (PTM) [27]. As describedin the previous section, the final voltage at SL denotes the popcount output of the binary convolution. The SL voltageis sensed using the ADC described in the previous section.Again, the 45-nm PTM models were used simulate the SAand the ADC logic block. Using the Dual RWL along with adual-stage ADC, the ADC output is relaxed to only 5bits. Themost-significant bits (2bits) are generated in the first-stage ofthe ADC (sub-classes SC1-4) using multiple references, whilethe lower bits (3bits) are generated in the second-stage by theintegrating ADC. We observe the effects of CMOS processvariation on the ADC output using Monte Carlo simulations,in presence of 30mV sigma threshold voltage variation. Fig.7 plots the distribution of the second-stage ADC output forvarious popcount cases. Note that a similar trend repeats forhigher popcount cases with modulo-8, since only the lower3bits of the output are generated in the second-stage. TheADC output is fairly accurate with a small overlap with theneighboring counts. The small inaccuracy is attributed to thetransistor threshold voltage variations in the memory arrayand in the SAs used in the ADC. Moreover, the charge beingpumped in/out of SL decreases with each cycle, due to charge-sharing, thereby inducing errors for higher counts. The insetshows a best-fitting normal distribution for the variations inthe ADC output. The average standard deviation of the countswas found to be ∼ ∼ ∼ popcount cases. Here, by one operation, we mean XNOR + popcount of a 64bit input activation and a 64bit kernel, both of whichare stored in the SRAM. The energy consumption includesthe pre-charge energy in the pseudo-read step and the ADCenergy. The latency of one operation was ∼ popcount . Fig. 7. Monte-Carlo simulations. The figure plots the histogram of the second-stage output of the ADC, for various popcount cases, in presence of processvariations. Inset: Each histogram is fitted with a Gaussian distribution. Theaverage standard deviation of the counts is ∼ popcount cases with modulo-8, since only the lower 3bits of the outputare generated in the second-stage. III. I N - MEMORY B INARY C ONVOLUTION − P ROPOSAL -BIn the previous section, we described an energy-efficientimplementation of performing binary convolutions within theSRAM array. However, the low-overhead ADC used to deter-mine the popcount induces errors in the convolution output,which may impact the system accuracy, as we will show later.The primary cause of the inaccuracy is the generation anddetection of an analog voltage, which is susceptible to noise,offset etc. Thus, in this section, we propose yet another imple-mentation of enabling binary convolutions in standard SRAMarrays, by modifying the peripheral circuitry. This approach isrobust since the popcount is computed using digital logic gates(full-adders), unlike Proposal-A which uses analog voltages.Although this robustness comes at a cost of energy-efficiencyand throughput as compared to the previous proposal based oncharge-sharing, our simulations show that this implementationis still better than the typical von-Neumann based approach asit leverages in-memory computing for XNORs and pop-countoperations.
A. Bitwise XNORs
Bitwise Boolean operations within SRAM arrays have re-cently been demonstrated in [15], [16], [28]. The idea is toenable two RWLs together during a read operation. Let usconsider words A and B stored in two rows of the mem-ory array. Note that we can simultaneously enable the twocorresponding RWLs without worrying about read disturbs,since the bit-cell has decoupled read-write paths (shown in Fig.8(a)). The RBL/RBLB are pre-charged to V DD . For the caseAB = ‘00’ (‘11’), RBL (RBLB) discharges to 0V, but RBLB(RBL) remains in the precharged state. However, for cases‘10’ and ‘01’, both RBL and RBLB discharge simultaneously.The four cases are summarized in Fig. 8(b). Now, in orderto sense bit-wise XNOR from the RBL/RBLB voltages, weuse two asymmetric SAs (see Fig. 8(c) [15]) which computethe bitwise NAND/NOR in parallel. Asymmetric SAs workby sizing either one of the transistors M BL / M BLB biggerthan the other. In Fig. 8(c), if the transistor M BL is sized ig. 8. (a) A 10T-SRAM bitcell schematic is repeated here for convenience. (b) Timing diagram used for in-memory computing with 10T-SRAM bitcells.(c) Circuit schematic of the asymmetric differential sense amplifier. [15] bigger compared to M BLB , its current carrying capabilityincreases. Thus, for cases ‘01’ and ‘10’ where both RBLand RBLB discharge simultaneously, SA out node dischargesfaster, and the cross-coupled inverter pair of the SA stabilizeswith SA out =‘0’. While for the case ‘11’(‘00’), RBL(RBLB)starts to discharge, and RBLB(RBL) is at V DD , making SA out =‘1’(‘0’). Thus it can be observed that SA out generatesan AND gate (thus, SA outb outputs NAND gate). Thus, wecall this sense-amp SA NAND . Similarly, by sizing the M BLB bigger than M BL , OR/NOR gates can be obtained and wecall it SA NOR . Next, by ORing the NOR and AND outputsobtained from SA NOR and SA NAND respectively, bitwiseXNOR operation is realized. A detailed description of the bit-wise Boolean XNOR used in this work can be found in [15].
B. Popcount
In order to utilize the above mentioned approach for en-abling binary convolutions, we propose to add a bit-tree adder after the asymmetric-SA stage to generate the popcount , asshown in Fig. 9. By enabling RWLs corresponding to rowsstoring activation (A1) and kernel (K1), the asymmetric-SAsgenerate the XNORed vector. The output XNORed vector ispassed to the bit-tree adder. It consists of multiple full-adder(FA) blocks connected in a tree manner. The bit-tree addersums up all the bits of the output XNORed vector to generatethe popcount . The first layer of the bit-tree adder consistsof single FA blocks, each of which is capable of addingthree consecutive bits to generate a 2-bit output. In the nextlayer, 2-bit adders are used, which are constructed using twostacked FA blocks. The second layer generates 3-bit output. Insubsequent layers, multiple FA blocks are stacked to constructmulti-bit adders. Finally in the log(N) layer, where N is thenumber of columns in the sub-array, the popcount output isgenerated, and is read out from the memory.To incorporate convolutions with large kernel sizes, thepartial popcount generated from the bit-tree adders can besummed up over multiple cycles, to generate the final pop-count . Note that the generated popcount is exact, as it iscomputed using conventional digital logic gates. Also note
Fig. 9. Modified peripheral circuitry of the SRAM array to enable binaryconvolution operations. It consists of two asymmetric SAs - SA NOR and SA NAND which pass the XNORed data vector to a bit-tree adder. The adderhas log(N) layers, where N is the number of inputs to the adder. It sums theinput bits to generate the popcount . that the sectioned-SRAM concept described in the previoussection is not applicable for this proposal. ig. 10. (a) Modified von-Neumann architecture based on Xcel-RAM memory banks and enhanced instruction set architecture (ISA) of the processor. (b)Snippet of assembly code for performing a binary convolution operation using conventional instructions and custom instructions. C. Results A × -bit SRAM array along with the asymmetricSAs − SA NOR and SA NAND were simulated in HSPICEusing the 45-nm predictive transistor models (PTM) [27]. Asdescribed above, two RWLs are enabled simultaneously, anddepending on the data stored in each of the bits, SA NOR and SA NAND generate bitwise NOR/OR and NAND/AND, re-spectively. Readers are referred to [15] for more circuit detailsand simulations. The energy consumption and latency of thebitwise XNOR operation was estimated to be 29.67fJ/bit and1ns, respectively. The energy consumption includes the pre-charge energy and the energy consumed in asymmetric-SAs.The bit-tree adder was modeled in Verilog, and synthesizedusing Synopsys Design Compiler to 45-nm tech node. Theinputs to the bit-tree adder block are 64 wires, which representthe bitwise XNORed data generated from the SA stage. Theoutput is a 6-bit popcount . The total power and the critical-pathdelay of the bit-tree adder in performing a 64-bit popcount ,was estimated to be 0.26mW and 0.3ns, respectively.IV. S
YSTEM - LEVEL E VALUATION F RAMEWORK FOR
BNNIn this section, we describe the framework developed toevaluate the benefits of our proposals at a system-level, takingan example of a deep binary neural network. We use amodified von-Neumann based system architecture, where theSRAM banks are replaced with our proposed Xcel-RAMbanks (Proposal-A/Proposal-B) with embedded convolutioncompute capabilities. By utilizing these in-memory convolu-tions, we demonstrate the benefits in the overall system energyconsumption and latency per inference.
A. Simulation Methodology
The modified von-Neumann processing architecture isshown in Fig. 10(a). It consists of a processor, an Xcel-RAMmemory-block and an instruction-memory, connected by a sys-tem bus. The Xcel-RAM block consists of multiple subarraysthat are arranged in a typical banked structure. We use theCACTI tool [29] to model a 64KB Xcel-RAM bank. Thecircuit numbers for a subarray obtained from HSPICE with the45nm PTM models [27] were put in CACTI to obtain the per-access energy and latency of memory read/write operations aswell as binary convolution operation. These include the energy
TABLE IB
ENCHMARK B INARY N EURAL N ETWORK [11]
USED FOR CLASSIFYING
CIFAR10
DATASET . consumed in H-trees, WL decoders, BL drivers, SAs, muxesetc. Next, a cycle-accurate RTL model was developed for Xcel-RAM banks, which was integrated with Intel’s programmableNios-II processor [30], with instruction set (ISA) extensions toleverage the Xcel-RAM compute capabilities (see Fig. 10(b)).The system bus follows the Avalon memory-mapped protocol,with enhanced bus architecture to support passing multipleaddresses at a time. Note that this is not a large overhead since in-memory instructions do not pass the data operands, andthus the data-channel is used to pass extra memory addressesover the bus [31]. Note that although we show a typical von-Neumann based system, Xcel-RAM banks can be interfacedwith general purpose graphics processing units (GP-GPUs)based systems as well, to leverage data parallelism along with in-memory computing . Our aim here was to show the benefitsof replacing conventional SRAM banks with compute capableXcel-RAM banks.The binary neural network (BNN) proposed in [11] usesbinary bipolar activations ( ± ) for both weights and activa-tions. Note that in our memory, +1 is stored as logic HIGHbit, while − is stored as logic LOW bit. We trained a BNNusing the algorithm proposed in [11] on Pytorch Platform[32] using the github repository [33] of the same work. The ig. 11. (a) Layer-wise (a) energy consumption and (b) latency, for running the CIFAR-10 image classification benchmark on the proposed designs, and thebaseline. neural network architecture is given in Table I. The networkwas evaluated on CIFAR-10 [34]. All layers were binarized,except Conv1 and FC3 layers. It was observed that ∼ . of total computations occur in the binarized layers - Conv2-6and FC1-2, all of which can utilize the Xcel-RAM convolutioncapabilities (see Table I). Or in other words, ∼ . of totalcomputations per-inference can be mapped using custom Xcel-RAM instructions, thereby giving us significant improvementsin energy and throughput. Each of these layers were run onthe modified von-Neumann architecture described above. Weassume that the binarized kernels are stored in an off-chipmemory, and the kernels for a particular layer are loadedinto the SRAM before processing that layer. Typical valuesof DRAM access energy and latency were taken from litera-ture [35]. The software was modified by replacing repetitiveconvolution operations with our custom instruction macros.In every layer, the convolutions are split into multiple 64-bit XNOR+ popcount operations, which are then accumulatedto compute the final output. The final output is stored backinto the SRAM, which would be the input activations for thesucceeding layer.As a baseline, we use a similar system architecture, but withstandard SRAM banks with only read/write capability, insteadof Xcel-RAM banks. The convolution operation is performedin software through conventional instructions. A snippet of theassembly code for convolution in the baseline and Xcel-RAMbased designs is shown in Fig. 10(b). B. Results and Discussion
The full precision accuracy of the network was . .The accuracy of the binary neural network was observed tobe . , an expected drop due to binarization. We thenevaluate the impact of inaccuracies in the ADC for Proposal-A (due to process variations) on the classification accuracyusing our simulation framework. At every binarized layer, eachelement of the output map is a sum of N binary XNORs,where N = k × I , k is the filter height, and I is thenumber of input channels. Our proposed methodology canperform 64 binary operations at once, in two steps of 32 bits each. Hence, the number of popcounts done per element ofan output map is M = ceil ( N/ . We add the popcount error to the output during inference, obtained from circuitsimulations, and obtained an accuracy of . , a decreaseby . from the ideal BNN accuracy of . . On theother hand, Proposal-B obtains ideal BNN accuracy becausethe computations are done using a digital adder-tree.Fig. 11 shows the layer-wise energy consumption andlatency for Proposal-A, Proposal-B, and the baseline. Notethat we focus only on layers Conv2-6 and FC1-2, as theyconstitute majority of the total computations. It can be ob-served that layers Conv2,4,6 are the most compute intensivelayers, due to larger kernels. Overall, per-inference, . × and . × improvements were obtained in energy consumption,for Proposal-A and Proposal-B, respectively, compared to thebaseline. In terms of latency, . × and . × improvementswere obtained per-inference, for Proposal-A and Proposal-B, respectively. These improvements can be attributed to thefact that the most compute intensive operations involved inthe BNN inference − bitwise-XNOR followed by popcount ,are performed efficiently within the memory, thereby savingmajority of unnecessary memory accesses and computations.Moreover, the energy and latency benefits of Proposal-A arisefrom the low-overhead ADC and the sectioned SRAM arrays,which enable multiple operations in a single memory access.In Proposal-B, although the sectioning is not applicable, theenergy and latency benefits arise from the bit-wise XNORcomputations on the bitline using asymmetric SAs and thedigital bit-tree adder to generate the result in the memory arrayitself. V. C ONCLUSION
Enhanced memory blocks having built-in compute function-ality can operate as on-demand accelerators for machine learn-ing computations, while simultaneously operating as usualmemory read-write units for general-purpose workloads. Inthis work, we demonstrated two novel techniques to enablebinary convolutions within a standard SRAM memory arrays.In the first proposal, we use charge-sharing on the inherentarasitic capacitances present in the 10T-SRAM structure toembed vector XNOR operations. Further, we use a dual-read wordline along with a dual-stage ADC, to handle theinaccuracies in the low precision, low-overhead ADC. Akey highlight of this proposal is the sectioned-SRAM , whichenables multi-row convolutions in parallel, thereby improvingthe overall system performance and energy-efficiency. Thesecond proposal uses asymmetric SAs and a bit-tree adderin the memory peripherals to perform bit-wise XNOR com-putations and popcount in-memory. A complete frameworkwas developed to evaluate a benchmark application (CIFAR-10) using our proposed memory arrays. For a system withour proposed Xcel-RAM banks, . × and . × improvementswere obtained in energy consumption, and . × and . × improvements were obtained in the latency for the respectiveproposals, compared to conventional SRAM based system.A CKNOWLEDGEMENTS
The research was funded in part by C-BRIC, one ofsix centers in JUMP, a Semiconductor Research Corporation(SRC) program sponsored by DARPA, the National ScienceFoundation, Intel Corporation and Vannevar Bush FacultyFellowship. R
EFERENCES[1] Y. Bengio et al. , “Learning deep architectures for AI,”
Foundations andtrends R (cid:13) in Machine Learning , vol. 2, no. 1, pp. 1–127, 2009.[2] N. Jones, “The learning machines,” Nature , vol. 505, no. 7482, p. 146,2014.[3] D. Silver et al. , “Mastering the game of go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in . IEEE, jun 2016.[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,“Mastering the game of go with deep neural networks and tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, jan 2016.[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in
CVPR09 , 2009.[7] A. Krizhevsky et al. , “Imagenet classification with deep convolutionalneural networks,” in
Advances in neural information processing systems ,2012, pp. 1097–1105.[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in . IEEE, jun 2015.[9] M. L. Schneider, C. A. Donnelly, S. E. Russek, B. Baek, M. R. Pufall,P. F. Hopkins, P. D. Dresselhaus, S. P. Benz, and W. H. Rippard, “Ul-tralow power artificial synapses using nanotextured magnetic josephsonjunctions,”
Science advances , vol. 4, no. 1, p. e1701329, 2018.[10] J. Backus, “Can programming be liberated from the von neumann style?:A functional style and its algebra of programs,”
Commun. ACM , vol. 21,no. 8, pp. 613–641, Aug. 1978.[11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-gio, “Binarized neural networks: Training deep neural networks withweights and activations constrained to+ 1 or-1,” arXiv preprintarXiv:1602.02830 , 2016.[12] G. Srinivasan, A. Sengupta, and K. Roy, “Magnetic tunnel junction en-abled all-spin stochastic spiking neural network,” in
Design, AutomationTest in Europe Conference Exhibition (DATE), 2017 , March 2017, pp.530–535.[13] ——, “Magnetic tunnel junction based long-term short-term stochasticsynapse for a spiking neural network with on-chip STDP learning,”
Scientific Reports , vol. 6, no. 1, jul 2016. [14] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision . Springer, 2016, pp. 525–542.[15] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-memory boolean computations in CMOS static random access memo-ries,”
IEEE Transactions on Circuits and Systems I: Regular Papers , pp.1–14, 2018.[16] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada,S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3v VDDmin 4+2t SRAMfor searching and in-memory computing using 55nm DDC technology,”in . IEEE, jun 2017.[17] A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8t SRAM cellas a multi-bit dot product engine for beyond von-neumann computing,” arXiv preprint arXiv:1802.08601 , 2018.[18] Y. Chen et al. , “Dadiannao: A machine-learning supercomputer,” in
Proceedings of the 47th Annual IEEE/ACM International Symposiumon Microarchitecture . IEEE Computer Society, 2014, pp. 609–622.[19] Y.-H. Chen et al. , “Eyeriss: An energy-efficient reconfigurable accelera-tor for deep convolutional neural networks,”
IEEE Journal of Solid-StateCircuits , vol. 52, no. 1, pp. 127–138, 2017.[20] A. Agrawal, A. Ankit, and K. Roy, “SPARE: Spiking neural networkacceleration using rom-embedded rams as in-memory-computation prim-itives,”
IEEE Transactions on Computers , pp. 1–1, 2018.[21] A. Ankit et al. , “RESPARC: A reconfigurable and energy-efficient ar-chitecture with memristive crossbars for deep spiking neural networks,”in
Proceedings of the 54th Annual Design Automation Conference 2017on - DAC 17 . ACM Press, 2017.[22] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”in . IEEE, jun 2016.[23] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precisiontuning of state for memristive devices by adaptable variation-tolerantalgorithm,”
Nanotechnology , vol. 23, no. 7, p. 075201, jan 2012.[24] A. Chen and M.-R. Lin, “Variability of resistive switching memoriesand its impact on crossbar array performance,” in
Reliability PhysicsSymposium (IRPS), 2011 IEEE International . IEEE, 2011, pp. MY–7.[25] A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficientSRAM with embedded convolution computation for low-powercnn-based machine learning applications,” in
Solid-State CircuitsConference-(ISSCC), 2018 IEEE International . IEEE, 2018, pp. 488–490.[26] B. Razavi,
AnalogtoDigital Converter Architectures . Wiley-IEEE Press,1995, pp. 272–.[27] “Predictive technology models,” [Online] http://ptm.asu.edu/ , June 2016.[28] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm con-figurable memory (tcam/bcam/sram) using push-rule 6t bit cell enablinglogic-in-memory,”
IEEE Journal of Solid-State Circuits , vol. 51, no. 4,pp. 1009–1021, April 2016.[29] “Cacti 6.0: A tool to understand large caches,” .[30] “Nios II processor overview,” in
Embedded SoPC Design with Nios IIProcessor and VHDL Examples . John Wiley & Sons, Inc., sep 2011,pp. 179–188.[31] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memorywith spin-transfer torque magnetic ram,”
IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems , vol. 26, no. 3, pp. 470–483,March 2018.[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” in
NIPS-W , 2017.[33] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarynet.pytorch,” https://github.com/itayhubara/BinaryNet.pytorch,2017.[34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.[35] N. Chatterjee, M. OConnor, D. Lee, D. R. Johnson, S. W. Keckler,M. Rhu, and W. J. Dally, “Architecting an energy-efficient DRAMsystem for GPUs,” in2017 IEEE International Symposium on HighPerformance Computer Architecture (HPCA)