[PDF] An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMM

Abstract

The high computation and memory storage of large deep neural networks (DNNs) models pose intensive challenges to the conventional Von-Neumann architecture, incurring substantial data movements in the memory hierarchy. The memristor crossbar array has emerged as a promising solution to mitigate the challenges and enable low-power acceleration of DNNs. Memristor-based weight pruning and weight quantization have been seperately investigated and proven effectiveness in reducing area and power consumption compared to the original DNN model. However, there has been no systematic investigation of memristor-based neuromorphic computing (NC) systems considering both weight pruning and weight quantization. In this paper, we propose an unified and systematic memristor-based framework considering both structured weight pruning and weight quantization by incorporating alternating direction method of multipliers (ADMM) into DNNs training. We consider hardware constraints such as crossbar blocks pruning, conductance range, and mismatch between weight value and real devices, to achieve high accuracy and low power and small area footprint. Our framework is mainly integrated by three steps, i.e., memristor-based ADMM regularized optimization, masked mapping and retraining. Experimental results show that our proposed framework achieves 29.81X (20.88X) weight compression ratio, with 98.38% (96.96%) and 98.29% (97.47%) power and area reduction on VGG-16 (ResNet-18) network where only have 0.5% (0.76%) accuracy loss, compared to the original DNN models. We share our models at link this http URL.

Full PDF

AAn Ultra-Efﬁcient Memristor-Based DNN Framework with StructuredWeight Pruning and Quantization Using ADMM

Geng Yuan † Northeastern University

Boston, [email protected]

Xiaolong Ma † Northeastern University

Boston, [email protected]

Caiwen Ding

Northeastern University

Boston, [email protected]

Sheng Lin

Northeastern University

Boston, [email protected]

Tianyun Zhang

Syracuse University

Boston, [email protected]

Zeinab S. Jalali

Syracuse University

Syracuse, [email protected]

Yilong Zhao

Shanghai JiaoTong University

Shanghai, [email protected]

Li Jiang

Shanghai JiaoTong University

Shanghai, [email protected]

Sucheta Soundarajan

Syracuse University

Syracuse, [email protected]

Yanzhi Wang

Northeastern University

Boston, [email protected]

Abstract —The high computation and memory storage of largedeep neural networks (DNNs) models pose intensive challengesto the conventional Von-Neumann architecture, incurring sub-stantial data movements in the memory hierarchy. The mem-ristor crossbar array has emerged as a promising solution tomitigate the challenges and enable low-power acceleration ofDNNs. Memristor-based weight pruning and weight quantizationhave been seperately investigated and proven effectiveness inreducing area and power consumption compared to the originalDNN model. However, there has been no systematic investiga-tion of memristor-based neuromorphic computing (NC) systemsconsidering both weight pruning and weight quantization. Inthis paper, we propose an uniﬁed and systematic memristor-based framework considering both structured weight pruning andweight quantization by incorporating alternating direction methodof multipliers (ADMM) into DNNs training. We consider hardwareconstraints such as crossbar blocks pruning, conductance range,and mismatch between weight value and real devices, to achievehigh accuracy and low power and small area footprint. Ourframework is mainly integrated by three steps, i.e., memristor-based ADMM regularized optimization, masked mapping andretraining. Experimental results show that our proposed frame-work achieves 29.81 × (20.88 × ) weight compression ratio, with98.38% (96.96%) and 98.29% (97.47%) power and area reductionon VGG-16 (ResNet-18) network where only have 0.5% (0.76%)accuracy loss, compared to the original DNN models. We shareour models at anonymous link http://bit.ly/2Jp5LHJ. I. I

NTRODUCTION

With the rise of artiﬁcial intelligence, Deep Neural Networkshave been widely used thanks to their high accuracy, excellentscalability, and self-adaptiveness [1]. DNN models are becom-ing deeper and larger, and are evolving fast to satisfy the diversecharacteristics of broad applications. The high computation andmemory storage of DNN models pose intensive challenges tothe conventional Von-Neumann architecture, incurring substan-tial data movements in memory hierarchy.To achieve high performance and energy efﬁciency, hardwareacceleration of DNNs is intensively studied both in academiaand industry [2–9]. DNN model compression techniques, in-cluding weight pruning [10–15] and weight quantization [16–18], are developed to facilitate hardware acceleration by re-ducing storage/computation in DNN inference with negligibleimpact on accuracy. However, as Moore’s law is coming to anend [19], the acceleration of the conventional Von-Neumannarchitecture is limited to some extent.To further mitigate the intensive computation and memorystorage of DNN models, the next-generation device/circuittechnologies beyond CMOS and novel computing architecturesbeyond the traditional Von-Neumann machine are investigated.The crossbar array of the recently discovered memristor devices † These authors contributed equally. (i.e., memristor crossbar) can be utilized to perform matrix-vector multiplication in the analog domain and solve systemsof linear equations in O (1) time complexity [20, 21]. Ankitet al. [22] implemented weight pruning techniques to NC sys-tems using memristor crossbar arrays, which reduces the area(energy) consumption compared to the original network. How-ever, for hardware implementations on on-chip neuromorphiccomputing systems, there are several limitations: (i) unbalancedworkload; (ii) extra memory footprint on indices; (iii) irreguralmemory access. These will cause the circuit overheads inhardware implementations. To address these limitations, Wang.et al. [23] proposed group connection deletion, which prunesconnections to reduce routing congestion between crossbararrays.On the other hand, Zhang. et al. [24] discussed the effec-tiveness of using the quantized conductance in memristor inmulti-level logics. Song. et al. [25] investigated the generationof quantization loss in the memristor-based NC systems andits impacts on computation accuracy, and proposed a regu-larized ofﬂine learning method that can minimize the impactof quantization loss during neural network mapping. Weightquantization can mitigate hardware imperfection of memristorincluding state drift and process variations, caused by theimperfect fabrication process or by the device feature itself.Because weight pruning and weight quantization techniquesleverage different sources of redundancy, they may be com-bined to achieve higher DNN compression. However, therehas been no systematic investigation of this effect in thememristor-based NC systems considering both weight pruningand weight quantization. In this paper, we propose an uniﬁedand systematic memristor-based framework considering bothstructured weight pruning and weight quantization, by incor-porating ADMM into DNNs training. We consider hardwareconstraints such as crossbar blocks pruning, conductance range,and mismatch between weight value and real devices, to achievehigh accuracy and low power and small area footprint. Ourproposed framework can better mitigate the inaccuracy causedby the hardware imperfection compared to only weight quan-tization method [24, 25]. It contains memristor-based ADMMregularized optimization , masked mapping and retraining steps ,which can guarantee the solution feasibility (satisfying allconstraints) and provide high solution quality (maintaining testaccuracy) at the same time. The contributions of this paperinclude: • We systematically investigate the combination of struc-tured weight pruning and weight quantization techniquesleveraging different sources of redundancy, to achievehigher DNN compression ratio and low power and areain the domain of memristor-based NC systems. • We adopt ADMM, an effective optimization technique for a r X i v : . [ c s . ET ] A ug eneral and non-convex optimization problems, to jointlyoptimize weight pruning and weight quantization problemsduring training for higher model accuracy.We evaluate our proposed framework on different net-works. Experimental results show that our proposed frameworkcan achieve 29.81 × (20.88 × ) weight compression ratio, with98.38% (96.96%) and 98.29% (97.47%) power and area reduc-tion on VGG-16 (ResNet-18) network where only have 0.5%(0.76%) accuracy loss, compared to the original DNN models.II. B ACKGROUND ON M EMRISTORS

A. Memristor Crossbar Model V R undopeddoped on R off ( a ) V i_1 V i_2 V i_i V O_j V O_2 V O_1

Rs Rs Rs Rs ( b ) C u rr e n t ( A ) Voltage (V)

Fig. 1: (a) Memristor crossbar performs maxtrix-vector multiplication.(b) Memristor model and its V - I curve. Memristor has shown remarkable characteristics as oneof the most promising emerging technologies as shown inFigure 1 [26]. The memristor has many promising features,such as non-volatility, low-power, high integration density, andexcellent scalability. Memristors can be formed as a crossbarstructure [27], as shown in Figure 1. Each pair of horizontalWord-line (WL) and vertical Bit-line (BL) is connected acrossa memristor device. Given the input voltage vector v i and theweight matrix W which can be constructed by a preprogramedcrossbar array, the matrix multiplication result v o can be easilyobtained by measuring the current across the resistor R S . Bynature, the memristor crossbar array is attractive for matrixcomputations with high degree of parallelism which can achievethe time complexity of O (1) . Based on this superior feature,the memristor-based computing system can provide a promisingsolution to reduce the latency and improve the energy efﬁciencyof neuromorphic computation. B. Hardware Imperfection of Memristor Crossbars and Miti-gation Techniques

Hardware imperfection of Memristor is mainly caused bythe imperfect fabrication process or by the device feature itself.These signiﬁcant issues cannot be ignored in hardware design,which is different from the software-based system design.

1) State Drift:

Memristor device consists of a thin-ﬁlmstructure, and the ﬁlm is divided into two regions. One regionis highly doped with oxygen vacancies and another region is anundoped. Applying an electric ﬁeld across the device over timewould lead to the migration of oxygen vacancies and changethe memristance state, which is called state drift [28]. Thus,after a certain number of read operations, the resistance of thedevice will drift caused by the accumulative effect of applyingthe same direction voltage. As a result of the state drift effect,the imprecision will be incurred when the memristor’s statedrifts to the other state levels.

2) Process Variation.:

Process variation is also phenomenalas the process technology scales to nanometer level. It mainlycomes from the line-edge roughness, oxide thickness ﬂuctua-tions, and random dopant variations that affect the memristordevice performance [29]. The process variation will cause the hardware non-ideal behavior, which usually means the accuracydegradation [30].It can be observed that quantization on resistance valuesplays an important role in dealing with hardware imperfec-tions. However, the prior work on mitigating the effect ofhardware imperfections are mainly ad hoc , lacking a system-atic, algorithm-hardware co-optimization framework to improveoverall resilience. Our proposed framework can mitigate the in-accuracy caused by the hardware imperfection, while achieveshigh hardware efﬁciency as well.III. A U

NIFIED AND S YSTEMATIC M EMRISTOR -B ASED F RAMEWORK FOR

DNN S The memristor crossbar structure has shown promising fea-tures in neuromorphic computing systems compared to thetraditional CMOS technologies[22]. However, as DNN goesdeeper and deeper, the massive weight computation and weightstorage introduce severe challenges in neuromorphic comput-ing system hardware implementations. On the other hand, tosystematically address hardware imperfections of memristorcrossbars, in this paper, we propose a integrated memristor-based framework

A. Uniﬁed and Systematic Memristor-Based Framework usingADMM1) Connection to ADMM:

ADMM [31] is a powerful opti-mization tool, by decomposing an original problem into twosubproblems that can be solved separately and iteratively.Consider the optimization problem min x f ( x ) + g ( x ) . InADMM, the problem is ﬁrst re-written asmin { x , z } f ( x ) + g ( z ) , subject to x = z (1)Next, by using augmented Lagrangian [31], the above prob-lem is decomposed into two subproblems on x and z . The ﬁrstis min x f ( x ) + q ( x ) , where q ( x ) is a quadratic function. As q ( x ) is convex, the complexity of solving subproblem 1 is thesame as minimizing f ( x ) . Subproblem 2 is min z g ( z ) + q ( z ) ,where q ( z ) is again a quadratic function. The two subproblemswill be solved iteratively until convergence is achieved [32].

2) Uniﬁed Memristor-Based Framework:

There is a difﬁ-culty in using ADMM directly due to the non-convex nature ofthe objective function for DNN training, and thereby lackingof any guarantees on solution feasibility and solution quality.It becomes even more challenging when incorporating ADMMinto training the memristor-based DNN model, where we needto consider hardware constraints such as crossbar blocks prun-ing, conductance range, and mismatch between weight valueand real devices. To overcome this challenge, we proposedan uniﬁed memristor-based framework including memristor-based ADMM regularized optimization , masked mapping and retraining steps , which can guarantee the solution feasibility(satisfying all constraints) and provide high solution quality(maintaining test accuracy) at the same time.First, the memristor-based ADMM regularized optimization starts from a pre-trained DNN model without compression.Consider an N -layer DNNs, sets of weights and biases of the i -th (CONV or FC) layer are denoted by W i and b i , respectively.And the loss function of the N -layer DNN is denoted by f (cid:0) { W i } Ni =1 , { b i } Ni =1 (cid:1) . Combining the task of memristor-basedstructured pruning and weight quantization, the overall problemis deﬁned by minimize { W i } , { b i } f (cid:0) { W i } Ni =1 , { b i } Ni =1 (cid:1) , subject to W i ∈ P i , W i ∈ Q i , i = 1 , . . . , N. (2) iven the value of α i , the set P i = { W i | the numberof non-zero structured weights in W i ≤ α i } reﬂects theconstraint for memristor-based structured weight pruning. El-ements in P i are the solution of W i satisfying the numberof non-zero elements (after structured pruning and memristorcrossbar mapping) in W i which is limited by α i for layer i . Similarly, elements in Q i are the solutions of W i , inwhich elements in W i assume one of q i, , q i, , · · · , q i,M i (memristor state values), where M i denotes the number ofavailable quantization level in layer i . Please note that the q i,j value indicates the j -th quantization level in layer i , and q i,j ∈ [ − cond max , − cond min ] ∪ [ cond min , cond max ] , where cond min , cond max are the minimum and maximum valid con-ductance value of a speciﬁed memristor device. More speciﬁ-cally, we use indicator functions to incorporate the memristor-based structured pruning and weight quantization constraintsinto the objective function, which are g i ( W i ) = (cid:26) if W i ∈ P i , + ∞ otherwise, h i ( W i ) = (cid:26) if W i ∈ Q i , + ∞ otherwise, for i = 1 , . . . , N . Then the original problem (2) can beequivalently rewritten as minimize { W i } , { b i } f (cid:0) { W i } Ni =1 , { b i } Ni =1 (cid:1) + N (cid:88) i =1 g i ( W i )+ N (cid:88) i =1 h i ( W i ) . (3) We incorporate auxiliary variables Y i and Z i , dual variables U i and V i , then apply ADMM to decompose problem (3)into three subproblems. After that, We solve these subproblemsiteratively until the convergence. Assume in iteration k , the ﬁrstsubproblem is minimize { W i } , { b i } f (cid:0) { W i } Ni =1 , { b i } Ni =1 (cid:1) + N (cid:88) i =1 ρ i (cid:107) W i − Y ki + U ki (cid:107) F + N (cid:88) i =1 ρ i (cid:107) W i − Z ki + V ki (cid:107) F , (4) The ﬁrst term in problem (4) is the differentiable (non-convex) loss function of the DNN, while the other quadraticterms are convex. As a result, this subproblem can be solvedby stochastic gradient descent (e.g., the ADAM algorithm [33])similar to training the original DNN.The solution { W i } of subproblem 1 is denoted by { W k +1 i } .Then we aim to derive { Z k +1 i } and { Y k +1 i } in subproblem 2and 3. Thanks to the characteristics in combinatorial constraints(the memristor-based structured pruning and weight quantiza-tion constraints), the optimal, analytical solution of the twosubproblems are Euclidean projections. We can prove that theprojections are: keeping α i elements with largest magnitudesand setting remaining weights to zero; and to quantize everyweight element to the closest valid memristor state value.Finally, we update dual variables U i and V i according toADMM rule [31] and thereby complete the k -th iteration in memristor-based ADMM regularized optimization . Masked Mapping and Retraining:

We ﬁrst perform theEuclidean projection (mapping) on the derived W i to guaranteethat at most α i values in each layer are non-zero. Since thezero weights will not be mapped on the memristor crossbar, wecan mask the zero weights and retrain the DNN with non-zeroweights using training sets. Particularly, this retraining step issimilar to the ADMM regularized hardware optimization step,but only the memristor weight quantization constraints need tobe satisﬁed. In this way test accuracy can be partially restored.

B. Memristor-Based Structured Pruning & Quantization1) Memristor-Based Structured Weight Pruning:

In order tobe hardware-friendly, we use structured pruning method [11] instead of the irregular pruning method [10] to reduce weightsparameters. There are different types of structured sparsity,ﬁlter-wise sparsity, channel-wise sparsity, shape-wise sparsityas shown in Figure 2. In the proposed framework, We in-corporate structured pruning in ADMM regularization, wherememristor features are considered. Compared to [23], ourproposed method can better explore the sparsity on weightmatrices, with negligible accuracy degradation, resulting inbetter area saving and lower power consumption.

Filter-Wise Channel-Wise Shape-WiseFilter 1Filter 2Filter A Filter 1Filter 2Filter A Filter 1Filter 2Filter A ... ... ... Fig. 2: Illustration of ﬁlter-wise, channel-wise and shape-wise struc-tured sparsities.

To better illustrate how structured pruning saves the mem-ristor crossbars, we transform the weight tensors of a CONVlayer to general matrix multiplication (GEMM) format [34]. Asshown in Figure 3 (a) (GEMM view), the structured pruningcorresponds to reducing rows or columns. The three structuredsparsities, along with their combinations, will reduce the weightdimension in GEMM while maintaining a full matrix. Indicesare not needed and weight quantization will be better supported.Figure 3 (b) shows a memristor implementation size viewand memristor crossbar area reduction on different types ofsparsities. By applying ﬁlter (row) pruning and shape/channel(column) pruning, as shown in the top of Figure 3(b), eitherblocks of memristor crossbar or numbers of memristor cross-bars can be saved compared to the original design.The Figure 4 shows how we map the weight parameterson the memristor crossbars. As shown in the GEMM view inFigure 3 (a), assuming a CONV layer has n ﬁlters, m channels(including k columns of weights), denoted as W ∈ R n × k .Generally, the size of a single memristor crossbar is limitedbecause the reading and writing error will increase by usinglarger crossbar size [35]. Thus, multiple memristor crossbarsare used to accommodate the large size weight matrix. Tomaintain accuracy, the single memristor crossbar size in our filter filter filter filter n c c c i c c k c i+1 filter n filter (a) (b)* **filter n filter prune column prune may not be the same filter or channel with original weight matrix Memristor implementation size view original size p er i ph er a l c i rc u i t s Bank Control channel channel channel m channel channel channel m channel channel m P er i ph er a l c i rc u i t s P er i ph er a l c i rc u i t s filter-wise (row-wise) pruning full size memristor implementationshrinked size after filter pruningsize after combined pruning shape-wise/channel-wise(column-wise pruning) k columns after combined pruning GEMM weight reduction ** *filter Savedarea

Savedarea size shrink

Bank ControlBank Control

Fig. 3: Structured weight pruning and reduction of hardware resources x f f j f f n f j+1 x x i x x k x i+1 block 1 block p c c c i c c k c i+1 SUM SUM

ADCADC

SUM

ADCADC

Data / Control

SUM SUM

ADCADC

SUM

ADCADC block 1

Fig. 4: Weights Mapping on Memristor crossbars design is no larger than 128 ×

64 [36] and is identical for allDNN layers. As shown in Figure 4, each crossbar has i rowsand j columns. We use X and f to represent the inputs andﬁlters, where c represents the column of weights as shown inFigure 3 (a). Since there are k weights in a ﬁlter, we need touse the columns at same position from at least k/j differentcrossbars to store one ﬁlter’s weights. Therefore, j ﬁlters canbe fully mapped on those crossbars as one block shown inFigure 4. There are n ﬁlters in total, therefore we need atleast p = n/j blocks to fully map the whole weight matrix( W ∈ R n × k ). Within each block, the outputs of each crossbarwill be propagated through an ADC. Then We column-wiselysum the intermediate results of all crossbars.

2) Memristor-Based Weight Quantization:

The weights ofthe DNNs are represented by the conductance of memristors onmemristor crossbars and the output of the memristor crossbarscan be obtained by measuring the accumulated current. Dueto the limited conductance range of the memristor devices,the weight values exceeding conductance range cannot berepresented precisely. On the other hand, within the conduc-tance range, accuracy loss also exists because of the mismatchbetween weight values and real memristor devices.To mitigate the limitation of conductance range, we in-corporate the conductance range constraint of the memristordevice (i.e., Q i ∈ [ cond min , cond max ] ) into DNNs training.To mitigate the accuracy degradation caused by the weightmismatch, we incorporate the constraint of conductance statelevels (i.e., q i, , q i, , · · · , q i,M i ∈ [ cond min , cond max ] ) intoDNNs training. Here set Q i = { the value of every element isone of the values in q , q , · · · , q M } to represent the constraint,and q , q , · · · , q M are all available quantized states. Theoreti-cally, the conductance of the memristor can be set to any stateswithin its available range. In reality, the memristor conductancestates are limited by the resolution that the peripheral writeand read circuitry can provide. Generally speaking, more statelevels require more sophisticated peripheral circuitry. In orderto reduce the overheads caused by the peripheral circuitry andsatisfy the robustness of the whole system, the conductancerange will be quantized to several distinctive state levels andrepresented by discrete states. q q q q q q q q Ron errorerror

Roff

L1 L2 L3 L5 L6 L7 L8L4

Fig. 5: Multi-level Memristor Storing 3-bit weight

Figure 5 illustrates an example of an 8-level (3bits) mem-ristor with linear conductance level (where it may behave asnonlinear in real designs [37]). The distribution curve shows a possible range that the memristor state might be actuallyset to, when the writing target state is q . An error will incur(hardware imprecision) when the actual written state is differentfrom the target state level. In order to minimize the error causedby the hardware imprecision, in our constraint of conductancestate levels, we set the quantized values as the mean valueof each state level. To optimize the overall performance, thenumber of memristor state levels is also considered in ourproposed framework. By quantizing the weights to fewer bitswhile maintain the overall accuracy, we can further improve theperformance since fewer state levels provide longer distance forsingle state, resulting in better error resilience and reducing thehardware imprecision.Another advantage is the design area and power consumptioncan be reduced by quantizing the weights to fewer bits. Accord-ing to the state-of-the-art design of neuromorphic computingusing memristor, a practical assumption is that the memristorcell can represent 16 weight levels (4-bit weights) [38]. Toensure a relatively high accuracy, usually two (or more) mem-ristors are bundled to represent weights with high resolution(more bits) [39]. On the other hand, since the memristor deviceonly has positive conductance value while the weights arepositive or negative, we use different memristor crossbars torepresent the positive weights and negative weights separately.As an illustration in Figure 6, a 9-bit weight value can berepresented using a 8-bit positive block and a 8-bit negativeblock, where each of the 8-bit block is formed by two 4-bitmemristor crossbars. In general, the cost of the ADCs andother peripheral circuits will grow exponentially for addingevery extra bit precision. Thus, the overhead of the peripheralcircuit can be signiﬁcantly reduced by quantizing the weightsto fewer bits. The total design area and power consumption canbe reduced as well.Figure 7 shows the weights distribution of a CONV layerin ResNet18 using CIFAR-10 dataset before (b) and after(a) quantization, after structured weight pruning. For a 5-bit quantization using our proposed method, the weights arequantized into 32 different levels within memristor’s validconductance range.IV. EXPERIMENTAL RESULTS

In this section, we evaluate our systematic structured weightpruning and weight quantization framework on MNIST datasetusing LeNet-5 network structure and CIFAR-10 dataset usingConvNet (4 CONV layers and 1 FC layer), VGG-16 andResNet-18 network structures. All models are designed usingPyTorch API and oriented to match the memristor’s physi-cal characteristics. Our hardware performance results such aspower consumption, area cost of the memristor device and itsperipheral circuits are simulated by using NVSim [40] and our R o w D ec o d er _ ++ _ DAC

ADC

Decoder

SenseAmplifier SenseAmplifier SenseAmplifier lower 4-bit 8-bit positive block 9-bit weights 8-bit negative blockhigher 4-bit lower 4-bithigher 4-bit

Column Decoder

ADC

Decoder

SenseAmplifier SenseAmplifier SenseAmplifier

Fig. 6: Represent Weights Using Multi-Memristor CrossbarsABLE I: Structured Weight pruning results on multi-layer network on MNIST, CIFAR-10 datasets (*calculation is based on bolded results)

Structured Weight Pruning Statistics (9-bit) Quantization & Accuracy

Method Original Accuracy Pruned Accuracy Crossbar Area Saved Compression Ratio 7-bit 6-bit 5-bit

MNIST

Group Scissor [23] 99.15% 99.14% 75.94% 4.16 × - - - ourLeNet-5 99.17% 99.15% 94.34% 17.69 × × × Group Scissor [23] 82.01% 82.09% 57.45% 2.35 × - - - ourConvNet 84.41% × × × ourVGG-16 93.70% × × × × *numbers of parameter reduced on ConvNet : , VGG-16 : , ResNet-18 : (a) (b) -min +min -max +max Fig. 7: Distribution of the weights: (a)before quantization, (b) after5-bit quantization

MATLAB model. In our memristor model, R min = 1 M Ω , R max = 10 M Ω , with 4-bit precision, and the peripheralcircuits are using 45nm technology. We use 128 ×

64 crossbarsize on ResNet-18 and VGG-16, where ConvNet and LeNet-5 uses 32 ×

32 crossbar size. The experiments are done on aneight NVIDIA GTX-2080Ti GPUs server.In this work, multiple 9-bit non-pruned models on differentnetworks are used as our original DNN models, and results ofstructured weight pruning using our original DNN models showthat, on memristor LeNet-5 model, we achieve 17.69 × weightreduction without accuracy loss, 37.06 × weight reduction withnegligible accuracy loss and 105.52 × weight reduction within1% accuracy loss. Meanwhile we shrink memristor crossbararea by more than 94%. On muti-layers CNN for CIFAR-10, we achieve a higher accuracy compared with [23]. Ondeeper neural network structures such as VGG-16 and ResNet-18, we manage to compress each model unprecedentedly by29.81 × with negligible accuracy loss and 20.88 × within 1%accuracy loss, respectively. We manage to save more crossbararea compared with [23], and reduce 96.65% of the crossbararea for VGG-16 and 95.21% for ResNet-18. The experimen-tal results illustrate great potential for incorporating ADMMinto structured weight pruning and quantization techniques onmemristor-based DNN design, which will tremendously reducethe area and power consumption. A. Experimental Results on Structured Weight Pruning

In our experiment, we compare our proposed framework withGroup Scissor [23] as shown in Table I. Please note that weonly prune CONV layers because they perform most of theFLOPs in the network calculation. On MNIST dataset, ouroriginal CNN model achieves 99.17% accuracy, and 99.15%accuracy with structured weight pruning. We also reduce ourmodel size using extreme prune conﬁguration, the results shows our method gets 98.33% accuracy when we compress ourmodel by 105.52 × .On CIFAR-10 dataset, we construct different networks totest our method. Compared with the Group Scissor [23], wenot only achieve higher test accuracy using same compressionratio (2.35 × ), but also manage to maintain same accuracy witha even higher compression ratio (2.93 × ). For deeper networkstructures like VGG-16 and ResNet-18, we introduce such highregular sparsity into networks without accuracy degradation.Our framework reduces 13.98M and 10.46M weight volumefor VGG-16 and ResNet-18 respectively. B. Experimental Results on Weight Quantization for MemristorCrossbar Mapping

From the discussion in Section III-B.2, we can see thatfewer bits can reduce the power as well as the memristorcrossbar area. However, quantizing weights to some speciﬁcvalues will cause non-negligible accuracy degradation. In thispaper, to mitigate accuracy degradation, we adopt ADMM todynamically optimize well-leveled groups of weights which canbe actually mapped on the memristors. By including memristorcharacteristics as discussed in Section III-B.2, our quantizationprocess does not map weights to zero, and our state levelsare zero-symmetric. Table I shows different conﬁgurations forweight quantization and Figure 8 shows the power and area.Experimental results demonstrate that our framework maintainshigh weight prune ratio and fewer bits with promising testaccuracy. According to 6-bit quantization results in Table I,there is only 0.1% accuracy degradation after quantizing LeNet-5 on 17.69 × model, and only 0.2% accuracy degradation afterquantizing the 105.52 × model. For a larger dataset such asCIFAR-10, a shallow ConvNet will introduce around 1.0%accuracy degradation for our designed conﬁguration (2.35 × )and 2.0% accuracy degradation on a 5.88 × compressed model,however as the network structure getting deeper, the accuracydrops around 0.1% in a 9.31 × compressed VGG-16 model and0.5% in a 11.75 × compressed ResNet-18 model, and as thecompression ratio gets larger, accuracy drops 0.8% in a 29.81 × compressed VGG-16 model and 0.6% in a 20.88 × compressedResNet-18 model.As shown in Figure 8, fewer bits represenation results inless power consumption and smaller area footprint, becausethe overhead of peripheral circuits such as ADCs and DACswill signiﬁcantly decrease by lower the computing precision.There is a tremendous power and area reduction using the 5- Fig. 8: Total power and area reduction on compressed models usingdifferent quantization bits and networks bit quantization, since all memristor crossbars for higher bitsrepresentations are no longer needed. Beside the power andarea reduction, fewer bits representation mitigates hardwareimperfection of memristor including state drift and processvariations. Compared to original DNN models, our 5-bit quan-tization models can achieve the largest power (area) reductionas 96.95% (97.46%), 98.38% (98.28%), 95.91% (89.74%) and96.96% (93.97%) on ResNet-18, VGG-16, ConvNet and LeNet-5, respectively, among different bits representations.V.

CONCLUSION

In this paper, we propose an uniﬁed and systematicmemristor-based framework with both structured weight prun-ing and weight quantization by incorporating ADMM intoDNNs training. Three steps are mainly incorporated in ourframework, which are memristor-based ADMM regularizedoptimization , masked mapping and retraining . We evaluateour proposed framework on different networks, and for eachnetwork, several pruning and quantizaiton scenarios are tested.On LeNet-5 and ConvNet, we can easily achieve better resultscomparing to Gourp Scissor. On VGG-16 and ResNet-18,after structured weight pruning and quantization, signiﬁcantweight compression ratio, power and area reduction 5-bitweight representation can achieve signiﬁcant power and areareduction network where only result in negligible accuracy loss,compared to the original DNN models.A CKNOWLEDGMENT

This work is funded by National Science Foundation CCF-1637559. We thank all anonymous reviewers for their feedback.R

EFERENCES[1] I. Goodfellow, Y. Bengio, and A. Courville,

Deep learning . MIT press,2016.[2] C. Ding, S. Liao, Y. Wang, Z. Li et al. , “Circnn: accelerating and com-pressing deep neural networks using block-circulant weight matrices,” in

Proceedings of the 50th Annual IEEE/ACM International Symposium onMicroarchitecture . ACM, 2017, pp. 395–408.[3] Y. Wang, C. Ding, and et al., “Towards ultra-high performance andenergy efﬁciency of deep learning systems: an algorithm-hardware co-optimization framework,”

AAAI2018 , Feb 2018.[4] C. Ding, A. Ren, and et al., “Structured weight matrices-based hardwareaccelerators in deep neural networks,”

Proceedings of GLSVLSI 18 , 2018.[5] X. Ma, Y. Zhang, and et al., “An area and energy efﬁcient design ofdomain-wall memory-based deep convolutional neural networks usingstochastic computing,” , Mar 2018.[6] A. Shrestha, H. Fang, Q. Wu, and Q. Qiu, “Approximating back-propagation for a biologically plausible local learning rule in spikingneural networks,” in

ICONS , 2019, (in press).[7] H. Fang, A. Shrestha, De Ma, and Q. Qiu, “Scalable noc-based neuro-morphic hardware learning and inference,” in , July 2018.[8] H. Fang, A. Shrestha, Z. Zhao, Y. Wang, and Q. Qiu, “A generalframework to map neural networks onto neuromorphic processor,” in , March 2019, pp. 20–25.[9] H. Li, N. Liu, and et al., “Admm-based weight pruning for real-time deeplearning acceleration on mobile devices,” in

Proceedings of GLSVLSI .ACM, 2019.[10] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights andconnections for efﬁcient neural network,” in

NeurIPS , 2015. [11] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in

NeurIPS , 2016, pp. 2074–2082.[12] T. Zhang, K. Zhang, and et al., “Adam-admm: A uniﬁed, system-atic framework of structured weight pruning for dnns,” arXiv preprintarXiv:1807.11091 , 2018.[13] X. Ma, G. Yuan, and et al., “Resnet can be pruned 60x: Introducing net-work puriﬁcation and unused path removal (p-rm) after weight pruning,” arXiv preprint arXiv:1905.00136 , 2019.[14] S. Ye, X. Feng, and et al., “Progressive dnn compression: A key toachieve ultra-high weight pruning and quantization rates using admm,” arXiv preprint arXiv:1903.09769 , 2019.[15] W. Niu, X. Ma, Y. Wang, and B. Ren, “26ms inference time for resnet-50:Towards real-time execution of all dnns on smartphone,” arXiv preprintarXiv:1905.00571 , 2019.[16] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantization fordeep neural networks,” in

CVPR , 2017.[17] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutionalneural networks for mobile devices,” in

CVPR , 2016.[18] S. Lin, X. Ma, and et al., “Toward extremely low bit and lossless accuracyin dnns with progressive admm,” arXiv preprint arXiv:1905.00789 , 2019.[19] M. M. Waldrop, “The chips are down for moores law,”

Nature News , vol.530, no. 7589, 2016.[20] L. Chua, “Memristor-the missing circuit element,”

IEEE Transactions oncircuit theory , vol. 18, no. 5, pp. 507–519, 1971.[21] G. Yuan, C. Ding, and et al., “Memristor crossbar-based ultra-efﬁcientnext-generation baseband processors,” in

MWSCAS . IEEE, aug 2017.[22] A. Ankit, A. Sengupta, and K. Roy, “Trannsformer: Neural network trans-formation for memristive crossbar based neuromorphic system design,”in

Proceedings of the 36th International Conference on Computer-AidedDesign . IEEE Press, 2017, pp. 533–540.[23] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor:Scaling neuromorphic computing design to large neural networks,” in

DAC . IEEE, 2017.[24] Y. Zhang, N. I. Mou, P. Pai, and M. Tabib-Azar, “Quantized currentconduction in memristors and its physical model,” in

SENSORS, 2014IEEE . IEEE, 2014, pp. 819–822.[25] C. Song, B. Liu, W. Wen, H. Li, and Y. Chen, “A quantization-awareregularized learning method in multilevel memristor-based neuromorphiccomputing system,” in . IEEE, 2017.[26] A. G. Radwan, M. A. Zidan, and K. N. Salama, “HP Memristormathematical model for periodic signals and DC,” in , aug 2010.[27] M. Hu, H. LI, Q. Wu, and G. S. Rose, “Hardware realization of bsb recallfunction using memristor crossbar arrays,” in

DAC Design AutomationConference 2012 , 2012, pp. 498–503.[28] J. J. Yang, M. D. Pickett, X. Li, D. A. Ohlberg, D. R. Stewart, andR. S. Williams, “Memristive switching mechanism for metal/oxide/metalnanodevices,”

Nature Nanotechnology , 2008.[29] S. Kaya, A. R. Brown, A. Asenov, D. Magot, e. D. LintonI, T.”, andC. Tsamis, “Analysis of statistical ﬂuctuations due to line edge roughnessin sub-0.1 µ m mosfets,” in Simulation of Semiconductor Processes andDevices 2001 . Springer Vienna, 2001, pp. 78–81.[30] S. Pi and et al., “Cross point arrays of 8 nm x 8 nm memristive devicesfabricated with nanoimprint lithography,”

Journal of Vacuum Science &Technology B: Microelectronics and Nanometer Structures , 2013.[31] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends® in Machine learning , 2011.[32] H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating directionmethod of multipliers,” in

ICML , 2013, pp. 80–88.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[34] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cudnn: Efﬁcient primitives for deep learning,” arXivpreprint arXiv:1410.0759 , 2014.[35] M. Hu, J. P. Strachan, Zhiyong Li, R. Stanley, and Williams, “Dot-productengine as computing memory to accelerate machine learning algorithms,”in . IEEE, mar 2016, pp. 374–379.[36] M. Hu, C. E. Graves, and et al., “Memristor-Based Analog Computationand Neural Network Classiﬁcation with a Dot Product Engine,”

AdvancedMaterials , 2018.[37] J. Lin, L. Xia, Z. Zhu, H. Sun, Y. Cai, H. Gao, M. Cheng, X. Chen,Y. Wang, and H. Yang, “Rescuing memristor-based computing with non-linear resistance levels,” in

DATE 2018 , 2018.[38] M. Courbariaux, J. P. David, and Y. Bengio, “Training deep neuralnetworks with low precision multiplications,” in

ICLR , 2015.[39] P. Chi, S. Li, C. Xu, T. Zhang, and et al. , “Prime: A novel processing-in-memory architecture for neural network computation in reram-based mainmemory,” in , 2016.[40] X. Dong, C. Xu, and et al., “Nvsim: A circuit-level performance, energy,and area model for emerging nonvolatile memory,”