[PDF] Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation

Abstract

The state-of-art DNN structures involve intensive computation and high memory storage. To mitigate the challenges, the memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. However, the high accuracy solution for extreme model compression on memristor crossbar array architecture is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating alternating direction method of multipliers (ADMM) algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. Motivated by these discoveries, we design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression ratio on the state-of-art neural network structures with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link this https URL.

Full PDF

TTiny but Accurate: A Pruned, Quantized and Optimized MemristorCrossbar Framework for Ultra Eﬃcient DNN Implementation

Xiaolong Ma † , Geng Yuan † , Sheng Lin , Caiwen Ding , Fuxun Yu , Tao Liu , Wujie Wen , Xiang Chen , Yanzhi Wang Northeastern University, University of Connecticut, George Mason University, Florida International UniversityE-mail: { ma.xiaol, yuan.geng, lin.sheng, } @husky.neu.edu, [email protected], [email protected], { fyu2, xchen26 } @gmu.edu, { tliu023, wwen } @ﬁu.edu Abstract— The state-of-art DNN structures involveintensive computation and high memory storage. Tomitigate the challenges, the memristor crossbar arrayhas emerged as an intrinsically suitable matrix compu-tation and low-power acceleration framework for DNNapplications. However, the high accuracy solution forextreme model compression on memristor crossbar ar-ray architecture is still waiting for unraveling. In thispaper, we propose a memristor-based DNN frame-work which combines both structured weight pruningand quantization by incorporating alternating directionmethod of multipliers (ADMM) algorithm for betterpruning and quantization performance. We also dis-cover the non-optimality of the ADMM solution inweight pruning and the unused data path in a struc-tured pruned model. Motivated by these discoveries,we design a software-hardware co-optimization frame-work which contains the ﬁrst proposed

Network Pu-riﬁcation and

Unused Path Removal algorithms tar-geting on post-processing a structured pruned modelafter ADMM steps. By taking memristor hardwareconstraints into our whole framework, we achieve ex-treme high compression ratio on the state-of-art neu-ral network structures with minimum accuracy loss.For quantizing structured pruned model, our frame-work achieves nearly no accuracy loss after quantiz-ing weights to 8-bit memristor weight representation.We share our models at anonymous link https://bit.ly/2VnMUy0 . Structured weight pruning [1–3] and weight quantiza-tion [4–6] techniques are developed to facilitate weight com-pression and computation acceleration to solve the high de-mand for parallel computation and storage resources [7–9].Even though models are compressed, computation com-plexity still burden the overall performance of the state-of-art CMOS hardware applications.To mitigate the bottleneck caused by CMOS-based DNNarchitectures, the next-generation device/circuit technolo-gies [10,11] triumph CMOS in their non-volatility, high en-ergy eﬃciency, in-memory computing capability and highscalability. Memristor crossbar device has shown its po-tential for bearing all these characteristic which makes itintrinsically suitable for large DNN hardware architecturedesign. A memristor crossbar device can perform matrix- ∗ † These authors contributed equally. vector multiplication in the analog domain and the com-putation is in O (1) time complexity [12, 13]. Motivated bythe fact that there is no precedent model that is structuredpruned and quantized as well as satisfying memristor hard-ware constraints, in this work, a memristor-based ADMMregularized optimization method is utilized both on struc-tured pruning and weight quantization in order to mitigatethe accuracy degradation during extreme model compres-sion. A structured pruned model can potentially beneﬁtfor high-parallelism implementation in crossbar architec-ture. Further more, quantized weights can reduce hard-ware imprecision during read/write procedure, and savemore hardware footprint due to less peripheral circuits areneeded to support fewer bits.However, to achieve ultra-high compression ratio, anADMM pruning method [3, 14] cannot fully exploit all re-dundancy in a neural network model. As a result, we designa hardware-software co-optimization framework in whichwe investigate Network Puriﬁcation and

Unused Path Re-moval after the procedure of structured weight pruning withADMM . Moreover, we utilize distilled knowledge from soft-ware model to guide our memristor hardware constraintquantization. To the best of our knowledge, we are the ﬁrstto combine extreme structured weight pruning and weightquantization in an uniﬁed and systematic memristor-basedframework. Also, we are the ﬁrst to discover the redun-dant weights and unused path in a structured pruned DNNmodel and design a sophisticate co-optimization frameworkto boost higher model compression rate as well as maintainhigh network accuracy. By incorporating memristor hard-ware constraints in our model, our frameworks are guar-anteed feasible for a real memristor crossbar device. Thecontributions of this paper include: • We adopt ADMM for eﬃciently optimizing the non-convex problem and utilized this method on structuredweight pruning. • We systematically investigate the weight quantizationon a pruned model with memristor hardware con-straints. • We design a software-hardware co-optimization frame-work in which

Network Puriﬁcation and

Unused PathRemoval are ﬁrst proposed.We evaluate our proposed memristor framework on dif-ferent networks. We conclude that structured pruningmethod with memristor-based ADMM regularized opti-mization achieves high compression ratio and desirable a r X i v : . [ ee ss . SP ] A ug igh accuracy. Software and hardware experimental resultsshows our memristor framework is very energy eﬃcient andsaves great amount of hardware footprint. Heuristic weight pruning methods [15] are widely used inneuromorphic computing designs to reduce the weight stor-age and computing delay [16]. [16] implemented weightpruning techniques on a neuromorphic computing sys-tem using irregular pruning caused unbalanced workload,greater circuits overheads and extra memory requirementon indices. To overcome the limitations, [17] proposedgroup connection deletion, which structually prunes con-nections to reduce routing congestion between memristorcrossbar arrays.Weight quantization can mitigate hardware imperfec-tion of memristor including state drift and process vari-ations, caused by the imperfect fabrication process or bythe device feature itself [4, 5]. [18] presented a techniqueto reduce the overhead of Digital-to-Analog Converters(DACs)/Analog-to-Digital Converters (ADCs) in resistiverandom-access memory (ReRAM) neuromorphic comput-ing systems. They ﬁrst normalized the data, and thenquantized intermediary data to 1-bit value. This can bedirectly used as the analog input for ReRAM crossbar and,hence, avoids the need of DACs.

Memristor [10] crossbar is an array structure consists ofmemristors, horizontal Word-lines and Vertical Bit-lines,as shown in Figure 1. Due to its outstanding performanceon computing matrix-vector multiplications (MVM), mem-ristor crossbars are widely used as dot-product acceleratorin recent neuromorphic computing designs [19]. By pro-gramming the conductance state (which is also known as“memristance”) of each memristor, the weight matrix W can be mapped onto the memristor crossbar. Given theinput voltage vector V i , the MVM output current vector I j can be obtained in time complexity of O (1). Diﬀerent from the software-based designs, hardware im-perfection is one of the key issues that causes the hard-ware non-ideal behaviors and needs to be considered inmemristor-based designs. The hardware imperfection ofmemristor devices are mainly come from the imperfect fab-rication process and the memristor features.

Process Variation.

Process variation is one majorhardware imperfection that caused by the ﬂuctuationsin fabrication process. It mainly comes from the line-edge roughness, oxide thickness ﬂuctuations, and randomdopant variations [20]. Inevitably, process variation playsan increasingly signiﬁcant role as the process technologyscales down to nanometer level. In a DNN hardware de-sign, the non-ideal behaviors caused by process variationsmay lead to an accuracy degradation.

State Drift.

State drift is the phenomenon that thememristance would change after several reading oper-

WL BL V i i,j I I I I j V V V HorizontalWorld Line VerticalBit Line W undopeddoped

Figure 1: memristor and memristor crossbartions [21]. It is known that memristor is a thin-ﬁlm deviceconstructed by a region highly doped with oxygen vacan-cies and an undoped region. By nature, applying an electricﬁeld across the memristor over a period of time, the oxy-gen vacancies would migrate to the direction along withthe electric ﬁeld, which leads to the (memristance) statedrift. Consequently, an error will incur when the state ofmemristor drifts to another state level.It has been proved that applying quantization onmemristor-based designs can mitigate the undesired im-pacts caused by hardware imperfections [22].

The memristor crossbar structure has shown its poten-tial for neuromorphic computing system compared to theCMOS technologies [16]. Due to great amount of weightsand computations that involved in networks, an eﬃcientand highly performed framework is needed to conquerthe memory storage and energy consumption problems.We propose an uniﬁed memristor-based framework includ-ing memristor-based ADMM regularized optimization and masked mapping . ADMM [23] is an advanced optimization technique whichdecompose an original problem into subproblems thatcan be solved separately and iteratively. By adopt-ing memristor-based ADMM regularized optimization , theframework can guarantee the solution feasibility (satisfyingmemristor hardware constraints) while provide high solu-tion quality (no obvious accuracy degradation after prun-ing).First, the memristor-based ADMM regularized optimiza-tion starts from a pre-trained full size DNN model withoutcompression. Consider an N -layer DNNs, sets of weightsof the i -th (CONV or FC) layer are denoted by W i . Andthe loss function associated with the DNN is denoted by f (cid:0) { W i } Ni =1 (cid:1) . The overall problem is deﬁned by minimize { W i } f (cid:0) { W i } Ni =1 (cid:1) , subject to W i ∈ P i , W i ∈ Q i , i = 1 , . . . , N. (1) Given the value of α i , the memristor-based constraint set P i = { W i | (cid:80) (structured W i (cid:54) = 0) ≤ α i } and Q i = { theweights in the i -th layer are mapped to the quantizationvalues } , where α i is predeﬁned hyper parameters. The gen-eral constraint can be extended in structured pruning suchas ﬁlter pruning, channel pruning and column pruning,which facilitate high-parallelism implementation in hard-ware. ilter Pruning Channel Pruning Filter Shape Pruning Filter 1Filter 2Filter A Filter 1Filter 2 Filter 1Filter 2 ... i Filter A i Filter A i ... ... Figure 2: Illustration of ﬁlter-wise, channel-wise and shape-wise structured sparsities.Similarly, for weight quantization, elements in Q i arethe solutions of W i . Assume set of q i, , q i, , · · · , q i,M i isthe available memristor state value which is the elementsin W i , where M i denotes the number of available quan-tization level in layer i . Suppose q i,j indicates the j -thquantization level in layer i , which gives q i,j ∈ [ − memr max , − memr min ] ∪ [ memr min , memr max ] (2) where memr min , memr max are the minimum and maxi-mum memristance value of a speciﬁed memristor device. Corresponding to every memristor-based constraint set of P i and Q i , a indicator functions is utilized to incorporate P i and Q i into objective functions, which are g i ( W i ) = (cid:40) W i ∈ P i , + ∞ otherwise, h i ( W i ) = (cid:40) W i ∈ Q i , + ∞ otherwise, for i = 1 , . . . , N . Substituting into (1) and we get minimize { W i } f (cid:0) { W i } Ni =1 (cid:1) + N (cid:88) i =1 g i ( Y i ) + N (cid:88) i =1 h i ( Z i ) , subject to W i = Y i = Z i , i = 1 , . . . , N, (3) We incorporate auxiliary variables Y i and Z i , dual vari-ables U i and V i , and the augmented Lagrangian formation L ρ {·} of problem (3) is minimize { W i } f (cid:0) { W i } Ni =1 (cid:1) + N (cid:88) i =1 ρ i (cid:107) W i − Y i + U i (cid:107) F + N (cid:88) i =1 ρ i (cid:107) W i − Z i + V i (cid:107) F , (4) We can see that the ﬁrst term in problem (4) is origi-nal DNN loss function, and the second and third term arediﬀerentiable and convex. As a result, subproblem (4) canbe solved by stochastic gradient descent [24] as the originalDNN training.The standard ADMM algorithm [23] steps proceed byrepeating, for k = 0 , , . . . , the following subproblems iter-ations: W k + := minimize { W i } L ρ ( { W i } , { Y ki } , { U ki } )+ L ρ ( { W i } , { Z ki } , { V ki } ) (5) Y k + , Z k + := minimize { Y i , Z i } L ρ ( { W k + } , { Y i } , { U ki } )+ L ρ ( { W k + } , { Z i } , { V ki } ) (6) U k + := U ki + W k + − Y k + ; V k + := V ki + W k + − Z k + (7) which (5) is the proximal step, (6) is projection step and(7) is dual variables update.The optimal solution is the Euclidean projection (maskedmapping) of W k +1 i + U ki and W k +1 i + V ki onto P i and Q i .Namely, elements in solution that less than α i will be set tozero. In the meantime, those kept elements are quantizedto the closest valid memristor state value. In order to accommodate high-parallelism implementationin hardware, we use structured pruning method [1] insteadof the irregular pruning method [15] to reduce the sizeof the weight matrix while avoid extra memory storagerequirement for indices. Figure 2 shows diﬀerent typesof structured sparsity which include ﬁlter-wise sparsity,channel-wise sparsity and shape-wise sparsity.Figure 3 (a) shows the general matrix multiplication(GEMM) view of the DNN weight matrix and the diﬀerentstructured weight pruning methods. The structured prun-ing corresponds to removing rows (ﬁlters-wise) or columns(shape-wise) or the combination of them. We can see thatafter structured weight pruning, the remaining weight ma-trix is still regular and without extra indices.Figure 3 (b) illustrate the memristor crossbar schematicsize reduction from corresponding structured weight prun-ing and Figure 3 (c) shows physical view of the mem-ristor crossbar blocks. A CONV layer has n ﬁlters, m channels which include total k columns, and is denotedas W ∈ R n × k . Due to the increasing reading/writing er-rors caused by expanding the memristor crossbar size, welimited our design by using multiple 128 ×

64 [25] crossbarsfor all DNN layers. In Figure 3 (c), i, j denote columnsand rows for each crossbar, X represent inputs and c is thecolumn number which is also shown in Figure 3 (a). Byeasy calculation, one can derived that there’s k/j diﬀerentcrossbars to store one ﬁlter’s weights as a block unit. Sothere’s total p = n/j blocks to store W ∈ R n × k . Withineach block, the outputs of each crossbar will be propagatedthrough an ADC. Then We column-wisely sum the inter-mediate results of all crossbars. Due to the existence of the non-optimality of ADMM pro-cess and the accuracy degradation problem of quantizingsparse DNN, a software-hardware co-optimization frame-work is desired. In this section we propose: (i) networkpuriﬁcation and unused path removal to eﬃciently removeredundant channels or ﬁlters, (ii) memristor model quanti-zation by using distilled knowledge from software helper.

Weight pruning with memristor-based ADMM regularizedoptimization can signiﬁcantly reduce the number of weightswhile maintaining high accuracy. However, does the prun-ing process really remove all unnecessary weights?From our analysis on the DNN data ﬂow, we ﬁnd thatif a whole ﬁlter is pruned, after General Matrix Multiply ilter 1filter 2filter 3filter nfilter 1*filter n*filter 1*filter n* filter prunecolumn prune may not be the same filter with original weight matrix full size memristor block size shrinkedmemristor blocksmallestmemristor block

GEMM view of weight reduction Memristor block size view size shrinkoriginal size channel 1 channel 2 channel mchannel 1 channel 2 channel mchannel 1channel 2 channel m*

SUM SUM

ADCADC

SUM

ADCADC

Data / Control

SUM SUM

ADCADC

SUM

ADCADC

Schematic View Physical View (a) (b) (c)

C1 Ci Ci+1 C2i Ck filter 1*filter n* channel 1 channel m*x1 x2 x3 xk

C1 Ci Ck filter j xi Block 1Block p

ADC D ec o d er D ec o d er S e n s e A m p li f i er DAC

Column Decoder S e n s e A m p li f i er S e n s e A m p li f i er Figure 3: Structured weight pruning and reduction of hardware resources(GEMM), the generated feature maps by this ﬁlter will be“blank”. If we map those “blank” feature input to nextlayer, the corresponding unused input channel weights be-come removable. By the same token, a pruned channel alsocauses the corresponding ﬁlter in previous layer removable.Figure 4 gives a clear illustration about the correspond-ing relationship between pruned ﬁlters/channels and cor-respond unused channels/ﬁlters.To better optimize the unused path removal eﬀect wediscussed above, we derive an emptiness ratio parameter η to deﬁne what can be treated as an empty channel. Sup-pose Λ i is the number of columns per channel in layer i ,and j is channel index. We have η i,j = (cid:2) δ (cid:88) k =1 ( column k ! = 0) (cid:3) /δ δ ∈ Λ i (8)If η i,j exceeds a pre-deﬁned threshold, we can assume thatthis channel is empty and thus actually prune every col-umn in it. However, if we remove all columns that satisfy η , dramatic accuracy drop will occur and it will be hardto recover by retraining because some relatively “impor-tant” weights might be removed. To mitigate this problem,we design Network Puriﬁcation algorithm dealing with the non-optimality problem of the ADMM process. We set-upan criterion constant σ i,j to represent channel j ’s impor-tance score, which is derived from an accumulation proce-dure: σ i,j = δ (cid:88) k =1 (cid:107) column k (cid:107) F /δ δ ∈ Λ i (9)One can think of this process as if collection evidence forwhether each channel that contains one or several columnsneed to be removed . A channel can only be treated as emptywhen both equation (8) and (9) are satisﬁed. Network Pu-riﬁcation also works on purifying remaining ﬁlters and thusremove more unused path in the network. Algorithm 1shows our generalized method of the P-RM method where

T h . . . T h are hyper-parameter thresholds values. Traditionally, DNN in software is composed by 32-bitweights. But on a memristor device, the weights of a neural

Feature maps fromprevious layer -0.4 0.3 0.10.61.2 0.60.8 -2.11.1

Weight kernelLayer i weight matrix Feature maps tonext layer Layer i+1 weight matrix ... Filter Channel Filter Channel

Figure 4: Unused data path caused by structured pruning

Algorithm 1:

Network puriﬁcation & Unused path removal

Result:

Redundant weights and unused paths removedLoad ADMM pruned model δ = numbers of columns per channel for i ← until last layer dofor j ← until last channel in layer i dofor each: k ∈ δ and (cid:107) column k (cid:107) F < T h do calculate: equation (8), (9); endif η i,j < T h and σ i,j < T h then prune( channel i,j )prune( filter i − ,j ) when i (cid:54) = 1 ;endendfor m ← until last filter in layer i doif filter m is empty or (cid:107) filter m (cid:107) F < T h then prune( filter i,m )prune( channel i +1 ,m ) when i (cid:54) = last layer index; endendend network are represented by the memristance of the mem-ristor (i.e. the memristance range constraint Q i in ADMMprocess). Due to the limited memristance range of thememristor devices, the weight values exceeding memris-tance range cannot be represented precisely. Meanwhile,the write-on value and the exact value mismatch whenmapping weights on memristor crossbar will also cause thereading mismatch if the amount of the value shift exceedsstate level range.In order to mitigate the memristance range limitationand the mapping mismatch, larger range between statelevel ( q i, , q i, , · · · , q i,M i ) is needed which means fewer bitsare representing weights. To better maintain accuracy, weuse a pretrained high-accuracy teacher model to provideable 1: Structured weight pruning results on multi-layer network on MNIST, CIFAR-10 and ImageNet datasets. (P-RM:Network Puriﬁcation and Unused Path Removal). Accuracies in ImageNet results are reported in Top-5 accuracy.

Method Original modelAccuracy Compression RateWithout P-RM AccuracyWithout P-RM Prune RatioWith P-RM AccuracyWith P-RM Weight QuantizationAccuracy (8-bit)

MNIST

Group Scissor [17] 99.15% 4.16 × ourLeNet-5 × × × × × × *numbers of parameter reduced: Group Scissor [17] 82.01% 2.35 × ourConvNet × *2.93 × × × × * × × × × × *numbers of parameter reduced on ConvNet : , VGG-16 : , ResNet-18 : SSL [1] AlexNet 80.40% 1.40 × our AlexNet × × × × × × numbers of parameter reduced on AlexNet : , ResNet-18 : , ResNet-50 : Algorithm 2:

Distillation Quantization

Result: distillation quantization with memristor hardwareconstraints student ← model pruned and ready to apply quantization; teacher ← model with a deeper structure and higher accuracy; for step ← until l student converge do student q = apply quantization ( w s , Q );calculate T L ( p s , p t ) of student q & teacher ;back propagate on student ← ∂ ( T L ( p s ,p t )) ∂ ( student q ) ; end distillation loss to add on our memristor model (referredas student model) loss to provide better training perfor-mance. l student = (1 − σ ) L ( p s , p r ) + σ T L ( p s , p t ) (10)The L in ﬁrst term in (10) is the memristor model (stu-dent) loss, and in second term is distillation loss betweenstudent and teacher. p s and p t are outputs of student andteacher and p r is the ground-truth label. σ is a balancingparameter, and T is the temperature parameter. In this section, we show the experimental results of ourproposed memristor-based DNN framework in which struc-tured weight pruning and quantization with memristor-based ADMM regularized optimization are included. Oursoftware-hardware co-optimization framework (i.e.

Net-work Puriﬁcation , Unused Path Removal (P-RM)) are alsothoroughly compared. We test MNIST dataset on LeNet-5 and CIFAR-10 dataset using ConvNet (4 CONV layersand 1 FC layer), VGG-16 and ResNet-18, and we alsoshow our ImageNet results on AlexNet, ResNet-18 andResNet-50. The accuracy of pruned and quantized modelresults are tested based on our software models that incor-porated with memristor hardware constraints. Models aretrained on an eight NVIDIA GTX-2080Ti GPUs server us-ing PyTorch API. Our memristor model on MATLAB andthe NVSim [26] is used to calculate power consumptionand area cost of the memristors and memristor crossbars. Figure 5: Eﬀect of removing redundant weights and unusedpaths. (dataset: CIFAR-10; Accuracy: VGG-16-93.36%,ResNet-18-93.79%)The 1R crossbar structure is used in our design. And wechoose the memristor device that has R on = 1 M Ω and R off = 10 M Ω. The memristor precision is 4-bit, whichindicates that 16 state-levels can be represented by a sin-gle memristor device, and two memristors are combined torepresent 8-bit weight in our framework. For the peripheralcircuits, the power and area is calculated based on 45nmtechnology. And H-tree distribution networks are used toaccess all the memristor crossbars.As shown in Table 1, we show groups of diﬀerent pruneratios and 8-bits quantization with accuracies on each net-work structure. Figure 5 proves our previous argumentsthat ADMM’s non-optimality exists in a structured prunedmodel. P-RM can further optimize the loss function.Please note all of the results are based on non-retrainingprocess. Below are some results highlights on diﬀerentdataset with diﬀerent network structures.

MNIST.

With LeNet-5 network, comparing to originalaccuracy (99.17%), our proposed P-RM framework achieve231.82 × compression with minor accuracy loss while otherstate-of-art compression ratios are lossless. And no accu-racy losses are observed after quantization on 40 × and 88 × models and only 0.4% accuracy drop on 231.82 × model.On the other hand, Group Scissor [17] only has 4.16 × com-pression rate. CIFAR-10.

Convnet structure are relative shallow soADMM reaches a relative optimal local minimum, so post-able 2: Area/power comparison between models with andwithout P-RM on ResNet-18 and VGG-16 on CIFAR-10processing is not necessary. But we still outperform GroupScissor [17] in accuracy (84.55% to 82.09%) when compres-sion rate is same (2.35 × ). For larger networks, when a mi-nor accuracy loss is allowed, our proposed P-RM methodimproves the prune ratio to 50.02 × and 59.84 × on VGG-16and ResNet-18 respectively, and no obvious accuracy lossafter quantization on pruned models. ImageNet.

AlexNet model outperform SSL [1] bothin compression rate (4.69 × to 1.40 × ) and network ac-curacy (81.76% to 80.40%), with or without P-RM. OurResNet-18 and ResNet-50 models also achieve unprece-dented 3.33 × with 88.36% accuracy and 2.70 × with 92.27%respectively. No accuracy losses are observed after quan-tization on pruned ResNet-18/50 models and around 1%accuracy loss on 5.13 × compressed AlexNet model.Table 2 shows our highlighted memristor crossbar powerand area comparisons of ResNet-18 and VGG-16 mod-els. By using our proposed P-RM method, the areaand power of the 5 . × (15 . × ) ResNet-18 model is re-duced from 0.235 mm (0.117 mm ) and 3.359 W (1.622 W )to 0.042 mm (0.041 mm ) and 0.585 W (0.556 W ), with-out any accuracy loss. For VGG-16 20 . × model, afterusing our P-RM method, the area and power is reducedfrom 0.113 mm and 1.611 W to 0.056 mm (0.053 mm ) and0.824 W (0.754 W ), where the compression ratio is achieved44.67 × (50.02 × ) with 0% (0.63%) accuracy degradation. In this paper, we designed an uniﬁed memristor-basedDNN framework which is tiny in overall hardware footprintand accurate in test performance. We incorporate ADMMin weight structured pruning and quantization to reducemodel size in order to ﬁt our designed tiny framework.We ﬁnd the non-optimality of the ADMM solution anddesign

Network Puriﬁcation and

Unused Path Removal inour software-hardware co-optimization framework, whichachieve better results comparing to Gourp Scissor [17] andSSL [1]. On AlexNet, VGG-16 and ResNet-18/50, afterstructured weight pruning and 8-bit quantization, modelsize, power and area are signiﬁcant reduced with negligibleaccuracy loss.

References [1] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structuredsparsity in deep neural networks,” in

NeurIPS , 2016, pp. 2074–2082.[2] X. Ma, G. Yuan, S. Lin, Z. Li, H. Sun, and Y. Wang, “Resnetcan be pruned 60x: Introducing network puriﬁcation and un-used path removal (p-rm) after weight pruning,” arXiv preprintarXiv:1905.00136 , 2019.[3] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin,M. Fardad, and Y. Wang, “Adam-admm: A uniﬁed, systematic framework of structured weight pruning for dnns,” arXiv preprintarXiv:1807.11091 , 2018.[4] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantizationfor deep neural networks,” in

CVPR , 2017.[5] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convo-lutional neural networks for mobile devices,” in

CVPR , 2016.[6] S. Lin, X. Ma, S. Ye, G. Yuan, K. Ma, and Y. Wang, “Towardextremely low bit and lossless accuracy in dnns with progressiveadmm,” arXiv preprint arXiv:1905.00789 , 2019.[7] W. Niu, X. Ma, Y. Wang, and B. Ren, “26ms inference time forresnet-50: Towards real-time execution of all dnns on smartphone,” arXiv preprint arXiv:1905.00571 , 2019.[8] H. Li, N. Liu, X. Ma, S. Lin, S. Ye, T. Zhang, X. Lin, W. Xu, andY. Wang, “Admm-based weight pruning for real-time deep learningacceleration on mobile devices,” in

Proceedings of the 2019 on GreatLakes Symposium on VLSI , 2019.[9] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, andY. Wang, “Structured weight matrices-based hardware acceleratorsin deep neural networks: Fpgas and asics,” in

Proceedings of the2018 on Great Lakes Symposium on VLSI , 2018.[10] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “Themissing memristor found,” nature , vol. 453, no. 7191, p. 80, 2008.[11] X. Ma, Y. Zhang, G. Yuan, A. Ren, Z. Li, J. Han, J. Hu, andY. Wang, “An area and energy eﬃcient design of domain-wallmemory-based deep convolutional neural networks using stochasticcomputing,” in

ISQED . IEEE, 2018.[12] L. Chua, “Memristor-the missing circuit element,”

IEEE Transac-tions on circuit theory , vol. 18, no. 5, pp. 507–519, 1971.[13] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, andY. Wang, “Memristor crossbar-based ultra-eﬃcient next-generationbaseband processors,” in

MWSCAS , 2017.[14] S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen,S. Liu, J. Tang et al. , “Progressive dnn compression: A keyto achieve ultra-high weight pruning and quantization rates usingadmm,” arXiv preprint arXiv:1903.09769 , 2019.[15] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights andconnections for eﬃcient neural network,” in

NeurIPS , 2015.[16] A. Ankit, A. Sengupta, and K. Roy, “Trannsformer: Neural networktransformation for memristive crossbar based neuromorphic systemdesign,” in

Proceedings of ICCD , 2017.[17] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor:Scaling neuromorphic computing design to large neural networks,”in

DAC . IEEE, 2017.[18] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang,and H. Yang, “Switched by input: power eﬃcient structure for rram-based convolutional neural network,” in

DAC . ACM, 2016, p. 125.[19] A. Shaﬁee, A. Nag, N. Muralimanohar, and et.al, “ISAAC: A Convo-lutional Neural Network Accelerator with In-Situ Analog Arithmeticin Crossbars,” in

ISCA 2016 .[20] S. Kaya, A. R. Brown, A. Asenov, D. Magot, e. D. LintonI, T.”,and C. Tsamis, “Analysis of statistical ﬂuctuations due to line edgeroughness in sub-0.1 µ m mosfets,” 2001.[21] J. J. Yang, M. D. Pickett, X. Li, D. A. Ohlberg, D. R. Stew-art, and R. S. Williams, “Memristive switching mechanism formetal/oxide/metal nanodevices,” Nature Nanotechnology , 2008.[22] C. Song, B. Liu, W. Wen, H. Li, and Y. Chen, “A quantization-awareregularized learning method in multilevel memristor-based neuro-morphic computing system,” in . IEEE, 2017.[23] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Dis-tributed optimization and statistical learning via the alternating di-rection method of multipliers,”

Foundations and Trends ® in Ma-chine learning , 2011.[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[25] M. Hu, C. E. Graves, C. Li, and e. Li, Yunning, “Memristor-BasedAnalog Computation and Neural Network Classiﬁcation with a DotProduct Engine,” Advanced Materials , 2018.[26] X. Dong, C. Xu, S. Member, Y. Xie, S. Member, and N. P.Jouppi, “Nvsim: A circuit-level performance, energy, and area modelfor emerging nonvolatile memory,”