DeepCABAC: Context-adaptive binary arithmetic coding for deep neural network compression
Simon Wiedemann, Heiner Kirchhoffer, Stefan Matlage, Paul Haase, Arturo Marban, Talmaj Marinc, David Neumann, Ahmed Osman, Detlev Marpe, Heiko Schwarz, Thomas Wiegand, Wojciech Samek
DDeepCABAC: Context-adaptive binary arithmetic coding for deep neuralnetwork compression
Simon Wiedemann * 1
Heiner Kirchhoffer * 1
Stefan Matlage * 1
Paul Haase * 1
Arturo Marban Talmaj Marinc David Neumann Ahmed Osman Detlev Marpe Heiko Schwarz Thomas Wiegand Wojciech Samek Abstract
We present DeepCABAC, a novel context-adaptive binary arithmetic coder for compressingdeep neural networks. It quantizes each weight pa-rameter by minimizing a weighted rate-distortionfunction, which implicitly takes the impact ofquantization on to the accuracy of the networkinto account. Subsequently, it compresses thequantized values into a bitstream representationwith minimal redundancies. We show that Deep-CABAC is able to reach very high compressionratios across a wide set of different network ar-chitectures and datasets. For instance, we areable to compress by x63.6 the VGG16 ImageNetmodel with no loss of accuracy, thus being able torepresent the entire network with merely 8.7MB.
1. Introduction
Inspite of their state-of-the-art performance across a widespectrum of problems (LeCun et al., 2015), deep neural net-works have the well-known caveat that most often they havehigh memory complexity. This does not only imply highstorage capacities as a requirement, but also high energy re-sources and slower runtimes for execution (Horowitz, 2014;Sze et al., 2017; Wang et al., 2019). This greatly limits theiradoption in industrial applications or their deployment intoresource constrained devices. Moreover, this also difficultstheir transmission into communication channels with lim-ited capacity, which becomes an obstacle for distributedtraining scenarios such as in federated learning (McMahanet al., 2016; Sattler et al., 2018; 2019).As a reaction, a plethora of work has been published on * Equal contribution Department of Video Coding& Analytics, Fraunhofer Heinrich-Hertz Institut, Berlin,Germany. Correspondence to: Wojciech Samek < [email protected] > . Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s). the topic of deep neural network compression (Cheng et al.,2017; Cheng et al., 2018). From all different proposedmethods, sparsification followed by weight quantization andentropy coding arguably belong to the set of most popu-lar approaches, since very high compression ratios can beachieved under such paradigm (Han et al., 2015a; Louizoset al., 2017; Wiedemann et al., 2018a;b). Whereas much ofresearch has focused on the sparsification part, a substan-tially less amount have focused on improving the later twosteps. In fact, most of the proposed (post-sparsity) com-pression algorithms come with at least one of the followingcaveats: 1) they decouple the quantization procedure fromthe subsequent lossless compression algorithm, 2) ignorecorrelations between the parameters and 3) apply a losslesscompression algorithm that produce a bitstream with moreredundancies than principally needed (e.g. scalar Huffmancoding). Moreover, some of the proposed compression al-gorithms do also not take the impact of quantization on tothe accuracy of the network into account.In this work we present DeepCABAC, a compression al-gorithm that overcomes all of the above limitations. Itis based on applying a context-adaptive binary arithmeticcoder (CABAC) on to the quantized parameters, which isthe state-of-the-art for lossless compression. It also couplesthe quantization procedure with CABAC by minimizing arate-distortion cost function where the rate explicitly mea-sures the bit-size of the network parameters as determinedby CABAC. Moreover, it implicitly takes the impact ofquantization on to the networks accuracy into account byweighting the distortion with a term that measures the “ro-bustness” of the networks parameter. In our experiments weshow that we can significantly boost the compression perfor-mance of a wide set of pre-sparsified network architectures,consequently achieving new state-of-the-art results for theVGG16 model.
2. CABAC
Context-adaptive binary arithmetic coding (CABAC) isa form of lossless coding which was originally designedfor the video compression standard H.264/AVC (Marpe a r X i v : . [ c s . L G ] M a y eepCABAC: Context-adaptive binary arithmetic coding for deep neural network compression et al., 2003), but it is also an integral part of its successorH.265/HEVC. CABAC does not only offer high flexibilityof adaptation, but also a highly efficient implementation,thus attaining higher compression performance as well asthroughputs compared to other entropy coding methods(Marpe & Wiegand, 2003). In short, it applies three power-ful coding techniques: 1) Firstly, it binarizes the data to beencoded. That is, it predefines a series of binary decisions(also called bins) under which each data element (or sym-bol) will be uniquely identified. 2) Then, it assigns a binaryprobability model to each bin (also named context model)which is updated on-the-fly by the local statistics of the data.This enables CABAC with a high degree of adaptation todifferent data distributions. 3) Finally, it employs an arith-metic coder in order to optimally and efficiently code eachbin, based on the respective context model. To recall, arith-metic coding is a form of entropy coding which encodesentire strings of symbols into a single integer value. It iswell-known to outperform other coding techniques such asthe Huffman code (Huffman, 1952) with regards to both,compactness of the data representation and coding efficiency(Witten et al., 1987).Due to the above reasons, we chose CABAC as our losslesscompression method and adapted it for the task of neuralnetwork compression. Inspired by a prior analysis on the empirical weight distri-bution of different neural network architectures, we adoptedthe following bianrization procedure. Given a quantizedweight tensor in its matrix form , DeepCABAC scans theweight elements in row-major order and encodes each quan-tized weight element value by: 1) firstly determining if theweight element is a significant element or not. That is, eachweight element is assigned with a bit which determines if theelement is 0 or not. This bit is then encoded using a binaryarithmetic coder, according to its respective context model.The context model is initially set to 0.5 (thus, 50% probabil-ity that a weight element is 0 or not), but will automaticallybe adapted to the local statistics of the weight parametersas DeepCABAC encodes more elements. 2) Then, if theelement is not 0, the sign bit is analogously encoded, ac-cording to its respective context model. 3) Subsequently, aseries of bits are analogously encoded, which determine ifthe element is greater than , , ..., n ∈ N . The number n becomes a hyperparameter for the encoder. 4) Finally, thereminder is encoded using a fixed-length binary code.The decoding process is performed analogously. An ex- For fully-connected layers this is trivial. For convolutional lay-ers we converted them into their respective matrix form accordingto (Chetlur et al., 2014). From left to right, up to down.
Figure 1.
DeepCABAC binarization of neural networks. It encodeseach weight element by performing the following steps: 1) encodesa bit named sigflag which determines if the weight is a significantelement or not (in other words, if its 0 or not). 2) If its not 0,then the sign bit, signflag , is encoded. 3) Subsequently, a seriesof bits are encoded, which indicate if the weight value is greaterequal than , , ..., n ∈ N (the so called AbsGr(n)Flag ). 4) Finally,the reminder is encoded. The grey bits (also named regular bins)represent bits that are encoded using an arithmetic coder accordingto a context model. The other bits, the so called bypass bins, areencoded in fixed-point form. The decoder is analogous. ample scheme of the binarization procedure is depicted infigure 1.
3. Weighted rate-distortion function
Before we apply CABAC, we have to firstly quantize theweight parameters of the network. We do this by minimizinga generalised form of a rate-distortion function. Namely, wequantize each weight parameter w i to the quantization point q k ∗ that minimizes the cost function w i → q k ∗ = min k η i ( w i − q k ) + λR ik (1)where R ik is the bit-size of the quantization point q k asdetermined by DeepCABAC, and λ is the lagrangian multi-plier that specifies the desired trade-offs between the bit-sizeand distortion incurred by the quantization. Notice, howthe bit-size R ik now also depends on the index i of theweight to be encoded. This is due to the context-adaptivemodels which update their probabilities as new elements arebeing encoded, thus being different for each weight w i andconsequently the bit-size of each quantization point q k .Moreover, (1) introduces a weight-specific parameter η i which takes into account the relative impact that the dis-tortion of a particular weight inccurs on to the accuracyof the network. In this work we take a bayesian approachin order to estimate this parameter. Namely, we assume agaussian prior for each weight parameter and apply scal-able bayesian techniques in order to estimate their sufficientstatistics (Kingma et al., 2015; Molchanov et al., 2017;Louizos et al., 2017). As a result, we attain a mean andstandard deviation value for each weight parameter of the eepCABAC: Context-adaptive binary arithmetic coding for deep neural network compression Table 1.
Compression ratios achieved when combining Deep-CABAC with a sparsification method. In parenthesis are the resultsfrom previous work, where (Han et al., 2015a) and (Louizoset al., 2017). Models Dataset Org.acc.(top1)[%] Org.size Spars. | w (cid:54) =0 || w | [%] Comp.ratio[%] Acc.(top1)[%]VGG16 ImageNet 69.43 553.43MB 9.85(7.5 ) (2.05 ) (68.83 )ResNet50 ImageNet 76.13 102.23MB 25.40(29.0 ) (5.95 ) (76.15 )Mobile-Net-v1 ImageNet 70.69 16.93MB 50.73 Small-VGG16 CIFAR10 91.35 59.9MB 7.57(5.5 ) (0.86 ) (90.8 )LeNet5 MNIST 99.22 1722KB 1.90(8.0 )(0.6 ) (2.55 )(0.13 ) (99.26 )(99.00 )LeNet-300-100 MNIST 98.29 1066KB 9.05(8.0 )(2.2 ) (2.49 )(0.88 ) (98.42 )(98.20 )FCAE CIFAR10 30.14PSNR 304.72KB 55.69 PSNR network, where the former can be interpreted as its (new)value and the later as a measure of its “robustness”. Thus,when quantizing each w i , we set η i = 1 /σ i in (1), with σ i being the respective standard deviation. This is also theoret-ically motivated, since (Achille et al., 2017) established aconnection between the variances and the diagonal elementsof the fisher information matrix.In order to minimize (1), we also need to define a set ofquantization points q k . We chose them to be equidistant toeach other with a particular distance ∆ ∈ R , namely, q k = ∆ I k , ∆ = 2 | w max | | w max | σ min + S , S, I k ∈ Z (2)where σ min is the smallest standard deviation and w max theparameter with highest magnitude value. S is then a hyper-parameter, which controls the “coarseness” of the quantiza-tion points. By selecting ∆ in such a manner we ensure thatthe quantisation points lie within the range of the standarddeviation of each weight parameter, in particular for values S ≥ . Moreover, by constraining them to be equidistant toeach other we encourage fixed-point representations, whichcan be exploited in order to perform inference with lowercomplexity (QNN; TFl).
4. Experiments
We applied DeepCABAC on the set of models described inthe evaluation framework (MPEG Requirements, 2019a) ofthe MPEG call on neural network representations (MPEG https://github.com/slychief/ismir2018_tutorial/tree/master/metadata Requirements, 2019b). This includes the VGG16, ResNet50and MobileNet-v1 models and a fully-convolutional autoen-coder pretrained on a task of end-to-end image compression(which we named
FCAE ). In addition, we also applied Deep-CABAC on the LeNet-300-100 and LeNet5 models and on asmaller version of VGG16 model (which we named Small-VGG16 ).We aplied the variational sparsification method introduced in(Molchanov et al., 2017) on to the LeNet-300-100, LeNet5,Small-VGG16, FCAE and MobileNet-v1 models. How-ever, due to the high training complexity that this methodrequires, we adopted a slightly different approach for theVGG16 and ResNet50. Namely, we firstly sparsified themby applying the iterative algorithm (Han et al., 2015b), andsubsequently applied method (Molchanov et al., 2017) butonly for estimating the variances of the distributions (thus,fixing the mean values during training). After sparsification,we applied DeepCABAC on to the weight parameters ofeach layer separately, excluding biases and normalizationparameters. Since the compression result can be sensitiveto the parameter S in (2), we probed the compression per-formance for all S ∈ { , , ..., } and selected the bestperforming model.The resulting sparsities as well as the compression ratiosare displayed in table 1. Notice that for most networks weare not able to reproduce the sparsity ratios reported in theliterature. In addition, we did not perform any fine-tuningafter compression, thus having a particularly challengingsetup for achieving good post-sparsity compression ratios.Nevertheless, in-spite of these two disadvantages, Deep-CABAC is able to significantly compress further the models,boosting the compression by 74% ( ±
5. Conclusion
We show that one can boost significantly the compressiongains if one applies state-of-the-art coding techniques onto pre-sparsified deep neural networks. In particular, ourproposed coding scheme, DeepCABAC, is able to increasethe compression rates of pre-sparsified networks by 74% onaverage, attaining as such compression ratios comparable(or sometimes higher) to the current state-of-the-art. Infuture work we will benchmark DeepCABAC also on non-sparsified networks, as well as apply it in the context ofdistributed learning scenarios where memory complexity iscritical (e.g. in federated learning). http://torch.ch/blog/2015/07/30/cifar.html eepCABAC: Context-adaptive binary arithmetic coding for deep neural network compression References
QNNPACK open source library for optimized mobiledeep learning. https://github.com/pytorch/QNNPACK . Accessed: 28.02.2019.TensorFlow Lite. . Accessed: 28.02.2019.Achille, A., Rovere, M., and Soatto, S. Critical learning pe-riods in deep neural networks. arXiv:1711.08856 , 2017.Cheng, Y., Wang, D., Zhou, P., and Zhang, T. A surveyof model compression and acceleration for deep neuralnetworks. arXiv:1710.09282 , 2017.Cheng, Y., Wang, D., Zhou, P., and Zhang, T. Model com-pression and acceleration for deep neural networks: Theprinciples, progress, and challenges.
IEEE Signal Pro-cessing Magazine , 35(1):126–136, Jan 2018.Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,J., Catanzaro, B., and Shelhamer, E. cudnn: Efficientprimitives for deep learning. arXiv:1410.0759 , 2014.Han, S., Mao, H., and Dally, W. J. Deep compression:Compressing deep neural network with pruning, trainedquantization and huffman coding. arXiv:1510.00149 ,2015a.Han, S., Pool, J., Tran, J., and Dally, W. J. Learning bothweights and connections for efficient neural networks.In
Advances in Neural Information Processing Systems(NIPS) , pp. 1135–1143, 2015b.Horowitz, M. 1.1 computing’s energy problem (and whatwe can do about it). In
IEEE International Solid-StateCircuits Conference Digest of Technical Papers (ISSCC) ,pp. 10–14, Feb 2014.Huffman, D. A. A method for the construction of minimum-redundancy codes.
Proceedings of the IRE , 40(9):1098–1101, Sep. 1952.Kingma, D. P., Salimans, T., and Welling, M. Variationaldropout and the local reparameterization trick. In
Ad-vances in Neural Information Processing Systems (NIPS) ,pp. 2575–2583, 2015.LeCun, Y., Bengio, Y., and Hinton, G. E. Deep learning.
Nature , 521:436–444, 2015.Louizos, C., Ullrich, K., and Welling, M. Bayesian Com-pression for Deep Learning. In
Advances in Neural In-formation Processing Systems (NIPS) , pp. 3288–3298,2017. Marpe, D. and Wiegand, T. A highly efficient multiplication-free binary arithmetic coder and its application in videocoding. In
Proceedings 2003 International Conferenceon Image Processing (Cat. No.03CH37429) , volume 2,pp. 263–266, Sep. 2003.Marpe, D., Schwarz, H., and Wiegand, T. Context-basedadaptive binary arithmetic coding in the h.264/avc videocompression standard.
IEEE Transactions on Circuitsand Systems for Video Technology , 13(7):620–636, July2003.McMahan, H. B., Moore, E., Ramage, D., and y Arcas,B. A. Federated learning of deep networks using modelaveraging. arXiv:1602.05629 , 2016.Molchanov, D., Ashukha, A., and Vetrov, D. Variationaldropout sparsifies deep neural networks. In
InternationalConference on Machine Learning (ICML) , pp. 2498–2507, 2017.MPEG Requirements. Updated evaluation framework forcompressed representation of neural networks. N18162.Technical report, Moving Picture Experts Group (MPEG),Marrakech, MA, Jan. 2019a.MPEG Requirements. Updated call for proposals on neuralnetwork compression. N18129. Cfp, Moving PictureExperts Group (MPEG), Marrakech, MA, Jan. 2019b.Sattler, F., Wiedemann, S., M¨uller, K.-R., and Samek,W. Sparse binary compression: Towards dis-tributed deep learning with minimal communication. arXiv:1805.08768 , 2018.Sattler, F., Wiedemann, S., M¨uller, K.-R., and Samek, W.Robust and communication-efficient federated learningfrom non-iid data. arXiv:1903.02891 , 2019.Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. Efficientprocessing of deep neural networks: A tutorial and survey. arXiv:1703.09039 , 2017.Wang, E., Davis, J. J., Zhao, R., Ng, H.-C., Niu, X., Luk, W.,Cheung, P. Y. K., and Constantinides, G. A. Deep neu-ral network approximation for custom hardware: Wherewe’ve been, where we’re going. arXiv:1901.06955 , 2019.Wiedemann, S., Marb´an, A., M¨uller, K.-R., and Samek,W. Entropy-constrained training of deep neural networks. arXiv:1812.07520 , 2018a.Wiedemann, S., M¨uller, K.-R., and Samek, W. Compact andcomputationally efficient representation of deep neuralnetworks. arXiv:1805.10692 , 2018b.Witten, I., H, I., , N., M, R., , C., and G, J. Arithmeticcoding for data compression.