[PDF] End-to-end Learning of Compressible Features

Abstract

Pre-trained convolutional neural networks (CNNs) are powerful off-the-shelf feature generators and have been shown to perform very well on a variety of tasks. Unfortunately, the generated features are high dimensional and expensive to store: potentially hundreds of thousands of floats per example when processing videos. Traditional entropy based lossless compression methods are of little help as they do not yield desired level of compression, while general purpose lossy compression methods based on energy compaction (e.g. PCA followed by quantization and entropy coding) are sub-optimal, as they are not tuned to task specific objective. We propose a learned method that jointly optimizes for compressibility along with the task objective for learning the features. The plug-in nature of our method makes it straight-forward to integrate with any target objective and trade-off against compressibility. We present results on multiple benchmarks and demonstrate that our method produces features that are an order of magnitude more compressible, while having a regularization effect that leads to a consistent improvement in accuracy.

Full PDF

EEND-TO-END LEARNING OF COMPRESSIBLE FEATURES

Saurabh Singh † , Sami Abu-El-Haija ∗ , Nick Johnston, Johannes Ballé, Abhinav Shrivastava ∗ ,George Toderici Google Research

ABSTRACT

Pre-trained convolutional neural networks (CNNs) are pow-erful off-the-shelf feature generators and have been shown toperform very well on a variety of tasks. Unfortunately, thegenerated features are high dimensional and expensive to store:potentially hundreds of thousands of ﬂoats per example whenprocessing videos. Traditional entropy based lossless com-pression methods are of little help as they do not yield desiredlevel of compression, while general purpose lossy compres-sion methods based on energy compaction (e.g. PCA followedby quantization and entropy coding) are sub-optimal, as theyare not tuned to task speciﬁc objective. We propose a learnedmethod that jointly optimizes for compressibility along withthe task objective for learning the features. The plug-in na-ture of our method makes it straight-forward to integrate withany target objective and trade-off against compressibility. Wepresent results on multiple benchmarks and demonstrate thatour method produces features that are an order of magnitudemore compressible, while having a regularization effect thatleads to a consistent improvement in accuracy.

Index Terms — Feature compression, Neural networks

1. INTRODUCTION

Convolutional neural networks (CNNs) have been hugely suc-cessful in computer vision and machine learning and havehelped push the frontier on a variety of problems [1–7]. Theirsuccess is attributed to their ability of learning a hierarchyof features ranging from very low level image features, suchas lines and edges, to high-level semantic concepts, such asobjects and parts [8, 9]. As a result, pre-trained CNNs havebeen shown to be very powerful as off-the-shelf feature gener-ators [10]. Razavian et al. [10] demonstrated that a pre-trainednetwork as a feature generator, coupled with a simple classiﬁersuch as a SVM or logistic regression, performs surprisinglywell and often outperforms hand tuned features on a varietyof tasks. This is observed even on tasks that are very differentfrom the original tasks that the CNN was trained on. As a re-sult, CNNs are being widely used as a feature-computing mod-ule in larger computer vision and machine learning application † [email protected]. ∗ Work done while at Google. pipelines such as video and image analysis. Although featurestend to be smaller in size in comparison to the original datathey are computed from, their storage can still be prohibitivefor large datasets. For example, the YouTube-8M Dataset [11]requires close to two terabytes of disk space. Compoundedby the fact that disk reads are slow, training large pipelineson such datasets becomes slow and expensive. We propose amethod to address this issue. Our method jointly optimizes forthe original training objective as well as compressibility, yield-ing features that are as powerful but only require a fraction ofthe storage cost.Features also need to be pre-computed and stored for cer-tain types of applications where the target task evolves overtime. A typical example is an indexing system or a contentanalysis system where the image features may be one of themany signals that the full model relies on. Such a model maychange over time by improving how it integrates various sig-nals. It becomes prohibitively expensive to compute featuresand continuously train such a system end-to-end. While pre-computing features speeds up training, the storage of thesefeatures can exceed petabytes for internet scale applications.Our method enables such systems to operate at a fraction of thecost without sacriﬁcing the performance on the target tasks.CNN features are typically derived by removing the topfew layers and using the activations of the remaining topmostlayer. These features tend to be very high dimensional, takingup a signiﬁcant amount of storage space, especially when com-puted at a large scale. For example, [11] computes features for8 million videos and mentions that the original size consistedof hundreds of terabytes, implying that uncompressed featuresfor even a small fraction of YouTube would require hundredsof petabytes. Off-the-shelf lossy compression methods are un-desirable for such data as they are content agnostic, resultingin unwanted distortions in the semantic information and a lossof the discriminative power of the original features.

Contributions:

We present a method that jointly optimizesfor compressibility as well as the target objective used forlearning the features. We introduce a penalty that enables atradeoff between compressibility and informativeness of thefeatures. The plug-in nature of our method makes it easy tointegrate with any target objective. We demonstrate that ourmethod produces features that are orders of magnitude morecompressible in comparison to traditional methods, while hav-

Copyright 2020 IEEE. Published in the IEEE 2020 International Conference on Image Processing (ICIP 2020), scheduled for 25-28 October 2020 in Abu Dhabi, United ArabEmirates. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collectiveworks for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. a r X i v : . [ c s . C V ] J u l ng a regularization effect leading to a consistent improvementin accuracy across benchmarks.

2. COMPRESSIBLE FEATURE LEARNING

In a typical supervised classiﬁcation or regression problem,we are concerned with minimizing a loss function L over a setof parameters θ : θ ∗ = arg min θ (cid:88) x , y ∈D L (cid:0) ˆ y , y (cid:1) with ˆ y = f ( x ; θ ) , (1)where x is the input variable (e.g., image pixels, or features), y the target variable (e.g., classiﬁcation labels, or regression tar-get), ˆ y is the prediction, f is often an artiﬁcial neural network(ANN) with parameters θ comprising its set of ﬁlter weights,and D is a set of training data.Applications using such pre-trained neural networks asfeature generators typically remove the top few layers and usethe output of remaining topmost layer. We refer to this outputas z . For the application, z is a set of representative featuresthat can be used in place of x . We represent this process ofconstruction of z by splitting f into two parts, f z and f ˆ y . Theprediction ˆ y is then given by: ˆ y = f ( x ; θ ) = f ˆ y ( z ; θ ˆ y ) with z = f z ( x ; θ z ) (2)We are interested in learning a model f such that z are com-pressible while still maintaining the performance of f on theoriginal classiﬁcation or regression task. Our method achievesthis by augmenting the original loss L in eq. (1) with a com-pression loss R to yield the following optimization problem θ ∗ = arg min θ (cid:88) x , y ∈D L (cid:0) ˆ y , y (cid:1) + λR ( z ) , (3)with ˆ y and z deﬁned as before and λ serving as a trade-offparameter. R encourages the compressibility of z by penaliz-ing an approximation of its entropy as described in the nextsection. Refer to section A and ﬁg. 2 in appendix for additionaldetails. z General data compression maps each possible data point to avariable length string of symbols (typically bits) [12], storingor transmitting them, and inverting the mapping at the receiverside. The optimal number of bits needed to store a discrete-valued data set Z is given by the Shannon entropy H = − (cid:88) ˆ z ∈Z log p ( ˆ z ) , (4)where p is a prior probability distribution of the data points,which needs to be available to both the sender and the receiver.The probability distribution is used by entropy coding tech-niques such as arithmetic coding [13] or Huffman coding [14] to implement the mapping. The entropy is also referred to asthe bit rate ( R ) of the compression method.It is common to speak of an intermediate representationlike z as a bottleneck . In many cases, for example in thecontext of autoencoders [15], a bottleneck serves to reducedimensionality without compromising predictive power, i.e., z is forced to have a smaller number of dimensions than x . Sincethe number of dimensions is a hyperparameter (an architecturalchoice), no changes to the loss function are necessary, and themodel is simply trained to minimize the loss under the givenconstraint. However, dimensionality reduction is only a crudeapproximation to data compression.In data compression, the compressibility of a dataset can begreatly increased by allowing errors in the compression process– referred to as lossy compression . A lossy compression methodallows the bit rate R to be traded off against the distortion D introduced in the data. This rate-distortion trade-off isrepresented as the following optimization problem. ˆ z ∗ = arg min ˆ z (cid:88) x ∈D D ( ˆ z , x ) + λR ( ˆ z ) , (5)where ˆ z is a discrete lossy representation of x . Note thesimilarity between eq. (3) and eq. (5). The key differenceis that while eq. (5) measures the distortion in data directly,eq. (3) measures the “distortion” using the target variable y andignores the input x . Our key observation is that the supervisedlosses, such as L ( ˆ y , y ) used for classiﬁcation, can take theplace of distortion in eq. (5). We therefore re-cast our originaloptimization problem in eq. (3) as the following rate-distortionoptimization problem θ ∗ , φ ∗ = arg min θ , φ (cid:88) x , y ∈D L ( ˆ y , y ) (cid:124) (cid:123)(cid:122) (cid:125) distortion D + λ · − log p (cid:0) ˆ z ; φ (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) bit rate R , (6)where ˆ z = (cid:98) f z ( x ; θ z ) (cid:101) , ˆ y = f ˆ y ( ˆ z ; θ ˆ y ) , and p is a probabilitymodel over ˆ z with parameters φ , which are trained jointlywith θ ≡ { θ z , θ ˆ y } . (cid:98)·(cid:101) here indicates that we round the outputof f z to the nearest integers. This quantization is necessary,as compression can take place only in a discrete space witha countable number of possible states. Note that this is howa trade-off between compression performance and predictionperformance is achieved. By reducing the number of possiblestates in ˆ z , for example by scaling down the outputs of f z ,the bit rate R can be reduced at the expense of predictionperformance. On the other hand, the prediction performancecan be improved by increasing the number of possible states, atthe expense of compressibility. The hyperparameter λ controlstrade-off. We call this type of bottleneck an entropy bottleneck . ˆ z It is not feasible to minimize the objective in eq. (6) directlywith descent methods, as the quantization leads to gradientsthat are zero almost everywhere. Instead, we closely followhe approach introduced in [16] with one key difference. Balléet al. [16] substitute the quantization with additive uniformnoise during training, while we do so only for modeling therate. For distortion, we discretize (by rounding) and substitutethe gradients by identity (straight-through). Further, ratherthan using a piecewise linear density model as in [16], we usea more reﬁned density model which is described in [17].For all the experiments in this paper, a separate model wasused for each vector element ˆ z i in ˆ z , yielding a fully factorizedprobability model p ( ˆ z ) = (cid:81) i p (ˆ z i ) . For bottlenecks with aspatial conﬁguration, all the spatial elements within the samechannel share the distribution.

3. EXPERIMENTS

We evaluate our method using classiﬁcation models, as theyare the most common off-the-shelf feature generation method.Unless otherwise stated, in all the following experiments, wefollow the standard practice of considering the activations ofthe penultimate layer immediately after the non-linearity as thefeature layer. We treat it as the bottleneck ˆ z and apply the ratepenalty over it. We train several models by varying the trade-off parameter λ and present our results in the form of a error vs.relative compression graph. Relative compression is measuredas a fraction of the compressed size achieved by lossless com-pression baseline zlib described below. For all the methods,the representation for each image is computed in float32 precision and compressed independently. Additional detailsare provided in section C of the appendix. We compare our method with the following standard baselines.

Lossless compression:

The features are compressed using thegzip compatible zlib compression library in Python. Therepresentation is ﬁrst converted to a byte array and then com-pressed using zlib at the highest compression level of 9.

The features are ﬁrst cast to a 16 bitﬂoating point representation and then losslessly compressedusing zlib as described above.

Quantized:

The features are scaled to a unit range followed byquantization to equal length intervals. We report performanceas a function of the number of quantization bins in the set { , , , } . These quantized values are losslessly com-pressed using gzip as above. If fewer than 256 quantizationbins are used, the data is natively stored as a byte, not packed,before gzip is used. PCA:

We compute principal components from the full covari-ance matrix of the features computed over the entire trainingset. We report the performance as a function of the numberof components used from the set { , , , , , , } . Weexclude the cost of PCA basis from the compression cost. CIFAR-10 and CIFAR-100 image classiﬁcationdatasets contain 10 and 100 classes respectively. Both contain50000 training and 10000 testing images. We use a 20 layerResnetV2 [18] model and train using SGD with ADAM [19]for 128k iterations with a batchsize of 128.

Results:

Figure 1 shows that our method consistently pro-duces features that are an order of magnitude more compress-ible than when losslessly compressed, while maintaining thediscriminative power of the learned features. We visualize theclassiﬁcation error of the decompressed features as a functionof the relative compression ratio with respect to the losslesscompression. On CIFAR-10 our method is able to producefeatures that are 1% the size of the lossless compression whilematching the accuracy. This is likely due to the fact that thereare only classes which would ideally only require log bits. For CIFAR-100, we observe that our method producesfeatures that can be compressed to 10% the size of losslesscompression while retaining the same accuracy. Here we seean order of magnitude reduction in achieved compression incomparison to CIFAR-10 with an order of magnitude increasein number of categories (from 10 to 100). On both the datasets16bit-gzip consistently retains performance indicating that16bit precision is accurate enough for these features. Quantiza-tion quickly loses performance as the number of quantizationbins is decreased. PCA performs better than other baselines.However, its performance quickly degrades as fewer compo-nents are used. The results summarized in Table 1 also showthat the best performing rate points on the validation set alsoexhibit a higher training error than the baseline. This is anindication of the regularization effect of our method. We train on ≈ . M training images and report resultson the 50000 validation images in the Imagenet classiﬁca-tion dataset [20]. We use a 50 layer ResnetV2 [18] modelas our base model. All networks are trained using SGD withADAM [19] for 300k steps using a batchsize of 256. Results:

Our method produces highly compressible represen-tations in comparison to the other baseline methods and isable to preserve the accuracy while reducing the storage costto ≈ . of the losslessly compressed ﬁle size (25.95%Ours vs. 25.91% Lossless). Note that lossless storage at16bit precision results in a 0.14% increase in error. Similar toCIFAR-10/100 datasets, we observe a regularization effect. Asevident in Table 1, despite a higher error on the training set incomparison to the baseline, validation performance improves. YouTube-8M [11] is one of the largest publicly avail-able video classiﬁcation dataset. We use the second version(v2), which contains 6.1 million videos and 3862 classes. We .00 0.01 0.02 0.03 0.05 0.09 0.17 0.31 0.55 1.00

Relative Compression Ratio C l a ss i f i c a t i o n E rr o r OursLossless16bit-gzipQuantizedPCA (a) CIFAR-10

Relative Compression Ratio C l a ss i f i c a t i o n E rr o r OursLossless16bit-gzipQuantizedPCA (b) CIFAR-100

Relative Compression Ratio C l a ss i f i c a t i o n E rr o r OursLossless16bit-gzipQuantizedPCA (c) Imagenet

Relative Compression Ratio C l a ss i f i c a t i o n E rr o r OursLosslessQuantizedPCA (d) YouTube-8M

Fig. 1 : We visualize the classiﬁcation error of the decompressed features as a function of the relative compression ratio withrespect to the lossless compression on CIFAR-10 (1a), CIFAR-100 (1b), ImageNet (1c) and YouTube-8M (1d). On CIFAR-10 ourmethod produces representations that preserve the accuracy at 1% the size of the losslessly compressed size, while on CIFAR-100at 10% the losslessly compressed size. On both ImageNet and YouTube-8M, our method preserves the accuracy while reducingthe storage cost to ≈ and 10% of the losslessly compressed ﬁle size respectively.ﬁrst aggregate the video sequence features into a ﬁxed-sizevector using mean pooling. We use a three fully-connectedlayer model, with ReLU activation and batch normalizationon the hidden layers. We apply the compression on the lasthidden activations, just before the output layer. Models aretrained using TensorFlow’s Adam Optimizer [19] for 300,000steps using a batchsize of 100. Results:

Figure 1d and Table 1 report the accuracy on thevalidation partition of YouTube-8M [11]. Similar to otherbenchmarks, our method can drastically lower the storage re-quirements while preserving and even improving the accuracy,providing further evidence for the regularization effect. Referto section D in appendix for additional results and discussion.

Please refer to section B in appendix for additional details anddiscussion.

4. RELATED WORKRepresentations from off-the-shelf compression algo-rithms:

Compressed representations have been directlyused for training machine learning algorithms as they havelow memory and computational requirements and enable efﬁ-cient real-time processing while avoiding decoding overhead.Aghagolzadeh and Radha [21] used linear SVM classiﬁerfor pixel classiﬁcation on compressive hyperspectral data.Hahn et al. [22] performed hyperspectral pixel classiﬁcationon the compressive domain using an adaptive probabilistic able 1 : We compare the total compressed size of the evaluation datasets along with the ﬁnal training and validation errors. Foreach dataset we select the lowest rate model with error lower than baseline. For YouTube-8M the reported size is of video levelfeatures. The gap between train and validation errors is consistently smaller for our model, indicating that the entropy penaltyhas a regularization effect. At the same time, our model signiﬁcantly reduces the total size.Training Error Validation Error Validation Set SizeLossless Ours Lossless Ours Lossless Ours RawImageNet 17.04 17.35 25.91 25.89 6.95GB 0.85GB 38.15GBCIFAR-10 0.14 0.29 8.73 8.45 41.53MB 2.78MB 156.25MBCIFAR-100 0.25 1.54 33.39 33.03 69.14MB 9.28MB 156.25MBYouTube-8M 19.56 19.75 19.76 19.49 5.30GB 0.27GB 17.80GBapproach. Fu et al. [23] fed DCT compressed image data intothe network to speed up machine learning algorithms appliedon the images. Biswas et al. [24] proposed an approach toclassify H.264 compressed videos. Chadha et al. [25] used3D CNN architecture for video classiﬁcation that directlyutilized compressed video bitstreams. Yeo et al. [26] designeda system for performing action recognition on videos com-pressed with MPEG. Kantorov et al. [27] proposed a methodfor extracting and encoding local video descriptors for actionrecognition on MPEG compressed video representation. Ourwork differs in that we jointly optimize for a compressiblerepresentation along with the target task.

Joint optimization for compression:

Torfason et al. [28]propose to extend an auto-encoding compression network byadding an additional inference branch over the bottleneck forauxiliary tasks. Our method does not use any auto-encodingpenalty and directly optimizes for the entropy along with thetask speciﬁc objective.

Dimensionality reduction methods:

While not performinginformation-theoretic compression, there are many lossy di-mensionality reduction methods which can be applied to re-duce the space needed for precomputed CNN features. Forexample, PCA, LDA, ICA, Product Quantization [29] etc.However, none of these methods takes into account the taskspeciﬁc loss. Instead they all rely on surrogate losses (e.g. L ). Similarity preserving hashing:

Hashing based methods havebeen used to produce a neighborhood preserving compact bi-nary embedding of the data [30–33]. Such methods are similarto compression in that a binary representation smaller thanthe data itself is found. However, the storage size is a hyper-parameter that is not directly optimized and representationsare typically of identical length with the goal of minimizinglookup speed as opposed to storage. Refer to section B.4 inappendix for further details.

Variational information bottleneck:

Our approach can beviewed as a particular instantiation of the more general infor-mation bottleneck framework [34, 35]. While these worksdiscuss mutual information as the parameterization indepen-dent measure of informativeness, we note that a task dependent measure can typically be used and may be better suited if targettasks are known ahead of time. As most classiﬁcation modelsare typically trained to optimize cross entropy, we use the samein this paper as a measure of informativeness.

5. CONCLUSION

We presented an end-to-end trained method to learn compress-ible features while training for a task dependent objective. Byevaluating on four different benchmarks we demonstrated thatour method achieves high compression rates compared to clas-sical methods, while having a regularization effect leading toa consistent improvement in accuracy across benchmarks.

References [1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolution,and fully connected crfs. arXiv:1606.00915 , 2016. 1[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. FasterR-CNN: Towards real-time object detection with region pro-posal networks. In

NIPS , 2015.[3] Ross Girshick, Jeff Donahue, Trevor Darrell, and JagannathMalik. Rich feature hierarchies for accurate object detectiona nd semantic segmentation. In

CVPR , pages 580–587. IEEE,2014.[4] George Papandreou, Tyler Zhu, Nori Kanazawa, AlexanderToshev, Jonathan Tompson, Chris Bregler, and Kevin Mur-phy. Towards accurate multi-person pose estimation in thewild. arXiv:1701.01779 , 2017.[5] Karen Simonyan and Andrew Zisserman. Two-streamconvolutional networks for action recognition in videos. arXiv:1406.2199 , 2014.[6] Pulkit Agrawal, Ross B. Girshick, and Jitendra Malik. Analyz-ing the performance of multilayer neural networks for objectrecognition. In

ECCV , 2014.7] Mihir Jain, Jan C van Gemert, and Cees GM Snoek. What do15,000 object categories tell us about classifying and localizingactions? In

CVPR , 2015. 1[8] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, andHod Lipson. Understanding neural networks through deepvisualization. arXiv:1506.06579 , 2015. 1[9] Matthew D Zeiler and Rob Fergus. Visualizing and understand-ing convolutional networks. In

ECCV , pages 818–833. Springer,2014. 1[10] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan,and Stefan Carlsson. Cnn features off-the-shelf: an astound-ing baseline for recognition. In

Computer Vision and PatternRecognition Workshops (CVPRW), 2014 IEEE Conference on ,pages 512–519. IEEE, 2014. 1[11] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev,George Toderici, Balakrishnan Varadarajan, and SudheendraVijayanarasimhan. YouTube-8M: A large-scale video classiﬁca-tion benchmark. arXiv:1609.08675 , 2016. 1, 3, 4, 8[12] Claude E. Shannon. A mathematical theory of communication.

The Bell System Technical Journal , 27(3), 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x. 2[13] Jorma Rissanen and Glen G. Langdon, Jr. Universal modelingand coding.

IEEE Transactions on Information Theory , 27(1),1981. doi: 10.1109/TIT.1981.1056282. 2[14] Jan van Leeuwen. On the construction of huffman trees. In

ICALP , pages 382–410, 1976. 2[15] G E Hinton and R R Salakhutdinov. Reducing the dimensional-ity of data with neural networks.

Science , 313(5786):504–507,July 2006. doi: 10.1126/science.1127647. 2[16] Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. End-to-end optimization of nonlinear transform codes for perceptualquality. In

Picture Coding Symposium , 2016. 3[17] Johannes Ballé, David Minnen, Saurabh Singh, Sung JinHwang, and Nick Johnston. Variational image compressionwith a scale hyperprior. In

ICLR , 2018. 3[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In

ECCV , pages630–645. Springer, 2016. 3[19] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv:1412.6980 , 2014. 3, 4, 7, 8[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. In

CVPR ,2009. 3[21] Mohammad Aghagolzadeh and Hayder Radha. On hyperspec-tral classiﬁcation in the compressed domain. arXiv preprintarXiv:1508.00282 , 2015. 4[22] Jurgen Hahn, Simon Rosenkranz, and Abdelhak M Zoubir.Adaptive compressed classiﬁcation for hyperspectral imagery.In

ICASSP , pages 1020–1024. IEEE, 2014. 4 [23] Dan Fu and Gabriel Guimaraes. Using compression to speedup image classiﬁcation in artiﬁcial neural networks. 2016. 5[24] Sovan Biswas and R Venkatesh Babu. H. 264 compressedvideo classiﬁcation using histogram of oriented motion vectors(homv). In

ICASSP , pages 2040–2044. IEEE, 2013. 5[25] Aaron Chadha, Alhabib Abbas, and Yiannis Andreopoulos.Video classiﬁcation with cnns: Using the codec as a spatio-temporal activity sensor.

IEEE Transactions on Circuits andSystems for Video Technology , 2017. 5[26] Chuohao Yeo, Parvez Ahammad, Kannan Ramchandran, andS Shankar Sastry. High-speed action recognition and local-ization in compressed domain videos.

IEEE Transactions onCircuits and Systems for Video Technology , 18(8):1006–1015,2008. 5[27] Vadim Kantorov and Ivan Laptev. Efﬁcient feature extraction,encoding and classiﬁcation for action recognition. In

CVPR ,pages 2593–2600, 2014. 5[28] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, MichaelTschannen, Radu Timofte, and Luc Van Gool. Towards imageunderstanding from deep compression without decoding. In

ICLR , 2018. 5, 7[29] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Productquantization for nearest neighbor search.

TPAMI , 33(1):117–128, 2011. 5[30] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen.Deep supervised hashing for fast image retrieval. In

CVPR ,pages 2064–2072, 2016. 5[31] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultane-ous feature learning and hash coding with deep neural networks.In

CVPR , pages 3270–3278, 2015.[32] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and ShuichengYan. Supervised hashing for image retrieval via image represen-tation learning. In

AAAI , volume 1, page 2, 2014.[33] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan.Deep semantic ranking based hashing for multi-label imageretrieval. In

CVPR , pages 1556–1564, 2015. 5[34] Naftali Tishby, Fernando C Pereira, and William Bialek. Theinformation bottleneck method. arXiv preprint physics/0004057 ,2000. 5[35] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and KevinMurphy. Deep variational information bottleneck. arXivpreprint arXiv:1612.00410 , 2016. 5[36] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradientdescent with warm restarts. In

ICLR , 2017. 8[37] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper with convo-lutions.

CoRR , abs/1409.4842, 2014. 8

PPENDIX & S

UPPLEMENTAL M ATERIAL x ˆ y f z ( x ; θ z ) f ˆ y ( z ; θ ˆ y ) z ˆ z + u ∼ U [ · ] (cid:98)·(cid:101) EC · · · ED p ( ˆ z ; φ ) Prob.Table L ( ˆ y , y ) R ( ˆ z ) Fig. 2 : Overview of our method. During training (green boxand arrows), uniform noise u is added to z to simulate quanti-zation while allowing gradient based optimization. An entropymodel p ( ˆ z ; φ ) is used to model the distribution of ˆ z and im-pose compression loss R ( ˆ z ) . During testing (red box andarrows), z are quantized using rounding to yield ˆ z and entropycoding (EC) is then used for lossless compression yielding avariable length bit string for storage and transmission. This bitstring can be decoded using entropy decoding (ED) to yield ˆ z which can then be further processed. Both EC and ED useprobability tables produced from the entropy model p ( ˆ z ; φ ) after the training is complete. Note that ˆ z represents noiseadded z as well as quantized z depending on the context. A. ADDITIONAL METHOD DETAILS

In ﬁg. 2 we show the model during training using green boxand green arrows, while red box and red arrows show themodel at test time. Common components are shown outsidethe colored boxes and use black arrows. Note that, duringtraining uniform noise is added to simulate quantization whileduring testing rounding is used for quantization. Since entropycoding and decoding are lossless operations, they are not usedduring training. Also note that the learned entropy model p ( ˆ z ; φ ) during training is used to produce probability tableswhich are used by entropy coding and decoding during testing.These dependencies are shown as dotted arrows in the ﬁgure.Once the model is trained and probability tables are produced,the entropy model p ( ˆ z ; φ ) is not required and can be discarded. B. ADDITIONAL DISCUSSIONB.1. Importance of Adam optimizer:

The weight λ for the entropy penalty in eq. (6) can affectthe magnitude of the gradient updates for the parameters φ of the probability model. A smaller value of λ can reducethe effective learning rate of φ causing the model to learnslower. This may result in a disconnect between the observeddistribution and the model. Adam optimizer [19] computesupdates normalized by the square root of a running averageof the squared gradients. This has the desirable property thata constant scaling of loss does not affect the magnitude ofupdates. Therefore, for the combined loss in eq. (6), the λ onlyaffects the relative weight of the gradient due to the entropypenalty, without changing the effective learning rate of φ . B.2. Regularization effect:

In addition to pure lossy compression, we show that ourmethod has an interesting side effect: it acts as an activa-tion regularizer, allowing higher classiﬁcation results on thevalidation than the original network, while exhibiting highertraining error (Table 1). Interestingly the regularization effect’ssweet spot may provide some insight in the complexity of theproblem to be solved. Unlike normal regularization methods,our approach makes a tradeoff between the information passedbetween the encoder network and the classiﬁer, therefore wecan explicitly measure how much information is required fora particular classiﬁcation task. We observed that CIFAR-100requires less compression to achieve the best result, whereasCIFAR-10 requires about half as much information in order toobtain the best result, which signals that perhaps the networkdesigned to solve both problems is perhaps a bit larger than itshould be in the case of CIFAR-10.

B.3. Note on deep compression without decoding [28]:

While there are signiﬁcant differences in the model and setupof Torfason et al. [28], we can qualitatively compare the per-formance on the ImageNet classiﬁcation task in terms of therelative increase in error rate versus the baselines at roughly0.3 bits per pixel (bpp). We observe a relative increase of10% in error at 0.388 bpp (corresponding to the highest com-pression rate in Fig. 1c), while in [28](Table 2), the relativeincrease reported at 0.330 bpp is 20.3%, indicating that ourmodel is able to better preserve the informativeness of thefeatures.

B.4. Differences from similarity preserving hashing

We enumerate the key differences below: • In hashing, the binary representations are required topreserve neighborhood to enable direct retrieval basedon the hash value. In our method the compressed bitsare output of arithmetic coding with no such constraints.

The storage size of the representation is ﬁxed (a hyper-parameter) in binary hashing and not directly optimized,while in compression it is directly incorporated in loss(as a rate term, eq. (6)) to tradeoff with accuracy. • In hashing, values of identical length are typically pro-duced with lookup speed beneﬁts while in compressionstorage size is the primary concern and entropy codingis used to produce variable length bit representations.

C. TRAINING DETAILSC.1. CIFAR-10/100

We used a cosine decay learning rate schedule [36] with aninitial learning rate of 0.005 (selected as best among {0.1, 0.05,0.025, 0.01, 0.005, 0.0025, 0.001}). We use the standard dataaugmentation of left-right ﬂips and zero padding all sides by 4pixels followed by a × crop. We use a weight decay of0.0001 and train our model on a single GPU using a batch sizeof 128. C.2. Imagenet

We use cosine decay learning rate schedule [36] with an initiallearning rate of 0.001. We use the standard data augmentationas used in [37] and train on crops of × . We use aweight decay of 0.0001 and train each model on 8 GPUs witha batch size of 32 per GPU resulting in a combined batch sizeof 256 and synchronous updates. We report top-1 classiﬁcationerror computed on a × center crop. C.3. Youtube 8 Million Dataset

We use weight decay of − and train each model on oneCPU, with a batch size of , minimizing cross-entropyloss, using TensorFlow’s Adam Optimizer [19] for 300,000steps. We sweep the initial learning rate from choices { . , . , . , . , . } , and we multiply the learn-ing rate by . every steps. For each model architecture,we use the best learning rate according to a held-out validationset. D. ADDITIONAL RESULTS ON YOUTUBE 8MILLION DATASET

Figure 1d in main text presented results for a three layer model.For comparison, we also present the results for a two layermodel in ﬁg. 3a. Figure 1d is also reproduced as ﬁg. 3b forease of comparison. As before, we observe that a drastic re-duction in the storage requirements while preserving accuracy.However, the three layer model preserves accuracy up to ahigher compression ratio than the three layer model. Note that,the accuracy metrics are measured on the “validation” partitionof YouTube-8M [11].