[PDF] Neural Networks Compression for Language Modeling

Abstract

In this paper, we consider several compression techniques for the language modeling problem based on recurrent neural networks (RNNs). It is known that conventional RNNs, e.g, LSTM-based networks in language modeling, are characterized with either high space complexity or substantial inference time. This problem is especially crucial for mobile applications, in which the constant interaction with the remote server is inappropriate. By using the Penn Treebank (PTB) dataset we compare pruning, quantization, low-rank factorization, tensor train decomposition for LSTM networks in terms of model size and suitability for fast inference.

Full PDF

aa r X i v : . [ s t a t . M L ] A ug Neural Networks Compression for LanguageModeling

Artem M. Grachev , , Dmitry I. Ignatov , and Andrey V. Savchenko Samsung R&D Institute Rus, Moscow, Russia National Research University Higher School of Economics, Moscow, Russia National Research University Higher School of Economics, Laboratory ofAlgorithms and Technologies for Network Analysis, Nizhny Novgorod, Russia [email protected]

Abstract.

In this paper, we consider several compression techniquesfor the language modeling problem based on recurrent neural networks(RNNs). It is known that conventional RNNs, e.g, LSTM-based networksin language modeling, are characterized with either high space complex-ity or substantial inference time. This problem is especially crucial formobile applications, in which the constant interaction with the remoteserver is inappropriate. By using the Penn Treebank (PTB) dataset wecompare pruning, quantization, low-rank factorization, tensor train de-composition for LSTM networks in terms of model size and suitabilityfor fast inference.

Keywords:

LSTM, RNN, language modeling, low-rank factorization,pruning, quantization

Neural network models can require a lot of space on disk and in memory. Theycan also need a substantial amount of time for inference. This is especially im-portant for models that we put on devices like mobile phones. There are severalapproaches to solve these problems. Some of them are based on sparse compu-tations. They also include pruning or more advanced methods. In general, suchapproaches are able to provide a large reduction in the size of a trained net-work, when the model is stored on a disk. However, there are some problemswhen we use such models for inference. They are caused by high computationtime of sparse computing. Another branch of methods uses diﬀerent matrix-based approaches in neural networks. Thus, there are methods based on theusage of Toeplitz-like structured matrices in [1] or diﬀerent matrix decomposi-tion techniques: low-rank decomposition [1], TT-decomposition (Tensor Traindecomposition) [2,3]. Also [4] proposes a new type of RNN, called uRNN (Uni-tary Evolution Recurrent Neural Networks).In this paper, we analyze some of the aforementioned approaches. The mate-rial is organized as follows. In Section 2, we give an overview of language mod-eling methods and then focus on respective neural networks approaches. Nexte describe diﬀerent types of compression. In Section 3.1, we consider the sim-plest methods for neural networks compression like pruning or quantization. InSection 3.2, we consider approaches to compression of neural networks based ondiﬀerent matrix factorization methods. Section 3.3 deals with TT-decomposition.Section 4 describes our results and some implementation details. Finally, in Sec-tion 5, we summarize the results of our work.

Consider the language modeling problem. We need to compute the probabilityof a sentence or sequence of words ( w , . . . , w T ) in a language L . P ( w , . . . , w T ) = P ( w , . . . , w T − ) P ( w T | w , . . . , w T − ) == T Y t =1 P ( w t | w , . . . , w t − ) (1)The use of such a model directly would require calculation P ( w t | w , . . . , w t − ) and in general it is too diﬃcult due to a lot of computation steps. That iswhy a common approach features computations with a ﬁxed value of N andapproximate (1) with P ( w t | w t − N , . . . , w t − ) . This leads us to the widely known N -gram models [5,6]. It was very popular approach until the middle of the 2000s.A new milestone in language modeling had become the use of recurrent neuralnetworks [7]. A lot of work in this area was done by Thomas Mikolov [8].Consider a recurrent neural network, RNN, where N is the number of timesteps, L is the number of recurrent layers, x t − ℓ is the input of the layer ℓ at the moment t . Here t ∈ { , . . . , N } , ℓ ∈ { , . . . , L } , and x t is the embedding vector. We candescribe each layer as follows: z tℓ = W ℓ x tℓ − + V ℓ x t − ℓ + b l (2) x tℓ = σ ( z tℓ ) , (3)where W ℓ and V ℓ are matrices of weights and σ is an activation function. Theoutput of the network is given by y t = softmax (cid:2) W L +1 x tL + b L +1 (cid:3) . (4)Then, we deﬁne P ( w t | w t − N , . . . , w t − ) = y t . (5)While N -gram models even with not very big N require a lot of space dueto the combinatorial explosion, neural networks can learn some representationsof words and their sequences without memorizing directly all options.Now the mainly used variations of RNN are designed to solve the problem ofdecaying gradients [9]. The most popular variation is Long Short-Term MemoryLSTM) [7] and Gated Recurrent Unit (GRU) [10]. Let us describe one layer ofLSTM: i tℓ = σ (cid:2) W il x tl − + V il x t − l + b il (cid:3) input gate (6) f tℓ = σ h W fl x tl − + V fl x t − l + b fl i forget gate (7) c tℓ = f tl · c t − l + i tl tanh (cid:2) W cl x tl − + U cl x t − l + b cl (cid:3) cell state (8) o tℓ = σ (cid:2) W ol x tℓ − + V ol x t − l + b ol (cid:3) output gate (9) x tℓ = o tℓ · tanh[ c tl ] , (10)where again t ∈ { , . . . , N } , ℓ ∈ { , . . . , L } , c tℓ is the memory vector at the layer ℓ and time step t . The output of the network is given the same formula 4 asabove.Approaches to the language modeling problem based on neural networks areeﬃcient and widely adopted, but still require a lot of space. In each LSTM layerof size k × k we have 8 matrices of size k × k . Moreover, usually the ﬁrst (orzero) layer of such a network is an embedding layer that maps word’s vocabularynumber to some vector. And we need to store this embedding matrix too. Its sizeis n vocab × k , where n vocab is the vocabulary size. Also we have an output softmaxlayer with the same number of parameters as in the embedding, i.e. k × n vocab . Inour experiments, we try to reduce the embedding size and to decompose softmaxlayer as well as hidden layers.We produce our experiments with compression on standard PTB models.There are three main benchmarks: Small, Medium and Large LSTM models[11]. But we mostly work with Small and Medium ones. In this subsection, we consider maybe not very eﬀective but still useful tech-niques. Some of them were described in application to audio processing [12]or image-processing [13,14], but for language modeling this ﬁeld is not yet welldescribed.Pruning is a method for reducing the number of parameters of NN. InFig 1. (left), we can see that usually the majority of weight values are con-centrated near zero. It means that such weights do not provide a valuable con-tribution in the ﬁnal output. We can set some threshold and then remove allconnections with the weights below it from the network. After that we retrainthe network to learn the ﬁnal weights for the remaining sparse connections.Quantization is a method for reducing the size of a compressed neural net-work in memory. We are compressing each ﬂoat value to an eight-bit integerrepresenting the closest real number in one of 256 equally-sized intervals withinthe range. ig. 1.

Weights distribution before and after pruning −3 −2 −1 0 1 2 3Value020000400006000080000100000120000140000160000180000 F r e q u e n c y −3 −2 −1 0 1 2 3Value0100020003000400050006000700080009000 F r e q u e n c y Pruning and quantization have common disadvantages since training fromscratch is impossible and their usage is quite laborious. In pruning the reasonis mostly lies in the ineﬃciency of sparse computing. When we do quantization,we store our model in an 8-bit representation, but we still need to do 32-bitscomputations. It means that we have not advantages using RAM. At least untilwe do not use the tensor processing unit (TPU) that is adopted for eﬀective 8-and 16-bits computations.

Low-rank factorization represents more powerful methods. For example, in [1],the authors applied it to a voice recognition task. A simple factorization can bedone as follows: x tl = σ (cid:2) W aℓ W bℓ x tℓ − + U al U bl x t − ℓ + b l (cid:3) (11)Following [1] require W bl = U bℓ − . After this we can rewrite our equation forRNN: x tl = σ (cid:2) W al m tl − + U al m t − l + b l (cid:3) (12) m tl = U bl x tl (13) y t = softmax (cid:2) W L +1 m tL + b L +1 (cid:3) (14)For LSTM it is mostly the same with more complicated formulas. The mainadvantage we get here from the sizes of matrices W al , U bl , U al . They have thesizes r × n and n × r , respectively, where the original W l and V l matrices havesize n × n . With small r we have the advantage in size and in multiplicationspeed. We discuss some implementation details in Section 4. In the light of recent advances of tensor train approach [2,3], we have also decidedto apply this technique to LSTM compression in language modeling.he tensor train decomposition was originally proposed as an alternativeand more eﬃcient form of tensor’s representation [15]. The TT-decomposition(or TT-representation) of a tensor A ∈ R n × ... × n d is the set of matrices G k [ j k ] ∈ R r k − × r k , where j k = 1 , . . . , n k , k = 1 , . . . , d , and r = r d = 1 such that each ofthe tensor elements can be represented as A ( j , j , . . . , j d ) = G [ j ] G [ j ] . . . G d [ j d ] . In the same paper, the author proposed to consider the input matrix as a multi-dimensional tensor and apply the same decomposition to it. If we have matrix A of size N × M , we can ﬁx d and such n , . . . , n d , m , . . . , m d that the followingconditions are fulﬁlled: Q dj =1 n j = N , Q di =1 m i = M . Then we reshape our ma-trix A to the tensor A with d dimensions and size n m × n m × . . . × n d m d .Finally, we can perform tensor train decomposition with this tensor. This ap-proach was successfully applied to compress fully connected neural networks [2]and for developing convolution TT layer [3].In its turn, we have applied this approach to LSTM. Similarly, as we describeit above for usual matrix decomposition, here we also describe only RNN layer.We apply TT-decomposition to each of the matrices W and V in equation 2 andget: z tℓ = TT( W i ) x tℓ − + TT( V l ) x t − ℓ + b ℓ . (15)Here TT( W ) means that we apply TT-decomposition for matrix W . It is nec-essary to note that even with the ﬁxed number of tensors in TT-decompositionand their sizes we still have plenty of variants because we can choose the rankof each tensor. For testing pruning and quantization we choose Small PTB Benchmark. Theresults can be found in Table 1. We can see that we have a reduction of the sizewith a small loss of quality.For matrix decomposition we perform experiments with Medium and LargePTB benchmarks. When we talk about language modeling, we must say that theembedding and the output layer each occupy one third of the total network size.It follows us to the necessity of reducing their sizes too. We reduce the outputlayer by applying matrix decomposition. We describe sizes of

LR LSTM 650-650 since it is the most useful model for the practical application. We start withbasic sizes for W and V , × , and × for embedding. We reduceeach W and V down to × and reduce embedding down to × .The value 128 is chosen as the most suitable degree of 2 for eﬃcient deviceimplementation. We have performed several experiments, but this conﬁgurationis near the best. Our compressed model, LR LSTM 650-650 , is even smallerthan

LSTM 200-200 with better perplexity. The results of experiments can befound in Table 2.In TT decomposition we have some freedom in way of choosing internal ranksand number of tensors. We ﬁx the basic conﬁguration of an LSTM-network with able 1.

Pruning and quantization results on PTB dataset

Model Size No. of params Test PP

LSTM 200-200 (Small benchmark) 18.6 Mb 4.64 M 117.659Pruning output layer 90%w/o additional training 5.5 Mb 0.5 M 149.310Pruning output layer 90%with additional training 5.5 Mb 0.5 M 121.123Quantization (1 byte per number) 4.7 Mb 4.64 M 118.232two 600-600 layers and four tensors for each matrix in a layer. And we performa grid search through diﬀerent number of dimensions and various ranks.We have trained about 100 models with using the Adam optimizer [16]. Theaverage training time for each is about 5-6 hours on GeForce GTX TITAN X(Maxwell architecture), but unfortunately none of them has achieved acceptablequality. The best obtained result (

TT LSTM 600-600 ) is even worse than

LSTM-200-200 both in terms of size and perplexity.

Table 2.

Matrix decomposition results on PTB dataset

Model Size No. of params Test PP

PTB LSTM 200-200 18.6 Mb 4.64 M 117.659Benchmarks LSTM 650-650 79.1 Mb 19.7 M 82.07LSTM 1500-1500 264.1 Mb 66.02 M 78.29Ours LR LSTM 650-650 16.8 Mb 4.2 M 92.885TT LSTM 600-600 50.4 Mb 12.6 M 168.639LR LSTM 1500-1500 94.9 Mb 23.72 M 89.462

In this article, we have considered several methods of neural networks compres-sion for the language modeling problem. The ﬁrst part is about pruning andquantization. We have shown that for language modeling there is no diﬀerencein applying of these two techniques. The second part is about matrix decompo-sition methods. We have shown some advantages when we implement models ondevices since usually in such tasks there are tight restrictions on the model sizeand its structure. From this point of view, the model

LR LSTM 650-650 hasnice characteristics. It is even smaller than the smallest benchmark on PTB anddemonstrates quality comparable with the medium-sized benchmarks on PTB. cknowledgements.

This study is supported by Russian Federation Presidentgrant MD-306.2017.9. A.V. Savchenko is supported by the Laboratory of Algo-rithms and Technologies for Network Analysis, National Research UniversityHigher School of Economics.

References

1. Lu, Z., Sindhwan, V., Sainath, T.N.: Learning compact recurrent neural networks.Acoustics, Speech and Signal Processing (ICASSP) (2016)2. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural net-works. In: Advances in Neural Information Processing Systems 28: Annual Con-ference on Neural Information Processing Systems 2015. (2015) 442–4503. Garipov, T., Podoprikhin, D., Novikov, A., Vetrov, D.P.: Ultimate tensorization:compressing convolutional and FC layers alike. CoRR/NIPS 2016 workshop: Learn-ing with Tensors: Why Now and How? abs/1611.03214 (2016)4. Arjovsky, M., Shah, A., Bengio, Y.: Unitary evolution recurrent neural networks.In: Proceedings of the 33nd International Conference on Machine Learning, ICML2016. (2016) 1120–11285. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press (1997)6. Kneser, R., Ney, H.: Improved backing-oﬀ for m-gram language modeling. InProceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing (1995) 181–184.7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation(9(8)) (1997) 1735–17808. Mikolov, T.: Statistical Language Models Based on Neural Networks. PhD thesis,Brno University of Technology (2012)9. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient ﬂow in recurrentnets: the diﬃculty of learning long-term dependencies. S. C. Kremer and J. F.Kolen, eds. A Field Guide to Dynamical Recurrent Neural Networks (2001)10. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the proper-ties of neural machine translation: Encoder-decoder approaches. arXiv preprintarXiv:1409.1259, 2014f (2014)11. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization.Arxiv preprint (2014)12. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net-works with pruning, trained quantization and huﬀman coding. Acoustics, Speechand Signal Processing (ICASSP) (2016)13. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kaut, J.: Pruning convolu-tional neural networks for resource eﬃcient transfer learning. arXiv preprintarXiv:1611.06440 (2016)14. Rassadin, A.G., Savchenko, A.V.: Deep neural networks performance optimiza-tion in image recognition. Proceedings of the 3rd International Conference onInformation Technologies and Nanotechnologies (ITNT) (2017)15. Oseledets, I.V.: Tensor-train decomposition. SIAM J. Scientiﬁc Computing33