[PDF] Compressing LSTM Networks by Matrix Product Operators

Abstract

Long Short-Term Memory (LSTM) models are the building blocks of many state-of-the-art algorithms for Natural Language Processing (NLP). But, there are a large number of parameters in an LSTM model. This usually brings out a large amount of memory space needed for operating an LSTM model. Thus, an LSTM model usually requires a large amount of computational resources for training and predicting new data, suffering from computational inefficiencies. Here we propose an alternative LSTM model to reduce the number of parameters significantly by representing the weight parameters based on matrix product operators (MPO), which are used to characterize the local correlation in quantum states in physics. We further experimentally compare the compressed models based the MPO-LSTM model and the pruning method on sequence classification and sequence prediction tasks. The experimental results show that our proposed MPO-based method outperforms the pruning method.

Full PDF

CCompressing LSTM Networks by Matrix Product Operators

Ze-Feng Gao ∗ Xingwei Sun † Lan Gao ∗ Junfeng Li † Zhong-Yi Lu ∗‡ Abstract

Long Short-Term Memory (LSTM) models are the building blocks of many state-of-the-artalgorithms for Natural Language Processing (NLP). But, there are a large number of parametersin an LSTM model. This usually brings out a large amount of memory space needed for operatingan LSTM model. Thus, an LSTM model usually requires a large amount of computationalresources for training and predicting new data, suﬀering from computational ineﬃciencies. Herewe propose an alternative LSTM model to reduce the number of parameters signiﬁcantly byrepresenting the weight parameters based on matrix product operators (MPO), which are usedto characterize the local correlation in quantum states in physics. We further experimentallycompare the compressed models based the MPO-LSTM model and the pruning method onsequence classiﬁcation and sequence prediction tasks. The experimental results show that ourproposed MPO-based method outperforms the pruning method.

The Long Short-Term Memory (LSTM) model Hochreiter & Schmidhuber (1997) has become apopular choice for modeling many practical tasks, such as speech recognition Xiong et al. (2016),language modeling Jozefowicz et al. (2016); Shazeer et al. (2017); Sundermeyer et al. (2012), ma-chine translation Wu et al. (2016), and many other tasks. These temporal and sequential modelingsshow that many state-of-the-art results He et al. (2017); Lee et al. (2017); Seo et al. (2016); Peterset al. (2018) have been achieved by the LSTM model.Nevertheless, the scalability of LSTM model has an obvious shortage, namely most LSTMmodels have a large number of parameters and take high computational cost. Since an LSTMmodel usually consists of multiple linear and nonlinear transformations, multiple high-dimensionalmatrices are required to represent these parameters. In a time-step, we need to apply multiple lineartransformations between the dense matrices of the high-dimensional input and the previous hiddenstate. Especially in the ﬁeld of speech recognition and machine translation, the latest models takea huge amount of computational cost with millions of parameters, which can only be implementedin high-end cluster environments. This hinders the eﬃcient LSTM models to be fast enough forlarge-scale real-time reasoning or small enough to be implemented in low-end devices such as mobile ∗ Renmin University of China; email: { zfgao, lgao, zlu } @ruc.edu.cn † Institute of Acoustics, Chinese Academy of Sciences and University of Chinese Academy of Sciences;email: { sunxingwei,lijunfeng } @hccl.ioa.ac.cn ‡ Corresponding author a r X i v : . [ c s . N I] J a n hones or embedded systems with limited memory Schuster (2010). Moreover, although the storagecapacity of an LSTM model is considered to be proportional to the size of the model, the recentresearch has proved an opposite fact, suggesting that an LSTM model is indeed over-parameterizedDenil et al. (2013); Levy et al. (2018); Melis et al. (2017); Merity et al. (2017).It is thus in strong demond to ﬁnd a more eﬃcient compression method for LSTM models.To bridge the gap between the high-performance and the high-cost, many approaches have beenproposed to compress such large, hyper-parametric neural networks, including parametric pruningand sharing Gong et al. (2014); Huang et al. (2018), low rank matrix decomposition Jaderberget al. (2014), and knowledge distillation Hinton et al. (2015). But, most of these methods havebeen applied to feed-forward neural networks and convolutional neural networks (CNN), whilelittle attention has been paid to compressing LSTM models Belletti et al. (2018); Lu et al. (2016),especially in NLP tasks. It is worth noting that See et al. (2016) applied the parameter pruningto the standard Seq2Seq Sutskever et al. (2014) architecture in neural machine translation, whichadopts an LSTM model for encoders and decoders. In addition, in language modeling, Tjandraet al. (2017) used tensor-train decomposition proposed by Oseledets (2011) that is mathematicallyequivalent to the matrix product operators, Wen et al. (2017) used the binarization technology, Yanget al. (2017) adopted the architectural changes to approximate the low-rank decomposition in part ofLSTM model, and Gao et al. (2020) demonstrated that the matrix product operators(MPO) methodis well eﬀective for model compression by using the MPO to replace the linear transformation ofthe fully connected layer and the convolution layer. Sun et al. (2020) has also veriﬁed that it is veryeﬀective to compress a linear part of the network with the matrix product operators method onacoustic data sets. Nevertheless, to the best of our knowledge, no study has focused on compressingan LSTM model fully with the MPO-based representation.In this work, we propose an MPO-LSTM model, which is an MPO network based on the LSTMmodel architecture. Speciﬁcally we apply the MPO-format to reformulate two dense matrices in theLSTM model structure, one is the dense matrix between input vector and hidden layer, the otheris the dense matrix between the dense matrices of the high dimensional input and the previoushidden state in a time-step.Our method diﬀers from the existing methods in two folds. First, we propose a model calledMPO-LSTM that is composed of the MPO-format and nonlinear structure while the previousmethods use low-rank decomposition in part of an LSTM model. Second, the proposed MPO-LSTM model can be applied to an existing model directly, there is no deﬁnition of new layers.In principle, our method can also combine with other categories method, such as quantizationCourbariaux et al. (2015) and Huﬀman coding Han et al. (2015), for obtaining higher compressionratios. In experiment section, we evaluate the proposed MPO-LSTM model on sequence classiﬁ-cation and sequence prediction tasks, and then compare it with the corresponding pruning. Theexperimental results show that our method has more advantages than the pruning.In summary, our contributions are as follows: • We demonstrate that deep neural networks can be well compressed using matrix product op-erators with proper structure and balanced dimensions, at the same time without performancedegradation. 2

We propose a novel compression method based on the MPO that can drastically compress anLSTM model while learning. • We propose a new network structure using the MPO method, and show that we can exper-imentally achieve even higher accuracy than the pruning method at the same compressionratio in natural language processing and acoustic ﬁelds.

To learn long-range dependencies with Recurrent Neural Network (RNN) is challenging due tothe vanishing and exploding gradient problems Bengio et al. (1994); Pascanu et al. (2013). Toaddress this issue, the LSTM model has been introduced by Hochreiter & Schmidhuber (1997),with the following recurrent computations:

LST M : h t − , c t − , x t → h t , c t . (1)Here x t is an input vector, h t is the cell state, c t is the cell memory. Equation (1) is computedas follows,  i t f t o t ˆ c t  =  σσσtanh  ( W i W h ) (cid:32) x t h t − (cid:33) (2) W i =  W ii W fi W oi W ci  , W h =  W ih W fh W oh W ch  , (3) c t = f t (cid:12) c t − + i c (cid:12) ˆ c t ,h t = o t (cid:12) tanh ( c t ) , (4)where x t ∈ R N x and h t ∈ R N h at time t . In the above equation, σ ( · ) and (cid:12) denote the sigmoidfunction and element-wise multiplication operator respectively. The i t , f t , o t , and ˆ c t are respectivelythe input gates, the forget gates, the output gates, and the memory cells. The input gates retainthe candidate memory cell values that are useful for the current memory cell and the forget gatesretain the previous memory cell values that are also useful for the current memory cell. The outputgates retain the memory cell values that are useful for the output and the next time-step hiddenlayer computation. 3he major part of computational cost is in W i and W h , where W i ∈ R N x × N h that is the inputmatrix, and W h ∈ R N h × N h that is the hidden matrix in time-step. The pruning is the most commonly used sparsity method in the original domain. Han et al.(2015, 2016) recursively trained a neural network and pruned unimportant connections based ontheir weight magnitudes. Guo et al. (2016) proposed the dynamic network surgery prune andspliced the branch of the network.In the pruning method, for the weight matrix between layers, some unimportant weights arecut oﬀ, usually by setting their values to zero. In training, one can implement pruning by ﬁlteringthe values of weights. After pruning, the weight matrix of neural network has only the importantweight parts left, and the corresponding, unimportant weight parts are trimmed away.

The MPO method develops from quantum many-body physics, which is based on high ordertensor single value decomposition. An MPO is used to factorize a higher-order tensor into asequential product of the so-called local-tensors, that is a more generalized form of the tensor-trainapproximation Verstraete et al. (2004). By representing the linear transformations in a modelwith MPOs, the number of parameters needed is greatly shrunk since the number of parameterscontained in an MPO decomposition format just grows linearly with the system size Poulin et al.(2011). The MPO method has been demonstrated to be well eﬀective for model compression byusing the MPO to replace the linear transformations of fully-connected and convolutional layersGao et al. (2020).To clarify the MPO method process, we assume a weight matrix W yx ∈ R N x × N y , and thenreshape it to a 2n-indexed tensor W yx = W j j ··· j n ,i i ··· i n . (5)As Eq.(5) shows, each entry vector is reshaped as a sequence of matrix multiplications in whichdimension N x is reshaped into a coordinate in an n-dimensional space, labelled by ( i i · · · i n ).Hence, there is one-to-one mapping between input vector X and MPO label ( i i · · · i n ). Likewise,we can set up another one-to-one correspondence between Y and ( j j · · · j n ). If I k and J k are thedimensions of index i k and j k respectively, n (cid:89) k =1 I k = N x , n (cid:89) k =1 J k = N y . (6)The MPO representation of W is obtained by factorizing it into a product of n local-tensors( Usuallycalled core-tensors). 4 j ··· j n ,i ··· i n = Tr (cid:16) w (1) [ j , i ] w (2) [ j , i ] · · · w ( n ) [ j n , i n ] (cid:17) , (7)where w ( k ) [ j k , i k ] is a d k − × d k matrix, and the d k means the dimension on the bond linking w ( k ) and w ( k +1) with d = d n = 1. ... ... ...... i i i k k k W x N x x x yy y i n j nn-1 k (a) (b) w (1) w (2) w (3) w (n) x N y j j j Figure 1: (a) Graphical representation of weight matrix W in a fully connected layer. The blue cir-cles represent neurons. The solid line connecting input neuron x i with output neuron y j representsweight element W ji . (b) The weight matrix represented by MPO. The local operator tensors w ( k ) are represented by ﬁlled circles. The hollow circles denote the input and output indices, i l and j l ,respectively. Given i k and j k , w ( k ) [ j k , i k ] is a matrix.Any matrix can be represented with the MPO representation. We schematically show thisrepresentation in Fig.1. Such a matrix product structure leads to the fact that the scaling of theparameter number is reduced from exponential to polynomial, which is a great advantage of MPOrepresentation. To be speciﬁc, the parameter number is shrunk with the following equation. n (cid:89) k =1 I k J k → n − (cid:88) k =2 I k J k d k − d k + I J d + I n J n d n − . (8)In Eq.(8), d k is dubbed as bond dimension. In practice, d k can be regarded as a tunable parameterwhich controls the accuracy of the representation, i.e., the more larger d k is, the more parametersthe MPO contains. We discuss the issue in detail in Section 4.2 and explain it in Figure 2.Additionally, Tabel 1 compares the forward and backward propagation times and the memorycomplexity between the fully connected layer and the MPO layer in Big-O notation proposed byNovikov et al. (2015). We compare the fully connected layer with matrix W ∈ R N x × N y versusthe MPO layer in format M P O ( W, x ) with MPO-dimension { d k } nk =0 . As can be seen from thetable, the MPO format has more advantages than the traditional one in terms of time and memoryconsumption. 5peration Time MemoryFC forward O ( N x N y ) O ( N x N y )MPO forward O ( nd m max ( N x , N y )) O ( d max ( N x , N y ))FC backward O ( N x N y ) O ( N x N y )MPO backward O ( n d m max ( N x , N y )) O ( d max ( N x , N y ))Table 1: Full-Connected Layer And MPO Layer Running Time And Memory. In this table, n denotes the number of MPO core tensors, m denotes max ( { I k } nk =1 ), d denotes max ( { d k } nk =0 ), N x denotes the total dimension of input, N y denotes the total dimension of output, respectively. The tendency of LSTM model overﬁtting suggests that there is always redundancy among theweights. Inspired by the low-rank decomposition of weight matrices in Gao et al. (2020) and Denilet al. (2013) which can well reduce the model size and computational cost at the same time, weadopt a more generalized operators (MPO) method to replace all the linear parts of the LSTMmodel.In this work, we investigate the speech enhancement of several typical cases with the LSTMmodel. Accordingly, we factorize input-to-hidden weight matrix W with M P O ( W, x ) and representhidden-to-hidden weight matrix U in the M P O ( U, h ).MPO-LSTM: k [ t ] = σ ( M P O ( W k , x [ t ] ) + M P O ( U k , h [ t − ) + b k ) ,f [ t ] = σ ( M P O ( W f , x [ t ] ) + M P O ( U f , h [ t − ) + b f ) ,o [ t ] = σ ( M P O ( W o , x [ t ] ) + M P O ( U o , h [ t − ) + b o ) ,g [ t ] = tanh ( M P O ( W g , x [ t ] ) + M P O ( U g , h [ t − ) + b g ) ,c [ t ] = f [ t ] ◦ c [ t − + k [ t ] ◦ g [ t ] ,h [ t ] = o [ t ] ◦ tanh ( c [ t ] ) , (9)We can see that to construct an MPO-LSTM model we need to prepare eight MPOs, one for eachof the gating units and hidden units respectively. Instead of calculating these MPOs directly, weincrease the dimension of the ﬁrst tensor to form the output tensor. This trick, inspired by theimplementation of standard LSTM model in Chollet et al. (2015), can further reduce the number ofparameters, where the concatenation is actually participating in the tensorization. The compressionratio for input-to-hidden weight matrix W now becomes ρ w = (cid:80) dk =1 i k j k d k − d k + 3 · ( i n d d )4 · (cid:81) dk =1 i k n k (10)6eanwhile, the compression ratio hidden-to-hidden weight matrix U becomes: ρ u = (cid:80) dk =1 i (cid:48) k j (cid:48) k d (cid:48) k − d (cid:48) k + 3 · ( i (cid:48) j (cid:48) d (cid:48) d (cid:48) )4 · (cid:81) dk =1 i (cid:48) k j (cid:48) k (11)Thus, the total compression ratio is ρ ∗ = (cid:80) dk =1 i k j k d k − d k + 3 · ( i n d d ) + (cid:80) dk =1 i (cid:48) k j (cid:48) k d (cid:48) k − d (cid:48) k + · ( i (cid:48) j (cid:48) d (cid:48) d (cid:48) )4 · (cid:81) dk =1 i k n k + 4 · (cid:81) dk =1 i (cid:48) k j (cid:48) k (12)In a speciﬁc MPO, the structure is variational. In our calculation, for simplicity all d k are set equalin the same MPO method structure, and denoted as d , as we met before. The index decompositionis not unique, and in this work, we just choose by convenience. In this section, we evaluate our proposed LSTM model with MPO method (MPO-LSTM) andcompare it with baseline LSTM model and pruning-LSTM model under the same compression rate.Speciﬁcally we compare our method with the pruning method for model compression rates from 5to 100. In our experiments, the models with pruning method are referred as pruning-LSTM andthe models with MPO method are refered as MPO-LSTM.To evaluate and compare the performance of the MPO method with the pruning method, weconducted the experiments on the sentiment analysis classiﬁcation tasks, in which we predictedwhether the sequence of tokens(usually words or sentences) contains either positive or negativemeaning, and the sequence regression tasks, in which we predicted the clean speech from thenoisy speech. We used Internet Movie Database(IMDB) and Stanford Sentiment Treebank(SST)datasets for the sentiment analysis classiﬁcation task, and we used the speech enhancement datasetsVoiceBank-DEMAND (VBD) for the sequence regression tasks Valentini-Botinhao et al. (2016). Forall of tasks, we adopted Adam Kingma & Ba (2014) to optimize our model parameters.

We evaluated our proposed MPO-LSTM model and the pruning-LSTM model for the classiﬁca-tion task using the IMDB dataset Maas et al. (2011) and the Stanford Sentiment Treebank (SST)Socher et al. (2013) with ﬁve categories. The IMDB dataset consists of 50,000 biased comments fortwo classes that are either positive or negative. The IMDB dataset has a training set with 25,000images, and a test set with 25,000 images. We took the most frequent 25,000 words for the IMDBdataset and 17,200 for SST respectively, then embedded them into a standard embedding layer andperformed classiﬁcation respectively using the pruning-LSTM model and the MPO-LSTM modelboth with hidden size h . In our experiments, we set h to 256.As shown in Section 3.2, there are two weight matrices, namely input-to-hidden weight matrix W and hidden-to-hidden weight matrix U . In the MPO-based compression method, both weightmatrices are decomposed. Accordingly, the parameters including factorization factors i Wk , j Wk , i Uk , j Uk ∗ d W

64 41 32 26 22 13 9 7 d U

64 40 29 24 20 13 9 7Table 2: In IMDB and SST-5 tasks, the factorization factors adopted as (8 , , , × (8 , , ,

8) andthe bond dimension factors adjusted for the two weight matrices to achieve a given compressionrate. The ρ ∗ denotes the total compression ratio of neural network, the d W denotes bond dimensionof the matrix mapping from the input-to-hidden weight matrix, the d U denotes bond dimension ofthe matrix mapping from the hidden-to-hidden weight matrix, respectively.and bond dimension factors d Wk , d Uk are adjustable. In these tasks, we ﬁxed the factorization factorsas (8 , , , × (8 , , ,

8) and adjusted the bond dimension factors for the two weight matrices toachieve a given compression rate. The values of the tunable bond dimension at diﬀerent compressionrates are listed in Table 2. In the pruning method, we only need to determine the sparsity of theweight matrix.We show the parameters of a single MPO-LSTM model as the bond dimension changes inFig.2. We can see that with the bond dimension of the MPO-LSTM model becoming larger, theparameters also increase. When we set d = 64, although the model parameters increase, they arestill much less than those in the baseline LSTM model.Our ﬁndings are summarized in Table 3 and Table 4. We observe that as the compression ratebecomes larger, the accuracy of MPO method does not decrease, and is higher than that of thebaseline LSTM model, indicating that there are many redundant parameters in the original LSTMmodel. Thus we need eﬀective methods, like MPO based method, to reduce these redundantparameters.In addition, we observe that the models heavily compressed by the MPO method can performequally or even better than the uncompressed models. At various compression rates, the results ofthe MPO method are better than those of the pruning method. This shows that the MPO methodcan be applied as a simple and eﬃcient alternative pruning method. In the speech enhancement task, we used the LSTM model to estimate the ideal ratio mask(IRM) from several acoustic features for denoising Wang et al. (2013). In our experiments, theVBD dataset was used in which 30 speakers selected from Voice Bank corpus Veaux et al. (2013)were mixed with 10 noise types: 8 from Demand dataset Thiemann et al. (2013) and 2 artiﬁciallygenerated one. The test set was generated with 5 noise types from demand that did not coincidewith those for training data. Consequently, 11,572 and 824 noisy-clean speech pairs were providedas the training and test set, respectively. This dataset is openly available and frequently used inexperiments of DNN-based speech enhancement.In the MPO method based LSTM model compression, we used the same network structure as theone in the sentiment analysis classiﬁcation task, except the unit numbers of the input and outputlayers. In this speech enhancement model, the input and output layers both had 256 hidden units,8igure 2: The number of parameters w.r.t bond dimension d of MPO-LSTM model, in the settingof I = 256 , J = 256. While the baseline LSTM model contains 524288 parameters. The numberparameter of MPO-LSTM model is in growth with d , meanwhile, the total number is always smallerthan those of the baseline model.which were the same as the dimension of the input feature and training target. To evaluate thespeech enhancement performances of the LSTM models respectively with and without compressionand further compare the two diﬀerent compression methods, we adopted an objective measureproposed by Rix et al. (2002), namely the perceptual evaluation of speech quality (PESQ).The speech enhancement evaluation results of the LSTM models are shown in Table 5. Theseresults conﬁrm the eﬀectiveness of the LSTM model in speech enhancement task. We can seethat the LSTM model without compression gains 0.52 PESQ improvements on average of all theutterances in the test set in comparision with the noisy speech. In terms of compressed models,the speech enhancement performance decreases with the increase of compression rate for bothcompression methods. However, the MPO-based compression method performs better than thepruning method at the same compression rates. Even in the high compression rate case such as100, the compressed model with the MPO-based method can still achieve satisfactory performancewith only 0.17 PESQ loss. 9ompressionRate CompressionMethod Test Accuracy(%)0 - 88.305 Pruning 87.02MPO

10 Pruning 86.57MPO

15 Pruning 87.42MPO

20 Pruning 86.59MPO

25 Pruning 86.76MPO

50 Pruning 87.68MPO

75 Pruning 87.21MPO

100 Pruning 87.11MPO

Table 3: The Test Accuracy results with IMDB Dataset of the LSTM model respectively withpruning and MPO base Methods.

In this paper, we demonstrate that an LSTM model can be well compressed using the MPOmethod with proper orders and balanced dimensions of modes. We also present the MPO-LSTMmodel based on our demonstration for LSTM model compression. We do not need to add newlayers for implementing the MPO decomposition as other tensor-based methods do. The advan-tage of our methods over the pruning method is that we do not require recording the indices ofnonzero elements. In this method, we use the MPO decomposition format to replace the weightmatrices in linear transformations in the LSTM models. We evaluate the models under diﬀerentcompression rates with several datasets. The experiment results on IMDB, SST and VBD showthat our proposed MPO method well outperforms the pruning method in NLP problem and speechenhancement performance under the same compression rate for an LSTM model. Thus, the MPOmethod can be applied as a simple and eﬃcient pruning method.In the future, the MPO-based model compression method can be used in many other tasks, andit is also an interesting problem to explore the combination of MPO method and other compressionmethods. 10ompressionRate CompressionMethod Test Accuracy(%)0 - 44.105 Pruning 41.16MPO

10 Pruning 41.59MPO

15 Pruning 41.67MPO

20 Pruning 41.38MPO

25 Pruning 41.45MPO

50 Pruning 41.29MPO

75 Pruning 41.15MPO

100 Pruning 41.09MPO

Table 4: The Test Accuracy results with SST Dataset of the LSTM model respectively with pruningand MPO base Methods.

Acknowledgments

This research is ﬁnancially supported by the National Natural Science Foundation of Chinaunder Grants 11934020, 11722437, 11674352 and 11774422.Ze-Feng Gao and Xingwei Sun contributed equally to this work.

References

Belletti, F., Beutel, A., Jain, S., & Chi, E. (2018). Factorized recurrent neural architectures forlonger range dependence. In

International Conference on Artiﬁcial Intelligence and Statistics (pp. 1522–1530).Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradientdescent is diﬃcult.

IEEE transactions on neural networks , 5(2), 157–166.Chollet, F. et al. (2015). Keras: Deep learning library for theano and tensorﬂow.

URL:https://keras. io/k , 7(8), T1.Courbariaux, M., Bengio, Y., & David, J.-P. (2015). Binaryconnect: Training deep neural networks11ompressionRate CompressionMethod PESQ (MOS)0 - 2.505 pruning 2.44mpo

10 pruning 2.42mpo

15 pruning 2.37mpo

20 pruning 2.35mpo

25 pruning 2.31mpo

50 pruning 2.25mpo

75 pruning 2.28mpo

100 pruning 2.33mpo

Noisy Speech 1.98Table 5: The speech enhancement performance evaluation results of the LSTM model with pruningand MPO-base Methods.with binary weights during propagations. In

Advances in neural information processing systems (pp. 3123–3131).Denil, M., Shakibi, B., Dinh, L., Ranzato, M., & De Freitas, N. (2013). Predicting parameters indeep learning. In

Advances in neural information processing systems (pp. 2148–2156).Gao, Z.-F., Cheng, S., He, R.-Q., Xie, Z., Zhao, H.-H., Lu, Z.-Y., & Xiang, T. (2020). Compressingdeep neural networks by matrix product operators.

Physical Review Research , 2(2), 023300.Gong, Y., Liu, L., Yang, M., & Bourdev, L. (2014). Compressing deep convolutional networksusing vector quantization. arXiv preprint arXiv:1412.6115 .Guo, Y., Yao, A., & Chen, Y. (2016). Dynamic network surgery for eﬃcient dnns. In

Advances inneural information processing systems (pp. 1379–1387).Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural networkswith pruning, trained quantization and huﬀman coding.

In ICLR2016 .Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for eﬃcientneural network. In

Advances in neural information processing systems (pp. 1135–1143).12e, L., Lee, K., Lewis, M., & Zettlemoyer, L. (2017). Deep semantic role labeling: What worksand what next. In

Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) (pp. 473–483).Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory.

Neural computation , 9(8),1735–1780.Huang, Q., Zhou, K., You, S., & Neumann, U. (2018). Learning to prune ﬁlters in convolutionalneural networks. In (pp. 709–718).: IEEE.Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Speeding up convolutional neural networkswith low rank expansions. arXiv preprint arXiv:1405.3866 .Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the limits oflanguage modeling. arXiv preprint arXiv:1602.02410 .Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Lee, K., He, L., Lewis, M., & Zettlemoyer, L. (2017). End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 .Levy, O., Lee, K., FitzGerald, N., & Zettlemoyer, L. (2018). Long short-term memory as a dynam-ically computed element-wise weighted sum. arXiv preprint arXiv:1805.03716 .Lu, Z., Sindhwani, V., & Sainath, T. N. (2016). Learning compact recurrent neural networks. In (pp.5960–5964).: IEEE.Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning wordvectors for sentiment analysis. In

Proceedings of the 49th annual meeting of the association forcomputational linguistics: Human language technologies-volume 1 (pp. 142–150).: Associationfor Computational Linguistics.Melis, G., Dyer, C., & Blunsom, P. (2017). On the state of the art of evaluation in neural languagemodels. arXiv preprint arXiv:1707.05589 .Merity, S., Keskar, N. S., & Socher, R. (2017). Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182 .Novikov, A., Podoprikhin, D., Osokin, A., & Vetrov, D. P. (2015). Tensorizing neural networks. In

Advances in neural information processing systems (pp. 442–450).13seledets, I. V. (2011). Tensor-train decomposition.

SIAM Journal on Scientiﬁc Computing , 33(5),2295–2317.Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the diﬃculty of training recurrent neuralnetworks. In

International conference on machine learning (pp. 1310–1318).Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018).Deep contextualized word representations. arXiv preprint arXiv:1802.05365 .Poulin, D., Qarry, A., Somma, R., & Verstraete, F. (2011). Quantum simulation of time-dependenthamiltonians and the convenient illusion of hilbert space.

Physical review letters , 106(17), 170501.Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2002). Perceptual evaluation ofspeech quality (pesq)-a new method for speech quality assessment of telephone networks andcodecs. In

IEEE International Conference on Acoustics .Schuster, M. (2010). Speech recognition for mobile devices at google. In

Paciﬁc Rim InternationalConference on Artiﬁcial Intelligence (pp. 8–10).: Springer.See, A., Luong, M.-T., & Manning, C. D. (2016). Compression of neural machine translation modelsvia pruning. arXiv preprint arXiv:1606.09274 .Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2016). Bidirectional attention ﬂow formachine comprehension. arXiv preprint arXiv:1611.01603 .Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Out-rageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprintarXiv:1701.06538 .Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013).Recursive deep models for semantic compositionality over a sentiment treebank. In

Proceedingsof the 2013 conference on empirical methods in natural language processing (pp. 1631–1642).Sun, X., Gao, Z.-F., Lu, Z.-Y., Li, J., & Yan, Y. (2020). A model compression method with matrixproduct operators for speech enhancement.

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 28, 2837–2847.Sundermeyer, M., Schl¨uter, R., & Ney, H. (2012). Lstm neural networks for language modeling. In

Thirteenth annual conference of the international speech communication association .Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks.

Advances in NIPS .Thiemann, J., Ito, N., & Vincent, E. (2013). The diverse environments multi-channel acousticnoise database: A database of multichannel environmental noise recordings.

The Journal of theAcoustical Society of America , 133(5), 3591–3591.14jandra, A., Sakti, S., & Nakamura, S. (2017). Compressing recurrent neural network with tensortrain. In (pp. 4451–4458).:IEEE.Valentini-Botinhao, C., Wang, X., Takaki, S., & Yamagishi, J. (2016). Investigating rnn-basedspeech enhancement methods for noise-robust text-to-speech. In

SSW (pp. 146–152).Veaux, C., Yamagishi, J., & King, S. (2013). The voice bank corpus: Design, collection anddata analysis of a large regional accent speech database. In (pp. 1–4).: IEEE.Verstraete, F., Garcia-Ripoll, J. J., & Cirac, J. I. (2004). Matrix product density operators:simulation of ﬁnite-temperature and dissipative systems.

Physical review letters , 93(20), 207204.Wang, Y., Han, K., & Wang, D. (2013). Exploring monaural features for classiﬁcation-based speechsegregation.

IEEE Transactions on Audio, Speech, and Language Processing , 21(2), 270–279.Wen, W., He, Y., Rajbhandari, S., Zhang, M., Wang, W., Liu, F., Hu, B., Chen, Y., & Li, H.(2017). Learning intrinsic sparse structures within long short-term memory. arXiv preprintarXiv:1709.05027 .Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao,Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gapbetween human and machine translation. arXiv preprint arXiv:1609.08144 .Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., & Zweig, G. (2016).Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 .Yang, Y., Krompass, D., & Tresp, V. (2017). Tensor-train recurrent neural networks for videoclassiﬁcation. In