Trace norm regularization and faster inference for embedded speech recognition RNNs
Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet, Mohammad Shoeybi
TT RACE NORM REGULARIZATION AND FASTER INFER - ENCE FOR EMBEDDED SPEECH RECOGNITION
RNN S Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet & Mohammad Shoeybi
Baidu Silicon Valley Artificial Intelligence Lab { klieglmarkus,goyalsiddharth,zhaokexin01,srinetkavya,mohammad } @baidu.com A BSTRACT
We propose and evaluate new techniques for compressing and speeding up densematrix multiplications as found in the fully connected and recurrent layers ofneural networks for embedded large vocabulary continuous speech recognition(LVCSR). For compression, we introduce and study a trace norm regularizationtechnique for training low rank factored versions of matrix multiplications. Com-pared to standard low rank training, we show that our method leads to good accu-racy versus number of parameter trade-offs and can be used to speed up trainingof large models. For speedup, we enable faster inference on ARM processorsthrough new open sourced kernels optimized for small batch sizes, resulting in3x to 7x speed ups over the widely used gemmlowp library. Beyond LVCSR, weexpect our techniques and kernels to be more generally applicable to embeddedneural networks with large fully connected or recurrent layers.
NTRODUCTION
For embedded applications of machine learning, we seek models that are as accurate as possiblegiven constraints on size and on latency at inference time. For many neural networks, the parametersand computation are concentrated in two basic building blocks:1.
Convolutions . These tend to dominate in, for example, image processing applications.2.
Dense matrix multiplications (GEMMs) as found, for example, inside fully connectedlayers or recurrent layers such as GRU and LSTM. These are common in speech and naturallanguage processing applications.These two building blocks are the natural targets for efforts to reduce parameters and speed upmodels for embedded applications. Much work on this topic already exists in the literature. For abrief overview, see Section 2.In this paper, we focus only on dense matrix multiplications and not on convolutions. Our two maincontributions are:1.
Trace norm regularization:
We describe a trace norm regularization technique and an ac-companying training methodology that enables the practical training of models with com-petitive accuracy versus number of parameter trade-offs. It automatically selects the rankand eliminates the need for any prior knowledge on suitable matrix rank.2.
Efficient kernels for inference:
We explore the importance of optimizing for low batchsizes in on-device inference, and we introduce kernels for ARM processors that vastlyoutperform publicly available kernels in the low batch size regime.These two topics are discussed in Sections 3 and 4, respectively. Although we conducted our exper-iments and report results in the context of large vocabulary continuous speech recognition (LVCSR)on embedded devices, the ideas and techniques are broadly applicable to other deep learning net-works. Work on compressing any neural network for which large GEMMs dominate the parametersor computation time could benefit from the insights presented in this paper. Available at https://github.com/paddlepaddle/farm . a r X i v : . [ c s . L G ] F e b R ELATED WORK
Our work is most closely related to that of Prabhavalkar et al. (2016), where low rank factoredacoustic speech models are similarly trained by initializing weights from a truncated singular valuedecomposition (SVD) of pretrained weight matrices. This technique was also applied to speechrecognition on mobile devices (McGraw et al., 2016; Xue et al., 2013). We build on this methodby adding a variational form of trace norm regularization that was first proposed for collaborativeprediction (Srebro et al., 2005) and also applied to recommender systems (Koren et al., 2009). Theuse of this technique with gradient descent was recently justified theoretically (Ciliberto et al., 2017).Furthermore, Neyshabur et al. (2015) argue that trace norm regularization could provide a sensibleinductive bias for neural networks. To the best of our knowledge, we are the first to combine thetraining technique of Prabhavalkar et al. (2016) with variational trace norm regularization.Low rank factorization of neural network weights in general has been the subject of many otherworks (Denil et al., 2013; Sainath et al., 2013; Ba & Caruana, 2014; Kuchaiev & Ginsburg, 2017).Some other approaches for dense matrix compression include sparsity (LeCun et al., 1989; Naranget al., 2017), hash-based parameter sharing (Chen et al., 2015), and other parameter-sharing schemessuch as circulant, Toeplitz, or more generally low-displacement-rank matrices (Sindhwani et al.,2015; Lu et al., 2016). Kuchaiev & Ginsburg (2017) explore splitting activations into independentgroups. Doing so is akin to using block-diagonal matrices.The techniques for compressing convolutional models are different and beyond the scope of thispaper. We refer the interested reader to, e.g., Denton et al. (2014); Han et al. (2016); Iandola et al.(2016) and references therein.
RAINING LOW RANK MODELS
Low rank factorization is a well studied and effective technique for compressing large matrices.In Prabhavalkar et al. (2016), low rank models are trained by first training a model with unfactoredweight matrices (we refer to this as stage 1), and then initializing a model with factored weightmatrices from the truncated SVD of the unfactored model (we refer to this as warmstarting a stage2 model from a stage 1 model). The truncation is done by retaining only as many singular values asrequired to explain a specified percentage of the variance.If the weight matrices from stage 1 had only a few nonzero singular values, then the truncatedSVD used for warmstarting stage 2 would yield a much better or even error-free approximation ofthe stage 1 matrix. This suggests applying a sparsity-inducing (cid:96) penalty on the vector of singularvalues during stage 1 training. This is known as trace norm regularization in the literature. Unfor-tunately, there is no known way of directly computing the trace norm and its gradients that wouldbe computationally feasible in the context of large deep learning models. Instead, we propose tocombine the two-stage training method of Prabhavalkar et al. (2016) with an indirect variationaltrace norm regularization technique (Srebro et al., 2005; Ciliberto et al., 2017). We describe thistechnique in more detail in Section 3.1 and report experimental results in Section 3.2.3.1 T RACE NORM REGULARIZATION
First we introduce some notation. Let us denote by || · || T the trace norm of a matrix, that is, thesum of the singular values of the matrix. The trace norm is also referred to as the nuclear norm orthe Schatten 1-norm in the literature. Furthermore, let us denote by || · || F the Frobenius norm of amatrix, defined as || A || F = √ Tr AA ∗ = (cid:115)(cid:88) i,j | A ij | . (1)The Frobenius norm is identical to the Schatten 2-norm of a matrix, i.e. the (cid:96) norm of the singularvalue vector of the matrix. The following lemma provides a variational characterization of the tracenorm in terms of the Frobenius norm. 2 emma 1 (Jameson (1987); Ciliberto et al. (2017)) . Let W be an m × n matrix and denote by σ itsvector of singular values. Then || W || T := min( m,n ) (cid:88) i =1 σ i ( W ) = min 12 (cid:0) || U || F + || V || F (cid:1) , (2) where the minimum is taken over all U : m × min( m, n ) and V : min( m, n ) × n such that W = U V .Furthermore, if W = ˜ U Σ ˜ V ∗ is a singular value decomposition of W , then equality holds in (2) forthe choice U = ˜ U √ Σ and V = √ Σ ˜ V ∗ . The procedure to take advantage of this characterization is as follows. First, for each large GEMM inthe model, replace the m × n weight matrix W by the product W = U V where U : m × min( m, n ) and V : min( m, n ) × n . Second, replace the original loss function (cid:96) ( W ) by (cid:96) ( U V ) + 12 λ (cid:0) || U || F + || V || F (cid:1) . (3)where λ is a hyperparameter controlling the strength of the approximate trace norm regularization.Proposition 1 in Ciliberto et al. (2017) guarantees that minimizing the modified loss equation (3) isequivalent to minimizing the actual trace norm regularized loss: (cid:96) ( W ) + λ || W || T . (4)In Section 3.2.1 we show empirically that use of the modified loss (3) is indeed highly effective atreducing the trace norm of the weight matrices.To summarize, we propose the following basic training scheme: • Stage 1:–
For each large GEMM in the model, replace the m × n weight matrix W by the product W = U V where U : m × r , V : r × n , and r = min( m, n ) . – Replace the original loss function (cid:96) ( W ) by (cid:96) ( U V ) + 12 λ (cid:0) || U || F + || V || F (cid:1) , (5)where λ is a hyperparameter controlling the strength of the trace norm regularization. – Train the model to convergence. • Stage 2:–
For the trained model from stage 1, recover W = U V by multiplying the two trainedmatrices U and V . – Train low rank models warmstarted from the truncated SVD of W . By varying thenumber of singular values retained, we can control the parameter versus accuracytrade-off.One modification to this is described in Section 3.2.3, where we show that it is actually not necessaryto train the stage 1 model to convergence before switching to stage 2. By making the transitionearlier, training time can be substantially reduced.3.2 E XPERIMENTS AND RESULTS
We report here the results of our experiments related to trace norm regularization. Our baselinemodel is a forward-only Deep Speech 2 model, and we train and evaluate on the widely used WallStreet Journal (WSJ) speech corpus. Except for a few minor modifications described in Appendix B,we follow closely the original paper describing this architecture (Amodei et al., 2016), and we referthe reader to that paper for details on the inputs, outputs, exact layers used, training methodology,and so on. For the purposes of this paper, suffice it to say that the parameters and computation aredominated by three GRU layers and a fully connected layer. It is these four layers that we compressthrough low-rank factorization. As described in Appendix B.2, in our factorization scheme, each3 . .
005 0 .
01 0 .
02 0 .
05 0 . λ nonrec λ r ec
11 9.7 9.5 9.5 9.3 9.410 9.4 8.7 8.6 7.8 8.18.7 7.7 7.4 7 7.2 7.78.4 7.3 7 6.7 7 7.37.9 7.4 6.9 7 7.2 9.67.8 7.1 6.9 6.7 10 107.7 6.9 6.9 11 17 13 . .
01 0 .
05 0 . . . λ nonrec λ r ec
11 9.3 7.9 7.7 8.1 8.99.9 8.3 7 7 7.9 8.79.4 7.6 6.8 7 7.7 8.48.6 7.4 6.8 6.8 7.9 8.88.3 7.2 6.8 7 8.2 9.18.3 7.6 7 7.1 8.4 98.3 7.5 7 7.1 8.6 9.6
Figure 1: CER dependence on λ rec and λ nonrec for trace norm regularization (left) and (cid:96) regular-ization (right).GRU layer involves two matrix multiplications: a recurrent and a non-recurrent one. For a simplerecurrent layer, we would write h t = f ( W nonrec x t + W rec h t − ) . (6)For a GRU layer, there are also weights for reset and update gates, which we group with the recurrent matrix. See Appendix B.2 for details and the motivation for this split.Since our work focuses only on compressing acoustic models and not language models, the errormetric we report is the character error rate (CER) rather than word error rate (WER). As the sizeand latency constraints vary widely across devices, whenever possible we compare techniques bycomparing their accuracy versus number of parameter trade-off curves. All CERs reported here arecomputed on a validation set separate from the training set.3.2.1 S TAGE EXPERIMENTS
In this section, we investigate the effects of training with the modified loss function in (3). Forsimplicity, we refer to this as trace norm regularization .As the WSJ corpus is relatively small at around 80 hours of speech, models tend to benefit substan-tially from regularization. To make comparisons more fair, we also trained unfactored models withan (cid:96) regularization term and searched the hyperparameter space just as exhaustively.For both trace norm and (cid:96) regularization, we found it beneficial to introduce separate λ rec and λ nonrec parameters for determining the strength of regularization for the recurrent and non-recurrentweight matrices, respectively. In addition to λ rec and λ nonrec in initial experiments, we also roughlytuned the learning rate. Since the same learning rate was found to be optimal for nearly all exper-iments, we just used that for all the experiments reported in this section. The dependence of finalCER on λ rec and λ nonrec is shown in Figure 1. Separate λ rec and λ nonrec values are seen to helpfor both trace norm and (cid:96) regularization. However, for trace norm regularization, it appears betterto fix λ rec as a multiple of λ nonrec rather than tuning the two parameters independently.The first question we are interested in is whether our modified loss (3) is really effective at reducingthe trace norm. As we are interested in the relative concentration of singular values rather than theirabsolute magnitudes, we introduce the following nondimensional metric. Definition 1.
Let W be a nonzero m × n matrix with d = min( m, n ) ≥ . Denote by σ the d -dimensional vector of singular values of W . Then we define the nondimensional trace normcoefficient of W as follows: ν ( W ) := || σ || (cid:96) || σ || (cid:96) − √ d − . (7)4 − − λ nonrec . . . . . . ν Trace norm ‘ − − λ rec . . . . . . ν Trace norm ‘ Figure 2: Nondimensional trace norm coefficient versus strength of regularization by type of regu-larization used during training. On the left are the results for the non-recurrent weight of the thirdGRU layer, with λ rec = 0 . On the right are the results for the recurrent weight of the third GRUlayer, with λ nonrec = 0 . The plots for the other weights are similar. . . . . . . R a nk a tt h r e s ho l d0 . Trace norm L Neither 6 . . . . . . R a nk a tt h r e s ho l d0 . Trace norm L Neither
Figure 3: The truncated SVD rank required to explain 90 % of the variance of the weight matrixversus CER by type of regularization used during training. Shown here are results for the non-recurrent (left) and recurrent (right) weights of the third GRU layer. The plots for the other weightsare similar.We show in Appendix A that ν is scale-invariant and ranges from 0 for rank 1 matrices to 1 formaximal-rank matrices with all singular values equal. Intuitively, the smaller ν ( W ) , the better W can be approximated by a low rank matrix.As shown in Figure 2, trace norm regularization is indeed highly effective at reducing the nondimen-sional trace norm coefficient compared to (cid:96) regularization. At very high regularization strengths, (cid:96) regularization also leads to small ν values. However, from Figure 1 it is apparent that this comesat the expense of relatively high CERs. As shown in Figure 3, this translates into requiring a muchlower rank for the truncated SVD to explain, say, 90 % of the variance of the weight matrix for agiven CER. Although a few (cid:96) -regularized models occasionally achieve low rank, we observe thisonly at relatively high CER’s and only for some of the weights. Note also that some form of regu-larization is very important on this dataset. The unregularized baseline model (the green points inFigure 3) achieves relatively low CER.3.2.2 S TAGE EXPERIMENTS
In this section, we report the results of stage 2 experiments warmstarted from either trace norm or L regularized stage 1 models.For each regularization type, we took the three best stage 1 models (in terms of final CER: allwere below 6.8) and used the truncated SVD of their weights to initialize the weights of stage 2models. By varying the threshold of variance explained for the SVD truncation, each stage 1 model5 P a r a m e t e r s Trace norm ‘ Neither
Figure 4: Number of parameters versus CER of stage 2 models colored by the type of regularizationused for training the stage 1 model.resulted into multiple stage 2 models. The stage 2 models were trained without regularization (i.e., λ rec = λ nonrec = 0 ) and with the initial learning rate set to three times the final learning rate of thestage 1 model.As shown in Figure 4, the best models from either trace norm or L regularization exhibit similaraccuracy versus number of parameter trade-offs. For comparison, we also warmstarted some stage2 models from an unregularized stage 1 model. These models are seen to have significantly loweraccuracies, accentuating the need for regularization on the WSJ corpus.3.2.3 R EDUCING TRAINING TIME
In the previous sections, we trained the stage 1 models for 40 epochs to full convergence and thentrained the stage 2 models for another 40 epochs, again to full convergence. Since the stage 2 modelsare drastically smaller than the stage 1 models, it takes less time to train them. Hence, shifting thestage 1 to stage 2 transition point to an earlier epoch could substantially reduce training time. In thissection, we show that it is indeed possible to do so without hurting final accuracy.Specifically, we took the stage 1 trace norm and (cid:96) models from Section 3.2.1 that resulted in the beststage 2 models in Section 3.2.2. In that section, we were interested in the parameters vs accuracytrade-off and used each stage 1 model to warmstart a number of stage 2 models of different sizes. Inthis section, we instead set a fixed target of 3 M parameters and a fixed overall training budget of 80epochs but vary the stage 1 to stage 2 transition epoch. For each of the stage 2 runs, we initialize thelearning rate with the learning rate of the stage 1 model at the transition epoch. So the learning ratefollows the same schedule as if we had trained a single model for 80 epochs. As before, we disableall regularization for stage 2.The (cid:96) stage 1 model has 21.7 M parameters, whereas the trace norm stage 1 model at 29.8 Mparameters is slightly larger due to the factorization. Since the stage 2 models have roughly 3 Mparameters and the training time is approximately proportional to the number of parameters, stage2 models train about 7x and 10x faster, respectively, than the (cid:96) and trace norm stage 1 models.Consequently, large overall training time reductions can be achieved by reducing the number ofepochs spent in stage 1 for both (cid:96) and trace norm.The results are shown in Figure 5. Based on the left panel, it is evident that we can lower thetransition epoch number without hurting the final CER. In some cases, we even see marginal CERimprovements. For transition epochs of at least 15, we also see slightly better results for trace normthan (cid:96) . In the right panel, we plot the convergence of CER when the transition epoch is 15. We findthat the trace norm model’s CER is barely impacted by the transition whereas the (cid:96) models see a6
10 20 30 40Transition epoch8101214 F i n a l C E R Trace norm (stage 2) ‘ (stage 2)Trace norm (stage 1) ‘ (stage 1) 0 20 40 60 80Epoch6810121416 C E R Trace norm ‘ Figure 5:
Left:
CER versus transition epoch, colored by the type of regularization used for trainingthe stage 1 model.
Right:
CER as training progresses colored by the type of regularization used instage 1. The dotted line indicates the transition epoch.Table 1: WER of three tiers of low rank speech recognition models and a production server modelon an internal test set. This table illustrates the effect of shrinking just the acoustic model. The samelarge server-grade language model was used for all rows.Model Parameters (M) WER % Relativebaseline . .
78 0 . tier-1 . . − . tier-2 . . − . tier-3* . . − . * The tier-3 model is larger but faster than the tier-2 model. See main text for details. huge jump in CER at the transition epoch. Furthermore, the plot suggests that a total of 60 epochsmay have sufficed. However, the savings from reducing stage 2 epochs are negligible compared tothe savings from reducing the transition epoch. PPLICATION TO PRODUCTION - GRADE EMBEDDED SPEECH RECOGNITION
With low rank factorization techniques similar to those described in Section 3, we were able totrain large vocabulary continuous speech recognition (LVCSR) models with acceptable numbersof parameters and acceptable loss of accuracy compared to a production server model (baseline).Table 1 shows the baseline along with three different compressed models with much lower numberof parameters. The tier-3 model employs the techniques of Sections B.4 and B.3. Consequently, itruns significantly faster than the tier-1 model, even though they have a similar number of parameters.Unfortunately, this comes at the expense of some loss in accuracy.Although low rank factorization significantly reduces the overall computational complexity of ourLVCSR system, we still require further optimization to achieve real-time inference on mobile orembedded devices. One approach to speeding up the network is to use low-precision -bit integerrepresentations for weight matrices and matrix multiplications (the GEMM operation in BLAS ter-minology). This type of quantization after training reduces both memory as well as computationrequirements of the network while only introducing to relative increase in WER. Quanti-zation for embedded speech recognition has also been previously studied in (Alvarez et al., 2016;Vanhoucke et al., 2011), and it may be possible to reduce the relative WER increase by quantizingthe forward passes during training (Alvarez et al., 2016). As the relative WER losses from com-pressing the acoustic and language models were much larger for us, we did not pursue this furtherfor the present study. This work was done prior to the development of our trace norm regularization. Due to long trainingcycles for the 10,000+ hours of speech used in this section, we started from pretrained models. However, thetechniques in this section are entirely agnostic to such differences. G i g a O p s / s farm gemmlowp 1 2 3 4Batch size051015 G i g a O p s / s farm gemmlowp 1 2 3 4Batch size024 G i g a O p s / s farm gemmlowp Figure 6: Comparison of our kernels ( farm ) and the gemmlowp library for matrix multiplication oniPhone 7 (left), iPhone 6 (middle), and Raspberry Pi 3 (right). The benchmark computes Ax = b where A is a random matrix with dimension × , and x is a random matrix with dimension × batch size. All matrices are in unsigned 8-bit integer format.To perform low precision matrix multiplications, we originally used the gemmlowp library, whichprovides state-of-the-art low precision GEMMs using unsigned 8-bit integer values (Jacob & War-den, 2015–2017). However, gemmlowp’s approach is not efficient for small batch sizes. Our ap-plication, LVCSR on embedded devices with single user, is dominated by low batch size GEMMsdue to the sequential nature of recurrent layers and latency constraints. This can be demonstrated bylooking at a simple RNN cell which has the form: h t = f ( W x t + U h t − ) (8)This cell contains two main GEMMs: The first, U h t − , is sequential and requires a GEMM withbatch size 1. The second, W x t , can in principle be performed at higher batch sizes by batchingacross time. However, choosing a too large batch sizes can significantly delay the output, as thesystem needs to wait for more future context. In practice, we found that batch sizes higher thanaround 4 resulted in too high latencies, negatively impacting user experience.This motivated us to implement custom assembly kernels for the 64-bit ARM architecture (AArch64,also known as ARMv8 or ARM64) to further improve the performance of the GEMMs operations.We do not go through the methodological details in this paper. Instead, we are making the kernelsand implementation details available at https://github.com/paddlepaddle/farm .Figure 6 compares the performance of our implementation (denoted by farm ) with the gemmlowplibrary for matrix multiplication on iPhone 7, iPhone 6, and Raspberry Pi 3 Model B. The farmkernels are significantly faster than their gemmlowp counterparts for batch sizes 1 to 4. The peaksingle-core theoretical performance for iPhone 7, iPhone 6, and Raspberry Pi 3 are . , . and . Giga Operations per Second, respectively. The gap between the theoretical and achieved valuesare mostly due to kernels being limited by memory bandwidth. For a more detailed analysis, werefer to the farm website.In addition to low precision representation and customized ARM kernels, we explored other ap-proaches to speed up our LVCSR system. These techniques are described in Appendix B.Finally, by combining low rank factorization, some techniques from Appendix B, int8 quantizationand the farm kernels, as well as using smaller language models, we could create a range of speechrecognition models suitably tailored to various devices. These are shown in Table 2.
ONCLUSION
We worked on compressing and reducing the inference latency of LVCSR speech recognition mod-els. To better compress models, we introduced a trace norm regularization technique and demon-strated its potential for faster training of low rank models on the WSJ speech corpus. To reducelatency at inference time, we demonstrated the importance of optimizing for low batch sizes andreleased optimized kernels for the ARM64 platform. Finally, by combining the various techniques8able 2: Embedded speech recognition models.Language % time spentAcoustic model Speedup over in acousticDevice model size (MB) WER % Relative real-time modelGPU server baseline
13 764 8 .
78 0 .
0% 10 . x . iPhone 7 tier-1
56 10 . − .
6% 2 . x . iPhone 6 tier-2
32 11 . − .
4% 1 . x . Raspberry Pi 3 tier-3
14 12 . − .
6% 1 . x . in this paper, we demonstrated an effective path towards production-grade on-device speech recog-nition on a range of embedded devices.A CKNOWLEDGMENTS
We would like to thank Gregory Diamos, Christopher Fougner, Atul Kumar, Julia Li, Sharan Narang,Thuan Nguyen, Sanjeev Satheesh, Richard Wang, Yi Wang, and Zhenyao Zhu for their helpfulcomments and assistance with various parts of this paper. We also thank anonymous referees fortheir comments that greatly improved the exposition and helped uncover a mistake in an earlierversion of this paper. R EFERENCES
Raziel Alvarez, Rohit Prabhavalkar, and Anton Bakhtin. On the efficient representation and execu-tion of deep acoustic models. In
Proceedings of Annual Conference of the International SpeechCommunication Association (Interspeech) , 2016. URL https://arxiv.org/abs/1607.04683 .Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, CarlCase, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In
International Conference on MachineLearning , pp. 173–182, 2016.Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In
Advances in neural informa-tion processing systems , pp. 2654–2662, 2014.Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressingneural networks with the hashing trick. In
International Conference on Machine Learning , pp.2285–2294, 2015.Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. On the propertiesof neural machine translation: Encoder-decoder approaches.
Syntax, Semantics and Structure inStatistical Translation , pp. 103, 2014.Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation ofgated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 , 2014.Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Reexamining low rank matrix factoriza-tion for trace norm regularization. arXiv preprint arXiv:1706.08934 , 2017.Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deeplearning. In
Advances in Neural Information Processing Systems , pp. 2148–2156, 2013.Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linearstructure within convolutional networks for efficient evaluation. In
Advances in Neural Informa-tion Processing Systems , pp. 1269–1277, 2014.Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networkswith pruning, trained quantization and huffman coding. In
International Conference on LearningRepresentations (ICLR) , 2016. 9orrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and KurtKeutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB modelsize. arXiv preprint arXiv:1602.07360 , 2016.Benoit Jacob and Pete Warden. gemmlowp: a small self-contained low-precision GEMM library. https://github.com/google/gemmlowp , 2015–2017.Graham James Oscar Jameson.
Summing and nuclear norms in Banach space theory , volume 8.Cambridge University Press, 1987.Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommendersystems.
Computer , 42(8), 2009.Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprintarXiv:1703.10722 , 2017.Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimalbrain damage. In
Advances in Neural Information Processing Systems , pp. 598–605, 1989.Hairong Liu, Zhenyao Zhu, Xiangang Li, and Sanjeev Satheesh. Gram-CTC: Automatic unit selec-tion and target decomposition for sequence labelling. arXiv preprint arXiv:1703.00096 , 2017.Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning compact recurrent neural networks. In
Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on , pp.5960–5964. IEEE, 2016.Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, DavidRybach, Ouais Alsharif, Has¸im Sak, Alexander Gruenstein, Franc¸oise Beaufays, et al. Personal-ized speech recognition on mobile devices. In
Acoustics, Speech and Signal Processing (ICASSP),2016 IEEE International Conference on , pp. 5955–5959. IEEE, 2016.Sharan Narang, Gregory Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrentneural networks. In
International Conference on Learning Representations (ICLR) , 2017.Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: Onthe role of implicit regularization in deep learning. In
Workshop track ICLR , 2015. arXiv preprintarXiv:1412.6614.Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Ian McGraw. On the compression ofrecurrent neural networks with an application to LVCSR acoustic modeling for embedded speechrecognition. In
Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE InternationalConference on , pp. 5970–5974. IEEE, 2016.Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets.In
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on ,pp. 6655–6659. IEEE, 2013.Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deeplearning. In
Advances in Neural Information Processing Systems , pp. 3088–3096, 2015.Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization. In
Advances in neural information processing systems , pp. 1329–1336, 2005.Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks oncpus. In
Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop , volume 1,pp. 4, 2011.Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models withsingular value decomposition. In
Interspeech , pp. 2365–2369, 2013.10igure 7: Contours of || σ || (cid:96) and || σ || (cid:96) . || σ || (cid:96) is kept constant at σ . For this case, || σ || (cid:96) can varyfrom σ to √ σ . A N
ONDIMENSIONAL TRACE NORM COEFFICIENT
In this section, we describe some of the properties of the non-dimensional trace norm coefficientdefined in Section 3.1.
Proposition 1.
Let
W, d, σ be as in Definition 1. Then(i) ν ( cW ) = ν ( W ) for all scalars c ∈ R \ { } .(ii) ≤ ν ( W ) ≤ .(iii) ν ( W ) = 0 if and only if W has rank 1.(iv) ν ( W ) = 1 if and only if W has maximal rank and all singular values are equal.Proof. Since we are assuming W is nonzero, at least one singular value is nonzero and hence || σ || (cid:96) (cid:54) = 0 . Property (i) is immediate from the scaling property || cσ || = | c | · || σ || satisfied byall norms.To establish the other properties, observe that we have ( σ i + σ j ) ≥ σ i + σ j ≥ (cid:18) σ i + 12 σ j (cid:19) . (9)The first inequality holds since singular values are nonnegative, and the inequality is strict unless σ i or σ j vanishes. The second inequality comes from an application of Jensen’s inequality and is strictunless σ i = σ j . Thus, replacing ( σ i , σ j ) by ( σ i + σ j , preserves || σ || (cid:96) while increasing || σ || (cid:96) unless one of σ i or σ j is zero. Similarly, replacing ( σ i , σ j ) by ( σ i + σ j , σ i + σ j ) preserves || σ || (cid:96) while decreasing || σ || (cid:96) unless σ i = σ j . By a simple argument by contradiction, it followsthat the minima occur for σ = ( σ , , . . . , , in which case ν ( W ) = 0 and the maxima occur for σ = ( σ , . . . , σ ) , in which case ν ( W ) = 1 .We can also obtain a better intuition about the minimum and maximum of ν ( W ) by looking at the2D case visualized in Figure 7. For a fixed || σ || (cid:96) = σ , || σ || (cid:96) can vary from σ to √ σ . Theminimum || σ || (cid:96) happens when either σ or σ are zero. For these values || σ || (cid:96) = || σ || (cid:96) and as aresult ν ( W ) = 0 . Similarly, the maximum || σ || (cid:96) happens for σ = σ , resulting in ν ( W ) = 1 . B M
ODEL DESIGN CONSIDERATIONS
We describe here a few preliminary insights that informed our choice of baseline model for theexperiments reported in Sections 3 and 4. 11ince the target domain is on-device streaming speech recognition with low latency, we chose tofocus on Deep Speech 2 like models with forward-only GRU layers (Amodei et al., 2016).B.1 G
ROWING RECURRENT LAYER SIZES
Across several data sets and model architectures, we consistently found that the sizes of the recurrentlayers closer to the input could be shrunk without affecting accuracy much. A related phenomenonwas observed in Prabhavalkar et al. (2016): When doing low rank approximations of the acousticmodel layers using SVD, the rank required to explain a fixed threshold of explained variance growswith distance from the input layer.To reduce the number of parameters of the baseline model and speed up experiments, we thus choseto adopt growing GRU dimensions. Since the hope is that the compression techniques studied in thispaper will automatically reduce layers to a near-optimal size, we chose to not tune these dimensions,but simply picked a reasonable affine increasing scheme of 768, 1024, 1280 for the GRU dimensions,and dimension 1536 for the final fully connected layer.B.2 P
ARAMETER SHARING IN THE LOW RANK FACTORIZATION
For the recurrent layers, we employ the Gated Recurrent Unit (GRU) architecture proposed in Choet al. (2014); Chung et al. (2014), where the hidden state h t is computed as follows: z t = σ ( W z x t + U z h t − + b z ) r t = σ ( W r x t + U r h t − + b r )˜ h t = f ( W h x t + r t · U h h t − + b h ) h t = (1 − z t ) · h t − + z t · ˜ h t (10)where σ is the sigmoid function, z and r are update and reset gates respectively, U z , U r , U h are thethree recurrent weight matrices, and W z , W r , W h are the three non-recurrent weight matrices.We consider here three ways of performing weight sharing when doing low rank factorization of the6 weight matrices.1. Completely joint factorization.
Here we concatenate the 6 weight matrices along the firstdimension and apply low rank factorization to this single combined matrix.2.
Partially joint factorization.
Here we concatenate the 3 recurrent matrices into a singlematrix U and likewise concatenate the 3 non-recurrent matrices into a single matrix W . Wethen apply low rank factorization to each of U and W separately.3. Completely split factorization.
Here we apply low rank factorization to each of the 6weight matrices separately.In (Prabhavalkar et al., 2016; Kuchaiev & Ginsburg, 2017), the authors opted for the LSTM ana-log of completely joint factorization , as this choice has the most parameter sharing and thus thehighest potential for compression of the model. However, we decided to go with partially joint fac-torization instead, largely for two reasons. First, in pilot experiments, we found that the U and W matrices behave qualitatively quite differently during training. For example, on large data sets the W matrices may be trained from scratch in factored form, whereas factored U matrices need to beeither warmstarted via SVD from a trained unfactored model or trained with a significantly loweredlearning rate. Second, the U and W split is advantageous in terms of computational efficiency. Forthe non-recurrent W GEMM, there is no sequential time dependency and thus its inputs x may bebatched across time.Finally, we compared the partially joint factorization to the completely split factorization and foundthat the former indeed led to better accuracy versus number of parameters trade-offs. Some resultsfrom this experiment are shown in Table 3.B.3 M EL AND SMALLER CONVOLUTION FILTERS
Switching from 161-dimensional linear spectrograms to 80-dimensional mel spectrograms reducesthe per-timestep feature dimension by roughly a factor of 2. Furthermore, and likely owing to this12able 3: Performance of completely split versus partially joint factorization of recurrent weights.Completely split Partially jointSVD threshold Parameters (M) CER Parameters (M) CER .
50 6 . . . . .
60 8 . . . . .
70 12 . . . . .
80 16 . . . . . . . . . . . P a r a m e t e r s baselinescaled baselinelow rank (fast)low ranksparse Figure 8: CER versus parameter on an internal dataset, colored by parameter reduction technique.switch, we could reduce the frequency-dimension size of the convolution filters by a factor of 2. Incombination, this means about a 4x reduction in compute for the first and second convolution layers,and a 2x reduction in compute for the first GRU layer.On the WSJ corpus as well as an internal dataset of around 1,000 hours of speech, we saw little im-pact on accuracy from making this change, and hence we adopted it for all experiments in Section 3.B.4 G
RAM -CTC
AND INCREASED STRIDE IN CONVOLUTIONS
Gram-CTC is a recently proposed extension to CTC for training models that output variable-sizegrams as opposed to single characters (Liu et al., 2017). Using Gram-CTC, we were able to increasethe time stride in the second convolution layer by a factor of 2 with little to no loss in CER, thoughwe did have to double the number of filters in that same convolution layer to compensate. Thenet effect is a roughly 2x speedup for the second and third GRU layers, which are the largest.This speed up more than makes up for the size increase in the softmax layer and the slightly morecomplex language model decoding when using Gram-CTC. However, for a given target accuracy,we found that Gram-CTC models could not be shrunk as much as CTC models by means of lowrank factorization. That is, the net effect of this technique is to increase model size in exchange forreduced latency.B.5
LOW RANK FACTORIZATION VERSUS LEARNED SPARSITY
Shown in Figure 8 is the parameter reduction versus relative CER increase trade-off for varioustechniques on an internal data set of around 1,000 hours of speech.13he baseline model is a Deep Speech 2 model with three forward-GRU layers of dimension 2560,as described in Amodei et al. (2016). This is the same baseline model used in the experimentsof Narang et al. (2017), from which paper we also obtained the sparse data points in the plot. Shownalso are versions of the baseline model but with the GRU dimension scaled down to 1536 and 1024.Overall, models with low rank factorizations on all non-recurrent and recurrent weight matricesare seen to provide the best CER vs parameters trade-off. All the low rank models use growingGRU dimensions and the partially split form of low rank factorization, as discussed in Sections B.1and B.2. The models labeled fastfast