[PDF] Grow and Prune Compact, Fast, and Accurate LSTMs

Abstract

Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original one level non-linear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning and speech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7x [floating-point operations (FLOPs) by 45.5x], run-time latency by 4.5x, and improve the CIDEr score by 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our two models reduce the number of parameters by 19.4x (FLOPs by 23.5x), run-time latency by 15.7%, and the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.

Full PDF

GGrow and Prune Compact, Fast, and AccurateLSTMs

Xiaoliang Dai ∗ Princeton University [email protected]

Hongxu Yin ∗ Princeton University [email protected]

Niraj K. Jha

Princeton University [email protected]

Abstract

Long short-term memory (LSTM) has been widely used for sequential data model-ing. Researchers have increased LSTM depth by stacking LSTM cells to improveperformance. This incurs model redundancy, increases run-time delay, and makesthe LSTMs more prone to overﬁtting. To address these problems, we propose ahidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM’s original one-level non-linear control gates. H-LSTM increases accuracy while employing fewerexternal stacked layers, thus reducing the number of parameters and run-timelatency signiﬁcantly. We employ grow-and-prune (GP) training to iteratively adjustthe hidden layers through gradient-based growth and magnitude-based pruning ofconnections. This learns both the weights and the compact architecture of H-LSTMcontrol gates. We have GP-trained H-LSTMs for image captioning and speechrecognition applications. For the NeuralTalk architecture on the MSCOCO dataset,our three models reduce the number of parameters by 38.7 × [ﬂoating-point opera-tions (FLOPs) by 45.5 × ], run-time latency by 4.5 × , and improve the CIDEr scoreby 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our two modelsreduce the number of parameters by 19.4 × (FLOPs by 23.5 × ), run-time latency by15.7%, and the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMscan be seen to be compact, fast, and accurate. Recurrent neural networks (RNNs) have been ubiquitously employed for sequential data modelingbecause of their ability to carry information through recurrent cycles. Long short-term memory(LSTM) is a special type of RNN that uses control gates and cell states to alleviate the explodingand vanishing gradient problems [1]. LSTMs are adept at modeling long-term and short-termdependencies. They deliver state-of-the-art performance for a wide variety of applications, such asspeech recognition [2], image captioning [3], and neural machine translation [4].Researchers have kept increasing the model depth and size to improve the performance of LSTMs.For example, the DeepSpeech2 architecture [2], which has been used for speech recognition, is morethan 2 × deeper and 10 × larger than the initial DeepSpeech architecture proposed in [5]. However,large LSTM models may suffer from three problems. First, deployment of a large LSTM modelconsumes substantial storage, memory bandwidth, and computational resources. Such demandsmay be too excessive for mobile phones and embedded devices. Second, large LSTMs are prone tooverﬁtting but hard to regularize [6]. Employing standard regularization methods that are used forfeed-forward neural networks (NNs), such as dropout, in an LSTM cell is challenging [7, 8]. Third,the increasingly stringent run-time latency constraints in real-time applications make large LSTMs,which incur high latency, inapplicable in these scenarios. All these problems pose a signiﬁcant designchallenge in obtaining compact, fast, and accurate LSTMs. ∗ indicates equal contribution. This work was supported by NSF Grant No. CNS-1617640. a r X i v : . [ c s . L G ] M a y n this work, we tackle these design challenges simultaneously by combining two novelties. We ﬁrstpropose a hidden-layer LSTM (H-LSTM) that introduces multi-level abstraction in the LSTM controlgates by adding several hidden layers. Then, we describe a grow-and-prune (GP) training methodthat combines gradient-based growth [9] and magnitude-based pruning [10] techniques to learn theweights and derive compact architectures for the control gates. This yields inference models thatoutperform the baselines from all three targeted design perspectives. Going deeper is a common strategy for improving LSTM performance. The conventional approachof stacking LSTM cells has shown signiﬁcant performance improvements on a wide range ofapplications, such as speech recognition and machine translation [11–13]. The recently proposedskipped connection technique has made it possible to train very deeply stacked LSTMs. This leads tohigh-performance architectures, such as residual LSTM [14] and highway LSTM [15].Stacking LSTMs improves accuracy but incurs substantial computation and storage costs. Numerousrecent approaches try to shrink the size of large NNs. A popular direction is to simplify the matrixrepresentation. Typical techniques include matrix factorization [16], low rank approximation [17],and basis ﬁlter set reduction [18]. Another direction focuses on efﬁcient storage and representationof weights. Various techniques, such as weight sharing within Toeplitz matrices [19], weight tyingthrough effective hashing [20], and appropriate weight quantization [21–23], can greatly reducemodel size, in some cases at the expense of a slight performance degradation.Network pruning has emerged as another popular approach for LSTM compression. Han et al. showthat pruning can signiﬁcantly cut down on the size of deep convolutional neural networks (CNNs)and LSTMs [10, 24, 25]. Moreover, Yu et al. show that post-pruning sparsity in weight matrices caneven improve speech recognition accuracy [26]. Narang el al. incorporate pruning in the trainingprocess and compress the LSTM model size by approximately 10 × while reducing training timesigniﬁcantly [27].Apart from size reduction, run-time latency reduction has also attracted an increasing amount ofresearch attention. A recent work by Zen et al. uses unidirectional LSTMs with a recurrent outputlayer to reduce run-time latency [28]. Amodei et al. propose a more efﬁcient beam search strategyto speed up inference with only a minor accuracy loss [2]. Narang et al. exploit hardware-drivencuSPARSE libraries on a TitanX Maxwell GPU to speed up post-pruning sparse LSTMs by 1.16 × to6.80 × [27]. In this section, we explain our LSTM synthesis methodology that is based on H-LSTM cells and GPtraining. We ﬁrst describe the H-LSTM structure, after which we illustrate GP training in detail.

Recent years have witnessed the impact of increasing NN depth on its performance. A deep architec-ture allows an NN to capture low/mid/high-level features through a multi-level abstraction. However,since a conventional LSTM employs ﬁxed single-layer non-linearity for gate controls, the currentstandard approach for increasing model depth is through stacking several LSTM cells externally. Inthis work, we argue for a different approach that increases depth within LSTM cells. We propose anH-LSTM whose control gates are enhanced by adding hidden layers. We show that stacking fewerH-LSTM cells can achieve higher accuracy with fewer parameters and smaller run-time latency thanconventionally stacked LSTM cells.We show the schematic diagram of an H-LSTM in Fig. 1. The internal computation ﬂow is governedby Eq. (1), where f t , i t , o t , g t , x t , h t , and c t refer to the forget gate, input gate, output gate, vector forcell updates, input, hidden state, and cell state at step t , respectively; h t − and c t − refer to the hiddenand cell states at step t − ; DN N , H , W , b , σ , and ⊗ refer to the deep neural network (DNN) gates,hidden layers (each performs a linear transformation followed by the activation function), weightmatrix, bias, sigmoid function, and element-wise multiplication, respectively; ∗ indicates zero ormore H layers for the DNN gate. 2igure 1: Schematic diagram of H-LSTM.  f t i t o t g t  =  DN N f ([ x t , h t − ]) DN N i ([ x t , h t − ]) DN N o ([ x t , h t − ]) DN N g ([ x t , h t − ])  =  σ ( W f H ∗ ([ x t , h t − ]) + b f ) σ ( W i H ∗ ([ x t , h t − ]) + b i ) σ ( W o H ∗ ([ x t , h t − ]) + b o ) tanh ( W g H ∗ ([ x t , h t − ]) + b g )  c t = f t ⊗ c t − + i t ⊗ g t h t = o t ⊗ tanh ( c t ) (1)The introduction of DNN gates provides three major beneﬁts to an H-LSTM:1. Strengthened control : Hidden layers in DNN gates enhance gate control through multi-level abstraction. This makes an H-LSTM more capable and intelligent, and alleviates itsreliance on external stacking. Consequently, an H-LSTM can achieve comparable or evenimproved accuracy with fewer external stacked layers relative to a conventional LSTM,leading to higher compactness.2.

Easy regularization : The conventional approach only uses dropout in the input/outputlayers and recurrent connections in the LSTMs. In our case, it becomes possible to applydropout even to all control gates within an LSTM cell. This reduces overﬁtting and leads tobetter generalization.3.

Flexible gates : Unlike the ﬁxed but specially-crafted gate control functions in LSTMs,DNN gates in an H-LSTM offer a wide range of choices for internal activation functions,such as the ReLU. This may provide additional beneﬁts to the model. For example, networkstypically learn faster with ReLUs [7]. They can also take advantage of ReLU’s zero outputsfor FLOPs reduction.

Conventional training based on back propagation on fully-connected NNs yields over-parameterizedmodels. Han et al. have successfully implemented pruning to drastically reduce the size of large CNNsand LSTMs [10,24]. The pruning phase is complemented with a brain-inspired growth phase for largeCNNs in [9]. The network growth phase allows a CNN to grow neurons, connections, and featuremaps, as necessary, during training. Thus, it enables automated search in the architecture space. It hasbeen shown that a sequential combination of growth and pruning can yield additional compression onCNNs relative to pruning-only methods (e.g., 1.7 × for AlexNet and 2.0 × for VGG-16 on top of thepruning-only methods) [9]. In this work, we extend GP training to LSTMs.We illustrate the GP training ﬂow in Fig. 2. It starts from a randomly initialized sparse seedarchitecture. The seed architecture contains a very limited fraction of connections to facilitate initialgradient back-propagation. The remaining connections in the matrices are dormant and maskedto zero. The ﬂow ensures that all neurons in the network are connected. During training, it ﬁrstgrows connections based on the gradient information. Then, it prunes away redundant connectionsfor compactness, based on their magnitudes. Finally, GP training rests at an accurate, yet compact,inference model. We explain the details of each phase next.GP training adopts the following policies: 3igure 2: An illustration of GP training. Growth policy:

Activate a dormant w in W iff | w.grad | is larger than the (100 α ) th percentile of allelements in | W .grad |. Pruning policy:

Remove a w iff | w | is smaller than the (100 β ) th percentile of all elements in | W |. w , W , .grad , α , and β refer to the weight of a single connection, weights of all connections withinone layer, operation to extract the gradient, growth ratio, and pruning ratio, respectively.In the growth phase, the main objective is to locate the most effective dormant connections toreduce the value of loss function L . We ﬁrst evaluate ∂L/∂w for each dormant connection w basedon its average gradient over the entire training set. Then, we activate each dormant connectionwhose gradient magnitude | w.grad | = | ∂L/∂w | surpasses the (100 α ) th percentile of the gradientmagnitudes of its corresponding weight matrix. This rule caters to dormant connections iff theyprovide most efﬁciency in L reduction. Growth can also help avoid local minima, as observed byHan et al. in their dense-sparse-dense training algorithm to improve accuracy [29].We prune the network after the growth phase. Pruning of insigniﬁcant weights is an iterative process.In each iteration, we ﬁrst prune away insigniﬁcant weights whose magnitudes are smaller than the (100 β ) th percentile within their respective layers. We prune a neuron if all its input (or output)connections are pruned away. We then retrain the NN after weight pruning to recover its performancebefore starting the next pruning iteration. The pruning phase terminates when retraining cannotachieve a pre-deﬁned accuracy threshold. GP training ﬁnalizes a model based on the last completeiteration. We present our experimental results for image captioning and speech recognition benchmarks next.We implement our experiments using PyTorch [30] on Nvidia GTX 1060 and Tesla P100 GPUs.

We ﬁrst show the effectiveness of our proposed methodology on image captioning.

Architecture:

We experiment with the NeuralTalk architecture [3, 31] that uses the last hidden layerof a pre-trained CNN image encoder as an input to a recurrent decoder for sentence generation.The recurrent decoder applies a beam search technique for sentence generation. A beam size of k indicates that at step t , the decoder considers the set of k best sentences obtained so far as candidatesto generate sentences in step t + 1 , and keeps the best k results [3, 31, 32]. In our experiments, we useVGG-16 [33] as the CNN encoder, same as in [3, 31]. We then use H-LSTM and LSTM cells withthe same width of 512 for the recurrent decoder and compare their performance. We use Beam = 2 as the default beam size.

Dataset:

We report results on the MSCOCO dataset [34]. It contains 123287 images of size256 × ×

3, along with ﬁve reference sentences per image. We use the publicly available split [31],which has 113287, 5000, and 5000 images in the training, validation, and test sets, respectively.

Training:

We use the Adam optimizer [35] for this experiment. We use a batch size of 64 for training.We initialize the learning rate at × − . In the ﬁrst 90 epochs, we ﬁx the weights of the CNN andtrain the LSTM decoder only. We decay the learning rate by 0.8 every six epochs in this phase. After4 C I D E r s c o r e LSTMH-LSTMBeam=2Beam=1

Figure 3: Comparison of NeuralTalk CIDEr forthe LSTM and H-LSTM cells. Number and areaindicate size. Table 1: Cell comparison for the NeuralTalk architec-ture on the MSCOCO dataset.

Model CIDEr

H-LSTM (Beam=2) 95.4 3.1M 24msH-LSTM (Beam=1) 93.0 3.1M 8ms

90 epochs, we start to ﬁne-tune both the CNN and LSTM at a ﬁxed × − learning rate. We usea dropout ratio of 0.2 for the hidden layers in the H-LSTM. We also use a dropout ratio of 0.5 forthe input and output layers of the LSTM, same as in [6]. We use CIDEr score [36] as our evaluationcriterion. We ﬁrst compare the performance of a fully-connected H-LSTM with a fully-connected LSTM toshow the beneﬁts emanating from using the H-LSTM cell alone.The NeuralTalk architecture with a single LSTM achieves a 91.0 CIDEr score [3]. We also experimentwith stacked 2-layer and 3-layer LSTMs, which achieve 92.1 and 92.8 CIDEr scores, respectively.We next train a single H-LSTM and compare the results in Fig. 3 and Table 1. Our single H-LSTMachieves a CIDEr score of 95.4, which is 4.4, 3.3, 2.6 higher than the single LSTM, stacked 2-layerLSTM, and stacked 3-layer LSTM, respectively.H-LSTM can also reduce run-time latency. Even with

Beam = 1 , a single H-LSTM achieves ahigher accuracy than the three LSTM baselines. Reducing the beam size leads to run-time latencyreduction. H-LSTM is 4.5 × , 3.6 × , 2.6 × faster than the stacked 3-layer LSTM, stacked 2-layerLSTM, and single LSTM, respectively, while providing higher accuracy. Next, we implement both network pruning and our GP training to synthesize compact inferencemodels for an H-LSTM (

Beam = 2 ). The seed architecture for GP training has a sparsity of 50%.In the growth phase, we use a 0.8 growth ratio in the ﬁrst ﬁve epochs. We summarize the results inTable 2, where CR refer to the compression ratio relative to a fully-connected model. GP trainingprovides an additional 1.40 × improvement on CR compared with network pruning.Table 2: Training algorithm comparison Method Cell × - GP training H-LSTM 1 95.4 394K 8.0 × × We list our GP-trained H-LSTM models in Table 3. Note that the accurate and fast models are thesame network with different beam sizes. The compact model is obtained through further pruning ofthe accurate model. We choose the stacked 3-layer LSTM as our baseline due to its high accuracy.Our accurate, fast, and compact models demonstrate improvements in all aspects (accuracy, speed,5able 3: Different inference models for the MSCOCO dataset

Model

Ours: accurate

Ours: fast

Ours: compact and compactness), with a 2.6 higher CIDEr score, 4.5 × speedup, and 38.7 × fewer parameters,respectively. We now consider another well-known application: speech recognition.

Architecture:

We implement a bidirectional DeepSpeech2 architecture that employs stacked recur-rent layers following convolutional layers for speech recognition [2]. We use Mel-frequency cepstralcoefﬁcients (MFCCs) as network inputs, extracted from raw speech data at a 16KHz sampling rateand 20ms feature extraction window. There are two CNN layers prior to the recurrent layers and oneconnectionist temporal classiﬁcation layer for decoding [37] after the recurrent layers. The width ofthe hidden and cell states is 800, same as in [38, 39]. We also set the width of H-LSTM hidden layersto 800.

Dataset:

We use the AN4 dataset [40] to evaluate the performance of our DeepSpeech2 architecture.It contains 948 training utterances and 130 testing utterances.

Training:

We utilize a Nesterov SGD optimizer in our experiment. We initialize the learning rate to × − , decayed per epoch by 0.99. We use a batch size of 16 for training. We use a dropout ratioof 0.2 for the hidden layers in the H-LSTM. We apply batch normalization between recurrent layers.We apply L2 regularization during training with a weight decay of × − . We use word error rate(WER) as our evaluation criterion, same as in [38, 39, 41]. We ﬁrst compare the performance of the fully-connected H-LSTM against the fully-connected LSTMand gate recurrent unit (GRU) to demonstrate the beneﬁts provided by the H-LSTM cell alone. GRUuses reset and update gates for memory control and has fewer parameters than LSTM [42].For the baseline, we train various DeepSpeech2 models containing a different number of stackedlayers based on GRU and LSTM cells. The stacked 4-layer and 5-layer GRUs achieve a WER of14.35% and 11.60%, respectively. The stacked 4-layer and 5-layer LSTMs achieve a WER of 13.99%and 10.56%, respectively.We next train an H-LSTM to make a comparison. Since an H-LSTM is intrinsically deeper, we aim toachieve a similar accuracy with a smaller stack. We reach a WER of 12.44% and 8.92% with stacked2-layer and 3-layer H-LSTMs, respectively.We summarize the cell comparison results in Fig. 4 and Table 4, where all the sizes are normalized tothe size of a single LSTM. We can see that H-LSTM can reduce WER by more than 1.5% with twofewer layers relative to LSTMs and GRUs, thus satisfying our initial design goal to stack fewer cellsthat are individually deeper. H-LSTM models contain fewer parameters for a given target WER, andcan achieve lower WER for a given number of parameters.

We next implement GP training to show its additional beneﬁts on top of just performing networkpruning. We select the stacked 3-layer H-LSTMs for this experiment due to its highest accuracy.For GP training, we initialize the seed architecture with a connection sparsity of 50%. We grow thenetworks for three epochs using a 0.9 growth ratio.6 W E R ( % ) Figure 4: Comparison of DeepSpeech2 WERs forthe GRU, LSTM, and H-LSTM cells. Number andarea indicate relative size to one LSTM. Table 4: Cell comparison for the DeepSpeech2architecture on the AN4 dataset

Cell Type

H-LSTM 2 3.0 12.44%

GRU 5 3.8 11.64%LSTM 5 5.0 10.56%

H-LSTM 3 4.5 8.92%

Table 5: Training algorithm comparison

Method Cell × - GP training H-LSTM 3 10.37% 2.6M 17.2 × × For best compactness, we set an accuracy threshold for both GP training and the pruning-only processat 10.52% (lowest WER from relevant work [38, 39, 41]). We compare these two approaches inTable 5. Compared to network pruning, GP training can further boost the CR by 2.44 × whileimproving the accuracy slightly. This is consistent with prior observations that pruning large CNNspotentially inherits certain redundancies from the original fully-connected model that the growthphase can alleviate [9]. We obtain two GP-trained models by varying the WER constraint during the pruning phase: anaccurate model aimed at a higher accuracy (9.00% WER constraint), and a compact model aimed atextreme compactness (10.52% WER constraint).We compare our results against prior work from the literature in Table 6. We select a stacked 5-layerLSTM [38] as our baseline. On top of the substantial parameter and FLOPs reductions, both theaccurate and compact models also reduce the average run-time latency per instance from 691.4ms to583.0ms (15.7% reduction) even without any sparse matrix library support.Table 6: Different inference models for the AN4 dataset

Model RNN Type WER(%) ∆ WER(%) × ) 100.8 (1.0 × )Alistarh et al. [41] LSTM 18.85 +5.95 13.0 (3.9 × ) 26.0 (3.9 × )Sharen et al. [39] GRU 10.52 -2.38 37.8 (1.3 × ) 75.7 (1.3 × ) Ours: accurate H-LSTM 8.73 -4.17 22.5 (2.3 × ) 37.1 (2.9 × )Ours: compact H-LSTM 10.37 -2.53 2.6 (19.4 × ) 4.3 (23.5 × ) The introduction of the ReLU activation function in DNN gates provides additional FLOPs reductionfor the H-LSTM. This effect does not apply to LSTMs and GRUs that only use tanh and sigmoid gate control functions. At inference time, the average activation percentage of the ReLU outputs is48.3% for forward-direction LSTMs, and 48.1% for backward-direction LSTMs. This further reducesthe overall run-time FLOPs by 14.5%. 7able 7: GP-trained compact 3-layer H-LSTM DeepSpeech2 model at 10.37% WER

Sparsity Sparsity SparsityLayers Seed Post-Growth Post-PruningH-LSTM layer1 50.00% 38.35% 94.26%H-LSTM layer2 50.00% 37.68% 94.20%H-LSTM layer3 50.00% 37.86% 94.21%Total 50.00% 37.96% 94.22%

The details of the ﬁnal inference models are summarized in Table 7. The ﬁnal sparsity of the compactmodel is as high as 94.22% due to the compounding effect of growth and pruning.

We observe the importance of regularization in H-LSTM on its ﬁnal performance. We summarizethe comparison between fully-connected models with and without dropout for both applications inTable 8. By appropriately regularizing DNN gates, we improve the CIDEr score by 2.0 on NeuralTalkand reduce the WER from 9.88% to 8.92% on DeepSpeech2.Table 8: Impact of dropout on H-LSTM

Architecture Dropout CIDEr Architecture Dropout WERNeuralTalk N 93.4 DeepSpeech2 N 9.88%NeuralTalk Y 95.4 DeepSpeech2 Y 8.92%

Some real-time applications may emphasize stringent memory and delay constraints instead ofaccuracy. In this case, the deployment of stacked LSTMs may be infeasible. The extra parametersused in H-LSTM’s hidden layers may also seem disadvantageous in this scenario. However, we nextshow that the extra parameters can be easily compensated by a reduced hidden layer width. Wecompare several models for image captioning in Table 9. If we reduce the width of the hidden layersand cell states in the H-LSTM from 512 to 320, we can easily arrive at a single-layer H-LSTM thatdominates the conventional LSTM from all three design perspectives. Our observation coincides withprior experience with neural network training where slimmer but deeper NNs (in this case H-LSTM)normally exhibit better performance than shallower but wider NNs (in this case LSTM).Table 9: H-LSTM with reduced width for further speedup and compactness

Cell

We also employ an activation function shift trick in our experiment. In the growth phase, we adopta leaky ReLU (reverse slope of 0.01) as the activation function for H ∗ in Eq. (1). Leaky ReLUeffectively alleviates the ‘dying ReLU’ phenomenon, in which a zero output of the ReLU neuronblocks it from any future gradient update. Then, we change all the activation functions from LeakyReLU to ReLU while keeping the weights unchanged, retrain the network to recover performance,and continue to the pruning phase. In this work, we combined H-LSTM and GP training to learn compact, fast, and accurate LSTMs.An H-LSTM adds hidden layers to control gates as opposed to conventional architectures that justemploy a one-level nonlinearity. GP training combines gradient-based growth and magnitude-basedpruning to ensure H-LSTM compactness. We GP-trained H-LSTMs for the image captioning andspeech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, our models8educed the number of parameters by 38.7 × (FLOPs by 45.5 × ) and run-time latency by 4.5 × , andimproved the CIDEr score by 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our modelsreduced the number of parameters by 19.4 × (FLOPs by 23.5 × ), run-time latency by 15.7%, andWER from 12.9% to 8.7%. References [1] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Computation , vol. 9,no. 8, pp. 1735–1780, 1997.[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,M. Chrzanowski, A. Coates, G. Diamos, E. Elsen et al. , “Deep Speech 2 : End-to-End speechrecognition in English and Mandarin,” in

Proc. Int. Conf. Machine Learning , vol. 48, 2016, pp.173–182.[3] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descrip-tions,” in

Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2015, pp. 3128–3137.[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,”in

Proc. Advances in Neural Information Processing Systems , 2014, pp. 3104–3112.[5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh,S. Sengupta, A. Coates et al. , “Deep Speech: Scaling up end-to-end speech recognition,” arXivpreprint arXiv:1412.5567 , 2014.[6] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXivpreprint arXiv:1409.2329 , 2014.[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcation with deep convolutionalneural networks,” in

Proc. Advances in Neural Information Processing Systems , 2012, pp.1097–1105.[8] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio,A. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hiddenactivations,” arXiv preprint arXiv:1606.01305 , 2016.[9] X. Dai, H. Yin, and N. K. Jha, “NeST: A neural network synthesis tool based on a grow-and-prune paradigm,” arXiv preprint arXiv:1711.02017 , 2017.[10] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang et al. , “ESE:Efﬁcient speech recognition engine with sparse LSTM on FPGA,” in

Proc. ACM/SIGDA Int.Symp. Field-Programmable Gate Arrays , 2017, pp. 75–84.[11] A. Graves, N. Jaitly, and A.-R. Mohamed, “Hybrid speech recognition with deep bidirectionalLSTM,” in

Proc. IEEE Workshop Automatic Speech Recognition and Understanding , 2013, pp.273–278.[12] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neuralnetworks,” in

Proc. Int. Conf. Acoustics, Speech, and Signal Processing , 2013, pp. 6645–6649.[13] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neuralnetworks,” arXiv preprint arXiv:1312.6026 , 2013.[14] J. Kim, M. El-Khamy, and J. Lee, “Residual LSTM: Design of a deep recurrent architecture fordistant speech recognition,” arXiv preprint arXiv:1701.03360 , 2017.[15] Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass, “Highway long short-termmemory RNNs for distant speech recognition,” in

Proc. IEEE Int. Conf. Acoustics, Speech andSignal Processing , 2016, pp. 5755–5759.[16] M. Denil, B. Shakibi, L. Dinh, N. De Freitas et al. , “Predicting parameters in deep learning,” in

Proc. Advances in Neural Information Processing Systems , 2013, pp. 2148–2156.[17] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structurewithin convolutional networks for efﬁcient evaluation,” in

Proc. Advances in Neural InformationProcessing Systems , 2014, pp. 1269–1277.[18] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks withlow rank expansions,” arXiv preprint arXiv:1405.3866 , 2014.919] Z. Lu, V. Sindhwani, and T. N. Sainath, “Learning compact recurrent neural networks,” in

Proc.IEEE Int. Conf. Acoustics, Speech, and Signal Processing , 2016, pp. 5960–5964.[20] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural networks withthe hashing trick,” in

Proc. Int. Conf. Machine Learning , 2015, pp. 2285–2294.[21] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak, “Fast, compact, andhigh quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices,” arXiv preprint arXiv:1606.06061 , 2016.[22] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,”in

Proc. NIPS Workshop Deep Learning and Unsupervised Feature Learning , vol. 1, 2011, p. 4.[23] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks usingvector quantization,” arXiv preprint arXiv:1412.6115 , 2014.[24] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efﬁcientneural network,” in

Proc. Advances in Neural Information Processing Systems , 2015, pp.1135–1143.[25] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic DNNweight pruning framework using alternating direction method of multipliers,” arXiv preprintarXiv:1804.03294 , 2018.[26] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep neural networks forlarge vocabulary speech recognition,” in

Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing , 2012, pp. 4409–4412.[27] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploring sparsity in recurrent neuralnetworks,” arXiv preprint arXiv:1704.05119 , 2017.[28] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network withrecurrent output layer for low-latency speech synthesis,” in

Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing , 2015, pp. 4470–4474.[29] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J. Dally,“DSD: Regularizing deep neural networks with dense-sparse-dense training ﬂow,” arXiv preprintarXiv:1607.04381 , 2016.[30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,”

NIPS Workshop Autodiff , 2017.[31] A. Karpathy, “Image captioning in Torch,” https://github.com/karpathy/neuraltalk2, 2016.[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image captiongenerator,” in

Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2015, pp. 3156–3164.[33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale imagerecognition,” arXiv preprint arXiv:1409.1556 , 2014.[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick,“Microsoft COCO: Common objects in context,” in

Proc. European Conf. Computer Vision ,2014, pp. 740–755.[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[36] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image descriptionevaluation,” in

Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2015, pp. 4566–4575.[37] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classiﬁcation:Labelling unsegmented sequence data with recurrent neural networks,” in

Proc. Int. Conf.Machine Learning , 2006, pp. 369–376.[38] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing thecommunication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887 , 2017.[39] S. Naren, “Speech recognition using DeepSpeech2,” https://github.com/SeanNaren/deepspeech.pytorch/releases, 2018. 1040] A. Acero, “Acoustical and environmental robustness in automatic speech recognition,” in

Proc.IEEE Int. Conf. Acoustics, Speech, and Signal Processing , 1990.[41] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efﬁcientSGD via gradient quantization and encoding,” in

Proc. Advances in Neural Information Pro-cessing Systems , 2017, pp. 1709–1720.[42] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, andY. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machinetranslation,” arXiv preprint arXiv:1406.1078arXiv preprint arXiv:1406.1078