[PDF] Sparsification via Compressed Sensing for Automatic Speech Recognition

Abstract

In order to achieve high accuracy for machine learning (ML) applications, it is essential to employ models with a large number of parameters. Certain applications, such as Automatic Speech Recognition (ASR), however, require real-time interactions with users, hence compelling the model to have as low latency as possible. Deploying large scale ML applications thus necessitates model quantization and compression, especially when running ML models on resource constrained devices. For example, by forcing some of the model weight values into zero, it is possible to apply zero-weight compression, which reduces both the model size and model reading time from the memory. In the literature, such methods are referred to as sparse pruning. The fundamental questions are when and which weights should be forced to zero, i.e. be pruned. In this work, we propose a compressed sensing based pruning (CSP) approach to effectively address those questions. By reformulating sparse pruning as a sparsity inducing and compression-error reduction dual problem, we introduce the classic compressed sensing process into the ML model training process. Using ASR task as an example, we show that CSP consistently outperforms existing approaches in the literature.

Full PDF

SSPARSIFICATION VIA COMPRESSED SENSINGFOR AUTOMATIC SPEECH RECOGNITION

Kai Zhen ∗ , Hieu Duy Nguyen , Feng-Ju Chang , Athanasios Mouchtaris , and Ariya Rastrow Indiana University Bloomington Alexa Machine Learning, Amazon, USA [email protected] , { hieng, fengjc, mouchta, arastrow } @amazon.com ABSTRACT

In order to achieve high accuracy for machine learning (ML) appli-cations, it is essential to employ models with a large number of pa-rameters. Certain applications, such as Automatic Speech Recogni-tion (ASR), however, require real-time interactions with users, hencecompelling the model to have as low latency as possible. Deployinglarge scale ML applications thus necessitates model quantization andcompression, especially when running ML models on resource con-strained devices. For example, by forcing some of the model weightvalues into zero, it is possible to apply zero-weight compression,which reduces both the model size and model reading time from thememory. In the literature, such methods are referred to as sparsepruning. The fundamental questions are when and which weightsshould be forced to zero, i.e. be pruned. In this work, we proposea compressed sensing based pruning (CSP) approach to effectivelyaddress those questions. By reformulating sparse pruning as a spar-sity inducing and compression-error reduction dual problem, we in-troduce the classic compressed sensing process into the ML modeltraining process. Using ASR task as an example, we show that CSPconsistently outperforms existing approaches in the literature.

Index Terms — Model pruning, automatic speech recognition(ASR), sparsity, Recurrent Neural Network Transducer (RNN-T),compressed sensing.

1. INTRODUCTION

Automatic Speech Recognition (ASR) is an important componentof a virtual assistant system. The main focus of ASR is to convertusers’ voice command into transcription, based on which further pro-cessing will act upon. Recently, end-to-end (E2E) approaches haveattracted much attention due to their ability of directly transducingaudio frame features into sequence outputs [1]. Without explicitlyimposing/injecting domain knowledge and manually tweaking inter-mediate components (such as lexicon model), building and main-taining E2E ASR system is much more efﬁcient than a hybrid deepneural network (DNN)-Hidden Markov Model (HMM) model.In order to provide the best user experience, an ASR systemis required to achieve not only high accuracy but also small user-perceived latency. This motivates the trend of moving the process-ing from Cloud/remote servers to users’ device to reduce the latencyfurther. More often than not, the hardware limitations impose strictconstraints on the model complexity [2, 3]. Firstly, the hardware ∗ This work was conducted during Kai’s internship in Amazon Pittsburgh,PA, USA. may only support integer arithmetic operations for run-time infer-ence, which necessitates model quantization. Secondly, a hardwareoften performs multiple tasks supported by different models in se-quence. The hardware, with limited memory size, thus needs tomove multiple models in and out of the processing units.To compress the model for the hardware, two widely appliedmethods are (a) model selection/structured pruning, i.e. choosing amodel structure with pruned layers/channels and small performancedegradation [4, 5], and (b) zero-weight compression/sparse pruning,i.e. pruning small-value weights to zero [6, 7]. Model selection dif-fers from sparse pruning in that it deletes entire channels or layers,showing a more efﬁcient speedup during inference, yet with a moresevere performance degradation [4, 5]. These two types of methodsare usually complementary: after being structurally pruned, a modelcan also undergo further zero-weight compression to improve theinference speed. In this study, we focus on sparse pruning, target-ing at lowering the memory storage and bandwidth requirement as itlargely contributes to the latency for on-device ASR models.A na¨ıve approach for sparse pruning is to push the weight valuessmaller than a threshold to zero after training, which often leads tosigniﬁcant performance degradation [6]. To mitigate this problem,Tibshirani et. al. applied

LASSO regularization to penalize large-value model weights [8]. The drawbacks are twofold: ﬁrstly, it doesnot exert an explicit speciﬁcation of the target sparsity level; sec-ondly, it is subject to the gradient vanishing issue when the modelhas more and more layers. Gradual pruning approach [6] resolvesthose concerns by deﬁning a sparsity function that maps each train-ing step to a corresponding intermediate sparsity level. During themodel training, the pruning threshold is adjusted gradually accordingto the function to eventually reach the target sparsity level. However,gradual pruning assumes that pruning the values smaller than thethreshold will lead to the least degradation, which is heuristic andsub-optimal. Consequently, gradual pruning techniques provide aguidance on “when to prune and by how much”, but a less satisfyinganswer for “which (weights) to prune”, thus leading to inefﬁciency.In this work, we propose a compressed sensing (CS)-based prun-ing method, referred to as CSP subsequently, that is sparse-awareand addresses both “ when to prune ” and “ which to prune ”. CSPreformulates the feedforward operations in machine learning ar-chitectures, such as Long Short Term Memory (LSTM) or Fully-Connected (FC) cells, as a sensing procedure with the inputs andhidden states being random sensing matrices. Under that perspec-tive, a sparsiﬁcation process is to enhance the sparsity and reducethe compression error, due to pruning, simultaneously. Following[9], we adopt the (cid:96) regularization to enforce sparsity and the (cid:96) regularization to mitigate the compression loss, and reformulate the a r X i v : . [ c s . L G ] F e b parsiﬁcation procedure as an optimization problem. We demon-strate the effectiveness of our method by compressing recurrent neu-ral network transducer (RNN-T), one of the E2E ASR models. TheRNN-T model is sparsiﬁed via a hybrid training mechanism in whichCSP is conducted during the feedforward stage, along with the backpropagation for the global optimization. Our proposed method con-stantly outperforms the state-of-the-art gradual pruning approach interms of the word error rate (WER) under all settings. In particular,with a sparisty ratio of where half of the weights are , CSPyields little to no performance degradation on LibriSpeech dataset.The rest of the paper is structured as follows. In Sec. 2, we brieﬂyreview related pruning methods. Our CSP method is introduced inSec. 3. Sec. 4 describes our experiment setup and results. Finally,we conclude our work with some remarks in Sec. 5.

2. RELATED WORK

One of the most straightforward approaches for sparse-aware train-ing is applying (cid:96) k regularization, where k = 0 , , , etc. There hasbeen a rich literature in comparing various forms of sparsity regular-izers. Consider the model training: W ← arg min W L accuracy ( W ) + || W || , where (cid:96) norm is used on the regularization. The fundamen-tal idea is to optimize the model prediction while penalizing largeweight values. In DNN model compression, the regularization isusually implemented as an extra loss term for the training.This training-aware sparsity regularization leads to promisingpruning results especially for convolutional neural networks (CNN)[10] with residual learning techniques [11, 12], but may not applywell to models employing recurrent neural network (RNN) compo-nents such as LSTM. The error due to a global sparsity constraint (cid:96) will be propagated to all time steps. Additionally, such drawback ismuch more severe for architectures, such as RNN-T, which containsfeedback loop from one part of the model to the others.Another well-known, state-of-the-art, pruning method for MLmodels is gradual-pruning [6]. This method does not resort to (cid:96) reg-ularization for sparsity, but dynamically updates the pruning thresh-old during model training, as is indicated by its name. To answerthe question ”when to prune”, the authors deﬁnes a sparsity func-tion parametrized by the target sparsity s f at t n step with an initialpruning step t . Concretely, at training step t , the pruning thresh-old is adjusted to match the sparsity s t calculated in Eq. 1. Themain complication is to adjust the pruning procedure such that themodel weights are relatively converged and the learning rate is stillsufﬁciently large to reduce the pruning-induced loss. s t = s f ∗ (cid:32) − (cid:18) − t − t t n − t (cid:19) (cid:33) (1)One concern is that ﬁnding an optimal setup for these hyperpa-rameters can be hard without going through a rigorous ablation anal-ysis. Furthermore, with gradual pruning, gradient-updating back-propagation is the only mechanism to limit the degradation. Mostimportantly, gradual pruning only addresses the question “ when toprune ” but not “ which (weights) to prune ”. At each time step, theweights below the (gradually increased) threshold are to be pruned.This is based on the premise that the smaller the weight, the lessimportant it is, which is heuristic and sub-optimal.

3. COMPRESSED SENSING BASED PRUNING3.1. Adapting Compressed Sensing for Sparse Pruning

CS aims to compressing potentially redundant information into asparse representation and reconstructing it efﬁciently [13, 14], whichhas facilitated a wide scope of engineering scenarios, such as med-ical resonance imaging (MRI) and radar imaging. For example, inMRI [15], high resolution scanned images are generated per mil-lisecond or microsecond, leading to signiﬁcant storage cost andtransmission overhead. CS learns a sparse representation of eachimage with which, during the decoding time, CS can recover the ref-erence image almost perfectly. Assume an image compression taskwith a reference image x ∈ R n , where x is usually decomposedinto an orthogonal transformation basis ψ and the activation s , as x = ψ ∗ s . Given that s satisﬁes K -sparse property, CS is capa-ble of locating those K salient activation elements. Concretely, CSintroduces a sensing matrix φ to project x into y , as y = φ ∗ x .In [9, 16], it has been proved that by optimizing the (cid:96) loss in thesensing dimensionality while exerting the (cid:96) norm regularizer to s ,a K -sparse solution of s , denoted as ˆ s , can be found in polynomialtime. Consequently, ψ ∗ ˆ s can estimate the original image x withhigh ﬁdelity and relatively small latency.In this work, we investigate the effectiveness of CS based prun-ing for ML models. We consider the ASR task, i.e. converting au-dio speech to transcriptions, using an RNN-T architecture. Due tothe space constraint, we only describe the transformation of LSTMcell, which is a major building block in various E2E ASR models.It is straightforward to extend the transformations to other architec-tures/layers like the fully-connected (FC) network and CNN.Consider a vanilla LSTM cell: the element-wise multiplicationbetween the input at time step t , x ( t ) , and kernels is given in Eq. 2,while that of hidden states from the previous step h ( t − and recur-rent kernels is in Eq. 3. Here, W f , W c , W i , and W o (correspond-ingly U f , U c , U i , and U o ) denote the kernels (correspondingly re-current kernels) weights of the cell ( c ), the input gate ( i ), output gate( o ), and forget gate ( f ), respectively. All gating mechanisms to up-date the cell and hidden states are encapsulated in Eq. 4, where C ( t ) is the cell state vector at time t and G denotes the transformation of z x , z h , C ( t − into C ( t ) , h ( t ) . The bias terms are omitted for easeof presentation. z x = [ W f , W c , W i , W o ] (cid:12) [ x ( t ) , x ( t ) , x ( t ) , x ( t ) ] (2) z h = [ U f , U c , U i , U o ] (cid:12) [ h ( t − , h ( t − , h ( t − , h ( t − ] (3) [ C ( t ) , h ( t ) ] = G ( z x , z h , C ( t − ) , (4)To prune all kernel weights (denoted as W =[ W f , W c , W i , W o ] ) in LSTM cells, we adapt and reformulatea CS-like pruning procedure by adopting (cid:96) regularization forsparsity-inducing and (cid:96) regularization for compression-error re-duction, as outlined in Fig. 1. The procedure starts by a midwayinference to collect the activation inputs [ z x , z h ] as in Eq. 2 andEq. 3. When those kernels are sparsiﬁed, the activation inputs willbe consequently updated as [ z (cid:48) x , z (cid:48) h ] (see Fig. 1). The goal is tosparsify and prune the kernels while preserving the value of [ z x , z h ] to minimize the pruning-induced loss. To that end, the (cid:96) regularizeris applied to the input kernels while the (cid:96) regularizer controls thereconstruction loss on z x . Hence, our CS solver is embedded ina local optimizer triggered periodically by feedforward steps in astochastic manner. As illustrated in the restricted isometry property(RIP), the sensing matrix φ is expected to be random for an accuratesignal reconstruction [16]. The proposed CS solver satisﬁes the RIP ! 𝑊 " 𝑊 𝑊 $ 𝑈 ! 𝑈 " 𝑈 𝑈 $ Updated after pruning : kernels before pruning: kernels after pruning ⨀𝑥 ! ℎ !" ℎ ! 𝐶 ! 𝐶 !" ⊕⊕⊕⊕ ⨀⨀⨀⨀⨀⨀⨀⨀ ⨀ ⨀ 𝜎 ⨁ 𝑡𝑎𝑛ℎ𝜎 𝑧 $ 𝑧 % 𝑧 & 𝑧 ’ 𝑧 ’( 𝑧 &( 𝑧 $( 𝑧 %( ℓ ! ( || ) : sensing error minimization ℓ " : kernel compression 𝜎 𝜎𝑡𝑎𝑛ℎ Fig. 1:

A CSP-LSTM cell with the local sparse optimizerwith high probability, since the sensing matrices (the input x ( t ) andhidden state h ( t ) ) in LSTM cells vary at different time steps.Our CS solver is kernel-wise, i.e. in each LSTM layer, there is oneCS solver for each of the input-kernel and recurrent-kernel sparsiﬁ-cation process. A general CS loss for the local optimizer is deﬁnedin Eq. 5. By adjusing the sensing coefﬁcient, the local optimizer iscapable of calibrating the model to the target sparsity by balancingthe (cid:96) and (cid:96) regularizers, L CS ( W , y , h ) = λ | W | + || y − W (cid:12) h || , (5)where the hyperparameter λ in Eq. 5 is referred to as the sensingcoefﬁcient subsequently. W , h , and y denote kernel weights, inputs,and activation input, respectively. To remove the manual-tuning, wedynamically update λ via Eq. 6. λ ← (cid:26) max( λ lower , λ − (cid:15) ) , if s > s t min( λ upper , λ + (cid:15) ) , otherwise . (6)During training, if the actual sparsity overshoots the target spar-sity at a certain step, we reduce λ by (cid:15) . Otherwise, the λ is increasedby (cid:15) instead. In our setup, λ is constrained between λ lower = 0 . and λ upper = 1 . with (cid:15) being 0.005. The pruning threshold, ρ ,initialized as . , is also updated as in [6] to adjust the sparsity.Algorithm 1 summarizes our proposed CSP procedure for input ker-nels. The CSP for recurrent kernels U = [ U f , U c , U i , U o ] is exe-cuted similarly and is omitted for brevity. The proposed CSP sparsiﬁcation optimizer is combined with theconventional backpropagation algorithm in a sparse-aware trainingscheme (see Fig. 2). Since CSP is conducted kernel-wise for eachlayer, the local optimization is triggered during the feedforwardstage as the samples passes through the ﬁrst encoder layer to the lastdecoder layer sequentially. When the model makes the prediction,the global loss is calculated to update parameters in all precedinglayers through backpropagation. The hybrid training scheme is notsubjected to gradient vanishing thanks to the local (cid:96) regularization,and compatible with global optimization.

4. EXPERIMENTS4.1. Experimental Setup

We consider the ASR task with an RNN-T architecture. Com-paring to the listen-attend-spell (LAS) model [17], RNN-T fea-

Algorithm 1

Proposed CSP for an LSTM cell Inputs: input data at time step t , x ( t ) the hidden state from the previous step, h ( t − the cell state from the previous step, C ( t − Outputs: updated hidden state at time step t , h ( t ) updated cell state at time step t , C ( t ) if sparsity level < the target then Midway inference: [ z x , z h ] = F ( W , x ( t ) , h ( t ) ) Sensing: W (cid:48) ← arg min W L CS ( W , x ( t ) , h ( t ) , z x , z h ) Pruning: W p ( i,j ) ← (cid:40) , if (cid:12)(cid:12) W (cid:48) ( i,j ) (cid:12)(cid:12) < ρ ; W (cid:48) ( i,j ) , otherwise . Update coefﬁcient and threshold λ and ρ end if Update cell outputs: [ z x , z h ] = F ( W p , x ( t ) , h ( t ) )[ C ( t ) , h ( t ) ] = G ( z x , z h , C ( t − ) LSTM LSTM LSTM

Input

𝐶𝑆 𝑆𝑜𝑙𝑣𝑒𝑟( )𝐶𝑆 𝑆𝑜𝑙𝑣𝑒𝑟( ) 𝐶𝑆 𝑆𝑜𝑙𝑣𝑒𝑟( ) 𝐿 !""

Local Op Global Op : gradient flow: feedforward flow ⋯ ℓ ! ℓ " ,, , ℓ ! ℓ " ℓ ! ℓ " Fig. 2:

Sparse-aware training scheme for CSPtures online streaming capability while not requiring a separate lexi-con/pronunciation system as in the connectionist temporal classiﬁca-tion model (CTC) [18]. To rigorously evaluate CSP along with exist-ing sparsiﬁcation methods, we experiment with 4 different RNN-Tbased topologies, all with 5 encoding layers and 2 decoding layers.Model M-I, M-III and M-IV all have 768 units per layer (UpL) and4000 word-pieces (WPs) while M-II uses 1024 UpL with 2500 WPs.Among 4 models, only M-III includes a joint network (J-N), whichcombines the outputs of RNN-T encoder and decoder to achieve bet-ter performance. M-I and M-II are trained on a far-ﬁeld dataset with25k hours of English audio, while M-III and M-IV are trained on theLibriSpeech dataset with 960 hours [19]. Note that these models arereasonably small comparing with counterparts in the literature, tohighlight the effect of sparse pruning on model performance. Pleaserefer to Table 1 for the total number of parameters of each model. able 1:

Model performance under various sparsity levels for far-ﬁeld (left) and LibriSpeech (right) datasets.

M-I, 38.7M M-II, 60.0MSparsity (%) Methods Rel. Dgrd (%)0 – – –25 A 1.83 0.80B 1.12 0.63Proposed

50 A 23.40 20.48B 8.77 7.14Proposed

M-III, 34.0M M-IV, 37.1MSparsity (%) Methods WER Abs. Dgrd Rel. Dgrd WER Abs. Dgrd Rel. Dgrd0 – 7.27 – – 9.58 – –50 A 17.45 10.18 140.03% 37.24 27.66 288.73%B 7.34 0.07 0.96% 10.35 0.77 8.04%Proposed

75 A 99.76 92.49 1272.21% 95.06 85.48 892.28%B 8.13 0.86 11.83% 10.43 0.85 8.87%Proposed

We compare our proposed CSP with two other pruning meth-ods: na¨ıve pruning and gradual pruning. In the na¨ıve pruning ap-proach, termed as method-A, the smallest weights are pruned post-training to reach the target sparsity level. Note that for a signiﬁ-cantly over-parameterized model, achieving a certain sparsity levelwith little degradation may not be challenging even with method-A,since most of the weights are not effectively involved in the opti-mization. Method-A, although not being spare-aware during train-ing, thus helps us probe the level of robustness for an RNN-T topol-ogy under various sparsity levels. The gradual pruning approach,denoted as method-B, is derived from [6]. As illustrated in Sec. 3,the pruning threshold is calibrated kernel-wise for a fair comparison.The learning rate in all experiments is speciﬁed via a warm-hold-decay scheduler: the initial learning rate, e - , is raised up to e - at 3K-th step, and is being held till 75K-th step, which then followsa decay stage with which it is reduced to e - at 200K-th step. Thepruning starts at 100K-th step and gradually increases the pruningthreshold to reach the target sparsity level at 150K-th step. The intu-ition, similar to [6], is to neither apply pruning too early such that theweights are reasonably distributed, nor too late to allow recoveringfrom the sparsiﬁcation-induced degradation. All models are trainedwith dropout. The sensing coefﬁcient λ initialized as . . Since λ is adjustable according to Eq. 6 during training, the results are notpredominantly contingent on its initial value. Consider the performance of M-I and M-II, which do not have a jointnetwork and are trained on the far-ﬁeld dataset. It is observed in Ta-ble 1 that the models are relatively robust at the sparsity level of .At , the degradation becomes noticeable for all 3 methods. Thehard pruning approach does not yield a desirable performance, whileour proposed CSP method gives the lowest WER relative degrada-tion in these experimental settings. As expected, the results also in-dicate a higher relative degradation when the model size decreases.In Table 1, we also report absolute WERs from models trained onthe LibriSpeech train dataset and decoded on the LibriSpeech test-clean dataset. Not surprisingly, the models trained with methodA1also suffer signiﬁcant degradation at the sparsity level . Again,it is observed that the proposed CSP method consistently outper-forms all other approaches. Comparing to M-IV, M-III indicates ahigher robustness to pruning likely due to the J-N. At sparsity,all methods experience substantial performance degradation, espe-cially for models with J-N. One reason is that the additional layeractually exacerbates the error, thus leading to higher relative degra-dation. However, it is worth noting that RNN-T models with J-Nstill outperforms the counterparts without it by a large margin.To understand the effect of CSP, we investigate how the modelweights are being redistributed when CSP is applied. Fig. 3 showsthe weight distribution when CSP is applied to M-III. It is observed

Fig. 3:

Model weight histogram when the sparsity (SP) is up from to : the threshold is barely increased with newly prunedweights selected via the CSP method as shown in zoomed-in insets. Gradual pruning thresholdCompressed sensing threshold

Fig. 4:

Pruning threshold comparison at sparsity: CSP can re-distribute weights to approach the same sparsity level with a smallerpruning threshold than the gradual pruning method.that to achieve a higher sparsity ratio from to for CSP, thepruning threshold does not move by a large margin. Instead, a setof the weights are driven towards 0, and consequently being pruned.In particular, from the top-right zoomed-in ﬁgure, we can see thatmost of the additional weights pruned at in CSP are not thoseclosest to the threshold at sparsity level. In contrast, the gradualpruning approach will signiﬁcantly increase the pruning threshold toaccommodate the higher sparsity level, and then simply prune theweights with the smallest values (Fig. 4). Rather than just a hardreset of the threshold, CSP instead determines “which (weights) toprune” via a joint optimization on sparsity and reconstruction regu-larizers (see Eq.5), thus leading to much smaller WER degradation.

5. CONCLUSIONS

We propose a novel pruning approach for machine learning modelcompression based on compressed sensing, termed as CSP. Com-pared to existing sparsiﬁcation methods which focus only on “whento prune”, CSP further addresses the question “which (weights) toprune” by considering both sparsity inducing and compression-errorreduction mechanisms. We validate the effectiveness of CSP via thespeech recognition task with RNN-T model. CSP achieves superiorresults compared to other sparsiﬁcation approaches. The proposedmethod can be straightforwardly incorporated into other ML modelsand/or compression methods to further reduce model complexity. . REFERENCES [1] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, andN. Jaitly, “A comparison of sequence-to-sequence models forspeech recognition.,” in

Proceedings of the Annual Confer-ence of the International Speech Communication Association(INTERSPEECH) , 2017, pp. 939–943.[2] S. Han, H. Z. Mao, and W. J. Dally, “Deep compression: Com-pressing deep neural networks with pruning, trained quantiza-tion and huffman coding,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2016.[3] S. Punjabi, H. Arsikere, Z. Raeesy, C. Chandak, N. Bhave,A. Bansal, M. M¨uller, S. Murillo, A. Rastrow, S. Garimella,et al., “Streaming end-to-end bilingual asr systems with jointlanguage identiﬁcation,” arXiv preprint arXiv:2007.03900 ,2020.[4] S. J. Cao, C. Zhang, Z. L. Yao, W. C. Xiao, L. S. Nie, D. C.Zhan, Y. X. Liu, M. Wu, and L. T. Zhang, “Efﬁcient and effec-tive sparse LSTM on FPGA with bank-balanced sparsity,” in

Proceedings of the ACM/SIGDA International Symposium onField-Programmable Gate Arrays (FPGA) , 2019, pp. 63–72.[5] S. R. Wang, P. Lin, R. H Hu, H. Wang, J. He, Q. J. Huang,and S. Chang, “Acceleration of LSTM with structured pruningmethod on FPGA,”

IEEE Access , vol. 7, pp. 62930–62937,2019.[6] M. Zhu and S. Gupta, “To prune, or not to prune: exploring theefﬁcacy of pruning for model compression,” in

Proceedingsof the International Conference on Learning Representations(ICLR) , 2018.[7] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploringsparsity in recurrent neural networks,” in

Proceedings of theInternational Conference on Learning Representations (ICLR) ,2017.[8] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight,“Sparsity and smoothness via the fused lasso,”

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) ,vol. 67, no. 1, pp. 91–108, 2005.[9] D. L. Donoho, “Compressed sensing,”

IEEE Transactions onInformation Theory , vol. 52, no. 4, pp. 1289–1306, 2006.[10] C. Louizos, M. Welling, and P. D. Kingma, “Learning sparseneural networks through l0 regularization,” in

Proceedings of the International Conference on Learning Representations(ICLR) , 2018, pp. 1389–1397.[11] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep resid-ual learning for image recognition,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2016, pp. 770–778.[12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017, pp. 4700–4708.[13] X. Yuan and R. Haimi-Cohen, “Image compression basedon compressive sensing: End-to-end comparison with JPEG,”

IEEE Transactions on Multimedia , 2020.[14] M. Qiao, Z. Y. Meng, J. W. Ma, and X. Yuan, “Deep learningfor video compressive sensing,”

APL Photonics , vol. 5, no. 3,2020.[15] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Com-pressed sensing MRI,”

IEEE Signal Processing Magazine , vol.25, no. 2, pp. 72–82, 2008.[16] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty prin-ciples: Exact signal reconstruction from highly incomplete fre-quency information,”

IEEE Transactions on Information The-ory , vol. 52, no. 2, pp. 489–509, 2006.[17] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversa-tional speech recognition,” in

Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2016, pp. 4960–4964.[18] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classiﬁcation: labelling unsegmentedsequence data with recurrent neural networks,” in

Proceedingsof the International Conference on Machine Learning (ICML) ,2006, pp. 369–376.[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in