[PDF] Tiny Transducer: A Highly-efficient Speech Recognition Model on Edge Devices

Abstract

This paper proposes an extremely lightweight phone-based transducer model with a tiny decoding graph on edge devices. First, a phone synchronous decoding (PSD) algorithm based on blank label skipping is first used to speed up the transducer decoding process. Then, to decrease the deletion errors introduced by the high blank score, a blank label deweighting approach is proposed. To reduce parameters and computation, deep feedforward sequential memory network (DFSMN) layers are used in the transducer encoder, and a CNN-based stateless predictor is adopted. SVD technology compresses the model further. WFST-based decoding graph takes the context-independent (CI) phone posteriors as input and allows us to flexibly bias user-specific information. Finally, with only 0.9M parameters after SVD, our system could give a relative 9.1% - 20.5% improvement compared with a bigger conventional hybrid system on edge devices.

Full PDF

TTINY TRANSDUCER: A HIGHLY-EFFICIENT SPEECH RECOGNITION MODEL ON EDGEDEVICES

Yuekai Zhang , ∗ , Sining Sun † , Long Ma Tencent Technology Co.,Ltd, Beijing, China The Johns Hopkins Univeristy, Baltimore, MD, [email protected], { siningsun,malonema } @tencent.com ABSTRACT

This paper proposes an extremely lightweight phone-based transducer model with a tiny decoding graph on edgedevices. First, a phone synchronous decoding (PSD) algo-rithm based on blank label skipping is ﬁrst used to speed upthe transducer decoding process. Then, to decrease the dele-tion errors introduced by the high blank score, a blank labeldeweighting approach is proposed. To reduce parameters andcomputation, deep feedforward sequential memory network(DFSMN) layers are used in the transducer encoder, and aCNN-based stateless predictor is adopted. SVD technologycompresses the model further. WFST-based decoding graphtakes the context-independent (CI) phone posteriors as in-put and allows us to ﬂexibly bias user-speciﬁc information.Finally, with only 0.9M parameters after SVD, our systemcould give a relative 9.1% - 20.5% improvement comparedwith a bigger conventional hybrid system on edge devices.

Index Terms — Transducer, on-device model, phone syn-chronous decoding

1. INTRODUCTION

Recently, end-to-end (E2E) models [1, 2, 3, 4, 5] for au-tomatic speech recognition (ASR) have become popular inthe ASR community. Comparing with conventional ASRsystems [6, 7], including three components: acoustic model(AM), pronunciation model (PM), and language model (LM),E2E models only have a single end-to-end trained neuralmodel but with comparable performance with the conven-tional systems. Thus, E2E models are gradually replacing thetraditional hybrid models in the industry [5, 8].Another research line focuses on deploying ASR systems ondevices such as cellphones, tablets, and embedded devices[8, 9, 10, 11]. However, deployment of E2E models on de-vices remains several challenges: ﬁrst, on-device ASR tasksusually require a streamable E2E model with low latency.Popular E2E models such as attention-based encoder-decoder(AED) [12, 13, 14] have shown state-of-the-art performanceon many tasks, but the attention mechanism is naturally un-friendly to online ASR. Second, the customizable ability is ∗ Work performed during internship at Tencent. † The ﬁrst two authors contributed equally to this work desired in many on-device ASR scenarios. The model shouldhave a promising performance on user-speciﬁc informationsuch as contacts’ phone numbers and favorite song names. In[15], shallow fusion is combined with E2E models’ predictionduring decoding. In [16], text-to-speech (TTS) technologyis utilized to generate training samples from text-only data.However, they all need to retrain the acoustic model or lan-guage model (LM). Finally, especially on edge devices wherethe memory and computing resources are highly constrained,ASR systems have to be very compact. (e.g., Embeddeddevices for vehicles could only attribute low memory andcomputing budget to ASR.)To satisfy the above requirements, we present a highly-efﬁcient ASR system, suitable for ASR tasks with insufﬁcientcomputing resources. Our proposed system consists of alightweight phone-based speech transducer and a tiny decod-ing graph. The transducer converts speech features to phonesequences. The decoding graph, composing of a lexicon anda grammar FST , named LG graph, maps phone posteriors toword sequences. On the one hand, compared with conven-tional senone-based acoustic modeling, phone-based speechtransducer simpliﬁes the acoustic modeling process. On theother hand, combining with the LG graph will easily fuselanguage model or bias user-special information into the de-coding graph.Within our proposed architecture, we ﬁrst adopt a phonesynchronous decoding (PSD) algorithm based on transducerwith blank skipping strategy, improving decoding speed dra-matically with no recognition performance drop. Then, toalleviate the deletion error caused by the over-scored blankprediction, we propose a blank label deweighting approachduring speech transducer decoding, which can reduce thedeletion error signiﬁcantly in our experiments. To reducemodel parameters and computation, a deep feedforward se-quential memory block (DFSMN) is used to replace the RNNencoder, and a casual 1-D CNN-based (Conv1d) statelesspredictor [17, 18] is adopted. Finally, we apply the singularvalue decomposition (SVD) to our speech transducer to fur-ther compress the model. Our tiny transducer could achieve apromising performance with only 0.9M parameters. a r X i v : . [ ee ss . A S ] F e b . TINY TRANSDUCER RNN-T model is proposed in [19] as an improvement ofconnectionist temporal classiﬁcation (CTC) [20], which re-moves the strong prediction independence assumption ofCTC. RNN-T includes three parts: an encoder, a predictor,and a joint network. Traditionally, both encoder and pre-dictor consist of a multi-layer recurrent neural network suchas LSTM, resulting in high computation on devices. In thiswork, DFSMN-based encoder and a casual Conv1d state-less predictor are used to achieve efﬁcient computation ondevices. Fig 1 illustrates the architecture of our transducermodel.

Fig. 1 . The Architecture of Transducer Model

Due to the limited computation resources, the popular stream-ing architecture, LSTM, is replaced with a DFSMN layer.DFSMN combines FSMN [21] with low-rank matrix factor-ization [22, 23] to reduce network parameters.To keep a goodtrade-off between model accuracy and latency, we set thenumber of left context frames in the DFSMN layer as eight,and the right context includes two frames. In this way, thedeeper layers would have a wider receptive ﬁeld with morefuture information. Additionally, two CNN layers with stridesize two each layer are inserted before the DFSMN layers toperform subsampling, which leads to four times subsampling.

The predictor network only has one casual Conv1d layer. Ittakes M previous predictions as input. We set M to four inour experiments. Formally, the predictor output at step u is h predu = Conv1d(Embed( y u − M , ..., y u − )) (1)where Embed() maps the predicted labels to the correspond-ing embeddings. Fig 1 also shows our Conv1d predictor.

3. DECODING WITH TINY TRANSDUCER

In this work, we choose to use CI phones as prediction units.Combining with a traditional WFST decoder allows us toﬂexibly inject biased contextual information into the de-coder graph without retraining the acoustic model and LM model. During the decoding process, the CI phone probabil-ity posteriors from the transducer model would be the WFSTdecoder’s input. Our WFST decoder includes two separateWFSTs: lexicons (L), and language model, or grammars (G).The ﬁnal search graph ( LG ) can be presented as follow: LG = min ( det ( L ◦ G )) (2)where min and det represent determinize and minimize oper-ations, respectively. To further speed up the decoding processand reduce parameters, we introduce our PSD algorithm andSVD technology in this section. PSD algorithm is ﬁrst used in [24] to speed up the decod-ing and reduce the memory usage with CTC lattice. A CTCmodel’s peaky posterior property allows the PSD algorithmto ignore blank prediction frames and compress the searchspace. We found the same peaky posterior property also existsin an RNN-T model. With transducer lattice, most frames arealigned with blank symbols. Motivated by this, we presenteda PSD algorithm based on RNN-T lattice. We introduce ourPSD method below.The decoding formulation for the RNN-T model using phoneas prediction is derived as below: w ∗ = arg max w { P ( w ) p ( x | w ) } = arg max w { P ( w ) p ( x | p w )= arg max w { P ( w ) P ( x ) p ( p w | x ) P ( p w ) } (3)where x is the acoustic feature sequences, w and p w arethe word sequences and the corresponding phone sequences.Equation 3 could be further simpliﬁed into below: w ∗ = arg max w { P ( w ) P ( p w ) max p w p ( p w | x ) } (4)We denote the standard decoding method as frame syn-chronous decoding (FSD) algorithm. When using the Viterbibeam search algorithm, FSD viterbi beam search could betransformed from the above equation 4 into: w ∗ = arg max w { P ( w ) P ( p w ) max π : π ∈ L (cid:48) ,β ( π T )= p w ) { (cid:89) t (cid:54)∈ U y tπ t × (cid:89) t ∈ U y tblank }} (5)where π is the possible alignment path. L (cid:48) is the CI phone setplus the blank symbol. y tk represents the posterior probabilityof RNN-T output unit k at time t . U is a set including thetime steps when y tblank closes to one. The size of U couldbe controlled by setting a threshold for blank labels’ poste-rior y tblank . Since π t = blank won’t change the correspond-ing phone sequences output β ( π T ) , assuming all competingalignment paths share the similar blank frames’ positions, wecould ignore the score of the blank frames. The below equa-ion formulates our PSD algorithm on RNN-T lattice: w ∗ = arg max w { P ( w ) P ( p w ) max π : π ∈ L (cid:48) ,β ( π T )= p w ) (cid:89) t (cid:54)∈ U y tπ t } (6)In this way, the PSD method avoids redundant searches dueto plenty of blank frames. The PSD algorithm is summarizedin Algorithm 1. We break the transducer lattice rule a littlebit in decoding. One frame only outputs one phone label orblank [25].To reduce the deletion errors caused by high blank la-bel scores, we combine blank frames skipping strategy withblank label deweighting technology into Algorithm 1. Weﬁrst deweight the blank scores by subtracting a deweight-ing factor in log domain. Then frames with blank scoresmore than predeﬁned threshold would be ﬁltered. The re-sults in section 4.4 show the deweighting method could re-duce the deletion errors signiﬁcantly. By changing the thresh-old of blank frames, we could control how many blank frameswould be skipped. Algorithm 1

PSD algorithm

Input:

Features { x , ..., x T − } , blank deweight value β blank , blank threshold γ blank , Conv1d look-back M Output:

Predicted word sequences w ∗ y in = Zeros(M) , u = 1 , Q posteroir = {} , w ∗ = {} , h pred =Predictor( y in ) for time t ← to T − do h enct = Encoder( x t ) , p t,u = Joint( h enct , h predu − ) p t,u ( blank ) = p t,u ( blank ) × β blank y tu = arg max p t,u if y tu (cid:54) = blank then u = u + 1 y in ← [ y in [1 :] , y tu ] h predu =Predictor( y in ) end if if p t,u ( blank ) ≤ γ blank then Enqueue( Q posterior , p t,u ) w t,u =WFSTDecoding( LG , Q posterior ) Enqueue( w ∗ , w t,u ) end if end for return w ∗ We further reduce the model parameters using SVD. Sinceour parameters mainly come from the feed-forward projectionlayers in the DFSMN encoder, SVD is only used on theseprojection layers’ weight matrices. Following the strategy in[26], we ﬁrst reduce the model size by SVD then ﬁne-tune thecompressed model to reduce the accuracy loss.

4. EXPERIMENT

The experiments are conducted on an 18,000 hours in-carMandarin speech dataset, which includes enquiries, naviga- tions, and conversations speech collected from Tencent in-carspeech assistant products. All the data are anonymized andhand transcribed. Development and Test set consist of 3382and 6334 utterances, about 4 hours and 7 hours, respectively.

Our model takes 40-dimensional power-normalized cepstralcoefﬁcients (PNCC) feature [27] as input, which uses a 25mswindow with a stride of 10ms. Adam optimizer, with an initiallearning rate of 0.0005, is used to train the transducer model.Specaugment [28] with mask parameter (F = 20), and ten timemasks with maximum time-mask ratio (pS = 0.05) is used aspreprocessing. A 4-gram language model is trained using textdata and additional text-only corpus. We have three differentconﬁgurations for the large, medium, and small model’s en-coder. Predictors are all one layer CNN with different inputdimensions according to the corresponding encoder size. Theoutput units include 210 context-independent (CI) phones andthe blank symbol. Transducer models are implemented withESPnet [29] toolkit. We ﬁrst storage the predicted posteriorprobability matrices of CI phones. Then EESEN [30] toolkitis used to process the posterior probabilities and gives the de-coding results. Table 1 summarizes the model architecturedetails.

Table 1 . Details for Large, Medium and Small modelModel Large Medium Small

Table 2 shows the word error rate (WER) results of the con-ventional hybrid system, four RNN-T models with differentsizes. They use the same language model. The hybrid systemuses TDNN as the acoustic model with 2.5M parameters,which is comparable with our small transducer model. Bycombining the end-to-end transducer model with LG WFSTdecoder, we could surpass the hybrid system’s performanceand keep the ﬂexibility of WFST to better customize theASR system. Furthermore, Table 2 also shows results of thesmall model after SVD. With only 0.9M parameters, the SVDmodel with ﬁne-tune could acheive 19.57% CER, still betterthan hybird system.

Table 2 . WER results on dev and test setWER(%) .3. RTF Results for PSD and FSD Algorithms

In this section, we would show the relationship between theblank rate and the corresponding threshold. Then we give thereal-time factor (RTF), and WER results on our small model.We denote the blank rate α as follows: α = 1 T size ( U ( γ blank )) (7)where T is the sequence length and U ( γ blank ) is the set in-cluding all blank frames: U ( γ blank ) = { f rames : p t,u ( blank ) > γ blank } (8)The number of blank frames is controlled by the blank pos-terior probability threshold γ blank . In decoding, all frames inthe set U would be skipped. When γ blank is larger than 1, noframes would be regarded as blank frames. In this case, thePSD algorithm would degrade to the FSD algorithm. Table 3 . Results with different threshold valuesSpeed WER(%)Method γ blank α (%) RTF S-RTF Dev TestFSD 1.0 0 0.069 0.053 14.12 18.11PSD 0.95 77.08 0.034 0.017 14.12 18.10PSD 0.85 80.73 0.033 0.017 19.73 23.85PSD 0.75 82.76 0.032 0.016 36.79 41.28Table 3 gives a comparison of different threshold valuesand the corresponding results. Since PSD and FSD algo-rithms only have differences during WFST decoding time, weuse RTF to represent the entire computation process, includ-ing transducer forward time and decoding time. S-RTF de-notes the WFST search time. We conduct the experiments ona server with Intel(R) Xeon(R) E5 CPU for proof-of-concept.The results show that setting γ blank as . would give a goodbalance between speed and accuracy. Following the strategy in [31], we don’t normalize the phonelabel posteriors in decoding. We deweight the blank labels’posteriors to add a cost to deletion errors in decoding. Otherlabel posteriors keep unchanged. We try to subtract a blankdeweighting value on the log probability domain, whichequals to divide a constant weight on blank labels’ posteriorprobability. Figure 2 shows the deletion, substitution, andinsertion errors for our small transducer model on the devel-opment set. We could always reduce the deletion errors bysubtracting a higher deweighting number. However, too largedeweighting values would increase the total WER. We tunethe deweighting value on the development set and using twoas a deweighting value to get the best result.

Fig. 2 . WER for different blank deweight value β blank We also deploy our system on edge devices. Int8 quantiza-tion is used to reduce memory consumption and speed up in-ference. Note that, in order to trade off speech recognitionaccuracy and inference efﬁciency, only FSMN layers, whichare parameter intensive, are quantized. Because our quantizedmodel obtains similar accuracy as the results in Table 2, weonly report mean CPU usage and RTF of our small RNN-Tmodel in Table 4. From Table 4, our proposed PSD methodcan signiﬁcantly reduce CPU usage and RTF compared withFSD.

Table 4 . On-device CPU usage and RTF resultsArm CPU ARMv74-core AArch644-coreCPU usage FSD 48.5% 38.8%PSD 21.5% 6.2%RTF FSD 2.88 2.66PSD 0.55 0.42

5. CONCLUSION

This paper introduces the pipeline of designing a highly com-pact speech recognition system for extremely low-resourceedge devices. To fulﬁll the streaming attribute with low-computation and small model size constraints, we choosetransducer lattice with DFSMN encoder. LSTM predictor isreplaced with a Conv1d layer to reduce the parameters andcomputation further. To keep the contextual and customizablerecognition ability, we use CI phones as our modeling unitand bias the language model at the WFST decoding part. Anovel PSD decoding algorithm based on transducer lattice isﬁrst proposed to speed up the decoding process. Also, blankweight deweighting and SVD technologies are adopted to im-prove recognition performance. The proposed system showsa speech recognizer with few parameters which could realizestreaming, fast, and accurate speech recognition. . REFERENCES [1] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-tasklearning,” in

ICASSP , 2017.[2] Senmao Wang, Pan Zhou, Wei Chen, Jia Jia, and Lei Xie, “Ex-ploring rnn-transducer for chinese speech recognition,” in . IEEE, 2019, pp. 1364–1369.[3] Pengcheng Guo, Florian Boyer, Xuankai Chang, TomokiHayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo,Chenda Li, Daniel Garcia-Romero, Jiatong Shi, et al., “Recentdevelopments on espnet toolkit boosted by conformer,” arXivpreprint arXiv:2010.13956 , 2020.[4] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par-mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng-dong Zhang, Yonghui Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,”

Interspeech ,2020.[5] Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei,Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang,Lei He, Sheng Zhao, et al., “Developing rnn-t models sur-passing high-performance hybrid models with customizationcapability,”

Interspeech , 2020.[6] Arnab Ghoshal and Daniel Povey, “Sequence discriminativetraining of deep neural networks,” in

INTERSPEECH , 2013.[7] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur,“A time delay neural network architecture for efﬁcient model-ing of long temporal contexts,” in

INTERSPEECH , 2015.[8] Tara N Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruom-ing Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, RazielAlvarez, Zhifeng Chen, et al., “A streaming on-device end-to-end model surpassing server-side conventional model qualityand latency,” in

ICASSP , 2020.[9] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian Mc-Graw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kan-nan, Yonghui Wu, Ruoming Pang, et al., “Streaming end-to-end speech recognition for mobile devices,” in

ICASSP , 2019.[10] Jinhwan Park, Yoonho Boo, Iksoo Choi, et al., “Fully neu-ral network based speech recognition on mobile and embeddeddevices,” in

NeuralIPS , 2018.[11] Xiong Wang, Zhuoyuan Yao, Xian Shi, and Lei Xie, “Cas-cade rnn-transducer: Syllable based streaming on-device man-darin speech recognition with a syllable-to-character con-verter,” arXiv preprint arXiv:2011.08469 , 2020.[12] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: ano-recurrence sequence-to-sequence model for speech recog-nition,” in

ICASSP , 2018.[13] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori,Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson En-rique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al.,“A comparative study on transformer vs rnn in speech applica-tions,” in

ASRU , 2019.[14] Haoneng Luo, Shiliang Zhang, Ming Lei, and Lei Xie, “Sim-pliﬁed self-attention for transformer-based end-to-end speechrecognition,” arXiv preprint arXiv:2005.10463 , 2020.[15] Ding Zhao, Tara N Sainath, David Rybach, Pat Rondon, DeeptiBhatia, Bo Li, and Ruoming Pang, “Shallow-fusion end-to-endcontextual biasing,”

Interspeech , 2019. [16] Khe Chai Sim, Franc¸oise Beaufays, Arnaud Benard, DhruvGuliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, PetrZadrazil, Harry Zhang, Leif Johnson, et al., “Personalizationof end-to-end speech recognition on mobile devices for namedentities,” in

ASRU , 2019.[17] Mohammadreza Ghodsi, Xiaofeng Liu, James Apfel, RodrigoCabrera, and Eugene Weinstein, “Rnn-Transducer with state-less prediction network,” in

ICASSP , 2020.[18] Chao Weng, Chengzhu Yu, Jia Cui, et al., “Minimum bayesrisk training of rnn-transducer for end-to-end speech recogni-tion,”

Interspeech , 2020.[19] Alex Graves, “Sequence transduction with recurrent neuralnetworks,” arXiv:1211.3711 , 2012.[20] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgenSchmidhuber, “Connectionist temporal classiﬁcation: la-belling unsegmented sequence data with recurrent neural net-works,” in

ICML , 2006.[21] Shiliang Zhang, Cong Liu, Hui Jiang, et al., “Feedforwardsequential memory networks: A new structure to learn long-term dependency,” arXiv:1512.08301 , 2015.[22] Shiliang Zhang, Hui Jiang, et al., “Compact feedforward se-quential memory networks for large vocabulary continuousspeech recognition,”

Interspeech , 2016.[23] Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” in

ICASSP , 2018.[24] Zhehuai Chen, Yimeng Zhuang, Yanmin Qian, and Kai Yu,“Phone synchronous speech recognition with ctc lattices,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 1, pp. 90–101, 2016.[25] Anshuman Tripathi, Han Lu, Hasim Sak, and Hagen Soltau,“Monotonic recurrent neural network transducer and decodingstrategies,” in . IEEE, 2019, pp. 944–948.[26] Jian Xue, Jinyu Li, and Yifan Gong, “Restructuring of deepneural network acoustic models with singular value decompo-sition.,” in

Interspeech , 2013, pp. 2365–2369.[27] Chanwoo Kim and Richard M Stern, “Power-normalizedcepstral coefﬁcients (PNCC) for robust speech recognition,”

IEEE/ACM Transactions on audio, speech, and language pro-cessing , vol. 24, no. 7, pp. 1315–1329, 2016.[28] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu,Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: Asimple data augmentation method for automatic speech recog-nition,”

Interspeech , 2019.[29] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson-Enrique Yalta So-plin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al.,“ESPnet: End-to-end speech processing toolkit,”

Interspeech ,2018.[30] Yajie Miao, Mohammad Gowayyed, and Florian Metze,“EESEN: End-to-end speech recognition using deep rnn mod-els and wfst-based decoding,” in

ASRU , 2015.[31] Has¸im Sak, Andrew Senior, Kanishka Rao, Ozan Irsoy, AlexGraves, Franc¸oise Beaufays, and Johan Schalkwyk, “Learningacoustic frame labeling for speech recognition with recurrentneural networks,” in