Transformer-based ASR using multiple-utterance beam-search
aa r X i v : . [ ee ss . A S ] J a n FAST OFFLINE TRANSFORMER-BASED END-TO-END AUTOMATIC SPEECHRECOGNITION FOR REAL-WORLD APPLICATIONS
Yoo Rhee Oh, Kiyoung Park, and Jeon Gyu Park
Artificial Intelligence Research LaboratoryElectronics and Telecommunications Research Institute (ETRI), Daejeon, 34129, Republic of Korea { yroh,pkyoung,jgp } @etri.re.kr ABSTRACT
Many real-world applications require to convert speech filesinto text with high accuracy with limited resources. This pa-per proposes a method to recognize large speech database fastusing the Transformer-based end-to-end model. Transfomershave improved the state-of-the-art performance in many fieldsas well as speech recognition. But it is not easy to be usedfor long sequences. In this paper, various techniques to speedup the recognition of real-world speeches are proposed andtested including parallelizing the recognition using batchedbeam search, detecting end-of-speech based on connectionisttemporal classification (CTC), restricting CTC prefix scoreand splitting long speeches into short segments. Experimentsare conducted with real-world Korean speech recognitiontask. Experimental results with an 8-hour test corpus showthat the proposed system can convert speeches into text inless than 3 minutes with 10.73% character error rate which is27.1% relatively low compared to conventional DNN-HMMbased recognition system.
Index Terms — Speech recognition, Transformer, end-to-end, segmentation, connectionist temporal classification
1. INTRODUCTION
Owing to the recent rapid advances in automatic speechrecognition (ASR) researches, the technologies are widelyadopted in many practical applications, such as automaticresponse system, automatic subtitle generation, meeting tran-scription and so on, where there had been difficulties inapplying ASR.More recently, a Transformer-based end-to-end ASRtechnique has achieved a state-of-art performance providinga major breakthrough since deep learning technology wasused with hidden Markov model (HMM) [1, 2]. However, theTransformer-based end-to-end system is not yet suitable forrecognizing continuously spoken speech because of the self-attention over entire utterance and there are many researchesto overcome this drawbacks including [3], [4], [5].Also, in the applications mentioned above, the system rec-ognizes the speech uttered spontaneously and continuously by multiple speakers. There are abundant of speeches beingrecorded to be transcribed, and more data are generated every-day. Hence, it is important to recognize large speech databasevery quickly with the least cost. In this paper, we proposea method to accomplish this by highly parallelized processeswhich include batch recognition for GPUs as in [6], connec-tionist temporal classification (CTC) based end-of-speech de-tection, restricted CTC prefix score, and segmentation of longspeeches.The organization of this paper is as follows. In Section 2,a Transformer-based end-to-end Korean ASR system is pre-sented with the description of model architecture, and train-ing and test corpora. In Section 3, the methods we utilized tospeed up the recognition for large speech corpus will be pre-sented. Next in Section 4, the performance of the proposedASR is evaluated with the real-world meeting corpus spokenby multiple speakers. Finally, we conclude the paper with ourfindings in Section 5.
2. BASELINE ASR SYSTEM AND SPEECH CORPUS2.1. Transformer-based end-to-end model
The Transformer-based end-to-end speech recognizer used inthis work is based on [7] and trained using the ESPnet, anend-to-end speech processing toolkit [8].As an input feature, 80-dimensional log-Mel filterbankcoefficients for every 10 msec analysis frame are extracted.Pitch feature is not used since in Korean language, tone andpitch are not important features. During the training and de-coding, a global cepstral mean and variance normalization isapplied to the feature vectors.The most hyper-parameters for the Transformer model arefollowed default settings of the toolkit. The encoder con-sists of two convolutional layers, a linear projection layer,and a positional encoding layer followed by 12 self-attentionblocks with a layer-normalization. An additional linear layerfor CTC is utilized. The decoder has 6 self-attention blocks.For every Transformer layer, we used 2048-dimensional feed-forward networks and 4 attention heads with 256-dimension.The training was performed using Noam optimizer [1], noarly stopping, warmup-steps, label smoothing, gradient clip-ping, and accumulating gradients [9].The text tokenization is performed in terms of characterunits. In Korean a character consists of 2 or 3 graphemesand corresponds to a syllable. Though number of plausiblecharacters in Korean is more than 10,000, only 2,273 tokensare used as output symbols including digits, alphabets and aspacing symbol which are seen at training corpus. No othertext processing is performed.
To train the end-to-end ASR model, we utilized about 12.4khours of Korean speech corpus which consist of a variety ofsources including read digits and sentences, recordings ofspontaneous conversation, and mostly (about 11.1k hours)broadcast data [10]. All sentences are manually and automat-ically segmented and transcribed. While training utteranceswhich are longer than 30 seconds are excluded.As for the test corpus, meetings are recorded at a pub-lic institute where automatic meeting transcription system isto be introduced for evaluation purpose. In total 8 hours ofrecording is made from 7 meetings. In each meeting 4 –22 people participated and every participants has their owngoose-neck type microphone. Each microphone is on onlywhile corresponding speaker utters, and recorded separately.However there are some overlaps and cross talk from adjacentspeakers which are ignored in manual transcription. Eachrecording lasts from 5 seconds to 36 minutes and manuallytranscribed and also split manually according to the contentby human transcribers. Each of segments is of length 0.6 –42.9 seconds. The number and average lengths of segmentsare shown in the first row of Table 2.
3. A FAST AND EFFICIENTTRANSFORMER-BASED SPEECH RECOGNITION3.1. A fast decoding based on a batched beam search
A batch processing accelerates a GPU parallelization [6, 11,12, 13, 14]. Especially, in [6, 12] a vectorized beam search fora joint CTC/attention-based recurrent neural network end-to-end ASR is introduced. Similarly we utilize the batched beamsearch for a Transformer-base end-to-end ASR to efficientlyparallelize the recognition for multiple utterances.Assuming that speech features are extracted from mul-tiple utterances and pushed into a queue Q , the U features( x , · · · , x U ) are first popped from Q and batched as X = { x , · · · , x U } . x i is length-extended from x i where the length | x i | is max | x i | and the extended values are masked as zero.Using an attention encoder, X is converted into the interme-diate representations H = { h , · · · , h U } . h i is the encoderoutput of x i and each h i can be computed in parallel. Next,using attention and CTC decoders, H is converted into the textsequence set Y = { y , · · · , y U } by performing a B -width beam search, where y i is the predicted text sequence for x i .At the l -th step of the beam search decoding, the hypothesisset Y l is estimated by performing a joint CTC/attention de-coding with H and Y l − . The hypothesis set Y k is definedas (cid:8) y k , , · · · , y kU,B (cid:9) , where y ki,j is the j -th hypothesis of h i which sequence length is k . And, each hypothesis y ki,j can beestimated in parallel. The beam search decoding is terminatedif all y i are encountered an end-of-speech label ( < eos > ) orthe decoding step l is greater than | x i | . After that, the outputtext sequence y i for x i is determined as, argmax y ∈ Y l last P ( y | x i ) , (1)where l last is the terminated decoding step and P ( y | x i ) is thejoint CTC/attention probability of y given x i . For a joint CTC/attention decoding, a CTC prefix score log p ctc ( y li,j , · · · | X ) [6] is defined as follows, X l ≤ t ≤ max | x i | φ t − ( y l − i,j ) p ( z t = y li,j | X ) , (2)where φ t − ( y l − i,j ) is the CTC forward probability up to time t − for y l − i,j and p ( z t | X ) is the CTC posterior probabilityat time t . First of all, we define τ li,j and ˜ τ li,j as, τ li,j = argmax τ l − i,j ≤ t ≤| x i | φ t ( y li,j ) (3) ˜ τ li,j = argmax τ l − i,j ≤ t ≤| x i | φ t ( ˜y li,j ) (4)where ˜y is a text sequence that is ended with a blank symbolconcatenated after y .For a calculation reduction, we propose an end-of-speechdetection using τ li,j and ˜ τ li,j . The proposed method is per-formed if the end-of-speech detection of [15] fails to detectend-of-speech. And it counts the end-of-speech hypotheseswhere a hypothesis y li,j ends with < eos > and τ li,j is same as | x i | . If the count is greater than a threshold, end-of-speech isdetected. This work sets the threshold as 3.For further reduction, we propose a time-restricted CTCprefix score by restricting the range of time t of Eq. (2), asfollows, X s li,j ≤ t ≤ e li,j φ t − ( y l − i,j ) p ( z t = y li,j | X ) , (5)where s li,j and e li,j indicate the start and end time to be calcu-lated. To prevent irregular alignments by attention [15], s li,j and e li,j are calculated as, s li,j = max( τ l − i,j − M , l, (6) e li,j = min(˜ τ l − i,j + M , | x i | ) , (7)here M and M are tunable margin parameters. For a batchprocessing of Y l , the restricted range is defined as, s l = min ≤ i ≤ U, ≤ j ≤ B s li,j (8) e l = max ≤ i ≤ U, ≤ j ≤ B e li,j . (9) When it comes to the recognition of long utterances, the per-formance of Transformer-based end-to-end ASR tends to besignificantly degraded due to the characteristics of the self-attention of a Transformer and the sensitiveness on the utter-ance length of a training data [16]. On the other hand, thecomputational complexity for a self-attention quadraticallyincreases proportional to the utterance length [1, 17].To tackle these issues, we perform a segmentation beforethe recognition. Two segmentation methods are tested; thefirst is the split at short pause with a voice activity detector(VAD), and the second is a simple hard segmentation.In [18], VAD information is used as a triggering sign forthe decoder in real-time. In this work, explicit segmentationis performed beforehand using the VAD based on deep neuralnetwork (DNN) which are used as an acoustic model (AM) ofDNN-HMM ASR.Let o ti be the output value of i -th node of the neural net-work at time t , then speech presence probability ( P S ( t ) ) andspeech absence probability ( P N ) at time t , are estimated as log P S ( t ) ≈ max k ∈S o tk (10) log P N ( t ) ≈ max k ∈N o tk (11) LLR( t ) = log P N ( t ) − log P S ( t ) (12)where k ∈ S , and k ∈ N mean output nodes correspondingto speech states and noise states respectively. Then a frame isregarded to be a non-speech frame if LLR( t ) is larger than agiven threshold. To get more robust decision the frame-wiseresults are smoothed over multiple frames as in [11].Segmentation based on DNN-VAD gives good perfor-mance to split long utterances in general, however it requiresmuch computation resources. It is also possible to use thesimpler VAD algorithm like [19], but we tried hard segmen-tation by segment length, i.e., split long utterances into shortpieces of the same length not considering truncation. Hardsegmentation requires no computation and the resulting seg-ments are of the same length which enforces the merits ofbatch processing of Section 3.1.
4. EXPERIMENTS
This section shows the speech recognition experiments withthe test corpus described in Section 2.2. At first the proposedparallelizaiton method is tested with the manually segmented
Table 1 . Performance of the baseline ASR and the proposedASRs employing the 3-width batched beam search, the CTC-based end-of-speech detection, and the restricted CTC prefixscore. Experiments are performed using two GPUs.Method CER(%) Elapsedtime(s)baseline 9.06 7,401+ batched decoding 9.03 543+ CPU CTC decoding 9.06 331+ CTC end-of-speech detection 9.08 303+ restricted CTC, M =5, M = ∞ M =5, M = M =5, M = M =5, M = As a baseline, we use the Transformer-based end-to-end ASRof Section 2.1 and perform a B -width beam search decodingon two GPUs. At each decoding step, the B hypotheses arebatched and the probabilities are computed in parallel. Whilethe proposed batched decoding considers the different lengthsof the multiple utterances, the baseline ASR does not sinceit performs one utterance at a time. This paper sets B as 3and uses the manually segmented test corpus. The baselineASR achieves the CER of 9.06% and the averaged elapsedtime of 7,401s, as shown in Table 1. As a comparison ourpreviously developed conventional DNN-HMM based ASRsystem using 5-layer bidirectional long short-term model(bi-LSTM) trained with the same training corpus achieved theCER of 14.72% with a external language model.To speed up the recognition by utilizing batch processing,the order of segments are sorted by their length to make thesegments of similar length are processed in the same batch.For a fast speech recognition, we gradually performed thebatched beam search of Section 3.1 with the batch size of21, moved the CTC prefix score calculation to CPU, and per-formed the end-of-speech detection of Section 3.2. As shownin the second, third, and forth rows of Table 1, the decod-ing time is reduced to 543s, 331s, and 303s, with a consider-able accuracy. The decoding time reduction is obtained due toaccelerating parallelization, preventing sequential processing, able 2 . Statistics of the length and number of segments be-fore and after splitting.Method Num. ofsegments Avg. Length Std. DevSource 83 347.7 464.4Manual 1,173 22.8 7.28DNN-VAD 1,838 15.7 2.79Hard Seg. 1,445 20.0 1.34and quickly detecting end-of speech.For a further speed-up, the restricted CTC prefix score ofSection 3.2 can be applied. We first restricted the start timeof the CTC prefix score with M of 5 and then did the endtime with M of , , or . From the fifth row of Table1, the decoding time is reduced to 263s with a comparableaccuracy if only start time is restricted. From the last threerows of Table 1, the decoding time is further reduced to 211s,210s, and 200s, with the accuracy degradation according to M when the end time is restricted. In previous section, to recognize speech with Transformermodel, manually segmented utterances are used. Howevermanual segmentation cannot be done for large test corpus inreal-world applications. To overcome this issue, we presenttwo automatic segmentation methods in this section.At first we applied DNN based VAD, which split 83 ut-terances into 8620 short pieces, which has average length of3.34 seconds and maximum length of 12 seconds. Consec-utive short pieces are merged into a segment of longer thanminimum length, and the pieces longer than maximum lengthare split again uniformly. Secondly a hard segmentation isapplied so that the lengths of resulting segments be betweenminimum and maximum length and be as uniform as possibleincluding the last segment. The lengths and numbers of seg-ment for two methods are given in Table 2. For DNN-VADmin and max length is set to 15 and 20 seconds respectively,and for hard segmentation set to 19 and 20 seconds. Thesevalues are selected in consideration of the batchsize allowedfor the GPU cards used in the experiments. The segmentedutterances are fed into the end-to-end recognizer correspondsto the seventh row of Table 1. Table 3 shows the recogni-tion performance and the speed. The difference in CER be-tween manual segmentation and DNN-VAD is mainly dueto insertion errors, since in manual segmentation overlappedspeeches which was hard to transcribed has been trimmed.The accuracy drop in hard segmentation is due to the deletionsat segmented boundaries and as a further work is ongoing toreduce this type or errors. Splitting speeches into shorter seg-ments improves recogniton speed by making it possible to uselarger batchsize as shown in the table.
Table 3 . Recognition accuracy in CER (%) and averagedelapsed time after 3 trials. Allowed batchsize is the largestbatchsize applicable in the experiment.Method CER Elapsedtime AllowedbatchsizeManual 9.10 210 21DNN-VAD 10.73 168 66Hard Seg. 12.22 184 64
5. CONCLUSION
This paper proposed fast and efficient recognition methodsin order to utilize an offline Transformer-based end-to-endspeech recognition in real-world applications equipped withlow computational resources. For fast decoding, we adoptedthe batched beam search for a Transformer-based ASR so asto accelerate a GPU parallelization. The proposed CTC-basedend-of-speech detection quickly completed speech recogni-tion. And the method would be more effective for noisy andsparsely uttered utterances. Moreover, the proposed restrictedCTC prefix score reduced the computational complexity bylimiting the time range to be examined for each decoding step.And for efficient decoding of long speeches, we proposed tosplit long speeches into segments using two methods: (a)DNN-VAD based segmentation and (b) hard segmentation.The DNN-VAD based segmentation achieved better recogni-tion accuracy than the hard segmentation. On the other hand,the DNN-VAD based segmentation required much compu-tation resources while the hard segmentation can be donewithout additional computation. The segmentation of longspeeches makes possible stable recognition of speeches fromvarious applications with Transformer models with limitedGPU memories ensuring a stable accuracy and fast speed byboosting the proposed batch processing.Speech recognition experiments were performed using areal-world speech corpus recorded at meetings participatedwith multiple speakers. For the 8-hour of speech after be-ing segmented, speech-to-text conversion was taken less then3 minutes by a transformer-based end-to-end ASR systememploying the proposed methods with 2 2PU cards. More-over, the ASR system achieved the CER of 10.73%, whichis 27.1% relatively low compared to the conventional DNN-HMM based ASR system.
6. ACKNOWLEDGEMENT
This work was supported by Institute for Information & com-munications Technology Planning & Evaluation(IITP) grantfunded by the Korea government(MSIT) (No.2019-0-01376,Development of the multi-speaker conversational speechrecognition technology). . REFERENCES [1] A. Vaswani et al. , “Attention is All you Need,” in
Advances in Neural Information Processing Systems(NIPS) , pp. 5998–6008. Curran Associates, Inc., 2017.[2] L. Dong, S. Xu, and B. Xu, “Speech-Transformer:A No-Recurrence Sequence-to-Sequence Model forSpeech Recognition,” in , 2018, pp. 5884–5888.[3] N. Moritz, T. Hori, and J. L. Roux, “Streaming End-to-End Speech Recognition with Joint CTC-AttentionBased Models,” in , 2019, pp.936–943.[4] N. Moritz, T. Hori, and J. Le, “Streaming AutomaticSpeech Recognition with the Transformer Model,” in
ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,2020, pp. 6074–6078.[5] H. Hwang and C. Lee, “Linear-Time Korean Morpho-logical Analysis Using an Action-based Local Mono-tonic Attention Mechanism,”
ETRI Journal , vol. 42, no.1, pp. 101–107, 2020.[6] H. Seki et al. , “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” in
Proc. Inter-speech 2019 , 2019, pp. 3825–3829.[7] S. Karita et al. , “A Comparative Study on Transformervs RNN in Speech Applications,” in , 2019, pp. 449–456.[8] S. Watanabe et al. , “ESPnet: End-to-end speech pro-cessing toolkit,” in
Proceedings of Interspeech , 2018,pp. 2207–2211.[9] M. Ott et al. , “Scaling Neural Machine Translation,” in
Proceedings of the Third Conference on Machine Trans-lation: Research Papers , 2018, pp. 1–9.[10] J. Bang et al. , “Automatic Construction of a Large-Scale Speech Recognition Database Using Multi-GenreBroadcast Data with Inaccurate Subtitle Timestamps,”
IEICE Transactions on Information and Systems , vol.E103.D, pp. 406–415, 02 2020.[11] Y. R. Oh, K. Park, and J.G. Park, “Online SpeechRecognition Using Multichannel Parallel AcousticScore Computation and Deep Neural Network (DNN)-Based Voice-Activity Detector,”
Applied Sciences , vol.10, no. 12, pp. 4091, 2020. [12] H. Seki, T. Hori, and S. Watanabe, “Vectorization ofhypotheses and speech for faster beam search in en-coder decoder-based speech recognition,”
CoRR , vol.abs/1811.04568, 2018.[13] D. Amodei et al. , “Deep Speech 2 : End-to-End SpeechRecognition in English and Mandarin,” in
ICML , 2016.[14] H. Braun et al. , “GPU-Accelerated Viterbi Exact LatticeDecoder for Batched Online and Offline Speech Recog-nition,” in
ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 7874–7878.[15] T. Hori, S. Watanabe, and J. R. Hershey, “JointCTC/attention decoding for end-to-end speech recogni-tion,” in
ACL 2017 - 55th Annual Meeting of the Associ-ation for Computational Linguistics, Proceedings of theConference (Long Papers) , 2017, vol. 1, pp. 518–529.[16] P. Zhou et al. , “Improving Generalization of Trans-former for Speech Recognition with Parallel ScheduleSampling and Relative Positional Embedding,”
CoRR ,vol. abs/1911.00203, 2019.[17] N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: TheEfficient Transformer,” in
International Conference onLearning Representations (ICLR) , 2020.[18] T. Yoshimura et al. , “End-to-End Automatic SpeechRecognition Integrated with CTC-Based Voice ActivityDetection,” in
ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6999–7003.[19] K. Park, “A robust endpoint detection algorithm for thespeech recognition in noisy environments,” in