BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
Martin Kocour, Guillermo Cámbara, Jordi Luque, David Bonet, Mireia Farrús, Martin Karafiát, Karel Veselý, Jan ''Honza'' ?ernocký
BBCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
Martin Kocour , Guillermo C´ambara , , Jordi Luque , David Bonet , Mireia Farr´us ,Martin Karafi´at , Karel Vesel´y and Jan “Honza” ˇCernock´y Brno University of Technology, Speech@FIT, IT4I CoE Telef´onica Research Universitat Pompeu Fabra Universitat de Barcelona [email protected]
Abstract
This paper describes joint effort of BUT and Telef´onica Re-search on development of Automatic Speech Recognition sys-tems for Albayzin 2020 Challenge. We compare approachesbased on either hybrid or end-to-end models. In hybrid mod-elling, we explore the impact of SpecAugment[1, 2] layer onperformance. For end-to-end modelling, we used a convolu-tional neural network with gated linear units (GLUs). The per-formance of such model is also evaluated with an additional n-gram language model to improve word error rates. We furtherinspect source separation methods to extract speech from noisyenvironment (i.e. TV shows). More precisely, we assess the ef-fect of using a neural-based music separator named Demucs[3].A fusion of our best systems achieved 23.33 % WER in officialAlbayzin 2020 evaluations. Aside from techniques used in ourfinal submitted systems, we also describe our efforts in retriev-ing high-quality transcripts for training.
Index Terms : fusion, end-to-end model, hybrid model, semi-supervised, automatic speech recognition, convolutional neuralnetwork.
1. Introduction
Albayzin 2020 challenge is a continuation of the Albayzin 2018challenges [4], which has evaluations for the following tasks:Speech to Text, Speaker Diarization and Identity Asignement,Multimodal Diarization and Scene Description and Search onSpeech. The target domain of the series is broadcast TV and ra-dio content, with shows in a notable variety of Spanish accents.This paper describes BCN2BRNO’s team AutomaticSpeech Recognition (ASR) system for IberSPEECH-RTVE2020 Speech to Text Transcription Challenge, a joint collabora-tion between Speech@FIT research group, Telef´onica Research(TID) and Universitat Pompeu Fabra (UPF). Our goal is to de-velop two distinct ASR systems, one based on a hybrid model[5] and the other one on an end-to-end approach [6], and com-plement each other through a joint fusion.We submitted one primary system and one contrastive sys-tem. The primary system – Fusion B – is a word-level ROVERfusion of hybrid ASR models and end-to-end models. Itachieved 23.33 % WER on official evaluation dataset. How-ever, the same result was accomplished by the contrastive sys-tem – Fusion A–, a fusion which comprises only hybrid ASRmodels. In this paper we describe both ASR systems, plus apost-evaluation analysis and experiments that lead to a betterperformance of the primary fusion. We also discuss the effectof speech enhancement techniques like background music re-moval or speech denoising.
2. Data
The Albayzin 2020 challenge comes with two databases:
RTVE2018 and
RTVE2020 . The RTVE2018 is the mainsource of training and development data, while the RTVE2020database is used for the final evaluation of submitted systems.RTVE2018 database [7] comprises 15 different TV programsbroadcast between 2015 and 2018 by the Spanish public televi-sion Radiotelevisi´on Espa˜nola (RTVE). The programs containa great variety of speech scenarios from read speech to sponta-neous speech, live broadcast, political debates, etc. They coveralso different Spanish accents, including Latin-American ones.The database is partitioned into 4 different subsets: train, dev1,dev2 and test. The database consists of hours of audio data,from which hours are provided with subtitles (train set),and hours are human-revised (dev1, dev2 and test sets).Both hybrid and end-to-end models utilize dev1 and train setsfor training, while dev2 and test sets serve as validation datasets.RTVE2020 database [8] consists of TV shows of different gen-res broadcast by the RTVE from 2018 to 2019. It includes morethan hours of audio and it has been whole manually anno-tated.In addition, three Linguistic Data Consortium (LDC) cor-pora were used for training the language model in the hybridASR system: Fisher Spanish Speech , CALLHOME SpanishSpeech and
Spanish Gigaword Third Edition .Fisher Spanish Speech [9] corpus comprises spontaneoustelephone speech from native Caribbean Spanish and non-Caribbean Spanish speakers with full orthographic transcripts.The recordings consists of telephone conversations lastingup to minutes each.CALLHOME Spanish Speech [10] corpus consists of telephone conversations between Spanish native speakers last-ing less than 30 minutes. Spanish Gigaword Third Edition [11]is an extensive database of Spanish newswire text data acquiredby the LDC. It includes reports, news, news briefs, etc. col-lected from 1994 through Dec 2010. We also downloaded thetext data from Spanish Wikipedia.The end-to-end model is trained on Fisher Spanish Speech,Mozilla’s Common Voice Spanish corpus and Telef´onica’s CallCenter in-house data (23 hours). Mozilla’s Common VoiceSpanish [12] corpus is an open-source dataset that consists ofrecordings from volunteer contributors pronouncing scriptedsentences, recorded at 48kHz rate. The sentences come fromoriginal contributor donations and public domain movie scripts.The version of Common Voice corpus used for this work is 5.1,which has 521 hours of recorded speech. However, we havekept only speech validated by the contributors, an amount of290 hours. a r X i v : . [ ee ss . A S ] J a n .1. Transcript retrieval The training data from RTVE2018 database includes manyhours of subtitled speech. Although, the captions contain se-veral errors. In the most cases captions are shifted by a fewseconds, so a segment with correct transcript corresponds toa different portion of audio. This phenomenon also occurs inhuman-revised development and test sets. Another problemwith subtitled speech is “partly-said” captions. This issue in-volves misspelled and unspoken words of the transcription.Since the training procedure of the hybrid ASR is quiteerror-prone in case of misaligned labels, we decided to apply atranscript retrieval technique developed by Manohar, et al. [13]:the closed-captions related to the same audio, i.e., the whole TVshow, are first concatenated according to the original timeline.This creates a small text corpus containing a few hundreds ofwords. The text corpus is used for training a biased N -gramlanguage model (LM) with N = 7 , so the model is biasedonly on the currently processed captions. During decoding, theweight of the acoustic model (AM) is significantly smaller thanthe weight of LM, because we believe that the captions shouldoccur in hypotheses. Then, the “winning” path is retrieved fromthe hypothesis lattice as the path that has a minimum edit costw.r.t. the original transcript. Finally, the retrieved transcripts aresegmented using the CTMs obtained from the oracle alignment(previous step). More details can be found in [14, 13].Table 1: Cleaning
Train Dev1 Dev2 TestOriginal
468 60 . . . . . -2-pass . . . . Recovered
50 % 91 % 94 % 92 %
The transcript retrieval technique is applied twice. First, wetrain an initial ASR system on out-of-domain data, e.g., Fisherand CALLHOME. A system is used in the first pass of transcriptretrieval. Then, a new system is trained from scratch on alreadycleaned data and the whole process of transcript retrieval is re-peated again. Table 1 shows how this 2-pass cleaning leads torecover almost all the manually annotated development data andhalf of the subtitled training data.Figure 1:
Amount of cleaned audio per TV-show, in hours. A F I L T H T e r L T H E n t C A A P H E C A G A V L A H L M D H S G M ill e nn i u m L T H E c o Original 2-pass
Figure 1 depicts how many hours have been recovered inindividual TV programs. It also shows how data is distributedin the database. Most speech comes from La-Ma˜nana (LM) TV program. We discarded most data in this TV program after 2-pass data cleaning. It happened because this particular TV showwas quite challenging for our ASR model.
3. Hybrid speech recognition
In all our experiments, the acoustic model was based on a hy-brid Deep Neural Network – Hidden Markov Model architec-ture trained in Kaldi [15]. The NN part of the model contains6 convolutional layers followed by 19 TDNN layers with semi-orthogonal factorization [5] (CNN-TDNNf). The input consistsof 40-dim MFCCs concatenated with speaker dependent 100-dim i-vectors. Whole model is trained using LF-MMI objectivefunction with bi-phone acoustic units as the targets.In order to make our NN model training more robust, weintroduced feature dropout layer into the architecture. This pre-vents the model from overfitting on training data. In fact, itturned overfitting problem into underfitting problem. Thus, itleads to a slower convergence during training. This is solved byincreasing the number of epochs from 6 to 8 to balance the un-derfitting in our system. This technique is also known as Spec-tral Augmentation. It was first suggested for multi-stream hy-brid NN models in [1] and fully examined in [2].
We trained three different -gram language models: Alb, Wikiand Giga. The names suggest which text corpus was usedduring training. Albayzin LM was trained on dev1 and trainsets from RTVE2018. This text mixture contains thousandunique words in . million sentences. This small training textis not optimal to train N -gram LM, which is able to generalizewell. So we also included larger text corpora: Wikipedia andSpanish Gigaword. These databases were further processed toget rid of unrelated text like advertisement, emoji, urls, etc. Thisresulted into more than . million fine sentences in Wikipediaand million sentences in Spanish Gigaword. We experi-mented with combinations of interpolation: Alb, Alb+Wiki,Alb+Giga, Alb+Wiki+Giga.Our vocabulary consists of words from RTVE2018database and from Santiago lexicon . The pronunciation ofSpanish words was extracted using public TTS model calledE-speak [16]. The vocabulary was then extended by auxiliarylabels for noise, music and overlapped speech. The final lexiconcontains around thousand words. Voice activity detection (VAD) was applied on evaluation datain order to segment the audio into smaller chunks. VAD is basedon feed-forward neural network with two outputs. It expects15-dimensional filterbank features with additional 3 Kaldi pitchfeatures [17] as the input. Features are normalized with cepstralmean normalization. More details can be found in [18].
4. End-to-end speech recognition
The end-to-end acoustic model is based on a convolutional ar-chitecture proposed by Collobert et al. [6] that uses gated linearunits (GLUs). Using GLUs in convolutional approaches helps voiding vanishing gradients, by providing them linear pathswhile keeping high performances. Concretely, we have usedthe model from wav2letter’s Wall Street Journal (WSJ) recipe.This model has approximately 17M parameters with dropoutapplied after each of its 17 layers. The WSJ dataset containsaround 80 hours of audio recordings, which is smaller than themagnitude of our data ( ∼
600 hours). The LibriSpeech recipe( ∼ Regarding the lexicon, we extract it from the train and valida-tion transcripts, plus Sala lexicon [20]. The resulting lexicon isa grapheme-based one with 271k words. We use the standardSpanish alphabet as tokens, plus the ”c¸” letter from Catalan andthe vowels with diacritical marks, making a total of 37 tokens.The LM is a 5-gram model trained with KenLM [21] us-ing only transcripts from the training sets: RTVE2018 train anddev1, plus Common Voice, Fisher and Call Center. The result-ing LM is described in this paper as
Alb+Others .Fine-tuning of decoder hyperparameters is done via grid-search with RTVE2018 dev2 set. The best results are achievedwith a LM weight of 2.25, a word score of 2.25 and a silencescore of -0.35. This same configuration is then applied for eval-uation datasets from RTVE2018 and RTVE2020.
5. Experiments
Data cleaning by means of 2-pass transcript retrieval improvesthe performance of our models the most. Table 1 shows theeffect of each pass. The nd pass helped to improve the accuracyby almost % in terms of WER. We also ran the rd pass, butthat did not help anymore. We simply did not retrieve morecleaned data from the original transcripts, just hours more.We could not train the models with the original subtitles, sincethese contained wrong timestamps. It is very common to find background music on TV programs,which can confuse our recognizer if it has a notorious presence. Table 2:
Effect of 2-pass transcript cleaning evaluated onRTVE2018 test set.
AM LM Training WER [%]data TestCNN-TDNNf Alb 1-pass 17.22-pass 15.53-pass 15.5This brought us the idea of processing the audio through a Mu-sic Source Separator called Demucs [3]. It separates the origi-nal audio into voice, bass, drums and others. By keeping onlythe voice component, we managed to significantly eliminate thebackground music, while maintaining relatively good quality inthe original voice.We enhanced both validation sets in order to assess possibleWER reductions. As seen in Table 4, this approach yielded asmall increase in WER. We also tried applying a specializeddenoiser [19] after background music removal, but the WERfor dev2 increased in an absolute 1.6%, compared to originalsystem without enhancement. None of these two approaches(Demucs and Demucs+Denoiser) provided WER improvementsat first, so we did not apply them for the end-to-end model usedin the fusion. Although, the end-to-end, end-to-end + Demucsand end-to-end + Demucs + Denoiser models were submittedas separate systems by UPF-TID team, see Table 5 for details.Our hypothesis is that not all the samples contain back-ground music. Speech enhancement for already clean samplesis detrimental because it causes slight degradation in the signal.Hence, we have evaluated the effects of applying music sourceseparation to samples under certain SNR ranges, measured withthe WADA-SNR algorithm [22]. The application of music sep-aration on RTVE dataset is optimal for SNR ranges between -5and 5 or 8 as it is shown in Table 3. Looking at Figure 2, bestimprovements are found at TV shows with higher WER (thusharder/noisier speech), e.g., AV, where most of the time speak-ers are in a car, or LM and DH, where music and speech oftenoverlap. Other shows have slighter benefits, since these alreadycontain good quality audio. The exception is AFI show, whichis reported to have poor quality audio, so further audio degrada-tion from Demucs might cause worse performance.Figure 2:
Variation of the mean WER per TV show betweenusing Demucs-cleaned or original samples on RTVE’s 2018 testset. Negative values represent Demucs improvements. Note thatonly samples with SNR between -5 and 8 are enhanced. able 3:
WER impact of cleaning speech signals between cer-tain SNR ranges, using a music source separator. End-to-endConvNet GLU model is used without LM, and percentage ofcleaned samples are reported.
SNR Cleaned Samples [%] Test WER [%]2018 2020 2018 2020 ( −∞ , ∞ )
100 100 37.50 53.53 ( −∞ , − , ) 25.84 31.33 -0.05 -0.88( − , ) 5.14 11.88 -0.07 -1.03 ( − , ) 14.95 22.11 -0.08 -0.97 Table 4 shows compared models with and without spectral aug-mentation. The technique helps quite significantly. All mod-els with feature dropout layer outperformed their counterpartswith a quite constant . absolute WER improvement onRTVE2018 test set and around . on RTVE2018 dev2 set. We also fuse the output of our best systems to further improvethe performance. Overall results of our systems considered forthe fusion are depicted in Table 4. Since the models with spec-tral augmentation performed significantly better, we decided tofuse only these systems. We analyzed two different approaches:a pure hybrid model fusion (Fusion A) and hybrid and end-to-end model fusion (Fusion B).Considering that the end-to-end model does not provideword-level timestamps, we had to force-align the transcriptswith the hybrid ASR system in order to obtain CTM output. Theoriginal word-level fusion was done using ROVER toolkit [23].Fusion B with end-to-end models performed slightly better thanits counterpart Fusion A, despite the fact that the end-to-endmodels achieved worse results. This somehow proves the ideathat the fusion can benefit from different modeling approaches.
6. Final systems
Table 5 depicts the results on RTVE2020 test set. For the end-to-end ConvNet GLU model, the performance drops around a15% WER when compared with previous results on develop-ment sets. Since the TV shows in such sets are also presentin training dataset, our hypothesis is that the model slightlyoverfits to them. Therefore, when facing different acousticconditions, voices, background noises and musics presented inRTVE2020 test set, the WER noticeable increases. Enhancingthe test samples with Demucs or with Demucs+Denoiser yieldsa worse WER score, probably due to an inherent degradation ofthe signal. A deeper analysis about more efficient ways to applysuch enhancements has been done in section 5.2.Also, note that the submitted systems had a leak of dev2stm transcripts in the LM, causing an hyperparameter overfittingduring LM tuning. This caused a WER drop in all end-to-endsystems, yielding WERs of 41.4%, 42.3% and 58.6%. Table 5also displays the results of same systems with the leakage andLM tuning corrected in post-evaluation analysis. Primary system of UPF-TID team. First contrastive system of UPF-TID team. Second contrastive system of UPF-TID team.
Table 4:
Overall results on RTVE2018 dataset with various lan-guage models and fusions.
AM LM WER [%]Dev2 Test1 CNN-TDNNf Alb 14.1 15.52 Alb + Wiki 13.6 14.93 Alb + Giga 13.6 15.14 Alb + Wiki + Giga 13.5 15.05 + SpecAug Alb 13.4 15.06 Alb+Wiki 12.9 14.57 Alb+Giga 13.0 14.78 Alb+Wiki+Giga 12.9 14.69 ConvNet GLU None 36.1 37.510 Alb + Others 20.8 20.711 + Demucs None 36.4 37.512 Alb + Others 21.1 20.813 Fusion A (row 5-8) (row 5-8 and 10)
Table 5:
Official and post-evaluation final results on RTVE2020eval set for the submitted systems.
Model WER [%]Official Post-evalCNN-TDNNf - 24.3+ SpecAug - 23.5ConvNet GLU 41.4
7. Conclusions
In this paper we described two different ASR model architec-tures and their fusion. We focused on improving the originalsubtitled data in order to train our models on high quality tar-get labels. We also improved the N -gram language model byincorporating publicly available text data from Wikipedia andSpanish Gigaword corpus from LDC. We have also successfullyincorporated the spectral augmentation into our AM architec-ture. Our best system achieved . % and 23.24 % WER onRTVE2018 and RTVE2020 test sets respectively.The performance of our hybrid system can be furtherimproved by using lattice-fusion with Minimum Bayes Riskdecoding[24]. Another space for improvement is offered byadding a RNN-LM lattice-rescoring. Our end-to-end modelshows relatively competitive performance on RTVE2018 testset in comparison with its hybrid counterpart. However, its per-formance on RTVE2020 expose that the model was not ableto generalize very well since this database turns out to containslightly different acoustic conditions. Despite of this fact, themodel still managed to improve the results in the final fusionwith hybrid systems. An exploration on background music re-moval shows that it yields the best results for lower SNR ranges,thus having a different impact depending on the acoustic condi-tions of each TV show. . References [1] S. H. R. Mallidi and H. Hermansky, “A Framework for Practi-cal Multistream ASR,” in Interspeech 2016, 17th Annual Con-ference of the International Speech Communication Association,San Francisco, CA, USA, September 8-12, 2016 , N. Morgan, Ed.ISCA, 2016, pp. 3474–3478.[2] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A Simple Data AugmentationMethod for Automatic Speech Recognition,” in
Interspeech 2019,20th Annual Conference of the International Speech Communica-tion Association, Graz, Austria, 15-19 September 2019 . ISCA,2019, pp. 2613–2617.[3] A. D´efossez, N. Usunier, L. Bottou, and F. Bach, “Musicsource separation in the waveform domain,” arXiv preprintarXiv:1911.13254 , 2019.[4] E. Lleida, A. Ortega, A. Miguel, V. Baz´an-Gil, C. P´erez,M. G´omez, and A. de Prada, “Albayzin 2018 evaluation: theiberspeech-rtve challenge on speech technologies for spanishbroadcast media,”
Applied Sciences , vol. 9, no. 24, p. 5412, 2019.[5] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,and S. Khudanpur, “Semi-Orthogonal Low-Rank Matrix Factor-ization for Deep Neural Networks,” in
Proceedings of Interspeech ,09 2018, pp. 3743–3747.[6] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: anend-to-end convnet-based speech recognition system,”
CoRR ,vol. abs/1609.03193, 2016. [Online]. Available: http://arxiv.org/abs/1609.03193[7] E. Lleida, A. Ortega, A. Miguel, V. Baz´an, C. P´erez,M. Zotano, and A. De Prada, “RTVE2018 Database Description,”2018. [Online]. Available: http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf[8] E. Lleida, A. Ortega, A. Miguel, V. Baz´an-Gil, C. P´erez,M. G´om´ez, and A. De Prada, “RTVE2020 Database Description,”2020. [Online]. Available: http://catedrartve.unizar.es/reto2020/RTVE2020DB.pdf[9] D. Graff, S. Huang, I. Cartagena, K. Walker, and C. Cieri, “FisherSpanish Speech,”
LDC2010S01. DVD. Philadelphia: LinguisticData Consortium , 2010.[10] A. Canavan and G. Zipperlen, “CALLHOME Spanish Speech,”
LDC96S35. Web Download. Philadelphia: Linguistic Data Con-sortium , 1996.[11] ˆAngelo Mendonc¸a, D. Jaquette, D. Graff, and D. DiPersio, “Span-ish Gigaword Third Edition,”
LDC2011T12. Web Download.Philadelphia: Linguistic Data Consortium , 2011.[12] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,“Common voice: A massively-multilingual speech corpus,” 2019.[13] V. Manohar, D. Povey, and S. Khudanpur, “JHU Kaldi system forArabic MGB-3 ASR challenge using diarization, audio-transcriptalignment and transfer learning,” in , vol. 2018-.IEEE, 2017, pp. 346–352.[14] M. Kocour, “Automatic Speech Recognition System ContinuallyImproving Based on Subtitled Speech Data,” Diploma thesis,Brno University of Technology, Faculty of Information Technol-ogy, Brno, 2019, technical supervisor Dr. Ing. Jordi Luque Ser-rano. supervisor Doc. Dr. Ing. Jan ˇCernocky.[15] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi SpeechRecognition Toolkit,” in
IEEE 2011 Workshop on AutomaticSpeech Recognition and Understanding . IEEE Signal ProcessingSociety, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.[16] J. Duddington and R. Dunn, “eSpeak text to speech,” 2012.[Online]. Available: http://espeak.sourceforge.net [17] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” in
Acoustics, Speech and Signal Pro-cessing (ICASSP), 2014 IEEE International Conference on . Flo-rence, Italy: IEEE, May 2014.[18] O. Plchot, P. Matˇejka, O. Novotn´y, S. Cumani, A. D. Lozano,J. Slav´ıˇcek, M. S. Diez, F. Gr´ezl, O. Glembek, M. V. Kamsali,A. Silnova, L. Burget, L. Ondel, S. Kesiraju, and A. J. Rohdin,“Analysis of but-pt submission for nist lre 2017,” in
Proceedingsof Odyssey 2018 The Speaker and Language Recognition Work-shop , 2018, pp. 47–53.[19] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech en-hancement in the waveform domain,” 2020.[20] A. Moreno, O. Gedge, H. Heuvel, H. H¨oge, S. Horbach, P. Martin,E. Pinto, A. Rinc´on, F. Senia, and R. Sukkar, “Speechdat acrossall america: Sala ii,” 2002.[21] K. Heafield, “Kenlm: Faster and smaller language model queries,”in
Proceedings of the Sixth Workshop on Statistical MachineTranslation , ser. WMT ’11. USA: Association for Computa-tional Linguistics, 2011, p. 187–197.[22] C. Kim and R. M. Stern, “Robust signal-to-noise ratio estima-tion based on waveform amplitude distribution analysis,” in
NinthAnnual Conference of the International Speech CommunicationAssociation , 2008.[23] J. G. Fiscus, “A post-processing system to yield reduced word er-ror rates: Recognizer Output Voting Error Reduction (ROVER),”in , 1997, pp. 347–354.[24] P. Swietojanski, A. Ghoshal, and S. Renals, “Revisiting hybridand GMM-HMM system combination techniques,” in