CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition
Ludwig Kürzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, Gerhard Rigoll
CCTC-Segmentation of Large Corpora for GermanEnd-to-end Speech Recognition
Ludwig Kürzinger (cid:63) , Dominik Winkelbauer (cid:63) , Lujun Li , Tobias Watzel ,and Gerhard Rigoll
Institute for Human-Machine Communication,Technische Universität München, Germany {ludwig.kuerzinger,dominik.winkelbauer}@tum.de
Abstract.
Recent end-to-end Automatic Speech Recognition (ASR) sys-tems demonstrated the ability to outperform conventional hybrid DNN/HMM ASR. Aside from architectural improvements in those systems,those models grew in terms of depth, parameters and model capacity.However, these models also require more training data to achieve com-parable performance.In this work, we combine freely available corpora for German speechrecognition, including yet unlabeled speech data, to a big dataset ofover h of speech data. For data preparation, we propose a two-stage approach that uses an ASR model pre-trained with ConnectionistTemporal Classification (CTC) to boot-strap more training data fromunsegmented or unlabeled training data. Utterances are then extractedfrom label probabilities obtained from the network trained with CTCto determine segment alignments. With this training data, we traineda hybrid CTC/attention Transformer model that achieves . WERon the Tuda-DE test set, surpassing the previous baseline of . ofconventional hybrid DNN/HMM ASR. Index Terms : German speech dataset, End-to-end automatic speech recogni-tion, hybrid CTC/attention, CTC-segmentation
Conventional speech recognition systems combine Deep Neural Networks (DNN)with Hidden Markov Models (HMM). The DNN serves as an acoustic model thatinfers classes, or their posterior probabilities respectively, originating from hand-crafted HMMs and complex linguistic models. Hybrid DNN/HMM models alsorequire multiple processing steps during training to refine frame-wise acousticmodel labels. In comparison to hybrid DNN/HMM systems, end-to-end ASR (cid:63)
These authors contributed equally to this work. This is a preprint article. The full paper [8] can be found athttps://doi.org/10.1007/978-3-030-60276-5_27 a r X i v : . [ ee ss . A S ] O c t L. Kürzinger et al. simplifies training and decoding by directly inferring sequences of letters, or to-kens, given a speech signal. For training, end-to-end systems only require theraw text corresponding to an utterance. Connectionist Temporal Classification(CTC) is a popular loss function to train end-to-end ASR architectures [5]. Inprinciple, its concept is similar to a HMM, the label sequence is modeled assequence of states, and during training, a slightly modified forward-backwardalgorithm is used in the calculation of CTC loss. Another popular approach forend-to-end ASR is to directly infer letter sequences, as employed in attention-based encoder-decoder architectures [3]. Hybrid CTC/attention ASR architec-tures combine these two approaches [19].End-to-end models also require more training data to learn acoustic repre-sentations. Many large corpora, such as Librispeech or TEDlium, are providedas large audio files partitioned into segments that contain speech with tran-scriptions. Although end-to-end systems do not need frame-wise temporal align-ment or segmentation, an utterance-wise alignment between audio and text isnecessary. To reduce training complexity, previous works used frameworks likesphinx [9] or MAUS [16] to partition speech data into sentence-length segments,each containing an utterance. Those frameworks determine the start and the endof a sentence from acoustic models (often HMMs) and the Viterbi algorithm.However, there are three disadvantages in using these for end-to-end ASR: (1)As only words in the lexicon can be detected, the segmentation tool needs astrategy for out-of-vocabulary words. (2) Scaling the Viterbi algorithm to gen-erate alignments within larger audio files requires additional mitigations. (3) Asthese algorithms provide forced alignments, they assume that the audio containsonly the text which should be aligned; but for most public domain audio thisis not the case. So do for example all audio files from the Librivox dataset con-tain an additional prologue and epilogue where the speaker lists his name, thebook title and the license. It might also be the case that the speaker skips somesentences or adds new ones due to different text versions. Therefore, aligningsegments of large datasets, such as TEDlium [15], is done in multiple iterationsthat often include manual examination. Unfortunately, this process is tediousand error prone; for example, by inspection of the SWC corpus, some of thoseautomatically generated transcriptions are missing words in the transcription.We aim for a method to extract labeled utterances in the form of cor-rectly aligned segments from large audio files. To achieve this, we propose CTC-segmentation, an algorithm to correctly align start and end of utterance seg-ments, supported by a CTC-based end-to-end ASR network . Furthermore, wedemonstrate additional data cleanup steps for German language orthography. Our contributions are:–
We propose CTC-segmentation, a scalable method to extract utterance seg-ments from speech corpora. In comparison to other automated segmentationtools, alignments generated with CTC-segmentation were observed to moreclosely correspond to manually segmented utterances. The source code underlying this work is available at https://github.com/cornerfarmer/ctc_segmentationTC-Segmentation of Large Corpora for German Speech Recognition 3 – We extended and refined the existing recipe from the ASR toolkit kaldiwith a collection of open source German corpora by two additional corpora,namely
Librivox and
CommonVoice , and ported it to the end-to-end ASRtoolkit ESPnet.
Milde et al. [11] proposed to combine freely available German language speechcorpora into an open source
German speech recognition system. A more detaileddescription of the German datasets can be found in [11], of which we give a shortsummary: – The Tuda-DE dataset [14] combines recordings of multiple sentences con-cerning various topics spoken by 180 speakers using five microphones. – The Spoken Wikipedia Corpus (SWC, [2]) is an open source summary ofrecordings of different Wikipedia articles made by volunteers. The transcrip-tion already includes alignment notations between audio and text, but asthese alignments were often incorrect, Milde et al. re-aligned utterance seg-ments using the Sphinx speech recognizer [9]. – The M-AILABS Speech Dataset [17] mostly consists of utterances extractedfrom political speeches and audio books from Librivox. Audio and text hasbeen aligned by using synthetically generated audio (TTS) based on the textand by manually removing intro and outro.In this work, we additionally combine the following German speech corpora: – CommonVoice dataset [1] consists of utterances recorded and verified byvolunteers; therefore, an utterance-wise alignment already exists. – Librivox [10] is a platform for volunteers to publish their recordings of readingpublic domain books. All recordings are published under a Creative Commonlicense. We use audio recordings of books. The corresponding texts areretrieved from Project Gutenberg-DE [6] that hosts a database of books inthe public domain.Milde et al. [11] mainly used a conventional DNN/HMM model, as providedby the kaldi toolkit [13]. Denisov et al. [4] used a similar collection of Germanlanguage corpora that additionally includes non-free pre-labeled speech corpora.Their ASR tool
IMS Speech is based on a hybrid CTC/attention ASR archi-tecture using the BLSTM model with location-aware attention as proposed byWatanabe et al. [19]. The architecture used in our work also is based on thehybrid CTC/attention ASR of the ESPnet toolkit [20], however, in combinationwith the Transformer architecture [18] that uses self-attention. As we only give ashort description of its architecture, an in-detail description of the Transformermodel is given by Karita et al. [7].
L. Kürzinger et al.
There are several tools to extract labeled utterance segments from speech cor-pora. The Munich Automatic Segmentation (MAUS) system [16] first transformsthe given transcript into a graph representing different sequences of phones byapplying predefined rules. Afterwards, the actual alignment is estimated by find-ing the most probable path using a set of HMMs and pretrained acoustic models.Gentle works in a similar way, but while MAUS uses HTK [21], Gentle is builton top of Kaldi [13]. Both methods yield phone-wise alignments. Aeneas [12]uses a different approach: It first converts the given transcript into audio byusing text-to-speech (TTS) and then uses the Dynamic Time Warping (DTW)algorithm to align the synthetic and the actual audio by warping the time axis.In this way it is possible to estimate begin and end of given utterance within theaudio file.We propose to use a CTC-based network for segmentation. CTC was orig-inally proposed as a loss function to train RNNs on unsegmented data. At thesame time, using CTC as a segmentation algorithm was also proposed by Graveset al. [5]. However, to the best knowledge of the authors, while the CTC algorithmis widely used for end-to-end speech recognition, there is not yet a segmentationtool for speech audio based on CTC.
The following paragraphs describe CTC-segmentation, an algorithm to extractproper audio-text alignments in the presence of additional unknown speech sec-tions at the beginning or end of the audio recording. It uses a CTC-based end-to-end network that was trained on already aligned data beforehand, e.g., asprovided by a CTC/attention ASR system. For a given audio recording theCTC network generates frame-based character posteriors p ( c | t, x T ) . From theseprobabilities, we compute via dynamic programming all possible maximum jointprobabilities k t,j for aligning the text until character index j ∈ [1; M ] to theaudio up to frame t ∈ [1; T ] . Probabilities are mapped into a trellis diagram bythe following rules: k t,j = max( k t − ,j · p ( blank | t ) , k t − ,j − · p ( c j | t )) if t > ∧ j > if t = 0 ∧ j > if j = 0 (1)The maximum joint probability at a point is computed by taking the mostprobable of the two possible transitions: Either only a blank symbol or the nextcharacter is consumed. The transition cost for staying at the first character isset to zero, to align the transcription start to an arbitrary point of the audio file.The character-wise alignment is then calculated by backtracking, starting offthe most probable temporal position of the last character in the transcription, TC-Segmentation of Large Corpora for German Speech Recognition 5 i.e, t = arg max t (cid:48) k t (cid:48) ,M . Transitions with the highest probability then determinethe alignment a t of the audio frame t to its corresponding character from thetext, such that a t = M − if t (cid:62) arg max t (cid:48) ( k t (cid:48) ,M − ) a t +1 if k t,a t +1 · p ( blank | t + 1) > k t,a t +1 − · p ( c j | t + 1) a t +1 − else . (2)As this algorithm yields a probability ρ t for every audio frame being aligned ina given way, a confidence score s seg for each segment is derived to sort out utter-ances with deviations between speech and corresponding text, that is calculatedas s seg = min j m j with m j = 1 L ( j +1) L (cid:88) t = jL ρ t . (3)Here, audio frames that were segmented to correspond to a given utterance arefirst split into parts of length L . For each of these parts, a mean value m j basedon the frame-wise probabilities ρ t is calculated. The total probability s seg for agiven utterance is defined as the minimum of these probabilities per part m j .This method inflicts a penalty on the confidence score on mismatch, e.g., evenif a single word is missing in the transcription of a long utterance.The complexity of the alignment algorithm is reduced from O ( M · N ) to O ( M ) by using the heuristic that the ratio between the aligned audio and text positionis nearly constant. Instead of calculating all probabilities k t,j , for every characterposition j one only considers the audio frames in the interval [ t − W/ , t + W/ with t = jN/M as the audio position proportional to a given character positionand the window size W . The ground truth text from free corpora, such as Librivox or the SWC corpus, isoften not directly usable for ASR and has therefore to be cleaned. To maximizegeneralization to the Tuda-DE test dataset, this is done in a way to match thestyle of the ground truth text used in Tuda-DE, which only consists of letters,i.e. a-z and umlauts (ä, ü, ö, ß). Punctuation characters are removed and allsentences with different letters are taken out of the dataset. All abbreviations andunits are replaced with their full spoken equivalent. Furthermore, all numbersare replaced by their full spoken equivalent. Here it is also necessary to considerdifferent cases, as this might influence the suffix of the resulting word. Say, “
00 Soldaten” needs to be replaced by “ eintausendacht hundert Soldaten” ,whereas “Es war is replaced according to its pronunciation by “Es war achtzehn hundert” . The correct case can be determined from neighboring wordswith simple heuristics. For this, the NLP tagger provided by the spacy framework[7] is used.Another issue arised due to old German orthography. Text obtained fromLibrivox is due to its expired copyright usually at least 70 years old and uses L. Kürzinger et al. old German spelling rules. For an automated transition to the reformed Germanorthography, we implemented a self-updating lookup-table of letter replacements.This list was compiled based on a list of known German words from correctlyspelled text.
In this section, we evaluate how well the proposed CTC-segmentation algorithmaligns utterance-wise text and audio. Evaluation is done on the dev and testset of the TEDlium v2 dataset [15], that consist of recordings from 19 uniquespeakers that talk in front of an audience. This corpus contains labeled sentence-length utterances, each with the information of start and end of its segment inthe audio recording. As these alignments have been done manually, we use themas reference for the evaluation of the forced alignment algorithms. The compari-son is done based on three parameters: the mean deviation of the predicted startor end from ground truth, its standard deviation and the ratio of predictionswhich are at maximum 0.5 seconds apart from ground truth. To evaluate theimpact of the ASR model on CTC-segmentation, we include both BLSTM aswell as Transformer models in the comparison. The pre-trained models wereprovided by the ESPnet toolkit [20]. We compare our approach with three exist-ing forced alignment methods from literature: MAUS, Gentle and Aeneas. To getutterance-wise from phone-wise alignments, we determine the begin time of thefirst phone and the end time of the last phone of the given utterance. As can beseen in Tab. 1, segment alignments generated by CTC-segmentation correspondsignificantly closer to ground truth compared to the segments generated by allother tested alignment algorithms.Fig. 1 visualizes the density of segmentation timing deviations across allpredictions. We thereby compare our approach using the LSTM-based modeltrained on TEDlium v2 with the Gentle alignment tool. It can be seen that bothapproaches have timing deviations smaller than one second for most predictions.Apart from that, our approach has a higher density in deviations between 0 and0.5 seconds, while it is the other way around in the interval from 0.5 to 1 second.This indicates that our approach generates more accurately aligned segmentswhen compared to Viterbi- or DTW-based algorithms.As explained in section 3.1, one of the main motivations for CTC-segmentationis to determine utterance segments in a robust manner, regardless of preamblesor deviating transcriptions. To simulate such cases using the TEDlium v2 devand test set, we prepended the last N seconds of every audio file before its startand appended the first M seconds to its end. Hereby, N and M are randomly Configuration of the pre-trained models: The Transformer model has a self-attentionencoder with layers of each units. The BLSTM model has a BLSTMPencoder containing layers with each units, with sub-sampling in the secondand third layer.TC-Segmentation of Large Corpora for German Speech Recognition 7 Table 1: Accuracy of different alignment methods on the dev and test set ofTEDlium v2, compared via the mean deviation from ground truth, its standarddeviation and the ratio of predictions which are at maximum . seconds apartfrom ground truth. Mean Std <0.5sConventional Segmentation Approaches
MAUS (HMM-based using HTK) 1.38s 11.62 74.1%Aeneas (DTW-based) 9.01s 38.47 64.7%Gentle (HMM-based using kaldi) 0.41s 1.97 82.0%
CTC-Segmentation (Ours)
Hybrid CTC/att. BLSTM trained on TEDlium v2 0.34s 1.16 90.1%Hybrid CTC/att. Transformer trained on TEDlium v2 0.31s 0.85 88.8%Hybrid CTC/att. Transformer trained on Librispeech 0.35s 0.68 85.1%
Deviation to manually labeled segments D en s i t y oursgentle0.0 0.2 0.4 0.6 0.8 1.0 1.2 Deviation to manually labeled segments D en s i t y oursgentle Fig. 1: Relative deviation, denoted in seconds, of segments generated by Gen-tle and our CTC-segmentation compared to manually labeled segments fromTEDlium 2. CTC-segmentation exhibited a greater accuracy to the start of thesegment (top) in comparison with Gentle; an also was observed to be slightlymore accurate towards the end of the segments (bottom). The y axis denotesdensity in a histogram with bins. L. Kürzinger et al.
Table 2: Different alignment methods on the augmented dev and test set ofTEDlium v2. Similar to the evaluation procedure as in Tab. 1, but the audiosamples are augmented by adding random speech parts to their start and end.In this the robustness of the different approaches is evaluated.
Mean Std <0.5sExisting methods
MAUS (HMM-based using HTK) 3.18s 18.97 66.9 %Aeneas (DTW-based) 10.91s 40.50 62.2 %Gentle (HMM-based using kaldi) 0.46s 2.40 81.7 %
CTC-Segmentation (Ours)
BLSTM trained on TEDlium v2 0.40s 1.63 89.3 %Transformer trained on TEDlium v2 0.35s 1.38 89.2 %Transformer trained on Librispeech 0.40s 1.21 84.2 % sampled from the interval [10 , s . Table 2 shows how the same algorithmsperform on this altered dataset. Especially the accuracy of the alignment toolsMAUS and Aeneas drops drastically when additional unknown parts of the au-dio recording are added. Gentle and our method however are able to retain theiralignment abilities in such cases.To conclude both experiments, alignments generated by CTC-segmentationcorrespond closer to the ground truth compared to DTW and HMM based meth-ods, independent of the used architecture and training set. By inspection, thequality of obtained alignments varies slightly across domains and conditions: TheTransformer model with a more powerful encoder performs better compared tothe BLSTM model. Also, the alignments of a model trained on the TEDliumv2 corpus are more accurate on average on its corresponding test and dev set;this corpus contains more reverberation and noise from an audience than theLibrispeech corpus. Model evaluation is performed on multiple combinations of datasets, listed inTab.3. Thereby we build upon the corpora collection used by Milde et al. [11],namely, Tuda-DE, SWC and M-AILABS. As [11], we also neglect recordingsmade by the Realtek microphone due to bad quality. Additional to these threecorpora, we train our model on Common Voice and Librivox. Data preparationof the Common Voice dataset only required to post-process the ground truthtext by replacing all numbers by their full spoken equivalent. As the Viterbi-alignment provided by [11] for SWC is not perfect, with some utterances missingits first words in the transcription, we realign and clean the data using CTC-segmentation, as in Sec. 3.1. Utterance alignments with a confidence score s seg lower than . , corresponding to − . in log space, were discarded. To performCTC-segmentation on the Librivox corpus, we combined the audio files with the TC-Segmentation of Large Corpora for German Speech Recognition 9 corresponding ground truth text pieces from Project Gutenberg-DE [6]. Compa-rable evaluation results were obtained from decoding the Tuda-DE dev and testsets, as also used in [11].In total, the cumulative size of these corpora spans up to h, of whichwe use three partially overlapping subsets for training: In the first configurationthat includes h of speech data, we use the selection as provided by Milde etal. that includes Tuda-DE, SWC and M-AILABS. The second subset is createdby adding the CommonVoice corpus, resulting in h of training data. Thethird selection conjoins the Tuda-DE corpus and CommonVoice with the twoCTC-segmented corpora, SWC and Librivox, to h of speech data.Table 3: Datasets used for training and evaluation.
Datasets Length Speakers Utterances
Tuda-DE train [14] TD h
147 55497
Tuda-DE dev [14] dev h
16 3678
Tuda-DE test [14] test h
17 4100
SWC [2], aligned by [11] SW h
363 171380
M-ailabs [17] MA h
29 118521
Common Voice [1] CV h CTC-segmented SWC SW* h
363 78214
CTC-segmented Librivox [6,10] LV* h
251 368532
For all experiments, the hybrid CTC/attention architecture with the Trans-former is used. It consists of a 12 layer encoder and a 6 layer decoder, both with2048 units in each layer; attention blocks contain 4 heads to each 256 units . Allmodels were trained for 23 epochs using the noam optimizer. We did not usedata augmentation, such as SpecAugment. At inference time, the decoding ofthe test and dev set is done using beam search with beam size of . To furtherimprove the results on the test and dev set, a language model was used to guidethe beam search. Language models with two sizes were used in decoding. TheRNNLM language models were trained on the same text corpus as used in [11]for 20 epochs. The first RNNLM has two layers with LSTM units per layer.It achieves a perplexity of 8.53. The second RNNLM consists of four layer ofeach units, with a perplexity of 6.46.
The benchmark results are listed in Tab. 4. First, the effects of using differentdataset combinations are inspected. By using the CommonVoice dataset in addi- The default configuration of the Transformer model at ESPnet v.0.5.30 L. Kürzinger et al.
Table 4: A comparison of using different dataset combinations. Word error ratesare in percent and evaluated on the Tuda-DE test and dev set.
Datasets ASR model LM Tuda-DETD SW MA CV SW* LV* h dev test (cid:88) (cid:88) - - - -
TDNN-HMM [11] 4-gram KN 15.3 16.5 (cid:88) (cid:88) - - - -
TDNN-HMM [11] LSTM ( × ) 13.1 14.4 (cid:88) (cid:88) (cid:88) - - - TDNN-HMM [11] 4-gram KN 14.8 15.9 (cid:88) (cid:88) (cid:88) - - -
Transformer RNNLM (2 × (cid:88) (cid:88) (cid:88) (cid:88) - - Transformer RNNLM (2 × (cid:88) (cid:88) (cid:88) (cid:88) - - Transformer RNNLM (4 × (cid:88) - - (cid:88) (cid:88) (cid:88) Transformer None 19.3 19.7 (cid:88) - - (cid:88) (cid:88) (cid:88)
Transformer RNNLM (2 × (cid:88) - - (cid:88) (cid:88) (cid:88) Transformer RNNLM (4 × tion to Tuda-DE, SWC and M-AILABS, the test WER decreases to . WER.Further replacing SWC and M-AILABS by the custom aligned SWC and Lib-rivox dataset decreased the test set WER down to . .The second observation is that the language model size and also the achievedperplexity on the text corpus highly influences the WER. The significant im-provement in WER of can be explained by the better ability of the bigRNNLM in detection and prediction of German words and grammar forms. Forexample, Milde et al. [11] described that compounding poses are a challengefor the ASR system; not recognized compounds resulted in at least two errors,a substitution and an insertion error. This was also observed in a decodingrun without the RNNLM, e.g., “Tunneleinfahrt” was recognized as “Tunnel _ ein _ fahrt” . By inspection of recognized transcriptions, most of these cases werecorrectly determined when decoding with language model, even more so withthe large RNNLM.Tab. 4 gives us further clues how the benefits to end-to-end ASR scale withthe amount of automatically aligned data. The benchmark results obtained withthe small language model improved by absolute . WER on the Tuda-DEtest set, after addition of the CommonVoice dataset, h of speech data. Thebiggest performance improvement of . WER was obtained with the thirdselection of corpora with h of speech data. Whereas the composition ofcorpora is slightly different in this selection, two main factors contributed tothis improvement: The increased amount of training data and better utterancealignments using CTC-segmentation.
End-to-end ASR models require more training data as conventional DNN/HMMASR systems, as those models grow in terms of depth, parameters and modelcapacity. In order to compile a large dataset from yet unlabeled audio recordings,
TC-Segmentation of Large Corpora for German Speech Recognition 11 we proposed CTC-segmentation. This algorithm uses a CTC-based end-to-endneural network to extract utterance segments with exact time-wise alignments.Evaluation of our method is two-fold: As evaluated on the hand-labeled devand test datasets from TEDlium v2, alignments generated by CTC-segmentationwere more accurate compared to those obtained from Viterbi- or DTW-basedapproaches. In terms of ASR performance, we build on a composition of Germanspeech corpora [11] and trained an end-to-end ASR model with CTC-segmentedtraining data; the best model achieved . WER on the Tuda-DE test set, animprovement of . WER absolute in comparison with the conventional hybridDNN/HMM ASR system.
References11