Using of heterogeneous corpora for training of an ASR system
Jan Trmal, Gaurav Kumar, Vimal Manohar, Sanjeev Khudanpur, Matt Post, Paul McNamee
UUSING OF HETEROGENEOUS CORPORA FOR TRAINING OF AN ASR SYSTEM
Jan Trmal , , Gaurav Kumar , , Vimal Manohar , Sanjeev Khudanpur , , Matt Post , Paul McNamee Center for Language and Speech Processing Human Language Technology Center of ExcellenceThe Johns Hopkins University, Baltimore, MD 21218, USA
ABSTRACT
The paper summarizes the development of the LVCSR systembuilt as a part of the Pashto speech-translation system at theSCALE (Summer Camp for Applied Language Exploration)2015 workshop on “Speech-to-text-translation for low-resourcelanguages”. The Pashto language was chosen as a good “proxy”low-resource language, exhibiting multiple phenomena whichmake the speech-recognition and and speech-to-text-translationsystems development hard.Even when the amount of data is seemingly sufficient, giventhe fact that the data originates from multiple sources, thepreliminary experiments reveal that there is little to no benefitin merging (concatenating) the corpora and more elaborateways of making use of all of the data must be worked out.This paper concentrates only on the LVCSR part and presentsa range of different techniques that were found to be useful inorder to benefit from multiple different corpora
Index Terms — speech translation; pashto; babel; multiplecorpora; neural networks; discriminative training;
1. INTRODUCTION
Pashto belongs to the southeastern Iranian branch of Indo-Iranian languages. It has three main variants: Northern andCentral (both spoken mainly in Pakistan) and Southern (spokenmainly in Afghanistan). Each of these variants has a numberof dialectal varieties. It is estimated that Pashto has 66 mil-lion speakers across the world[1]. While written Pashto hasexisted since the 16th century, standardization of the writingsystem is still in progress. There are a substantial number ofwords that have more than one publicly accepted way of beingwritten (cf. English adviser vs. advisor). Other issues includepresent/missing spacing after certain characters (especiallythose, which belong to non-connecting arabic characters, andfrequent substitution of visually similar (and similar sounding)characters. The last problem is emphasized by the fact thatfor writing Pashto, several different keyboard layouts are used“in the wild”. There is, of course the official Pashto keyboard
The work was supported by the NSF CRI Grant No 1513128, by theDARPA LORELEI Contract No HR0011-15-2-0024 and by IARPA BABELContract No 2012-12050800010. ”The U.S. Government is authorized toreproduce and distribute reprints for Governmental purposes notwithstandingany copyright annotation hereon.
Fig. 1 . A
RABIC L ETTER F ARSI Y EH and A RABIC L ETTER A LEF M AKSURA layout although the Arabic and Urdu layouts are used as well.The alternative layouts have a majority of the Pashto characters(and people freely substitute those which are missing withvisually similar characters). Also, different fonts can havesmall deficiencies in rendering of glyphs, especially duringkerning or joining of characters and the users often try to fixthis by substituting a different character that looks better (i.e.closer to the expected shape) in the given context.A direct impact of this is that a word as a sequence of glyphs(visual representations of characters) can be represented asmultiple sequences of unicode codepoints (numerical codesof the characters). One example was the pair of unicode code-points A
RABIC L ETTER F ARSI Y EH and A RABIC L ETTER A LEF M AKSURA (see Figure 1) whose glyphs are renderedwith no visible differences, despite the fact that the codepointsare different.This also causes problems when looking only for visualdifferences (for example, during debugging of problems withthe lexicon). We mention this somewhat anecdotal evidence toshow the readers who are only familiar with languages whichuse romanized scripts, that there are a number of hidden pecu-liarities that are really surprising when met for the first time.
2. SPEECH CORPORA AVAILABLE
In this section, we give a short overview of the different corporawe worked with. We acquired two corpora before the start ofthe workshop and one additional during the course of the work-shop. As the latter is substantially different and not completelysuitable for our task (LVCSR and machine translation), we didnot achieve any gain from using it and we are reporting thenumbers here just for the sake of completeness.
The Appen Pashto (dataset “A”) contains approximately 90hours of conversational telephone (8 kHz) speech. As the a r X i v : . [ c s . C L ] J un rain/dev partitions were not externally defined in the Appendata release, we partitioned the data into 85 hours for trainingand 5 hours as dev sets . The lexicon was included with thecorpus and the lexical entries included vowelized representa-tions and the romanized forms of words. Moreover, the lexiconcontained four dialectal variants of pronunciation of each word(we assume that these were generated automatically). Babel Pashto (dataset “B”) is the Full Language Pack inBabel program terminology. Simply explained, it is a datasetof 80 hours of 8 kHz sampled telephone speech data and theassociated lexicon and transcripts. We used the development setdefined in the corpus. Although our description of the corpora“A” and “B” may indicate that the datasets are similar, our obser-vations demonstrated that this was not the case. We observedthat dataset “B” was, on overall, prepared and transcribed morecarefully than dataset “A”. The TransTac Pashto is significantly different when comparedwith the two aforementioned corpora. By nature, it is morescripted speech, albeit with high level of spontaneity. It isalso professionally recorded (in a recording room) and hencewe had to down-sample it to 8 kHz for use with corpora “A”and “B”. The dataset was created as follows. The participatingindividuals were given one out of a limited set of scenarios andthey were then asked to re-enact that scenario. We were unableto obtain the definitions of the train/dev partitions for this datasetthat were used in the DARPA TransTac ([2]) project. Becauseof this and the fact that we used the down-sampled speech,our performance is not directly comparable to the previouslyreported results for this dataset. We started work on this corpusrelatively late in the course of the workshop and hence a majorityof the reported experiments done on the “merged” training dataonly used the A+B training set. Also, our primary aim was not toachieve the best result on this corpus but to establish whether thisdataset could be used in the 8 kHz telephone speech scenario.
As mentioned in the introduction, there is significant variabilityin the process used to transcribe corpora “A” and “B”. Ourfirst efforts targeted at making the transcriptions from bothcorpora more consistent in their use of characters. This wasmotivated by our inspection of the lexicons from the corpora.Starting with the lexicons and then removing the vowel marksfrom all words, the overlap (defined as the number of wordsthe lexicons share) was only 15 %. In an attempt to determinethe minimal number of character changes that would increaselexicon overlap, we developed a simple algorithm that alloweda language expert to determine the edit rules.1. Take 1000 most frequent words from each lexicon (basedon the associated transcripts). We chose this number so asto cover over 90 % of the frequency mass in each corpus. These splits are part of the Kaldi recipe release-IARPA-babel104b-v04.bY Operation Character CharacterDEL A
RABIC K ASRA
DEL A
RABIC F ATHATAN
DEL A
RABIC K ASRATAN
DEL Z
ERO W IDTH N ON -J OINER
SUB A
RABIC L ETTER K AF A RABIC L ETTER K EHEH
SUB A
RABIC L ETTER G AF A RABIC L ETTER K AF W ITH R ING
SUB A
RABIC L ETTER F ARSI Y EH A RABIC L ETTER Y EH SUB A
RABIC L ETTER Y EH W ITH T AIL A RABIC L ETTER Y EH SUB A
RABIC L ETTER Y EH W ITH H AMZA A BOVE A RABIC L ETTER Y EH SUB A
RABIC L ETTER
E A
RABIC L ETTER Y EH SUB A
RABIC L ETTER A LEF W ITH H AMZA A BOVE A RABIC L ETTER A LEF
Table 1 . Examples of automatically discovered rules for lexi-con/text normalization, confirmed later by a language expert.DEL = Deletion, SUB = Substitution‘’ dev set WERA-dev B-dev T-devtriphone GMM system 64.10 % 61.90 % 36.45 % + pron. probs 62.03 % 60.30 % 35.37 % + TDNN system 52.09 % 48.19 % 25.29 % + sequence training 48.33 % 45.43 % 23.23 % + duration modeling 48.19 % 44.93 % 23.12 % Table 2 . Performance of the baseline ASR system on threePashto test sets: “A” - the Appen Lila corpus, “B” - the Babelcorpus, and “T” - DARPA TransTac corpus.2. Find the best word pairs in terms of the least characterdistance, one word from each lexicon, based on(a) Character Edit Distance(b) Phone Edit Distance: We used the
Festvox toolkit to generate the phone sequence for each word in thelexicon.3. Generate count based statistics for character based substi-tution and deletion rules based on the previous step.4. Get an expert to verify the most frequent rules.5. Use the rules to modify the lexicons and transcripts.6. Go to Step 1.To elaborate, step 2 (the edit character computations) canprovide highly informative observations – for each edit oper-ation, we keep track of the global effect of applying such rules.We choose the minimal set of rules that would lead to maximalcorpus overlap. These most frequent operations can provideinsights to discover character-substitution or deletion rules.See Table 1 for an example of these automatically discovered(and expert-confirmed) rules. After about four iteration of thisprocess, we increased the overlap of the most frequent 1000words across corpora “A” and “B” to approximately 70 %. Afterthe fourth iteration, we didn’t find any additional systematicdifferences, so we stopped there.
3. BASELINE ACOUSTIC SYSTEM
In this section, we provide a high level description of how wetrained the baseline system. As the objective was to developa sufficiently simple, single pass, minimum delay system,we opted for training a deep neural network system (or moreprecisely, a TDNN, as described in [3]). We used the Kaldi http://festvox.org oolkit [4] for training the ASR system. There are a couple ofdetails worth mentioning which follow. We utilized augmentation of the data via speed perturbation (asdescribed in [5]) during training. We used sox to obtain twocopies of the training data (the first copy used a speed factor . and the second . ). Our experience confirms this improvesoverall robustness of the resulting models. The network wastrained in a parallel fashion using model averaging, as describedin [6]. The pronunciation lexicon rarely contains the probabilities ofthe individual pronunciation variants. It is however possible toestimate these probabilities from the alignments of the trainingdata. Moreover, it is possible to model word-dependent silenceprobabilities, in addition to modeling of the probability ofsilence to estimate (and suitably smooth) the probability of eachword appearing after silence. See [7] for a detailed analysis ofthis and several related ideas.In our experience, it is also beneficial to re-estimate theseprobabilities iteratively several times during the training pro-cess. We saw reasonable gains (given the fact that this phaseitself is not computationally expensive) from using the prob-abilistic pronunciation lexicon even during training.
For the sequence training we used the sMBR method [8, 9]which is reported [10] to give best performance (measuredwith respect to WER). Also, we found it beneficial to adjustthe prior probabilities – used during decoding for convertingTDNN posterior probabilites into likelihoods - after finishingthe discriminative training.Historically, the priors are computed from alignment. Asmentioned in [11], marginalizing of the DNN posteriors over allacoustic vectors gives better performance, especially when thedata is noisy. Only a limited subset of the full training data is usu-ally needed, so again, this improvement comes relatively cheap.
After the final lattices were generated, we used the durationmodeling rescoring as described in [12]. We used the softwarethe author of the paper provided. We find the improvementsfairly consistent, albeit lower, than the numbers reported in theoriginal paper.
The overall performance of the resulting baseline system isreported in Table 2. Please note that a separate baseline system http://sox.sourceforge.net/ https://github.com/alumae/kaldi-nnet-dur-model Babel FullLP (B-dev) performanceRadical-JHU 3-way combination 50.70 %Radical-JHU single system best number 53.60 %Scale B-train only without normalization 47.30 %Scale B-train only with normalization 43.92 %Scale A+B-train with normalization 45.14 %
Table 3 . Comparison of the performance (WER) of the newlydeveloped baseline vs. the best single system previously devel-oped by us for the Babel dataset.
A-dev B-dev T-dev“native” data 49.49 % 43.92 % 23.12 %A+B-train 48.19 % 45.14 % 53.46 %
Table 4 . Comparison of the performance of the system trainedon corpus-specific(native) data vs the training set obtained bymerging the training data of “A” and “B” sets (TDNN system)is trained for each test set – Appen, Babel and TransTac – usingonly speech data from the corresponding training set.
4. RELATED WORK
This comparison is not completely straightforward. There wasa substantial amount of work reported as a part of the BabelPashto project, but the reported numbers are generally from acombination of multiple (sometimes of huge number of) sys-tems. As our team participated (as a part of team Radical) in thisproject, we are presenting a comparison of our current systemwith respect to our best-performing Babel system (hybrid DNNsystem) from two years ago (see Table 3). The comparison hasto be made carefully, as we (in order to unify the corpora at hand)applied the rule set mentioned in section 2.4. From Table 3 itcan be seen that the new training procedure (Scale B-train onlywithout normalization) gives us ∼ % absolute gain over theolder system and the normalization rules provide another ∼ %gain. Note, as an aside, that simply adding the Appen data tothe Babel trainng set degrades WER by about 1.2 %.
5. JOINT MULTI-CORPUS TRAINING
During the course of the workshop, it became apparent that thethree corpora actually do not combine well. The corpora “A”and “B” are closest, but even their combination for training didnot produce better results – see Table 4.Another piece of evidence can be gathered from Table 5. Thelanguage model created from the training data of the “T” datasetwas not useful for the language model interpolation.As the diversity of the data proved to be too high to allowfor the training of a single model on all of the data that wouldfunction well, we decided to train data-set specific models, i.e.train three models, each of which would be specialized to thatgiven dataset. Moreover, we tried to find out if there was a wayto benefit from the fact that we had multiple (similar) corpora.The method we used to exploit this fact was the sharing (i.e. m A lm B lm T ppl orig ppl interptext A 0.8 0.2 0.0 99.2 92.9text B 0.1 0.8 0.1 141.9 140.0text T 0.0 0.0 1.0 86.7 86.7
Table 5 . Optimal mixing weights, and the resulting perplexitieson three dev-test sets, for interpolating language models trainedon the three corpora: “A” – Appen, “B” – Babel and “T” –TransTac.
Fig. 2 . Scheme of sharing the layers for joint multi-corporatraining. The best performance was achieved with the “shareall but first and last” approach. dev-A dev-B dev-T”Matched” single corpus 55.38 % 46.81 % 24.76 %A+B-train 51.89 % 48.19 % 52.59 %share all but last 53.20 % 47.35 % 27.03 %share all but first and last 51.22 % 45.02 % 25.38 % + optimized LM 50.83 % 44.77 % 25.38 % + duration modeling 50.44 % 44.26 % 24.83 % Table 6 . ASR performance (WER) for multi-corpus training ofacoustic models, with shared hidden layers and corpus-specificinput and/or output layers.training jointly) of the hidden layers and only having the lastand the first layers be dataset-specific. The reason for doing thiswas two-fold. First, it allowed us to train a larger neural networkwith potentially better performance. Secondly, the sharedlayers would hopefully learn more general/robust hyperplaneseparations. See Figure 2 for an illustration of this method.We experimented with different sharing strategies. The bestperformance was obtained when the first and the last layer wereshared. Sharing less or more layers (than the first and last one)has made the performance worse.The training procedure was similar to the training of a singlenetwork. First, in each step, for each dataset, a new, updatednetwork was obtained using model averaging. After that, thesecorpus-specific networks were averaged and the shared layersof this final network were copied back to the corpus-specific net-work. This represents one iteration of the joint-multi-corporatraining. Using the described approach, we were able to train a 7-layerp-norm network with a hidden layer dimension of 4500 andp-norm pooling of 1 :
10. The best network trainable using onlythe native data was a 7-layer p-norm network with a hiddenlayer dimension of 2500 and p-norm pooling of 1 :
6. MAIN RESULTS
A key lesson from Table 6 is that relative to single corpus train-ing, straightforward pooling of multiple corpora (A and B) totrain a single acoustic model results in significant degradation of WER on 2 out of 3 test sets (Babel and TransTac). Only theAppen test set benefits from training on the Babel data.The main new result, by contrast, is that training separateTDNN acoustic models for each corpus while sharing theinternal layers – an idea akin to the training of multilingualacoustic models – followed by some LM and duration modeloptimization results in significant improvements in WER.The Appen WER reduces from 51.9 % to 50.4 % (3 % rela-tive), while the Babel WER reduces from 46.8 % to 44.3 % (5 %relative).The TransTac corpus is too different from the Appen andBabel corpora to see any benefit from data pooling. However,it is also remarkable that acoustic models trained on only theAppen and Babel data with the new method attain the sameperformance as models trained exclusively on TransTac data:24.8 %. By contrast, acoustic models trained on Appen andBabel using traditional data pooling degrade WER to 53 %.This illustrates the cross-corpus robustness of the new method.
7. CONCLUSION
This paper presents an overview of Pashto low-resource ASRsystem built during the SCALE’15 workshop. Initially, wedeveloped a single pass LVCSR system for Pashto language.This system, trained using a corpora obtained by concatenatingtwo different corpora (called “A” and “B” in this paper) outper-formed significantly the performance of systems developed asa part of the Babel program. Our single-pass system achievedcomparable, if not better, performance with respect to verycomplex systems (multiple passes + combination of multiplesystems).While achieving as good WER as possible was important, thebody of the work done during the duration of the workshop andhence described in the paper concentrates on providing insighton how to combine training data for multiple different sources.Pashto is a good proxy to demonstrate various issues which canbe seen “in the wild”. While the aspects of having differentaudio channels (i.e. sampling frequencies, additive line noise,room impulse response) are generally fully appreciated, theissues stemming from an inner complexities of the languageare largely overlooked. On Pashto, we are demonstrating someissues native to languages with not mature-enough computer-implemented writing system together with our take on how todeal with them in order to obtain data homogeneous enough tobe useful for ASR system training. . REFERENCES [1] M. Paul Lewis, Gary F. Simons, and Charles D. Fenning,Eds.,
Ethnologue: Languages of the World , SILInternational, Dallas, Texas, eighteenth edition, 2015.[2] Gregory A. Sanders, Brian A. Weiss, Craig Schlenoff,Michelle P. Steves, and Sherri Condon, “Evaluationmethodology and metrics employed to assess theTransTac two-way, speech-to-speech translation systems,”
Computer Speech & Language , vol. 27, no. 2, pp. 528 –553, 2013, Special Issue on Speech-speech translation.[3] Vijayaditya Peddinti, Daniel Povey, and SanjeevKhudanpur, “A time delay neural network architecturefor efficient modeling of long temporal contexts,” in
Proceedings of Interspeech . 2015, ISCA.[4] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, JanSilovsky, Georg Stemmer, and Karel Vesely, “The Kaldispeech recognition toolkit,” in
IEEE 2011 Workshop onAutomatic Speech Recognition and Understanding . Dec.2011, IEEE Signal Processing Society.[5] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and SanjeevKhudanpur, “Audio augmentation for speech recognition,”in
Proceedings of Interspeech . 2015, ISCA.[6] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudan-pur, “Parallel training of deep neural networks withnatural gradient and parameter averaging,”
CoRR , vol.abs/1410.7455, 2014.[7] Guoguo Chen, Hainan Xu, Minhua Wu, Daniel Povey,and Sanjeev Khudanpur, “Pronunciation and silenceprobability modeling for ASR,” in
Proceedings ofInterspeech . 2015, ISCA.[8] Janez Kaiser, Bogomir Horvat, and Zdravko Kacic,“A novel loss function for the overall risk criterionbased discriminative training of HMM models,” in
Sixth International Conference on Spoken LanguageProcessing . 2000, ISCA.[9] Matthew Gibson and Thomas Hain, “Hypothesis spacesfor minimum Bayes risk training in large vocabularyspeech recognition.,” in
Proceedings of Interspeech .2006, ISCA.[10] Karel Vesel´y, Arnab Ghoshal, Luk´aˇs Burget, and DanielPovey, “Sequence-discriminative training of deep neuralnetworks.,” in
Proceedings of Interspeech . 2013, pp.2345–2349, ISCA.[11] Vimal Manohar, Daniel Povey, and Sanjeev Khudanpur,“Semi-supervised maximum mutual information trainingof deep neural network acoustic models,” in
Proceedingsof Interspeech . 2015, ISCA. [12] Tanel Alum¨ae, “Neural network phone duration modelfor speech recognition,” in