Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec
Jiatong Shi, Jonathan D. Amith, Rey Castillo García, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe
LLeveraging End-to-End ASR for Endangered Language Documentation:An Empirical Study on Yolox ´ochitl Mixtec
Jiatong Shi Jonathan D. Amith Rey Castillo Garc´ıa Esteban Guadalupe Sierra Kevin Duh Shinji Watanabe The Johns Hopkins University, Baltimore, Maryland, United States Department of Anthropology, Gettysburg College Secretar´ıa de Educaci´on P´ublica, Estado de Guerrero, Mexico { jiatong shi@, kevinduh@cs. } jhu.edu { jonamith, reyyoloxochitl, estebanyoloxochitl } @[email protected] Abstract “Transcription bottlenecks”, created by ashortage of effective human transcribers areone of the main challenges to endangered lan-guage (EL) documentation. Automatic speechrecognition (ASR) has been suggested as atool to overcome such bottlenecks. Followingthis suggestion, we investigated the effective-ness for EL documentation of end-to-end ASR,which unlike Hidden Markov Model ASR sys-tems, eschews linguistic resources but is in-stead more dependent on large-data settings.We open source a Yolox´ochitl Mixtec EL cor-pus. First, we review our method in build-ing an end-to-end ASR system in a way thatwould be reproducible by the ASR community.We then propose a novice transcription correc-tion task and demonstrate how ASR systemsand novice transcribers can work together toimprove EL documentation. We believe thiscombinatory methodology would mitigate thetranscription bottleneck and transcriber short-age that hinders EL documentation.
Grenoble et al. (2011) warned that half of theworld’s 7,000 languages would disappear by theend of the 21st century. Consequently, a con-cern with endangered language documentation hasemerged from the convergence of interests of twomajor groups: (1) native speakers who wish todocument their language and cultural knowledgefor future generations; (2) linguists who wish todocument endangered languages to explore lin-guistic structures that may soon disappear. En-dangered language (EL) documentation aims tomitigate these concerns by developing and archiv-ing corpora, lexicons, and grammars (Lehmann,1999). There are two major challenges: (a) Transcription Bottleneck:
The creation ofEL resources through documentation is extremely challenging, primarily because the traditionalmethod to preserve primary data is not simply withaudio recordings but also through time-coded tran-scriptions. In a best-case scenario, texts are pre-sented in interlinear format with aligned parses andglosses along with a free translation (Anastasopou-los and Chiang, 2017). But interlinear transcrip-tions are difficult to produce in meaningful quanti-ties: (1) ELs often lack a standardized orthography(if written at all); (2) invariably, few speakers canaccurately transcribe recordings. Even a highlyskilled native speaker or linguist will require a min-imum of 30 to 50 hours to simply transcribe onehour of recording (Michaud et al., 2014; Zahreret al., 2020). Additional time is needed for parse,gloss, and translation. This creates what has beencalled a “transcription bottleneck”, a situation inwhich the expert transcribers cannot keep up withthe amount of recorded material for documentation. (b) Transcriber Shortage:
It is generally under-stood that any viable solution to the transcriptionbottleneck must involve native speaker transcribers.Yet usually few, if any, native speakers have theskills (or time) to transcribe their language. Train-ing new transcribers is one solution, but it is time-consuming, especially with languages that presentcomplicated phonology and morphology. The situ-ation is distinct for major languages, for whichtranscription can be crowd-sourced to speakerswith little need for specialized training (Das andHasegawa-Johnson, 2016). In Yolox´ochitl Mixtec(YM; Glottocode=yolo1241, ISO 639-3=xty), thefocus of this study, training is time-consuming: af-ter one-year part-time transcription training, a pro-ficient native speaker, Esteban Guadalupe Sierra,still has problems with certain phones, particularlytones and glottal stops. Documentation requiresaccurate transcriptions, a goal yet beyond even thecapability of an enthusiastic speaker with many a r X i v : . [ ee ss . A S ] F e b onths of training.As noted, ASR has been proposed to mitigatethe Transcription Bottleneck and create increas-ingly extensive EL corpora. Previous studies firstinvestigated HMM-based ASR for EL documenta-tion ( ´Cavar et al., 2016; Mitra et al., 2016; Adamset al., 2018; Jimerson et al., 2018; Jimerson andPrud’hommeaux, 2018; Michaud et al., 2018; Cruzand Waring, 2019; Thai et al., 2020; Zahrer et al.,2020; Gupta and Boulianne, 2020a). Along withHMM-based ASR, natural language processing andsemi-supervised learning have been suggested as away to produce morphological and syntactic anal-yses. As HMM-based systems have become moreprecise, they have been increasingly promoted as amechanism to bypass the transcription bottleneck.However, ASR’s context for ELs is quite distinctfrom that of major languages. Endangered lan-guages seldom have sufficient extant language lexi-cons to train an HMM system and invariably sufferfrom a dearth of skilled transcribers to create thesenecessary resources (Gupta and Boulianne, 2020b).As we have confirmed with this present study,end-to-end ASR systems have shown comparableor better results over conventional HMM-basedmethods (Graves and Jaitly, 2014; Chiu et al., 2018;Pham et al., 2019; Karita et al., 2019a). As end-to-end systems directly predict textual units fromacoustic information, they save much effort onlexicon construction. Nevertheless, end-to-endASR systems still suffer from the limitation oftraining data. Attempts with resource-scarce lan-guages have relatively high character (CER) orword (WER) error rates (Thai et al., 2020; Mat-suura et al., 2020; Hjortnaes et al., 2020). It hasnevertheless become possible to utilize ASR withELs to reduce significantly, but not eliminate, theneed for human input and annotation to create ac-ceptable (“archival quality”) transcriptions. This Work:
This work represents end-to-endASR efforts on Yolox´ochitl Mixtec (YM), an en-dangered language from western Mexico. TheYMC corpus comprises two sub-corpora. Thefirst (“YMC-EXP”, expert transcribed, corpus) in-cludes 100 hours of transcribed speech that havebeen carefully checked for accuracy. We built arecipe of the ESPNet (Watanabe et al., 2018) thatshows the whole process of constructing an end- Specifically, we used material from the community ofYolox´ochitl (YMC), one of four in which YM is spoken. to-end ASR system using the YMC-EXP corpus. The second corpus, (“YMC-NT”, native trainee,corpus) includes 8+ hours of additional recordingsnot included in the YMC-EXP corpus. This secondcorpus contains novice transcriptions with subse-quent expert corrections that has allowed us to eval-uate the skill level of the novice. Both the YMC-EXP and YMC-NT corpora are publicly availableat OpenSLR under a CC BY-SA-NC 3.0 License. The contributions of our research are:• A new Yolox´ochitl Mixtec corpus to supportASR efforts in EL documentation.• A reproducible workflow to build an end-to-end ASR system for EL documentation.• A comparative study between HMM-basedASR and end-to-end ASR, demonstrating thefeasibility of the latter. To test the frame-work’s generalizability, we also experimentwith another EL: Highland Puebla Nahuat(Glottocode=high1278; ISO 639-3=azz).• An in-depth analysis of errors in novice tran-scription and ASR. Considering the discrepan-cies in error types, we propose Novice Tran-scription Correction (NTC) as a task for theEL documentation community. A rule-basedmethod and a voting-based method are pro-posed. In clean speech, the best system re-duces relative word error rate in the novicetranscription by 38.9% .
In this section, we first introduce the linguisticspecifics for YM and YMC. Then we discuss therecording settings. Since YM is a spoken languagewithout a standardized textual format, we next ex-plain the transcription style designed for this lan-guage. Finally, we offer the corpus partition andsome statistics regarding corpora size.
Yolox´ochitl Mixtec is an endangered, relativelylow-resource Mixtecan language. It is mainly spo-ken in the municipality of San Luis Acatl´an, state https://github.com/espnet/espnet/tree/master/egs/yoloxochitl_mixtec/asr1 A system combination method, Recognizer Output VotingError Reduction (Fiscus, 1997)) f Guerrero, Mexico. It is one of some 50 lan-guages in the Mixtec language family, which is partof a larger unit, Otomanguean, that Su´arez (1983)considers “a ‘hyper-family’ or ‘stock’.” Mixtec lan-guages (spoken in Oaxaca, Guerrero, and Puebla)are highly varied, resulting from approximately2,000 years of diversification.YM is spoken in four communities: Yolox´ochitl,Cuanacaxtitlan, Arroyo Cumiapa, and Buena Vista.Mutual intelligibility among the four YM com-munities is high despite significant differencesin phonology, morphology, and syntax. All vil-lages have a simple segmental inventory but sig-nificant though still undocumented variation intonal phonology. YMC (refering only to the Mix-tec of the community of Yolox´ochitl [16.81602,-98.68597]) manifests 28 distinct tonal patterns on1,451 identified bimoraic lexical stems. The tonalpatterns carry a significant functional load in re-gards to the lexicon and inflection. For example,24 distinct tonal patterns on the bimoraic segmen-tal sequence [nama] yield 30 words (including sixhomophones). This ample tonal inventory presentschallenges to both a native speaker learning to writeand an ASR system learning to recognize. Notably,it also introduces difficulties in constructing a lan-guage lexicon for training HMM-based systems.
There are two corpora used in this study. Thefirst (YMC-EXP) was used for ASR training. Thesecond (YMC-NT) was used to train the novicespeaker (e.g., set up a curriculum for him to learnhow to transcribe) and for Novice TranscriptionCorrection. The YMC-EXP corpus comprises ex-pert transcriptions used as the gold-standard refer-ence for ASR development. The YMC-NT corpushas paired novice-expert transcription as it wasused to train and evaluate the novice writer.The corpus used for ASR development com-prises mostly conversational speech in two-channelrecordings (split for training). Each conversationis with two speakers and each of the two speak-ers was fitted with a separate head-worn mic (usu-ally a Shure SM10a). Over two dozen speakers(mostly male) contributed to the corpus. The topicsand their distribution were varied (plants, animals,hunting/fishing, food preparation, ritual speech).The YMC-NT corpus comprises single-channelfield recordings made with a Zoom H4n at the mo-ment plants were collected during ethnobotanical research. Speakers were interviewed one after an-other; there is no overlap. However, the recordingsoften registered background sounds (crickets, birds)that we expected would negatively impact ASR ac-curacy more than seems to have occurred. Thetopic was always a discussion of plant knowledge(a theme of only 9% of the YMC-EXP corpus).Expectedly, there were many out-of-vocabulary(OOV) words (e.g., plant names not elsewhererecorded) in this YMC-NT corpus. The YMC-EXP corpuspresently has two levels of transcription: (1) a prac-tical orthography that represents underlying forms;(2) surface forms. The underlying form marks pre-fixes (separated from the stem by a hyphen), en-clitics (separated by an = sign), and tone elision(with the elided tones in parentheses). All these“breaks” and phonological processes disappear inthe surface form. For example, the underlying be (cid:48) e = an (house=3sgFem; ’her house’) surfacesas be (cid:48) ˜ a . And be (cid:48) e (3) = (’my house’) surfaces as be (cid:48) e . Another example is the completive prefix ni -, which is separated from the stem as in ni - xi xi (3) = (completive-eat-1sgS; ’I ate’). The sur-face form would be written n ˜ i xi xi . Again, pro-cesses such as nasalization, vowel harmony, palatal-ization, and labialization are not represented in thepractical (underlying) orthography but are gener-ated in the surface forms. The only phonologicalprocess encoded in the underlying orthography istone elision, for which parentheses are used.The practical, underlying orthography men-tioned above was chosen as the default system forASR training for three reasons: (1) it is easier thana surface representation for native speakers to write;(2) it represents morphological boundaries and thusserves to teach native speakers the morphology oftheir language; and (3) for a researcher interestedin generating concordances for a corpus-based lex-icographic project it is much easier to discover theroot for ‘house’ in be (cid:48) e = an and be (cid:48) e (3) = thanin the surface forms be (cid:48) ˜ a and be (cid:48) e . (b) “Code-Switching” in YMC: Endangered,colonialized Indigenous languages often manifestextensive lexical input from a dominant West-ern language, and speakers often talk with “code-switching” (for lack of a better term). Yolox´ochitl After separating enclitics and prefixes as separate tokens,the OOV rate in YMC-NT is 4.84%. orpus Subset UttNum Dur (h)EXP
Train 52763 92.46Validation 2470 4.01Test 1577 2.52
EXP(-CS)
Train 35144 58.60Validation 1301 2.16Test 2603 4.35 NT Clean-Dev 2523 3.45Clean-Test 2346 3.31Noise-Test 1335 1.60
Table 1: YMC Corpus Partition for EXP (corpus withexpert transcription), EXP(-CS) (subset of EXP with-out “code-switching”), NT (corpus with paired noviceand expert transcription)
Mixtec is no exception. Amith considered how towrite such forms best and decided that Spanish-origin words would be written in Spanish and with-out tone when their phonology and meaning areclose to that of Spanish. So Spanish docena ap-pears over a dozen times in the corpus and is writ-ten tucena ; it always has the meaning of ‘dozen’.All month and day names are also written withouttones. Note, however, that Spanish camposanto (‘cemetery’) is also found in the corpus and pro-nounced as pa san tu . The decision was made towrite this with tone markings as it is significantlydifferent in pronunciation from the Spanish originword. In effect, words like pa san tu are consid-ered loans into YM and are treated orthographicallyas Mixtec. Words such as tucena are considered“code-switching” and written without tones. (c) Transcription Process: The initial time-aligned transcriptions were made in Transcriber(Barras et al., 1998). However, given that Tran-scriber cannot handle multiple tiers (e.g., transcrip-tion and translation, or underlying and surface or-thographies), the Transcriber transcriptions werethen imported into ELAN (Wittenburg et al., 2006)for further processing (e.g., correction, surface-form generation, translation).
Though endangered, YMC does not suffer from thesame level of resource limitations that affect mostASR work with ELs ( ´Cavar et al., 2016; Jimersonet al., 2018; Thai et al., 2020). The YMC-EXPcorpus, developed for over ten years, provided 100hours for the ASR training, validation, and testcorpora. There are 505 recordings from 34 speakers in the YMC-EXP corpus, and the transcription forthe YMC-EXP were all carefully proofed by anexpert native-speaker linguist. As shown in Table1, we offer a train-valid-test split where there is nooverlap in content between the sets. The partitionconsiders the balance between speakers and relativesize for each part.As introduced in Section 2.2, the YMC-NT cor-pus has both expert and novice transcription. It in-cludes only three speakers for a total of 8.36 hours.In the recordings of two consultants, the environ-ment is relatively clean and free of backgroundnoise. The speech of the other individual, however,is frequently affected by background noise. Thisseems coincidental as all three were recorded to-gether, one after the other in random order. Butgiven this situation, we split the corpus into threesets: clean-dev (speaker EGS), clean-test (speakerCTB), and noise-test (speaker FEF; see Table 1).The “code-switching” discussed in 2.3 (b) intro-duces different phonological representations andmakes it difficult to train an HMM-based modelusing language lexicons. Therefore, previous work(Mitra et al., 2016) using the HMM-based systemfor YMC did not consider phrases with “code-switching”. To compare our model with their re-sults, we have used the same experimental corpusin our evaluation. Their corpus (YMC-EXP(-CS)),shown in Table 1, is a subset of the YMC-EXP; theYMC-EXP(-CS) corpus does not contain “code-switching” phrases, i.e., phrases with words thatwere tagged as Spanish origin and transcribed with-out tone.
As ESPNet (Watanabe et al., 2018) is widely usedin open-source end-to-end ASR research, our end-to-end ASR systems are all constructed using ESP-Net. For the encoder, we employed the conformerstructure (Gulati et al., 2020), while for the decoderwe used the transformer structure to condition thefull context, following the work of Karita et al.(2019b). The conformer architecture is a state-of-the-art innovation developed from the previoustransformer-based encoding methods (Karita et al.,2019a; Guo et al., 2020). A comparison betweenthe conformer and transformer encoders shows thevalue of applying state-of-the-art end-to-end ASRto ELs. .2 Experiments and Results
As discussed above, our end-to-end model appliedan encoder-decoder architecture with a conformerencoder and a transformer decoder. The archi-tecture of the model follows Gulati et al. (2020)while its configuration follows the aishell con-former recipe from ESPNet. The experiment isreproducible using ESPNet.As the end-to-end system models are based onword pieces, we adopted CER and WER as eval-uation metrics. They help demonstrate the sys-tem performances at different levels of graininess.But because the HMM-based systems were decod-ing with a word-based lexicon, for comparison toHMM we only use the WER metric. To thoroughlyexamine the model, we conducted several compar-ative experiments, as discussed in continuation. (a) Comparison with HMM-based Methods:
We first compared our end-to-end method withthe Deep Neural Network-Hidden Markov Model(DNN-HMM) methods proposed in Mitra el al.(2016). In this work, Gammatone Filterbanks(GFB), articulation, and pitch are configured for theDNN-HMM model. This baseline is a DNN-HMMmodel using Mel Filterbanks (MFB). In recent un-published work, Kwon and Kathol develop a lat-est state-of-the-art CNN-HMM-based ASR model for YMC based on the lattice-free Maximum Mu-tual Information (LF-MMI) approach, also knownas “chain model” (Povey et al., 2016). The ex-perimental data of the above HMM-based modelsis YMC-EXP(-CS) discussed in Section 2.4. Forthe comparison, our end-to-end model adopted thesame partition to ensure fair comparability withtheir results.Table 2 shows the comparison between DNN-HMM systems and our end-to-end system on YMC-EXP(-CS). It indicates that even without an exter-nal language lexicon the end-to-end system signifi-cantly outperforms both the DNN-HMM baselinemodels and the CNN-HMM-based state-of-the-artmodel.In Section 2.3 (b), we note that “code-switching”is invariably present in EL speech (e.g., YMC).Thus, ASR models built on ”code-switching-freecorpora (like YMC-EXP[-CS]) are not practical forreal-world usage. However, a language lexicon isavailable only for the YMC-EXP(-CS) corpus so See Appendix for details about the model configuration. See Appendix for details about the model configuration.
Model Feature WER
DNN-HMM MFB 36.9DNN-HMM GFB + Articu. 31.1+PitchCNN-HMM MFCC 19.1(Chain)E2E-Conformer MFB + Pitch
Table 2: Comparison between HMM-based Mod-els and the End-to-End Conformer (E2E-Conformer)Model on YMC-EXP(-CS) that is a subset of the YMC-EXP without “code-switching”.
Model CER WER dev/test dev/testE2E-RNN 9.2/9.3 19.1/19.2E2E-Transformer 7.8/7.9 16.3/16.7E2E-Conformer / / Table 3: End-to-End ASR Results on YMC-EXP (cor-pus with “code-switching”) we cannot conduct HMM-based experiments witheither YMC-EXP or YMC-NT corpora. (b) Comparison with Different End-to-EndASR Architectures:
We also conducted exper-iments comparing models with different encodersand decoders on the YMC-EXP corpus. For a Re-current Neural Network-based (E2E-RNN) model,we followed the best hyper-parameter configura-tion, as discussed in Zeyer et al. (2018). For aTransformer-based (E2E-Transformer) model, thesame configuration from Karita et al. (2019b) wasadopted. Both models shared the same data prepa-ration process as the E2E-Conformer model.Table 3 compares different end-to-end ASRarchitectures on the YMC-EXP corpus. TheE2E-Conformer obtained the best results, obtain-ing significant WER improvement as comparedto the E2E-RNN and the E2E-Transformer mod-els. The E2E-Conformer’s WER on YMC-EXP(-CS) is slightly lower than that obtained for thewhole YMC-EXP corpus, despite a significantlysmaller training set in the YMC-EXP(-CS) corpus.Since the subset excludes Spanish words, “code-switching” may well be a problem to consider inASR for endangered languages such as YM. The train set in YMC-EXP is significantly larger thanthat in YMC-EXP(-CS), the YMC-EXP corpus from whichall lines containing a Spanish-origin word have been removed. ranscription Level CER WER dev/test dev/testSurface 8.0/ /7.7 / Table 4: E2E-Conformer Results for Two Transcrip-tion Levels (Underlying represents morphological divi-sions and underlying phonemes before the applicationof phonological rules; Surface is reflective of spokenforms and lacks morphological parsing)
Corpus CER WER dev/test dev/test10h 19.4/19.5 39.1/39.220h 12.6/12.7 26.2/26.250h 8.6/8.7 18.0/18.0Whole (92h) / / Table 5: E2E-Conformer Results on Different CorpusSize (c) Comparison with Different TranscriptionLevels:
In addition to comparing model archi-tectures, we compared the impact of transcriptionlevels on the ASR model. E2E-Conformer modelswith the same configurations were trained usingboth the surface and the underlying transcriptionforms, which are discussed in Section 2.3. Wealso trained separate RNN language models for fu-sion and unigram language models to extract wordpieces for different transcription levels.Table 4 shows the E2E-Conformer results forboth underlying and surface transcription levels.As introduced in Section 2.3, the surface form re-duces several linguistic and phonological processescompared to the underlying practical form. The re-sults indicate that the end-to-end system is able toautomatically infer those morphological and phono-logical processes and maintain a consistent lowerror rate. (d) Comparison with Different Corpus Sizes:
As introduced in Section 1, most ELs are consid-ered low-resource for ASR purposes. To measurethe impact of resource availability on ASR accu-racy we trained the E2E-Conformer model on 10,20, and 50 hours subsets of YMC-EXP. The resultsdemonstrate the model performances over differentsizes of resources.Table 5 shows the E2E-Conformer performanceson different amounts of training data. It demon-strates how the model consumes data. As corpussize is incrementally increased, WER decreases
Model CER WER dev/test dev/testE2E-RNN 10.3/9.9 26.8/25.4E2E-Transformer /9.1 23.7/
E2E-Conformer 9.9/ / Table 6: E2E-Conformer Results on another EL: High-land Puebla Nahuatl
Corpus CER WER dev/test dev/test10h 18.3/17.5 44.7/43.320h 14.2/12.9 34.8/33.350h 11.0/10.2 27.0/24.9Whole (120h) / / Table 7: E2E-Conformer Results on another EL: High-land Puebla Nahuatl (Different Corpus Size) significantly. It is apparent that the model still hasthe capacity to improve performance with moredata. The result also indicates that our system canget reasonable performances from 50 hours of data.This would be an important guideline when wecollect a new EL database. (e) The Framework Generalizability:
To testthe end-to-end ASR systems’ generalization ability,we conducted the same end-to-end training and testprocedures on another endangered language: High-land Puebla Nahuatl (high1278; azz). This corpusis also open access under the same CC license. Itcomprises 954 recordings that total 185 hours 22minutes, including 120 hours transcribed data inELAN and 65 hours still only in Transcriber andnot used in ASR training. Table 6 shows the performance of three differentend-to-end ASR architectures on Highland PueblaNahuatl. For this language the E2E-Conformeragain offers better performances over the othermodels. Table 7 shows the E2E-Conformer per-formances on different amounts of training datafor Highland Puebla Nahuatl. We can observe that50-hour is a reasonable size for an EL, which issimilar to the experiments in Table 5. These exper-iments indicate the general ability to consistentlyapply end-to-end ASR systems across ELs. http://openslr.org/92 The recordings are almost all with two channels and twospeakers in natural conversation. rror Types Novice ASR
Enclitics (=) Glottal Stop (’) 341
Parenthesis 1607
Tone 4144
Stem-Nasal (n) Table 8: Character Error-type Distribution of Noviceand ASR (by number of errors)
Finally, this paper presents novice transcriptioncorrection (NTC) as a task for EL documentation.That is, in this experiment we explore not only thepossibility of using ASR to enhance the accuracyof a YM novice transcription but to combine bothnovice transcription and ASR to achieve accurateresults that surpass that of either component. Be-low we first analyze patterns manifested in novicetranscriptions. Next, we introduce two baselinesthat fuse ASR hypotheses and novice transcriptionfor the NTC task.
As mentioned in Section 1, transcriber shortageshave been a severe challenge for EL documenta-tion. Before 2019, only the native speaker linguist,Rey Castillo Garc´ıa, could accurately transcribe thesegments and tones of YMC. To mitigate the YMCtranscriber shortage, in 2019 Castillo began to trainanother speaker, Esteban Guadalupe Sierra. First,a computer course was designed to incrementallyteach Guadalupe segmental and tonal phonology.In the next stage, he was given YMC-NT corpusrecordings to transcribe. Compared to the pairedexpert transcription, the novice achieved a CER of6.0% on clean-dev, defined in Table 1. However,it is not feasible to spend many months trainingspeakers with no literacy skills to acquire the tran-scription proficiency achieved by Guadalupe in ourproject. Moreover, even with a 6.0% CER, thereare still enough errors so as to require significantannotation/correction by the expert, Castillo. Thestate-of-the-art ASR system (e.g., E2E-Conformer)shown in Table 3 gets an 8.2% CER on the clean-dev set, more errors than the novice CER. So forYMC, ASR is still not a good enough substitute fora proficient novice.
WordAlignmentSyllableAlignmentCharacterAlignmentWordRulesSyllableRules Novice Transcriptions& ASR HypothesesHybrid TranscriptionCharacterRules
Figure 1: Novice-ASR Fusion Process
As Amith and Castillo worked with the novice,they saw a repetition of types of errors that theyworked to correct by giving the novice exercisesfocused on these transcription shortcomings. Theend-to-end ASR, however, has demonstrated a dif-ferent pattern of errors. For example, it developed afair understanding of the rules for suppleting tones,marked by parentheses around the suppleted tones.Rather than over-specify the NTC correction algo-rithm, we first analyzed the error-type distributionusing the Clean-dev from the YMC-NT corpus, asshown in Table 8.
Rapid comparison of the types of errors for eachtranscription (novice and ASR) demonstrated con-sistent patterns and has led us to hypothesize thata fusion system might automatically correct manyof these errors. Two baseline methods are exam-ined for the fusion: a voting-based system (Fiscus,1997) and a rule-based system.The voting-based system follows the definitionin (Fiscus, 1997) that combines hypotheses fromdifferent ASR models with novice transcription.The framework of rule-based fusion is shown inFigure 1. The rules are defined in different linguis-tic units: words, syllables, and characters. They as-sume a hierarchical alignment between the novicetranscription and ASR hypotheses. The rules areapplied to the transcription from word to syllableto character level. The rules are developed basedon continual evaluation of the novice’s progress.Thus they will be different but discoverable when odel Clean-Dev Clean-Test Noise-Test OverallCER WER CER WER CER WER CER WER
A. Novice 6.0 21.5 6.4 22.6
J. ROVER-Fusion2 (A+B+C+E)
Table 9: NTC Results on YMC-NT (the results are evaluated using the expert transcription in YMC-NT). ModelD is trained with a 50-hour subset of the YMC-EXP as shown in Table 5. applied to a new language. However, the generalprinciple should be applicable to other ELs: Novicetrainees will learn certain transcription tasks easierthan others. Below we explain the rules for YMC.
Word Rules : If a word from the novice transcrip-tion is Spanish (i.e., no tones and no linguisticindications [-, =, ’] that mark it as Mixtec), keepthe novice transcription. If the novice has extrawords, not in the ASR hypothesis, keep those extrawords.
Syllable Rules : If a novice syllable is tone initial,use the corresponding ASR syllable. If the noviceand the ASR have identical segments but differenttones, use the ASR tones. When an ASR syllablehas CVV or CV’V, and its corresponding novicesyllable has CV, use the ASR syllable (CVV orCV’V). If the tone from either transcription systemfollows a consonant (except a stem-final n ), use theother system’s transcription. Character Rules : If the ASR has a hyphen, equalsign, parentheses, glottal stop which is absent fromthe novice transcription, then always trust the ASRand maintain the aforementioned symbols in thefinal transcription.We apply the edit distance (Wagner and Fischer,1974) to find the alignment between the ASR modelhypothesis { C , ..., C n } and the Novice transcrip-tion { C (cid:48) , ..., C (cid:48) m } . The L I , L D , L S are introducedin the dynamic function as the insertion, deletion,and substitution loss, respectively. In the naive set-ting, L I , L D are both set to 1. The L S is set to 1if C i is different from C (cid:48) j and 0 otherwise. This A CV syllable can occur in a monomoraic word. Butnovice will often write a CV word when it should be CVVor CV’V. Stem-final syllables can be CV, CVV or CV’V. Butnovice tends to write CV in these cases. setting is computation-efficient. However, it doesnot consider how the contents mismatch betweenthe C i and C j . Therefore, we adopt a hierarchicaldynamic alignment. In this method, the characteralignment follows the native setting. While the L S ( C i , C (cid:48) j ) for syllable alignment is defined as thenormalized character-level edit distance between C i and C (cid:48) j as follows: L S ( C i , C (cid:48) j ) = D [ C i , C (cid:48) j ] | C i | (1)where the | C i | is the lengths of the syllable. Simi-larly, the L S ( C i , C (cid:48) j ) for word alignment is definedbased on syllable alignment. The novice transcription, the E2E-Transformermodel, and the E2E-Conformer model were consid-ered as baselines for the NTC task. To evaluate thesystem for reduced training data, we also show ourresults of E2E-Conformer trained with a 50-hoursubset. For the end-to-end models, we adopted thetrained model from Section 3 with the same decod-ing set-ups. To test the effectiveness of the hierar-chical dynamic alignment, we tested the data withtwo fusion systems, namely Fusion1 and Fusion2.The Fusion1 system used the naive settings of editdistance, while the Fusion2 system adopted the hi-erarchical dynamic alignment. Both fusion systemsadopt rules defined in Section 4.2. Two configura-tions for voting-based methods were tested. Thefirst “ROVER” combined three hypotheses (i.e.,the E2E-Transformer, the E2E-Conformer, and theovice). In contrast, the “ROVER-Fusion2” com-bined the Fusion2 system with the above three.
As shown in Table 9, voting-based methods andrule-based methods all significantly reduce thenovice errors for clean speech. However, forthe noise-test, the novice transcription is the mostrobust method. For overall results, the ROVER sys-tem (model I) has a lower WER, while the ROVER-Fusion2 system (model J) reaches a lower CER.Model J significantly reduces specific errors, in-cluding tone errors (25%), enclitic errors (50%),and parentheses errors (87.5%). In addition, mod-els D, F, and H indicate that the system could stillreduce clean-environment novice errors using ASRmodels trained with a 50-hour subset of the YMC-EXP corpus.As we discussed in Section 4, novice and ASRtranscriptions manifest distinct patterns of errorand thus can be used to complement each other.Table 9 shows that our proposed rule-based andvoting-based fusion methods can potentially elim-inate the errors that come from the novice tran-scriber, and it can mitigate the transcriber shortageproblems based on these fusion methods. However,we should note that a noisy recording conditionwould negatively affect a fusion approach as ASRdoes poorly under such conditions ( >
23% CER),and for practical purposes, the novice transcriptionalone ( < This work presents an open-source endangered lan-guage corpus in Yolox´ochitl Mixtec and a compar-ative and reproducible study on various approachesto end-to-end ASR. We demonstrate that end-to-end approaches are feasible and present compara-ble results over conventional HMM approaches,which require resources such as language lexiconsnot necessary with end-to-end ASR. Additionally,we propose novice transcription correction as a po-tential task for ASR in EL documentation. Weexamine two methods to approach this task. Thefirst is a rule-based approach that uses hierarchicaldynamic alignment and linguistic rules to perform Note that the rules are developed based on YM specifics,so the result cannot be applied to other languages directly.Readers should view it as a case study. novice-ASR hybridization. The second is a voting-based method that combines hypotheses from thenovice and end-to-end ASR systems. Empiricalstudies on the YMC-NT corpus indicate that bothmethods significantly reduce the CER/WER of thenovice transcription for clean speech.The above discussion suggests that a useful ap-proach to EL documentation using both humanand computational (ASR) resources might focus ontraining each system (human and ASR) for partic-ular transcription tasks. If we know from the startthat ASR will be used to correct novice transcrip-tions in areas of difficulty, we could train an ASRsystem to maximize accuracy in those areas thatchallenge novice learning.
References
Oliver Adams, Trevor Cohn, Graham Neubig, HilariaCruz, Steven Bird, and Alexis Michaud. 2018. Eval-uating phonemic transcription of low-resource tonallanguages for language documentation. In
LREC2018 (Language Resources and Evaluation Confer-ence) , pages 3356–3365.Antonios Anastasopoulos and David Chiang. 2017. Acase study on using speech-to-translation alignmentsfor language documentation. In
Proceedings of the2nd Workshop on the Use of Computational Methodsin the Study of Endangered Languages , pages 170–178.Claude Barras, Edouard Geoffrois, Zhibiao Wu, andMark Liberman. 1998. Transcriber: a free tool forsegmenting, labeling and transcribing speech. In
Preceedings of the First international conference onlanguage resources and evaluation (LREC) , pages1373–1376.Malgorzata ´Cavar, Damir ´Cavar, and Hilaria Cruz.2016. Endangered language documentation: Boot-strapping a chatino speech corpus, forced aligner,asr. In
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC) , pages 4004–4011.Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro-hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,Anjuli Kannan, Ron J Weiss, Kanishka Rao, Eka-terina Gonina, et al. 2018. State-of-the-art speechrecognition with sequence-to-sequence models. In , pages4774–4778.Hilaria Cruz and Joseph Waring. 2019. Deployingtechnology to save endangered languages. arXivpreprint arXiv:1908.08971 .Amit Das and Mark Hasegawa-Johnson. 2016. An in-vestigation on training deep neural networks usingrobabilistic transcriptions. In
Interspeech , pages3858–3862.Jonathan G Fiscus. 1997. A post-processing system toyield reduced word error rates: Recognizer outputvoting error reduction (rover). In , pages 347–354.Pegah Ghahremani, Bagher BabaAli, Daniel Povey,Korbinian Riedhammer, Jan Trmal, and SanjeevKhudanpur. 2014. A pitch extraction algorithmtuned for automatic speech recognition. In , pages 2494–2498.Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural net-works. In
International conference on machinelearning , pages 1764–1772.Lenore A Grenoble, Peter K Austin, and Julia Salla-bank. 2011. Handbook of endangered languages.Anmol Gulati, James Qin, Chung-Cheng Chiu, NikiParmar, Yu Zhang, Jiahui Yu, Wei Han, ShiboWang, Zhengdong Zhang, Yonghui Wu, et al.2020. Conformer: Convolution-augmented trans-former for speech recognition. arXiv preprintarXiv:2005.08100 .Pengcheng Guo, Florian Boyer, Xuankai Chang,Tomoki Hayashi, Yosuke Higuchi, Hirofumi In-aguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, et al. 2020. Recent develop-ments on espnet toolkit boosted by conformer. arXivpreprint arXiv:2010.13956 .Vishwa Gupta and Gilles Boulianne. 2020a. Auto-matic transcription challenges for Inuktitut, a low-resource polysynthetic language. In
Proceedings ofthe 12th Language Resources and Evaluation Con-ference , pages 2521–2527.Vishwa Gupta and Gilles Boulianne. 2020b. Speechtranscription challenges for resource constrained in-digenous language cree. In
Proceedings of the 1stJoint Workshop on Spoken Language Technologiesfor Under-resourced languages (SLTU) and Collab-oration and Computing for Under-Resourced Lan-guages (CCURL) , pages 362–367.Nils Hjortnaes, Niko Partanen, Michael Rießler, andFrancis M Tyers. 2020. Towards a speech recognizerfor komi, an endangered and low-resource uralic lan-guage. In
Proceedings of the Sixth InternationalWorkshop on Computational Linguistics of UralicLanguages , pages 31–37.Robbie Jimerson and Emily Prud’hommeaux. 2018.Asr for documenting acutely under-resourced indige-nous languages. In
Proceedings of the Eleventh In-ternational Conference on Language Resources andEvaluation (LREC 2018) . Robbie Jimerson, Kruthika Simha, Raymond Ptucha,and Emily Prudhommeaux. 2018. Improvingasr output for endangered language documenta-tion. In
Proc. The 6th Intl. Workshop on SpokenLanguage Technologies for Under-Resourced Lan-guages , pages 187–191.Shigeki Karita, Nanxin Chen, Tomoki Hayashi,Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang,Masao Someki, Nelson Enrique Yalta Soplin,Ryuichi Yamamoto, Xiaofei Wang, et al. 2019a. Acomparative study on transformer vs rnn in speechapplications. In ,pages 449–456.Shigeki Karita, Nelson Enrique Yalta Soplin, ShinjiWatanabe, Marc Delcroix, Atsunori Ogawa, and To-mohiro Nakatani. 2019b. Improving transformer-based end-to-end speech recognition with connec-tionist temporal classification and language modelintegration.
Proceedings of Interspeech 2019 , pages1408–1412.Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71.Christian Lehmann. 1999. Documentation of endan-gered languages. In
A priority task for linguistics.ASSIDUE Arbeitspapiere des Seminars f¨ur Sprach-wissenschaft der Universit¨at Erfurt , 1.Kohei Matsuura, Sei Ueno, Masato Mimura, ShinsukeSakai, and Tatsuya Kawahara. 2020. Speech corpusof ainu folklore and end-to-end speech recognitionfor ainu language. In
Proceedings of The 12th Lan-guage Resources and Evaluation Conference , pages2622–2628.Alexis Michaud, Oliver Adams, Trevor Cohn, GrahamNeubig, and S´everine Guillaume. 2018. Integrat-ing automatic transcription into the language docu-mentation workflow: Experiments with na data andthe persephone toolkit.
Language Documentation &Conservation , 12:393–429.Alexis Michaud, Eric Castelli, et al. 2014. Towards theautomatic processing of yongning na (sino-tibetan):developing a’light’acoustic model of the target lan-guage and testing’heavyweight’models from five na-tional languages. In , pages 153–160.Vikramjit Mitra, Andreas Kathol, Jonathan D Amith,and Rey Castillo Garc´ıa. 2016. Automatic speechtranscription for low-resource languages-the case ofyolox´ochitl mixtec (mexico). In
Proceedings of In-tespeech 2016 , pages 3076–3080.aniel S Park, William Chan, Yu Zhang, Chung-ChengChiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.2019. Specaugment: A simple data augmentationmethod for automatic speech recognition.
Proc. In-terspeech 2019 , pages 2613–2617.Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues,Markus M¨uller, and Alex Waibel. 2019. Very deepself-attention networks for end-to-end speech recog-nition.
Proceedings of Interspeech 2019 , pages 66–70.Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-gah Ghahremani, Vimal Manohar, Xingyu Na, Yim-ing Wang, and Sanjeev Khudanpur. 2016. Purelysequence-trained neural networks for asr based onlattice-free mmi. In
Interspeech , pages 2751–2755.Jorge A Su´arez. 1983.
The mesoamerican indian lan-guages . Cambridge University Press.Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi. 2017. Inception-v4,inception-resnet and the impact of residual connec-tions on learning. In
Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence , pages4278–4284.Bao Thai, Robert Jimerson, Raymond Ptucha, andEmily Prud’hommeaux. 2020. Fully convolutionalasr for less-resourced endangered languages. In
Proceedings of the 1st Joint Workshop on SpokenLanguage Technologies for Under-resourced lan-guages (SLTU) and Collaboration and Computingfor Under-Resourced Languages (CCURL) , pages126–130.Robert A Wagner and Michael J Fischer. 1974. Thestring-to-string correction problem.
Journal of theACM (JACM) , 21(1):168–173.Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson-Enrique Yalta Soplin, Jahn Heymann, MatthewWiesner, Nanxin Chen, et al. 2018. Espnet: End-to-end speech processing toolkit.
Proceedings of In-terspeech 2018 , pages 2207–2211.Peter Wittenburg, Hennie Brugman, Albert Russel,Alex Klassmann, and Han Sloetjes. 2006. Elan: aprofessional framework for multimodality research.In , pages 1556–1559.Alexander Zahrer, Andrej Zgank, and Barbara Schup-pler. 2020. Towards building an automatic transcrip-tion system for language documentation: Experi-ences from muyu. In
Proceedings of The 12th Lan-guage Resources and Evaluation Conference , pages2893–2900.Albert Zeyer, Kazuki Irie, Ralf Schl¨uter, and HermannNey. 2018. Improved training of end-to-end atten-tion models for speech recognition.
Proceedings ofInterspeech 2018 , pages 7–11.
A Appendices
Experimental Settings for End-to-End ASR:
All the end-to-end ASR systems adopted the hy-brid CTC/Attention architecture integrated withan RNN language model. The best model wasselected on the basis of performance on the de-velopment set. The input acoustic features were83-dimensional log-Mel filterbanks features withpitch features (Ghahremani et al., 2014). The win-dow length and the frameshift were set to 25ms and10ms. SpecAugmentation are adopted for data aug-mentation (Park et al., 2019). The prediction targetswere the 150-word pieces trained using unigramlanguage modeling (Kudo and Richardson, 2018)(both for surface and underlying form). All the end-to-end models are fused with RNN language mod-els. The CTC ratio for Hybrid CTC/Attentionwas set to 0.3. The decoding beam size was 20.Training and Testing are based on Pytorch.
E2E-Conformer Configuration : The E2E-Conformer used 12 encoder blocks and 6 decoderblocks. All the blocks adopted a 2048 dimen-sion feed-forward layer and four-head multi-head-attention with 256 dimensions. Kernel size in theConformer block was set to 15. For training, thebatch size was set to 32. Adam optimizer with1.0 learning rate and Noam scheduler with 25000warmup-steps were used in training. We trained fora max epoch of 50. The parameter size is 43M.
E2E-RNN Configuration : The E2E-RNN used3 encoder blocks and 2 decoder blocks. All theblocks adopt 1024 hidden units. Location-basedattention adopted 1024-dim attention. Adadeltawas chosen as the optimizer, and we trained for amax epoch of 15. The parameter size is 108M.
E2E-Transformer Configuration : The E2E-Transformer used 12 encoder blocks and 6 decoderblocks. All the blocks adopted a 2048 dimen-sion feed-forward layer and four-head multi-head-attention with 256 dimensions. Adam optimizerwith 1.0 learning rate and Noam scheduler with25000 warmup-steps were used in training. Wetrained for a max epoch of 100. The parameter sizeis 27M.
Experimental Settings for HMM-based ASR: