[PDF] BembaSpeech: A Speech Recognition Corpus for the Bemba Language

Abstract

We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we train an end-to-end Bemba ASR system by fine-tuning a pre-trained DeepSpeech English model on the training portion of the BembaSpeech corpus. Our best model achieves a word error rate (WER) of 54.78%. The results show that the corpus can be used for building ASR systems for Bemba. The corpus and models are publicly released at this https URL

Full PDF

aa r X i v : . [ c s . C L ] F e b BembaSpeech: A Speech Recognition Corpus for the Bemba Language

Claytone Sikasote ∗ Department of Computer ScienceUniversity of ZambiaZambia [email protected]

Antonios Anastasopoulos

Department of Computer ScienceGeorge Mason UniversityUSA [email protected]

Abstract

Ili ipepala lilelanda pamashiwi mu Cibe-mba na ifyebo fyalembwa ifyabikwa pamonga mashiwi yakopwa elyo na yalembwaukupanga iileitwa BembaSpeech. Iikweteamashiwi ayengabelengwa ukuﬁka kumaawala amakumi yabili na yane mu lulimilwa Cibemba, ululandwa na impendwa yabantu ba mu Zambia ukuﬁka cipendo ca30%. Pakufwaisha ukumona ubukankalabwakubomﬁwa mu mukupanga ifya mi-bombele ya ASR mu Cibemba, tupangaimibombele ya ASR iya mu Cibembaukufuma pantendekelo ukuﬁka na pam-pela, kubomfya elyo na ukuwaminishaicilangililo ca mibomfeshe yamashiwi naifyebo ifyabikwa pamo mu Cisungu icitwaDeepSpeech na ukupangako iciputulwa camashiwi na ifyebo fyalembwa mu Cibemba(BembaSpeech corpus). Imibobembele yesuiyakunuma ilangisha icipimo ca kupusa nelyoukulufyanya kwa mashiwi ukwa 54.78%.Ifyakufumamo ﬁlangisha ukuti ifyalembwakuti fyabomﬁwa ukupanga imibombele yaASR mu Cibemba.We present a preprocessed, ready-to-use au-tomatic speech recognition corpus, Bem-baSpeech, consisting over 24 hours of readspeech in the Bemba language, a written butlow-resourced language spoken by over 30%of the population in Zambia. To assess its use-fulness for training and testing ASR systemsfor Bemba, we train an end-to-end BembaASR system by ﬁne-tuning a pre-trained Deep-Speech English model on the training portionof the BembaSpeech corpus. Our best modelachieves a word error rate (WER) of 54.78%.The results show that the corpus can be usedfor building ASR systems for Bemba. ∗ Work done while at African Masters in Machine Intelli-gence (AMMI). The corpus and models will be publicly re-leased https://github.com/csikasote/BembaSpeech . Speech-to-Text, also known as Automatic SpeechRecognition(ASR) or simply just Speech Recogni-tion (SP), is the task of recognising and transcrib-ing spoken utterances into text. In recent years,there has been a tremendous growth in popular-ity of speech-enabled applications. This can beattributed to their usability and integration acrosswide domain applications, such as voice over con-trol systems. However, building well-performingASR systems typically requires massive amountsof transcribed speech, as well as large text cor-pora. This is generally not an issue for well-resourced languages such as English and Chinese,where ASR applications have been successfullybuilt with remarkable results (Amodei et al., 2016,et alia).Unfortunately, this is not the case for Africa andits over 2000 languages (Heine and Nurse, 2000).The prevalence of speech recognition applicationsfor African languages is very low. This can at leastpartially be attributed to the lack or unavailabilityof linguistic resources (speech and text) for mostAfrican languages (Martinus and Abbott, 2019).This is particularly the case with Zambian lan-guages. There exist no general speech or textualdatasets curated for building natural language pro-cessing systems, including ASR systems.In this paper we present a speech corpus, Be-mbaSpeech, consisting of over 24 hours of readspeech in Bemba, a written but under-resourcedlanguage spoken by over 30% of the populationin Zambia. We also present an end-to-end speechrecognition Bemba model obtained by ﬁne-tuninga pre-trained DeepSpeech English model on Be-mbaSpeech corpus. To our knowledge this is theﬁrst work carried out towards building ASR sys-tems for any Zambian language.The rest of the paper is organized as follows. Inection 2, we summarise similar works in ASR forunder-resourced languages with a focus on Africalanguages. In Section 3 we provide details onthe Bemba language. In section 4, we outlinethe development process of the BembaSpeech cor-pus, and in section 5 we provide details of our ex-periments towards building a Bemba ASR model.Last, section 6 discusses our experimental results,before drawing conclusions and sketching out fu-ture research directions.

In the recent past, despite the challenge of limitedavailability of linguistic resources, several workshave been carried out to improve the prevalenceof ASR applications in Africa. For example, Gau-thier et al. (2016c) collected speech data and de-veloped ASR systems for four languages: Wolof,Hausa, Swahili and Amharic. In South Africa, re-searchers (de Wet and Botha, 1999; Badenhorstet al., 2011; Henselmans et al., 2013; Van Heer-den et al., 2016; De Wet et al., 2017) have in-vestigated and built speech recognition systemsfor South African languages. Other languagesthat have seen development of linguistic resourcesfor ASR applications include: Fongbe (Laleyeet al., 2016) of Benin; Swahili (Gelas et al.,2012) predominantly spoken by people of EastAfrica; Amharic, Tigrigna, Oromo and Wolayttaof Ethiopia (Abate et al., 2005; Tachbelie and Be-sacier, 2014; Abate et al., 2020; Woldemariam,2020); Hausa(Schlippe et al., 2012) of Nigeria andSomali (Abdillahi et al., 2006) of Somalia. In allthe aforementioned works, Hidden Markov Mod-els (Juang and Rabiner, 1991) and traditional sta-tistical language models are adopted to developASR systems, typically using the Kaldi (Poveyet al., 2011) or HTK (Young et al., 2009) frame-works. The disadvantage of such approaches isthat they typically require separate training for alltheir pipeline components including the acousticmodel, phonetic dictionary, and language model.Recently, end-to-end deep neural network ap-proaches have successfully been applied to speechrecognition tasks (Amodei et al., 2016; Pratapet al., 2018, et alia) achieving remarkable re-sults outperforming traditional HMM-GMM ap-proaches. Such methods require only a speechdataset with speech utterances and their transcrip-tions for training. In this work, we use anopen source end-to-end neural network system, Mozilla‘s DeepSpeech (Hannun et al., 2014) todevelop a Bemba ASR model using our Bem-baSpeech corpus.

The language we focus on is Bemba (also re-ferred to as ChiBemba, Icibemba), a Bantu lan-guage principally spoken in Zambia, in the North-ern, Copperbelt, and Luapula Provinces. It is alsospoken in southern parts of the Democratic Re-public of Congo and Tanzania. It is estimated tobe spoken by over 30% of the population of Zam-bia (Kula and Marten, 2008; Kapambwe, 2018).Bemba has 5 vowels and 19 consonants (Spitul-nik and Kashoki, 2001). Its syllable structure ischaracteristically open and is of four main types:V, CV, NCV, and NCGV (where V = vowel (longor short), C = consonant, N = nasal, G = glide (wor y))(Spitulnik and Kashoki., 2014). The writingsystem is based on Latin script (Mwansa, 2017).Similar to other Bantu languages, Bemba is de-scribed to have a very elaborate noun class systemwhich involves pluralization patterns, agreementmarking, and patterns of pronominal reference.There are 20 different classes in Bemba: 15 basicclasses, 2 subclasses, and 3 locative classes (Spit-ulnik and Kashoki, 2001, 2014). Each noun classis indicated by a class preﬁx (typically VCV-, VC-,or V-) and the co-occurring agreement markers onadjectives, numerals and verbs.In terms of tone, Bemba is considered to be atone language, with two basic tones, high (H) andlow (L) (Kula and Hamann, 2016). A high toneis marked with an acute accent (e.g. ´a) while alow tone is typically unmarked. As with mostother Bantu languages, tone can be phonemic andis an important functional marker in Bemba, sig-naling semantic distinctions between words (Spit-ulnik and Kashoki, 2001, 2014).

Description

The corpus has a size of 2.8 Giga-Bytes with a total duration of speech data of ap-proximately over 24 hours. We provide ﬁxed train,development, and test splits to facilitate future ex-perimentation. The subsets have no speaker over-lap among them. Table 1 summarises the charac-teristics of the corpus and its subsets. All audioﬁles are encoded in Waveform Audio File Format(WAVE) with a single track (mono) and recordingwith a sample rate of 16kHz. ubset Duration Utterances Speakers Male Female

Whole Corpus:Train 20hrs 11906 8 5 3Dev 2hrs, 30min 1555 7 3 4Test 2hrs 977 2 1 1Total 24hrs, 30min 14438 17 9 8Used in our experiments:Train 14hrs, 20min 10200 8 5 3Dev 2hrs 1437 7 3 4Test 1hr, 18min 756 2 1 1Subset total 17hrs, 38min 12393 17 9 8

Table 1: General characteristics of the BembaSpeech ASR corpus. We use a subset (audio ﬁles shorter than 10seconds) for our baseline experiments.

Data collection

To build the BembaSpeech cor-pus we used the Lig-Aikuma app (Gauthier et al.,2016c) for recording speech. Speakers used theelicitation mode of the software to record audiofrom text scripts tokenized at sentence level. TheLig-Aikuma has been used by other researchersfor similar works (Blachon et al., 2016; Gauthieret al., 2016a,b).

Speakers

The speakers involved in Bem-baSpeech recording were students of ComputerScience in the School of Natural Science at theUniversity of Zambia. The corpus consists of14,438 audio ﬁles recorded by 17 speakers, 9 maleand 8 female. Based on the information extractedfrom metadata as supplied by speakers, their rangeof age is between 22 and 28 years and all of themidentiﬁed as black. All the speakers were selectedbased on their ﬂuency to speak and read Bembaand are not necessarily native language speakers.There are 14 native Bemba speakers, 1 Lozi, 1Lunda and 1 Nsenga. It is also important to notethat the recordings in this corpus were conductedoutside controlled conditions. Speakers recordedas per their comfort and have varied accents.Therefore, some utterances are expected to havesome background noise. We consider this “moreof a feature than a bug” for our corpus: it willallow us to train and, importantly, evaluate ASRsystems that match real-world conditions, ratherthan a quiet studio setting.

Preprocessing

The corpus was preprocessedand validated to ensure data accuracy by elimi- nating all corrupted audio ﬁles and, most impor-tantly, to ensure that all utterances matched thetranscripts. All the numbers, dates and times inthe text were replaced with their text equivalentaccording to the utterances. We also sought to fol-low the LibriSpeech (Panayotov et al., 2015) ﬁleorganization and nomenclature by grouping all theaudio ﬁles according to the speaker, using speakerID number. In addition, we renamed all the audioﬁles by pre-pending the speaker ID number to theutterance ID numbers.

Text Sources

The phrases and sentencesrecorded were extracted from diverse sourcesin Bemba language, mainly Bemba literature.In Table 2, we summarise the sources of textcontained in BembaSpeech. The length of thephrases varies from a single word to as many as20 words.

Availability

The corpus is made available to theresearch community licensed under the CreativeCommons BY-NC-ND 4.0 license and it can befound at our github project repository.

In this section, we describe the experiments toascertain the usefulness of the speech corpus forASR applications. Code to reproduce our experiments is available here: https://github.com/csikasote/BembaASR . D Source Name Size(%)

Table 2: Sources of text contained in BembaSpeechcorpus. The Bemba literature includes publicly avail-able books, magazines and training materials written inBemba. Other online resources includes various web-sites with Bemba content.

In our experiments, we use Mozilla‘s DeepSpeech- an open source implementation of a varia-tion of Baidu‘s ﬁrst DeepSpeech paper (Hannunet al., 2014). This architecture is an end-to-endsequence-to-sequence model trained via stochas-tic gradient descent (Bottou, 2012) with the Con-nectionist Temporal Classiﬁcation (Graves et al.,2006, CTC) loss function. The model is six layersdeep: three fully connected layers connected fol-lowed by a unidirectional LSTM (Hochreiter andSchmidhuber, 1997) layer followed by two morefully connected layers. All hidden layers have adimensionality of 2048 and a clipped ReLU (Nairand Hinton, 2010) activation. The output layer hasas many dimensions as characters in the alphabetof the target language (including desired punctua-tions and blank symbols used for CTC). The inputlayer accepts a vector of 19 spliced frames (9 pastframes, 1 present frame and 9 future frames) with26 MFCC features each. We use the DeepSpeechv0.8.2 release for all our experiments. We preprocessed the data in conformity with theexpectation of the DeepSpeech input pipeline. Weconverted all transcriptions to lower case. SinceDeepSpeech only accepts audio ﬁles not exceed-ing 10 seconds, we considered only audio ﬁleswith that duration for our training. This resized thecorpus for training as can be seen in Table 1. Wealso generated an alphabet of characters and sym-bols which appear in the text, the length of whichdetermines the size of the output layer of the Deep-Speech model. We note that, since Bemba uses theLatin alphabet, our alphabet was the same as thatof the pretrained DeepSpeech English model. https://github.com/mozilla/DeepSpeech/tree/v0.8.2 No. of TokensLanguage Model Sentences Unique Total

LM1 13461 27K 123KLM2 403452 189K 5.8M

Table 3: The token counts for the two sets of textsources used to create the language models.

Similar to (Hjortnaes et al., 2020; Meyer, 2020),we trained DeepSpeech from scratch using thedefault parameters on the BembaSpeech dataset,providing a baseline model for our experiments. In our search for a better performing model, weapplied and also experimented with cross-lingualtransfer learning. We achieve this by ﬁne-tuning awell performing DeepSpeech English pre-trainedmodel on our Bemba dataset, using a learning rateof 0.00005, dropout at 0.4, and 50 training epochswith early stopping.

Similar to the original DeepSpeech approach pre-sented by Hannun et al. (2014), we considered in-cluding the language model to the acoustic modelto improve performance. In order to identify thelanguage model that most improved model perfor-mance, we evaluated two sets of language mod-els each consisting, 3-gram, 4-gram and 5-gram.The ﬁrst set of language models, denoted LM1,were generated from text sourced from train anddevelopment transcripts. The second set, denotedLM2, were sourced from a combination of textfrom train and development transcripts and addi-tional Bemba text from the JW300 dataset (Agicand Vulic, 2020). In Table 3 we give the tokencount for LM1 and LM2. All the language modelswere generated using the KenLM (Heaﬁeld, 2011)language model library. We used the DeepSpeechnative library to create the trie based models withdefault parameter values. The same speech recog-nition model obtained from section 5.4 was usedchanging only the language model. With the exception of batch size: instead of using thedefault batch size of 1 for train, dev and test, we used 64, 32,32 respectively for all our experiments. odel WER(%) CER%

BL 1.00 85.67FT 71.21 16.68FT + LM1-3 54.79 18.54FT + LM1-4 54.80 18.08FT + LM1-5

FT + LM2-3 55.65 19.69FT + LM2-4 55.84 20.49FT + LM3-5 55.75 19.99

Table 4: The best results from our experiments wereobtained through ﬁne-tuning a pretrained model andcombining it with a 5-gram LM generated only fromtranscripts. In the table, BL denotes a baseline modeland FT denotes ﬁne-tuned model.

Table 4 summarises the results obtained from ourexperiments. The best performing model was FT+ LM1-5, obtained from ﬁnetuning DeepSpeechwith a 5-gram language model generated from textsourced from transcripts. The model achieved aword error rate (WER) of 54.78% and charactererror rate (CER) of 17.05%.The results also show the impact of the lan-guage model on improving the performance of theBemba ASR model. By including the languagemodel we were able to improve the model per-formance by a signiﬁcant margin from 71.21% to54.78% WER. Interestingly, no signiﬁcant changein performance was observed by the inclusion ofthe additional 389,991 sentences from the JW300Bemba data.

In this paper, we presented an ASR corpus for Be-mba language, BembaSpeech. We also demon-strated the usefulness of the corpus by buildingan End-to-End Bemba ASR model obtained byﬁnetuning a well performing DeepSpeeh Englishmodel on the 17.5 hours speech dataset, a subsetof BembaSpeech.For the future, there are many things we cando to improve the results of our model. We areinterested to tune the models (both acoustic andlanguage) by expanding the parameter search. Weare also interested in further improving our corpusboth in size and number of speakers involved. Inaddition, it would be interesting to compare theseresults with other frameworks like the Facebook‘s wav2letter++ (Pratap et al., 2018) and Pytorch-Kaldi (Ravanelli et al., 2019).Lastly, we plan to (a) collect even more data inBemba, (b) collect data in the different Bemba va-rieties as spoken throughout Zambia, as well as (c)other Zambian languages.

Acknowledgments

We would like to express our gratitude to all thespeakers who were involved in the creation of ourcorpus. This work would not have been success-ful without their time and effort. We also want tothank Eunice Mukonde-Mulenga for her help withthe Bemba translation of the abstract. Twatotelasaana. Thank you!

References

Solomon Teferra Abate, Wolfgang Menzel, and BairuTaﬁla. 2005. An amharic speech corpus for large vo-cabulary continuous speech recognition. In .Solomon Teferra Abate, Martha Yiﬁru Tachbelie,Michael Melese, Hafte Abera, Tewodros Abebe,Wondwossen Mulugeta, Yaregal Assabie, MillionMeshesha, Solomon Atinafu, and Binyam Ephrem.2020. Large vocabulary read speech corpora for fourethiopian languages: Amharic, Tigrigna, Oromoand Wolaytta. In

LREC 2020 - 12th InternationalConference on Language Resources and Evaluation,Conference Proceedings .Nimaan Abdillahi, Nocera Pascal, and Bonastre Jean-Franc¸ois. 2006. Towards automatic transcription ofSomali language. In

Proceedings of the 5th Interna-tional Conference on Language Resources and Eval-uation, LREC 2006 .ˇZeljko Agic and Ivan Vulic. 2020. JW300: A wide–coverage parallel corpus for low-resource languages.In

ACL 2019 - 57th Annual Meeting of the Associa-tion for Computational Linguistics, Proceedings ofthe Conference .Dario Amodei, Sundaram Ananthanarayanan, RishitaAnubhai, Jingliang Bai, Eric Battenberg, Carl Case,Jared Casper, Bryan Catanzaro, Qiang Cheng, Guo-liang Chen, Jie Chen, Jingdong Chen, Zhijie Chen,Mike Chrzanowski, Adam Coates, Greg Diamos,Ke Ding, Niandong Du, Erich Elsen, Jesse En-gel, Weiwei Fang, Linxi Fan, Christopher Fougner,Liang Gao, Caixia Gong, Aw Ni Hannun, TonyHan, Lappi Vaino Johannes, Bing Jiang, Cai Ju,Billy Jun, Patrick Legresley, Libby Lin, JunjieLiu, Yang Liu, Weigao Li, Xiangang Li, Dong-peng Ma, Sharan Narang, Andrew Ng, SherjilOzair, Yiping Peng, Ryan Prenger, Sheng Qian,ongfeng Quan, Jonathan Raiman, Vinay Rao, San-jeev Satheesh, David Seetapun, Shubho Sengupta,Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Lil-iang Tang, Chong Wang, Jidong Wang, Kaifu Wang,Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu,Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yo-gatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu.2016. Deep speech 2: End-to-end speech recogni-tion in English and Mandarin. In .Jaco Badenhorst, Charl van Heerden, Marelie Davel,and Etienne Barnard. 2011. Collecting and evaluat-ing speech recognition corpora for 11 South Africanlanguages.

Language Resources and Evaluation .David Blachon, Elodie Gauthier, Laurent Besacier,Guy No¨el Kouarata, Martine Adda-Decker, and An-nie Rialland. 2016. Parallel Speech Collectionfor Under-resourced Language Studies Using theLig-Aikuma Mobile Device App. In

Procedia Com-puter Science .L´eon Bottou. 2012. Stochastic Gradient DescentTricks.Febe De Wet, Neil Kleynhans, Dirk Van Compernolle,and Reza Sahraeian. 2017. Speech recognition forunder-resourced languages: Data sharing in hiddenMarkov model systems.

South African Journal ofScience .Elodie Gauthier, Laurent Besacier, and Sylvie Voisin.2016a. Automatic Speech Recognition for AfricanLanguages with Vowel Length Contrast. In

Proce-dia Computer Science .Elodie Gauthier, Laurent Besacier, Sylvie Voisin,Michael Melese, and Uriel Pascal Elingui. 2016b.Collecting resources in sub-Saharan African lan-guages for automatic speech recognition: A casestudy of Wolof. In

Proceedings of the 10th Interna-tional Conference on Language Resources and Eval-uation, LREC 2016 .Elodie Gauthier, David Blachon, Laurent Besacier,Guy Noel Kouarata, Martine Adda-Decker, An-nie Rialland, Gilles Adda, and Gr´egoire Bachman.2016c. LIG-AIKUMA: A mobile app to collect par-allel speech for under-resourced language studies.In

Proceedings of the Annual Conference of the In-ternational Speech Communication Association, IN-TERSPEECH .Hadrien Gelas, Laurent Besacier, and Francois Pelle-grino. 2012. Developments of Swahili resourcesfor an automatic speech recognition system.

SLTU-Workshop on Spoken Language Technologies forUnder-Resourced Languages .Alex Graves, Santiago Fern´andez, Faustino Gomez,and J¨urgen Schmidhuber. 2006. Connectionisttemporal classiﬁcation: Labelling unsegmented se-quence data with recurrent neural networks. In

ACMInternational Conference Proceeding Series . Awni Hannun, Carl Case, Jared Casper, Bryan Catan-zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-jeev Satheesh, Shubho Sengupta, Adam Coates, andAndrew Y. Ng. 2014. Deep Speech: Scaling up end–to-end speech recognition.Kenneth Heaﬁeld. 2011. KenLM : Faster and SmallerLanguage Model Queries.

Proceedings of the SixthWorkshop on Statistical Machine Translation .B Heine and D Nurse. 2000.

African Languages: AnIntroduction . Cambridge University Press.Daan Henselmans, Thomas Niesler, and David VanLeeuwen. 2013. Baseline Speech Recognition ofSouth African Languages using Lwazi and AST.In

Proceedings of the twenty-fourth annual sympo-sium of the Pattern Recognition Association of SouthAfrica (PRASA) .Nils Hjortnaes, Timofey Arkhangelskiy, Niko Partanen,Michael Rießler, and Francis Tyers. 2020. Improv-ing the Language Model for Low-Resource { ASR } with Online Text Corpora. In Proceedings of the 1stJoint Workshop on Spoken Language Technologiesfor Under-resourced languages (SLTU) and Collab-oration and Computing for Under-Resourced Lan-guages (CCURL) .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. LongShort-Term Memory.

Neural Computation .B. H. Juang and L. R. Rabiner. 1991. Hidden markovmodels for speech recognition.

Technometrics .Mazuba Kapambwe. 2018. An Introduction to Zam-bia’s Bemba Tribe.Nancy C. Kula and Silke Hamann. 2016.

Intonation inBemba .Nancy C Kula and Lutz Marten. 2008.

Language andNational Identity in Africa . Oxford University.Frejus A.A. Laleye, Laurent Besacier, Eugene C. Ezin,and Cina Motamed. 2016. First automatic fongbecontinuous speech recognition system: Develop-ment of acoustic models and language models. In

Proceedings of the 2016 Federated Conference onComputer Science and Information Systems, FedC-SIS 2016 .Laura Martinus and Jade Z. Abbott. 2019. Bench-marking Neural Machine Translation for SouthernAfrican Languages.Josh Meyer. 2020.

Multi-Task and Transfer Learningin Low-Resource Speech Recognition . Ph.D. thesis,The University of Arizona.Joseph M Mwansa. 2017. Theorectical Reﬂectionson the Teaching of Literacy in Zambian Bantu Lan-guages.

International Journal of Humanities, SocialSciences and Education , 4(10):116–129.inod Nair and Geoffrey E. Hinton. 2010. Rectiﬁedlinear units improve Restricted Boltzmann machines.In

ICML 2010 - Proceedings, 27th InternationalConference on Machine Learning .Vassil Panayotov, Guoguo Chen, Daniel Povey, andSanjeev Khudanpur. 2015. Librispeech: An ASRcorpus based on public domain audio books. In

ICASSP, IEEE International Conference on Acous-tics, Speech and Signal Processing - Proceedings .Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlıcek, Yanmin Qian, PetrSchwarz, Jan Silovsky, Georg Stemmer, and KarelVesely. 2011. The Kaldi speech recognition toolkit.In

Proc. ASRU .Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai,Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,and Ronan Collobert. 2018. WAV2LETTER++:THE FASTEST OPEN-SOURCE SPEECHRECOGNITION SYSTEM.Mirco Ravanelli, Titouan Parcollet, and Yoshua Ben-gio. 2019. The Pytorch-kaldi Speech RecognitionToolkit. In

ICASSP, IEEE International Conferenceon Acoustics, Speech and Signal Processing - Pro-ceedings .Tim Schlippe, Edy Guevara Komgang Djomgang,Ngoc Thang Vu, Sebastian Ochs, and Tanja Schultz.2012. Hausa large vocabulary continuous speechrecognition. In

Proceedings of the Workshop on Spo-ken Languages Technologies for Under-ResourcedLanguages (SLTU) .Debra Spitulnik and Mubanga E Kashoki. 2001.

FactsAbout the World‘s Languages: An Encyclopedia ofthe Worlds‘s Major Languages, Past and Present .H.W. Wilson, New York.Vidali D Spitulnik and Mubanga E Kashoki. 2014. Be-mba Morphology.Vidali D Spitulnik and Mubanga E. Kashoki. 2014. Be-mba Phonology.Martha Yiﬁru Tachbelie and Laurent Besacier. 2014.Using different acoustic, lexical and language mod-eling units for ASR of an under-resourced language- Amharic.

Speech Communication .Charl Van Heerden, Neil Kleynhans, and MarelieDavel. 2016. Improving the Lwazi ASR baseline.In

Proceedings of the Annual Conference of the In-ternational Speech Communication Association, IN-TERSPEECH .F. de Wet and E. C. Botha. 1999. Towards speechtechnology for south african languages: Automaticspeech recognition in xhosa.

South African Journalof African Languages . Yonas Woldemariam. 2020. Transfer Learningfor Less-Resourced { S } emitic Languages SpeechRecognition: the Case of { A } mharic. In Pro-ceedings of the 1st Joint Workshop on SpokenLanguage Technologies for Under-resourced lan-guages (SLTU) and Collaboration and Computingfor Under-Resourced Languages (CCURL) .S. J. Young, G. Evermann, M. J. F. Gales, T. Hain,D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason,D. Povey, V. Valtchev, and P. C. Woodland. 2009.The HTK Book (for HTK Version 3.4).