A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov, Alen Adiyev, Mukhamet Nurpeiissov, Huseyin Atakan Varol
AA Crowdsourced Open-Source Kazakh Speech Corpusand Initial Speech Recognition Baseline
Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov,Alen Adiyev, Mukhamet Nurpeiissov and Huseyin Atakan Varol
Institute of Smart Systems and Artificial Intelligence (ISSAI),Nazarbayev University, Nur-Sultan, Kazakhstan { yerbolat.khassanov, saida.mussakhojayeva, almas.mirzakhmetov,alen.adiyev, mukhamet.nurpeiissov, ahvarol } @nu.edu.kz Abstract
We present an open-source speech corpus forthe Kazakh language. The Kazakh speech cor-pus (KSC) contains around 335 hours of tran-scribed audio comprising over 154,000 utter-ances spoken by participants from different re-gions, age groups, and gender. It was care-fully inspected by native Kazakh speakers toensure high quality. The KSC is the largestpublicly available database developed to ad-vance various Kazakh speech and languageprocessing applications. In this paper, we firstdescribe the data collection and prepossessingprocedures followed by the description of thedatabase specifications. We also share our ex-perience and challenges faced during databaseconstruction. To demonstrate the reliabilityof the database, we performed the preliminaryspeech recognition experiments. The experi-mental results imply that the quality of audioand transcripts are promising. To enable ex-periment reproducibility and ease the corpususage, we also released the ESPnet recipe.
We present an open-source Kazakh speech corpus(KSC) constructed to advance the development ofspeech and language processing applications forthe Kazakh language. Kazakh is an agglutinativelanguage with the vowel harmony belonging to thefamily of Turkic languages. During the Soviet pe-riod, the Kazakh language was overwhelmed by theRussian language; causing the decline of Kazakhlanguage usage (Dave, 2007). In the 1990s, it wasdeclared as an official language of Kazakhstan andmany initiatives were launched to increase the num-ber of Kazakh speakers. Today, it is spoken byover 10 million people in Kazakhstan and by over3 million people in other countries . Our goal is to accelerate the penetration of Kazakh languageinto the Internet of things (IoT) technologies and topromote the research in Kazakh speech processingapplications by introducing the KSC corpus.Although several Kazakh speech corpora havebeen previously presented (Makhambetov et al.,2013; Shi et al., 2017; Mamyrbayev et al., 2019),there is no generally accepted common corpus.Most of them are either publicly unavailable orcontain an insufficient amount of data to train re-liable models. Especially, these databases are toosmall for building recent end-to-end models whichare extremely data hungry (Hannun et al., 2014).Consequently, different research groups usuallyconduct their experiments on internally collecteddata which prevents the reproducibility and com-parison of different approaches.To address the aforementioned limitations, wecreated the KSC containing around 335 hours oftranscribed audio. It was crowdsourced throughthe Internet where volunteers were asked to readpresented sentences on the web browser. In to-tal, we accepted over 154,000 utterances submittedfrom over 1,500 unique device IDs identified bythe web cookies stored on the users’ devices. Therecordings were first checked manually and whena sufficient amount of data was collected partiallychecked automatically. To the best of our knowl-edge, KSC is the largest open-source speech cor-pus in Kazakh and it is available for public andcommercial use upon request at this link underCreative Commons Attribution 4.0 InternationalLicense. We expect that this database will be avaluable resource for the research communities inboth academia and industry. The primary appli-cation domains of the corpus are speech recog-nition, speech synthesis and speaker recognition.To demonstrate the reliability of the database, we https://issai.nu.edu.kz/kz-speech-corpus/ a r X i v : . [ ee ss . A S ] S e p erformed the preliminary automatic speech recog-nition (ASR) experiments where promising resultswere achieved, sufficient for the practical usage.We also provide a practical guide on the develop-ment of ASR systems for the Kazakh languageby sharing the reproducible recipe and pretrainedmodels . We left the utilization of the database forspeech synthesis and speaker recognition tasks asfuture work.The rest of the paper is organized as follows. Insection 2, we review related works. In section 3, wepresent our KSC database and describe the databaseconstruction procedures in details. In section 4, wedescribe the speech recognition experiment setupand obtained results. In section 5, we discuss theobtained results and challenges. Lastly, section 6concludes this paper and mentions potential futureworks. The ASR has intrigued researchers for centuries. Inthe past few years, the interest has surged by its newapplications in smart devices such as voice com-mand, voice search, message dictation, and virtualassistants (Yu and Deng, 2014). In response to thistechnological shift, many speech corpora have beenintroduced for various languages. For example, Duet al. (2018) released 1,000 hours of open-sourceMandarin read speech corpus to bridge the gapbetween academia and industry. The utteranceswere recorded using iOS-based mobile phones andcover the following domains: voice command,smart home, autonomous driving and others. Sim-ilarly, Koh et al. (2019) developed 2,000 hours ofread speech corpus to help speech technology de-velopers and researchers to build speech-relatedapplications in Singapore.Several works presented speech databases forthe Kazakh language. For example, Makhambe-tov et al. (2013) developed a Kazakh languagecorpus (KLC) containing around 40 hours of tran-scribed read speech data recorded in a sound-proofstudio. Similarly, Mamyrbayev et al. (2019) col-lected 76 hours of data using professional record-ing booth which was further extended to 123 hoursin (Mamyrbayev et al., 2020). Khomitsevich et al.(2015) utilized 147 hours of bilingual Kazakh-Russian speech corpus to build code-switchingASR systems. Shi et al. (2017) released 78 hours https://github.com/IS2AI/ISSAI_SAIDA_Kazakh_ASR of transcribed Kazakh speech data recorded by 96students from China. The IARPA Babel projecthas released Kazakh language pack consisting ofaround 50 hours of conversational and 14 hours ofscripted telephone speech. Unfortunately, the afore-mentioned databases are either publicly unavailableor of insufficient size to build robust Kazakh ASRsystems. Additionally, some of them are nonrep-resentative, i.e. cover speakers from a narrow setof categories like the same region or age group.Furthermore, since most of these databases are col-lected in optimal lab settings, they might be inef-fective for real-world applications.The emergence of the crowdsourcing platformsand the growth in Internet connectivity motivatedresearchers to employ crowdsourcing approach forconstructing annotated corpora. Different fromthe expert-based approaches, the crowdsourcingtends to be cheaper and faster, though, additionalmeasures should be taken to ensure the data qual-ity (Snow et al., 2008; Novotney and Callison-Burch, 2010). Furthermore, crowdsourcing allowsto gather a variety of dialects and accents fromfar-located geographical locations and enables theparticipation of people with disabilities and the el-derly, which otherwise would be not possible ortoo costly (Takamichi and Saruwatari, 2018). In-spired by this, we followed the best crowdsourcingpractices to construct a large-scale and open-sourcespeech corpus for the Kazakh described in the fol-lowing sections. The KSC project was conducted with the approvalof the Institutional Research Ethics Committee(IREC) of Nazarbayev University. Each readerparticipated voluntarily and was informed of thedata collection and use protocols through an onlineconsent form.
We first extracted Kazakh textual data from var-ious sources such as electronic books, laws, andwebsites including Wikipedia, news portals, andblogs. For each website, we designed a specializedweb crawler to improve the quality of the extractedtext. The extracted texts were manually filteredto eliminate improper contents involving sensitivepolitical issues, user privacy, violence, and so on. https://catalog.ldc.upenn.edu/LDC2018S13 dditionally, we filtered out the documents entirelyconsisting of Russian words. The documents con-sisting of mixed Kazakh-Russian words were keptbecause there are a lot of borrowed Russian wordsin Kazakh and the language mixing is a commonpractice among Kazakh speakers (Khomitsevichet al., 2015). Next, we divided the documents intosentences and removed the sentences consistingof more than 25 words. Lastly, we removed theduplicate sentences. The total number of extractedsentences reached around 2.3 million. To narrate the extracted sentences, we prepared aweb-based speech recording platform capable ofrunning on personal computers and smartphones.The platform randomly samples a sentence fromthe pool of extracted text and presents it to a reader(see Figure 1). It has “pause” and “next” buttonsto control the recording process and the readerswere allowed to quit any time. We attracted thereaders by advertising the project in social media,news, and open messaging communities in What-sApp and Telegram. We invited readers who are atleast 18 years old so that they could legally agreeto participate in the data collection. The audioswere recorded in 48 kHz and 16 bits but downsam-pled to 16 kHz and 16 bits for online publication.Following our experimental protocol, we did notstore readers personal information except the re-gion identified using the IP addresses.We hired several transcribers among nativeKazakh speakers to check the quality of recordings.Transcribers logged on to a special transcriptionchecking platform and were provided with an au-
Figure 1: The speech recording web interface.
Category Train Valid Test TotalDuration (hours) 320 7.4 7.2 334.6
Table 1: The KSC database specifications. dio segment and the corresponding sentence thatthe reader had read. Their task was to check if thereaders had read according to the prompt, and totranscribe any deviations or other acoustic eventsbased on a set of transcription instructions. Asan additional quality measure, we hired a linguistwho was assigned to supervise and to randomlycheck the works done by transcribers. To harmo-nize the transcriptions, the linguist also held the“go through errors” sessions with the transcribers.The transcribers were instructed to reject utter-ances containing obvious mispronunciations or se-vere noise, to convert the numbers into writtenforms , and to trim the long silences in the begin-ning and end of the audio segments. Additionally,they were instructed to enclose the partial repe-titions and hesitations in parentheses, e.g. ‘(he)hello’ and ‘(ah)’, and to indicate other non-verbalsounds produced by the readers such as sneezingand coughing using special ‘[noise]’ token. Thebackground noises were not labeled.When the size of accepted utterances reached100 hours, we built an ASR system to automati-cally check the recordings. The ASR accepted onlythe recordings perfectly matching correspondingtext prompts, i.e. 0% character error rate (CER),whereas the remaining utterances were left to hu-man transcribers. The KSC database specifications are provided inTable 1. We split the data into the three sets of non-overlapping speakers: training, validation, and test.While the training set recordings were collectedfrom the anonymous speakers, the validation andtest sets were collected from the known speakers toensure that they don’t overlap with the training set,represent different age groups and regions, and gen-der balanced (see Table 2). In total, around 154,000 Note that after converting the numbers into written formssome sentence lengths exceeded 25 words. igure 2: The distribution of letters in KSC (%). utterances were accepted yielding 335 hours of tran-scribed speech data. Note that the device IDs can’tbe used to represent the number of speakers in thetraining set as several speakers might have used thesame device or the same speaker might have useddifferent devices. The whole database creation pro-cess took around 4 months to complete and its sizeis around 38GB.The Kazakh writing system differs depending onthe regions where it is spoken. For example, theCyrillic alphabet is used in Kazakhstan and Mon-golia, and Arabic-derived alphabet is used in China.In KSC, we represented all text using the Cyrillicalphabet consisting of 42 letters. The distributionof these letters in KSC is given in Figure 2.One of the important features of the proposedKSC database is that it was collected in variousenvironment conditions (e.g. home, office, cafe,transport and street) with diverse background noisethrough mobile devices (e.g. phones and tablets)and personal computers, with and without head-phone sets which is similar to the realistic use-casescenarios. Consequently, our database enables thedevelopment and evaluation of ASR systems de-signed to operate in real-world voice-enabled ap-plications such as voice commands, voice search,message dictation and others.The KSC database consists of audio recordings,
Category Valid TestGender (%) Female 51.7 51.7Male 48.3 48.3Age (%) 18-27 37.9 34.528-37 34.5 31.038-47 10.4 13.848 and above 17.2 20.7Region (%) East 13.8 13.8West 20.7 17.2North 13.8 20.7South 37.9 41.4Center 13.8 6.9Device (%) Phone 62.1 79.3Computer 37.9 20.7Headphone (%) Yes 20.7 17.2No 79.3 82.8
Table 2: The validation and test set speaker details. transcripts and metadata stored in separate folders.The audio and the corresponding transcription filenames are the same except the audio recordings arestored as WAV files whereas the transcriptions arestored as TXT files using UTF-8 encoding. Themetadata contains the data splitting information(training, validation, and test) and the speaker’s de-tails (gender, age, region, device, and headphones)of the validation and test sets.
To demonstrate the utility and reliability of theKSC database, we conducted speech recognitionexperiments using both traditional hidden Markovmodel-deep neural network (HMM-DNN) and re-cently proposed end-to-end (E2E) architectures.We did not compare or perform thorough archi-tecture searches for either HMM-DNN and E2Esince it is out of the scope of this paper.
All ASR models were trained with a single V100GPU running on the NVIDIA DGX-2 server us-ing the training set. All hyper-parameters weretuned using the validation set. The best-performingmodel was evaluated using the test set. All resultsare reported without lattice or n-best hypothesesrescoring and no external data has been used.
The HMM-DNN ASR system was build using theKaldi (Povey et al., 2011) framework. We fol-lowed the Wall Street Journal (WSJ) recipe with thennet3+chain” setup and other latest Kaldi develop-ments. The acoustic model was constructed usingthe factorized time-delay neural networks (TDNN-F) (Povey et al., 2018) trained with the lattice-freemaximum mutual information (LF-MMI) trainingcriterion (Povey et al., 2016). The inputs usedwere Mel-frequency cepstral coefficients (MFCC)features with cepstral mean and variance normal-ization extracted every 10 ms over 25 ms window.In addition, we applied data augmentation usingthe speed perturbation (Ko et al., 2015) techniqueat rates 0.9, 1.0 and 1.1.We employed the graphemic lexicon due to thestrong correspondence between word spellings andpronounce in Kazakh (e.g. “ hello → h e l l o ”).The graphemic lexicon was constructed by extract-ing all words present in the training set; result-ing in 157,616 unique words. During the decod-ing stage, we employed a 3-gram language model(LM) with the Kneser-Ney smoothing built usingthe SRILM (Stolcke, 2002) toolkit. The 3-gramLM was trained using the transcripts of the trainingset and the vocabulary covering all the words in thegraphemic lexicon. The E2E ASR system was built using the ES-Pnet (Watanabe et al., 2018) framework. Wefollowed the WSJ recipe to train the Trans-former networks (Vaswani et al.). It was jointlytrained with the connectionist temporal classifica-tion (CTC) (Graves et al., 2006) objective func-tion under the multi-task learning framework (Kimet al., 2017). The input speech was represented asan 80-dimensional filterbank features with pitchcomputed every 10 ms over 25 ms window. Theacoustic features were first processed by the VGGnetwork (Simonyan and Zisserman). Since Kazakhis a morphologically rich language, it is susceptibleto severe data sparseness issue. To overcome thisissue, we employed character-level output units. Intotal, we used 45 distinct output units consisting of42 letters from the Kazakh alphabet and 3 specialtokens, i.e. < unk > , < blank > , and < space > .The E2E ASR systems do not require a lexiconwhen modeling with the grapheme-based outputunits (Sainath et al., 2018). The character-levelLM was build using the transcripts of the trainingset as a 2-layer RNN with the 650 long short-termmemory (LSTM) units each. We utilized the RNNLM during the decoding stage using the shallow fu-sion (G¨ulc¸ehre et al., 2015). Besides, we augment Model Valid TestCER WER CER WERHMM-DNN 8.5 21.6 7.9 19.8Transformer 4.5 13.7 3.9 12.4
Table 3: The CER (%) and WER (%) performances ofdifferent ASR models built using KSC. the training data using the speed perturbation andthe specaugment (Park et al., 2019) techniques. Fordecoding, we set the beam size to 30 and the RNNLM interpolation weight to 1.The Transformer-based E2E ASR system con-sists of 12 encoder and 6 decoder blocks. We setthe number of heads in the self-attention layer to4 with 256-dimension hidden states and the feed-forward network dimensions to 2,048. We used thedropout rate of 0.1 and the label smoothing of 0.1.The model was trained for 160 epochs using theNoam optimizer (Vaswani et al.) with the initiallearning rate of 10 and the warmup-steps of 25,000.The batch size was set to 96. We report resultson an average model computed over the last 10checkpoints. The interpolation weight for the CTCobjective was set to 0.3.
The experiment results are presented in Table 3in terms of both character error rate (CER) andword error rate (WER). All ASR models achievedcompetitive results on both validation and testsets. However, the performance of the HMM-DNNmodel is inferior comparing to the Transformermodel. These experimental results successfullydemonstrate the utility of the KSC database forthe speech recognition task. We leave the explo-ration of the optimal hyper-parameter settings anddetailed comparison of different ASR architecturesas future work.
The Kazakh language speech recognition is con-sidered challenging due to its agglutinative naturewhere word structures are formed by the affixationof derivational and inflectional suffixes to stems ina specific order. As a result, the vocabulary sizemight considerably grow resulting in data sparsity,especially for the models operating in a word-levelas our HMM-DNN architecture. The potential solu-tion is to break down words into finer-level linguis-tic units such as characters or subword units (Sen-rich et al., 2016).Another challenge is the Kazakh-Russian code-switching practice which is common in daily com-munications as the majority of Kazakh speakersare bilingual. Mostly inter-sentential and intra-sentential code-switching is practiced, however, theintra-word code-switching is also possible. For ex-ample, one can say “
Men magazinga bardym ” (“
Iwent to store ”) where the Russian word “ magazin ”is appended by the Kazakh suffix “ ga ” represent-ing the preposition “ to ”. Furthermore, while thespelling of Kazakh words closely matches theirpronunciation, this is not the case for the Rus-sian words, e.g. the letter “ o ” is sometimes pro-nounced as “a”, which might confuse the ASRsystem. We inspected the recognized output re-sults and observed that the ASR consistently makeserrors on the code-mixed utterances and the rareproper nouns. Therefore, future works should focuson alleviating those errors.Although we can’t directly compare our resultsto previous works, our WER results are appealing.For example, Mamyrbayev et al. (2019) used 76hours of speech data to build an HMM-DNN ASRsystem which achieved 32.7% WER on the cleanread speech. Similarly, an HMM-DNN ASR sys-tem build using 78 hours of data in (Shi et al., 2017)achieved 25.1% WER on the read speech. On theother hand, Mamyrbayev et al. (2020) achieved17.8% WER on the clean read speech using theE2E ASR system trained on 126 hours of data.We also envision that the KSC corpus can beutilized in cross-lingual transfer learning tech-niques (Das and Hasegawa-Johnson, 2015) to im-prove the ASR systems for other Turkic languagessuch as Kyrgyz and Tatar. In this work, we presented the KSC database con-taining around 335 hours of transcribed speechdata. It is developed to advance Kazakh speechprocessing applications such as speech recognition,speech synthesis, and speaker recognition. We de-scribed the database construction procedures anddiscussed challenges that should be addressed in fu-ture works. The database is freely available underCreative Commons Attribution 4.0 InternationalLicense for any purposes including research andcommercial use. We also conducted a preliminaryspeech recognition experiment using both tradi-tional hybrid HMM-DNN and recently proposed Transformer-based E2E architectures. To ease thedatabase usage and ensure the reproducibility of ex-periments, we split it into the three non-overlappingsets (training, validation, and test) and released ourESPnet recipe. The detailed exploration of betterASR settings and the adaptation of the database toother applications are left as future work.
References
Amit Das and Mark Hasegawa-Johnson. 2015. Cross-lingual transfer learning during supervised trainingin low resource scenarios. In
Proc. of the 16th An-nual Conference of the International Speech Com-munication Association (INTERSPEECH), Dresden,Germany, September 6-10, 2015 , pages 3531–3535.Bhavna Dave. 2007.
Kazakhstan-ethnicity, languageand power . Routledge.Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. 2018.AISHELL-2: transforming mandarin ASR researchinto industrial scale.
CoRR , abs/1808.10583.Alex Graves, Santiago Fern´andez, Faustino J. Gomez,and J¨urgen Schmidhuber. 2006. Connectionisttemporal classification: labelling unsegmented se-quence data with recurrent neural networks. In
Proc.of the 23rd International Conference on MachineLearning, (ICML), Pittsburgh, Pennsylvania, USA,June 25-29, 2006 , pages 369–376.C¸ aglar G¨ulc¸ehre, Orhan Firat, Kelvin Xu, KyunghyunCho, Lo¨ıc Barrault, Huei-Chi Lin, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2015. On us-ing monolingual corpora in neural machine transla-tion.
CoRR , abs/1503.03535.Awni Y. Hannun, C. Case, J. Casper, Bryan Catanzaro,Greg Diamos, E. Elsen, Ryan Prenger, S. Satheesh,S. Sengupta, A. Coates, and A. Ng. 2014. Deepspeech: Scaling up end-to-end speech recognition.
ArXiv , abs/1412.5567.Olga Khomitsevich, Valentin Mendelev, Natalia A.Tomashenko, Sergey Rybin, Ivan Medennikov, andSaule Kudubayeva. 2015. A bilingual kazakh-russian system for automatic speech recognition andsynthesis. In
Proc. of the 17th International Confer-ence on Speech and Computer (SPECOM), Athens,Greece, September 20-24, 2015 , volume 9319, pages25–33.Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017.Joint ctc-attention based end-to-end speech recogni-tion using multi-task learning. In
Proc. of the IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP), New Orleans, LA, USA,March 5-9, 2017 , pages 4835–4839.Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur. 2015. Audio augmentation forspeech recognition. In
Proc. of the 16th Annualonference of the International Speech Communi-cation Association (INTERSPEECH), Dresden, Ger-many, September 6-10, 2015 , pages 3586–3589.Jia Xin Koh, Aqilah Mislan, Kevin Khoo, Brian Ang,Wilson Ang, Charmaine Ng, and Ying-Ying Tan.2019. Building the singapore english nationalspeech corpus. In
Proc. of the 20th Annual Con-ference of the International Speech CommunicationAssociation (INTERSPEECH), Graz, Austria, 15-19September 2019 , pages 321–325.Olzhas Makhambetov, Aibek Makazhanov, ZhandosYessenbayev, Bakhyt Matkarimov, Islam Sabyr-galiyev, and Anuar Sharafudinov. 2013. Assemblingthe kazakh language corpus. In
Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), 18-21 October 2013, Seattle,Washington, USA , pages 1022–1031.Orken Mamyrbayev, Keylan Alimhan, Bagashar Zhu-mazhanov, Tolganay Turdalykyzy, and Farida Gus-manova. 2020. End-to-end speech recognition inagglutinative languages. In
Proc. of the 12th AsianConference on Intelligent Information and DatabaseSystems (ACIIDS), Phuket, Thailand, March 23-26,2020 , volume 12034, pages 391–401.Orken J. Mamyrbayev, Mussa Turdalyuly, NurbapaMekebayev, Keylan Alimhan, Aizat Kydyrbekova,and Tolganay Turdalykyzy. 2019. Automatic recog-nition of kazakh speech using deep neural networks.In
Proc. of the 11th Asian Conference on Intelli-gent Information and Database Systems (ACIIDS),Yogyakarta, Indonesia, April 8-11, 2019 , pages 465–474.Scott Novotney and Chris Callison-Burch. 2010.Cheap, fast and good enough: Automatic speechrecognition with non-expert transcription. In
Proc.of Human Language Technologies: Conference ofthe North American Chapter of the Association ofComputational Linguistics, June 2-4, 2010, Los An-geles, California, USA , pages 207–215.Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, andQuoc V. Le. 2019. Specaugment: A simple dataaugmentation method for automatic speech recog-nition. In
Proc. of the 20th Annual Conference ofthe International Speech Communication Associa-tion (INTERSPEECH), Graz, Austria, 15-19 Septem-ber 2019 , pages 2613–2617.Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,Hainan Xu, Mahsa Yarmohammadi, and SanjeevKhudanpur. 2018. Semi-orthogonal low-rank ma-trix factorization for deep neural networks. In
Proc. of the 19th Annual Conference of the Interna-tional Speech Communication Association (INTER-SPEECH), Hyderabad, India, 2-6 September 2018 ,pages 3743–3747.Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, et al. 2011. The Kaldi speech recognitiontoolkit. In
Proc. of the IEEE Workshop on AutomaticSpeech Recognition and Understanding .Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-gah Ghahremani, Vimal Manohar, Xingyu Na, Yim-ing Wang, and Sanjeev Khudanpur. 2016. Purelysequence-trained neural networks for ASR based onlattice-free MMI. In
Proc. of the 17th Annual Con-ference of the International Speech CommunicationAssociation (INTERSPEECH), San Francisco, CA,USA, September 8-12, 2016 , pages 2751–2755.Tara N. Sainath, Rohit Prabhavalkar, Shankar Ku-mar, Seungji Lee, Anjuli Kannan, David Rybach,Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu,Zhifeng Chen, and Chung-Cheng Chiu. 2018. Noneed for a lexicon? Evaluating the value of the pro-nunciation lexica in end-to-end models. In
Proc.of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), Calgary,AB, Canada, April 15-20, 2018 , pages 5859–5863.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In
Proc. of the 54th Annual Meet-ing of the Association for Computational Linguistics(ACL), August 7-12, 2016, Berlin, Germany .Ying Shi, Askar Hamdullah, Zhiyuan Tang, DongWang, and Thomas Fang Zheng. 2017. A freeKazakh speech database and a speech recognitionbaseline. In
Proc. of the Asia-Pacific Signal andInformation Processing Association Annual Summitand Conference (APSIPA), Kuala Lumpur, Malaysia,December 12-15, 2017 , pages 745–748.Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recog-nition. In
Proc. of the 3rd International Conferenceon Learning Representations (ICLR), San Diego,CA, USA, May 7-9, 2015 .Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and fast - But is itgood? Evaluating non-expert annotations for natu-ral language tasks. In
Proc. of the Conference onEmpirical Methods in Natural Language Processing,(EMNLP), 25-27 October 2008, Honolulu, Hawaii,USA , pages 254–263.Andreas Stolcke. 2002. SRILM - An extensible lan-guage modeling toolkit. In
Proc. of the Interna-tional Conference on Spoken Language Processing(ICSLP), Denver, Colorado, USA, September 16-20,2002 .Shinnosuke Takamichi and Hiroshi Saruwatari. 2018.CPJD corpus: Crowdsourced parallel speech corpusof Japanese dialects. In
Proc. of the 11th Interna-tional Conference on Language Resources and Eval-uation (LREC), Miyazaki, Japan, May 7-12, 2018 .shish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all youneed. In
Proc. of the Annual Conference on Ad-vances in Neural Information Processing Systems(NIPS), 4-9 December 2017, Long Beach, CA, USA ,pages 5998–6008.Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson En-rique Yalta Soplin, Jahn Heymann, Matthew Wies-ner, Nanxin Chen, Adithya Renduchintala, andTsubasa Ochiai. 2018. ESPnet: End-to-end speechprocessing toolkit. In
Proc. of the 19th Annual Con-ference of the International Speech CommunicationAssociation (INTERSPEECH), Hyderabad, India, 2-6 September 2018 , pages 2207–2211.Dong Yu and Li Deng. 2014.