Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling
Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek
MMISPRONUNCIATION DETECTION IN NON-NATIVE (L2) ENGLISH WITH UNCERTAINTYMODELING
Daniel Korzekwa (cid:63) † , Jaime Lorenzo-Trueba (cid:63) , Szymon Zaporowski † ,Shira Calamaro (cid:63) , Thomas Drugman (cid:63) , Bozena Kostek † (cid:63) Amazon Speech † Gdansk University of Technology, Faculty of ETI, Poland
ABSTRACT
A common approach to the automatic detection of mispro-nunciation in language learning is to recognize the phonemesproduced by a student and compare it to the expected pronun-ciation of a native speaker. This approach makes two sim-plifying assumptions: a) phonemes can be recognized fromspeech with high accuracy, b) there is a single correct way fora sentence to be pronounced. These assumptions do not al-ways hold, which can result in a significant amount of falsemispronunciation alarms. We propose a novel approach toovercome this problem based on two principles: a) takinginto account uncertainty in the automatic phoneme recogni-tion step, b) accounting for the fact that there may be multiplevalid pronunciations. We evaluate the model on non-native(L2) English speech of German, Italian and Polish speakers,where it is shown to increase the precision of detecting mis-pronunciations by up to 18% (relative) compared to the com-mon approach.
Index Terms — Pronunciation Assessment, Second Lan-guage Learning, Uncertainty Modeling, Deep Learning
1. INTRODUCTION
In Computer Assisted Pronunciation Training (CAPT), stu-dents are presented with a text and asked to read it aloud.A computer informs students on mispronunciations in theirspeech, so that they can repeat it and improve. CAPT has beenfound to be an effective tool that helps non-native (L2) speak-ers of English to improve their pronunciation skills [1, 2].A common approach to CAPT is based on recognizingthe phonemes produced by a student and comparing themwith the expected (canonical) phonemes that a native speakerwould pronounce [3, 4, 5, 6]. It makes two simplifying as-sumptions. First, it assumes that phonemes can be automat-ically recognized from speech with high accuracy. However,even in native (L1) speech, it is difficult to get the PhonemeError Rate (PER) below 15% [7]. Second, this approach as-sumes that this is the only ‘correct’ way for a sentence to bepronounced, but due to phonetic variability this is not alwaystrue. For example, the word ‘enough’ can be pronounced bynative speakers in multiple correct ways: /ih n ah f/ or /ax n ah f/ (short ‘i’ or ‘schwa’ phoneme at the beginning). Theseassumptions do not always hold which can result in a signif-icant amount of false mispronunciation alarms and makingstudents confused when it happens.We propose a novel approach that results in fewer falsemispronunciation alarms, by formalizing the intuition thatwe will not be able to recognize exactly what a student haspronounced or say precisely how a native speaker wouldpronounce it. First, the model estimates a belief over thephonemes produced by the student, intuitively representingthe uncertainty in the student’s pronunciation. Then, themodel converts this belief into the probabilities that a nativespeaker would pronounce it, accounting for phonetic variabil-ity. Finally, the model makes a decision on which words weremispronounced in the sentence by processing three pieces ofinformation: a) what the student pronounced, b) how likely anative speaker would pronounce it that way, and c) what thestudent was expected to pronounce.In Section 2, we review the related work. In Section 3,we describe the proposed model. In Section 4, we present theexperiments, and we conclude in Section 5.
2. RELATED WORK
In 2000, Witt et al. coined the term Goodness of Pronun-ciation (GoP) [3]. GoP starts by aligning the canonicalphonemes with the speech signal using a forced-alignmenttechnique. This technique aims to find the most likely map-ping between phonemes and the regions of a correspondingspeech signal. In the next step, GoP computes the ratio be-tween the likelihoods of the canonical and the most likelypronounced phonemes. Finally, it detects a mispronunciationif the ratio falls below a given threshold. GoP was furtherextended with Deep Neural Networks (DNNs), replacingHidden Markov Model (HMM) and Gaussian Mixture Model(GMM) techniques for acoustic modeling [4, 5]. Cheng etal. [8] improved the performance of GoP with the latentrepresentation of speech extracted in an unsupervised way.As opposed to GoP, we do not use forced-alignment thatrequires both speech and phoneme inputs. Following thework of Leung et al. [6], we use a phoneme recognizer, a r X i v : . [ ee ss . A S ] F e b hich recognizes phonemes from only the speech signal.The phoneme recognizer is based on a Convolutional Neu-ral Network (CNN), a Gated Recurrent Unit (GRU), andConnectionist Temporal Classification (CTC) loss. Leung etal. report that it outperforms other forced-alignment [4] andforced-alignment-free [9] techniques on the task of detectingphoneme-level mispronunciations in L2 English. Contrary toLeung et al., who rely only on a single recognized sequence ofphonemes, we obtain top N decoded sequences of phonemes,along with the phoneme-level posterior probabilities.It is common in pronunciation assessment to employ thespeech signal of a reference speaker. Xiao et al. use a pair ofspeech signals from a student and a native speaker to classifynative and non-native speech [10]. Mauro et al. incorporatethe speech of a reference speaker to detect mispronunciationsat the phoneme level [11]. Wang et al. use siamese networksfor modeling discrepancy between normal and distorted chil-dren’s speech [12]. We take a similar approach but we donot need a database of reference speech. Instead, we train astatistical model to estimate the probability of pronouncinga sentence by a native speaker. Qian et al. propose a sta-tistical pronunciation model as well [13]. Unlike our work,in which we create a model of ‘correct‘ pronunciation, theybuild a model that generates hypotheses of mispronouncedspeech.
3. PROPOSED MODEL
The design consists of three subsystems: a Phoneme Rec-ognizer (PR), a Pronunciation Model (PM), and a Pronun-ciation Error Detector (PED), illustrated in Figure 1. ThePR recognizes phonemes spoken by a student. The PM esti-mates the probabilities of having been pronounced by a nativespeaker. Finally, the PED computes word-level mispronunci-ation probabilities. In Figure 2, we present detailed architec-tures of the PR, PM, and PED.For example, considering the text: ‘I said alone not gone’with the canonical representation of /ay - s eh d - ax l ow n- n aa t - g aa n/. Polish L2 speakers of English often mis-pronounce the /eh/ phoneme in the second word as /ey/. ThePM would identify the /ey/ as having a low probability of be-ing pronounced by a native speaker in the middle of the word‘said’, which the PED would translate into a high probabilityof mispronunciation.
The PR (Figure 2a) uses beam decoding [14] to estimate N hypotheses of the most likely sequences of phonemes thatare recognized in the speech signal o . A single hypothesisis denoted as r o ∼ p ( r o | o ) . The speech signal o is repre-sented by a mel-spectrogram with f frames and 80 mel-bins.Each sequence of phonemes r o is accompanied by the poste-rior phoneme probabilities of shape: ( l r o , l s + 1) . l r o is the length of the sequence and l s is the size of the phoneme set(45 phonemes including ‘pause’, ‘end of sentence (eos)’, anda ‘blank’ label required by the CTC-based model). The PM (Figure 2b) is an encoder-decoder neural networkfollowing Sutskever et al. [15]. Instead of building a text-to-text translation system between two languages, we useit for phoneme-to-phoneme conversion. The sequence ofphonemes r c that a native speaker was expected to pronounceis converted into the sequence of phonemes r they had pro-nounced, denoted as r ∼ p ( r | r c ) . Once trained, the PMacts as a probability mass function, computing the likelihoodsequence π of the phonemes r o pronounced by a studentconditioned on the expected (canonical) phonemes r c . ThePM is denoted in Eq. 1, which we implemented in MxNet[16] using ‘sum’ and ‘element-wise multiply’ linear-algebraoperations. π = (cid:88) r o p ( r o | o ) p ( r = r o | r c ) (1)The model is trained on phoneme-to-phoneme speechdata created automatically by passing the speech of the nativespeakers through the PR. By annotating the data with thePR, we can make the PM model more resistant to possiblephoneme recognition inaccuracies of the PR at testing time. The PED (Figure 2c) computes the probabilities of mispro-nunciations e at the word level, denoted as e ∼ p ( e | r o , π , r c ) .The PED is conditioned on three inputs: the phonemes r o recognized by the PR, the corresponding pronunciation like-lihoods π from the PM, and the canonical phonemes r c . Themodel starts with aligning the canonical and recognized se-quences of phonemes. We adopted a dynamic programmingalgorithm for aligning biological sequences developed byNeedleman-Wunsch [17]. Then, the probability of mispro-nunciation for a given word is computed with Equation 2, k denotes the word index, and j is the phoneme index in theword with the lowest probability of pronunciation. p ( e k ) = (cid:40) if aligned phonemes match, − π k,j otherwise. (2)We compute the probabilities of mispronunciation for N phoneme recognition hypotheses from the PR. Mispronunci-ation for a given word is detected if the probability of mispro-nunciation falls below a given threshold for all hypotheses.The hyper-parameter N = 4 was manually tuned on a singleL2 speaker from the testing set to optimize the PED in theprecision metric. ig. 1 : Architecture of the system for detecting mispronounced words in a spoken sentence. Fig. 2 : Architecture of the PR, PM, and PED subsystems. l s - the size of the phoneme set.
4. EXPERIMENTS AND DISCUSSION
We want to understand the effect of accounting for uncer-tainty in the PR-PM system presented in Section 3. To dothis, we compare it with two other variants, PR-LIK and PR-NOLIK, and analyze precision and recall metrics. The PR-LIK system helps us understand how important is it to ac-count for the phonetic variability in the PM. To switch thePM off, we modify it so that it considers only a single way fora sentence to be pronounced correctly.The PR-NOLIK variant corresponds to the CTC-basedmispronunciation detection model proposed by Leung et al.[6]. To reflect this, we make two modifications comparedto the PR-PM system. First, we switch the PM off in thesame way we did it in the PR-LIK system. Second, we setthe posterior probabilities of recognized phonemes in the PRto 100%, which means that the PR is always certain aboutthe phonemes produced by a speaker. There are some slightimplementation differences between Leung’s model and PR-NOLIK, for example, regarding the number of units in theneural network layers. We use our configuration to makea consistent comparison with PR-PM and PR-LIK systems.One can hence consider PR-NOLIK as a fair state-of-the-artbaseline [6].
For extracting mel-spectrograms, we used a time step of 10ms and a window size of 40 ms. The PR was trained withCTC Loss and Adam Optimizer (batch size: 32, learningrate: 0.001, gradient clipping: 5). We tuned the followinghyper-parameters of the PR with Bayesian Optimization:dropout, CNN channels, GRU, and dense units. The PM was trained with the cross-entropy loss and AdaDelta opti-mizer (batch size: 20, learning rate: 0.01, gradient clipping:5). The location-sensitive attention in the PM follows thework by Chorowski et al. [7]. The PR and PM models wereimplemented in MxNet Deep Learning framework.
For training and testing the PR and PM, we used 125.28 hoursof L1 and L2 English speech from 983 speakers segmentedinto 102812 sentences, sourced from multiple speech corpora:TIMIT [18], LibriTTS [19], Isle [20] and GUT Isle [21]. Wesummarize it in Table 1. All speech data were downsam-pled to 16 kHz. Both L1 and L2 speech were phoneticallytranscribed using Amazon proprietary grapheme-to-phonememodel and used by the PR. Automatic transcriptions of L2speech do not capture pronunciation errors, but we found itis still worth including automatically transcribed L2 speechin the PR. L2 corpora were also annotated by 5 native speak-ers of American English for word-level pronunciation errors.There are 3624 mispronounced words out of 13191 in the IsleCorpus and 1046 mispronounced words out of 5064 in theGUT Isle Corpus.From the collected speech, we held out 28 L2 speakersand used them only to assess the performance of the systemsin the mispronunciation detection task. It includes 11 Ital-ian and 11 German speakers from the Isle corpus [20], and 6Polish speakers from the GUT Isle corpus [21].
The PR-NOLIK detects mispronounced words based on thedifference between the canonical and recognized phonemes. able 1 : The summary of speech corpora used by the PR.
Native Language Hours SpeakersEnglish 90.47 640Unknown 19.91 285German and Italian 13.41 46Polish 1.49 12
Therefore, this system does not offer any flexibility in opti-mizing the model for higher precision.The PR-LIK system incorporates posterior probabilitiesof recognized phonemes. It means that we can tune this sys-tem towards higher precision, as illustrated in Figure 3. Ac-counting for uncertainty in the PR helps when there is morethan one likely sequence of phonemes that could have beenuttered by a user, and the PR model is uncertain which oneit is. For example, the PR reports two likely pronunciationsfor the text ‘I said’ /ay s eh d/. The first one, /s eh d/ with/ay/ phoneme missing at the beginning and the alternative one/ay s eh d/ with the /ay/ phoneme present. If the PR consid-ered only the mostly likely sequence of phonemes, like PR-NOLIK does, it would incorrectly raise a pronunciation error.In the second example, a student read the text ‘six’ /s ih k s/mispronouncing the first phoneme /s/ as /t/. The likelihood ofthe recognized phoneme is only 34%. It suggests that the PRmodel is quite uncertain on what phoneme was pronounced.However, sometimes even in such cases, we can be confidentthat the word was mispronounced. It is because the PM com-putes the probability of pronunciation based on the posteriorprobability from the PR model. In this particular case, otherphoneme candidates that account for the remaining 66% ofuncertainty are also unlikely to be pronounced by a nativespeaker. The PM can take it into account and correctly detecta mispronunciation.However, we found that the effect of accounting for uncer-tainty in the PR is quite limited. Compared to the PR-NOLIKsystem, the PR-LIK raises precision on the GUT Isle corpusonly by 6% (55% divided by 52%), at the cost of droppingrecall by about 23%. We can observe a much stronger effectwhen we account for uncertainty in the PM model. Com-pared to the PR-LIK system, the PR-PM system further in-creases precision between 11% and 18%, depending on thedecrease in recall between 20% to 40%. One example wherethe PM helps is illustrated by the word ‘enough’ that can bepronounced in two similar ways: /ih n ah f/ or /ax n ah f/(short ‘i’ or ‘schwa’ phoneme at the beginning.) The PM canaccount for phonetic variability and recognize both versionsas pronounced correctly. Another example is word linking[22]. Native speakers tend to merge phonemes of neighbor-ing words. For example, in the text ‘her arrange’ /hh er - er eyn jh/, two neighboring phonemes /er/ can be pronounced as asingle phoneme: /hh er ey n jh/. The PM model can correctlyrecognize multiple variations of such pronunciations.Complementary to precision-recall curve showed in Fig-
Fig. 3 : Precision-recall curves for the evaluated systems. ure 3, we present in Table 2 one configuration of the precisionand recall scores for the PR-LIK and PR-PM systems. Thisconfiguration is selected in such a way that: a) recall for bothsystems is close to the same value, b) to illustrate that the PR-PM model has a much bigger potential of increasing precisionthan the PR-LIK system. A similar conclusion can be madeby inspecting multiple different precision and recall configu-rations in the precision and recall plots for both Isle and GUTIsle corpora.
Table 2 : Precision and recall of detecting word-level mispronuncia-tions. CI - Confidence Interval.
Model Precision [%,95%CI] Recall [%,95%CI]
Isle corpus (German and Italian)
PR-LIK 49.39 (47.59-51.19) 40.20 (38.62-41.81)PR-PM 54.20 (52.32-56.08) 40.20 (38.62-41.81)
GUT Isle corpus (Polish)
PR-LIK 54.91 (50.53-59.24) 40.29 (36.66-44.02)PR-PM 61.21 (56.63-65.65) 40.15 (36.51-43.87)
5. CONCLUSION AND FUTURE WORK
To report fewer false pronunciation alarms, it is important tomove away from the two simplifying assumptions that areusually made by common methods for pronunciation assess-ment: a) phonemes can be recognized with high accuracy, b)a sentence can be read in a single correct way. We acknowl-edged that these assumptions do not always hold. Instead,we designed a model that: a) accounts for the uncertainty inphoneme recognition and b) accounts for multiple ways a sen-tence can be pronounced correctly due to phonetic variability.We found that to optimize precision, it is more important toaccount for the phonetic variability of speech than account-ing for uncertainty in phoneme recognition. We showed thatthe proposed model can raise the precision of detecting mis-pronounced words by up to 18% compared to the commonmethods.In the future, we plan to adapt the PM model to correctlypronounced L2 speech to account for phonetic variability ofnon-native speakers. We plan to combine the PR, PM, andPED modules and train the model jointly to eliminate accu-mulation of statistical errors coming from disjoint training ofthe system. . REFERENCES [1] A. Neri, O. Mich, M. Gerosa, and D. Giuliani, “The ef-fectiveness of computer assisted pronunciation trainingfor foreign language learning by children,”
ComputerAssisted Language Learning , vol. 21, no. 5, pp. 393–408, 2008.[2] C. Tejedor-Garc´ıa, D. Escudero, E. C´amara-Arenas,C. Gonz´alez-Ferreras, and V. Carde˜noso-Payo, “As-sessing pronunciation improvement in students of en-glish using a controlled computer-assisted pronuncia-tion tool,”
IEEE Transactions on Learning Technolo-gies , 2020.[3] S. M. Witt and S. J. Young, “Phone-level pronunciationscoring and assessment for interactive language learn-ing,”
Speech communication , vol. 30, no. 2-3, pp. 95–108, 2000.[4] K. Li, X. Qian, and H. Meng, “Mispronunciation detec-tion and diagnosis in l2 english speech using multidistri-bution deep neural networks,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 25,no. 1, pp. 193–207, 2016.[5] S. Sudhakara, M. K. Ramanathi, C. Yarra, and P. K.Ghosh, “An improved goodness of pronunciation (gop)measure for pronunciation evaluation with dnn-hmmsystem considering hmm transition probabilities.,” in
INTERSPEECH , 2019, pp. 954–958.[6] W. Leung, X. Liu, and H. Meng, “Cnn-rnn-ctc basedend-to-end mispronunciation detection and diagnosis,”in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 8132–8136.[7] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, andY. Bengio, “Attention-based models for speech recog-nition,” in
Advances in neural information processingsystems , 2015, pp. 577–585.[8] S. Cheng et al., “Asr-free pronunciation assessment,” arXiv preprint arXiv:2005.11902 , 2020.[9] A. M. Harrison, W. Lo, X. Qian, and H. Meng, “Im-plementation of an extended recognition network formispronunciation detection and diagnosis in computer-assisted pronunciation training,” in
Intl. Workshop onSpeech and Language Technology in Education , 2009.[10] Y. Xiao and W. Soong, F. K .and Hu, “Paired phone-posteriors approach to esl pronunciation quality assess-ment,” in bdl , vol. 1, p. 3. 2018. [11] M. Nicolao, A. V. Beeston, and T. Hain, “Automaticassessment of english learner pronunciation using dis-criminative classifiers,” in .IEEE, 2015, pp. 5351–5355.[12] J. Wang, Y. Qin, Z. Peng, and T. Lee, “Child speechdisorder detection with siamese recurrent network usingspeech attribute features.,” in
INTERSPEECH , 2019,pp. 3885–3889.[13] X. Qian, H. Meng, and F. Soong, “Capturing l2 seg-mental mispronunciations with joint-sequence modelsin computer-aided pronunciation training (capt),” in . IEEE, 2010, pp. 84–88.[14] A. Graves, A. Mohamed, and G. Hinton, “Speech recog-nition with deep recurrent neural networks,” in . IEEE, 2013, pp. 6645–6649.[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in
Advances inneural information processing systems , 2014, pp. 3104–3112.[16] T. et al. Chen, “Mxnet: A flexible and efficient machinelearning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274 , 2015.[17] S. B. Needleman and C. D. Wunsch, “A general methodapplicable to the search for similarities in the amino acidsequence of two proteins,”
Journal of molecular biol-ogy , vol. 48, no. 3, pp. 443–453, 1970.[18] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fis-cus, and David S. Pallett, “Darpa timit acoustic-phoneticcontinous speech corpus cd-rom. nist speech disc 1-1.1,”
STIN , vol. 93, pp. 27403, 1993.[19] H. Zen et al., “LibriTTS: A Corpus Derived fromLibriSpeech for Text-to-Speech,” in
Proc. Interspeech2019 , 2019, pp. 1526–1530.[20] E. S. Atwell, P. A. Howarth, and D. C. Souter, “The islecorpus: Italian and german spoken learner’s english,”
ICAME Journal: Intl. Computer Archive of Modern andMedieval English Journal , vol. 27, pp. 5–18, 2003.[21] D. Weber, S. Zaporowski, and D. Korzekwa, “Con-structing a dataset of speech recordings with lombardeffect,” in , 2020.[22] A. E. Hieke, “Linking as a marker of fluent speech,”