[PDF] Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

Abstract

Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems. State-of-the-art mispronunciation detection models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness of Pronunciation (GOP) based algorithm for pronunciation scoring. However, GOP based scoring models have two major limitations: i.e., (i) They depend on forced alignment which splits the speech into phonetic segments and independently use them for scoring, which neglects the transitions between phonemes within the segment; (ii) They only focus on phonetic segments, which fails to consider the context effects across phonemes (such as liaison, omission, incomplete plosive sound, etc.). In this work, we propose the Context-aware Goodness of Pronunciation (CaGOP) scoring model. Particularly, two factors namely the transition factor and the duration factor are injected into CaGOP scoring. The transition factor identifies the transitions between phonemes and applies them to weight the frame-wise GOP. Moreover, a self-attention based phonetic duration modeling is proposed to introduce the duration factor into the scoring model. The proposed scoring model significantly outperforms baselines, achieving 20% and 12% relative improvement over the GOP model on the phoneme-level and sentence-level mispronunciation detection respectively.

Full PDF

CContext-aware Goodness of Pronunciation forComputer-Assisted Pronunciation Training

Jiatong Shi , Nan Huo , Qin Jin ∗ Department of Computer Science, Johns Hopkins University, U.S.A. School of Information, Renmin University of China, P.R.China { jiatong shi, nhuo1 } @jhu.edu, [email protected] Abstract

Mispronunciation detection is an essential component of theComputer-Assisted Pronunciation Training (CAPT) systems.State-of-the-art mispronunciation detection models use DeepNeural Networks (DNN) for acoustic modeling, and a Good-ness of Pronunciation (GOP) based algorithm for pronuncia-tion scoring. However, GOP based scoring models have twomajor limitations: i.e., (i) They depend on forced alignmentwhich splits the speech into phonetic segments and indepen-dently use them for scoring, which neglects the transitions be-tween phonemes within the segment; (ii) They only focus onphonetic segments, which fails to consider the context effectsacross phonemes (such as liaison, omission, incomplete plosivesound, etc.). In this work, we propose the Context-aware Good-ness of Pronunciation (CaGOP) scoring model. Particularly,two factors namely the transition factor and the duration factorare injected into CaGOP scoring. The transition factor identiﬁesthe transitions between phonemes and applies them to weightthe frame-wise GOP. Moreover, a self-attention based phoneticduration modeling is proposed to introduce the duration factorinto the scoring model. The proposed scoring model signiﬁ-cantly outperforms baselines, achieving 20% and 12% relativeimprovement over the GOP model on the phoneme-level andsentence-level mispronunciation detection respectively.

Index Terms : Computer-Assisted Pronunciation Training,Goodness of Pronunciation, Computer-Assisted LanguageLearning, Phonetic Duration Modeling

1. Introduction

Computer-Assisted Pronunciation Training (CAPT) is an im-portant technology that offers automatic feedback to help userslearn new spoken languages [1]. Because of its objectiveness,some standardized examinations also use the CAPT systemfor automatic speech proﬁciency evaluation (e.g., TOFEL [2],AZELLA [3]).Several directions have been explored for the CAPT prob-lem in the literature. One is to recognize pronunciation er-ror patterns. The patterns can either be pre-deﬁned based onlinguistic knowledge or obtained in driven-data way [4, 5].However, it is often hard to implement due to the difﬁcul-ties of obtaining comprehensive expert knowledge and collect-ing enough erroneous data. The dominant CAPT systems arebased on an Automatic-speech-recognition (ASR)-like archi-tecture [6]. It mainly includes three components: an acousticmodule, a decoding module, and a scoring module. The acous-tic module ﬁrst converts speech information into frame-levelphonetic posterior-probabilities. Next, the decoding moduleforce-aligns the posterior-probabilities into phonetic segments. ∗ Corresponding Author. -0.200.20.40.60.811.21.41.6

S T AH D IY

Figure 1:

Posterior-Probability Entropy of the word ”Study”

Lastly, the scoring module scores the given segment accord-ing to the reference phoneme. The acoustic module follows theASR’s evolution from Hidden Markov Model-Gaussian Mix-ture Model (HMM-GMM) [7, 8] to Deep Neural Networks(DNN) [9, 10, 11] and Recurrent Neural Networks (RNN) [12].For the scoring module, the dominant method is the Goodnessof Pronunciation (GOP) [7]. It is a conﬁdence measure for pho-netic pronunciation, relating the test speech to an ASR modeltrained on native speech. Later, weighted GOP (wGOP) wasproposed to improve the bad cases when the system cannot suc-cessfully identify phonemes with similar sound [13]. Similarly,Zhang et al. employed a confused phoneme set to solve thesame problem [14].However, the above GOP-like methods do not fully con-sider the context information within and between the phoneticsegments. Within the segments, the scoring strategies dependon the forced-alignments. The forced-alignments split the en-tire speech sequence into phonetic segments corresponding toreference phonemes. Based on the facts of speech production,the vocal tract is gradually changing in the production of differ-ent phonemes [15]. Therefore, a hard assignment of phonemesin time domain would include the transition between phonemeswithin the force-aligned segments. As stated in the informa-tion theory, the entropy represents the ”disorder” degree [16].It reaches the highest for uniform distribution but becomes zerowhen there is a deﬁnite event. The Figure 1 is an example ofthe posterior probability entropy for word ”Study” where thegreen line stands for the boundary of forced-alignment. Wecan observe that the entropy rises between phonemes, whichindicates that the model has lower conﬁdence level in its pre-dictions. As GOP uses the whole segment for scoring, it mayinclude misleading information (i.e. phonetic transitions) thatdoes not relate to the target phoneme. In addition, the GOPscoring module only considers very short context informationusing context-dependent phonemes [18]. However, it does not the posterior-probability is computed from an ASR acoustic modeltrained from Librispeech [17]. a r X i v : . [ ee ss . A S ] A ug onsider longer dependencies (e.g. word-level context infor-mation), which may inﬂuence the original sound of phonemes(speech co-articulation). As a result, the GOP scoring moduletends to give over-strict suggestions when there are acceptedco-articulation effects [19].In this paper, we propose a Context-aware GOP (CaGOP)scoring module which takes into account the transition factorand the duration factor. The transition factor is captured using aproposed transition-aware mechanism, while the duration factordepends on a duration prediction model with self-attention. Onthe mispronunciation detection task, the proposed CaGOP out-performs the GOP with around 20% relative improvement onboth accuracy and F1 score. Meanwhile, the sentence-level cor-relation with human also achieves 12% relative improvementover the baselines. We also observe 25% relative improvementin phonetic duration modeling with self-attention structure overthe baseline Long-Short-Term-Memory (LSTM) model.

2. Context-aware GOP

Figure 2 shows the framework of our CAPT system. Our fo-cus in this paper is on the scoring module. In this section, weﬁrst present the background of the GOP and then introduce thetransition factor and the duration factor in our proposed context-aware GOP (CaGOP) scoring module.

The Goodness of Pronunciation (GOP) scoring module assumesthat the orthographic phoneme sequence is given. It appliesan acoustic module (e.g. HMM-DNN) to infer the likelihood p ( o , o , ..., o N | a (cid:48) ) of the acoustic feature o , o , ...o N corre-sponding to each phone a (cid:48) in the phoneme sequence. The ﬁ-nal score is deﬁned as the duration normalized log-posterior-probability log( p ( a | o , o , ..., o N )) for the reference phoneme a [7]. Using the Bayesian theorem, it is computed as in Eq. (1).GOP( a ) = 1 N · log( p ( o , o , ..., o N | a ) · p ( a ) (cid:80) a (cid:48) ∈ A p ( o , o , ..., o N | a (cid:48) ) p ( a (cid:48) ) ) (1)where a stands for the phone index to be scored, A stands forthe phone set, and the N represents the length of the acousticfeatures. Practically, all the phones are assumed to be equal-likely (i.e., p ( a ) = p ( a (cid:48) ) ). Hence, GOP can be approximatedinto Eq. (2). Due to the logarithmic operation, GOP is notbounded to a certain range. The mispronunciation is determinedby phone-dependent thresholds.GOP( a ) = 1 N · log( p ( o , o , ..., o N | a ) (cid:80) a (cid:48) ∈ A p ( o , o , ..., o N | a (cid:48) ) ) (2) As mentioned above, using the whole segments from forcedalignment should consider the phone transitions. Inspired bythe entropy concept in information theory which refers to theuncertainty of random variables [16] and the Entropy WeightMethod which is deﬁned based on the entropy to measure thecontract intensity of each attribute [20], we propose to calculatethe entropy weight for frame-level posterior-probabilities in or-der to reduce the transition effect on the scoring.For each frame, a lower entropy of posterior-probabilityrepresents that the acoustic module is more conﬁdent in the pre-diction, while the ﬂuctuation in posterior-probability entropy For example, they often give a lower score on ”D” in ”AND” wherethe ”D” is often less-stressed in spoken English

AudioPosteriorProbability

Acoustic Model CaGOPScoring Model

Text

Decoding Model

TransitionFactor DurationFactorCaGOP Alignment

Figure 2:

Our proposed CAPT System Framework indicates that there probably exists phonetic transition withinthe phonetic segments. In order to incorporate the transitionfactor into the scoring, we ﬁrst reformulate the GOP to frame-wise GOP using conditional independence as Eq. (3) and theframe-wise posterior-probability entropy deﬁned as Eq. (4).GOP( a ) = 1 N N (cid:88) t =1 log p ( a | o t ) = 1 N N (cid:88) t =1 log( p ( o t | a ) (cid:80) a (cid:48) ∈ A p ( o t | a (cid:48) ) ) (3) E t = − (cid:88) a (cid:48) ∈ A p ( a (cid:48) | o t ) · log( p ( a (cid:48) | o t )) (4)We then compute the transition-aware pronunciation score byweighting the frame-wise GOP with the reciprocal of entropy:TAScore ( a ) = N (cid:88) t E t (cid:80) Nt (cid:48) E t (cid:48) · log( p ( a | o t )) (5) As the GOP based model does not consider the context informa-tion between phonetic segments, it does not consider the pro-nunciation phenomena from contextual phonemes. To tacklethis issue, we introduce duration factor as the prosody contextto the pronunciation scoring. The computation of the durationfactor includes two steps. First, we use a context-dependent du-ration model to predict the duration for the given phoneme se-quences. Then, we refer the duration factor as the phonetic du-ration mismatch between the reference and the test utterances.Phonetic duration is an essential factor in speech articu-lation and beneﬁts many speech processing tasks (e.g., Text-to-speech, ASR) [21, 22]. Previous works have explored sta-tistical graphic models (e.g., HMM) [21] and neural networks[23, 24, 25]. In this work, we propose to use the multi-head self-attention structure to model the phonetic duration for a givenphoneme sequence. The model structure is shown in Figure3. First, we pass the reference text to the text-to-phone con-verter and use phoneme index sequences as our model input.Then we add the speed and positional information. The speedis represented as the average duration of phonemes in the givenspeech. The self-attention encoder follows the deﬁnition of [26]as shown in Eq. (6) but with a local diagonal Gaussian matrix the conditional independence is inferred from HMM assumptionsfor the acoustic module honeEmbeddingLinear SpeedLinear + Self-attentionEncoderLinear + PositionalEncodingLinearN x

Figure 3:

Phonetic Duration Prediction with Self-attention [27]. We introduce the diagonal Gaussian matrix to favor localinformation because phonetic duration does not have very longdependency. The co-articulation and omission of phonemes areusually caused by nearby phonemes.Attention ( q , k , v ) = softmax ( q · k T √ d + M ) · v (6) M j,k = − ( j − k ) σ (7)where d is the dimension of the input vector and M is the localGaussian matrix. The M in Eq. (6) is deﬁned as in Eq (7),which is a T × T matrix for the sequence length of T , and σ isa learnable parameter of the network.We compute the duration mismatch between the referenceand the test utterances as the duration factor. As the phone du-ration strongly correlates with the speed of speech, we applyphone-speed dependent balance factors to compute the durationmismatch. The factors are computed using the phonetic dura-tion of the training set. For each sentence speed, we computethe absolute error between our reference duration prediction andthe ground truth . The factors T a,s are introduced as the sum-mation of mean and 1.5 times of the standard deviation of theabsolute error. Our estimation of the duration mismatch is de-ﬁned as Eq. (8). δ ( a ) = | D align − D pred | − T a,s (8)where D align is the force-aligned duration of phone a (i.e., theground truth), D pred is the prediction from the reference dura-tion network of phone a , and T a,s is the balance factor of phone a and speed s . Noted that the δ ( a ) can be negative, which indi-cates that the duration of the target phoneme is as expected. Injecting both the transition factor and duration factor into thepronunciation scoring, we compute the context-aware pronun-ciation score as follows:CaGOP ( a ) = (1 − β · δ ( a )) · TAScore ( a ) (9)where β is the hyper-parameter. Please note that replacingTAScore ( a ) with GOP ( a ) will lead to a reduced model withonly injecting the duration factor, which will be compared within the following experiments. we generate the ground truth duration using alignments from theASR model

3. Experimental Settings

In this section, we ﬁrst introduce the dataset. Then we presentthe implementation details for the proposed system and base-lines. Since the main focus of this work is on the scoring mod-ule, we only brieﬂy describe the settings of the acoustic anddecoding modules. Finally, we present the evaluation metrics.

The Librispeech 960-hours training set is used for training theacoustic module and the phonetic duration model [17]. TIMIT[28] dataset is used to evaluate the phonetic duration model.The duration information in TIMIT is quantiﬁed to 30ms framesto align with the training set. There are phoneme mismatchesbetween TIMIT and Librispeech. Therefore, for evaluation, weconvert all the phonemes in TIMIT to Librispeech-style. Formispronunciation detection evaluation, a CAPT dataset with2.8-hour English reading speech from 129 non-native childrenis collected. The dataset is annotated by ﬁve English teach-ers, including the phonetic and sentence-level annotation. Thephonetic annotation follows the guideline in [29]. For sentence-level annotation, the teachers are asked to evaluate the wholeutterance with a score ranging from 0 to 10, regarding the utter-ance naturalness. In addition, 50% of the data is used as the de-velopment set for hyper-parameter tuning, while the other 50%is used for testing. : We employ theHidden Markov Model-Time Delay Neural Network (HMM-TDNN) as the acoustic module. It is trained on LibrispeechASR task with Kaldi [30]. The input features are 40-dimensional Mel-Frequency Cepstrum Coefﬁcients (MFCC)with 25-ms window size and 10-ms window shift. I-vectors areconcatenated with MFCC to add some environment and speakerinformation. The HMM-TDNN applies discriminative chainsettings with lattice-free maximum mutual information crite-rion [31], sub-sampling techniques and the factorized mecha-nism [32, 33]. Because of the down-sampling effect of sub-sampling TDNN model, the temporal resolution for the acousticprediction is 30ms [32]. For decoding, we align the phonemesequence to the speech signal using the Viterbi algorithm (withoptional silences between phonemes).

Duration Factor : Our proposed phonetic duration model uses256-dimensional phoneme embeddings. For the encoder, weemploy six four-head self-attention blocks with 256-dimensionhidden states and 1024-dimension feed-forward layers. Theoutput is frame-wise 1-dimensional, indicating the phonetic du-ration. The positional encoding follows the work in [26]. Theparameters are initialized with uniform distribution as reportedin [34]. Dropout is set as 0.1. Note that long phonetic sequencesare constrained to a length of 100 for fast processing in training.We use L1 loss for training. We choose the Noam optimizer in-troduced in [26] with a learning rate of 0.001 and warm-up stepsof 25,000. The batch size is set as 64. We use the clean valida-tion set of Librispeech to select the best model for 100 epochs.The baselines of the duration prediction includes a DNN-style model proposed in [25] and a LSTM model introducedin [24]. The DNN performs a context expansion (seven leftand seven right context). It uses six layers of 256 nodes,each with ReLU activation functions. The loss for the modelis a cross-entropy loss. The LSTM models has three stackedLSTM layers with 256 nodes. We interpolate the cross-entropyloss with mean square error loss, as suggested in [24]. In24, 25], linguistic features are adopted such as position inword/phrase/sentences, Part of Speech, etc. However, to makethe model comparable in terms of features, we only use thephoneme indicators and some phonetic properties (long/shortvowels, voiced/unvoiced consonant, plosive, affricative, nasal,etc.). All the training conﬁguration follows the self-attentionmodel but with Adam optimizer.

Scoring module : In the following experiments, the likelihood p ( o t | a ) and p ( o , o , ..., o t | a ) are computed using Viterbi ap-proximation with the TDNN prediction and HMM alignment.For GOP and CaGOP, we determine the hyper-parameters (i.e.,the phone-dependent thresholds for mispronunciation detectionand β in CaGOP) using the CAPT development set. We com-pare ﬁve scoring modules in the experiments: 1) GOP ; 2) cen-ter GOP which only uses the center frame of the segments forscoring; 3)

CaGOP which is our proposed scoring model; 4)

CaGOP-TA which removes the transition factor from CaGOPmodel; and 5)

CaGOP-Dur which removes the duration factorfrom CaGOP model. From 4) and 5) we can analyze the con-tributions of the two factors in our proposed scoring module.

For duration prediction, we use the Mean Absolute Error(MAE) as the evaluation metrics. For mispronunciation detec-tion, we adopt the accuracy and F1 measure. The labels (i.e.,mispronounced or not) are imbalanced , so F1 measure shouldbe a more reliable metric. As stated in [11], long context ofspeech can serve as a better indicator of overall proﬁciency.Therefore, we also compute the sentence-level score by usingthe mean score of phonemes throughout the whole utterance.For sentence-level evaluation, we use Pearson correlation co-efﬁcients (PCC) and Spearman correlation coefﬁcients (SCC),where the PCC focuses on numeric correlation, and SCC fo-cuses on ranking correlation. The metric for the sentence-levelscore is the mean of correlation coefﬁcients between each hu-man rater and the evaluating scoring algorithm.

4. Experimental Results

Table 1 presents the frame-level phonetic duration predictionperformance on the TIMIT dataset. It shows that our self-attention based duration model achieves 25% relative improve-ment on MAE comparing to the best baseline LSTM system.Table 1:

Frame-level Duration Prediction on TIMIT

Model MAE (ms)

DNN [23] 89.87LSTM [24] 42.74Self-attention

The results of the mispronunciation detection task areshown in Table 2. The β is set to 0.1 after tuned with the CAPTdevelopment set. Our proposed CaGOP signiﬁcantly outper-forms the GOP for more than 14% absolute (20% relative) im-provement on both accuracy and F1. To be speciﬁc, the tran-sition factor contributes 10% absolute improvement while theduration factor contributes 4% absolute improvement. Mean-while, the center GOP, which only focuses on the center of thephonetic segment, gets a better result then GOP as well. Itindicates that indeed there are phonetic transitions within theforced-alignment. After removing them from GOP, the scoringmodule can get better performance. Since the performance of In our case, more than 80% phones are labeled as correct ‘center GOP‘ is worse than ‘CaGOP-Dur‘ which considers thetransition factor, we can also conclude that the phonetic transi-tions are not always located around the boundary of the alignedsegments. The result shows that our transition factor can suc-cessfully identify those ”skewed” phonetic transitions.The sentence-level scoring in Table 3 shows similar trend.When the human raters’ PCC and SCC scores reach around 0.5,our CaGOP module achieves 0.392 PCC and 0.416 SCC. Com-pared to GOP baseline, CaGOP achieves 12% relative improve-ment on both correlation metrics. Both the transition and dura-tion factors contribute to the sentence-level scoring.In addition to the above results, we also observe some in-teresting facts regarding the posterior-probability entropy of theacoustic module. First, the entropy of a phonetic segment isalways in ”U” shape. Second, the phonetic transitions (entropypeaks) depend on vocal tract shapes. For phones that are similarin vocal tract shapes, the transitions are not very signiﬁcant (e.g.AH and N). Empirically, the ”U” shapes are often right-skewed.We assume it might due to the acoustic module structure.Table 2:

Model Performance on Mispronunciation Detection

Method Accuracy(%) F1(%)

GOP [29] 53.44 65.81center GOP 60.88 72.98CaGOP

CaGOP-Dur 64.07 75.85CaGOP-TA 56.80 70.73Table 3:

Correlation between Human-Rater & Scoring Methods

Method PCC SCC

Human 0.502 0.501GOP [29] 0.343 0.360center GOP 0.363 0.387CaGOP

CaGOP-Dur 0.377 0.399CaGOP-TA 0.375 0.393

5. Conclusion

To deal with the limitations of GOP scoring, we proposethe context-aware GOP (CaGOP) scoring model in this work,which injects two context related factors into the model, thetransition factor and duration factor. The transition factor isrepresented using the frame-wise posterior-probability entropy.The duration factor is represented based on the duration mis-match, which is computed using a duration model with self-attention network. Experimental results prove the effectivenessof our proposed CaGOP scoring model, which achieves 20%relative improvement at phoneme-level and 12% relative im-provement at the sentence-level over the GOP baselines. Bothfactors contribute to the performance boost. Our proposed du-ration model also achieves 25% relative improvement for pho-netic duration prediction on the TIMIT dataset.

6. Acknowledgment

This work was partially supported by National Natural ScienceFoundation of China (No. 61772535) and Beijing Natural Sci-ence Foundation (No. 4192028). . References [1] A. Neri, C. Cucchiarini, H. Strik, and L. Boves, “The pedagogy-technology interface in computer assisted pronunciation training,”

Computer assisted language learning , vol. 15, no. 5, pp. 441–467,2002.[2] K. Evanini and X. Wang, “Automated speech scoring for non-native middle school students with multiple task types.” in

Inter-speech , 2013, pp. 2435–2439.[3] A. Metallinou and J. Cheng, “Using deep neural networks to im-prove proﬁciency assessment for children english language learn-ers,” in

Fifteenth Annual Conference of the International SpeechCommunication Association , 2014.[4] S. Xu, J. Jiang, Z. Chen, and B. Xu, “Automatic pronuncia-tion error detection based on linguistic knowledge and pronuncia-tion space,” in . IEEE, 2009, pp. 4841–4844.[5] Y.-B. Wang and L.-s. Lee, “Supervised detection and unsuper-vised discovery of pronunciation error patterns for computer-assisted language learning,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 23, no. 3, pp. 564–579,2015.[6] S. M. Witt, “Automatic error detection in pronunciation training:Where we are and where we need to go,” in

Proceedings of Inter-national Symposium on automatic detection on errors in pronun-ciation training , vol. 1, 2012.[7] S. Witt and S. Young, “Computer-assisted pronunciation teachingbased on automatic speech recognition,”

Language Teaching andLanguage Technology Groningen, The Netherlands , 1997.[8] Y. Tsubota, T. Kawahara, and M. Dantsuji, “Recognition and veri-ﬁcation of english by japanese students for computer-assisted lan-guage learning system,” in

Seventh International Conference onSpoken Language Processing , 2002.[9] W. Hu, Y. Qian, and F. K. Soong, “A new dnn-based high qual-ity pronunciation evaluation for computer-aided language learn-ing (call).” in

Interspeech , 2013, pp. 1886–1890.[10] X. Qian, H. Meng, and F. K. Soong, “The use of dbn-hmms formispronunciation detection and diagnosis in l2 english to sup-port computer-aided pronunciation training,” in

Thirteenth AnnualConference of the International Speech Communication Associa-tion , 2012.[11] K. Li, X. Qian, and H. Meng, “Mispronunciation detection anddiagnosis in l2 english speech using multidistribution deep neuralnetworks,”

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 25, no. 1, pp. 193–207, 2016.[12] Y. Qian, K. Evanini, X. Wang, C. M. Lee, and M. Mulholland,“Bidirectional lstm-rnn for improving automated assessment ofnon-native children’s speech.” in

Interspeech , 2017, pp. 1417–1421.[13] J. v. Doremalen, C. Cucchiarini, and H. Strik, “Using non-nativeerror patterns to improve pronunciation veriﬁcation,” in

EleventhAnnual Conference of the International Speech CommunicationAssociation , 2010.[14] L. Zhang, H. Li, and L. Ma, “Exploit posterior probability algo-rithm for pronunciation quality evaluation,”

Journal of Computa-tional Information Systems , vol. 8, pp. 9251–9258, 11 2012.[15] G. Fant,

Auditory analysis and perception of speech . Elsevier,2012.[16] C. E. Shannon, “A mathematical theory of communication,”

ACM SIGMOBILE mobile computing and communications re-view , vol. 5, no. 1, pp. 3–55, 2001.[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[18] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependentpre-trained deep neural networks for large-vocabulary speechrecognition,”

IEEE Transactions on audio, speech, and languageprocessing , vol. 20, no. 1, pp. 30–42, 2011. [19] R. Tong, N. F. Chen, B. Ma, and H. Li, “Context aware mispro-nunciation detection for mandarin pronunciation training.” in

In-terspeech , 2016, pp. 3112–3116.[20] M. Zeleny,

Linear multiobjective programming . Springer Sci-ence & Business Media, 2012, vol. 95.[21] J. Pylkkonen and M. Kurimo, “Duration modeling techniques forcontinuous speech recognition,” in

Eighth International Confer-ence on Spoken Language Processing , 2004.[22] K. Tokuda, K. Hashimoto, K. Oura, and Y. Nankaku, “Temporalmodeling in neural network based statistical parametric speechsynthesis.” in

SSW , 2016, pp. 106–111.[23] G. E. Henter, S. Ronanki, O. Watts, M. Wester, Z. Wu, andS. King, “Robust tts duration modelling using dnns,” in . IEEE, 2016, pp. 5130–5134.[24] B. Chen, T. Bian, and K. Yu, “Discrete duration model for speechsynthesis.” in

Interspeech , 2017, pp. 789–793.[25] X. Wei, M. Hunt, and A. Skilling, “Neural network-based model-ing of phonetic durations,” in

Interspeech , 2019, pp. 1751–1755.[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in neural information processing systems , 2017, pp.5998–6008.[27] M. Sperber, J. Niehues, G. Neubig, S. St¨uker, and A. Waibel,“Self-attentional acoustic models,” in

Interspeech , 2018, pp.3723–3727.[28] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.Pallett, “Darpa timit acoustic-phonetic continous speech corpuscd-rom. nist speech disc 1-1.1,”

NASA STI/Recon technical reportn , vol. 93, 1993.[29] S. M. Witt et al. , “Use of speech recognition in computer-assistedlanguage learning,” Ph.D. dissertation, University of CambridgeCambridge, United Kingdom, 1999.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in

IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[31] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for asr based on lattice-free mmi.” in

Interspeech ,2016, pp. 2751–2755.[32] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neuralnetwork architecture for efﬁcient modeling of long temporal con-texts,” in

Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[33] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza-tion for deep neural networks.” in

Interspeech , 2018, pp. 3743–3747.[34] X. Glorot and Y. Bengio, “Understanding the difﬁculty of train-ing deep feedforward neural networks,” in