[PDF] Text-Conditioned Transformer for Automatic Pronunciation Error Detection

Abstract

Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignment process are independent, the prior knowledge about the target text is not fully utilized. In this paper, we propose to use the target text as an extra condition for the Transformer backbone to handle the APED task. The proposed method can output the error states with consideration of the relationship between the input speech and the target text in a fully end-to-end fashion.Meanwhile, as the prior target text is used as a condition for the decoder input, the Transformer works in a feed-forward manner instead of autoregressive in the inference stage, which can significantly boost the speed in the actual deployment. We set the ASR-based Transformer as the baseline APED model and conduct several experiments on the L2-Arctic dataset. The results demonstrate that our approach can obtain 8.4\% relative improvement on the F 1 score metric.

Full PDF

TText-Conditioned Transformer for AutomaticPronunciation Error Detection

Zhan Zhang a , Yuehai Wang a, ∗ , Jianyi Yang a a Department of Information and Electronic Engineering, Zhejiang University, China

Abstract

Automatic pronunciation error detection (APED) plays an important role in thedomain of language learning. As for the previous ASR-based APED methods,the decoded results need to be aligned with the target text so that the errors canbe found out. However, since the decoding process and the alignment processare independent, the prior knowledge about the target text is not fully utilized.In this paper, we propose to use the target text as an extra condition for theTransformer backbone to handle the APED task. The proposed method canoutput the error states with consideration of the relationship between the inputspeech and the target text in a fully end-to-end fashion. Meanwhile, as the priortarget text is used as a condition for the decoder input, the Transformer works ina feed-forward manner instead of autoregressive in the inference stage, which cansigniﬁcantly boost the speed in the actual deployment. We set the ASR-basedTransformer as the baseline APED model and conduct several experiments onthe L2-Arctic dataset. The results demonstrate that our approach can obtain8.4% relative improvement on the F score metric. Keywords: automatic pronunciation error detection (APED),computer-assisted pronunciation training (CAPT), Transformer ∗ Corresponding author

Email addresses: [email protected] (Zhan Zhang), [email protected] (YuehaiWang), [email protected] (Jianyi Yang)

Preprint submitted to Journal of L A TEX Templates August 31, 2020 a r X i v : . [ ee ss . A S ] A ug . Introduction With the quick development of globalization and education, the number oflanguage learners is rapidly increasing. However, most learners are faced withthe problem of teacher shortage or ﬁnding a proper time to follow systematiclearning. Thus, recently, the computer-assisted language learning (CALL)[1]systems have been studied to oﬀer a ﬂexible education service, which can beused to reach the language learning requirement in fragmented time. In partic-ular, oral practice is an important part of daily communication, and computer-assisted pronunciation training (CAPT)[2] systems are designed for this task.Such systems generally play the role of automatic pronunciation error detec-tion (APED). The APED system ﬁrst gives a predeﬁned utterance text (anda reference speech of a professional teacher if needed), and the learner tries topronounce this target text correctly. By accurately detecting the pronuncia-tion errors and providing precise feedback, the APED system guides the learnerto correct their pronunciation towards the target utterance and improve theirspeaking ability.APED has been widely studied for decades. Depending on how to evaluatethe matching degree between the student pronounced speech and the standardpronunciation, several comparison-based or goodness of pronunciation (GOP)methods have been proposed to solve the APED task[3, 4, 5, 6, 7, 8]. Re-cently, with the rising trend for neural networks and the development of au-tomatic speech recognition (ASR) technologies, some end-to-end APED mod-els [9, 10] have been studied to simplify the workﬂow. They use ASR back-bones to recognize the canonical pronunciation and obtain where the errorsare, based on the alignment between the predicted phonemes and the standardphonemes. The ASR-based methods can signiﬁcantly decrease the deploying ef-forts compared with conventional GOP methods or comparison-based methods.In particular, recently, the Transformer structure[11] shows a good talent forsequence-to-sequence modelling, and gets promising performance in ASR tasks[12, 13, 14, 15]. Thus, we choose the Transformer as the backbone for APED2asks in this paper.However, the main deﬁciency of the Transformer for APED tasks is thatthe autoregressive decoding will slow the inference speed[16]. Unfortunately,the APED task generally requires the system to give a quick response aboutthe errors so that the learners can adapt their pronunciations and evaluateagain. Another consideration is that, for the ASR-based APED, the decodedtext sequence needs to be aligned with the target text to detect the errors. Sincethe target text is already known in advance, it is a waste to ignore this priorknowledge during the autoregressive inference. On the one hand, the length ofthe target text is ﬁxed, but the autoregressive decoding is length-agnostic. Onthe other hand, the recognized sequence is generally close to the prior targettext in this evaluation task. These two factors inspire us to use the target textas extra input for the network.In this paper, we propose a Transformer-based APED workﬂow, which canincorporate both the audio feature and the text information, and output theerror states directly. Compared with ASR-based methods which optimize therecognition result to improve the APED performance, the proposed methodworks in a fully end-to-end manner. Thus, the proposed method can optimizethe APED metric directly. We observe a 8.4% relative improvement on the F score for the L2-Arctic dataset[17] with the proposed method. Meanwhile, byusing the prior target text as an input condition, the inference process works ina feed-forward manner rather than autoregressive, which can signiﬁcantly boostthe inference speed as suggested in [18, 19].The rest of this paper is organized as follows. In Section 2, we analyzethe related works about the APED task and how we are inspired to propose thetext-conditioned Transformer; In Section 3, we compare the baseline ASR-basedAPED method and describe the proposed method in detail; Next, we analyzethe results obtained by the conventional methods and the proposed method inSection 4; Finally, we show the conclusion of this paper in Section 5.3 . Related Works From the perspective of language learning, an error detected in the APEDsystem can be described as that the produced pronunciation is a nonstandardone. In other words, the pronounced speech deviates too far from the standardtarget speech. Based on this simple idea, comparison-based APED methods[3, 4, 5, 6] have been explored. These methods generally adopt dynamic timewarping (DTW) [20] algorithms to align the extracted features of the inputspeech with the standard target speech. Depending on the distance betweeneach text unit, the pronunciation quality score can be calculated. To this end,the comparison-based methods need to prepare a standard speech for reference,which are inconvenient to evaluate a new utterance.Apart from directly comparing to a speciﬁc standard speech, the inputspeech can also be evaluated by whether a standard acoustic model can rec-ognize each phoneme. In particular, the likelihood of each phoneme has provento be an eﬀective feature for indicating whether the error happens, and such alikelihood-based scoring method is often referred to as GOP [7, 8]. In practice,this approach utilizes the hidden Markov model (HMM) to model the sequentialphone states. The likelihood score is calculated from the force-aligned states andthe open phone states. Since the ﬁrst proposal of GOP by [7], many variants[21, 22, 23, 24, 25] have been studied to adapt its original equation for bettermeasurement of the goodness.With the rise of deep learning, the performance of the ASR tasks has beengreatly improved. Thus, by utilizing the advanced acoustic model of an ASRsystem and recognizing the input speech, ASR-based APED can be another eﬃ-cient approach to detect the errors. Such a method can also avoid the deployingeﬀorts of conventional HMM-based GOP methods or comparison-based DTWmethods, and several ASR-based APED systems have been proposed [9, 10].Currently, the ASR systems are generally built upon CTC loss [26] or atten-tion mechanism [27] to handle the sequential features. The main deﬁciencyof CTC loss is the independent assumption. Such an assumption may not be4alid for the continuous speech. The ASR performance is reported to be bet-ter by combining the CTC loss with the attention mechanism [28] or using theTransformer structure[14, 15]. In particular, the Transformer structure, whichis originally designed to handle the natural language processing (NLP) prob-lems [29, 30], has been successfully utilized in several other domains, such ascomputer vision (CV)[31, 32], and speech-related tasks including text to speech(TTS) [33, 34, 18, 19], voice conversion (VC)[35], and ASR [12, 13].Despite the convenience of ASR-based APED systems, alignment is stillan inevitable process to obtain the ﬁnal evaluation results. The recognizedphonemes should be aligned with the target phonemes to ﬁnd out the mis-pronunciations. As the alignment process is not integrated into the backwardoptimization of the ASR model, such a method is not fully end-to-end. In otherwords, the decoding process and the evaluation process are independent. How-ever, intuitively, human raters will ﬁrst keep the target text in mind, then tryto compare the input speech to ﬁnd out where the errors take place. Focussingon the prior target text limits the search space for the decoding process. Ex-tended Recognition Network (ERN) [36] utilizes this idea to incorporate priorknowledge about common mispronunciations into the HMM states. However,the predeﬁned error HMM paths will lead to bad performance when faced withunseen mispronunciations. Despite its weakness, ERN still shows that the priorknowledge is of vital importance to facilitate the performance of APED tasks.This inspires us to directly take the prior target text as an extra condition, to-gether with the speech features for input. Meanwhile, the attention mechanismcan be a logical approach to fuse both the speech feature and the text feature.Thus, the Transformer is an ideal backbone to start with.However, although the ASR performance of Transformer is reported to bebetter in [14, 15], the Transformer-based methods generally adopt autoregressivedecoding to predict the next entity. This will lead to a slow inference, whichcan be a deﬁciency for the APED system. On the contrary, Transformers whichwork in a feed-forward manner can greatly boost the speed [16, 18, 19]. Asanalyzed in [16], since the output target is already known in the training stage,5he Transformer can run in parallel, whereas this prior does not exist in theinference stage, and the Transformer must run sequentially. However, for theAPED task, if we can utilize the prior text to be evaluated, the aforementionedlimitation will no longer exist.Based on the analysis above, we propose the text-conditioned Transformerfor the APED task. We give a detailed description of the proposed method inthe next section.

3. Proposed Method

In this section, we ﬁrst show the conventional ASR-based APED workﬂowfor comparison. Next, we demonstrate the proposed fully end-to-end workﬂowand describe the network structure and its training method in detail.

Audio FeaturesASR ModelPredictedPhonemes Target PhonemesCanonicalPhonemes

Loss

Predicted Error StatesAlignment

Data outputting Data preparing

Figure 1: Workﬂow for the ASR-based APED method. The alignment process is independentfrom the decoding process.

A typical workﬂow for the ASR-based APED is depicted in Figure 1. Thetraining dataset is generally constructed by three parts, the target text to beread, the collected speech, and the canonical pronounced text marked by pro-fessional teachers. Based on this dataset, an ASR model is trained to recognizethe canonical phoneme-level text p = ( p , p , ..., p n , p n +1 ) from the extracted6udio features x = ( x , x , ..., x m ). The cross-entropy loss is used between thepredict phonemes ˆ p and the canonical phonemes p : l asr = CrossEntropy (ˆ p , p ) , (1)where p n +1 = (cid:104) EOS (cid:105) , which is the end-of-sentence-tag.For the inference stage, the Transformer works quite diﬀerently from thetraining stage. The Transformer uses autoregressive to recognize the canonicalphonemes sequentially. The recognized phonemes string will end with (cid:104)

EOS (cid:105) .Next, Needleman-Wunscha algorithm[37] is applied to align the recognized se-quence ˆ p with the target phonemes t = ( t , t , ..., t k ). After the alignmentprocess, the error states e = ( e , e , ..., e k ) with consideration of the targetphonemes can be returned to the user. An alignment example is shown in Ta-ble 1. We can observe that this sample includes 1 deletion and 2 substitutionerrors. The mispronounced phonemes whose error states are marked as 1 canbe returned to the users. Table 1: Alignment sample

IF YOU ONLY COULD KNOW HOW I THANK YOU

Target

IH F Y UW OW N L IY K UH D N OW HH AW AY TH AE NG K Y UW

Pronounced

IH F Y UW AO N L IY K UH - N AO HH AW AY TH AE NG K Y UW

Error States

For better clariﬁcation, we summarize the training and the inference stage ofthe ASR-based model in Table 2. We use a 39-dim Mel frequency cepstral coeﬃ-cients (MFCC) feature as the encoder input. The start-of-sentence tag ( (cid:104)

SOS (cid:105) )and the right-shifted 1-dim label of the canonical phonemes are concatenated asthe decoder input in the training stage. This input is replaced by (cid:104)

SOS (cid:105) and aregressively decoded phonemes string in the inference stage. The decoder triesto predict the probability of the next phoneme and (cid:104)

EOS (cid:105) for output. Thereare in total 42 tags for classiﬁcation, including 39 phonemes and (cid:104)

SOS (cid:105) (cid:104)

EOS (cid:105)(cid:104)

PAD (cid:105) .We should note that there are several lengths deﬁned for the described se-quences. First, the attention mechanism is adopted to match the speech features7 able 2: Training and inference summary of the ASR-Based Transformer

Training StageEncoderInput DecoderInput DecoderOutputdata

SpeechFeatures (cid:104)

SOS (cid:105) +Canonical Phonemes(Shifted) Canonical Phonemes+ (cid:104)

EOS (cid:105) loss - - l asr len m 1+n n+1 dim

39 1 42

Inference StageEncoderInput DecoderInput DecoderOutputdata

SpeechFeatures (cid:104)

SOS (cid:105) +Recognized Phonemes Next Recognized Phonemes len m End with (cid:104)

EOS (cid:105)

End with (cid:104)

EOS (cid:105) dim

39 1 42 ( length = m ) and the recognized phonemes ( length = n + 1). Next, the align-ment operation is applied to ﬁnd out the error states, whose length is equal tothat of the target phonemes ( length = k ). However, such an alignment oper-ation is performed in the inference stage, thus not jointly optimized with theASR model. Such a dilemma inspires us to integrate the alignment operationor the target text into the training stage. Audio Features Fusion ModelTarget PhonemesTarget PhonemesPredicted Error States AlignmentCanonicalPhonemesGround Truth Error States

MainLoss

PredictedAccentGround Truth Accent

AuxiliaryLoss 1

Data outputting Data preparingMain task Auxiliary task

Aligned CanonicalPhonemesPredictedPhonemes

AuxiliaryLoss 2

Figure 2: Workﬂow for the proposed APED method. We move the alignment process into thepreparing stage. The proposed model can directly output the error states. Meanwhile, theauxiliary accent and phoneme classiﬁcation tasks are adopted.

As shown in Figure 2, for the proposed method, we move the alignmentoperation into the data preparing stage. We align the canonical phonemes andthe target phonemes to obtain where the errors occur in advance.8 ulti- HeadAttentionAdd & NormInputEmbedding OutputEmbeddingFeedForwardAdd & Norm MaskedMulti- HeadAttentionAdd & NormMulti- HeadAttentionAdd & NormFeedForwardAdd & NormAudio Features Target Phonemes Positional EncodingPositional Encoding Global MeanLinearLinear LinearSigmoidError StatesSoftmaxAccent A u x ili a r y B l o ck E n c ode r D e c ode r LinearSoftmaxCanonical Phonemes AE P AH LAE P AO L 0 0 1 0Arabic

Figure 3: Network architecture of the text-conditioned Transformer. We append an accentclassiﬁer after the encoder to extract the L1-related information. Target phonemes are usedas an extra condition for the decoder input. The error states are obtained in a feed-forwardmanner. Meanwhile, phoneme classiﬁcation is also performed as an auxiliary task. Themispronounced word “APPLE” is shown in this ﬁgure for demonstration. a and the ground truth accent a presented in thedataset: l a = CrossEntropy (ˆ a, a ) . (2)Since the speech evaluation dataset is scarce, we ﬁrst obtain a basic acousticmodel by training the model on ASR datasets. The training process is similarto conventional ASR-based APED methods discussed in Sec. 3.1, and the newASR loss function is, l (cid:48) asr = l asr + αl a , (3)where α is the weight of the auxiliary accent task.We further adapt this basic acoustic model to the APED task. A trainingand inference summary of the proposed model is shown in Table 3. We willdiscuss the details and the diﬀerences between the proposed model and theASR-based model in the remaining paragraphs. Table 3: Training and inference summary of the proposed Transformer

Training StageEncoderInput EncoderOutput DecoderInput DecoderOutput1 DecoderOutput2data

SpeechFeatures Accent (cid:104)

SOS (cid:105) +Target Phonemes Aligned Canonical Phonemes+ (cid:104)

EOS (cid:105) (cid:104)

SOS (cid:105) +Error States loss - l a - l asr l eval len m 1 1+k k+1 1+k dim

39 6 1 42 1

Inference StageEncoderInput EncoderOutput DecoderInput DecoderOutput1 DecoderOutput2data

SpeechFeatures Accent (cid:104)

SOS (cid:105) +Target Phonemes Canonical Phonemes+ (cid:104)

EOS (cid:105) (cid:104)

SOS (cid:105) +Error States len m 1 1+k k+1 1+k dim

39 6 1 42 1

Firstly, for the auxiliary accent classiﬁcation task, while the input audio fea-10ures are sequential, the accent is a 1-dim global attribute. We try to process thesequential data with gated recurrent units (GRU)[41] or a simple GlobalMean.Experiments in Section 4 show that GlobalMean performs a little better. Notethat there are 6 kinds of accent for the used dataset in our experiments.Secondly, the prior target phonemes are used as an extra condition for the de-coder input instead of the canonical pronounced phonemes, in both the trainingand the inference stage. For the audio features x and a certain target phoneme t i (target phoneme at step i ), the decoder output is adapted to ˆ e i , which indi-cates the matching degree of the audio features and t i . As we use a binary stateto judge its goodness, we use the sigmoid activation at the last layer for binaryclassiﬁcation in Figure 3.As the whole process is diﬀerential, we can directly optimize the loss be-tween the predicted error states ˆ e and the ground truth error states e . Fornow, several classiﬁcation losses can be used for this model. We ﬁrst applya basic binary cross-entropy (BCE) loss between the predicted error statesˆ e = ( ˆ e , ˆ e , ˆ e , ..., ˆ e k ) and the ground truth error states e = ( e , e , e , ..., e k )as the evaluation loss, l BCEeval = BCE (ˆ e , e ) , (4)where e = (cid:104) SOS (cid:105) . A further discussion about the choice of loss functions ispresented in Section 4.3.However, compared with ASR-based methods, a binary state only concernsabout whether the target phoneme is correct or mispronounced. Thus, themodel may lose information about the exact phoneme. To ﬁx this, we stillrequire the proposed model to conduct the ASR task with an auxiliary weightof β , and the whole loss function is, l = l eval + βl asr + αl a . (5)The canonical phonemes to be recognized are aligned with the target phonemesfor the proposed model to make these two phoneme strings have equal length k + 1. 11astly, we should note that the proposed model has a consistent behavior inthe training and inference stage, as shown in Table 3. This characteristic makesthe inference in our method faster compared with ASR-based autoregressiveTransformers, and readers can refer to [18] for the comparison of inference la-tency.

4. Experiment

We use the SpeechTransformer backbone proposed in [42] for experiments.The SpeechTransformer is constructed by 6 encoder and 6 decoder layers in ourexperiments. Meanwhile, the attention modelling dimension d model = 512, 4attention heads, and the feed-forward dimension d ff = 1024 are adopted. Weextract the MFCC features of the audio ﬁles by Kaldi toolkit[43]. These MFCCfeatures are subsampled with a factor of n = 4, and stacked with m = 5 numberof frames, which is the same as the settings in [42]. We demonstrate the ASRperformance for phoneme recognition in the ﬁrst subsection 4.1. Then we usethis pretrained model to adapt for the APED task and show the result in thenext subsection 4.2. Finally, we analyze the loss functions and the behavior ofthe proposed model in the last two subsections, 4.3 and 4.4, correspondingly. We use Librispeech [44] as the dataset for ASR training. As the APED taskfocuses on the phoneme-level error, we ﬁrst convert the dataset into phoneme-level transcriptions using the Montreal Forced Aligner tool[45]. Next, we trainthe Transformer on diﬀerent parts of the trainset for 300 epochs, includingtrain-clean-100, train-clean-460, and the whole train-960. We use dev-clean asthe validation dataset to choose the best model and test-clean for inference per-formance comparison. Adam optimizer, with a learning rate of 10 − , is used.We use a CTC-based ASR model called Jasper5x3 proposed in [46] for com-parison. We show the phone error rate (PER) performance in Table 4. As wecan see from the table, the attention-based Transformer structure generally per-forms better than the CTC-based method on PER. This observation is in accord12ith the conclusion in [14, 15], as the attention mechanism in Transformer cancapture more relevant information compared with the CTC loss which holds theindependence assumption. Table 4: Performance of PER on Librispeech dataset.

CTC-Based Transformer-Based train- dev-clean test-clean dev-clean test-clean100h 8.13% 8.50% 4.55% 8.11%460h 4.88% 5.50% 2.32% 4.24%960h 4.02% 4.23% 1.70% 3.17% Next, we conduct the APED task on L2-Arctic dataset [17]. This corpuscontains 26,867 utterances with 6 diﬀerent accents, from 24 nonnative speakers.The 3,599 utterances annotated on phoneme-level are used for the APED task.The trainset, valset, and testset are divided into 8:1:1. For the APED task, themodel should make a good balance of detecting the wrong pronunciations andaccepting the correct ones. Thus, F score is chosen as the main indicator forthe performance. As deﬁned in [47], the hierarchical evaluation structure is ﬁrstdivided into correct pronunciations and wrong pronunciations by the canonicalpronounced phoneme. Next, depending on whether the predicted error statematches the ground truth label, the outcomes are further divided into trueacceptance (TA), false rejection (FR), false acceptance (FA), and true rejection(TR). In other words, T/F suggests whether the prediction of the model iscorrect for the APED task, and A/R is the decision of the model. Based on thisevaluation structure, F score of the APED system is deﬁned as follows: P recision = T RT R + F R , (6) This result is obtained by using the teacher-forcing training. This result is taken from [17], Figure 4. It is trained on Librispeech train-960 and testedon L2-Arctic dataset. ecall = T RT R + F A , (7) F = 2 P recision ∗ RecallP recision + Recall . (8)For the predicted binary error states ( ˆ e , ˆ e , ..., ˆ e k ), they are ﬁrstly ﬁltered bya threshold of θ = 0 . ,

1) intobinary integer { , } , ˆ e ←  , if ˆ e ≥ θ . otherwise (9)Next, each outcome is calculated by following equations, T R = k (cid:88) i =1 ( ˆ e i ∗ e i ) , (10) F R = k (cid:88) i =1 ((1 − ˆ e i ) ∗ e i ) , (11) F A = k (cid:88) i =1 ( ˆ e i ∗ (1 − e i )) , (12) T A = k (cid:88) i =1 ((1 − ˆ e i ) ∗ (1 − e i )) . (13)Apart from the conventional classiﬁcation-related metrics including F score,accuracy, precision and recall, the false rejection rate (FRR) and the false ac-ceptance rate (FAR) are also of vital importance to the APED task. They arecalculated as follows, F RR = F RT A + F R , (14)

F AR = F AF A + T R . (15)We ﬁrst conduct experiments to explore the auxiliary accent classiﬁcationtask. We start from the model obtained on Librispeech dataset, and train foranother 200 epochs, with the learning rate decreased to 10 − . We ﬁnd that theGlobalMean method performs a little better than the GRU, as shown in Figure4. We use α = 0 . α = 0 . .1 0.3 0.5 0.7 0.9 λ F − s c o r e GlobalMeanGRU

Figure 4: F score comparison of GlobalMean and GRU.Table 5: Comparison between diﬀerent models. AccentClassiﬁcation PhonemeClassiﬁcation FAR FRR Acc Precision Recall F1GOP-Based

GMM-HMM(Librispeech) - - - 0.290 0.290 0.290 ASR-Based

Initial(Librispeech) 0.485 0.207 0.753 0.295 0.515 0.3750.375 0.103 0.858 0.504 0.625 0.558 (cid:88)

Proposed

BCE Loss (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Next, we adapt this pretrained ASR-based model to the proposed text-conditioned version. We still train the whole model for 200 epochs, with thelearning rate of 10 − . We set the ASR-based model without the auxiliary taskfor baseline and the proposed methods with ablation for comparison. We ﬁndthat β = 0 . F score is increased by nearly 0.1. If we simplyuse the target text as the condition and change the prediction target to theerror states, the basic binary cross-entropy loss can bring a 0.19 improvementin terms of the F score. We discuss the eﬀect of diﬀerent loss functions for theproposed method in the next subsection. First of all, as the F score is an important metric for the APED task,inspired by [48], we directly utilize the generalized F score to optimize thepredicted error states. To make it diﬀerentiable, the sums of probabilities areused instead of counts. That is, we do not apply Eq.9 before calculating Eq.10 -13. As we try to maximize the F score, the F evaluation loss of the proposedmethod is, l F eval = 1 − F . (16)Another consideration is that only 2.18% of the labelled phone segmentsare mispronounced for the L2-Arctic dataset, which may cause an unbalancebetween correct pronunciations and mispronunciations. Thus, we adopt focalloss[49] to mine the hard labels. Formally, if we deﬁne e t as: e t =  ˆ e, if e = 11 − ˆ e. otherwise (17)The focal loss is, l focaleval = − (1 − e t ) γ log ( e t ) , (18)where γ modulates how much the well-classiﬁed samples are down-weighted.When γ = 0, this loss function is equivalent to Eq.4.We apply F loss function and focal loss with diﬀerent γ values to the pro-posed model. We can see from Table 5 that, when adopting the F loss functioninstead of the basic BCE loss, the result can be slightly improved. For the focalloss, we ﬁnd that a small γ value ( γ =0.5 in our experiments) performs the best,16nd a bigger value will lead to a degraded F score. Meanwhile, the auxiliaryASR task can boost the performance for all these loss functions. The focal lossversion has the highest F score 0.605, which is a relative 8.4% improvementover the baseline ASR-based method. We further analyse the behavior of the proposed method.For the APED task, we need to make a trade-oﬀ between FAR and FRR.Meanwhile, as noted in [50], it is usually more unacceptable to take the correctpronunciations as wrong ones (false reject) than to regard the mispronuncia-tions as correct ones (false acceptance). We can observe from Table 5 that theproposed methods all have a higher FAR and decreased FRR compared withASR-based models, which suggests our model obeys the former principle.For the actual deployment, as the proﬁciency level of the target languagevaries among diﬀerent students, the trade-oﬀ between FAR and FRR should beeasy to adjust. Compared with ASR-based models, the proposed method cansimply change the threshold θ to control how strict the APED system is. Wefurther explore the eﬀect of θ for diﬀerent loss functions. The output probabilitydistribution and the metrics are shown in Figure 5. Compared with the F1loss version, the BCE loss and the focal loss version have a more reasonablydistributed output. As a result, they have a wider range of FAR and FRR whenadjusting θ and can be a better choice for the actual deployment. C o un t s F1 Loss C o un t s BCE Loss C o un t s Focal Loss (γ=0.5)

Figure 5: Output probability distribution and the metrics for diﬀerent θ parameter. . Conclusion In this study, we propose a text-conditioned Transformer for automatic pro-nunciation error detection. By conditioning the target phonemes as an extrainput, the Transformer can directly evaluate the relationship between the in-put speech and the target phonemes. Thus, the error states are obtained ina fully end-to-end manner. Meanwhile, unlike the conventional autoregressiveTransformer, the proposed method works in a feed-forward manner in boththe training and the inference stage. We conduct a number of experimentsto compare the performance of diﬀerent methods and ﬁnd that the proposedtext-conditioned Transformer can boost the F score of the APED task on theL2-Arctic dataset. The proposed method has a more reasonable FAR and FRR,and the degree of strictness can be easily adjusted by the threshold θ parameter. References [1] K. Beatty, Teaching and Researching: Computer-assisted Language Learn-ing, Routledge, 2013. doi:10.4324/9781315833774 .[2] N. Stenson, B. Downing, J. Smith, K. Smith, The eﬀectiveness of computer-assisted pronunciation training, Calico Journal (1992) 5–19.[3] A. Lee, J. Glass, Pronunciation assessment via a comparison-based system,in: Speech and Language Technology in Education, 2013.[4] A. Lee, Y. Zhang, J. Glass, Mispronunciation detection via dynamic timewarping on deep belief network-based posteriorgrams, in: 2013 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing, IEEE,IEEE, 2013, pp. 8227–8231. doi:10.1109/icassp.2013.6639269 .[5] A. Lee, N. F. Chen, J. Glass, Personalized mispronunciation detectionand diagnosis based on unsupervised error pattern discovery, in: 2016IEEE International Conference on Acoustics, Speech and Signal Process-ing (ICASSP), IEEE, IEEE, 2016, pp. 6145–6149. doi:10.1109/icassp.2016.7472858 . 196] A. Lee, J. Glass, A comparison-based approach to mispronunciation detec-tion, in: 2012 IEEE Spoken Language Technology Workshop (SLT), IEEE,IEEE, 2012, pp. 382–387. doi:10.1109/slt.2012.6424254 .[7] S. M. Witt, Use of speech recognition in computer-assisted language learn-ing.[8] S. Witt, S. Young, Phone-level pronunciation scoring and assessment forinteractive language learning, Speech Communication 30 (2-3) (2000) 95–108. doi:10.1016/s0167-6393(99)00044-8 .[9] W.-K. Leung, X. Liu, H. Meng, CNN-RNN-CTC based end-to-end mispro-nunciation detection and diagnosis, in: ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, IEEE, 2019, pp. 8132–8136. doi:10.1109/icassp.2019.8682654 .[10] L. Zhang, Z. Zhao, C. Ma, L. Shan, H. Sun, L. Jiang, S. Deng, C. Gao, End-to-end automatic pronunciation error detection based on improved hybridCTC/Attention architecture, Sensors 20 (7) (2020) 1809. doi:10.3390/s20071809 .[11] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster,L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit,L. Kaiser, Z. Chen, Y. Wu, M. Hughes, The best of both worlds: Com-bining recent advances in neural machine translation, in: Proceedings ofthe 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), Association for Computational Linguistics, 2018,pp. 5998–6008. doi:10.18653/v1/p18-1008 .[12] N. Moritz, T. Hori, J. Le, Streaming automatic speech recognition with thetransformer model, in: ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2020,pp. 6074–6078. doi:10.1109/icassp40776.2020.9054476 .2013] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, S. Ku-mar, Transformer transducer: A streamable speech recognition model withtransformer encoders and RNN-t loss, in: ICASSP 2020 - 2020 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, IEEE, 2020, pp. 7829–7833. doi:10.1109/icassp40776.2020.9053896 .[14] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., Espnet: End-to-end speech processing toolkit, arXiv preprint arXiv:1804.00015.[15] L. Dong, S. Xu, B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in: 2018 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE,2018, pp. 5884–5888. doi:10.1109/icassp.2018.8462506 .[16] J. Gu, J. Bradbury, C. Xiong, V. O. Li, R. Socher, Non-autoregressiveneural machine translation, arXiv preprint arXiv:1711.02281.[17] S. Ding, C. Liberatore, S. Sonsaat, I. Lui, A. Silpachai, G. Zhao,E. Chukharev-Hudilainen, J. Levis, R. Gutierrez-Osuna, Golden speakerbuilder – an interactive tool for pronunciation training, Speech Communi-cation 115 (2019) 51–66. doi:10.1016/j.specom.2019.10.005 .[18] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.-Y. Liu, Fastspeech:Fast, robust and controllable text to speech, in: Advances in Neural Infor-mation Processing Systems, 2019, pp. 3171–3180.[19] K. Peng, W. Ping, Z. Song, K. Zhao, Parallel neural text-to-speech, arXivpreprint arXiv:1905.08459.[20] D. J. Berndt, J. Cliﬀord, Using dynamic time warping to ﬁnd patterns intime series., in: KDD workshop, Vol. 10, Seattle, WA, USA:, 1994, pp.359–370. 2121] Y. Kim, H. Franco, L. Neumeyer, Automatic pronunciation scoring forlanguage instruction, in: 1997 IEEE International Conference on Acoustics,Speech, and Signal Processing, IEEE Comput. Soc. Press, 1997. doi:10.1109/icassp.1997.596227 .[22] H. Franco, L. Neumeyer, M. Ramos, H. Bratt, Automatic detection ofphone-level mispronunciation for language learning, in: Sixth EuropeanConference on Speech Communication and Technology, 1999.[23] J. Proena, C. Lopes, M. Tjalve, A. Stolcke, S. Candeias, F. Perdigo, Detec-tion of mispronunciations and disﬂuencies in children reading aloud, in: In-terspeech 2017, ISCA, 2017, pp. 1437–1441. doi:10.21437/interspeech.2017-1522 .[24] W. Hu, Y. Qian, F. K. Soong, A new DNN-based high quality pronunciationevaluation for computer-aided language learning (CALL)., in: Interspeech,2013, pp. 1886–1890.[25] J. Cheng, X. Chen, A. Metallinou, Deep neural network acoustic models forspoken assessment applications, Speech Communication 73 (2015) 14–27. doi:10.1016/j.specom.2015.07.006 .[26] A. Graves, S. Fernndez, F. Gomez, J. Schmidhuber, Connectionist tem-poral classiﬁcation, in: Proceedings of the 23rd international conferenceon Machine learning - ICML ’06, ACM Press, 2006, pp. 369–376. doi:10.1145/1143844.1143891 .[27] W. Chan, N. Jaitly, Q. Le, O. Vinyals, Listen, attend and spell: A neuralnetwork for large vocabulary conversational speech recognition, in: 2016IEEE International Conference on Acoustics, Speech and Signal Process-ing (ICASSP), IEEE, IEEE, 2016, pp. 4960–4964. doi:10.1109/icassp.2016.7472621 .[28] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, T. Hayashi, HybridCTC/Attention architecture for end-to-end speech recognition, IEEE J.22el. Top. Signal Process. 11 (8) (2017) 1240–1253. doi:10.1109/jstsp.2017.2763455 .[29] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, R. Salakhutdinov,Transformer-xl: Attentive language models beyond a ﬁxed-length context,arXiv preprint arXiv:1901.02860.[30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training ofdeep bidirectional transformers for language understanding, arXiv preprintarXiv:1810.04805.[31] L. Sampaio Ferraz Ribeiro, T. Bui, J. Collomosse, M. Ponti, Sketch-former: Transformer-based representation for sketched structure, in: 2020IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), IEEE, 2020, pp. 14153–14162. doi:10.1109/cvpr42600.2020.01416 .[32] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov,S. Zagoruyko, End-to-end object detection with transformers, arXivpreprint arXiv:2005.12872.[33] T. Okamoto, T. Toda, Y. Shiga, H. Kawai, Transformer-based text-to-speech with weighted forced attention, in: ICASSP 2020 - 2020 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, IEEE, 2020, pp. 6729–6733. doi:10.1109/icassp40776.2020.9053915 .[34] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, Neural speech synthesis with trans-former network, in: Proceedings of the AAAI Conference on Artiﬁcial In-telligence, Vol. 33, 2019, pp. 6706–6713.[35] R. Liu, X. Chen, X. Wen, Voice conversion with transformer network, in:ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), IEEE, IEEE, 2020, pp. 7759–7759. doi:10.1109/icassp40776.2020.9054523 .2336] A. M. Harrison, W.-K. Lo, X.-j. Qian, H. Meng, Implementation of anextended recognition network for mispronunciation detection and diagnosisin computer-assisted pronunciation training, in: International Workshop onSpeech and Language Technology in Education, 2009.[37] V. Likic, The needleman-wunsch algorithm for sequence alignment, Lecturegiven at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Scienceand Biotechnology Institute, University of Melbourne (2008) 1–46.[38] C. Chang, First language phonetic drift during second language acquisition,Ph.D. thesis (10 2010).[39] Y. Jiao, M. Tu, V. Berisha, J. Liss, Accent identiﬁcation by combiningdeep neural networks and recurrent neural networks trained on long andshort term features, in: Interspeech 2016, ISCA, 2016, pp. 2388–2392. doi:10.21437/interspeech.2016-1148 .[40] M. Tu, A. Grabek, J. Liss, V. Berisha, Investigating the role ofl1 in automatic pronunciation evaluation of l2 speech, arXiv preprintarXiv:1807.01738.[41] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, Y. Bengio, Learning phrase representations using RNNencoder-decoder for statistical machine translation, arXiv preprintarXiv:1406.1078.[42] Y. zhao, J. Li, X. Wang, Y. Li, The speechtransformer for large-scale man-darin Chinese speech recognition, in: ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, IEEE, 2019, pp. 7095–7099. doi:10.1109/icassp.2019.8682586 .[43] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speechrecognition toolkit, in: IEEE 2011 workshop on automatic speech recogni-tion and understanding, no. CONF, IEEE Signal Processing Society, 2011.2444] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An ASRcorpus based on public domain audio books, in: 2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,IEEE, 2015, pp. 5206–5210. doi:10.1109/icassp.2015.7178964 .[45] M. McAuliﬀe, M. Socolof, S. Mihuc, M. Wagner, M. Sonderegger, Montrealforced aligner: Trainable text-speech alignment using kaldi, in: Interspeech2017, Vol. 2017, ISCA, 2017, pp. 498–502. doi:10.21437/interspeech.2017-1386 .[46] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen,H. Nguyen, R. T. Gadde, Jasper: An end-to-end convolutional neuralacoustic model, arXiv preprint arXiv:1904.03288.[47] X. Qian, F. K. Soong, H. Meng, Discriminative acoustic model for improv-ing mispronunciation detection and diagnosis in computer-aided pronuncia-tion training (CAPT), in: Eleventh Annual Conference of the InternationalSpeech Communication Association, 2010.[48] E. Eban, M. Schain, A. Mackey, A. Gordon, R. Rifkin, G. Elidan, Scal-able learning of non-decomposable objectives, in: Artiﬁcial Intelligence andStatistics, 2017, pp. 832–840.[49] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for denseobject detection, in: 2017 IEEE International Conference on ComputerVision (ICCV), IEEE, 2017, pp. 2980–2988. doi:10.1109/iccv.2017.324 .[50] M. Eskenazi, An overview of spoken language technology for education,Speech Communication 51 (10) (2009) 832–844. doi:10.1016/j.specom.2009.04.005 .[51] H. H. Mao, A survey on self-supervised pre-training for sequential transferlearning in neural networks, arXiv preprint arXiv:2007.00800.[52] V. Sanh, T. Wolf, A. M. Rush, Movement pruning: Adaptive sparsity byﬁne-tuning, arXiv preprint arXiv:2005.07683.25 t was a curious coincidenceIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention it was a curious coincidenceIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention it was a curious coincidenceTarget WordsIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention SIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SPCanonical PhonemesSIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SPSIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SP it was a curious coincidenceIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention it was a curious coincidenceIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention it was a curious coincidenceTarget WordsIHTWAHZAHKYUHRIYAHSKOWIHNSIHDAHNS T a r g e t P h o n e m e s DecoderAttention SIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SPCanonical PhonemesSIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SPSIL IH T WAH Z AH K Y UH R IY AHS K AO IH NS IHDAH N S SP (a) Sample arctic a0052 by speaker YDCK, “IT WAS A CURIOUS COINCIDENCE”.The OW phoneme in “COINCIDENCE” is pronounced to be AO by mistake. her face was against his breastHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention her face was against his breastHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention her face was against his breastTarget WordsHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention HH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SPCanonical PhonemesHH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SPHH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SP her face was against his breastHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention her face was against his breastHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention her face was against his breastTarget WordsHHERFEYSWAHZAHGEYNSTHHIHZBREHST T a r g e t P h o n e m e s DecoderAttention HH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SPCanonical PhonemesHH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SPHH ER F EY S W AHZ AH G EH N S T HHIHZ B R EHS T SP (b) Sample arctic a0129 by speaker ZHAA, “HER FACE WAS AGAINST HISBREAST”. The EY phoneme in “AGAINST” is pronounced to be EH by mistake.