[PDF] Black-box Adaptation of ASR for Accented Speech

Abstract

We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations - we observed that the word error rate (WER) achieved by Google's ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either require access to model parameters or overlay an error-correcting module on output transcripts. We highlight the need for correlating outputs with the original speech to fix accent errors. Accordingly, we propose a novel coupling of an open-source accent-tuned local model with the black-box service where the output from the service guides frame-level inference in the local model. Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies. Experiments on Indian and Australian accents with three leading ASR models as service, show that we achieve as much as 28% relative reduction in WER over both the local and service models.

Full PDF

BBlack-box Adaptation of ASR for Accented Speech

Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi

Indian Institute of Technology Bombay, Mumbai, India { kartikk,pjyothi,awasthi,sunita } @cse.iitb.ac.in Abstract

We introduce the problem of adapting a black-box, cloud-basedASR system to speech from a target accent. While leadingonline ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations — weobserved that the word error rate (WER) achieved by Google’sASR API on Indian accents is almost twice the WER on USaccents. Existing adaptation methods either require access tomodel parameters or overlay an error correcting module on out-put transcripts. We highlight the need for correlating outputswith the original speech to ﬁx accent errors. Accordingly, wepropose a novel coupling of an open-source accent-tuned lo-cal model with the black-box service where the output from theservice guides frame-level inference in the local model. Ourﬁne-grained merging algorithm is better at ﬁxing accent errorsthan existing word-level combination strategies. Experimentson Indian and Australian accents with three leading ASR mod-els as service, show that we achieve upto 28% relative reductionin WER over both the local and service models.

Index Terms : Black box ASR systems, accented speech recog-nition, adaptation.

1. Introduction

The emergence of cloud-based AI services, for tasks like ma-chine translation and speech recognition, have greatly increasedthe accessibility of machine learning. These services are pow-ered by sophisticated engines and trained on large proprietarydatasets. The internals of these engines are often not exposedto clients. Often a client’s input comes from a different domainthan the training domain of the server. The existing ﬁx of re-training to adapt to new target domains is not an option in thiscase. This leads us to our problem of

Black-box Adaptation.

In this work, the task of interest is automatic speech recog-nition (ASR) in English, and the domains correspond to dif-ferent accents. While leading online services like the GoogleASR API [1] attain superior performance on high-resource En-glish accents, they perform poorly on a large number of under-represented English accents. The API gives word error rates(WERs) of or higher on datasets in Australian and Indianaccents, as opposed to a WER of . on US accents. As ASRsystems start getting deployed in several critical applications, itis increasingly imperative to design light-weight methods of ac-cent adaptation to provide fair access to users of all regions andethnicity [2]. Existing methods of adapting ASR models to aspeciﬁc accent [3, 4, 5, 6, 7, 8], require modifying model pa-rameters, which is not an option in the black-box setting.One could implement black-box adaptation in the form ofan error-correction model to alter the outputs from the ser-vice [9]. When the mismatch is in the language model, errorcorrection models using a domain-speciﬁc language model havebeen proposed before [10]. However, for recovering from ac-cent errors an error correction model would need to correlate the service’s transcript with the original speech. To handle speech,the model would in turn need to incorporate an ASR system.This leads us to switching our perspective, so that black-boxadaptation amounts to building a local ASR system which isretargeted to correct accent errors in the service’s output.The local ASR system would be an open-sourced ASR ar-chitecture like DeepSpeech2 [11] pretrained on a publicly avail-able corpus like the US-accented Librispeech [12] corpus butfurther ﬁnetuned using a small amount of data in the target ac-cent. Typically the local model would be less accurate than theservice in all parts except the parts with systematic accent dif-ferences. If the outputs from the local and service models arecombined via standard combination approaches at the word ortranscript-level [13, 14], we obtain only limited improvementsin accuracy over the service. In other words, if the local systemwere also to be used as a black-box, we would not obtain theperformance improvements we seek.Hence, we exploit our white-box access to the local sys-tem. Our idea, at a high-level, is to use the transcript obtainedfrom the service to guide the inference of the local ASR system.Our guided inference algorithm (named FineMerge) aligns thecharacters in the service with input frames using a Viterbi-likedecoding and then selectively modiﬁes the frame-level distribu-tion of the local model. Our ﬁne-grained merging step is easyto plug in existing speech pipelines, fast during inference, andspeciﬁcally tailored to ﬁxing accent errors — we often recoverwords that were absent from stand-alone outputs of both lo-cal and service models. Experiments on different service APIson two different English accents show that FineMerge providessigniﬁcant reduction in WER over either the local or servicemodels, and existing methods of combining them at the word-level. To summarize, our overall contribution in this paper are:1. We introduce the problem of black-box accent adapta-tion of ASR service APIs.2. We propose an efﬁcient coupling of a local white-boxmodel with a black-box service to accent adapt with lim-ited labeled data without incurring the cost of accessingthe service during training.3. We design a novel guided inference algorithm on the lo-cal model that is speciﬁcally tailored to correct focusedaccent errors in an otherwise strong service API.4. We evaluate our algorithm on two accents and three ser-vice APIs. Our approach provides up to 28% reductionin relative WER over both local and service models. Ex-isting methods based on rescoring N-best lists or com-bining outputs at the word-level are not as effective.

2. Related Work

Accent Adaptation in Speech.

Accent adaptation in speechhas been a problem of long standing interest. One category ofmethods attempt to create accent invariant systems and rangefrom early approaches that simply merged data from multiple a r X i v : . [ ee ss . A S ] J un ccents for training a single model [15] to more recent work thatuses adversarial learning objectives to extract accent-invariantfeature representations from speech [16, 17]. A second cate-gory of methods are accent dependent methods that adapt to thespeaker’s accent. Early approaches were HMM-based acousticmodel adaptation and pronunciation model augmentation withaccent-speciﬁc pronunciations [18, 19]. Within neural models,accent adaptation was achieved via accent-speciﬁc output lay-ers [3, 4] and hierarchical models in a multitask learning set-ting [8]. A more recent work jointly learns an accent classiﬁerand accent-dependent models [5, 6, 7]. Our method is also ac-cent dependent but we need to adapt a black-box service model.We build local accent-adapted ASR systems, which are in turnguided during inference by service predictions. Black box ASR Systems.

Speech transcription services haveseen widespread use in recent years. However, the underlyingASR systems in these services are black box systems. Adaptingsuch models to a client’s needs would be of great utility but priorwork in this area is sparse. [20] shows how to optimize blackbox ASR systems. and [21] shows how to improve conﬁdenceestimates produced by such black-box systems. Another closelyrelated work [10] is to use a domain-speciﬁc language modeland a semantic parser to rescore the hypotheses from a black-box ASR system. Unlike their method, we achieve a more ﬁne-grained integration of our client model with the service.

System Combination Approaches.

Ours can be viewed asa type of system combination approach which has seen wideuse in ASR. ROVER (Recognizer Output Voting Error Reduc-tion) [13] is one of the most popular techniques that ﬁrst com-bines predictions from different systems using an alignmentstep followed by a weighted voting step. Prior work on dialectalspeech recognition [14] observed that using the best output froma dialect-speciﬁc model is more accurate than techniques likeROVER. Unlike ROVER that considers each individual systemas a black-box, our method that leverages white-box access toa local accent-adapted ASR system is more targeted to correctaccent errors and ultimately more accurate.

3. Our Approach

Given an audio signal x , we invoke the service model S on x and get the transcript s comprising of tokens s , . . . , s k , alongwith token-level conﬁdences p , . . . , p k . In addition the clientcan invoke a local white-box model C that has been trained/ﬁne-tuned on a limited accented labeled data. On input x , let c = c , . . . , c r denote the transcript from the local model C withtoken-level conﬁdences q = q , . . . , q r . In general the numberof tokens in the two outputs ( k, r ) could be different.One option to merge the transcripts of the two models isusing a word-level aligner like Rover [13]. However, for accenterrors we expect the service to be wrong only on a sub-part of aword, say a ’t’ being wrongly identiﬁed as a ’d’. The local tran-script c might correct some accent errors while missing out onother parts of the word. In general, the local model is expectedto be weaker than the service on all but the accent errors, forthe client to want to pay for the service. As an example con-sider the ﬁrst sentence in Table 3 showing the gold transcript y , service transcript s , and local transcript c for an Indian ac-cented model. The service model fails to recognize the t in toasted and outputs posted . The local model recognizesthe t but yields to state . To reconstruct the correct word insuch cases we need a ﬁner-grained splicing at sub-word level.Given the prevalence of character-level models in modernASR systems we then sought to splice the two transcripts at 1 S t p o s t e d d P t ( S t ) d t t t o o s t a t d P t ( d t ) r t t t o s t e d d P st ( r t ) P t ( r t ) t Example: Client model revising frame-level charac-ter distribution P → P s using service transcript s =’posted’in FineMerge. d t = argmax c P t ( c ) and r t = argmax c P st ( c ) .First row shows aligned service characters and their probabil-ity from P , second row shows the modes of the P distribution,third row shows the argmax r t of the revised distribution and itsprobability from the revised and original distribution. the character-level. Designing a good character-level mergingstrategy is challenging because of large divergences betweenthe two outputs both because of the differential strengths oftheir acoustic models and the introduction of unheard characterswhen biasing with their respective language models. Strategieslike combining the characters from the two outputs using Rover-like algorithms fail to distinguish between the two types of er-rors in the absence of accurate character-level conﬁdence fromthe service. For example, aligning the characters in posted with to state yielded too sttd We ﬁnally designed an algorithm that exploits white-boxaccess to the local model C to guide its decoding using the ser-vice transcript s , instead of merging a ﬁxed c from C .We assume the local model C is trained using the standardCTC loss invoked on frame-level character distributions [22]that maximizes likelihood of the target y by marginalizing overall alignments compatible with y . During inference, the trainedmodel generates the distribution over alignments for an input x and predicts character distributions P , . . . , P T (cid:44) P at eachof the T frames of the input. From these probability distribu-tions, an output sequence c is recovered using beam-decodingin conjunction with a language model (LM).We guide this inference using the service transcript in twosteps: First align the service characters with each frame of thelocal model using its frame-level probability distributions P .Next revise P to selectively support s . We elaborate these stepsnext. A pseudo code appears in Algorithm 1. Aligning service characters

Our ﬁrst step is to expand outthe characters in s over the T frames by repeating characters orinserting blanks so as to maximize the probability of the alignedcharacters as per P , . . . , P T . Let S denote the highest proba-bility expanded character sequence. An example is shown inTable 1 where s = posted is aligned over T = 10 framesand the resulting S is shown in the ﬁrst row. The full P cannotbe shown but we show the probability of the aligned characterbelow it and the maximizing character probability in the sec-ond row. Such a forced alignment of s with P can be solvedoptimally using a simple Viterbi-like dynamic programming al-gorithm. The algorithm processes s time-synchronously overthe T frames such that either a symbol from s or a blank is pro-duced as output at each frame. This is referred to as “Viterbi-align” in Algorithm 1. Successfully aligning the service char-acters requires an additional consideration. The server’s output s contains characters that can be attributed to both accent errorsand cascaded language model errors. We therefore smooth P distribution by adding a small constant − to all probability lgo 1: The FineMerge Inference Algorithm

Input: x : Input audio with T frames C : Local model ﬁne-tuned on target accent ψ : service probability threshold ω : service weight for mixing γ : probability of blank Output:

Final transcript s , p ← Transcript, token-conﬁdence from Service on x P , . . . , P T ← Frame-level probability from C ( x ) S , . . . , S T ← Viterbi-align( s , Smooth ( P ) ) for t ← to T do if ψ < P t [ S t ] < max c P t [ c ] then ω t ← γ if S t is blank else ω p [word index of S t ] P st ← (1 − ω t ) P t + ω t oneHot ( S t ) else P st ← P t P s ← P s , . . . , P sT return Beam-decode using P s and local LM of C entries so even unheard characters get non-zero probability. Revising P with s , p Now each frame t is aligned with acharacter S t in service. We need to revise P so as to ’sup-port’ the aligned characters of service but while ignoring thosecharacters which may have been erroneously introduced duringLM-based decoding. For this we boost the probability of thatcharacter in P t on those frames t where the probability P t ( S t ) is less than the maximum probability in P t but greater than athreshold ψ . The lower limit ψ is to suppress those charactersin s which are not ’heard’ at all by the client’s acoustic model,and are likely to have been introduced by the LM. The amountof boosting is product of a hyper-parameter ω and the conﬁ-dence of the parent word of S t . If S t is blank, we use a ﬁxedprobability γ . We use P s to denote the P distribution after thisrevision with s . In Table 1 we show the mode of the reviseddistribution P s in row 3. Note, how ’p’ in frame 2 was ig-nored in favor of the gold character ’t’ since P ( p ) has a verysmall probability (1e-11). In frame 8, the ’e’ from the servicewas used to boost the probability of ’e’ in the P distributionfrom 0.44 to 0.66. Likewise in frame 9. Greedy decoding onthe revised distribution yields to sted which is closer to thegold token toasted than either the service token posted orthe local token to state . Beam-decoding on the revised P s recovers the gold token.The above merging algorithm is simple and requires tun-ing only three hyper-parameters. Since client’s labeled data islimited, we found that more elaborate attention-based mergingmodels using several parameters did not perform well.

4. Experiments

We evaluate FineMerge on three accents and three servicecombinations and contrast against four other methods. Wepresent anecdotes and analyze the kind of accent adaptationswe achieve. Datasets

We used the Mozilla Common Voice v4 (MCV-v4)dataset. The dataset is crowd-sourced and contains 1,118 hoursof validated speech data of varying accents. We got around 28K code available at https://github.com/Kartik14/FineMerge Indian, 27K Australian and 63k British accented utterances,amounting to , and hours of speech, respectively. Foreach accent, we split into train, validation and test sets roughlyin the ratio 85-5-10 ensuring no overlap among speakers andtranscripts. The MCV-v4 audio clips were normalized in a pre-processing step. Service and Local Models

We used Google Cloud Speech toText API [1] as our default service model, and include two otherservice models later. For the local, we used the DeepSpeech2(DS2) [11] model pretrained on the LibriSpeech corpus [12]and then ﬁne-tuned individually for each accent. We used a tri-gram LM trained on sentences from the MCV-v4 corpus afterremoving sentences overlapping with test sets. DS2 parame-ters α (for LM weight) and β (to encourage more words) werealso ﬁne-tuned on the validation set for each accent. The hy-per parameters of our method ω , ψ , γ were also tuned on thevalidation set for each accent. Methods compared

We measure word error rates (WER)on ﬁve different models: the service model, the local model,Rover [13] on the conﬁdence weighted transcripts of serviceand local model, LM rescoring top-N whole transcripts fromservice, and our FineMerge method.

Overall Results

In Table 2 we show the WERs on the In-dian and Australian accents on these ﬁve methods. Observethat overall the error rate of Service is lower than that of accent-adapted Local. Rover’s word-level merging provides signiﬁ-cantly improved results than either of the two indicating thatthe two models exhibit complementary strengths. LM rescor-ing does not improve results much, establishing that the localLM may not have much impact on the improved results. Ouralgorithm FineMerge provides the greatest gains in WER overall methods. For Australian, we obtain a 28% relative reduc-tion in WER over either of the service and client models. Ta-Method WER CERInd Aus uk Ind Aus ukLocal 27.99 24.41 25.06 16.98 14.55 14 28Service 22.32 23.52 20.82 11.96 13.27 11.20Rover 21.12 18.04 18.10 11.95 9.81 9.88LM rescore 22.10 23.42 20.96 12.10 13.56 11.56FineMerge

Table 2:

Overall comparison on WER and CER for Indian, Aus-tralian and British Accented Data ble 3 presents some anecdotes which show how the ﬁne-grainedmerging enables us to recover the highlighted word, even whenneither the service nor client models contain that word.

Comparing methods of character alignment

A centerpieceof our method is Viterbi aligning s with the frame-level char-acter probability distribution. We show that this achieves acharacter-level alignment that is more accurate than existingmethods by focusing only on character error rate (CER) beforebeam-decoding. The last two columns in Table 2 presents CERof Local (before LM decoding), Service (as is), Rover appliedat the character-level on these two, LM rescoring, and Fine-Merge’s after selecting the modes of the revised distribution P s i.e., before LM decoding. We observe that FineMerge’s CERNDIAN AUSTRALIANGold everyone toasted the .. nora ﬁnds herself ugly ..Service everyone posted the .. nora van to self ugly ..Local everyone to state the .. nor iphones herself ugly ..Rover everyone to posted the .. nor to self ugly ..FineMerge everyone toasted the .. nora ﬁnds herself ugly ..Gold for a brief time .. hannelore is an ..Service soda beef time .. i don’t know what is an ..Local for a breese time .. hailar is an ..Rover for a beef time .. i don’t know what is an ..FineMerge for a brief time .. hannelore is an ..Gold the condition also occurs.. ..rope a bull while on aService deﬁnition of circus.. ..work a bowl while on aLocal the condition also acres.. ..rope the ball while on aRover the deﬁnition also circus.. ..work a bowl while on aFineMerge the condition also occurs ..rope a bull while on aTable 3: Anecdotes comparing transcripts of Indian and Aus-tralian accents speech from ﬁve different methods. is much lower particularly for Indian accent. This explains thatthe main reasons for our gains is due to our novel frame-levelﬁne-grained merging algorithm.

Varying Quality of Service Model

In addition to the defaultGoogle Speech API service (G-US), we evaluate on two othermodels as service — a second Google speech-to-text model (G-Video) [23] meant for transcribing audio of video ﬁles, whichworks signiﬁcantly better for MCV-v4 utterances because oftheir low-ﬁdelity, and Jasper [24], a recent end-to-end convo-lutional neural ASR model trained on the LibriSpeech dataset.We note here that we opted for G-US rather than Google’s ASRAPI for Indian English because of the latter’s poor performance(compared to G-US) on MCV-v4 utterances that are low band-width. Table 4 shows the results. WER of Local stays the samesince service has no role during its training. We see a wide dif-ference in accuracies across the different services. G-Video isthe most accurate, but even in this case FineMerge is able to ob-tain a relative WER reduction by at least 3%. The Jasper modelis worse than local Indian ﬁne-tuned, yet FineMerge achievesmore than 15% relative WER reduction wrt both service and lo-cal. This shows that the hyper-parameters of our service guidedlocal inference adapt even to a weaker service.Method Indian AustralianG-US G-Video Jasper G-US G-Video JasperLocal 27.99 24.41Service 22.32 13.77 31.82 23.52 11.08 19.56Rover 21.12 20.51 26.95 18.04 13.84 17.57LM rescore 22.10 13.37 31.38 23.42 10.99 19.35FineMerge

Table 4:

Effect of changing service model

Importance of Accent Adaptation

One interesting questionwas if our gains were merely due to ensembling of any two in-dependent models adapted to test data domain, or did we specif-ically adapt accent. To answer this, we run FineMerge with alocal model ﬁne-tuned on a similarly-sized MCV corpus froma different accent. Table 5 compared our WER to the WER ob-tained after the client models for the two accents is ﬁnetuned onUS accented sample. Test Service FineMerge withaccent (ind/aus)-Local us-localIndian 22.32 18.45 21.01Aus 23.52 16.90 20.66Table 5:

WER comparison with different local models.

Words E rr o r R a t e System

FineMergeService

Figure 1:

Highest reductions in error per word on Indian-accented test samples.

Observe that FineMerge out-performs service even whenthe local model is ﬁne-tuned on a different accent. This cap-tures the base beneﬁt of ensembling. However, after ﬁne-tuningon data of its own accent the gains are higher. For Aus ac-cent, service WER of 23.52 drops to 20.66 with FineMerge onIndian-local but drops further to 16.90 on Aus-local.Figure 1 shows the largest reductions in errors per word onIndian test samples obtained by FineMerge over service. Errorrates are cut in half for most words revealing FineMerge’s abil-ity to do accent adaptation. Word “however” is an interestingexample to highlight. The diphthong /AW/ in “however” hasa wide range of phonetic realizations across Indian speakers;and has been investigated in prior work [25]. This variability isdifﬁcult for the service to accurately model, while FineMergecuts the errors on “however” down to from . Anotherinteresting example is “were”. The phonemes /v/ and /w/ areindistinguishable in most Indian languages, making minimalpairs like veil and wail homophones when articulated by Indianspeakers. /DH/-initial words like “then”, “these”, “their” and“there” are other likely targets of accent errors due to the lackof dental fricatives like /DH/ in most Indian languages. Fine-Merge is able to substantially reduce these errors.

5. Conclusion and Future Work

In this paper we motivated and introduced the problem of black-box adaptation of an ASR service. We presented a novel cou-pling of an open-source accent adapted model with the black-box service model to ﬁx accent errors in an otherwise strong ser-vice model. We presented FineMerge an algorithm that achievesa ﬁne-grained mixing of the service output and local frame-leveldistributions. We show that such ﬁne-grained mixing is specif-ically effective in ﬁxing accent errors that word-level mixingcannot ﬁx. Our strategy achieves upto 28% reduction in word-error rate over service APIs of varying grades of quality. Futurework could consider combining outputs from multiple servicesand ﬁxing both dialect and accent differences. . References [1] “Cloud speech-to-text api.” [Online]. Available: https://cloud.google.com/speech-to-text/docs/reference/rest[2] A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Menge-sha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racialdisparities in automated speech recognition,”

Proceedings of theNational Academy of Sciences , vol. 117, no. 14, pp. 7684–7689,2020.[3] Y. Huang, D. Yu, C. Liu, and Y. Gong, “Multi-accent deep neu-ral network acoustic model with accent-speciﬁc top layer usingthe kld-regularized model adaptation,” in

Fifteenth Annual Con-ference of the International Speech Communication Association ,2014.[4] M. Chen, Z. Yang, J. Liang, Y. Li, and W. Liu, “Improving deepneural networks based multi-accent mandarin speech recognitionusing i-vectors and accent-speciﬁc top layer,” in

Sixteenth AnnualConference of the International Speech Communication Associa-tion , 2015.[5] X. Yang, K. Audhkhasi, A. Rosenberg, S. Thomas, B. Ramab-hadran, and M. Hasegawa-Johnson, “Joint modeling of accentsand acoustics for multi-accent speech recognition,” in . IEEE, 2018, pp. 1–5.[6] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speechrecognition using accent embeddings and multi-task learning.” in

Proceedings of Interspeech , 2018.[7] T. Viglino, P. Motlicek, and M. Cernak, “End-to-end accentedspeech recognition,” 2019.[8] K. Rao and H. Sak, “Multi-accent speech recognition with hi-erarchical grapheme based models,” in .IEEE, 2017, pp. 4815–4819.[9] E. K. Ringger and J. F. Allen, “Error correction via a post-processor for continuous speech recognition,” in , vol. 1. IEEE, 1996, pp. 427–430.[10] R. Corona, J. Thomason, and R. Mooney, “Improving black-boxspeech recognition using semantic parsing,” in

Proceedings of theEighth International Joint Conference on Natural Language Pro-cessing (Volume 2: Short Papers) , 2017, pp. 122–127.[11] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in englishand mandarin,” in

International conference on machine learning ,2016, pp. 173–182.[12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[13] J. G. Fiscus, “A post-processing system to yield reduced word er-ror rates: Recognizer output voting error reduction (rover),” in . IEEE, 1997, pp. 347–354.[14] V. Soto, O. Siohan, M. Elfeky, and P. Moreno, “Selection andcombination of hypotheses for dialectal speech recognition,” in . IEEE, 2016, pp. 5845–5849.[15] D. Vergyri, L. Lamel, and J.-L. Gauvain, “Automatic speechrecognition of multiple accented english data,” in

Eleventh AnnualConference of the International Speech Communication Associa-tion , 2010.[16] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Do-main adversarial training for accented speech recognition,” in . IEEE, 2018, pp. 4854–4858. [17] Y.-C. Chen, Z. Yang, C.-F. Yeh, M. Jain, and M. L. Seltzer,“Aipnet: Generative adversarial pre-training of accent-invariantnetworks for end-to-end speech recognition,” arXiv preprintarXiv:1911.11935 , 2019.[18] J. J. Humphries, P. C. Woodland, and D. Pearce, “Using accent-speciﬁc pronunciation modelling for robust speech recognition,”in

Proceeding of Fourth International Conference on Spoken Lan-guage Processing. ICSLP’96 , vol. 4. IEEE, 1996, pp. 2324–2327.[19] Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D. Ju-rafsky, R. Starr, and S.-Y. Yoon, “Accent detection and speechrecognition for shanghai-accented mandarin,” in

Ninth EuropeanConference on Speech Communication and Technology , 2005.[20] S. Watanabe and J. Le Roux, “Black box optimization for au-tomatic speech recognition,” in .IEEE, 2014, pp. 3256–3260.[21] A. Kastanos, A. Ragni, and M. Gales, “Conﬁdence estimation forblack box automatic speech recognition systems using lattice re-current neural networks,” arXiv preprint arXiv:1910.11933 , 2019.[22] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks,” in

Proceedings ofthe 23rd international conference on Machine learning , 2006, pp.369–376.[23] “Cloud speech-to-text api video model.” [On-line]. Available: https://cloud.google.com/speech-to-text/docs/transcription-model[24] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Co-hen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convo-lutional neural acoustic model,” arXiv preprint arXiv:1904.03288 ,2019.[25] O. Maxwell and J. Fletcher, “The acoustic characteristics of diph-thongs in indian english,”