An evaluation of word-level confidence estimation for end-to-end automatic speech recognition
AAN EVALUATION OF WORD-LEVEL CONFIDENCE ESTIMATIONFOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Dan Oneat , ˘a , Alexandru Caranica , Adriana Stan , Horia Cucu University P
OLITEHNICA of Bucharest, Romania Technical University of Cluj-Napoca, Romania
ABSTRACT
Quantifying the confidence (or conversely the uncertainty) of aprediction is a highly desirable trait of an automatic system, asit improves the robustness and usefulness in downstream tasks.In this paper we investigate confidence estimation for end-to-end automatic speech recognition (ASR). Previous work hasaddressed confidence measures for lattice-based ASR, whilecurrent machine learning research mostly focuses on confi-dence measures for unstructured deep learning. However,as the ASR systems are increasingly being built upon deepend-to-end methods, there is little work that tries to developconfidence measures in this context. We fill this gap by provid-ing an extensive benchmark of popular confidence methods onfour well-known speech datasets. There are two challenges weovercome in adapting existing methods: working on structureddata (sequences) and obtaining confidences at a coarser levelthan the predictions (words instead of tokens). Our resultssuggest that a strong baseline can be obtained by scaling thelogits by a learnt temperature, followed by estimating the con-fidence as the negative entropy of the predictive distributionand, finally, sum pooling to aggregate at word level.
Index Terms — Confidence scoring, uncertainty estima-tion, automatic speech recognition, end-to-end deep learning
1. INTRODUCTION
Reasoning under uncertainty is one of the tenets of intelli-gence. The first step towards this goal is to endow systemswith reliable uncertainty estimates of their predictions. Ide-ally, the larger the uncertainty the more likely the predictionis erroneous. Alternatively, one can solve the complemen-tary problem of confidence estimation—in this case, the moreconfident a prediction, the more likely the output is correct.In the context of automatic speech recognition (ASR) con-fidence estimation can be of crucial importance for many end-user applications, as it improves the robustness of the sys-tems in safety-critical tasks, helps avoiding errors in human-computer dialogue systems and facilitates manual correctionsin audio transcription tasks by flagging the errors. Moreover,previous research has leveraged confidence estimates for a number of downstream tasks: propagating uncertainties for au-tomatic speech translation [1], selecting confident predictionsfor self-training [2], manually annotating the less confidentpredictions for active learning [3].In this paper we consider confidence estimation for end-to-end
ASR systems, also known as lattice-free speech recog-nition [4]. End-to-end models for ASR are gaining tractionrecently as their performance matches the one of classicalASR and have the additional benefits of being conceptuallysimple and allowing unified training [5, 6, 7]. However, thereis surprisingly little work on confidence estimation for end-to-end speech recognition systems, most of the ongoing researchon confidence estimation being carried on computer visiontasks (image classification or segmentation). We believe thatthere are two main challenges to developing confidence scor-ing methods for ASR systems: the structured output and thegranular predictions ( e.g. , tokens or graphemes versus words).ASR systems are structured models (mapping sequencesto sequences) as opposed to usual recognition networks (suchas, image classification) whose output is a single label. Thesequential nature of the output imposes a decoding step, whichcomplicates not only the prediction but also the confidencescoring algorithm, as we need estimate the confidence in anauto-regressive context (the already predicted sequence). Forthis reason, we fix the predictions based on a pre-trained ASRand apply the confidence scoring methods on top of tokenprobabilities, which are conditioned on the fixed transcript.In order to enable open vocabulary predictions, end-to-endASR systems usually use subword tokens to represent the out-put (byte-pair encoded tokens or even graphemes). However,given that the tokens lack semantics, for many downstreamapplications we are interested in estimating the confidence ofwords. To this end, we explore ways of aggregating the token-level uncertainty measures to the larger units, correspondingto words; in fact, the presented techniques can be extended toeven coarser predictions, such as sentence or utterance level.In this context, our main contributions are the following: (i) we adapt several state-of-the-art uncertainty estimation meth-ods to the end-to-end ASR pipeline; (ii) we propose and evalu-ate aggregation techniques to obtain user-relevant confidenceestimates ( i.e. word-level); (iii) we perform a thorough evalu-ation on multiple speech benchmark datasets. To the best of a r X i v : . [ ee ss . A S ] J a n ur knowledge, this is the first study that provides an in-depthanalysis of confidence measures for end-to-end ASR.
2. RELATED WORK
In this section we review two lines of research that are relatedto our work.
Confidence scoring for speech recognition.
Most priorwork on confidence scoring for ASR targets classical systemsbased on the HMM-GMM paradigm. These methods first ex-tract a set of features from the decoding lattice, acoustic orlanguage model, and then train a classifier to predict whetherthe transcription is correct or not. Typical examples of featuresinclude log-likelihood of the acoustic realization, languagemodel score, word duration, number of alternatives in the con-fusion network [8, 9, 10]. More recently, Swarup et al. haveaugmented the feature set with deep embeddings of the inputaudio and the predicted text [11], while Errattahi et al. haveshown that the benefits of domain adaptation on the extractedfeatures [12]. The classifiers employed by the confidence scor-ing methods range from conditional random fields [13, 14]and multiple layer perceptrons [15] to bidirectional recurrentneural networks [16, 17, 18].
Confidence scoring in end-to-end systems.
The baselinemethod for confidence estimation in neural networks is touse directly the probability of the most-likely prediction [19].However the neural networks tend to be overconfident andthe probability estimates can be improved through tempera-ture scaling [20], which typically leads to better calibration[21, 22]. The most promising direction in terms of simplic-ity and usefulness involves Monte Carlo estimation: Gal andGhahramani use dropout at test time to obtain multiple predic-tions, which are then averaged [23], while Lakshminarayanan et al. average the predictions over an ensemble of networksusually trained with different initializations [24]. The latter hasbeen show to be very reliable on challenging out-of-domaindatasets [25], but coming at a high cost [22]. The literatureon general confidence scoring is rich and continually evolv-ing; the most interesting research avenues involve Bayesianaveraging [26], generative models [27, 28], input perturbations[29, 30], exploiting inner activations [31, 32].At the intersection of the two lines of research, there is therecent work of Malinin and Gales [33], which similar to us ad-dresses the task of confidence estimation for end-to-end ASRsystems. However, they are concerned with token and sentenceuncertainty estimation, while we are interested in estimation atword level, and, consequently, provide more focus on the ag-gregation techniques. Furthermore, they employ ensembles astheir primary method of confidence estimation, while we alsoevaluate temperature scaling and dropout methods. Dropoutwas previously used for obtaining confidence scores for ASR[34], but our approaches differ: in [34] multiple hypotheses aregenerated via dropout and then word confidences are assignedbased on their frequency of appearance in the aligned hypothe- ses; in contrast, we aggregate the posterior probabilities andnot the hypotheses, which simplifies the procedure as it avoidsthe alignment step.
3. METHODOLOGY
This section presents the confidence estimation methodologyand proposed ways of improving them. We first start with adescription of the setup and the involved notation.We consider a sequence-to-sequence model that maps anaudio sequence a to a sequence of tokens t = ( t , · · · , t T ) .The model is specified by the parameters θ , which are learnedby minimizing losses such as the CTC or KL divergence onthe training set. At test time the model outputs probabilities forthe next token k in an auto-regressive manner p ( t k | ˆ t To measure the confidence in a pre-diction at token level we use two variants:• Log probability (log-proba) of the most probable pre-diction given by classifier, that is s ( t ) = log max p .This type of feature has been shown to yield a strongbaseline for the related tasks of misclassification andout-of-distribution detection [19].• Negative entropy (neg-entropy) computed over the vo-cabulary of tokens at each time step, that is s ( t ) = p (cid:124) log p . A large entropy means a large uncertainty or,conversely, a large negative entropy implies a confidentprediction. Aggregation. To obtain word-level features from thetoken-level ones, we experiment with three types of aggre-gation functions: sum, average, minimum. Since both pro-posed features are negative, summing across tokens will result onditional probabilities p ( t k | ˆ t Temperature scaling [20, 21] consists of dividing thelogit activations (pre-softmax values) by a scalar τ (knownas temperature). The value of τ ranges from zero to infinityand it controls the shape of the distribution: when τ → weobtain a uniform distribution, when τ → ∞ we obtain a Diracdistribution on the most likely output. Based on τ we updatetoken-level probabilities p at each time step k , as follows: p (cid:48) k = softmax(log( p k ) /τ ) . (1)We then extract features s ( t ) on the updated probabilities p (cid:48) , aggregate them into the word-level score s ( w ) and, finally,classify the word as either correct or incorrect: P (correct) = σ ( α · s ( w ) + β ) . (2)The variables α , β and τ are parameters and are learntby optimizing the cross-entropy loss on a validation set. Thelabels are set at word level by aligning at the groundtruth textwith the transcription. Note that the parameters α and β arenot changing the ranking of the predictions, but allow us tolearn a calibrated confidence model. Dropout [39] is a technique that masks out random partsof the activations in a network, making the network less proneto overfitting. In [23] it has been observed that the dropout in-duces a probability distribution over the weights of the network and can be consequently used for approximate Bayesian infer-ence. We follow this idea and average the token probabilitiesobtained over multiple runs of dropout: p (cid:48) k = 1 N (cid:88) n ˆ p k (3)where ˆ p specifies the dropout prediction. While the originalwork [23] employed entropy as a confidence measure, thereis no reason not use other uncertainty features; we use theupdated probabilities p (cid:48) to extract both log-proba and neg-entropy features. Ensembles [24] are based on the same idea of averagingpredictions from multiple sources, but in this case the set ofweights come from independently trained networks (differentrandom seeds used in initialization and batch selection).In ourcase, we average the token predictions over the models: p (cid:48) k = 1 N (cid:88) n p ( t k | ˆ t 4. EXPERIMENTAL SETUP In this section, we describe the datasets used for evaluation,the ASR systems for which we build confidence estimates, andthe evaluation metrics. We have opted for multiple publicly-available and widely-useddatasets for our experimental setup. able 1 . Size of the datasets ( test split) used for confidenceestimation evaluation.dataset no. utts. durationLibri clean 2.6K 5.4 hLibri other 2.9K 5.3 hTED 1.1K 2.6 hCommonVoice 66K 72 h LibriSpeech [40] is a corpus of approximately 1000 hoursof read audiobooks derived from the LibriVox project. We usethe dataset for both training the ASR and evaluating the confi-dence scoring: for training we use the three splits clean100 , clean360 and other500 , while for development and eval-uation we use the standard clean and other splits. TED-LIUM 2 [41] consists of talks and their transcriptscollected from the TED website. We use the dataset for evalu-ation and consequently employ only the predefined dev and test subsets. CommonVoice [42] is a collaborative dataset of short tran-scripts that are read by people across the world; we use thefirst release of the dataset. The data is used for evaluation andwe defined dev and test subsets by choosing 10% randomsamples for each of them.Table 1 presents the test size of each evaluation dataset. The main ASR system is based on the pre-trained LibriSpeechmodel provided by the ESPNet toolkit [43]. The model im-plements the transformer architecture [44], takes as input 80-dimensional Mel filter banks (extracted with the Kaldi toolkit[45]) and outputs a sequence of tokens. The token vocabularyhas dimension 5000 and is obtained by subword segmenta-tion based on a unigram language model [46]. The modelis trained on the 960h of the LibriSpeech dataset, which isfurther augmented using the SpecAugment techniques (timewarping, frequency masking, time masking) [47]. For decod-ing we use a language model, which is also implemented as atransformer and is trained on the LibriSpeech transcriptionsand other 14,500 public domain books [40]. The vocabularyof the language model consists of the same 5000 tokens asused by the ASR model.For the ensemble experiments we re-train the ASR systemusing the same architecture and data, but different randomseeds. We repeat the process four times obtaining four in-dependent models. Due to computational constraints, thesemodels were trained for a shorter number of epochs than themain system (10 versus 120), but we observed that the val-idation loss function curve began to flatten and that the test https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz performance is reasonable (5.5% ± WER on Libri clean vs2.7% obtained by the pre-trained model). Ideally, we want the confidence score to be correlated with thecorrectness of the transcription, that is, correct words shouldhave large confidence score, while incorrect ones, low score.Following previous work [19, 31, 33], we employ metrics thatare generally used for evaluating binary classifiers, but whichhave the discrimination threshold varied. More precisely, wemeasure the area under precision-recall curve (AUPR) and thearea under receiver operating characteristic curve (AUROC).However, depending on what we want to focus (correctly orerroneously transcribed words) we obtain different variants: ifwe are interested in detecting erroneously transcribed words,we will treat the errors as the positive class; on the other handif we are interested in the correctly transcribed words, we willtreat the latter as the positive class. Hence, for AUPR we usetwo variants AUPR e (when errors are treated as positives) andAUPR s (when correct words are treated as positives). ForAUROC the same value is obtained for either choice, so thereis no need to make this distinction.We do not evaluate calibration, since our methodology isnot designed to necessarily yield a probability, but a score thatis correlated with the label. The temperature scaling approachdoes indeed transform the score to a probability (since it learnsthe scaling coefficients α and β ), but the same cannot be saidabout the other approaches (for example, negative entropy). 5. RESULTS This section presents the experimental results. We start with anevaluation of features and their aggregations (§5.1), and thenreport results for the improved variants involving temperaturescaling, dropout (§5.2) and ensembles (§5.3). We evaluate the proposed uncertainty features and aggregationtechniques on the four datasets described in subsection 4.1. Weuse the pre-trained model to obtain text predictions for all theaudio files in the test split of each dataset, and then estimate theconfidence based on the methodology described in subsection3.1. Table 2 presents the results for all combinations of featuresand aggregations. Comparison of features. We observe that log probabilityfeatures outperform the entropy features across all settings(aggregations and datasets). The only notable exception is theCommonVoice dataset where the results are comparable. Comparison of aggregations. Generally, the sum aggre-gation works better with log-proba features, while the minaggregation works better for entropy features. The sum mightnot be well suited for entropy features because their magnitude able 2 . Confidence scoring results for combinations of features and aggregations on the four test splits. For all three metricsreported (AUPR e , AUPR s , AUROC) larger values are better. We indicate the word error rate of the pre-trained ASR system oneach of the dataset by the figures on the right of the name. Libri clean / 2.7% Libri other / 6.0% TED / 13.3% CommonVoice / 28.6%feat. agg. AUPR e AUPR s AUROC AUPR e AUPR s AUROC AUPR e AUPR s AUROC AUPR e AUPR s AUROC log-proba sum 21.55 log-proba min log-proba avg 20.12 99.10 80.90 26.72 97.93 80.47 38.74 95.88 80.29 44.51 75.82 60.87 neg-entropy sum 17.31 neg-entropy min neg-entropy avg 17.55 98.95 77.72 24.26 97.59 77.46 36.28 95.42 78.29 42.64 74.83 58.75 Libri other Libri clean no. tokens / word TED no. tokens / word CommonVoice Fig. 2 . Fraction of errors as a function of the word length. Thefraction of errors is computed as the number of erroneouslytranscribed words divided by the total number of words, whilethe word length is measured as number of tokens.is larger than for log-proba and the word confidence gets pe-nalized too much by the length; but, as we will see further, thisbehaviour can be alleviated by temperature scaling. Averagingis generally underperforming for both features, suggesting thatlength-invariant measures are detrimental. Indeed, a closerlook at the frequency of errors with the length size indicatesthat the more tokens a words has the more likely is that it isincorrect, see Figure 2. Statistical tests (paired t -tests on thetwelve results from each configuration at p = 0 . ) confirmthat for both features the sum and min aggregation are signifi-cantly better than avg, while the statistical test between sumand min did not reject the null hypothesis for neither feature. Comparison across datasets. As expected, the pre-trainedmodel performs best on in-domain data (2.7% WER on Libriclean and 6.0% on Libri other), the performance then droppingsharply as we evaluate on out-of-domain data (13.3% on TEDand 28.6% on CommonVoice). In each of these settings thenumber of words that are correctly classified changes, goingfrom more on the Libri splits to fewer on TED and Common-Voice. This observation explains why the performance forAUPR s drops as a function of the domain of the data, and,conversely, why the AUPR e performance improves. Unfor- tunately, for this exact reason—the different performance ofthe base ASR system on the four datasets—it is impossible tocompare the confidence methods across datasets, as they use adifferent groundtruth [22]. We benchmark the confidence scoring method after improvingthe token probabilities by two of the described techniques:temperature scaling and dropout. We use the pre-trained ASRsystem and report results only on the TED test set. The parame-ters for temperature scaling method are learnt on the dev splitof the TED dataset for each setting of feature and aggregation.When temperature scaling is combined with dropout we firstapply the temperature scaling (using the same temperature)and the follow with the aggregation over dropout. The dropoutmethod averages 64 independent predictions. Table 3 presentsthe results for all combinations of features and aggregationsand improvement techniques.The results indicate that both proposed methods improvethe results as is their combination, which gives overall the bestresult. We observe that log-proba features benefit more fromdropout, while the neg-entropy feature yield more improve-ments when temperature scaling is used. Interestingly, the bestresults are now obtained for the neg-entropy with sum aggre-gation (row 16). Figure 3 shows that the dropout performanceimproves with the number of runs and plateaus around thechosen value of 64. We present results for confidence scoring using ensemblesof models and their combinations with the other improvedversions (temperature scaling and dropout). For each of theretrained models from the ensemble we use the predictionsof the pre-trained model to select the transcription; the re-trained model is just used for confidence scoring, by extractingthe confidence features described previously. The results arepresented in Table 4. For the rows that do not use ensemble able 3 . Confidence scoring results on the TED test set forcombinations of features, aggregations and their improvedvariants – temperature scaling ( TS ) and dropout ( D ). Thebullet sign • indicates whether a variant is employed. Boldresults indicate the best results for the feature-aggregationcombination; these results show that using both temperaturescaling and dropout yields the best results. feat. agg. TS D AUPR e AUPR s AUROC log-proba sum 39.97 95.88 79.95 • • • • log-proba min 39.74 95.94 80.58 • • • • log-proba avg 38.74 95.88 80.29 • • • • neg-entropy sum 34.96 95.41 77.57 • • • • neg-entropy min 37.55 95.56 79.01 • • • • neg-entropy avg 36.28 95.42 78.29 • • • • N A U P R e Fig. 3 . AUPR e performance as a function of the number ofdropout runs on the TED test set. The horizontal red line in-dicates the performance of the model without dropout. Themodel uses neg-entropy features, sum aggregation and temper-ature scaling. Table 4 . Confidence scoring results on the TED test set forcombinations of temperature scaling ( TS ), dropout ( D ) and en-sembles ( E ), using neg-entropy features and sum aggregation. TS D E AUPR e AUPR s AUROC1 28.58 95.30 75.792 • • • • • • • • • • • • (rows 1, 2, 3 and 5) we evaluate each of the four single modelsindependently and report the mean performance.The pre-trained model (Table 3, row 13) has generally abetter performance than the retrained ones (Table 4, row 1),suggesting that the predictive performance of a model cancorrelate with its confidence scoring performance.Among the three improvement methods, we note that tem-perature scaling gives the largest performance boost on allthree metrics (row 2). Surprisingly, the dropout method im-proves only the AUPR s performance over the baseline (row 3).On combinations of two methods, temperature scaling and en-semble complement each other and obtain better performance. 6. CONCLUSIONS This paper presented an approach for word-level confidencescoring in end-to-end speech recognition systems. We carrieda thorough ablation study on features and their aggregation onthree well-known speech databases (LibriSpeech, TED-LIUMand CommonVoice) and further evaluated improved methods,which modify the token probabilities, and their combinations.Our main observation is that temperature scaling improvesboth uncertainty features (log-proba and neg-entropy) as wellas the other two methods (dropout and ensemble). Using apre-trained model allows replicability and enables comparisonwith future confidence scoring methods that will use the sameASR. We strived for simplicity by using a compact feature set(based on readily-available token posteriors); in future work wewill consider augmenting these features with complementaryinformation ( e.g. , token duration extraction from attention). 7. ACKNOWLEDGEMENTS This work was supported by the PCCDI UEFISCDI project(funded by the Romanian Ministry of Research and Innovation,PN-III-P1-1.2-PCCDI-2017-0818/73) and the POCU project(funded by the Romanian Ministry of European Funds, finan-cial agreement 51675/09.07.2019, SMIS code 125125). . REFERENCES [1] Matthias Sperber, Graham Neubig, Jan Niehues, andAlex Waibel, “Neural lattice-to-sequence models foruncertain inputs,” in EMNLP , 2017, pp. 1380–1389.[2] Karel Vesel`y, Mirko Hannemann, and Lukas Burget,“Semi-supervised training of deep neural networks,” in ASRU , 2013, pp. 267–272.[3] Dong Yu, Balakrishnan Varadarajan, Li Deng, and AlexAcero, “Active learning and semi-supervised learning forspeech recognition: A unified framework using the globalentropy reduction maximization criterion,” ComputerSpeech & Language , vol. 24, no. 3, pp. 433–444, 2010.[4] Hossein Hadian, Hossein Sameti, Daniel Povey, and San-jeev Khudanpur, “End-to-end speech recognition usinglattice-free MMI,” in Interspeech , 2018, pp. 12–16.[5] Christoph L¨uscher, Eugen Beck, Kazuki Irie, MarkusKitza, Wilfried Michel, Albert Zeyer, Ralf Schl¨uter, andHermann Ney, “RWTH ASR systems for LibriSpeech:Hybrid vs attention,” in Interspeech , 2019, pp. 231–235.[6] Zolt´an T¨uske, Kartik Audhkhasi, and George Saon, “Ad-vancing sequence-to-sequence based speech recognition,”in Interspeech , 2019, pp. 3780–3784.[7] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, TakaakiHori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki,Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xi-aofei Wang, Shinji Watanabe, Takenori Yoshimura, andWangyou Zhang, “A comparative study on transformervs RNN in speech applications,” in ASRU , 2019, pp.449–456.[8] Thomas Kemp and Thomas Schaaf, “Estimating confi-dence using word lattices,” in Eurospeech , 1997.[9] Mitch Weintraub, Francoise Beaufays, Ze´ev Rivlin,Yochai Konig, and Andreas Stolcke, “Neural-networkbased measures of confidence for word recognition,” in ICASSP , 1997, vol. 2, pp. 887–890.[10] Timothy J Hazen, Stephanie Seneff, and Joseph Polifroni,“Recognition confidence scoring and its use in speechunderstanding systems,” Computer Speech & Language ,vol. 16, no. 1, pp. 49–67, 2002.[11] Prakhar Swarup, Roland Maas, Sri Garimella, Sri HarishMallidi, and Bj¨orn Hoffmeister, “Improving ASR con-fidence scores for Alexa using acoustic and hypothesisembeddings,” in Interspeech , 2019, pp. 2175–2179.[12] Rahhal Errattahi, Salil Deena, Asmaa El Hannani, Has-san Ouahmane, and Thomas Hain, “Improving ASRerror detection with RNNLM adaptation,” in SLT , 2018,pp. 190–196. [13] Mathew Stephen Seigel, Confidence estimation for au-tomatic speech recognition hypotheses , Ph.D. thesis,University of Cambridge, 2013.[14] Isa´ıas S´anchez Cortina, Jes´us Andr´es-Ferrer, AlbertoSanchis, and Alfons Juan, “Speaker-adapted confidencemeasures for speech recognition of video lectures,” Com-puter Speech & Language , vol. 37, pp. 11–23, 2016.[15] Kaustubh Kalgaonkar, Chaojun Liu, Yifan Gong, andKaisheng Yao, “Estimating confidence scores on ASRresults using recurrent neural networks,” in ICASSP ,2015, pp. 4999–5003.[16] Atsunori Ogawa and Takaaki Hori, “Error detection andaccuracy estimation in automatic speech recognition us-ing deep bidirectional recurrent neural networks,” SpeechCommunication , vol. 89, pp. 70–83, 2017.[17] M. A. Del-Agua, A. Gimenez, A. Sanchis, J. Civera,and A. Juan, “Speaker-adapted confidence measuresfor ASR using deep bidirectional recurrent neural net-works,” Transactions on Audio, Speech, and LanguageProcessing , vol. 26, no. 7, pp. 1198–1206, 2018.[18] Qiujia Li, PM Ness, Anton Ragni, and Mark JF Gales,“Bi-directional lattice recurrent neural networks for confi-dence estimation,” in ICASSP , 2019, pp. 6755–6759.[19] Dan Hendrycks and Kevin Gimpel, “A baseline for de-tecting misclassified and out-of-distribution examples inneural networks,” in ICLR , 2016.[20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-ing the knowledge in a neural network,” arXiv preprintarXiv:1503.02531 , 2015.[21] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-berger, “On calibration of modern neural networks,” in ICML , 2017, pp. 1321–1330.[22] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov,and Dmitry Vetrov, “Pitfalls of in-domain uncertaintyestimation and ensembling in deep learning,” in ICLR ,2020.[23] Yarin Gal and Zoubin Ghahramani, “Dropout as aBayesian approximation: Representing model uncer-tainty in deep learning,” in ICML , 2016, pp. 1050–1059.[24] Balaji Lakshminarayanan, Alexander Pritzel, and CharlesBlundell, “Simple and scalable predictive uncertaintyestimation using deep ensembles,” in NeurIPS , 2017, pp.6402–6413.[25] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado,David Sculley, Sebastian Nowozin, Joshua Dillon, Bal-aji Lakshminarayanan, and Jasper Snoek, “Can yourust your model’s uncertainty? Evaluating predictiveuncertainty under dataset shift,” in NeurIPS , 2019, pp.13991–14002.[26] Wesley J Maddox, Pavel Izmailov, Timur Garipov,Dmitry P Vetrov, and Andrew Gordon Wilson, “A simplebaseline for Bayesian uncertainty in deep learning,” in NeurIPS , 2019, pp. 13153–13164.[27] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Di-lan Gorur, and Balaji Lakshminarayanan, “Do deepgenerative models know what they don’t know?,” in ICLR , 2018.[28] Tong Che, Xiaofeng Liu, Site Li, Yubin Ge, RuixiangZhang, Caiming Xiong, and Yoshua Bengio, “Deepverifier networks: Verification of deep discriminativemodels with deep generative models,” arXiv preprintarXiv:1911.07421 , 2019.[29] Shiyu Liang, Yixuan Li, and R Srikant, “Enhancing thereliability of out-of-distribution image detection in neuralnetworks,” in ICLR , 2018.[30] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes,Tanmoy Bhattacharya, and Sarah Michalak, “On mixuptraining: Improved calibration and predictive uncertaintyfor deep neural networks,” in NeurIPS , 2019, pp. 13888–13899.[31] Charles Corbi`ere, Nicolas Thome, Avner Bar-Hen,Matthieu Cord, and Patrick P´erez, “Addressing failureprediction by learning model confidence,” in NeurIPS ,2019, pp. 2902–2913.[32] Tongfei Chen, Jir´ı Navr´atil, Vijay Iyengar, andKarthikeyan Shanmugam, “Confidence scoring usingwhitebox meta-models with linear classifier probes,” in AISTATS , 2019, pp. 1467–1475.[33] Andrey Malinin and Mark Gales, “Uncertainty in struc-tured prediction,” arXiv preprint arXiv:2002.07650 ,2020.[34] Apoorv Vyas, Pranay Dighe, Sibo Tong, and Herv´eBourlard, “Analyzing uncertainties in speech recognitionusing dropout,” in ICASSP , 2019, pp. 6730–6734.[35] Alex Graves, “Sequence transduction with recurrentneural networks,” arXiv preprint arXiv:1211.3711 , 2012.[36] Hasim Sak, Matt Shannon, Kanishka Rao, and Franc¸oiseBeaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequencemapping,” in Interspeech , 2017, vol. 8, pp. 1298–1302.[37] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, andY. Bengio, “End-to-end attention-based large vocabularyspeech recognition,” in ICASSP , 3 2016, pp. 4945–4949. [38] Parnia Bahar, Albert Zeyer, Ralf Schl¨uter, and HermannNey, “On using 2d sequence-to-sequence models forspeech recognition,” in ICASSP , 2019, pp. 5671–5675.[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: Asimple way to prevent neural networks from overfitting,” Journal of Machine Learning Research , vol. 15, no. 1,pp. 1929–1958, 1 2014.[40] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: An ASR corpus based onpublic domain audio books,” in ICASSP , April 2015, pp.5206–5210.[41] Anthony Rousseau, Paul Del´eglise, and Yannick Est`eve,“Enhancing the TED-LIUM corpus with selected data forlanguage modeling and more TED talks,” in LREC , 2014,pp. 3935–3939.[42] Rosana Ardila, Megan Branson, Kelly Davis, MichaelHenretty, Michael Kohler, Josh Meyer, Reuben Morais,Lindsay Saunders, Francis M. Tyers, and Gregor We-ber, “Common Voice: A massively-multilingual speechcorpus,” in LREC , 2020.[43] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson En-rique Yalta Soplin, Jahn Heymann, Matthew Wiesner,Nanxin Chen, Adithya Renduchintala, and TsubasaOchiai, “ESPnet: End-to-end speech processing toolkit,”in Interspeech , 2018, pp. 2207–2211.[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in NeurIPS , 2017, pp. 5998–6008.[45] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Hanne-mann, Petr Motlicek, Yanmin Qian, Petr Schwarz, JanSilovsky, Georg Stemmer, and Karel Vesely, “The Kaldispeech recognition toolkit,” in ASRU , 12 2011.[46] Taku Kudo, “Subword regularization: Improving neu-ral network translation models with multiple subwordcandidates,” in ACL , 2018, pp. 66–75.[47] Daniel S Park, William Chan, Yu Zhang, Chung-ChengChiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le,“SpecAugment: A simple data augmentation method forautomatic speech recognition,” in