[PDF] Uncertainty Estimation in Autoregressive Structured Prediction

Abstract

Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on un-structured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for autoregressive structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider: uncertainty estimation for sequence data at the token-level and complete sequence-level; interpretations for, and applications of, various measures of uncertainty; and discuss both the theoretical and practical challenges associated with obtaining them. This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT'14 English-French and WMT'17 English-German translation and LibriSpeech speech recognition datasets.

Full PDF

PPre-print — work in progress. U NCERTAINTY E STIMATION IN A UTOREGRESSIVE S TRUCTURED P REDICTION

Andrey Malinin

Yandex, Higher School of Economics [email protected]

Mark Gales

University of Cambridge [email protected] A BSTRACT

Uncertainty estimation is important for ensuring safety and robustness of AI sys-tems. While most research in the area has focused on un-structured predictiontasks, limited work has investigated general uncertainty estimation approaches forstructured prediction. Thus, this work aims to investigate uncertainty estimation forautoregressive structured prediction tasks within a single uniﬁed and interpretableprobabilistic ensemble-based framework. We consider: uncertainty estimationfor sequence data at the token-level and complete sequence-level; interpretationsfor, and applications of, various measures of uncertainty; and discuss both thetheoretical and practical challenges associated with obtaining them. This workalso provides baselines for token-level and sequence-level error detection, andsequence-level out-of-domain input detection on the WMT’14 English-French andWMT’17 English-German translation and LibriSpeech speech recognition datasets.

NTRODUCTION

Neural Networks (NNs) have become the dominant approach in numerous applica-tions (Simonyan & Zisserman, 2015; Mikolov et al., 2013; 2010; Bahdanau et al., 2015;Vaswani et al., 2017; Hinton et al., 2012) and are being widely deployed in production. Asa consequence, predictive uncertainty estimation is becoming an increasingly important researcharea, as it enables improved safety in automated decision making (Amodei et al., 2016). Importantadvancements have been the deﬁnition of baseline tasks and metrics (Hendrycks & Gimpel, 2016)and the development of ensemble approaches, such as Monte-Carlo Dropout (Gal & Ghahramani,2016) and Deep Ensembles (Lakshminarayanan et al., 2017) . Ensemble-based uncertainty estimateshave been successfully applied to detecting misclassiﬁcations, out-of-distribution inputs andadversarial attacks (Carlini & Wagner, 2017; Smith & Gal, 2018; Malinin & Gales, 2019) and toactive learning (Kirsch et al., 2019). Crucially, they allow total uncertainty to be decomposed into data uncertainty , the intrinsic uncertainty associated with the task, and knowledge uncertainty , whichis the model’s uncertainty in the prediction due to a lack of understanding of the data (Malinin,2019) . Estimates of knowledge uncertainty are particularly useful for detecting anomalous andunfamiliar inputs (Kirsch et al., 2019; Smith & Gal, 2018; Malinin & Gales, 2019; Malinin, 2019).Despite recent advances, most work on uncertainty estimation has focused on unstructured tasks,such as image classiﬁcation. Meanwhile, uncertainty estimation within a general, unsupervised ,probabilistically interpretable ensemble-based framework for structured prediction tasks, such aslanguage modelling, machine translation (MT) and speech recognition (ASR), has received littleattention. Previous work has examined bespoke supervised conﬁdence estimation techniques for eachtask separately (Evermann & Woodland, 2000; Liao & Gales, 2007; Ragni et al., 2018; Chen et al.,2017; Koehn, 2009; Kumar & Sarawagi, 2019) which construct an "error-detection" model on top ofthe original ASR/NMT system. While useful, these approaches suffer from a range of limitations.Firstly, they require a token-level supervision, typically obtained via minimum edit-distance alignmentto a ground-truth transcription (ASR) or translation (NMT), which can itself by noisy. Secondly, suchtoken-level supervision is generally inappropriate for translation, as it doesn’t account for the validityof re-arrangements. Thirdly, we are unable to determine whether the error is due to knowledge or An in-depth comparison of ensemble methods was conducted in (Ashukha et al., 2020; Ovadia et al., 2019) Data and Knowledge Uncertainty are sometimes also called Aleatoric and Epistemic uncertainty. a r X i v : . [ s t a t . M L ] D ec re-print — work in progress. data uncertainty . Finally, this model is itself subject to the pitfalls of the original system - domainshift, noise, etc. Thus, unsupervised uncertainty-estimation methods are more desirable.Recently, however, initial investigations into unsupervised uncertainty estimation for structuredprediction have appeared. The nature of data uncertainty for translation tasks was examinedin (Ott et al., 2018a). Estimation of sequence and word-level uncertainty estimates via Monte-CarloDropout ensembles has been investigated for machine translation (Xiao et al., 2019; Wang et al.,2019; Fomicheva et al., 2020). However, these works focus on machine translation, consider only asmall range of uncertainty adhoc measures, provide limited theoretical analysis of their propertiesand do not make explicit their limitations. Furthermore, they don’t identify or tackle challenges inestimating uncertainty arising from exponentially large output space. Finally, to our knowledge, nowork has examined uncertainty estimation for autoregressive ASR models.This work examines uncertainty estimation for structured prediction tasks within a general, probabilis-tically interpretable ensemble-based framework. The ﬁve core contributions are as follows. First, wederive information-theoretic measures of both total uncertainty and knowledge uncertainty at both the token level and the sequence level , make explicit the challenges involved and state any assumptionsmade. Secondly, we introduce a novel uncertainty measure, reverse mutual information , which hasa set of desirable attributes for structured uncertainty. Third, we examine a range of Monte-Carloapproximations for sequence-level uncertainty. Fourth, for structured tasks there is a choice of howensembles of models can be combined; we examine how this choice impacts predictive performanceand derived uncertainty measures. Fifth, we explore the practical challenges associated with obtaininguncertainty estimates for structured predictions tasks and provide performance baselines for token-level and sequence-level error detection, and out-of-domain (OOD) input detection on the WMT’14English-French and WMT’17 English-German translation datasets and the LibriSpeech ASR dataset. NCERTAINTY FOR S TRUCTURED P REDICTION

In this section we develop an ensemble-based uncertainty estimation framework for structured predic-tion and introduce a novel uncertainty measure. We take a Bayesian viewpoint on ensembles, as ityields an elegant probabilistic framework within which interpretable uncertainty estimates can be ob-tained. The core of the Bayesian approach is to treat the model parameters θ as random variables andplace a prior p ( θ ) over them to compute a posterior p ( θ |D ) via Bayes’ rule, where D is the trainingdata. Unfortunately, exact Bayesian inference is intractable for neural networks and it is necessaryto consider an explicit or implicit approximation q ( θ ) to the true posterior p ( θ |D ) to generate anensemble. A number of different approaches to generating ensembles have been developed, suchas Monte-Carlo Dropout (Gal & Ghahramani, 2016) and DeepEnsembles (Lakshminarayanan et al.,2017). An overview is available in (Ashukha et al., 2020; Ovadia et al., 2019).Consider an ensemble of models { P ( y | x ; θ ( m ) ) } Mm =1 sampled from an approximate posterior q ( θ ) ,where each model captures the mapping between variable-length sequences of inputs { x , · · · , x T } = x ∈ X and targets { y , · · · , y L } = y ∈ Y , where x t ∈ { w , · · · , w V } , y l ∈ { ω , · · · , ω K } . The predictive posterior is obtained by taking the expectation over the ensemble: P ( y | x , D ) = E q ( θ ) (cid:2) P ( y | x , θ ) (cid:3) ≈ M M (cid:88) m =1 P ( y | x , θ ( m ) ) , θ ( m ) ∼ q ( θ ) ≈ p ( θ |D ) (1)The total uncertainty in the prediction of y is given by the entropy of the predictive posterior. H [ P ( y | x , D )] (cid:124) (cid:123)(cid:122) (cid:125) Total Uncertainty = E P ( y | x , D ) [ − ln P ( y | x , D )] = − (cid:88) y ∈Y P ( y | x , D ) ln P ( y | x , D ) (2)The sources of uncertainty can be decomposed via the mutual information I between θ and y : I (cid:2) y , θ | x , D (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Know. Uncertainty = E q ( θ ) (cid:104) E P ( y | x , θ ) (cid:104) ln P ( y | x , θ ) P ( y | x , D ) (cid:105)(cid:105) = ˆ H (cid:2) P ( y | x , D ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Total Uncertainty − E q ( θ ) (cid:2) ˆ H [ P ( y | x , θ )] (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Expected Data Uncertainty (3)Mutual information (MI) is a measure of ‘disagreement’ between models in the ensemble, andtherefore a measure of knowledge uncertainty (Malinin, 2019). It can be expressed as the differ-ence between the entropy of the predictive posterior and the expected entropy of each model in2re-print — work in progress.the ensemble. The former is a measure of total uncertainty and the latter is a measure of datauncertainty (Depeweg et al., 2017). Another measure of ensemble diversity is the expected pairwiseKL-divergence (EPKL): K (cid:2) y , θ | x , D (cid:3) = E q ( θ ) q ( ˜ θ ) (cid:104) E P ( y | x , θ ) (cid:104) ln P ( y | x , θ ) P ( y | x , ˜ θ ) (cid:105)(cid:105) q ( θ ) ≈ p ( θ |D ) (4)where q ( θ ) = q ( ˜ θ ) and ˜ θ is a dummy variable. This measure is an upper bound on the mutualinformation, obtainable via Jensen’s inequality. A novel measure of diversity which we introduce inthis work is the reverse mutual information (RMI) between each model and the predictive posterior: M (cid:2) y , θ | x , D (cid:3) = E q ( θ ) (cid:104) E P ( y | x , D ) (cid:104) ln P ( y | x , D ) P ( y | x , θ ) (cid:105)(cid:105) , q ( θ ) ≈ p ( θ |D ) (5)This is the reverse-KL divergence counterpart to the mutual information (3), and has not beenpreviously explored. As will be shown in the next section, RMI is particularly attractive for estimatinguncertainty in structured prediction. Interestingly, RMI is the difference between EPKL and MI: M (cid:2) y , θ | x , D (cid:3) = K (cid:2) y , θ | x , D (cid:3) − I (cid:2) y , θ | x , D (cid:3) ≥ (6)While mutual information, EPKL and RMI yield estimates of knowledge uncertainty , only mutualinformation ‘cleanly’ decomposes into total and data uncertainty . EPKL and RMI do not yield cleanmeasures of total and data uncertainty , respectively. For details see appendix A.Unfortunately, we cannot in practice construct a model which directly yields a distribution over aninﬁnite set of variable-length sequences y ∈ Y . Neither can we take expectations over the this set.Instead, autoregressive models are used to factorize the joint distribution over y into a product ofconditionals over a ﬁnite set of classes, such as words or BPE tokens (Sennrich et al., 2015). P ( y | x , θ ) = L (cid:89) l =1 P ( y l | y

The key challenge of autoregressive models is that expressions (2)-(5) are intractable to evaluate.Speciﬁcally, all expectations over y are intractable to evaluate due to the combinatorial explosion ofthe hypothesis space - there are a total of | K | L possible L-length sequences in Y L , where K is thevocabulary size, and it is necessary to do a forward-pass through the model for each hypothesis. Thisis an issue which was ignored in prior work (Wang et al., 2019; Fomicheva et al., 2020; Xiao et al.,2019). Clearly, it is necessary to consider Monte-Carlo approximations to make this tractable. Akey desiderata of these approximations is that they should be obtainable at no extra cost on top ofstandard beam-search inference from the ensemble. We examine two types of Monte-Carlo (MC)3re-print — work in progress.approximations for expressions (2)-(5) which are identical in the limit, but have different attributesgiven a ﬁnite sample size. Properties of these approximation are detailed in appendix A.A result of the auto-regressive conditional independence assumption is that distributions over longsequences can have higher entropy than over short ones. To compare uncertainties of sequences ofdifferent lengths, in accordance with the desiderata in the introduction, we consider length-normalized‘ rate ’ (Cover & Thomas, 2006) equivalents of all uncertainty measures, denoted by ‘ ∧ ’.The simplest Monte-Carlo estimation for entropy is to approximate (2) using S samples: ˆ H ( S ) S-MC (cid:2) P ( y | x , D ) (cid:3) ≈ − S S (cid:88) s =1 L ( s ) ln P ( y ( s ) | x , D ) , y ( s ) ∼ P ( y | x , D ) (8)where y ( s ) is a realization of the random variable y . Alternatively, we can approximate (2) as a sumof conditional entropies via the entropy chain-rule (Cover & Thomas, 2006): ˆ H ( S ) C-MC [ P ( y | x , D )] ≈ S S (cid:88) s =1 L ( s ) L ( s ) (cid:88) l =1 H [ P ( y l | y ( s )

Before applying the proposed Monte-Carlo approximations, two practi-calities need to be considered. Firstly, due to the vastness of the hypothesis space Y L , Monte-Carlo4re-print — work in progress.sampling requires prohibitively many samples to ﬁnd a good set of hypotheses. Decoding by samplingis rarely used as it yields poor predictive performance (Eikema & Aziz, 2020; Holtzman et al., 2019).Instead, beam-search is typically used for inference, as it efﬁciently ﬁnds high-quality hypothe-ses. With regards to the Monte-Carlo estimators above, beam-search can be interpreted as a formof importance-sampling which yields hypotheses from high-probability regions of the hypothesisspace. As each hypothesis is seen only once during beam-search, the uncertainty associated witheach hypothesis y ( b ) within a beam B in the MC estimators above must be importance-weighted inproportion to P ( y ( b ) | x , D ) . ˆ H ( B ) S-IW [ P ( y | x , D )] ≈ − B (cid:88) b =1 π b L ( b ) ln P ( y ( b ) | x , D ) , y ( b ) ∈ B , π b = exp T ln P ( y ( b ) | x , D ) (cid:80) Bk exp T ln P ( y ( k ) | x , D )ˆ H ( B ) C-IW [ P ( y | x , D )] ≈ B (cid:88) b =1 L ( b ) (cid:88) l =1 π b L ( b ) H [ P ( y l | x , y ( b )

The current section provides performance baselines on three applications of structured uncertaintyestimates: sequence-level and token-level error detection, and out-of-distribution input (anomaly)detection. Additional analysis is provided in appendices C-J. We also compare performance toprior heuristic ensemble-based approaches. This work only considers ensembles of autoregressiveneural machine translation (NMT) and speech recognition (ASR) models generated by training iden-tical models from different random initializations (Lakshminarayanan et al., 2017). This approachwas shown to consistently outperform other ensemble generation techniques using exponentiallysmaller ensembles (Ashukha et al., 2020; Ovadia et al., 2019; Fort et al., 2019). Ensembles of 10transformer-big (Vaswani et al., 2017) models were trained on the WMT’17 English-to-German(EN-DE) and WMT’14 English-to-French (EN-FR) translation tasks and evaluated on the new-stest14 (nwt14) dataset. All models were trained using the conﬁguration described in (Ott et al., In the current Fairseq (Ott et al., 2019) implementation ensembles are combined as a product-of-expectations . supervised uncertainty estimation techniques for NMT/ASR, such asthose described in (Liao & Gales, 2007; Koehn, 2009), for two reasons. Firstly, the focus of this workis general, unsupervised uncertainty estimation approaches based on ensemble methods. Secondly,to our knowledge, they have not been applied to autoregressive models and doing so is beyond thescope of this work.Table 1: Predictive performance in terms of BLEU, %WER and NLL on newstest14 and LibriSpeech. Model NMT BLEU ASR % WER NMT NLL ASR NLLEN-DE EN-FR LTC LTO EN-DE EN-FR LTC LTOSingle 28.8 ± ± ± ± ± ± ± ± ENS-PrEx

ENS-ExPr 29.9 46.3 4.5 12.6 1.36 1.05 0.23 0.58

Choice of Ensemble Combination

As discussed in section 3, ensembles can be combined as an expectation-of-products (ExPr) or as a product-of-expectations (PrEx) (16). Therefore, it is necessaryto evaluate which yields superior predictive performance. We evaluate EN-DE and EN-FR NMTmodels on newstest14 and the ASR models on LibriSpeech test-clean (LTC) and test-other (LTO).Results in table 1 show that a product-of-expectations combination consistently yields marginallyhigher translation BLEU and lower ASR word-error-rate (WER) in beam-search decoding for alltasks . Beam-width for NMT and ASR models is 5 and 20, respectively. We speculate that this isbecause beam-search inference, which is a sequence of greedy token-level decisions, beneﬁts morefrom token-level Bayesian model averaging. At the same time, both combination strategies yieldequivalent teacher-forcing mean length-normalized negative-log-likelihood on reference data. Thismay be because the models in the ensemble yield consistent predictions on in-domain data, in whichcase the two combinations will yield similar probabilities. Further experiments in this work will usehypotheses obtained from a product-of-expectations ensemble combination, as it yields marginallybetter predictive performance. Additional analysis and results are available in appendix C.Table 2: Sequence-level Error Detection % Prediction Rejection Ratio in Beam-Search decoding. Task Test ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUset ˆ H (1) C-IW ˆ H (1) S-IW ˆ H (1) C-IW ˆ H (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M (1) C-IW ˆ M (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M (1) C-IW ˆ M (1) S-IW

LSP LTC 61.2

Sequence-level Error Detection

We now investigate whether the sequence-level uncertainty mea-sures can be used to detect sentences which are challenging to translate or transcribe. In the followingexperiment a model’s 1-best hypotheses are sorted in order of decreasing uncertainty and incremen-tally replaced by the references. The mean sentence-BLEU (sBLEU) or sentence-WER (sWER) isplotted against the fraction of data replaced on a rejection curve . If the uncertainties are informative,then the increase in sBLEU or decrease in sWER should be greater than random (linear). Rejectioncurves are summarised using the

Prediction Rejection Ratio (PRR) (Malinin, 2019; Malinin et al.,2020), describe in appendix D.2, which is 100% if uncertainty estimates perfectly correlate withsentence BLEU/WER, and 0% if they are uninformative. In these experiments information only fromthe 1-best hypothesis is considered. While the 1-best hypotheses are obtained from a product-of-expectation combination, we consider uncertainty estimates obtained by expressing the predictiveposterior both as a product-of-expectations (ENS-PrEx) and expectation-of-products (ENS-ExPr).Table 2 shows several trends. First, measures of total uncertainty yield the best performance.Furthermore, joint-sequence estimates of total uncertainty consistently outperform chain-rule based BLEU was calculated using sacrebleu (Post, 2018) and WER using sclite. Assessment of uncertainty derived from all hypotheses in the beam are analyzed in appendix D. along the1-best hypothesis, and therefore assess its quality. This is consistent with results for unstructured-prediction (Malinin, 2019). Second, measures derived from a product-of-expectation predictiveposterior tend to yield superior performance than their expectation-of-products counterparts. However,this does not seem to be a property of the 1-best hypotheses, as results in appendix D on hypothesesobtained from a expectation-of-products ensemble show a similar trend. Third, out of all measures ofknowledge uncertainty, joint-sequence RMI performs best. Finally, the performance gap betweenchain-rule and joint-sequence estimates of total uncertainty is larger for NMT. This is becausecompared to ASR, NMT is a task with intrinsically higher uncertainty, and therefore more irrelevantinformation is introduced by considering the probabilities of non-generated tokens.The results also show that uncertainty-based rejection works better for ASR than NMT. The issue liesin the nature of NMT - it is inherently difﬁcult to objectively deﬁne a bad translation . While WER isan objective measure of quality, BLEU is only a proxy measure. While a high sBLEU indicates agood translation, a low sBLEU does not necessarily indicate a poor one. Thus, a model may yield alow uncertainty, high-quality translation which has little word-overlap with the reference and lowsBLEU, negatively impacting PRR. A better, but more expensive, approach to assess uncertaintyestimates in NMT is whether they correlate well with human assessment of translation quality.Table 3: Token-level Error Detection %AUPR for LibriSpeech in Beam-Search Decoding regime. Test ENSM-PrEx TU ENSM-ExPr TU ENS-PrEx KU ENS-ExPr KU % TERData

H P H P I K M I K M

LTC 34.7

Token-level Error Detection

We now assess whether token-level uncertainties can be used to detecttoken-level errors in the models’ 1-best hypotheses. Note that token-level error labelling is ill-posedfor translation, where correct tokens can be mislabelled as errors due valid word re-arrangements andsubstitutions. Thus, token-level error detection is only investigated for ASR. Ground-truth error-labelsare obtained by aligning the hypotheses to the references using the SCLITE NIST scoring tool andmarking insertions and substitutions . Performance is assessed via area-under a Precision-Recallcurve. Random performance corresponds to the baseline recall, which is equal to the token errorrate. Results in table 3 are consistent with the previous section. First, measures of total uncertainty outperform measures of knowledge uncertainty . Second, estimates derived from conditional log-scores P of the generated token outperform the entropy H of the token-level predictive posterior.This is because the latter relates to probability of an error at this position , while the former relatesto the probability of this particular token being an error. Finally, deriving uncertainties from aproduct-of-expectation token-level predictive posterior P PE ( y l | y (1)

We now consider out-of-domain input (anomaly) detection. Thegoal is use uncertainty estimates to discriminate between in-domain test data and a selection of out-of-domain (OOD) datasets. Performance is assessed via area under a ROC-curve (ROC-AUC), where100% is ideal performance, 50% is random and below 50% indicates that the model yields loweruncertainty for the OOD data. Results are presented in table 4, additional results in appendices F,G,I.First, let’s examine OOD detection for speech recognition. Three OOD datasets are considered, eachcovering a different form of domain shift. First, LibriSpeech test-other (LTO), which represents a setof sentences which are more noisy and difﬁcult to transcribe. Second, the AMI meeting transcriptiondataset (Kraaij et al., 2005), which represents spoken English from a different domain, mismatched toLibriSpeech, which consist of books being read. Finally, we consider speech in a different language(French), taken from the Common Voice Project (Ardila et al., 2019). The results show that OODdetection becomes easier the greater the domain mismatch. Curiously, there is marginal differencebetween the performance of measures of uncertainty. This is likely because ASR models tend to bevery ‘sharp’ and are naturally more entropic in mismatched conditions. Provided that the model is high-performing in general. Detecting deletions is, in general, a far more challenging task.

Task OOD

T B

ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUData ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ I ( B ) C-IW ˆ K ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW ˆ I ( B ) C-IW ˆ K ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW

ASR LTO 1 1 76.7 75.5 76.2 75.0 76.4 76.6 76.6 73.9 74.0 76.3 76.4 73.420 76.9 76.3 76.4

Now let’s consider OOD detection for WMT’17 English-German machine translation. The followingOOD datasets are considered. First, the LibriSpeech test-clean (LTC) reference transcriptions, whichare OOD in terms of both domain and structure, as spoken English is structurally distinct fromwritten English. Second, newstest14 sentences corrupted by randomly permuting the source-tokens(PRM). Third, French and German source-sentences from newstest14 (L-FR and L-DE). Resultsshow that discriminating between spoken and written English is challenging. In contrast, it is possibleto near-perfectly detect corrupted English. Interestingly, detection of text from other languages isparticularly difﬁcult. Inspection of the model’s output shows that the ensemble displays a pathologicalcopy-through effect, where the input tokens are copied to the output with high conﬁdence. As a result,estimates of total uncertainty are lower for the (OOD) French or German data than for (ID) Englishdata. Notably, estimates of knowledge uncertainty , especially reverse mutual information (RMI), ˆ M ( B ) C-IW and ˆ M ( B ) S-IW , are affected far less and discriminate between the in-domain and OOD data. Thiseffect true in general, but is especially pronounced when the copy-through effect is triggered. Thishighlights the value of the RMI, for which asymptotically exact approximations can be obtained.Clearly, ASR ensembles are better at OOD detection than NMT ensembles. This is expected, asASR models receive a continuous-valued input signal which contains information not only about thecontent of the speech, but also the domain, language, speaker characteristics, background noise andrecording conditions. This makes the task easier, as the model conditions on more information. Thisis also why ASR has low intrinsic data uncertainty and why the best OOD detection performance forASR is obtained using measures of total uncertainty . In contrast, NMT models only have access to asequence of discrete tokens, which contains far less information. This also highlights the value of knowledge uncertainty , as it disregard the high intrinsic data uncertainty of NMT.An interesting effect, which is fully explored in appendix G, is that when considering all the hy-potheses within the beam for uncertainty estimation, it is beneﬁcial for NMT to use a higherimportance-weighting temperature ( T = 10 ), increasing the contribution from competing hypotheses.In contrast, this detrimental for ASR, and temperature is kept at T = 1 . We hypothesise that this maybe an artifact of the multi-modality of translation - multiple hypotheses could be equally good andcontribute valuable information. In contrast, in ASR there is only one correct transcription, thoughnot necessarily the 1-best, and considering competing hypotheses is detrimental.The results also show that chain-rule and joint-sequence approximations yield similar performanceand that, with the exception of ˆ M ( B ) S-IW , using information from the full beam yields beneﬁts mi-nor improvements compared during using just the 1-best hypotheses. Uncertainties derived from P EP ( y | x , D ) and P PE ( y | x , D ) yield comparable performance, with the exception of mutual infor-mation and EPKL, where P EP ( y | x , D ) yields consistently poorer performance. This suggests that P PE ( y | x , D ) yields more robust uncertainty estimates. Comparison to heuristic uncertainty measures

We close with a comparison of the proposedinformation-theoretic measures of knowledge uncertainty to the ’heuristic’ measures describesin (Wang et al., 2019; Fomicheva et al., 2020; Xiao et al., 2019). These measures, and our modiﬁca-8re-print — work in progress.tions thereof, are detailed in appendix I. We examine the variance of the length-normalized probabilityand log-probability of hypotheses across the ensemble, as well as the cross-hypothesis WER/BLEU.The use of more than the 1-best hypothesis is our extension for the variance-based measures.All of these measures aim to evaluate the diversity of ensemble of models in different ways. Inthis regard they are all measures of knowledge uncertainty . Their main limitations, as originallypresented, are the following. All measures focused on the diversity in the probabilities or surfaceforms of only the 1-best hypothesis. While sufﬁcient in some tasks, such as sequence-error detection,this prevents them from fully capture information about the behavior of the space of possibletranslations/transcriptions. In this regard, the information theoretic measures presented in our workare an advantage, as they naturally allow to do this. We attempt to address this for the variance-basedmeasures by considering the importance-weighted variance of the probability/log-probability ofeach hypothesis in the beam. While not strictly rigorous, this nonetheless attempts to address theproblem. Such an extension to cross-BLEU/WER is not possible, as it is not clear how to match updifferent hypotheses across all decodings of each model in the ensemble. Cross-BLEU/WER havethe additional limitation of needing a separate decoding of each model in the ensemble, which isundesirable and expensive. Finally, it is likely that there is bias towards longer hypotheses as beingmore diverse, as there is a greater chance of a surface form mismatch.The results in table 5 show that information-theoretic measures consistently yield better performance,though sometimes only marginally. Cross-BLEU/WER typically yields the worst performance,especially for NMT. Finally, including information from competing hypotheses can be advantageousfor the variance based measures. However, sometimes the performance is degraded - this is becausethe information was integrated in an adhoc, rather than theoretically meaningful, fashion. Wealso show in appendix I that length-normalization, which was used inconsistently in prior work, isimportant both for these measures, and appendix H for the information-theoretic ones.Table 5: Comparison of info-theoretic and heuristic measures on OOD detection (% ROC-AUC).

Task OOD

T B

Info.Theor. HeuristicData ˆ M ( B ) C-IW ˆ M ( B ) S-IW ˆ V ( B ) [ P ] ˆ V ( B ) [ln P ] X-BLEU X-WERASR LTO 1 1 76.6 73.9 72.0 72.7 74.3 71.820

ONCLUSION

This work investigated applying a general, probabilistically interpretable ensemble-based uncertaintyestimation framework to structured tasks, focusing on autoregressive models. A range of information-theoretic uncertainty measures both at the token level and sequence level were considered, includinga novel measure of knowledge uncertainty called reverse mutual-information (RMI). Two types ofMonte-Carlo approximations were proposed - one based on the entropy chain rule, and the other onsequence samples. Additionally, this work examined ensemble combination through both token-leveland sequence-level Bayesian model averaging. Performance baselines for sequence and token-levelerror detection, and out-of-domain (OOD) input detection were provided on the WMT’14 English-French and WMT’17 English-German translation datasets, and the LibriSpeech ASR dataset. Theresults show that ensemble-based measures of uncertainty are useful for all applications considered.Estimates of knowledge uncertainty are especially valuable for NMT OOD detection. Crucially, it wasshown that RMI is consistently the most informative measure of knowledge uncertainty for structuredprediction. Notably, it was found that token-level Bayesian model averaging consistently yieldsboth marginally better predictive performance and more robust estimates of uncertainty. However,9re-print — work in progress.it remains unclear why this is the case, which should be investigated in future work. Future workshould also investigate alternative ensemble generation techniques and compare ensemble-baseduncertainty estimates to the task-speciﬁc conﬁdence-score estimates previously explored for ASRand NMT. Another interesting direction is to assess the calibration of autoregressive ASR models. R EFERENCES

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané.Concrete problems in AI safety. http://arxiv.org/abs/1606.06565 , 2016. arXiv:1606.06565.Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, ReubenMorais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 , 2019.Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domainuncertainty estimation and ensembling in deep learning. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=BJxI5gHKDr .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In

Proc. International Conference on Learning Representations(ICLR) , 2015.Nicholas Carlini and David A. Wagner. Adversarial examples are not easily detected: Bypassing tendetection methods.

CoRR , 2017. URL http://arxiv.org/abs/1705.07263 .William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. Listen, Attend and Spell. http://arxiv.org/abs/1508.01211 , 2015. arXiv:1508.01211.Zhehuai Chen, Yimeng Zhuang, and Kai Yu. Conﬁdence measures for ctc-based phone synchronousdecoding. In , pp. 4850–4854. IEEE, 2017.Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley & Sons, 2006.Stefan Depeweg, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluft. Decompo-sition of uncertainty for active learning and reliable reinforcement learning in stochastic systems. stat , 1050:11, 2017.Bryan Eikema and Wilker Aziz. Is map decoding all you need? the inadequacy of the mode in neuralmachine translation. arXiv preprint arXiv:2005.10283 , 2020.G. Evermann and P.C. Woodland. Large vocabulary decoding and conﬁdence estimation using wordposterior probabilities. In

Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , 2000.Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel,Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation forneural machine translation. arXiv preprint arXiv:2005.10608 , 2020.Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspec-tive. arXiv preprint arXiv:1912.02757 , 2019.Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing ModelUncertainty in Deep Learning. In

Proc. 33rd International Conference on Machine Learning(ICML-16) , 2016.Mark Gales and Steve Young.

The application of hidden Markov models in speech recognition . NowPublishers Inc, 2008.Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporalclassiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In

Proceedingsof the 23rd international conference on Machine learning , pp. 369–376, 2006.10re-print — work in progress.Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressiveneural machine translation. arXiv preprint arXiv:1711.02281 , 2017.Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassiﬁed and Out-of-Distribution Examples in Neural Networks. http://arxiv.org/abs/1610.02136 , 2016.arXiv:1610.02136.Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, AndrewSenior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neuralnetworks for acoustic modeling in speech recognition.

Signal Processing Magazine , 2012.Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural textdegeneration. arXiv preprint arXiv:1904.09751 , 2019.Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efﬁcient and diverse batchacquisition for deep bayesian active learning, 2019.Philipp Koehn.

Statistical machine translation . Cambridge University Press, 2009.Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wilfried Post. The ami meeting corpus. 2005.Aviral Kumar and Sunita Sarawagi. Calibration of encoder decoder models for neural machinetranslation. arXiv preprint arXiv:1903.00802 , 2019.B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive UncertaintyEstimation using Deep Ensembles. In

Proc. Conference on Neural Information Processing Systems(NIPS) , 2017.Hank Liao and Mark JF Gales. Uncertainty decoding for noise robust speech recognition. In

Proceedings of Interspeech , volume 37. Citeseer, 2007.A. Malinin, A. Ragni, M.J.F. Gales, and K.M. Knill. Incorporating Uncertainty into Deep Learning forSpoken Language Assessment. In

Proc. 55th Annual Meeting of the Association for ComputationalLinguistics (ACL) , 2017.Andrey Malinin.

Uncertainty Estimation in Deep Learning with application to Spoken LanguageAssessment . PhD thesis, University of Cambridge, 2019.Andrey Malinin and Mark JF Gales. Reverse kl-divergence training of prior networks: Improveduncertainty and adversarial robustness. 2019.Andrey Malinin, Bruno Mlodozeniec, and Mark JF Gales. Ensemble distribution dis-tillation. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=BygSP6Vtvr .Tomas Mikolov, Martin Karaﬁát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. RecurrentNeural Network Based Language Model. In

Proc. INTERSPEECH , 2010.Tomas Mikolov et al. Linguistic Regularities in Continuous Space Word Representations. In

Proc.NAACL-HLT , 2013.Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutionalcontext for asr. arXiv preprint arXiv:1904.11660 , 2019.Kevin P. Murphy.

Machine Learning . The MIT Press, 2012.Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Analyzing uncertainty in neuralmachine translation. arXiv preprint arXiv:1803.00047 , 2018a.Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. arXiv preprint arXiv:1806.00187 , 2018b.Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In

Proceedings ofNAACL-HLT 2019: Demonstrations , 2019. 11re-print — work in progress.Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V Dillon,Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluatingpredictive uncertainty under dataset shift.

Advances in Neural Information Processing Systems ,2019.Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpusbased on public domain audio books. In , pp. 5206–5210. IEEE, 2015.Matt Post. A call for clarity in reporting BLEU scores. In

Proceedings ofthe Third Conference on Machine Translation: Research Papers , pp. 186–191, Bel-gium, Brussels, October 2018. Association for Computational Linguistics. URL .Anton Ragni, Qiujia Li, Mark JF Gales, and Yongqiang Wang. Conﬁdence estimation and dele-tion prediction using bidirectional recurrent neural networks. In , pp. 204–211. IEEE, 2018.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909 , 2015.Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale ImageRecognition. In

Proc. International Conference on Learning Representations (ICLR) , 2015.L. Smith and Y. Gal. Understanding Measures of Uncertainty for Adversarial Example Detection. In

UAI , 2018.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Shuo Wang, Yang Liu, Chao Wang, Huanbo Luan, and Maosong Sun. Improving back-translationwith uncertainty-based conﬁdence estimation. arXiv preprint arXiv:1909.00157 , 2019.Tim Z Xiao, Aidan N Gomez, and Yarin Gal. Wat heb je gezegd? detect-ing out-of-distribution translations with variational transformers. 2019. URL http://bayesiandeeplearning.org/2019/papers/90.pdf .12re-print — work in progress.

A D

ERIVATIONS OF U NCERTAINTY M EASURES

The current appendix details token-level measures of uncertainty for autoregressive models andprovides the derivations of the sequence-level measures of uncertainty discussed in section 3 as wellas extended discussions of their theoretical properties.A.1 T

OKEN - LEVEL UNCERTAINTY ESTIMATES

As was stated in section 2, token-level ensemble-based uncertainty estimates for autoregressive modelsare isomorphic to un-structured uncertainty estimates (Malinin, 2019). However, for completeness,they are described in the current section.First, let’s consider the predictive posterior P ( y l | y

Bayesian modelaveraging. The ﬁrst can be expressed as follows: P ( y l | y

In the current section we detail the derivations of joint-sequence and chain-rule derived Monte-Carloapproximations of sequence-level measures of uncertainty deﬁned in 3. Crucially, they make use ofthe chain rules of entropy and relative entropy (Cover & Thomas, 2006): ˆ H [ P ( y | x , θ )] = 1 L L (cid:88) l =1 E P ( y

As discussed in section 3, beam-search decoding can be interpreted as a form of importance sampling.For the Monte-Carlo approximations for sequence-level measures of uncertainty to be used withbeam search, they need to be adjusted such that uncertainty associated with each hypothesis y ( b ) within the beam B is weighted in proportion to it’s probability: y ( b ) ∈ B , π b = exp T ln P ( y ( b ) | x , D ) (cid:80) Bk exp T ln P ( y ( k ) | x , D ) (36)All chain-rule derived measures of uncertainty will be expressed as follows: ˆ H ( B ) C-IW [ P ( y | x , D )] ≈ B (cid:88) b =1 π b L ( b ) L ( b ) (cid:88) l =1 H [ P ( y l | x , y ( b )

XPERIMENTAL C ONFIGURATION

The current section of the appendix provides both a description of the datasets and details of themodels and experimental setups used in this work.B.1 ASR

MODEL CONFIGURATION

In this work ensembles of the VGG-Transformer sequence-to-sequence ASR model (Mohamed et al.,2019) were considered. An ensemble of 6 models was constructed using a different seed for bothinitialization and mini-batch shufﬂing in each model. We used ensembles of only 6 VGG-Transformermodels for inference. We used the Fairseq (Ott et al., 2019) implementation and training recipe forthis model with no modiﬁcations. Speciﬁcally, models were trained at a ﬁxed learning rate for 8016re-print — work in progress. Table 6: Description of ASR DatasetsDataset Subset Hours Utterances Words / Utterance DomainLibrispeech Train 960 281.2K 33.4 Story BooksDev-Clean 5.4 2703 17.8Dev-Other 5.3 2864 18.9Test-Clean 5.4 2620 20.1Test-Other 5.1 2939 17.8AMI Eval - 12643 7.1 MeetingsCommon-Voice FR Test - 14760 9.5 Generalepochs, where an epoch is a full pass through the entire training set. Checkpoints over the last 30epochs were averaged together, which proved to be crucial to ensuring good performance. Trainingtook 8 days using 8 V100 GPUs. Models were trained on the full 960 hours of the LibriSpeechdataset (Panayotov et al., 2015) in exactly the same conﬁguration as described in (Mohamed et al.,2019). LibriSpeech is a dataset with ∼ MODEL CONFIGURATION

Table 7: Description of NMT DatasetsDataset Subset LNG Sentences Words / Sent. DomainWMT’14 EN-FR Train En 40.8M 29.2Fr 33.5 Policy, News, WebWMT’17 EN-DE Train En 4.5M 26.2De 24.8 Policy, News, WebNewstest14 - En 3003 27.0 News- Fr 32.1- De 28.2Khresmoi-Summary Dev+Test En 1500 19.0 MedicalFr 21.8De 17.9This work considered ensembles of Transformer-Big (Vaswani et al., 2017) neural machine translation(NMT) models. An ensemble 10 models was constructed using a different seed for both initializationand mini-batch shufﬂing in each model. NMT models were trained on the WMT’14 English-Frenchand WMT’17 English-German datasets. All models were trained using the standard Fairseq (Ott et al.,2019) implementation and recipe, which is consistent with the baseline setup in described in (Ott et al.,2018b). The data was tokenized using a BPE vocabulary of 40,000 tokens as per the standardrecipe (Sennrich et al., 2015). For each dataset and translation direction an ensemble of 10 modelswas trained using different random seeds. All 10 models were used during inference. Models trainedon WMT’17 English-German were trained for 193000 steps of gradient descent, which correspondsto roughly 49 epochs, while WMT’14 English-French models were trained for 800000 steps ofgradient descent, which corresponds to roughly 19 epochs. Models were checkpoint-averaged acrossthe last 10 epochs. All models were trained using mixed-precision training. Models were evaluatedon newstest14, which was treated as in-domain data. OOD data was constructed by considering17re-print — work in progress.BPE-token permuted and language-ﬂipped versions of the newstest14 dataset. Furthermore, the khresmoi-summary medical dataset as well the reference transcriptions of the LibriSpeech test-cleanand test-other datasets were also used as OOD evaluation datasets. All additional datasets usedconsistent tokenization using the 40K BPE vocabulary.

C P

REDICTIVE P ERFORMANCE A BLATION S TUDIES

The current section provides additional results assessing the predictive-performance and negativelog-likelihood of ensembles of autoregressive NMT and ASR models. Additionally, we include anablation study of how the number of models in an ensemble affects the performance in terms ofBLEU and NLL. Tables 8 and 9 include expanded set of results. Crucially the results show that forall languages, tasks and datasets a product-of-expectations yields superior performance (with oneexception) in beam-search decoding and and a consistently lower NLL on reference transcriptionsand translations.Table 8: Predictive performance in terms of BLEU, %WER and NLL on newstest14 and LibriSpeech.

Model NWT’14 BLEU MED BLEU NWT’14 NLL MED NLLEN-DE EN-FR EN-DE EN-FR EN-DE EN-FR EN-DE EN-FRSingle 28.8 ± ± ± ± ± ± ± ± ENS-PrEx

ENS-ExPr 29.9 46.3 31.4

Table 9: Predictive performance in terms of BLEU, %WER and NLL on newstest14 and LibriSpeech.

Model ASR % WER ASR NLLLTC LTO AMI LTC LTO AMISingle 5.6 ± ± ± ± ± ± ENS-PrEx

ENS-ExPr 4.5 12.6 53.4 0.23 0.58 4.62

Finally, we present an ablation study, which shows how the predictive performance varies with thenumber of models in an ensemble. The ablation shows several trends. Firstly, both BLEU/WER andNLL begin to shown diminishing returns for using more models. This suggests that using 4-6 NMTmodels and 2-3 ASR models will allow most of the gains to be derived at half the cost of a full 10 or6-model ensemble. Secondly, it shows that the advantage of a product-of-expectations combination isremains consistent with the number of models. This shows that regardless of the number of modelsavailable, it is always better to combine as a product-of-expectations. (a) NMT EN2DE BLEU (b) NMT EN2DE NLL

Figure 1: BLEU and NLL ablation study. Shading indicates ± σ (a) ASR LTC WER (b) ASR LTC NLL(c) ASR LTO WER (d) ASR LTO NLL(e) ASR AMI WER (f) ASR AMI NLL Figure 2: WER and NLL ablation study. Shading indicates ± σ D S

EQUENCE - LEVEL E RROR D ETECTION

The current appendix provides a description of the Prediction Rejection Ratio metric, the rejectioncurves which correspond to results in section 4, and histograms of sentence-WER and sentence-BLEUwhich provide insights into the behaviour of the corresponding rejection curves.D.1 P

REDICTION R EJECTION R ATIO

Here we describe the

Prediction Rejection Ratio metric, proposed in (Malinin, 2019; Malinin et al.,2017), which in this work is used to assess how well measures of sequence-level uncertainty areable to identify sentences which are hard to translate/transcribe. Consider the task of identifyingmisclassiﬁcations - ideally we would like to detect all of the inputs which the model has misclassiﬁedbased on a measure of uncertainty. Then, the model can either choose to not provide any predictionfor these inputs, or they can be passed over or ‘rejected’ to an oracle (ie: human) to obtain the correctprediction (or translation/transcription). The latter process can be visualized using a rejection curve depicted in ﬁgure 3, where the predictions of the model are replaced with predictions provided by anoracle in some particular order based on estimates of uncertainty. If the estimates of uncertainty are19re-print — work in progress. (a) Shaded area is AR orc . (b) Shaded area is AR uns . Figure 3: Example Prediction Rejection Curves (Malinin, 2019)uninformative, then, in expectation, the rejection curve would be a straight line from base error rateto the lower right corner, given the error metric is a linear function of individual errors. However, ifthe estimates of uncertainty are ‘perfect’ and always bigger for a misclassiﬁcation than for a correctclassiﬁcation, then they would produce the ‘oracle’ rejection curve. The ‘oracle’ curve will go downlinearly to classiﬁcation error at the percentage of rejected examples equal to the number ofmisclassiﬁcations. A rejection curve produced by estimates of uncertainty which are not perfect, butstill informative, will sit between the ‘random’ and ‘oracle’ curves. The quality of the rejection curvecan be assessed by considering the ratio of the area between the ‘uncertainty’ and ‘random’ curves AR uns (orange in ﬁgure 3) and the area between the ‘oracle’ and ‘random’ curves AR orc (blue inﬁgure 3). This yields the prediction rejection area ratio P RR : P RR = AR uns AR orc (43)A rejection area ratio of 1.0 indicates optimal rejection, a ratio of 0.0 indicates ‘random’ rejection. Anegative rejection ratio indicates that the estimates of uncertainty are ‘perverse’ - they are higher foraccurate predictions than for misclassiﬁcations. An important property of this performance metric isthat it is independent of classiﬁcation performance, unlike AUPR, and thus it is possible to comparemodels with different base error rates. Note, that similar approaches to assessing misclassiﬁcationdetection were considered in (Lakshminarayanan et al., 2017; Malinin et al., 2017; Malinin, 2019).In this work instead of considered misclassiﬁcations we assess whether measures of uncertaintycorrelate well with sentence-level BLEU or WER. The overall ‘error’ is then the average of sentence-level BLEU/WER over the test-set.D.2 R EJECTION C URVES

The rejection curves for all NMT models on newstest14 and the ASR model on LibriSpeech test-cleanand test-other are presented in ﬁgure 4. The main difference between the NMT and ASR curves isthat the ‘oracle’ rejection curve for the former is not much better than random, while the rejectioncurve for the latter is far better than random. This can be explained by considering the histograms ofsentence-level BLEU and sentence-level WER presented in ﬁgure 5. Notice, that the sentence-levelBLEUs are varied across the spectrum, and very few sentences reach a BLEU of 100. In contrast,55-75% of all utterances transcribed by the ASR models have a sentence-WER of 0-10%, and thenthere are a few utterances with a much larger WER. Thus, if the measures of uncertainty can identifythe largest errors, which contribute most to the mean WER over the dataset, then a large decrease canbe achieved. Hence the shape of the ‘oracle’ WER-rejection curve. In contrast, the contributions fromeach sentence to mean sentence-BLEU are more evenly spread. Thus, it is difﬁcult to signiﬁcantlyraise the mean-sentence BLEU by rejecting just a few sentences. Hence the shape of the ‘oracle’BLEU rejection curve for NMT.Figure 5e shows that the sentence-WER on AMI eval is distributed more like the sentence-BLEUis for NMT tasks - few correct sentence and a much more uniform distribution of error. Thus, the20re-print — work in progress. (a) NMT EN2DE (b) NMT EN2FR(c) ASR Test-clean (d) ASR Test-other(e) ASR AMI

Figure 4: Sequence-level rejection curves for NMT and ASR.corresponding ‘oracle’ rejection curve’s shape is more similar to the NMT ‘oracle’ rejection curves.This clearly shows that the shape of the oracle curve is not determined by the task (ASR/NMT), butthe error (BLEU/WER) distribution across a dataset.The second trend in the results provided in section 4 is that score-based measures of uncertaintywork better than entropy-based measures on NMT tasks, while on ASR they perform comparably.The justiﬁcation provided states that NMT models yield far less conﬁdent predictions, and thereforeentropy-based measures suffer due to probability mass assigned to other tokens. In contrast, ASRmodels yield more conﬁdent predictions, as shown in ﬁgure 6. Notably, on AMI and Common Voicedatasets the ASR model also yields less conﬁdent predictions, and thus the score-based measuresof uncertainty do better than entropy-based ones in the AMI rejection curve in ﬁgure 4e. These21re-print — work in progress. (a) ENDE NWT’14 (b) ASR

Figure 5: Sentence BLEU and WER Histograms.results show that on tasks where it is important to determine which particular translation/transcriptionhypotheses are worse, score-based measures of uncertainty do as well as or better than entropy-based measures. This result is consistent with conﬁdences being a better measure of uncertainty formisclassiﬁcation detection in unstructured prediction tasks (Malinin, 2019). (a) NMT (b) ASR

Figure 6: Histograms of predicted-token conﬁdence for ASR and NMT.D.3 A

DDITIONAL R ESULTS

The current section provides additional sequence-level error detection results. We examine sequence-level error detection of hypotheses produced by an ensemble combined as an expectation-of-products .Results presented in table 10 serve to conﬁrm the previously observed trends and illustrate that thesuperior performance of measures of uncertainty derived from an product-of-expectations posteriordoes not depend on the nature of the hypotheses.Table 10: Sequence-level Error Detection % PRR in Beam-Search decoding using P EP ( y | x , D ) . Task Test ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUset ˆ H (1) C-IW ˆ H (1) S-IW ˆ H (1) C-IW ˆ H ( B ) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M (1) C-IW ˆ M (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M ( B ) C-IW ˆ M (1) S-IW

LSP LTC 64.7

E T

OKEN - LEVEL E RROR D ETECTION

Table 11: %AUPR for LibriSpeech in Beam-Search Decoding regime using P PE ( y | x , D ) . Test ENSM-PrEx TU ENSM-ExPr TU ENS-PrEx KU ENS-ExPr KU % TERData

H P H P I K M M ω l I K M M ω l LTC 34.7

Current appendix provides additional results for token-level error detection. Notably, we presentresults using a score-based measures of knowledge uncertainty M ω l , as well as results on hypothesesderived from an expectation-of-products ensemble combination. Results in table 11 that the newmeasures of uncertainty consistently outperforms token-level mutual-information and RMI.Results in table 12 show that the observed trends do not depend from which ensemble-combinationthe hypotheses were obtained.Table 12: %AUPR for LibriSpeech in Beam-Search Decoding regime using P EP ( y | x , D ) . Test ENSM-PrEx TU ENSM-ExPr TU ENS-PrEx KU ENS-ExPr KU % TERData

H P H P I K M M ω l I K M M ω l LTC 38.0

F O UT - OF -D ISTRIBUTION I NPUT D ETECTION

Table 13: OOD Detection % ROC-AUC in Beam-Search decoding regime for ASR and NMT. T = 10 Task OOD B ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUData ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ I ( B ) C-IW ˆ K ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW ˆ I ( B ) C-IW ˆ K ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW

ENFR LTC 1 64.1 77.3 63.3 77.2 78.5 78.4 78.3 81.7 65.2 75.3 78.0 78.95 65.4 79.2 64.7 79.2 79.1 78.9 78.8 83.6 66.7 76.8 78.6 82.0PRM 1 92.7 91.6 90.9 91.6

In the current section additional OOD input detection results are provided for En-De and En-Fr NMTmodels and the ASR model in a Beam-Search decoding regime. Additionally, we provide results forEn-De and En-Fr models on reference hypotheses in a teacher-forcing regime.F.1 A

DDITIONAL RESULTS

Table 13 provides a full of OOD detection results using an ensemble of EN-FR translation models.All hypotheses are derived from a product-of-expectations ensemble. These results tell essentially thesame story as OOD detection on En-De models. However, it seem that because WMT’14 En-Fr isroughly ten times larger than WMT’17 En-Fr, OOD detection in some cases, notably LTC, PRM andL-FR is easier. However, L-DE are signiﬁcantly worse. Note that for En-FR, L-DE is the heldoutlanguage, while L-FR is more familiar. One explanation is that the copy-through effect is so strongon an unfamiliar language that even measures of knowledge uncertainty are drastically affected. Thissuggests that it is necessary to eliminate this regime, as it strongly compromises the measures ofuncertainty. 23re-print — work in progress.Table 14: OOD Detection % ROC-AUC in Beam-Search decoding regime for ASR and NMT.

Task OOD

T B

ASR LTO 1 1 76.5 75.4 75.6 74.6 76.2 76.5 76.4 73.9 68.5 76.0 76.1 74.120 77.0 77.0 76.3 75.5 76.8 77.1 77.1 77.0 71.8 76.9 77.0 77.2AMI 1 1 97.5 97.4 97.3 97.1 96.3 96.2 96.1 96.2 79.7 96.1 96.0 96.220 96.6 97.9 96.7 97.3 95.1 95.1 95.0 97.7 86.7 95.2 95.2 97.7C-FR 1 1 99.9 99.7 99.7 99.1 99.9 99.9 99.9 99.8 33.4 99.9 99.9 99.920 100.0 99.7 99.9 98.9 99.9 99.9 99.9 99.9 38.9 99.9 99.9 99.9ENDE LTC 10 1 65.3 72.0 63.7 70.1 72.3 72.0 71.7 72.8 49.5 66.4 71.0 71.25 65.4 74.8 63.8 72.3 72.7 72.4 72.1 75.0 50.6 67.4 71.4 73.9PRM 10 1 83.0 85.2 78.6 79.4 96.4 96.7 96.7 95.9 45.3 92.3 96.3 94.35 83.6 86.3 79.3 79.5 96.7 96.9 97.0 96.5 47.6 93.5 96.6 95.6L-FR 10 1 27.1 20.4 21.8 16.5 63.3 68.9 72.2 69.9 19.8 43.7 69.4 69.45 28.0 23.9 22.7 20.2 65.9 71.7 75.2 78.6 18.9 47.5 73.4 76.8L-DE 10 1 41.7 33.0 34.6 24.4 74.9 78.5 80.5 75.2 34.1 67.8 78.3 73.25 42.6 39.8 35.7 31.2 76.9 80.6 82.6 88.6 33.3 72.2 81.2 85.5ENFR LTC 10 1 63.1 78.0 61.6 76.5 78.1 78.0 77.9 81.6 56.2 73.0 77.5 79.25 64.4 80.1 63.1 78.3 78.7 78.6 78.5 83.8 58.7 75.0 78.2 82.2PRM 10 1 93.2 92.7 90.7 90.6 98.7 98.7 98.6 98.5 38.2 88.8 98.6 98.15 93.7 93.7 91.4 91.2 98.8 98.7 98.7 98.8 38.6 90.6 98.8 98.7L-FR 10 1 55.9 38.3 49.9 29.5 86.6 88.0 88.9 84.6 45.2 81.2 88.1 83.95 58.1 48.0 52.2 38.8 88.3 89.7 90.5 94.8 46.7 84.8 90.2 93.4L-DE 10 1 12.5 7.6 11.0 6.0 35.0 38.1 40.3 38.9 20.6 27.6 39.8 41.35 13.2 15.4 11.6 13.6 39.2 43.1 45.8 67.6 18.6 30.8 46.8 66.6

Table 14 provides a full set of OOD detection results on hypotheses generated from an ensemblecombined as an expectation-of-products . The results essentially tell the same story those obtainedon hypotheses generated from an ensemble combined as a product-of-expectations . This conﬁrmsthe trend that measures of uncertainty derived by expressing the predictive posterior as a product-of-expectations typically yield marginally better performance, with expceptions where they yieldsigniﬁcantly better performance, regardless of the nature of the hypotheses.F.2 T

EACHER - FORCING

Table 15: OOD Detection % ROC-AUC in Teacher-Forcing regime

Task OOD ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUData ˆ H (1) C-IW ˆ H (1) S-IW ˆ H (1) C-IW ˆ H (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M (1) C-IW ˆ M (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M (1) C-IW ˆ M (1) S-IW

ENDE L-DEEN 98.3 98.9 98.0 99.0 99.4

L-FREN

Table 15 provides a set of OOD detection results for EN-DE/EN-FR translation models evaluated ina teacher-forcing regime, where ‘references’ are fed into the decoder. The aim is to further exploreOOD detection of foreign languages and the copy-through effect. Here, in-domain data is En-Deand En-Fr newstest14 for En-De and En-Fr models, respectively. As OOD data we also considernewstest14 data, but where the either source, target or both languages are changed. The resultsshow that when the source and target languages are both changed (L-DEEN, L-FREN), then this isan easy to detect scenario, as copy-though is forcibly avoided. We also consider situations wherewe forcibly initiate copy-through. Here, we have matched pairs of source-source or target-targetlanguage. Measures of total uncertainty fail, while measures of knowledge uncertainty do not.However, the effect is more severe when source sentences are copied through, rather than targetsentences. We speculate that this is an affect of the decoder being familiar with target sentences24re-print — work in progress.and still trying to do something sensible. These results clearly show that if the copy-through effectis somehow eliminated, then detection of OOD sentences by NMT models because as easy as forASR models, where copy-though cannot occur by construction. Finally, we note that again, derivinguncertainties from a product-of-expectations predictive posterior yields marginally better results.Additionally, chain-rule RMI seems to be the best measures of knowledge uncertainty overall. Notethat in teacher-forcing, we have access to only a single hypothesis per input, which is a regime wherejoint-sequence estimates of knowledge uncertainty, speciﬁcally RMI, tend to perform worse thanchain-rule derived estimates.

G S

ENSITIVITY OF MC ESTIMATORS TO NUMBER OF SAMPLES ANDCALIBRATION

In this section we explore in detail the effect of using more than the 1-best hypothesis in the Monte-Carlo estimation of entropy and reverse mutual information (RMI). We consider both chain-ruleand joint-sequence estimators. Performance is evaluated on the tasks of sequence error detectionand OOD detection. Figure 7 shows that for sequence error detection using more hypotheses withinthe beam either has little effect, or is detrimental. This is expected, as this makes the estimate lesssensitive to the particular hypothesis which is being assessed. Notable, joint-sequence estimates areaffected the most, while chain-rule estimators demonstrate more stable behaviour.Figure 7: Sensitivity of uncertainty measures to number samples on NMT and ASR sequence errordetection.Figure 7 shows that considering more hypotheses is generally beneﬁcial for OOD detection, especiallyfor joint-sequence estimates of RMI. This is also expected, as for OOD detection it is useful to haveinformation about the effect of the input on more than just the 1-best hypothesis. The performancegain is, ultimately, unsatisfying.We consider what happens if we increase the importance weighting temperature T, which increases thecontribution for the remaining hypotheses to the uncertainty estimate. Figure 9 shows, unsurprisinglythat this is extremely detrimental to sequence error detection, as it introduce even more irrelevantinformation. However, ﬁgure 10 shows that for OOD detection, the is unexpectedly interesting25re-print — work in progress.Figure 8: Sensitivity of uncertainty measures to number samples on NMT and ASR OOD detection.Figure 9: Sensitivity of uncertainty measures to importance weighting calibration.behaviour. Using higher temperature for NMT always leads to signiﬁcantly performance, especiallyfor joint-sequence estimates of RMI. However, what is especially surprising, is that for ASR the26re-print — work in progress.OOD detection performance degrades. It is not entirely clear why up-weighting information fromcompeting hypotheses is detrimental to performance. Unfortunately, investigating the underlyingcause of this effect is beyond the scope of this work. Ultimately, it suggests that analysis of the calibration of autoregressive structured prediction models is an interesting area of future research.Figure 10: Sensitivity of uncertainty measures to importance weighting calibration.27re-print — work in progress.

H E

FFECTS OF LENGTH - NORMALIZATION

In section 3 we make the claim that length-normalization in important - in this section we provideevidence in support of that claim. Here we compare sequence error detection and OOD detectionperformance for WMT’17 EN-DE NMT and LibriSpeech ASR systems when using the standardlength-normalized versions of the uncertainties as well as their non-length-normalized counterparts.In these experiments we only consider translation/transcription hypotheses obtained via beam-searchdecoding from a product-of-expectations ensemble, and uncertainty measures obtained using thesame combination approach.Table 16: Effect of length-normalization on sequence-level Error Detection % PRR.

Task Test len-norm ENS-PrEx TU ENS-PrEx KUset ˆ H (1) C-IW ˆ H (1) S-IW ˆ I (1) C-IW ˆ K (1) C-IW ˆ M ( B ) C-IW ˆ M (1) S-IW

LSP LTC - 48.8 56.3 47.6 46.7 46.4 54.3+ 64.7

The results in tables 16-17 deﬁnitively show, with one exception, that length-normalization consis-tently boosts the performance on all tasks. The only exception in for OOD detection when French(L-FR) input sentences are given to the ensemble. In this case there is a large improvement from not using length normalization. This seems odd, given than when German input sentences are using (inand En-De system) the same effect does not appear. Not that in both of these cases the pathologicalcopy-through effect appears. However, it seems likely that the copy-through effect is far morepronounced for L-FR. The likely reason for length-norm proving detrimental is for long sentences - asthe length of the translation increases, so does the chance of a token which is not copied successfully,on which the models will yield a far higher estimate of ensemble diversity. Length-normalizationwould mask such an effect relative.Table 17: Effect of length-normalization on OOD Detection % ROC-AUC for ASR and NMT.

Task OOD T B Len-Norm ENS-PrEx TU ENS-PrEx KUData ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ I ( B ) C-IW ˆ K ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW

ASR LTO 1 20 - 73.2 73.5 73.4 73.9 74.0 74.3+ 76.9 76.3 76.6

PRM 10 5 - 65.6 68.7 86.8 88.3 89.1 92.1+ 82.9 82.7 96.7 + 27.1 21.7 65.2 71.2 74.8 79.6L-DE 10 5 - 44.9 37.1 70.1 73.8 76.1 85.8+ 40.8 34.9 76.1 79.9 82.1

I C

OMPARISON TO HEURISTIC MEASURES OF UNCERTAINTY

In this work we considered a range of information-theoretic measures of uncertainty, describe themtheoretically in section 2 and provide Monte-Carlo estimators in section 3. However, in (Xiao et al.,2019; Fomicheva et al., 2020; Wang et al., 2019) a range of other measures was considered, thoughtheir properties were not analyzed. Firstly, (Xiao et al., 2019) considered computing the cross-BLEU or ’BLEU-Variance‘ of an ensemble. Here, the average square of the complement of pairwisesentence-BLEU between 1-best hypotheses ˆ y ( m ) produced by each individual model in the ensembleis used as a measure of uncertainty:X-BLEU = 1 M ( M − M (cid:88) m =1 M (cid:88) q (cid:54) = m (cid:0) − BLEU ( ˆ y ( m ) , ˆ y ( q ) ) (cid:1) (44)Similarly, an equivalent measure of ASR can be derived, which we will call cross-WER :X-WER = 1 M ( M − M (cid:88) m =1 M (cid:88) q (cid:54) = m (cid:0) WER ( ˆ y ( m ) , ˆ y ( q ) ) (cid:1) (45)These two measure of uncertainty assess the diversity between the 1-best hypotheses of each modelin an ensemble. In this way they are conceptually related to measures of knowledge uncertainty .Notably, as they are only sensitive to the 1-best hypothesis, they are ‘hard’ versions of ensemblediversity, as they operate on the surface forms, rather than over probabilities. While a heuristicallysensible measure, the main detraction is that this requires doing beam-search inference of each modelin an ensemble, which is expensive and undesirable. Especially if the ﬁnal prediction is derivedthrough inference of the joint-ensemble.Additionally, Fomicheva et al. (2020); Wang et al. (2019) consider the variance of the sentence-levelprobability and length-normalized log-probability of the 1-best hypothesis across models in anensemble: V [ P ] = V q ( θ ) (cid:2) P ( y | x , θ ) (cid:3) ˆ V [ln P ] = V q ( θ ) (cid:2) L ln P ( y | x , θ ) (cid:3) (46)(47)These are also measures of diversity, and therefore knowledge uncertainty . However, they are notstrictly information-theoretically meaningful and it is not how to relate them to concepts, such asentropy. Given these two measures, it is unclear how to include information from other hypothesesand whether length-normalization is necessary. First we address the latter - we consider a length-normalized version of variance, and the non-length-normalized variance of the log-probability. ˆ V [ P ] = V q ( θ ) (cid:2) P ( y | x , θ ) L (cid:3) V [ln P ] = V q ( θ ) (cid:2) ln P ( y | x , θ ) (cid:3) (48)(49)(50)Secondly, we extend all four of these measures to average variances of the hypotheses within thebeam as follows: V ( B ) [ P ] = B (cid:88) b =1 π b V q ( θ ) (cid:2) P ( y ( b ) | x , θ ) (cid:3) , y ( b ) ∈ B ˆ V ( B ) [ P ] = B (cid:88) b =1 π b V q ( θ ) (cid:2) P ( y ( b ) | x , θ ) L ( b ) (cid:3) , θ ) (cid:3) , y ( b ) ∈ B V ( B ) [ln P ] = B (cid:88) b =1 π b V q ( θ ) (cid:2) ln P ( y ( b ) | x , θ ) (cid:3) , θ ) (cid:3) , y ( b ) ∈ B ˆ V ( B ) [ln P ] = B (cid:88) b =1 π b V q ( θ ) (cid:2) L ( b ) ln P ( y ( b ) | x , θ ) (cid:3) , θ ) (cid:3) , y ( b ) ∈ B (51)(52)(53)(54)29re-print — work in progress.Thus, we explore the effect of considering additional hypotheses and whether these measures needlength-normalization or not.Table 18 explores the utility of these uncertainty measures and compares them to the best-performingmeasures of knowledge uncertainty uncertainty - reverse mutual information on the task of OOD inputdetection. The results show that, with one exception, estimates of reverse mutual information outper-form these ‘heuristic’ measures. With regards to the measures - it is clear that length-normalization,with one exception, improves performance. We can also see that these heuristic measures can beimproved by considering importance-weighted averages across the hypotheses within the beam.However, these gains are inconsistent, and sometimes this is detrimental. This highlights both thevalue of information theoretic measures, whose behaviour is far more consistent.Table 18: Comparison of information-theoretic and heuristic measures on OOD detection (% ROC-AUC). Task OOD

T B

Info.Theor. HeuristicData ˆ M ( B ) C-IW ˆ M ( B ) S-IW V ( B ) [ P ] ˆ V ( B ) [ P ] V ( B ) [ln P ] ˆ V ( B ) [ln P ] X-BLEU X-WERASR LTO 1 1 76.6 73.9 52.6 72.0 71.0 72.7 74.3 71.820

J C

HECKPOINT E NSEMBLES

This work focused mainly on ensemble of models constructed by training from different randominitializations. While this tends to yield the best ensembles, this is an extremely expensive process,especially for large transformer models trained on industrial scale corpora. Thus, in this sectionwe conduct a preliminary investigation of checkpoint ensembles - ensembles of models constructedby considering different checkpoints within a single training run. In this work we consider thecheckpoints for the 10 last epochs for NMT model training. For ASR, we checkpoint average single models (CPT-AVG), on average. However, random-initensembles (those considered in the main work), yield consistently superior performance.Table 19: Predictive performance in terms of BLEU, %WER and NLL on newstest14 and LibriSpeech.

Model NWT’14 BLEU MED BLEU NWT’14 NLL MED NLLEN-DE EN-FR EN-DE EN-FR EN-DE EN-FR EN-DE EN-FRCPT-AVG 28.8 ± ± ± ± ± ± ± ± CPT-ENS 29.3 ± - 30.5 ± - 1.42 ± - 1.25 ± -RND-ENS We also compare checkpoint ensembles with random-init ensembles of OOD detection. The results,which are a pleasant surprise, show that while checkpoint ensembles are consistently inferior, theyare only marginally so. The biggest differences occur on NMT L-FR and L-DE datasets, where thepathological ’copy-through’ effect kicks in. However, even here the difference between the bestCPT-ENS and RND-ENS measures is only about 6% ROC-AUC points. This shows that the approachconsidered in this work need to be too expensive for practical application, and useful ensembles can30re-print — work in progress.Table 20: Predictive performance in terms of BLEU, %WER and NLL on newstest14 and LibriSpeech.

Model ASR % WER ASR NLLLTC LTO AMI LTC LTO AMICPT-AVG 5.6 ± ± ± ± ± ± CPT-ENS 5.3 ± NA ± NA ± NA ± NA ± NA ± NA RND-ENS be formed even from the last 10 checkpoints of a standard transformer model training run , enablingusage in extreme-scale industrial applications.Table 21: OOD Detection % ROC-AUC in Beam-Search decoding regime for ASR and NMT.

Task Test ENSM ENS-PrEx TU ENS-ExPr TU ENS-PrEx KU ENS-ExPr KUset type ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ H ( B ) C-IW ˆ H ( B ) S-IW ˆ I ( B ) C-IW ˆ H ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW ˆ I ( B ) C-IW ˆ H ( B ) C-IW ˆ M ( B ) C-IW ˆ M ( B ) S-IW