Confidence Estimation via Auxiliary Models
Charles Corbière, Nicolas Thome, Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, Patrick Pérez
IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Confidence Estimation via Auxiliary Models
Charles Corbi `ere, Nicolas Thome, Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, and Patrick P ´erez
Abstract —Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deployingsuch models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the trueclass probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability(MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model,introducing a specific learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and ofself-training with pseudo-labels for domain adaptation, which both necessitate effective confidence estimates. Extensive experimentsare conducted for validating the relevance of the proposed approach in each task. We study various network architectures andexperiment with small and large datasets for image classification and semantic segmentation. In every tested benchmark, ourapproach outperforms strong baselines.
Index Terms —Confidence Estimation, Uncertainty, Deep Neural Networks, Classification with Reject Option, MisclassificationDetection, Failure Prediction, Self-Training, Pseudo-Labeling, Unsupervised Domain Adaptation, Semantic Image Segmentation (cid:70)
NTRODUCTION L AST decade’s research in deep learning lead to tremen-dous boosts in predictive performance for various tasksincluding image classification [1], object recognition [2],[3], [4], natural language processing [5], [6] and speechrecognition [7], [8]. However, safety remains a great concernwhen it comes to implementing these models in real-worldconditions [9], [10]. Failing to detect possible errors or over-estimating the confidence of a prediction may carry seriousrepercussions in critical visual-recognition applications suchas in autonomous driving, medical diagnosis [11] or nuclearpower plant monitoring [12].Classification with a reject option [13], [14], [15], alsoknown as selective classification [16], [17], consists in a sce-nario where the classifier is given the option to reject an in-stance instead of predicting its label. Equipped with a rejectoption, a classifier could decide to stick to the predictionor, on the contrary, to hand over to a human or a back-up system with, e.g. , other sensors, or simply to trigger analarm. One common approach for tackling the problem isto discriminate with a confidence-based criterion: For aninstance x , along with a prediction f ( x ) , a scalar value g ( x ) that quantifies the confidence of the classifier in itsprediction is also provided.Correctly identifying uncertain predictions thanks to lowconfidence values g ( x ) could be beneficial for classifica-tion improvements in active learning [18] or for efficientexploration in reinforcement learning [19]. On a related mat-ter, one would expect the confidence criterion to correlatesuccessful predictions with high values. Some paradigms,such as self-training with pseudo-labeling [20], [21], con-sist in picking and labeling the most confident samplesbefore retraining the network accordingly. The performanceimproves by selecting successful predictions thanks to anaccurate confidence criterion. A final perspective, linked tofailure prediction [22], [23], [24], is the capacity of modelsto provide a ranking which enables to distinguish correctfrom erroneous predictions. In each of the previous tasks,obtaining reliable estimates of the predictive confidence isthen of prime importance. Confidence estimation has been explored in a wide va-riety of applications, including computer vision [23], [25],speech recognition [26], [27], [28], reinforcement learning[19] or machine translation [29]. A widely used baselinewith neural-network classifiers is to take the value of thepredicted class’ probability, namely the maximum class prob-ability (MCP), given by the softmax layer output. Althoughrecent evaluations of MCP with modern deep models revealreasonable performances [23], they still suffer from severalconceptual drawbacks. In particular, MCP leads by designto high confidence values, even for erroneous predictions,since the largest softmax output is used. This design tends tomake erroneous and correct predictions overlap in terms ofconfidence and thus limits the capacity to distinguish them.In this work, we identify a better confidence criterion,the true class probability (TCP), for deep neural network clas-sifiers with a reject option. For a sample x , TCP correspondsto the probability of the model with respect to the true class y of that sample, which naturally reflects a better-behavedmodel’s confidence. We provide theoretical guarantees ofthe quality of this criterion regarding confidence estimation.Since the true class is obviously unknown at test time,we propose a novel approach which consists in designingan auxiliary network specifically dedicated to estimate theconfidence of a prediction. Given a trained classifier f ,this auxiliary network learns the TCP criterion from data.In inference, we use its scalar output as the confidenceestimate g ( x ) associated to the prediction. When appliedto failure prediction, we observe significant improvementsover strong baselines. Our approach is also adequate forself-training strategies in unsupervised domain adaption. Tomeet the challenge of this task in semantic segmentation, wepropose an enhanced architecture with structured outputand adopt an adversarial learning scheme which enforcesalignment between confidence maps in source and targetdomains. A thorough analysis of our approach, includingrelevant variations, ablation studies and qualitative evalua-tions of confidence estimates, helps to gain insight about itsbehavior. a r X i v : . [ c s . C V ] D ec EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
In summary, our contributions are as follows: • We define a novel confidence criterion, the true classprobability , which exhibits an adequate behavior forconfidence estimation; • We propose to design an auxiliary neural network,coined
ConfidNet , which aims to learn this confidencecriterion from data; • We apply this approach to the task of failure pre-diction and to self-training in unsupervised domainadaptation with adequate choices of architecture, lossfunction and learning scheme; • We extensively experiment across various bench-marks and backbone networks to validate the rele-vance of our approach on both tasks.The paper is organized as follows. In Section 2, weprovide an overview of the most relevant related workson confidence estimation, failure prediction, self-trainingand unsupervised domain adaption. Section 3 exposes ourapproach for confidence estimation based on learning anadequate criterion via an auxiliary network. We also de-scribe how it relates to classification with a reject option.In Section 4, we adapt our approach to failure prediction byintroducing an architecture, a loss function and a learningscheme for this task. Similarly, Section 5 details the instan-tiation of our approach for confidence-based self-training inunsupervised domain adaptation (DA), which we denoteas ConDA. In particular, we present two additions, anadversarial loss and a multi-scale confidence architecture,which help further improving the performance for this task.Finally, we report experimental studies in Section 6. Thispaper extends a previous conference publication [30] by in-troducing: (1) An comprehensive adaptation of the approachto improve the key step of self-training from pseudo-labelsin semantic segmentation with DA; (2) An exploration of theclassification-with-rejection framework, which strengthensthe rationale of the proposed approach.
ELATED WORK
Confidence estimation in machine learning has been aroundfor many decades, firstly linked to the idea of classificationwith a reject option [13]. Following works [14], [15], [31],[32] explored alternative rejection criteria. In particular, [31]proposes to jointly learn the classifier and the selectionfunction. El-Yaniv [16] provides an analysis of the risk-coverage trade-off that occurs when classifying with a rejectoption. More recently, [17], [33] extend the approach to deepneural networks, considering various confidence measures.Since the wide adoption of deep learning methods,confidence estimation has raised even more interest as re-cent works reveal that modern neural networks tend tobe overconfident [34], non-calibrated [35], [36], sensitive toadversarial attacks [37], [38] and inadequate to distinguishin- from out-of-distribution examples [23], [39], [40].Bayesian neural networks [41] offer a principled ap-proach for confidence estimation by adopting a Bayesianformalism which models the weight posterior distribution.As the true posterior cannot be evaluated analytically in complex models, various approximations have been de-veloped, such as variational inference [19], [42], [43] orexpectation propagation [44]. In particular, MC Dropout[19] has raised a lot of interest due to the simplicity of itsimplementation. Predictions are obtained by averaging soft-max vectors from multiple feed-forward passes through thenetwork with dropout layers. When applied to regression,the predictive distribution uncertainty can be summarizedby computing statistics, e.g. , variance. However, when usingMC Dropout for uncertainty estimation in classificationtasks, the predictive distribution is averaged to a point-wisesoftmax estimate before computing standard uncertaintycriteria, e.g. entropy or variants such as mutual information.It is worth mentioning that these entropy-based criteriameasure the softmax output dispersion, where the uniformdistribution has maximum entropy. It is not clear how wellthese dispersion measures are adapted to distinguishingfailures from correct predictions, especially with deep neu-ral networks which output overconfident predictions [35]:for example, it might be very challenging to discriminatea peaky prediction corresponding to a correct predictionfrom an incorrect overconfident one. Lakshminarayanan etal . [40] propose an alternative to Bayesian neural networksby leveraging an ensemble of neural networks to producewell-calibrated uncertainty estimates. However, it requiresto train multiple classifiers, which has a considerable com-puting cost in training and inference time.
In the context of classification, a widely used baseline forfailure prediction is to take the value of the predicted class’probability given by the softmax layer output, namely the maximum class probability (MCP), suggested by [45] and re-vised by [23]. As stated before, MCP presents several limitsregarding both failure prediction and out-of-distributiondetection, as it outputs unduly high confidence values.Blatz et al . [29] introduce a method for confidence es-timation in machine translation by solving a binary clas-sification between correct and erroneous predictions. Morerecently, Jiang et al. [24] proposed a new confidence mea-sure, ‘Trust Score’, which measures the agreement betweenthe classifier and a modified nearest-neighbor classifier onthe test examples. More precisely, the confidence criterionused in Trust Score is the ratio between the distance fromthe sample to the nearest class different from the predictedclass and the distance to the predicted class. One cleardrawback of this approach is its lack of scalability, sincecomputing nearest neighbors in large datasets is extremelycostly in both computations and memory. Another morefundamental limitation related to the Trust Score itself is thatlocal distance computation becomes less meaningful in highdimensional spaces [46], which is likely to negatively affectthe performances of this method as shown in Section 6.1.In tasks closely related to failure prediction, Guo et al .[35], for confidence calibration, and Liang et al . [39], forout-of-distribution detection, proposed to use temperaturescaling to mitigate confidence values. However, this doesnot affect the ranking of the confidence score and thereforethe separability between errors and correct predictions. De-Vries et al . [47] share with us the same purpose of learning
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 (a) Maximum Class Probability (b) True Class Probability
Fig. 1:
Distributions of different confidence measures over correct and erroneous predictions of a given model.
Whenranking according to MCP (a) the test predictions of a convolutional model trained on CIFAR-10, we observe that correctones (in green) and misclassifications (in red) overlap considerably, making it difficult to distinguish them. On the otherhand, ranking samples according to TCP (b) alleviates this issue and allows a much better separation.confidence in neural networks. Their work differs from oursby focusing on out-of-distribution detection and learningjointly a distribution confidence score and classificationprobabilities. In addition, their criterion is based on aninterpolation between output probabilities and target dis-tribution whereas we specifically define a criterion suited tofailure prediction.
Unsupervised Domain Adaptation (UDA).
UDA has re-ceived a lot of attention over the past few years because ofits importance for a variety of real-world problems, suchas robotics or autonomous driving. Most works in this lineof research aim at minimizing the discrepancy between thedata distributions in source and target domains. For thesemantic segmentation task, most recent progresses havebeen obtained by adopting an adversarial training approachto produce indistinguishable source-target distributions inthe space of features extracted by modern convolutionaldeep neural nets. To cite a few methods: CyCADA [48]first stylizes the source-domain images as target-domainimages before aligning source and target in the featurespace; AdaptSegNet [49] constructs a multi-level adversarialnetwork to perform output-space domain adaptation atdifferent feature levels; AdvEnt [50] aligns the entropy ofthe pixel-wise predictions with an adversarial loss; BDL [21]learns alternatively an image translation model and a seg-mentation model that promote each other.
Self-Training.
Semi-supervised learning designates the gen-eral problem where a decision rule must be learned fromboth labeled and unlabeled data. Among the methods ap-plied to address this problem, self-training with pseudo-labeling [20] is a simple strategy that relies on picking upthe current predictions on the unlabeled data and usingthem as if they were true labels for further training. It isshown in [20] that the effect of pseudo-labeling is equivalentto entropy regularization [51]. In a UDA setting, the ideais to collect pseudo-labels on the unlabeled target-domain samples in order to have an additional supervision loss inthe target domain. To select only reliable pseudo-labels, suchthat the performance of the adapted semantic segmentationnetwork effectively improves, BDL [21] resorts to standardselection with MCP. ESL [52] uses instead the entropy of theprediction as confidence criterion for its pseudo-label selec-tion. CBST [53] proposes an iterative self-training procedurewhere the pseudo-labels are generated based on a loss min-imization. In [53], the authors also propose a way to balancethe classes in their pseudo-labels to avoid the dominanceof large classes as well as a way to introduce spatial priors.More recently, the CRST framework [54] proposes multipletypes of confidence regularization to limit the propagationof errors caused by noisy pseudo-labels.
EARNING A MODEL ’ S CONFIDENCE WITH ANAUXILIARY MODEL
In this section, we first introduce briefly the task of classifi-cation with a reject option, along with necessary notations.We then introduce an effective confidence-rate function forneural-net classifiers and we present our approach to learnthis target confidence-rate function thanks to an auxiliaryneural network. For sake of simplicity, we consider in thissection a generic classification task, where the input is rawor transformed signals and the expected output is a pre-dicted category. The semantic segmentation task we addressin Section 5 is in effect a pixel-wise classification of localizedfeatures derived from the input image.
Let us consider a dataset D = { ( x n , y n ) } Nn =1 composed of N i.i.d. training samples, where x n ∈ X ⊂ R D is a D -dimensional data representation, deep feature maps from animage or the image itself for instance, and y n ∈ Y = (cid:74) , K (cid:75) is its true class among the K pre-defined categories. Thesesamples are drawn from an unknown joint distribution P ( X, Y ) over ( X , Y ) . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
Fig. 2:
Learning confidence approach.
The fixed classification network F , with parameters w = ( w E , w cls ) , is composedof a succession of convolutional and fully-connected layers (encoder E ) followed by last classification layers with softmaxactivation. The auxiliary confidence network C , with parameters θ , builds upon the feature maps extracted by the encoder E , or its fine-tuned version E (cid:48) with parameters w E’ : they are passed to ConfidNet, a trainable multi-layer module withparameters ϕ . The auxiliary model outputs a confidence score C ( x ; θ ) ∈ [0 , , with θ = ϕ in absence of encoder fine-tuning and θ = ( w E (cid:48) , ϕ ) in case of fine-tuning.A selective classifier [16], [17] is a pair ( f, g ) where f : X → Y is a prediction function and g : X → { , } is a selection function which enables to reject a prediction: ( f, g )( x ) = (cid:40) f ( x ) , if g ( x ) = 1 , reject , if g ( x ) = 0 . (1)In this work, we focus on classifiers based on artificialneural networks. Given an input x , such a network F withparameters w outputs non-negative scores over all classes,which are L -normalized through softmax. If well trained,this output can be interpreted as the predictive distribution P ( Y | x , ˆ w ) = F ( x ; ˆ w ) ∈ ∆ , with ∆ the probability K -simplex in R K and ˆ w the learned weights. Based on thisdistribution, the predicted sample class is usually the maxi-mum a posteriori estimate: f ( x ) = argmax k ∈Y P ( Y = k | x , ˆ w ) = argmax k ∈Y F ( x ; ˆ w )[ k ] . (2)We are not interested here in trying to improve theaccuracy of the already-trained model F , but rather to makeits future use more reliable by endowing the system with theability to recognize when the prediction might be wrong.To this end, a confidence-rate function κ f : X → R + isassociated to f so as to assess the degree of confidenceof its predictions, the higher the value the more certainthe prediction [16], [17]. A suitable confidence-rate functionshould correlate erroneous predictions with low values andsuccessful predictions with high values. Finally, given auser-defined threshold δ ∈ R + , the selection function g canbe simply derived from the confidence rate: g ( x ) = (cid:40) if κ f ( x ) ≥ δ , otherwise. (3) For a given input x , a standard confidence-rate function fora classifier F is the probability associated to the predictedmax-score class, that is the maximum class probability :MCP F ( x ) = max k ∈Y P ( Y = k | x , ˆ w ) = max k ∈Y F ( x ; ˆ w )[ k ] . (4)However, by taking the largest softmax probability asconfidence estimate, MCP leads to high confidence valuesboth for correct and erroneous predictions alike, making ithard to distinguish them, as shown in Figure 1a. On theother hand, when the model misclassifies an example, theprobability associated to the true class y is lower than themaximum one and likely to be low. Based on this simpleobservation, we propose to consider instead this true classprobability as a suitable confidence-rate function. For anyadmissible input x ∈ X , we assume the true class y ( x ) isknown, which we denote y for simplicity. The TCP confidentrate is defined asTCP F ( x , y ) = P ( Y = y | x , ˆ w ) = F ( x ; ˆ w )[ y ] . (5) Theoretical guarantees.
With TCP, the following propertieshold (see derivation in Appendix A.1). Given a properlylabelled example ( x , y ) , then: • TCP F ( x , y ) > / ⇒ f ( x ) = y , i.e. the example iscorrectly classified by the model; • TCP F ( x , y ) < /K ⇒ f ( x ) (cid:54) = y , i.e. the example iswrongly classified by the model,where class prediction f ( x ) is defined by (2).Within the range [1 /K, / , there is no theoretical guar-antee that correct and incorrect predictions will not overlapin terms of TCP. However, when using deep neural net-works, we observe that the actual overlap area is extremelysmall in practice, as illustrated in Figure 1b on the CIFAR-10 dataset. One possible explanation comes from the factthat modern deep neural networks output overconfident EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5 predictions and therefore non-calibrated probabilities [35].We provide consolidated results and analyses on this aspectin Section 6 and in Appendix A.2.
Using TCP as confidence-rate function on a model’s outputwould be of great help when it comes to reliably estimateits confidence. However, the true classes y are obviously notavailable when estimating confidence on test inputs.We propose to learn TCP confidence from data . Moreformally, for the classification task at hand, we consider aparametric selective classifier ( f, g ) , with f based on analready-trained neural network F . We aim at deriving itscompanion selection function g from a learned estimateof the TCP function of F . To this end, we introduce an auxiliary model C , with parameters θ , that is intended topredict TCP F and to act as confidence-rate function for theselection function g . An overview of the proposed approachis available in Figure 2. This model is trained such that, atruntime, for an input x ∈ X with (unknown) true label y ,we have: C ( x ; θ ) ≈ TCP F ( x , y ) . (6)In practice, this auxiliary model C will be a neuralnetwork trained under full supervision on D to producethis confidence estimate. To design this network, we cantransfer knowledge from the already-trained classificationnetwork. Throughout its training, F has indeed learnedto extract increasingly-complex features that are fed to itsfinal classification layers. Calling E the encoder part of F , asimple way to transfer knowledge consists in defining andtraining a multi-layer head with parameters ϕ that regresses TCP F from features encoded by E . We call ConfidNet thismodule. As a result of this design, the complete confidencenetwork C is composed of a frozen encoder followed bytrained ConfidNet layers. As we shall see in Section 4, thecomplete architecture might be later fine-tuned, includingthe encoder, as in classic transfer learning. In that case, θ will encompass the parameters of both the encoder and theConfidNet’s layers.In the rest of the paper, we detail the different networkarchitectures, loss functions and learning schemes of Con-fidNet for two distinct applications: Classification failureprediction and self-training for semantic segmentation withdomain adaptation. In both tasks, a ranking of unlabelledsamples that allows a clear distinction of correct predictionsfrom erroneous ones is crucial. The proposed auxiliarymodel offers a new solution to this problem. PPLICATION TO FAILURE PREDICTION
Given a trained model, failure prediction is the task ofpredicting at run-time whether the model has taken a correctdecision or not for a given input. As discussed in Section 2,they are different ways to attack this task, which has manyreal-world applications in safety-critical systems especially.With a confidence-rate function in hand, the task can besimply set as thresholding this function, exactly in thesame way the selection function works in prediction witha reject option. In this Section, we discuss how ConfidNetcan be used for that exact purpose in the context of imageclassification.
State-of-art image classification models are composed ofconvolutional layers followed by one or more fully-connected layers and a final softmax operation. In order towork with such a classification network F , we build Confid-Net upon a late intermediate representation of F . ConfidNetis designed as a small multilayer perceptron composed of asuccession of dense layers with a final sigmoid activationthat outputs C ( x ; θ ) ∈ [0 , . As explained in Section 3, wewill train this network in a supervised manner, such that itpredicts well the true-class probability assigned by F to theinput image. Regarding the capacity of ConfidNet, we haveempirically found that increasing further its depth leavesperformance unchanged for estimating the confidence of theclassification network (see Appendix B.4 for more details). As we want to regress a score between and , we use amean-square-error (MSE) loss to train the confidence model: L conf ( θ ; D ) = 1 N N (cid:88) n =1 (cid:0) C ( x n ; θ ) − TCP F ( x n , y n ) (cid:1) . (7)Since the final task here is the prediction of failures,with confidence prediction being only a means toward it, amore explicit supervision with failure/success informationcould be considered. In that case, the previous regressionloss could still be used, with 0 (failure) and 1 (success)target values instead of TCP. Alternatively, a binary crossentropy loss (BCE) for the error-prediction task using thepredicted confidence as a score could be used. Seeing failuredetection as a ranking problem, where good predictionsmust be ranked before erroneous ones according to thepredicted confidence, a batch-wise ranking loss can also beutilized [55]. We assessed experimental all these alternativelosses, including a focal version [56] of the BCE to focus onhard examples, as discussed in Section 6.1.3. They lead toinferior performance compared to using (7). This might bedue to the fact that TCP conveys more detailed informationthan a mere binary label on the quality of the classifier’sprediction for a sample. In situations where only very fewerror samples are available, this fine-grained informationimproves the performance of the final failure detection (seeSection 6.1.3). We decompose the parameters of the classification network F into w = ( w E , w cls ) , where w E denotes its encoder’sweights and w cls the weights of its last classification layers.Such as in transfer learning, the training of the confidencenetwork C starts by fixing the shared encoder and trainingonly ConfidNet’s weights ϕ . In this phase, the loss (7) isthus minimized only w.r.t. θ = ϕ .In a second phase, we further fine-tune the completenetwork C , including its encoder which is now untied fromthe classification encoder E (the main classification modelmust remain unchanged, by definition of the addressedproblem). Denoting E (cid:48) this now independent encoder, and w E (cid:48) its weights, this second training phase optimizes (7)w.r.t. θ = ( w E (cid:48) , ϕ ) with w E (cid:48) initially set to w E . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
Fig. 3:
Overview of proposed confidence learning for domain adaptation (ConDA) in semantic segmentation . Givenimages in source and target domains, we pass them to the encoder part of the segmentation network F to obtain theirfeature maps. This network F is fixed during this phase and its weights are not updated. The confidence maps are obtainby feeding these feature maps to the trainable head of the confidence network C , which includes a multi-scale ConfidNetmodule. For source-domain images, a regression loss L conf (8) is computed to minimize the distance between C θx s andthe fixed true-class-probability map TCP F ( x s , y s ) . An adversarial training scheme – based on discriminator’s loss L D ( ψ ) (Eq. 10) and adversarial part L adv ( θ ) of confidence net’s loss (Eq. 12) –, is also added to enforce the consistency betweenthe C θx s ’s and C θx t ’s. Dashed arrows stand for paths that are used only at train time.We also deactivate dropout layers in this last trainingphase and reduce learning rate to mitigate stochastic effectsthat may lead the new encoder to deviate too much fromthe original one used for classification. Data augmentationcan thus still be used.In Section 6, we put this framework at work on severalstandard image-classification benchmarks and analyse itseffectiveness in comparison with alternative approaches. PPLICATION TO SELF - TRAINING IN SEMANTICSEGMENTATION WITH DOMAIN ADAPTATION
Unsupervised domain adaptation for semantic segmenta-tion aims to adapt a segmentation model trained on alabeled source domain to a target domain devoid of an-notation. Formally, let us consider the annotated source-domain training set D s = { ( x s ,n , y s ,n ) } N s n =1 , where x s ,n is a color image of size ( H, W ) and y s ,n ∈ Y H × W itsassociated ground-truth segmentation map. A segmenta-tion network F with parameters w takes as input animage x and returns a predicted soft -segmentation map F ( x ; w ) = P w x ∈ [0 , H × W × K , where P w x [ h, w, :] = P ( Y [ h, w ] | x ; w ) ∈ ∆ . The final prediction of the net-work is the segmentation map f ( x ) defined pixel-wiseas f ( x )[ h, w ] = argmax k ∈Y P w x [ h, w, k ] . This network islearned with full supervision from the source-domain sam-ples in D s , using a cross-entropy loss, while leveraging a set D t of unlabelled target-domain examples. In UDA, the main challenge is to use the unlabeled target set D t = { x t ,n } N t n =1 available during training to learn domain-invariant features on which the segmentation model wouldbehave similarly in both domains. As reviewed in Section2, a variety of techniques have been proposed to do that,in particular for the task of semantic segmentation. Lever-aging automatic pseudo-labeling of target-domain trainingexamples is in particular a simple, yet powerful way tofurther improve UDA performance with self-training. Onekey ingredient of such an approach being the selection ofthe most promising pseudo-labels, the proposed auxiliaryconfidence-prediction model lends itself particularly wellto this task. In the rest of this section, we detail how theproposed approach to confidence prediction can be adaptedto semantic segmentation, with application to domain adap-tation through self-training. The resulting framework, calledConDA, is illustrated in Figure 3.A high-level view of self-training for semantic segmen-tation with UDA is a follows:1) Train a segmentation network for the target domainusing a chosen UDA technique;2) Collect pseudo-labels among the predictions thatthis network makes on the target-domain trainingimages;3) Train a new semantic-segmentation network fromscratch using the chosen UDA technique in combi-nation with supervised training on target-domain EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7 data with pseudo-labels;4) Possibly, repeat from step 2 by collecting betterpseudo-labels after each iteration.While the general idea of self-training is simple andintuitive, collecting good pseudo-labels is quite tricky: Iftoo many of them correspond to erroneous predictions ofthe current segmentation network, the performance of thewhole UDA can deteriorate. Thus, a measure of confidenceshould be used in order to only gather reliable predictionsas pseudo-labels and to reject the others.
Following the self-training framework previously described,a confidence network C is learned at step (2) to predictthe confidence of the UDA-trained semantic segmentationnetwork F and used to select only trustworthy pseudo-labels on target-domain images. To this end, the frameworkproposed in Section 3 in an image classification setup,and applied to predicting erroneous image classification inSection 4, needs here to be adapted to the structured outputof semantic segmentation.Semantic segmentation can be seen as a pixel-wise clas-sification problem. Given a target-domain image x t , wewant to predict both its soft semantic map F ( x t ; w ) and,using an auxiliary model with trainable parameters θ , itsconfidence map C ( x t ; θ ) = C θx t ∈ [0 , H × W . Given apixel ( h, w ) , if its confidence C θx t [ h, w ] is above a chosenthreshold δ , we label it with its predicted class f ( x t )[ h, w ] =argmax k ∈Y P w x t [ h, w, k ] , otherwise it is masked out. Com-puted over all images in D t , these incomplete segmentationmaps, constitute target pseudo-labels that are used to traina new semantic-segmentation network. Optionally, we mayrepeat from step (2) and learn alternately a confidencemodel to collect pseudo-labels and a segmentation networkusing this self-training. To train the segmentation confidence network C , we pro-pose to jointly optimize two objectives. Following the ap-proach proposed in Section 3, the first one supervises theconfidence prediction on annotated source-domain exam-ples using the known true class probabilities for the predic-tions from F . Specific to semantic segmentation with UDA,the second one is an adversarial loss that aims at reducingthe domain gap between source and target. A completeoverview of the approach is provided in Figure 3. Confidence loss.
The first objective is a pixel-wise versionof the confidence loss in (7). On annotated source-domainimages, it requires C to predict at each pixel the scoreassigned by F to the (known) true class: L conf ( θ ; D s ) = 1 N s N s (cid:88) n =1 (cid:13)(cid:13) C θx s ,n − TCP F ( x s ,n , y s ,n ) (cid:13)(cid:13) F , (8)where (cid:107) · (cid:107) F denotes the Frobenius norm and, for an image x with true segmentation map y and predicted soft one F ( x ; ˆ w ) , we noteTCP F ( x , y )[ h, w ] = F ( x ; ˆ w ) (cid:104) h, w, y [ h, w ] (cid:105) (9) at location ( h, w ) . On a new input image, C should predictat each pixel the score that F will assign to the unknowntrue class, which will serve as a confidence measure.However, compared to the application in previous Sec-tion, we have here the additional problem of the gap be-tween source and target domains, an issue that might affectthe training of the confidence model as in the training of thesegmentation model. Adversarial loss.
The second objective concerns the domaingap. While model C learns to estimate TCP on source-domain images, its confidence estimation on target-domainimages may suffer dramatically from this domain shift. Asclassically done in UDA, we propose an adversarial learningof our auxiliary model in order to address this problem.More precisely, we want the confidence maps produced by C in the source domain to resemble those obtained in thetarget domain.A discriminator D : [0 , H × W → { , } , with param-eters ψ , is trained concurrently with C with the aim torecognize the domain (1 for source, 0 for target) of an imagegiven its confidence map. The following loss is minimizedw.r.t. ψ : L D ( ψ ; D s ∪ D t ) = 1 N s N s (cid:88) n =1 L adv ( x s ,n , N t N t (cid:88) n =1 L adv ( x t ,n , , (10)where L adv denotes the cross-entropy loss of the discrimina-tor based on confidence maps: L adv ( x , λ ) = − λ log (cid:0) D ( C θx ; ψ ) (cid:1) − (1 − λ ) log(1 − D (cid:0) C θx ; ψ ) (cid:1) , (11)for λ = { , } , which is a function of both ψ and θ . Inalternation with the training of the discriminator using (10),the adversarial training of the confidence net is conductedby minimizing, w.r.t. θ , the following loss: L C ( θ ; D s ∪ D t ) = L conf ( θ ; D s ) + λ adv N t N t (cid:88) n =1 L adv ( x t , , (12)where the second term, weighted by λ adv , encourages C to produce maps in target domain that will confuse thediscriminator.This adversarial scheme for confidence learning also actsas a regularizer during training, improving the robustnessof the unknown TCP target confidence. As the training of C may actually be unstable, adversarial training providesadditional information signal, in particular imposing thatconfidence estimation should be invariant to domain shifts.We empirically observed that this adversarial confidencelearning provides better confidence estimates and improvesconvergence and stability of the training scheme. In semantic segmentation, models consist of fully convo-lutional networks where hidden representations are 2Dfeature maps. This is in contrast with the architecture ofclassification models considered in Section 4. As a result,
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
TABLE 1:
Comparison of confidence estimation methods for failure prediction and selective classification . For eachdataset, all methods share the same classification network. For MC Dropout, test accuracy is averaged through randomsampling. The first three metrics are percentages and concern failure prediction. The two last ones (the lower, the better)concern selective classification and their values have been multiplied by for clarity. Best results are in bold, second bestones are underlined. Dataset Model FPR @ 95% TPR ↓ AUPR ↑ AUROC ↑ AURC ↓ E-AURC ↓ MNIST
MLP MCP [23] 14.87 37.70 97.13 0.77 0.58MC Dropout [19] 15.15 38.22 97.15 0.79 0.59Trust Score [24] 12.31 52.18 97.52 0.69 0.50ConfidNet
SmallConvNet MCP [23] 5.56 35.05 98.63 0.17 0.12MC Dropout [19] 5.26 38.50 98.65 0.17 0.13Trust Score [24] 10.00 35.88 98.20 0.21 0.17ConfidNet
SmallConvNet MCP [23] 31.28 48.18 93.20 5.44 4.38MC Dropout [19] 36.60 43.87 92.85 5.68 4.57Trust Score [24] 34.74 43.32 92.16 6.08 5.03ConfidNet
VGG16 MCP [23] 47.50 45.36 91.53 10.45 7.32MC Dropout [19] 49.02 46.40 92.08 9.97 6.92Trust Score [24] 55.70 38.10 88.47 16.04 12.91ConfidNet
VGG16 MCP [23] 67.86 71.99 85.67 120.28 54.35MC Dropout [19] 64.68 72.59 86.09
Trust Score [24] 71.74 66.82 84.17 126.90 60.97ConfidNet
ConfidNet module must have a different design here: In-stead of fully-connected layers, it is composed of × convolutional layers with the adequate number of channels.In many segmentation datasets, the existence of objectsat multiple scales may complicate confidence estimation.As in recent works dealing with varying object sizes [57],we further improve our confidence network C by adding amulti-scale architecture based on spatial pyramid pooling. Itconsists of a computationally efficient scheme to re-samplea feature map at different scales, and then to aggregate theconfidence maps.From a feature map, we apply parallel atrous convolu-tional layers with × kernel size and different samplingrates, each of them followed by a series of 4 standardconvolutional layers with × kernel size. In contrast withconvolutional layers with large kernels, atrous convolu-tional layers enlarge the field of view of filters and help toincorporate a larger context without increasing the numberof parameters and the computation time. Resulting featuresare then summed before upsampling to the original imagesize of H × W . We apply a final sigmoid activation to outputa confidence map with values between 0 and 1.The whole architecture of the confidence model C is rep-resented in the orange block of Figure 3, along with its train-ing given a fixed segmentation model F (blue block) withwhich it shares the encoder. Such as in previous section,fine-tuning the encoder within C is also possible, althoughwe did not explore the option in this semantic segmentationcontext due to the excessive memory overhead it implies. XPERIMENTS
We evaluate our approach on the two tasks presented inprevious sections: Failure prediction in classification settingsand semantic segmentation with domain adaptation.
In this section, we present comparative experimentsagainst state-of-the-art confidence-estimation approachesand Bayesian methods on various datasets. Then, we studythe effect of learning variants on our approach.
The experiments are conducted on image datasets of vary-ing scale and complexity: MNIST [58] and SVHN [59]datasets provide small and relatively simple images of digits(10 classes). CIFAR-10 and CIFAR-100 [60] propose morecomplex object-recognition tasks on low resolution images.The classification models range from small convolutionalnetworks for MNIST and SVHN to the larger VGG-16 ar-chitecture for the CIFAR datasets. We also consider a multi-layer perceptron (MLP) with one hidden layer to investigateperformances on small models. ConfidNet is attached tothe penultimate layer of the convolutional neural network.Further details about datasets, architectures, training andmetrics can be found in Appendix B.We measure the quality of failure prediction follow-ing standard metrics used in the literature [23]: AU-ROC, the area under the receiver operating characteristic;FPR @ 95% TPR, the false-positive rate measured when thetrue-positive rate is 95%; and AUPR, the area under theprecision-recall curve, using here incorrect model’s predic-tions as positive detection samples (see details in AppendixB.2). Among these metrics, AUPR is the most directly relatedto the failure detection task, and is thus the prevalent one inour assessment.As an additional, indirect way to assess the quality ofthe predicted classifier’s confidence, we also consider theselective classification problem that was discussed in Section3.1. In this setup, the predictions by the classifier F that EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Fig. 4:
Limitations of MC Dropout’s confidence measure.
Two test samples from SVHN dataset, which are respectivelymisclassified (left) and correctly classified (right) by a given model F , illustrate these limits. The entropies of the predictedclass distributions (averaged over Monte Carlo dropout layers and displayed as histograms) are equally high, at around0.79, resulting in equally low MC Dropout confidence estimates. In contrast, both MCP and TCP approximated byConfidNet clearly differ as expected for the two examples. Yet, ConfidNet has the best behavior, being the lowest forthe erroneous model’s prediction and the highest for the correct one.TABLE 2: Impact of the choice of training data on the error-prediction performance of ConfidNet.
Comparison in AUPRbetween training on model’s train set or on a validation set.
Variant
MNIST MNIST SVHN CIFAR-10 CIFAR-100
MLP SmallConvNet SmallConvNet VGG-16 VGG-16ConfidNet- train val get a predicted confidence below a defined threshold arerejected. Given a coverage rate (the fraction of examples thatare not rejected), the performance of the classifier shouldimprove. The impact of this selection, and hence of theunderlying confidence-rate function, is measured in averagewith the area under the risk-coverage curve (AURC) and itsnormalized variant
Excess -AURC (E-AURC) [61].
Along with our approach, we implemented competitiveconfidence and uncertainty estimation methods includingMCP [23], Trust Score [24], and Monte-Carlo Dropout (MCDropout) [19].Comparative results are summarized in Table 1. Weobserve that our approach outperforms the other methodsin every setting, with a significant gap on small mod-els/datasets. This confirms that TCP is an adequate confi-dence criterion for failure prediction and that our approachConfidNet is able to learn it. Trust Score also delivers goodresults on small datasets/models such as MNIST. WhileConfidNet still performs well on more complex datasets,Trust Score’s performance drops, which might be explainedby high-dimensionality issues with distances. Regardingselective classification result (AURC and E-AURC), we alsoprovide risk-coverages curves in Appendix B.8 .We also improve state-of-art performances from MCDropout. While MC Dropout leverages ensembling basedon dropout layers, taking as confidence measure the entropyon the average softmax distribution may not be always ade-quate. In Figure 4, we show side-by-side samples with simi-lar distribution entropy. The left image is misclassified whilethe right one enjoys a correct prediction. In fact, the entropyis a permutation-invariant measure on discrete probabilitydistributions: A correct 3-class prediction with score vec-tor [0 . , . , . has the same entropy-based confidenceas an incorrect one with probabilities [0 . , . , . . In contrast, our approach can discriminate an incorrect froma correct prediction, despite both having similarly-spreaddistributions. We analyse in Table 3 the effect of the encoder fine-tuningthat is described in Section 4.3. Learning only ConfidNet ontop of the pre-trained encoder E (that is, θ = ϕ ), our con-fidence network already achieves significant improvementsw.r.t. the baselines. With a subsequent fine-tuning of bothmodules (that is θ = ( w E (cid:48) , ϕ )) , its performance is furtherboosted in every setting, by around 1-2%. Note that using avanilla fine-tuning without the deactivation of the dropoutlayers did not bring any improvement.TABLE 3: Impact of the encoder fine-tuning on theerror-prediction performance of ConfidNet . Comparison inAUPR on two benchmarks with different backbones.
MNIST CIFAR-100
SmallConvNet VGG-16Confidence training 43.94% 72.68%+ Encoder fine-tuning 45.89% 73.68%
Given the small number of erroneous-prediction samplesthat are available due to deep neural network over-fitting,we also experimented confidence training on a hold-outdataset. We report the results on all datasets in Table 2for validation sets with 10% of samples. We observe ageneral performance drop when using a validation set fortraining TCP confidence. The drop is especially pronouncedfor small datasets (MNIST), where models reach more than97% of train and validation accuracies. Consequently, witha high accuracy and a small validation set, we do not geta larger absolute number of errors using a hold-out setrather than the train set. One solution would be to increase
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
TABLE 4:
Comparative performance on semantic segmentation with synth-to-real unsupervised domain adaptation.
Results in per-class IoU and class-averaged mIoU on GTA5 (cid:46)
Cityscapes. All methods are based on a DeepLabv2 backbone.
GTA5 (cid:46)
CityscapesMethod S e l f- T r a i n . r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g t e rr a i n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m b i k e b i k e mIoUAdaptSegNet [49] 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) the validation-set size, but this would damage the model’sprediction performance. By contrast, we take care with ourexperiments to base our confidence estimation on modelswith levels of test predictive performance that are similar tothose of the baselines. On CIFAR-100, the gap between trainaccuracy and validation accuracy is substantial (95.56% vs.65.96%), which may explain the slight improvement for con-fidence estimation using a validation set (+0.17%). We thinkthat training ConfidNet on a validation set with modelsreporting low/medium test accuracies could improve theapproach.TABLE 5: Effect of the loss on the error-detection per-formance of ConfidNet.
Comparison in AUPR betweenproposed MSE loss and three other alternatives.
Dataset MSE BCE Focal Ranking
SVHN 50.72%
CIFAR-10 49.94%
In Table 5, we compare training ConfidNet with the MSEloss (7) to training with a binary-classification cross-entropyloss (BCE), a focal BCE loss and a batch-wise approximateranking loss. Even though BCE specifically addresses thefailure prediction task, it achieves lower performances onCIFAR-10 and SVHN datasets. Similarly, the focal loss andthe ranking one yield results below TCP’s performance inevery tested benchmark. Our intuition is that TCP regular-izes the training by providing finer-grain information aboutthe quality of the classifier’s predictions. This is especiallyimportant in the difficult learning configuration where onlyvery few error samples are available due to the good perfor-mance of the classifier.
In this section, we analyse on several semantic segmentationbenchmarks the performance of ConDA, our approach todomain adaptation with confidence-based self-training. Wereport comparisons with state-of-the-art methods on eachbenchmark. We also analyse further the quality of ConDA’spseudo-labelling and demonstrate via an ablation study theimportance of each of its components.
As in many UDA works for semantic segmentation, weconsider the specific task of adapting from synthetic toreal data in urban scenes. We present in particular exper-iments in the common set-up, denoted GTA5 (cid:46)
Cityscapes,where GTA5 [63] is the synthetic source dataset whilethe real-word target dataset is Cityscapes [64]. We alsovalidate our approach on two other benchmarks – SYN-THIA (cid:46)
Cityscapes and SYNTHIA (cid:46)
Mapillary Vistas [65] –in Appendix C.3. The GTA5 [63] dataset is composed of24,966 images extracted from the eponymous game, of di-mension × and semantically annotated with 19classes in common with Cityscapes [64]. Cityscapes [64] isa dataset of real street-level images. For domain adaptation,we use the training set as target dataset during training. Itis composed of 2,975 images of dimension × . Allresults are reported in terms of intersection over union (IoU)per class or mean IoU over all classes (mIoU); the higher thispercentage, the better.We evaluate the proposed self-training method on Ad-vEnt [50], a state-of-the-art UDA approach. AdvEnt [50]proposes an adversarial learning framework for domainadaptation: Instead of the softmax output predictions, Ad-vEnt aligns the entropy of the pixel-wise predictions. Allthe implementations are done with the PyTorch frame-work [66]. The semantic segmentation models are initializedwith DeepLabv2 backbones pretrained on ImageNet [1].Due to computational constraints, we only train the multi-scale ConfidNet without encoder fine-tuning. Further infor-mation about architectures and implementation details oftraining and metrics can be found in Appendix C.1. The results of semantic segmentation on the Cityscapesvalidation set using GTA5 as source domain are availablein Table 4. All the methods rely on DeepLabv2 as theirsegmentation backbone. We first notice that self-training-based methods from the literature are superior on thisbenchmark, with performance reaching up to . mIoUwith ESL [52]. ConDA outperforms all those methods byreaching . mIoU. Ablation Study.
To study the effect of the adversarialtraining and of the multi-scale confidence architecture on
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
Fig. 5:
Qualitative results of pseudo-label selection for semantic-segmentation adaptation.
The three first columnspresent target-domain images of the GTA5 (cid:46)
Cityscapes benchmark (a) along with their ground-truth segmentation maps(b) and the predicted maps before self-training (c). We compare pseudo-labels collected with MCP (d) and with ConDA(e). Green (resp. red) pixels are correct (resp. erroneous) predictions selected by the method and black pixels are discardedpredictions. ConDA retains fewer errors while preserving approximately the same amount of correct predictions.the confidence model, we perform an ablation study onthe GTA5 (cid:46)
Cityscapes benchmark. The results on domainadaptation after re-training the segmentation network us-ing collected pseudo-labels are reported in Table 6. In thistable, “ConfidNet” refers to the simple network architecturedefined in Section 4 (adapted to segmentation by replacingthe fully connected layers by × convolutions of suitablewidth); “Adv. ConfidNet” denotes the same architecturebut with the adversarial loss from Section 5.3 added to itslearning scheme; “Multi-scale ConfidNet” stands for the ar-chitecture introduced in Section 5.4; Finally, the full method,“ConDA” amounts to having both this architecture andthe adversarial loss. We notice that adding the adversariallearning achieves significantly better performance, for bothConfidNet and multi-scale ConfidNet, with respectively +1 . and +0 . point increase. Multi-scale ConfidNet (resp.adv. multi-Scale ConfidNet) also improves performance upto +0 . point (resp. +0 . ) from their ConfidNet architecturecounterpart. These results stress the importance of bothcomponents of the proposed confidence model. Model Multi-Scale. Adv mIoUConfidNet 47.6Multi-Scale ConfidNet (cid:88) (cid:88) (cid:88) (cid:88)
TABLE 6:
Ablation study on semantic segmentation withpseudo-labelling-based adaptation.
Full-fledged ConDAapproach is compared on GTA5 (cid:46)
Cityscapes to stripped-down variants (w/ and w/o multi-scale architecture inConfidNet, w/ and w/o adversarial learning).
Quality of pseudo-labels.
We analyze here the effectivenessof MCP and ConDA as confidence measure to select relevantpseudo-labels in the target domain. For a given fraction ofretained pseudo-labels (coverage) on target-domain train-ing images, we compare in Figure 6 the proportion ofthose labels that are correct (accuracy). ConDA outperformsMCP for all coverage levels, meaning the it selects signif-icantly fewer erroneous predictions for the next round ofsegmentation-model training. Along with the segmentationadaptation improvements presented earlier, these coverageresults demonstrate that reducing the amount of noise in the pseudo-labels is key to learn a better segmentationadaptation model. Figure 5 presents qualitative results ofthose pseudo-labels methods. We find again that MCP andConDA seem to select around the same amount of correctpredictions in their pseudo-labels, but with ConDA pickingout a lot fewer erroneous ones.Fig. 6:
Comparative quality of selected pseudo-labels .Proportion of correct pseudo-labels (precision) for differentcoverages on GTA5 (cid:46)
Cityscapes, for MCP and ConDA.
ONCLUSION
In this paper, we defined a new confidence criterion, TCP,which enjoys both theoretical guarantees and empirical evi-dences of improving the confidence estimation for classifierswith a reject option. We proposed a specific method to learnthis criterion with an auxiliary neural network built uponthe encoder of the model that is monitored. Applied tofailure prediction, this learning scheme consists in trainingthe auxiliary network and then enabling the fine-tuningof its encoder (the one of the monitored classifier remainsfrozen). In each image classification experiment, we wereable to improve the capacity of the model to distinguish cor-rect from erroneous samples and to achieve better selectiveclassification. Besides failure prediction, other applicationscan benefit from this improved confidence estimation. Inparticular, we showed that applied to self-training withpseudo-labels, our approach reaches state-of-the-art resultson three synthetic-to-real unsupervised-domain-adaptation
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12 benchmarks (GTA5 (cid:46)
Cityscapes, SYNTHIA (cid:46)
Cityscapes andSYNTHIA (cid:46)
Mapillary Vistas). To achieve these results, weequipped the auxiliary model with a multi-scale confidencearchitecture and supplemented the confident loss with anadversarial training scheme which enforces alignment be-tween confidence maps in source and target domains. Oneclear limitation of this approach is the number of errorsavailable in training. Further work includes exploring meth-ods to artificially generate errors, such as aggressive dataaugmentation. A PPENDIX AT HE TRUE CLASS PROBABILITY (TCP)
CRITERION
A.1 Proof of TCP’s theoretical guarantees
Let F be a trained neural network classifier with learnedweights ˆ w as defined in the main paper, K be the numberof labels and x ∈ R D a sample with its associated truelabel y ∈ Y such that TCP F ( x , y ) > . Starting from thedefinition of TCP we have: TCP F ( x , y ) = P ( Y = y | x , ˆ w ) > (13) ⇐⇒ − (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) > (14) ⇐⇒ (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) < . (15)Since probabilities are positive, we obtain that ∀ k (cid:54) = y , P ( Y = k | x , ˆ w ) < < P ( Y = y | x , ˆ w ) . Denoting ˆ y the classpredicted by the network, we have ˆ y = argmax k P ( Y = k | x , ˆ w ) . Hence ˆ y = y .In the same way, for any ( x , y ) ∈ R D × Y , such that TCP F ( x , y ) < K , we have: P ( Y = y | x , ˆ w ) < K (16) ⇐⇒ − (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) < K (17) ⇐⇒ (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) > K − K . (18)If the model correctly classifies this sample, i.e. , ˆ y = y , then ∀ k (cid:54) = y , P ( Y = y | x , ˆ w ) ≥ P ( Y = k | x , ˆ w ) . We have: (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) ≤ ( K − P ( Y = y | x , ˆ w ) ≤ K − K , (19)which contradicts Equation 18. Hence, there exists at leastone k such that P ( Y = k | x , ˆ w ) > P ( Y = y | x , ˆ w ) , whichresults in ˆ y (cid:54) = y . A.2 Empirical error and success distributions
In this section, we provide the plots, analogous to Figure1 in the main paper, that show the distribution of theconfidence measures over correct and incorrect predictionsrespectively, for each dataset and each model in our failureprediction experiments. We also include absolute numbersof incorrect and correct predictions grouped into 3 bins(‘ > /K ’, ‘ [ K , ] ’ and ‘ > / ’) to validate our theoreticalassumptions about TCP’s properties. The plots are available for MNIST with MLP in Fig. 7, for MNIST with a smallconvnet in Fig. 8, for SVHN with a small convnet in Fig. 9,for CIFAR-100 with VGG-16 in Fig. 10 and for CamVid withSegNet in Fig. 11. A PPENDIX BE XPERIMENTS ON FAILURE PREDICTION
B.1 Implementation details
Datasets.
We run experiments on image datasets of varyingscale and complexity: MNIST [58] and SVHN [59] datasetsprovide relatively simple and small ( × ) images ofdigits (10 classes). They are split in 60,000 training samplesand 10,000 testing samples. CIFAR-10 and CIFAR-100 [60]bring more complexity to classify low resolution images. Ineach dataset, we further keep 10% of training samples as avalidation dataset. We also report experiments for semanticsegmentation on CamVid [67], using ConfidNet’s trainingand architecture introduced in Section 4 of the main paper,with dense layers replaced by × convolutions withadequate number of channels. CamVid is a standard roadscene dataset. Images are resized to × pixels and aresegmented according to 11 classes such as ‘road’, ‘building’,‘car’ or ‘pedestrian’. Classification network.
For each dataset, we use stan-dard neural network architectures as classifiers. We re-implemented in PyTorch [66] the network architectures pro-posed in [24] for fair comparison. They range from smallconvolutional networks for MNIST and SVHN to VGG-16architectures for CIFAR datasets. We also added a multi-layer perceptron (MLP) with 1 hidden layer of size 100for MNIST dataset in order to investigate performances onsmall models. Finally, we implemented SegNet following[25]. All models are trained in a standard way with a cross-entropy loss and an SGD optimizer with a learning rateof − , a momentum of 0.9 and a weight decay of − .The number of training epochs depends on the datasetconsidered, varying from 100 epochs on MNIST to 250epochs on CIFAR-100. As we also want to compute MonteCarlo samples following [19], we include dropout layers.Best models are selected on validation-set accuracy. ConfidNet.
For each of the considered classification models,ConfidNet is built upon the penultimate layer, which is aconvolutional layer with non-linear activation and option-ally followed by a normalization layer. We train ConfidNetfor 100 epochs with Adam optimizer with learning rate × − , dropout, weight decay − and same data aug-mentation as in classifier’s training. We select the best modelbased on the AUPR on the validation dataset. In the secondtraining step involving encoder fine-tuning, the trainingis completed on very few epochs based on previous bestmodel, using Adam optimizer with learning rate × − or × − and no dropout to mitigate stochastic effectsthat may lead the new encoder to deviate too much fromthe original one used for classification. Once again, the bestmodel is selected on validation-set AUPR.
1. https://github.com/EN10/KerasMNIST2. https://github.com/tohinz/SVHN-Classifier3. https://github.com/geifmany/cifar-vgg
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Fig. 7:
Distributions of MCP and TCP confidence estimates.
These distributions are computed over correct and erroneouspredictions by a trained MLP on MNIST. (a) MCP (b) TCP
Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 25 170 0 28 9777 37.70% 97.13%TCP 81 114 0 0 28 9777 98.77% 99.98%Fig. 8: Same as Figure 7 on MNIST using a small convnet architecture. (a) MCP (b) TCP
Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 8 82 0 11 9899 35.05% 98.63%TCP 32 58 0 0 11 9899 99.41% 99.41%
Other baseline details.
For Trust Score [24], we used thecode provided by the authors . We add parallel processingwhen computing distances for each class to speed up in-ference. This parallelization does not alter the algorithm norits performance. Specifically for semantic segmentation withCamVid, each image contains 172,800 pixels. Even thoughCamVid remains a small dataset (367 training images, 101validation images, 233 test images) compared to othersemantic segmentation datasets, computation complexityforced us to drastically reduce the number of training neigh-
4. https://github.com/google/TrustScore bors and the number of test samples. We randomly samplein each train and test image a small percentage of pixels tocompute a proxy.For MC Dropout [19], we use the same model thanbaseline (which already includes dropout layers) and wesample 100 times from the classification model at test timekeeping dropout layers activated. We then compute theaverage softmax probability over all samples to conductMonte Carlo integration. Model uncertainty is estimated,following [19], by calculating the entropy of the averagedprobability vector across the class dimension.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
Fig. 9: Same as Figure 7 on SVHN using a small convnet architecture. (a) MCP (b) TCP
Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 329 857 0 206 24640 48.18% 93.20%TCP 500 686 0 0 206 24640 98.93% 99.95%Fig. 10: Same as Figure 7 on CIFAR-100 using a VGG-16 architecture. (a) MCP (b) TCP
Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 603 2801 0 118 6478 71.99% 85.67%TCP 2724 680 0 0 118 6478 99.91% 99.91%
B.2 Evaluation metrics
Failure prediction being the task of detecting samples thathave been incorrectly classified by the main model (if theirestimated confidence is below a chosen threshold δ ), classicdetection metrics can be used. False-positive rate at 95% true-positive rate(FPR @ 95% TPR) . This is the probability that an imagepredicted incorrectly by the model is wrongly seen ascorrectly classified by the error detector (a false positive)while the true positive rate is as high as 95%. True- andfalse-positive rates are defined as
TPR = TP / (TP + FN) and FPR = FP / (FP + TN) , where TP , TN , FP and FN are the numbers of true positives, true negatives, false positivesand false negatives respectively. Area under the receiver operating characteristic curve (AU-ROC) . The receiver operating characteristic (ROC) curveis the graph of TPR as function of FPR. This metric isthreshold-independent. It can be interpreted as the prob-ability that an incorrectly classified sample has a greaterprediction score than a negative example.
Area under the precision-recall curve (AUPR) . Theprecision-recall (PR) curve is the graph of the precision =TP / (TP + FP) as a function of the recall = TP / (TP + FN) .In our experiments, classification errors are used as the EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
Fig. 11: Same as Figure 7 on CamVid using a SegNet architecture. (a) MCP (b) TCP
Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 401,573 55,506,172 0 188,128 34,166,526 48.53% 84.42%TCP 41,84,875 1,722,871 0 0 188,128 34,166,526 99.92% 99.99%TABLE 7:
Accuracies for used classification models . Image classification accuracies on train, validation and test sets forall the models whose confidence is estimated in our experiments.
MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid
MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetTrain accuracy 98.32% 98.94% 95.06% 98.69% 95.55% 96.69%Validation accuracy 97.95% 99.03% 96.56% 99.80% 66.96% 91.72%Test accuracy 98.05% 99.10% 95.44% 92.19% 65.96% 85.33%
TABLE 8:
Impact of the encoder fine-tuning on the error-prediction performance of ConfidNet . Comparison in AUPR(the higher, the better) on all benchmarks. This table extends Table 3 in the main paper.
MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid
MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetConfidence training 57.34% 43.94% 50.43% 46.44% 72.68% 50.12%+ Fine-tuning ConvNet 57.37% 45.89% 50.72% 49.94% 73.68% 50.51% positive detection class. As we specifically want to detectfailures, AUPR is the primary metrics to assess perfor-mances.
Area under the risk-coverage curve (AURC) . In classifica-tion with a reject option, the risk-coverage curve is the graphof the empirical risk of the classifier given a loss (usually / loss) as a function of the empirical coverage, which isthe proportion of the non-rejected samples. This metric isthreshold-independent, as AUROC and AUPR. Excess-AURC (E-AURC) . This is a normalized AURC met-rics defined in [61]. It takes into account the optimal rankinggiven the error rate of the classifier. More specifically, if wedenote κ ∗ f the perfect confidence-rate function and ˆ r the riskof classifier f , it writes as:E-AURC ( κ f ) = AURC ( κ f ) − AURC ( κ ∗ f ) (20) ≈ AURC ( κ f ) − (cid:0) ˆ r + (1 − ˆ r ) log(1 − ˆ r ) (cid:1) . (21) B.3 Classification accuracies of model F Most neural networks used in our experiments tend tooverfit. On small datasets such as MNIST and SVHN, con-volutional neural networks already achieve nearly perfectaccuracy on test set, above 96%, which leaves very fewerrors available. We provide in Table 7 the accuracies ontraining, validation and test set of the classification model F whose confidence must be predicted. B.4 Effect of ConfidNet’s architecture
We experiment different architectures for ConfidNet on theSVHN dataset, varying the number of layers. Except for thefirst and last layers, whose dimensions respectively dependon the size of the input and of the output, each layerpresents the same number of units (400). On Figure 12,we observe that starting from 3 layers, ConfidNet alreadyimproves baseline performance.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
TABLE 9:
Effect of the loss and of the confidence criterion on the error-detection performance of ConfidNet . Comparisonin between proposed MSE and three other alternatives, all based on TCP as confidence criterion. Using MSE withnormalized
TCP r is also reported. This table extends Table 5 in the main paper. Dataset Loss FPR @ 95% TPR ↓ AUPR ↑ AUROC ↑ SVHN
SmallConvNet MSE
BCE 29.34% 50.00% 92.76%Focal 28.67% 49.96% 93.01%Ranking 31.04% 48.11% 92.90%MSE w/
TCP r CIFAR-10
VGG-16 MSE
TCP r SegNet MSE 61.52% 50.51% 85.02%BCE 61.68% 48.96% 83.41%Focal 61.64% 49.05% 84.09%MSE w/
TCP r TABLE 10:
Comparative calibration results . Performance in ECE (the lower, the better) when using MCP baseline(’Baseline’) or ConfidNet as confidence estimator on the six benchmarks, and when using dedicated temperature scaling(’T. Scaling’).
MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid
MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetBaseline 0.37%
Fig. 12:
Influence of ConfidNet’s depth on its performance .Performance in AUPR as a function of the number of layersused in ConfidNet on SVHN test set and compared to theperformance of MCP and True Score baselines.
B.5 Effect of learning variants
In Table 8, we detail the effect of fine-tuning on all clas-sification and segmentation datasets for failure predictionexperiments.The influence of the loss (MSE, BCE, Focal Loss orRanking based on TCP) is analysed for SVHN, CIFAR10and CamVid in Table 9. We also tested a normalized variantof the TCP confidence criterion, which consists in the ratio between TCP and MCP:
TCP rF ( x , y ) = F ( x ; ˆ w )[ y ]max k ∈Y F ( x ; ˆ w )[ k ] . (22)This criterion presents stronger theoretical guarantees thanTCP, since correct predictions will be, by design, assigneda confidence of , whereas errors’ confidence will lie in [0 , . On the other hand, learning this criterion may be morechallenging since all correct predictions have a unique targetconfidence value. We note in Table 8 that its performance is lower than the one of TCP on small datasets such asCIFAR-10 where few errors are present, but higher on largerdatasets such as CamVid where each pixel is a sample. Thisemphasizes once again the complexity of incorrect/correctclassification training. B.6 Effect on calibration
We observed that ConfidNet tends to lower the confidenceof an example that the model wrongly classified whilebeing over-confident (high MCP). As a side experiment, westudy whether using ConfidNet as confidence estimationcan improve the calibration of deep neural networks.In Table 10, we report the expected calibration error(ECE) which is an approximate measure of miscalibrationbetween confidence and accuracy [35]. ConfidNet yieldsequivalent or better ECE results than the MCP baseline, withclear superiority on complex datasets such as CIFAR-10,CIFAR-100 and CamVid. On MNIST and SVHN, the base-line already offers a small ECE. These results confirm ourintuition about the capacity of ConfidNet to address over-confident predictions, even though it has not been designedfor. Nevertheless, dedicated methods such as temperaturescaling used in [35] remain preferred for calibrating deepneural networks.
B.7 Qualitative assessment
We provide in Figure 13 an illustration of confidence-basedfailure prediction on CamVid. Compared to the MCP base-line, our approach produces higher confidence scores forpixels correctly classified by DeepLabv2 and lower ones forincorrectly classified ones. This allows one to better identify
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
Fig. 13:
Confidence estimation in semantic segmentation . On this example from CamVid (a), the confidence at each pixelof the class prediction delivered by DeepLabv2 (c) can be done classically by MCP (f), or by the proposed auxiliary model(e). The second approach appears better aligned with the actual errors (shown in white in (d)) of the semantic segmentation.Indeed, ConfidNet (50.51% AUPR) allows a better prediction of these errors than MCP (48.53%). (a) MLP (b) MNIST(c) SVHN (d) CIFAR-10 (e) CIFAR-100
Fig. 14:
Risk-coverage curves of selective image classification . On the five benchmarks, the RC curves are providedfor MCP baseline, MC Dropout, Trust Score and ConfidNet. For a given coverage (fraction of non-rejected samples), theselective risk indicates the percentage of erroneous predictions in the remaining test set.error areas in semantic segmentation maps, as we leveragein ConDA.
B.8 Risk-coverage curves
In relation with the selective classification experiments,we provide the risk-coverage curves for five classificationbenchmarks: MLP on MNIST (Fig. 14a), SmallConvNet onMNIST (Fig. 14b), SmallConvNet on SVHN (Fig. 14c) andVGG-16 on CIFAR-10 and CIFAR-100 (Figs. 14d and 14e). A PPENDIX CE XPERIMENTS WITH SELF - TRAINING FOR UNSUPER - VISED DOMAIN ADAPTATION
C.1 Experiments details
The segmentation network is a DeepLabv2 [57] architecture,optimized by stochastic gradient descent with learning rate . × − , momentum . and weight decay − . As in recent state-of-the-art methods [21], [49], [50], weadopt an adversarial learning approach to align source andtarget output distributions. The discriminator in the seg-mentation network is optimized by Adam [68] with learningrate − . The hyperparameters λ adv and λ ST are fixed at − and , respectively. In Table 5 of the main paper, CBSTresults [53] are drawn from the authors’ most recent paper[54] where CBST serves as baseline. Self-training procedure.
Following the implementation inBDL [21], we carried out self-training iterations for AdvEntin GTA5 (cid:46) Cityscapes experiments. For the other experi-ments in Section C.3 below, only self-training iteration isused.The experiments start by pre-training AdvEnt model, us-ing the published code, on source-domain images translatedinto the target domain. Those translated images are pre-computed using a CycleGAN [69], as provided by [21].
5. https://github.com/liyunsheng13/BDL
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
TABLE 11:
Ablation study on semantic segmentation with pseudo-labeling-based adaptation . This table extends Tab. 6 inthe main paper by providing IoUs for the 19 classes of the GTA5 (cid:46)
Cityscapes UDA benchmark, using AdapSegNet model.
GTA5 (cid:46)
CityscapesMethod M u l t i - S c a l e A d v T r a i n i n g r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g t e rr a i n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m b i k e b i k e mIoUConfidNet 88.9 44.4 84.6 34.3 26.2 32.8 39.0 32.5 84.9 33.6 82.8 59.8 31.7 80.9 30.3 45.9 0.0 30.9 40.8 47.6Multi-Scale ConfidNet (cid:88) (cid:88) (cid:88) (cid:88) TABLE 12:
Additional comparative performance on semantic segmentation with synth-to-real unsupervised domainadaptation . Results in per-class IoU and aggregated mIoU on SYNTHIA (cid:46)
Cityscapes (‘mIoU*’ is the 13-class setup,excluding the classes ‘wall’, ‘fence’ and ‘pole’, as used in earlier works). All methods are based on a DeepLabv2 backbone.
SYNTHIA (cid:46)
CityscapesMethod S e l f- T r a i n . r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g s ky p e r s o n r i d e r c a r b u s m b i k e b i k e mIoU mIoU*AdaptSegNet [49] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 - 46.7DISE [62] (cid:88) (cid:88) (cid:88) (cid:88) C.2 Detailed ablation study
We complement the ablation results in Table 6 of the mainpaper, by providing class-wise IoUs in Table 11.
C.3 Additional results
We extend our experiments by using another syntheticsource-domain dataset, SYNTHIA [70]. More specifically, weuse the SYNTHIA-RAND-CITYSCAPES split, composed of9,400 color images generated in a simulator, of dimension × and annotated for semantic segmentation with 16classes in common with Cityscapes [64]. In the experimentswith this dataset, we do not use translated source images. SYNTHIA (cid:46)
Cityscapes.
Similar to Table 4 in the main pa-per, we report in Table 12 the comparative results on SYN-THIA (cid:46)
Cityscapes. Following the literature on this dataset,mIoU metric is computed over 16 categories as well as over13 categories (ignoring ‘wall’, ‘fence’ and ‘pole’). Again,ConDA achieves impressive performance for methods witha DeepLabv2 backbone on this benchmark with . mIoU. SYNTHIA (cid:46)
Mapillary.
Along with previous results onCityscapes, we further study domain adaptation on anothertarget dataset. Mapillary Vistas [65] is a dataset of street-level images, split in a train set, a validation set and atest set. The ground-truth semantic maps are missing fromthe test set. For domain adaptation, we use the 18,000images from the training set as target and the 2,000 imagesfrom the validation set for testing. We consider 7 ‘superclasses’ that include the 16 classes used in Cityscapes [64]experiments with SYNTHIA [70]. Table 13 presents semanticsegmentation results using SYNTHIA as source dataset.This benchmark has also been used in other recent works,such as in AdvEnt [50] and DADA [71]. ConDA outperforms TABLE 13:
Combining ConDA with AdvEnt and DADAapproaches to domain adaptation in semantic segmenta-tion . Performance in IoU and mIoU on SYNTHIA (cid:46)
Mapillar.‘DADA*’ and ‘ConDA*’ are trained using depth as privi-leged information.
SYNTHIA (cid:46)
MapillaryMethod S e l f- T r a i n . fl a t c o n s t r . o b j e c t n a t u r e s ky h u m a n v e h i c l e mIoUAdvEnt [50] 86.9 58.8 (cid:88) DADA* [71] 86.7 62.1 34.9 75.9 88.6 51.1 73.8 67.6ConDA* (cid:88) baseline methods with . mIoU compared to . mIoU in AdvEnt.We also tested the proposed confidence-based self-training approach on DADA [71], another domain adapta-tion baseline which uses the depth information available onsource-domain synthetic scenes as privileged informationduring segmentation training. Again, the proposed method(ConDA*) provides a boost of performance, from . to . mIoU. C.4 Additional model analysis
In this section, we provide additional plots analogous tothose in Figure 7 of the main paper. These graphs showprecision w.r.t. coverage of extracted pseudo-labels, usingMCP and ConDA respectively, in the following set-ups: • AdvEnt on SYNTHIA (cid:46)
Cityscapes (Figure 15); • AdvEnt on SYNTHIA (cid:46)
Mapillary Vistas (Figure 16); • DADA on SYNTHIA (cid:46)
Mapillary Vistas (Figure 17).
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
Fig. 15:
Comparative quality of selected pseudo-labels .Proportion of correct pseudo-labels (precision) for differentcoverages on SYNTHIA (cid:46)
Cityscapes for AdvEnt.Fig. 16: Same as Fig. 15 for AdvEnt on SYNTHIA (cid:46)
MapillaryFig. 17: Same as Fig. 15 for DADA on SYNTHIA (cid:46)
MapillaryIn all cases, we observe that ConDA collects fewer erroneouspseudo-labels than MCP. R EFERENCES [1] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in
NIPS , 2012. 1,10[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in
NIPS ,2015. 1[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in
ECCV , 2016. 1[4] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You onlylook once: Unified, real-time object detection,” in
CVPR , 2016. 1[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimationof word representations in vector space,”
CoRR , 2013. 1[6] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock´y, and S. Khudanpur,“Recurrent neural network based language model.” in
INTER-SPEECH , 2010. 1[7] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deepneural networks for acoustic modeling in speech recognition: Theshared views of four research groups,”
IEEE Signal Process Mag ,2012. 1[8] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng,“Deep speech: Scaling up end-to-end speech recognition,” 2014. 1[9] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman,and D. Man´e, “Concrete problems in AI safety,”
CoRR , 2016. 1 [10] J. Janai, F. G ¨uney, A. Behl, and A. Geiger, “Computer visionfor autonomous vehicles: Problems, datasets and state-of-the-art.”
Foundations and Trends in Computer Graphics and Vision , 2017. 1[11] J. Nam, S. Park, E. J. Hwang, J. Lee, K.-N. Jin, K. Lim, T. Vu, J. Sohn,S. Hwang, J. M. Goo, and C. M. Park, “Development and vali-dation of deep learning–based automatic detection algorithm formalignant pulmonary nodules on chest radiographs,”
Radiology ,2018. 1[12] O. Linda, T. Vollmer, and M. Manic, “Neural network basedintrusion detection system for critical infrastructures,” in
IJCNN ,2009. 1[13] C. Chow, “An optimum character recognition system using deci-sion functions,”
IRE Trans. Electron. Comput. , 1957. 1, 2[14] P. L. Bartlett and M. H. Wegkamp, “Classification with a rejectoption using a hinge loss,”
JMLR , 2008. 1, 2[15] C. Cortes, G. DeSalvo, and M. Mohri, “Boosting with abstention,”in
NIPS , 2016. 1, 2[16] R. El-Yaniv and Y. Wiener, “On the foundations of noise-freeselective classification,”
JMLR , 2010. 1, 2, 4[17] Y. Geifman and R. El-Yaniv, “Selective classification for deepneural networks,” in
NIPS , 2017. 1, 2, 4[18] Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian active learn-ing with image data,” in
Proceedings of Machine Learning Research ,2017. 1[19] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:Representing model uncertainty in deep learning,” in
ICML , 2016.1, 2, 8, 9, 12, 13[20] D.-H. Lee, “Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,”
ICML(Workshop) , 2013. 1, 3[21] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning fordomain adaptation of semantic segmentation,” in
CVPR , 2019. 1,3, 10, 17, 18[22] S. Hecker, D. Dai, and L. V. Gool, “Failure prediction for au-tonomous driving,” in IV , 2018. 1[23] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassi-fied and out-of-distribution examples in neural networks,” ICLR ,2017. 1, 2, 8, 9[24] H. Jiang, B. Kim, M. Guan, and M. Gupta, “To trust or not to trusta classifier,” in
NIPS , 2018. 1, 2, 8, 9, 12, 13[25] A. Kendall, V. Badrinarayanan, , and R. Cipolla, “Bayesian segnet:Model uncertainty in deep convolutional encoder-decoder archi-tectures for scene understanding,” arXiv preprint arXiv:1511.02680 ,2015. 1, 12[26] A. Ragni, Q. Li, M. J. F. Gales, and Y. Wang, “Confidence estima-tion and deletion prediction using bidirectional recurrent neuralnetworks,” in
SLT Workshop , 2018. 1[27] Q. Li, P. Ness, A. Ragni, and M. Gales, “Bi-directional latticerecurrent neural networks for confidence estimation,” in
ICASSP ,2018. 1[28] D. Yu, J. Li, and L. Deng, “Calibration of confidence measures inspeech recognition,”
IEEE Trans Audio Speech Lang Process , 2011. 1[29] J. Blatz, E. Fitzgerald, G. Foster, S. Gandrabur, C. Goutte,A. Kulesza, A. Sanchis, and N. Ueffing, “Confidence estimationfor machine translation,” in
COLING , 2004. 1, 2[30] C. Corbi`ere, N. Thome, A. Bar-Hen, M. Cord, and P. P´erez,“Addressing failure prediction by learning model confidence,” in
NeurIPS , 2019. 2[31] C. Cortes, G. DeSalvo, and M. Mohri, “Learning with rejection,”in
ALT , 2016. 2[32] H. Zaragoza and d. Buc, “Confidence measures for neural networkclassifiers,” in
IPMU , 1998. 2[33] Y. Geifman and R. El-Yaniv, “Selectivenet: A deep neural networkwith an integrated reject option,” in
ICML , 2019. 2[34] A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networksare easily fooled: High confidence predictions for unrecognizableimages.” in
CVPR , 2015. 2[35] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration ofmodern neural networks,” in
ICML , 2017. 2, 5, 16[36] L. Neumann, A. Zisserman, and A. Vedaldi, “Relaxed softmax: Ef-ficient confidence auto-calibration for safe pedestrian detection,”in
NIPS (Workshop) , 2018. 2[37] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness-ing adversarial examples,” in
ICLR , 2015. 2[38] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”in
ICLR , 2014. 2
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20 [39] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in
ICLR , 2018.2[40] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensembles,”in
NIPS , 2017. 2[41] R. M. Neal,
Bayesian Learning for Neural Networks . Springer, 1996.2[42] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra,“Weight uncertainty in neural networks,” in
ICML , 2015. 2[43] W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G.Wilson, “A simple baseline for Bayesian uncertainty in deeplearning,” in
NeurIPS , 2019. 2[44] J. M. Hernandez-Lobato and R. Adams, “Probabilistic backprop-agation for scalable learning of Bayesian neural networks,” in
ICML , 2015. 2[45] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento, “Amethod for improving classification reliability of multilayer per-ceptrons,”
IEEE Transactions on Neural Networks , 1995. 2[46] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is“nearest neighbor” meaningful?” in
ICDT , 1999. 2[47] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprintarXiv:1802.04865 , 2018. 2[48] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell, “Cycada: Cycle consistent adversarialdomain adaptation,” in
ICML , 2018. 3, 10[49] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, andM. Chandraker, “Learning to adapt structured output space forsemantic segmentation,” in
CVPR , 2018. 3, 10, 17, 18[50] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P´erez, “Advent: Ad-versarial entropy minimization for domain adaptation in semanticsegmentation,” in
CVPR , 2019. 3, 10, 17, 18[51] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by en-tropy minimization,” in
NIPS , 2005. 3[52] A. Saporta, T.-H. Vu, M. Cord, and P. P´erez, “Esl: Entropy-guidedself-supervised learning for domain adaptation in semantic seg-mentation,” 2020. 3, 10[53] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsuperviseddomain adaptation for semantic segmentation via class-balancedself-training,” in
ECCV , 2018. 3, 10, 17, 18[54] Y. Zou, Z. Yu, X. Liu, B. V. Kumar, and J. Wang, “Confidenceregularized self-training,” in
ICCV , 2019. 3, 10, 17, 18[55] P. Mohapatra, M. Rol´ınek, C. Jawahar, V. Kolmogorov, andM. Pawan Kumar, “Efficient optimization for rank-based lossfunctions,” in
CVPR , 2018. 5[56] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss fordense object detection,” in
ICCV , 2017. 5[57] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Deeplab: Semantic image segmentation with deep convo-lutional nets, atrous convolution, and fully connected crfs.”
CoRR ,2016. 8, 17[58] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” in
Proceedings of theIEEE , 1998. 8, 12[59] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in
NIPS (Workshop) , 2011. 8, 12[60] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea-tures from tiny images,”
Master’s thesis, Department of ComputerScience, University of Toronto , 2009. 8, 12[61] Y. Geifman, G. Uziel, and R. El-Yaniv, “Bias-reduced uncertaintyestimation for deep neural classifiers,” in
ICLR , 2019. 9, 15[62] W.-L. Chang, H.-P. Wang, W.-H. Peng, and W.-C. Chiu, “All aboutstructure: Adapting structural information across domains forboosting semantic segmentation,” in
CVPR , 2019. 10, 18[63] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:Ground truth from computer games,” in
ECCV , 2016. 10[64] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in
CVPR , 2016. 10, 18[65] G. Neuhold, T. Ollmann, S. Rota Bul`o, and P. Kontschieder, “Themapillary vistas dataset for semantic understanding of streetscenes,” in
ICCV , 2017. 10, 18[66] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ-entiation in pytorch,” in
NIPS (Workshop) , 2017. 10, 12 [67] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classesin video: A high-definition ground truth database,”
Pattern Recogn.Lett. , 2009. 12[68] J. B. Diederik P. Kingma, “Adam: A method for stochastic opti-mization,” in
ICLR , 2015. 17[69] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
ICCV , 2017. 17[70] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez,“The SYNTHIA dataset: A large collection of synthetic images forsemantic segmentation of urban scenes,” in
CVPR , 2016. 18[71] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez, “Dada: Depth-aware domain adaptation in semantic segmentation,” in
ICCV ,2019. 18
Charles Corbi`ere is a Ph.D. student in DeepLearning and Computer Vision for AutonomousDriving at Conservatoire National des Arts etM´etiers (CNAM, France) and valeo.ai researchlab (France). He received an M.Sc. degree inApplied Mathematics and Statistic by Universit´eParis-Saclay (France) in 2017 and an M.Eng. de-gree in Computer Science from Ecole Centralede Lille (France) in 2016. His research interestsinclude uncertainty in deep learning and certifiedrobustness.
Nicolas Thome is a full professor at Conserva-toire Nationnal des Arts et M´etiers (Cnam Paris).His research interests include machine learningand deep learning for understanding low-levelsignals, e.g. . vision, time series, acoustics, etc.He also explores solutions for combining low-level and more higher-level data for multi-modaldata processing. His current application domainsare essentially targeted towards healthcare, au-tonomous driving and physics. He is involved inseveral French, European and international col-laborative research projects on artificial intelligence and deep learning.
Antoine Saporta is a Ph.D. student in DeepLearning and Computer Vision for AutonomousDriving at the Machine Learning and DeepLearning for Information Access (MLIA) team ofLIP6, Sorbonne Universit´e (France) and Valeo.airesearch lab. He is a graduate of ´Ecole Poly-technique (France) and has received a Masterdegree in Computer Science from TechnischeUniversit¨at M¨unchen (Germany) in 2019. His re-search interests include domain adaptation andsemantic segmentation.
Tuan-Hung Vu is a research scientist at Va-leo.ai. He received a PhD degree in Com-puter Science from ´Ecole Normale Sup´erieure in2018. His research interests include deep learn-ing, object recognition, domain adaptation andmore recently data augmentation. Tuan-Hungpublished and regularly served as reviewer atcomputer vision conferences and journals likeCVPR, ICCV, ECCV and IJCV.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21
Matthieu Cord is full professor at SorbonneUniversity. He is also part-time principal scien-tist at Valeo.ai. His research expertise includescomputer vision, machine learning and artificialintelligence. He is the author of more 150 pub-lications on image classification, segmentation,deep learning, and multimodal vision and lan-guage understanding. He is an honorary mem-ber of the Institut Universitaire de France andserved from 2015 to 2018 as an AI expert atCNRS and ANR (National Research Agency).