[PDF] Confidence Estimation via Auxiliary Models

Abstract

Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deploying such models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and of self-training with pseudo-labels for domain adaptation, which both necessitate effective confidence estimates. Extensive experiments are conducted for validating the relevance of the proposed approach in each task. We study various network architectures and experiment with small and large datasets for image classification and semantic segmentation. In every tested benchmark, our approach outperforms strong baselines.

Full PDF

IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Conﬁdence Estimation via Auxiliary Models

Charles Corbi `ere, Nicolas Thome, Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, and Patrick P ´erez

Abstract —Reliably quantifying the conﬁdence of deep neural classiﬁers is a challenging yet fundamental requirement for deployingsuch models in safety-critical applications. In this paper, we introduce a novel target criterion for model conﬁdence, namely the trueclass probability (TCP). We show that TCP offers better properties for conﬁdence estimation than standard maximum class probability(MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model,introducing a speciﬁc learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and ofself-training with pseudo-labels for domain adaptation, which both necessitate effective conﬁdence estimates. Extensive experimentsare conducted for validating the relevance of the proposed approach in each task. We study various network architectures andexperiment with small and large datasets for image classiﬁcation and semantic segmentation. In every tested benchmark, ourapproach outperforms strong baselines.

Index Terms —Conﬁdence Estimation, Uncertainty, Deep Neural Networks, Classiﬁcation with Reject Option, MisclassiﬁcationDetection, Failure Prediction, Self-Training, Pseudo-Labeling, Unsupervised Domain Adaptation, Semantic Image Segmentation (cid:70)

NTRODUCTION L AST decade’s research in deep learning lead to tremen-dous boosts in predictive performance for various tasksincluding image classiﬁcation [1], object recognition [2],[3], [4], natural language processing [5], [6] and speechrecognition [7], [8]. However, safety remains a great concernwhen it comes to implementing these models in real-worldconditions [9], [10]. Failing to detect possible errors or over-estimating the conﬁdence of a prediction may carry seriousrepercussions in critical visual-recognition applications suchas in autonomous driving, medical diagnosis [11] or nuclearpower plant monitoring [12].Classiﬁcation with a reject option [13], [14], [15], alsoknown as selective classiﬁcation [16], [17], consists in a sce-nario where the classiﬁer is given the option to reject an in-stance instead of predicting its label. Equipped with a rejectoption, a classiﬁer could decide to stick to the predictionor, on the contrary, to hand over to a human or a back-up system with, e.g. , other sensors, or simply to trigger analarm. One common approach for tackling the problem isto discriminate with a conﬁdence-based criterion: For aninstance x , along with a prediction f ( x ) , a scalar value g ( x ) that quantiﬁes the conﬁdence of the classiﬁer in itsprediction is also provided.Correctly identifying uncertain predictions thanks to lowconﬁdence values g ( x ) could be beneﬁcial for classiﬁca-tion improvements in active learning [18] or for efﬁcientexploration in reinforcement learning [19]. On a related mat-ter, one would expect the conﬁdence criterion to correlatesuccessful predictions with high values. Some paradigms,such as self-training with pseudo-labeling [20], [21], con-sist in picking and labeling the most conﬁdent samplesbefore retraining the network accordingly. The performanceimproves by selecting successful predictions thanks to anaccurate conﬁdence criterion. A ﬁnal perspective, linked tofailure prediction [22], [23], [24], is the capacity of modelsto provide a ranking which enables to distinguish correctfrom erroneous predictions. In each of the previous tasks,obtaining reliable estimates of the predictive conﬁdence isthen of prime importance. Conﬁdence estimation has been explored in a wide va-riety of applications, including computer vision [23], [25],speech recognition [26], [27], [28], reinforcement learning[19] or machine translation [29]. A widely used baselinewith neural-network classiﬁers is to take the value of thepredicted class’ probability, namely the maximum class prob-ability (MCP), given by the softmax layer output. Althoughrecent evaluations of MCP with modern deep models revealreasonable performances [23], they still suffer from severalconceptual drawbacks. In particular, MCP leads by designto high conﬁdence values, even for erroneous predictions,since the largest softmax output is used. This design tends tomake erroneous and correct predictions overlap in terms ofconﬁdence and thus limits the capacity to distinguish them.In this work, we identify a better conﬁdence criterion,the true class probability (TCP), for deep neural network clas-siﬁers with a reject option. For a sample x , TCP correspondsto the probability of the model with respect to the true class y of that sample, which naturally reﬂects a better-behavedmodel’s conﬁdence. We provide theoretical guarantees ofthe quality of this criterion regarding conﬁdence estimation.Since the true class is obviously unknown at test time,we propose a novel approach which consists in designingan auxiliary network speciﬁcally dedicated to estimate theconﬁdence of a prediction. Given a trained classiﬁer f ,this auxiliary network learns the TCP criterion from data.In inference, we use its scalar output as the conﬁdenceestimate g ( x ) associated to the prediction. When appliedto failure prediction, we observe signiﬁcant improvementsover strong baselines. Our approach is also adequate forself-training strategies in unsupervised domain adaption. Tomeet the challenge of this task in semantic segmentation, wepropose an enhanced architecture with structured outputand adopt an adversarial learning scheme which enforcesalignment between conﬁdence maps in source and targetdomains. A thorough analysis of our approach, includingrelevant variations, ablation studies and qualitative evalua-tions of conﬁdence estimates, helps to gain insight about itsbehavior. a r X i v : . [ c s . C V ] D ec EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

In summary, our contributions are as follows: • We deﬁne a novel conﬁdence criterion, the true classprobability , which exhibits an adequate behavior forconﬁdence estimation; • We propose to design an auxiliary neural network,coined

ConﬁdNet , which aims to learn this conﬁdencecriterion from data; • We apply this approach to the task of failure pre-diction and to self-training in unsupervised domainadaptation with adequate choices of architecture, lossfunction and learning scheme; • We extensively experiment across various bench-marks and backbone networks to validate the rele-vance of our approach on both tasks.The paper is organized as follows. In Section 2, weprovide an overview of the most relevant related workson conﬁdence estimation, failure prediction, self-trainingand unsupervised domain adaption. Section 3 exposes ourapproach for conﬁdence estimation based on learning anadequate criterion via an auxiliary network. We also de-scribe how it relates to classiﬁcation with a reject option.In Section 4, we adapt our approach to failure prediction byintroducing an architecture, a loss function and a learningscheme for this task. Similarly, Section 5 details the instan-tiation of our approach for conﬁdence-based self-training inunsupervised domain adaptation (DA), which we denoteas ConDA. In particular, we present two additions, anadversarial loss and a multi-scale conﬁdence architecture,which help further improving the performance for this task.Finally, we report experimental studies in Section 6. Thispaper extends a previous conference publication [30] by in-troducing: (1) An comprehensive adaptation of the approachto improve the key step of self-training from pseudo-labelsin semantic segmentation with DA; (2) An exploration of theclassiﬁcation-with-rejection framework, which strengthensthe rationale of the proposed approach.

ELATED WORK

Conﬁdence estimation in machine learning has been aroundfor many decades, ﬁrstly linked to the idea of classiﬁcationwith a reject option [13]. Following works [14], [15], [31],[32] explored alternative rejection criteria. In particular, [31]proposes to jointly learn the classiﬁer and the selectionfunction. El-Yaniv [16] provides an analysis of the risk-coverage trade-off that occurs when classifying with a rejectoption. More recently, [17], [33] extend the approach to deepneural networks, considering various conﬁdence measures.Since the wide adoption of deep learning methods,conﬁdence estimation has raised even more interest as re-cent works reveal that modern neural networks tend tobe overconﬁdent [34], non-calibrated [35], [36], sensitive toadversarial attacks [37], [38] and inadequate to distinguishin- from out-of-distribution examples [23], [39], [40].Bayesian neural networks [41] offer a principled ap-proach for conﬁdence estimation by adopting a Bayesianformalism which models the weight posterior distribution.As the true posterior cannot be evaluated analytically in complex models, various approximations have been de-veloped, such as variational inference [19], [42], [43] orexpectation propagation [44]. In particular, MC Dropout[19] has raised a lot of interest due to the simplicity of itsimplementation. Predictions are obtained by averaging soft-max vectors from multiple feed-forward passes through thenetwork with dropout layers. When applied to regression,the predictive distribution uncertainty can be summarizedby computing statistics, e.g. , variance. However, when usingMC Dropout for uncertainty estimation in classiﬁcationtasks, the predictive distribution is averaged to a point-wisesoftmax estimate before computing standard uncertaintycriteria, e.g. entropy or variants such as mutual information.It is worth mentioning that these entropy-based criteriameasure the softmax output dispersion, where the uniformdistribution has maximum entropy. It is not clear how wellthese dispersion measures are adapted to distinguishingfailures from correct predictions, especially with deep neu-ral networks which output overconﬁdent predictions [35]:for example, it might be very challenging to discriminatea peaky prediction corresponding to a correct predictionfrom an incorrect overconﬁdent one. Lakshminarayanan etal . [40] propose an alternative to Bayesian neural networksby leveraging an ensemble of neural networks to producewell-calibrated uncertainty estimates. However, it requiresto train multiple classiﬁers, which has a considerable com-puting cost in training and inference time.

In the context of classiﬁcation, a widely used baseline forfailure prediction is to take the value of the predicted class’probability given by the softmax layer output, namely the maximum class probability (MCP), suggested by [45] and re-vised by [23]. As stated before, MCP presents several limitsregarding both failure prediction and out-of-distributiondetection, as it outputs unduly high conﬁdence values.Blatz et al . [29] introduce a method for conﬁdence es-timation in machine translation by solving a binary clas-siﬁcation between correct and erroneous predictions. Morerecently, Jiang et al. [24] proposed a new conﬁdence mea-sure, ‘Trust Score’, which measures the agreement betweenthe classiﬁer and a modiﬁed nearest-neighbor classiﬁer onthe test examples. More precisely, the conﬁdence criterionused in Trust Score is the ratio between the distance fromthe sample to the nearest class different from the predictedclass and the distance to the predicted class. One cleardrawback of this approach is its lack of scalability, sincecomputing nearest neighbors in large datasets is extremelycostly in both computations and memory. Another morefundamental limitation related to the Trust Score itself is thatlocal distance computation becomes less meaningful in highdimensional spaces [46], which is likely to negatively affectthe performances of this method as shown in Section 6.1.In tasks closely related to failure prediction, Guo et al .[35], for conﬁdence calibration, and Liang et al . [39], forout-of-distribution detection, proposed to use temperaturescaling to mitigate conﬁdence values. However, this doesnot affect the ranking of the conﬁdence score and thereforethe separability between errors and correct predictions. De-Vries et al . [47] share with us the same purpose of learning

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 (a) Maximum Class Probability (b) True Class Probability

Fig. 1:

Distributions of different conﬁdence measures over correct and erroneous predictions of a given model.

Whenranking according to MCP (a) the test predictions of a convolutional model trained on CIFAR-10, we observe that correctones (in green) and misclassiﬁcations (in red) overlap considerably, making it difﬁcult to distinguish them. On the otherhand, ranking samples according to TCP (b) alleviates this issue and allows a much better separation.conﬁdence in neural networks. Their work differs from oursby focusing on out-of-distribution detection and learningjointly a distribution conﬁdence score and classiﬁcationprobabilities. In addition, their criterion is based on aninterpolation between output probabilities and target dis-tribution whereas we speciﬁcally deﬁne a criterion suited tofailure prediction.

Unsupervised Domain Adaptation (UDA).

UDA has re-ceived a lot of attention over the past few years because ofits importance for a variety of real-world problems, suchas robotics or autonomous driving. Most works in this lineof research aim at minimizing the discrepancy between thedata distributions in source and target domains. For thesemantic segmentation task, most recent progresses havebeen obtained by adopting an adversarial training approachto produce indistinguishable source-target distributions inthe space of features extracted by modern convolutionaldeep neural nets. To cite a few methods: CyCADA [48]ﬁrst stylizes the source-domain images as target-domainimages before aligning source and target in the featurespace; AdaptSegNet [49] constructs a multi-level adversarialnetwork to perform output-space domain adaptation atdifferent feature levels; AdvEnt [50] aligns the entropy ofthe pixel-wise predictions with an adversarial loss; BDL [21]learns alternatively an image translation model and a seg-mentation model that promote each other.

Self-Training.

Semi-supervised learning designates the gen-eral problem where a decision rule must be learned fromboth labeled and unlabeled data. Among the methods ap-plied to address this problem, self-training with pseudo-labeling [20] is a simple strategy that relies on picking upthe current predictions on the unlabeled data and usingthem as if they were true labels for further training. It isshown in [20] that the effect of pseudo-labeling is equivalentto entropy regularization [51]. In a UDA setting, the ideais to collect pseudo-labels on the unlabeled target-domain samples in order to have an additional supervision loss inthe target domain. To select only reliable pseudo-labels, suchthat the performance of the adapted semantic segmentationnetwork effectively improves, BDL [21] resorts to standardselection with MCP. ESL [52] uses instead the entropy of theprediction as conﬁdence criterion for its pseudo-label selec-tion. CBST [53] proposes an iterative self-training procedurewhere the pseudo-labels are generated based on a loss min-imization. In [53], the authors also propose a way to balancethe classes in their pseudo-labels to avoid the dominanceof large classes as well as a way to introduce spatial priors.More recently, the CRST framework [54] proposes multipletypes of conﬁdence regularization to limit the propagationof errors caused by noisy pseudo-labels.

EARNING A MODEL ’ S CONFIDENCE WITH ANAUXILIARY MODEL

In this section, we ﬁrst introduce brieﬂy the task of classiﬁ-cation with a reject option, along with necessary notations.We then introduce an effective conﬁdence-rate function forneural-net classiﬁers and we present our approach to learnthis target conﬁdence-rate function thanks to an auxiliaryneural network. For sake of simplicity, we consider in thissection a generic classiﬁcation task, where the input is rawor transformed signals and the expected output is a pre-dicted category. The semantic segmentation task we addressin Section 5 is in effect a pixel-wise classiﬁcation of localizedfeatures derived from the input image.

Let us consider a dataset D = { ( x n , y n ) } Nn =1 composed of N i.i.d. training samples, where x n ∈ X ⊂ R D is a D -dimensional data representation, deep feature maps from animage or the image itself for instance, and y n ∈ Y = (cid:74) , K (cid:75) is its true class among the K pre-deﬁned categories. Thesesamples are drawn from an unknown joint distribution P ( X, Y ) over ( X , Y ) . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Fig. 2:

Learning conﬁdence approach.

The ﬁxed classiﬁcation network F , with parameters w = ( w E , w cls ) , is composedof a succession of convolutional and fully-connected layers (encoder E ) followed by last classiﬁcation layers with softmaxactivation. The auxiliary conﬁdence network C , with parameters θ , builds upon the feature maps extracted by the encoder E , or its ﬁne-tuned version E (cid:48) with parameters w E’ : they are passed to ConﬁdNet, a trainable multi-layer module withparameters ϕ . The auxiliary model outputs a conﬁdence score C ( x ; θ ) ∈ [0 , , with θ = ϕ in absence of encoder ﬁne-tuning and θ = ( w E (cid:48) , ϕ ) in case of ﬁne-tuning.A selective classiﬁer [16], [17] is a pair ( f, g ) where f : X → Y is a prediction function and g : X → { , } is a selection function which enables to reject a prediction: ( f, g )( x ) = (cid:40) f ( x ) , if g ( x ) = 1 , reject , if g ( x ) = 0 . (1)In this work, we focus on classiﬁers based on artiﬁcialneural networks. Given an input x , such a network F withparameters w outputs non-negative scores over all classes,which are L -normalized through softmax. If well trained,this output can be interpreted as the predictive distribution P ( Y | x , ˆ w ) = F ( x ; ˆ w ) ∈ ∆ , with ∆ the probability K -simplex in R K and ˆ w the learned weights. Based on thisdistribution, the predicted sample class is usually the maxi-mum a posteriori estimate: f ( x ) = argmax k ∈Y P ( Y = k | x , ˆ w ) = argmax k ∈Y F ( x ; ˆ w )[ k ] . (2)We are not interested here in trying to improve theaccuracy of the already-trained model F , but rather to makeits future use more reliable by endowing the system with theability to recognize when the prediction might be wrong.To this end, a conﬁdence-rate function κ f : X → R + isassociated to f so as to assess the degree of conﬁdenceof its predictions, the higher the value the more certainthe prediction [16], [17]. A suitable conﬁdence-rate functionshould correlate erroneous predictions with low values andsuccessful predictions with high values. Finally, given auser-deﬁned threshold δ ∈ R + , the selection function g canbe simply derived from the conﬁdence rate: g ( x ) = (cid:40) if κ f ( x ) ≥ δ , otherwise. (3) For a given input x , a standard conﬁdence-rate function fora classiﬁer F is the probability associated to the predictedmax-score class, that is the maximum class probability :MCP F ( x ) = max k ∈Y P ( Y = k | x , ˆ w ) = max k ∈Y F ( x ; ˆ w )[ k ] . (4)However, by taking the largest softmax probability asconﬁdence estimate, MCP leads to high conﬁdence valuesboth for correct and erroneous predictions alike, making ithard to distinguish them, as shown in Figure 1a. On theother hand, when the model misclassiﬁes an example, theprobability associated to the true class y is lower than themaximum one and likely to be low. Based on this simpleobservation, we propose to consider instead this true classprobability as a suitable conﬁdence-rate function. For anyadmissible input x ∈ X , we assume the true class y ( x ) isknown, which we denote y for simplicity. The TCP conﬁdentrate is deﬁned asTCP F ( x , y ) = P ( Y = y | x , ˆ w ) = F ( x ; ˆ w )[ y ] . (5) Theoretical guarantees.

With TCP, the following propertieshold (see derivation in Appendix A.1). Given a properlylabelled example ( x , y ) , then: • TCP F ( x , y ) > / ⇒ f ( x ) = y , i.e. the example iscorrectly classiﬁed by the model; • TCP F ( x , y ) < /K ⇒ f ( x ) (cid:54) = y , i.e. the example iswrongly classiﬁed by the model,where class prediction f ( x ) is deﬁned by (2).Within the range [1 /K, / , there is no theoretical guar-antee that correct and incorrect predictions will not overlapin terms of TCP. However, when using deep neural net-works, we observe that the actual overlap area is extremelysmall in practice, as illustrated in Figure 1b on the CIFAR-10 dataset. One possible explanation comes from the factthat modern deep neural networks output overconﬁdent EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5 predictions and therefore non-calibrated probabilities [35].We provide consolidated results and analyses on this aspectin Section 6 and in Appendix A.2.

Using TCP as conﬁdence-rate function on a model’s outputwould be of great help when it comes to reliably estimateits conﬁdence. However, the true classes y are obviously notavailable when estimating conﬁdence on test inputs.We propose to learn TCP conﬁdence from data . Moreformally, for the classiﬁcation task at hand, we consider aparametric selective classiﬁer ( f, g ) , with f based on analready-trained neural network F . We aim at deriving itscompanion selection function g from a learned estimateof the TCP function of F . To this end, we introduce an auxiliary model C , with parameters θ , that is intended topredict TCP F and to act as conﬁdence-rate function for theselection function g . An overview of the proposed approachis available in Figure 2. This model is trained such that, atruntime, for an input x ∈ X with (unknown) true label y ,we have: C ( x ; θ ) ≈ TCP F ( x , y ) . (6)In practice, this auxiliary model C will be a neuralnetwork trained under full supervision on D to producethis conﬁdence estimate. To design this network, we cantransfer knowledge from the already-trained classiﬁcationnetwork. Throughout its training, F has indeed learnedto extract increasingly-complex features that are fed to itsﬁnal classiﬁcation layers. Calling E the encoder part of F , asimple way to transfer knowledge consists in deﬁning andtraining a multi-layer head with parameters ϕ that regresses TCP F from features encoded by E . We call ConﬁdNet thismodule. As a result of this design, the complete conﬁdencenetwork C is composed of a frozen encoder followed bytrained ConﬁdNet layers. As we shall see in Section 4, thecomplete architecture might be later ﬁne-tuned, includingthe encoder, as in classic transfer learning. In that case, θ will encompass the parameters of both the encoder and theConﬁdNet’s layers.In the rest of the paper, we detail the different networkarchitectures, loss functions and learning schemes of Con-ﬁdNet for two distinct applications: Classiﬁcation failureprediction and self-training for semantic segmentation withdomain adaptation. In both tasks, a ranking of unlabelledsamples that allows a clear distinction of correct predictionsfrom erroneous ones is crucial. The proposed auxiliarymodel offers a new solution to this problem. PPLICATION TO FAILURE PREDICTION

Given a trained model, failure prediction is the task ofpredicting at run-time whether the model has taken a correctdecision or not for a given input. As discussed in Section 2,they are different ways to attack this task, which has manyreal-world applications in safety-critical systems especially.With a conﬁdence-rate function in hand, the task can besimply set as thresholding this function, exactly in thesame way the selection function works in prediction witha reject option. In this Section, we discuss how ConﬁdNetcan be used for that exact purpose in the context of imageclassiﬁcation.

State-of-art image classiﬁcation models are composed ofconvolutional layers followed by one or more fully-connected layers and a ﬁnal softmax operation. In order towork with such a classiﬁcation network F , we build Conﬁd-Net upon a late intermediate representation of F . ConﬁdNetis designed as a small multilayer perceptron composed of asuccession of dense layers with a ﬁnal sigmoid activationthat outputs C ( x ; θ ) ∈ [0 , . As explained in Section 3, wewill train this network in a supervised manner, such that itpredicts well the true-class probability assigned by F to theinput image. Regarding the capacity of ConﬁdNet, we haveempirically found that increasing further its depth leavesperformance unchanged for estimating the conﬁdence of theclassiﬁcation network (see Appendix B.4 for more details). As we want to regress a score between and , we use amean-square-error (MSE) loss to train the conﬁdence model: L conf ( θ ; D ) = 1 N N (cid:88) n =1 (cid:0) C ( x n ; θ ) − TCP F ( x n , y n ) (cid:1) . (7)Since the ﬁnal task here is the prediction of failures,with conﬁdence prediction being only a means toward it, amore explicit supervision with failure/success informationcould be considered. In that case, the previous regressionloss could still be used, with 0 (failure) and 1 (success)target values instead of TCP. Alternatively, a binary crossentropy loss (BCE) for the error-prediction task using thepredicted conﬁdence as a score could be used. Seeing failuredetection as a ranking problem, where good predictionsmust be ranked before erroneous ones according to thepredicted conﬁdence, a batch-wise ranking loss can also beutilized [55]. We assessed experimental all these alternativelosses, including a focal version [56] of the BCE to focus onhard examples, as discussed in Section 6.1.3. They lead toinferior performance compared to using (7). This might bedue to the fact that TCP conveys more detailed informationthan a mere binary label on the quality of the classiﬁer’sprediction for a sample. In situations where only very fewerror samples are available, this ﬁne-grained informationimproves the performance of the ﬁnal failure detection (seeSection 6.1.3). We decompose the parameters of the classiﬁcation network F into w = ( w E , w cls ) , where w E denotes its encoder’sweights and w cls the weights of its last classiﬁcation layers.Such as in transfer learning, the training of the conﬁdencenetwork C starts by ﬁxing the shared encoder and trainingonly ConﬁdNet’s weights ϕ . In this phase, the loss (7) isthus minimized only w.r.t. θ = ϕ .In a second phase, we further ﬁne-tune the completenetwork C , including its encoder which is now untied fromthe classiﬁcation encoder E (the main classiﬁcation modelmust remain unchanged, by deﬁnition of the addressedproblem). Denoting E (cid:48) this now independent encoder, and w E (cid:48) its weights, this second training phase optimizes (7)w.r.t. θ = ( w E (cid:48) , ϕ ) with w E (cid:48) initially set to w E . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

Fig. 3:

Overview of proposed conﬁdence learning for domain adaptation (ConDA) in semantic segmentation . Givenimages in source and target domains, we pass them to the encoder part of the segmentation network F to obtain theirfeature maps. This network F is ﬁxed during this phase and its weights are not updated. The conﬁdence maps are obtainby feeding these feature maps to the trainable head of the conﬁdence network C , which includes a multi-scale ConﬁdNetmodule. For source-domain images, a regression loss L conf (8) is computed to minimize the distance between C θx s andthe ﬁxed true-class-probability map TCP F ( x s , y s ) . An adversarial training scheme – based on discriminator’s loss L D ( ψ ) (Eq. 10) and adversarial part L adv ( θ ) of conﬁdence net’s loss (Eq. 12) –, is also added to enforce the consistency betweenthe C θx s ’s and C θx t ’s. Dashed arrows stand for paths that are used only at train time.We also deactivate dropout layers in this last trainingphase and reduce learning rate to mitigate stochastic effectsthat may lead the new encoder to deviate too much fromthe original one used for classiﬁcation. Data augmentationcan thus still be used.In Section 6, we put this framework at work on severalstandard image-classiﬁcation benchmarks and analyse itseffectiveness in comparison with alternative approaches. PPLICATION TO SELF - TRAINING IN SEMANTICSEGMENTATION WITH DOMAIN ADAPTATION

Unsupervised domain adaptation for semantic segmenta-tion aims to adapt a segmentation model trained on alabeled source domain to a target domain devoid of an-notation. Formally, let us consider the annotated source-domain training set D s = { ( x s ,n , y s ,n ) } N s n =1 , where x s ,n is a color image of size ( H, W ) and y s ,n ∈ Y H × W itsassociated ground-truth segmentation map. A segmenta-tion network F with parameters w takes as input animage x and returns a predicted soft -segmentation map F ( x ; w ) = P w x ∈ [0 , H × W × K , where P w x [ h, w, :] = P ( Y [ h, w ] | x ; w ) ∈ ∆ . The ﬁnal prediction of the net-work is the segmentation map f ( x ) deﬁned pixel-wiseas f ( x )[ h, w ] = argmax k ∈Y P w x [ h, w, k ] . This network islearned with full supervision from the source-domain sam-ples in D s , using a cross-entropy loss, while leveraging a set D t of unlabelled target-domain examples. In UDA, the main challenge is to use the unlabeled target set D t = { x t ,n } N t n =1 available during training to learn domain-invariant features on which the segmentation model wouldbehave similarly in both domains. As reviewed in Section2, a variety of techniques have been proposed to do that,in particular for the task of semantic segmentation. Lever-aging automatic pseudo-labeling of target-domain trainingexamples is in particular a simple, yet powerful way tofurther improve UDA performance with self-training. Onekey ingredient of such an approach being the selection ofthe most promising pseudo-labels, the proposed auxiliaryconﬁdence-prediction model lends itself particularly wellto this task. In the rest of this section, we detail how theproposed approach to conﬁdence prediction can be adaptedto semantic segmentation, with application to domain adap-tation through self-training. The resulting framework, calledConDA, is illustrated in Figure 3.A high-level view of self-training for semantic segmen-tation with UDA is a follows:1) Train a segmentation network for the target domainusing a chosen UDA technique;2) Collect pseudo-labels among the predictions thatthis network makes on the target-domain trainingimages;3) Train a new semantic-segmentation network fromscratch using the chosen UDA technique in combi-nation with supervised training on target-domain EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7 data with pseudo-labels;4) Possibly, repeat from step 2 by collecting betterpseudo-labels after each iteration.While the general idea of self-training is simple andintuitive, collecting good pseudo-labels is quite tricky: Iftoo many of them correspond to erroneous predictions ofthe current segmentation network, the performance of thewhole UDA can deteriorate. Thus, a measure of conﬁdenceshould be used in order to only gather reliable predictionsas pseudo-labels and to reject the others.

Following the self-training framework previously described,a conﬁdence network C is learned at step (2) to predictthe conﬁdence of the UDA-trained semantic segmentationnetwork F and used to select only trustworthy pseudo-labels on target-domain images. To this end, the frameworkproposed in Section 3 in an image classiﬁcation setup,and applied to predicting erroneous image classiﬁcation inSection 4, needs here to be adapted to the structured outputof semantic segmentation.Semantic segmentation can be seen as a pixel-wise clas-siﬁcation problem. Given a target-domain image x t , wewant to predict both its soft semantic map F ( x t ; w ) and,using an auxiliary model with trainable parameters θ , itsconﬁdence map C ( x t ; θ ) = C θx t ∈ [0 , H × W . Given apixel ( h, w ) , if its conﬁdence C θx t [ h, w ] is above a chosenthreshold δ , we label it with its predicted class f ( x t )[ h, w ] =argmax k ∈Y P w x t [ h, w, k ] , otherwise it is masked out. Com-puted over all images in D t , these incomplete segmentationmaps, constitute target pseudo-labels that are used to traina new semantic-segmentation network. Optionally, we mayrepeat from step (2) and learn alternately a conﬁdencemodel to collect pseudo-labels and a segmentation networkusing this self-training. To train the segmentation conﬁdence network C , we pro-pose to jointly optimize two objectives. Following the ap-proach proposed in Section 3, the ﬁrst one supervises theconﬁdence prediction on annotated source-domain exam-ples using the known true class probabilities for the predic-tions from F . Speciﬁc to semantic segmentation with UDA,the second one is an adversarial loss that aims at reducingthe domain gap between source and target. A completeoverview of the approach is provided in Figure 3. Conﬁdence loss.

The ﬁrst objective is a pixel-wise versionof the conﬁdence loss in (7). On annotated source-domainimages, it requires C to predict at each pixel the scoreassigned by F to the (known) true class: L conf ( θ ; D s ) = 1 N s N s (cid:88) n =1 (cid:13)(cid:13) C θx s ,n − TCP F ( x s ,n , y s ,n ) (cid:13)(cid:13) F , (8)where (cid:107) · (cid:107) F denotes the Frobenius norm and, for an image x with true segmentation map y and predicted soft one F ( x ; ˆ w ) , we noteTCP F ( x , y )[ h, w ] = F ( x ; ˆ w ) (cid:104) h, w, y [ h, w ] (cid:105) (9) at location ( h, w ) . On a new input image, C should predictat each pixel the score that F will assign to the unknowntrue class, which will serve as a conﬁdence measure.However, compared to the application in previous Sec-tion, we have here the additional problem of the gap be-tween source and target domains, an issue that might affectthe training of the conﬁdence model as in the training of thesegmentation model. Adversarial loss.

The second objective concerns the domaingap. While model C learns to estimate TCP on source-domain images, its conﬁdence estimation on target-domainimages may suffer dramatically from this domain shift. Asclassically done in UDA, we propose an adversarial learningof our auxiliary model in order to address this problem.More precisely, we want the conﬁdence maps produced by C in the source domain to resemble those obtained in thetarget domain.A discriminator D : [0 , H × W → { , } , with param-eters ψ , is trained concurrently with C with the aim torecognize the domain (1 for source, 0 for target) of an imagegiven its conﬁdence map. The following loss is minimizedw.r.t. ψ : L D ( ψ ; D s ∪ D t ) = 1 N s N s (cid:88) n =1 L adv ( x s ,n , N t N t (cid:88) n =1 L adv ( x t ,n , , (10)where L adv denotes the cross-entropy loss of the discrimina-tor based on conﬁdence maps: L adv ( x , λ ) = − λ log (cid:0) D ( C θx ; ψ ) (cid:1) − (1 − λ ) log(1 − D (cid:0) C θx ; ψ ) (cid:1) , (11)for λ = { , } , which is a function of both ψ and θ . Inalternation with the training of the discriminator using (10),the adversarial training of the conﬁdence net is conductedby minimizing, w.r.t. θ , the following loss: L C ( θ ; D s ∪ D t ) = L conf ( θ ; D s ) + λ adv N t N t (cid:88) n =1 L adv ( x t , , (12)where the second term, weighted by λ adv , encourages C to produce maps in target domain that will confuse thediscriminator.This adversarial scheme for conﬁdence learning also actsas a regularizer during training, improving the robustnessof the unknown TCP target conﬁdence. As the training of C may actually be unstable, adversarial training providesadditional information signal, in particular imposing thatconﬁdence estimation should be invariant to domain shifts.We empirically observed that this adversarial conﬁdencelearning provides better conﬁdence estimates and improvesconvergence and stability of the training scheme. In semantic segmentation, models consist of fully convo-lutional networks where hidden representations are 2Dfeature maps. This is in contrast with the architecture ofclassiﬁcation models considered in Section 4. As a result,

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

TABLE 1:

Comparison of conﬁdence estimation methods for failure prediction and selective classiﬁcation . For eachdataset, all methods share the same classiﬁcation network. For MC Dropout, test accuracy is averaged through randomsampling. The ﬁrst three metrics are percentages and concern failure prediction. The two last ones (the lower, the better)concern selective classiﬁcation and their values have been multiplied by for clarity. Best results are in bold, second bestones are underlined. Dataset Model FPR @ 95% TPR ↓ AUPR ↑ AUROC ↑ AURC ↓ E-AURC ↓ MNIST

MLP MCP [23] 14.87 37.70 97.13 0.77 0.58MC Dropout [19] 15.15 38.22 97.15 0.79 0.59Trust Score [24] 12.31 52.18 97.52 0.69 0.50ConﬁdNet

SmallConvNet MCP [23] 5.56 35.05 98.63 0.17 0.12MC Dropout [19] 5.26 38.50 98.65 0.17 0.13Trust Score [24] 10.00 35.88 98.20 0.21 0.17ConﬁdNet

SmallConvNet MCP [23] 31.28 48.18 93.20 5.44 4.38MC Dropout [19] 36.60 43.87 92.85 5.68 4.57Trust Score [24] 34.74 43.32 92.16 6.08 5.03ConﬁdNet

VGG16 MCP [23] 47.50 45.36 91.53 10.45 7.32MC Dropout [19] 49.02 46.40 92.08 9.97 6.92Trust Score [24] 55.70 38.10 88.47 16.04 12.91ConﬁdNet

VGG16 MCP [23] 67.86 71.99 85.67 120.28 54.35MC Dropout [19] 64.68 72.59 86.09

Trust Score [24] 71.74 66.82 84.17 126.90 60.97ConﬁdNet

ConﬁdNet module must have a different design here: In-stead of fully-connected layers, it is composed of × convolutional layers with the adequate number of channels.In many segmentation datasets, the existence of objectsat multiple scales may complicate conﬁdence estimation.As in recent works dealing with varying object sizes [57],we further improve our conﬁdence network C by adding amulti-scale architecture based on spatial pyramid pooling. Itconsists of a computationally efﬁcient scheme to re-samplea feature map at different scales, and then to aggregate theconﬁdence maps.From a feature map, we apply parallel atrous convolu-tional layers with × kernel size and different samplingrates, each of them followed by a series of 4 standardconvolutional layers with × kernel size. In contrast withconvolutional layers with large kernels, atrous convolu-tional layers enlarge the ﬁeld of view of ﬁlters and help toincorporate a larger context without increasing the numberof parameters and the computation time. Resulting featuresare then summed before upsampling to the original imagesize of H × W . We apply a ﬁnal sigmoid activation to outputa conﬁdence map with values between 0 and 1.The whole architecture of the conﬁdence model C is rep-resented in the orange block of Figure 3, along with its train-ing given a ﬁxed segmentation model F (blue block) withwhich it shares the encoder. Such as in previous section,ﬁne-tuning the encoder within C is also possible, althoughwe did not explore the option in this semantic segmentationcontext due to the excessive memory overhead it implies. XPERIMENTS

We evaluate our approach on the two tasks presented inprevious sections: Failure prediction in classiﬁcation settingsand semantic segmentation with domain adaptation.

In this section, we present comparative experimentsagainst state-of-the-art conﬁdence-estimation approachesand Bayesian methods on various datasets. Then, we studythe effect of learning variants on our approach.

The experiments are conducted on image datasets of vary-ing scale and complexity: MNIST [58] and SVHN [59]datasets provide small and relatively simple images of digits(10 classes). CIFAR-10 and CIFAR-100 [60] propose morecomplex object-recognition tasks on low resolution images.The classiﬁcation models range from small convolutionalnetworks for MNIST and SVHN to the larger VGG-16 ar-chitecture for the CIFAR datasets. We also consider a multi-layer perceptron (MLP) with one hidden layer to investigateperformances on small models. ConﬁdNet is attached tothe penultimate layer of the convolutional neural network.Further details about datasets, architectures, training andmetrics can be found in Appendix B.We measure the quality of failure prediction follow-ing standard metrics used in the literature [23]: AU-ROC, the area under the receiver operating characteristic;FPR @ 95% TPR, the false-positive rate measured when thetrue-positive rate is 95%; and AUPR, the area under theprecision-recall curve, using here incorrect model’s predic-tions as positive detection samples (see details in AppendixB.2). Among these metrics, AUPR is the most directly relatedto the failure detection task, and is thus the prevalent one inour assessment.As an additional, indirect way to assess the quality ofthe predicted classiﬁer’s conﬁdence, we also consider theselective classiﬁcation problem that was discussed in Section3.1. In this setup, the predictions by the classiﬁer F that EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

Fig. 4:

Limitations of MC Dropout’s conﬁdence measure.

Two test samples from SVHN dataset, which are respectivelymisclassiﬁed (left) and correctly classiﬁed (right) by a given model F , illustrate these limits. The entropies of the predictedclass distributions (averaged over Monte Carlo dropout layers and displayed as histograms) are equally high, at around0.79, resulting in equally low MC Dropout conﬁdence estimates. In contrast, both MCP and TCP approximated byConﬁdNet clearly differ as expected for the two examples. Yet, ConﬁdNet has the best behavior, being the lowest forthe erroneous model’s prediction and the highest for the correct one.TABLE 2: Impact of the choice of training data on the error-prediction performance of ConﬁdNet.

Comparison in AUPRbetween training on model’s train set or on a validation set.

Variant

MNIST MNIST SVHN CIFAR-10 CIFAR-100

MLP SmallConvNet SmallConvNet VGG-16 VGG-16ConﬁdNet- train val get a predicted conﬁdence below a deﬁned threshold arerejected. Given a coverage rate (the fraction of examples thatare not rejected), the performance of the classiﬁer shouldimprove. The impact of this selection, and hence of theunderlying conﬁdence-rate function, is measured in averagewith the area under the risk-coverage curve (AURC) and itsnormalized variant

Excess -AURC (E-AURC) [61].

Along with our approach, we implemented competitiveconﬁdence and uncertainty estimation methods includingMCP [23], Trust Score [24], and Monte-Carlo Dropout (MCDropout) [19].Comparative results are summarized in Table 1. Weobserve that our approach outperforms the other methodsin every setting, with a signiﬁcant gap on small mod-els/datasets. This conﬁrms that TCP is an adequate conﬁ-dence criterion for failure prediction and that our approachConﬁdNet is able to learn it. Trust Score also delivers goodresults on small datasets/models such as MNIST. WhileConﬁdNet still performs well on more complex datasets,Trust Score’s performance drops, which might be explainedby high-dimensionality issues with distances. Regardingselective classiﬁcation result (AURC and E-AURC), we alsoprovide risk-coverages curves in Appendix B.8 .We also improve state-of-art performances from MCDropout. While MC Dropout leverages ensembling basedon dropout layers, taking as conﬁdence measure the entropyon the average softmax distribution may not be always ade-quate. In Figure 4, we show side-by-side samples with simi-lar distribution entropy. The left image is misclassiﬁed whilethe right one enjoys a correct prediction. In fact, the entropyis a permutation-invariant measure on discrete probabilitydistributions: A correct 3-class prediction with score vec-tor [0 . , . , . has the same entropy-based conﬁdenceas an incorrect one with probabilities [0 . , . , . . In contrast, our approach can discriminate an incorrect froma correct prediction, despite both having similarly-spreaddistributions. We analyse in Table 3 the effect of the encoder ﬁne-tuningthat is described in Section 4.3. Learning only ConﬁdNet ontop of the pre-trained encoder E (that is, θ = ϕ ), our con-ﬁdence network already achieves signiﬁcant improvementsw.r.t. the baselines. With a subsequent ﬁne-tuning of bothmodules (that is θ = ( w E (cid:48) , ϕ )) , its performance is furtherboosted in every setting, by around 1-2%. Note that using avanilla ﬁne-tuning without the deactivation of the dropoutlayers did not bring any improvement.TABLE 3: Impact of the encoder ﬁne-tuning on theerror-prediction performance of ConﬁdNet . Comparison inAUPR on two benchmarks with different backbones.

MNIST CIFAR-100

SmallConvNet VGG-16Conﬁdence training 43.94% 72.68%+ Encoder ﬁne-tuning 45.89% 73.68%

Given the small number of erroneous-prediction samplesthat are available due to deep neural network over-ﬁtting,we also experimented conﬁdence training on a hold-outdataset. We report the results on all datasets in Table 2for validation sets with 10% of samples. We observe ageneral performance drop when using a validation set fortraining TCP conﬁdence. The drop is especially pronouncedfor small datasets (MNIST), where models reach more than97% of train and validation accuracies. Consequently, witha high accuracy and a small validation set, we do not geta larger absolute number of errors using a hold-out setrather than the train set. One solution would be to increase

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

TABLE 4:

Comparative performance on semantic segmentation with synth-to-real unsupervised domain adaptation.

Results in per-class IoU and class-averaged mIoU on GTA5 (cid:46)

Cityscapes. All methods are based on a DeepLabv2 backbone.

GTA5 (cid:46)

CityscapesMethod S e l f- T r a i n . r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g t e rr a i n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m b i k e b i k e mIoUAdaptSegNet [49] 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) the validation-set size, but this would damage the model’sprediction performance. By contrast, we take care with ourexperiments to base our conﬁdence estimation on modelswith levels of test predictive performance that are similar tothose of the baselines. On CIFAR-100, the gap between trainaccuracy and validation accuracy is substantial (95.56% vs.65.96%), which may explain the slight improvement for con-ﬁdence estimation using a validation set (+0.17%). We thinkthat training ConﬁdNet on a validation set with modelsreporting low/medium test accuracies could improve theapproach.TABLE 5: Effect of the loss on the error-detection per-formance of ConﬁdNet.

Comparison in AUPR betweenproposed MSE loss and three other alternatives.

Dataset MSE BCE Focal Ranking

SVHN 50.72%

CIFAR-10 49.94%

In Table 5, we compare training ConﬁdNet with the MSEloss (7) to training with a binary-classiﬁcation cross-entropyloss (BCE), a focal BCE loss and a batch-wise approximateranking loss. Even though BCE speciﬁcally addresses thefailure prediction task, it achieves lower performances onCIFAR-10 and SVHN datasets. Similarly, the focal loss andthe ranking one yield results below TCP’s performance inevery tested benchmark. Our intuition is that TCP regular-izes the training by providing ﬁner-grain information aboutthe quality of the classiﬁer’s predictions. This is especiallyimportant in the difﬁcult learning conﬁguration where onlyvery few error samples are available due to the good perfor-mance of the classiﬁer.

In this section, we analyse on several semantic segmentationbenchmarks the performance of ConDA, our approach todomain adaptation with conﬁdence-based self-training. Wereport comparisons with state-of-the-art methods on eachbenchmark. We also analyse further the quality of ConDA’spseudo-labelling and demonstrate via an ablation study theimportance of each of its components.

As in many UDA works for semantic segmentation, weconsider the speciﬁc task of adapting from synthetic toreal data in urban scenes. We present in particular exper-iments in the common set-up, denoted GTA5 (cid:46)

Cityscapes,where GTA5 [63] is the synthetic source dataset whilethe real-word target dataset is Cityscapes [64]. We alsovalidate our approach on two other benchmarks – SYN-THIA (cid:46)

Cityscapes and SYNTHIA (cid:46)

Mapillary Vistas [65] –in Appendix C.3. The GTA5 [63] dataset is composed of24,966 images extracted from the eponymous game, of di-mension × and semantically annotated with 19classes in common with Cityscapes [64]. Cityscapes [64] isa dataset of real street-level images. For domain adaptation,we use the training set as target dataset during training. Itis composed of 2,975 images of dimension × . Allresults are reported in terms of intersection over union (IoU)per class or mean IoU over all classes (mIoU); the higher thispercentage, the better.We evaluate the proposed self-training method on Ad-vEnt [50], a state-of-the-art UDA approach. AdvEnt [50]proposes an adversarial learning framework for domainadaptation: Instead of the softmax output predictions, Ad-vEnt aligns the entropy of the pixel-wise predictions. Allthe implementations are done with the PyTorch frame-work [66]. The semantic segmentation models are initializedwith DeepLabv2 backbones pretrained on ImageNet [1].Due to computational constraints, we only train the multi-scale ConﬁdNet without encoder ﬁne-tuning. Further infor-mation about architectures and implementation details oftraining and metrics can be found in Appendix C.1. The results of semantic segmentation on the Cityscapesvalidation set using GTA5 as source domain are availablein Table 4. All the methods rely on DeepLabv2 as theirsegmentation backbone. We ﬁrst notice that self-training-based methods from the literature are superior on thisbenchmark, with performance reaching up to . mIoUwith ESL [52]. ConDA outperforms all those methods byreaching . mIoU. Ablation Study.

To study the effect of the adversarialtraining and of the multi-scale conﬁdence architecture on

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Fig. 5:

Qualitative results of pseudo-label selection for semantic-segmentation adaptation.

The three ﬁrst columnspresent target-domain images of the GTA5 (cid:46)

Cityscapes benchmark (a) along with their ground-truth segmentation maps(b) and the predicted maps before self-training (c). We compare pseudo-labels collected with MCP (d) and with ConDA(e). Green (resp. red) pixels are correct (resp. erroneous) predictions selected by the method and black pixels are discardedpredictions. ConDA retains fewer errors while preserving approximately the same amount of correct predictions.the conﬁdence model, we perform an ablation study onthe GTA5 (cid:46)

Cityscapes benchmark. The results on domainadaptation after re-training the segmentation network us-ing collected pseudo-labels are reported in Table 6. In thistable, “ConﬁdNet” refers to the simple network architecturedeﬁned in Section 4 (adapted to segmentation by replacingthe fully connected layers by × convolutions of suitablewidth); “Adv. ConﬁdNet” denotes the same architecturebut with the adversarial loss from Section 5.3 added to itslearning scheme; “Multi-scale ConﬁdNet” stands for the ar-chitecture introduced in Section 5.4; Finally, the full method,“ConDA” amounts to having both this architecture andthe adversarial loss. We notice that adding the adversariallearning achieves signiﬁcantly better performance, for bothConﬁdNet and multi-scale ConﬁdNet, with respectively +1 . and +0 . point increase. Multi-scale ConﬁdNet (resp.adv. multi-Scale ConﬁdNet) also improves performance upto +0 . point (resp. +0 . ) from their ConﬁdNet architecturecounterpart. These results stress the importance of bothcomponents of the proposed conﬁdence model. Model Multi-Scale. Adv mIoUConﬁdNet 47.6Multi-Scale ConﬁdNet (cid:88) (cid:88) (cid:88) (cid:88)

TABLE 6:

Ablation study on semantic segmentation withpseudo-labelling-based adaptation.

Full-ﬂedged ConDAapproach is compared on GTA5 (cid:46)

Cityscapes to stripped-down variants (w/ and w/o multi-scale architecture inConﬁdNet, w/ and w/o adversarial learning).

Quality of pseudo-labels.

We analyze here the effectivenessof MCP and ConDA as conﬁdence measure to select relevantpseudo-labels in the target domain. For a given fraction ofretained pseudo-labels (coverage) on target-domain train-ing images, we compare in Figure 6 the proportion ofthose labels that are correct (accuracy). ConDA outperformsMCP for all coverage levels, meaning the it selects signif-icantly fewer erroneous predictions for the next round ofsegmentation-model training. Along with the segmentationadaptation improvements presented earlier, these coverageresults demonstrate that reducing the amount of noise in the pseudo-labels is key to learn a better segmentationadaptation model. Figure 5 presents qualitative results ofthose pseudo-labels methods. We ﬁnd again that MCP andConDA seem to select around the same amount of correctpredictions in their pseudo-labels, but with ConDA pickingout a lot fewer erroneous ones.Fig. 6:

Comparative quality of selected pseudo-labels .Proportion of correct pseudo-labels (precision) for differentcoverages on GTA5 (cid:46)

Cityscapes, for MCP and ConDA.

ONCLUSION

In this paper, we deﬁned a new conﬁdence criterion, TCP,which enjoys both theoretical guarantees and empirical evi-dences of improving the conﬁdence estimation for classiﬁerswith a reject option. We proposed a speciﬁc method to learnthis criterion with an auxiliary neural network built uponthe encoder of the model that is monitored. Applied tofailure prediction, this learning scheme consists in trainingthe auxiliary network and then enabling the ﬁne-tuningof its encoder (the one of the monitored classiﬁer remainsfrozen). In each image classiﬁcation experiment, we wereable to improve the capacity of the model to distinguish cor-rect from erroneous samples and to achieve better selectiveclassiﬁcation. Besides failure prediction, other applicationscan beneﬁt from this improved conﬁdence estimation. Inparticular, we showed that applied to self-training withpseudo-labels, our approach reaches state-of-the-art resultson three synthetic-to-real unsupervised-domain-adaptation

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12 benchmarks (GTA5 (cid:46)

Cityscapes, SYNTHIA (cid:46)

Cityscapes andSYNTHIA (cid:46)

Mapillary Vistas). To achieve these results, weequipped the auxiliary model with a multi-scale conﬁdencearchitecture and supplemented the conﬁdent loss with anadversarial training scheme which enforces alignment be-tween conﬁdence maps in source and target domains. Oneclear limitation of this approach is the number of errorsavailable in training. Further work includes exploring meth-ods to artiﬁcially generate errors, such as aggressive dataaugmentation. A PPENDIX AT HE TRUE CLASS PROBABILITY (TCP)

CRITERION

A.1 Proof of TCP’s theoretical guarantees

Let F be a trained neural network classiﬁer with learnedweights ˆ w as deﬁned in the main paper, K be the numberof labels and x ∈ R D a sample with its associated truelabel y ∈ Y such that TCP F ( x , y ) > . Starting from thedeﬁnition of TCP we have: TCP F ( x , y ) = P ( Y = y | x , ˆ w ) > (13) ⇐⇒ − (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) > (14) ⇐⇒ (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) < . (15)Since probabilities are positive, we obtain that ∀ k (cid:54) = y , P ( Y = k | x , ˆ w ) < < P ( Y = y | x , ˆ w ) . Denoting ˆ y the classpredicted by the network, we have ˆ y = argmax k P ( Y = k | x , ˆ w ) . Hence ˆ y = y .In the same way, for any ( x , y ) ∈ R D × Y , such that TCP F ( x , y ) < K , we have: P ( Y = y | x , ˆ w ) < K (16) ⇐⇒ − (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) < K (17) ⇐⇒ (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) > K − K . (18)If the model correctly classiﬁes this sample, i.e. , ˆ y = y , then ∀ k (cid:54) = y , P ( Y = y | x , ˆ w ) ≥ P ( Y = k | x , ˆ w ) . We have: (cid:88) k ∈Y ,k (cid:54) = y P ( Y = k | x , ˆ w ) ≤ ( K − P ( Y = y | x , ˆ w ) ≤ K − K , (19)which contradicts Equation 18. Hence, there exists at leastone k such that P ( Y = k | x , ˆ w ) > P ( Y = y | x , ˆ w ) , whichresults in ˆ y (cid:54) = y . A.2 Empirical error and success distributions

In this section, we provide the plots, analogous to Figure1 in the main paper, that show the distribution of theconﬁdence measures over correct and incorrect predictionsrespectively, for each dataset and each model in our failureprediction experiments. We also include absolute numbersof incorrect and correct predictions grouped into 3 bins(‘ > /K ’, ‘ [ K , ] ’ and ‘ > / ’) to validate our theoreticalassumptions about TCP’s properties. The plots are available for MNIST with MLP in Fig. 7, for MNIST with a smallconvnet in Fig. 8, for SVHN with a small convnet in Fig. 9,for CIFAR-100 with VGG-16 in Fig. 10 and for CamVid withSegNet in Fig. 11. A PPENDIX BE XPERIMENTS ON FAILURE PREDICTION

B.1 Implementation details

Datasets.

We run experiments on image datasets of varyingscale and complexity: MNIST [58] and SVHN [59] datasetsprovide relatively simple and small ( × ) images ofdigits (10 classes). They are split in 60,000 training samplesand 10,000 testing samples. CIFAR-10 and CIFAR-100 [60]bring more complexity to classify low resolution images. Ineach dataset, we further keep 10% of training samples as avalidation dataset. We also report experiments for semanticsegmentation on CamVid [67], using ConﬁdNet’s trainingand architecture introduced in Section 4 of the main paper,with dense layers replaced by × convolutions withadequate number of channels. CamVid is a standard roadscene dataset. Images are resized to × pixels and aresegmented according to 11 classes such as ‘road’, ‘building’,‘car’ or ‘pedestrian’. Classiﬁcation network.

For each dataset, we use stan-dard neural network architectures as classiﬁers. We re-implemented in PyTorch [66] the network architectures pro-posed in [24] for fair comparison. They range from smallconvolutional networks for MNIST and SVHN to VGG-16architectures for CIFAR datasets. We also added a multi-layer perceptron (MLP) with 1 hidden layer of size 100for MNIST dataset in order to investigate performances onsmall models. Finally, we implemented SegNet following[25]. All models are trained in a standard way with a cross-entropy loss and an SGD optimizer with a learning rateof − , a momentum of 0.9 and a weight decay of − .The number of training epochs depends on the datasetconsidered, varying from 100 epochs on MNIST to 250epochs on CIFAR-100. As we also want to compute MonteCarlo samples following [19], we include dropout layers.Best models are selected on validation-set accuracy. ConﬁdNet.

For each of the considered classiﬁcation models,ConﬁdNet is built upon the penultimate layer, which is aconvolutional layer with non-linear activation and option-ally followed by a normalization layer. We train ConﬁdNetfor 100 epochs with Adam optimizer with learning rate × − , dropout, weight decay − and same data aug-mentation as in classiﬁer’s training. We select the best modelbased on the AUPR on the validation dataset. In the secondtraining step involving encoder ﬁne-tuning, the trainingis completed on very few epochs based on previous bestmodel, using Adam optimizer with learning rate × − or × − and no dropout to mitigate stochastic effectsthat may lead the new encoder to deviate too much fromthe original one used for classiﬁcation. Once again, the bestmodel is selected on validation-set AUPR.

1. https://github.com/EN10/KerasMNIST2. https://github.com/tohinz/SVHN-Classiﬁer3. https://github.com/geifmany/cifar-vgg

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Fig. 7:

Distributions of MCP and TCP conﬁdence estimates.

These distributions are computed over correct and erroneouspredictions by a trained MLP on MNIST. (a) MCP (b) TCP

Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 25 170 0 28 9777 37.70% 97.13%TCP 81 114 0 0 28 9777 98.77% 99.98%Fig. 8: Same as Figure 7 on MNIST using a small convnet architecture. (a) MCP (b) TCP

Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 8 82 0 11 9899 35.05% 98.63%TCP 32 58 0 0 11 9899 99.41% 99.41%

Other baseline details.

For Trust Score [24], we used thecode provided by the authors . We add parallel processingwhen computing distances for each class to speed up in-ference. This parallelization does not alter the algorithm norits performance. Speciﬁcally for semantic segmentation withCamVid, each image contains 172,800 pixels. Even thoughCamVid remains a small dataset (367 training images, 101validation images, 233 test images) compared to othersemantic segmentation datasets, computation complexityforced us to drastically reduce the number of training neigh-

4. https://github.com/google/TrustScore bors and the number of test samples. We randomly samplein each train and test image a small percentage of pixels tocompute a proxy.For MC Dropout [19], we use the same model thanbaseline (which already includes dropout layers) and wesample 100 times from the classiﬁcation model at test timekeeping dropout layers activated. We then compute theaverage softmax probability over all samples to conductMonte Carlo integration. Model uncertainty is estimated,following [19], by calculating the entropy of the averagedprobability vector across the class dimension.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

Fig. 9: Same as Figure 7 on SVHN using a small convnet architecture. (a) MCP (b) TCP

Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 329 857 0 206 24640 48.18% 93.20%TCP 500 686 0 0 206 24640 98.93% 99.95%Fig. 10: Same as Figure 7 on CIFAR-100 using a VGG-16 architecture. (a) MCP (b) TCP

Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 603 2801 0 118 6478 71.99% 85.67%TCP 2724 680 0 0 118 6478 99.91% 99.91%

B.2 Evaluation metrics

Failure prediction being the task of detecting samples thathave been incorrectly classiﬁed by the main model (if theirestimated conﬁdence is below a chosen threshold δ ), classicdetection metrics can be used. False-positive rate at 95% true-positive rate(FPR @ 95% TPR) . This is the probability that an imagepredicted incorrectly by the model is wrongly seen ascorrectly classiﬁed by the error detector (a false positive)while the true positive rate is as high as 95%. True- andfalse-positive rates are deﬁned as

TPR = TP / (TP + FN) and FPR = FP / (FP + TN) , where TP , TN , FP and FN are the numbers of true positives, true negatives, false positivesand false negatives respectively. Area under the receiver operating characteristic curve (AU-ROC) . The receiver operating characteristic (ROC) curveis the graph of TPR as function of FPR. This metric isthreshold-independent. It can be interpreted as the prob-ability that an incorrectly classiﬁed sample has a greaterprediction score than a negative example.

Area under the precision-recall curve (AUPR) . Theprecision-recall (PR) curve is the graph of the precision =TP / (TP + FP) as a function of the recall = TP / (TP + FN) .In our experiments, classiﬁcation errors are used as the EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

Fig. 11: Same as Figure 7 on CamVid using a SegNet architecture. (a) MCP (b) TCP

Model Nb. of Errors Nb. of Successes AUPR ↑ AUROC ↑ > /K [ K , ] > / < /K [ K , ] > / MCP 0 401,573 55,506,172 0 188,128 34,166,526 48.53% 84.42%TCP 41,84,875 1,722,871 0 0 188,128 34,166,526 99.92% 99.99%TABLE 7:

Accuracies for used classiﬁcation models . Image classiﬁcation accuracies on train, validation and test sets forall the models whose conﬁdence is estimated in our experiments.

MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid

MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetTrain accuracy 98.32% 98.94% 95.06% 98.69% 95.55% 96.69%Validation accuracy 97.95% 99.03% 96.56% 99.80% 66.96% 91.72%Test accuracy 98.05% 99.10% 95.44% 92.19% 65.96% 85.33%

TABLE 8:

Impact of the encoder ﬁne-tuning on the error-prediction performance of ConﬁdNet . Comparison in AUPR(the higher, the better) on all benchmarks. This table extends Table 3 in the main paper.

MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid

MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetConﬁdence training 57.34% 43.94% 50.43% 46.44% 72.68% 50.12%+ Fine-tuning ConvNet 57.37% 45.89% 50.72% 49.94% 73.68% 50.51% positive detection class. As we speciﬁcally want to detectfailures, AUPR is the primary metrics to assess perfor-mances.

Area under the risk-coverage curve (AURC) . In classiﬁca-tion with a reject option, the risk-coverage curve is the graphof the empirical risk of the classiﬁer given a loss (usually / loss) as a function of the empirical coverage, which isthe proportion of the non-rejected samples. This metric isthreshold-independent, as AUROC and AUPR. Excess-AURC (E-AURC) . This is a normalized AURC met-rics deﬁned in [61]. It takes into account the optimal rankinggiven the error rate of the classiﬁer. More speciﬁcally, if wedenote κ ∗ f the perfect conﬁdence-rate function and ˆ r the riskof classiﬁer f , it writes as:E-AURC ( κ f ) = AURC ( κ f ) − AURC ( κ ∗ f ) (20) ≈ AURC ( κ f ) − (cid:0) ˆ r + (1 − ˆ r ) log(1 − ˆ r ) (cid:1) . (21) B.3 Classiﬁcation accuracies of model F Most neural networks used in our experiments tend tooverﬁt. On small datasets such as MNIST and SVHN, con-volutional neural networks already achieve nearly perfectaccuracy on test set, above 96%, which leaves very fewerrors available. We provide in Table 7 the accuracies ontraining, validation and test set of the classiﬁcation model F whose conﬁdence must be predicted. B.4 Effect of ConﬁdNet’s architecture

We experiment different architectures for ConﬁdNet on theSVHN dataset, varying the number of layers. Except for theﬁrst and last layers, whose dimensions respectively dependon the size of the input and of the output, each layerpresents the same number of units (400). On Figure 12,we observe that starting from 3 layers, ConﬁdNet alreadyimproves baseline performance.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

TABLE 9:

Effect of the loss and of the conﬁdence criterion on the error-detection performance of ConﬁdNet . Comparisonin between proposed MSE and three other alternatives, all based on TCP as conﬁdence criterion. Using MSE withnormalized

TCP r is also reported. This table extends Table 5 in the main paper. Dataset Loss FPR @ 95% TPR ↓ AUPR ↑ AUROC ↑ SVHN

SmallConvNet MSE

BCE 29.34% 50.00% 92.76%Focal 28.67% 49.96% 93.01%Ranking 31.04% 48.11% 92.90%MSE w/

TCP r CIFAR-10

VGG-16 MSE

TCP r SegNet MSE 61.52% 50.51% 85.02%BCE 61.68% 48.96% 83.41%Focal 61.64% 49.05% 84.09%MSE w/

TCP r TABLE 10:

Comparative calibration results . Performance in ECE (the lower, the better) when using MCP baseline(’Baseline’) or ConﬁdNet as conﬁdence estimator on the six benchmarks, and when using dedicated temperature scaling(’T. Scaling’).

MNIST MNIST SVHN CIFAR-10 CIFAR-100 CamVid

MLP SmallConvNet SmallConvNet VGG-16 VGG-16 SegNetBaseline 0.37%

Fig. 12:

Inﬂuence of ConﬁdNet’s depth on its performance .Performance in AUPR as a function of the number of layersused in ConﬁdNet on SVHN test set and compared to theperformance of MCP and True Score baselines.

B.5 Effect of learning variants

In Table 8, we detail the effect of ﬁne-tuning on all clas-siﬁcation and segmentation datasets for failure predictionexperiments.The inﬂuence of the loss (MSE, BCE, Focal Loss orRanking based on TCP) is analysed for SVHN, CIFAR10and CamVid in Table 9. We also tested a normalized variantof the TCP conﬁdence criterion, which consists in the ratio between TCP and MCP:

TCP rF ( x , y ) = F ( x ; ˆ w )[ y ]max k ∈Y F ( x ; ˆ w )[ k ] . (22)This criterion presents stronger theoretical guarantees thanTCP, since correct predictions will be, by design, assigneda conﬁdence of , whereas errors’ conﬁdence will lie in [0 , . On the other hand, learning this criterion may be morechallenging since all correct predictions have a unique targetconﬁdence value. We note in Table 8 that its performance is lower than the one of TCP on small datasets such asCIFAR-10 where few errors are present, but higher on largerdatasets such as CamVid where each pixel is a sample. Thisemphasizes once again the complexity of incorrect/correctclassiﬁcation training. B.6 Effect on calibration

We observed that ConﬁdNet tends to lower the conﬁdenceof an example that the model wrongly classiﬁed whilebeing over-conﬁdent (high MCP). As a side experiment, westudy whether using ConﬁdNet as conﬁdence estimationcan improve the calibration of deep neural networks.In Table 10, we report the expected calibration error(ECE) which is an approximate measure of miscalibrationbetween conﬁdence and accuracy [35]. ConﬁdNet yieldsequivalent or better ECE results than the MCP baseline, withclear superiority on complex datasets such as CIFAR-10,CIFAR-100 and CamVid. On MNIST and SVHN, the base-line already offers a small ECE. These results conﬁrm ourintuition about the capacity of ConﬁdNet to address over-conﬁdent predictions, even though it has not been designedfor. Nevertheless, dedicated methods such as temperaturescaling used in [35] remain preferred for calibrating deepneural networks.

B.7 Qualitative assessment

We provide in Figure 13 an illustration of conﬁdence-basedfailure prediction on CamVid. Compared to the MCP base-line, our approach produces higher conﬁdence scores forpixels correctly classiﬁed by DeepLabv2 and lower ones forincorrectly classiﬁed ones. This allows one to better identify

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

Fig. 13:

Conﬁdence estimation in semantic segmentation . On this example from CamVid (a), the conﬁdence at each pixelof the class prediction delivered by DeepLabv2 (c) can be done classically by MCP (f), or by the proposed auxiliary model(e). The second approach appears better aligned with the actual errors (shown in white in (d)) of the semantic segmentation.Indeed, ConﬁdNet (50.51% AUPR) allows a better prediction of these errors than MCP (48.53%). (a) MLP (b) MNIST(c) SVHN (d) CIFAR-10 (e) CIFAR-100

Fig. 14:

Risk-coverage curves of selective image classiﬁcation . On the ﬁve benchmarks, the RC curves are providedfor MCP baseline, MC Dropout, Trust Score and ConﬁdNet. For a given coverage (fraction of non-rejected samples), theselective risk indicates the percentage of erroneous predictions in the remaining test set.error areas in semantic segmentation maps, as we leveragein ConDA.

B.8 Risk-coverage curves

In relation with the selective classiﬁcation experiments,we provide the risk-coverage curves for ﬁve classiﬁcationbenchmarks: MLP on MNIST (Fig. 14a), SmallConvNet onMNIST (Fig. 14b), SmallConvNet on SVHN (Fig. 14c) andVGG-16 on CIFAR-10 and CIFAR-100 (Figs. 14d and 14e). A PPENDIX CE XPERIMENTS WITH SELF - TRAINING FOR UNSUPER - VISED DOMAIN ADAPTATION

C.1 Experiments details

The segmentation network is a DeepLabv2 [57] architecture,optimized by stochastic gradient descent with learning rate . × − , momentum . and weight decay − . As in recent state-of-the-art methods [21], [49], [50], weadopt an adversarial learning approach to align source andtarget output distributions. The discriminator in the seg-mentation network is optimized by Adam [68] with learningrate − . The hyperparameters λ adv and λ ST are ﬁxed at − and , respectively. In Table 5 of the main paper, CBSTresults [53] are drawn from the authors’ most recent paper[54] where CBST serves as baseline. Self-training procedure.

Following the implementation inBDL [21], we carried out self-training iterations for AdvEntin GTA5 (cid:46) Cityscapes experiments. For the other experi-ments in Section C.3 below, only self-training iteration isused.The experiments start by pre-training AdvEnt model, us-ing the published code, on source-domain images translatedinto the target domain. Those translated images are pre-computed using a CycleGAN [69], as provided by [21].

5. https://github.com/liyunsheng13/BDL

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

TABLE 11:

Ablation study on semantic segmentation with pseudo-labeling-based adaptation . This table extends Tab. 6 inthe main paper by providing IoUs for the 19 classes of the GTA5 (cid:46)

Cityscapes UDA benchmark, using AdapSegNet model.

GTA5 (cid:46)

CityscapesMethod M u l t i - S c a l e A d v T r a i n i n g r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g t e rr a i n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m b i k e b i k e mIoUConﬁdNet 88.9 44.4 84.6 34.3 26.2 32.8 39.0 32.5 84.9 33.6 82.8 59.8 31.7 80.9 30.3 45.9 0.0 30.9 40.8 47.6Multi-Scale ConﬁdNet (cid:88) (cid:88) (cid:88) (cid:88) TABLE 12:

Additional comparative performance on semantic segmentation with synth-to-real unsupervised domainadaptation . Results in per-class IoU and aggregated mIoU on SYNTHIA (cid:46)

Cityscapes (‘mIoU*’ is the 13-class setup,excluding the classes ‘wall’, ‘fence’ and ‘pole’, as used in earlier works). All methods are based on a DeepLabv2 backbone.

SYNTHIA (cid:46)

CityscapesMethod S e l f- T r a i n . r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g s ky p e r s o n r i d e r c a r b u s m b i k e b i k e mIoU mIoU*AdaptSegNet [49] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 - 46.7DISE [62] (cid:88) (cid:88) (cid:88) (cid:88) C.2 Detailed ablation study

We complement the ablation results in Table 6 of the mainpaper, by providing class-wise IoUs in Table 11.

C.3 Additional results

We extend our experiments by using another syntheticsource-domain dataset, SYNTHIA [70]. More speciﬁcally, weuse the SYNTHIA-RAND-CITYSCAPES split, composed of9,400 color images generated in a simulator, of dimension × and annotated for semantic segmentation with 16classes in common with Cityscapes [64]. In the experimentswith this dataset, we do not use translated source images. SYNTHIA (cid:46)

Cityscapes.

Similar to Table 4 in the main pa-per, we report in Table 12 the comparative results on SYN-THIA (cid:46)

Cityscapes. Following the literature on this dataset,mIoU metric is computed over 16 categories as well as over13 categories (ignoring ‘wall’, ‘fence’ and ‘pole’). Again,ConDA achieves impressive performance for methods witha DeepLabv2 backbone on this benchmark with . mIoU. SYNTHIA (cid:46)

Mapillary.

Along with previous results onCityscapes, we further study domain adaptation on anothertarget dataset. Mapillary Vistas [65] is a dataset of street-level images, split in a train set, a validation set and atest set. The ground-truth semantic maps are missing fromthe test set. For domain adaptation, we use the 18,000images from the training set as target and the 2,000 imagesfrom the validation set for testing. We consider 7 ‘superclasses’ that include the 16 classes used in Cityscapes [64]experiments with SYNTHIA [70]. Table 13 presents semanticsegmentation results using SYNTHIA as source dataset.This benchmark has also been used in other recent works,such as in AdvEnt [50] and DADA [71]. ConDA outperforms TABLE 13:

Combining ConDA with AdvEnt and DADAapproaches to domain adaptation in semantic segmenta-tion . Performance in IoU and mIoU on SYNTHIA (cid:46)

Mapillar.‘DADA*’ and ‘ConDA*’ are trained using depth as privi-leged information.

SYNTHIA (cid:46)

MapillaryMethod S e l f- T r a i n . ﬂ a t c o n s t r . o b j e c t n a t u r e s ky h u m a n v e h i c l e mIoUAdvEnt [50] 86.9 58.8 (cid:88) DADA* [71] 86.7 62.1 34.9 75.9 88.6 51.1 73.8 67.6ConDA* (cid:88) baseline methods with . mIoU compared to . mIoU in AdvEnt.We also tested the proposed conﬁdence-based self-training approach on DADA [71], another domain adapta-tion baseline which uses the depth information available onsource-domain synthetic scenes as privileged informationduring segmentation training. Again, the proposed method(ConDA*) provides a boost of performance, from . to . mIoU. C.4 Additional model analysis

In this section, we provide additional plots analogous tothose in Figure 7 of the main paper. These graphs showprecision w.r.t. coverage of extracted pseudo-labels, usingMCP and ConDA respectively, in the following set-ups: • AdvEnt on SYNTHIA (cid:46)

Cityscapes (Figure 15); • AdvEnt on SYNTHIA (cid:46)

Mapillary Vistas (Figure 16); • DADA on SYNTHIA (cid:46)

Mapillary Vistas (Figure 17).

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

Fig. 15:

Comparative quality of selected pseudo-labels .Proportion of correct pseudo-labels (precision) for differentcoverages on SYNTHIA (cid:46)

Cityscapes for AdvEnt.Fig. 16: Same as Fig. 15 for AdvEnt on SYNTHIA (cid:46)

MapillaryFig. 17: Same as Fig. 15 for DADA on SYNTHIA (cid:46)

MapillaryIn all cases, we observe that ConDA collects fewer erroneouspseudo-labels than MCP. R EFERENCES [1] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classiﬁca-tion with deep convolutional neural networks,” in

NIPS , 2012. 1,10[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

NIPS ,2015. 1[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “Ssd: Single shot multibox detector,” in

ECCV , 2016. 1[4] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You onlylook once: Uniﬁed, real-time object detection,” in

CVPR , 2016. 1[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimationof word representations in vector space,”

CoRR , 2013. 1[6] T. Mikolov, M. Karaﬁ´at, L. Burget, J. Cernock´y, and S. Khudanpur,“Recurrent neural network based language model.” in

INTER-SPEECH , 2010. 1[7] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deepneural networks for acoustic modeling in speech recognition: Theshared views of four research groups,”

IEEE Signal Process Mag ,2012. 1[8] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng,“Deep speech: Scaling up end-to-end speech recognition,” 2014. 1[9] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman,and D. Man´e, “Concrete problems in AI safety,”

CoRR , 2016. 1 [10] J. Janai, F. G ¨uney, A. Behl, and A. Geiger, “Computer visionfor autonomous vehicles: Problems, datasets and state-of-the-art.”

Foundations and Trends in Computer Graphics and Vision , 2017. 1[11] J. Nam, S. Park, E. J. Hwang, J. Lee, K.-N. Jin, K. Lim, T. Vu, J. Sohn,S. Hwang, J. M. Goo, and C. M. Park, “Development and vali-dation of deep learning–based automatic detection algorithm formalignant pulmonary nodules on chest radiographs,”

Radiology ,2018. 1[12] O. Linda, T. Vollmer, and M. Manic, “Neural network basedintrusion detection system for critical infrastructures,” in

IJCNN ,2009. 1[13] C. Chow, “An optimum character recognition system using deci-sion functions,”

IRE Trans. Electron. Comput. , 1957. 1, 2[14] P. L. Bartlett and M. H. Wegkamp, “Classiﬁcation with a rejectoption using a hinge loss,”

JMLR , 2008. 1, 2[15] C. Cortes, G. DeSalvo, and M. Mohri, “Boosting with abstention,”in

NIPS , 2016. 1, 2[16] R. El-Yaniv and Y. Wiener, “On the foundations of noise-freeselective classiﬁcation,”

JMLR , 2010. 1, 2, 4[17] Y. Geifman and R. El-Yaniv, “Selective classiﬁcation for deepneural networks,” in

NIPS , 2017. 1, 2, 4[18] Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian active learn-ing with image data,” in

Proceedings of Machine Learning Research ,2017. 1[19] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:Representing model uncertainty in deep learning,” in

ICML , 2016.1, 2, 8, 9, 12, 13[20] D.-H. Lee, “Pseudo-label : The simple and efﬁcient semi-supervised learning method for deep neural networks,”

ICML(Workshop) , 2013. 1, 3[21] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning fordomain adaptation of semantic segmentation,” in

CVPR , 2019. 1,3, 10, 17, 18[22] S. Hecker, D. Dai, and L. V. Gool, “Failure prediction for au-tonomous driving,” in IV , 2018. 1[23] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassi-ﬁed and out-of-distribution examples in neural networks,” ICLR ,2017. 1, 2, 8, 9[24] H. Jiang, B. Kim, M. Guan, and M. Gupta, “To trust or not to trusta classiﬁer,” in

NIPS , 2018. 1, 2, 8, 9, 12, 13[25] A. Kendall, V. Badrinarayanan, , and R. Cipolla, “Bayesian segnet:Model uncertainty in deep convolutional encoder-decoder archi-tectures for scene understanding,” arXiv preprint arXiv:1511.02680 ,2015. 1, 12[26] A. Ragni, Q. Li, M. J. F. Gales, and Y. Wang, “Conﬁdence estima-tion and deletion prediction using bidirectional recurrent neuralnetworks,” in

SLT Workshop , 2018. 1[27] Q. Li, P. Ness, A. Ragni, and M. Gales, “Bi-directional latticerecurrent neural networks for conﬁdence estimation,” in

ICASSP ,2018. 1[28] D. Yu, J. Li, and L. Deng, “Calibration of conﬁdence measures inspeech recognition,”

IEEE Trans Audio Speech Lang Process , 2011. 1[29] J. Blatz, E. Fitzgerald, G. Foster, S. Gandrabur, C. Goutte,A. Kulesza, A. Sanchis, and N. Uefﬁng, “Conﬁdence estimationfor machine translation,” in

COLING , 2004. 1, 2[30] C. Corbi`ere, N. Thome, A. Bar-Hen, M. Cord, and P. P´erez,“Addressing failure prediction by learning model conﬁdence,” in

NeurIPS , 2019. 2[31] C. Cortes, G. DeSalvo, and M. Mohri, “Learning with rejection,”in

ALT , 2016. 2[32] H. Zaragoza and d. Buc, “Conﬁdence measures for neural networkclassiﬁers,” in

IPMU , 1998. 2[33] Y. Geifman and R. El-Yaniv, “Selectivenet: A deep neural networkwith an integrated reject option,” in

ICML , 2019. 2[34] A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networksare easily fooled: High conﬁdence predictions for unrecognizableimages.” in

CVPR , 2015. 2[35] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration ofmodern neural networks,” in

ICML , 2017. 2, 5, 16[36] L. Neumann, A. Zisserman, and A. Vedaldi, “Relaxed softmax: Ef-ﬁcient conﬁdence auto-calibration for safe pedestrian detection,”in

NIPS (Workshop) , 2018. 2[37] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness-ing adversarial examples,” in

ICLR , 2015. 2[38] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”in

ICLR , 2014. 2

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20 [39] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in

ICLR , 2018.2[40] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensembles,”in

NIPS , 2017. 2[41] R. M. Neal,

Bayesian Learning for Neural Networks . Springer, 1996.2[42] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra,“Weight uncertainty in neural networks,” in

ICML , 2015. 2[43] W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G.Wilson, “A simple baseline for Bayesian uncertainty in deeplearning,” in

NeurIPS , 2019. 2[44] J. M. Hernandez-Lobato and R. Adams, “Probabilistic backprop-agation for scalable learning of Bayesian neural networks,” in

ICML , 2015. 2[45] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento, “Amethod for improving classiﬁcation reliability of multilayer per-ceptrons,”

IEEE Transactions on Neural Networks , 1995. 2[46] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is“nearest neighbor” meaningful?” in

ICDT , 1999. 2[47] T. DeVries and G. W. Taylor, “Learning conﬁdence for out-of-distribution detection in neural networks,” arXiv preprintarXiv:1802.04865 , 2018. 2[48] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell, “Cycada: Cycle consistent adversarialdomain adaptation,” in

ICML , 2018. 3, 10[49] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, andM. Chandraker, “Learning to adapt structured output space forsemantic segmentation,” in

CVPR , 2018. 3, 10, 17, 18[50] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P´erez, “Advent: Ad-versarial entropy minimization for domain adaptation in semanticsegmentation,” in

CVPR , 2019. 3, 10, 17, 18[51] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by en-tropy minimization,” in

NIPS , 2005. 3[52] A. Saporta, T.-H. Vu, M. Cord, and P. P´erez, “Esl: Entropy-guidedself-supervised learning for domain adaptation in semantic seg-mentation,” 2020. 3, 10[53] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsuperviseddomain adaptation for semantic segmentation via class-balancedself-training,” in

ECCV , 2018. 3, 10, 17, 18[54] Y. Zou, Z. Yu, X. Liu, B. V. Kumar, and J. Wang, “Conﬁdenceregularized self-training,” in

ICCV , 2019. 3, 10, 17, 18[55] P. Mohapatra, M. Rol´ınek, C. Jawahar, V. Kolmogorov, andM. Pawan Kumar, “Efﬁcient optimization for rank-based lossfunctions,” in

CVPR , 2018. 5[56] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss fordense object detection,” in

ICCV , 2017. 5[57] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Deeplab: Semantic image segmentation with deep convo-lutional nets, atrous convolution, and fully connected crfs.”

CoRR ,2016. 8, 17[58] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” in

Proceedings of theIEEE , 1998. 8, 12[59] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in

NIPS (Workshop) , 2011. 8, 12[60] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea-tures from tiny images,”

Master’s thesis, Department of ComputerScience, University of Toronto , 2009. 8, 12[61] Y. Geifman, G. Uziel, and R. El-Yaniv, “Bias-reduced uncertaintyestimation for deep neural classiﬁers,” in

ICLR , 2019. 9, 15[62] W.-L. Chang, H.-P. Wang, W.-H. Peng, and W.-C. Chiu, “All aboutstructure: Adapting structural information across domains forboosting semantic segmentation,” in

CVPR , 2019. 10, 18[63] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:Ground truth from computer games,” in

ECCV , 2016. 10[64] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in

CVPR , 2016. 10, 18[65] G. Neuhold, T. Ollmann, S. Rota Bul`o, and P. Kontschieder, “Themapillary vistas dataset for semantic understanding of streetscenes,” in

ICCV , 2017. 10, 18[66] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ-entiation in pytorch,” in

NIPS (Workshop) , 2017. 10, 12 [67] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classesin video: A high-deﬁnition ground truth database,”

Pattern Recogn.Lett. , 2009. 12[68] J. B. Diederik P. Kingma, “Adam: A method for stochastic opti-mization,” in

ICLR , 2015. 17[69] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

ICCV , 2017. 17[70] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez,“The SYNTHIA dataset: A large collection of synthetic images forsemantic segmentation of urban scenes,” in

CVPR , 2016. 18[71] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez, “Dada: Depth-aware domain adaptation in semantic segmentation,” in

ICCV ,2019. 18

Charles Corbi`ere is a Ph.D. student in DeepLearning and Computer Vision for AutonomousDriving at Conservatoire National des Arts etM´etiers (CNAM, France) and valeo.ai researchlab (France). He received an M.Sc. degree inApplied Mathematics and Statistic by Universit´eParis-Saclay (France) in 2017 and an M.Eng. de-gree in Computer Science from Ecole Centralede Lille (France) in 2016. His research interestsinclude uncertainty in deep learning and certiﬁedrobustness.

Nicolas Thome is a full professor at Conserva-toire Nationnal des Arts et M´etiers (Cnam Paris).His research interests include machine learningand deep learning for understanding low-levelsignals, e.g. . vision, time series, acoustics, etc.He also explores solutions for combining low-level and more higher-level data for multi-modaldata processing. His current application domainsare essentially targeted towards healthcare, au-tonomous driving and physics. He is involved inseveral French, European and international col-laborative research projects on artiﬁcial intelligence and deep learning.

Antoine Saporta is a Ph.D. student in DeepLearning and Computer Vision for AutonomousDriving at the Machine Learning and DeepLearning for Information Access (MLIA) team ofLIP6, Sorbonne Universit´e (France) and Valeo.airesearch lab. He is a graduate of ´Ecole Poly-technique (France) and has received a Masterdegree in Computer Science from TechnischeUniversit¨at M¨unchen (Germany) in 2019. His re-search interests include domain adaptation andsemantic segmentation.

Tuan-Hung Vu is a research scientist at Va-leo.ai. He received a PhD degree in Com-puter Science from ´Ecole Normale Sup´erieure in2018. His research interests include deep learn-ing, object recognition, domain adaptation andmore recently data augmentation. Tuan-Hungpublished and regularly served as reviewer atcomputer vision conferences and journals likeCVPR, ICCV, ECCV and IJCV.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21

Matthieu Cord is full professor at SorbonneUniversity. He is also part-time principal scien-tist at Valeo.ai. His research expertise includescomputer vision, machine learning and artiﬁcialintelligence. He is the author of more 150 pub-lications on image classiﬁcation, segmentation,deep learning, and multimodal vision and lan-guage understanding. He is an honorary mem-ber of the Institut Universitaire de France andserved from 2015 to 2018 as an AI expert atCNRS and ANR (National Research Agency).