[PDF] Entropy Maximization and Meta Classification for Out-Of-Distribution Detection in Semantic Segmentation

Abstract

Deep neural networks (DNNs) for the semantic segmentation of images are usually trained to operate on a predefined closed set of object classes. This is in contrast to the "open world" setting where DNNs are envisioned to be deployed to. From a functional safety point of view, the ability to detect so-called "out-of-distribution" (OoD) samples, i.e., objects outside of a DNN's semantic space, is crucial for many applications such as automated driving. A natural baseline approach to OoD detection is to threshold on the pixel-wise softmax entropy. We present a two-step procedure that significantly improves that approach. Firstly, we utilize samples from the COCO dataset as OoD proxy and introduce a second training objective to maximize the softmax entropy on these samples. Starting from pretrained semantic segmentation networks we re-train a number of DNNs on different in-distribution datasets and consistently observe improved OoD detection performance when evaluating on completely disjoint OoD datasets. Secondly, we perform a transparent post-processing step to discard false positive OoD samples by so-called "meta classification". To this end, we apply linear models to a set of hand-crafted metrics derived from the DNN's softmax probabilities. In our experiments we consistently observe a clear additional gain in OoD detection performance, cutting down the number of detection errors by up to 52% when comparing the best baseline with our results. We achieve this improvement sacrificing only marginally in original segmentation performance. Therefore, our method contributes to safer DNNs with more reliable overall system performance.

Full PDF

EEntropy Maximization and Meta Classiﬁcation forOut-of-Distribution Detection in Semantic Segmentation

Robin Chan, Matthias Rottman and Hanno GottschalkUniversity of Wuppertal, GermanyFaculty of Mathematics und Natural Sciences { rchan,rottmann,hgottsch } @uni-wuppertal.de Abstract

Deep neural networks (DNNs) for the semantic segmen-tation of images are usually trained to operate on a pre-deﬁned closed set of object classes. This is in contrast tothe “open world” setting where DNNs are envisioned to bedeployed to. From a functional safety point of view, the abil-ity to detect so-called “out-of-distribution” (OoD) samples,i.e., objects outside of a DNN’s semantic space, is crucialfor many applications such as automated driving. A natu-ral baseline approach to OoD detection is to threshold onthe pixel-wise softmax entropy. We present a two-step pro-cedure that signiﬁcantly improves that approach. Firstly,we utilize samples from the COCO dataset as OoD proxyand introduce a second training objective to maximize thesoftmax entropy on these samples. Starting from pretrainedsemantic segmentation networks we re-train a number ofDNNs on different in-distribution datasets and consistentlyobserve improved OoD detection performance when eval-uating on completely disjoint OoD datasets. Secondly, weperform a transparent post-processing step to discard falsepositive OoD samples by so-called “meta classiﬁcation”.To this end, we apply linear models to a set of hand-craftedmetrics derived from the DNN’s softmax probabilities. Inour experiments we consistently observe a clear additionalgain in OoD detection performance, cutting down the num-ber of detection errors by 52% when comparing the bestbaseline with our results. We achieve this improvement sac-riﬁcing only marginally in original segmentation perfor-mance. Therefore, our method contributes to safer DNNswith more reliable overall system performance.

1. Introduction

In recent years spectacular advances in the computervision task semantic segmentation have been achieved bydeep learning [43, 46]. Deep convolutional neural networks(CNNs) are envisioned to be deployed to real world appli-

Baseline segmentation mask Baseline entropy heatmapOur segmentation mask Our entropy heatmap

Figure 1: Comparison of segmentation mask and softmaxentropy before our OoD training ( top row ) and after ( bottomrow ). While there are minor differences in the segmentationmasks, the annotated unknown object (marked with yellowlines) becomes clearly recognizable in the entropy heatmapdue to our OoD training. In the heatamp high values are red.cations, where they are likely to be exposed to data that issubstantially different from the model’s training data. Weconsider data samples that are not included in the set of amodel’s semantic space as out-of-distribution (OoD) sam-ples. State-of-the-art neural networks for semantic segmen-tation, however, are trained to recognize a predeﬁned closedset of object classes [11, 29], e.g. for the usage in environ-ment perception systems of autonomous vehicles [22]. Inopen world settings there are countless possibly occurringobjects. Deﬁning additional classes requires a large amountof annotated data (cf. [10, 47]) and may even lead to perfor-mance drops [13]. One natural approach is to introduce a none-of-the-known output for objects not belonging to anyof the predeﬁned classes [45]. In other words, one uses aset of object classes that is sufﬁcient for most scenarios andcover all OoD objects by enforcing a speciﬁc model out-put for such samples. This additional output can be imple-1 a r X i v : . [ c s . C V ] D ec ented by introducing an additional class or by setting athreshold on the softmax entropy as well as any other dis-persion or uncertainty measure. From a functional safetypoint of view, it is a crucial but yet missing prerequisite thatneural networks are capable of reliably indicating when theyare operating out of their proper domain, i.e., detecting OoDobjects, in order to initiate a fallback policy.As images from everyday scenes usually contain manydifferent objects, of which only some could be out-of-distribution, knowing the location where the OoD objectoccurs is desired for practical application. Therefore, weaddress the problem of detecting anomalous regions in animage, which is the case if an OoD object is present (seeﬁgure 1) and which is a research area of high interest[5, 18, 30, 39].This so-called anomaly segmentation [4, 18] can be pur-sued, for instance, by incorporating sophisticated uncer-tainty estimates [2, 16] or by adding an extra class to themodel’s learnable set of classes [45].In this work, we detect OoD objects in semantic segmen-tation with a different approach which is composed of twosteps: As ﬁst step, we re-train the segmentation networkto predict class labels with low conﬁdence scores on OoDinputs by enforcing the segmentation model to output highprediction uncertainty. In order to quantify uncertainty, wecompute the softmax entropy which is maximized when amodel outputs uniform probability scores over all classes[27]. By deliberately including annotated OoD objects as known unknowns into the training process and employing amodiﬁed multi-objective loss function, we observe that thesemantic segmentation network generalizes learnt uncer-tainty to unseen OoD samples ( unknown unknowns ) with-out signiﬁcantly sacriﬁcing in original performance on theprimary task, see ﬁgure 1.The primal model for semantic segmentation is trainedon the Cityscapes data [11]. As proxy for OoD sam-ples we randomly pick images from the COCO dataset[29] excluding the ones with instances that are also avail-able in Cityscapes, cf. [17, 20, 34] for a related approachin image classiﬁcation. We evaluate the pixel-wise OoDdetection performance via entropy thresholding for OoDsamples from the LostAndFound [39] and Fishyscapes [5]dataset, respectively. Both datasets share the same setup asCityscapes but include OoD road obstacles.The second step incorporates a meta classiﬁer ﬂaggingincorrect class predictions at segment level, similar as pro-posed in [31, 41, 42] for the detection of false positive in-stances in semantic segmentation. After increasing the sen-sitivity towards predicting OoD objects, we aim at removingfalse predictions which are produced due to the precedingentropy boost (cf. [8]). The removal of false positive OoDobject predictions is based on aggregated dispersion mea-sures and geometry features within segments (connected components of pixels), with all information derived solelyfrom the neural network’s softmax output. As meta classi-ﬁer we employ a simple linear model allowing to track andunderstand the impact of each metric.To sum up our contributions, we show that only a littlemodiﬁcation of training is required to make semantic seg-mentation networks much more sensitive to the detectionof OoD samples. Re-training segmentation networks witha speciﬁc choice of OoD images from COCO [29] clearlyoutperforms the natural baseline approach of plain softmaxentropy thresholding [19] by up to 73 percent points in av-erage precision. In addition, we are the ﬁrst to demon-strate that entropy based OoD object predictions in semanticsegmentation can be meta classiﬁed reliably, i.e., classiﬁedwhether one considered OoD prediction is true positive orfalse positive without access to the ground truth. For thistask we employ simple logistic regression. Combining en-tropy maximization and meta classiﬁcation therefore is anefﬁcient and yet lightweight method, particularly suitable asan integrated monitoring system of safety-critical real worldapplications based on deep learning.

2. Related Work

Methods from prior works have already proven their ef-ﬁciency in identifying OoD input for image data. The pro-posed methods are either modiﬁcations of the training pro-cedure [17, 20, 27, 28, 34] or post-processing techniquesadjusting the estimated conﬁdence [14, 19, 27]. How-ever, most of these works treat entire images as out-of-distribution.When considering the semantic space to be ﬁxed,anomaly segmentation, i.e., treating pixels as OoD, is nec-essarily based on estimates of uncertainty for neural net-works. Early approaches to uncertainty estimation involveBayesian neural networks (BNNs) yielding posterior distri-butions over the model’s weight parameters [32, 37]. Inpractice, approximations such as Monte-Carlo dropout [16]or stochastic batch normalization [2] are mainly used dueto cheaper computational cost. Frameworks using dropoutfor uncertainty estimation applied to semantic segmenta-tion have been developed in [3, 24]. Other approaches tomodel uncertainty consist of using an ensemble of neuralnetworks [26], which captures model uncertainty by aver-aging predictions over multiple models, and density esti-mation [5, 9, 36, 40] via estimating the likelihood of sam-ples with respect to the training distribution. Methods forOoD detection in semantic segmentation based on label-prediction (or classiﬁcation) uncertainty have been analyzedin [6, 21, 23, 30, 33, 38].Using BNNs for estimating uncertainty in deep neuralnetworks is associated with prohibitive computational cost.Uncertainty estimates that are generated by multiple mod-els or by multiple forward passes are still computationally2xpensive compared to single inference based ones. In ourapproach, we unite semantic segmentation and OoD detec-tion in one model without any modiﬁcations of the under-lying network’s architecture. Therefore, our re-training ap-proach can be even combined with existing OoD detectiontechniques and potentially enhance their efﬁciency.Works with similar training approaches as ours use a dif-ferent OoD proxy and are presented in [5, 23]. They trainneural networks on the unlabeled objects in Cityscapes asOoD approximation. The training process includes onlyone single dataset, but in our experiments we observe thatthe unlabeled data lacks in diversity and therefore tends tobe too dataset speciﬁc. With respect to other OoD datasets,such as LostAndFound and Fishyscapes on which we per-form our experiments, we observe in our tests that thesemethods fail to generalize. Furthermore, in contrast to thoseworks we incorporate a post-processing step that signiﬁ-cantly leverages the OoD detection performance.Another line of work detects OoD samples in semanticsegmentation by incorporating autoencoders [1, 4, 12, 30].Training such a model only on speciﬁc samples from aclosed set of classes, it is assumed that the autoencodermodel performs less accurately when fed with samples fromnever-seen-before classes. The identiﬁcation of an OoDsample then relies on the reconstruction quality. In this way,no OoD data is required, except for further adjusting thesensitivity of the method.Autoencoders are in fact deep neural networks them-selves. For the goal of safe real-time semantic segmen-tation, e.g., necessary for automated driving [22], morelightweight approaches are favorable. We avoid incor-porating deep auxiliary models at all and only employ alightweight linear model instead. Furthermore, usually themore complex a model, the greater the lack of interpretabil-ity. As monitoring systems are supposed to make deeplearning models safer, one seeks for simpler and therebymore explainable approaches. We post-process our entropyboosted semantic segmentation network output via logis-tic regression whose computational overhead is negligible.This linear model is transparent as it allows us to analyzethe impact of each single feature fed into the model andit demonstrates in our experiments to efﬁciently reduce thenumber of OoD detection errors.

3. Entropy based OoD Detection

In this section, we present our training strategy to im-prove the detection of OoD pixels in semantic segmentationvia spatial entropy heatmaps.

Let f ( x ) ∈ (0 , q be the softmax probabilities after pro-cessing the input image x ∈ X with some machine learningmodel f : X → (0 , q and let q ∈ N be the number of classes. For the sake of brevity, we omit the considerationof image pixels in this section. We compute the softmaxentropy via E ( f ( x )) = − q (cid:88) j =1 f j ( x ) log( f j ( x )) . (1)By ( x, y ( x )) ∼ D in we denote an “in-distribution” sam-ple with y ( x ) ∈ { , . . . , q } being its corresponding groundtruth class label, by x (cid:48) ∼ D out we denote an “out-distribution” sample for which there is no label given. Weaim at minimizing the overall objective L := (1 − λ ) E ( x,y ) ∼D in [ (cid:96) in ( f ( x ) , y ( x ))]+ λ E x (cid:48) ∼D out [ (cid:96) out ( f ( x (cid:48) ))] , λ ∈ [0 , (2)where (cid:96) in ( f ( x ) , y ( x )) := − q (cid:88) j =1 j = y ( x ) log( f j ( x )) and (3) (cid:96) out ( f ( x (cid:48) )) := − q (cid:88) j =1 q log( f j ( x (cid:48) )) (4)with the indicator function j = y ( x ) ∈ { , } being equalto one if j = y ( x ) and zero else. In other words, for in-distribution samples we apply the commonly used empir-ical cross entropy loss, i.e., the negative log likelihood ofthe target class. For out-distribution samples, we considerthe negative log likelihood for each class, weighted inverseproportionally to the number of classes.By that choice of out-distribution loss function, minimiz-ing (cid:96) out ( f ( x (cid:48) )) is equivalent to maximizing the softmax en-tropy E ( f ( x )) , see equation (1). Since the softmax deﬁni-tion implies f j ( x ) ∈ (0 , and (cid:80) qj =1 f j ( x ) = 1 , Jensen’sinequality applied to the convex function − log( · ) yields (cid:96) out ( f ( x )) ≥ − log (cid:32) q (cid:88) i =1 q f j ( x ) (cid:33) = log( q ) (5)and applied to the concave function log( · ) E ( f ( x )) ≤ log (cid:32) q (cid:88) i =1 f j ( x ) 1 f j ( x ) (cid:33) = log( q ) (6)with equality (for both inequalities inequalities (5) and (6),respectively) if f j ( x ) = 1 /q ∀ j = 1 , . . . , q , i.e., if the soft-max probabilities are uniformly distributed over all classes.In order to control the impact of each single objectiveon the overall objective L , the convex combination betweenexpected in-distribution loss and expected out-distributionloss is included in equation (2) which can be adjusted byvarying the parameter λ .3 ntropy heatmap w/o OoD training OoD prediction w/o OoD trainingEntropy heatmap w/ OoD training OoD prediction w/ OoD training Figure 2: Comparison of softmax entropy heatmap andOoD prediction mask with our OoD training ( top row ) andwithout ( bottom row ). The yellow lines in the entropyheatmaps mark the annotation of the OoD object. The OoDobject prediction is obtained by simply thresholding on theentropy heatmap (in this example at t = 0 . yielding thered pixels in the OoD prediction masks). The softmax probabilities output of neural networks forsemantic segmentation f ( x ) ∈ (0 , |H|×|W|× q , x ∈ X ⊆ [0 , |H|×|W|× , can be viewed as pixel-wise probabilitydistributions that express how likely each potential class af-ﬁliation j = 1 , . . . , q of a given pixel z ∈ H × W is, ac-cording to the model f . Let f z ( x ) ∈ (0 , q denote thesoftmax output in pixel location z which we implicitly con-sidered throughout the previous section. In semantic seg-mentation one minimizes the averaged pixel-wise classiﬁ-cation loss over the image, cf. equation (2). For the sake ofsimplicity, we consider the normalized entropy ¯ E ( f z ( x )) atpixel location z in the following, which we obtain by divid-ing E ( f z ( x )) by log( q ) − . One pixel is then assumed tobe out-of-distribution if the normalized entropy ¯ E ( f z ( x )) at that pixel location z is greater than a chosen threshold t ∈ [0 , , i.e., z is predicted to be OoD if z ∈ ˆ Z out ( x ) := { z (cid:48) ∈ H × W : ¯ E ( f z (cid:48) ( x )) ≥ t } . (7)A connected component k ∈ ˆ K ( x ) ⊆ P ( ˆ Z out ( x )) (the lat-ter being the power set of ˆ Z out ( x ) ) consisting of neighbor-ing pixels fulﬁlling the condition in equation (7) gives usan OoD segment / object prediction . An illustration can beviewed in ﬁgure 2. Obviously, the better an in-distributionpixel can be separated from an out-distribution pixel bymeans of the entropy, the more accurate the OoD object pre-diction will be.

4. Meta Classiﬁer in Semantic Segmentation

By training the segmentation network to output uniformconﬁdence scores as presented in section 3, we increase thesensitivity towards predicting OoD objects, aiming for an“entropy boost” on OoD samples. However, it is not guar-anteed that only OoD samples have a high entropy. There-fore, detecting OoD samples via entropy boosting poten-tially comes along with a considerable number of false OoDpredictions, resulting in an unfavorable trade-off.In this context, we consider one entire OoD object pre-diction (see section 3.2) as true positive if its intersectionover union (

IoU , [15]) with a ground truth OoD object isgreater than zero. More formally, let Z out ( x ) be the setof pixel locations in x which are labeled OoD according toground truth. Then k ∈ ˆ K ( x ) is true positive (TP) ifIoU ( k, Z out ( x )) > ⇔ ∃ z ∈ k : ¯ E ( f z ( x )) ≥ t ∧ z ∈ Z out ( x ) . (8)In [8] it has been demonstrated that false-positives dueto increased prediction sensitivity can be removed basedon a meta classiﬁer’s decision, achieving improved trade-offs between error rates. This meta classiﬁer is essentiallya binary classiﬁcation model added on top of an underly-ing segmentation network [31, 41, 42]. We construct hand-crafted metrics per connected component of pixels by ag-gregating different pixel-wise uncertainty measures derivedfrom the softmax probabilities, one of which is the entropy.The entropy metric has proven to be highly correlated tothe segment-wise IoU and therefore contributes greatly tothe meta classiﬁer’s performance, cf. [41]. Different to ex-isting approaches, that consider neighboring pixels sharingthe same class label as segment, we generate metrics persegment above the given entropy threshold t . Given the im-portance of entropy for meta classiﬁers in combination withentropy based segment generation, we expect the learnedentropy maximization on OoD objects to leverage the metaclassiﬁcation performance.Given the softmax output, we include further pixel-wisedispersion measures such as the probability margin, the dif-ference between highest and second highest softmax prob-ability, and variation ratio, the maximum softmax probabil-ity. They all have proven their efﬁciency in terms of metaclassiﬁcation performance [8, 31, 42]. Moreover, we alsoconsider geometry features, such as the segment’s size orits ratio between interior and boundary [41]. These metricsserve as inputs for the auxiliary meta model that classiﬁesinto true positive and false positive (FP) OoD object predic-tion, i.e., classifying k ∈ ˆ K ( x ) into the classes / sets C := { k (cid:48) ∈ ˆ K ( x ) : IoU ( k, Z out ( x )) > } and C := { k (cid:48) ∈ ˆ K ( x ) : IoU ( k, Z out ( x )) = 0 } . (9)4 n-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Baseline in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Epoch 4, λ = 0 . (a) LostAndFound in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Baseline in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Epoch 4, λ = 0 . (b) Fishyscapes Figure 3: Relative pixel frequencies of LostAndFound (a) and Fishyscapes (b) OoD pixels, respectively, at different entropyvalues for the baseline model, i.e., before OoD training ( a & b left ), and after OoD training ( a & b right ). The red linesindicate the thresholds of highest accuracy. See also appendix A for more details and see appendix E for a visualization.The outlined hand-crafted metrics form a structured datasetof features where the rows correspond to predicted seg-ments and the columns to metrics, see also appendix B.

5. Setup of Experiments

Semantic segmentation is one of the basic componentsin environment perception systems of autonomous vehicles[22]. We therefore consider the semantic segmentation forthe Cityscapes dataset [11] as original task, i.e., we considerCityscapes as in-distribution data D in . The (standard) train-ing split consists of 2,975 pixel-annotated urban street sceneimages. As original model, we use the state-of-the-art se-mantic segmentation DeepLabv3+ model with a WideRes-Net38 backbone trained by Nvidia [46]. This model is ini-tialized with publicly available weights and serves as our baseline model. For testing, we evaluate the OoD detectionperformance on two datasets comprising street scene im-ages and unexpected obstacles. We consider images fromthe LostAndFound test split [39], containing 1,203 imageswith annotations of small obstacles and road in front ofthe (ego-)car, and Fishyscapes Static [5], containing 30 im-ages with annotated anomalous objects extracted from Pas-cal VOC [15] which are then overlayed in Cityscapes im-ages. Both datasets share the same setup as Cityscapes butinclude small road obstacles.In order to perform the OoD training as proposed in sec-tion 3.1, we approximate the out-distribution via imagesfrom the COCO [29] dataset. This dataset contains imagesof everyday objects captured from everyday scenes. Be-sides that, we only consider COCO images with instancesthat are not included in Cityscapes (no persons, no cars, notrafﬁc lights, ...) and images that have a minimum heightand width of at least 480 pixels. After ﬁltering, there remain1,489 images serving as our proxy for D out (see appendix Cfor experiments with another OoD proxy).We ﬁnetune the DeepLabv3+ model with loss functions according to equation (3) and equation (4). As training datawe randomly sample 297 images from our COCO subsetper epoch and mix them into all 2,975 Cityscapes trainingimages (1:10 ratio of out-distribution to in-distribution im-ages). We train the model’s weight parameters on randomcrops of size 480 pixels for 4 epochs in total and set the(out-distribution) loss weight λ = 0 . (see equation (2)). Asoptimizer we use Adam [25] with a learning rate of − .

6. Pixel-wise Evaluation

Based on the softmax probabilities, we compute thenormalized entropy ¯ E for all pixels in the respective testdataset. This gives us a per-pixel anomaly / OoD scorewhich we compare with the ground truth anomaly segmen-tation. For the sake of clarity, in this section we refer toin-distribution pixels as samples of the negative class andto out-distribution pixels as samples of the positive class .We emphasize that none of the OoD objects in the test datahave been seen during our OoD training since we use sepa-rate datasets for training and testing, with different objectscorresponding to completely disjoint semantic class labels. On basis of the violin plots in ﬁgure 3, one already no-tices the improved separability of in-distribution and out-distribution pixels as large masses of the distributions cor-responding to the respective classes can be well separatedvia multiple entropy thresholds. One also notices, that ourOoD training is beneﬁcial with respect to the separability.This effect can be further quantiﬁed with the aid of receiveroperating characteristic (ROC) curves, see ﬁgure 4 (a) & (b)left. The area under the curve (AUC) of ROC curves (AU-ROC) then represents the degree of separability. The higherthe AUC, the better the separability.By comparing the ROC curves for LostAndFound (ﬁg-ure 4 (a)), we observe that there is a performance gain over5 . . . . . . . . . . . . T r u e p o s i t i v e r a t e Baseline OoD Training after Epoch 4, λ = 0 . .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . T r u e p o s i t i v e r a t e AUC=0.93 AUC=0.98 0 .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . P r ec i s i o n AUC=0.46 AUC=0.76 (a) LostAndFound ( left:

AUROC, right:

AUPRC) .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . T r u e p o s i t i v e r a t e AUC=0.94 AUC=0.99 0 .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . P r ec i s i o n AUC=0.28 AUC=0.81 (b) Fishyscapes ( left:

AUROC, right:

AUPRC)

Figure 4: Detection ability for LostAndFound (a) and Fishyscapes (b) OoD pixels, respectively, evaluated by means ofreceiver operating characteristic curve ( a & b left ) and precision recall curve ( a & b right ). The red lines indicate theperformance according to random guessing.the baseline model when OoD training is applied. Thebaseline curve indicates that the corresponding model hasa lower true positive rate across various ﬁxed false posi-tive rates, i.e., our model after OoD training assigns higheruncertainty / entropy values to OoD samples which is bene-ﬁcial for OoD detection. Although the AUC of the baselinemodel is already decent at . , we outperform this AUCsigniﬁcantly with a value of . , which is / of what wecould gain due to our OoD training.We observe the same effect for Fishyscapes. From theviolins already, the discrimination performance seems al-ready close to perfect due our OoD training. This is con-ﬁrmed by means of the corresponding ROC curve as theAUC score has increased up to . . In comparison to thebaseline model, there is a gain of percent points whichmakes considerable / of the possible performance gain. As the AUROC essentially measures the overlap of dis-tributions corresponding to negative and positive samples,this score does not place more emphasis on one classover the other in case of class imbalance. In both ourtest datasets, there is a considerably strong class imbal-ance, . and . OoD samples in LostAndFound andFishyscapes, respectively. Therefore, we additionally mea-sure the separability by means of precision recall curves(PRC), see ﬁgure 4 (a) & (b) right, thus ignoring true neg-atives and emphasizing the detection of the positive class /OoD samples. Now the AUC of PRC (AUPRC) serves asmeasure of separability.For LostAndFound as well as for Fishyscapes objectsthe re-trained model is superior over the baseline model interms of precision when we ﬁx recall to any score. The AUCquantiﬁes this performance gain and thus further clariﬁesthe improved capability at detecting pixels corresponding toan OoD object. Regarding LostAndFound, the OoD train- . . . . . . t for OoD prediction0 . . . C i t y s c a p e s V a l. m I o U BaselineOoD Training

Figure 5: Mean intersection over union (mIoU) forCityscapes validation split with OoD predictions at differententropy thresholds t . The dashed red line indicates the per-formance loss that we consider to be “acceptable” ( ∼ ).ing increases the AUC by . up to a score of . . Thisis a relative change with respect to the baseline model ofroughly . Regarding Fishyscapes, the performance gainis even more signiﬁcant. We raise the AUC from . up to . , which is nearly a threefold performance increase. Weconclude that measured by AUPRC scores our OoD trainingis highly beneﬁcial for detecting OoD samples. In order to monitor that the baseline model does not un-learn its original task due to OoD training, we evaluate themodel’s performance on in-distribution data with OoD pre-dictions at different entropy thresholds. The original taskis the semantic segmentation of the Cityscapes images andwe evaluate by means of the most commonly used perfor-mance metric mean Intersection over Union (mIoU, [15]).Additionally to the Cityscapes class predictions, that is ob-tained via the standard maximum a posteriori (MAP) deci-sion principle [7, 35], we consider an extra OoD class pre-diction if the softmax entropy is above the given threshold.We compute the mIoU for the Cityscapes validation dataset,but average only over the 19 Cityscapes class IoUs.6

UROC ↑ FPR ↓ AUPRC ↑ mIoU ↑ Method LostAndFound Test Cityscapes Val.Lis et al. [30] Best 0.93 - - 0.80Li et al. [44] Plain 0.91 0.30 0.36 0.80Zhu et al. [46] Plain 0.93 0.35 0.46

Ours: Li et al. + OoD T. 0.94 0.12 0.51 0.76Ours: Zhu et. al + OoD T.

Ours: Li et al. + OoD T. 0.94 0.21 0.38 0.76Ours: Zhu et. al + OoD T.

Table 1: Benchmark results for LostAndFound and Fishy-scapes with DeepLabv3+ (Zhu et al. [46]) and a slightlyweaker DualGCNNet (Li et al. [44]) CNNs. The grayrows mark scores with OoD training, otherwise only en-tropy thresholding is applied (Plain). For comparison, weincluded scores reported in [30] and [5] which are, to thebest of our knowledge, the only works comparable to ours.The state-of-the-art DeepLabv3+ model [46], whichserves as our baseline throughout our experiments, achievesan mIoU of . on the Cityscapes validation dataset with-out OoD predictions (implying t = 1 . ). By re-trainingthe neural network with entropy maximization on OoD in-puts, we observe improved OoD-AUPRC scores over thecourse of training peaking at . . This gain at detectingOoD samples in LostAndFound comes with a marginal lossin Cityscapes validation mIoU down to . . These twomIoU scores remain nearly constant (deviations less than percent point) for the evaluated thresholds t = 0 . , . . . , . .In general, the lower the entropy threshold, the more pixelsare predicted to be OoD. For t = 0 . this results in notice-able performance decrease, . for the baseline model and . for the re-trained model, respectively. As displayed inﬁgure 5 further lowering the threshold leads to even moresigniﬁcant sacriﬁce of original performance. Consequently,we consider in the following experiments entropy thresh-olds of t ≥ . as the performance loss seems acceptable,especially in view of substantially improved OoD detectioncapability. We refer to appendix F for more details.All the results presented in this section are summarizedin table 1 where we additionally provide the false positiverates at 95% true positive rate (FPR ). Moreover, we con-ducted the same experiments as for the DeepLabv3+ model[46] also for the weaker DualGCNNet [44], see appendix D.We re-trained the latter model with λ = 0 . for 11 epochsin total and report the scores in the table 1 as well.

7. Segment-wise Evaluation

In this section we evaluate the meta classiﬁcation perfor-mance on LostAndFound. The main metrics of the segment

OoD Training OoD Training + meta classiﬁer

Figure 6: OoD detection with t = 0 . after OoD trainingand meta classiﬁcation. The yellow lines mark the annota-tions of the OoD objects. OoD predictions labeled as back-ground area according to the ground truth are ignored (thisincludes e.g. the garbage bin). See appendix G for moreexamples. f a l s e n e ga t i v e O o D s e g m e n t s BaselineBaseline + Meta C.OoD TrainingOoD T. + Meta C.

Figure 7: Detection errors of LostAndFound OoD objects.In this plot, the number of errors when t = 0 . , . . . , . aredisplayed (when in the axes’ range). The pie-chart markersindicate the road miss rate ε , being entirely red if ε ≥ . .See also table 2 for exact numbers.-wise evaluation are the numbers of FPs and FNs with re-spect to an OoD object prediction, cf. equation (8). As theremoval of FP OoD predictions should not come at cost ofa signiﬁcant loss in original performance, see ﬁgure 6, weadditionally consider the miss rate of road pixels : ε := 1 − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:91) x ∈X (cid:16) ˆ Z in ( x ) ∩ Z in ( x ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:91) x ∈X Z in ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (10)with pixel locations predicted to be in-distribution in ˆ Z in and annotated as in-distribution in Z in . The road miss rate (cid:15) measures the proportion of actual road pixels in the wholedataset which are incorrectly identiﬁed.We compute per-segment metrics as outlined in sec-tion 4 for OoD object predictions in the LostAndFoundtest set and feed them through meta classiﬁcation models,which are simple logistic regressions throughout our exper-iments. The segments are then leave-one-out cross validatedwhether they are TP or FP, see equation (9).We consistently observe a gain in OoD detection perfor-mance due to meta classiﬁcation. The number of detection7 ntropy Baseline Baseline OoD Training OoD TrainingThreshold + Meta Classiﬁer + Meta Classiﬁer ¯ E ≥ t FP ↓ FN ↓ (cid:80) ↓ ε in % ↓ FP ↓ FN ↓ (cid:80) ↓ ε in % ↓ FP ↓ FN ↓ (cid:80) ↓ ε in % ↓ FP ↓ FN ↓ (cid:80) ↓ ε in % ↓ t = 0 . t = 0 . t = 0 .

598 0.06 t = 0 .

610 0.03 t = 0 .

809 662 1,471 0.01

686 780 < t = 0 .

158 1,084 1,242 < <

940 0.02 149 619 t = 0 .

10 1,511 1,521 < < Table 2: Detection errors for LostAndFound OoD objects at different entropy thresholds t . We consider the road miss rate ε , see equation (10), as further measure of loss in original performance (for Cityscapes mIoU, see ﬁgure 5). Below thehorizontal line, i.e., t ≥ . , we consider the loss in original performance to be acceptable, see section 6.3 for further details. Entropy Baseline + MSP [19] Baseline + Meta C. OoD T. + Meta C.Threshold t AUROC AUPRC AUROC AUPRC AUROC AUPRC t = 0 . t = 0 . t = 0 . t = 0 . t = 0 . t = 0 . t = 0 . Table 3: Meta classiﬁcation performance on LostAndFoundat different entropy thresholds t . As comparison to the metaclassiﬁer, we include the detection of OoD prediction errorsvia the maximum softmax probability (MSP, [19]).errors as well as road miss rate ε at different entropy thresh-olds t are summarized in table 2. The performance of FPOoD removal is given in table 3.In general, the higher the entropy threshold, the lessOoD objects are predicted and consequently less data is fedthrough the linear models. This explains the observationthat meta classiﬁers identify FPs more reliably the lower t . However, also for larger thresholds, meta classiﬁers stillclearly outperform the natural maximum softmax probabil-ity (MSP, [19]) approach. Due to our OoD training, themeta classiﬁers demonstrate to be even more effective, be-ing most superior when t = 0 . with an AUPRC score of0.72, which is 19 percent points higher than without OoDtraining. In our experiments, OoD training in combina-tion with meta classiﬁcation at t = 0 . turns out to be thebest OoD detection approach achieving the best result withonly 598 errors in total while having a road miss rate ofmarginally 0.06%. In comparison, there are 7,567 errorsat a road miss rate of 0.38% when applying neither OoDtraining nor meta classiﬁers, which can be reduced to de-cent scores of 714 and 0.09%, respectively, when adding themeta classiﬁer. Compared with the best baseline at t = 0 . ,we decrease the number of total errors by 52% from 1,242down to 598. More safety-relevantly, at the same time we signiﬁcantly reduce the number of overlooked OoD objectsby 70% from 1,084 down to 308.

8. Conclusion & Outlook

In this work, we presented a novel re-training approachfor deep neural networks that unites improved OoD de-tection capability and state-of-the-art semantic segmenta-tion in one model. Up to now, only a small number priorworks exists for anomaly segmentation on LostAndFoundand Fishyscapes, respectively. We demonstrate that ourOoD training signiﬁcantly improves the detection efﬁciencyvia softmax entropy thresholding, leading to a performancesuperior over existing methods.Moreover, we introduced meta classiﬁers for entropybased OoD object predictions. By applying lightweight lo-gistic regressions, we have shown that entire LostAndFoundOoD segments are meta classiﬁed reliably. This observationalready holds for the tested neural network in its plain ver-sion. Due to the increased sensitivity of OoD predictionsvia entropy maximization, the meta classiﬁers’ efﬁciency iseven more pronounced. In view of emerging safety-criticaldeep learning applications, the combination of OoD train-ing and meta classiﬁcation has the potential to considerableimprove the overall system’s performance.For future work, we plan to apply OoD training forthe retrieval of OoD objects in order to assess the impor-tance of their occurrence and whether a new concept isrequired to be learned. Our code is publicly available athttps://github.com/robin-chan/meta-ood.

Acknowledgement

The research leading to these results is funded by theGerman Federal Ministry for Economic Affairs and Energywithin the project “KI Absicherung – Safe AI for Auto-mated Driving”, grant no. 19A19013Q. The authors wouldlike to thank the consortium for the successful cooperationand acknowledge fruitful discussions with Sharat Gujama-gadi & Fabian Kunst.8 eferences [1] Samet Akc¸ay, Amir Atapour-Abarghouei, and Toby PBreckon. Skip-ganomaly: Skip connected and adversariallytrained encoder-decoder anomaly detection. In , pages1–8. IEEE, 2019. 3[2] Andrei Atanov, Arsenii Ashukha, Dmitry Molchanov, Kir-ill Neklyudov, and Dmitry Vetrov. Uncertainty estimationvia stochastic batch normalization. In Huchuan Lu, HuajinTang, and Zhanshan Wang, editors,

Advances in Neural Net-works – ISNN 2019 , pages 261–269, Cham, 2019. SpringerInternational Publishing. 2[3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.Bayesian segnet: Model uncertainty in deep convolutionalencoder-decoder architectures for scene understanding. In

Proceedings of the British Machine Vision Conference(BMVC) , pages 57.1–57.12. BMVA Press, September 2017.2[4] Christoph Baur, Benedikt Wiestler, Shadi Albarqouni, andNassir Navab. Deep autoencoding models for unsupervisedanomaly segmentation in brain mr images. In

InternationalMICCAI Brainlesion Workshop , pages 161–169. Springer,2018. 2, 3[5] Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, RolandSiegwart, and Cesar Cadena. Fishyscapes: A benchmarkfor safe semantic segmentation in autonomous driving. In

Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV) Workshops , October 2019. 2, 3, 5,7[6] Dominik Br¨uggemann, Robin Chan, Matthias Rottmann,Hanno Gottschalk, and Stefan Bracke. Detecting Out of Dis-tribution Objects in Semantic Segmentation of Street Scenes.In

The 30th European Safety and Reliability Conference (ES-REL) , 2020. 2[7] Robin Chan, Matthias Rottmann, Fabian H¨uger, PeterSchlicht, and Hanno Gottschalk. Application of MaximumLikelihood Decision Rules for Handling Class Imbalance inSemantic Segmentation. In

The 30th European Safety andReliability Conference (ESREL) , 2020. 6[8] Robin Chan, Matthias Rottmann, Fabian H¨uger, PeterSchlicht, and Hanno Gottschalk. Controlled false negativereduction of minority classes in semantic segmentation. In , 2020. 2, 4, 11[9] Hyunsun Choi and Eric Jang. Generative ensembles for ro-bust anomaly detection.

ArXiv , abs/1810.01392, 2018. 2[10] Pascal Colling, Lutz Roese-Koerner, Hanno Gottschalk, andMatthias Rottmann. Metabox+: A new region based ac-tive learning method for semantic segmentation using pri-ority maps, 2020. 1[11] Marius Cordts, Mohamed Omran, Sebastian Ramos, et al.The cityscapes dataset for semantic urban scene understand-ing. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016. 1, 2, 5[12] Clement Creusot and Asim Munawar. Real-time small ob-stacle detection on highways using compressive rbm road re- construction. In , pages 162–167, 2015. 3[13] Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. Whatdoes classifying more than 10,000 image categories tell us?In Kostas Daniilidis, Petros Maragos, and Nikos Paragios,editors,

Computer Vision – ECCV 2010 , pages 71–84, Berlin,Heidelberg, 2010. Springer Berlin Heidelberg. 1[14] Terrance DeVries and Graham W. Taylor. Learning Conﬁ-dence for Out-of-Distribution Detection in Neural Networks,Feb 2018. 2[15] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, et al.The pascal visual object classes challenge: A retrospective.

International Journal of Computer Vision , 111(1):98–136,Jan 2015. 4, 5, 6[16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesianapproximation: Representing model uncertainty in deeplearning. In

Proceedings of The 33rd International Con-ference on Machine Learning , volume 48 of

Proceedings ofMachine Learning Research , pages 1050–1059, New York,New York, USA, 20–22 Jun 2016. PMLR. 2[17] Matthias Hein, Maksym Andriushchenko, and Julian Bitter-wolf. Why relu networks yield high-conﬁdence predictionsfar away from the training data and how to mitigate the prob-lem. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2019. 2[18] Dan Hendrycks, Steven Basart, Mantas Mazeika, Moham-madreza Mostajabi, Jacob Steinhardt, and Dawn Song.A benchmark for anomaly segmentation. arXiv preprintarXiv:1911.11132 , 2019. 2[19] Dan Hendrycks and Kevin Gimpel. A baseline for detect-ing misclassiﬁed and out-of-distribution examples in neuralnetworks. In , 2017. 2, 8[20] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich.Deep anomaly detection with outlier exposure.

Proceed-ings of the International Conference on Learning Represen-tations , 2019. 2[21] S. Isobe and S. Arai. Deep convolutional encoder-decodernetwork with model uncertainty for semantic segmentation.In , pages 365–370, 2017. 2[22] Joel Janai, Fatma G¨uney, Aseem Behl, Andreas Geiger,et al. Computer vision for autonomous vehicles: Problems,datasets and state of the art.

Foundations and Trends® inComputer Graphics and Vision , 12(1–3):1–308, 2020. 1, 3,5[23] Nicolas Jourdan, Eike Rehder, and Uwe Franke. Identiﬁca-tion of uncertainty in artiﬁcial neural networks. In

Proceed-ings of the 13th Uni-DAS e.V. Workshop Fahrerassistenz undautomatisiertes Fahren , July 2020. 2, 3[24] Alex Kendall and Yarin Gal. What Uncertainties Do WeNeed in Bayesian Deep Learning for Computer Vision? InI Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, SVishwanathan, and R Garnett, editors,

Advances in NeuralInformation Processing Systems 30 , pages 5574–5584. Cur-ran Associates, Inc., 2017. 2

25] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In Yoshua Bengio and Yann LeCun,editors, , 2015. 5[26] Balaji Lakshminarayanan, Alexander Pritzel, and CharlesBlundell. Simple and scalable predictive uncertainty esti-mation using deep ensembles. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R.Garnett, editors,

Advances in Neural Information Process-ing Systems 30 , pages 6402–6413. Curran Associates, Inc.,2017. 2[27] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin.Training conﬁdence-calibrated classiﬁers for detecting out-of-distribution samples. In

International Conference onLearning Representations , 2018. 2[28] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the re-liability of out-of-distribution image detection in neural net-works. In

International Conference on Learning Represen-tations , 2018. 2[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common Objects in Context. InDavid Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuyte-laars, editors,

Computer Vision – ECCV 2014 , pages 740–755. Springer International Publishing, 2014. 1, 2, 5[30] Krzysztof Lis, Krishna Nakka, Pascal Fua, and MathieuSalzmann. Detecting the unexpected via image resynthesis.In

Proceedings of the IEEE/CVF International Conferenceon Computer Vision (ICCV) , October 2019. 2, 3, 7[31] Kira Maag, Matthias Rottmann, and Hanno Gottschalk.Time-dynamic estimates of the reliability of deep seman-tic segmentation networks. In

IEEE International Confer-ence on Tools with Artiﬁcial Intelligence (ICTAI) , November2020. 2, 4, 11[32] David J. C. MacKay. A practical bayesian framework forbackpropagation networks.

Neural Computation , 4(3):448–472, 1992. 2[33] A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmae-sumi, and T. Kapur. Conﬁdence calibration and predictiveuncertainty estimation for deep medical image segmentation.

IEEE Transactions on Medical Imaging , pages 1–1, 2020. 2[34] Alexander Meinke and Matthias Hein. Towards neural net-works that provably know when they don’t know. In

Inter-national Conference on Learning Representations , 2020. 2[35] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal-walkar.

Foundations of Machine Learning . The MIT Press,2012. 6[36] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, DilanGorur, and Balaji Lakshminarayanan. Do deep generativemodels know what they don’t know? In

International Con-ference on Learning Representations , 2019. 2[37] Radford M Neal.

Bayesian learning for neural networks ,volume 118. Springer Science & Business Media, 2012. 2[38] Philipp Oberdiek, Matthias Rottmann, and Gernot A. Fink.Detection and retrieval of out-of-distribution objects in se-mantic segmentation. In

Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)Workshops , June 2020. 2[39] Peter Pinggera, Sebastian Ramos, Stefan Gehrig, UweFranke, Carsten Rother, and Rudolf Mester. Lost andfound: detecting small road hazards for self-driving vehicles.In , 2016. 2, 5[40] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, RyanPoplin, Mark Depristo, Joshua Dillon, and Balaji Lakshmi-narayanan. Likelihood ratios for out-of-distribution detec-tion. In

Advances in Neural Information Processing Systems32 , pages 14707–14718. Curran Associates, Inc., 2019. 2[41] Matthias Rottmann, Pascal Colling, Thomas Paul Hack,Robin Chan, Fabian H¨uger, Peter Schlicht, and HannoGottschalk. Prediction error meta classiﬁcation in semanticsegmentation: Detection via aggregated dispersion measuresof softmax probabilities. In , 2020. 2, 4, 11[42] Matthias Rottmann and Marius Schubert. Uncertainty mea-sures and prediction quality rating for the semantic segmen-tation of nested multi resolution street scene images. In

Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR) Workshops , June 2019. 2,4, 11[43] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, MingkuiTan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deephigh-resolution representation learning for visual recogni-tion.

IEEE transactions on pattern analysis and machineintelligence , PP, April 2020. 1[44] Li Zhang, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yun-hai Tong, and Philip H. S. Torr. Dual graph convolutionalnetwork for semantic segmentation. In

Proceedings of theBritish Machine Vision Conference (BMVC) , 2019. 7[45] Xiang Zhang and Yann LeCun. Universum prescription:Regularization using unlabeled data. In

Thirty-First AAAIConference on Artiﬁcial Intelligence , 2017. 1, 2, 14[46] Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, ShawnNewsam, Andrew Tao, and Bryan Catanzaro. Improving se-mantic segmentation via video propagation and label relax-ation. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2019. 1,5, 7[47] Aleksandar Zlateski, Ronnachai Jaroensri, Prafull Sharma,and Fr´edo Durand. On the importance of label quality forsemantic segmentation. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,June 2018. 1 ppendixA. Separability by means of Data Distribution The violin plots in ﬁgure 3 visualize the separabilityof in-distribution and out-distribution pixels (binary clas-siﬁcation) in LostAndFound and Fishyscapes, respectively.These plots summarize different statistics such as medianand interquartile ranges and also show the full distributionof the data. The density corresponds to the relative pixelfrequency at a given entropy value of the considered class.In the following, we refer to the shape of the violin plots asdistribution.First, we focus on evaluating LostAndFound OoD ob-jects, see ﬁgure 3 (a). For the baseline model we observethat a large mass of data corresponding to the negative classis located at very low entropy values (median . ), i.e.,most road pixels are classiﬁed with high conﬁdence. More-over, the 75th percentile is located at an entropy value of . and the sample of highest value at . . Regarding thepixels of the positive class, we see that the distribution israther dispersed. The median is at . and the interquartileranges from . to . . We conclude that, on average,positive samples have higher entropy values than negativesamples, i.e., pixels of an OoD object are classiﬁed withhigher uncertainty than for road pixels. However, for per-fect performance one seeks a threshold such that both distri-butions (of positive and negative class) are separated. Thisis not the case for the baseline model since a substantialamount of samples still has very low entropy, e.g. the 10thpercentile of the positive samples is at . , which is alsothe median of negative samples.After OoD training, the distribution of negative samplesremains in large parts similar compared to the baseline onlywith little changes. Noteworthy, the median and upper quar-tile decrease down to entropy values of . and . , respec-tively. The distribution’s maximum is at . . On the con-trary, the changes of the distribution for the positive sam-ples are signiﬁcant as a large mass is concentrated at veryhigh entropy values. The median is located at . which isroughly at the same magnitude as the maximum for negativesamples. Moreover, the minimum value for positive pixelsis at . which equals the median for negative samples. Inparticular the latter underlines the signiﬁcant improvementof separability due to our OoD training.We observe the same behavior for Fishysacpes OoD ob-jects but even more pronounced, see ﬁgure 3 (b). After theOoD training, the medians of the two classes, . for neg-ative samples and . for positive samples, differ by percent points. Besides, the lower quartile of positive sam-ples at an entropy value of . as well as the 1st percentileat . are still above the median of negative samples. Con-sequently, we conclude that our OoD training is beneﬁcialfor identifying OoD pixels. B. Segment-wise Metrics for Meta Classiﬁers

As outlined in section 4, we train meta classiﬁers basedon hand-crafted metrics. These metrics are derived fromthe softmax probabilities f ( x ) ∈ (0 , |H|×|W|× q , x ∈ X of deep convolutional neural networks, information we getin every forward pass. As a reminder, let ˆ Z out ( x ) be theset of pixel locations in image x ∈ X that are predictedto be OoD, see equation (7). A connected component k ∈ ˆ K ( x ) ⊆ P ( ˆ Z out ( x )) represents an OoD segment /object prediction due to the entropy being above the giventhreshold. This is different to other works dealing withsegment-wise meta classiﬁcation [8, 31, 41, 42] as they con-sider connected components sharing the same class label assegments.We estimate uncertainty per OoD segment k by averag-ing pixel-wise scores at the segment’s pixel locations z ∈ k .In addition to the plain softmax probabilities f z ( x ) , we alsoincorporate three pixel-wise dispersion measures, namely ∀ z ∈ k the (normalized) entropy ¯ E ( f z ( x )) = − q q (cid:88) j =1 f zj ( x ) log( f zj ( x )) , (11)the variation ratio V ( f z ( x )) = 1 − f zj ∗ ( z ) ( x ) , (12)and the probability margin M ( f z ( x )) = V ( f z ( x )) + max j ∈{ ,...,q }\{ j ∗ ( z ) } f zj ( x ) (13)with j ∗ ( z ) := arg max j =1 ,...,q f zj ( x ) being the class labelaccording to the maximum a posteriori principle.The segment’s size S ( k ) = | k | is not only needed foraveraging but also serves as meta classiﬁcation input on itsown. Moreover, let k in ⊂ k be the set of pixel locationsin the interior of the segment k , i.e., k in = { ( h, w ) ∈ k :[ h ± × [ w ± ∈ k } . This also gives us the pixel locationsof the boundary k bd = k \ k in . In order to capture geometryfeatures of a segment, we consider the relative sizes ˜ S = S/S bd and ˜ S in = S in /S bd (14)by treating the segment’s boundary and interior separately.Let k nb = { z (cid:48) ∈ [ h ± × [ w ± ⊂ |H| × |W| :( h, w ) ∈ k, z (cid:48) / ∈ k } be the neighborhood of k . As metric ifone segment is misplaced we include N ( j | k ) = 1 | k nb | (cid:88) z ∈ k nb { j = j ∗ ( z ) } ∀ j = 1 , . . . , q (15)which is the proportion of neighborhood pixels, with class j ∈ { , . . . , q } having the highest softmax score, to neigh-borhood size. Another metric for localization purposes is11 n t r opy Baseline performance Cityscapes void OoD training ( Detecting Cityscapes unlabeled objects)

Figure 8: Separability between in-distribution and out-distribution pixels in Cityscapes. Pixels labeled as trainclass according to the ground truth are considered as in-distribution, pixels labeled with the void class as out-of-distribution. For the results with Cityscapes void OoD train-ing the baseline model (left) was retrained with entropymaximization on the Cityscapes void class (right), i.e., us-ing Cityscapes unlabeled objects as OoD proxy for D out .the segment’s geometric center C h ( k ) = 1 S S (cid:88) i =1 h i and C w ( k ) = 1 S S (cid:88) i =1 w i (16)with ( h i , w i ) ∈ k ∀ i = 1 , . . . , | k | , i.e., averaging over thesegment’s pixel coordinates in vertical and horizontal direc-tion.For each segment k we then have 46 metrics in total (as q = 19 in our experiments). This forms a structured dataset µ ⊆ R |∪ x ∈X ˆ K ( x ) |× (17)serving as input for the meta classiﬁcation model g : µ → [0 , , the latter being a simple logistic regression in ourcase. By means of this linear model, we learn to dis-criminate whether a segment k has an intersection with theground truth (while all inputs are independent of the groundtruth segmentation), see also equation (9). C. OoD Training with Cityscapes void Class

Before using the COCO dataset as OoD proxy, we con-ducted some experiments with the Cityscapes void class asOoD proxy for D out in order to perform entropy maximiza-tion. This class includes objects that cannot be assigned toany of the Cityscapes training classes, therefore they remainunlabeled and are ignored during training. We refer to thisretraining approach using the Cityscapes unlabeled objectsas OoD proxy as void OoD training . We ﬁnd the best resultsin our experiments for the DeepLabv3+ as baseline model E n t r opy Baseline performance Cityscapes void OoD training ( Detecting LostAndFound OoD objects)

Figure 9: Separability between in-distribution and out-of-distribution pixels in the OoD dataset LostAndFound. Forthe results with Cityscapes void OoD training the baselinemodel (left) was retrained with entropy maximization on theCityscapes void class (right).after 8 epochs of void OoD training and out-distribution lossweight of λ = 0 . . With respect to the Cityscapes valida-tion dataset, the retrained model clearly improves at identi-fying unseen unlabeled objects, see ﬁgure 8.However, the same retrained model fails to generalize tounseen OoD objects available in the LostAndFound dataset,see ﬁgure 9. Not only the softmax entropy of OoD pixels isboosted but also the entropy of a signiﬁcant amount of in-distribution pixels. This is even more considerable due tothe strong class imbalance in LostAndFound. With respectto the AUROC, the void OoD training decreases the OoDdetection score by 5 percent points down to 0.88, while de-creasing the more relevant metric AUPRC by even 29 per-cent points down to 0.17 compared to the baseline model.A visual comparison of the effects of void OoD trainingis shown is ﬁgure 10. The retraining does not noticeably im-pact the segmentation performance, neither for Cityscapesnor LostAndFound. In particular for the segmentation ofthe Cityscapes scenes, there are only minor differences vis-ible, i.e., the difference in performance for the original taskis marginal. This is in line with the observation that retrain-ing with the multi-criteria loss function, see equation equa-tion (2), and the COCO dataset as OoD proxy leads only to amarginal loss of mIoU for the Cityscapes validation dataset.With respect to the Cityscapes images, the softmax entropyinside unlabeled objects is clearly boosted due to void OoDtraining. This makes identifying such objects easier in com-parison to the baseline model.Regarding the LostAndFound segmentations the differ-ences are more visible although still not being signiﬁcant.On the contrary, by comparing the entropy heatmaps forthe baseline model and the model after void OoD training,one observes that not only the entropy of pixels inside the12 ityscapes baseline segmentation Cityscapes baseline entropy Lost&Found baseline segmentation Cityscapes baseline entropyVoid OoD training segmentation Void OoD training entropy Void OoD training segmentation Void OoD training entropy Figure 10: Comparison between baseline model and retrained model, with entropy maximization on Cityscapes unlabeledobjects, for one Cityscapes and one LostAndFound scene. The ﬁrst and third column displays the segmentations obtainedby the respective models on either a Cityscapes or LostAndFound input image, the second and forth column displays thecorresponding entropy heatmaps. In the entropy heatmaps, the OoD objects are marked with yellow lines. in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y DualGCNNet in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Epoch 11, λ = 0 . (a) LostAndFound in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y DualGCNnet in-distribution out-distribution0.00.20.40.60.81.0 E n t r o p y Epoch 11, λ = 0 . (b) Fishyscapes Figure 11: Relative pixel frequencies of LostAndFound (a) and Fishyscapes (b) OoD pixels, respectively, at different entropyvalues for the baseline model, i.e., before OoD training ( a & b left ), and after OoD training ( a & b right ). The red linesindicate the thresholds of highest accuracy.OoD objects is boosted but also many in-distribution pix-els. This has an detrimental impact on the discriminationperformance between in-distribution and out-distributionpixel as these two classes cannot be separated well via en-tropy thresholding. These visualizations support the im-pression of the pixel-wise evaluation that void OoD train-ing is not suitable for the detection of objects other than theCityscapes unlabeled objects.

D. OoD Training for DualGCNNet

As a second model complementary to the DeepLabv3+model, we performed the same experiments of OoDtraining, i.e., retraining with the COCO dataset as OoDproxy, for the DualGCNNet which is a weaker andmore lightweight network compared to the state-of-the-artDeepLabv3+ segmentation network. We ﬁnd the best re- sults after 11 epochs of OoD training with out-distributionloss weight of λ = 0 . . As optimizer we used Adam witha learning rate of − .The pixel-wise evaluation results are presented by meansof the violin plots in ﬁgure 11 and by ROC as well as PRcurves in ﬁgure 12. We evaluated the OoD detection forthe LostAndFound test and Fishyscapes static dataset in thesame manner as for the experiments for DeepLabv3+.We observe that OoD training is not as effective as forthe DeepLabv3+ model in terms of absolute performancegain. However, we still observe a decent improvement inseparability. By applying OoD training, the AUROC in-creases by 3 percent points for LostAndFound and even 9percent points for Fishyscapes up to a score of 0.94 for bothdatasets. With respect to the PR curves, the AUC improvesby 15 percent points up to 0.51 for LostAndFound and by20 percent points up to 0.38 for Fishyscapes. Noteworthy,13 . . . . . . . . . . . . T r u e p o s i t i v e r a t e DualGCNNet OoD Training after Epoch 11, λ = 0 . .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . T r u e p o s i t i v e r a t e AUC=0.91 AUC=0.94 0 .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . P r ec i s i o n AUC=0.36 AUC=0.51 (a) LostAndFound ( left:

AUROC, right:

AUPRC) .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . T r u e p o s i t i v e r a t e AUC=0.85 AUC=0.94 0 .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . P r ec i s i o n AUC=0.08 AUC=0.38 (b) Fishyscapes ( left:

AUROC, right:

AUPRC)

Figure 12: Detection ability of LostAndFound (a) and Fishyscapes (b) OoD pixels, respectively, evaluated by means ofreceiver operating characteristic curve ( a & b left ) and precision recall curve ( a & b right ). The red lines indicate theperformance according to random guessing.

Before OoD Training After OoD training

Figure 13: Comparison of softmax entropy heatmaps before(left) and after OoD training (right). The yellow lines markthe OoD objects according to their ground truth annotation.these AUC scores after OoD training are higher than forthe plain DeepLabv3+ (baseline) model which is already astrong OoD detection model.These results for the weaker DualGCNNet model furtherdemonstrate the positive effect on the OoD detection abilitywhen performing OoD training with the COCO dataset asOoD proxy.

E. OoD Training Visualization

The improved separation ability due to OoD training isnot only achieved by increasing the softmax entropy of OoDpixels but also by decreasing the softmax entropy for in-distribution pixels. This can be also observed by meansof the in-distribution violins, for instance in ﬁgure 11. Bycomparing the shapes of the violins corresponding to theDualGCNNet plain model and the model after OoD train-ing, we notice that the violin shapes remain similar in largeparts. The median and the upper quartile, however, de-crease down to lower entropy values after OoD training.This indicates that after entropy maximization the modelis on the one hand more uncertain at OoD pixel locationsand on the other hand more certain about its prediction atin-distribution pixel locations. The same observation also .

88 0 .

89 0 .

90 0 . . . . . . L o s t A nd F o und T e s t Sp li t AU P R C .

880 0 .

885 0 .

890 0 .

895 0 .

900 0 .

905 0 . . . . . . L o s t A nd F o und T e s t Sp li t AU P R C OoD Training with λ = 0 . Figure 14: Mean intersection over union (mIoU) for theCityscapes validation dataset split over the course of OoDtraining.holds for the DeepLabv3+ model, see ﬁgure 3. This is inline with the observation made in [45] that training with anOoD proxy may have a regularizing effect.An illustration is provided in ﬁgure 13. For compari-son purposes, we refer to the entropy heatmaps providedin ﬁgure 10 as both ﬁgures show the same scene. Thevisualization of heatmaps clearly shows that due to OoDtraining pixels with high entropy are more concentrated in-side OoD objects. Moreover, the in-distribution objects, es-pecially the pixels corresponding to the road, have lowerentropy values than before OoD training. This makes theroad seem cleaner with respect to the possible occurrence ofOoD objects. After entropy maximization the OoD objectsare (visibly) better recognizable within the softmax entropyheatmaps. Therefore, we expect that the meta classiﬁcationperformance is leveraged as the meta classiﬁers are able to14 aseline plain Baseline plain + meta classiﬁerOoD training OoD training + meta classiﬁer

Figure 15: OoD detection for one scene with differentcombinations of entropy thresholding for the plain model,entropy thresholding after OoD training and meta classi-ﬁcation. For all the OoD predictions the same thresholdscore of t = 0 . was used. The red segments indicate OoDobject predictions. OoD training OoD training + meta classiﬁerOoD training OoD training + meta classiﬁer

Figure 16: OoD detection performed by the OoD-trainednetwork with and without meta classiﬁers. The red seg-ments indicate OoD object predictions.estimate the shape of OoD objects even better. Moreover,higher entropy values are stronger correlated with the pres-ence of OoD objects.

F. Course of OoD Training