Self-Supervised Domain Mismatch Estimation for Autonomous Perception
Jonas Löhdefink, Justin Fehrling, Marvin Klingner, Fabian Hüger, Peter Schlicht, Nico M. Schmidt, Tim Fingscheidt
SSelf-Supervised Domain Mismatch Estimation for Autonomous Perception
Jonas L¨ohdefink Justin Fehrling Marvin Klingner Fabian H¨uger Peter Schlicht Nico M. Schmidt Tim Fingscheidt { j.loehdefink, j.fehrling, m.klingner, t.fingscheidt } @tu-bs.de { fabian.hueger, peter.schlicht, nico.maurice.schmidt } @volkswagen.de Technische Universit¨at Braunschweig Volkswagen Group Automation
Abstract
Autonomous driving requires self awareness of its per-ception functions. Technically spoken, this can be realizedby observers, which monitor the performance indicators ofvarious perception modules. In this work we choose, exem-plarily, a semantic segmentation to be monitored, and pro-pose an autoencoder, trained in a self-supervised fashionon the very same training data as the semantic segmenta-tion to be monitored. While the autoencoder’s image re-construction performance (PSNR) during online inferenceshows already a good predictive power w.r.t. semantic seg-mentation performance, we propose a novel domain mis-match metric DM as the earth mover’s distance between apre-stored PSNR distribution on training (source) data, andan online-acquired PSNR distribution on any inference (tar-get) data. We are able to show by experiments that the DMmetric has a strong rank order correlation with the semanticsegmentation within its functional scope. We also propose atraining domain-dependent threshold for the DM metric todefine this functional scope.
1. Introduction
Semantic segmentation is an essential function concern-ing camera-based perception for autonomous driving. Be-cause of its highly safety-critical nature, it is crucial to ob-serve the performance during inference. Domain shifts inthe input space of images are one of the various issues thatcome into play, being part of everyday scenarios and mustbe handled. These domain shifts could be, e.g ., changinglighting or weather conditions such as rain or fog. The firststep towards a better assessment of the input domain is todetect and measure an occurring domain shift.The commonly used quality measure for object detec-tion and semantic segmentation is the mean intersectionover union (mIoU). Unfortunately, an mIoU can only becomputed with ground truth semantic segmentation labels labelsnecessaryno labelsnecessarySemanticSegmentationAutoencoder mIoUPSNR xx x ˆ x ¯ yy qualitymeasureNew: Domain mismatch estimationSo far: Performance evaluationqualitymeasurecanpredict Figure 1:
Performance evaluation of semantic segmentation (simplified sketch). Evaluation of the mean intersection over union(mIoU) requires ground truth segmentation labels ¯ y , while the pro-posed domain mismatch estimation is performed on the basis ofthe PSNR of an autoencoder, trained and evaluated without labels. at hand, which are not available online during driving, ofcourse. Besides semantic segmentation networks, we as-sume that other learned functions (even for different tasks)also perform worse when it comes to a performance degra-dation of the segmentation caused by a domain shift, as-suming they were trained on the same data distribution.Hence, we propose the use of a (self-supervised) autoen-coder, which allows to monitor domain shifts by computa-tion of a peak-signal-to-noise ratio (PSNR) between inputand output images without requiring labels, see Figure 1.Clearly, it is difficult to determine the domain shift on sin-gle images as there may always be unusual images, so wefocus on investigating batches of images. In fact, we trainand evaluate the framework on various datasets simulatingdomain shifts. A first simple approach to estimate the do-main shift is to evaluate the resulting autoencoder’s meanPSNR scores. We also compute PSNR performance his-tograms both for the training data and for different infer-ence data domains and compare them by the earth mover’s1 a r X i v : . [ c s . C V ] J un istance (EMD) [59], obtaining a domain mismatch (DM)metric between two datasets. In our experimental evalua-tion, we evaluate the PSNR and our novel DM metric withthe absolute segmentation performance difference in mIoU,showing a strong correlation for both.The rest of this paper is structured as follows: Section 2presents an overview of the state of the art for related fieldsof research. In Section 3, we explain the details of our do-main mismatch estimation. Section 4 then discusses andinterprets the results of the conducted experiments. Finally,we conclude our findings in Section 5.
2. Related Work
In this section we provide an overview of the most rele-vant state-of-the-art approaches of semantic segmentation,autoencoders, and domain shift.
Semantic Segmentation can be considered as pixel-wise classification of images. Some areas of applicationsfor semantic segmentation are medical image analysis, per-ception of autonomous driving [5, 6], video surveillance,and augmented reality [38].The architectural concepts for semantic segmentationcan be categorized into fully convolutional networks(FCNs) [33], graphical models [11], encoder-decoderbased models [40], multi-scale architectures [30], regionCNNs (R-CNNs) [24], networks based on dilated convo-lutions [11, 12], recurrent neural networks (RNNs) [53],attention-based models [13], generative adversarial net-works (GANs) [20, 34], and active contour models [26], ascomprehensively investigated in [38]. Furthermore, thereis also a variety of image segmentation datasets in 2D,2.5D (including depth), and 3D. Often used 2D datasetsare PASCAL VOC [16], PASCAL VOC12 [15], MS COCO[31], Cityscapes [14], KITTI [19], SYNTHIA [45], Berke-ley DeepDrive [60], and CamVid [9]. For the evaluationof semantic segmentation models, several quality measuresare frequently used, e.g., pixel accuracy (PA), mean pixelaccuracy (MPA), mean intersection over union (mIoU), pre-cision, F1-score, and dice coefficient [38].Due to the efficient implementation and therefore alsotraining and inference time savings, we use the encoder-decoder-based ERFNet [44], which adopts its architec-ture from [42] and [4]. For our experiments, we use theCityscapes dataset [14], the KITTI dataset [19] and theBerkeley DeepDrive dataset [60] and report the mIoU sinceit is the most wide-spread segmentation metric.
Autoencoders are a special case of encoder-decoder ar-chitectures, trained to have the same input and output in aself-supervised fashion. Variations of autoencoders can befound in their respective architectures, loss functions, learn-ing principles, and strategies.Due to the bottleneck in the autoencoder, it is inherentlyclosely related to image compression [2, 32, 49], which of- ten adds quantization, and also to image (and video) super-resolution (SR) methods [22, 37], focussing on reconstruct-ing the original high-resolution image from a low-resolutionrepresentation. Furthermore, also texture synthesis [29, 51],image inpainting [58, 61], and style transfer [18, 25] in-corporate autoencoder structures. In many cases, decodersmake use of transposed convolutions [62, 63] and multi-task learning [10, 24]. Besides this, many architecturesuse generative adversarial networks (GANs) [20] or exten-sions such as the conditional GAN (cGAN) [39], or theleast squares GAN (LSGAN) [36]. The Wasserstein GAN(WGAN) [3] is another famous representative of GANs,using the Wasserstein-1 distance, also known as the earthmover’s distance (EMD) [59], which we will use as domainmismatch metric. Commonly used quality measures forimage compression systems, super resolution approaches,and autoencoders in general are peak-signal-to-noise ra-tio (PSNR) [32, 47], structural similarity (SSIM) [55], andmulti-scale SSIM (MS-SSIM) [56], as well as the meanopinion score (MOS), which is the human-evaluated per-ceptual quality. Besides, there are numerous other imagequality assessment methods, trying to simulate the humanperception system [35, 48].We use the autoencoder architecture for learned imagecompression from [2], with the difference that we omit thequantization block, since we do not aim at compression.
Domain Shift deals with variations between data do-mains or distributions, while domains can be considered asenvironments of different technical or natural data charac-teristics and different data distributions. Examples for suchdomain shifts are differing sensor setups in capture devices,or traffic signs in different countries.Learning models on data distributions differing fromthe application distributions is referred to as transfer learn-ing [41, 52], since the goal is to transfer the learned know-ledge. Specifically, domain adaptation approaches [7, 17]aim at adjusting models to perform well in two (or more)domains in a (semi-)supervised or unsupervised fashion.Moreover, time-variant domains often lead to conceptualdrifts [50, 57], posing a particularly difficult problem, sincethe direction of the drift is unknown. This makes thedrift even more important to detect. The maximum meandiscrepancy (MMD) [8, 21] is another task-independentmethod to measure a domain shift between a source anda target domain. In this technique, a function in a repro-ducing kernel Hilbert space (RKHS) is to be found, beinglarge for samples from the first distribution p and small forsamples from the second distribution q . The MMD then iscomputed by subtracting the mean of function outputs withinputs from q from the mean of function outputs with inputsfrom p . This method can be thought of comparing not onlythe means of two distributions but also their higher ordermoments such as the variance. image J AE PSNR ˆ x DM D o m a i n A . . µ = PSNR [dB] P ( µ ) D o m a i n B . . ν = PSNR [dB] Q ( ν ) Domain MismatchEstimation TrainingAutoencoder LossesPSNR Earth Mover’sDistanceStatistics
Figure 2:
Our proposed domain mismatch estimation . The loss function for the autoencoder is only used during self-supervised trainingand is not needed during inference. The histogram of
PSNR values in the training data domain (A) is compared to an acquired histogramduring inference (domain B), using the earth mover’s distance (EMD), yielding the proposed domain mismatch metric DM . The main differences between the MMD and our methodis that, first, the MMD maximizes the sample expectationdifferences from two distributions in a reproducing kernelHilbert space over a set of functions for each domain pairto be evaluated, while our proposed method is trained onlyonce on the training (source) domain. Second, the MMDuses the difference of mean values to obtain the final metric,while we evaluate the outputs by the EMD. And third, weuse neural networks both for semantic segmentation and fordomain mismatch estimation, while the function optimizedin the typical MMD is not related to neural networks.
3. Domain Mismatch Estimation
A detailed block diagram of the proposed domain mis-match estimation can be seen in Figure 2. It consists of anautoencoder along with a loss function and computationalsteps to obtain a domain mismatch metric DM. The image x = ( x i ) with height H and width W , consisting of nor-malized (color) pixels x i ∈ [ − , C , with C = 3 colorchannels and pixel index i ∈ I = { , , ..., H · W } , is theinput to both, an undisplayed but to be observed seman-tic segmentation, and to our proposed domain mismatchestimator. Its autoencoder receives the normalized image x and produces an image reconstruction ˆ x = (ˆ x i ) with ˆ x i ∈ [ − , C . An advantage of all autoencoder settingsis the fact that no explicit labels are needed because of itsself-supervised training. So in addition to the image recon-struction, the loss and quality measure also use the input im-age x . Different domains result in different self-supervisedquality measure distributions, which can then be comparedby the earth mover’s distance [59], providing our proposeddomain mismatch metric. We use the ERFNet [44] for the task of semantic seg-mentation to be observed. The network is optimized to run in real-time, while still achieving accurate results. It hasan encoder/decoder structure and makes use of factorizedresidual layers consisting of a combination of two 1D fil-ters instead one 2D filter. Since the semantic segmentationarchitecture and loss function are identical to that used in[44], we refer the interested reader to this reference.Concerning our autoencoder, we use an adversarialarchitecture adopted from [2], [54], and [23]. Speakingin terms of a generative adversarial network, the generatorcombines the encoder and decoder networks of the autoen-coder and the discriminator evaluates its reconstructions ina simultaneous training. In the encoder, decoder, and dis-criminator, each convolutional operation is zero-padded, al-ways preserving the image dimensions, and followed by aninstance normalization layer as well as a ReLU activationfunction if not stated otherwise.First in the encoder, there is a convolutional layer withkernel size × , stride of , and 60 feature maps. After-wards, 4 downsampling blocks follow, each consisting of aconvolutional layer with kernel size × , and a stride oftwo for spatial reduction of the (120 , , , featuremaps. The last convolutional layer has a kernel size × ,stride of one, and 8 feature maps, shaping the bottleneck.The final encoder layer has a tanh activation to yield outputsin the range [ − , .The decoder architecture first has a convolutional layerwith kernel size × , stride of one, and 960 feature maps.Afterwards, there are 9 residual blocks, each consisting oftwo convolutional layers, bypassed by an identity function,where the second convolutional layer omits the ReLU acti-vation function. The initial image resolution is restored by4 transposed convolutional layers with kernel size × ,stride of two, and (960 , , , feature maps. Thearchitecture is finalized by a convolutional layer with ker-nel size × , stride of , three feature maps, and a tanhactivation function.n the discriminator, instead of the ReLU activation func-tion, the leakyReLU function is used. The discriminatorconsists of 4 convolutional layers with kernel size × ,stride of , and (64 , , , feature maps. A finalconvolutional layer with kernel size × , stride of one, onefeature map, and ReLU activation delivers the discriminatoroutputs.The autoencoder loss J AE = α J dist + α J FM + (1 − α − α ) J G , adv , (1)with the weighting factors α , α ∈ [0 , , α + α ≤ ,consists of an MSE distortion loss J dist , the L1 fea-ture map loss J FM between the discriminator’s featureactivations fed with the image x and the reconstruc-tion ˆ x , and the generator-specific least-squares (LS) GANloss J G , adv [36]. The discriminator is trained with thediscriminator-specific LS-GAN loss J D , adv , which pursuesthe opposed goal of the generator. Evaluating the semantic segmentation performance for aset of images, commonly the mean intersection over union mIoU = 1 |S| (cid:88) s ∈S TP s TP s + FP s + FN s (2)is used, being composed of the numbers of true-positive( TP s ) pixels, false-positive ( FP s ) pixels, and false-negative( FN s ) pixels w.r.t. the ground truth, with the class index s ∈ S = { , , ..., S } , being summed up over all imagesbefore.For the evaluation of the autoencoder, the image recon-struction quality for input and output color image pixels inthe number range x (cid:48) i , ˆ x (cid:48) i ∈ [0 , C usually is computed bythe peak signal-to-noise ratio (PSNR), performing a directMSE comparison of pixel values: PSNR = 10 log (cid:32) ( x (cid:48) max ) C · H · W (cid:80) i ∈I (cid:13)(cid:13) x (cid:48) i − ˆ x (cid:48) i (cid:13)(cid:13) (cid:33) [dB] (3)with x (cid:48) max = 255 .The comparison of two discrete probability dis-tributions P ( µ ) , µ ∈ M = { , , ..., M } and Q ( ν ) , ν ∈ N = { , , ..., N } can be computed by the earth-mover’s distance (EMD) [59]. This metric computesthe minimum work W required to convert one dis-tribution into the other by multiplying the distance d µν = | µ − ν | ∈ { , , ..., max( M, N ) − } between thebins with index µ and ν with the M × N flow matrix F = ( f µν ) with f µν ∈ [0 , being the flow from bin µ to ν . The optimal flow is found by minimizing the workaccording to F ∗ = arg min F W ( P, Q, F ) = arg min F (cid:88) µ ∈M (cid:88) ν ∈N f µν d µν (4) under consideration of the four (stochastic) constraints f µν ≥ , µ ∈ M , ν ∈ N (cid:88) ν ∈N f µν ≤ P ( µ ) , µ ∈ M (cid:88) µ ∈M f µν ≤ Q ( ν ) , ν ∈ N (cid:88) µ ∈M (cid:88) ν ∈N f µν = min( P ( µ ) , Q ( ν )) . We then obtain the earth-mover’s distance as DM ( P, Q ) = (cid:80) µ ∈M (cid:80) ν ∈N f ∗ µν d µν (cid:80) µ ∈M (cid:80) ν ∈N f ∗ µν , (5)which we will use as our proposed domain mismatch metricby computing the difference of reconstruction qualities forvarious datasets.We use Kendall’s rank order coefficient [1, 27] τ = τ b ,which accounts for ties in one quantity, whereby in the fol-lowing we will omit the index b. Having K observations o k = ( a k , b k ) with k ∈ { , ..., K } , the total number of ob-servation pairs ( o k , o (cid:96) ) = (cid:0) ( a k , b k ) , ( a (cid:96) , b (cid:96) ) (cid:1) (6)with k < (cid:96) is n p = (cid:0) K (cid:1) = K ( K − . A pair of observa-tions is called concordant if the observation’s componentshave the same order (both ascending or both descending),otherwise it is discordant . If the values of one componentin the pair are equal, it is called a tie in this component(here: a tie in a or a tie in b ) and is neither concordant nordiscordant. The number of concordant pairs n c , discordantpairs n d , ties in a n a , and ties in b n b is used to calculateKendall’s rank order coefficient τ = n c − n d (cid:112) ( n − n a )( n − n b ) ∈ [ − , , (7)where τ = 1 means that the observations are perfectly inthe same order, τ = − means that they are perfectly inreversed order, and τ = 0 means that there is no correlationin rank order.
4. Evaluation and Discussion
In this section, we will introduce the training setup anddescribe the performance of the segmentation and autoen-coder networks on different datasets, as well as we will ana-lyze the proposed method for domain mismatch estimation.
For experimental evaluation, we use Cityscapes [14],containing images from several German cities, Berke-ley DeepDrive [60], containing data from the U.S., andrained on Model Measure Evaluated on Kendall τ CS train CS val BDD train
BDD val
KITTICS train
Autoencoder PSNR .
55 dB 28 .
24 dB 21 .
01 dB 21 .
26 dB 20 .
13 dB . . . . . BDD train
Autoencoder PSNR .
18 dB 25 .
13 dB 25 .
87 dB 25 .
37 dB 22 .
10 dB . . . . . Table 1: Mean PSNR results for the autoencoder and mIoU results for the semantic segmentation trained and evaluated on various datasets.
KITTI [19], containing data from a single German city in-cluding surroundings. All these datasets provide the sameclass labeling scheme for segmentation and are thereforecompatible. Furthermore, they all provide a training anda validation set with segmentation labels. For our exper-iments we distinguish between the Cityscapes training set(CS train ), the Cityscapes validation set (CS val ), the BerkeleyDeepDrive training set (BDD train ), the Berkeley DeepDrivevalidation set (BDD val ), and the KITTI set (which consistsof all first images in the stereo training set of KITTI2015).CS train and CS val consists of 2,975 and 500 images, respec-tively, and are downsampled to × pixels. BDD train and BDD val have 7,000 and 1,000 images, respectively, witha resolution of × pixels. Finally, the KITTI train-ing split has 200 images with a resolution of × pixels. The models for the semantic segmentation and theautoencoder are trained with PyTorch [43] either withCS train or BDD train on an
NVidia GTX 1080 Ti
GPU.The encoder of the segmentation network is pretrainedon ImageNet [46]. For data augmentation, the trainingimages are randomly flipped horizontally and cropped to × pixels. After the pretraining, we continue train-ing for 200 epochs with a batch size of 6, an initial learningrate of 0.0005, an Adam optimizer [28] with β = 0 . and β = 0 . , and a weight decay of 0.0002, while ignoringthe background class.The GAN training procedure first optimizes the gener-ator while fixing the discriminator weights, and vice versaafterwards. We train for epochs with batch size , andan initial learning rate of . , using an Adam optimizerwith β = 0 . and β = 0 . . Concerning the au-toencoder loss function (1), we use the weighting factors α = for the MSE loss and α = for the featurematching loss. Furthermore, early stopping w.r.t. the PSNRon the validation set is applied. In this section, we first evaluate the performance ofsemantic segmentation and autoencoder individually withmIoU (2) and PSNR (3), respectively, for the differentdatasets. The results for the models trained on CS train andBDD train can be seen in Table 1. We also report Kendall’s rank order coefficient τ (7), evaluating the degree of ranksimilarity of the PSNR and mIoU series.For the CS train -trained autoencoder, the PSNR perfor-mance is best on CS train (obviously because it is the trainingset) and performs second best on CS val , which is also plau-sible since it is the in-domain case. Evaluated on BDD train and BDD val the PSNR falls by several dB compared to thesource domain to .
01 dB and .
26 dB , respectively, dueto the domain shift. The lowest performance is achieved onKITTI with .
13 dB . We observe a similar ranking of per-formances in the semantic segmentation results of the seg-mentation trained on CS train , with the surprising exceptionthat the KITTI dataset this time does not yield the largestdrop in mIoU. When comparing rank orders, only the posi-tions of BDD val and KITTI seem to be swapped. The rankorder coefficient τ ∈ [ − , is . , still indicating a posi-tive correlation in the behavior of PSNR and mIoU. Con-clusively, we observe a huge domain-shift-induced perfor-mance drop for both models trained on the Cityscapes data,and evaluated on BDD and KITTI data.As before, the autoencoder trained on BDD train performsbest in its own domain with a PSNR of .
87 dB on thetraining set and .
37 dB on the validation set. Evalua-tion on CS train and CS val is ranked third and fourth w.r.t.PSNR, even though the dB difference to the source domainis quite small. The performance on KITTI is again lowerthan on the other datasets. In the semantic segmentation,the mIoU again is best for the in-domain datasets BDD train and BDD val , while CS train , CS val , and KITTI achieve similarmIoU, which is a bit in contrast to the autoencoder perfor-mance, which indicates that KITTI has a larger domain shiftthan the others. Kendall’s τ is . , underlining the strongcorrelation of rank orders.The models trained on CS train and BDD train show at leastsimilar trends in both of the investigated tasks (autoencoderand segmentation), which encourages us to assign the au-toencoder the role of an observer for the semantic seg-mentation. The general trend is: Once PSNR drops, alsomIoU can be assumed to drop, while the achievable abso-lute PSNR scores are data-dependent. This makes it a bittedious to define a threshold for an acceptable domain shift,since it varies for each training dataset. Rank orders are not
10 20 30 4000 . . . µ = PSNR [dB] P ( µ ) (a) Evaluated on CS train ,trained on CS train . . . ν = PSNR [dB] Q ( ν ) (b) Evaluated on CS val ,trained on CS train . . . ν = PSNR [dB] Q ( ν ) (c) Evaluated on BDD val ,trained on CS train . . . ν = PSNR [dB] Q ( ν ) (d) Evaluated on KITTI,trained on CS train . . . µ = PSNR [dB] P ( µ ) (e) Evaluated on BDD train ,trained on BDD train . . . ν = PSNR [dB] Q ( ν ) (f) Evaluated on BDD val ,trained on BDD train . . . ν = PSNR [dB] Q ( ν ) (g) Evaluated on CS val ,trained on BDD train . . . ν = PSNR [dB] Q ( ν ) (h) Evaluated on KITTI,trained on BDD train Figure 3: Histograms P ( µ ) (source domain) and Q ( ν ) (target domain), with µ , ν , representing autoencoder performance PSNRs withmodels trained and evaluated on different datasets. The upper histograms (red, 3a to 3d) stem from the autoencoder trained on CS train ,while the lower ones (blue, 3e to 3h) are trained on BDD train . necessarily kept in the low PSNR regime (BDD train , BDD val ,KITTI) for models trained on CS train , and (CS val , KITTI) formodels trained on BDD train : However, even here we can re-liably always assume that mIoU drops as well to unaccept-able low values. Already in this preliminary experiment,investigating mean performance scores, we observed that if semantic segmentation performance (mIoU) drops belowtraining or validation set performance, also autoencoderperformance (PSNR) drops. For better visualization of domains, Figure 3 showsPSNR histograms, resulting from the evaluation on the in-dividual datasets. For both source domains CS and BDD,evaluating the training set itself yields smooth distributionsof PSNR scores around their mean values as expected (al-most Gaussian), see Figures 3a and 3e. The transition to thevalidation set in the source domain and further on to oneof the target domains implies a decrease of the mean PSNRand an increase of the standard deviation in the distribution,as can be seen in the Figures 3b to 3d for the CS-trainedautoencoder, and in Figures 3f to 3h for the BDD-trainedautoencoder. Noteworthy, the KITTI dataset is only from asingle German city, which may be the cause for the smallstandard deviation in the histograms 3d and 3h.Table 2 shows the mIoU differences and earth mover’s distance (EMD) scores, namely our proposed domain mis-match scores DM (5), based on the PSNR histograms forthe segmentation and the autoencoder, respectively. Also,Kendall’s rank order coefficient τ is provided, here evalua-ting the rank order similarity of the DM and ∆ mIoU series.The segmentation performance drop is simply stated as themIoU difference between the training domains (CS train andBDD train , respectively) and the target domains.In consideration of the results for the Cityscapes-trainedmodels, the DM metric for the validation set (here: .
31 dB )indicates what is to be considered as default (or: typi-cal) domain shift for in-domain data. For each of the out-of-domain shifts, regardless whether the target domain isBDD train , BDD val , or KITTI, the autoencoder reconstruc-tion performance dropped significantly, so our DM metricincreased to and more. In each of these cases also thedrop in the mIoU is large, with ∆ mIoU being more than
50 % absolute for both BDD splits and . for KITTI.Again, the mIoU drop on KITTI is not the worst (althoughthe DM metric is), but a . absolute mIoU drop defi-nitely justifies KITTI to be “out-of-domain”, as it is markedby the high DM = 9 .
41 dB . The pure rank orders in theDM metric and the ∆ mIoU series lead to a rank order co-efficient τ of . , which is still indicating a positive rankcorrelation.Considering the models trained on BDD train , the val-rained on Reference Model Measure Evaluated on Kendall τ CS train CS val BDD train
BDD val
KITTICS train CS train Autoencoder DM . .
31 dB 8 .
53 dB 8 .
29 dB 9 .
41 dB ∆ mIoU . . . . . BDD train
BDD train
Autoencoder DM .
68 dB 0 .
74 dB 0 . .
51 dB 3 .
77 dB ∆ mIoU . . . . . Table 2: Domain mismatch metric DM (5), absolute mIoU differences between the references (CS val and BDD val ) and various datasets,and Kendall’s rank order τ . idation set domain shift of .
51 dB is smaller than forCityscapes, corresponding to an mIoU difference of . to the training set. The domain mismatch estimate DMfor both CS datasets is a bit higher as with BDD val , sowe assume that DM and ∆ mIoU perform proportionally.And indeed, as the DM metric increases from .
51 dB forBDD val over .
68 dB for CS train to .
74 dB for CS val , alsothe ∆ mIoU increases following the same rank order ofdatasets. Interestingly again, the DM metric for KITTIis highest (here: by far highest), which is appropriate for ∆ mIoU being more than doubled w.r.t. the source valida-tion set BDD val . Due to the concordant rank order of theDM metric and the ∆ mIoU in all but one cases, Kendall’srank order coefficient for the BDD-trained models is . .We infer that the autoencoder is even more sensitiveto domain shifts than the semantic segmentation , sincefor both training datasets, the PSNR evaluated on KITTIdropped significantly while the mIoU showed a smaller de-crease. Nevertheless, for small values of our DM metric,the experiments show that the rank orders are concordant, ascan especially be seen for the BDD-trained models. There-fore, we propose to set a threshold for the DM metric to de-fine its functional scope, in which the rank orders of the DMmetric are expected to correspond to those of the ∆ mIoU.The threshold should be two times the DM score of the in-domain validation set , so it is depending on the specificdomain it is trained and validated in. Hence, for the CS-trained autoencoder the threshold lies at × .
31 dB =2 .
62 dB , excluding BDD train , BDD val , and KITTI from thefunctional scope (meaning these are clearly out-of-domaindatasets!), and for the BDD-trained autoencoder the tresh-old is × .
51 dB = 1 .
02 dB , which excludes only theKITTI dataset.
Inside its functional scope, the DM metricmakes a statement about the semantic segmentation perfor-mance with concordant rank ordering.
In comparison to thePSNR, we believe that the DM metric is the better gener-alizing metric, since the proposed threshold is relying onPSNR distributions, and is therefore less sensitive to singleunusual images which do not yet necessarily make up a do-main shift.
As a result, the autoencoder is well-suited as abatch-type observer, since the DM metric exhibits reliablegradual estimations of the domain shift until exceeding the DM threshold, where the PSNR will collapse even before the mIoU of the semantic segmentation. DM results beyondthe DM threshold always indicate a critical domain shift.
5. Conclusions
Observing the performance of safety-critical perceptionfunctions during autonomous driving is essential, becausevehicles are by nature exposed to various environments, im-plying domain shifts. We proposed a novel framework tomonitor the quality of a semantic segmentation. We ac-complish this by estimating the domain shift by an autoen-coder trained in self-supervised fashion. A first approachis to evaluate mean PSNR scores which already show astrong rank order correlation to the mIoU. However, com-paring autoencoder outputs for various datasets by the earthmover’s distance yields a more stable estimation of the do-main shift which we propose as domain mismatch DM me-tric. We found that the task of reconstructing an image iseven more sensitive to domain shifts than semantic segmen-tation, being pixel-wise classification, which ultimately re-sults in a certain functional scope for the autoencoder, be-yond which input data can be clearly classified as “out-of-domain”. Within the valid functional scope of the autoen-coder rank orders of our DM metric and mIoU differencesare strongly rank-correlated. The proposed DM metric istherefore shown to be well-suited as an observer.
Acknowledgment
The research, leading to the results presented above, isfunded by the German Federal Ministry for Economic Af-fairs and Energy within the project “KI Absicherung – SafeAI for automated driving”.
References [1] Johannes Abel, Magdalena Kaniewska, Cyril Guillaume,Wouter Tirry, and Tim Fingscheidt. Objective Assessmentof Artificial Speech Bandwidth Extension Approaches. In
Proc. of 12. ITG Symposium Speech Communication , pages190–194, Paderborn, Germany, Oct. 2016. 4[2] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer,Radu Timofte, and Luc Van Gool. Generative Adversarialetworks for Extreme Learned Image Compression. In
Proc.of ICCV , pages 221–231, Seoul, Korea, Oct. 2019. 2, 3[3] Martin Arjovsky, Soumith Chintala, and L´eon Bottou.Wasserstein GAN. In
Proc. of ICML , pages 214–223, Syd-ney, Australia, Aug. 2017. 2[4] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.SegNet: A Deep Convolutional Encoder-Decoder Architec-ture for Image Segmentation. In
Proc. of PAMI , pages 2481–2495, Kharagpur, India, Oct. 2016. 2[5] Andreas B¨ar, Fabian H¨uger, Peter Schlicht, and Tim Fing-scheidt. On the Robustness of Teacher-Student Frameworksfor Semantic Segmentation. In
Proc. of CVPR - Workshops ,pages 1–9, Long Beach, CA, USA, June 2019. 2[6] Jan-Aike Bolte, Andreas B¨ar, Daniel Lipinski, and Tim Fing-scheidt. Towards Corner Case Detection for AutonomousDriving. In
Proc. of IV , pages 366–373, Paris, France, June2019. 2[7] Jan-Aike Bolte, Markus Kamp, Antonia Breuer, Silviu Ho-moceanu, Peter Schlicht, Fabian Huger, Daniel Lipinski, andTim Fingscheidt. Unsupervised Domain Adaptation to Im-prove Image Segmentation Quality Both in the Source andTarget Domain. In
Proc. of CVPR - Workshops , pages 1–10,Long Beach, CA, USA, June 2019. 2[8] Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch,Hans-Peter Kriegel, Bernhard Sch¨olkopf, and Alex J. Smola.Integrating Structured Biological Data by Kernel MaximumMean Discrepancy.
Bioinformatics , 22(14):e49–e57, July2006. 2[9] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla.Semantic Object Classes in Video: A High-DefinitionGround Truth Database.
Pattern Recognition Letters ,30(2):88–97, Jan. 2009. 2[10] Rich Caruana. Multitask Learning.
Machine Learning ,28(1):41–75, July 1997. 2[11] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L. Yuille. Semantic Image Segmen-tation With Deep Convolutional Nets and Fully ConnectedCRFs. In
Proc. of ICLR , pages 1–14, San Diego, CA, USA,May 2015. 2[12] Liang-Chieh Chen, George Papandreou, Florian Schroff,and Hartwig Adam. Rethinking Atrous Convolutionfor Semantic Image Segmentation. arXiv , June 2017.(arXiv:1706.05587). 2[13] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, andAlan L. Yuille. Attention to Scale: Scale-Aware SemanticImage Segmentation . In
Proc. of CVPR , pages 1063–6919,Las Vegas, NV, USA, June 2016. 2[14] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The CityscapesDataset for Semantic Urban Scene Understanding. In
Proc.of CVPR , pages 3213–3223, Las Vegas, NV, USA, June2016. 2, 4[15] Mark Everingham, Luc Van Gool, Christopher K. I.Williams, John Winn, and Andrew Zisserman. The PAS-CAL Visual Object Classes (VOC) Challenge.
InternationalJournal of Computer Vision , 88(2):303–338, Sept. 2010. 2 [16] Mark Everingham, Luc Van Gool, Christopher K. I.Williams, John Winn, and Andrew Zisserman. The PascalVisual Object Classes Challenge: A Retrospective.
Interna-tional Journal of Computer Vision (IJCV) , 111(1):98–136,Jan. 2015. 2[17] Yaroslav Ganin and Victor Lempitsky. Unsupervised Do-main Adaptation by Backpropagation. In
Proc. of ICML ,pages 1180–1189, Lille, France, July 2015. 2[18] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.Image Style Transfer Using Convolutional Neural Networks.In
Proc. of CVPR , pages 2414–2423, Las Vegas, NV, USA,June 2016. 2[19] Andreas Geiger, Philip Lenz, Christoph Stiller, and RaquelUrtasun. Vision Meets Robotics: The KITTI Dataset.
Inter-national Journal of Robotics Research (IJRR) , 32(11):1231–1237, Aug. 2013. 2, 4[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative Adversarial Nets. In
Proc. ofNIPS , pages 2672–2680, Montr´eal, Canada, Dec. 2014. 2[21] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch,Bernhard Sch¨olkopf, and Alexander Smola. A kernel two-sample test.
Journal of Machine Learning Research , 13:723–773, Mar. 2012. 2[22] Muhammad Haris, Gregory Shakhnarovich, and Norim-ichi Ukita. Deep Back-Projection Networks for Super-Resolution. In
Proc. of CVPR , pages 1664–1673, Salt LakeCity, UT, USA, June 2018. 2[23] Justin Johnson and Alexandre Alahi and Li Fei-Fei. Per-ceptual Losses for Real-Time Style Transfer and Super-Resolution. In
Proc. of ECCV , pages 694–711, Amsterdam,Netherlands, Oct. 2016. 3[24] Kaiming He and Georgia Gkioxari and Piotr Doll´ar and RossGirshick. Mask R-CNN. In
Proc. of ICCV , pages 2980–2988, Venice, Italy, Oct. 2017. 2[25] Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative AdversarialNetworks. In
Proc. of CVPR , pages 4401–4410, Long Beach,CA, USA, June 2019. 2[26] Michael Kass, Andrew Witkin, and Demetri Terzopoulos.Snakes: Active Contour Models.
International Journal ofComputer Vision , 1(4):321–331, Jan. 1988. 2[27] Maurice G. Kendall. The Treatment of Ties in Ranking Prob-lems.
Biometrika , 33(3):239–251, Nov. 1945. 4[28] Diederik P. Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. In
Proc. of ICLR , pages 1–15, SanDiego, CA, USA, May 2015. 5[29] Chuan Li and Michael Wand. Precomputed Real-Time Tex-ture Synthesis With Markovian Generative Adversarial Net-works. In
Proc. of ECCV , pages 702–716, Amsterdam,Netherlands, Oct. 2016. 2[30] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature PyramidNetworks for Object Detection. In
Proc. of CVPR , pages2117–2125, Honolulu, HI, USA, July 2017. 2[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrenceitnick. Microsoft COCO: Common Objects in Context. In
Proc. of ECCV , pages 740–755, Zurich, Switzerland, Sept.2014. 2[32] Jonas L¨ohdefink, Andreas B¨ar, Nico M. Schmidt, FabianH¨uger, Peter Schlicht, and Tim Fingscheidt. On Low-BitrateImage Compression for Distributed Automotive Perception:Higher Peak SNR Does Not Mean Better Semantic Segmen-tation. In
Proc. of IV , pages 352–359, Paris, France, June2019. 2[33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. FullyConvolutional Networks for Semantic Segmentation. In
Proc. of CVPR , pages 3431–3440, Boston, MA, USA, June2015. 2[34] Pauline Luc, Camille Couprie, Soumith Chintala, and JakobVerbeek. Semantic Segmentation using Adversarial Net-works. In
NIPS Workshop on Adversarial Training , pages1–12, Barcelona, Spain, Dec. 2016. 2[35] Chao Ma, Chih-Yuan Yang, Xiaokang Yang, and Ming-Hsuan Yang. Learning a No-Reference Quality Metric forSingle-Image Super-Resolution.
Computer Vision and Im-age Understanding , 158:1–16, May 2017. 2[36] Xudong Mao, Qing Li, Haoran Xie, Raymond Yiu KeungLau, Zhen Wang, and Stephen Paul Smolley. Least SquaresGenerative Adversarial Networks. In
Proc. of ICCV , pages2794–2802, Venice, Italy, Oct. 2017. 2, 4[37] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Im-age Restoration Using Very Deep Convolutional Encoder-Decoder Networks With Symmetric Skip Connections. In
Proc. of NIPS , pages 2802–2810, Barcelona, Spain, Dec.2016. 2[38] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza,Nasser Kehtarnavaz, and Demetri Terzopoulos. Image Seg-mentation Using Deep Learning: A Survey. arXiv , Jan. 2020.(arXiv:2001.05566). 2[39] Mehdi Mirza and Simon Osindero. Conditional GenerativeAdversarial Nets. arXiv , Nov. 2014. (arXiv:1411.1784). 2[40] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.Learning Deconvolution Network for Semantic Segmenta-tion. In
Proc. of ICCV , pages 1520–1528, Las Condes, Chile,Dec. 2015. 2[41] Sinno Jialin Pan and Qiang Yang. A Survey on TransferLearning.
IEEE Transactions on Knowledge and Data Engi-neering , 22(10):1345–1359, Oct. 2010. 2[42] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Euge-nio Culurciello. ENet: A Deep Neural Network Architecturefor Real-Time Semantic Segmentation. arXiv , June 2016.(arXiv:1606.02147). 2[43] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, AlbanDesmaison, Luca Antiga, and Adam Lerer. Automatic Dif-ferentiation in PyTorch. In
Proc. of NIPS - Workshops , pages1–4, Long Beach, CA, USA, Dec. 2017. 5[44] Eduardo Romera, Jos´e M. ´Alvarez, Luis M. Bergasa,and Roberto Arroyo. ERFNet: Efficient Residual Fac-torized ConvNet for Real-Time Semantic Segmentation.
IEEE Transactions on Intelligent Transportation Systems ,19(1):263–272, Jan. 2018. 2, 3 [45] German Ros, Laura Sellart, Joanna Materzynska, DavidVazquez, and Antonio M. Lopez. The Synthia Dataset: ALarge Collection of Synthetic Images for Semantic Segmen-tation of Urban Scenes. In
Proc. of CVPR , pages 3234–3243,Las Vegas, NV, USA, June 2016. 2[46] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet Large Scale Visual Recognition Chal-lenge.
International Journal of Computer Vision (IJCV) ,115(3):211–252, Dec. 2015. 5[47] David Salomon.
Data Compression: The Complete Refer-ence . Springer Science & Business Media, 2004. 2[48] Hossein Talebi and Peyman Milanfar. NIMA: Neural ImageAssessment.
IEEE Trans. on Image Processing , 27(8):3998–4011, Sept. 2018. 2[49] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and FerencHusz´ar. Lossy Image Compression With Compressive Au-toencoders. In
Proc. of ICLR , pages 1–19, Toulon, France,Apr. 2017. 2[50] Alexey Tsymbal. The Problem of Concept Drift: Definitionsand Related Work.
Computer Science Department, TrinityCollege Dublin , 106(2):58, Apr. 2004. 2[51] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Im-proved Texture Networks: Maximizing Quality and Diver-sity in Feed-Forward Stylization and Texture Synthesis. In
Proc. of CVPR , pages 6924–6932, Honulu, HI, USA, July2017. 2[52] Hemanth Demakethepalli Venkateswara, ShayokChakraborty, and Sethuraman Panchanathan. Deep-Learning Systems for Domain Adaptation in ComputerVision: Learning Transferable Feature Representations.
IEEE Signal Processing Magazine , 34(6):117–129, Nov.2017. 2[53] Francesco Visin, Marco Ciccone, Adriana Romero, KyleKastner, Kyunghyun Cho, Yoshua Bengio, Matteo Mat-teucci, and Aaron Courville. Reseg: A Recurrent NeuralNetwork-Based Model for Semantic Segmentation. In
Proc.of CVPR - Workshops , pages 41–48, Las Vegas, NV, USA,June 2016. 2[54] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,Jan Kautz, and Bryan Catanzaro. High-Resolution Im-age Synthesis and Semantic Manipulation With ConditionalGANs. In
Proc. of CVPR , pages 8798–8807, Salt Lake City,UT, USA, June 2018. 3[55] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, andEero P. Simoncelli. Image Quality Assessment: From Er-ror Visibility to Structural Similarity.
IEEE Trans. on ImageProcessing , 13(4):600–612, Apr. 2004. 2[56] Zhou Wang, Eero Simoncelli, and Alan Bovik. Multi-ScaleStructural Similarity for Image Quality Assessment. In
Proc.of ACSSC , pages 1398–1402, Pacific Grove, CA, USA, Nov.2003. 2[57] Gerhard Widmer and Miroslav Kubat. Learning in the Pres-ence of Concept Drift and Hidden Contexts.
Machine Learn-ing , 23(1):69–101, Apr. 1996. 258] Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G.Schwing, Mark Hasegawa-Johnson, and Minh N. Do. Se-mantic Image Inpainting With Deep Generative Models. In
Proc. of CVPR , pages 5485–5493, Honolulu, HI, USA, July2017. 2[59] Yossi Rubner and Carlo Tomasi and Leonidas J. Guibas. TheEarth Mover’s Distance as a Metric for Image Retrieval.
In-ternational Journal of Computer Vision , 40(2):99–121, Nov.2000. 2, 3, 4[60] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, MikeLiao, Vashisht Madhavan, and Trevor Darrell. BDD100K: ADiverse Driving Video Database With Scalable AnnotationTooling. arXiv , Aug. 2018. (arXiv:1805.04687). 2, 4[61] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, andThomas S. Huang. Generative Image Inpainting With Con-textual Attention. In
Proc. of CVPR , pages 5505–5514, SaltLake City, UT, USA, June 2018. 2[62] Matthew D. Zeiler and Rob Fergus. Visualizing and Under-standing Convolutional Networks. In
Proc. of ECCV , pages818–833, Zurich, Switzerland, Sept. 2014. 2[63] Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, andRob Fergus. Deconvolutional Networks. In