Transferring and Regularizing Prediction for Semantic Segmentation
Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, Tao Mei
TTransferring and Regularizing Prediction for Semantic Segmentation ∗ Yiheng Zhang † , Zhaofan Qiu † , Ting Yao ‡ , Chong-Wah Ngo § , Dong Liu † , and Tao Mei ‡ † University of Science and Technology of China, Hefei, China ‡ JD AI Research, Beijing, China § City University of Hong Kong, Kowloon, Hong Kong { yihengzhang.chn, zhaofanqiu, tingyao.ustc } @[email protected], [email protected], [email protected] Abstract
Semantic segmentation often requires a large set of im-ages with pixel-level annotations. In the view of extremelyexpensive expert labeling, recent research has shown thatthe models trained on photo-realistic synthetic data (e.g.,computer games) with computer-generated annotations canbe adapted to real images. Despite this progress, with-out constraining the prediction on real images, the mod-els will easily overfit on synthetic data due to severe do-main mismatch. In this paper, we novelly exploit the in-trinsic properties of semantic segmentation to alleviate suchproblem for model transfer. Specifically, we present a Reg-ularizer of Prediction Transfer (RPT) that imposes the in-trinsic properties as constraints to regularize model trans-fer in an unsupervised fashion. These constraints includepatch-level, cluster-level and context-level semantic predic-tion consistencies at different levels of image formation.As the transfer is label-free and data-driven, the robust-ness of prediction is addressed by selectively involving asubset of image regions for model regularization. Exten-sive experiments are conducted to verify the proposal ofRPT on the transfer of models trained on GTA5 and SYN-THIA (synthetic data) to Cityscapes dataset (urban streetscenes). RPT shows consistent improvements when inject-ing the constraints on several neural networks for semanticsegmentation. More remarkably, when integrating RPT intothe adversarial-based segmentation framework, we reportto-date the best results: mIoU of 53.2%/51.7% when trans-ferring from GTA5/SYNTHIA to Cityscapes, respectively.
1. Introduction
Semantic segmentation aims at assigning semantic labelsto every pixel of an image. Leveraging on CNNs [18, 22,42, 45, 46], significant progress has been reported for thisfundamental task [6, 7, 30, 36]. One drawback of the ex-isting approaches, nevertheless, is the requirement of largequantities of pixel-level annotations, such as in VOC [15], ∗ This work was performed at JD AI Research.
COCO [28] and Cityscapes [11] datasets, for model train-ing. Labeling of semantics at pixel-level is cost expensiveand time consuming. For example, the Cityscapes datasetis composed of 5,000 high-quality pixel-wise annotated im-ages, and the annotation on a single image is reported totake more than 1.5 hours.An alternative is by utilizing synthetic data, which islargely available in 3D engines (e.g., SYNTHIA [41]) and3D computer games (e.g., GTA5 [40]). The ground-truth se-mantics of these data can be automatically generated with-out manual labeling. Nevertheless, in the case where thesynthetic data is different from the real images, the do-main gap might be difficult to bridge. Unsupervised do-main adaptation is generally regarded as an appealing wayto address the problem of domain gap. The existing ap-proaches include narrowing the gap by transferring imagesacross domains [14, 32, 50] and learning domain-invariantrepresentation via adversarial mechanism [13, 31, 49].In this paper, we consider model overfitting in source do-main as the major cause of domain mismatch. As shown inFigure 1(a), although Fully Convolutional Networks (FCN)perfectly segment the synthetic image by correct labeling ofpixels, directly deploying this model for real image yieldspoor results. Instead of leveraging training samples in thetarget domain for model fine-tuning, this paper exploreslabel-free constraints to alleviate the problem of modeloverfitting. These constraints are intrinsic and generic inthe context of semantic segmentation. Figure 1(b) ∼ (d) il-lustrate three label-free constraints being investigated. Thefirst two constraints, namely patch-based and cluster-basedconsistencies guide the segmentation based on the predic-tion consistency among the pixels in an image patch andamong the clusters of patches sharing similar visual prop-erties, respectively. The last criterion, namely spatial logic,contextualizes the prediction of labels based on spatial re-lation between image patches. Based on these criteria, wepropose a novel Regularizer of Prediction Transfer (RPT)for transferring the model trained on synthetic data for se-mantic segmentation of real images. a r X i v : . [ c s . C V ] J un ource DomainTarget Domain FCNFCN road building on top of roadroad on top of building building
Good Prediction
Bad Prediction
Good Prediction
Bad Prediction
Good Prediction
Bad Prediction (a) FCN trained on synthetic images (b) Patch-based consistency (c) Cluster-based consistency (d) Spatial logic
Figure 1. The examples of (a) predictions on two domains by fully convolutional networks trained on synthetic data; (b) ∼ (d) the threeevaluation criteria we studied, i.e., patch-based consistency, cluster-based consistency and spatial logic. The main contribution of this paper is on the explo-ration of label-free data-driven constraints for transferringof model to bridge domain gap. These constraints are im-posed as regularizers during training to transfer an overfit-ted source model for proper labeling of pixels in the tar-get domain. Specifically, at the lowest level of regulariza-tion, majority voting is performed to derive a dominativecategory for each image patch. The dominative categoryserves as a local cue for pixels with low prediction confi-dence to adjust their label prediction during training. Thepatch-level regularization is then extended to a higher levelof regularization to explore cluster-level and context-levelprediction consistency. Despite its simplicity, the three reg-ularizers, when jointed optimized in a fully convolutionalnetwork with adversarial learning, show impressive perfor-mances by outperforming several state-of-the-art methods,when transferring the models trained on GTA5 and SYN-THIA for semantic segmentation on the Cityscapes dataset.
2. Related Work
CNN Based Semantic Segmentation.
As one of themost challenging computer vision task, semantic segmen-tation has received intensive research attention. With thesurge of deep learning and convolutional neural networks(CNNs), Fully Convolutional Network (FCN) [30] success-fully serves as an effective approach that employs CNNs toperform dense semantic prediction. Following FCN, vari-ous schemes, ranging from multi-path feature aggregationand refinement [16, 27, 36, 38, 56, 58] to multi-scale con-text extraction and integration [5, 6, 17, 39, 51, 53, 59], havebeen developed and achieved great success in leveragingcontextual information for semantic segmentation. Post-processing techniques, such as CRF [6] and MRF [29],could further be applied to take the spatial consistency oflabels into account and improve the predictions from FCNs.Considering that such methods typically rely on the datasetswith pixel-level annotations which are extremely expensiveand laborious to collect, researchers have also strived to uti-lize a weaker form of annotation, such as image-level tags[34, 37], bounding boxes [12], scribbles [2] and statistics[35], for semantic segmentation. The development of com-puter graphics techniques provides an alternative approachthat exploits synthetic data with free annotations. Thiswork aims to study the methods of applying the semanticsegmentation model learnt on the computer-generated syn- thetic data to unlabeled real data.
Domain Adaptation of Semantic Segmentation.
Toalleviate the issues of expensive labeling efforts in collect-ing pixel-level annotations, domain adaptation is studied forsemantic segmentation. FCNWild [20], which is one ofthe early works, attempts to align the features in differentdomains from both global and local aspects by adversar-ial training. Curriculum [55] proposes a curriculum-stylelearning approach to bridge the domain gap between syn-thetic and real data. Later on, similar to domain adaptationin image recognition and object detection [3, 33, 52], visualappearance-level and/or representation-level adaptation areexploited in [14, 32, 47, 57] for this task. [14, 32] per-form an image-to-image translation that transfers the syn-thetic images to the real domain in the appearance-level.From the perspective of the representation-level adaptation,AdaSegNet [47] proposes to apply adversarial learning onsegmentation maps for adapting structured output space.FCAN [57] employs the two levels of adaptation simultane-ously, in which the appearance gap between synthetic andreal images is minimized and the network is encouragedto learn domain-invariant representations. There have beenseveral other strategies [4, 9, 10, 19, 23, 25, 61], being per-formed for cross-domain semantic segmentation. For ex-ample, ROAD [10] devises a target guided distillation mod-ule and a spatial-aware adaptation module for real style anddistribution orientation. Labels from the source domain aretransferred to the target domain as the additional supervi-sion in CyCADA [19]. Depth maps which are available invirtual 3D environments are utilized as geometric informa-tion to reduce domain shift in [9]. [23, 25, 61] treat targetpredictions as the guide for learning a model applicable tothe images in target domain by self-supervised learning. [4]proposes a domain invariant structure extraction frameworkthat decouples the structure and texture representations ofimages and improves the performance of segmentation.
Summary.
Most of the aforementioned approachesmainly investigate the problem of domain adaptation for se-mantic segmentation through bridging the domain gap dur-ing training. Our work is different in the way that we seekthe additional regularization for the prediction in target do-main based on the intrinsic and generic properties of seman-tic segmentation task. Such solution formulates an innova-tive and promising research direction for this task. atch Crop Patch Segment Segmentation Results Probability of Dominative ClassDominative Class(a) Unpunished Case roadcarbuilding (b)Punished Case traffic signvegetationfence
Figure 2. Example of pixels to be unpunished (a) or punished (b)in optimization. (a) For the unpunished cases, some pixels arevery confident in the class differed from the dominative category.(b) For the punished cases, most pixels inside the region predictrelatively high probabilities for the dominative category.
3. Regularizer of Prediction Transfer
We start by introducing the Regularizer of PredictionTransfer (RPT) for semantic segmentation. Three criteriaare defined to assess the quality of segmentation. The resultof assessment is leveraged to guide the transfer of a learntmodel in the source domain for semantic segmentation inthe target domain.
The idea is to enforce all pixels in a patch to be consistentin the prediction of semantic labels. Here, a patch is definedas a superpixel that groups neighboring pixels with simi-lar visual appearance. We employ Simple Linear IterativeClustering (SLIC) [1], which is both speed and memory ef-ficient in the generation of superpixels by adopting k-meansalgorithm. Given one image from target domain x t , SLICsplits the image into N superpixels { S i | i = 1 , ..., N } . Eachsuperpixel S i = { p ji | j = 1 , ..., M i } is composed of M i ad-jacent pixels with similar appearance. We assume that allor the majority of pixels will be annotated with the samesemantic labels. Here, the dominative category ˆ y i of a su-perpixel is defined as the most number of predicted labelsamong all the pixels in this superpixel.As SLIC considers only visual cue, a superpixel usuallycontains multiple regions of different semantic labels. Sim-ply involving all pixels in network optimization can run intothe risk of skew optimization. To address this problem, asubset of pixels is masked out from patch-based regulariza-tion. Specifically, in superpixel S i , pixels p ji ∈ S i are clus-tered into two groups depending on the predicted probabil-ity of the dominative category ˆ y i : (a) P seg (ˆ y i | p ji ) < = λ pc means that the probability is less than or equal to a pre- vegetation carbuilding road person bicyclesky Figure 3. Feature space visualization of seven superpixel clustersusing t-SNE. The dominative category is given for each cluster. defined threshold λ pc . In other words, the pixel p ji is pre-dicted with labels different from the dominative categorywith relatively high probability. This group of pixels shouldbe exempted from regularization. (b) P seg (ˆ y i | p ji ) > λ pc represents that p ji has relatively higher confidence to be pre-dicted as the dominative category. In this case, the domina-tive ˆ y i is leveraged as a cue to guide the prediction of thesepixels. To the end, the loss item for patch-based consistencyregularization of a target image x t is formulated as: L pc ( x t ) = − (cid:88) i,j I ( P seg (ˆ y i | p ji ) >λ pc ) logP seg (ˆ y i | p ji ) , (1)where I ( · ) is an indicator function to selectively mask outpixels from optimization by thresholding. Figure 2 showsexamples of superpixels that are masked out (i.e., unpun-ished) and involved (i.e., punished) for optimization. In addition to patch, we also enforce the consistency oflabel prediction among the clusters of patches that are vi-sually similar. Specifically, cluster-level regularization im-poses a constraint that the superpixels with similar visualproperties should predict the cluster dominative category astheir label. To this end, superpixels are further grouped intoclusters. The feature representation of a superpixel is ex-tracted through ResNet-101 [18], which is pre-trained onImageNet dataset [42]. The feature vector utilized for clus-tering is generated by averagely pooling the feature mapsof the superpixel region from res c layer. All the superpix-els from target domain images are grouped into K = 2048 clusters by k-means algorithm. The cluster-level domina-tive category ˜ y k is determined by majority voting amongthe superpixels within a cluster. Figure 3 visualizes sevenexamples of clusters and the corresponding dominative cat-egories by t-SNE [48]. As clustering is imperfect, it is ex-ected that some superpixels will be incorrectly grouped.Denote P seg (˜ y k | p ji ) , where p ji ∈ S i ∈ C k , as the proba-bility of predicting cluster-level dominative category as la-bel for pixel p ji . Similar to patch-based consistency regu-larization, pixels with low confidence on the cluster-levelcategory will not be punished during network optimization.Thus, the loss item of cluster-based consistency regulariza-tion for a target image x t is defined as: L cc ( x t ) = − (cid:88) i,j,S i ∈ C k I ( P seg (˜ y k | p ji ) >λ cc ) logP seg (˜ y k | p ji ) , (2)where λ cc is a pre-defined threshold to gate whether a pixelshould be masked out from regularization. A useful cue to leverage for target-domain segmenta-tion is the spatial relation between semantic labels. Forinstance, a superpixel of category sky is likely on the topof another superpixel labeled with building or road , andnot vice versa. These relations are expected to be invari-ant across the source and target domains. The supportivehypothesis behind is introduced in [4] that the high-levelstructure information of an image is informative for seman-tic segmentation and can be readily shared across domains.As such, the motivation of spatial logic is to preserve thespatial relations learnt in source domain to target domain.Formally, we exploit the LSTM encoder-decoder archi-tecture to learn the vertical relation between superpixels,as shown in Figure 4. The main goal of this architec-ture is to speculate the category of the masked segmentin the sequence according to context information. Then,the produced probability can be used to evaluate the log-ical validity of the predicted category in the masked seg-ment. Suppose we have a prediction sequence Y , where Y = { y , y , ..., y T − , y T } including T superpixel pre-dictions sliced from one column of prediction map. Let y t ∈ R C +1 denote the one-hot vector of the t -th predictionin the sequence, and the dimension of y t , i.e., C + 1 , is thenumber of semantic categories plus one symbol as an iden-tification of masked prediction. The masked prediction se-quence ˆ Y , which is fed into the LSTM encoder, is generatedby masking a segment of consecutive predictions with theidentical semantic category in the original sequence Y . TheLSTM encoder embeds the masked prediction sequence ˆ Y into a sequence representation. The LSTM decoder, whichis attached on the top of the encoder, then speculates the cat-egories of the masked segment and reconstructs the originalsequence Y . To learn the aforementioned spatial logic, theencoder-decoder architecture is optimized with the cross-entropy loss supervised by the label from source domain.Next, the optimized model can be utilized to estimate thevalidity of each prediction from the view of spatial logic.For the target image x t , we first slice the prediction map PredictionSequence
Masked Prediction Sequence
LSTM LSTM LSTM LSTM LSTM ...
LSTMLSTM LSTM LSTM LSTM LSTM ...
LSTMDecode Prediction SequenceMask One Segment
Spatial LogicalProbability S e qu e n c e R e c o n s t r u c t i o n L o ss ? ? ? ? Figure 4. The LSTM encoder-decoder architecture to learn the spa-tial logic in the prediction map. to several columns consisting of vertically neighbored su-perpixels. The patch-level dominative categories of the su-perpixels in the column are organized into a prediction se-quence. For the superpixel S i in the column, the spatiallogical probability P logic (ˆ y i | S i ) is measured by the LSTMencoder-decoder only when the prediction of this superpixelis masked in the input sequence. Once this probability islower than the threshold λ sl , we consider this prediction tobe illogical and punish the prediction of ˆ y i by the segmen-tation network. The loss of spatial logic regularization iscomputed as: L sl ( x t ) = (cid:88) i,j I ( P logic (ˆ y i | S i ) <λ sl ) logP seg (ˆ y i | p ji ) , (3)where P logic ( · ) denotes the prediction from LSTM encoder-decoder architecture.
4. Semantic Segmentation with RPT
The proposed Regularizer of Prediction Transfer (RPT)can be easily integrated into most of the existing frame-works for domain adaptation of semantic segmentation.Here, we choose the widely adopted framework based onadversarial learning as shown in Figure 5. The principle inthis framework is equivalent to guiding the semantic seg-mentation in both domains by fooling a domain discrimi-nator D with the learnt source and target representations.Formally, given the training set X s = { x is | i = 1 , . . . , N s } in source domain and X t = { x it | i = 1 , . . . , N t } in targetdomain, the adversarial loss L adv is the average classifica-tion loss, which is formulated as: L adv ( X s , X t ) = − E x t ∼X t [ log ( D ( x t ))] − E x s ∼X s [ log (1 − D ( x s )] . (4) ource Domain Target Domain
FCNFCN
SourceLabel
Patch-based
Consistency
Cluster-basedConsistency
Spatial Logic
Adversarial Learning
Figure 5. The adversarial-based semantic segmentation adaptationframework with RPT. The shared FCN is learnt with adversarialloss for domain-invariant representations across two domains. Thepredictions on source domain are optimized by supervised label,while the target domain predictions are regularized by RPT loss. where E denotes the expectation over the image set. Thediscriminator D will attempt to minimize this loss by dif-ferentiating between source and target representations, andthe shared Fully Convolutional Network (FCN) is learnt tofool the domain discriminator. Considering that the imageregion corresponding to the receptive field of each spatialunit in the final feature map is treated as an individual in-stance during semantic segmentation, the representations ofsuch instances are expected to be invariant across domains.Thus we employ a fully convolutional domain discrimina-tor whose outputs are the domain prediction of each imageregion corresponding to the spatial unit in the feature map.Since training labels are available in the source domain,the loss function is based on the pixel-level classificationloss L seg . In contrast, due to the absence of training labels,the loss function in the target domain is defined based uponthe following three regularizers: L rpt ( X t ) = E x t ∼X t [ L cc ( x t ) + L pc ( x t ) + L sl ( x t )] . (5) Here, we empirically treat each loss in RPT equally. Thus,the overall objective of the segmentation framework inte-grates L adv , L seg and L rpt as: min FCN {− ε min D L adv ( X s , X t ) + L seg ( X s ) + L rpt ( X t ) } , (6) where ε = 0 . is the trade-off parameter to align the scaleof different losses.
5. Implementation
Training strategy.
Our proposed network is imple-mented in Caffe [24] framework and the weights are trainedby SGD optimizer. We employ dilated FCN [6] originatedfrom the ImageNet pre-trained ResNet-101 as our backbonefollowed by a PSP module [59], unless otherwise stated.The domain discriminator for adversarial learning is bor-rowed from FCAN [57]. During the training stage, imagesare randomly cropped to × due to the limitationof GPU memory. Both random horizontal flipping and im-age resizing are utilized for data augmentation. To make the training process stable, we pre-train the FCN on datafrom the source domain with annotations. At the stage ofpre-training, the “poly” policy whose power is fixed to 0.9is adopted with the initial learning rate 0.001. Momentumand weight decay are 0.9 and 0.0005 respectively. Eachmini-batch has 8 samples and maximum training iterationsis set as 30K. With the source domain pre-trained weights,we perform the domain adaptation by finetuning the wholeadaptation framework which is equipped with our proposedRPT. The initial learning rate is 0.0001 and the total trainingiteration is 10K. Other training hyper-parameters remainunchanged. Following [26], we randomly selected 500 im-ages from the official training set of Cityscapes as a generalvalidation set. The hyper-parameters ( λ pc = λ cc = λ sl =0 . , ε = 0 . ) are all determined on this set. Complexity of superpixel.
RPT highly relies on thequality of superpixel extraction. For robustness, superpix-els with complex content ideally should be excluded frommodel training. The term “complex” refers to the distri-bution of semantic labels in a superpixel. In our case, wemeasure complexity based on the proportion of pixels be-ing predicted with the dominative category over the numberof pixels in a superpixel. A larger value implies consistencyin prediction and hence safer to involve the correspondingsuperpixel in regularizations. Empirically, RPT only reg-ularizes the top-50% of superpixels. The empirical choicewill be further validated in the next section.
State update of RPT.
During network optimization, thesegmentation prediction P seg , superpixel dominative cate-gory ˆ y i and cluster dominative category ˜ y k change gradu-ally. Iteratively updating these “states” is computationallyexpensive because reassigning the categories to superpixeland cluster (e.g., ˆ y i and ˜ y k ) requires the semantic predic-tions collected from the whole training set of the target do-main. Considering these predictions only change slightlyduring training, we first calculate these states before the op-timization (without regularization) and fix these states at thebeginning of iterations. Then, we will update the predic-tions or states for N su times evenly during training.
6. Experiments
The experiments are conducted on GTA5 [40], SYN-THIA [41] and Cityscapes [11] datasets. The proposed RPTis trained on GTA5 and SYNTHIA (source domain) andCityscapes (target domain). GTA5 is composed of 24,966synthetic images of size × . These images are gen-erated by Grand Theft Auto V (GTA5), a modern computergame, to render city scenes. The pixels of these images areannotated with 19 classes that are compatible with the labelsin Cityscapes. Similarly, SYNTHIA consists of syntheticimages of urban scenes with resolutions of × .Following [4, 9, 21, 25, 47], we use the subset, SYNTHIA- able 1. RPT performances in terms of mean IoU for domain adap-tation of semantic segmentation on GTA5 → Cityscapes.
Method ResNet-50 ResNet-101
FCN +ABN +ADV FCN +ABN +ADVbaseline 30.1 35.7 45.7 32.3 39.1 47.2
RPT RPT RPT Number of training iterations (k) w\o RPT RPT RPT RPT State Update m I o U ( % ) (a) State updating Percentage of filtered complex superpixels (%) m I o U ( % ) (b) Filtering complex superpixels Figure 6. Two analysis experiments of (a) the effectiveness of stateupdating during training of RPT ; (b) the percentage of filteredcomplex superpixels of RPT . RAND-CITYSCAPES, which has 9,400 images being an-notated with labels consistent with Cityscapes for experi-ments. Cityscapes is composed of 5,000 images of resolu-tion × . These images are split into three subsetsof sizes 2,975, 500 and 1,525 for training, validation andtesting, respectively. The pixels of these images are anno-tated with 19 classes. In the experiments, the training sub-set is treated as the target-domain training data, where thepixel-level annotation is assumed unknown to RPT. On theother hand, the target-domain testing data is from validationsubset. The same setting is also exploited in [4, 25, 47].To this end, the performance of RPT is assessed bytreating GTA5 as source domain and Cityscapes as targetdomain (i.e., GTA5 → Cityscapes), and similarly, SYN-THIA → Cityscapes. The metrics are per class Intersectionover Union (IoU) and mean IoU over all the classes.
RPT is experimented on top of six different networkarchitectures derived from
FCN which leverages on ei-ther
ResNet-50 or ResNet-101 as the backbone network.Specially, we adopt Adaptive Batch Normalization (
ABN )to replace the mean and variance of BN in the originalversion of FCN, resulting in a variant of network namedFCN+ABN. Note that the BN layer is first learnt in sourcedomain and then replaced by ABN when being applied tothe target domain. In addition, leveraging on the adver-sarial training (
ADV ), another variant, FCN+ABN+ADV,is trained to learn domain-invariant representations.We first verify the impact of N su , the number ofstate updating, in RPT. Table 1 summarizes the impacton six variants of network for domain adaptation onGTA5 → Cityscapes. All the networks are pre-trained onImageNet dataset and then injected with RPT. The super-
Table 2. Contribution of each design in RPT for domain adaptationof semantic segmentation on GTA5 → Cityscapes.
Method ABN ADV PCR CCR SLR SU mIoU
FCN 32.3+ABN √ adv (+ADV) √ √ √ √ √ √ √ √ √ RPT (+SLR) √ √ √ √ √ RPT √ √ √ √ √ √ script, RPT n , refers to the number of times for state up-dating (see Table 1 for exact number). The baselines areobtained by performing domain adaptation of semantic seg-mentation on the use of the corresponding network architec-tures, but without RPT. Overall, RPT improves the baselinewithout regularization. The improvement is consistentlyobserved across the variants of networks, and proportionalto the number of state updating at the expense of compu-tation cost. RPT achieves the best performance (mIoU =52.6%) and with 5.4% improvement over the baseline of thesame network (FCN+ABN+ADV). Figure 6(a) shows theperformance changes in terms of mIoU during training overdifferent times of state updating. The training starts withmodel learning in source domain. State updating, such asthe assignment of dominative categories at superpixel andcluster levels, is then performed three times evenly duringthe training process in the target domain. Despite droppingin performance at the start of training after each state updat-ing, mIoU gradually improves and eventually converges toa higher value than the previous round. Figure 6(b) showsthe performance trend when the percentage of complex su-perpixels being excluded from learning gradually increases.As shown, the value mIoU constantly increases till reach-ing the level when 50% of superpixels are filtered. In theremaining experiments, we fix the setting of RPT to involve50% of superpixels in regularization. Next, we conduct an ablation study to assess the per-formance impacts of different design components. Weseparately assess the three regularizations in RPT: patch-based consistency regularization (
PCR ), cluster-based con-sistency regularization (
CCR ) and spatial logic regulariza-tion (
SLR ). Table 2 details the contribution of each com-ponent towards the overall performance. FCN adv , by con-sidering adaptive batch normalization and adversarial learn-ing (ABN+ADV), successfully boosts mIoU from 32.3%to 47.2%. The result indicates the importance of narrow-ing the domain gap between synthetic data and real images.The three regularizations in target domain introduce 1.8%,0.6% and 0.8% of improvement, respectively. Furthermore,by increasing the number of state updating during networkoptimization, additional 2.2% of improvement is observed able 3. Comparisons with the state-of-the-art unsupervised domain adaptation methods on GTA5 → Cityscapes adaptation. Please notethat the baseline methods are divided into five groups: (1) representation-level domain adaptation by adversarial learning [10, 13, 19, 21,31, 44, 47]; (2) appearance-level domain adaptation by image translation [14, 32]; (3) appearance-level + representation-level adaptation [4,50, 57]; (4) self-learning [23, 26, 54, 61]; (5) others [8, 25, 43, 55, 60].
Method road sdwlk bldng wall fence pole light sign vgttn trrn sky person rider car truck bus train mcycl bcycl mIoUFCNWild [20] 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1Learning [44] 88.0 30.5 78.6 25.2 23.5 16.7 23.5 11.6 78.7 27.2 71.9 51.3 19.5 80.4 19.8 18.3 0.9 20.8 18.4 37.1ROAD [10] 76.3 36.1 69.6 28.6 22.4 28.6 29.3 14.8 82.3 35.3 72.9 54.4 17.8 78.9 27.7 30.3 4.0 24.9 12.6 39.4CyCADA [19] 79.1 33.1 77.9 23.4 17.3 32.1 33.3 31.8 81.5 26.7 69.0 62.8 14.7 74.5 20.9 25.6 6.9 18.8 20.4 39.5AdaptSegNet [47] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4CLAN [31] 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2Conditional [21] 89.2 49.0 70.7 13.5 10.9 38.5 29.4 33.7 77.9 37.6 65.8
FCN adv +RPT FCN adv +RPT FCN adv +RPT +MS 89.7 44.8 Image Ground Truth FCN FCN adv
FCN adv +RPT Figure 7. Examples of semantic segmentation results on GTA5-Cityscapes adaptation. The original images, their ground truth andcomparative results at different stages of FCN adv +RPT are given. from RPT to RPT . Figure 7 shows the gradual improve-ment on semantic segmentation of five images, when differ-ent design components are incrementally integrated. We compare with several state-of-the-art techniques forunsupervised domain adaptation on GTA5 → Cityscapes.Broadly, we can categorize the baseline methods intofive categories: (1) representation-level domain adapta-tion by adversarial learning [10, 13, 19, 21, 31, 44, 47];(2) appearance-level domain adaptation by image transla-tion [14, 32]; (3) appearance-level + representation-leveladaptation [4, 50, 57]; (4) self-learning [23, 26, 54, 61]; (5) others [8, 25, 43, 55, 60]. The performance compar-isons on GTA5 → Cityscapes adaptation are summarizedin Table 3. FCN adv +RPT achieves new state-of-the-artperformance with mIoU of 52.6%. Benefiting from theproposed regularizations, FCN adv +RPT outperforms SSF-DAN [13] and ADVENT [49], which also adopt a simi-lar adversarial mechanism, by additional improvement of7.2% and 7.1%, respectively. The performance is also bet-ter than the most recently proposed FCAN [57] and Styl-ization [14], which exploit a novel appearance transferringmodule that is not considered in RPT. Comparing to the bestreported result to-date by MLSL [23], our proposed modelstill leads the performance by 3.6%. By further integratingwith the multi-scale (MS) scheme, i.e, FCN adv +RPT +MS,the mIoU boosts to 53.2% with 9 out of the 19 categoriesreach to-date the best reported performances.To verify the generalization of RPT, we also test the per-formance on SYNTHIA → Cityscapes using the same set-tings. Following previous works [23, 26, 49, 61], the perfor-mances are reported in terms of mIoU@16 and mIoU@13by not considering the different number of categories. Theperformance comparisons are summarized in Table 4. Sim-ilarly, FCN adv +RPT +MS achieves the best performancewith mIoU@16 = 51.7% and mIoU@13 = 59.5%. The per-formances are better than PyCDA, which reports the bestknown results, by 5% and 6.2% respectively. able 4. Comparisons with the state-of-the-art unsupervised domain adaptation methods on SYNTHIA → Cityscapes transfer. road sdwlk bldng wall fence pole light sign vgttn sky person rider car bus mcycl bcycl mIoU@16 mIoU@13Learning [44] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1 -ROAD [10] 77.7 30.0 77.5 9.6 0.3 25.8 10.3 15.6 77.6 79.8 44.5 16.6 67.8 14.5 7.0 23.8 36.2 -AdaptSegNet [47] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 - 46.7CLAN [31] 81.3 37.0 80.1 - - - 16.1 13.7 78.2 81.5 53.4 21.2 73.0 32.9 22.6 30.7 - 47.8Conditional [21] 85.0 25.8 73.5 3.4
FCN adv +RPT FCN adv +RPT FCN adv +RPT +MS 89.1 47.3 Image Ground Truth FCN adv
FCN adv +RPT Image Ground Truth FCN adv
FCN adv +RPT (a) Effect of RPT on patch-based consistency (b) Effect of RPT on cluster-based consistency Figure 8. Examples showing the effectiveness of patch-based con-sistency and cluster-based consistency in RPT.
Figure 8 shows examples to demonstrate the effective-ness of patch-based and cluster-based consistency regular-izations. Here, we crop some highlighted regions of inputimage, ground truth, prediction by FCN adv and predictionby FCN adv +RPT , respectively. On one hand, as shownin Figure 8(a), patch-based consistency encourages the pix-els to be predicted as the dominative category of the su-perpixel. On the other hand, cluster-based consistency isable to correct the predictions with the cue of visual simi-larity across superpixels as illustrated in Figure 8(b). Theseexamples validate our motivation of enforcing label consis-tency within superpixel and cluster, where most semanticlabels are correctly predicted in the target domain. Figure 9further visualizes the merit of modeling spatial context byspatial logic regularization. Given the segmentation resultsfrom FCN adv , our proposed LSTM encoder-decoder out-puts the logical probability of assigning current semanticlabels to each region. The darkness indicates that the re-gion is predicted with low logical probability. Better resultsare achieved by penalizing the illogical predictions, such as road on the top of vegetation (1 st row) or car (2 nd row), sky below building (3 rd row), fence above building (4 th row). Image FCN adv
Logical Probability
FCN adv +RPT Figure 9. The examples of punished patches by spatial logic.
7. Conclusion
We have presented Regularizer of Prediction Transfer(RPT) for unsupervised domain adaptation of semantic seg-mentation. RPT gives light to a novel research direction,by directly exploring the three intrinsic criteria of semanticsegmentation to restrict the label prediction on the target do-main. These criteria, when imposed as regularizers duringtraining, are found to be effective in alleviating the problemof model overfitting. The patch-based consistency attemptsto unify the prediction inside each region by introducing itsdominative category to the unconfident pixels. The cluster-based consistency further amends the prediction accordingto other visually similar regions which belong to the samecluster. In pursuit of suppressing illogical predictions, spa-tial logic is involved to regularize the spatial relation whichis shared across domains. Experiments conducted on thetransfer from GTA5 to Cityscapes show that the injection ofRPT can consistently improve the domain adaptation acrossdifferent network architectures. More remarkably, the set-ting of FCN adv +RPT achieves new state-of-the-art perfor-mance. A similar conclusion is also drawn from the adapta-tion from SYNTHIA to Cityscapes, which demonstrates thegeneralization ability of RPT. eferences [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, AurelienLucchi, Pascal Fua, and Sabine S¨usstrunk. Slic superpix-els compared to state-of-the-art superpixel methods. IEEETrans. on PAMI , 34(11):2274–2282, 2012.[2] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and LiFei-Fei. Whats the point: Semantic segmentation with pointsupervision. In
ECCV , 2016.[3] Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, LingyuDuan, and Ting Yao. Exploring object relation in meanteacher for cross-domain detection. In
CVPR , 2019.[4] Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, and Wei-Chen Chiu. All about structure: Adapting structural infor-mation across domains for boosting semantic segmentation.In
CVPR , 2019.[5] Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, GeorgePapandreou, Barret Zoph, Florian Schroff, Hartwig Adam,and Jonathon Shlens. Searching for efficient multi-scale ar-chitectures for dense image prediction. In
NIPS , 2018.[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im-age segmentation with deep convolutional nets, atrous con-volution, and fully connected crfs.
IEEE Trans. on PAMI ,40(4):834–848, 2018.[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, FlorianSchroff, and Hartwig Adam. Encoder-decoder with atrousseparable convolution for semantic image segmentation. In
ECCV , 2018.[8] Minghao Chen, Hongyang Xue, and Deng Cai. Do-main adaptation for semantic segmentation with maximumsquares loss. In
ICCV , 2019.[9] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool.Learning semantic segmentation from synthetic data: A ge-ometrically guided input-output adaptation approach. In
CVPR , 2019.[10] Yuhua Chen, Wen Li, and Luc Van Gool. Road: Reality ori-ented adaptation for semantic segmentation of urban scenes.In
CVPR , 2018.[11] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In
CVPR ,2016.[12] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploit-ing bounding boxes to supervise convolutional networks forsemantic segmentation. In
ICCV , 2015.[13] Liang Du, Jingang Tan, Hongye Yang, Jianfeng Feng, Xi-angyang Xue, Qibao Zheng, Xiaoqing Ye, and XiaolinZhang. Ssf-dan: Separated semantic feature based domainadaptation network for semantic segmentation. In
ICCV ,2019.[14] Aysegul Dundar, Ming-Yu Liu, Ting-Chun Wang, JohnZedlewski, and Jan Kautz. Domain stylization: A strong,simple baseline for synthetic to real image domain adapta-tion. arXiv preprint arXiv:1807.09384 , 2018. [15] Mark Everingham, Luc Van Gool, Christopher KI Williams,John Winn, and Andrew Zisserman. The pascal visual objectclasses (voc) challenge.
IJCV , 88(2):303–338, 2010.[16] Golnaz Ghiasi and Charless C. Fowlkes. Laplacian pyramidreconstruction and refinement for semantic segmentation. In
ECCV , 2016.[17] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In
ICCV , 2019.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016.[19] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell.CyCADA: Cycle-consistent adversarial domain adaptation.In
ICML , 2018.[20] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell.Fcns in the wild: Pixel-level adversarial and constraint-basedadaptation. arXiv preprint arXiv:1612.02649 , 2016.[21] Weixiang Hong, Zhenzhen Wang, Ming Yang, and JunsongYuan. Conditional generative adversarial network for struc-tured domain adaptation. In
CVPR , 2018.[22] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In
CVPR , 2018.[23] Javed Iqbal and Mohsen Ali. Mlsl: Multi-level self-supervised learning for domain adaptation with spatiallyindependent and semantically consistent labeling. arXivpreprint arXiv:1909.13776 , 2019.[24] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadarrama,and Trevor Darrell. Caffe: Convolutional architecture for fastfeature embedding. In
ACM MM , 2014.[25] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectionallearning for domain adaptation of semantic segmentation. In
CVPR , 2019.[26] Qing Lian, Fengmao Lv, Lixin Duan, and Boqing Gong.Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach.In
ICCV , 2019.[27] Guosheng Lin, Anton Milan, Chunhua Shen, and IanReid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In
CVPR , 2017.[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. LawrenceZitnick. Microsoft coco: Common objects in context. In
ECCV , 2014.[29] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, andXiaoou Tang. Deep learning markov random field for seman-tic segmentation.
IEEE Trans. on PAMI , 40(8):1814–1828,2018.[30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In
CVPR , 2015.[31] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and YiYang. Taking a closer look at domain shift: Category-leveladversaries for semantics consistent domain adaptation. In
CVPR , 2019.32] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ra-mamoorthi, and Kyungnam Kim. Image to image translationfor domain adaptation. In
CVPR , 2018.[33] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-WahNgo, and Tao Mei. Transferrable prototypical networks forunsupervised domain adaptation. In
CVPR , 2019.[34] George Papandreou, Liang-Chieh Chen, Kevin Murphy, andAlan L. Yuille. Weakly- and semi-supervised learning of adeep convolutional network for semantic image segmenta-tion. In
ICCV , 2015.[35] Deepak Pathak, Philipp Krhenbhl, and Trevor Darrell. Con-strained convolutional neural networks for weakly super-vised segmentation. In
ICCV , 2015.[36] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, andJian Sun. Large kernel matters – improve semantic segmen-tation by global convolutional network. In
CVPR , 2017.[37] Pedro O. Pinheiro and Ronan Collobert. From image-level topixel-level labeling with convolutional networks. In
CVPR ,2015.[38] Tobias Pohlen, Alexander Hermans, Markus Mathias, andBastian Leibe. Full-resolution residual networks for seman-tic segmentation in street scenes. In
CVPR , 2017.[39] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning deepspatio-temporal dependence for semantic video segmenta-tion.
IEEE Trans. on Multimedia , 2017.[40] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and VladlenKoltun. Playing for data: Ground truth from computergames. In
ECCV , 2016.[41] German Ros, Laura Sellart, Joanna Materzynska, DavidVazquez, and Antonio M Lopez. The synthia dataset: A largecollection of synthetic images for semantic segmentation ofurban scenes. In
CVPR , 2016.[42] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. Imagenet large scale visual recognition challenge.
IJCV , 115(3):211–252, 2015.[43] Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian,Mathieu Salzmann, Lars Petersson, and Jose M Alvarez. Ef-fective use of synthetic data for urban scene semantic seg-mentation. In
ECCV , 2018.[44] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, SerNam Lim, and Rama Chellappa. Learning from syntheticdata: Addressing domain shift for semantic segmentation. In
CVPR , 2018.[45] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In
ICLR ,2015.[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In
CVPR , 2015.[47] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Ki-hyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker.Learning to adapt structured output space for semantic seg-mentation. In
CVPR , 2018.[48] Laurens van der Maaten and Geoffrey Hinton. Visualizingdata using t-sne.
JMLR , 2008. [49] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, MatthieuCord, and Patrick Perez. Advent: Adversarial entropy mini-mization for domain adaptation in semantic segmentation. In
CVPR , 2019.[50] Zuxuan Wu, Xintong Han, Yen-Liang Lin, MustafaGokhan Uzunbas, Tom Goldstein, Ser Nam Lim, and Larry SDavis. Dcan: Dual channel-wise alignment networks for un-supervised scene adaptation. In
ECCV , 2018.[51] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and KuiyuanYang. Denseaspp for semantic segmentation in street scenes.In
CVPR , 2018.[52] Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, andTao Mei. Semi-supervised domain adaptation with subspacelearning for visual recognition. In
CVPR , 2015.[53] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-text encoding for semantic segmentation. In
CVPR , 2018.[54] Junting Zhang, Chen Liang, and C-C Jay Kuo. A fully con-volutional tri-branch network (fctn) for domain adaptation.In
ICASSP , 2018.[55] Yang Zhang, Philip David, Hassan Foroosh, and BoqingGong. A curriculum domain adaptation approach to the se-mantic segmentation of urban scenes.
IEEE Trans. on PAMI ,2019.[56] Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, DongLiu, and Tao Mei. Customizable architecture search for se-mantic segmentation. In
CVPR , 2019.[57] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and TaoMei. Fully convolutional adaptation networks for semanticsegmentation. In
CVPR , 2018.[58] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, JianpingShi, and Jiaya Jia. Icnet for real-time semantic segmentationon high-resolution images. In
ECCV , 2018.[59] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR , 2017.[60] Xinge Zhu, Hui Zhou, Ceyuan Yang, Jianping Shi, andDahua Lin. Penalizing top performers: Conservative lossfor semantic segmentation adaptation. In
ECCV , 2018.[61] Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and JinsongWang. Unsupervised domain adaptation for semantic seg-mentation via class-balanced self-training. In