Category Anchor-Guided Unsupervised Domain Adaptation for Semantic Segmentation
CCategory Anchor-Guided Unsupervised DomainAdaptation for Semantic Segmentation
Qiming Zhang ∗ Jing Zhang ∗ Wei Liu Dacheng Tao UBTECH Sydney AI Centre, School of Computer Science, Faculty of EngineeringThe University of Sydney, Darlington, NSW 2008, Australia Tencent AI Lab, China [email protected], [email protected]@columbia.edu, [email protected]
Abstract
Unsupervised domain adaptation (UDA) aims to enhance the generalization ca-pability of a certain model from a source domain to a target domain. UDA is ofparticular significance since no extra effort is devoted to annotating target domainsamples. However, the different data distributions in the two domains, or domainshift/discrepancy , inevitably compromise the UDA performance. Although therehas been a progress in matching the marginal distributions between two domains,the classifier favors the source domain features and makes incorrect predictionson the target domain due to category-agnostic feature alignment. In this paper, wepropose a novel category anchor-guided (CAG) UDA model for semantic segmen-tation, which explicitly enforces category-aware feature alignment to learn shareddiscriminative features and classifiers simultaneously. First, the category-wisecentroids of the source domain features are used as guided anchors to identify theactive features in the target domain and also assign them pseudo-labels. Then,we leverage an anchor-based pixel-level distance loss and a discriminative lossto drive the intra-category features closer and the inter-category features furtherapart, respectively. Finally, we devise a stagewise training mechanism to reduce theerror accumulation and adapt the proposed model progressively. Experiments onboth the GTA5 → Cityscapes and SYNTHIA → Cityscapes scenarios demonstratethe superiority of our CAG-UDA model over the state-of-the-art methods. Thecode is available at https://github.com/RogerZhangzz/CAG_UDA . Semantic segmentation is a classical computer vision task that refers to assigning pixel-wise categorylabels to a given image to facilitate downstream applications such as autonomous driving, videosurveillance, and image editing. The recent progress in semantic segmentation has been dominatedby deep neural networks trained on large datasets. Despite their success, annotating labels at the pixellevel is prohibitively expensive and time-consuming, e.g. , about 90 minutes for a single image inthe Cityscapes dataset [8]. One economical alternative is to exploit computer graphics techniques tosimulate a virtual 3D environment and automatically generate images and labels, e.g. , GTA5 [31] andSYNTHIA [32]. Although synthetic images have similar appearances to real images, there still existsubtle differences in textures, layouts, colors, and illumination conditions [11, 42–44], which resultin different data distributions, or domain discrepancy . Consequently, the performance of a certainmodel trained on synthetic datasets degrades drastically when applied to realistic scenes. To addressthis issue, one promising approach is domain adaptation [1, 45, 15, 34, 36, 27, 33, 40, 47, 13] to ∗ indicates equal contributions.33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . C V ] D ec educe the domain shift and learn a shared discriminative model for both domains. In this paper, wetackle the more challenging unsupervised domain adaptation (UDA) situation, where no labels areavailable in the target domain during training.Previous methods have tried to learn domain-invariant representations by matching the distributionsbetween source and target domains at the appearance level [27, 34, 40, 13, 21], feature level [14, 27,3, 13], or output level [45, 36, 26]. However, even though matching the global marginal distributionscan bring the two domains closer, e.g. , reaching a lower maximum mean discrepancy (MMD) [25]or a saddle point in the minimax game via adversarial learning [13], it does not guarantee thatsamples from different categories in the target domain are properly separated, hence compromisingthe generalization ability. To tackle this issue, one could instead consider category-aware featurealignment by matching the local joint distributions of features and categories [7, 19, 33]. Otherapproaches adopt the idea of self-training by generating pseudo-labels for samples in the targetdomain and providing extra supervision to the classifier [47, 21, 3]. Together with supervision fromthe source domain, this enforces the network to simultaneously learn domain-invariant discriminativefeature representations and shared decision boundaries through back-propagation. The ideas ofminimizing the entropy (uncertainty) of the output [39] or discrepancies between the outputs of twoclassifiers (voters) [26] have also been exploited to implicitly enforce category-level alignment.Although category-level alignment and self-training methods have produced some promising results,there are still some outstanding issues that need to be addressed to further improve the adaptationperformance. For example, error-prone pseudo-labels will mislead the classifier and accumulateerrors. Meanwhile, implicit category-level alignment may be affected by category imbalance. To dealwith these issues and take advantage of both approaches, here we propose a novel idea of categoryanchors , which facilitate both category-wise feature alignment and self-training. It is motivated bythe observation that features from the same category tend to be clustered together. Moreover, thecentroids of source domain features in each category can serve as explicit anchors to guide adaptation.Specifically, we propose a novel category anchor-guided unsupervised domain adaptation model(CAG-UDA) for semantic segmentation. This model explicitly enforces category-wise featurealignment to learn shared feature representations and classifiers for both domains simultaneously.First, the centroids of category-wise features in the source domain are used as anchors to identifythe active features in the target domain. Then, we assign pseudo-labels to these active featuresaccording to the category of the closest anchor. Lastly, two loss functions are proposed: the first is apixel-level distance loss between the guiding anchors and active features, which pushes them closerand explicitly minimizes the intra-category feature variance; the other is a pixel-level discriminativeloss to supervise the classifier and maximize the inter-category feature variance. To reduce the erroraccumulation of incorrect pseudo-labels, we propose a stagewise training mechanism to adapt themodel progressively.The main contributions of this paper can be summarized as follows. First, we propose a novel categoryanchor idea to tackle the challenging UDA problem in semantic segmentation. Second, we propose asimple yet effective category anchor-based method to identify active features in the target domain,further enabling category-wise feature alignment. Finally, the proposed CAG-UDA model achievesnew state-of-the-art performance in both GTA5 → Cityscapes and SYNTHIA → Cityscapes scenarios.
Many recent advances in computer vision [20, 12, 30, 11, 24, 46, 5] have been based on deep neuralnetworks trained on large-scale labeled datasets such as ImageNet [9], Pascal VOC [10], MS COCO[22], and Cityscapes [8]. However, a domain shift between training data and testing data impairsmodel performance [29, 17, 18]. To overcome this issue, a variety of domain adaptation methods forclassification [6, 23, 37, 28, 41, 3, 19], detection [38, 16], and segmentation [7, 14, 13, 27, 34, 40, 21,47] have been proposed. In this paper, we focus on the challenging semantic segmentation problem.The current mainstream approaches include style transfer [27, 34, 40, 13, 21], feature alignment[7, 14, 13], and self-training [47, 21]. As our work is most related to the latter two approaches, webriefly review and discuss their characteristics.
Feature distribution alignment : Previous methods that match the global marginal distributions be-tween two domains [14, 13, 27] do not distinguish local category-wise feature distribution shifts.Consequently, error-prone predictions are made for misaligned features with shared decision bound-2ries. In contrast to these methods, we propose a category-wise feature alignment method to explicitlyreduce category-level mismatches and learn discriminative domain-invariant features. The idea ofcategory-level feature alignment was also exploited in [26, 33] for semantic segmentation. Luo et al. proposed a weighted adversarial learning method to align the category-level feature distri-butions implicitly [26]. Saito et al. tried to align the feature distributions and learn discriminativedomain-invariant features by utilizing task-specific classifiers as a discriminator [33]. In contrastto the implicit feature alignment in the aforementioned methods, we propose a novel categoryanchor-guided method, which directly aligns category-wise features in both domains.
Pseudo-label assignment : Assigning pseudo-labels to target domain samples based on the trainedclassifiers helps adapt the feature extractor and classifier to the target domain. Zou et al. [47] proposedan iterative self-training UDA model by alternatively generating pseudo-labels and retraining themodel. They also dealt with the category imbalance issue by controlling the proportion of selectedpseudo-labels in each category [47]. Li et al. [21] proposed a bidirectional learning domain adaptationmodel that alternately trains the image translation model and the self-supervised segmentationadaptation model. In contrast to these methods, where pseudo-labels were determined accordingto the predicted category probability, we propose a category anchor-based method to generatetrustable pseudo-labels. Compared with selected samples that have been “correctly” classified withhigh confidence, our selected samples are not determined by the decision boundaries so are more informative for the classifier to further adapt to the target domain.The idea of assigning pseudo-labels based on category centers has also been utilized in domainadaptation for classification, e.g. , category centroids in [41], prototypes in [3], and cluster centersin [19]. The former two methods minimize the distance loss against category centroids, while thethird minimizes contrastive domain discrepancies. Our method differs from these methods in severalways. First, we tackle the more challenging task of image semantic segmentation rather than imageclassification, where dense pixel-wise labels need to be predicted as not just single labels for entireimages. Second, we fix the category centroids (hence called category anchors ) instead of updatingthem at each iteration. On one hand, the mini-batch size used for segmentation ( e.g. , 1) in this paperis much smaller than that used for classification. On the other hand, pixels are spatially coherentin an image, so the category centroids calculated at each iteration will be biased and unreliable dueto the dominance of homogeneous features. Third, the pseudo-labels of target domain samples aredetermined by their distance against the category centroids from the source domain instead of thetarget domain. This is reasonable since: 1) the source domain category centroids are calculated fromall training samples based on ground-truth labels, which are reliable; 2) driving the target domainfeatures towards the source domain category centroids can effectively reduce the domain discrepancy.Fourth, together with the category anchor-based distance loss, we also add the segmentation lossbased on the pseudo-labeled target samples to learn discriminative feature representations and adaptthe decision boundaries simultaneously.
A semantic segmentation model M can be formulated as amapping function from the image domain X to the output label domain Y : M : X → Y, (1) which predicts a pixel-wise category label ˆ y close to the ground-truth annotation y ∈ Y for a givenimage x ∈ X . Usually, the segmentation model M is trained in a supervised manner by minimizingthe difference between the prediction ˆ y and its ground-truth y for every training sample x . Thecross-entropy (CE) loss is widely used as a measurement, which is defined as: L CE = − N (cid:88) i =1 H × W (cid:88) j =1 C (cid:88) c =1 y ijc log ( p ijc ) , (2) where N is the number of training images, H and W denote the image size, j is the pixel index, C isthe number of categories, c is the category index, y ijc ∈ { , } is the one-hot vector representation ofthe ground-truth label, i.e. , ∀ i, j, (cid:80) c y ijc = 1 , and p ijc is the predicted category probability by M . UDA for semantic segmentation:
Generally, a segmentation model trained on a source domain X s has a limited generalization capability to a target domain X t , when the distributions between X s and3 𝑇 classifierencoder 𝑓 𝐷 𝑆𝑇𝑆𝑇
PredictionGround truthPseudo labelsPrediction
ImagesImages
Loss 𝐿 𝑑𝑖𝑠𝑆 Category anchorSource sampleLoss 𝐿 𝑑𝑖𝑠𝑇 Category anchorActive sample
CAG architecture
ATI and PLA
Category anchors …… active (CA1) Feature spaceTarget samplesinactiveactive (CA19) (CA19)(CA1) ∆ 𝑑 hyp e r bo l a : 𝑑 − 𝑑 = Δ 𝑑 Domain adaptation
Loss 𝐿 𝐶𝐸𝑇
Loss 𝐿 𝐶𝐸𝑆
Global marginal distributionsCAG category-wise alignment (b)(a) (c) 𝐿 ;< 𝐿 𝑑𝑖𝑠 Target sampleSource sample Decision boundary
Figure 1: An illustration of the proposed category anchor-guided UDA model for semantic seg-mentation. (a) The architecture of the proposed CAG-UDA model consists of an encoder, a featuretransformer ( f D ), and a classifier. The green part denotes the source domain flow while the orangeparts represent the target domain flow. (b) The illustration of the process of active target sampleidentification and pseudo label assignment described in Section 3.2. (c) The illustration of theproposed category-wise feature alignment with the anchor-based pixel-level distance loss L dis andcross-entropy loss L CE described in Section 3.3. Best viewed in color. X t are different, i.e. , there is a domain shift/discrepancy. Several unsupervised domain adaptationmodels have been proposed, which can be formulated as the following mapping function: M uda : X s ∪ X t → Y s ∪ Y t , (3) where M uda is trained on the labeled training samples ( X s , Y s ) in the source domain together withthe training unlabeled samples X t in the target domain. Typically, the aforementioned CE loss andsome domain-adaptation losses are used to align the distributions of both domains ( e.g. , p ( X s ) and p ( Y s ) ) and to learn domain-invariant discriminative feature representations. Model components:
The main semantic segmentation approaches have been based on fully con-volutional neural networks (CNNs) since the seminal work in [24]. Usually, a DCNN-based modelhas two parts: an encoder
Enc and a decoder
Dec , where the encoder maps the input image into alow-dimensional feature space and then the decoder decodes it to the label space. The decoder can befurther divided into a feature transformation net f D and a classifier Cls , where
Cls denotes the lastclassification layer and f D denotes the remaining part in Dec . Typical encoders are the classificationnetworks pretrained on ImageNet [9], e.g.,
VGGNet [35] and ResNet [12]. The decoder consists ofconvolutional layers responsible for context modeling, multi-scale feature fusion, etc . UDA methodstypically employ a segmentation model with carefully designed modules for domain adaptation.
The network architecture of our proposed CAG-UDA model is shown in Figure 1(a). The CAG-UDAmodel employs Deeplab v2 [4] as the base segmentation model, where ResNet-101 is used as theencoder
Enc and the ASPP module is used in the decoder
Dec . To reduce the domain shift, wedevise a category anchor-guided alignment module on the features from f D , consisting of categoryanchor construction (CAC), active target sample identification (ATI), and pseudo-label assignment(PLA) as shown in Figure 1(b). The details are as follows. Category anchor construction (CAC):
Based on the observation that pixels in the same categorycluster in the feature space, we propose to calculate the centroids of the features of each category inthe source domain as a representative of the feature distribution, i.e. , the mean. Considering that thefeatures fed into the classifier directly relate to the decision boundaries, we choose the features from f D to calculate these centroids. Mathematically, this can be written as: f sc = 1 | Λ sc | N (cid:88) i =1 H × W (cid:88) j y sijc ( f D ( Enc ( x si )) | j ) , (4) where Λ sc is the index set of all pixels on the training images in the source domain X s belongingto the c th category, i.e. , Λ sc = { ( i, j ) | y sijc = 1 } , | Λ sc | denotes the number of pixels in Λ sc , i.e. ,4 Λ sc | = (cid:80) Ni =1 (cid:80) H × Wj y ijc , and f D ( x si ) | j is the feature vector at index j on the feature map f D ( x si ) .It is noteworthy that we calculate the category centroids at the beginning of each training stageand then keep them fixed during training (we propose a stagewise training mechanism in Section3.4.). Therefore, we call these centroids category anchors (CAs) in this paper, i.e. , CA = { f sc , c =1 , ..., C } . Active target sample identification (ATI):
To align the category-wise feature distributions betweentwo domains, we expect that the category centroids from the target domain get closer to the categoryanchors during training. However, on one hand, target sample labels are unavailable. On the otherhand, the calculated centroids on target samples are very unstable at each iteration since the mini-batch size is very small ( i.e. , 1) in this paper and image pixels are spatially coherent. To tacklethese issues, we propose identifying active target samples and assigning them pseudo-labels for thesubsequent feature alignment. The term “active target samples” refers to target samples near onecategory anchor and far from the other anchors, i.e. , being activated by one specific category anchor.Mathematically, this can be formulated as follows. We first define the distance between a targetfeature f D ( Enc ( x ti )) | j and the c th category anchor as d tijc = (cid:13)(cid:13) f sc − f D (cid:0) Enc (cid:0) x ti (cid:1)(cid:1) | j (cid:13)(cid:13) , (5) where (cid:107)·(cid:107) is the L norm of a vector. Then, we sort { d tijc , c = 1 , ..., C } in an ascending order andcompare the shortest distance d tijc ∗ with the second shortest d tijc (cid:48) . If their difference is larger than apredefined threshold (cid:52) d , we identify this target sample as active one, i.e. , a tij = (cid:26) , d tijc (cid:48) − d tijc ∗ > (cid:52) d , , otherwise, (6) where a tij denotes the active state of the target feature f D ( Enc ( x ti )) | j . Like the category anchors,we calculate the active states at the beginning of each training stage and keep them fixed duringsubsequent stages. This is explained in Section 3.4, where we introduce a stagewise trainingmechanism. Pseudo-label assignment (PLA):
After we obtain the active state according to Eq. (6), a pseudolabel c ∗ can be assigned to x ti according to its closest category anchor f sc ∗ with a reliable margin (cid:52) d : ˆ y tijc ∗ = 1 , if d tijc ∗ < d tijc − (cid:52) d , ∀ c (cid:54) = c ∗ . (7) Due to the lack of the target domain labels, the classifier layer is biased to the source domain anddoes not generalize well to the target domain, as shown in Figure 1(c). Consequently, some of thepseudo-labels from predicted probabilities may be error-prone. However, based on the observationof the intra-category clustering characteristics, the generated pseudo-labels via category anchorsare independent of the biased classifier and are thus more reliable than those assigned by predictedcategory probabilities. Further, considering that high-probability samples have been “correctly”classified by the classifier layer with high confidence, these samples provide only weak supervisionsignals. In contrast, active samples are more informative for adapting the classifier to the targetdomain as the classifier layer may not predict these active samples with high probabilities.
When training the CAG-UDA model, we leverage a CE loss L sCE as defined in Eq. (2). We alsopropose a category-wise distance loss L sdis on the source domain samples and two domain adaptationlosses on the active target samples, i.e. , a CE loss L tCE and a category-wise distance loss L tdis basedon the pseudo-labels, to guide the adaptation process. These are defined as: L sdis = N (cid:88) i =1 H × W (cid:88) j =1 C (cid:88) c =1 y sijc (cid:107) f sc − f D ( Enc ( x si )) | j (cid:107) , (8) L tCE = − M (cid:88) i =1 H × W (cid:88) j =1 a tij C (cid:88) c =1 ˆ y tijc log (cid:0) p tijc (cid:1) , (9) L tdis = M (cid:88) i =1 H × W (cid:88) j =1 a tij C (cid:88) c =1 ˆ y tijc (cid:13)(cid:13) f sc − f D (cid:0) Enc (cid:0) x ti (cid:1)(cid:1) | j (cid:13)(cid:13) . (10) L tdis , otherinactive target samples within each category may also follow the active samples due to beingclustered. Therefore, minimizing L tdis indeed reduces the intra-category variances in the targetdomain. Meanwhile, L tCE leverages the pseudo-labels to update the network weights together withthe source domain CE loss, prompting the encoder, decoder, and classifier to adapt to the targetdomain and therefore reducing the intra- and inter-category variances simultaneously. The illustrationis show in Figure 1(c). To leverage the complementarity between the proposed category anchor-basedPLA and category probability-based PLA in [47], we also identify active target samples based on thepredicted category probability and add an extra CE loss L tPCE similar to Eq. (9). L tPCE = − M (cid:88) i =1 H × W (cid:88) j =1 a tPij C (cid:88) c =1 ˆ y tPijc log (cid:0) p tijc (cid:1) , (11) where a tPij , y tPijc refer to the probability-based active state and assigned pseudo-labels respectively.Then the final objective function is as follows: L = L sCE + λ (cid:0) L sdis + L tdis (cid:1) + λ (cid:16) L tCE + L tPCE (cid:17) , (12) where λ and λ are loss weights. We tried to train the CAG-UDA model in a single stage and update the pseudo-labels at each iteration.However, it is not stable because there are some error-prone pseudo-labels, which may produceincorrect supervision signals, lead to more erroneous pseudo-labels iteratively and trap the networkto a local minimum with poor performance eventually, e.g. less than 30 mIoU. To address this issue,we propose a stagewise training mechanism as summarized in Algorithm 1. First, we pretrain thesegmentation model on the source domain. Then, we leverage the global feature alignment methodin [14] to warm up the training process and obtain a well-initialized model. Next, we train theCAG-UDA model with the proposed losses for several stages. At the beginning of each stage, wecalculate the CAs, identify the active target samples, and assign pseudo-labels to them. By using thisstagewise delayed updating mechanism, we avoid updating the pseudo-labels at each iteration andreduce the error accumulation. Hence, L tdis and L tCE serve as two regularizations on the network. Algorithm 1
Stagewise training the CAG-UDA model
Input: training dataset: ( X s , Y s , X t ), maximum stages: K , maximum iterations: L , distancethreshold: (cid:52) d . Output: M K and ( ˆ Y s , ˆ Y t ). Pretraining: M p ← ( X s , Y s ) according to [4]; Warm-up: M ← ( X s , Y s ) and M p according to [14]; for k ← to K do CAC: { f sc } ← M k − and ( X s , Y s ) according to Eq. (4); ATI: { d tijc } , { a tij } ← M k − , ( X s , Y s , X t ) , { f sc } and (cid:52) d according to Eq. (5) and Eq. (6); PLA: { ˆ y tijc ∗ } ← { d tijc } , (cid:52) d according to Eq. (7); for n ← to L do SGD: training M k − on ( X s , Y s , X t , { ˆ y tijc ∗ } , { f sc } , { a tij } ) according to Eq.(12); end for M k ← M k − end for Prediction: ( ˆ Y s , ˆ Y t ) ← ( X s , X t ) and M K . Following [21], we evaluate the CAG-UDA model in two com-mon scenarios, GTA5[31] → Cityscapes[8] and SYNTHIA[32] → Cityscapes[8]. GTA5 contains6able 1: Results of the CAG-UDA model and SOTA methods ( GTA5 → Cityscapes). r o a d s i d e w a l k bu il d i ng w a ll f e n ce po l e li gh t s i gn v e g e . t e rr ace s ky p e r s on r i d e r ca r t r u c k bu s t r a i n m o t o r b i k e mIoUSource only 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6AdaptSegNet[36] 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 29.5 32.5 41.4Source only 69.9 22.3 75.6 15.8 20.1 18.8 28.2 17.1 75.6 8.00 73.5 55.0 2.9 66.9 34.4 30.8 0.00 18.4 0.00 33.3DCAN[40] 85.0 30.8 81.3 25.8 21.2 22.2 25.4 26.6 83.4 36.7 76.2 58.9 24.9 80.7 29.5 42.9 2.50 26.9 11.6 41.7Source only 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6CLAN[26] 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 Table 2: Results of the CAG-UDA model on the testing set ( GTA5 → Cityscapes). r o a d s i d e w a l k bu il d i ng w a ll f e n ce po l e li gh t s i gn v e g e . t e rr ace s ky p e r s on r i d e r ca r t r u c k bu s t r a i n m o t o r b i k e mIoUCAG-UDA 93.2 57.0 85.6 35.7 25.1 37.5 30.8 45.3 87.1 50.1 89.4 62.7 40.8 87.8 18.0 32.4 34.5 34.4 35.4 51.7 × × × → Cityscapes scenario and the 16 common categories in theSYNTHIA → Cityscapes scenario. Some methods [36, 26, 21] only reported mIoU for 13 commoncategories in the SYNTHIA → Cityscapes scenario, denoted as mIoU* in this paper.
Implementation details:
In our experiments, training images were randomly cropped to 1280 × × ∼ × . . Due to GPU memory limitations, the batch sizewas set to 1 and the weights of all batch normalization layers were frozen. In the warm-up phase,we used a CNN-based domain discriminator comprising 5 convolutional layers of kernel size 3 × λ , and λ were set to 1e-4, 0.9, 0.3, and 0.7, respectively. (cid:52) d was set to 2.5. We also assigned pseudo-labels based on predicted category probabilities, andthe threshold P was set to 0.95. Experiments were conducted on a TITAN Tesla V100 GPU withPyTorch implementation. Code will be made publicly available. The results of the GTA5 → Cityscapes scenario are presented in Table 1 withthe best results highlighted in bold. All the models adopted ResNet-101 as a backbone network forfair comparison. Overall, our CAG-UDA model strikingly outperforms all other models with a 50.2mIoU, surpassing the model trained on the source domain by a significant gain of 16.1. Comparedwith CLAN [26] and DISE [2], which implicitly align category-level features, our model achieves anextra gain of 4.5 and outperforms them on fence, traffic sign, rider, train, and bike by large margins.This is due to the proposed category anchor-guided alignment method, which explicitly uses categorycentroids as representatives of feature distributions, reducing the side effect of category imbalance.Like [40, 13], BLF in [21] also involves a style-transfer module but combines it with self-trainingin a bidirectional learning framework. It achieved the second-best mIoU of 48.5. BLF achievesbetter results than the CAG-UDA model on stuff categories such as road, building, wall, terrace,and sky but is inferior to the CAG-UDA model for small objects. This is because BLF includes a7able 3: Results of the CAG-UDA model and SOTA methods ( SYNTHIA → Cityscapes). r o a d s i d e w a l k bu il d i ng w a ll f e n ce po l e li gh t s i gn v e g e t a b l e s ky p e r s on r i d e r ca r bu s m o t o r b i k e mIoU mIoU*AdaptSegNet[36] 79.2 37.2 78.8 - - - 9.9 10.5 78.2 80.5 53.5 19.6 67.0 29.5 21.6 31.3 - 45.9CLAN[26] 81.3 37.0 80.1 - - - - - - 13.7 - DCAN[40] 82.8 36.4 75.7 5.1 0.1 25.8 8.0 18.7 74.7 76.9 51.1 15.9 77.7 24.8 4.1 37.3 38.4 -DISE[2] - Figure 2: (a) Subjective evaluation of the CAG-UDA model on some images from the Cityscapesvalidation set. (b) Comparison between probability-based PLA and the proposed CAs-based PLA onan image from the Cityscapes training set. Best viewed in color and zoom-in.style-transfer module that benefits from the texture clues in the stuff categories and assigns reliablepseudo-labels accordingly. In contrast, CAG-UDA uses a category-anchor guided method that cantackle the category imbalance and generate more informative pseudo-labels, leading to better resultson more categories.We also present the result on the testing set of the Cityscapes dataset in Table 2. The CAG-UDAmodel reaches 51.7 mIoU, proving the good generalization of our method.Results in the SYNTHIA → Cityscapes scenario are listed in Table 3. Same as the previous work, wereport the performance of the CAG-UDA model in two mIoU metrics: 13 categories (mIoU*) and 16categories (mIoU) for fair comparisons. Since the domain shift is much larger than the above scenario,the performance is slightly worse. The CAG-UDA model still achieves better results than all previousSOTA methods, including CLAN, BLF, etc . Similar to the above discussions with the GTA5 dataset,the superiority of the CAG-UDA model remains in small objects like pole, sign, person, and bike.
Qualitative results:
Some qualitative segmentation examples are given in Figure 2(a). Trainingmerely on the source domain dataset leads to a limited generalization ability, e.g. , the road and personwere incorrectly predicted as sidewalk and building in the first row. Benefited from the categoryanchor-guided adaptation, the proposed CAG-UDA model achieves better results, especially for smallobjects, e.g. , pole, sign, and person. Besides, we also attribute it to the proposed CAs-based pseudolabel assignment, which successfully activated small objects and assigned them trustable pseudo-labels, as highlighted in red circles in Figure 2(b). More results can be found in the supplement.8able 4: Results of ablation study (GTA5 → Cityscapes). r o a d s i d e . bu il . w a ll f e n c . po l e li gh t s i gn v e g e . t e rr . s ky p e r s on r i d e r ca r t r u c k bu s t r a i n m o t o r b i k e mIoU gainSource only 69.8 25.4 74.7 11.3 18.3 24.2 35.6 23.3 72.0 14.4 65.3 58.7 29.0 53.1 14.3 19.2 7.9 15.1 16.3 34.1 -Warm-up 88.4 45.2 82.0 30.1 22.0 35.4 36.7 23.7 82.7 27.6 70.8 51.4 26.9 81.5 14.5 25.0 21.4 13.0 7.9 41.4 7.3 + L tPCE + L tCE + L sPdis + L tPdis + L sdis + L tdis + L sdis + L tdis + L tCE + L tCE + L tPCE Ablation studies:
The ablation study results are listed in Table 4. We add a superscript P to thesymbols of losses to denote that the active target samples are identified by category probabilitiesas described in Section 3.3. Several models were trained by combining L tCE with different losses.As can be seen from the nd and rd rows, the proposed category anchor-guided PLA is moreeffective than the predicted category probability-based one. More detailed comparisons of differenthyper-parameters can be found in the supplement. In addition, the CE loss is more effective than thedistance loss. The results in the th row demonstrate the complementarity between the CE loss anddistance loss, as well as between the category anchor-based and probability-based PLA. We combinethem as in Eq. (12) to train the CAG-UDA model and obtain a better result as listed in the bottom row.Finally, the stagewise trained CAG-UDA model obtains an mIoU of 50.2, outperforming the SOTAmodels. Besides, the CAG-UDA model has been trained for an extra stage, e.g. , Stage 4. However, itis saturated at 50.2 mIoU with no improvement. The proposed CAG-UDA model relies on reliable pseudo-labels to guarantee a correct supervisionimposed on the network to be trained. To this end, we adopt a warm-up strategy to roughly aligntwo domains together and increase the reliability of the generated pseudo-labels by the CAs, asdescribed in Section 3.4. In contrast, we also conducted an experiment by removing the warm-upstage and observed a significant drop of 6.3 mIoU. Some techniques can also be used to obtain reliablepseudo-labels such as enforcing local smoothness on the probability map, utilizing a normalizedthreshold during assigning pseudo-labels, and reducing the appearance bias through a style transfermodule. We leave it as the future work to build a stage-free and end-to-end CAG-UDA model.
In this paper, we proposed a novel category anchor-guided (CAG) unsupervised domain adaptation(UDA) model for semantic segmentation. The CAG-UDA model successfully adapts the segmentationmodel to the target domain through category-wise feature alignment guided by category anchors.Specifically, we proposed a category anchor construction module, an active target sample identificationmodule, and a pseudo-label assignment module. We utilized a distance loss and a CE loss based onthe identified active target samples, which complementarily enhance the adaptation performance. Wealso proposed a stagewise training mechanism to reduce the error accumulation and adapt the CAG-UDA model progressively. The experiments on the GTA5 and SYNTHIA datasets demonstrate thesuperiority of the CAG-UDA model over representative methods on generalization to the Cityscapesdataset.
Acknowledgements
This work is supported by the Australian Research Council Project FL-170100117 and the NationalNatural Science Foundation of China Project 61806062.9 eferences [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domainadaptation with generative adversarial networks. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 3722–3731, 2017.[2] W. Chang, H. Wang, W. Peng, and W. Chiu. All about structure: Adapting structural information acrossdomains for boosting semantic segmentation.
CoRR , abs/1903.12212, 2019.[3] C. Chen, W. Xie, T. Xu, W. Huang, Y. Rong, X. Ding, Y. Huang, and J. Huang. Progressive featurealignment for unsupervised domain adaptation. arXiv preprint arXiv:1811.08585 , 2018.[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
IEEE transactionson pattern analysis and machine intelligence , 40(4):834–848, 2017.[5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separableconvolution for semantic image segmentation. In
Proceedings of the European Conference on ComputerVision (ECCV) , pages 801–818, 2018.[6] M. Chen, K. Q. Weinberger, and J. Blitzer. Co-training for domain adaptation. In
Advances in NeuralInformation Processing Systems (Neurips) , pages 2456–2464, 2011.[7] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. Frank Wang, and M. Sun. No more discrimination:Cross city adaptation of road scene segmenters. In
Proceedings of the IEEE International Conference onComputer Vision (ICCV) , pages 1992–2001, 2017.[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, andB. Schiele. The cityscapes dataset for semantic urban scene understanding. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pages 3213–3223, 2016.[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,pages 248–255. Ieee, 2009.[10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes(voc) challenge.
International journal of computer vision , 88(2):303–338, 2010.[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In
Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) , pages 2961–2969, 2017.[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770–778, 2016.[13] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In
International Conference on Machine Learning (ICML) ,2018.[14] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-basedadaptation. arXiv preprint arXiv:1612.02649 , 2016.[15] W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional generative adversarial network for structureddomain adaptation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 1335–1344, 2018.[16] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-domain weakly-supervised object detectionthrough progressive domain adaptation. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 5001–5009, 2018.[17] W. Jiang, H. Gao, W. Lu, W. Liu, F.-L. Chung, and H. Huang. Stacked robust adaptively regularized auto-regressions for domain adaptation.
IEEE Transactions on Knowledge and Data Engineering , 31(3):561–574, 2018.[18] W. Jiang, W. Liu, and F.-l. Chung. Knowledge transfer for spectral clustering.
Pattern Recognition ,81:484–496, 2018.[19] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann. Contrastive adaptation network for unsuperviseddomain adaptation. arXiv preprint arXiv:1901.00976 , 2019.[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In
Advances in Neural Information Processing Systems (Neurips) , pages 1097–1105, 2012.[21] Y. Li, L. Yuan, and N. Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. arXiv preprint arXiv:1904.10620 , 2019.[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoftcoco: Common objects in context. In
Proceedings of the European Conference on Computer Vision(ECCV) , pages 740–755. Springer, 2014.[23] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In
Advances in Neural InformationProcessing Systems (Neurips) , pages 469–477, 2016.[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015.[25] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks.In
International Conference on Machine Learning (ICML) , pages 97–105, 2015.[26] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-leveladversaries for semantics consistent domain adaptation. arXiv preprint arXiv:1809.09478 , 2018.[27] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domainadaptation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , ages 4500–4509, 2018.[28] P. O. Pinheiro. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pages 8004–8013, 2018.[29] G.-J. Qi, W. Liu, C. Aggarwal, and T. Huang. Joint intermodal and intramodal label transfers for extremelyrare or unseen classes.
IEEE transactions on pattern analysis and machine intelligence , 39(7):1360–1373,2016.[30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in Neural Information Processing Systems (Neurips) , pages 91–99, 2015.[31] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In
Proceedings of the European Conference on Computer Vision (ECCV) , pages 102–118. Springer, 2016.[32] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collectionof synthetic images for semantic segmentation of urban scenes. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , pages 3234–3243, 2016.[33] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsuperviseddomain adaptation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 3723–3732, 2018.[34] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa. Learning from synthetic data:Addressing domain shift for semantic segmentation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 3752–3761, 2018.[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
International Conference on Learning Representations (ICLR) , 2015.[36] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structuredoutput space for semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 7472–7481, 2018.[37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7167–7176, 2017.[38] D. Vazquez, A. M. Lopez, J. Marin, D. Ponsa, and D. Geronimo. Virtual and real world adaptation forpedestrian detection.
IEEE transactions on pattern analysis and machine intelligence , 36(4):797–809,2014.[39] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez. Advent: Adversarial entropy minimization for domainadaptation in semantic segmentation. arXiv preprint arXiv:1811.12833 , 2018.[40] Z. Wu, X. Han, Y.-L. Lin, M. Gokhan Uzunbas, T. Goldstein, S. Nam Lim, and L. S. Davis. Dcan: Dualchannel-wise alignment networks for unsupervised scene adaptation. In
Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 518–534, 2018.[41] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domainadaptation. In
International Conference on Machine Learning (ICML) , pages 5419–5428, 2018.[42] J. Zhang, Y. Cao, S. Fang, Y. Kang, and C. Wen Chen. Fast haze removal for nighttime image usingmaximum reflectance prior. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 7418–7426, 2017.[43] J. Zhang, Y. Cao, Y. Wang, C. Wen, and C. W. Chen. Fully point-wise convolutional neural network formodeling statistical regularities in natural images. In , pages 984–992. ACM, 2018.[44] J. Zhang and D. Tao. Famed-net: A fast and accurate multi-scale end-to-end dehazing network.
IEEETransactions on Image Processing , 29:72–84, 2020.[45] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes.In
Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages 2020–2030, 2017.[46] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pages 2881–2890, 2017.[47] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentationvia class-balanced self-training. In
Proceedings of the European Conference on Computer Vision (ECCV) ,pages 289–305, 2018.,pages 289–305, 2018.