Style Normalization and Restitution for Domain Generalization and Adaptation
11 Style Normalization and Restitution for DomainGeneralization and Adaptation
Xin Jin, Cuiling Lan,
Member, IEEE , Wenjun Zeng,
Fellow, IEEE , Zhibo Chen,
Senior Member, IEEE
Abstract —For many practical computer vision applications, the learned models usually have high performance on the datasets usedfor training but suffer from significant performance degradation when deployed in new environments, where there are usually styledifferences between the training images and the testing images. An effective domain generalizable model is expected to be able tolearn feature representations that are both generalizable and discriminative. In this paper, we design a novel Style Normalization andRestitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks. In the SNRmodule, particularly, we filter out the style variations ( e . g . , illumination, color contrast) by performing Instance Normalization (IN) toobtain style normalized features, where the discrepancy among different samples and domains is reduced. However, such a processis task-ignorant and inevitably removes some task-relevant discriminative information, which could hurt the performance. To remedythis, we propose to distill task-relevant discriminative features from the residual ( i . e ., the difference between the original feature andthe style normalized feature) and add them back to the network to ensure high discrimination. Moreover, for better disentanglement,we enforce a dual causality loss constraint in the restitution step to encourage the better separation of task-relevant and task-irrelevantfeatures. We validate the effectiveness of our SNR on different computer vision tasks, including classification, semantic segmentation,and object detection. Experiments demonstrate that our SNR module is capable of improving the performance of networks for domaingeneralization (DG) and unsupervised domain adaptation (UDA) on many tasks. Code are available at https://github.com/microsoft/SNR. Index Terms —Discriminative and Generalizable Feature Representations; Style Normalization and Restitution; FeatureDisentanglement; Domain Generalization; Unsupervised Domain Adaptation. (cid:70)
NTRODUCTION D EEP neural networks (DNNs) have advanced the state-of-the-arts for a wide variety of computer visiontasks. The trained models typically perform well on thetest/validation dataset which follows similar characteris-tics/distribution as the training data, but suffer from signif-icant performance degradation (poor generalization capa-bility) on unseen datasets that may present different styles[1], [2]. This is ubiquitous in practical applications. Forexample, we may want to deploy a trained classificationor detection model in unseen environments, like a newlyopened retail store, or a house. The captured images inthe new environments in general present style discrepancywith respect to the training data, such as illumination, colorcontrast/saturation, quality, etc . (as shown in Fig. 1). Theseresult in domain gap/shift between the training and testing.To address such domain gap/shift problems, many in-vestigations have been conducted and they could be dividedinto two categories: domain generalization (DG) [3], [4], [5],[6], [7], [8] and unsupervised domain adaptation (UDA) [2],[9], [10], [11], [12], [13], [14], [15], [16]. DG and UDA bothaim to bridge the gaps between source and target domains.DG exploits only labeled source domain data while UDAcan also access/exploit the unlabeled data of the target do-main for training/fine-tuning. Both do not require the costlylabeling on the data of target domain, which is desirable in
Xin Jin and Zhibo Chen are with University of Science and Technology ofChina, Hefei, Anhui, 230026, China, (e-mail: [email protected])Cuiling Lan and Wenjun Zeng are with Microsoft Research Asia, Building2, No. 5 Dan Ling Street, Haidian District, Beijing, 100080, China, (e-mail: { culan, wezeng } @microsoft.com)Corresponding authors: Cuiling Lan and Zhibo ChenThis work was done when Jin Xin was an intern at Microsoft Research Asia. (a) Illumination (c) Quality/Resolutions (b) Color contrast/saturation (d) Imaging style … … … … Fig. 1: Due to the differences in environments (such as light-ing/camera/place/weather), the captured images presentstyle discrepancy, such as the illumination, color con-trast/saturation, quality, imaging style. These result in do-main gaps between the training and testing data.practical applications.In particular, due to the domain gaps, directly apply-ing a model trained on a source dataset to an unseentarget dataset typically suffers from a large performancedegradation [3], [4], [5], [6], [7], [8]. As a consequence, fea-ture regularization based UDA methods have been widelyinvestigated to mitigate the domain gap by aligning thedomains for better transferring source knowledge to thetarget domain. Several methods align the statistics, suchas the second order correlation [17], [18], [19], or bothmean and variance (moment matching) [14], [20], in thenetworks to reduce the domain discrepancy on features [21],[22]. Some other methods introduce adversarial learningwhich learns domain-invariant features to deceive domainclassifiers [10], [23], [24]. The alignment of domains reduces a r X i v : . [ c s . C V ] J a n Conv
Block 1 ConvBlock N
SNR 2 SNR N
SNR 3
SNR 1 𝐹 +
෨𝐹 ෨𝐹 + = ෨𝐹 + 𝑅 + − = ෨𝐹 + 𝑅 − Conv
Block 3Conv
Block 2 (c)(a) … …
Task-specific LossSmaller entropy
Input image
Backbone Network
Head (b) Dual Causality Loss I n s t a n c e N o r m 𝜶 Input feature 𝐹 Output
Style Normalization and Restitution (SNR) Module + = ෩𝑭 + 𝑹 + + Dual Causality Loss − = ෨𝐹 + 𝑅 − C WH 𝜶෨𝐹 𝑅 𝑅 + 𝑅 − + 𝐹෨𝐹𝐹 likelihood Larger entropy
Fig. 2: Overall flowchart. (a) Our generalizable feature learning network with the proposed Style Normalization andRestitution (SNR) module being plugged in after some convolutional blocks. Here, we use ResNet-50 as our backbone forillustration. (b) Proposed SNR module. Instance Normalization (IN) is used to eliminate some style discrepancies followedby identity-relevant feature restitution (marked by red solid arrows). Note the branch with dashed green line is only usedfor enforcing loss constraint and is discarded in inference. (c) Dual causality loss constraint encourages the disentanglementof a residual feature R to task-relevant one ( R + ) and task-irrelevant one ( R − ), which decreases and enhances, respectively,the entropy by adding them to the style normalized feature (cid:101) F (see Section 3.1).domain-specific variations but inevitably leads to loss ofsome discriminative information [25]. Even though manyworks investigate UDA, the study on domain generalization(DG) is not as extensive.Domain generalization (DG) aims to design models thatare generalizable to previously unseen domains [3], [5],[26], [27], [28], [29], without accessing the target domaindata. Classic DG approaches tend to learn domain-invariantfeatures by minimizing the dissimilarity in features acrossdomains [3], [27]. Some other DG methods explore op-timization strategies to help improve generalization, e . g .,through meta-learning [6], episodic training [8], and adap-tive ensemble learning [30]. Recently, Jia et al . [28] and Zhou et al . [31] integrate a simple but effective style regularizationoperation, i . e ., Instance Normalization (IN), in the networksto alleviate the domain discrepancy by reducing appear-ance style variations, which achieves clear improvement.However, the feature style regularization using IN is task-ignorant and will inevitably remove some task-relevantdiscriminative information [32], [33], and thus hindering theachievement of high performance.In this paper, we propose a Style Normalization andRestitution (SNR) method to enhance both the generaliza-tion and discrimination capabilities of the networks for com-puter vision tasks. Fig. 2 shows our proposed SNR moduleand illustrates the dual causality loss. We propose to firstperform style normalization by introducing Instance Nor-malization (IN) to our neural network architecture to elimi-nate style variations. For a feature map of an image, IN nor-malizes the features across spatial positions on each channel,which reserves the spatial structure but reduces instance-specific style like contrast, illumination [32], [34], [35]. INreduces style discrepancy among instances and domains,but it inevitably results in the loss of some discriminative information. To remedy this, we propose to distill the task-specific information from the residues ( i . e ., the differencebetween the original features and the instance-normalizedfeatures) and add it back to the network. Moreover, to betterdisentangle the task-relevant features from the residual, adual causality loss constraint is designed by ensuring thefeatures after restitution of the task-relevant features to bemore discriminative than that before restitution, and thefeatures after restitution of task-irrelevant features to be lessdiscriminative than that before restitution.We summarize our main contributions as follows: • We propose a Style Normalization and Restitution (SNR)module, a simple yet effective plug-and-play tool, forexisting neural networks to enhance their generalizationcapabilities. To compensate for the loss of discriminativeinformation caused by style normalization, we propose todistill the task-relevant discriminative information fromthe residual ( i . e ., the difference between the original fea-ture and the instance-normalized feature). • We introduce a dual causality loss constraint in SNRto encourage the better disentanglement of task-relevantfeatures from the residual information. • The proposed SNR module is generic and can be appliedto various networks for different vision tasks to enhancethe generalization capability, including object classifica-tion, detection, semantic segmentation, etc .. Moreover,thanks to the enhancement of generalization and discrimi-nation capability of the networks, SNR could also improvethe performance of the existing unsupervised domainadaptation networks.Extensive experiments demonstrate that our SNR sig-nificantly improves the generalization capability of the net-works and brings improvement to the existing unsuperviseddomain adaptation networks. This work is an extension of our conference paper [36] which is specifically designed forperson re-identification. In this work, we make the designgeneric and incorporate it into popular generic tasks, such asobject classification, detection, semantic segmentation, etc .In addition, we tailor the dual causality loss to these tasksby leveraging entropy comparisons.
ELATED W ORK
DG considers a challenging setting where the target data isunavailable during training. Some recent DG methods ex-plore optimization strategies to improve generalization, e . g .,through meta-learning [6], episodic training [8], or adaptiveensemble learning [30]. Li et al . [6] propose a meta-learningsolution, which uses a model agnostic training procedure tosimulate train/test domain shift during training and jointlyoptimize the simulated training and testing domains withineach mini-batch. Episodic training is proposed in [8], whichdecomposes a deep network into feature extractor andclassifier components, and then train each component bysimulating it interacting with a partner who is badly tunedfor the current domain. This makes both components morerobust. Zhou et al . [30] propose domain adaptive ensemblelearning (DAEL) which learns multiple experts (for differentdomains) collaboratively so that when forming an ensemble ,they can leverage complementary information from eachother to be more effective for an unseen target domain.Some other methods augment the samples to enhance thegeneralization capability [5], [37].Some DG approaches tend to learn domain-invariantfeatures by aligning the domains/minimizing the fea-ture dissimilarity across domains [3], [27]. Recently, sev-eral works attempt to add Instance normalisation (IN) toCNNs to improve the model generalisation ability [28],[33]. Instance normalisation (IN) layers [38] could elimi-nate instance-specific style discrepancy and IN has beenextensively investigated in the field of image style trans-fer [32], [34], [35], where the mean and variance of INreflect the style of images. For DG, IN alleviates the stylediscrepancy among domains/instances, and thus improvesthe domain generalization. In [33], a CNN called IBN-Netis designed by inserting IN into the shallow layers forenhancing the generalization capability. However, instancenormalization is task-ignorant and inevitably introducesthe loss of discriminative information [32], [33], leadingto inferior performance. Pan et al . [33] use IN and BatchNormalization (BN) together (half of channels use IN whilethe other half of channels use BN for the IBN-a setting)in the same layer to preserve some discrimination. Nam etal . [39] determine the use of BN and IN (at dataset-level) foreach channel based on learned gate parameters. It lacks theadaptivity to instances. Besides, the selection of IN or BN fora channel is hard (0 or 1) rather than soft. In this paper, wepropose a style normalization and restitution module. First,we perform IN for all channels to enhance generalization.To assure high discrimination, we go a step further toconsider a restitution step, which adaptively distills task-specific features from the residual (removed information)and restitute it to the network. Unsupervised domain adaptation (UDA) belongs to a targetdomain annotation-free transfer learning task, where thelabeled source domain data and unlabeled target domaindata are available for training. Existing UDA methods typi-cally explore to learn domain-invariant features by reducingthe distribution discrepancy between the learned featuresof source and target domains. Some methods minimizedistribution divergence by optimizing the maximum meandiscrepancy (MMD) [2], [11], [22], second order correlation[17], [18], [19], etc . Some other methods learn to achievedomain confusion by leveraging the adversarial learningto reduce the difference between the training and testingdomain distributions [10], [24], [40], [41]. Moreover, somerecent works tend to separate the model into feature ex-tractor and classifier, and develop some new metric to pullclose the learned source and target feature representations.In particular, Maximum Classifier Discrepancy (MCD) [12]maximizes the discrepancy between two classifiers whileminimizing it with respect to the feature extractor. Similarly,Minimax Entropy (MME) [42] maximizes the conditionalentropy on unlabeled target data w.r.t the classifier and min-imizes it w.r.t the feature encoder. M3SDA [14] minimizesthe moment distance among the source and target domainsand per-domain classifier is used and optimized as in MCDto enhance the alignment.Our proposed SNR module aims at enhancing the gener-alization ability and preserving the discrimination capabil-ity and thus can enhance the performance of existing UDAapproaches.
Deep neural networks are known to extract features wheremultiple hidden factors are highly entangled [43]. Learningdisentangled representations can help remove irrelevantfeatures [44]. To this end, some recent works [45], [46],[47] explore the learning of interpretable representationsby using generative adversarial networks (GANs) [48] andvariational autoencoders (VAEs) [49]. Under the fully su-pervised setting, Odena et al . propose an auxiliary classifierGAN (AC-GAN) to achieve representation disentanglement[47]. Liu et al . introduce a unified feature disentanglementframework to learn domain-invariant features from dataacross different domains [46]. Lee et al . propose to disentan-gle the features into a domain-invariant content space and adomain-specific attributes space, producing diverse outputswithout paired training data [50]. Inspired by these works,we propose to disentangle the task-specific features fromthe discarded/removed residual features, in order to distilland restore the discriminative information. To encourage abetter disentanglement, we introduce a dual causality lossconstraint, which enforces a higher discrimination of thefeature after the restitution than before. The basic idea is tomake the class-likelihood after the restitution to be sharperthan before, which enables less ambiguity of a sample.
TYLE N ORMALIZATION AND R ESTITUTION
We propose a style normalization and restitution (SNR)module which enhances the generalization capability while preserving the discriminative power of the networksfor effective DG and DA. Figure 2 shows the overallflowchart of our framework. Particularly, SNR can be usedas a plug-and-play module for existing ( e . g ., classifica-tion/detection/segmentation) networks. Taking the widelyused ResNet-50 [51] backbone network as an example (seeFig. 2(a)), SNR module is added after each convolutionalblock.In the SNR module (see Fig. 2(b)), we denote the in-put feature map by F ∈ R h × w × c and the output by (cid:101) F + ∈ R h × w × c , where h, w, c denote the height, width,and number of channels, respectively. We first eliminatestyle discrepancy among samples/instances by performingInstance Normalization (IN). Then, we propose a dedicatedrestitution step to distill task-relevant (discriminative) fea-ture from the residual (previously discarded by IN, which isthe difference between the original feature F and the stylenormalized feature ˜ F ), and add it to the normalized feature ˜ F . Moreover, we introduce a dual causality loss constraintto facilitate the better separation of task-relevant and -irrelevant features within the SNR module (see Fig. 2(c)).SNR is generic and can be used in different networksfor different tasks. We also present the usages of SNR(with small variations on the dual causality loss forms withrespect to different tasks) in detail for different tasks ( i . e .,object classification, detection, and semantic segmentation).Besides, since SNR can enhance the generalization anddiscrimination capability of networks which is also veryimportant for UDA, SNR is capable of benefiting the existingUDA networks. Real-world images could be captured by different cam-eras under different scenes and environments ( e . g ., light-ing/camera/place/weather). As shown in Figure 1, the cap-tured images present large style discrepancies ( e . g ., in illu-mination, color contrast/saturation, quality, imaging style),especially for samples from two different datasets/domains.Domain discrepancy between the source and target domaingenerally hinders the generalization capability of learnedmodels.A learning-theoretic analysis in [3] shows that reducingfeature dissimilarity improves the generalization ability onnew domains. As discussed in Section 2.1, Instance Normal-ization (IN) actually performs some kinds of style normal-ization which reduces the discrepancy/dissimilarity amonginstances/samples [32], [33], so it has the power to enhancethe generalization ability of networks [28], [31], [33].Inspired by that, in SNR module, we first try to reducethe instance discrepancy on the input feature by performingInstance Normalization [32], [34], [35], [38] as (cid:101) F = IN( F ) = γ ( F − µ ( F ) σ ( F ) ) + β, (1)where µ ( · ) and σ ( · ) denote the mean and standard deviationcomputed across spatial dimensions independently for eachchannel and each sample/instance , γ , β ∈ R c are parameterslearned from the data. IN could filter out some instance-specific style information from the content. With IN per-formed in the feature space, Huang et al . have argued and experimentally shown that IN has more profound impactsthan a simple contrast normalization and it performs a formof style normalization by normalizing feature statistics [32].However, IN inevitably removes some discriminativeinformation and results in weaker discrimination capability[33]. To address this problem, we propose to distill andrestitute the task-specific discriminative feature from the INremoved information, by disentangling it into task-relevantfeature and task-irrelevant feature with a dual causality lossconstraint (see Fig. 2(b)). We elaborate on such restitutionhereafter. As illustrated in Fig. 2(b), to ensure high discrimination ofthe features, we propose to restitute the task-relevant featureto the network by distilling it from the residual feature R . R is defined as R = F − (cid:101) F , (2)which denotes the difference between the original inputfeature F and the style normalized feature (cid:101) F .We disentangle the residual feature R in a content adap-tive way through channel attention. This is crucial for learn-ing generalizable feature representations since the discrim-inative components of different images are typically differ-ent. Specifically, given R , we disentangle it into two parts:task-relevant feature R + ∈ R h × w × c and task-irrelevantfeature R − ∈ R h × w × c , through masking R by a learnedchannel attention response vector a = [ a , a , · · · , a c ] ∈ R c : R + (: , : , k ) = a k R (: , : , k ) ,R − (: , : , k ) =(1 − a k ) R (: , : , k ) , (3)where R (: , : , k ) ∈ R h × w denotes the k th channel of featuremap R , k = 1 , , · · · , c . We expect the channel attentionresponse vector a to help adaptively distill the task-relevantfeature for the restitution. We derive it by SE-like [52]channel attention as a = g ( R ) = σ (W δ (W pool ( R ))) , (4)where the attention module is implemented by a spatialglobal average pooling layer, followed by two FC layers(that are parameterized by W ∈ R ( c/r ) × c and W ∈ R c × ( c/r ) ), δ ( · ) and σ ( · ) denote ReLU activation functionand sigmoid activation function, respectively. To reduce thenumber of parameters, a dimension reduction ratio r is setto 16.By adding this distilled task-relevant feature R + to thestyle normalized feature (cid:101) F , we obtain the output feature (cid:101) F + of the SNR module as (cid:101) F + = (cid:101) F + R + . (5)Similarly, by adding the task-irrelevant feature R − to thestyle normalized feature (cid:101) F , we obtain the contaminatedfeature (cid:101) F − = (cid:101) F + R − , which is used in the loss optimizationin next subsection.It is worth pointing out that, instead of using two inde-pendent attention modules to obtain R + , R − , respectively,we use a ( · ) , and − a ( · ) to facilitate the disentanglement.We will discuss the effectiveness of this operation in theexperiment section. ℎ 𝑤 𝑐 𝑐 𝐾 Pool FC Layer
Calculate
Entropy ℎ 𝑤𝑐 𝑐 𝐾 FC Layer
Calculate
Entropy ℎ 𝑤𝑐 𝑐 𝐾 FC Layer
Calculate
Entropy
Pool (a) Classification (b) Segmentation (c) Detection bbox
Fig. 3: Illustration of obtaining feature vector for causality loss optimization with respect to different tasks. (a) Forclassification task, spatial average pooling is performed over the entire feature map ( h × w × c ) to obtain a featurevector of c dimensions (see Section 3.2.1). (b) For segmentation task (pixel level classification), entropy is calculated foreach pixel (see Section 3.2.2). (c) For detection task (region level classification), spatial average pooling is performed overeach groundtruth bounding box (bbox) region to obtain a feature vector of c dimensions (see Section 3.2.3).We use the channel attention vector a to adaptivelydistill the task-relevant features for restitution for two rea-sons. (a) Those style factors ( e . g ., illumination, hue, contrast,saturation) are in general regarded as spatial consistent. Weleverage channel attention to select the discriminative stylefactors distributed in different channels. (b) In our SNR,“disentanglement” aims at better “restitution” of the lostdiscriminative information due to Instance Normalization(IN). IN reduces style discrepancy of input features byperforming normalization across spatial dimensions inde-pendently for each channel, where the normalization pa-rameters are the same across different spatial positions.Consistent with IN, we disentangle the features and restitutethe task-relevant ones to the normalized features on thechannel level.
To promote the disentanglement of task-relevant featureand task-irrelevant feature, we design a dual causality lossconstraint by comparing the discrimination capability offeatures before and after the restitution. The dual causalityloss L snr consists of L + SNR and L − SNR , i . e ., L SNR = L + SNR + L − SNR . As illustrated in Figure 2(c), the physical meaning ofthe proposed dual causality loss constraint L SNR is that:after adding the task-relevant feature R + to the normalizedfeature (cid:101) F , the enhanced feature becomes more discriminativeand its predicted class likelihood becomes less ambiguous(less uncertain) with a smaller entropy; on the other hand,after adding the task-irrelevant feature R − to the normal-ized feature (cid:101) F , the contaminated feature should become lessdiscriminative, resulting in a larger entropy of the predictedclass likelihood.Taking classification task as an example, we pass the spa-tially average pooled enhanced feature vector (cid:101) f + = pool ( (cid:101) F + R + ) ∈ R c into a FC layer (of K nodes, where K denotesthe number of classes) followed by softmax function (wedenote these as φ ( (cid:101) f + ) ∈ R K ) and thus obtain its entropy.We denote an entropy function as H ( · ) = − p ( · ) log p ( · ) .Similarly, the contaminated feature vector can be obtainedby (cid:101) f − = pool ( (cid:101) F + R − ) , and the style normalized featurevector is (cid:101) f = pool ( (cid:101) F ) . L + SNR and L − SNR are defined as: L + SNR = Sof tplus ( H ( φ ( (cid:101) f + )) − H ( φ ( (cid:101) f ))) , L − SNR = Sof tplus ( H ( φ ( (cid:101) f )) − H ( φ ( (cid:101) f − ))) , (6)where Sof tplus ( · ) = ln (1 + exp ( · )) is a monotonicallyincreasing function that aims to reduce the optimization dif-ficulty by avoiding negative loss values. For other tasks, e . g ., segmentation, detection, there are some slight differences, e . g ., in obtaining the feature vectors, which are described inthe next subsection. The proposed SNR is general. It can improve the generaliza-tion and discrimination capability of networks for DG andDA. As a plug-and-play module, SNR can be easily appliedinto different neural networks for different computer visiontasks, e . g ., object classification, segmentation, and detection.As we described in Section 3.1.3, we pass the spatiallyaverage pooled enhanced / contaminated feature vector (cid:101) f + / (cid:101) f − into the function H ( φ ( · )) for obtaining entropy. For thedifferent tasks of classification ( i . e ., image-level classifica-tion), segmentation ( i . e ., pixel level classification), detection( i . e ., region level classification), there are some differencesin obtaining the feature vectors for calculating causalitylosses. Fig. 3 illustrates the manners to obtain the featurevectors, respectively. We elaborate on them in the followingsubsections. For a K -category classification task, we take the backbonenetwork of ResNet-50 as an example for describing theusage of SNR. As illustrated in Fig. 2(a), we could insertthe proposed SNR module after each convolution block.For a SNR module, given an input feature F , we obtainthree features–style normalized feature (cid:101) F , enhanced feature (cid:101) F + , and contaminated feature (cid:101) F − . As shown in Fig. 3(a), we spatially averagely pool the features to get the correspondingfeature vectors ( i . e ., (cid:101) f , (cid:101) f + , and (cid:101) f − ) to calculate the dualcausality loss for optimization. Semantic segmentation predicts the label for each pixel,which is a pixel wise classification problem. Similar toclassification, we insert the SNR modules to the backbonenetworks of segmentation. Differently, in our causality loss,as illustrated in Fig. 3(b), we calculate the entropy for thefeature vector of each spatial position (since each spatialposition has a classification likelihood) instead of over thespatially averagely pooled feature vector. To save compu-
Dog Guitar Horse House Person S k e t c h P ho t o C a r t oon A r t Spoon Sink Mug Pen Knife R e a l P r o d u c t C li pa r t A r t Airplane
Clock Axe Ball
Bicycle
Bird S k t R e l Q d r P n t I n f C l p u p s y n s v mm m t (c) Digit-Five (d) DomainNet(a) PACS (b) Office-Home
0 3 5 7 8
Fig. 4: Four classification datasets (first two for DG and last two for UDA). (a) PACS, which includes
Sketch , Photo , Cartoon ,and
Art . (b) Office-Home, which includes Real-world (
Real ), Product , Clipart , and
Art . (c) Digit-Five, which includes MNIST[53] ( mt ), MNIST-M [54] ( mm ), USPS [55] ( up ), SVHN [56] ( sv ), and Synthetic [54] ( syn ). (d) DomainNet, which includesClipart ( clp ), Infograph ( inf ), Painting ( pnt ), Quickdraw ( qdr ), Real ( rel ), and Sktech ( skt ). Considering the required hugecomputation resources, we use a subset of DomainNet ( i . e ., mini-DomainNet) following [30] for experiments.tation and be robust to pixel noises, we take the averageentropy of all pixels to calculate the causality loss as: L + SNR = Sof tplus ( 1 h × w h (cid:88) i =1 w (cid:88) j =1 H ( φ ( (cid:101) F + ( i, j, :))) − h × w h (cid:88) i =1 w (cid:88) j =1 H ( φ ( (cid:101) F ( i, j, :)))) , L − SNR = Sof tplus ( 1 h × w h (cid:88) i =1 w (cid:88) j =1 H ( φ ( (cid:101) F ( i, j, :))) − h × w h (cid:88) i =1 w (cid:88) j =1 H ( φ ( (cid:101) F − ( i, j, :)))) , (7)where (cid:101) F ( i, j, :) denotes the feature vector of the spatialposition ( i , j ) of the feature map (cid:101) F . Note that this is slightlybetter than that of calculating causality loss for each pixel interm of performance but has fewer computation. The widely-used object detection frameworks like R-CNN[57], fast/faster-RCNN [58], mask-RCNN [59], perform ob-ject proposals, regress the bounding box of each objectand predict its class, where the class prediction is basedon the feature region of the bounding box. Similar to theclassification task, we insert SNR modules in the backbonenetwork. Since object detection task can be regarded as a‘region-wise’ (bounding box regression) classification task,as illustrated in Fig. 3(c), we calculate the entropy for eachgroudtruth bounding box region, with the feature vectorobtained by spatially average pooling of the features withineach bounding box region. We take the average entropy ofall the object regions in an image to calculate the causalityloss.
XPERIMENT
We validate the effectiveness and superiority of our SNRmethod under the domain generalization and adaptationsettings for object classification (Section 4.1), segmentation(Section 4.2), and detection (Section 4.3), respectively. Foreach task, we describe the datasets and implementation details within each section. Moreover, without loss of gener-ality, we study some design choices on object classificationtask in Section 4.1.5. In Section 4.1.6, we further provide thevisualization analysis.
We first evaluate the effectiveness of the object classificationtask, under domain generalization (DG) and unsuperviseddomain adaptation (UDA) settings, respectively.
We conduct experiments on four classification datasets ofmultiple domains: PACS (includes
Sketch , Photo , Cartoon ,and
Art ), Office-Home [63], Digit-Five (indicates five mostpopular digit datasets, MNIST [53], MNIST-M [54], USPS[55], SVHN [56], Synthetic [54]), and DomainNet [14].Fig. 4 shows some samples of these datasets. PACS [4]and Office-Home [64] are two widely used DG datasetswhere each dataset includes four domains. PACS has sevenobject categories and office-Home has 65 categories. Digit-Five consists of five different digit recognition datasets:MNIST [53], MNIST-M [54], USPS [55], SVHN [56] andSYN [54]. We follow the same split setting as [14] to usethe dataset. DomainNet is a recently introduced benchmarkfor large-scale multi-source domain adaptation [14], whichincludes six domains ( i . e ., Clipart, Infograph, Painting,Quickdraw, Real, and Sketch) of 600k images (345 classes).Considering the high demand on computational resources,following [30], we use a subset of DomainNet, i . e ., mini-DomainNet, for experiments.PACS and Office-Home are usually used for DG. Wevalidate the effectiveness of DG on PACS and Office-Home.Following [14], we use the leave-one-domain-out protocol.For PACS and Office-Home , similar to [8], [65], we useResNet18 as the backbone to build our baseline network. Wetrain the model for 40 epochs with an initial learning rate of0.002. Each mini-batch contains 30 images (10 per sourcedomain). We insert a SNR module after each convolutionalblock of the ResNet18 baseline as our SNR scheme.
1. We use the baseline code from Epi-FCR [8]https://github.com/HAHA-DL/Episodic-DG as our code frameworkto validate the effectiveness of our PACS and Office-Home.
TABLE 1: Performance (in accuracy %) comparisons with the state-of-the-art domain generalization approaches for imageclassification.
Method PACS Office-Home
Art Cartoon Photo Sketch Avg Art Clipart Product Real Avg
MMD-AAE [60] 75.2 72.7 96.0 64.2 77.0 56.5 47.3 72.1 74.8 62.7CCSA [27] 80.5 76.9 93.6 66.8 79.4 59.9 49.9 74.1 75.7 64.9JiGen [7] 79.4 75.3
TABLE 2: Ablation study and performance comparisons (in accuracy %) with the state-of-the-art unsupervised domainadaptation approaches for image classification. (a) Results on Digit-Five.Method Digit-Five mm mt up sv syn Avg
DAN [11] 63.78 96.31 94.24 62.45 85.43 80.44CORAL [17] 62.53 97.21 93.45 64.40 82.77 80.07DANN [23] 71.30 97.60 92.33 63.48 85.34 82.01JAN [22] 65.88 97.21 95.42 75.27 86.55 84.07ADDA [24] 71.57 97.89 92.83 75.48 86.45 84.84DCTN [61] 70.53 96.23 92.81 77.61 86.77 84.79MEDA [62] 71.31 96.47 97.01 78.45 84.62 85.60MCD [12] 72.50 96.21 95.33 78.89 87.47 86.10M3SDA [14] 69.76 98.58 95.23 78.56 87.56 86.13M3SDA- β [14] 72.82 98.43 96.14 81.32 89.58 87.65Baseline (M3SDA) 69.76 98.58 95.23 78.56 87.56 86.13SNR-M3SDA (b) Results on mini-DomainNet.Method mini-DomainNet clp pnt rel skt Avg MCD [12] 62.91 45.77 57.57 45.88 53.03DCTN [61] 62.06 48.79 58.85 48.25 54.49DANN [23] 65.55 46.27 58.68 47.88 54.60M3SDA [14] 64.18 49.05 57.70 49.21 55.03M3SDA- β [14] 65.58 50.85 58.40 49.33 56.04MME [42] 68.09 47.14 Digit-5 and DomainNet are usually used for DA. Wevalidate the effectiveness of our DA on them. We followprior works [4], [8], [65] to use the leave-one-domain-outprotocol. For Digit-5, following [14], we build the backbonewith three convolution layers and two fully connected lay-ers . We insert a SNR module after each convolutional layerof the baseline as our SNR scheme. For each mini-batch, wesample 64 images from each domain. The model is trainedwith an initial learning rate of 0.05 for 30 epochs. For mini-DomainNet, we use ResNet18 [51] as the backbone. Weinsert a SNR module after each convolutional block of theResNet18 baseline as our
SNR scheme. We sample 32 imagesfrom each domain to form a mini-batch (of size 32 × DG is very attractive in practical applications, which aims at“train once and run everywhere”. We perform experimentson PACS and Office-Home for DG. There are very few worksin this field.
MMD-AAE [60] learns a domain-invariantembedding by minimizing the Maximum Mean Discrep-ancy (MMD) distance to align the feature representations.
CCSA [27] proposes a semantic alignment loss to reducethe feature discrepancy among domains.
CrossGrad [5]uses domain discriminator to guide the data augmentation
2. We use the baseline code from DEAL [30]https://github.com/KaiyangZhou/Dassl.pytorch as our codeframework to validate the effectiveness of Digit-5 and mini-DomainNetdatasets. with adversarial gradients.
JiGen [7] jointly optimizes objectclassification and the Jigsaw puzzle problem.
Epi-FCR [8]leverages episodic training strategy to simulate domain shiftduring the model training.Table 1 shows the comparisons with the state-of-the-art methods. We can see that the proposed scheme
SNR achieves the best average accuracy on both PACS and Office-Home.
SNR outperforms our baseline
Baseline (AGG) thataggregates all source domains to train a single model by and for PACS and Office-Home, respectively.
SNR outperforms the second best method by 1.7% on Office-Home.
The introduction of SNR modules to the networks of ex-isting UDA methods could reduce the domain gaps andpreserve discrimination. It thus facilitates the domain adap-tation. Table 2 shows the experimental results on the twodatasets Digit-Five and mini-DomainNet. Here, we use thealignment-based UDA method M3SDA [14] as our baselineUDA network for domain adaptive classification. We referto the scheme after using our SNR as
SNR-M3SDA .We have the following observations. 1) For the overallperformance (as shown in the column marked by Avg),the scheme
SNR-M3SDA achieves the best performanceon both datasets, outperforming the second-best method(M3SDA- β [14] significantly by 6.47% on Digit-Five, and2.03% on mini-DomainNet in accuracy. 2) In comparisonwith the baseline scheme Baseline (M3SDA [14]) , which usesthe aligning technique in [14] for domain adaptation, theintroduction of SNR (scheme
SNR-M3SDA ) brings signif-icant gains of 7.99% on Digit-Five, and 3.04% on mini-
TABLE 3: Effectiveness of our SNR, compared to other normalization-based methods for domain generalizable classifica-tion. Note that the italics denotes the left-out target domain. We use ResNet18 as our backbone.
Method PACS Office-Home
Art Cat Pho Skt Avg Art Clp Prd Rel Avg
AGG 77.0 75.9 ∗ TABLE 4: Ablation study on the dual causality loss L SNR for domain generalizable classification. Here, we use ResNet18as our backbone.
Method PACS Office-Home
Art Cat Pho Skt Avg Art Clp Prd Rel Avg
Baseline (AGG) 77.0 75.9 L SNR L + SNR L − SNR
DomainNet in accuracy, demonstrating the effectiveness ofSNR modules for UDA.
We first perform comprehensive ablation studies to demon-strate the effectiveness of 1) the SNR module, 2) the pro-posed dual causality loss constraint. We evaluate the modelsunder the domain generalization (on PACS and Office-Home datasets) setting, with ResNet18 as our backbonenetwork. Besides, we validate that SNR is beneficial to UDAand is complementary to the existing UDA techniques onthe Digital-Five dataset.
Effectiveness of SNR.
Here we compare several schemeswith our proposed SNR.
AGG : a simple strong baselinethat aggregates all source domains to train a single model.
AGG-All-IN : on top of
AGG scheme, we replace all theBatch Normalization(BN) [67] layers in AGG by InstanceNormalization(IN).
AGG-IN : on top of
AGG scheme, an INlayer is added after each convolutional block/stage (the firstfour blocks) of backbone (ResNet18), respectively.
AGG-IBN-a , AGG-IBN-b : Following IBNNet [33], we insert BNand IN in parallel at the beginning of the first two residualblocks for scheme
AGG-IBN-a , and we add IN to the lastlayers of the first two residual blocks to get
AGG-IBN-b . AGG-All-BIN : following [39], we replace all BN layersof the baseline network by Batch-Instance Normalization(BIN) to get the scheme
AGG-All-BIN , which uses dataset-level learned gates to determine whether to do instancenormalization or batch normalization for each channel.
AGG-All-BIN ∗ denotes a variant of AGG-All-BIN , wherewe replace the original dataset-level learned gates withcontent-adaptive gates (via channel attention layer [52]) forthe selection of normalization manner.
AGG-SNR : our finalscheme where a SNR module is added after each block(of the first four convolutional blocks/stages) of backbone,respectively (see Fig. 2). We also refer to it as
SNR for simplicity. Table 3 shows the results. We have the followingobservations/conclusions: Such normalization based methods, including
AGG-All-IN , AGG-IN , AGG-IBN-a , AGG-IBN-b , AGG-BIN and
AGG-BIN ∗ improve the performance of the baseline scheme AGG by , , , , , and in average onPACS, respectively, which demonstrates the effectiveness ofIN for improving the model generalization capability. AGG-All-BIN ∗ outperforms AGG-All-IN by 0.9% and0.4 on PACS and Office-Home, respectively. This becausethat IN introduces some loss of discriminative informationand the selective use of BN and IN can preserve somediscriminative information.
AGG-All-BIN ∗ slightly outper-forms the original AGG-All-BIN , demonstrating that theinstance-adaptive determination of IN or BN is better thandataset-level determination ( i . e ., same selection results of theuse of IN and BN for all instances). Thanks to the our compensation of the task-relevant in-formation in the proposed restitution step, our final scheme
SNR achieves superior performance, which significantlyoutperforms all the baseline schemes. In particular,
SNR outperforms
AGG by 2.3% and 1.4% on PACS and Office-Home, respectively.
SNR outperforms
AGG-IN by 1.7% and0.9% on PACS and Office-Home, respectively. Such largeimprovements also demonstrate that style normalization isnot enough, and the proposed restitution is critical. Thanksto our restitution design,
SNR outperforms
AGG-BIN ∗ by1.2% and 0.5% on PACS and Office-Home, respectively. Effectiveness of Dual Causality Loss.
Here, we performablation study on the proposed dual causality loss con-straint. Table 4 shows the results. 1) We observe that ourfinal scheme SNR outperforms the scheme without the dualcausality loss ( i . e ., scheme SNR w/o L SNR ) by 1.0% and0.8% on PACS and Office-Home, respectively. The dualcausality loss effectively promotes the disentanglement oftask-relevant information and task-irrelevant information.Besides, both the constraint on the enhanced feature L + SNR
TABLE 5: Influence of SNR modules for DG and UDArespectively on top of a simple ResNet-50 baseline with-out incorporating other UDA methods. DG schemes
Base-line(AGG) and
SNR do not use target domain data fortraining.
SNR-UDA uses target domain unlabeled data fortraining.
Method Digit-Five mm mt up sv syn Avg
Baseline(AGG) 63.37 90.50 88.71 63.54 82.44 77.71SNR 65.46 93.14 88.32 63.43 84.08 78.89SNR-UDA 65.86 93.24 89.79 65.21 85.04 79.83 and that on the contaminated feature L − SNR contribute to thegood feature disentanglement. 2) In L SNR , we compare theentropy of the predicted class likelihood of features before and after the feature restitution process to encourage thedistillation of discriminative features. To verify the effec-tiveness of this strategy, we compare it with the schemewithout comparing
SNR w/o Comparing , which minimizesthe entropy loss of the predicted class likelihood of the enhanced feature f + and maximizes the entropy loss ofthe predicted class likelihood of the contaminated feature f − , i . e ., without comparison with the normalized feature.Table 4 reveals that our scheme SNR with the comparisonoutperforms the scheme
SNR w/o Comparing by onPACS, and on Office-Home. SNR for DG and UDA.
One may wonder how about theperformance when exploiting UDA directly, where otherUDA-based methods ( e . g ., M3SDA ) are not used together.We perform this experiment by training the scheme
SNR (the baseline (VGG) powered by SNR modules) using sourcedomain labeled data and target domain unlabeled data.We refer to this scheme as
SNR-UDA . Table 5 shows thecomparisons on Digital-Five. The difference between
SNR and
SNR-UDA is that
SNR-UDA uses target domain unla-beled data for training while
SNR only uses source domaindata. We can see that
SNR-UDA outperforms
SNR by 0.94%in average accuracy. Moreover, as shown in Table 5(a),introducing SNR modules to the baseline UDA scheme
M3SDA brings 7.99% gain for UDA. These demonstrateSNR is helpful for UDA, especially when it is jointly usedwith existing UDA method. SNR modules reduce the stylediscrepancy between source and target domains, which easesthe alignment and adaptation . Note that SNR modules reducestyle discrepancy of instances for the source domain andtarget domain. However, there is a lack of explicit interac-tion between source and target domain after the resitition ofdiscriminative features. Thus, the explicit alignment like in
M3SDA is still very useful for UDA.
Which Stage to Add SNR?
We compare the cases ofadding a single SNR module to a different convolutionalblock/stage, and to all the four stages ( i . e . , stage-1 to 4)of the ResNet18 (see Fig. 2(a)), respectively. The module isadded after the last layer of a convolutional block/stage.Table 6 shows that on top of the baseline scheme Baseline(AGG) , SNR is not sensitive to the inserted position andbrings gain at each stage. Besides, when SNR is added toall the four stages, we achieve the best performance. TABLE 6: Ablation study on which stage to add SNR.
Method PACS
Art Cat Pho Skt Avg
Baseline (AGG) 77.0 75.9 96.0 69.2 79.5stage-1 77.5 76.2 stage-all 80.3 78.2
TABLE 7: Study on the disentanglement designs in SNR
Method PACS
Art Cat Pho Skt Avg
Baseline (AGG) 77.0 75.9 conv g ( · ) Influence of Disentanglement Design.
In our SNR module,as described in Eq. (3)(4) of Section 3.1.2, we use the learnedchannel attention vector a ( · ) , and its complementary one − a ( · ) as masks to obtain task-relevant feature R + andtask-irrelevant feature R − , respectively. Here, we study theinfluence of different disentanglement designs within SNR. SNR conv : we disentangle the residual feature R through 1 × i . e ., R + = ReLU ( W + R ) , R − = ReLU ( W − R ) . SNR g ( · ) :we use two unshared channel attention gates g ( · ) + , g ( · ) − to obtain R + and R − respectively. SNR-S : different fromthe original SNR design that leverages channel attention toachieve feature separation, here we disentangle the residualfeature R using only a spatial attention, and its complemen-tary. SNR-SC : we disentangle the residual feature R throughthe paralleled spatial and channel attention. Table 7 showsthe results. We have the following observations: Our
SNR outperforms
SNR conv by on averageon PACS, demonstrating the benefit of explicit design ofdecomposition using attention masks. Ours
SNR outperforms
SNR g ( · ) by on average onPACS, demonstrating the benefit of the design that encour-ages interaction between R + and R − where their sum isequal to R . SNR-S is inferior to
SNR that is based on channel at-tention. Those task-irrelevant style factors ( e . g ., illumina-tion, contrast, saturation) are in general spatial consistent,which are characterized by the statistics of each channel. INreduces style discrepancy of input features by performingnormalization across spatial dimensions independently foreach channel, where the normalization parameters are thesame across different spatial positions. Consistent with IN,we disentangle the features at channel level and add thetask-relevant ones back to the normalized features. SNR-SC outperforms
SNR which uses only channel at-tention by on average on PACS. To be simple andalign with our main purpose of distilling the removedtask-relevant information, we use only channel attention bydefault. B a s e li n e S N R Original
Contrast changed Illumination changed 𝐹 − = 𝐹 + 𝑅 − Input 𝐹 + = 𝐹 + 𝑅 + (a) (b) Fig. 5: (a) Activation maps of different features within an SNR module (SNR 3). They show that SNR can disentangle outthe task-relevant (classification-relevant) object features well ( i . e ., R + ). (b) Activation maps of our scheme (bottom) and thebaseline Baseline (AGG) (top) with respect to images of varied styles. The maps of our SNR are more consistent for imagesof different styles. (b) Features of our
SNR (a) Features of Baseline
Fig. 6: Visualization of t-SNE distributions on the Digit-Fivedataset for UDA classification task. We compare our
SNR-M3SDA with the baseline scheme
Baseline (M3SDA) . Feature Map Visualization.
To better understand how ourSNR works, we visualize the intermediate feature maps ofthe SNR module that is inserted in the third residual block( i . e ., SNR-3). Following [31], [68], we get each activation mapby summarizing the feature maps along channels followedby a spatial (cid:96) normalization.Fig. 5(a) shows the activation maps of normalized fea-ture (cid:101) F , enhanced feature (cid:101) F + = (cid:101) F + R + , and contaminatedfeature (cid:101) F − = (cid:101) F + R − , respectively. We see that after addingthe task-irrelevant feature R − , the contaminated feature (cid:101) F − has high response mainly on background. In contrast, theenhanced feature (cid:101) F + (with the restitution of task-relevantfeature R + ) has high responses on regions of the object(‘dog’ and ‘horse’), better capturing discriminative featureregions.Moreover, in Fig. 5(b), we further compare the activationmaps (cid:101) F + of our scheme and those of the strong baselinescheme Baseline (AGG) by varying the styles of input images( e . g ., contrast, illumination). We can see that, for the imageswith different styles, the activation maps of our scheme aremore consistent than those of the baseline scheme Baseline(AGG) . The activation maps of
Baseline (AGG) are moredisorganized and are easily affected by style variants. Theseindicate that our scheme is more robust to style variations.
Visualization of Feature Distributions.
In Fig. 6, wevisualize the distribution of the features using t-SNE[69] for UDA classification on Digit-Five (on the setting mm,mt,sv,syn → up ). We compare the feature distribution of (a) the baseline scheme Baseline (M3SDA [14]) , and (b) our
SNR . We observe that the features obtained by our
SNR are better separated for different classes than the baselinescheme.
For the semantic segmentation task, we used three repre-sentative semantic segmentation datasets: Cityscapes [70],Synthia [71], and GTA5 [72]. Cityscapes contains 5,000 an-notated images with 2048 × × × GTA5-to-Cityscapes case. Since SYNTHIA has only 16shared classes with Cityscapes, we consider the IoU andmIoU of the 16 classes in the
SYNTHIA-to-Cityscapes setting.As discussed in [75], [76], it is also important to adopta stronger baseline model to understand the effect of dif-ferent generalization/adaption approaches and to enhancethe performance for the practical applications. Therefore,similar to [12], in all experiments, we employ two kindsof backbones for evaluation.1) We use DRN-D-105 [12], [77] as our baseline network andapply our SNR to the network. For DRN-D-105, we followthe implementation of MCD . Similiar to ResNet [51], DRNstill uses the block-based architecture. We insert our SNRmodule after each convolutional block of DRN-D-105. Weuse momentum SGD to optimize our models. We set themomentum rate to 0.9 and the learning rate to 10-3 in allexperiments. The image size is resized to 1024 ×
3. https://github.com/mil-tokyo/MCD DA/tree/master/segmentation TABLE 8: Domain generalization performance (%) for semantic segmentation when we train on GTA5 and test onCityscapes.
GTA5 → CityscapeSetting Backbone Method m I o U r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g e t a t i o n t e rr a i n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m o t o c y c l e b i c y c l e Source only DRN-D-105 Baseline 29.84 45.82 20.80 58.86 5.14
SNR 36.16 83.34 17.32 78.74 16.85
DeeplabV2 Baseline 36.94 71.41 15.33 74.04 21.13 14.49 22.86 33.93 18.62 80.75 20.98 68.58 56.62
SNR 42.68 78.95 29.51 79.92 25.01 20.32 28.33 34.83 20.40 82.76 36.13 71.47 59.19
TABLE 9: Domain generalization performance (%) of semantic segmentation when we train on Synthia and test onCityscapes.
Synthia → CityscapeSetting Backbone Method m I o U r o a d s i d e w a l k b u il d i n g w a ll f e n c e p o l e li g h t s i g n v e g e t a t i o n s ky p e r s o n r i d e r c a r b u s m o t o c y c l e b i c y c l e Source only DRN-D-105 Baseline 23.56 14.63 11.49 58.96
SNR 26.30 19.33 15.21 network, which is the same as other works [81], [82]. Weinsert our SNR module after each convolutional block ofResNet-101. Following the implementation of MSL [76] , wetrain the model with SGD optimizer with the learning rate . × − , momentum . , and weight decay × − . Weschedule the learning rate using “poly” policy: the learningrate is multiplied by (1 − itermax iter ) . [78]. Similar to [83],we employ the random flipping and Gaussian blur for dataaugmentation. Here, we evaluate the effectiveness of SNR under DG setting(only training on the source datasets, and directly testing onthe target test set). Since very few previous works investi-gate on this task, here we define the comparison/validationsettings. We compare the proposed scheme
SNR with 1)the
Baseline (only use source dataset for training) and 2)the baseline when adding IN after each convolutional block
Baseline-IN .Table 8 and Table 9 show that for DRN-D-105, ourscheme
SNR outperforms
Baseline by 6.32% and 2.74%in mIoU accuracy for GTA5-to-Cityscapes and Synthia-to-Cityscapes, respectively. For the stronger backbone networkDeeplabV2, our scheme
SNR outperforms
Baseline by 5.74%and 3.24% in mIoU for GTA5-to-Cityscapes and Synthia-to-Cityscapes, respectively. When compared with the scheme
Baseline-IN , our
SNR also consistently outperforms it on twobackbones for both settings.
Unsupervised domain adaptive semantic segmentation hasbeen extensively studied [12], [76], where the unlabeled
4. https://github.com/ZJULearning/MaxSquareLoss target domain data is also used for training. We vali-date the effectiveness of UDA by adding the SNR mod-ules into two popular UDA approaches: MCD [12] andMaxSqure(MS) [76], respectively. MCD [12] maximizes thediscrepancy between two task-classifiers while minimizingit with respect to the feature extractor of domain adapta-tion. MS [76] extends the entropy minimization idea toUDA for semantic segmentation by using a proposed max-imum squares loss. We refer to the two schemes poweredby our SNR modules as
SNR-MCD and
SNR-MS . Table10 and Table 11 show that based on the same DRN-105backbone,
SNR-MCD significantly outperforms the second-best method MCD [12] by , and in mIoU forGTA5 → Cityscape and Synthia → Cityscape, respectively. Inaddition, based on the DeeplabV2 backbone,
SNR-MS con-sistently outperforms
MaxSqure (MS) [76] by , and in mIoU for GTA5 → Cityscape and Synthia → Cityscape, re-spectively.
We visualize the qualitative results in Fig. 7 by comparingthe baseline schemes and the schemes powered by our SNR.For DG in the first row, we can see that the introductionof SNR to
Baseline brings obvious improvement on thesegmentation results. For UDA in the second row, 1) theintroduction of SNR to
Baseline (MCD) brings clear im-provement on the segmentation results; 2) the segmentationresults with adapation (UDA) to the target domain data ismuch better than that obtained from domain generalizationmodel, indicating the exploration of target domain data ishelpful to have good performance. TABLE 10: Performance (%) comparisons with the state-of-the-art semantic segmentation approaches for unsuperviseddomain adaptation for GTA5-to-Cityscape.
GTA5 → CityscapeNetwork method m I o U r o a d s d w k b l d n g w a ll f e n c e p o l e li g h t s i g n v g tt n t rr n s ky p e r s o n r i d e r c a r t r u c k b u s t r a i n m c y c l b c y c l DRN-105 DANN [23] 32.8 64.3 23.2 73.4 11.3 18.6 29.0 31.8 14.9 82.0 16.8 73.2 53.9 12.4 53.3 20.4 11.0 5.0 18.7 9.8MCD [12] 35.0 87.5 17.6 79.7 22.0 10.5 27.5 21.9 10.6 82.7 30.3 78.2 41.1 9.7 80.4 19.3 23.1 11.7 9.3 1.1
SNR-MCD (ours) 40.3
SNR-MS (ours) 46.5
TABLE 11: Performance (%) comparisons with the state-of-the-art semantic segmentation approaches for unsuperviseddomain adaptation for Synthia-to-Cityscape.
Synthia → CityscapeNetwork method m I o U r o a d s d w k b l d n g w a ll f e n c e p o l e li g h t s i g n v g tt n t rr n s ky p e r s o n c a r b u s m c y c l b c y c l DRN-105 DANN [23] 32.5 67.0 29.1 71.5 14.3 0.1 28.1 12.6 10.3 72.7 76.7 48.3 12.7 62.5 11.3 2.7 0.0MCD [12] 36.6 84.5 43.2 77.6 6.0 0.1 29.1 7.2 5.6 83.8 83.5 51.5 11.8 76.5 19.9 4.7 0.0
SNR-MCD (ours) 39.6
SNR-MS (ours) 45.1
Input Baseline SNR Ground Truth D i r ec tl y E v a l . on T a r g e t C it y s ca p e A dop tt o T a r g e t C it y s ca p e Baseline (MCD)
SNR-MCD
Fig. 7: Qualitative results on domain generable segmenta-tion (first row) and domain adaptive segmentation (secondrow) from GTA5 to Cityscapes. For DG (first row),
Baseline denotes the baseline scheme trained with source domaindataset while testing on the target domain directly.
SNR denotes our scheme which adds SNR modules to
Baseline .For UDA (second row), we compare the baseline scheme
Baseline (MCD) [12] to the scheme
SNR+MCD which ispowered by our SNR.
Following [85], [86], we evaluate performance on multi-and single-label object detection tasks using three differentdatasets.Cityscapes [70] is a dataset of real urban scenes contain-ing , images captured by a dash-cam. , images areused for training and the remaining for validation (suchsplit information is different from the above-mentionedstatistics for the semantic segmentation). Following [85],
5. This dataset is usually used for semantic segmentation as wedescribed before. we report results on the validation set because we do nothave annotations of the test set. There are different objectcategories in this dataset including person, rider, car, truck,bus, train, motorcycle and bicycle .Foggy Cityscapes [87] is the foggy version of Cityscapes.The depth maps provided in Cityscapes are used to simulatethree intensity levels of fog in [87]. In our experiments weused the fog level with highest intensity (least visibility) toimitate large domain gap. The same dataset split as used forCityscapes is used for Foggy Cityscapes.KITTI [88] is another real-world dataset consisting of , images of real-world traffic situations, including free-ways, urban and rural areas. Following [85], we use theentire dataset for training, when it is used as source. Weuse the entire dataset for testing when it is used as targettest set for DG.For the domain generalization (DG) experiments, weemploy the original Faster RCNN [58] as our baseline,which is trained using the source domain training data.We follow [58] to set the hyper-parameters. For our scheme SNR , we add SNR modules into the backbone (by addinga SNR module after each convolutional block for the firstfour blocks of ResNet-50) of the Faster RCNN, which areinitialized using weights pre-trained on ImageNet. We trainthe network with a learning rate of . for k iterationsand then reduce the learning rate to . for another kiterations.For the unsupervised domain adaptation (UDA) exper-iments, we use the Domain Adaptive Faster R-CNN (DaFaster R-CNN) [85] model as our baseline, which tackles thedomain shift on two levels, the image level and the instancelevel. A domain classifier is added on each level, trained in TABLE 12: Performance (in mAP accuracy %) of object detection on the Foggy Cityscapes validation set, models are trainedon the Cityscapes training set.
Setting Method Cityscapes → Foggy Cityscapesperson rider car truck bus train mcycle bicycle mAPDG Faster R-CNN [58] 17.8 23.6 27.1 11.9 23.8 9.1 14.4 22.8 18.8SNR-Faster R-CNN
UDA DA Faster R-CNN [85] 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6SNR-DA Faster R-CNN
TABLE 13: Performance (in AP accuracy %) for the class ofCar for object detection on KITTI (K) and Cityscapes (C)datasets.
Setting Method K → C C → KDG Faster R-CNN [58] 30.24 53.52SNR
UDA DA Faster R-CNN [85] 38.52 64.15SNR an adversarial training manner. A consistency regularizer isincorporated within these two classifiers to learn a domain-invariant RPN for the Faster R-CNN model. Each batch iscomposed of two images, one from the source domain andthe other from the target domain. A momentum of 0.9 anda weight decay of 0.0005 is used in our experiments.For all experiments , we report mean average precisions(mAP) with a threshold of 0.5 for evaluation. Results for Normal to Foggy Weather.
Differences inweather conditions can significantly affect visual data. Inmany applications ( i . e ., autonomous driving), the object de-tector needs to perform well in all conditions [87]. Here weevaluate the effectiveness of our SNR and demonstrate itsgeneralization superiority over the current state-of-the-artfor this task. We use Cityscapes dataset as the source domainand Foggy Cityscapes as the target domain (denoted by“Cityscapes → Foggy Cityscapes”).Table 12 compares our schemes using SNR to twobaselines (Faster R-CNN [58], and Domain Adaptive (DA)Faster R-CNN [85]) on domain generalization, and domainadaptation settings. We report the average precision for eachcategory, and the mean average precision (mAP) of all theobjects. We can see that our SNR improves Faster R-CNN by . in mAP for domain generalization, and improves DAFaster R-CNN by . in mAP for unsupervised domainadaptation. Results for Cross-Dataset DG and UDA.
Many factorscould result in domain gaps. There is usually some databias when collecting the datasets [89]. For example, differentdatasets are usually captured by different cameras or col-lected by different organizations with different preference,with different image quality/resolution/characteristics. Inthis subsection, we conduct experiments on two datasets:Cityscapes and KITTI. We only train the detector on anno-tated cars because cars is the only object common to bothCityscapes and KITTI.
6. We use the repository https://github.com/yuhuayc/da-faster-rcnn DA F a s t e r R - C NNO u r S N R Fig. 8: Qualitative comparisons of the baseline approach DAFaster R-CNN [85] and the baseline powered by our SNRon “Cityscapes → KITTI”. Top and bottom rows denote thedetected cars by the baseline scheme
DA Faster R-CNN andour scheme
SNR-DA Faster R-CNN respectively.TABLE 14: Comparisons of complexity and model sizes.FLOPs: the number of FLoating-point OPerations; Params:the number of arameter.
FLOPs ParamsResNet-18 1.83G 11.74MResNet-18-SNR 2.03G 12.30M ∆ +9.80% +4.50%ResNet-50 3.87G 24.56MResNet-50-SNR 4.08G 25.12M ∆ +5.10% +2.20% Table 13 compares our methods to two baselines: FasterR-CNN [58], and Domain Adaptive (DA) Faster R-CNN [85]for domain generalization and domain adaptation setting,respectively. We denote KITTI (source dataset) to Cityscapes(target dataset) as K → C and vice versa. We can seethat the introduction of SNR brings significant performanceimprovement for both DG and UDA settings. For UDA, we visualize the qualitative detection resultsin Fig. 8. We can see that our SNR corrects several falsepositives in the first column, and has detected cars that DAFaster R-CNN missed in the second column.
In Table 14, we analyze the increase of complexity of ourSNR modules in terms of FLOPs and model size with respect to different backbone networks. Here, we use ourdefault setting where we insert a SNR module after eachconvolutional block (for the first four blocks) for the back-bone networks of ResNet-18, ResNet-50. We observe thatour SNR modules bring a small increase in complexity. ForResNet-50 [51] backbone, our SNR only brings an increaseof 2.2% in model size (24.56M vs. 25.12M) and an increase of5.1% in computational complexity (3.87G vs. 4.08G FLOPs). ONCLUSION
In this paper, we present a Style Normalization and Restitu-tion (SNR) module, which aims to learn generalizable anddiscriminative feature representations for effective domaingeneralization and adaptation. SNR is generic. As a plug-and-play module, it can be inserted into existing backbonenetworks for many computer vision tasks. SNR reduces thestyle variations by using Instance Normalization (IN). Toprevent the loss of task-relevant discriminative informationcased by IN, we propose to distill task-relevant discrim-inative features from the discarded residual features andadd them back to the network, through a well-designedrestitution step. Moreover, to promote a better feature disen-tanglement of task-relevant and task-irrelevant information,we introduce a dual causality loss constraint. Extensiveexperimental results demonstrate the effectiveness of ourSNR module for both domain generalization and domainadaptation. The schemes powered by SNR achieves thestate-of-the-art performance on various tasks, includingclassification, semantic segmentation, and object detection. A CKNOWLEDGMENTS
This work was supported in part by NSFC under GrantU1908209, 61632001 and the National Key Research and De-velopment Program of China 2018AAA0101400. We wouldlike to thank Li Zhang, Associate Professor of Fudan Uni-versity, for the valuable and constructive suggestions. R EFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-tion with deep convolutional neural networks,” in
NeurIPS , 2012,pp. 1097–1105.[2] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsuperviseddomain adaptation with residual transfer networks,” in
NeurIPS ,2016, pp. 136–144.[3] K. Muandet, D. Balduzzi, and B. Sch¨olkopf, “Domain generaliza-tion via invariant feature representation,” in
International Confer-ence on Machine Learning , 2013, pp. 10–18.[4] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broaderand artier domain generalization,” in
ICCV , 2017, pp. 5542–5550.[5] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi,and S. Sarawagi, “Generalizing across domains via cross-gradienttraining,”
ICLR , 2018.[6] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Learning togeneralize: Meta-learning for domain generalization,” in
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[7] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tom-masi, “Domain generalization by solving jigsaw puzzles,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 2229–2238.[8] D. Li, J. Zhang, Y. Yang, C. Liu, Y.-Z. Song, and T. M. Hospedales,“Episodic training for domain generalization,” in
Proceedings of theIEEE International Conference on Computer Vision , 2019, pp. 1446–1455. [9] S. J. Pan and Q. Yang, “A survey on transfer learning,”
IEEETransactions on knowledge and data engineering , vol. 22, no. 10, pp.1345–1359, 2009.[10] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation bybackpropagation,”
ICML , 2014.[11] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferablefeatures with deep adaptation networks,”
ICML , 2015.[12] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximumclassifier discrepancy for unsupervised domain adaptation,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 3723–3732.[13] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarialdomain adaptation,”
ICML , 2018.[14] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang,“Moment matching for multi-source domain adaptation,” in
Pro-ceedings of the IEEE International Conference on Computer Vision ,2019, pp. 1406–1415.[15] R. Xu, G. Li, J. Yang, and L. Lin, “Larger norm more transferable:An adaptive feature norm approach for unsupervised domainadaptation,” in
ICCV , 2019, pp. 1426–1435.[16] X. Wang, Y. Jin, M. Long, J. Wang, and M. I. Jordan, “Transferablenormalization: Towards improving transferability of deep neuralnetworks,” in
NeurIPS , 2019, pp. 1951–1961.[17] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easydomain adaptation,” in
AAAI , 2016.[18] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deepdomain adaptation,” in
ECCV , 2016, pp. 443–450.[19] X. Peng and K. Saenko, “Synthetic to real adaptation with genera-tive correlation alignment networks,” in
WACV . IEEE, 2018, pp.1982–1991.[20] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschl¨ager, andS. Saminger-Platz, “Central moment discrepancy (cmd) fordomain-invariant representation learning,”
CoRR , 2017.[21] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deepdomain confusion: Maximizing for domain invariance,” arXivpreprint arXiv:1412.3474 , 2014.[22] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transferlearning with joint adaptation networks,” in
ICML , 2017, pp. 2208–2217.[23] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”
The Journal of MachineLearning Research , vol. 17, no. 1, pp. 2096–2030, 2016.[24] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarialdiscriminative domain adaptation,” in
CVPR , 2017, pp. 7167–7176.[25] H. Liu, M. Long, J. Wang, and M. Jordan, “Transferable adversarialtraining: A general approach to adapting deep classifiers,” in
ICML , 2019, pp. 4013–4022.[26] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang, “Scattercomponent analysis: A unified framework for domain adaptationand domain generalization,”
TPAMI , vol. 39, no. 7, pp. 1414–1430,2016.[27] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Uni-fied deep supervised domain adaptation and generalization,” in
Proceedings of the IEEE International Conference on Computer Vision ,2017, pp. 5715–5725.[28] J. Jia, Q. Ruan, and T. M. Hospedales, “Frustratingly easy personre-identification: Generalizing person re-id in practice,” 2019.[29] J. Song, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales, “Gen-eralizable person re-identification by domain-invariant mappingnetwork,” 2019.[30] K. Zhou, Y. Yang, Y. Qiao, and T. Xiang, “Domain adaptiveensemble learning,” arXiv preprint arXiv:2003.07325 , 2020.[31] K. Zhou, Y. Yang, A. Cavallaro et al. , “Omni-scale feature learningfor person re-identification,” 2019.[32] X. Huang and S. Belongie, “Arbitrary style transfer in real-timewith adaptive instance normalization,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2017, pp. 1501–1510.[33] X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancinglearning and generalization capacities via ibn-net,” in
Proceedingsof the European Conference on Computer Vision (ECCV) , 2018, pp.464–479.[34] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texturenetworks: Maximizing quality and diversity in feed-forward styl-ization and texture synthesis,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp. 6924–6932. [35] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representationfor artistic style,” ICLR , 2017.[36] X. Jin, C. Lan, W. Zeng, Z. Chen, and L. Zhang, “Style normal-ization and restitution for generalizable person re-identification,” arXiv preprint arXiv:2005.11037 , 2020.[37] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, andS. Savarese, “Generalizing to unseen domains via adversarial dataaugmentation,” in
NeurIPS , 2018, pp. 5334–5344.[38] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normaliza-tion: The missing ingredient for fast stylization,” arXiv preprintarXiv:1607.08022 , 2016.[39] H. Nam and H.-E. Kim, “Batch-instance normalization for adap-tively style-invariant neural networks,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 2558–2567.[40] W. Zhang, W. Ouyang, W. Li, and D. Xu, “Collaborative andadversarial network for unsupervised domain adaptation,” in
CVPR , 2018, pp. 3801–3809.[41] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J.Gordon, “Adversarial multiple source domain adaptation,” in
NeurIPS , 2018, pp. 8559–8570.[42] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko, “Semi-supervised domain adaptation via minimax entropy,” in
Proceed-ings of the IEEE International Conference on Computer Vision , 2019,pp. 8050–8058.[43] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, andY. LeCun, “Disentangling factors of variation in deep representa-tion using adversarial training,” in
Advances in neural informationprocessing systems , 2016, pp. 5040–5048.[44] X. Peng, Z. Huang, Y. Zhu, and K. Saenko, “Federated adversarialdomain adaptation,”
ICLR , 2020.[45] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey,“Adversarial autoencoders,” arXiv preprint arXiv:1511.05644 , 2015.[46] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang, “A unified featuredisentangler for multi-domain image translation and manipula-tion,” in
Advances in neural information processing systems , 2018, pp.2590–2599.[47] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesiswith auxiliary classifier gans,” in
Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 . JMLR. org, 2017,pp. 2642–2651.[48] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,” in
Advances in neural information processing systems , 2014, pp.2672–2680.[49] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114 , 2013.[50] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang,“Diverse image-to-image translation via disentangled representa-tions,” in
Proceedings of the European conference on computer vision(ECCV) , 2018, pp. 35–51.[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
CVPR , 2016.[52] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
CVPR , 2018, pp. 7132–7141.[53] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” in
IEEE , 1998.[54] Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptationby backpropagation,” in
ICML , 2015.[55] J. J. Hull, “A database for handwritten text recognition research,”
TPAMI , vol. 16, no. 5, pp. 550–554, 1994.[56] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in
NeurIPS-W , 2011.[57] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hier-archies for accurate object detection and semantic segmentation,”in
CVPR , 2014, pp. 580–587.[58] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in
Advancesin neural information processing systems , 2015, pp. 91–99.[59] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in
ICCV , 2017, pp. 2961–2969.[60] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain generaliza-tion with adversarial feature learning,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp.5400–5409.[61] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin, “Deep cocktail net-work: Multi-source unsupervised domain adaptation with cate- gory shift,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 3964–3973.[62] J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu,“Visual domain adaptation with manifold embedded distributionalignment,” in
ACMMM , 2018, pp. 402–410.[63] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan,“Deep hashing network for unsupervised domain adaptation,” in
CVPR , 2017.[64] ——, “Deep hashing network for unsupervised domain adapta-tion,” in
CVPR , 2017.[65] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tom-masi, “Domain generalization by solving jigsaw puzzles,” in
CVPR , 2019.[66] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descentwith warm restarts,” in
ICLR , 2017.[67] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[68] W.-S. Zheng, S. Gong, and T. Xiang, “Person re-identification byprobabilistic relative distance comparison,” 2011.[69] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”2008.[70] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in
Proc. of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2016.[71] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez,“The synthia dataset: A large collection of synthetic images forsemantic segmentation of urban scenes,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 3234–3243.[72] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:Ground truth from computer games,” in
European Conference onComputer Vision (ECCV) , ser. LNCS, B. Leibe, J. Matas, N. Sebe, andM. Welling, Eds., vol. 9906. Springer International Publishing,2016, pp. 102–118.[73] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial and constraint-based adaptation,” arXiv preprintarXiv:1612.02649 , 2016.[74] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptationfor semantic segmentation of urban scenes,” in
Proceedings of theIEEE International Conference on Computer Vision , 2017, pp. 2020–2030.[75] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, andM. Chandraker, “Learning to adapt structured output space forsemantic segmentation,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 7472–7481.[76] M. Chen, H. Xue, and D. Cai, “Domain adaptation for semanticsegmentation with maximum squares loss,” in
Proceedings of theIEEE International Conference on Computer Vision , 2019, pp. 2090–2099.[77] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 472–480.[78] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutionalnets, atrous convolution, and fully connected crfs,”
CoRR , vol.abs/1606.00915, 2016.[79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
CVPR , 2016.[80] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: Alarge-scale hierarchical image database,” in
CVPR , 2009.[81] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chan-draker, “Learning to adapt structured output space for semanticsegmentation,” in
CVPR , 2018.[82] T. Vu, H. Jain, M. Bucher, M. Cord, and P. P´erez, “ADVENT: ad-versarial entropy minimization for domain adaptation in semanticsegmentation,”
CoRR , vol. abs/1811.12833, 2018.[83] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in
CVPR , 2017, pp. 2881–2890.[84] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P´erez, “Advent: Ad-versarial entropy minimization for domain adaptation in semanticsegmentation,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 2517–2526.[85] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domainadaptive faster r-cnn for object detection in the wild,” in
Proceed- ings of the IEEE conference on computer vision and pattern recognition ,2018, pp. 3339–3348.[86] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready, “Arobust learning approach to domain adaptive object detection,” in Proceedings of the IEEE International Conference on Computer Vision ,2019, pp. 480–490.[87] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy sceneunderstanding with synthetic data,”
IJCV , pp. 1–20, 2018.[88] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meetsrobotics: The kitti dataset,”
IJRR , 2013.[89] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
CVPR 2011 . IEEE, 2011, pp. 1521–1528.
Xin Jin received the B.S. degree in electrical & information engineering from the Chang’anUniversity in 2017. He is currently pursuing thePh.D. degree with the Department of ElectronicEngineer and Information Science, Universityof Science and Technology of China. His cur-rent research interests include image/video com-pression, computer vision, and machine learn-ing. Cuiling Lan received the B.S. degree in electri-cal engineering and the Ph.D. degree in intelli-gent information processing from Xidian Univer-sity, Xi’an, China, in 2008 and 2014, respectively.She joined Microsoft Research Asia, Beijing,China, in 2014. Her current research interests in-clude computer vision problems related to poseestimation, action recognition, person/vehicle re-identification, domain generalization/adaptation.
Wenjun (Kevin) Zeng (M’97-SM’03-F’12) is aSr. Principal Research Manager and a memberof the senior leadership team at Microsoft Re-search Asia. He has been leading the video an-alytics research empowering the Microsoft Cog-nitive Services, Azure Media Analytics Services,Office, and Windows Machine Learning since2014. He was with Univ. of Missouri from 2003to 2016, most recently as a Full Professor. Priorto that, he had worked for PacketVideo Corp.,Sharp Labs of America, Bell Labs, and Pana-sonic Technology. Wenjun has contributed significantly to the develop-ment of international standards (ISO MPEG, JPEG2000, and OMA).He received his B.E., M.S., and Ph.D. degrees from Tsinghua Univ.,the Univ. of Notre Dame, and Princeton Univ., respectively. His currentresearch interests include mobile-cloud media computing, computervision, and multimedia communications and security.He is on the Editorial Board of International Journal of ComputerVision. He was an Associate Editor-in-Chief of IEEE Multimedia Mag-azine, and was an AE of IEEE Trans. on Circuits & Systems for VideoTechnology (TCSVT), IEEE Trans. on Info. Forensics & Security, andIEEE Trans. on Multimedia (TMM). He was on the Steering Committeeof IEEE Trans. on Mobile Computing and IEEE TMM. He served as theSteering Committee Chair of IEEE ICME in 2010 and 2011, and hasserved as the General Chair or TPC Chair for several IEEE conferences( e.g. , ICME’2018, ICIP’2017). He was the recipient of several best paperawards. He is a Fellow of the IEEE.
Zhibo Chen (M’01-SM’11) received the B. Sc.,and Ph.D. degree from Department of ElectricalEngineering Tsinghua University in 1998 and2003, respectively. He is now a full professor inUniversity of Science and Technology of China.Before that he has worked in SONY and Thom-son from 2003 to 2012. He used to be principalscientist and research manager in Thomson Re-search &&