[PDF] Universal Domain Adaptation through Self Supervision

Abstract

Full PDF

UUniversal Domain Adaptation throughSelf-Supervision

Kuniaki Saito Donghyun Kim Stan Sclaroff Kate Saenko , Boston University MIT-IBM Watson AI Lab [keisaito, dohnk,sclaroff,saenko]@bu.edu

Abstract

Unsupervised domain adaptation methods traditionally assume that all sourcecategories are present in the target domain. In practice, little may be known aboutthe category overlap between the two domains. While some methods address targetsettings with either partial or open-set categories, they assume that the particularsetting is known a priori. We propose a more universally applicable domainadaptation framework that can handle arbitrary category shift, called DomainAdaptative Neighborhood Clustering via Entropy optimization (DANCE). DANCEcombines two novel ideas: First, as we cannot fully rely on source categoriesto learn features discriminative for the target, we propose a novel neighborhoodclustering technique to learn the structure of the target domain in a self-supervisedway. Second, we use entropy-based feature alignment and rejection to aligntarget features with the source, or reject them as unknown categories based ontheir entropy. We show through extensive experiments that DANCE outperformsbaselines across open-set, open-partial and partial domain adaptation settings.Implementation is available at https://github.com/VisionLearningGroup/DANCE . Deep neural networks can learn highly discriminative representations for image recognition tasks [8,38, 20, 32, 16], but do not generalize well to domains that are not distributed identically to the trainingdata. Domain adaptation (DA) aims to transfer representations of source categories to novel targetdomains without additional supervision. Recent deep DA methods primarily do this by minimizingthe feature distribution shift between the source and target samples [11, 25, 39]. However, thesemethods make strong assumptions about the degree to which the source categories overlap with thetarget domain, which limits their applicability to many real-world settings.In this paper, we investigate the problem of

Universal DA . Suppose L s and L t are the label sets in thesource and target domain. In Universal DA we want to handle all of the following potential “categoryshifts": closed-set ( L s = L t ), open-set ( L s ⊂ L t ) [1, 36], partial ( L t ⊂ L s ) [2], or a mix of open andpartial [44]. Existing DA methods cannot address Universal DA well because they are each designedto handle just one of the above settings. However, since the target domain is unlabeled, we may notknow in advance which of these situations will occur. Thus, an unexpected category shift could leadto catastrophic misalignment. For example, using a closed-set method when the target has novel(“unknown”) classes could incorrectly align them to source (“known”) classes. The underlying issueat play is that existing work heavily relies on prior knowledge about the category shift.The second problem is that the over-reliance on source supervision makes it challenging to obtaindiscriminative features on the target. Prior methods focus on aligning target features with source, a r X i v : . [ c s . C V ] O c t igure 1: We propose

DANCE , which combines a self-supervised clustering loss (red) to cluster neighboringtarget examples and an entropy separation loss (gray) to consider alignment with source (best viewed in color). rather than on exploiting structure speciﬁc to the target domain. In the universal DA setting, thismeans that we may fail to learn features useful for discriminating “unknown” categories from theknown categories, because such features may not exist in the source. Self-supervision was proposedin [5] to extract domain-generalizable features, but it is limited in that they did not utilize the clusterstructure of the target domain.We propose to overcome these challenging problems by introducing

Domain Adaptive NeighborhoodClustering via Entropy optimization (DANCE) . An overview is shown in Fig. 1. Rather than relyingonly on the supervision of source categories to learn a discriminative representation,

DANCE harnessesthe cluster structure of the target domain using self-supervision. This is done with a “neighborhoodclustering" technique that self-supervises feature learning in the target. At the same time, usefulsource features and class boundaries are preserved and adapted via distribution alignment with batch-normalization [7] and a novel partial domain alignment loss that we refer to as “entropy separationloss." This loss allows the model to either match each target example with the source, or reject it asan “unknown” category. Our contributions are summarized as follows: (i) we propose

DANCE , auniversal domain adaptation framework that can be applied out-of-the-box without prior knowledgeof speciﬁc category shift, (ii) design two novel loss functions, neighborhood clustering and entropyseparation, for shift-agnostic adaptation, (iii) experimentally observe that

DANCE is the only methodthat outperforms the source-only model in every setting, achieving state-of-the-art performance on allopen-set and open-partial DA settings, and some partial DA settings, and (iv) learn discriminativefeatures of “unknown” target samples without any supervision.

Closed-set Domain Adaptation (CDA).

The main challenge in domain adaptation (DA) is to lever-age unlabeled target data to improve the source classiﬁer’s performance while accounting for domainshift. Classic approaches measure the distance between feature distributions in source and target,then train a model to minimize this distance. Many DA methods utilize a domain classiﬁer tomeasure the distance [11, 40, 25, 26], while others minimize utilize pseudo-labels assigned to targetexamples [35, 48]. Clustering-based methods are proposed by [9, 37, 15]. These and other main-stream methods assume that all target examples belong to source classes. In this sense, they relyheavily on the relationship between source and target.

Partial Domain Adaptation (PDA) handlesthe case where the target classes are a subset of source classes. This task is solved by performingimportance-weighting on source examples that are similar to samples in the target [2, 45, 3].

Openset Domain Adaptation (ODA) deals with target examples whose class is different from any of thesource classes [29, 36, 24]. The drawback of ODA methods is that they assume we necessarily haveunknown examples in the target domain, and can fail in closed or partial domain adaptation. Theidea of

Universal Domain Adaptation (UniDA) was proposed in [44]. However, they applied theirmethod to a mixture of PDA and ODA, which we call

OPDA , where the target domain contains asubset of the source classes plus some unknown classes. Our goal is to propose a method that workswell on CDA, ODA, PDA, and OPDA. We call the task UniDA in our paper.

Entropy minimiza-tion [13] for unlabeled samples is popular in semi-supervised learning. In CDA, its effectiveness isconﬁrmed when combined with batch-normalization based domain alignment [4]. Pseudo-labelingis a similar approach since it increases the conﬁdence of the prediction for target samples and alsoperforms well on CDA when combined with domain-speciﬁc batch-normalization [7]. These methodssimply attempt to increase the conﬁdence for “known” classes while our entropy separation loss candecrease the conﬁdence to reject “unknown” classes.

Self-Supervised Learning.

Self-supervised learning obtains features useful for various imagerecognition tasks by using a large number of unlabeled images [10]. A model is trained to solve a2retext (surrogate) task such as solve a jigsaw puzzle [28] or instance discrimination [42]. [6] traineda model to predict each sample’s cluster index given by k-means clustering. Directly applying themethod to UniDA is challenging, since we need to know the number of clusters in the target domain.[19, 47] proposed to perform instance discrimination and trained a model to discover neighborhoodsfor each example. Their result indicates that we can cluster samples without specifying their clustercenters. They calculate cross entropy loss on the probabilistic distribution of similarity betweenexamples. Our neighborhood clustering (NC) is similar in that we aim to perform unsupervisedclustering of unlabeled examples without specifying cluster centers, but different in that [19, 47]require assigning neighbors for each example. Instead, we perform entropy minimization on thesimilarity distribution among unlabeled target examples and source prototypes. Objectives that cancluster samples without specifying the center can possibly replace NC. But, note that one of ourcontributions is to provide a framework for UniDA that exploits cluster structure of the target domainand domain-alignment objective with a rejection option. Since these methods are not designed forDA, we focus on the comparison with DA baselines in our main paper. We provide the ablationstudy of replacing our neighborhood clustering loss with [19, 6, 5] to better understand each loss insupplemental material.

Our task is universal domain adaptation: given a labeled source domain D s = { ( x si , y is ) } N s i =1 with“known" categories L s and an unlabeled target domain D t = { ( x ti ) } N t i =1 which contains all or some“known" categories and possible “unknown" categories. Our goal is to label the target samples witheither one of the L s labels or the “unknown" label. We train the model on D s ∪ D t and evaluate on D t .We seek a truly universal method that can handle any possible category shift without prior knowledgeof it. The key is not to force complete alignment between the entire source and target distributions,as this may result in catastrophic misalignment. Instead, the challenge is to extract well-clusteredtarget features while performing a relaxed alignment to the source classes and potentially rejecting“unknown" points.We adopt a prototype-based classiﬁer that maps samples close to their true class centroid (prototype)and far from samples of other classes. We ﬁrst propose to use self-supervision in the target domain tocluster target samples. We call this technique neighborhood clustering (NC) . Each target point isaligned either to a “known" class prototype in the source or to its neighbor in the target. This allowsthe model to learn a discriminative metric that maps a point to its semantically close match, whetheror not its class is “known". This is achieved by minimizing the entropy of the distribution over pointsimilarity. Second, we propose an entropy separation loss (ES) to either align the target point with asource prototype or reject it as “unknown". The loss is applied to the entropy of the “known" categoryclassiﬁer’s output to force it to be either low (the sample should belong to a “known" class) or high(the sample should be far from any “known" class). In addition, we utilize domain-speciﬁc batchnormalization [7, 23, 34] to eliminate domain style information as a form of weak domain alignment. We adopt the architecture used in [34], which has an L2 normalization layer before the last linearlayer. We can regard the weight vectors in the last linear layer as prototype features of each class.This architecture is well-suited to our purpose of ﬁnding a clustering over both target features andsource prototypes. Let G be the feature extraction network which takes an input x and outputs afeature vector f . Let W be the classiﬁcation network which consists of one linear layer without bias.The layer consists of weight vectors [ w , w , . . . , w K ] where K represents the number of classes inthe source. W takes L2 normalized features and outputs K logits. p denotes the output of W afterthe softmax function. The principle behind our self-supervised clustering objective is to move each target point either to a“known" class prototype in the source or to its neighbor in the target. By making nearby points closer,the model learns well-clustered features. If “unknown" samples have similar characteristics withother “unknown" samples, then this clustering objective will help us extract discriminative features.This intuition is illustrated in Fig. 1. The important point is that we do not rely on strict distributionalignment with the source in order to extract discriminative target features. Instead we propose tominimize the entropy of each target point’s similarity distribution to other target samples and to3igure 2:

Left : Similarity distribution calculation in neighborhood clustering (best viewed in color). Weminimize the entropy of the similarity distribution between each point (as shown for f ), the prototypes, and theother target samples. Since most target samples are absent in the mini-batch, we store the their features in amemory bank, updating it with each batch. Right : An overview of the entropy separation loss (best viewed incolor). We further decrease small entropy to move the sample to a “known"-class prototype, and increase largeentropy to move it farther away. Since distinguishing “known" vs “unknown" samples near the boundary is hard,we introduce a conﬁdence threshold that ignores such ambiguous samples. prototypes. To minimize the entropy, the point will move closer to a nearby point (we assume aneighbor exists) or to a prototype. This approach is illustrated in Fig. 2.Speciﬁcally, we calculate the similarity to all target samples and prototypes for each mini-batchof target features. Let V ∈ R N t × d denotes a memory bank which stores all target features and F ∈ R ( N t + K ) × d denotes the target features in the memory bank and the prototype weight vectors,where d is the feature dimension in the last linear layer: V = [ V , V , ... V N t ] , (1) F = [ V , V , ... V N t , w , w , . . . , w K ] , (2)where the V i and w are L2-normalized. To consider target samples absent in the mini-batch, weemploy a memory bank to store and use the features to calculate the similarity as done in [42]. Inevery iteration, V is updated with the mini-batch features. Let f i denote features in the mini-batchand B t denote sets of target samples’ indices in the mini-batch. For all i ∈ B t , we set V i = f i . (3)Therefore, the memory bank V contains both updated target features from the current mini-batchand the older target features absent in the mini-batch. Unlike [42], we update the memory so thatit simply stores features, without considering the momentum of features in previous epochs. Let F j denote the j -th item in F , then the probability that the feature f i is a neighbor of the feature orprototype F j is, for i (cid:54) = j and i ∈ B t , p i,j = exp( F j (cid:62) f i /τ ) Z i , (4)where Z i = N t + K (cid:88) j =1 ,j (cid:54) = i exp ( F j (cid:62) f i /τ ) , (5)and the temperature parameter τ controls the distribution concentration degree [18]. Therefore, τ controls the number of neighbors for each sample, which implicitly affects the number of clusters.We provide the analysis on the parameter in supplemental material. Then, the entropy is calculated as L nc = − | B t | (cid:88) i ∈ B t N t + K (cid:88) j =1 ,j (cid:54) = i p i,j log( p i,j ) . (6)We minimize the above loss to align each target sample to either a target neighbor or a prototype,whichever is closer. The neighborhood clustering loss encourages the target samples to become well-clustered, but westill need to align some of them with “known” source categories while keeping the “unknown” targetsamples far from the source. In addition to the domain-speciﬁc batch normalization (see Sec. 3.4),which can work as a form of weak domain alignment, we need an explicit objective to encouragealignment or rejection of target samples. As pointed out in [44], “unknown" target samples are likely4able 1:

Summary of the Universal comparisons. Each dataset (Ofﬁce, OC, OH, VisDA) has multipledomains and adaptation scenarios and we provide the average accuracy over all scenarios. Our DANCE methodsubstantially improves performance compared to the source-only model in all settings and the average rank ofDANCE is signiﬁcantly higher than all other baselines.

Method Closed DA Partial DA Open set DA Open-Partial DA AvgOfﬁce OH VD OC OH VD Ofﬁce OH VD Ofﬁce OH VD Acc RankSource Only 76.5 54.6 46.3 75.9 57.0 49.9 89.1 69.6 43.2 86.4 71.0 38.8 61.7 4.8 ± ± ± ± ± ± to have a larger entropy of the source classiﬁer’s output than “known” target samples. This is because“unknown" target samples do not share common features with “known" source classes.Inspired by this, we propose to draw a boundary between “known" and “unknown” points using theentropy of a classiﬁer’s output. We visually introduce the idea in Fig. 2. The distance between theentropy and threshold boundary, ρ , is deﬁned as | H ( p ) − ρ | , where p is the classiﬁcation output fora target sample. By maximizing the distance, we can make H ( p ) far from ρ . We expect that theentropy of “unknown” target samples will be larger than ρ whereas for the “known" ones it will besmaller. Tuning the parameter ρ based on each adaptation setting requires a validation set. Instead,we deﬁne ρ = log( K )2 , where K is the number of source classes. Since log( K ) is the maximum valueof H ( p ) , we assume ρ depends on it, and conﬁrm that the deﬁned value empirically works well. Weperform an analysis of ρ in the supplemental material. The above formulation assumes that “known”and “unknown” target samples can be separated with ρ . However, in many cases, the threshold canbe ambiguous and can change due to domain shift. Therefore, we propose to introduce a conﬁdencethreshold parameter m such that the ﬁnal form of the loss is L es = 1 | B t | (cid:88) i ∈ B t L es ( p i ) , L es ( p i ) = (cid:26) −| H ( p i ) − ρ | ( | H ( p i ) − ρ | > m ) , otherwise. (7)The introduction of the conﬁdence threshold m allows us to give the separation loss only to conﬁdentsamples. When | H ( p i ) − ρ | is sufﬁciently large, the network is conﬁdent about a decision of “known"or “unknown". Thus, we train the network to make the sample far from the value ρ . To enhance alignment between source and target domain, we propose to utilize domain-speciﬁcbatch normalization [7, 23, 34]. The batch normalization layer whitens the feature activations, whichcontributes to a performance gain. As reported in [34], simply splitting source and target samplesinto different mini-batches and forwarding them separately helps alignment. This kind of weakalignment matches our goal because strongly aligning feature distributions can harm the performanceon non-closed set domain adaptation.

Final Objective . The ﬁnal objective is L = L cls + λ ( L nc + L es ) , (8)where L cls denotes the cross-entropy loss on source samples. The loss on source and target iscalculated in a different mini-batch to achieve domain-speciﬁc batch normalization. To reduce thenumber of hyper-parameters, we used the same weighting hyper-parameter λ for L nc and L es . The goal of the experiments is to compare DANCE with the baselines across all sub-cases of UniversalDA (i.e., CDA, PDA, ODA, and OPDA) under the four object classiﬁcation datasets and four settingsfor each dataset. We follow the settings of [26] for closed (CDA), [2] for partial (PDA), [24] foropen-set (ODA), and [44] for open-partial domain adaptation (OPDA) in our experiments.

Datasets . As the most prevalent benchmark dataset, we use

Ofﬁce [33], which has three domains(Amazon (A), DSLR (D), Webcam (W)) and 31 classes. The second benchmark dataset

OfﬁceHome

Universal comparison

Method Ofﬁce (31 / 0 / 0) Ofﬁce-Home (65 / 0 / 0) VisDAA2W D2W W2D A2D D2A W2A Avg A2C A2P A2R C2A C2P C2R P2A P2C P2R R2A R2C R2P Avg 12 / 0 / 0SO 74.1 95.3 99.0 80.1 54.0 56.3 76.5 37.0 62.2 70.7 46.6 55.1 60.3 46.1 32.0 68.7 61.8 39.2 75.4 54.6 46.3DANN [12] 86.7 97.2 99.8 86.1

100 89.4

SAFN [43] 88.8 98.4 99.8 87.7 69.8 69.7 85.7 52.0 71.7 76.3 64.2 69.9 71.9 63.7 51.4 77.1 70.9 57.1 81.5 67.3 NACDAN [26] 93.1 98.2 100 89.8 70.1 68.0 86.6 49.0 69.3 74.5 54.4 66 68.4 55.6 48.3 75.9 68.4 55.4 80.5 63.8 70.0MDD [46] 94.5 98.4 100 93.5 74.6 72.2 88.9 54.9 73.7 77.8 60 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1 74.6

Table 3: Results on partial domain adaptation including SAN [2] and IAFN [43].

Universal comparison

Method Ofﬁce-Caltech(10 / 21 / 0) Ofﬁce-Home(25 / 40 / 0) VisDAA2C W2C D2C D2A W2A Avg A2C A2P A2R C2A C2P C2R P2A P2C P2R R2A R2C R2P Avg (6 / 6 / 0)SO 75.4 70.7 68.5 80.4 84.6 75.9 37.1 64.5 77.1 52.0 51.3 62.4 52.0 31.3 71.6 66.6 42.6 75.1 57.0 46.3DANN [12] 41.9 42.7 43.4 41.5 41.5 42.2 35.5 48.2 51.6 35.2 35.4 41.4 34.8 31.7 46.2 47.5 34.7 49.0 40.9 38.7ETN [3]

SAN [2] NA NA NA 87.2 91.9 NA 44.4 68.7 74.6 67.5 65.0 77.8 59.8 44.7 80.1 72.2 50.2 78.7 65.3 NAETN [3] 89.5 92.6 93.5 95.9 92.3 92.7 59.2 77.0 79.5 62.9 65.7 75.0 68.3 55.4 84.4 75.7 57.7 84.5 70.5 66.0IAFN [43] NA NA NA NA NA NA 58.9 76.3 81.4 70.4 73.0 77.8 72.4 55.3 80.4 75.8 60.4 79.9 71.8 67.7 (OH) [41] contains four domains and 65 classes. The third dataset

VisDA (VD) [31] contains 12classes from two domains: synthetic and real images. We provide an analysis of varying the numberof classes using Caltech [14] and ImageNet [8] because these datasets contain a large number ofclasses. Let L s denotes a set of classes present in the source, L t denotes a set of classes present inthe target. The class split in each setting ( | L s ∩ L t | / | L s − L t | / | L t − L s | ) is shown in each table. Wefollow the experimental settings of [26, 2, 24, 44] for each split (see suppl. material for details). Evaluation . We use the same evaluation metrics as previous works. In CDA and PDA, we simplycalculate the accuracy over all target samples. In ODA and OPDA, we average the accuracy overclasses including “unknown”. For example, an average over 11 classes is reported in the Ofﬁce ODAsetting. We run each experiment three times and report the average result. For all settings, we activate“unknown” sample rejection method for all methods since we assume that we do not know the kind ofa category shift.

Implementation . All experiments are implemented in Pytorch [30]. We employ ResNet50 [17]pre-trained on ImageNet [8] as the feature extractor in all experiments. We remove the last linearlayer of the network and add a new weight matrix to construct W . For baselines, we use theirimplementation. Hyper-parameters for each method are tuned on the “Amazon to DSLR” OPDAsetting. We set λ in Eq. 9 as 0.05 and m in Eq. 7 as 0.5 for our method. For all comparisons, we usethe same hyper-parameters, batch-size, learning rate, and checkpoint. The analysis of sensitivity tohyper-parameters is discussed in the supplementary. Comparisons . We show two kinds of comparisons to provide better empirical insights. The ﬁrst isthe universal comparison to the 5 baselines including state-of-the-art methods on CDA, PDA, ODA,and OPDA. As we assume that we do not have prior knowledge of the category shift in the targetdomain, so all methods use ﬁxed hyper-parameters, which are tuned on the “Amazon to DSLR” OPDAsetting. This means that even for CDA and PDA, all methods use “unknown” example rejectionduring testing. Since methods for CDA and PDA are not designed with “unknown” example rejection,we reject samples using the entropy of classiﬁer’s output. The second is the comparison with methodstailored for each setting. In addition to the 5 baselines, we report published state-of-the-art resultson each setting if available. Note that the universal results should not be directly compared with themethods tailored for each setting, as they are optimized for each setting with prior knowledge and donot have “unknown” example rejection in CDA and PDA.

Universal comparison baselines: Source-only (SO).

The model is trained with source exampleswithout using target samples. By comparing to it, we can see how much gain we can obtain byperforming adaptation.

Closed-set DA (CDA) . Since this is the most popular setting of domainadaptation, we employ DANN [11], a standard approach of feature distribution matching between6able 4:

Results on open-set domain adaptation.

Universal comparison

Method Ofﬁce(10 / 0 / 11) Ofﬁce-Home(15 / 0 / 50) VisDAA2W D2W W2D A2D D2A W2A Avg A2C A2P A2R C2A C2P C2R P2A P2C P2R R2A R2C R2P Avg (6 / 0 / 6)SO 83.8 95.3 95.3 89.6 85.6 84.9 89.1 55.1 79.8 87.2 61.8 66.2 76.6 63.9 48.5 82.4 75.5 53.7 84.2 69.6 43.3DANN [12] 87.6 90.5 91.2 88.7 87.4 87.0 88.7 62.1 78.0 86.4 75.5 72.0 79.3 68.8 52.5 82.7 76.1 58.0 82.7 72.8 48.2ETN [3] 86.7 90.0 90.1 89.1 86.7 86.6 88.2 58.2 79.9 85.5 67.7 70.9 79.6 66.2 54.8 81.2 76.8 60.7 81.7 71.9 51.7STA [24] 91.7 94.4 94.8 90.9 87.3 80.6 89.9 56.6 74.7 86.5 65.7 69.7 77.3 63.4 47.8 81.0 73.6 57.1 78.8 69.3 51.7UAN [44] 88.0 95.8 94.8 88.1 89.9 89.4 91.0 63.3 82.4 86.3 75.3 76.2 82.0 69.4 58.2 83.4 76.1 60.5 81.9 74.6 50.0DANCE (a) Source only (b) DANN (c) DANCE w/o NC (d) DANCE

Figure 3: t-SNE [27] plots of target examples (Best viewed in color). Black dots are target “known” exampleswhile other colors are “unknown” examples. The colors indicate different classes. DANCE extracts discriminativefeatures for “known” examples while keeping “unknown” examples far from “known” examples. Although wedo not give supervision on “unknown” classes, some of them are clustered correctly. domains.

Partial DA (PDA) . ENT [3] is the state-of-the-art method in PDA. This method utilizes theimportance weighting on source samples with adversarial learning.

Open-set DA (ODA) . STA [24]tries to align target “known” examples as well as rejecting “unknown” samples. This method assumesthat there is a particular number of “unknown” samples and rejects them as “unknown”.

Open-PartialDA (OPDA) . UAN [44] tries to incorporate the value of entropy to reject “unknown” examples. . As seen in Table 1, which summarizes the universal comparison, DANCE is theonly method which improves the performance compared to SO, a model trained without adaptation,in all settings. In addition, DANCE performs the best on open set and open-partial adaptation inall settings and the partial and closed domain adaptation setting for OfﬁceHome and VisDA. Ouraverage performance is much better than other baselines with respect to both accuracy and rank. Dueto a limited space, we put the result of OPDA in supplemental material.

CDA (Table 2) . DANCEsigniﬁcantly improves performance compared to the source-only model (SO), and shows comparableperformance to some of the baseline methods. Even compared to methods specialized in this setting,DANCE shows superior performance in OfﬁceHome. Some baselines show better performancein Ofﬁce. However, such methods designed for CDA fail in adaptation when there are “unknown”examples.

PDA (Table 3) . DANCE signiﬁcantly improves accuracy compared to SO and achieves acomparable performance to ETN, which is one of the state-of-the-art methods in PDA. Although ETNin the universal comparison shows better performance than DANCE in Ofﬁce, it does not performwell on ODA and OPDA. In the case of VisDA and OfﬁceHome, DANCE outperforms all baselines.

ODA (Table 4) . DANCE outperforms all the other baselines including ones tailored for ODA. STAand UAN are designed for the ODA and OPDA achieve decent performance on these settings butshow poor performance on some settings in CDA and PDA. One reason is that their method assumesthat there there is a particular number of “unknown” examples in the target domain and reject themas “unknown”.

Feature Visualization . Fig. 3 shows the target feature visualization with t-SNE [27]. We use theODA setting of “DSLR to Amazon” on Ofﬁce. The target “known” features (black plots) are wellclustered with DANCE. In addition, most of the “unknown” features (the other colors) are kept farfrom “known” features and “unknown” examples in the same class are clustered together. Althoughwe do not give supervision on the “unknown” classes, similar examples are clustered together. Thevisualization supports the results of the clustering performance (see below).

Ablation by clustering “unknown” examples . Here, we evaluate how well the learned featurescan cluster samples from both “known” and “unknown” classes in the ODA setting. The goal ofthe ODA is to classify samples from “unknown” classes into a single class, i.e. “unknown”. Here,we evaluate the ability to cluster “unknown” classes into their original classes. Then, we train anew linear classiﬁer on top of the ﬁxed learned feature extractor. We use one labeled example per7able 5: Linear classiﬁcation accuracy given 1 labeled target sample per class in open-set setting(Known Accuracy/ Novel Accuracy). NC provides better clustered features for both known and novelclasses. Adding entropy separation loss (DANCE) further improves the performance.

Method R2A R2C R2P P2A P2C P2Rknown / novel known / novel known / novel known / novel known / novel known / novelImgNet 37.5 / 31.0 35.3 / 36.4 64.8 / 56.9 36.9 / 31.0 36.3 / 36.0 66.3 / 45.5SO 42.4 / 30.7 43.4 / 33.8 69.9 / 53.8 38.6 / 30.1 37.0 / 32.2 65.1 / 39.1DANN 41.3 / 30.2 42.4 / 33.4 62.8 / 50.7 41.6 / 28.9 40.1 / 31.6 67.2 / 38.8NC 48.4 /

DANCE (a) Learning curve (b) Accuracy on all (c) AUC of “unknown” (d) Accuracy on all

Figure 4: (a): Learning curve in the VisDA PDA setting. The accuracy improves as the two losses decrease.(b)(c): Increasing number of “unknown” classes. (d): Increasing number of source classes. DANCE outperformsthan other methods even when there are many outlier source classes or many outlier target classes. category for training. Then, we evaluate the classiﬁcation accuracy for both “known” and “unknown”.Since the feature extractor is ﬁxed, we can evaluate its own ability to cluster the samples. In thisexperiment, we use OfﬁceHome (15 “known” and 50 “unknown” classes). As we can see in Table5, DANCE maintain or improve performance for“unknown” classes compared to ImageNet modelwhile the baseline methods much worsen performance to classify “unknown” classes. NC provideswell-clustered features for both types of classes and adding entropy separation further improves theperformance. The result also shows the effectiveness of our method for class-incremental domainadaptation [22]. This result and the feature visualization indicate that the features learned by DANCEare better for clustering samples from “unknown” classes.

The number of “unknown” classes.

We analyze the behavior of DANCE under the different thenumber of “unknown” classes. In this analysis, we use open set adaptation from Amazon in Ofﬁceto Caltech, where there are 10 shared classes and many unshared classes. Openness is deﬁned as − | L s ∩ L t || L t − L s | . L s ∩ L t corresponds to the shared 10 categories. We increased the number of “unknown”categories, i.e. | L t − L s | . Fig. 4b shows the accuracy of all classes whereas Fig. 4c shows areaunder the receiver operating characteristic curve on “unknown” classes. As we add more “unknown”classes, the performance of all methods decreases. However, DANCE consistently performs betterthan other methods and is robust to the number of “unknown” classes. The number of source private classes.

We analyze the behavior under the different the number ofsource private classes in the OPDA setting. We vary the number of classes present only in the source(i.e., | L s − L t | ). To conduct an extensive analysis, we use ImageNet-1K [8] as the source domainand Caltech-256 as the target domain. They have 84 shared classes. We use all of the unsharedclasses of Caltech as “unknown” target while we increase the number of the classes of ImageNet(i.e., | L s − L t | ). The result is shown in Fig. 4d. As we have more unshared source classes, theperformance degrades as seen in Fig. 4d. DANCE consistently shows better performance. Since STAjust tries to classify almost all target examples as “unknown,” the performance is signiﬁcantly worse. In this paper, we introduce Domain Adaptative Neighborhood Clustering via Entropy optimiza-tion (DANCE) which performs well on universal domain adaptation. We propose two novel self-supervision based components: neighborhood clustering and entropy separation which can handlearbitrary category shift. DANCE is the only model which outperforms the source-only model inall settings and the state-of-the-art baselines in many settings. In addition, we show that DANCEextracts discriminative feature representations for “unknown” class examples without any supervisionon the target domain. 8 roader Impact

Our work is applicable to training deep neural networks with less supervision via knowledge transferfrom auxiliary datasets. Modern deep networks outperform humans on many datasets given a lot ofannotated data, such as in ImageNet. Our proposed method can help reduce the burden of collectinglarge-scale supervised data in many applications where large related datasets are available. Thepositive impact of our work is to reduce the data gathering effort for data-expensive applications.This can make the technology more accessible for institutions and individuals that do not have richresources. It can also help applications where data is protected by privacy laws and is thereforedifﬁcult to gather, or in sim2real applications where simulated data is easy to create but real datais difﬁcult to collect. The negative impacts could be to make these systems more accessible tocompanies, governments or individuals that attempt to use them for criminal activities such as fraud.Furthermore, As with all current deep learning systems, ours is susceptible to adversarial attacks andlack of interpretability. Finally, while we show improved performance relative to state-of-the-art,negative transfer could still occur, therefore our approach should not be used in mission-criticalapplications or to make important decisions without human oversight.

This work was supported by Honda, DARPA and NSF Award No. 1535797.

References [1] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In

IEEE International Conference onComputer Vision (ICCV) , 2017.[2] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang. Partial adversarial domain adaptation. In

European Conference on Computer Vision (ECCV) , 2018.[3] Zhangjie Cao, Kaichao You, Mingsheng Long, Jianmin Wang, and Qiang Yang. Learning to transferexamples for partial domain adaptation. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2019.[4] Fabio Maria Cariucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulò. Autodial:Automatic domain alignment layers. In

IEEE International Conference on Computer Vision (ICCV) , 2017.[5] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domaingeneralization by solving jigsaw puzzles. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2019.[6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervisedlearning of visual features. In

European Conference on Computer Vision (ECCV) , 2018.[7] Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han. Domain-speciﬁc batchnormalization for unsupervised domain adaptation. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2019.[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.[9] Zhijie Deng, Yucen Luo, and Jun Zhu. Cluster alignment with a teacher for unsupervised domain adaptation.In

IEEE International Conference on Computer Vision (ICCV) , pages 9944–9953, 2019.[10] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by contextprediction. In

IEEE International Conference on Computer Vision (ICCV) , pages 1422–1430, 2015.[11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In

Interna-tional Conference on Machine Learning (ICML) , 2014.[12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette,Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.

JMLR , 17(59):1–35, 2016.[13] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In

Advances inNeural Information Processing Systems (NeurIPS) , 2005.[14] Gregory Grifﬁn, Alex Holub, and Pietro Perona. Caltech-256 object category dataset.

California Instituteof Technology , 2007.[15] Philip Haeusser, Thomas Frerix, Alexander Mordvintsev, and Daniel Cremers. Associative domainadaptation. In

IEEE International Conference on Computer Vision (ICCV) , 2017.

16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In

IEEE InternationalConference on Computer Vision (ICCV) , 2017.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.[18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.[19] Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian Zhu. Unsupervised deep learning by neighbourhooddiscovery. In

International Conference on Machine Learning (ICML) , 2019.[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

Advances in Neural Information Processing Systems (NeurIPS) , 2012.[21] Jogendra Nath Kundu, Naveen Venkat, and R Venkatesh Babu. Universal source-free domain adaptation.In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020.[22] Jogendra Nath Kundu, Rahul Mysore Venkatesh, Naveen Venkat, Ambareesh Revanur, and R VenkateshBabu. Class-incremental domain adaptation. In

European Conference on Computer Vision (ECCV) , 2020.[23] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization forpractical domain adaptation. arXiv preprint arXiv:1603.04779 , 2016.[24] Hong Liu, Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Qiang Yang. Separate to adapt: Openset domain adaptation via progressive separation. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2019.[25] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deepadaptation networks. In

International Conference on Machine Learning (ICML) , 2015.[26] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domainadaptation. In

Advances in Neural Information Processing Systems (NeurIPS) , 2018.[27] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

JMLR , 9(11):2579–2605,2008.[28] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

European Conference on Computer Vision (ECCV) . Springer, 2016.[29] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In

IEEE International Conference onComputer Vision (ICCV) , 2017.[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch.

Openreview ,2017.[31] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: Thevisual domain adaptation challenge. arXiv preprint arXiv:1710.06924 , 2017.[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In

Advances in Neural Information Processing Systems (NeurIPS) , 2015.[33] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to newdomains. In

European Conference on Computer Vision (ECCV) , 2010.[34] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-supervised domainadaptation via minimax entropy. In

IEEE International Conference on Computer Vision (ICCV) , 2019.[35] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domainadaptation. In

International Conference on Machine Learning (ICML) , 2017.[36] Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. Open set domain adaptation bybackpropagation. In

European Conference on Computer Vision (ECCV) , 2018.[37] Ozan Sener, Hyun Oh Song, Ashutosh Saxena, and Silvio Savarese. Learning transferrable representationsfor unsupervised domain adaptation. In

Advances in Neural Information Processing Systems (NeurIPS) ,2016.[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv , 2014.[39] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In

AAAI ,2016.[40] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion:Maximizing for domain invariance. arXiv , 2014.[41] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashingnetwork for unsupervised domain adaptation. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017.

42] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametricinstance discrimination. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.[43] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger norm more transferable: An adaptive featurenorm approach for unsupervised domain adaptation. In

IEEE International Conference on Computer Vision(ICCV) , 2019.[44] Kaichao You, Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Universal domainadaptation. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2019.[45] Jing Zhang, Zewei Ding, Wanqing Li, and Philip Ogunbona. Importance weighted adversarial nets forpartial domain adaptation. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2018.[46] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I Jordan. Bridging theory and algorithm fordomain adaptation. In

International Conference on Machine Learning (ICML) , 2019.[47] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning ofvisual embeddings. In

IEEE International Conference on Computer Vision (ICCV) , 2019.[48] Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and Jinsong Wang. Unsupervised domain adaptation forsemantic segmentation via class-balanced self-training. In

European Conference on Computer Vision(ECCV) , 2018. Dataset Detail

In PDA, 10 classes in Caltech-256 are used as shared classes ( L s ∩ L t ) . The other 21 classes areused as source private classes ( L s − L t ) . Since DSLR and Webcam do not have many examples, weconduct experiments on D to A, W to A, A to C (Caltech), D to C, and W to C shifts. In OSDA,the same 10 classes are used as shared classes ( L s ∩ L t ) and the selected 11 classes are used asunknown classes ( L t − L s ) . The setting is the same as [36]. In OPDA, the same 10 class are usedas shared classes ( L s ∩ L t ) and then, in alphabetical order, the next 10 classes are used as sourceprivate classes ( L s − L t ) , and the remaining 11 classes are used as unknown classes ( L t − L s ) . Thesecond benchmark dataset is OfﬁceHome (OH) [41], which contains four domains and 65 classes.In PDA, in alphabetical order, the ﬁrst 25 classes are selected as shared classes ( L s ∩ L t ) and the restclasses are source private classes ( L s − L t ) . In OSDA, the ﬁrst 15 classes are used as shared classes ( L s ∩ L t ) and the rest classes are used as unknown classes ( L t − L s ) . In OPDA, the ﬁrst 10 classesare used as shared classes ( L s ∩ L t ) , the next 5 classes are source private classes ( L s − L t ) andthe rest are unknown classes ( L t − L s ) . The third dataset is VisDA [31], which contains 12 classesfrom the two domains, synthetic and real images. The synthetic domain consists of 152,397 synthetic2D renderings of 3D objects and the real domain consists of 55,388 real images. In PDA, the ﬁrst6 classes are used as shared classes ( L s ∩ L t ) and the rest are source private classes ( L s − L t ) . InOSDA, we follow [36] and use the 6 classes as shared classes | L s ∩ L t | and the rest as unknownclasses ( L t − L t ) . In OPDA, the ﬁrst 6 classes are shared classes ( L s ∩ L t ) , the next 3 are sourceprivate classes ( L s − L t ) and the other 3 classes are unknown classes ( L t − L s ) . We mainly performexperiments on these three datasets with four settings because it enables direct comparison with manystate-of-the-art results. We provide an analysis of varying the number of classes using Caltech [14]and ImageNet [8] because these datasets contain a large number of classes. B Implementation Detail

We list the implementation details which are excluded from the main paper due to a limit of space.We used TITAN X (Pascal) with 12GB. One GPU is used for each experiment and each experimenttakes about 2 hours.

DANCE (universal comparison).

The batch-size is set as 36. The temperature parameter in Eq. 5is set as 0.05 by following [34]. We train a model for 10,000 iterations with nestrov momentum SGDand report the performance at the end of the iterations. The initial learning rate is set as . , whichis decayed with the factor of (1 + γ i , ) − p , where i denotes the number of iterations and we set γ = 10 and p = 0 . . The learning rate of pre-trained layers is multiplied by . . We follow [34] forthis scheduling method. Baselines (universal comparison).

We use the following released codes forETN [3]( https://github.com/thuml/ETN ), UAN [44]( https://github.com/thuml/Universal-Domain-Adaptation ), and STA [24]( https://github.com/thuml/Separate_to_Adapt ). We tune the hyper-parameter of these methods by validating the performance on OPDA,Amazon to DSLR, Ofﬁce. Since we could not see improvements by changing the hyper-parametersfrom their codes, we employed the hyper-parameters provided in their codes. For ETN, we use thehyper-parameters for Ofﬁce-Home. For UAN and STA, we use the hyper-parameters for Ofﬁce.We implement DANN by ourselves and tuned the hyper-parameters by the performance on OPDA,Amazon to DSLR, Ofﬁce. For all of these methods, we report the performance at the end of trainingfor comparison. We observe that there is a gap in the performance between the best checkpoint andthe ﬁnal checkpoint. This can explain the gap between the reported performance in their paper andthe performance in our universal comparisons.

Baselines tailored for each category shift.

We run experiments for ETN (A2C, W2C, D2C, PDA)since the results are not available in their papers. For ETN, we report the performance which employsthe same hyper-parameters as the universal comparison but does not use “unknown” sample rejection.For the methods tailored for each setting, we show the performance of the results reported in theirpapers. “NA” indicates the results are not available in their paper. We observe the performance gap inour universal comparison and the reported performance in each paper. For example, the performanceof UAN in OPDA has a big gap between the universal comparison and the reported accuracy althoughwe use the same hyper-parameters. We could obtain similar performance to the reported number if12able A:

Results on open-partial domain adaptation. USFDA [21] focuses on open-partial domain adaptationwithout access to source samples in adapting a model to a target domain. The number of UAN [44] in a lowerrow is taken from their paper.

Universal comparison

Method Ofﬁce(10 / 10 / 11) Ofﬁce-Home(10 / 5 / 50) VisDAA2W D2W W2D A2D D2A W2A Avg A2C A2P A2R C2A C2P C2R P2A P2C P2R R2A R2C R2P Avg (6 / 3 / 3)SO 75.7 95.4 95.2 83.4 84.1 84.8 86.4 50.4 79.4 90.8 64.9 66.1 79.9 71.6 48.5 87.6 77.8 52.1 82.8 71.0 38.8DANN [12] 87.6 90.5 91.2 88.7 87.4 87.0 88.7 59.9 80.6 89.8 77.5 73.3 86.4 78.5 61.5 88.5 80.3 62.1 82.4 76.7 50.6ETN [3] 89.1 90.6 90.9 86.3 86.4 86.5 88.3 58.2 78.5 89.1 77.2 69.3 87.5 77.0 56.0 88.2 77.5 58.4 83.0 75.0 66.6STA [24] 85.2 96.3 95.1 88.1 87.9 86.0 89.8 54.8 76.6

UAN [44] 85.6 94.8 98.0 86.5 85.5 85.1 89.2 63.0 82.8 87.9 76.9 78.7 85.4 78.2 58.6 86.8 83.4 63.2 79.4 77.0 60.8USFDA [21] 85.6 95.2 97.8 88.5 87.5 86.6 90.2 63.4 83.3 89.4 71.0 72.3 86.1 78.5 60.2 87.4 81.6 63.2 88.2 77.0 63.9

Table B: Evaluation on two metrics on open-set and open partial domain adaptation. OS is average ofall classes. OS* is the average of known classes.

Open SetMethod A to W D to W W to D A to D D to A W to AOS OS* OS OS* OS OS* OS OS* OS OS* OS OS*SO 83.8 87.7 95.3 99.0 95.3 100.0 89.6 93.6 85.6 86.3 84.9 88.2DANN 87.6 95.7 90.5 99.3 91.2 100.0 88.7 96.9 87.4

Open PartialMethod A to W D to W W to D A to D D to A W to AOS OS* OS OS* OS OS* OS OS* OS OS* OS OS*SO 75.7 79.2 95.4 98.1 95.2 100.0 83.4 88.3 84.1 84.9 84.8 85.7DANN 83.0 90.0 89.3 97.1 89.5 96.9 81.9 88.8 80.2 86.8 78.2 77.8ETN 89.1

STA 85.2 87.8 96.3 99.1 95.1 100.0 88.1 90.6 87.9 88.7 86.0 87.1UAN 78.8 86.7 83.5 91.9 84.9 93.4 77.5 85.3 75.7 83.3 75.6 83.1DANCE we pick up the best checkpoint for each setting. But, we report the performance of ﬁxed iterations’checkpoints for a fair comparison, which can explain the gap.Table C: Standard deviation of DANCE in experiments on Ofﬁce and VisDA. The deviation iscalculated by three runs. DANCE shows descent deviations.

Setting A2W D2W W2D A2D D2A W2A Avg VisDACDA 88.6 ± . ± . ± . ± . ± . ± . ± . ± . ODA 93.6 ± . ± . ± . ± . ± . ± . ± . ± . OPDA 92.8 ± . ± . ± . ± . ± . ± . ± . ± . Setting A2C W2C D2C D2A W2A Avg VisDAPDA 88.8 ± . ± . ± . ± . ± . ± . ± . C Supplemental Results

Detailed results of ODA and OPDA.

Table A and Table B shows the detailed results of ODA andOPDA. OS* shows the averaged accuracy over known classes while OS shows the averaged accuracyincluding unknown class. DANCE shows good performance on both metrics. ETN shows betterresults on OS* than DANCE in several scenarios. In ETN results, OS* shows much better results onOS, which means that ETN is not good at recognizing unknown samples as unknown. This is clearlyshown in Fig. 4 (c) in our main paper.

Comparison with Jigsaw [5].

Table D shows the comparison with jigsaw puzzle based self-supervised learning. To consider the self-supervised learning part of DANCE, we replaced neigh-13able D: Comparison between jigsaw [5, 28] and DANCE on the Ofﬁce dataset. For a fair comparison,we replace the loss of entropy similarity with jigsaw puzzle loss.

Setting Method A2W D2W W2D A2D D2A W2A AvgClosed Jigsaw 87.7

Open Jigsaw 89.4 95.5 93.6 93.8 90.3 89.3 92.0DANCE (a) Value of λ in Eq. 9. (b) Value of m in Eq. 7. (c) Value of ρ in Eq. 7. Figure A: (a): Varying the value of λ in Eq. 9. (b): Varying the value of margin in Eq. 7. (c): Varyingthe value of ρ in Eq. 7, which is determined based on the number of known classes.borhood clustering loss with the jigsaw puzzle loss on the target domain. The jigsaw puzzle loss iscalculated on target samples. We can see that DANCE performed better in almost all settings andconﬁrm the effectiveness of clustering based self-supervision for this task. Results with standard deviations.

Table C show results of DANCE with standard deviations. Weshow only the averaged accuracy over three runs in the main paper due to a limit of space. We showthe standard deviation. We can observe that DANCE shows decent standard deviations.

Sensitivity to hyper-parameters.

In Fig. A, we show the sensitivity to hyper-parameters on OPDAsetting of Amazon to DSLR, which we used to tune the hyper-parameters. Although ρρ