Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation
EExploring Category-Agnostic Clusters for Open-Set Domain Adaptation
Yingwei Pan † , Ting Yao † , Yehao Li † , Chong-Wah Ngo ‡ , and Tao Mei † † JD AI Research, Beijing, China ‡ City University of Hong Kong, Kowloon, Hong Kong { panyw.ustc, tingyao.ustc, yehaoli.sysu } @gmail.com, [email protected], [email protected] Abstract
Unsupervised domain adaptation has received signif-icant attention in recent years. Most of existing workstackle the closed-set scenario, assuming that the sourceand target domains share the exactly same categories. Inpractice, nevertheless, a target domain often contains sam-ples of classes unseen in source domain (i.e., unknownclass). The extension of domain adaptation from closed-set to such open-set situation is not trivial since the tar-get samples in unknown class are not expected to alignwith the source. In this paper, we address this problemby augmenting the state-of-the-art domain adaptation tech-nique, Self-Ensembling, with category-agnostic clusters intarget domain. Specifically, we present Self-Ensemblingwith Category-agnostic Clusters (SE-CC) — a novel ar-chitecture that steers domain adaptation with the addi-tional guidance of category-agnostic clusters that are spe-cific to target domain. These clustering information pro-vides domain-specific visual cues, facilitating the general-ization of Self-Ensembling for both closed-set and open-setscenarios. Technically, clustering is firstly performed overall the unlabeled target samples to obtain the category-agnostic clusters, which reveal the underlying data spacestructure peculiar to target domain. A clustering branch iscapitalized on to ensure that the learnt representation pre-serves such underlying structure by matching the estimatedassignment distribution over clusters to the inherent clusterdistribution for each target sample. Furthermore, SE-CCenhances the learnt representation with mutual informationmaximization. Extensive experiments are conducted on Of-fice and VisDA datasets for both open-set and closed-setdomain adaptation, and superior results are reported whencomparing to the state-of-the-art approaches.
1. Introduction
Convolutional Neural Networks (CNNs) have drivenvision technologies to reach new state-of-the-arts. Theachievements, nevertheless, are on the assumption that (a) Closed-setDomain AdaptationTarget Domain
Source Domain
Target Domain
Source Domain
Source samplesTarget samples
Unknown classes (b) Open-set Domain Adaptation(Existing) (c) Open-set Domain Adaptationwith Category-agnostic Clusters (Ours)Clustering
Figure 1. A comparison between (a) closed-set domain adaptation,(b) existing methods for open-set domain adaptation, and (c) ouropen-set domain adaptation with category-agnostic clusters. large quantities of annotated data are accessible for modeltraining. The assumption becomes impractical when cost-expensive and labor-intensive manual labeling is required.An alternative is to recycle off-the-shelf learnt knowl-edge/models in source domain for new domain(s). Unfortu-nately, the performance often drops significantly on a newdomain, a phenomenon known as “domain shift.” One fea-sible way to alleviate this problem is to capitalize on un-supervised domain adaptation [3, 6, 17, 21, 35, 37], whichleverages labeled source samples and unlabeled target sam-ples to generalize a target model. One of the most crit-ical limitations is that most existing models simply aligndata distributions between source and target domains. Asa consequence, these models are only applicable in closed-set scenario (Figure 1(a)) under the unrealistic assumptionthat both domains should share exactly the same set of cat-egories. This adversely hinders the generalization of thesemodels in open-set scenario to distinguish target samples ofunknown class (unseen in source domain) from the targetsamples of known classes (seen in source domain).The difficulty of open-set domain adaptation mainlyoriginates from two aspects: 1) how to distinguish the un-known target samples from known ones while classifyingthe known target samples correctly? 2) how to learn ahybrid network for both closed-set and open-set domainadaptation? One straightforward way (Figure 1(b)) to al-leviate the first issue is by employing an additional binaryclassifier for assigning known/unknown label to each tar- a r X i v : . [ c s . C V ] J un et sample [22]. All the unknown target samples are fur-ther taken as outlier and will be discarded during the adap-tation from source to target. As the unknown target sam-ples are holistically grouped as one generic class, the inher-ent data structure is not fully exploited. In the case whenthe distribution of these target samples is diverse or the se-mantic labels between known and unknown classes are am-biguous, the performance of binary classification is subop-timal. Instead, we novelly perform clustering over all unla-beled target samples to explicitly model the diverse seman-tics of both known and unknown classes in target domain,as depicted in Figure 1(c). All target samples are firstlydecomposed into clusters, and the learnt clusters, thoughcategory-agnostic, convey the discriminative knowledge ofunknown and known classes specific to target domain. Assuch, by further steering domain adaptation with category-agnostic clusters, the learnt representations are expected tobe domain-invariant for known classes, and discriminativefor unknown and known classes in target domain. To ad-dress the second issue, we remould Self-Ensembling [5]with an additional clustering branch to estimate the assign-ment distribution over all clusters for each target sample,which in turn refines the learnt representations to preserveinherent structure of target domain.To this end, we present a new Self-Ensembling withCategory-agnostic Clusters (SE-CC), as shown in Figure2. Specifically, clustering is firstly implemented to decom-pose all the target samples into a set of category-agnosticclusters. The underlying structure of each target sampleis thus formulated as its inherent cluster distribution overall clusters, which is initially obtained by utilizing a soft-max over the cosine similarities between this sample andeach cluster centroid. With this, an additional clusteringbranch is integrated into student model of Self-Ensemblingto predict the cluster assignment distribution of each targetsample. For each target sample, the KL-divergence is ex-ploited to model the mismatch between its estimated clus-ter assignment distribution and the inherent cluster distribu-tion. By minimizing the KL-divergence, the learnt featureis enforced to preserve the underlying data structure in tar-get domain. Moreover, we uniquely maximize the mutualinformation among the input intermediate feature map, theoutput classification distribution and cluster assignment dis-tribution of target sample in student to further enhance thelearnt feature representation. The whole SE-CC frameworkis jointly optimized.
2. Related Work
Unsupervised Domain Adaptation.
One common so-lution for unsupervised domain adaptation in closed-set sce-nario is to learn transferrable feature in CNNs by minimiz-ing domain discrepancy through Maximum Mean Discrep-ancy (MMD) [8]. [34] is one of early works that integrates MMD into CNNs to learn domain invariant representation.[17] additionally incorporates a residual transfer moduleinto the MMD-based adaptation of classifiers. Inspired by[7], another direction of unsupervised domain adaptation isto encourage domain confusion across different domains viaa domain discriminator [4, 6, 33], which is devised to pre-dict the domain (source/target) of each input sample. Inparticular, a domain confusion loss [33] in domain discrim-inator is devised to enforce the learnt representation to bedomain invariant. [6] formulates domain confusion as a taskof binary classification and utilizes a gradient reversal algo-rithm to optimize domain discriminator.
Open-Set Domain Adaptation.
The task of open-set domain adaptation goes beyond the traditional domainadaptation to tackle a realistic open-set scenario, in whichthe target domain includes numerous samples from com-pletely new and unknown classes not present in source do-main. [22] is one of the early attempts to tackle the realisticopen-set scenario. Busto et al. additionally exploit the as-signments of target samples as know/unknown classes whenlearning the mapping of known classes from source to tar-get domain. Later on, [29] utilizes adversarial training tolearn feature representations that could separate the targetsamples of unknown class from the known target samples.Furthermore, [2] factorizes the source and target data intothe shared and private subspace. The shared subspace mod-els the target and source samples from known classes, whilethe target samples from unknown class are modeled with aprivate subspace, tailored to the target domain.
Summary.
In summary, similar in spirit as previousmethods [2, 22], SE-CC utilizes unlabeled target samplesfor learning task-specific classifiers in the open-set sce-nario. Different from these approaches, SE-CC leveragescategory-agnostic clusters for representation learning. Thelearnt feature is driven to preserve the target data structureduring domain adaption. The structure preservation enableseffective alignment of sample distributions within knownand unknown classes, and discrimination of samples be-tween known and unknown classes. As a by-product, thepreservation, which is represented as a cluster probabilitydistribution, is exploited to further enhance representationlearning. This is achieved through maximizing the mutualinformation among input feature, its cluster and class prob-ability distributions. To the best of our knowledge, thereis no study yet to fully explore the advantages of category-agnostic clusters for open-set domain adaptation.
3. Our Approach: SE-CC
In this paper, we remold Self-Ensembling to suit bothclosed-set and open-set scenarios by integrating category-agnostic clusters into domain adaptation procedure. Anoverview of our Self-Ensembling with Category-agnosticClusters (SE-CC) model is depicted in Figure 2. ategory-agnostic Clusters
Cross
EntropySource ImageTarget Image
Self-ensembling LossConditional Entropy
Student modelTeacher model C l u s t e r Cluster 3
Cluster 1 Cluster 2
Cluster 5
Cluster 4 Cluster 6Cluster K
KL-divergence
Class 1
Class 2
Class N-1 Cluster K
Cluster 1Cluster 2Cluster 3 ... ... ... ... ... ... ... ... ... ... ... ...
Unknown
Clustering branch
Cluster assignment distribution Inherent cluster distribution ... ...
Mutual Information Maximization ......... ...
Class 3
Cluster Assignment ......
Global/Local Mutual Information DiscriminatorGlobal/Local Mutual Information Discriminator
Real
Student model
Fake
Figure 2. An overview of our SE-CC. Each labeled source image is fed into student model to train the classifier with cross entropy.Each unlabeled target image x t is transformed into two perturbed samples, i.e., x S t and x T t , before injected into student and teachermodels separately. Conditional entropy is applied to x S t in student pathway and self-ensembling loss is adopted to align the classificationpredictions between teacher and student. To further exploit the underlying data structure of target domain, we perform clustering todecompose the whole unlabeled target samples into a set of category-agnostic clusters ( top right ), which will be incorporated into Self-Ensembling to facilitate both closed-set and open-set scenarios. Specifically, an additional clustering branch is integrated into student toinfer the assignment distribution over all clusters for each target sample x S t . By aligning the estimated cluster assignment distribution tothe inherent cluster distribution learnt from original clusters via minimizing their KL-divergence, the feature representation is enforcedto preserve the underlying data structure in target domain. Furthermore, the feature representation of student is enhanced by maximizingthe mutual information among its feature map, classification and cluster assignment distributions ( bottom right ). The maximization isconducted at both global and local levels as detailed in Figure 3. In open-set domain adaptation, we are given the labeledsamples X s = { ( x s , y s ) } in source domain and the unla-beled samples X t = { x t } in target domain belonging to N classes, where y s is the class label of sample x s . Theset of N classes is denoted as C , which consists of N − known classes shared between two domains and an addi-tional unknown class that aggregates all samples of unla-beled classes. The goal of open-set domain adaptation isto learn the domain-invariant representations and classifiersfor recognizing the N − known classes in target domainand meanwhile distinguishing the unknown target samplesfrom known ones. We first briefly recall the method of Self-Ensembling [5].Self-Ensembling mainly builds upon the Mean Teacher [32]for semi-supervised learning, which consists of a studentmodel and a teacher model with the same network architec-ture. The main idea behind Self-Ensembling is to encourageconsistent classification predictions between teacher andstudent under small perturbations of the input image. Inother words, despite of different augmentations imposed ona target sample, both teacher and student models should pre-dict similar classification probability distribution over all classes. Specifically, given two perturbed target samples x S t and x T t augmented from an unlabeled sample x t , theself-ensembling loss penalizes the difference between theclassification predictions of student and teacher: L SE ( x t ) = || P S cls ( x S t ) − P T cls ( x T t ) || , (1) where P S cls ( x S t ) ∈ R N and P T cls ( x T t ) ∈ R N denote thepredicted classification distribution over N classes via theclassification branch in student and teacher. During train-ing, the student is trained using gradient descent, while theweights of the teacher are directly updated as the exponen-tial moving average of the student weights. Inspired by [31],we additionally adopt the unsupervised conditional entropyloss to train the classification branch in student, aiming todrive the decision boundaries of the classifier far away fromhigh-density regions in target domain.Therefore, the overall training loss of Self-Ensemblingis composed of supervised cross entropy loss ( L CSE ) onsource data, self-ensembling loss ( L SE ) and conditional en-tropy loss ( L CDE ) of unlabeled target data: L SEC = (cid:88) ( xs,ys ) ∈S L CSE ( x s , y s )+ (cid:88) xt ∈T ( L SE ( x t )+ L CDE ( x t )) . (2) Open-set is more difficult than closed-set domain adap-tation because it is required to classify not only inliers butlso outliers into N − known and one unknown classes.The most typical way is by learning a binary classifier torecognize each target sample as known/unkown class. Nev-ertheless, such recipe oversimplifies the problem by assum-ing that all unknown samples belong to one class, whileleaving the inherent data distribution among them unex-ploited. The robustness of this approach is questionablewhen the unknown samples span across multiple unknownclasses and may not be properly grouped as one genericclass. To alleviate this issue, we perform clustering to ex-plicitly model the diverse semantics in target domain as thedistilled category-agnostic clusters, which are further inte-grated into Self-Ensembling to guide domain adaptation.Specifically, we design an additional clustering branch instudent of Self-Ensembling to align its estimated cluster as-signment distribution with the inherent cluster distributionamong category-agnostic clusters. Hence, the learnt fea-ture representations are enforced to be domain-invariant forknown classes and meanwhile more discriminative for un-known and known classes in target domain. Category-agnostic Clusters.
Clustering is an essentialdata analysis technique for grouping unlabeled data in un-supervised machine learning [11]. Here we utilize k -means[19], the most popular clustering method, to decomposeall unlabeled target samples X t into a set of K clusters { C k } Kk =1 , where C k represents the set of target samplesfrom the k -th cluster. Accordingly, the obtained clusters { C k } Kk =1 , though category-agnostic, is still able to revealthe underlying structure tailored to target domain, wherethe target samples with similar semantics stay closer withlocal discrimination. In our implementations, we directlyrepresent each target sample x t as the output feature ( ˜ x t )of CNNs pre-trained on ImageNet [26] for clustering. Wealso tried to refresh the clusters according to learnt featuresperiodically (e.g., every 5 training epoches), but that did notmake a major difference.We encode the underlying structure of each target sam-ple x t as the joint relations between this sample and allcategory-agnostic clusters, i.e., the inherent cluster distribu-tion over all clusters. Specifically, for each target sample x t ,we measure its inherent cluster distribution ˜ P clu ( x t ) ∈ R K through a softmax over the cosine similarities between thissample and each cluster centroid. The k -th element repre-sents the cosine similarity between x t and the centroid µ k of k -th cluster: ˜ P kclu ( x t ) = e ρ · cos (˜ x t ,µ k ) (cid:80) k (cid:48) e ρ · cos ( ˜ x t ,µ k (cid:48) ) , µ k = 1 | C k | (cid:88) x t ∈ C k ˜ x t , (3) where cos ( · ) is cosine similarity function and ρ is the tem-perature parameter of softmax for scaling. The centroid ofeach cluster µ k is defined as the average of all samples be-longing to that cluster. Clustering Branch.
An additional branch in student, named as clustering branch , is especially designed to pre-dict the distribution over all category-agnostic clusters forcluster assignment of each target sample x S t . Concretely, wedenote the feature of target sample x S t along student path-way as x S t ∈ R M . Hence, depending on the input feature x S t , clustering branch infers its cluster assignment distri-bution P clu ( x S t ) ∈ R K over all K clusters via a modifiedsoftmax layer [15]: P kclu ( x S t ) = e ρ · cos ( x S t , W k ) (cid:80) k (cid:48) e ρ · cos ( x S t , W k (cid:48) ) , (4) where P kclu ( x S t ) is the k -th element in P clu representingthe probability of assigning target sample x S t into the k -thcluster. W k is the k -th row of the parameter matrix W ∈ R K × M in the modified softmax layer, which denotes thecluster assignment parameter matrix for the k -th cluster. KL-divergence Loss.
The clustering branch is trainedwith the supervision from the inherent cluster distribution ofeach target sample. To measure the mismatch between theestimated cluster assignment distribution and the inherentcluster distribution, a KL-divergence loss is defined as L KL = (cid:88) x t ∈T KL (cid:16) ˜ P clu ( x t ) || P clu ( x S t ) (cid:17) = (cid:88) x t ∈T (cid:88) k ˜ P kclu ( x t ) log (cid:18) ˜ P kclu ( x t ) P kclu ( x S t ) (cid:19) . (5) By minimizing the KL-divergence loss, the learnt represen-tation is enforced to preserve the underlying data structureof target domain, pursuing to be more discriminative forboth unknown and known classes. Moreover, we incor-porate the inter-cluster relationship into the KL-divergenceloss as a constraint to preserve the inherent relations amongthe cluster assignment parameter matrices. The spirit be-hind follows the philosophy that the cluster assignment pa-rameter matrices of two semantically similar clusters shouldbe similar. Hence, the KL-divergence loss with the con-straint of inter-cluster relationships is formulated as L KL = (cid:80) x t ∈T KL (cid:16) ˜ P clu ( x t ) || P clu ( x S t ) (cid:17) s.t. cos ( W k , W k (cid:48) ) = cos ( µ k , µ k (cid:48) ) , ≤ k, k (cid:48) ≤ K. (6) The KL-divergence loss in Eq.(6) is further relaxed as: L KL = (cid:88) x t ∈T KL (cid:16) ˜ P clu ( x t ) || P clu ( x S t ) (cid:17) + (cid:88) ≤ k,k (cid:48) ≤ K | cos ( W k , W k (cid:48) ) − cos ( µ k , µ k (cid:48) ) | . (7) Given the input feature of a target sample, the studentin our SE-CC produces both classification and cluster as-signment distributions via the two parallel branches in aulti-task paradigm. To further strengthen the learnt tar-get feature in an unsupervised manner, we leverage MutualInformation Maximization (MIM) [10] in student to max-imize the mutual information among the input feature andthe two output distributions. The rationale behind followsthe philosophy that the global/local mutual information be-tween input feature and output high-level features can beused to tune the feature’s suitability for downstream tasks.As a result, we design a MIM module in student to simulta-neously estimate and maximize the local and global mutualinformation among input feature map, the output classifica-tion distribution, and cluster assignment distribution.
Global Mutual Information.
Technically, let x S t ∈ R H × H × D be the output feature map of the last convolu-tional layer in student model for the input target sample x S t ( H : the size of height and width; D : the number of chan-nels). We encode this feature map into a global feature vec-tor G ( x S t ) ∈ R D via a convolutional layer (kernel size: × ; stride size: 1; filter number: D ) plus an averagepooling layer. Next, we concatenate the global feature vec-tor G ( x S t ) with the conditioning classification distribution P S cls ( x S t ) and cluster assignment distribution P clu ( x S t ) .The concatenated feature will be fed into the global Mu-tual information discriminator for discriminating whetherthe input global feature vector is aligned with the givenclassification and cluster assignment distributions. Herethe global Mutual information discriminator is implementedwith three stacked fully-connected network plus nonlinearactivation. The final output score of global Mutual infor-mation discriminator is V g ([ G ( x S t ) , P S cls ( x S t ) , P clu ( x S t )]) ,which represents the probability of discriminating the realinput feature with matched classification and cluster assign-ment distributions. As such, the global Mutual Informationis estimated via Jensen-Shannon MI estimator [20]: L JSDg = (cid:80) x t ∈T − ϕ (cid:0) − V g ([ G ( x S t ) , P S cls ( x S t ) , P clu ( x S t )]) (cid:1) − (cid:80) ˆ x t ∈T , ˆ x t (cid:54) = x t ϕ (cid:0) V g ([ G (ˆ x S t ) , P S cls ( x S t ) , P clu ( x S t )]) (cid:1) , (8) where ϕ ( · ) is softplus function and G (ˆ x S t ) denotes theglobal feature of a different target image ˆ x S t . Local Mutual Information.
In addition, we exploitthe local Mutual Information among the local input fea-ture at every spatial location, and the output classifica-tion and cluster assignment distributions. In particular,we spatially replicate the two distributions P S cls ( x S t ) and P clu ( x S t ) to construct H × H × N and H × H × K fea-ture maps respectively, and then concatenate them with theinput feature map x S t along the channel dimension. Theconcatenated feature map L ( x S t , P S cls ( x S t ) , P clu ( x S t )) ∈ R H × H × ( D + N + K ) will be fed into the local Mutual infor-mation discriminator for discriminating whether each inputlocal feature is matched with the given classification andcluster assignment distributions. The local Mutual informa- ...... Global Mutual
Information Discriminator
Real
Student model
Fake
Student model ......
Local Mutual
Information Discriminator
Real
Student model
Fake
Student model ... ... ...
Spatial
Replication Spatial
Replication ... ... ... (a)(b)
Figure 3. Framework of (a) global mutual information estimationand (b) local mutual information estimation in our SE-CC. tion discriminator is constructed with three stacked convo-lutional layer (kernel size: × ) plus nonlinear activation.Hence the final output score map of local Mutual infor-mation discriminator is V l ( L ( x S t , P S cls ( x S t ) , P clu ( x S t ))) ∈ R H × H . The i -th element V il ( L ( x S t , P S cls ( x S t ) , P clu ( x S t ))) in score map denotes the probability of discriminating thereal input local feature at the i -th spatial location withmatched classification and cluster assignment distributions.As such, the local Mutual Information is estimated as: L JSDl = (cid:80) x t ∈T − H H (cid:80) i =1 ϕ (cid:0) − V il ( L ( x S t , P S cls ( x S t ) , P clu ( x S t ))) (cid:1) − (cid:80) ˆ x t ∈T , ˆ x t (cid:54) = x t H H (cid:80) i =1 ϕ (cid:0) V il ( L (ˆ x S t , P S cls ( x S t ) , P clu ( x S t ))) (cid:1) . (9) Accordingly, the final objective for MIM module is mea-sured as the combination of local and global Mutual Infor-mation estimations, balanced with tradeoff parameter α : L MIM = α L JSDg + L JSDl . (10) Figure 3 conceptually depicts the process of both local andglobal mutual information estimation.
The overall training objective of our SE-CC integratesthe cross entropy loss on source data, unsupervised self-ensembling loss, and conditional entropy loss in Eq.(2),KL-divergence loss of clustering branch in Eq.(7), and theMutual Information estimation in Eq.(10) on target data: L = L SEC + L KL − β L MIM , (11) where β is tradeoff parameter. able 1. Performance comparison with the state of arts on Office for open-set domain adaptation. ♦ indicates a different open-set settingwithout unknown source examples. Method A → D A → W D → A D → W W → A W → D AvgOS OS* OS OS* OS OS* OS OS* OS OS* OS OS* OS OS*Source-only 67.1 67.0 64.6 63.8 61.9 60.7 90.6 92.3 60.2 59.7 96.7 98.7 73.5 73.7RTN [17] 76.6 74.7 73.0 70.8 57.2 53.8 89.0 88.1 62.4 60.2 98.8 98.3 76.2 74.3RevGrad [6] 78.3 77.3 75.9 73.8 57.6 54.1 89.8 88.9 64.0 61.8 98.7 98.0 77.4 75.7AODA ♦ [29] 76.6 76.4 74.9 74.3 62.5 62.3 94.4 94.6 81.4 81.2 96.8 96.9 81.1 80.9ATI- λ [22] 79.8 79.2 77.6 76.5 71.3 70.0 93.5 93.2 76.7 76.5 98.3 99.2 82.9 82.4FRODA [2] - 78.7 - 76.5 - - 73.7 - 94.6 - 84.9 -SE-CC ♦ Table 2. Performance comparison with the state of arts on VisDA for open-set adaptation (Known-to-Unknown Ratio = 1:10). ♦ indicatesa different open-set setting without unknown source examples. † indicates the results are referred from the official leaderboard [1].Method aero bike bus car horse knife mbike person plant skbrd train truck unk Knwn Mean OverallSource-only 53.8 54.2 50.3 48.7 72.7 5.3 82.0 27.0 49.6 43.4 78.0 5.1 44.2 46.9 47.3 44.8RevGrad [6] 33.0 57.3 44.1 33.9 72.1 46.9 82.2 26.8 36.8 50.4 89.4 9.8 47.8 48.6 48.5 47.8RTN [17] 49.2 72.6 66.5 39.5 80.8 18.8 73.8 56.8 47.4 45.2 74.0 4.5 48.7 52.4 52.1 49.0SE † [5] ♦† [29] 80.2 63.1 59.1 63.1 83.2 12.1 89.1 5.0 61.0 14.0 79.2 0.0 69.0 50.8 52.2 67.6ATI- λ [22] 85.7 74.9 60.3 49.9 80.0 19.3 88.8 40.8 54.0 ♦
4. Experiments
We empirically verify the merit of our SE-CC by con-ducting experiments on
Office [27] and
VisDA [23] datasetsfor both open-set and closed-set domain adaptation.
Office is the standard benchmark for domain adaptation,which contains 4,110 images from 31 categories. They arecollected from three domains: Amazon (A), DSLR (D), andWebcam (W). Six directions of transfer among them areevaluated for both open-set and closed-set adaptation. Foropen-set adaptation, as in [22], we firstly take 10 classesas the known classes shared between source and target do-mains. In alphabetical order, the classes with labels 11-20are taken as the unknown classes in source, and the oneswith labels 21-31 are unknown classes in target. Two met-rics OS and OS*, are adopted for evaluation (OS: the ac-curacy on all known & unknown target samples; OS*: theaccuracy on the target samples of the 10 known classes).We adopt AlexNet [13] pre-trained on ImageNet [26] as thebasic CNNs architecture for clustering and adaptation. Forclosed-set adaptation, we follow [16] and report accuracyon target domain over all 31 classes. The basic architec-ture of CNNs for clustering and adaptation is ResNet50 [9]pre-trained on ImageNet.
VisDA is a large-scale dataset for the challengingsynthetic-real image transfer, consisting of 280k imagesfrom three domains. The synthetic images generated from3D CAD models are taken as the training domain. The val-idation domain contains real images from COCO [14] andthe testing domain includes video frames in YTBB [25].Given the fact that the ground truth of testing set are notpublicly available, the synthetic images in training domainare taken as source and the COCO images in validation do-main are taken as target for evaluation. In particular, for open-set adaptation, we follow the open-set setting in [23]and take the 12 classes as the known classes for source &target domains, the 33 background classes as the unknownclasses in source, and the other 69 COCO categories as theunknown classes in target. The known-to-unknown ratioof samples in target domain is strictly set as 1:10. Threemetrics, i.e., Knwn, Mean, and Overall, are adopted forevaluation. Here Knwn denotes the accuracy averaged overall known classes, Mean is the accuracy averaged over allknown & unknown classes, and Overall is the accuracyover all target samples. For closed-set adaptation, we re-port the accuracy of all the 12 classes for adaptation, asin the closed-set setting of [23]. We utilize ResNet152 asthe backbone of CNNs for clustering and adaptation in bothclosed-set and open-set scenarios.
Implementation Details.
Our SE-CC is mainly imple-mented with PyTorch and the network weights are opti-mized with SGD. We set the learning rate and mini-batchsize as 0.001 and 56 for all experiments. The maximumtraining iteration is set as 300 and 25 epochs on Office andVisDA, respectively. The dimension D of global featurefor global Mutual Information estimation is set as 128/1,024in the backbone of AlexNet/ResNet. The number of clus-ters K is determined using Gap statistics method ( K = 25 for Office and K = 500 for VisDA). As in [10], we re-strict the hyper-parameter search for each dataset in rangeof α = { , , } and β = { − , − , − } ( α = 1 , β = 10 − for Office, and α = 5 , β = 10 − for VisDA). Open-Set Adaptation on Office.
The results of differ-ent models on Office for open-set adaptation are shown inTable 1. It is worth noting that AODA adopts a differentopen-set setting where unknown source samples are absent. able 3. Performance comparison with the state of arts on VisDA dataset for closed-set domain adaptation.Method aero bike bus car horse knife mbike person plant skbrd train truck MeanSource-only 67.1 51.4 50.8 64.5 83.4 13.0 89.9 34.4 78.8 47.0 88.1 2.0 55.9RevGrad [6] 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4RTN [17] 89.1 56.4 72.4 69.7 77.9 49.5 87.7 13.0 88.1 77.4 86.7 7.2 64.6MCD [28] 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9SimNet [24] 94.3 82.3 73.5 47.2 87.9 49.2 75.1 79.7 85.3 68.5 81.1 50.3 72.9TPN [21] 93.7 85.1 69.2
Table 4. Performance comparison with the state of arts on Officedataset for closed-set domain adaptation.
Method A → D A → W D → A D → W W → A W → D AvgRTN [17] 77.5 84.5 66.2 96.8 64.8 99.4 81.6RevGrad [6] 79.7 82.0 68.2 96.9 67.4 99.1 82.2JAN [16] 85.1 86.0 69.2 96.7 70.7 99.7 84.6SimNet [24] 85.3 88.6 73.4 98.2 71.8 99.7 86.2GTA [30] 87.7 89.5 72.8 97.9 71.4 99.8 86.5iCAN [36] 90.1
For fair comparison with AODA, we additionally include avariant of our SE-CC (dubbed as SE-CC ♦ ) which learnsclassifier without unknown source samples. Specifically,the classifier in SE-CC ♦ is naturally able to recognize onlythe N-1 known classes and the target samples will be recog-nized as unknown if the predicted probability is lower thanthe threshold for any class as in open set SVM [12].Overall, the results across two metrics consistently in-dicate that our SE-CC obtains better performances againstother state-of-the-art closed-set adaptation models (RTNand RevGrad) and open-set adaptation methods (AODA,ATI- λ , and FRODA) on most transfer directions. Pleasealso note that our SE-CC improves the classification ac-curacy evidently on the harder transfers, e.g., D → A andW → A, where the two domains are substantially different.The results generally highlight the key advantage of exploit-ing underlying target data structure implicit in category-agnostic clusters for open-set domain adaptation. Such de-sign makes the learnt feature representation to be domain-invariant for known classes while discriminative enough tosegregate target samples from known and unknown classes.Specifically, by aligning the data distributions betweensource and target domains, RTN and RevGrad exhibit bet-ter performance than Source-only that trains classifier onlyon source data while leaving unlabeled target data unex-ploited. By rejecting unknown target samples as outliersand aligning data distributions only for inliers, the open-setadaptation techniques (AODA, ATI- λ , and FRODA) outper-form RTN and RevGrad. This confirms the effectivenessof excluding unknown target samples from the known tar-get samples during domain adaptation in open-set scenario.Nevertheless, AODA, ATI- λ , and FRODA are still inferiorto our SE-CC which steers the domain adaptation by inject-ing the distribution of category-agnostic clusters as a con-straint for feature learning and alignment. Open-Set Adaptation on VisDA.
The performancecomparison on VisDA for open-set adaptation is summa-
Table 5. Performance contribution of each design (i.e., ConditionalEntropy ( CE ), KL-divergence Loss ( KL ), and Mutual InformationMaximization ( MIM )) in SE-CC on VisDA for open-set transfer.Method CE KL MIM Knwn Mean OverallSE 66.4 65.2 52.7+CE (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) rized in Table 2. Our SE-CC performs consistently betterthan other methods across all the three metrics. In partic-ular, the Mean accuracy averaged over 12 known classesplus one unknown class of our SE-CC can achieve 70.5%,making the absolute improvement over the best closed-setadaptation method (SE) and open-set adaptation approach(ATI- λ ) by 5.3% and 12.3%, respectively. Similar to theobservations on Office for open-set adaptation, the open-set adaptation approaches (AODA and ATI- λ ) exhibit betterperformance than RTN and RevGrad, by additionally sepa-rating unknown target samples from known target samplesfor open-set adaptation. Note that although the closed-settechnique SE achieves higher Mean per-category accuracythan the open-set techniques (AODA and ATI- λ ), the Over-all accuracy over all target samples of SE are still worsethan open-set techniques. This is because SE aligns un-known samples across different domains and thus fails torecognize unknown target samples. Furthermore, by inte-grating category-agnostic clusters into SE and steering do-main adaptation to preserve the underlying target data struc-ture of both known and unknown classes, SE-CC boosts theperformances in terms of all metrics. Closed-Set Adaptation on Office and VisDA.
To fur-ther verify the generality of our proposed SE-CC, we ad-ditionally conduct experiments for domain adaptation inclosed-set scenario. Tables 4 and 3 show the performancecomparisons on Office and VisDA datasets for closed-setdomain adaptation. Similar to the observations for open-set domain adaptation task on these two datasets, our SE-CC achieves better performances than other state-of-the-art closed-set adaptation techniques. The results basicallydemonstrate the advantage of exploiting the underlying datastructure in target domain via category-agnostic clusters, fordomain adaptation, even on closed-set scenario without anydiverse and ambiguous unknown samples.
Ablation Study.
Here we investigate how each design inour SE-CC influences the overall performance. Conditional able 6. Evaluation of (a) clustering branch with different lossfunctions (i.e., L : L distance, L : L distance, and KL : KL-divergence) to measure the mismatch between two distributionsand (b) mutual information estimated over input feature and differ-ent outputs (i.e., CLS : output of classification branch,
CLU : out-put of clustering branch, and
CLS+CLU : combined output of clas-sification and clustering branches) on VisDA for open-set transfer. (a)
Method Knwn Mean OverallL (b) Method Knwn Mean OverallCLS 69.3 69.4 69.4CLU 70.0 70.1 70.8CLS+CLU 70.4 70.5 71.6
Entropy ( CE ) incorporates an unsupervised conditional en-tropy loss into SE to drive the classifier’s decision bound-aries away from high-density target data regions in studentmodel. KL-divergence Loss ( KL ) aligns the estimated clus-ter assignment distribution to the inherent cluster distribu-tion for each target sample, targeting for refining featureto preserve the underlying structure of target domain. Mu-tual Information Maximization ( MIM ) further enhances thefeature’s suitability for downstream tasks by maximizingthe mutual information among the input feature, the outputclassification and cluster assignment distributions. Table 5details the performance improvements on VisDA by consid-ering different designs and their contributions for open-setdomain adaptation in our SE-CC. CE is a general way toenhance classifier for target domain irrespective of any do-main adaptation architectures. In our case, CE improves theMean accuracy from 65.2% to 66.3%, which demonstratesthat CE is an effective choice. KL and MIM are two specificdesigns in our SE-CC and the performance gain of each is3.0% and 1.2% in Mean metric. In other words, our SE-CC leads to a large performance boost of 4.2% in total interms of Mean metric. The results verify the idea of exploit-ing underlying target data structure and mutual informationmaximization for open-set adaptation.
Evaluation of Clustering Branch.
To study how the de-sign of loss function in clustering branch affects the perfor-mance, we compare the use of KL-divergence in our SE-CCwith L and L distance. The results in Table 6(a) verify thatKL-divergence is a better measure of mismatch between theclassification and cluster assignment distributions than L and L distance, which yield inferior performance. Evaluation of Mutual Information Maximization.
Next, we evaluate different variants of MIM module inour SE-CC by estimating mutual information between in-put feature and different outputs, as shown in Table 6(b).CLS, CLU and CLS+CLU estimates the local and globalmutual information between input feature and the output ofclassification branch, the output of clustering branch, andthe combined output of two branches, respectively. Com-pared to our SE-CC without MIM module (Knwn: 69.3%,Mean: 69.3%, and Overall: 69.1%), CLS and CLU slightlyimproves the performances by additionally exploiting the (a) Source-only (b) SE (c) SE-CC known in source unknown in source known in target unknown in target
Figure 4. The t-SNE visualization of features learnt by (a) Source-only, (b) SE, and (c) SE-CC on VisDA for open-set adaptation. mutual information between input feature and the outputof each branch. Furthermore, CLS+CLU obtains a largerperformance boost, when combining the outputs from bothbranches for mutual information estimation. The resultsdemonstrate the merit of exploiting the mutual informa-tion among the input feature and the combined outputs oftwo downstream tasks (i.e., classification and cluster assign-ment) in our MIM module.
Feature Visualization.
We visualize the features learntby Source-only, SE, and SE-CC with t-SNE [18] on VisDAfor open-set adaptation in Figure 4(a)-(c). Compared toSource-only without domain adaptation, SE brings the twodistributions of source and target closer, leading to domain-invariant representation. However, in SE, all target samplesincluding unknown samples are enforced to match sourcesamples, making it difficult to recognize unknown targetsamples with ambiguous semantics. Through the preserva-tion of underlying target data structure for both known andunknown classes by SE-CC, the unknown target samplesare separated from known target samples, and meanwhilethe known samples in two domains are indistinguishable.
5. Conclusion
We have presented Self-Ensembling with Category-agnostic Clusters (SE-CC), which exploits the category-agnostic clusters in target domain for domain adaptationin both open-set and closed-set scenarios. Particularly, westudy the problem from the viewpoint of how to separateunknown target samples from known ones and how to learna hybrid network that nicely integrates category-agnosticclusters into Self-Ensembling. We initially perform cluster-ing to decompose all target samples into a set of category-agnostic clusters. Next, an additional clustering branch isintegrated into student model to align the estimated clus-ter assignment distribution to the inherent cluster distribu-tion implicit in category-agnostic clusters. That enforcesthe learnt feature to preserve the underlying data structurein target domain. Moreover, the mutual information amongthe input feature, the outputs of classification and clusteringbranches is exploited to further enhance the learnt feature.Experiments conducted on Office and VisDA for both open-set and closed-set adaptation tasks verify our proposal. Per-formance improvements are observed when comparing tostate-of-the-art techniques. eferences [1]
VisDA , 2018. https://competitions.codalab.org/competitions/19113 .[2] Mahsa Baktashmotlagh, Masoud Faraki, Tom Drummond,and Mathieu Salzmann. Learning factorized representationsfor open-set domain adaptation. In
ICLR , 2019.[3] Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, LingyuDuan, and Ting Yao. Exploring object relation in meanteacher for cross-domain detection. In
CVPR , 2019.[4] Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian, and TaoMei. Mocycle-gan: Unpaired video-to-video translation. In
ACMMM , 2019.[5] Geoffrey French, Michal Mackiewicz, and Mark Fisher.Self-ensembling for domain adaptation. In
ICLR , 2018.[6] Yaroslav Ganin and Victor Lempitsky. Unsupervised domainadaptation by backpropagation. In
ICML , 2015.[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In
NIPS , 2014.[8] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern-hard Sch¨olkopf, and Alexander Smola. A kernel two-sampletest.
Journal of Machine Learning Research , 2012.[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016.[10] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,Karan Grewal, Adam Trischler, and Yoshua Bengio. Learn-ing deep representations by mutual information estimationand maximization. In
ICLR , 2019.[11] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Dataclustering: a review.
ACM computing surveys , 1999.[12] Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multi-class open set recognition using probability of inclusion. In
ECCV , 2014.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
NIPS , 2012.[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common objects in context. In
ECCV , 2014.[15] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, BhikshaRaj, and Le Song. Sphereface: Deep hypersphere embeddingfor face recognition. In
CVPR , 2017.[16] Mingsheng Long, Jianmin Wang, and Michael I Jordan.Deep transfer learning with joint adaptation networks. In
ICML , 2017.[17] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael IJordan. Unsupervised domain adaptation with residual trans-fer networks. In
NIPS , 2016.[18] Laurens Van Der Maaten and Geoffrey Hinton. Visualizingdata using t-SNE.
JMLR , 2008.[19] James MacQueen et al. Some methods for classificationand analysis of multivariate observations. In
Proceedings ofthe fifth Berkeley symposium on mathematical statistics andprobability , 1967. [20] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variationaldivergence minimization. In
NIPS , 2016.[21] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-WahNgo, and Tao Mei. Transferrable prototypical networks forunsupervised domain adaptation. In
CVPR , 2019.[22] Pau Panareda Busto and Juergen Gall. Open set domainadaptation. In
ICCV , 2017.[23] Xingchao Peng, Ben Usman, Kuniaki Saito, Neela Kaushik,Judy Hoffman, and Kate Saenko. Syn2real: A new bench-mark forsynthetic-to-real visual domain adaptation. arXivpreprint arXiv:1806.09755 , 2018.[24] Pedro O Pinheiro. Unsupervised domain adaptation withsimilarity learning. In
CVPR , 2018.[25] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan,and Vincent Vanhoucke. Youtube-boundingboxes: A largehigh-precision human-annotated data set for object detectionin video. In
CVPR , 2017.[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet Large Scale Visual Recognition Chal-lenge.
IJCV , 2015.[27] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell.Adapting visual category models to new domains. In
ECCV ,2010.[28] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-suya Harada. Maximum classifier discrepancy for unsuper-vised domain adaptation. In
CVPR , 2018.[29] Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, andTatsuya Harada. Open set domain adaptation by backpropa-gation. In
ECCV , 2018.[30] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo,and Rama Chellappa. Generate to adapt: Aligning domainsusing generative adversarial networks. In
CVPR , 2018.[31] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon.A dirt-t approach to unsupervised domain adaptation. In
ICLR , 2018.[32] Antti Tarvainen and Harri Valpola. Mean teachers are betterrole models: Weight-averaged consistency targets improvesemi-supervised deep learning results. In
NIPS , 2017.[33] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko.Simultaneous deep transfer across domains and tasks. In
ICCV , 2015.[34] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, andTrevor Darrell. Deep domain confusion: Maximizing fordomain invariance. arXiv preprint arXiv:1412.3474 , 2014.[35] Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, andTao Mei. Semi-supervised domain adaptation with subspacelearning for visual recognition. In
CVPR , 2015.[36] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Col-laborative and adversarial network for unsupervised domainadaptation. In
CVPR , 2018.[37] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and TaoMei. Fully convolutional adaptation networks for semanticsegmentation. In