[PDF] Implicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Abstract

We present an approach for unsupervised domain adaptation---with a strong focus on practical considerations of within-domain class imbalance and between-domain class distribution shift---from a class-conditioned domain alignment perspective. Current methods for class-conditioned domain alignment aim to explicitly minimize a loss function based on pseudo-label estimations of the target domain. However, these methods suffer from pseudo-label bias in the form of error accumulation. We propose a method that removes the need for explicit optimization of model parameters from pseudo-labels directly. Instead, we present a sampling-based implicit alignment approach, where the sample selection procedure is implicitly guided by the pseudo-labels. Theoretical analysis reveals the existence of a domain-discriminator shortcut in misaligned classes, which is addressed by the proposed implicit alignment approach to facilitate domain-adversarial learning. Empirical results and ablation studies confirm the effectiveness of the proposed approach, especially in the presence of within-domain class imbalance and between-domain class distribution shift.

Full PDF

IImplicit Class-Conditioned Domain Alignmentfor Unsupervised Domain Adaptation

Xiang Jiang

Qicheng Lao

Stan Matwin

Mohammad Havaei Abstract

We present an approach for unsupervised domainadaptation—with a strong focus on practical con-siderations of within-domain class imbalance andbetween-domain class distribution shift—from aclass-conditioned domain alignment perspective.Current methods for class-conditioned domainalignment aim to explicitly minimize a lossfunction based on pseudo-label estimations ofthe target domain. However, these methods sufferfrom pseudo-label bias in the form of error accu-mulation. We propose a method that removes theneed for explicit optimization of model parametersfrom pseudo-labels directly. Instead, we presenta sampling-based implicit alignment approach,where the sample selection procedure is implicitly guided by the pseudo-labels. Theoretical analysisreveals the existence of a domain-discriminatorshortcut in misaligned classes, which is addressedby the proposed implicit alignment approach tofacilitate domain-adversarial learning. Empiricalresults and ablation studies conﬁrm the effective-ness of the proposed approach, especially in thepresence of within-domain class imbalance andbetween-domain class distribution shift.

1. Introduction

Supervised learning aims to extract statistical patterns fromdata by learning to approximate the conditional density p ( y | x ) . However, the generalization of the approximationis often sensitive to some dataset-speciﬁc factors. Datasetshift (Quionero-Candela et al., 2009) frequently arises fromreal-world applications and can manifest in many differentways, such as sample selection bias (Heckman, 1979; Tor-ralba et al., 2011), class distribution shift (Webb & Ting, Imagia, Canada Dalhousie University, Canada Mila,Universit´e de Montr´eal, Canada Polish Academy of Sciences,Poland. Correspondence to: Xiang Jiang < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).

Explicit class-conditioned domain align-ment (Xie et al., 2018; Pan et al., 2019; Liang et al., 2019a;Deng et al., 2019) has emerged as a key approach to promot-ing class-conditioned invariance by aligning prototypicalrepresentations of each class. While explicit alignmenthas the advantage of directly minimizing class-conditionedmisalignment, it presents critical vulnerabilities to erroraccumulation (Chen et al., 2019a) and ill-calibrated probabil-ities (Guo et al., 2017) due to its dependence on explicit su-pervision from pseudo-labels provided by model predictions.We propose

Implicit

Class-Conditioned Domain Alignmentthat removes the need for explicit pseudo-label based opti-mization. Instead, we use the pseudo-labels implicitly to sam-ple class-conditioned data in a way that maximally aligns thejoint distribution between features and labels. The primaryadvantage of the sampling-based implicit domain alignmentis the ability to address within-domain class imbalance andbetween-domain class distribution shift, in addition to manyother beneﬁts such as applications in cost-sensitive learning.The proposed method is simple, effective, and is supported bytheoretical analysis on the empirical estimations of domain di-vergence measures. It also overcomes limitations of explicitalignment by allowing the domain adaptation algorithm todiscover class-conditioned domain-invariance in an unsuper-vised way without explicit supervision from pseudo-labels.The contributions of this paper are as follows: (i) Wepropose implicit class-conditioned domain alignment toaddress the challenge of within-domain class imbalance andbetween-domain class distribution shift, which overcomesthe limitation of error accumulation in explicit domain align- a r X i v : . [ c s . L G ] J un mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation ment; (ii) We provide theoretical analysis on the empiricaldomain divergence and reveal the existence of a shortcut func-tion that interferes with domain-invariant learning, whichis addressed by the proposed approach; (iii) We show thatthe proposed approach is orthogonal to the choice of domainadaptation algorithms and offers consistent improvements totwo adversarial domain adaptation algorithms; (iv) We reportstate-of-the-art UDA performance under extreme within-domain class imbalance and between-domain class distri-bution shift, and competitive results on standard UDA tasks.

2. Preliminaries

We follow the notations by (Ben-David et al., 2010) anddeﬁne a domain as an ordered pair consisting of a distribution D on the input space X , and a labeling function f : X → Y that maps X to the label space Y . The source and targetdomains are denoted by (cid:104)D S ,f S (cid:105) and (cid:104)D T ,f T (cid:105) , respectively.In unsupervised domain adaptation, the model is trained onlabeled data from the source domain, together with unlabeleddata from the target domain. The goal is to obtain a model h ∈ H which learns domain-invariant representations whilesimultaneously minimizing the classiﬁcation error on D S .Adversarial training is the prevailing approach for domainadaptation (Ganin et al., 2016). It formulates a minimaxproblem where the maximizer maximizes the estimation ofthe domain divergence between the empirical samples, andthe minimizer minimizes the sum of the source error and thedomain divergence estimation obtained from the maximizer.While matching the marginal distribution is a good steptowards domain-invariant learning, it is still susceptibleto the problem of conditional distribution mismatching.Prototype-based class-conditioned domain alignment (Luoet al., 2017; Xie et al., 2018; Chen et al., 2019a; Pan et al.,2019; Liang et al., 2019a;b) is designed to address thisproblem. We refer to this group of methods as explicit class-conditioned domain alignment. The explicit align-ment is achieved by incorporating an auxiliary loss thatminimizes the Euclidean distance of the class-conditionedprototypical representations c j between the source and targetdomains. The class-conditioned prototype c j is the averagerepresentation for all examples in a domain with class label j .The main limitation of explicit class-conditioned domainalignment is in its reliance on explicit optimization of modelparameters based on pseudo-labels. This learning procedureis vulnerable to error accumulation (Chen et al., 2019a) asmistakes in the pseudo-label predictions can gradually ac-cumulate leading to poor local minima in EM-style training.Furthermore, the pseudo-labels are likely to suffer fromill-calibrated probabilities (Guo et al., 2017), especially fordeep learning methods, which exacerbate the critical problemof error accumulation with misleadingly conﬁdent mistakes.

3. Method

We begin with theoretical motivations of implicit alignmentby decomposing the empirical domain divergence measureinto class-aligned and class-misaligned divergence, andshow that misaligned divergence is detrimental to domainadaptation. We then present the proposed implicit domainalignment framework that addresses class-misalignment.

The H ∆ H divergence between two domains is deﬁned as d H ∆ H ( D S , D T ) = 2 sup h,h (cid:48) ∈H | E D T [ h (cid:54) = h (cid:48) ] − E D S [ h (cid:54) = h (cid:48) ] | , (1)where H denotes some hypothesis space, and h (cid:54) = h (cid:48) is theabbreviation for h ( x ) (cid:54) = h (cid:48) ( x ) . (Ben-David et al., 2010)theorized that the target domain error (cid:15) T ( h ) is boundedby the error of the source domain (cid:15) S ( h ) and the empiricaldomain divergence ˆ d H ∆ H ( U S , U T ) where U S , U T areunlabeled empirical samples drawn from D S , D T .In deep learning, minibatch-based optimization limits theamount of data available at each training step. This necessi-tates the analysis of the empirical estimations of d H ∆ H at theminibatch level, so as to shed light on the learning dynamics. Deﬁnition 3.1.

Let B S , B T be minibatches from U S and U T ,respectively, where B S ⊆ U S , B T ⊆ U T , and |B S | = |B T | .The empirical estimation of d H ∆ H ( B S , B T ) over theminibatches B S , B T is deﬁned as ˆ d H ∆ H ( B S , B T ) = sup h,h (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) B T [ h (cid:54) = h (cid:48) ] − (cid:88) B S [ h (cid:54) = h (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (2) Theorem 3.2 (The decomposition of ˆ d H ∆ H ( B S , B T ) ) . Let H be a hypothesis space and Y be the label spaceof the classiﬁcation task where B S , B T are minibatchesdrawn from U S , U T , respectively, and Y S , Y T are thelabel set of B S , B T . We deﬁne three disjoint sets on thelabel space: the shared labels Y C := Y S ∩ Y T , and thedomain-speciﬁc labels Y S := Y S − Y C , and Y T := Y T − Y C .We also deﬁne the following disjoint sets on the input spacewhere B CS := { x ∈ B S | y ∈ Y C } , B CS := { x ∈ B S | y / ∈ Y C } , B CT := { x ∈ B T | y ∈ Y C } , B CT := { x ∈ B T | y / ∈ Y C } . The em-pirical ˆ d H ∆ H ( B S , B T ) divergence can be decomposed intoclass aligned divergence and class-misaligned divergence: ˆ d H ∆ H ( B S , B T ) = sup h,h (cid:48) ∈H (cid:12)(cid:12)(cid:12) ξ C ( h,h (cid:48) )+ ξ C ( h,h (cid:48) ) (cid:12)(cid:12)(cid:12) , (3) where ξ C ( h,h (cid:48) ) = (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ] , (4) ξ C ( h,h (cid:48) ) = (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ] . (5) mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation （ , ）（ , ）（ , ）（ , ） shortcut （ , ） Domain discriminatorLabel spaceInput samples shortcut shortcut (source, target) domain discriminator （ , ） goal shortcut (source, target) domain discriminatorgoal shortcut (source, target) domain discriminator goal shortcut (source, target) domain discriminatorgoal （ , ）（ , ） Aligned:Misaligned:

Figure 1.

Illustration of the domain discriminator shortcut. Thedomain discriminator aims to distinguish between differentdomains (red and blue), where the decision boundary is representedby dashed lines. But misaligned samples create a shortcut where thedomain labels can be directly determined by the misaligned classlabels (3 and 6). The decision boundary of the resulting shortcutis independent of the covariate that causes the domain difference,which does not contribute to adversarial domain-invariant learning.

The proof is provided in supplementary materials.

Remark 3.3 (The domain discriminator shortcut) . Let theordered triple ( x,y c ,y d ) denote data sample x , and its associ-ating class label y c and domain label y d , respectively, where x ∈ B , y c ∈ Y and y d ∈ { , } . Let f c be a classiﬁer that maps x to a class label y c . Let f d be a domain discriminator thatmaps x to a binary domain label y d . For the empirical class-misaligned divergence ξ C ( h,h (cid:48) ) with sample x ∈ B CS ∪B CT ,there exists a domain discriminator shortcut function f d ( x ) = (cid:26) f c ( x ) ∈ Y S f c ( x ) ∈ Y T , (6) such that the domain label can be solely determined by thedomain-speciﬁc class labels. This shortcut interferes with ad-versarial domain adaptation because the model could bypassthe optimization for domain-invariant representations, butrather optimize for a shortcut function that is independentof the covariate contributing to the domain difference. Figure 1 illustrates a toy example where the source andtarget domains are aligned for class 4 but misalignedbetween classes 3 and 6 as a result of random samplingin the minibatch construction. The domain discriminatoraims to predict domain labels based on their domaininformation, i.e., red and blue. However, due to the classshortcut for the misaligned samples (3 and 6), the domaindiscriminator could infer domain labels based on classinformation directly (digits 3 and 6), without the needto learn domain-speciﬁc information. This problem ofclass-misalignment is especially pronounced under extremewithin-domain class imbalance and between-domain classdistribution shift, where a simple random sample is morelikely to fail in providing good coverage of the label space.

Having identiﬁed the domain discriminator shortcut in classmisaligned empirical samples, we now propose frameworkthat aligns the two domains from a sampling perspective. pseudo-labels sampling 𝑝(𝑧|𝑥; 𝜙) 𝑝(𝑦 * |𝑧 * ; 𝜃),𝑦 * 𝑝 - (𝑥) 𝑝 - 𝑥 𝑦 𝑝(𝑦)𝑝 * 𝑥 ,𝑦 𝑝(𝑦) data implicitalignment domain-invariantrepresentations classifier(a) (b) (c) (d) 𝑝 * (𝑥) Figure 2.

The proposed framework. (a) We aim to align the sourcedomain p S ( x ) , colored by classes, with unlabeled target domain p T ( x ) . (b) For p S ( x ) , we sample x ∼ p S ( x | y ) p ( y ) based on the alignment distribution p ( y ) . For p T ( x ) , we sample a class aligned minibatch x ∼ p T ( x | ˆ y ) p ( y ) using identical p ( y ) , with the helpof pseudo-labels ˆ y T . (c) The adversarial training aims to acquiredomain-invariant representations z from the feature extractorparameterized by φ . (d) The classiﬁer predicts class labels from z . Figure 2 depicts the proposed implicit class-conditioned do-main alignment framework. We aim to align p S ( x ) and p T ( x ) at the input and label space jointly with the factorization p ( x,y ) = p ( x | y ) p ( y ) while ensuring that the sampled classesare aligned between the two domains. The alignment distribu-tion p ( y ) is pre-speciﬁed, e.g., uniform distribution, to ensuresamples are aligned in the shared label space in spite of dif-ferent empirical label distributions of the two domains. Thisimplicit alignment procedure minimizes the class-misaligneddivergence ξ C ( h,h (cid:48) ) , providing a more reliable empiricalestimation of domain divergence. For the unlabeled tar-get domain, we use the model predictions to sample class-conditioned data from p T ( x | ˆ y ) to approximate p T ( x | y ) .3.2.1. C LASS -A LIGNED S AMPLING S TRATEGY

Algorithm 1 presents the proposed sampling procedure thatselects class-aligned examples for minibatch training. It isa type of stratiﬁed sampling where the dataset is partitionedinto mutually exclusive subgroups to reﬂect the labelinformation in a class-aligned manner.First, we predict pseudo-labels of the target domain usingthe classiﬁer f c ( · ; θ ) parameterized by θ . The pseudo-labelswill be later used in class-conditioned sampling. Second, wesample a set Y from the label space Y where p ( y ) deﬁnesthe probability with which we pick the classes to align soas to ensure the empirical samples of the source and targetdomains share the same Y . This in turn minimizes theclass-misaligned divergence ξ C ( h,h (cid:48) ) . Third, for each class y i ∈ Y , we sample class-conditioned examples for the source mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation Algorithm 1

The proposed implicit alignment training Input: dataset S = { ( x i ,y i ) } Ni =1 , T = { x i } Mi =1 , label space Y , label alignment distribution p ( y ) , classiﬁer f c ( · ; θ ) while not converged do T ˆ T ← { ( x i ,f c ( x i ; θ )) } Mi =1 where x i ∈ T N unique classes in the label space Y ← draw N samples in Y from p ( y ) K examples conditioned on each y i ∈ Y for y i in Y do ( X (cid:48) S ,Y (cid:48) S ) ← draw K samples in S from p S ( x | y i ) X (cid:48) T ← draw K samples in ˆ T from p T ( x | y i ) end for train minibatch ( X (cid:48) S , Y (cid:48) S , X (cid:48) T ) end while and target domains, respectively, and store them in ( X (cid:48) S ,Y (cid:48) S ) and X (cid:48) T . This is equivalent to performing a table lookupto select a subset B i where all examples belong to class y i ,followed by random sampling in B i . We use pseudo-labelsto sample the target domain due to the lack of ground-truthlabels. Once we obtained the class-aligned minibatch, weuse it to train unsupervised domain adaptation algorithm andrepeat this process until the model converges.This algorithm addresses class imbalance within each domainas well as class distribution shift between different domainsby specifying the sampling strategy p ( y ) in the label space.We use uniform sampling p ( y ) for all experiments in thispaper, and leave more advanced speciﬁcations and their appli-cations to cost-sensitive domain adaptation as future work.3.2.2. I NTEGRATING I MPLICIT A LIGNMENT INTO C LASSIFIER -B ASED D OMAIN D ISCREPANCY M EASURE

Section 3.2.1 describes the implicit alignment algorithmfrom a sampling perspective, where we sample minibatchesin a way that maximizes class alignment implicitly. Thissampling strategy is independent of the choice of domaindivergence measures. In this section, we show how tointegrate the sampling approach into Margin Disparity Dis-crepancy (MDD) (Zhang et al., 2019b)—a state-of-the-artclassiﬁer-based domain discrepancy measure—to furtherfacilitate implicit alignment. MDD is deﬁned as d f, F ( S,T ) = sup f (cid:48) ∈F (cid:16) disp D T ( f (cid:48) ,f ) − disp D S ( f (cid:48) ,f ) (cid:17) , (7)where f and f (cid:48) are two independent scoring functions thatpredict class probabilities, and disp( f (cid:48) , f ) is a disparitymeasure between the scores provided by the classiﬁers f (cid:48) and f . The domain divergence is to estimate the discrepancybetween the disparity measures of the two domains. Following notations in Theorem A.2, we deﬁne the empiricalMDD on class-misaligned samples as ˆ d f, F ( B CS , B CT ) = sup f (cid:48) ∈F (cid:16)(cid:88) B CT disp( f (cid:48) ,f ) − (cid:88) B CS disp( f (cid:48) ,f ) (cid:17) . (8)Because B CS and B CT are disjoint in the label space, thereexists a shortcut solution disp( f (cid:48) ( x ) ,f ( x )) = (cid:26) f c ( x ) ∈ Y S f c ( x ) ∈ Y T , (9)which maximizes the divergence estimation of Eq. (8).Although class-aligned sampling can mitigate this problem,it is difﬁcult to fully eliminate the impact of misalignementdue to imperfect pseudo-labels. To further eliminate thedetrimental impact of class-misalignment, we introduce amasking scheme on the scoring functions f and f (cid:48) deﬁned as ˆ d f, F ( B S , B T )= sup f (cid:48) ∈F (cid:16)(cid:88) B T disp( f (cid:48) (cid:12) ω,f (cid:12) ω ) − (cid:88) B S disp( f (cid:48) (cid:12) ω,f (cid:12) ω ) (cid:17) , (10)where f (cid:12) ω denotes element-wise multiplication betweenthe output of f and ω . The alignment mask ω is a binaryvector that denotes whether the i -th class is present in thesampled classes Y (i.e., the classes that we intend to alignin the current minibatch). By doing so, we simultaneouslyalign the source and target domains (i) in the input spaceand (ii) in the functional approximations of the domaindivergence by masking the scoring functions f and f (cid:48) .

4. Experiments

We evaluate on Ofﬁce-31, Ofﬁce-Home andVisDA2017. Ofﬁce-31 (Saenko et al., 2010) has three do-mains ( A mazon, D SLR and W ebcam) with 31 classes. Weuse three versions of Ofﬁce-Home (Venkateswara et al., 2017)that contains four domains ( Ar t, Cl ip Art, Pr duct, and R eal- w orld) with 65 classes: (i) “standard”: the standard Ofﬁce-Home dataset. (ii) “balanced” (Tan et al., 2019): a subset ofthe standard dataset where each class has the same number ofexamples. (iii) “RS-UT”: Reversely-unbalanced Source (RS)and Unbalanced-Target (UT) distribution (Tan et al., 2019)where both domains are imbalanced, but the majority class inthe source domain is the minority class in the target domain.VisDA2017 (synthetic → real) (Peng et al., 2017) is a large-scale dataset with 12 classes and more than 200k images. Model architecture.

We use ResNet-50 (He et al., 2016)pre-trained from ImageNet (Russakovsky et al., 2015) asthe backbone, and use hyper-parameters from (Zhang et al.,

Code: https://github.com/xiangdal/implicit_alignment mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Table 1.

Per-class average accuracy on Ofﬁce-Home dataset with RS-UT label shifts (ResNet-50).

Methods Rw (cid:1) Pr Rw (cid:1)

Cl Pr (cid:1)

Rw Pr (cid:1)

Cl Cl (cid:1)

Rw Cl (cid:1) Pr AvgSource Only † † † † † † † † † † , ‡ ‡ ‡ ‡ MDD+Implicit Alignment 76.08 50.04 74.21 45.38 61.15 63.15 61.67 † Source : Data of these baseline methods are cited from (Tan et al., 2019). ‡ Methods using explicit class-conditioned domain alignment.

Baselines.

Our main explicit alignment baselines areCOAL (Tan et al., 2019), PACET (Liang et al., 2019b) andMCS (Liang et al., 2019a), state-of-the-art explicit alignmentmethods based on domain discriminator discrepancy. Asour domain discrepancy measure is MDD, we re-implementvarious MDD-based explicit alignment for fair comparison.

Computational efﬁciency.

We only update pseudo-labelsperiodically, i.e., every 20 steps, instead of at every trainingstep. We show in the supplementary materials that ourmethod does not require more frequent pseudo-label updates.

We use Ofﬁce-Home (RS-UT), described in Figure 3 (a), toevaluate the performance of different methods under extremewithin-domain class imbalance and between-domain classdistribution shift where the majority classes in the sourcedomain are minority classes in the target domain. Table 1presents the per-class average accuracy on Ofﬁce-Home (RS-UT). Our main baseline is the explicit alignment method“covariate and label shift co-alignment” (COAL) designedto address data imbalance and class distribution shift. Ourproposed implicit domain alignment works the best.4.2.1. T

HE IMPACT OF CLASS DISTRIBUTION SHIFT

Many baseline methods suffer from class distribution shift,and their performances degrade to “Source Only” training asthey do not take into account within-domain class imbalanceand between-domain class distribution shift. For MDD-basedmethods, after we apply balanced sampling for the source do-main, the per-class average accuracy improved from 55.44%to 58.50%, which indicates balanced sampling is helpful forclass distribution shift, despite only in the source domain. (a) (b)

Figure 3. (a) Source and target class distribution of Ofﬁce-Home(RS-UT). (b) Accuracy comparison between Ofﬁce-Home (RS-UT)and Ofﬁce-Home (balanced) for Rw → Pr.

HE EFFECTIVENESS OF IMPLICIT ALIGNMENT

The effectiveness of implicit alignment is demonstratedthrough the comparison between “MDD+Implicit Alignment”and “MDD (source-balanced sampler)”. Both methodsuse the same sampling procedure for the source. The onlydifference is that implicit alignment aligns the two domainsby sampling aligned classes in the target domain, whereas“source-balanced sampler” only takes random samples fromthe target domain. Table 1 shows that implicit alignmentperforms better than “source-balanced sampler” because it isbetter-aligned, which conﬁrms the effectiveness of implicitalignment. Besides, the proposed method also outperformsMDD-based explicit alignment, which validates the effec-tiveness of implicit alignment over the explicit alignment.Figure 3 (b) compares the baseline, implicit and explicitalignments on Ofﬁce-Home (balanced) and Ofﬁce-Home (RS-UT). We observe that implicit alignmentperforms the best on both datasets. More importantly,implicit alignment is more robust to class distribution shiftwhich greatly out-performs other methods under RS-UTdistribution shift and has a smaller performance drop fromthe balanced version of Ofﬁce-Home. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Table 2.

Accuracy (%) on Ofﬁce-31 (standard) for unsupervised domain adaptation (ResNet-50). We repeated each experiment 5 timeswith different random seeds and report the average and the standard error of the accuracy.

Method A (cid:1)

W D (cid:1)

W W (cid:1)

D A (cid:1)

D D (cid:1)

A W (cid:1)

A AvgSource only 68.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± .0 92.2 ± ± ± ± ± ± .0 92.9 ± ± ± ± ± ± .0 ± ± ± PACET (Liang et al., 2019b) ‡ ‡ ± ± ± ± ± ± ± ± ± ± ± . ± ‡ ± ± ± .0 92.3 ± ± ± MDD+Implicit Alignment ± ± ± .0 92.1 ± ± ± ‡ Methods using explicit class-conditioned domain alignment.

Table 3.

Accuracy (%) on Ofﬁce-Home (standard) for unsupervised domain adaptation (ResNet-50).

Method Ar (cid:1)

Cl Ar (cid:1)

Pr Ar (cid:1)

Rw Cl (cid:1)

Ar Cl (cid:1)

Pr Cl (cid:1)

Rw Pr (cid:1)

Ar Pr (cid:1)

Cl Pr (cid:1)

Rw Rw (cid:1)

Ar Rw (cid:1)

Cl Rw (cid:1)

Pr AvgSource only 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1DAN (Long et al., 2015) 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3DANN (Ganin et al., 2016) 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6JAN (Long et al., 2017) 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3CDAN (Long et al., 2018) 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8BSP (Chen et al., 2019c) 52.0 68.6 76.1 58.0 70.3 70.2 58.6 50.2 77.6 72.2 59.3 81.9 66.3MDD (Zhang et al., 2019b) 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 ‡ ‡ MDD+Implicit Alignment 56.2 77.9 79.2 64.4 73.1 74.4 64.2 54.2 79.9 ‡ Methods using explicit class-conditioned domain alignment.

Table 2 and Table 3 summarize the results for the standardOfﬁce-31 and Ofﬁce-Home datasets which have a smalldegree of class imbalance. Our method outperforms thebaselines in 3 out of 6 domain pairs for Ofﬁce-31, and 10out of 12 domain pairs for Ofﬁce-Home (standard). Theproposed implicit alignment exhibits larger performancegains on the Ofﬁce-Home dataset because the dataset ismore difﬁcult for domain adaptation, and it has 65 classescompared with the 31 classes in Ofﬁce-31. We also reportstate-of-the-art results for VisDA in Table 4.Similar to ﬁndings in Section 4.2, we observe source-balanced sampling is helpful when comparing “MDD(source-balanced sampler)” with the MDD standard baseline,even without extreme class distribution shift.The proposed method outperforms the state-of-the-artexplicit alignment methods—PACET and MCS—acrossall domain pairs. We ﬁnd it ineffective to incorporateprototype-based explicit alignment into MDD. This is method acc. (%)JAN (Long et al., 2017) 61.6GTA(Sankaranarayanan et al., 2018) 69.5MCD (Saito et al., 2018) 69.8CDAN (Long et al., 2018) 70.0MDD (Zhang et al., 2019b) 74.6MDD+Explicit Alignment 67.1

MDD+Implicit Alignment 75.8

Table 4.

VisDA2017 target accuracy (ResNet-50) in contrast with domain-discriminator-based adversariallearning, where explicit alignment is shown to improvedomain adaptation. This is because the classiﬁer-baseddiscrepancy MDD contains more abundant informationthan domain-discriminator-based discrepancy, owing tothe availability of predictive probabilities provided by theclassiﬁers. The rich information in domain discrepancyremoves the need for prototype-based distances. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation N : the number of unique labels per batch T e s t a cc u r a c y o f t h e t a r g e t d o m a i n ( % ) Baseline (random)Baseline (S-sampled, T-random)Aligned (pseudo-labels)Aligned (oracle)

Figure 4.

The impact of class diversityand alignment on domain adaptation forAr → Cl, Ofﬁce-Home (standard).

10 20 30 40 50 60 70 74 76

Pseudo-label accuracy (%) A cc u r a c y a ft er s t e p s ( % ) Implicit alignmentExplicit alignment

Figure 5.

The impact of pseudo-labelerrors on implicit and explicit alignment,Ar → Cl, Ofﬁce-Home (standard).

Alignment optionsDomains masking sampling avg. acc.Rw (cid:1) Cl × × √ × × √ √ √ Pr (cid:1) Rw × × √ × × √ √ √ Table 5.

The impact of different implicitalignment options, i.e., masking in theMDD estimation and sampling class-alignedminibatches, on Ofﬁce-Home (RS-UT).

MPACT OF CLASS DIVERSITY AND ALIGNMENT

We analyze the impact of class diversity and alignment bydesigning experiments along three dimensions: the numberof unique labels in each minibatch, whether the classes arealigned, and whether we use pseudo-labels or ground-truthlabels when sampling the target domain.

Setup. “Baseline (random)” randomly samples examplesof both domains. “Baseline (S-sampled, T-random)” uses N -way sampler for the source domain, and randomlysamples the target domain. “Aligned (pseudo-labels)” is theproposed implicit alignment approach. “Aligned (Oracle)”is the oracle form of implicit alignment where the targetdomain uses ground-truth labels for sampling. The impact of class diversity.

Minibatch-based class diver-sity determines the sampling distribution of the label space,and a greater diversity corresponds to a more stable measureof this sampling distribution. Figure 4 suggests a positive cor-relation between the model performance and class diversity:domain adaptation methods do not work well when the classdiversity is very low—i.e., only sample 5 classes per batchamong the 65 classes—and the alignment-based methodsoutperform the baseline as we increase class diversity.

The impact of alignment.

We conﬁrm the importanceof the proposed implicit alignment algorithm from twoperspectives. First, “Aligned (oracle)” consistently performsthe best, which suggests perfect alignment can providesubstantial beneﬁts to unsupervised domain adaptation.Second, the comparison between “Aligned (pseudo-labels)”and “Baseline (S-sampled, T-random)” validates theeffectiveness of pseudo-label based implicit alignment,although the pseudo-labels are approximations of the oracle.4.4.2. R

OBUSTNESS TO PSEUDO - LABEL ERRORS

We investigate whether implicit alignment is indeed morerobust to pseudo-label errors when compared with explicit alignment. Figure 5 illustrates the relationship betweenpseudo-label accuracy at training step t and the corre-sponding subsequent target accuracy at step t + 1000 , i.e.,after 1000 domain adaptation training steps. This processresembles a Markov chain that allows us to analyze theimpact of pseudo-label accuracy on the learning dynamics.It is evident that the drawbacks of explicit alignment aremore severe when the pseudo-labels are less accurate, e.g.,10 ∼ BLATION S TUDY ONN

MDDTable 5 presents the ablation study on Ofﬁce-Home (RS-UT)that aims to assess the impact of different implicit alignmentoptions: alignment in the domain divergence estimations inSection 3.2.1 (i.e., masking in MDD) and alignment in theinput space in Section 3.2.1 (i.e., sampling class-conditionedexamples). We observe that both alignment techniques areessential for domain adaptation because alignment shouldbe enforced consistently across all aspects of adaptation. Wereport similar ﬁndings, in the supplementary material, onOfﬁce-Home (standard). mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Source Domain Target Domain balancedimbalanced balancedimbalanced

Figure 6.

Interactions between within-domain class imbalance and between-domain class distribution shift . Table 6.

Per-class average accuracy (%) with mismatched prior where the source domain is balanced while the target domain isimbalanced. SVHN → MNIST MNIST → SVHNmethod mild extreme mild extremesource only 67.4 ± ± ± ± ± ± ± ± ± ± ± ± ENERALIZATION : IMPLICIT ALIGNMENT ALSO IMPROVES

DANNWe design additional experiments to further demonstrate theeffectiveness of the proposed approach on a different domainadaptation algorithm —DANN—on two synthetic domainswith different degrees of class imbalance : “mild” (light-tailed class imbalance from a triangular-like distribution)and “extreme” (heavy-tailed class imbalance from a Paretodistribution). We synthetically manipulate the class distribu-tions of SVHN and MNIST to simulate various interactionsbetween within-domain class imbalance and between-domain class distribution shift . As illustrated in Fig 6, wesimulate three types of distribution shift when p S ( y ) (cid:54) = p T ( y ) (i) source-balanced, target-imbalanced; (ii) source-imbalanced, target-balanced; (iii) both-imbalanced.Table 6, 7 and 8 present the results for the abovementionedscenarios and all experiments are repeated ﬁve times. Theproposed implicit alignment approach signiﬁcantly improvesthe performance of DANN regardless of the degree ofimbalance or the type of distribution shift. Besides, implicitalignment offers greater improvements over DANN when thedegree of imbalance is more severe, i.e., comparing “mild”with “extreme”. Implicit alignment overcomes this limitationof DANN and greatly improves the performance of the chal-lenging task between SVHN and MNIST. We conclude thatthe proposed approach is independent of the choice of domainadaptation algorithms and helps both MDD and DANN.Note that the aim of this subsection is to show that implicitalignment could help improve DANN on the digits dataset.More work is needed to compare with the current state-of-the-art methods (Kumar et al., 2018; Shu et al., 2018) on thisdataset. Table 7.

Per-class average accuracy (%) with mismatched prior where the source domain is imbalanced while the target domainis balanced. SVHN → MNIST MNIST → SVHNmethod mild extreme mild extremesource only 65.2 ± ± ± ± ± ± ± ± ± ± ± ± Table 8.

Per-class average accuracy (%) with mismatched prior where both domains are imbalanced.SVHN → MNIST MNIST → SVHNmethod mild extreme mild extremesource only 60.9 ± ± ± ± ± ± ± ± ± ± ± ±

5. Related Work

We review related work on unsupervised domain adaptationand discuss their relations with our proposed method.

Instance-based importance-weighting (Chawla et al.,2002; Kouw & Loog, 2019) aims to minimize the targeterror directly from the source domain data, weighted atthe example level or class level. Unlike our approach,importance-weighting only uses the source data to train theclassiﬁer without learning domain invariant representations.

Feature-based distribution adaptation is the prevailing ap-proach to domain adaptation that aims to minimize the dis-tribution discrepancy between the source and target domains.The domain difference can be measured in various ways, suchas Maximum Mean Discrepancy (MMD) (Borgwardt et al.,2006), which is further minimized to achieve domain invari-ance. The minimization of such discrepancy can be carriedout by directly minimizing the distance (Tzeng et al., 2014)or with the help of adversarial learning (Ganin et al., 2016).

Classiﬁer-based distribution adaptation is a strong com-petitor to feature-based adaptation. It aims to minimizethe discrepancy between two classiﬁers so that the learnedrepresentations respect the decision boundary of theclassiﬁcation task (Saito et al., 2018; Zhang et al., 2019b).We show that the proposed approach is beneﬁcial to bothclassiﬁer-based discrepancy MDD (Zhang et al., 2019b) andfeature-based discrepancy DANN (Ganin et al., 2016).

Feature-classiﬁer joint distribution adaptation aims to alignthe joint distribution between features and their correspond-ing predictions (Long et al., 2013; Tsai et al., 2018). Thejoint distribution can be represented in a multilinear mapbetween features and classiﬁer predictions (Long et al., mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation p ( x, y ) = p ( x | y ) p ( y ) from a sampling perspective where p ( y ) is the pre-speciﬁed alignment distribution in the labelspace, and p ( x | y ) represents class-conditioned sampling. Explicit class-conditioned domain alignment , or classprototype alignment, introduces a loss function thatminimizes the distances of class-level prototypes betweenthe source and target domains (Snell et al., 2017; Pinheiro,2018; Pan et al., 2019; Deng et al., 2019). It is prone to erroraccumulation due to its reliance on explicit optimization ofmodel parameters from the pseudo-labels. A variety of recentmethods have been proposed to mitigate these limitationsby estimating batch-level statistics (Xie et al., 2018) andintroducing an easy-to-hard curriculum that favors conﬁdentpredictions (Chen et al., 2019a). Nevertheless, thesealgorithms suffer from ill-calibrated probabilities in the formof conﬁdent mistakes, and more work is needed to improvemodel calibration so as to better utilize explicit alignment.

Self-training (Nigam & Ghani, 2000) is a special formof co-training (Blum & Mitchell, 1998) where the modeliteratively uses its predictions, i.e., pseudo-labels, as explicitsupervision to re-train itself. The use of pseudo-labels hasbecome an emerging trend in domain adaptation, becausethey provide estimations of the target domain label distri-bution that can be exploited by training algorithms. Apartfrom class prototype based methods (Chen et al., 2011; Saitoet al., 2017; Zhang et al., 2018; Deng et al., 2019) for explicitalignment, (Wen et al., 2019) proposed the use of uncertaintyestimates of the target domain predictions as second-orderstatistics to promote feature-label joint adaptation. Forsemantic segmentation tasks, (Zou et al., 2018) proposedto iteratively generate pseudo-labels in the target domain andre-train the model on these labels; (Zhang et al., 2019a) pro-posed to use pseudo-labels to encourage examples to clustertogether if they belong to the same class; (Chen et al., 2019b)applied entropy minimization (Grandvalet & Bengio, 2005)on the pseudo-labels to encourage class overlap betweendomains. A main bottleneck for this approach is the bias inpseudo-label predictions. Directly optimizing these labelsis prone to “entropy over-minimization” (Zou et al., 2019)and negative transfer (Lifshitz & Wolf, 2020) where themodel overﬁts to mistakes in the pseudo-labels. Moreover,the pseudo-labels are likely to suffer from ill-calibratedprobabilities (Guo et al., 2017), especially for deep learningmethods. The resulting misleadingly conﬁdent mistakesexacerbate the critical problem of error accumulation inpseudo-label bias. In contrast, our proposed method removesthe need for direct supervision from pseudo-labels, and as aresult is more robust to bias in how these labels are produced.

Reinforced sample selection (Dong & Xing, 2018) is proposed for one-shot domain adaptation where a modelactively selects labeled examples to train the domainadaptation model. In comparison, the advantage of ourapproach is in its simplicity that no reinforcement learningis required to obtain the sampling strategy.

6. Conclusion and Future Work

We introduce an approach for unsupervised domainadaptation—with a strong focus on practical considerationsof within-domain class imbalance and between-domain classdistribution shift—from a class-conditioned domain align-ment perspective. We show theoretically that the proposedimplicit alignment provides a more reliable measure of empir-ical domain divergence which facilitates adversarial domain-invariant representation learning, that would otherwise behampered by the class-misaligned domain divergence. Weshow that our proposed approach leads to superior UDA per-formance under extreme within-domain class imbalance andbetween-domain class distribution shift, as well as compet-itive results on standard UDA tasks. We emphasize that theproposed method is robust to pseudo-label bias, simple to im-plement, has a uniﬁed training objective, and does not requireadditional parameter tuning. We also show that the proposedapproach is orthogonal to the choice of domain adaptationalgorithms and offers consistent improvements to feature-based and classiﬁer-based domain adaptation algorithms.Future work includes extensions to cost-sensitive learningfor domain adaptation, and other setups where the label spacebetween the source and target domains are not identical, aswell as other domain adaptation setups (Cao et al., 2018). It isalso important to analyze the probability calibration of differ-ent domain adaptation models and develop well-calibratedmethods for more effective use of pseudo-labels.

Acknowledgements

We thank the anonymous reviewers for providing thoughtfulfeedback. The authors also thank Lisa Di Jorio, Tanya Nair,Francis Dutil, Cecil Low-Kam, Nicolas Chapados, and theImagia team for their support. Xiang Jiang acknowledgesthe support of NVIDIA Corporation with the donation ofthe Titan X GPU used for this research.

References

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira,F., and Vaughan, J. W. A theory of learning from differentdomains.

Machine learning , 79(1-2):151–175, 2010.Blum, A. and Mitchell, T. Combining labeled and unlabeleddata with co-training. In

Proceedings of the eleventhannual conference on Computational learning theory , pp.92–100. ACM, 1998. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P.,Sch¨olkopf, B., and Smola, A. J. Integrating structuredbiological data by kernel maximum mean discrepancy.

Bioinformatics , 22(14):e49–e57, 2006.Cao, Z., Ma, L., Long, M., and Wang, J. Partial adversarialdomain adaptation. In

Proceedings of the European Con-ference on Computer Vision (ECCV) , pp. 135–150, 2018.Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,W. P. Smote: synthetic minority over-sampling technique.

Journal of artiﬁcial intelligence research , 16:321–357,2002.Chen, C., Xie, W., Huang, W., Rong, Y., Ding, X., Huang,Y., Xu, T., and Huang, J. Progressive feature alignmentfor unsupervised domain adaptation. In

Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pp. 627–636, 2019a.Chen, M., Weinberger, K. Q., and Blitzer, J. Co-training fordomain adaptation. In

Advances in neural informationprocessing systems , pp. 2456–2464, 2011.Chen, M., Xue, H., and Cai, D. Domain adaptation forsemantic segmentation with maximum squares loss. In

Proceedings of the IEEE International Conference onComputer Vision , pp. 2090–2099, 2019b.Chen, X., Wang, S., Long, M., and Wang, J. Transferabilityvs. discriminability: Batch spectral penalization for ad-versarial domain adaptation. In

International Conferenceon Machine Learning , pp. 1081–1090, 2019c.Cicek, S. and Soatto, S. Unsupervised domain adaptationvia regularized conditional alignment. In

The IEEEInternational Conference on Computer Vision (ICCV) ,October 2019.Deng, Z., Luo, Y., and Zhu, J. Cluster alignment witha teacher for unsupervised domain adaptation. In

Proceedings of the IEEE International Conference onComputer Vision , pp. 9944–9953, 2019.Dong, N. and Xing, E. P. Domain adaption in one-shotlearning. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases , pp.573–588. Springer, 2018.Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle,H., Laviolette, F., Marchand, M., and Lempitsky, V.Domain-adversarial training of neural networks.

TheJournal of Machine Learning Research , 17(1):2096–2030,2016.Grandvalet, Y. and Bengio, Y. Semi-supervised learning byentropy minimization. In

Advances in neural informationprocessing systems , pp. 529–536, 2005. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In

Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pp. 1321–1330. JMLR. org, 2017.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

Proceedings ofthe IEEE conference on computer vision and patternrecognition , pp. 770–778, 2016.Heckman, J. J. Sample selection bias as a speciﬁcation error.

Econometrica: Journal of the econometric society , pp.153–161, 1979.Kouw, W. M. and Loog, M. A review of domain adaptationwithout target labels.

IEEE transactions on patternanalysis and machine intelligence , 2019.Kumar, A., Sattigeri, P., Wadhawan, K., Karlinsky, L.,Feris, R., Freeman, B., and Wornell, G. Co-regularizedalignment for unsupervised domain adaptation. In

Advances in Neural Information Processing Systems , pp.9345–9356, 2018.Liang, J., He, R., Sun, Z., and Tan, T. Distant supervisedcentroid shift: A simple and efﬁcient approach tovisual domain adaptation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 2975–2984, 2019a.Liang, J., He, R., Sun, Z., and Tan, T. Exploring uncertaintyin pseudo-label guided unsupervised domain adaptation.

Pattern Recognition , 96:106996, 2019b.Lifshitz, O. and Wolf, L. A sample selection approachfor universal domain adaptation. arXiv preprintarXiv:2001.05071 , 2020.Lipton, Z. C., Wang, Y.-X., and Smola, A. Detecting andcorrecting for label shift with black box predictors. arXivpreprint arXiv:1802.03916 , 2018.Long, M., Wang, J., Ding, G., Sun, J., and Yu, P. S. Transferfeature learning with joint distribution adaptation. In

Proceedings of the IEEE international conference oncomputer vision , pp. 2200–2207, 2013.Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learningtransferable features with deep adaptation networks. In

Proceedings of the 32nd International Conference onInternational Conference on Machine Learning-Volume37 , pp. 97–105. JMLR. org, 2015.Long, M., Zhu, H., Wang, J., and Jordan, M. I. Deep transferlearning with joint adaptation networks. In

Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pp. 2208–2217. JMLR. org, 2017. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditionaladversarial domain adaptation. In

Advances in NeuralInformation Processing Systems , pp. 1640–1650, 2018.Luo, Z., Zou, Y., Hoffman, J., and Fei-Fei, L. F. Labelefﬁcient learning of transferable representations acrosssdomains and tasks. In

Advances in Neural InformationProcessing Systems , pp. 165–177, 2017.Nigam, K. and Ghani, R. Analyzing the effectiveness and ap-plicability of co-training. In

Cikm , volume 5, pp. 3, 2000.Pan, S. J. and Yang, Q. A survey on transfer learning.

IEEETransactions on knowledge and data engineering , 22(10):1345–1359, 2009.Pan, Y., Yao, T., Li, Y., Wang, Y., Ngo, C.-W., and Mei, T.Transferrable prototypical networks for unsuperviseddomain adaptation. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pp.2239–2247, 2019.Pei, Z., Cao, Z., Long, M., and Wang, J. Multi-adversarialdomain adaptation. In

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence , 2018.Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D.,and Saenko, K. Visda: The visual domain adaptationchallenge. arXiv preprint arXiv:1710.06924 , 2017.Pinheiro, P. O. Unsupervised domain adaptation with simi-larity learning. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 8004–8013,2018.Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., andLawrence, N. D.

Dataset shift in machine learning . TheMIT Press, 2009.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., et al. Imagenet large scale visual recognitionchallenge.

International journal of computer vision , 115(3):211–252, 2015.Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adaptingvisual category models to new domains. In

European con-ference on computer vision , pp. 213–226. Springer, 2010.Saito, K., Ushiku, Y., and Harada, T. Asymmetric tri-trainingfor unsupervised domain adaptation. In

Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pp. 2988–2997. JMLR. org, 2017.Saito, K., Watanabe, K., Ushiku, Y., and Harada, T.Maximum classiﬁer discrepancy for unsupervised domainadaptation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 3723–3732,2018. Sankaranarayanan, S., Balaji, Y., Castillo, C. D., andChellappa, R. Generate to adapt: Aligning domainsusing generative adversarial networks. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pp. 8503–8512, 2018.Shimodaira, H. Improving predictive inference undercovariate shift by weighting the log-likelihood function.

Journal of statistical planning and inference , 90(2):227–244, 2000.Shu, R., Bui, H. H., Narui, H., and Ermon, S. A dirt-tapproach to unsupervised domain adaptation. arXivpreprint arXiv:1802.08735 , 2018.Snell, J., Swersky, K., and Zemel, R. Prototypical networksfor few-shot learning. In

Advances in Neural InformationProcessing Systems , pp. 4077–4087, 2017.Tan, S., Peng, X., and Saenko, K. Generalized domainadaptation with covariate and label shift co-alignment. arXiv preprint arXiv:1910.10320 , 2019.Torralba, A., Efros, A. A., et al. Unbiased look at datasetbias. In

CVPR , volume 1, pp. 7. Citeseer, 2011.Tsai, Y.-H., Hung, W.-C., Schulter, S., Sohn, K., Yang,M.-H., and Chandraker, M. Learning to adapt structuredoutput space for semantic segmentation. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pp. 7472–7481, 2018.Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell,T. Deep domain confusion: Maximizing for domaininvariance. arXiv preprint arXiv:1412.3474 , 2014.Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adver-sarial discriminative domain adaptation. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pp. 7167–7176, 2017.Venkateswara, H., Eusebio, J., Chakraborty, S., andPanchanathan, S. Deep hashing network for unsuperviseddomain adaptation. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pp.5018–5027, 2017.Webb, G. I. and Ting, K. M. On the application of roc anal-ysis to predict classiﬁcation performance under varyingclass distributions.

Machine learning , 58(1):25–32, 2005.Wen, J., Zheng, N., Yuan, J., Gong, Z., and Chen, C.Bayesian uncertainty matching for unsupervised domainadaptation. arXiv preprint arXiv:1906.09693 , 2019.Wu, Y., Winston, E., Kaushik, D., and Lipton, Z. Domainadaptation with asymmetrically-relaxed distributionalignment. arXiv preprint arXiv:1903.01689 , 2019. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

Xie, S., Zheng, Z., Chen, L., and Chen, C. Learning semanticrepresentations for unsupervised domain adaptation.In

International Conference on Machine Learning , pp.5419–5428, 2018.Zhang, Q., Zhang, J., Liu, W., and Tao, D. Categoryanchor-guided unsupervised domain adaptation for se-mantic segmentation. In

Advances in Neural InformationProcessing Systems , pp. 433–443, 2019a.Zhang, W., Ouyang, W., Li, W., and Xu, D. Collaborative andadversarial network for unsupervised domain adaptation.In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 3801–3809, 2018.Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theoryand algorithm for domain adaptation. In Chaudhuri, K. andSalakhutdinov, R. (eds.),

Proceedings of the 36th Interna-tional Conference on Machine Learning , volume 97 of

Pro-ceedings of Machine Learning Research , pp. 7404–7413,Long Beach, California, USA, 09–15 Jun 2019b. PMLR.Zou, Y., Yu, Z., Vijaya Kumar, B., and Wang, J. Unsu-pervised domain adaptation for semantic segmentationvia class-balanced self-training. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pp.289–305, 2018.Zou, Y., Yu, Z., Liu, X., Kumar, B. V., and Wang, J. Conﬁ-dence regularized self-training. In

The IEEE InternationalConference on Computer Vision (ICCV) , October 2019.

A. Theory

Deﬁnition A.1.

Let B S , B T be minibatches from U S and U T , respectively, where B S ⊆ U S , B T ⊆ U T ,and m b = |B S | = |B T | . The empirical estimation of d H ∆ H ( B S , B T ) over the minibatches B S , B T is deﬁned as ˆ d H ∆ H ( B S , B T ) = 1 m b sup h,h (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) B T [ h (cid:54) = h (cid:48) ] − (cid:88) B S [ h (cid:54) = h (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (11)For simplicity, we drop the multiple m b in the followinganalysis as it does not affect the result of optimization. Theorem A.2 (The decomposition of ˆ d H ∆ H ( B S , B T ) ) . Let H be a hypothesis space and Y be the label spaceof the classiﬁcation task where B S , B T are minibatchesdrawn from U S , U T , respectively, and Y S , Y T are thelabel set of B S , B T . We deﬁne three disjoint sets on thelabel space: the shared labels Y C := Y S ∩ Y T , and thedomain-speciﬁc labels Y S := Y S − Y C , and Y T := Y T − Y C .We also deﬁne the following disjoint sets on the input spacewhere B CS := { x ∈ B S | y ∈ Y C } , B CS := { x ∈ B S | y / ∈ Y C } , B CT := { x ∈ B T | y ∈ Y C } , B CT := { x ∈ B T | y / ∈ Y C } . The em-pirical ˆ d H ∆ H ( B S , B T ) divergence can be decomposed intoclass aligned divergence and class-misaligned divergence: ˆ d H ∆ H ( B S , B T ) = sup h,h (cid:48) ∈H (cid:12)(cid:12)(cid:12) ξ C ( h,h (cid:48) )+ ξ C ( h,h (cid:48) ) (cid:12)(cid:12)(cid:12) , (12) where ξ C ( h,h (cid:48) ) = (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ] , (13) ξ C ( h,h (cid:48) ) = (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ] . (14) Proof.

By deﬁnition, we have ˆ d H ∆ H ( B S , B T ) = sup h,h (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) B T [ h (cid:54) = h (cid:48) ] − (cid:88) B S [ h (cid:54) = h (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (15)We rewrite the summation over all the samples B into thesum of disjoint subsets B C and B C . (cid:88) B T [ h (cid:54) = h (cid:48) ] − (cid:88) B S [ h (cid:54) = h (cid:48) ] (16) = (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ]  (17) + (cid:88) B CT [ h (cid:54) = h (cid:48) ] − (cid:88) B CS [ h (cid:54) = h (cid:48) ]  (18) = ξ C ( h,h (cid:48) )+ ξ C ( h,h (cid:48) ) . (19)This completes the proof. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation B. Experiments

B.1. Additional Evaluation Measures on Ofﬁce-Home

Table 9.

Evaluation on Ofﬁce-Home (%) with ResNet-50. Ar (cid:1) Cl Pr (cid:1)

RwMDD ours MDD oursaccuracy 54.91 56.17 77.46 79.94macro F1 score 53.66 55.29 75.86 78.42weighted F1 score 53.97 55.81 77.24 79.79macro precision 57.02 57.72 78.21 79.56weighted precision 58.85 60.30 79.60 80.97macro recall 56.41 57.76 76.30 78.61weighted recall 54.91 56.17 77.65 79.94

Table 9 presents additional evaluation on Ofﬁce-Home(standard). We re-implement MDD using identical batchsizes (50) and random seeds for fair comparison. The resultsshow that our proposed method has consistent improvementsacross all evaluation measures, and the improvements arenot a result of batch sizes or random seeds.

B.2. Additional Ablation on Alignment Options

Table 10.

The impact of different implicit alignment options, i.e.,masking the classiﬁer-based domain discrepancy measure andsampling examples from the source and target domains, on Ar → Cland Cl → Pr, Ofﬁce-Home (standard).

Alignment optionsDomains masking sampling AccuracyAr (cid:1) Cl × × √ × × √ √ √ Cl (cid:1) Pr × × √ × × √ √ √ Table 10 presents the ablation study on Ofﬁce-Home (stan-dard) that aims to assess the impact of different implicitalignment options: alignment in the domain divergenceestimations (i.e., masking in MDD) and alignment in theinput space (i.e., sampling class-conditioned examples).We observe that both alignment techniques are essential fordomain adaptation because alignment should be enforcedconsistently across all aspects of the domain adaptationtraining. This is consistent to ﬁndings in the main paper.

Training steps T e s t a cc u r a c y o f t h e t a r g e t d o m a i n ( % ) Aligned (implicit)Aligned (explicit)Baseline (S-sampled, T-random)Baseline (random)

Figure 7.

Learning curve of the target domain accuracy for Pr → Rw,Ofﬁce-Home (RS-UT).

B.3. Learning Curve

Figure 7 shows the learning curve of the target domainaccuracy for different methods. The proposed implicitalignment converges better than other methods.

B.4. Computational Efﬁciency

Table 11.

The impact of pseudo-label update frequency on Ar → Cl,Ofﬁce-Home (standard).pseudo-labelsupdated every N steps accuracy5 56.010 56.720 56.250 55.2100 56.3500 55.7 Self-training requires estimating the target domain labels,which could be time-consuming depending on the size ofthe dataset. To improve the computational efﬁciency of ouralgorithm, we only update pseudo-labels periodically, i.e.,every 20 steps, instead of at every training step. We showin Table 11 that different pseudo-label update frequenciesexhibit similar performance on the target domain. Notably,implicit alignment outperforms the baseline method in spiteof only updating the pseudo-labels every 500 training steps.This validates the robustness of implicit alignment.For the experiments described in Section B.3, training thebaseline methods take 31 hours (wall clock time), whereasimplicit alignment takes 34 hours under the same trainingcondition when the pseudo-labels are updated every 20 steps.The 10% computational overhead is rather restricted. More-over, from an engineering perspective, partially updatingand caching the pseudo-labels could further improve thecomputational efﬁciency, and we leave them as future work. mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

B.5. Impact of Batch Size

Table 12.

Impact of batch size on target domain accuracy (%),Ar → Cl, Ofﬁce-Home (standard). The MDD results are based onour re-implementation. batch size baseline implicit8 48.9 49.716 52.7 52.832 54.9 56.250 55.3 56.2

Table 12 presents the impact of batch size on the target do-main accuracy. We ﬁnd that implicit alignment consistentlyimproves the model performance over the MDD baselineacross different batch sizes, and both methods work betterwith larger batch sizes.

B.6. Empirical Class Diversity

Training steps N u m b er o f un i qu e c l a ss e s p er b a t c h BaselineImplicit alignment

Figure 8.

Empirical class diversity while training A → W (Ofﬁce-31)with batch size 31.

Figure 8 shows the empirical class diversity comparingimplicit alignment with the baseline. In both experiments,the batch size is identical with the total number of classes (i.e.,31). For the baseline method, random sampling only obtainsabout 19 unique classes per-batch, which is much smallerthan the batch size, in spite of the batch sizes being the samewith the total number of classes. This is because randomsampling can be viewed as sampling with replacement in the label space, whereas the implicit alignment can beviewed as sampling without replacement in the label space,which naturally increases the empirical class diversity. Theexpected class diversity of the baseline is E [ | Y | ] = n (cid:34) − (cid:18) n − n (cid:19) k (cid:35) , (20) where n is the number of unique classes and k is the sizeof the minibatch. The expected class diversity is 19.78 if n = 31 and k = 31 , which is consistent with the empiricalclass diversity shown in Figure 8.For the implicit alignment method shown in Figure 8,although it has low class diversity at training step 0 due to therandom pseudo-labels, it has a sharp increase in class diver-sity for the ﬁrst few hundred training steps, and eventuallybeing able to sample 28 classes from the total of 31 classes.This conﬁrms that implicit alignment is effective in improv-ing empirical class diversity beyond random sampling. C. Datasets

Figure 9 shows the frequencies of different classes forCl → Rw on the Ofﬁce-Home (standard) dataset. This datasetis under natrual class imbalance where examples of differentclasses are not evenly distributed.

Classes N u m b er o f e xa m p l e s p er c l a ss SourceTarget

Figure 9.

Class frequency of Cl → Rw, Ofﬁce-Home (standard)

Figure 10 shows the frequencies of different classesfor Cl → Rw on the Ofﬁce-Home (RS-UT) dataset (Tanet al., 2019). In this dataset, the minority classes in thesource domain are majority classes in the target domain,which creates extreme within-domain class imbalance andbetween-domain distribution shift.

Classes N u m b er o f e xa m p l e s p er c l a ss SourceTarget

Figure 10.

Class distribution of of Cl → Rw, Ofﬁce-Home (RS-UT) mplicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation

D. Model Architecture and Training Details

Code.

We use PyTorch 1.2 as the training environment,and we observe that the adaptation performance on PyTorch1.4 is slightly better PyTorch 1.2. The differences betweenPyTorch versions do not change the ﬁndings and theconclusions of this paper. Our code and training instructionsare provided in https://github.com/xiangdal/implicit_alignment . Model architecture.

We use ResNet-50 (He et al., 2016)pre-trained from ImageNet (Russakovsky et al., 2015) asthe backbone, and use hyper-parameters from (Zhang et al.,2019b) for MDD-based domain discrepancy measure. Thebackbone is followed by a 1-1ayer bottleneck. The classiﬁer f and auxiliary classiﬁer f (cid:48) are both 2-layer networks. Optimization.

We use the SGD optimizer with learningrate . , nesterov momentum . , and weight decay . . We empirically ﬁnd that SGD converges betterthan Adam for adversarial optimization. We use a gradientreversal layer for minimax optimization, and we use atraining scheduler (Ganin et al., 2016) for gradient reversallayer deﬁned as λ p = 0 . − i ) − . , (21)where i denotes the step number. We used the same schedulerfrom (Zhang et al., 2019b) for all experiments and have nottried hyperparameter search for λ pp