[PDF] Learning a Domain-Invariant Embedding for Unsupervised Domain Adaptation Using Class-Conditioned Distribution Alignment

Abstract

We address the problem of unsupervised domain adaptation (UDA) by learning a cross-domain agnostic embedding space, where the distance between the probability distributions of the two source and target visual domains is minimized. We use the output space of a shared cross-domain deep encoder to model the embedding space anduse the Sliced-Wasserstein Distance (SWD) to measure and minimize the distance between the embedded distributions of two source and target domains to enforce the embedding to be domain-agnostic.Additionally, we use the source domain labeled data to train a deep classifier from the embedding space to the label space to enforce the embedding space to be this http URL a result of this training scheme, we provide an effective solution to train the deep classification network on the source domain such that it will generalize well on the target domain, where only unlabeled training data is accessible. To mitigate the challenge of class matching, we also align corresponding classes in the embedding space by using high confidence pseudo-labels for the target domain, i.e. assigning the class for which the source classifier has a high prediction probability. We provide experimental results on UDA benchmark tasks to demonstrate that our method is effective and leads to state-of-the-art performance.

Full PDF

LLearning a Domain-Invariant Embedding for Unsupervised DomainAdaptation Using Class-Conditioned Distribution Alignment

Alexander J. Gabourie , Mohammad Rostami , Philip E. Pope ,Soheil Kolouri , and Kuyngnam Kim Abstract — We address the problem of unsupervised domainadaptation (UDA) by learning a cross-domain agnostic em-bedding space, where the distance between the probabilitydistributions of the two source and target visual domains isminimized. We use the output space of a shared cross-domaindeep encoder to model the embedding space and use the Sliced-Wasserstein Distance (SWD) to measure and minimize thedistance between the embedded distributions of two source andtarget domains to enforce the embedding to be domain-agnostic.Additionally, we use the source domain labeled data to train adeep classiﬁer from the embedding space to the label space toenforce the embedding space to be discriminative. As a resultof this training scheme, we provide an effective solution to trainthe deep classiﬁcation network on the source domain such that itwill generalize well on the target domain, where only unlabeledtraining data is accessible. To mitigate the challenge of classmatching, we also align corresponding classes in the embeddingspace by using high conﬁdence pseudo-labels for the targetdomain, i.e. assigning the class for which the source classiﬁerhas a high prediction probability. We provide experimentalresults on UDA benchmark tasks to demonstrate that ourmethod is effective and leads to state-of-the-art performance.

I. I

NTRODUCTION

Deep learning classiﬁcation algorithms have surpassedperformance of humans for a wide range of computer visionapplications. However, this achievement is conditioned onavailability of high-quality labeled datasets to supervisetraining deep neural networks. Unfortunately, preparing hugelabeled datasets is not feasible for many situations as datalabeling and annotation can be expensive [29]. Domainadaptation [12] is a paradigm to address the problem oflabeled data scarcity in computer vision, where the goal isto improve learning speed and model generalization as wellas to avoid expensive redundant model retraining. The majoridea is to overcome labeled data scarcity in a target domainby transferring knowledge from a related auxiliary sourcedomain, where labeled data is easy and cheap to obtain.A common technique in domain adaptation literature is toembed data from the two source and target visual domains inan intermediate embedding space such that common cross-domain discriminative relations are captured in the embed-ding space. For example, if the data from source and target Alexander J. Gabourie is with the Department of Electrical Engineering,Stanford University, Stanford, CA, USA [email protected] Mohammad Rostami is with the Department of Electrical and Sys-tems Engineering, University of Pennsylvania, Philadelphia, PA, US. [email protected]

Philip Pope, Soheil Kolouri, and Kyungnam Kim are with Informationand Systems Laboratory, HRL Laboratories, LLC, Malibu, CA, USA pepope, skolouri, kkim @hrl.com domains have similar class-conditioned probability distribu-tions in the embedding space, then a classiﬁer trained solelyusing labeled data from the source domain will generalizewell on data points that are drawn from the target domaindistribution [28], [30].In this paper, we propose a novel unsupervised adaptation(UDA) algorithm following the above explained procedure.Our approach is a simpler, yet effective, alternative for ad-versarial learning techniques that have been more dominantto address probability matching indirectly for UDA [41],[43], [23]. Our contribution is two folds. First, we trainthe shared encoder by minimizing the Sliced-WassersteinDistance (SWD) [26] between the source and the targetdistributions in the embedding space. We also train a classi-ﬁer network simultaneously using the source domain labeleddata. A major beneﬁt of SWD over alternative probabilitymetrics is that it can be computed efﬁciently. Additionally,SWD is known to be suitable for gradient-based optimizationwhich is essential for deep learning [28]. Our second contri-bution is to circumvent the class matching challenge [34]by minimizing SWD between conditional distributions insequential iterations for better performance compared toprior UDA methods that match probabilities explicitly. Ateach iteration, we assign pseudo-labels only to the targetdomain data that the classiﬁer predicts the assigned classlabel with high probability and use this portion of targetdata to minimize the SWD between conditional distributions.As more learning iterations are performed, the number oftarget data points with correct pseudo-labels grows and pro-gressively enforces distributions to align class-conditionally.We provide experimental results on benchmark problems,including ablation and sensitivity studies, to demonstrate thatour method is effective.II. B

ACKGROUND AND R ELATED W ORK

There are two major approaches in the literature to addressdomain adaption. The approach for a group of methods isbased on preprocessing the target domain data points. Thetarget data is mapped from the target domain to the sourcedomain such that the target data structure is preserved inthe source [33]. Another common approach is to map datafrom both domains to a latent domain invariant space [7].Early methods within the second approach learn a linearsubspace as the invariant space [15], [13] where the targetdomain data points distribute similar to the source domaindata points. A linear subspace is not suitable for capturingcomplex distributions. For this reason, recently deep neural a r X i v : . [ c s . L G ] S e p etworks have been used to model the intermediate space asthe output of the network. The network is trained such thatthe source and the target domain distributions in its outputpossess minimal discrepancy. Training procedure can bedone both by adversarial learning [14] or directly minimizingthe distance between the two distributions [4].Several important UDA methods use adversarial learning.Ganin et al. [10] pioneered and developed an effectivemethod to match two distributions indirectly by using ad-versarial learning. Liu et al. [21] and Tzeng et al. [41] usethe Generative Adversarial Networks (GAN) structure [14] totackle domain adaptation. The idea is to train two competing(i.e., adversarial) deep neural networks to match the sourceand the target distributions. A generator network maps datapoints from both domains to the domain-invariant space anda binary discriminator network is trained to classify the datapoints, with each domain considered as a class, based onthe representations of the target and the source data points.The generator network is trained such that eventually thediscriminator cannot distinguish between the two domains,i.e. classiﬁcation rate becomes .A second group of domain adaptation algorithms matchthe distributions directly in the embedding by using a sharedcross-domain mapping such that the distance between thetwo distributions is minimized with respect to a distancemetric [4]. Early methods use simple metrics such theMaximum Mean Discrepancy (MMD) for this purpose [16].MMD measures the distances between the distance betweendistributions simply as the distance between the mean ofembedding features. In contrast, more recent techniquesthat use a shared deep encoder, employ the Wassersteinmetric [42] to address UDA [4], [6]. Wasserstein metric isshown to be a more accurate probability metric and can beminimized effectively by deep learning ﬁrst-order optimiza-tion techniques. A major beneﬁt of matching distributionsdirectly is existence of theoretical guarantees. In particular,Redko et al. [28] provided theoretical guarantees for using aWasserstein metric to address domain adaptation. Addition-ally, adversarial learning often requires deliberate architec-ture engineering, optimization initialization, and selection ofhyper-parameters to be stable [32]. In some cases, adversariallearning also suffers from a phenomenon known as modecollapse [22]. That is, if the data distribution is a multi-modal distribution, which is the case for most classiﬁcationproblems, the generator network might not generate samplesfrom some modes of the distribution. These challenges areeasier to address when the distributions are matched directly.As Wasserstein distance is ﬁnding more applications indeep learning, efﬁcient computation of Wasserstein distancehas become an active area of research. The reason is thatWasserstein distance is deﬁned in form of a linear pro-gramming optimization and solving this optimization prob-lem is computationally expensive for high-dimensional data.Although computationally efﬁcient variations and approx-imations of the Wasserstein distance have been recentlyproposed [5], [40], [25], these variations still require anadditional optimization in each iteration of the stochastic gradient descent (SGD) steps to match distributions. Courtyet al. [4] used a regularized version of the optimal transportfor domain adaptation. Seguy et al. [36] used a dual stochas-tic gradient algorithm for solving the regularized optimaltransport problem. Alternatively, we propose to address theabove challenges using Sliced Wasserstein Distance (SWD).Deﬁnition of SWD is motivated by the fact that in contrastto higher dimensions, the Wasserstein distance for one-dimensional distributions has a closed form solution whichcan be computed efﬁciently. This fact is used to approximateWasserstein distance by SWD, which is a computationallyefﬁcient approximation and has recently drawn interest fromthe machine learning and computer vision communities [26],[1], [3], [8], [39].III. P ROBLEM F ORMULATION

Consider a source domain, D S = ( X S , Y S ) , with N labeled samples, i.e. labeled images, where X S =[ x s , . . . , x sN ] ∈ X ⊂ R d × N denotes the samples and Y S = [ y s , ..., y sN ] ∈ Y ⊂ R k × N contains the correspondinglabels. Note that label y sn identiﬁes the membership of x sn to one or multiple of the k classes (e.g. digits , ..., forhand-written digit recognition). We assume that the sourcesamples are drawn i.i.d. from the source joint probabilitydistribution, i.e. ( x si , y i ) ∼ p ( x S , y ) . We denote the sourcemarginal distribution over x S with p S . Additionally, we havea related target domain (e.g. machine-typed digit recognition)with M unlabeled data points X T = [ x t , . . . , x tM ] ∈ R d × M . Following existing UDA methods, we assume thatthe same type of labels in the source domain holds for thetarget domain. The target samples are drawn from the targetmarginal distribution x ti ∼ p T . We also know that despitesimilarity between these domains, distribution discrepancyexists between these two domains, i.e. p S (cid:54) = p T . Ourgoal is to classify the unlabeled target data points throughknowledge transfer from the source domain. Learning a goodclassiﬁer for the source data points is a straight forwardproblem as given a large enough number of source samples, N , a parametric function f θ : R d → Y , e.g., a deep neuralnetwork with concatenated learnable parameters θ , can betrained to map samples to their corresponding labels usingstandard supervised learning solely in the source domain.The training is conducted via minimizing the empirical risk, ˆ θ = arg min θ ˆ e θ = arg min θ (cid:80) i L ( f θ ( x si ) , y si ) , with respectto a proper loss function, L ( · ) (e.g., cross entropy loss). Thelearned classiﬁer f ˆ θ generalizes well on testing data pointsif they are drawn from the training data point’s distributions.Only then, the empirical risk is a suitable surrogate for thereal risk function, e = E ( x ,y ) ∼ p ( x S ,y S ) ( L ( f θ ( x ) , y )) . Hencethe naive approach of using f ˆ θ on the target domain mightnot be effective as given the discrepancy between the sourceand target distributions, f ˆ θ might not generalize well on thetarget domain. Therefore, there is a need for adapting thetraining procedure of f ˆ θ by incorporating unlabeled targetdata points such that the learned knowledge from the sourcedomain could be transferred and used for classiﬁcation in thetarget domain using only the unlabeled samples.ig. 1: Architecture of the proposed unsupervised domainadaptation framework.The main challenge is to circumvent the problem ofdiscrepancy between the source and the target domain distri-butions. To that end, the mapping f θ ( · ) can be decomposedinto a feature extractor φ v ( · ) and a classiﬁer h w ( · ) , suchthat f θ = h w ◦ φ v , where w and v are the correspondinglearnable parameters, i.e. θ = ( w , v ) . The core idea is tolearn the feature extractor function, φ v , for both domainssuch that the domain speciﬁc distribution of the extractedfeatures to be similar to one another. The feature extractingfunction φ v : X → Z , maps the data points from bothdomains to an intermediate embedding space

Z ⊂ R f (i.e.,feature space) and the classiﬁer h w : Z → Y maps thedata points representations in the embedding space to thelabel set. Note that, as a deterministic function, the featureextractor function φ v can change the distribution of the datain the embedding. Therefore, if φ v is learned such that thediscrepancy between the source and target distributions isminimized in the embedding space, i.e., discrepancy between p S ( φ ( x s )) and p T ( φ ( x t )) (i.e., the embedding is domainagnostic), then the classiﬁer would generalize well on thetarget domain and could be used to label the target domaindata points. This is the core idea behind various prior domainadaptation approaches in the literature [24], [11].IV. P ROPOSED M ETHOD

We consider the case where the feature extractor, φ v ( · ) ,is a deep convolutional encoder with weights v and theclassiﬁer h w ( · ) is a shallow fully connected neural networkwith weights w . The last layer of the classiﬁer networkis a softmax layer that assigns a membership probabilitydistribution to any given data point. It is often the casethat the labels of data points are assigned according tothe class with maximum predicted probability. In short, theencoder network is learned to mix both domains such that theextracted features in the embedding are: 1) domain agnosticin terms of data distributions, and 2) discriminative forthe source domain to make learning h w feasible. Figure 1demonstrates system level presentation of our framework.Following this framework, the UDA reduces to solving the following optimization problem to solve for v and w : min v , w N (cid:88) i =1 L (cid:0) h w ( φ v ( x si )) , y si (cid:1) + λD (cid:0) p S ( φ v ( X S )) , p T ( φ v ( X T )) (cid:1) , (1)where D ( · , · ) is a discrepancy measure between the prob-abilities and λ is a trade-off parameter. The ﬁrst term inEq. (1) is empirical risk for classifying the source labeleddata points from the embedding space and the second termis the cross-domain probability matching loss. The encoder’slearnable parameters are learned using data points from bothdomains and the classiﬁer parameters are simultaneouslylearned using the source domain labeled data.A major remaining question is to select a proper metric.First, note that the actual distributions p S ( φ ( X S )) and p T ( φ ( X T )) are unknown and we can rely only on observedsamples from these distributions. Therefore, a sensible dis-crepancy measure, D ( · , · ) , should be able to measure thedissimilarity between these distributions only based on thedrawn samples. In this work, we use the SWD [27] as itis computationally efﬁcient to compute SWD from drawnsamples from the corresponding distributions. More impor-tantly, the SWD is a good approximation for the optimaltransport [2] which has gained interest in deep learningcommunity as it is an effective distribution metric and itsgradient is non-vanishing.The idea behind the SWD is to project two d -dimensionalprobability distributions into their marginal one-dimensionaldistributions, i.e., slicing the high-dimensional distributions,and to approximate the Wasserstein distance by integratingthe Wasserstein distances between the resulting marginalprobability distributions over all possible one-dimensionalsubspaces. For the distribution p S , a one-dimensional sliceof the distribution is deﬁned as: R p S ( t ; γ ) = (cid:90) S p S ( x ) δ ( t − (cid:104) γ, x (cid:105) ) d x , (2)where δ ( · ) denotes the Kronecker delta function, (cid:104)· , ·(cid:105) de-notes the vector dot product, S d − is the d -dimensional unitsphere and γ is the projection direction. In other words, R p S ( · ; γ ) is a marginal distribution of p S obtained fromintegrating p S over the hyperplanes orthogonal to γ . TheSWD then can be computed by integrating the Wassersteindistance between sliced distributions over all γ : SW ( p S , p T ) = (cid:90) S d − W ( R p S ( · ; γ ) , R p T ( · ; γ )) dγ (3)where W ( · ) denotes the Wasserstein distance. The mainadvantage of using the SWD is that, unlike the Wasser-stein distance, calculation of the SWD does not require anumerically expensive optimization. This is due to the factthat the Wasserstein distance between two one-dimensionalprobability distributions has a closed form solution andis equal to the (cid:96) p -distance between the inverse of theircumulative distribution functions Since only samples fromdistributions are available, the one-dimensional Wassersteindistance can be approximated as the (cid:96) p -distance between thesorted samples [31]. The integral in Eq. (3) is approximatedsing a Monte Carlo style numerical integration. Doing so,the SWD between f -dimensional samples { φ ( x S i ) ∈ R f ∼ p S } Mi =1 and { φ ( x T i ) ∈ R f ∼ p T } Mj =1 can be approximatedas the following sum: SW ( p S , p T ) ≈ L L (cid:88) l =1 M (cid:88) i =1 |(cid:104) γ l , φ ( x S s l [ i ] (cid:105) ) − (cid:104) γ l , φ ( x T t l [ i ] ) (cid:105)| (4) where γ l ∈ S f − is uniformly drawn random sample fromthe unit f -dimensional ball S f − , and s l [ i ] and t l [ i ] arethe sorted indices of { γ l · φ ( x i ) } Mi =1 for source and targetdomains, respectively. Note that for a ﬁxed dimension d ,Monte Carlo approximation error is proportional to O ( √ L ) .We utilize the SWD as the discrepancy measure betweenthe probability distributions to match them in the embeddingspace. Next, we discuss a major deﬁciency in Eq. (1) and ourremedy to tackle it. We utilize the SWD as the discrepancymeasure between the probability densities, p S ( φ ( x S ) | C j ) (cid:1) and p T ( φ ( x T ) | C j ) (cid:1) . A. Class-conditional Alignment of Distributions

A main shortcoming of Eq. (1) is that minimizing thediscrepancy between p S ( φ ( X S )) and p T ( φ ( X T )) does notguarantee semantic consistency between the two domains. Toclarify this point, consider the source and target domains tobe images corresponding to printed digits and handwrittendigits. While the feature distributions in the embeddingspace could have low discrepancy, the classes might not becorrectly aligned in this space, e.g. digits from a class inthe target domain could be matched to a wrong class of thesource domain or, even digits from multiple classes in thetarget domain could be matched to the cluster of a single digitof the source domain. In such cases, the source classiﬁer willnot generalize well on the target domain. In other words,the shared embedding space, Z , might not be a semanti-cally meaningful space for the target domain if we solelyminimize SWD between p S ( φ ( X S )) and p T ( φ ( X T )) . Tosolve this challenge, the encoder function should be learnedsuch that the class-conditioned probabilities of both domainsin the embedding space are similar, i.e. p S ( φ ( x S ) | C j ) ≈ p T ( φ ( x T ) | C j ) , where C j denotes a particular class. Giventhis, we can mitigate the class matching problem by usingan adapted version of Eq. (1) as: min v , w N (cid:88) i =1 L (cid:0) h w ( φ v ( x si )) , y si (cid:1) + λ k (cid:88) j =1 D (cid:0) p S ( φ v ( x S ) | C j ) , p T ( φ v ( x T ) | C j ) (cid:1) , (5)where discrepancy between distributions is minimized con-ditioned on classes, to enforce semantic alignment in theembedding space. Solving Eq. (5), however, is not tractableas the labels for the target domain are not available and theconditional distribution, p T ( φ ( x T ) | C j ) (cid:1) , is not known.To tackle the above issue, we compute a surrogateof the objective in Eq. (5). Our idea is to approximate p T ( φ ( x T ) | C j ) (cid:1) by generating pseudo-labels for the target Algorithm 1

DACAD (

L, η, λ ) Input: data D S = ( X S , Y S ); D T = ( X S ) , Pre-training : ˆ θ = ( w , v ) = arg min θ (cid:80) i L ( f θ ( x si ) , y si ) for itr = 1 , . . . , IT R do D PL = { ( x ti , ˆ y ti ) | ˆ y ti = f ˆ θ ( x ti ) , p ( ˆ y ti | x ti ) > τ } for alt = 1 , . . . , ALT do Update encoder parameters using pseudo-labels: ˆ v = (cid:80) j D (cid:0) p S ( φ v ( x S ) | C j ) , p SL ( φ v ( x T ) | C j ) (cid:1) Update entire model: ˆ v , ˆ w = arg min w , v (cid:80) Ni =1 L (cid:0) h w ( φ ˆˆ v ( x si )) , y si (cid:1) end for end for data points. The pseudo-labels are obtained from the sourceclassiﬁer prediction, but only for the portion of target datapoints that the the source classiﬁer provides conﬁdent pre-diction. More speciﬁcally, we solve Eq. (5) in incrementalgradient descent iterations. In particular, we ﬁrst initializethe classiﬁer network by training it on the source data. Wethen alternate between optimizing the classiﬁcation loss forthe source data and SWD loss term at each iteration. Ateach iteration, we pass the target domain data points intothe classiﬁer learned on the source data and analyze thelabel probability distribution on the softmax layer of theclassiﬁer. We choose a threshold τ and assign pseudo-labelsonly to those target data points that the classiﬁer predictsthe pseudo-labels with high conﬁdence, i.e. p ( y i | x ti ) > τ .Since the source and the target domains are related, it issensible that the source classiﬁer can classify a subset oftarget data points correctly and with high conﬁdence. Weuse these data points to approximate p T ( φ ( x T ) | C j ) (cid:1) inEq. (5) and update the encoder parameters, v , accordingly.In our empirical experiments, we have observed that becausethe domains are related, as more optimization iterations areperformed, the number of data points with conﬁdent pseudo-labels increases and our approximation for Eq. (5) improvesand becomes more stable, enforcing the source and the targetdistributions to align class conditionally in the embeddingspace. As a side beneﬁt, since we math the distributionsclass-conditionally, a problem similar to mode collapse isunlikely to occur. Figure 2 visualizes this process using realdata. Our proposed framework, named Domain Adaptationwith Conditional Alignment of Distributions (DACAD) issummarized in Algorithm 1.V. E XPERIMENTAL V ALIDATION

We evaluate our algorithm using standard benchmark UDAtasks and compare against several UDA methods.

Datasets:

We investigate the empirical performance of ourproposed method on ﬁve commonly used benchmark datasetsin UDA, namely: MNIST ( M ) [19], USPS ( U ) [20], StreetView House Numbers, i.e., SVHN ( S ), CIFAR ( CI ), andSTL ( ST ). The ﬁrst three datasets are 10 class (0-9) digitclassiﬁcation datasets. MNIST and USPS are collection ofhand written digits whereas SVHN is a collection of realworld RGB images of house numbers. STL and CIFAR hared Encoder, ! Shared Embedding, " Source Classifier, ℎ Inputs $ %& $ '( Source DomainLabeled Dataset Target DomainUnlabeled Dataset

Labels Pseudo-Labels …… Source DomainTarget Domain Ground TruthFinal predictionConfident pseudo labelsUnlabeled samples

Fig. 2: The high-level system architecture, shown on the left, illustrates the data paths used during UDA training. On theright, t SNE visualizations demonstrate how the embedding space evolves during training for the

S → U task. In the targetdomain, colored points are examples with assigned pseudo-labels, which increase in number with the conﬁdence of theclassiﬁer.contain RGB images that share 9 object classes: airplane,car, bird, cat, deer, dog, horse, ship, and truck . For thedigit datasets, while six domain adaptation problems can bedeﬁned among these datasets, prior works often consider fourof these six cases, as knowledge transfer from simple MNISTand USPS datasets to a more challenging SVHN domaindoes not seem to be tractable. Following the literature,we use 2000 randomly selected images from MNIST and1800 images from USPS in our experiments for the case of

U → M and

S → M [23]. In the remaining cases, we usedfull datasets. All datasets have their images scaled to × pixels and the SVHN images are converted to grayscale asthe encoder network is shared between the domains. CIFARand STL maintain their RGB components. We report thetarget classiﬁcation accuracy across the tasks. Pre-training : Our experiments involve a pre-training stageto initialize the encoder and the classiﬁer networks solelyusing the source data. This is an essential step becausethe combined deep network can generate conﬁdent pseudo-labels on the target domain only if initially trained on therelated source domain. In other words, this initially learnednetwork can be served as a naive model on the target domain.We then boost the performance on the target domain usingour proposed algorithm, demonstrating that our algorithmis indeed effective for transferring knowledge. Doing so,we investigate a less-explored issue in the UDA literature.Different UDA approaches use considerably different net-works, both in terms of complexity, e.g. number of layersand convolution ﬁlters, and the structure, e.g. using an auto-encoder. Consequently, it is ambiguous whether performanceof a particular UDA algorithm is due to successful knowledgetransfer from the source domain or just a good baselinenetwork that performs well on the target domain even withoutconsiderable knowledge transfer from the source domain. Tohighlight that our algorithm can indeed transfer knowledge,we use two different network architectures: DRCN [11] andVGG [38]. We then show that our algorithm can effec-tively boost base-line performance (statistically signiﬁcant)regardless of the underlying network. In most of the domainadaptation tasks, we demonstrate that this boost indeed stems from transferring knowledge from the source domain. Inour experiments we used Adam optimizer [18] and set thepseudo-labeling threshold to tr = 0 . . Data Augmentation:

Following the literature, we use dataaugmentation to create additional training data by applyingreasonable transformations to input data in an effort to im-prove generalization [37]. Conﬁrming the reported result in[11], we also found that geometric transformations and noise,applied to appropriate inputs, greatly improves performanceand transferability of the source model to the target data.Data augmentation can help to reduce the domain shiftbetween the two domains. The augmentations in this workare limited to translation, rotation, skew, zoom, Gaussiannoise, Binomial noise, and inverted pixels.

A. Results

Figure 2 demonstrates how our algorithm successfullylearns an embedding with class-conditional alignment ofdistributions of both domains. This ﬁgure presents the two-dimensional t SNE visualization of the source and targetdomain data points in the shared embedding space for the

S → U task. The horizontal axis demonstrates the optimiza-tion iterations where each cell presents data visualizationafter a particular optimization iteration is performed. Thetop sub-ﬁgures visualize the source data points, where eachcolor represents a particular class. The bottom sub-ﬁguresvisualize the target data points, where the colored data pointsrepresent the pseudo-labeled data points at each iteration andthe black points represent the rest of the target domain datapoints. We can see that, due to pre-training initialization,the embedding space is discriminative for the source domainin the beginning, but the target distribution differs fromthe source distributions. However, the classiﬁer is conﬁdentabout a portion of target data points. As more optimiza-tion iterations are performed, since the network becomes abetter classiﬁer for the target domain, the number of thetarget pseudo-labeled data points increase, improving ourapproximate of Eq. 5. As a result, the discrepancy betweenthe two distributions progressively decreases. Over time, ouralgorithm learns a shared embedding which is discriminative ethod

M → U U → M S → M S → U ST → CI CI → ST

GtA 92.8 ± ± ± ± ± ± ± ± ± † ± † ± † † ± ‡ ± ‡ ± † ± † ± ± ± † † ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TABLE I: Classiﬁcation accuracy for UDA between MINIST, USPS, SVHN, CIFAR, and STL datasets. † indicates use offull MNIST and USPS datasets as opposed to the subset described in the paper. ‡ indicates results from reimplementationin [41].for both domains, making pseudo-labels a good predictionfor the original labels, bottom, right-most sub-ﬁgure. Thisresult empirically validates applicability of our algorithm toaddress UDA.We also compare our results against several recent UDAalgorithms in Table I. In particular, we compare against therecent adversarial learning algorithms: Generate to Adapt(GtA) [35], CoGAN [21], ADDA [41], CyCADA [17],and I2I-Adapt [24]. We also include FADA [23], whichis originally a few-shot learning technique. For FADA, welist the reported one-shot accuracy, which is very close tothe UDA setting (but it is arguably a simpler problem).Additionally, we have included results for RevGrad [9],DRCN [11], AUDA [34], OPDA [4], MML [36]. The lattermethods are similar to our method because these methodslearn an embedding space to couple the domains. OPDA andMML are more similar as they match distributions explicitlyin the learned embedding. Finally, we have included theperformance of fully-supervised (FS) learning on the targetdomain as an upper-bound for UDA. In our own results,we include the baseline target performance that we obtainby naively employing a DRCN network as well as targetperformance from VGG network that are learned solely onthe source domain. We notice that in Table I, our baselineperformance is better than some of the UDA algorithms forsome tasks. This is a very crucial observation as it demon-strates that, in some cases, a trained deep network with gooddata augmentation can extract domain invariant features thatmake domain adaptation feasible even without any furthertransfer learning procedure. The last row demonstrates thatour method is effective in transferring knowledge to boostthe baseline performance. In other words, Table I serves asan ablation study to demonstrate that that effectiveness of ouralgorithm stems from successful cross-domain knowledgetransfer. We can see that our algorithm leads to near- or thestate-of-the-art performance across the tasks. Additionally, an important observation is that our method signiﬁcantlyoutperforms the methods that match distributions directly andis competent against methods that use adversarial learning.This can be explained as the result of matching distributionsclass-conditionally and suggests our second contribution canpotentially boost performance of these methods. Finally,we note that our proposed method provide a statisticallysigniﬁcant boost in all but two of the cases (shown in grayin Table I).VI. C ONCLUSIONS AND D ISCUSSION

We developed a new UDA algorithm based on learninga domain-invariant embedding space. We map data pointsfrom two related domains to the embedding space suchthat discrepancy between the transformed distributions isminimized. We used the sliced Wasserstein distance metric asa measure to match the distributions in the embedding space.As a result, our method is computationally more efﬁcient.Additionally, we matched distributions class-conditionallyby assigning pseudo-labels to the target domain data. Asa result, our method is more robust and outperforms priorUDA methods that match distributions directly. We providedexperimental validations to demonstrate that our method iscompetent against SOA recent UDA methods.R

EFERENCES[1] N. Bonneel, J. Rabin, G. Peyr´e, and H. Pﬁster. Sliced and RadonWasserstein barycenters of measures.

Journal of Mathematical Imag-ing and Vision , 51(1):22–45, 2015.[2] N. Bonnotte.

Unidimensional and evolution methods for optimaltransportation . PhD thesis, Paris 11, 2013.[3] Mathieu Carriere, Marco Cuturi, and Steve Oudot. Sliced wassersteinkernel for persistence diagrams. arXiv preprint arXiv:1706.03358 ,2017.[4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimaltransport for domain adaptation.

IEEE TPAMI , 39(9):1853–1865,2017.[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimaltransport. In

Advances in neural information processing systems , pages2292–2300, 2013.6] B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty.Deepjdot: Deep joint distribution optimal transport for unsuperviseddomain adaptation. arXiv preprint arXiv:1803.10081 , 2018.[7] H. Daum´e III. Frustratingly easy domain adaptation. arXiv preprintarXiv:0907.1815 , 2009.[8] I. Deshpande, Z. Zhang, and A. Schwing. Generative modeling usingthe sliced wasserstein distance. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3483–3491, 2018.[9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation bybackpropagation. In

ICML , 2014.[10] Y. Ganin, E. Ustinova, H. Ajakan, P.l Germain, H. Larochelle, F. Lavi-olette, M. Marchand, and V. Lempitsky. Domain-adversarial trainingof neural networks.

The Journal of Machine Learning Research ,17(1):2096–2030, 2016.[11] M. Ghifary, B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deepreconstruction-classiﬁcation networks for unsupervised domain adap-tation. In

European Conference on Computer Vision , pages 597–613.Springer, 2016.[12] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classiﬁcation: A deep learning approach. In

ICML ,pages 513–520, 2011.[13] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic ﬂow kernelfor unsupervised domain adaptation. In

Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on , pages 2066–2073.IEEE, 2012.[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In

Advances in neural information processing systems , pages 2672–2680,2014.[15] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for objectrecognition: An unsupervised approach. In

Computer Vision (ICCV),2011 IEEE International Conference on , pages 999–1006. IEEE, 2011.[16] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, andB. Sch¨olkopf. Covariate shift by kernel mean matching, 2009.[17] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A.Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domainadaptation. In

ICML , 2018.[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard,W. E. Hubbard, and L. D. Jackel. Handwritten digit recognitionwith a back-propagation network. In

Advances in neural informationprocessing systems , pages 396–404, 1990.[20] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker,H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, et al. Comparison oflearning algorithms for handwritten digit recognition. In

Internationalconference on artiﬁcial neural networks , volume 60, pages 53–60.Perth, Australia, 1995.[21] M. Liu and O. Tuzel. Coupled generative adversarial networks. In

Advances in neural information processing systems , pages 469–477,2016.[22] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generativeadversarial networks. arXiv preprint arXiv:1611.02163 , 2016.[23] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shotadversarial domain adaptation. In

Advances in Neural InformationProcessing Systems , pages 6670–6680, 2017.[24] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim.Image to image translation for domain adaptation. arXiv preprintarXiv:1712.00479 , 2017.[25] A. Oberman and Y. Ruan. An efﬁcient linear programming methodfor optimal transportation. arXiv preprint arXiv:1509.03668 , 2015.[26] J. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenterand its application to texture mixing. In

International Conferenceon Scale Space and Variational Methods in Computer Vision , pages435–446. Springer, 2011.[27] J. Rabin, G. Peyr´e, J. Delon, and M. Bernot. Wasserstein barycenterand its application to texture mixing. In

International Conferenceon Scale Space and Variational Methods in Computer Vision , pages435–446. Springer, 2011.[28] A. Redko, I.and Habrard and M. Sebban. Theoretical analysis of do-main adaptation with optimal transport. In

Joint European Conferenceon Machine Learning and Knowledge Discovery in Databases , pages737–753. Springer, 2017.[29] M. Rostami, D. Huber, and T. Lu. A crowdsourcing triage algorithm for geopolitical event forecasting. In

Proceedings of the 12th ACMConference on Recommender Systems , pages 377–381. ACM, 2018.[30] M. Rostami, S. Kolouri, E. Eaton, and K. Kim. Deep transfer learningfor few-shot sar image classiﬁcation.

Remote Sensing , 11(11):1374,2019.[31] M. Rostami, S. Kolouri, E. Eaton, and K. Kim. Sar image classiﬁcationusing few-shot cross-domain transfer learning. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , pages 0–0, 2019.[32] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing trainingof generative adversarial networks through regularization. In

Advancesin Neural Information Processing Systems , pages 2018–2028, 2017.[33] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual categorymodels to new domains. In

European conference on computer vision ,pages 213–226. Springer, 2010.[34] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training forunsupervised domain adaptation. In

ICML , 2018.[35] S. Sankaranarayanan, Y. Balaji, C. D Castillo, and R. Chellappa.Generate to adapt: Aligning domains using generative adversarialnetworks. In

CVPR , 2018.[36] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, andM. Blondel. Large-scale optimal transport and mapping estimation.In

ICLR , 2018.[37] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices forconvolutional neural networks applied to visual document analysis. In

Seventh International Conference on Document Analysis and Recog-nition, 2003. Proceedings. , pages 958–963, Aug 2003.[38] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[39] Umut S¸ims¸ekli, Antoine Liutkus, Szymon Majewski, and Alain Dur-mus. Sliced-wasserstein ﬂows: Nonparametric generative modelingvia optimal transport and diffusions. arXiv preprint arXiv:1806.08141 ,2018.[40] J Solomon, F. De Goes, G. Peyr´e, M. Cuturi, A. Butscher, A. Nguyen,T. Du, and L. Guibas. Convolutional wasserstein distances: Efﬁcientoptimal transportation on geometric domains.

ACM Transactions onGraphics (TOG) , 34(4):66, 2015.[41] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. In

Computer Vision and PatternRecognition (CVPR) , volume 1, page 4, 2017.[42] C. Villani.

Optimal transport: old and new , volume 338. SpringerScience & Business Media, 2008.[43] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised duallearning for image-to-image translation. In